
On 3 November 2025, the Lithuanian Language Vector Model (LT-MLKM-modernBERT) developed by the State Digital Solutions Agency (SDSA) in collaboration with Vytautas Magnus University (VMU), UAB Neurotechnology, UAB Tilde Lietuva, and MB Krilas, was publicly released as a part of implementing the project “Development of the General Lithuanian Language Corpus and Vectorised Lithuanian Language Models”. Further details regarding the project can be found on the website of the Institute of Digital Resources and Interdisciplinary Research (SITTI).
The SDSA project manager is A. Rakauskas, and the supplier group leader is Assoc. Prof. Dr. Andrius Utka.
LT-MLKM-modernBERT is a Lithuanian masked language model (MLM) built on the ModernBERT architecture and pre-trained on the BLKT Lithuanian Text Corpus Stage 3, comprising over 1.87 billion words and 49 billion training tokens from diverse Lithuanian sources, including news, legal, academic, and public-sector texts. Employing a context length of 8,192 tokens, this model effectively processes extensive documents while ensuring linguistic accuracy and textual coherence.
The LT-MLKM-modernBERT model represents a high-quality Lithuanian language resource, featuring pre-trained neural models designed to advance research and development in artificial intelligence as well as facilitate the practical adaptation of digital innovations to real-world requirements.
The complete model information, including model description, instructions for getting started with the model, usage options, training details, limitations, and more, can be found here.
The LT-MLKM-modernBERT model is already available on the Hugging Face platform in open access.
Also read about the LT-MLKM-modernBERT model in the following sources:


Leave a Reply
You must be logged in to post a comment.