
On 23 July 2025, CLARIN-LT researchers – Jolanta Kovalevskaitė, Erika Rimkutė, and Jurgita Vaičenonienė – published an article “Developing new annotated corpora for Lithuanian: Compilation issues” in the scientific journal Studies about Languages, which presents the origins of Lithuanian text linguistics, provides an overview of grammatically annotated Lithuanian language corpora, analyzes the situation of annotated corpora in other languages, examines the structure of new grammatically annotated corpora, and provides a detailed analysis of the concept of a corpus unit.
The first corpus linguistics studies in Lithuania were initiated at the Computational Linguistics Centre of Vytautas Magnus University. Today, a wide range of publicly accessible corpora and other linguistic resources, including databases, dictionaries, and language analysis tools, are available at the SITTI (Institute of Digital Resources and Interdisciplinary Research) and CLARIN-LT Repository.
In the article, the researchers reveal the importance of annotated corpora, present the morphologically annotated corpus “Matas“, the automatically morphologically annotated “Corpus of the Contemporary Lithuanian Language“, the syntactically annotated corpus of the Lithuanian language “Alksnis“, and the morphological analysis and synthesis tool of the Lithuanian language “Morfuoklis” (for additional information on the analysis and synthesis functionalities refer to the interview with Erika Rimkutė and Virginijus Dadurkevičius in the Tour de CLARIN). Also, the European Union’s NextGenerationEU project for 2024–2026, “Morphologically and Syntactically Annotated Corpora Models for Training (Gold Standards)” is outlined.
Researchers provide a comprehensive overview of the evolution of grammatically annotated corpora in Lithuania, introducing the development process and key features of the morphologically annotated Lithuanian language corpus “Matas“, and the syntactically annotated Lithuanian language corpus “Alksnis“. International standards (CoNLL-U, MULTEXT-East, PDT (Prague Dependency Treebank), UD (Universal Dependency)), and the Lithuanian standard “Jablonskis“) are mentioned. Furthermore, annotated corpora in various other languages are discussed, with their size and structures detailed, and comparative analyses between the corpora are provided. The authors highlight factors that negatively impact the comparability of corpora and propose a solution to mitigate this issue. They also present a list of countries possessing the largest annotated corpora, introduce the largest annotated corpora by language, size, and structure, and compare them with English corpora.
Readers are introduced to the structure, proportions, text types, styles, and genres of newly developing grammatically annotated corpora. Texts from administrative, scientific, and literary domains are discussed, along with the specified conditions and restrictions governing their use.
The advantages and disadvantages of various corpus development strategies, whether based on complete texts or selected excerpts, are examined and clarified. The intricate notion of the corpus unit is explored in detail. The following terms are elucidated: tokenization, token, word, and non-word. Instances are examined where a semantic unit comprises several words, and conversely, when a single word encompasses two semantic units.
Text elements that pose challenges, such as symbols, numbers, abbreviations, and punctuation marks, for example, 3M, i600, FB, 25-hour, !mportant, are reviewed. A detailed list of problem cases related to the Lithuanian language, accompanied by explanations, is presented.
The authors also emphasized that segmenting a text into corpus units poses additional challenges due to the choice of software (such as AntConc, LancsBox, or SketchEngine), as each program defines corpus units differently, leading to variations in the results obtained.
Get the latest CLARIN-LT news by following our Facebook page and visiting our website.


Leave a Reply
You must be logged in to post a comment.