{"id":3225,"date":"2025-11-17T13:48:24","date_gmt":"2025-11-17T11:48:24","guid":{"rendered":"http:\/\/158.129.51.247:8888\/?p=3225"},"modified":"2025-11-17T13:48:25","modified_gmt":"2025-11-17T11:48:25","slug":"publications-an-article-by-clarin-lt-researchers-has-been-issued","status":"publish","type":"post","link":"https:\/\/clarin-lt.lt\/?p=3225","title":{"rendered":"Publications. An article by CLARIN-LT researchers has been issued"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-12.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"19\" src=\"https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-12-1024x19.png\" alt=\"\" class=\"wp-image-3215\" srcset=\"https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-12-1024x19.png 1024w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-12-300x5.png 300w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-12-768x14.png 768w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-12-1536x28.png 1536w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-12-100x2.png 100w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-12-150x3.png 150w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-12-200x4.png 200w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-12-450x8.png 450w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-12-600x11.png 600w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-12-900x16.png 900w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-12.png 1650w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n\n\n\n<p>On 23 July 2025, CLARIN-LT researchers \u2013 <a href=\"https:\/\/www.vdu.lt\/cris\/entities\/person\/jolanta-kovalevskaite\">Jolanta Kovalevskait\u0117<\/a><em>, <\/em><a href=\"https:\/\/www.vdu.lt\/cris\/entities\/person\/erika-rimkute\">Erika Rimkut\u0117<\/a><em>, <\/em>and<em> <\/em><a href=\"https:\/\/www.vdu.lt\/cris\/entities\/person\/jurgita-vaicenoniene\">Jurgita Vai\u010denonien\u0117<\/a> \u2013 published an article &#8220;<a href=\"https:\/\/kalbos.ktu.lt\/index.php\/KStud\/article\/view\/40544\">Developing new annotated corpora for Lithuanian: Compilation issues<\/a>&#8221; in the scientific journal <a href=\"https:\/\/kalbos.ktu.lt\/index.php\/KStud\/index\"><em>Studies about Languages<\/em><\/a>, which presents the origins of Lithuanian text linguistics, provides an overview of grammatically annotated Lithuanian language corpora, analyzes the situation of annotated corpora in other languages, examines the structure of new grammatically annotated corpora, and provides a detailed analysis of the concept of a corpus unit.<\/p>\n\n\n\n<p>The first corpus linguistics studies in Lithuania were initiated at the <em>Computational Linguistics Centre<\/em> of <a href=\"https:\/\/www.vdu.lt\/en\/\">Vytautas Magnus University<\/a>. Today, a wide range of publicly accessible corpora and other linguistic resources, including databases, dictionaries, and language analysis tools, are available at the <a href=\"https:\/\/sitti.vdu.lt\/en\/\">SITTI (Institute of Digital Resources and Interdisciplinary Research)<\/a> and <a href=\"https:\/\/clarin-lt.lt\/?page_id=86\">CLARIN-LT Repository<\/a>.<\/p>\n\n\n\n<p>In the article, the researchers reveal the importance of annotated corpora, present the morphologically annotated corpus &#8220;<a href=\"https:\/\/sitti.vdu.lt\/en\/matas-morphologically-annotated-lithuanian-corpus\/\">Matas<\/a>&#8220;, the automatically morphologically annotated &#8220;<a href=\"http:\/\/tekstynas.vdu.lt\/tekstynas\/\">Corpus of the Contemporary Lithuanian Language<\/a>&#8220;, the syntactically annotated corpus of the Lithuanian language &#8220;<a href=\"https:\/\/sitti.vdu.lt\/alksnis-sintaksiskai-anotuotas-tekstynas\/\">Alksnis<\/a>&#8220;, and the morphological analysis and synthesis tool of the Lithuanian language &#8220;<a href=\"https:\/\/sitti.vdu.lt\/morfuoklis\/lt\">Morfuoklis<\/a>&#8221; (for additional information on the analysis and synthesis functionalities refer to the interview with <a href=\"https:\/\/www.vdu.lt\/cris\/entities\/person\/erika-rimkute\">Erika Rimkut\u0117<\/a> and <a href=\"https:\/\/www.vdu.lt\/cris\/entities\/person\/virginijus-dadurkevicius\/datasets\">Virginijus Dadurkevi\u010dius<\/a> in the <a href=\"https:\/\/www.clarin.eu\/blog\/tour-de-clarin-interview-erika-rimkute-and-virginijus-dadurkevicius\">Tour de CLARIN<\/a>). Also, the European Union&#8217;s <a href=\"https:\/\/next-generation-eu.europa.eu\/index_en\">NextGenerationEU<\/a> project for 2024\u20132026, &#8220;<a href=\"https:\/\/sitti.vdu.lt\/morfologiskai-ir-sintaksiskai-anotuotu-tekstynu-modeliai-dirbtinio-intelekto-apmokymui\/\">Morphologically and Syntactically Annotated Corpora Models for Training (Gold Standards)<\/a>&#8221; is outlined.<\/p>\n\n\n\n<p>Researchers provide a comprehensive overview of the evolution of grammatically annotated corpora in Lithuania, introducing the development process and key features of the morphologically annotated Lithuanian language corpus &#8220;<a href=\"https:\/\/sitti.vdu.lt\/en\/matas-morphologically-annotated-lithuanian-corpus\/\">Matas<\/a>&#8220;, and the syntactically annotated Lithuanian language corpus &#8220;<a href=\"https:\/\/sitti.vdu.lt\/alksnis-sintaksiskai-anotuotas-tekstynas\/\">Alksnis<\/a>&#8220;. International standards (CoNLL-U, MULTEXT-East, PDT (Prague Dependency Treebank), UD (<a href=\"https:\/\/universaldependencies.org\/\">Universal Dependency<\/a>)), and the Lithuanian standard &#8220;<a href=\"https:\/\/sitti.vdu.lt\/jablonskis-lt\/\">Jablonskis<\/a>&#8220;) are mentioned. Furthermore, annotated corpora in various other languages are discussed, with their size and structures detailed, and comparative analyses between the corpora are provided. The authors highlight factors that negatively impact the comparability of corpora and propose a solution to mitigate this issue. They also present a list of countries possessing the largest annotated corpora, introduce the largest annotated corpora by language, size, and structure, and compare them with English corpora.<\/p>\n\n\n\n<p>Readers are introduced to the structure, proportions, text types, styles, and genres of newly developing grammatically annotated corpora. Texts from administrative, scientific, and literary domains are discussed, along with the specified conditions and restrictions governing their use.<\/p>\n\n\n\n<p>The advantages and disadvantages of various corpus development strategies, whether based on complete texts or selected excerpts, are examined and clarified. The intricate notion of the corpus unit is explored in detail. The following terms are elucidated: <em>tokenization, token, word, and non-word. <\/em>Instances are examined where a semantic unit comprises several words, and conversely, when a single word encompasses two semantic units.<\/p>\n\n\n\n<p>Text elements that pose challenges, such as symbols, numbers, abbreviations, and punctuation marks, for example, <em>3M, i600, FB, 25-hour, !mportant<\/em>, are reviewed. A detailed list of problem cases related to the Lithuanian language, accompanied by explanations, is presented.<\/p>\n\n\n\n<p>The authors also emphasized that segmenting a text into corpus units poses additional challenges due to the choice of software (such as <em>AntConc<\/em>, <em>LancsBox<\/em>, or <em>SketchEngine<\/em>), as each program defines corpus units differently, leading to variations in the results obtained.<\/p>\n\n\n\n<p>Get the latest\u00a0<a href=\"https:\/\/clarin-lt.lt\/?lang=en\">CLARIN-LT<\/a>\u00a0news by following our\u00a0<a href=\"https:\/\/www.facebook.com\/profile.php?id=100087289837974\">Facebook page<\/a>\u00a0and visiting our\u00a0<a href=\"https:\/\/clarin-lt.lt\/?page_id=179\">website<\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-13.png\"><img loading=\"lazy\" decoding=\"async\" width=\"975\" height=\"18\" src=\"https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-13.png\" alt=\"\" class=\"wp-image-3216\" srcset=\"https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-13.png 975w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-13-300x6.png 300w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-13-768x14.png 768w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-13-100x2.png 100w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-13-150x3.png 150w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-13-200x4.png 200w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-13-450x8.png 450w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-13-600x11.png 600w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2025\/11\/image-13-900x17.png 900w\" sizes=\"auto, (max-width: 975px) 100vw, 975px\" \/><\/a><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>On 23 July 2025, CLARIN-LT researchers \u2013 Jolanta Kovalevskait\u0117, Erika Rimkut\u0117, and Jurgita Vai\u010denonien\u0117 \u2013 published an article &#8220;Developing new annotated corpora for Lithuanian: Compilation issues&#8221; in the scientific journal Studies about Languages, which presents the origins of Lithuanian text<span class=\"ellipsis\">&hellip;<\/span><\/p>\n<div class=\"read-more\"><a href=\"https:\/\/clarin-lt.lt\/?p=3225\">Read more &#8250;<\/a><\/div>\n<p><!-- end of .read-more --><\/p>\n","protected":false},"author":7,"featured_media":3226,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-3225","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=\/wp\/v2\/posts\/3225","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=\/wp\/v2\/users\/7"}],"replies":[{"embeddable":true,"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3225"}],"version-history":[{"count":2,"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=\/wp\/v2\/posts\/3225\/revisions"}],"predecessor-version":[{"id":3228,"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=\/wp\/v2\/posts\/3225\/revisions\/3228"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=\/wp\/v2\/media\/3226"}],"wp:attachment":[{"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3225"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3225"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3225"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}