{"id":3993,"date":"2026-06-15T08:02:59","date_gmt":"2026-06-15T06:02:59","guid":{"rendered":"https:\/\/clarin-lt.lt\/?p=3993"},"modified":"2026-06-15T08:03:55","modified_gmt":"2026-06-15T06:03:55","slug":"introducing-a-new-clarin-lt-resource-the-simas-corpus","status":"publish","type":"post","link":"https:\/\/clarin-lt.lt\/?p=3993","title":{"rendered":"Introducing a New CLARIN-LT Resource: The SIMAS Corpus"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-7.png\"><img loading=\"lazy\" decoding=\"async\" width=\"975\" height=\"18\" src=\"https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-7.png\" alt=\"\" class=\"wp-image-3988\" srcset=\"https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-7.png 975w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-7-300x6.png 300w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-7-768x14.png 768w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-7-100x2.png 100w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-7-150x3.png 150w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-7-200x4.png 200w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-7-450x8.png 450w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-7-600x11.png 600w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-7-900x17.png 900w\" sizes=\"auto, (max-width: 975px) 100vw, 975px\" \/><\/a><\/figure>\n\n\n\n<p>In May 2025, the <em><a href=\"https:\/\/clarin-repo.lt\/items\/6d76cc64-2192-4081-94ba-2dc5664968b4\">Morphologically and Syntactically Annotated Corpus SIMAS<\/a> <\/em>was deposited in the <a href=\"https:\/\/clarin-lt.lt\/?page_id=86\"><em>CLARIN-LT repository<\/em><\/a>. \u00a0This is part of the European Union\u2019s <a href=\"https:\/\/commission.europa.eu\/strategy-and-policy\/recovery-plan-europe_en\"><em>NextGenerationEU<\/em><\/a> project \u201cMorphologically and syntactically annotated text models for training (gold standards)\u201d (No. 02-098-K-0001) carried out by the <a href=\"https:\/\/sitti.vdu.lt\/en\/\"><em>Institute of Digital Resources and Interdisciplinary Research (SITTI)<\/em><\/a> at <a href=\"https:\/\/www.vdu.lt\/en\/\"><em>Vytautas Magnus University (VMU)<\/em><\/a>. <strong>Project leader:<\/strong> <em>Assoc. Prof. <\/em><a href=\"https:\/\/hdl.handle.net\/20.500.12259\/154977\"><em>Erika Rimkut\u0117<\/em><\/a>. This project fosters technological innovation in the Lithuanian language and offers tangible benefits to the public, government agencies, and the business sector (additional information is available <a href=\"https:\/\/sitti.vdu.lt\/morfologiskai-ir-sintaksiskai-anotuotu-tekstynu-modeliai-dirbtinio-intelekto-apmokymui\/\"><em>here<\/em><\/a>).<\/p>\n\n\n\n<p>The &#8220;SIMAS&#8221; corpus comprises original texts across various genres, including fiction, academic writing, administrative texts, and journalism, all authored by Lithuanian writers from 2005 to 2025. This corpus is composed of complete texts rather than text fragments. The corpus underwent automatic morphological and syntactic annotation, followed by a review conducted by linguists. Automatic morphological annotation was performed using the tool <a href=\"https:\/\/sitti.vdu.lt\/morfuoklis\/lt\"><em>Morfuoklis<\/em><\/a>. Automatic syntactic parsing was conducted utilising the international tool <a href=\"https:\/\/lindat.mff.cuni.cz\/services\/udpipe\/\"><em>UDPipe<\/em><\/a>. The corpus is annotated according to the international <a href=\"https:\/\/universaldependencies.org\/lt\/\"><em>Universal Dependencies (UD) standard<\/em><\/a>, the morphological annotation standard <a href=\"https:\/\/sitti.vdu.lt\/jablonskis-lt\/\"><em>Jablonskis<\/em><\/a>, and for syntactic analysis, the <a href=\"https:\/\/sitti.vdu.lt\/wp-content\/uploads\/2026\/05\/UD_standarto_gaires.pdf\"><em>Universal Dependencies Standard: the Lithuanian Syntactic Annotation Guidelines<\/em><\/a>. Corpus size: 10,010,420 words (or 12,221,575 tokens).<\/p>\n\n\n\n<p>More information about <a href=\"https:\/\/sitti.vdu.lt\/morfologiskai-ir-sintaksiskai-anotuotu-tekstynu-modeliai-dirbtinio-intelekto-apmokymui\/\"><em>Morphologically and syntactically annotated text models for training (gold standards)<\/em><\/a>.<\/p>\n\n\n\n<p>Additionally, the <a href=\"https:\/\/hmf.vdu.lt\/en\/home\/\"><em>VMU Faculty of Humanities<\/em><\/a> website features an article by Migl\u0117 \u017demriet\u0117, <a href=\"https:\/\/hmf.vdu.lt\/morfologiskai-ir-sintaksiskai-anotuotas-lietuviu-kalbos-tekstynas-simas\/\"><em>Morphologically and Syntactically Annotated Lithuanian Corpus SIMAS<\/em><\/a>, which discusses the necessity of technological resources for the Lithuanian language and the significance of various technological projects for Lithuania.<\/p>\n\n\n\n<p>Follow CLARIN-LT news on our<em> <a href=\"https:\/\/www.facebook.com\/profile.php?id=100087289837974\">Facebook account<\/a><\/em> and <a href=\"https:\/\/clarin-lt.lt\/?page_id=104\"><em>website<\/em><\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-8.png\"><img loading=\"lazy\" decoding=\"async\" width=\"975\" height=\"18\" src=\"https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-8.png\" alt=\"\" class=\"wp-image-3989\" srcset=\"https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-8.png 975w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-8-300x6.png 300w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-8-768x14.png 768w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-8-100x2.png 100w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-8-150x3.png 150w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-8-200x4.png 200w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-8-450x8.png 450w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-8-600x11.png 600w, https:\/\/clarin-lt.lt\/wp-content\/uploads\/2026\/06\/image-8-900x17.png 900w\" sizes=\"auto, (max-width: 975px) 100vw, 975px\" \/><\/a><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>In May 2025, the Morphologically and Syntactically Annotated Corpus SIMAS was deposited in the CLARIN-LT repository. \u00a0This is part of the European Union\u2019s NextGenerationEU project \u201cMorphologically and syntactically annotated text models for training (gold standards)\u201d (No. 02-098-K-0001) carried out by<span class=\"ellipsis\">&hellip;<\/span><\/p>\n<div class=\"read-more\"><a href=\"https:\/\/clarin-lt.lt\/?p=3993\">Read more &#8250;<\/a><\/div>\n<p><!-- end of .read-more --><\/p>\n","protected":false},"author":7,"featured_media":3994,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-3993","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=\/wp\/v2\/posts\/3993","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=\/wp\/v2\/users\/7"}],"replies":[{"embeddable":true,"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3993"}],"version-history":[{"count":3,"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=\/wp\/v2\/posts\/3993\/revisions"}],"predecessor-version":[{"id":3997,"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=\/wp\/v2\/posts\/3993\/revisions\/3997"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=\/wp\/v2\/media\/3994"}],"wp:attachment":[{"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3993"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3993"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/clarin-lt.lt\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3993"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}