Construction of a Lithuanian Treebank ALKSNIS

In 2015 scientists of the Centre of Computational Linguistics at the Vytautas Magnus university have started to build a Lithuanian treebank ALKSNIS (a syntactically annotated corpus). This task is one of activities of the project of CLARIN-LT consortium.

ALKSNIS willMEDIS1 consists of 2300 syntactically annotated sentences. The corpus is based on automatically analysed sentences (dependency trees) that are encoded in the PML (Prague Mark-up Language) format. The format allows researchers to visualise and edit syntactic trees by the editor TrED[1].

Each node of a tree corresponds to a word, a punctuation mark or other text element (symbol, digit etc.) within a sentence. The following information is presented for each node: 1) a used form; 2) a lemma; 3) a morphology tag, and 4) a syntactic function (subject, object, etc.). Dependencies are shown by links between words.

The morphology tag set of the corpus is based on the MULTEXT-East format[2]. Syntactically annotated sentences are corrected according to guidelines that were created by scientists of CCL, following rules of Prague Dependency Treebank. All the sentences are being manually checked and corrected by a group of linguists.

We also attach a part of syntactically annotated sentences. The TreED editor and a style file is needed in order to view the files with .pml extension . After installing the TrED editor, a user needs to define what information will be shown at each node of a syntactic tree. In order to do this, a user needs to select a wizard button and to type the following code:

context: .*
hint:
node:${lemma}
node:${form}
node:${ana}
node:${syfun}
text:${form}

Otherwise, you can overview the example trees in PDF files.

[1] Žr. https://ufal.mff.cuni.cz/tred/

[2] Žr. http://nl.ijs.si/ME/V4/msd/html/index.html

Posted in Uncategorized

Leave a Reply