A Parallel Icelandic Dependency Treebank: Creation, Annotation and Evaluation
This thesis describes the creation, annotation and evaluation of an Icelandic dependency treebank. This treebank holds syntactic annotation that is necessary for parser development and grammar research. Syntactic parsers are useful in various types of information technology applications and treebanks are the essential training data for data–driven natural language parsers. Parallel corpora have been mainly used for training machine translation systems but can also be used for creating dictionaries and ontologies, and multilingual and cross–lingual document classification. This first Icelandic parallel dependency treebank presented here is aligned with 19 other languages and is based on the Universal Dependencies (UD) annotation scheme. Studies on cross-lingual modeling have been growing constantly since the first UD treebanks were published and it could be a beneficial step for less-resource languages like Icelandic to become a part of this international research. Creating a treebank can be an extremely laborious task and it is therefore important to utilize accessible methods and data applicable for research. Here the method of preprocessing syntactic relations using delexicalized parsing was explored. The description of dependency grammar for Icelandic according to the UD annotation scheme is documented in appendix and the Icelandic parallel UD corpus, Icelandic PUD, will be published as part of the UD project, version 2.6.