Þórunn Arnardóttir

Project Manager; MA thesis (2020)

MA thesis:

An Icelandic Neural Parsing Pipeline: Training, evaluation and resources

Language technology has seen much progress in the last decades following general technological development. Languages vary in how well-equipped they are and a great difference is in how much natural language processing tools and data are available for each language. Icelandic tools for natural language processing have grown in number in the last decade and their accuracy has improved but no parser which delivers a deep and accurate parse has been made available as of yet.
This thesis describes a new Icelandic parsing pipeline, its development, evaluation and the corpora resulting from it. The parsing pipeline takes as input raw Icelandic text and delivers its parsed counterpart and with it, parsing becomes available for a large group of people. The pipeline preprocesses, parses and postprocesses the text and thereby, users do not have to process the text any further. A state-of-the-art neural parser is included which delivers a fast parse, making the parsing of large texts feasible. The parsing pipeline‘s structure is described along with the parser‘s training and evaluation. Corpora which have been created using the pipeline are described, two new treebanks which consist of approximately 525 million words in total. The parsing pipeline along with the two treebanks are open-source, making them as useful as possible for anyone interested, for use and further development.