A Faroese part-of-speech tagger built with Icelandic methods. Data preperation, training and evaluation
This thesis describes the development of a dedicated, high-accuracy part-of-speech (PoS) tagging solution for Faroese. To achieve this, a state-of-the-art neural PoS tagger for Icelandic, ABLTagger, was trained on the 100,000 word Sosialurin PoS-tagged corpus for Faroese, standardised with methods previously applied to Icelandic corpora. This tagger was supplemented with a novel Experimental Database of Faroese Inflection (EDFM), which contains morphological information on 67,488 Faroese words with about one million inflectional forms. This approach produced a PoS-tagging model for Faroese which achieves a 91.40% overall accuracy when evaluated with 10-fold cross validation, which is currently the highest accuracy for a dedicated Faroese PoS-tagging implementation. The tagging model, morphological database, proposed revised PoS tagset for Faroese as well as a revised and standardised Sosialurin corpus are all presented as products of this project and are made available for use in further research in Faroese language technology.