Completed MA theses
Hinrik Hafsteinsson (2020)
A Faroese part-of-speech tagger built with Icelandic methods. Data preperation, training and evaluation
This thesis describes the development of a dedicated, high-accuracy part-of-speech (PoS) tagging solution for Faroese. To achieve this, a state-of-the-art neural PoS tagger for Icelandic, ABLTagger, was trained on the 100,000 word Sosialurin PoS-tagged corpus for Faroese, standardised with methods previously applied to Icelandic corpora. This tagger was supplemented with a novel Experimental Database of Faroese Inflection (EDFM), which contains morphological information on 67,488 Faroese words with about one million inflectional forms. This approach produced a PoS-tagging model for Faroese which achieves a 91.40% overall accuracy when evaluated with 10-fold cross validation, which is currently the highest accuracy for a dedicated Faroese PoS-tagging implementation. The tagging model, morphological database, proposed revised PoS tagset for Faroese as well as a revised and standardised Sosialurin corpus are all presented as products of this project and are made available for use in further research in Faroese language technology.
Hildur Jónsdóttir (2020)
A Parallel Icelandic Dependency Treebank: Creation, Annotation and Evaluation
This thesis describes the creation, annotation and evaluation of an Icelandic dependency treebank. This treebank holds syntactic annotation that is necessary for parser development and grammar research. Syntactic parsers are useful in various types of information technology applications and treebanks are the essential training data for data–driven natural language parsers. Parallel corpora have been mainly used for training machine translation systems but can also be used for creating dictionaries and ontologies, and multilingual and cross–lingual document classification. This first Icelandic parallel dependency treebank presented here is aligned with 19 other languages and is based on the Universal Dependencies (UD) annotation scheme. Studies on cross-lingual modeling have been growing constantly since the first UD treebanks were published and it could be a beneficial step for less-resource languages like Icelandic to become a part of this international research. Creating a treebank can be an extremely laborious task and it is therefore important to utilize accessible methods and data applicable for research. Here the method of preprocessing syntactic relations using delexicalized parsing was explored. The description of dependency grammar for Icelandic according to the UD annotation scheme is documented in appendix and the Icelandic parallel UD corpus, Icelandic PUD, will be published as part of the UD project, version 2.6.
Steinunn Rut Friðriksdóttir (2020)
The use of confusion sets for automatic spelling correction in Icelandic
This essay covers the use of confusion sets in automatic spelling correction for Icelandic, the compilation and publication of The Icelandic Confusion Set Corpus and the machine learning experiments done on the data. Confusion sets are word pairs that are likely to get mixed up in spelling due to their homophonous properties. In machine learning experiments done with confusion sets, a feature vector is derived from the surrounding context of the word. A classifier is then trained on sentence examples containing the word pair. By this use of context sensitivity, real word errors can be retrieved and corrected automatically.
The Icelandic Confusion Set Corpus contains 27 categories of homophonous word pairs which are in Levenshtein-distance 1 from each other. It contains lists of words containing each category, frequency tables with information on the words taken from the Icelandic Gigaword Corpus, as well as information on whether the words within the pair are grammatically disjoint or identical. The corpus has been made accessible on the open source repository of CLARIN-IS.
In the machine learning experiments done for this thesis, two feature extraction methods were compared. The results show that feature extraction using a bag-of-words method generally have higher accuracy, precision, recall and f-score than those done with handwritten grammatical rules. The grammatical rules however work better on the grammatically disjoint pairs whereas the bag-of-words model works better for grammatically identical pairs. The results also show that the use of a decision tree or a neural network work best for the grammatical features, but a logistic regression classifier and a neural network work best for the bag-of-words features.
Þórunn Arnardóttir (2020)
An Icelandic Neural Parsing Pipeline: Training, evaluation and resources
Language technology has seen much progress in the last decades following general technological development. Languages vary in how well-equipped they are and a great difference is in how much natural language processing tools and data are available for each language. Icelandic tools for natural language processing have grown in number in the last decade and their accuracy has improved but no parser which delivers a deep and accurate parse has been made available as of yet.
This thesis describes a new Icelandic parsing pipeline, its development, evaluation and the corpora resulting from it. The parsing pipeline takes as input raw Icelandic text and delivers its parsed counterpart and with it, parsing becomes available for a large group of people. The pipeline preprocesses, parses and postprocesses the text and thereby, users do not have to process the text any further. A state-of-the-art neural parser is included which delivers a fast parse, making the parsing of large texts feasible. The parsing pipeline‘s structure is described along with the parser‘s training and evaluation. Corpora which have been created using the pipeline are described, two new treebanks which consist of approximately 525 million words in total. The parsing pipeline along with the two treebanks are open-source, making them as useful as possible for anyone interested, for use and further development.
Starkaður Barkarson (2018)
Training the PoS-tagger Stagger on a new gold standard
PoS-tagging is an important basis for various Language Technology tasks. The precision of Icelandic PoS-tagging is still relatively low because of the morphological complexity involved and because of the limitations of the gold standard that has been used for previous experiments. Recently, a new gold standard was introduced and the present work examines the possibilities that arise from using this new resource in training. A series of experiments are carried out and presented, including a test of how well the PoS tagger Stagger performs on texts from an unseen genre.
Tinna Frímann Jökulsdóttir (2018)
„I didnʹt understand that — please try again“: Communication between Icelanders and virtual assistants
The ever increasing language contact between Icelandic and English has raised a number of concerns regarding its influence on the viability of the Icelandic language. The arrival of so-called dialogue systems which are largely based on vocal communication has further increased these concerns due to the interactive, and even personal, nature of the communications that they allow.
The main objective of this project is to map the extent and nature of the use of digital assistants by the Icelandic population. Furthermore, we will study how well Icelandic language speakers are doing using these assistants, their attitudes towards their use, whether they would prefer using Icelandic in these communications were it available, and how they foresee their use after 2–3 years. Finally, we will evaluate how well Icelandic language technology is prepared to offer Icelandic voice control.
With this general objective in mind the following report is divided into three parts. In the first part we establish a connection between English influences on the Icelandic linguistic community and predictions of the viability of the Icelandic language. In the second part we present the current state of Icelandic language technology, with special emphasis on dialogue systems. In the third, and main, part, we present the design, execution and results of an on-line survey intended to gather information on the communication of Icelandic language speakers with digital assistants.
Our results show that even though digital assistance are a relatively recent technology 33.9% of participants reported having used one. Younger participants were more likely to report usage which was mostly bound to simple commands, although there were some examples of more extended communications. The reports show a general consensus on the importance of making Icelandic available for digital assistants and other voice-controlled technology. A possible explanation of this consensus is the fact that just under 80% percent of participants reported communication problems with digital assistants which either be traced to difficulties understanding Icelandic pronounciation of English on the part of the assistant or to Icelandic names and toponyms. Our results also indicate that the use of digital assistance will continue to increase in the years to come, but that a considerable amount of work is needed in Icelandic language technology if Icelandic is to be the language of these future communications.
Lilja Björk Stefánsdóttir (2018)
Localists and Globalists: Cultural motivation in digital language contact
In recent years, societies have undergone many fundamental changes, many of which are the result of the general trend of globalization. Advancements in access to information and technology have, in a way, completely altered many people’s perceptions of the world around them, leading some to feel the world is getting smaller. Alongside this, the importance of a widely understood international language has increased, and so these transitions can be expected to have an impact on languages, their status and people’s attitudes towards different languages.
The thesis is based on data captured in an online survey which is a part of the project Modeling the Linguistic Consequences of Digital Language Contact, with the aim of examining how globalization and societal changes can affect people’s attitudes towards their native language, Icelandic, and towards the global language of English. I argue that cultural motivation and people’s cultural identities can influence people’s attitudes towards languages and their willingness to embrace language standards. I do that by systematically comparing two groups from the survey, Cosmopolitans and Localists, focusing on how they answer questions about their attitudes towards Icelandic and English. Comparison of the two groups shows systematic differences towards Icelandic and English, indicating that cultural motivation and cultural identity can have an impact on people’s attitudes towards languages.
Kristján Rúnarsson (2017)
Samba: Automatic identification of verbal expressions in Icelandic
This thesis discusses the development of Samba, a software solution designed to identify known verbal expressions in PoS-tagged and lemmatized text. Samba uses a database of verbal expressions which is being developed by Kristín Bjarnadóttir at the Árni Magnússon Institute for Icelandic Studies and which the author contributed to at its inception in the summer of 2015. Samba and the verbal expression database are based on the principle that the entire predicate-argument structure along with any other constituents that form a unit with a verb be included in the analysis of that verb and the unified treatment of simple verbs and more complex verbal expressions. The evaluation of Samba has given positive results, with a usable baseline functionality that was improved significantly during the development process.