The use of confusion sets for automatic spelling correction in Icelandic
This essay covers the use of confusion sets in automatic spelling correction for Icelandic, the compilation and publication of The Icelandic Confusion Set Corpus and the machine learning experiments done on the data. Confusion sets are word pairs that are likely to get mixed up in spelling due to their homophonous properties. In machine learning experiments done with confusion sets, a feature vector is derived from the surrounding context of the word. A classifier is then trained on sentence examples containing the word pair. By this use of context sensitivity, real word errors can be retrieved and corrected automatically.
The Icelandic Confusion Set Corpus contains 27 categories of homophonous word pairs which are in Levenshtein-distance 1 from each other. It contains lists of words containing each category, frequency tables with information on the words taken from the Icelandic Gigaword Corpus, as well as information on whether the words within the pair are grammatically disjoint or identical. The corpus has been made accessible on the open source repository of CLARIN-IS.
In the machine learning experiments done for this thesis, two feature extraction methods were compared. The results show that feature extraction using a bag-of-words method generally have higher accuracy, precision, recall and f-score than those done with handwritten grammatical rules. The grammatical rules however work better on the grammatically disjoint pairs whereas the bag-of-words model works better for grammatically identical pairs. The results also show that the use of a decision tree or a neural network work best for the grammatical features, but a logistic regression classifier and a neural network work best for the bag-of-words features.