The processing of strings which are semantically distinct but can be easily confused with each other, often on account of being pronounced identically, is a prime example of context dependency in Natural Language Processing. This problem arises when a system needs to distinguish whether a bank is a ‘river bank’ or a ‘financial institution’ and it also challenges systems for context-sensitive spelling and grammar correction because pairs like their/there and I/me are one common source of issues that such systems must address. In practice, this type of context-dependency can be especially prominent in languages with rich morphology where large paradigms of inflected word forms lead to a proliferation of such confusion sets. In this paper, we present our novel confusion set corpus for Icelandic as well as our findings from an experiment that uses well-known classification algorithms to disambiguate confusion sets that appear in our corpus.
BibTeX:
@INPROCEEDINGS {fridriksdottir2020disambiguating,
author = "Friðriksdóttir, Steinunn Rut and Anton Karl Ingason",
title = "Disambiguating Confusion Sets in a Language with Rich Morphology",
booktitle = "Proceedings of ICAART 12 (International Conference on Agents and Artificial Intelligence)",
year = "2020"
}