This page lists resources that I have developed along with my collaborators, including linguistic databases and software tools.

Linguistic Databases

  • IcePaHC

    Icelandic Parsed Historical Corpus (CC-BY; LGPL). With Joel Wallenberg, Eiríkur Rögnvaldsson, Einar Freyr Sigurðsson and Kristján Rúnarsson.

    The Icelandic Parsed Historical Corpus (IcePaHC) is a manually corrected treebank, parsed according to the annotation guidelines of The Penn Parsed Corpora of Historical English (PPCHE), with minor modifications that are specific to Icelandic. It consists of about 1 million words from the 12th century to the 21st. The samples in the corpus are close to being evenly distributed over this period. Most of the text consists of narratives and religious material but some samples from other genres are also included. The file format is labeled bracketing as in the Penn Treebank with a UTF-8 encoding. The corpus is released under a CC BY 4.0 license.

  • Neural MIcePaHC

    Neural Machine-Parsed IcePaHC (CC-BY). With Þórunn Arnardóttir.

    The Neural Machine-Parsed IcePaHC is a machine-parsed treebank which consists of Icelandic texts from the 13th to 20th century, mostly Icelandic sagas. The texts were parsed using the IceNeuralParsingPipeline, a parsing pipeline which includes an Icelandic model of the Berkeley Neural Parser along with pre- and postprocessing steps. The parser was trained on IcePaHC and the parsing scheme of the treebank is therefore the same, although the treebank does not include empty phrases or lemmas. The treebank includes 52 texts. The total word count is 1,716,429 and the total number of clauses is 167,815.

  • IceConTree

    The Icelandic Contemporary Treebank (CC-BY). With Þórunn Arnardóttir.

    The Icelandic Contemporary Corpus (IceConTree) is a machine-parsed treebank parsed according to the IcePaHC annotation scheme. It consists of texts from the Icelandic Gigaword Corpus, parsed using the IceNeuralParsingPipeline. It contains 524,601,329 words in 29,929,132 clauses. The treebank consists of 14 texts which are mainly media, law and parliamentary text. Within each text, files are divided according to years. This division was done after the text was parsed and is therefore not completely correct.

  • ICoSC

    The Icelandic Confusion Set Corpus (CC-BY). With Steinunn Rut Friðriksdóttir.

    The Icelandic Confusion Set Corpus (ICoSC) is available under a CC-BY licence. It was compiled during the course of 8 months by Steinunn Rut Friðriksdóttir and Anton Karl Ingason of the language technology department in the University of Iceland. Included in the ICoSC are CSV spreadsheets containing all collected confusion sets of each phonetic category and their frequencies. The spreadsheets are organized so that for each set, the total frequency of each candidate is calculated along with the frequency of each possible PoS tag for that candidate. The seventh and eight column of the tables contain binary values referring to whether the confusion set is grammatically disjoint (all PoS tags differ for the two candidates) or grammatically identical (all PoS tags are identical for the two candidates). The final column shows the frequency of the less frequent candidate of the set which can be used to determine which sets are viable in an experiment. Also included are text files containing the list of words from each category and text files containing all sentence examples from the Icelandic Gigaword Corpus which contain the words for each category. As the n/nn examples are by far the most frequent confusion sets, the corpus also includes a word list and sentence examples for the 55 most frequent sets. There are also spreadsheets containing all of the collected word pairs which are grammatically identical, grammatically disjoint or neither of the aforementioned categories. All files have UTF-8 encoding.

  • IceEC

    The Icelandic Error Corpus (CC-BY). With Lilja Björk Stefánsdóttir and Þórunn Arnardóttir.

    The Icelandic Error Corpus (IceEC) is a collection of texts in modern Icelandic annotated for mistakes related to spelling, grammar, and other issues. The texts are organized by genre. The current version includes sentences from student essays , online news texts, and the Icelandic section of Wikipedia.

  • IceL2EC

    The Icelandic L2 Error Corpus (CC-BY). With Lilja Björk Stefánsdóttir and Þórunn Arnardóttir.

    The Icelandic L2 Error Corpus (IceL2EC) is a collection of texts in modern Icelandic, written by learners of Icelandic as a second language. They have been annotated for mistakes related to spelling, grammar, and other issues. Each mistake is marked according to error type using an error code, of which there are 234. The corpus currently consists of 14 files with 3992 categorized error instances.

Software Tools

  • Annotald X

    A Graphical User Interface for editing treebank files in labeled bracketing format.

    Annotald X is a GUI editor for treebank files in labeled bracketing format. The Annotald X version of the software includes Annotald in a configuration similar to the one that was used to annotate the IcePaHC treebank. Annotald X also includes a user friendly installation package for Windows.

  • IceNeuralParsingPipeline

    The Icelandic Neural Parsing Pipeline (MIT License) -- with Þórunn Arnardóttir

    The Icelandic Neural Parsing Pipeline (IceNeuralParsingPipeline) includes all steps necessary for parsing plain Icelandic text, i.e. preprocessing, parsing and post processing. The preprocessing step consists of tokenization, both punctuation and matrix clause splitting. The parsing step consists of an Icelandic model of the Berkeley Neural Parser, trained on IcePaHC, which reports an 84.74 F1 score. The output’s annotation scheme is the same as IcePaHC’s, except that neither empty phrases, e.g. traces and zero subjects, nor lemmas are shown. The post processing step includes minor steps for cleaning and formatting the parsed text.

  • PaCQL Search Engine

    The Parsed Corpus Query Language and a Search Engine implementation.

    The Parsed Corpus Query Language can be used to perform advanced searches on treebanks, including coding queries that output structured data for use in statistical packages like R and SPSS. The Search Engine implementation running at treebankstudio.org is a fast indexed system that outperforms alternatives in common types of research queries.