From Icelandic Parsed Historical Corpus (IcePaHC)
Revision as of 08:04, 12 April 2024 by Einarfs (Talk | contribs) (Download Version 2024.03, (CC BY))

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


The Icelandic Parsed Historical Corpus (IcePaHC) is a project that has built a diachronic corpus with samples of written Icelandic from all periods from the 12th century to modern times. The corpus is mostly compatible with the corpora of historical English developed at UPenn. For historical texts spelling is modernized for phonological change.

Download Version 2024.03, (CC BY)

To get access to the current version of the Icelandic Parsed Historical Corpus (IcePaHC) you can download it from CLARIN-IS, which contains the raw data of the corpus in labeled bracketing format.

Version 2024.03 - latest release

The current release is version 2024.03 of around 1 million words words total from every century between the 12th and the 21st centuries inclusive. This version includes the same text as version 0.9.

Version 0.9

Version 0.9 from 2011 of 1,002,390 words total from every century between the 12th and the 21st centuries inclusive. All of the text for version 1.0 is already included but some minor corrections remain to be finished. If you use Windows, we provide an easy-to-use setup file that installs the corpus and a graphical user interface which can be started from your desktop. Since this is an early preview version you can expect to find some uncorrected mistakes. Please let us know about those so they can be corrected before our next release.

Version 0.5

Version 0.4

Version 0.3

Version 0.2

Version 0.1

The corpus, as well as software developed as part of the IcePaHC project, is released under an (LGPL) license, to ensure compatibility with other LGPL-licensed NLP tools, notably the IceNLP toolkit, which is used extensively in the development of the corpus.

The corpus is free as in beer and as in speech and there is no registration wall. We recommend that people cite the latest released version when using the corpus for research to ensure that results can be replicated. However, the most up-to-date version and information on the current state of development can be accessed at our version control repository at Github.

Getting started using the corpus

If you use Windows, the easiest way to get started is to download IcePaHC for Windows (download above) and follow the on-screen instructions. IcePaHC for Windows uses CorpusSearch to run queries, so read CorpusSearch documentation in addition to this web page. If you use IcePaHC for Windows you do not have to type in commands to start the program, you simply click the IcePaHC icon on your Desktop. If you do not have Java installed, which is required, the installation will direct you to a Java download page.

Since the corpus uses the labeled bracketing format it is compatible with programs that assume such annotation. We recommend using the CorpusSearch program developed by Beth Randall at UPenn. If you have copied the corpus to the directory "/home/chomsky/icepahc" and saved the CorpusSearch jar file in "/home/chomsky/corpussearch", you can give a command like the following to search the corpus using a query in a text file named datsubj.q.

java -classpath /home/chomsky/corpussearch/CS_2.002.75.jar csearch/CorpusSearch datsubj.q /home/chomsky/icepahc/*.psd

Let us assume that datsubj.q is a query that picks out all dative subjects. The file could look like the following:

node: IP*

query: (IP* idoms NP-SBJ) AND (NP-SBJ idoms *-D)

If you run the command above using a file like that, CorpusSearch will return a file called datsubj.out with all sentences in the corpus that contain dative subjects. Read the CorpusSearch documentation and the annotation guidelines for the corpus to find out how to do more.

Note that there will be ways to simplify the commands by creating aliases etc. but this will work differently on different operating systems. Read the getting started with CorpusSearch documentation for more information.

Texts included in Version 0.9

  • 1150: Fyrsta málfræðiritgerðin (The First Grammatical Treatise) (4422 words)
  • 1150: Íslensk hómilíubók (Icelandic book of homilies) (40943 words)
  • 1210: Jarteinabók (10328 words)
  • 1210: Þorláks saga helga (10868 words)
  • 1250: Íslendinga saga (22805 words)
  • 1250: Þetubrot Egils Sögu (Theta manuscript of Egils Saga) (3461 words)
  • 1260: Jómsvíkinga saga (21133 words)
  • 1270: Grágás. Lagasafn íslenska þjóðveldisins. (6203 words)
  • 1275: Morkinskinna (25064 words)
  • 1300: Alexanders saga (23356 words)
  • 1310: Grettis saga Ásmundarsonar (20563 words)
  • 1325: Árna saga biskups (19968 words)
  • 1350: Bandamanna saga (Möðruvallabók text) (13618 words)
  • 1350: Finnboga saga ramma (23036 words)
  • 1350: Mörtu saga og Maríu Magdalenu (17241 words)
  • 1400: Gunnars saga Keldugnúpsfífls (8770 words)
  • 1400: Gunnars saga Keldugnúpsfífls - Part 2 (3164 words)
  • 1400: Víglundar saga (13453 words)
  • 1450: Bandamanna saga (Konungsbók text) (11560 words)
  • 1450: Ectors saga (21063 words)
  • 1450: Júditarbók (6562 words)
  • 1450: Vilhjálms saga Sjóðs (23132 words)
  • 1475: Miðaldaævintýri (18084 words)
  • 1480: Jarlmanns saga og Hermanns (14482 words)
  • 1525: Erasmus saga (Reykjahólabók) (8589 words)
  • 1525: Georgíus saga (Reykjahólabók) (20092 words)
  • 1540: Nýja Testamenti Odds Gottskálkssonar (The New Testament of Oddur Gottskálksson), Postulanna Gjörningar (Acts of the Apostles) (16550 words)
  • 1540: Nýja Testamenti Odds Gottskálkssonar (The New Testament of Oddur Gottskálksson), S. Jóhannis Guðspjöll (Gospel of St. John) (20925 words)
  • 1593: Eintal sálarinnar við sjálfa sig (23327 words)
  • 1611: Okur (15481 words)
  • 1628: Reisubók séra Ólafs Egilssonar (17199 words)
  • 1630: Fimmtíu heilagar hugvekjur Meditationes sacrae (12698 words)
  • 1650: Illuga saga Tagldarbana (20921 words)
  • 1659: Píslarsaga séra Jóns Magnússonar (9825 words)
  • 1661: Reisubók Jóns Ólafssonar Indíafara (23031 words)
  • 1675: Móðars þáttur (3845 words)
  • 1675: Söguþáttur af Ármanni og Þorsteini gála (11228 words)
  • 1675: Um ætt Magnúsar Jónssonar (3187 words)
  • 1680: Sögu-þáttur um Skálholts biskupa fyrir og um siðaskiptin. (10281 words)
  • 1720: Vídalínspostilla (23016 words)
  • 1725: Biskupasögur Jóns prófasts Halldórssonar í Hítardal (22297 words)
  • 1745: Nikulás Klím (22038 words)
  • 1790: Fimmbræðra saga (18860 words)
  • 1791: Ævisaga síra Jóns Steingrímssonar (22369 words)
  • 1830: Hellismanna saga (14988 words)
  • 1835: Um eðli og uppruna jarðarinnar (On the Nature and Origin of the Earth) (3257 words)
  • 1850: Piltur og stúlka (17844 words)
  • 1859: Fimtíu hugvekjur út af pínu og dauða Drottins vors Jesú Krists (20530 words)
  • 1861: Sagan af Heljarslóðarorrustu (20336 words)
  • 1882: Brynjólfur Sveinsson biskup (27342 words)
  • 1883: Hans Vöggur (1927 words)
  • 1888: Grímur kaupmaður deyr (7241 words)
  • 1888: Vordraumur (10753 words)
  • 1902: Upp við fossa (20647 words)
  • 1907: Leysing (20613 words)
  • 1908: Ofurefli (20262 words)
  • 1920: Árin og eilífðin. Prédikanir eftir Harald Níelsson (21234 words)
  • 1985: Margsaga (22295 words)
  • 1985: Sagan öll (20980 words)
  • 2008: Ofsi (21144 words)
  • 2008: Segðu mömmu að mér líði vel - saga um ástir - (21958 words)

Total number of words: 1,002,390

Directories included in Version 0.9

  • /psd: for the parsed versions of the texts (.psd), in plain text utf-8.
  • /txt: for versions of the texts with no markup at all (.txt), in plain text utf-8.

Note that each token in a given .psd file corresponds to a single line in the .txt file, so for example, the token with ID number ""361"" will appear on line 361 in the corresponding .txt file.

  • /info: for philological and other information about each text (.info), in plain text utf-8.
  • /tagged: for versions of the texts with only part-of-speech tags and lemmas, in plain text utf-8, with each word on its own line and tab-separated markup.

Citation for the Version 0.9 release (of August 26th 2011)

Wallenberg, Joel, Anton Karl Ingason, Einar Freyr Sigurðsson and Eiríkur Rögnvaldsson. 2011. 
Icelandic Parsed Historical Corpus (IcePaHC). 
Version 0.9.

Treebank team