Download

From Icelandic Parsed Historical Corpus (IcePaHC)
Revision as of 13:30, 5 July 2011 by Anton (Talk | contribs) (Download Version 0.5, (LGPL))

Jump to: navigation, search

Introduction

The Icelandic Parsed Historical Corpus (IcePaHC) is a project that aims to construct a diachronic corpus with samples of written Icelandic from all periods from the 12th century to modern times. The corpus is mostly compatible with the corpora of historical English developed at UPenn. For historical texts spelling is modernized for phonological change.

Download Version 0.5, (LGPL)

To get access to the current version of the Icelandic Parsed Historical Corpus (IcePaHC) you can download the following zip-file, which contains the raw data of the corpus in labeled bracketing format. If you use Windows, we provide an easy-to-use setup file that installs the corpus and a graphical user interface which can be started from your desktop. Since this is an early preview version you can expect to find some uncorrected mistakes. Please let us know about those so they can be corrected before our next release.

Version 0.5 - Current release

Version 0.4

Version 0.3

Version 0.2

Version 0.1

The corpus, as well as software developed as part of the IcePaHC project, is released under an (LGPL) license, to ensure compatibility with other LGPL-licensed NLP tools, notably the IceNLP toolkit, which is used extensively in the development of the corpus.

The corpus is free as in beer and as in speech and there is no registration wall. We recommend that people cite the latest released version when using the corpus for research to ensure that results can be replicated. However, the most up-to-date version and information on the current state of development can be accessed at our version control repository at Github.

Getting started using the corpus

If you use Windows, the easiest way to get started is to download IcePaHC for Windows (download above) and follow the on-screen instructions. IcePaHC for Windows uses CorpusSearch to run queries, so read CorpusSearch documentation in addition to this web page. If you use IcePaHC for Windows you do not have to type in commands to start the program, you simply click the IcePaHC icon on your Desktop. If you do not have Java installed, which is required, the installation will direct you to a Java download page.

Since the corpus uses the labeled bracketing format it is compatible with programs that assume such annotation. We recommend using the CorpusSearch program developed by Beth Randall at UPenn. If you have copied the corpus to the directory "/home/chomsky/icepahc" and saved the CorpusSearch jar file in "/home/chomsky/corpussearch", you can give a command like the following to search the corpus using a query in a text file named datsubj.q.

java -classpath /home/chomsky/corpussearch/CS_2.002.75.jar csearch/CorpusSearch datsubj.q /home/chomsky/icepahc/*.psd

Let us assume that datsubj.q is a query that picks out all dative subjects. The file could look like the following:

node: IP*

query: (IP* idoms NP-SBJ) AND (NP-SBJ idoms *-D)

If you run the command above using a file like that, CorpusSearch will return a file called datsubj.out with all sentences in the corpus that contain dative subjects. Read the CorpusSearch documentation and the annotation guidelines for the corpus to find out how to do more.

Note that there will be ways to simplify the commands by creating aliases etc. but this will work differently on different operating systems. Read the getting started with CorpusSearch documentation for more information.

Texts included in Version 0.5

  • 4438 words from Fyrsta málfræðiritgerðin (The First Grammatical Treatise) (1150)
  • 40844 words from Íslensk hómilíubók (Icelandic book of homilies) (1150)
  • 3458 words from Þetubrot Egils Sögu (Theta manuscript of Egils Saga) (1250)
  • 22719 words from Íslendinga saga (1250)
  • 6194 words from Grágás. Lagasafn íslenska þjóðveldisins. (1270)
  • 25017 words from Morkinskinna (1275)
  • 23347 words from Alexanders saga (1300)
  • 13541 words from Bandamanna saga (Möðruvallabók text) (1350)
  • 23040 words from Finnboga saga ramma (1350)
  • 11485 words from Bandamanna saga (Konungsbók text) (1450)
  • 23040 words from Vilhjálms saga Sjóðs (1450)
  • 18042 words from Miðaldaævintýri (1475)
  • 13151 words from Georgíus saga (Reykjahólabók) (1525)
  • 8582 words from Erasmus saga (Reykjahólabók) (1525)
  • 16420 words from Nýja Testamenti Odds Gottskálkssonar (The New Testament of Oddur Gottskálksson), Postulanna Gjörningar (Acts of the Apostles) (1540)
  • 20682 words from Nýja Testamenti Odds Gottskálkssonar (The New Testament of Oddur Gottskálksson), S. Jóhannis Guðspjöll (Gospel of St. John) (1540)
  • 23384 words from Eintal sálarinnar við sjálfa sig (1593)
  • 15445 words from Okur (1611)
  • 17126 words from Reisubók séra Ólafs Egilssonar (1628)
  • 12690 words from Fimmtíu heilagar hugvekjur Meditationes sacrae (1630)
  • 9759 words from Píslarsaga séra Jóns Magnússonar (1659)
  • 22904 words from Reisubók Jóns Ólafssonar Indíafara (1661)
  • 11220 words from Söguþáttur af Ármanni og Þorsteini gála (1675)
  • 3204 words from Um ætt Magnúsar Jónssonar (1675)
  • 3857 words from Móðars þáttur (1675)
  • 23032 words from Vídalínspostilla (1720)
  • 22297 words from Biskupasögur Jóns prófasts Halldórssonar í Hítardal (1725)
  • 22290 words from Nikulás Klím (1745)
  • 18784 words from Fimmbræðra saga (1790)
  • 22098 words from Ævisaga síra Jóns Steingrímssonar (1791)
  • 3268 words from Um eðli og uppruna jarðarinnar (On the Nature and Origin of the Earth) (1835)
  • 17837 words from Piltur og stúlka (1850)
  • 20380 words from Sagan af Heljarslóðarorrustu (1861)
  • 27191 words from Brynjólfur Sveinsson biskup (1882)
  • 20687 words from Upp við fossa (1902)
  • 20664 words from Leysing (1907)
  • 20305 words from Ofurefli (1908)


Total number of words: 632422

Directories included in Version 0.5

  • /psd: for the parsed versions of the texts (.psd), in plain text utf-8.
  • /txt: for versions of the texts with no markup at all (.txt), in plain text utf-8.

Note that each token in a given .psd file corresponds to a single line in the .txt file, so for example, the token with ID number ""361"" will appear on line 361 in the corresponding .txt file.

  • /info: for philological and other information about each text (.info), in plain text utf-8.
  • /tagged: for versions of the texts with only part-of-speech tags and lemmas, in plain text utf-8, with each word on its own line and tab-separated markup.

Citation for the Version 0.5 release (of July 5th 2011)

Wallenberg, Joel, Anton Karl Ingason, Einar Freyr Sigurðsson and Eiríkur Rögnvaldsson. 2011. 
Icelandic Parsed Historical Corpus (IcePaHC). 
Version 0.5. http://www.linguist.is/icelandic_treebank

Treebank team