Difference between revisions of "Download"

From Icelandic Parsed Historical Corpus (IcePaHC)
Jump to: navigation, search
(Directories included in Version 0.4)
(Directories included in Version 0.4)
Line 69: Line 69:
 
==Directories included in Version 0.4==
 
==Directories included in Version 0.4==
  
/psd: for the parsed versions of the texts (.psd), in plain text utf-8.
+
* /psd: for the parsed versions of the texts (.psd), in plain text utf-8.
  
/txt: for versions of the texts with no markup at all (.txt), in plain text utf-8.   
+
* /txt: for versions of the texts with no markup at all (.txt), in plain text utf-8.   
  
 
Note that each token in a given .psd file corresponds to a single line in the .txt file, so for example, the token with ID number ""361"" will appear on line 361 in the corresponding .txt file.
 
Note that each token in a given .psd file corresponds to a single line in the .txt file, so for example, the token with ID number ""361"" will appear on line 361 in the corresponding .txt file.
  
/info: for philological and other information about each text (.info), in plain text utf-8.
+
* /info: for philological and other information about each text (.info), in plain text utf-8.
  
 
==Citation for the Version 0.3 release (of January 6th 2011)==
 
==Citation for the Version 0.3 release (of January 6th 2011)==

Revision as of 16:57, 11 April 2011

Introduction

The Icelandic Parsed Historical Corpus (IcePaHC) is a project that aims to construct a diachronic corpus with samples of written Icelandic from all periods from the 12th century to modern times. The corpus is mostly compatible with the corpora of historical English developed at UPenn. For historical texts spelling is modernized for phonological change.

Download Version 0.4, (LGPL)

To get access to the current version of the Icelandic Parsed Historical Corpus (IcePaHC) you can download the following zip-file, which contains the raw data of the corpus in labeled bracketing format. Since this is an early preview version you can expect to find some uncorrected mistakes. Please let us know about those so they can be corrected before our next release.

Current release

Previous releases

The corpus, as well as software developed as part of the IcePaHC project, is released under an (LGPL) license, to ensure compatibility with other LGPL-licensed NLP tools, notably the IceNLP toolkit, which is used extensively in the development of the corpus.

The corpus is free as in beer and as in speech and there is no registration wall. We recommend that people cite the latest released version when using the corpus for research to ensure that results can be replicated. However, the most up-to-date version and information on the current state of development can be accessed at our version control repository at Github.

Getting started using the corpus

Since the corpus uses the labeled bracketing format it is compatible with programs that assume such annotation. We recommend using the CorpusSearch program developed by Beth Randall at UPenn. If you have copied the corpus to the directory "/home/chomsky/icepahc" and saved the CorpusSearch jar file in "/home/chomsky/corpussearch", you can give a command like the following to search the corpus using a query in a text file named datsubj.q.

java -classpath /home/chomsky/corpussearch/CS_2.002.75.jar csearch/CorpusSearch datsubj.q /home/chomsky/icepahc/*.psd

Let us assume that datsubj.q is a query that picks out all dative subjects. The file could look like the following:

node: IP*

query: (IP* idoms NP-SBJ) AND (NP-SBJ idoms *-D)

If you run the command above using a file like that, CorpusSearch will return a file called datsubj.out with all sentences in the corpus that contain dative subjects. Read the CorpusSearch documentation and the annotation guidelines for the corpus to find out how to do more.

Note that there will be ways to simplify the commands by creating aliases etc. but this will work differently on different operating systems. Read the getting started with CorpusSearch documentation for more information.

Texts included in Version 0.4

  • 4439 words from The First Grammatical Treatise (entire text) (12th century)
  • 40844 words from Íslensk hómilíubok (Icelandic book of homilies) (12th century)
  • 25017 words from Morkinskinna (1275)
  • 3459 words from Egils saga (theta fragment) (13th century)
  • 22720 words from Sturlunga saga (13th century)
  • 23040 words from Finnboga saga ramma (1350)
  • 13541 words from Bandamanna saga - Möðruvallabók manuscript (1350)
  • 11486 words from Bandamanna saga - Konungsbók manuscript (1450)
  • 23041 words from Vilhjálms saga Sjóðs (1450)
  • 18042 words from Miðaldaævintýri (1500)
  • 8582 words from Erasmus saga (1525)
  • 13151 words from Georgius saga (1525)
  • 20683 words from the New Testament's Gospel of John (1540)
  • 16421 words from the New Testament's Acts (1540)
  • 15445 words from Okur, treatise on usury, (1611)
  • 17127 words from Ólafur Egilsson's travelogue (1628)
  • 9760 words from Píslarsaga Jóns Magnússonar (1659)
  • 22905 words from Jón Indíafari's travelogue (1661)
  • 11220 words from Söguþáttur af Ármanni og Þorsteini gála (1675)
  • 3204 words from Um ætt Magnúsar Jónssonar (1675)
  • 3857 words from Móðars þáttur (1675)
  • 23013 words from Vídalínspostilla (1720)
  • 18784 words from Fimmbræðra saga (1790)
  • 22099 words from Jón Steingrímsson's biography (1791)
  • 3269 words from Jónas Hallgrímsson's essay on the nature and origin of the earth (1835)
  • 17837 words from Piltur og stúlka (novel by Jón Thoroddsen) (1850)
  • 27192 words from Brynjólfur Sveinsson biskup (novel by Torfhildur Hólm) (1882)

Total number of words: 440166

Directories included in Version 0.4

  • /psd: for the parsed versions of the texts (.psd), in plain text utf-8.
  • /txt: for versions of the texts with no markup at all (.txt), in plain text utf-8.

Note that each token in a given .psd file corresponds to a single line in the .txt file, so for example, the token with ID number ""361"" will appear on line 361 in the corresponding .txt file.

  • /info: for philological and other information about each text (.info), in plain text utf-8.

Citation for the Version 0.3 release (of January 6th 2011)

Wallenberg, Joel, Anton Karl Ingason, Einar Freyr Sigurðsson and Eiríkur Rögnvaldsson. 2011. 
Icelandic Parsed Historical Corpus (IcePaHC). 
Version 0.3. http://www.linguist.is/icelandic_treebank

Treebank team