Difference between revisions of "Icelandic Parsed Historical Corpus (IcePaHC)"

From Icelandic Parsed Historical Corpus (IcePaHC)
Jump to: navigation, search
(Annotation guidelines)
(Grants)
(13 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 
This is the wiki for the '''Icelandic Parsed Historical Corpus (IcePaHC)''' (The Icelandic Treebank). It is mostly used to document the annotation standard for those constructing and using the corpus. The annotation scheme for the Icelandic corpus is mostly compatible with [http://www.ling.upenn.edu/~beatrice/annotation/ the Penn historical corpora], and the guidelines here are written as a supplement to the Penn guidelines, so look at Beatrice Santorini's [http://www.ling.upenn.edu/~beatrice/annotation/ guidelines] for further information.
 
This is the wiki for the '''Icelandic Parsed Historical Corpus (IcePaHC)''' (The Icelandic Treebank). It is mostly used to document the annotation standard for those constructing and using the corpus. The annotation scheme for the Icelandic corpus is mostly compatible with [http://www.ling.upenn.edu/~beatrice/annotation/ the Penn historical corpora], and the guidelines here are written as a supplement to the Penn guidelines, so look at Beatrice Santorini's [http://www.ling.upenn.edu/~beatrice/annotation/ guidelines] for further information.
  
==Download==
+
==Download IcePaHC==
  
'''You can download the Icelandic corpus from the [[download|download page]]'''. The corpus is released under a free and open source license (LGPL) and there is no registration wall. The current release is version 0.5 which is a preview release of about 632.000 words from every century between the 12th and the 20th centuries inclusive. We recommend use of released versions to ensure that results can be replicated but between releases you can watch the development at [http://github.com/antonkarl/icecorpus/ Github].
+
'''You can download the Icelandic corpus from the [[download|download page]]'''. The corpus is released under a free and open source license (LGPL) and there is no registration wall. The current release is version 0.9 of 1,002,390 words total from every century between the 12th and the 21st centuries inclusive. All of the text for version 1.0 is already included but some minor corrections remain to be finished. We recommend use of released versions to ensure that results can be replicated but between releases you can watch the development at [http://github.com/antonkarl/icecorpus/ Github].
  
 
==Annotation guidelines==
 
==Annotation guidelines==
Line 16: Line 16:
 
*[[Splitting and joining words]]
 
*[[Splitting and joining words]]
 
*[[Index]]
 
*[[Index]]
 +
*[[Construction-based corrections]]
  
 
==Citation==
 
==Citation==
For the version 0.5 release of July 5th 2011.
+
For the version 0.9 release of August 29th 2011.
  
 
<pre>
 
<pre>
 
Wallenberg, Joel C., Anton Karl Ingason, Einar Freyr Sigurðsson and Eiríkur Rögnvaldsson. 2011.  
 
Wallenberg, Joel C., Anton Karl Ingason, Einar Freyr Sigurðsson and Eiríkur Rögnvaldsson. 2011.  
 
Icelandic Parsed Historical Corpus (IcePaHC).  
 
Icelandic Parsed Historical Corpus (IcePaHC).  
Version 0.5. http://www.linguist.is/icelandic_treebank
+
Version 0.9. http://www.linguist.is/icelandic_treebank
 
</pre>
 
</pre>
  
Line 43: Line 44:
 
*[[Icelandic Resources]] for doing Computational Linguistics and Natural Language Processing
 
*[[Icelandic Resources]] for doing Computational Linguistics and Natural Language Processing
 
*[[Treebank Resources]] (language independent)
 
*[[Treebank Resources]] (language independent)
 +
*[http://www.tycho.iel.unicamp.br/~tycho/corpus/en/ Tycho Brahe Parsed Corpus of Historical Portuguese]
 
*[[Penn Parsed Corpora of Historical English]]
 
*[[Penn Parsed Corpora of Historical English]]
 
*[[Parsed Corpora]] for other languages
 
*[[Parsed Corpora]] for other languages
Line 61: Line 63:
 
* From the '''U.S. National Science Foundation (NSF) International Research Fellowship Program (IRFP)''', grant '''#OISE-0853114, Evolution of Language Systems: a comparative study of grammatical change in Icelandic and English'''
 
* From the '''U.S. National Science Foundation (NSF) International Research Fellowship Program (IRFP)''', grant '''#OISE-0853114, Evolution of Language Systems: a comparative study of grammatical change in Icelandic and English'''
 
* From the '''University of Iceland Research Fund (Rannsóknasjóður Háskóla Íslands)''', grant '''Icelandic Diachronic Treebank (Sögulegur íslenskur trjábanki)'''
 
* From the '''University of Iceland Research Fund (Rannsóknasjóður Háskóla Íslands)''', grant '''Icelandic Diachronic Treebank (Sögulegur íslenskur trjábanki)'''
 +
 +
 +
Disclaimer:
 +
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Revision as of 09:49, 1 September 2011

This is the wiki for the Icelandic Parsed Historical Corpus (IcePaHC) (The Icelandic Treebank). It is mostly used to document the annotation standard for those constructing and using the corpus. The annotation scheme for the Icelandic corpus is mostly compatible with the Penn historical corpora, and the guidelines here are written as a supplement to the Penn guidelines, so look at Beatrice Santorini's guidelines for further information.

Download IcePaHC

You can download the Icelandic corpus from the download page. The corpus is released under a free and open source license (LGPL) and there is no registration wall. The current release is version 0.9 of 1,002,390 words total from every century between the 12th and the 21st centuries inclusive. All of the text for version 1.0 is already included but some minor corrections remain to be finished. We recommend use of released versions to ensure that results can be replicated but between releases you can watch the development at Github.

Annotation guidelines

Citation

For the version 0.9 release of August 29th 2011.

Wallenberg, Joel C., Anton Karl Ingason, Einar Freyr Sigurðsson and Eiríkur Rögnvaldsson. 2011. 
Icelandic Parsed Historical Corpus (IcePaHC). 
Version 0.9. http://www.linguist.is/icelandic_treebank

General information

Annotation:

Resources

Treebank team:

Grants

The project within which the Icelandic corpus is constructed is funded in part by the following grants:

  • From the Icelandic Research Fund (RANNÍS), grant nr. 090662011, Viable Language Technology beyond English – Icelandic as a test case.
  • From the U.S. National Science Foundation (NSF) International Research Fellowship Program (IRFP), grant #OISE-0853114, Evolution of Language Systems: a comparative study of grammatical change in Icelandic and English
  • From the University of Iceland Research Fund (Rannsóknasjóður Háskóla Íslands), grant Icelandic Diachronic Treebank (Sögulegur íslenskur trjábanki)


Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.