Anton Karl Ingason

  • Increase font size
  • Default font size
  • Decrease font size
Anton Karl Ingason

Icelandic phrase structure parser

Print
We are making available a grammar for doing Icelandic phrase structure parsing using the Berkeley parser. This is our initial release of such a grammar. We aim to continue improving the it. The parser was trained on the Icelandic Parsed Historical Corpus (IcePaHC) which was converted to the universal tagset prior to training. The mappings we used can be downloaded here. The grammar also assumes a simplified set of phrase labels compared to the full IcePaHC corpus.

  • Icelandic grammar for the Berkeley parser (grammar version 1.0) (48.7 MB)

  • When running the parser, punctuation marks should be separated from the remainder of a sentence with whitespace. If you have downloaded the Icelandic grammar as grammar_file and have an input file with one sentence per line in input_file, the following command will output machine parsed trees to output_file. You need to download the Berkeley parser (the .jar file) from its website where you can also find further documentation.

    java -jar BerkeleyParser-1.7.jar -gr grammar_file -inputFile input_file -outputFile output_file

    Sample input and output files:
  • Sample input file
  • Sample output file
  • Make sure to download the files and open them in a UTF-8 compatible editor. Special characters may not be rendered correctly if you open the files in your browser. The whole parsing process assumes UTF-8 files throughout.

    Note that the universal tagset contains much less information than the full IcePaHC tagset. However, this makes it easier to plug our resourcs into multilingual solutions which employ the same tagset. See the paper by Slav Petrov, Dipanjan Das and Ryan McDonald for more information about the universal tagset.

    Anton Karl Ingason
    Einar Freyr Sigurðsson
    Eiríkur Rögnvaldsson

    Last Updated on Thursday, 20 November 2014 00:06
     

    Workshop on Formal Ways of Analyzing Variation (FWAV)

    Print
    We are planning a workshop on Formal Ways of Analyzing Variation (FWAV) which will be part of the 25th SCL (Scandinavian Conference of Linguistics) in Reykjavik, May 13-15, 2013.

    We invite abstracts for 20 minute papers (plus 10 minutes for questions). Abstract submission for SCL workshops uses the same procedure as the general conference, so please refer to the general call for papers for guidelines (indicate that your abstract is for the FWAV workshop): http://conference.hi.is/scl25/call-for-papers/

    Deadline: November 15, 2012
    Notification of acceptance: December 1, 2012

    Anton Karl Ingason ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
    Einar Freyr Sigurdsson ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
    Charles Yang ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )

    Formal Ways of Analyzing Variation (FWAV)

    Labov’s pioneering study on contraction and deletion of the copula in African American Vernacular English (1969) and subsequent work on linguistic variation and change has drawn substantial attention to the relationship between formal analysis and quantitative usage patterns. Robust quantitative regularities have been studied in synchronic as well as diachronic corpus data using a variety of theoretical frameworks. Recently available evidence shows that discrete acceptability judgments in syntax, drawn from a large sample of speakers, also manifest regular quantitative patterns (Thráinsson 2012).

    This themed session is a venue for case studies on formal analyses of variation and its implications on grammatical theory, acquisition and change. A specific focus will be on the use of methodology which provide ready access to data and development tools to facilitate replication and extension of research results.

    What do formal analyses of variation predict to be possible and impossible?

    The session aims to investigate the empirical content of analyses of speaker variation. Representative research questions include, but are not limited to:
    • What are the limits of variation?
    • Do our analyses provide unifying accounts for apparently disparate clusters of linguistic properties?
    • How does the child analyze a heterogeneous pool of primary linguistic data?
    • What types of diachronic trajectories are consequences of language acquisition under variation?
    • Is the statistical distribution of variation constrained by grammatical factors?
    • How do we make the best use of statistical tools for formal linguistic analysis?
    • On a more practical note, the session hopes to contribute to the the practice of replicability, data access, and collaborative development.
    What does the variation attach to?

    We also ask about the relationship between the linguistic machinery and the mechanisms that are responsible for how speakers alternate between functionally equivalent variants. One line of research adopts the design of Chomskyan structure building while proposing independent mechanisms for acquisition of probabilities (Labov 1969, Kroch 1989, Yang 2002). A constraint based parallel is found in Stochastic OT (Boersma & Heyes 2001). Other proposals suggest that frequency distributions in language use are tightly interwoven with the grammar itself. Guy (1991) argued that repeated rule application in Lexical Phonology was responsible for an exponential decay in final -t/-d production in English. Anttilla (1997) and Adger (2006) have proposed analyses where usage probabilities reflect the number of times that equally likely paths through the grammar lead to a particular output. Coetzee (2004) suggested that the comparison-based nature of OT imposes an ordering on the frequency of variants. How can we compare and contrast such a multitude of formal proposals?

    It may not be the case that all instances of variable usage are of the same nature. Even if we assume acquired probabilities are a part of a speaker’s knowledge about language, it may still be the case that the variation is due to other, non-linguistic, factors. Furthermore, different domains of language may be subject to different constraints on variation. It has been suggested that unlike phonology, syntax is less sensitive to social evaluation (Labov & Harris 1986) but a concrete formulation of this effect is quite a nuanced task (Ingason et al 2012). The role of interfaces is also important, since variables in syntax can be affected by constraints that operate across the interface, e.g. prosodic constraints on variation in other domains (e.g. Labov 1969, Anttila et al. 2010). Representative questions include:
    • Where does the variation come from and how can we distinguish the formal models empirically?
    • How do we know which type of mechanism is responsible for which part of language usage?
    • How does a formal analysis of variation handle different domains of language and the interfaces between them?
    References

    Adger, David. 2006. Journal of Linguistics 42:503–530.
    Anttila, Arto. 1997. Deriving variation from grammar. In Variation, change, and phonological theory , ed. Frans Hinskens, Roeland van Hout, and W. Leo Wetzels, 35-68. Amsterdam: John Benjamins.
    Anttila, Arto; Matthew Adams; and Michael Speriosu. 2010. The role of prosody in the English dative alternation. Language and Cognitive Processes. 25(7-9):946-981.
    Boersma, Paul, and Bruce Hayes. 2001. Empirical tests of the gradual learning algorithm. Linguistic Inquiry 32:45-86. Available on Rutgers Optimality Archive, http://ruccs.rutgers.edu/roa.html. Coetzee, Andries. 2004. What it means to be a loser: Non-optimal candidates in Optimality Theory. Ph. D dissertation, UMass Amherst.
    Fowler, Joy. 1986. The social stratification of (r) in New York City Department Stores, 24 years after Labov. NYU term paper.
    Guy, G. R. 1991. Explanation in variable phonology. Language Variation and Change 3,1:1-22. Ingason, Anton Karl, Einar Freyr Sigurðsson and Joel C. Wallenberg. 2012. Antisocial Syntax. Disentangling the Icelandic VO/OV parameter and its lexical remains. Paper presented at DiGS, 14. Lisbon, 6 July 2012.
    Kroch, Anthony S. 1989. Reflexes of grammar in patterns of language change. Language Variation and Change 1:199-244.
    Labov, William. 1966. The social stratification of English in New York City. Center for Applied Linguistics, Washington.
    Labov, William. 1969. Contraction, Deletion and Inherent Variability of the English Copula. Language, 45,4:715-762.
    Labov, William, and Wendell A. Harris. 1986. De facto segregation of black and white vernaculars. In Diversity and Diachrony, ed. D. Sankoff, 1–24. Philadelphia: John Benjamins. MacDonald, Jeff. 1984. The social stratification of (r) in New York City department stores revisited. Paper written for Anthropology 150, Anthropological Linguistics, for Nancy Bonvillain.
    Thráinsson, Höskuldur. 2012. Ideal speakers and other speakers. The case of dative and other cases. Variation in Datives: A Micro-Comparative Perspective. Oxford Studies in Comparative Syntax, Oxford University Press, Oxford.
    Yang, Charles. 2002. Knowledge and Learning in Natural Language. Oxford: Oxford University Press.

    Last Updated on Sunday, 14 October 2012 16:46
     

    Fix accent problem in TexMaker on Ubuntu

    Print
    To fix the accent problem with TexMaker in Ubuntu where the accents stop going over the character but are instead written before them, so you get 'a instead of á (Icelandic, Spanish etc.), install the ibus-qt4 package:

    sudo apt-get install ibus-qt4

    Fedora equivalent:

    yum install ibus-qt

    (Source thread)

    Easy fix for a very annoying and unpredictable problem. I have no idea why this happens occasionally without that package but according to online sources the bug affects some more Linux distributions even in the latest version of TexMaker.
    Last Updated on Wednesday, 24 August 2011 09:36
     

    IcePaHC 0.9. 1 million words of syntactically parsed (hand-corrected) Icelandic

    Print
    We are very pleased to announce that version 0.9 of the Icelandic Parsed Historical Corpus (IcePaHC) is now available for free download.

    The corpus can be downloaded from:
    www.linguist.is/icelandic_treebank/Download

    The corpus is a treebank of over 1 million words in size, annotated for full phrase structure parse, and hand-corrected, using an adaptation of the annotation scheme used by the Penn Treebank and the Penn parsed corpora of historical English (http://www.ling.upenn.edu/hist-corpora/). Note that this release contains all of the text for version 1.0, but some minor corrections remain to be finished.

    The corpus contains:

    - 1 002 361 words total, consisting of ~100 000-word samples from each century from the 12th to the beginnng of the 21st century.
    - Annotated with a phrase structure parse, part-of-speech-tagged, and lemmatized.
    - The entire parse, pos-tagging, and lemmata for every sentence have been *hand-corrected*.
    - Text samples are balanced for genre within each century.
    - LGPL license: You are free to copy, modify and redistribute the corpus for research and/or profit with appropriate citation.

    The corpus is distributed as raw UTF-8 data in labeled bracketing format and it is therefore compatible with various existing programs, including CorpusSearch (http://corpussearch.sourceforge.net/).

    A plain text version without markup and a set of info files containing philological information accompany the corpus download.

    The entire corpus may be downloaded in a plain text version, a platform-independent GUI, and a Windows-compatible GUI for ease of searching.

    Further information on the annotation guidelines and project organization can be found on the project wiki:
    www.linguist.is/icelandic_treebank/


    Joel C. Wallenberg ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
    Anton Karl Ingason ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
    Einar Freyr Sigurðsson ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
    Eiríkur Rögnvaldsson ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
    University of Iceland

    We were grateful to receive support for this project through the following grants:

    Icelandic Research Fund (RANNÍS), grant nr. 090662011,"Viable Language Technology beyond English – Icelandic as a test case".

    U.S. National Science Foundation (NSF) International Research Fellowship Program (IRFP), grant #OISE-0853114, "Evolution of Language Systems: a comparative study of grammatical change in Icelandic and English".

    University of Iceland Research Fund (Rannsóknasjóður Háskóla Íslands), grant Icelandic Diachronic Treebank (Sögulegur íslenskur trjábanki)

    Last Updated on Monday, 29 August 2011 14:04
     

    Available: IcePaHC 0.4 (now includes a visual Windows version)

    Print
    IcePaHC 0.4, the latest version of the Icelandic Parsed Historical Corpus, is now available for download:

    http://linguist.is/icelandic_treebank/Download

    - 440.000 words total, from every century between the 12th and the 19th centuries inclusive annotated for phrase structure, part-of-speech-tagged and lemmatized
    - An optional easy-to-install visual user interface for Windows
    - LGPL license: You are free to copy, modify and redistribute the corpus for research and/or profit

    Joel C. Wallenberg ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
    Anton Karl Ingason ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
    Einar Freyr Sigurðsson ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
    Eiríkur Rögnvaldsson ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
    University of Iceland

    The project is funded by the following grants:

    Icelandic Research Fund (RANNÍS), grant nr. 090662011,"Viable Language Technology beyond English – Icelandic as a test case".

    U.S. National Science Foundation (NSF) International Research Fellowship Program (IRFP), grant #OISE-0853114, "Evolution of Language Systems: a comparative study of grammatical change in Icelandic and English".

    --------------------------------

    IcePaHC 0.4, íslenski trjábankinn (nú með Windows útgáfu)


    IcePaHC 0.4, nýjasta útgáfa íslenska trjábankans, er komin út:

    http://linguist.is/icelandic_treebank/Download

    - Samtals 440.000 orð frá öllum öldum frá og með 12. öld til og með 19. öld, sem búið er að greina setningafræðilega, marka og lemma
    - Einföld Windows uppsetning á myndrænu notandaviðmóti
    - LGPL leyfi: Notendur geta afritað málheildina, breytt henni og endurútgefið vegna rannsókna og/eða í hagnaðarskyni

    Joel C. Wallenberg ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
    Anton Karl Ingason ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
    Einar Freyr Sigurðsson ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
    Eiríkur Rögnvaldsson ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )

    Verkefnið er styrkt af:

    RANNÍS, styrk nr. 090662011, "Hagkvæm máltækni utan ensku - íslenska tilraunin".

    U.S. National Science Foundation (NSF) International Research Fellowship Program (IRFP), styrk #OISE-0853114, "Evolution of Language Systems: a comparative study of grammatical change in Icelandic and English".
    Last Updated on Tuesday, 12 April 2011 15:45
     
    • «
    •  Start 
    •  Prev 
    •  1 
    •  2 
    •  3 
    •  4 
    •  5 
    •  6 
    •  Next 
    •  End 
    • »


    Page 1 of 6