Annotation Process

From Icelandic Parsed Historical Corpus (IcePaHC)
Jump to: navigation, search

This is a guide for the local annotation team only. This stuff is under construction.

Starting work on a new file

To start working on a new file.

  • move the ".txt" file from "texts" to "parsing"
  • make sure the ".txt" file is formatted correctly for our software
  • run ./txt2ipsd.sh filename, without the .txt extension, for example ./txt2ipsd.sh ntjohn04 if the filename is ntjohn04.txt.
  • This will run a few scripts and result in a file with an .ipsd extension, like ntjohn04.ipsd
  • To run the additional CorpusSearch revision queries, run ./runall.sh ntjohn04.ipsd ntjohn04.psd (assuming the input file is named ntjohn04.ipsd)
  • to open in CorpusDraw: CD ntjohn04.psd

Documenting the annotation history of a file

Every file that is edited has exactly one file with notes about its edit history. If the file name is piltur1.psd, the corresponding notes file is piltur1.notes.txt.

Syntax of the notes file

  • For each sentence there is a section in the file that starts with its number.
  • Each section is an alphabetized list of notes about the sentence in question.
  • Each note starts with the initials of the annotator who wrote it.
  • The format is always exactly the same

Example:

1)
a) AKI: changed lemma of "ekki" from "ekki" to "ekkert"

2)
a) AKI: added missing expletive subject to IP-MAT 
b) EFS: changed tag of "epli" from N to NS.

5)
a) AKI: disagree, changed it back to -bila; EFS: changed the lemma for bilast-bila to -bilast

Note categories

Every note is classified according to its nature. The types of notes are as follows:

  • (no label), change to correct an error in the file. This is the default and the most common kind of a note -- therefore, no label is needed.
  • NOTE, means if you are reading this file, pay attention to this, and includes important information about the parse that does not reflect a change. This is typically used by the first annotator to share some information with the reviewers. This includes, in particular, arguments for the parse that resulted from a difficult choice between two or more alternatives (in which case citing documentation may be a good idea). NOTE can also be used to express that the meaning of the sentence is unclear to the annotator.
  • DISCUSS, a request that something is discussed among the annotators. Use sparingly and try really hard to come to a clear conclusion that results in a clear decision (change or keep previous parse). DISUSS should be used when there is an apparent inconsistency in the corpus or the documentation -- the goal of DISCUSS should be to increase consistency when needed.

Example:

1)
a) JW NOTE: made "að honum látnum" be an IP-SMC complement of P because it looks a lot like English examples with "with" (cf. url-to-docs)
b) AKI: changed lemma of "ekki" from "ekki" to "ekkert"
c) AKI DISCUSS: Treatment of NP-PRN is not consistent with NP-SBJ in "file-x.psd" sentence 4. 
          We should decide between those two parse, correct it in one of the places and document the decision.

2)
a) JW NOTE: I'm not sure what "jarteinir" means here, can this be something other than a noun?
            AKI: yes, this is a verb in this context! changed parse accordingly
b) AKI: added missing expletive subject to IP-MAT 
c) EFS: changed tag of "epli" from N to NS.

How to parse and review parses

General principles

  • Always make sure that all of your notes are labeled with your initals
  • Quickly review the Checklist before moving on to the next token.
  • Be careful not to spend too much time on decisions
  • If you have taken over 15 minutes to decide on a parse, consider adding a note and moving on. More specific notes are better.

For example, (CODE {COM:unsure_of_parse}) is fine but (CODE {COM:unsure_of_dashtag_on_NP}) is better

  • Be careful not to make notes that cause unnecessary delays or discussions
  • Still, if something really needs to be discussed, discuss it

First annotator

  • Create a notes file. If the file name is piltur1.psd, the corresponding notes file is piltur1.notes.txt.
  • If you don't understand the sentence properly, pick a plausible parse and make a NOTE about the problem in the notes file.
  • If you spend a lot of time making a decision (studying documentation, etc.) or if you believe the reviewer(s) need to know about some argument for the parse, make a NOTE (and cite documentation if you think that will be useful)
  • If you are unsure of the parse after spending some time on it, make a NOTE (like "AKI NOTE: Unsure of parse").
  • Sanity checks should be run before passing the file to a reviewer -- and again before placing file in "finished"
  • React to changes made to the file by reviewer(s) as necessary
    • Write DISAGREE in front of points where you don't accept the change
    • discuss the DISAGREE points with the reviewer who made them
1)
a) JW NOTE: made "að honum látnum" be an IP-SMC complement of P because it looks a lot like 
            English examples with "with" (cf. url-to-docs)
b) AKI: changed lemma of "ekki" from "ekki" to "ekkert"
c) AKI DISCUSS: Treatment of NP-PRN is not consistent with NP-SBJ in "file-x.psd" sentence 4. 
                We should decide between those two parse, correct it in one of the places and document the decision.

2)
a) JW NOTE: I'm not sure what "jarteinir" means here, can this be something other than a noun?
            AKI: yes, this is a verb in this context! changed parse accordingly
b) DISAGREE, there is a subject there already!, AKI: added missing expletive subject to IP-MAT 
c) EFS: changed tag of "epli" from N to NS.
  • Make sure that you track the state of your file until it has been placed in the "finished" directory (the first annotator of a file is responsible for the file)
  • When file has reached "finished", remove any backup copies from the "current" directory (move backups you may want to keep to "backup")

Review

  • Add notes to the existing collection of notes for this file, do not create a new file!
  • Do not use the review to point out that there is a potential ambiguity in the sentence to discuss. The previous annotator already spent time on making a decision. If you believe an ambiguity was resolved in a wrong way, change the parse, otherwise the parse should not be changed. If you are unsure whether it should be changed, do not change it.
  • Make sure that all changes you document in the notes file are reflected in the updated version of the psd-file
  • Some decisions are necessarily judgment calls. Do not spend time on those unless you disagree quite strongly with the previous parse. Those include:
    • PP-attachment (which does usually not have serious effects on searching anyway)
  • At the end of a review, look over any DISCUSS points you made and see if they can be eliminated by making a clear decision