Issues
From Icelandic Parsed Historical Corpus (IcePaHC)
Contents
Corrections
- add missing case informaion to ADJ in piltur1
- Remove possessive dollar signs from piltur1 since we don't use those anymore
- Do NP-CMP thing
- correct inconsistencies in "því að CP-ADV" and "af því að" CP-THT-PRN
- sem here introduces CP-ADV, see SEM.
( (IP-MAT (IP-MAT-1 (BEDI var-vera) (NP-SBJ (PRO-N hún-hún)) (ADVP (ADV því-því)) (HAN höfð-hafa) (PP (P á-á) (NP (N-D baðstofugólfi-baðstofugólf)))) (CONJP (CONJ og-og) (IP-MAT=1 (VAN gefin-gefa) (NP-OB1 (N-A mjólk-mjólk)) (PP (P sem-sem) (NP (N-D barni-barn) (PP (P með-með) (NP (N-D pípu-pípa))))))) (. ,-,)))
Script
- ?VBN for VAN komið
- When captial E, (PP (P ef-ef) (CP-ADV C 0) for CP-ADV C ef-ef (and more of the same, such as capital Þegar)
- make nú-nú be ADVP-TMP by default
- (ADJP (ADVR eins-eins)
- Tag "neinn" as Q
- project NP-POSs when needed, like for "minn"
- (ALSO líka), attach to IP (no ADVP)
- give "þó að" proper structure (not CP)
- "sjálfur" is (almost) always NP-PRN
- preserve case on (PRO ðu)
- fix tag "VBI-MA2SP"
- fix tag "D-PMG" for "hinna", should be "D-G" -BS/HO
- fix tag "D-MSA" for "hinn", should be "D-A" -BS/HO
- fix tag "D-NSN" for "hið", should be "D-N" -BS/HO
- fix tag "D-FPN" for "hinar", should be "D-N" -BS/HO
- fix tag "D-FSA" for "hina", should be "D-A" -BS/HO etc. for all forms of "hinn"
Sanity checks
- Check if -SPE extension is missing in clauses dominated by other -SPE clauses (exception, -PRN)
- (DONE in sanity checks) CP always doms IP-SUB and the other way around (neither can be missing)
- Make sure that there is a trace where it must be (CP-QUE, CP-REL)
- (DONE in sanity checks) One subject in IP-MAT/IP-SUB, not more, not less
- (DONE in sanity checks) Subjects not dominated by other stuff than IPs, like no NP-VOC idoms NP-SBJ
- Only use valid tags
- (DONE in sanity checks) N is not sister of PRO that idoms -minn (needs NP-POS for minn)
- check if "til að (IP-INF) is IP-INF-PRP
- check for case agreement (e.g. inside NPs and in conjunction structures)
- check RP words
- check that IP-IMP idoms an imperative verb (VBI ...)
- check that sentence final punctuation tag is not ","
Semi-automatic checking
- Pick out all typical subjunctive contexts and check verbs
- check case assigned by verbs against a list of known verbs
Post-processing
- Make sure that token final punctuation is always period
- Move punctuation to highest level
- Assign IDs to tokens
- Do some checking that lemmas are consistent with final PoS-tag
To discuss
- ELLA 'else', have argument about ELSE tag
- LENGI is tagged ADJ, but this is inconsistent with the rest of ADJs because it has no case. Can we do something about this?
- The flat N modifier structure may need to be changed, it is kinda strange sometimes
- the LÍTILL, MIKILL Q thing: what about dálítill?
(NP-VOC (NPR-N Sigga-sigga) (NP-POS (PRO-N mín-minn)) (ADJ-N góð-góður))
... and when there are many Ds
Docs
- Adjectives page, Comparatives in ADJP, fix NP-CMP and make page for that
Various stuff, incl. IceNLP
- Make parentheses behave nicely in text
- fix Tagset page (perhaps this means "delete page", but some of this info needs to be somewhere)
- Make list of locative stuff, ADVP-LOC
- Make list of temporal stuff, ADVP-TMP