Lemmatization

From Icelandic Parsed Historical Corpus (IcePaHC)
Jump to: navigation, search

General principles

The words in the corpus all occur with a corresponding lemma, in the form: (POSTAG word-lemma). The lemma is always all lowercase, even for proper names.

In general, the lemma for the word is the dictionary citation form for that word. However, there are some systematic differences between our analysis and traditional Icelandic lexicography, which will be listed here or under Treatment of individual words.

For the Old(er) Icelandic texts in the corpus, some of the words have modernized lemmas, i.e., the lemma for the corresponding word in modern Icelandic is used rather than the citation form found in Old Icelandic / Old Norse dictionaries. This is done primarily when the word has a form that might be confusing to speakers of modern Icelandic.

The systematically modernized lemmas are below:

(ADV þ$) (NEG $eygi-ekki)

(NEG eigi-ekki)

(NEG ei-ekki)

(Q ekki-ekkert)

(Q nekkvar-nokkur) / (Q nekkver-nokkur)

(P fyr-fyrir)

(P fyrr-fyrir)

(P und-undir)

(P viður-við)

(P meður-með)

(ADJ-A átta-áttundi) , i.e. an old ordinal number which has the same form as the modern cardinal number.

(PRO eg-ég)

(PRO ér-þú)

(WADV hverninn-hvernig)

(ALSO einninn-einnig)

Proper names are systematically modernized, if possible:

(NPR Moises-móses)

(NPR Herodes-heródes)

-st middle-verbs

The lemma of an -st verb ends in -st if the meaning is clearly different from the corresponding verb without -st, or if there is no verb there is no such without-st-verb, or if the syntax of the -st verb is different, notably if it is a DAT-NOM verb:

Different meaning:

andast 'die' != anda 'breathe'
eignast != eigna
gerast 'happen' or 'become (intentionally)' != gera 'do' (note that the -st form is tagged VB*, not DO*)
reiðast 'get angry' != reiða 'transport (on a horse)'
skjótast 'move quickly' != skjóta 'shoot'
villast 'get lost' != villa 'mislead'
kannast 'be familar with' != kanna 'explore'
látast 'pretend' 'die' != láta 'let'
þykjast 'pretend' != þykja 'think'

No without-st-verb:

aðhyllast (*aðhylla)
heppnast (*heppna)
iðrast (*iðra)
leiðast (*leiða)
lukkast (*lukka)

Different case pattern

sýnast (DAT-NOM) != sýna (NOM-ACC)
finnast (DAT-NOM) != finna (NOM-ACC)
óast (NOM subject) != óa (ACC subject)
fyllast (NOM subject) != fylla (ACC subject when on argumental, NOM-ACC when monotransitive)
venjast != venja
verjast (NOM-GEN) != verja (NOM-ACC)
setjast != setja
berast != bera
undra (ACC-ACC) != undrast (NOM-ACC)
minnast (NOM-GEN) != minna (NOM-ACC-PP)
nefnast (NOM-NOM) != nefna (NOM-ACC)

Pronouns

Gender only matters for personal pronouns.

PRO-N hann-hann
PRO-N hún-hún
PRO-A hana-hún
PRO-N það-það
PRO-D því-það

The number is not a dividing factor (so the dual is lemmatized as the singular).

vér,oss-ég
ér,þér-þú
þeir-hann

For determiners and quantifiers, they are only divided by nature, the default gender is the masculine, the default case the nominative and the default number the singular.

D-N sú-sá
D-N það-sá
Q-N engar-enginn

The possessive pronouns sinn, minn and þinn get their own lemmas (sinn,minn,þinn).

Plural possessive pronouns existed in Old Icelandic. They are forms of the personal pronouns but are specific forms if they inflect with the noun (in Old Icelandic).

minn,mitt,mín-minn
vár,vor,ór,vort-vor
okkar,okkrum-okkar
þinn-þinn
yðvar,yðar,yðrum-yðar
ykkar,ykkrum-ykkar

If they did not inflect with the noun they are just genitive forms of the personal pronouns and are lemmatized as such.

hans-hann
hennar-hún
þess-það
þeirra-hann/hennar/það
okkar-ég
ykkar-þú 

The personal pronoun það is easily confused with the determiner það (lemmatized sá). If there is doubt, PRO is default.

Individual words

HVORTVEGGJA, HVORTVEGGI, lemmatized as hvortveggja When it is written in two words TVEGGJA is lemmatized as tveggi

Q manngi-manngi

Q engi-enginn

WPRO hvorgi-hvorgi

WPRO hvergi-hvergi

Q hvatki-hvergi

Numbers

Singel numerals which appear in the text are lemmatized with the word for the number, not the numeral. This is to prevent confusion (either human or computer) between numerals in the text and indices in the annotation. E.g., LIKE THIS: "7-sjö", NOT LIKE THIS: "7-7".

Issues

HÉÐAN Í FRÁ, ÞAR ÚT Í FRÁ

The comparative of heilagur is usually helgari, not heilagri, in Íslensk hómilíubók

aldregi: aldregi or aldrei?

fullting: -fullting or -fulltingi

líkamur, líkhamur 'body': -líkami or -líkamur/-líkhamur?

vor: -vor or -ég?

Jóan: -Jóan or -Jóhannes?

engi: -engi or -enginn (or even -einngi)?

sing. mánaður, pl. mánuður: -mánaður or mánuður?

sétti: -sétti or -sjötti?

Marie (gen.): -marie or -María; or even LATIN?

VB ríta/VBN ritið: -ríta or -rita

orðaslaug: what is this?

allmáttkur: allmáttkur or allmáttugur?

ritka and séka: How to express the negative and the pronoun? In Firstgrammar2.psd . -HO

sömnuðu: samna or safna?