Difference between revisions of "Splitting and joining words"

From Icelandic Parsed Historical Corpus (IcePaHC)
Jump to: navigation, search
 
(6 intermediate revisions by 2 users not shown)
Line 20: Line 20:
 
</pre>
 
</pre>
  
Verbs and particles written as one, that is combinations of particles (RP) and verbs (VB) are split off.
+
When preverbal adverbial particles (RP) are written together with the following verb, the particle is split off, as below.
 +
 
 +
<pre>
 +
( (IP-MAT (CODE VS:I_24J)
 +
  (CONJ Og-og)
 +
  (NP-SBJ (PRO-N þeir-hann)
 +
  (CP-REL (WNP-1 0)
 +
  (C sem-sem)
 +
  (IP-SUB (NP-SBJ *T*-1)
 +
  (RP út$-út)
 +
  (VBDI $sendust-senda))))
 +
  (, ,-,)
 +
  (BEDI voru-vera)
 +
  (PP (P af-af)
 +
      (NP (NPRS-D faríseis-Faríseis)))
 +
  (. .-.)))
 +
</pre>
 +
 
 +
Lögmálslestur og spámannanna (Tyndale: And after the lawe and the prophetes were redde):
 +
 
 +
<pre>
 +
  (PP (P eftir-eftir)
 +
      (NP (NP-POS (NP (N-G lögmáls$-lögmál))
 +
  (CONJP *ICH*-1))
 +
  (N-A $lestur$-lestur)
 +
  (D-A $inn-hinn)
 +
  (CONJP-1 (CONJ og-og)
 +
  (NP (NS-G spámanna$-spámaður) (D-G $nna-hinn)))))
 +
</pre>
 +
 
 +
<pre>
 +
( (IP-MAT (CONJ En-en)
 +
  (NP-SBJ (D-N þessi-þessi)
 +
  (NP-POS (N-G óþolinmæðis--óþolinmæði) (CONJ og-og) (N-G gáleysis$-gáleysi))
 +
  (NS-N $orð-orð))
 +
  (DODI gjörðu-gera)
 +
  (NP-OB2 (PRO-D mér-ég))
 +
  (ADVP-TMP (ADVR síðar-síðar))
 +
  (NP-OB1 (ADJP (ADV nógu-nógur) (ADJ-A þunga-þungur))
 +
  (NS-A þanka-þanki))
 +
  (. .-.)))
 +
</pre>
  
 
==Items treated as unitary==
 
==Items treated as unitary==
Items of this kind may be written as one word or more. When they are written as one, they get a simple POS tag but when written apart each part gets its own '''numbered''' POS tag. Together they project one tag:
+
Items of this kind may often either be written as one word or more. When they are written as one, they get a simple POS tag but when written apart each part gets its own '''numbered''' POS tag. Together they project one tag:
  
 
<pre>
 
<pre>
 
(NP-OB1 (NS-D (N21-A kapal-kapall) (NS22-D hestum-hestur))))))
 
(NP-OB1 (NS-D (N21-A kapal-kapall) (NS22-D hestum-hestur))))))
 +
 +
( (IP-MAT (NP-SBJ (PRO-D Þeim-það))
 +
  (BEDI var-vera)
 +
  (NP-OB1 (PRO-N það-það))
 +
  (PP (P til-til)
 +
      (NP (N-G (N21 auka-auka) (N22 fyrdæmingar-fyrdæming))
 +
  (NP-POS (PRO-G sinnar-sinn))))
 
</pre>
 
</pre>
  
Line 42: Line 90:
 
(NP-SBJ (N-N (N21-G manns) (N22-N bani))
 
(NP-SBJ (N-N (N21-G manns) (N22-N bani))
 
</pre>
 
</pre>
 +
 +
When used as one, the complementizers SEM and AÐ are treated as unitary, i.e. (C (C21 sem-sem) (C22 að-að)):
 +
 +
<pre>
 +
( (IP-MAT-SPE (CODE VS:VI_57J)
 +
      (ADVP-LFD (ADVR Líka-líka)
 +
(CP-CMP-SPE (WADVP-1 0)
 +
    (C (C21 sem-sem) (C22 að-að))
 +
    (IP-SUB-SPE (IP-SUB-SPE (ADVP *T*-1)
 +
    (NP-OB1 (PRO-A mig-ég))
 +
    (VBDI sendi-senda)
 +
    (NP-SBJ (VAG lifandi-lifa) (N-N faðir-faðir)))
 +
</pre>
 +
 +
The use of '''sem að''' is very frequent in 1628.olafuregils.
  
 
====Treatment of individual items and parts====
 
====Treatment of individual items and parts====

Latest revision as of 10:10, 26 April 2019

Items treated as compounds

Items that are split

Definite article (determiner).

(NP-VOC (N-N gæska$-gæska)
        (D-N $n-hinn)
        (NP-POS (PRO-N mín-minn)))

Suffixed þú 'you' on finite verbs. -du, -ðu-, -tu is always NP-SBJ:

	      (ADVP-RSP (ADV þá-þá))
	      (VBPI veis$-vita)
	      (NP-SBJ (PRO-N $tu-þú))
	      (NP-OB1 (PRO-A það-það)

When preverbal adverbial particles (RP) are written together with the following verb, the particle is split off, as below.

( (IP-MAT (CODE VS:I_24J)
	  (CONJ Og-og)
	  (NP-SBJ (PRO-N þeir-hann)
		  (CP-REL (WNP-1 0)
			  (C sem-sem)
			  (IP-SUB (NP-SBJ *T*-1)
				  (RP út$-út)
				  (VBDI $sendust-senda))))
	  (, ,-,)
	  (BEDI voru-vera)
	  (PP (P af-af)
	      (NP (NPRS-D faríseis-Faríseis)))
	  (. .-.)))

Lögmálslestur og spámannanna (Tyndale: And after the lawe and the prophetes were redde):

	  (PP (P eftir-eftir)
	      (NP (NP-POS (NP (N-G lögmáls$-lögmál))
			  (CONJP *ICH*-1))
		  (N-A $lestur$-lestur)
		  (D-A $inn-hinn)
		  (CONJP-1 (CONJ og-og)
			   (NP (NS-G spámanna$-spámaður) (D-G $nna-hinn)))))
( (IP-MAT (CONJ En-en)
	  (NP-SBJ (D-N þessi-þessi)
		  (NP-POS (N-G óþolinmæðis--óþolinmæði) (CONJ og-og) (N-G gáleysis$-gáleysi))
		  (NS-N $orð-orð))
	  (DODI gjörðu-gera)
	  (NP-OB2 (PRO-D mér-ég))
	  (ADVP-TMP (ADVR síðar-síðar))
	  (NP-OB1 (ADJP (ADV nógu-nógur) (ADJ-A þunga-þungur))
		  (NS-A þanka-þanki))
	  (. .-.)))

Items treated as unitary

Items of this kind may often either be written as one word or more. When they are written as one, they get a simple POS tag but when written apart each part gets its own numbered POS tag. Together they project one tag:

(NP-OB1 (NS-D (N21-A kapal-kapall) (NS22-D hestum-hestur))))))

( (IP-MAT (NP-SBJ (PRO-D Þeim-það))
	  (BEDI var-vera)
	  (NP-OB1 (PRO-N það-það))
	  (PP (P til-til)
	      (NP (N-G (N21 auka-auka) (N22 fyrdæmingar-fyrdæming))
		  (NP-POS (PRO-G sinnar-sinn))))

The first number (2) is the number of parts in the item, the second one (1 / 2) shows each part's place within the sequence.

Note that in the case of nouns, adjectives and pronouns, different parts of these items usually don't have the same case, as in the example above; the first part usually gets accusative or genitive since it modifies the last part.

Genitive modifiers in sequences like these do not project NP-POS.

Names like Skalla Grímur, usually written as one orthographic word, are treated a little bit differently from regular nouns like manns bani (which is also usually written as one):

(NP-SBJ (NPR-G Skalla) (NPR-N Grímur))

(NP-SBJ (N-N (N21-G manns) (N22-N bani))

When used as one, the complementizers SEM and AÐ are treated as unitary, i.e. (C (C21 sem-sem) (C22 að-að)):

( (IP-MAT-SPE (CODE VS:VI_57J)
	      (ADVP-LFD (ADVR Líka-líka)
			(CP-CMP-SPE (WADVP-1 0)
				    (C (C21 sem-sem) (C22 að-að))
				    (IP-SUB-SPE (IP-SUB-SPE (ADVP *T*-1)
							    (NP-OB1 (PRO-A mig-ég))
							    (VBDI sendi-senda)
							    (NP-SBJ (VAG lifandi-lifa) (N-N faðir-faðir)))

The use of sem að is very frequent in 1628.olafuregils.

Treatment of individual items and parts

NÉ EITT (NEITT): (NP-SBJ (Q-N (NEG21 né-né) (ONE22-N eitt-einn)))

Ó: (ADJP (ADJ-N (NEG21 ó-ó) (ADJ22-N heilagur-heilagur)))

SYSTUR SON EGILS:

(NP-PRN (N-N (N21-G systur-systir) (N22-N son-sonur))
			  (NP-POS (NPR-G Egils-egill))))