next up previous
Next: Log-linear models. Up: Disambiguation Previous: Penalty rules.

Dependency relations

We also experimented with statistical models based on dependency relations encoded in the dependency structure. The model assigns a probality to a parse by considering each dependency relation. For this purpose, dependency relations d are 5-tuples $d=\langle
w_h,p_h,r,w_a,p_a\rangle$ where wh is the head word, ph is the corresponding part-of-speech tag taken from a small set of part-of-speechs $\{v,n,a,adv,p,\dots\}$, r is the name of the relation taken from a small set of relation names $\{\mbox{\em su,obj1,obj2,vc,mod,det},\dots\}$; wa is the argument word, and pa is its part of speech.

The probability of a parse y given a sentence x might then be defined as:

p(y\vert x) = {1\over Z(x)}\prod_{d\in y} p(r,w_a,p_a\vert w_h,p_h)

For disambiguation, the normalizing factor Z(x) is the same for every parse of a given sentence and can be ignored.

Due to the occurrence of reentrancies, dependency structures are generally not trees but graphs. Therefore, the product above gives poor results because it will have an unjustified bias against such reentrancies (a reentrancy gives rise to an additional dependency relation). For this reason, we have chosen to score parse trees by determining the mean value of $-\log p$ for each tuple; this improved results considerably. The probability of a dependency is calculated as follows:

p(r,wa,pa|wh,ph) = p(r|wh,ph) * p(pa|wh,ph,r) * p(wa|wh,wp,r,pa)

The three components are each calculated using a linear back-off strategy, where the weights are determined by frequency and diversity (formula 2.66 of [9]). The quantities we use for backing off are given in the following table:

back-off level p(r|wh,ph) p(pa|wh,ph,r) p(wa|wh,wp,r,pa)
1 p(r|ph) p(pa|ph,r) p(wa|wp,r,pa)
2 p(r) p(pa|r) p(wa|r,pa)
3   p(pa) p(wa|pa)
4     p(wa)

Because the size of the treebanks we have currently available is much too small to estimate these quantities accurately, we have chosen to do our estimation using unsupervised learning. We have parsed a large corpus (`de Volkskrant' newspaper text: first four months of 1997) using the penalty rules described in the previous section as our disambiguator. This corpus contains about 350,000 sentences and 6,200,000 words. We only used those sentences that the system could analyse as a single constituent, and within a reasonable amount of time. This meant that we could use the results of about 225,000 sentences. We estimated the quantity p using the best parse (according to the penalty rules) for each of these sentences. Collecting the 225,000 dependency structures took about one month of CPU-time (using the high-performance computing cluster of the University of Groningen).

As can be concluded from table 3, such a model performs much better than the baseline. Moreover, a combined model in which we simply add the rule penalties to the quantity p performs better than either model in isolation.

Table 3: Preliminary results on the cdbl10 and cdbl20 development treebanks for a number of disambiguation techniques. The baseline row lists the percentages obtained if we select for each sentence a random parse tree from the parse forest. The maximum row lists the percentages obtained if we take for each sentence the best parse tree. These two numbers thus indicate the lower and upper bounds for parse selection.
  cdbl10 cdbl20
technique precision recall f-score precision recall f-score
baseline 62.3 63.3 62.8 58.5 59.6 59.0
log linear 76.0 76.6 76.3 66.3 67.6 66.0
penalties 78.6 79.3 78.9 73.1 73.3 73.2
dependency rel's 78.9 79.7 79.3 69.7 71.1 70.4
heur. + dep-rel's 80.9 81.7 81.3 74.6 75.4 75.0
maximum 89.1 90.0 89.6 83.2 84.1 83.7

next up previous
Next: Log-linear models. Up: Disambiguation Previous: Penalty rules.
Noord G.J.M. van