Lexical Resources

Accurate, wide-coverage parsing of unrestricted text requires a lexical component with detailed subcategorization frames. For lexicalist grammar formalisms, the availability of lexical resources which specify subcategorization frames is even more crucial. In HPSG, for instance, phrase structure rules rely on the fact that each head contains a specification of the elements it subcategorizes for. If such specifications are missing, the grammar will wildly overgenerate.

We have used two existing lexical databases (Celex and Parole) to create a wide-coverage lexicon with detailed subcategorization frames enriched with dependency relations. Celex [2] is a large lexical database for Dutch, with rich phonological and morphological information. For use within the CGN project, this database has been extended with dependency frames [11]. This version of the lexicon contains 11,800 verbal stems, with a total of 21,800 dependency frames. By far the most frequent frames are those for intransitive (4,100) and transitive (6,500) verbs. A fair number of frames occurs more than 100 times, but 300 of the 650 different dependency frame types in the database occur only once.

Table 1: Dependency Frames and the number of stems occurring with this frame in both resources, in CGN/Celex only, in Parole only, and the total number of stems with this dependency frame in the Alpino Lexicon.

Dependency Frame Overlap Celex Parole Total
    only only  
[SU:NP][OBJ1:NP] 1810 1211 240 3261
[SU:NP] 257 1697 42 1996
[SU:NP][PC:PP$\langle$ pform$\rangle$] 337 541 273 1151
[SU:NP][OBJ1:NP][PC:PP$\langle$ pform$\rangle$] 129 375 308 812
[SU:NP][VC:S$\langle$subordinate$\rangle$] 103 136 103 342
[SUP:NP$\langle$het$\rangle$][OBJ1:NP][SU:CP] 7 247 5 259
[SU:NP][OBJ2:NP][OBJ1:NP] 65 171 28 264
[SU:NP][SE:NP][PC:PP$\langle$ pform$\rangle$] 65 62 102 229
[SU:NP][SE:NP] 49 137 65 251
[SU:NP][VC:VP] 10 16 37 63

The Dutch Parole lexicon2 comes with detailed subcategorization information, including dependency relations. The Parole lexicon is smaller than Celex, with 3,200 verbal stems and a total of 5000 dependency frames. There are 320 different dependency frame types, 190 of which occur only once.

Dependency frames for the Alpino lexicon have been constructed using the dependency information provided by CGN/Celex, Parole, and by entering definitions by hand. The latter has been done mostly for auxiliary and modal verbs: a small class of high-frequent elements which are exceptional in a number of ways. The CGN/Celex dictionary is very large. As the Celex database comes with frequency information, we currently only include those lexical items whose frequency is above a certain threshold. For verbal stems, this means that roughly 50% of the stems in Celex is included in the Alpino lexicon. All verbal stems from the Parole lexicon with a dependency frame covered by the grammar are included.

Currently, for 28 different CGN/Celex dependency frames a definition in the grammar has been provided. This covers over 80% of the verbal dependency frames in the CGN/Celex database, 10,400 of which are sufficiently frequent to be included in the Alpino lexicon. For 15 different dependency frames in the Parole lexicon a definition in Alpino is present. Using these, we extract over 4,100 dependency frames (82% of the total number of dependency frames in the Parole database). An overview of overlap and non-overlap for the most frequent frames extractable from both sources is given in table 1. For transitive and intransitive verbs, we see that over 85% of the stems in Parole are present in CGN/Celex as well. For most other dependency frames, however, the overlap is generally much smaller, and a significant portion of the stems present in Parole is not present in Celex. This suggests that, for more specific subcategorization frames, both resources are only partially complete, and that not even the union of both provides exhaustive coverage.3

Noord G.J.M. van