next up previous
Next: Dependency Trees Up: The Alpino Dependency Treebank Previous: Introduction

The grammar

Alpino is a wide-coverage grammar: it is designed to analyze sentences of unrestricted Dutch text. The grammar is based on the OVIS grammar (van Noord et al. 1999), that was used in the Dutch public transportation information system, but both lexicon and grammar have been extensively modified and extended.

The lexicon contains about 100,000 entries at this moment. More than 130 different verbal subcategorization frames are distinguished. Lexical information from the lexical databases Celex (Baayen, Piepenbrock and van Rijn 1993), Parole2 and CGN (Groot 2000) was used to construct the lexicon. Various unknown word heuristics further extend the lexical coverage of the system.

The grammar should not only cover the question and answer patterns of the public transportation information system, but, in principle, all Dutch syntactic constructions. Therefore the grammar has been greatly extended. At this moment it consists of about 335 rules. These rules are linguistically motivated and describe both the common and the more specific, complex constructions such as verb-raising constructions and cleft sentences. The rules are written in a framework that is based on Head-Driven Phrase Structure Grammar (Pollard and Sag 1994; Sag 1997). Following Sag (1997), we have defined construction specific rules in terms of more general structures and principles.

In the lexicon, each word is assigned a type from a small set of basic lexical types. This type, e.g. noun for the word tafel (table), specifies the set of lexical features the word has. Nouns for instance have an agreement feature and a feature NFORM, that distinguishes regular nouns from temporal or reflexive nouns. A complementizer in contrast doesn't have those features, but is specified for CTYPE (i.e. complementizer type). These lexical features are represented in feature structures. Fig. 1 shows the feature structures for the word tafel, which is a lexical item (ylex), not derived from a verb (ndev) and not of any special class of nouns such as temporal or reflexive nouns (norm).

The feature DT is shared between all types. It contains information about the relations between a word and other words with which it can form a constituent. With the DT values of all words of a sentence a dependency tree is built.3 This is a structure in which the various dependency relations between words and constituents in a sentence are expressed. More information about dependency structures is found in section 3.

Figure 1: feature structure for lexical entry tafel
\includegraphics [scale=0.3]{lexent.eps}

Handwritten grammar rules define how lexical or phrasal items may combine to form larger units. The rules specify for each syntactic structure the type of the mother node, a head daughter and the non-head daughter(s). In addition, the type of structure that they constitute is specified. Almost all structures are headed structures (structures in which one of the daughters can be identified as the head daughter). The class of headed structures is further subdivided in head-complement, head-adjunct, head-filler and head-extra structures according to the function of the non-head daughter.

Furthermore, the grammar rules should specify how the lexical information on the daughter nodes is passed on to the mother node. For different types of features, different inheritance rules apply. It would be extremely time intensive, error prone and opaque if in each rule and for each feature the inheritance had to be specified separately. Therefore, five general principles are formulated that define how feature values are propagated up the tree. Each principle applies to a group of features. For instance, the Head-feature Principle states that for all features that are marked as head features, the values on the mother node are unified with the values on the head daughter. The Valence, Filler, Extraposition and Adjunct and Dependency principle define similar principles for the subcategorization, extraction, extraposition and modification and dependency tree features respectively. For each syntactic structure that is listed as a headed structure in the grammar, these general principles apply.

Lexical information, construction specific rules and general principles are the three basic components of the grammar. This setup allows the grammarian to formulate simple rules without specifying all the regular feature values on each of the components. The complete rules can be deduced from the simple ones through addition of the information that is conveyed in the general principles. For example, the rule in (1-a) expands to the rule in (1-b), in which the inheritance of the values for VFORM (finite, infinite, participle), subcategorization frame (SC), long distance dependencies (SLASH and EXTRA) and the dependency relations (DT) is specified. In this rule, $\langle$H|T$\rangle$ is used to denote a list with head H and tail T, and L$\oplus$M represents the concatenation of the two lists L and M.

% begin\{exe\} ex begin\{xlist\}
\ex. \a.
\emph{head-complement structure: v
...7\Vert\@2\>\\ slash & \@3\\ extra & \@5\\ dt & \@6\end{displaymath}\end{avm}\par

next up previous
Next: Dependency Trees Up: The Alpino Dependency Treebank Previous: Introduction
Noord G.J.M. van