next up previous
Next: acknowledgments Up: An Efficient Implementation of Previous: Head-corner parsing and Robustness


Practical Experience

There does not exist a generally agreed upon method to measure the efficiency of parsers for grammars of the kind we assume here, i.e. constraint-based grammars for natural language understanding. Therefore, I will present the results of the parser for the current version of the OVIS grammar in comparison with a number of other parsers that have been developed in the same project (by the author and his colleagues). Moreover, a similar experiment was performed with two other grammars: the English MiMo2 grammar [30], and the English Alvey NL Tools grammar [6]. It should be clear that the results to be presented should not be taken as a formal evaluation, but are presented solely to give an impression of the practical feasibility of the parser, at least for its present purpose. The following results should be understood with these reservations in mind.

Other Parsers

In the experiments the head-corner parser was compared with a number of other parsers. The parsers are described in further detail in [31] and [29]. The last two parsers of the following list were implemented by Mark-Jan Nederhof.

Note that we have experimented with a number of different versions of each of these parsers. We will report only on the most efficient version. The experiments were performed on a 125Mhz HP-UX 735 machine with 240 Megabytes of memory. Timings measure CPU-time and should be independent of the load on the machine. 7

Experiment 1: OVIS

The OVIS grammar (for Dutch) contains about 1400 lexical entries (many of which are station and city names) and 66 rules (a substantial fraction of those rules is concerned with time and date expressions), including 7 epsilon rules. The most important epsilon rule is part of a gap threading implementation of verb-second. The grammar is documented in detail in [31]. The head-corner table contains 128 pairs, the lexical head-corner table contains 93 pairs, the gap-head-corner table contains 14 pairs. The left-corner table contains 156 pairs, the lexical left-corner table contains 114 pairs, the gap-left-corner table contains 20 pairs. The pre-compiled grammar (which is used by the chart parsers) contains 92 rules.

The input for the parser consists of a test-set of 5000 word-graphs, randomly taken from a corpus of more than 25000 word-graphs. These word-graphs are the latest word-graphs that were available to us; they are `real' output of the current version of the speech recognizer as developed by our project partners. In this application, typical utterances are short. As a consequence, the typical size of word-graphs is rather small too, as can be seen in table 1.

Table 1: The leftmost table gives information concerning the number of transitions per word-graph of the test set for the OVIS grammar. As can be seen from this table, more than half of the corpus consists of word-graphs with at most five transitions. In the rightmost table the number of words per utterance is given. Many utterances consists of less than five words.
# transitions # word-graphs
0-5 2825
6-10 850
11-15 408
16-20 246
21-30 237
31-40 146
41-50 83
51-75 112
76-100 44
101-150 36
151-200 12
263 1
# words # utterances
1-2 2465
3-4 1448
5-6 543
7-8 319
9-10 118
11-12 56
13-14 26
15-16 20
17-18 5

We report on three different experiments with the OVIS grammar and these word-graphs. In the first experiment, the system runs in best-1-mode: the best path is selected from the word-graph using bigram scores and the acoustic scores (present in the word-graph). This best path is then sent to the parser and robustness component. In the second experiment, the parser is given the utterance as it was actually spoken (to simulate a situation in which speech recognition is perfect). In the third experiment, the parser takes the full word-graph as its input. The results are then passed on to the robustness component. As explained in the previous section on robustness, each of the parsers finds all derivations of the start symbol anywhere in the input (this is the case in each of the OVIS experiments).

For the current version of the OVIS system, parsing on the basis of the best path in the word-graph gives results in terms of word-accuracy which are similar to the results obtained with full word-graphs. Results for concept-accuracy are not yet available. Details can be found in [29].

Parsing best path only

In table 2 the CPU-time requirements and the maximum space requirements of the different parsers are listed.
Table 2: Total and average CPU-time and maximal space requirements for a test-set of 5000 best paths through word-graphs (OVIS grammar).
parser total (msec) msec/sentence maximum maximum space
hc 169370 34 530 163
lc 180160 36 530 171
bu-active 291870 58 4220 1627
bu-inactive 545060 109 13050 784
bu-earley 961760 192 24470 2526
lr 1088940 218 416000 4412

In the table we list respectively the total number of milliseconds CPU-time required for all 5000 word-graphs (timings include lexical lookup, parsing and the robustness component), the average number of milliseconds per word-graph, and the maximum number of milliseconds for a word-graph. The final column lists the maximum amount of space requirements (per word-graph, in Kbytes). 8

Parsing sentences

The differences in CPU-time for the corpus of 5000 word-graphs are similar to differences we have found for other test sets. The results are also very similar to the results we obtain if we parse the actually spoken utterances. Table 3 lists the results of parsing the set of 5000 utterances from which the word-graphs were derived.

Table 3: Total and average CPU-time and maximum space requirements for a test-set of 5000 utterances (OVIS grammar).
parser total (msec) msec/sentence maximum maximum space
hc 126930 25 510 137
lc 137090 27 490 174
bu-active 257390 51 4030 1438
bu-inactive 546650 109 15170 1056
bu-earley 934810 187 25490 3558
lr 957980 192 417580 4435

Parsing word-graphs

Obviously, parsing word-graphs is more difficult than parsing only the best path through a word-graph, or parsing an ordinary sentence. In table 4 we list the results for the same set of 5000 word-graphs. This experiment could only be performed for the head-corner and the left-corner parser. The other parsers ran into memory problems for some very large word-graphs.

Table 4: Total and average CPU-time and maximum space requirements for a test-set of 5000 word-graphs (OVIS grammar).
parser total (msec) msec/word-graph maximum maximum space
lc 410670 82 15360 4455
hc 435320 87 16230 4174

In order to compare the other parsers too, I performed the experiment with a time-out of 5000 msec (the memory problems only occur for word-graphs that take longer to process). In table 5 the percentage of word-graphs that can be treated within a certain amount of CPU-time are listed.

Table 5: Percentage of word-graphs that can be treated within time limit (OVIS grammar).
parser 500 1000 2000 3000 4000 5000 time-outs
lc 97.72 99.28 99.78 99.92 99.92 99.92 4
hc 97.42 98.94 99.60 99.84 99.92 99.92 4
lr 91.44 94.42 96.30 96.98 97.34 97.70 115
bu-active 91.84 94.76 96.04 96.84 97.30 97.60 120
bu-inactive 82.36 88.64 92.24 94.10 95.14 95.86 207
bu-earley 77.10 84.26 89.04 91.42 92.64 93.50 325

From the experiments with the OVIS grammar and corpus it can be concluded that the head-corner and left-corner parsers (implemented with selective memoization and goal-weakening) are much more efficient than the other parsers. In the case of word-graphs, the left-corner parser is about 5% faster than the head-corner parser; for strings, the head-corner parser is about 6 to 8% faster than the left-corner parser.

Experiment 2: MiMo2

Another experiment was carried out for the English grammar of the MiMo2 system. This grammar is a unification-based grammar which is compiled into a DCG. The grammar contains 525 lexical entries, 63 rules including 13 gaps. The head-corner relation contains 33 pairs and the lexical head-corner relation contains 18 pairs. The left-corner parser runs into hidden left-recursion problems on the original grammar, so it uses a version of the grammar in which left-most gaps are compiled out. This compiled grammar has 69 rules. The left-corner relation contains 80 pairs; the lexical left-corner relation contains 62 pairs. As a result, the left-corner parser only hypothesizes gaps for non-left-most daughters. Because the grammar never allows gaps as head, the head-corner parser can be optimized in a similar fashion. Both the left-corner and head-corner parser use a goal-weakening operator which only leave the functor symbol. This simplifies the way in which the goal table is maintained.

For this experiment we have no notion of typical input, but instead just made up a set of 25 sentences of various length and difficulty, with a total of 338 readings. In order to be able to complete the experiment a time-out of 60 seconds of CPU-time was used. Timings include lexical lookup and parse tree recovery.

The original parser implemented in the MiMo2 system (a left-corner parser without packing) took 294 seconds of CPU-time to complete the experiment (with 3 time-outs). Because the test environment was (only slightly) different, we have indicated the latter results in italics. Average CPU-time is only given for those parsers which completed each of the sentences within the time limit. The results are given in table 6

Table 6: Total and average CPU-time and maximum space requirements for set of 25 sentences (MiMo2 grammar).
parser total (msec) msec/sentence maximum space time-outs
hc 52670 2107 2062 0
bu-active 52990 2120 30392 0
lc 109750 4390 8570 0
mimo2-lc 294000     3
bu-earley 439050   12910 4
bu-inactive 498610   7236 5

The bottom-up active chart parser performs better on smaller sentences with a small number of readings. For longer and more ambiguous sentences the head-corner parser is (much) more efficient. The other parsers are consistently much less efficient.

Experiment 3: Alvey NL Tools

A final set of experiments was performed for the Alvey NL Tools grammar [6], similar to the experiments discussed in [5]. For a longer description of the grammar and the test sets we refer to this publication. The grammar contains 2363 lexical entries, and 780 rules (8 of which are gaps). The left-corner relation contains 440 pairs; the lexical left-corner relation contains 254 pairs. No gaps are possible as left-most element of the right-hand-side of a rule.

In order to be able to use the head-corner parser we needed to determine for each of the rules which element on the right-hand-side constitutes the head of the rule. The head-corner relation contains 352 pairs; the lexical head-corner relation contains 180 pairs. We also report on experiments in which for each rule the left-most member of the right-hand-side was selected as the head. The goal-weakening operator used for the left-corner and head-corner parser removes all features (only leaving the functor symbol of each category); again this simplifies the maintenance of the goal table considerably.

The bottom-up chart parsers use a version of the grammar in which all epsilon rules are compiled out. The resulting grammar has 1015 rules.

The first test set consists of 129 short sentences (mean length 6.7 words). Our results were obtained with a newer version of the Alvey NL Tools grammar. In the table below we list the results for the same grammar and test set for Carroll's bottom-up left-corner parser (BU-LC). Carroll performed this experiment on a SUN UltraSparc 1/140. It was estimated by Carroll and the author that this machine is about 1.62 times faster than the HP-UX 735 on which the other experiments were performed. 9

In the table below we have multiplied the 13.3 seconds of CPU-time (obtained by Carroll) with this factor in order to compare his results with our results. Clearly, these numbers should be taken with extreme caution, because many factors in the test environment differ (hardware, LISP versus Prolog). For this reason we use italics in table 7.

Table 7: Total and average CPU-time and maximum space requirements for set of 129 short sentences (Alvey NL Tools grammar). Italicized items are offered for cautious comparison.
parser msec msec/sentence max Kbytes
bu-active 18250 141 1276
lc 21900 170 137
Carroll BU-LC 21500 167  
hc (lc mode) 23690 184 165
bu-earley 27670 214 758
hc 68880 534 140
bu-inactive 83690 649 170

The second test set consists of 100 longer and much more complex sentences. The length of the sentences is distributed uniformly between 13 and 30 words (sentences created by Carroll). Many of the sentences have many parses: the maximum number of parses is 2736 for one 29-word sentence. Average number of readings is about 100 readings per sentence.

Again, we list the results Carroll obtained with the BU-LC parser. It took 205.7 seconds on the SUN UltraSparc 1/140. 10 The bottom-up active chart parser ran into memory problems for some very ambiguous sentences and was very slow on many of the other sentences (due to the lack of packing). The results are summarized in table 8.

Table 8: Total and average CPU-time and maximum space requirements for set of 100 longer sentences (Alvey NL Tools grammar). Italicized items are offered for cautious comparison.
parser msec msec/sentence max Kbytes
lc 195850 1959 10955
hc (lc mode) 216180 2162 10969
Carroll BU-LC 333000 3330  
bu-earley 1219120 12191 18232
hc 3053910 30539 7915
bu-inactive 3578370 35784 16936
bu-active > >   > 65000

The implementation of the left-corner parser based on selective memoization and goal-weakening seems to be substantially more efficient than the chart-based implementation of Carroll. The head-corner parser running in left-corner mode is almost as fast as this specialized left-corner parser. This suggests that the use of selective memoization with goal-weakening is on the right track.

From these experiments it can be concluded that the head-corner parser is not suitable for the Alvey NL Tools grammar. The reason seems to be that for this grammar the amount of top-down information that is available through the head-corner table is of limited value. In order to parse a given goal category, too many different lexical head-corners are typically available. For example, in order to parse a sentence possible head-corners include auxiliaries, verbs, adverbs, complementizers, pronouns, prepositions, determiners, nouns and conjunctions. In contrast, in the MiMo2 grammar only verbs can function as the head-corners of sentences. As a result the prediction step introduces too much non-determinism. A related reason for the poor performance for this grammar might be the large amount of lexical ambiguity. The grammar and lexicon used in the experiment is compiled from a compact user notation. In the compiled format, all disjunctions are spelled out in different rules and lexical entries. As a result, many words have a large number of (only slightly different) readings. It may be that the head-corner parser is less suitable in such circumstances. This could also explain the fact that the head-corner parser performs better on strings then on word-graphs: in many respects the generalization to word-graphs is similar to an increase in lexical ambiguity. This suggests that further improvements in the design of the head-corner parser should be sought in the prediction step.

Availability test material

The material used to perform the experiments with the MiMo2 grammar and the Alvey NL Tools grammar, including several versions of the head-corner parser, is available via anonymous ftp and the world-wide-web. The material is ready to be plugged into the Hdrug environment available from the same site.

next up previous
Next: acknowledgments Up: An Efficient Implementation of Previous: Head-corner parsing and Robustness
Noord G.J.M. van