Robust parsing

The input to the NLP module consists of word-graphs produced by the
speech recognizer. A word-graph is a compact representation for all
lists of words that the speech recognizer hypothesizes for a spoken
utterance. The nodes of the graph represent points in time, and
an edge between two nodes represents a word that may have been uttered
between the corresponding points in time. Each edge is associated with
an *acoustic score* representing a measure of confidence that the word
perceived there is the word that was actually uttered. These scores
are negative logarithms of probabilities and therefore require
addition as opposed to multiplication when two scores are combined.

At an early stage, the word-graph is optimized to eliminate the
*epsilon transitions*. Such transitions represent periods of time when the
speech recognizer hypothesizes that no words are uttered. After
this optimization, the word-graph contains exactly one start node and
one or more final nodes, associated with a score,
representing a measure of confidence that the utterance ends at that point.

In the ideal case, the parser will find one or more paths in a given word-graph that can be assigned an analysis according to the grammar, such that the paths cover the complete time span of the utterance, i.e. the paths lead from the start node to a final node. Each analysis gives rise to an update of the dialogue state. From that set of updates, one is then passed on to the dialogue manager.

However, often no such paths can be found in the word-graph, due to:

- errors made by the speech recognizer,
- linguistic constructions not covered in the grammar, and
- irregularities in the spoken utterance.

Our solution is to allow recognition of paths in the word-graph
that do not necessarily span the complete utterance. Each path should
be an instance of some major category from the grammar, such as
S, NP, PP, etc.
In our application, this often comes down to categories such as
``temporal expression'' and ``locative phrases''.
Such paths will be called *maximal projections*.
A list of maximal projections that do not pair-wise overlap and
that lie on a single path from the start node
to a final node in the word-graph represents a reading of the utterance.
The transitions between the maximal projections will be called
*skips*.

The optimal such list is computed, according to criteria to be
discussed below.
The categories of the
maximal projections in the list are then combined and
the update for the complete utterance is computed.
This last phase contains, among other things, some domain-specific linguistic
knowledge dealing with expressions that may be ungrammatical in other
domains; e.g. the utterance *``Amsterdam Rotterdam''*
does not exemplify a general grammatical
construction of Dutch, but in the particular domain of OVIS such an
utterance occurs frequently, with the meaning
*``departure from Amsterdam and arrival in Rotterdam''*.

We will now describe the robust parsing module in more detail. The first phase that is needed is the application of a parsing algorithm which is such that:

- 1.
- grammaticality is investigated for all paths, not only for the complete paths from the first to a final node in the word-graph, and
- 2.
- grammaticality of those paths is investigated for each category from a fixed set.

The second phase is the selection of the optimal list of maximal projections
lying
on a single path from the start node to a final node.
At each node
we visit, we compute a partial score consisting of a tuple (*S*, *P*, *A*),
where *S* is the number of transitions on the path not part of a
maximal projection (the skips),
*P* is the number of maximal projections,
*A* is the sum of the acoustic scores of all the transitions on the path,
including those internal in maximal projections.
We define the relation
on triples such that
(*S*_{1}, *P*_{1}, *A*_{1})
(*S*_{2}, *P*_{2}, *A*_{2}) if and only if:

*S*_{1}<*S*_{2}, or*S*_{1}=*S*_{2}and*P*_{1}<*P*_{2}, or*S*_{1}=*S*_{2}and*P*_{1}=*P*_{2}and*A*_{1}<*A*_{2}.

Our branch-and-bound algorithm maintains a priority queue, which
contains pairs of the form
(*N*,(*S*, *P*, *A*)),
consisting of a node *N* and a triple (*S*, *P*, *A*) found at the node,
or pairs of the form
(,(*S*, *P*, *A*)), with the same meaning
except that *N* is now a final node of which the acoustic score is
incorporated into *A*.
Popping an element from the queue yields a pair of which the
second element is an optimal triple with regard to the relation
fined above.
Initially, the queue contains just
(*N*_{0},(0, 0, 0)), where *N*_{0} is
the start node, and possibly
(,(0, 0, *A*)), if
*N*_{0} is also a final state with acoustic score *A*.

A node *N* is marked as *seen* when a triple has been encountered
at *N* that must be optimal with respect to all paths leading to
*N* from the start node.

The following is repeated until a final node is found with an optimal triple:

- 1.
- Pop an optimal element from the queue.
- 2.
- If it is of the form
(,(
*S*,*P*,*A*)) then return the path leading to that triple at that node, and halt. - 3.
- Otherwise, let that element be
(
*N*,(*S*,*P*,*A*)). - 4.
- If
*N*was already marked as*seen*then abort this iteration and return to step 1. - 5.
- Mark
*N*as*seen*. - 6.
- For each maximal projection from
*N*to*M*with acoustic score*A'*, enqueue (*M*,(*S*,*P*+ 1,*A*+*A'*)). If*M*is a final node with acoustic score*A''*, then furthermore enqueue (,(*S*,*P*+ 1,*A*+*A'*+*A''*)). - 7.
- For each transition from
*N*to*M*with acoustic score*A'*, enqueue (*M*,(*S*+ 1,*P*,*A*+*A'*)). If*M*is a final node with acoustic score*A''*, then furthermore enqueue (,(*S*+ 1,*P*,*A*+*A'*+*A''*)).

Besides *S*, *P*, and *A*,
other factors can be taken into account as well,
such as the *semantic* score, which is obtained by comparing the
updates corresponding to maximal projections with the meaning of the
question generated by the system prior to the user utterance.

We are also experimenting with the bigram score. Bigrams attach a measure of likelihood to the occurrence of a word given a preceding word.

Note that when bigrams are used, simply labelling nodes in the graph
as *seen* is not a valid method to prevent recomputation of
subpaths. The required adaptation to the basic
branch-and-bound algorithm is not discussed here.

Also, in the actual implementation the *X* best readings are produced,
instead of a single best reading.
This requires a generalization of the above procedure
so that instead of using the label *``seen''*,
we attach labels *``seen i times''* to each node, where
0
*i*
*X*.

1998-09-24