In this paper we are concerned with the syntactic analysis phase of a natural language understanding system. Ordinarily, the input of such a system is a sequence of words. However, following Bernard Lang we argue that it might be fruitful to take the input more generally as a finite state automaton (FSA) to model cases in which we are uncertain about the actual input. Parsing uncertain input might be necessary in case of ill-formed textual input, or in case of speech input.
For example, if a natural language understanding system is interfaced with a speech recognition component, chances are that this compenent is uncertain about the actual string of words that has been uttered, and thus produces a word lattice of the most promising hypotheses, rather than a single sequence of words. FSA of course generalizes such word lattices.
As another example, certain techniques to deal with ill-formed input can be characterized as finite state transducers ; the composition of an input string with such a finite state transducer results in a FSA that can then be input for syntactic parsing. Such an approach allows for the treatment of missing, extraneous, interchanged or misused words [13,12,9].
Such techniques might be of use both in the case of written and spoken language input. In the latter case another possible application concerns the treatment of phenomena such as repairs .
Note that we allow the input to be a full FSA (possibly including cycles, etc.) since some of the above-mentioned techniques indeed result in cycles. Whereas an ordinary word-graph always defines a finite language, a FSA of course can easily define an infinite number of sentences. Cycles might emerge to treat unknown sequences of words, i.e. sentences with unknown parts of unknown lengths .
As suggested by an ACL reviewer, one could also try to model haplology
phenomena (such as the 's in English sentences like `The chef at Joe's
hat', where `Joe's' is the name of a restaurant) using a finite state
transducer. In a straightforward approach this would also lead to a
finite-state automaton with cycles.
It can be shown that the computation of the intersection of a FSA and a CFG requires only a minimal generalization of existing parsing algorithms. We simply replace the usual string positions with the names of the states in the FSA. It is also straightforward to show that the complexity of this process is cubic in the number of states of the FSA (in the case of ordinary parsing the number of states equals n + 1) [6,2] (assuming the right-hand-sides of grammar rules have at most two categories).
In this paper we investigate whether the same techniques can be applied in case the grammar is a constraint-based grammar rather than a CFG. For specificity we will take the grammar to be a Definite Clause Grammar (DCG) . A DCG is a simple example of a family of constraint-based grammar formalisms that are widely used in natural language analysis (and generation). The main findings of this paper can be extended to other members of that family of constraint-based grammar formalisms.