In experimenting with finite-state approximation techniques for
context-free and more powerful grammatical formalisms (such as the
techniques presented in [12], [11],
[4]) we have found that the resulting automata often
are extremely large. Moreover, the automata contain many
-moves (jumps). And finally, if such automata are
determinised then the resulting automata are often smaller. It
turns out that a straightforward implementation of the subset
construction determinisation algorithm performs badly for such inputs.
As a motivating example, consider the definite-clause grammar that has
been developed for the OVIS2 Spoken Dialogue System. This grammar is
described in detail in [14]. After removing the feature
constraints of this grammar, and after the removal of the sub-grammar
for temporal expressions, this context-free skeleton grammar was input
to an implementation of the technique described in [11].
1The resulting non-deterministic automaton (labelled zovis2
below) contains 89832 states, 80935
-moves, and 80400
transitions. The determinised automaton contains only 6541 states, and
60781 transitions. Finally, the minimal automaton contains only 78
states and 526 transitions! Other grammars give rise to similar
numbers. Thus, the approximation techniques yield particularly
`verbose' automata for relatively simple languages.
The experiments were performed using the FSA Utilities toolkit
[13]. At the time, an old version of the toolkit was used,
which ran into memory problems for some of these automata. For this
reason, the subset construction algorithm has been re-implemented,
paying special attention to the treatment of
-moves. Three
variants of the subset construction algorithm are identified which
differ in the way
-moves are treated:
The motivation for this paper is the experience that the first
approach turns out to be impractical for automata with very large
numbers of
-moves. An integration of the subset
construction algorithm with the computation of
-reachable
states performs much better in practice. The per subset
algorithm almost always performs better than the per state
approach. However, for automata with a low number of jumps, the per graph algorithm outperforms the others.
In constructing an
-free automaton the number of transitions
increases. Given the fact that the input automaton already is
extremely large (compared to the simplicity of the language it
defines), this is an undesirable situation. An equivalent
-free automaton for the example given above results in an
automaton with 2353781 transitions. The implementation of per
subset is the only variant which succeeds in determinising the
input automaton of this example.
In the following section some background information concerning the
FSA Utilities tool-box is provided. Section 3 then
presents a short statement of the problem (determinise a given
finite-state automaton), and a subset construction algorithm which
solves this problem in the absence of
-moves.
Section 4 identifies three variants of the subset
construction algorithm which take
-moves into account.
Finally, section 5 discusses some experiments in order to
compare the three variants both on randomly generated automata and
on automata generated by approximation algorithms.