In natural language processing, it is often more natural to think of symbols in terms of predicates or classes. The linguistic principle of Community dictates that similar segments behave similarly. Predicates are a means to express this similarity. In computational phonology it is thus more natural to talk about vowels and consonants rather than enumerate each of the phonemes in these classes. Phonological generalizations typically refer to predicates such as fricative, nasal, voiced and very seldomly to individual phonemes directly. Therefore, in finite state computational phonology, some have proposed finite state automata in which transitions are associated with sets of symbols [35,3,8,36].
As a further piece of motivation for the introduction of predicates, consider the unknown symbol regular expression operator, typically written ?, as it is available in some regular expression compilers [18,34]. An obvious implementation will expand the ? operator into a set of transitions for each of the symbols in the alphabet . In our proposal, the ? operator will be expanded into a single transition with an associated predicate which is true for all symbols; this has the advantage that need not be explicitly defined. As a consequence, there is no need to assume that the alphabet is finite. Such considerations become important for applications with large alphabets, such as the Unicode alphabet. Even larger alphabets may surface in natural language processing applications in which the symbols are words. Typical electronic dictionaries have at least 200K words and even this large size alphabet is not enough to handle unrestricted texts. Realistically, robust syntactic parsing requires an infinite alphabet.1
Below, we define predicate augmented finite state automata more precisely; for now it suffices to assume that such automata are similar to classical finite state automata, except that we have predicates instead of symbols.