next up previous
Next: Implementation Up: The Algorithm Previous: The Algorithm


Preliminary Considerations

Before presenting the algorithm proper, we will deal with a couple of meta issues. First, we introduce our version of the finite state calculus in §2.1.1. The treatment of special marker symbols is discussed in §2.1.2. Then in §2.1.3, we discuss various utilities that will be essential for the algorithm.

FSA Utilities

Table 1: Regular expression operators.
[] empty string
[E1,...En] concatenation of E1 ...En
{} empty language
{E1,...En} union of E1,...En
E* Kleene closure
E^ optionality
~E complement
E1-E2 difference
$ E containment
E1 & E2 intersection
? any symbol
A:B pair
E1 x E2 cross-product
A o B composition
domain(E) domain of a transduction
range(E) range of a transduction
identity(E) identity transduction
inverse(E) inverse transduction

The algorithm is implemented in the FSA Utilities [16]. We use the notation provided by the toolbox throughout this paper. Table 1 lists the relevant regular expression operators. FSA Utilities offers the possibility to define new regular expression operators. For example, consider the definition of the nullary operator vowel as the union of the five vowels:

In such macro definitions, Prolog variables can be used in order to define new n-ary regular expression operators in terms of existing operators. For instance, the lenient_composition operator [10] is defined by:
      {Q, ~domain(Q) o R}).
      priority_union(R o C,R)).
Here, priority_union of two regular expressions Q and R is defined as the union of Q and the composition of the complement of the domain of Q with R. Lenient composition of R and C is defined as the priority union of the composition of R and C (on the one hand) and R (on the other hand).

Some operators, however, require something more than simple macro expansion for their definition. For example, suppose a user wanted to match n occurrences of some pattern. The FSA Utilities already has the '*' and '+' quantifiers, but any other operators like this need to be user defined. For this purpose, the FSA Utilities supplies simple Prolog hooks allowing this general quantifier to be defined as:

macro(match_n(N,X),Regex) :-

match_n(N,X,[X|Rest]) :-
   N > 0,
   N1 is N-1,

For example: match_n(3,a) is equivalent to the ordinary finite state calculus expression [a,a,a].

Finally, regular expression operators can be defined in terms of operations on the underlying automaton. In such cases, Prolog hooks for manipulating states and transitions may be used. This functionality has been used in [17] to provide an implementation of the algorithm in [12].

Treatment of Markers

Previous algorithms for compiling rewrite rules into transducers have followed [5] by introducing special marker symbols (markers) into strings in order to mark off candidate regions for replacement. The assumption is that these markers are outside the resulting transducer's alphabets. But previous algorithms have not ensured that the assumption holds.

This problem was recognized by [8], whose algorithm starts with a filter transducer which filters out any string containing a marker. This is problematic for two reasons. First, when applied to a string that does happen to contain a marker, the algorithm will simply fail. Second, it leads to logical problems in the interpretation of complementation. Since the complement of a regular expression R is defined as $ \Sigma$ - R, one needs to know whether the marker symbols are in $ \Sigma$ or not. This has not been clearly addressed in previous literature.

We have taken a different approach by providing a contextual way of distinguishing markers from non-markers. Every symbol used in the algorithm is replaced by a pair of symbols, where the second member of the pair is either a 0 or a 1 depending on whether the first member is a marker or not.2 As the first step in the algorithm, 0's are inserted after every symbol in the input string to indicate that initially every symbol is a non-marker. This is defined as:


Similarly, the following macro can be used to insert a 0 after every symbol in an arbitrary expression E.

      range(E o non_markers)).
Since E is a recognizer, it is first coerced to identity(E). This form of implicit conversion is standard in the finite state calculus.

Note that 0 and 1 are perfectly ordinary alphabet symbols, which may also be used within a replacement. For example, the sequence [1,0] represents a non-marker use of the symbol 1.


Before describing the algorithm, it will be helpful to have at our disposal a few general tools, most of which were described already in [5]. These tools, however, have been modified so that they work with our approach of distinguishing markers from ordinary symbols. So to begin with, we provide macros to describe the alphabet and the alphabet extended with marker symbols:


The macro xsig is useful for defining a specialized version of complementation and containment:

macro(not(X),xsig* - X).

The algorithm uses four kinds of brackets, so it will be convenient to define macros for each of these brackets, and for a few disjunctions.


As in Kaplan & Kay, we define an Intro(S) operator that produces a transducer that freely introduces instances of S into an input string. We extend this idea to create a family of Intro operators. It is often the case that we want to freely introduce marker symbols into a string at any position except the beginning or the end.

%% Free introduction
macro(intro(S),{xsig-S,[] x S}*).

%% Introduction, except at begin

%% Introduction, except at end

%% Introduction, except at begin & end

This family of Intro operators is useful for defining a family of Ignore operators:

macro( ign( E1,S),range(E1 o  intro( S))).
macro(xign( E1,S),range(E1 o xintro( S))).
macro( ignx(E1,S),range(E1 o  introx(S))).
macro(xignx(E1,S),range(E1 o xintrox(S))).

In order to create filter transducers to ensure that markers are placed in the correct positions, Kaplan & Kay introduce the operator P-iff-S(L1,L2). A string is described by this expression iff each prefix in L1 is followed by a suffix in L2 and each suffix in L2 is preceded by a prefix in L1. In our approach, this is defined as:


To make the use of p_iff_s more convenient, we introduce a new operator l_iff_r(L,R), which describes strings where every string position is preceded by a string in L just in case it is followed by a string in R:


Finally, we introduce a new operator if(Condition,Then,Else) for conditionals. This operator is extremely useful, but in order for it to work within the finite state calculus, one needs a convention as to what counts as a boolean true or false for the condition argument. It is possible to define true as the universal language and false as the empty language:

macro(true,? *).   macro(false,{}).

With these definitions, we can use the complement operator as negation, the intersection operator as conjunction and the union operator as disjunction. Arbitrary expressions may be coerced to booleans using the following macro:

      range(E o (true x true))).

Here, E should describe a recognizer. E is composed with the universal transducer, which transduces from anything (?*) to anything (?*). Now with this background, we can define the conditional:

   {  coerce_to_boolean(Cond) o Then,
     ~coerce_to_boolean(Cond) o Else

next up previous
Next: Implementation Up: The Algorithm Previous: The Algorithm
Noord G.J.M. van