Introduction

Context sensitive rewrite rules have been widely used in several areas of natural language processing. Johnson [4] has shown that such rewrite rules are equivalent to finite state transducers in the special case that they are not allowed to rewrite their own output. An algorithm for compilation into transducers was provided by [5]. Improvements and extensions to this algorithm have been provided by [7], [9], [8] and [12]. In this paper, the algorithm will be extended to provide a limited form of backreferencing. Backreferencing has been implicit in previous research, such as in the ``batch rules'' of [5], bracketing transducers for finite-state parsing [8], and the ``LocalExtension'' operation of [13]. The explicit use of backreferencing leads to more elegant and general solutions.

Backreferencing is widely used in editors, scripting languages and
other tools employing regular expressions [3]. For
example, Emacs uses the special brackets `\(`

and `\)`

to
capture strings along with the notation `\`

*n* to recall the
*n*th such string. The expression ** \(a*\)b\1**
matches strings of the form

The central case of an allowable backreference is:

(1) |

This says that each string *x* preceded by
and
followed by
is replaced by *T*(*x*), where
and are arbitrary regular expressions, and *T* is a
transducer.^{1} This contrasts sharply with the rewriting
rules that follow the tradition of Kaplan & Kay:

(2) |

In this case, any string from the language is replaced by any string independently chosen from the language .

We also allow multiple (non-permuting) backreferences of the form:

(3) |

Since transducers are closed under concatenation, handling multiple backreferences reduces to the problem of handling a single backreference:

(4) |

A problem arises if we want capturing to follow the
POSIX standard requiring a longest-capture strategy. Friedl
[3] (p. 117), for example, discusses matching the
regular expression (to|top)(o|polo)?(gical|o?logical)
against the word: **topological**. The desired result is that (once
an overall match is established) the first set of parentheses should
capture the longest string possible (**top**); the second set should
then match the longest string possible from what's left (**o**), and
so on. Such a left-most longest match concatenation operation is
described in §3.

In the following section, we initially concentrate on the simple case in (1) and show how (1) may be compiled assuming left-to-right processing along with the overall longest match strategy described by [8].

The major components of the algorithm are not new, but straightforward modifications of components presented in [8] and [12]. We improve upon existing approaches because we solve a problem concerning the use of special marker symbols (§2.1.2). A further contribution is that all steps are implemented in a freely available system, the FSA Utilities of [16] (§2.1.1).

1999-04-15