Next: Longest Match Capturing Up: The Algorithm Previous: Preliminary Considerations

## Implementation

A rule of the form will be written as replace(T,Lambda,Rho). Rules of the more general form will be discussed in §3. The algorithm consists of nine steps composed as in figure 1.

The names of these steps are mostly derived from [7] and [12] even though the transductions involved are not exactly the same. In particular, the steps derived from Mohri & Sproat (r, f, l1 and l2) will all be defined in terms of the finite state calculus as opposed to Mohri & Sproat's approach of using low-level manipulation of states and transitions.3

The first step, non_markers, was already defined above. For the second step, we first consider a simple special case. If the empty string is in the language described by Right, then r(Right) should insert an rb2 in every string position. The definition of r(Right) is both simpler and more efficient if this is treated as a special case. To insert a bracket in every possible string position, we use:

[[[] x rb2,sig]*,[] x rb2]


If the empty string is not in Right, then we must use intro(rb2) to introduce the marker rb2, followed by l_iff_r to ensure that such markers are immediately followed by a string in Right, or more precisely a string in Right where additional instances of rb2 are freely inserted in any position other than the beginning. This expression is written as:

            intro(rb2)
o
l_iff_r(rb2,xign(non_markers(Right),rb2))


Putting these two pieces together with the conditional yields:

macro(r(R),
if([] & R,       % If: [] is in R:
[[[] x rb2,sig]*,[] x rb2],
intro(rb2)   % Else:
o
l_iff_r(rb2,xign(non_markers(R),rb2)))).


The third step, f(domain(T)) is implemented as:

macro(f(Phi), intro(lb2)
o
l_iff_r(lb2,[xignx(non_markers(Phi),b2),
lb2^,rb2])).


The lb2 is first introduced and then, using l_iff_r, it is constrained to occur immediately before every instance of (ignoring complexities) Phi followed by an rb2. Phi needs to be marked as normal text using non_markers and then xign_x is used to allow freely inserted lb2 and rb2 anywhere except at the beginning and end. The following lb2^ allows an optional lb2, which occurs when the empty string is in Phi.

The fourth step is a guessing component which (ignoring complexities) looks for sequences of the form lb2 Phi rb2 and converts some of these into lb1 Phi rb1, where the b1 marking indicates that the sequence is a candidate for replacement. The complication is that Phi, as always, must be converted to non_markers(Phi) and instances of b2 need to be ignored. Furthermore, between pairs of lb1 and rb1, instances of lb2 are deleted. These lb2 markers have done their job and are no longer needed. Putting this all together, the definition is:

macro(left_to_right(Phi),
[[xsig*,
[lb2 x lb1,
(ign(non_markers(Phi),b2)
o
inverse(intro(lb2))
),
rb2 x rb1]
]*, xsig*]).


The fifth step filters out non-longest matches produced in the previous step. For example (and simplifying a bit), if Phi is ab*, then a string of the form ... rb1 a b lb1 b ... should be ruled out since there is an instance of Phi (ignoring brackets except at the end) where there is an internal lb1. This is implemented as:4

macro(longest_match(Phi),
not($$([lb1, (ignx(non_markers(Phi),brack) &$$(rb1)
),     % longer match must be
rb     % followed by an rb
]))     % so context is ok
o
% done with rb2, throw away:
inverse(intro(rb2))).


The sixth step performs the transduction described by T. This step is straightforwardly implemented, where the main difficulty is getting T to apply to our specially marked string:

macro(aux_replace(T),
{{sig,lb2},
[lb1,
inverse(non_markers)
o T o
non_markers,
rb1 x []
]
}*).


The seventh step ensures that lb1 is preceded by a string in Left:

macro(l1(L),
ign(if_s_then_p(
ignx([xsig*,non_markers(L)],lb1),
[lb1,xsig*]),
lb2)
o
inverse(intro(lb1))).


The eighth step ensures that lb2 is not preceded by a string in Left. This is implemented similarly to the previous step:

macro(l2(L),
if_s_then_p(
ignx(not([xsig*,non_markers(L)]),lb2),
[lb2,xsig*])
o
inverse(intro(lb2))).


Finally the ninth step, inverse(non_markers), removes the 0's so that the final result in not marked up in any special way.

Next: Longest Match Capturing Up: The Algorithm Previous: Preliminary Considerations
Noord G.J.M. van
1999-04-15