# 2. Regular Expressions

Regular expressions are the preferred way to specify regular languages and regular relations. The regular expression compiler compiles a regular expression into an equivalent finite-state automaton.

The syntax of regular expressions in FSA is somewhat non-standard. As usual, atomic symbols normally represent themselves. Concatenation is indicated using a Prolog list where each of the elements is a regular expression itself; union is indicated using curly (`set') brackets, and Kleene star uses the *-suffix. For instance:

[{a,b,c}*,b,{a,b,c}*]

is the set of strings over the alphabet {a,b,c} such that each string contains at least one `b'. Since a list indicates concatenation, the empty list `[]' indicates the empty concatenation, i.e. the empty string (the language consisting of a single string which is the epsilon string). Furthermore, `{}' represents the empty language.

Such regular expressions define regular languages, but regular expressions can also be used to define weighted regular languages, regular relations and weighted regular relations. A regular relation (transducer) is specified for instance by:

[{a:a,b:b,c:c}*,b:a,{a:a,b:b,c:c}*]

Where x:y represents a mapping from symbol x to symbol y. Weights can be attached using a double semi-colon:

{a::0,b::1,c::2}*

This extends to weighted transducers. As an example consider:

[{a:a::0,b:b::1,c:c::2}*,b:a::0,{a:a::2,b:b::1,c:c::0}*]

btw. this is equivalent to:

[{a:a:0,b:b:1,c:c:2}*,b:a:0,{a:a:0,b:b:1,c:c:2}*]

Below we provide more detailed documentation on the regular expression syntax, the type coercion performed implicitly by the regular expression compiler, ways to debug regular expressions, the way in which regular expression operators can be added, and detailed documentation on each of the regular expression operators.

## 2.1. Regular expression syntax

Regular expressions are defined as Prolog terms, and therefore Prolog syntax applies. For detailed information on this, cf. the Prolog manual. The brackets () can always be used to express the desired grouping. The order of precedence of operators is as follows:

: /
..
+ * ^
& -
o x xx
! #

Operators with the same precedence are interpreted left-to-right. For example, the expression

a..z* - b* & c..d*

is interpreted as:

(((a..z)*) - (b*)) & ((c..d)*)

Syntax restrictions

These are all due to the use of Prolog syntax. The benefit of using Prolog syntax is that I don't need to implement a parser, and you have flexibility (by using your own operator definitions). However, a few limitations are inherited as well. Here are a few rules of thumb:

Capitals can be used in a regular expression in the Tk entry field, by putting them between quotes:

'A'..'Z'

At the Regex prompt (after fsa -r) you can use:

'A'..'Z'

At the command-interpreter you can use:

2 |: -r 'A'..'Z'

As part of Expr in the fsa -r Expr command, use (this depends on the shell you are using. This example works for bourne sh, csh, and bash):

fsa -r "'A'..'Z'"

Use space between operators. Use space before and after a question mark (?). Don't use the dot '.' or the vertical bar '|' as (part of) a new operator. Similarly, avoid using the comma ',', the ';', and '->' as (part of a) regular expression operator. It's neither a good idea to use ':-'. Operators can be escaped using (), but hardly ever have to (e.g. the following works, even if o is the binary composition operator!).

fsa -r 'o o o'

Brackets can be used for grouping as well:

fsa -r 'o o o o (o o o)'

## 2.2. Type Coercion

FSA6 supports four types of automaton:

• recognizers

• weighted recognizers (aka string-to-weight transducers)

• transducers (aka string-to-string transducers)

• weighted transducers (aka string-to-string-weight transducers)

If automata of different types are combined by a binary regular expression operator such as the concatenation operator, then the types of the automata are compared, and if neccessary the automata are coerced silently. A similar mechanism is used for operations which require a specific type.

Coercion is performed according to the following type hierarchy, where coercion is possible in upward direction, using the operator indicated within brackets.

WEIGHTED-
TRANSDUCER
/       \
/           \
/               \
(identity)         (zero_weights)
/                   \
/                       \
WEIGHTED-            TRANSDUCER
RECOGNIZER
\                       /
\                   /
\               /
(zero_weights)     (identity)
\         /
\     /
RECOGNIZER

Consider the following examples:

expression:            coerced into:
{ a, b:c     }          { a:a, b:c }
{ a, b::4    }          { a::0, b::4 }
{ a, b:c:4 }          { a:a:0, b:c:4 }
{ a::0,a:b:4 }          { a:a:0, a:b:4 }
{ a:b, a:b:4 }          { a:b:0, a:b:4 }

Some operators which expect a recognizer will temporarily freeze their argument automaton if it is of a different type. In that case, the symbol pairs or triples are treated as atomic symbols. This may or may not be what you want. The following operators work in this way:

complete minimize determinize complement

## 2.3. Spy Points on Regular expressions

The regular expression compiler provides detailed information on the computation time and the size of the resulting automata for certain regular expression operators, namely for those operators Op for which the predicate

bb_get(fsa_rx_spy:Op,on).

succeeds. So you can set a spy-point to operator concat by the directive:

?- bb_put(fsa_rx_spy:concat,on).

The special operator spy(Expr) is equivalent to Expr except that it has an associated spy-point.

## 2.4. Extending the regular expression notation

Using the -aux[file] command line option, or the AuxFile button of the TK Widget, you can load auxiliary regular expression operators. The file should be a Prolog source file. It will be loaded into module fsa_regex_aux. The syntax of regular expressions can be used in this file (in fact it must be used, beware if the file also contains ordinary Prolog code!).

Two relations are important: 1. macro/2 2. rx/2

The first relation is usually defined by unit clauses. It simply states that the regular expression in the first argument is an abbreviation for the regular expression in the second argument. For example:

macro(vowels,{a,e,i,u,o}).

Such macro's can be parameterized using Prolog variables; e.g.:

macro(brz(Expr),determinize(reverse(determinize(reverse(Expr))).

The relation rx/2 can be used for more complicated operations (operations that are cumbersome or impossible to define in terms of simpler regular expression operators). It defines a relation between the regular expression in the first argument and the finite automaton in the second argument. It is often useful to be able to call the regular expression compiler recursively. This should be done through the predicate fsa_regex:rx/2. The following is equivalent to the first example of macro/2 above:

rx( vowels, Fa) :-
fsa_regex:rx({a,e,i,u,o}, Fa).

Consult the Examples directory, for instance in the MohriSproat96, Karttunen95, Karttunen96, Karttunen98, GerdemannVannoord, Queens directories, for some extensive illustrations.

## 2.5. Combining several auxiliary regular expression operator files

Suppose you want to use the definition of a replace operator in some file replace.pl in your analysis of Dutch phonology. In the latter file you can include the definitions from replace.pl by including somewhere at the top of your file the following directives: