This bug was in the way pre-modifiers were handled. Instead of applying a pre-modifier to just the first following head, it was applied to all the remaining heads in the same word as well.
This means that if you have used pre-modifiers, your results were wrong. Sorry.
Sequences of tokens in the data that represent one sound are combined into a set of feature values. Each unique set of feature values is replaced with a unique number. These values are written to the output files, which have the name of the input files with the extension .ftr appended.
In addition, the differences between all sets of feature values are calculated, and saved to the file features.table.out, which can be used by the leven program.
A typical usage is:
features configfile data/*.txt leven -s features.table.out (other options) data/*.txt.ftr
The input datafiles should be in the same format as used by the leven program, accept that all data must be in the form of ascii strings preceded by a minus sign. Data in the form of sequences of numeral preceded by a plus sign are not allowed.
The new files will have data in numeral format.
As an example, part of an the input datafile could have these lines:
: Aachen - "t7n@stI_-SIn the output file, those lines could be translated to something like:
: Aachen + 17 116 19 3 27 17 77 14
example: an example of a configuration file.
xstokens: a simpler but less accurate alternative to the features program.
The configuration file has five parts, in a fixed order. All parts must be present, even if one is empty. The five parts start with a key word:
DEFINES FEATURES TEMPLATES INDELS TOKENS
VERSION 0 # you shouldn't use this VERSION 1 VERSION 2The program features version 1.00 fixes a methodological error of earlier versions. The program will run as before with old configuration files, or if you set VERSION 0. To use the fix, you need to set VERSION 1 or 2, and make some further changes in older configuration files in the part 2: FEATURES.
TOP 255 TOP 65535 # this is the defaultThe Levenshtein program leven reads differences as integer values. These are in the range from 0 to 65535. When the table of differences is very large, it may be necessary to use the alternative compiled program leven-s, which used differences in the range from 0 to 255, and uses less memory.
The alternative compiled program leven-r uses differences as real values (and uses even more memory, and makes the program slower).
The program features maps the calculated differences between feature value sets onto the range from 0 to the value of TOP, unless you specify the option -g on the command line, in which case the result can be used with leven-r.
SUBSTMAX 1.0 # this is the default SUBSTMAX 20This value has two purposes:
INDEL 0.5 INDEL 10This is the value of an indel, if it is not specified in another manner. The default is the value of SUBSTMAX divided by two.
METHOD SUM # equal to METHOD MINKOWSKI 1 (this is the default) METHOD SQUARE METHOD EUCLID # equal to METHOD MINKOWSKI 2 METHOD MINKOWSKI 1.4This determines how, from the differences between individual features, the difference between two sets of features is calculated.
TOKENSTRING RAW # this is the default TOKENSTRING ESC
With TOKENSTRING ESC, tokens can be defined in the configuration file using escape sequences. See below.
START 0 # this is the default START 1
This defines the start condition of the mini-parser. See below.
RANGE 1 50 1 RANGE 50 10000 2.1
You can use RANGE zero, one or more times.
This defines a final mapping from calculated value to output value. If a value falls in the range of the first two values (inclusive), then it is replaced by the third value.
If a calculated value falls inside one of these ranges, SUBSTMAX is ignored.
There are three types of features, indicated by a letter B, N, or D:
If you have two sets of feature values, a and b, and a feature i, the difference between ai and bi is:
B : ( a[i] & b[i] ) ? 0.0 : 1.0 N : fabs( a[i] - b[i] ) D : ( a[i] == b[i] ) ? 0.0 : 1.0
A ? B : C...is C-code shorthand for:
if A does NOT return 0 then do B else do C)
In prose, for bitmaps: if ai and bi have at least one bit in common set to 1 in both bitmaps, then the difference is 0. It is 1 otherwise. The difference between numeric features is the absolute difference between the two values. The difference between two discrete features is 0 if they are equal, and 1 otherwise.
The above values are multiplied with the weight of the feature. So you get:
B : ( a[i] & b[i] ) ? 0.0 : w N : fabs( a[i] - b[i] ) * w D : ( a[i] == b[i] ) ? 0.0 : wThe weight of each feature is defined with the definition of the feature itself. The default weight is 1.
Here are some examples of feature definitions for VERSION 1 and 2 (see part DEFINES above):
N 2 v_advancement # numeric feature, with default difference 2.0, weight 1.0 N 1 v_high # three more numeric features, with default difference and weight 1.0 N 1 v_long N 1 v_rounded D 1 .7 breathy # a discrete feature, with default difference 1.0 and weight 0.7 B 1 3 type # a bitmap feature, with default difference 1.0 and weight 3.0
Differences between versions (set in part DEFINES above):
In VERSION 0, the first value in the lines above is missing. There can be at most one value between the first letter and the label. If there is a value, it will set the weight. In VERSION 0, if the difference between two feature sets needs to be calculated, and if the feature is undefined in one or both features sets, the difference is set to 0. That is probably not what you want, so you should VERSION 1 or 2 instead.
In VERSION 1, the first value is the feature's default difference, to be used if two feature sets are compared and one or both has this feature undefined. The default difference gets multiplies by the weight.
In VERSION 2, the default difference is used only if the feature is defined in one feature set, but not in the other. If the feature is undefined in both feature sets, the difference is set to 0.
There are three predefined features:
N 1 WEIGHT N 1 INDEL B 1 STATEThese features have a special meaning, explained below. They are not used to calculate the differences between feature value sets in the normal way. However, they can be handled (assigned to and modified) like normal features.
T vowel # start of template 'vowel' F v_long = 1 # assign value 1 to feature 'v_long' F v_rounded = -.5 T v_close # start of template 'v_close' F v_high = 1.5
In this part of the configuration file, the letter T is used for the definition of a template. Here, the letter T can be used to start a single template only.
In the parts of the configuration file that follow, the letter T is used to execute a template, and it can be used with multiple templates at once.
T consonant c_glottal c_fricative # like the consonant h T vowel v_mid v_central # like the vowel schwa F v_rounded = 0 # between a rounded and unrounded vowel
Tokens come in three flavours, indicated with the letters H, M or P:
End there is one special token:
One sound, one segment that is to be translated into a single set of feature values, consists of one or more tokens:
Each token can change feature values. Usually, the head assigns initial values to features, while modifiers change those values.
If the input consists of two pre-modifiers (P1, P2), a head (H), and two modifiers (M1, M2), like this:
P1 P2 H M1 M2... then the feature value changes are processed in this order:
H M1 M2 P2 P1
So, the actions for the head are processed first, then the modifiers, and lastly, the pre-modifiers in reverse order.
Each token is defined in the configuration file with a letter indicating the type, followed by calls to templates (T) or other feature value changes (F). Examples:
H y T vowel v_close v_front v_rounded H @ T vowel v_mid v_central F v_rounded = 0 # the END OF TEXT token has no substring and sets no features: EOT
Examples of actions that can be performed on feature values:
# Features of type bitmap, B F featB = 4 # assign the value 4 (integer) F featB - 3 # clear the bits from the value 3: new = old XOR (old AND 3) F featB + 3 # set the bits from the value: new = old OR 3 F featB ! 3 # flip the bits from the value 3: new = old XOR 3 F featB U # make the bitmap undefined # Features of type numeral, N F featN = 4 # assign the value 4 (float) F featN - 3 # decrease with 3 F featN + 3 # increase with 3 F featN * 3 # multiply by 3 F featN U # make it undefined # features of type discrete, D F featD = 4 # assign the value 4 (integer) F featD U # make it undefined
Note that, usually, you don't need to un-define a feature value. All feature values are undefined until a value is assigned. Also note that you can't modify a feature before you have assigned to it. (The features WEIGHT and STATE are the exceptions. WEIGHT is set to 1 as soon as a head is recognised. STATE is set to 0 at the start of each string, and changes are persistent until the end of the string.)
If TOKENSTRING ESC is set in the DEFINES part of the configuration file, then you can use escape sequences to define token strings. This is useful if the data is not in a standard character set. Escape sequences are:
With TOKENSTRING ESC, these are equivalent:
H A\\+ H \d065\d092\d043 H \101\134\053 H \x41\x5C\x2B
With TOKENSTRING RAW, the same token can only be defined as:
When tokens are defined with a letter I appended to the first letter, then no actions on features are performed. Examples:
HI x MI _y PI ^
In case of a token of type head: the complete sound is ignored. There will be no token in the output sequence. Pre-modifiers and modifiers with this head will also be ignored. If STATE was already changed by a pre-modifier, that change will remain in effect.
In case of a token of type modifier or pre-modifier: no feature changes are made, including STATE.
If a values is assigned to the pseudo-feature INDEL, that will be the value of an indel for the current sound, ignoring what was defined in the main parts DEFINES and INDELS of the configuration file.
H a : 7 H aThe first is the ordinary definition. the second is the conditional definition. The token is recognised only if the 'state', interpreted as a bitmap, has at least one bit in the number 7 set (7 = 1 + 2 + 4, or binary: 001 + 010 + 100). It matches if the "bitwise and" is not zero.
When two token definitions of equal length match the input, one token defined with a conditional, the other without, the definition with conditional is used if the condition also matches, and the other definition is used if the condition doesn't match.
If two tokens definition of equal length match, both with conditions, and both conditions match also, it is undetermined which of the two definition is used.
If a definition for a token has a condition on the token itself, and the condition doesn't match, the rule isn't used, and none of the actions on feature values are executed. But you can also use conditions on actions or the call to templates, so the token can match the input and only part of the actions executed. For example:
^: 3 F featA + 4The value of feature FeatA is increased only if the 'state', interpreted as a bitmap, has no positive match with the number 3. It matches if "bitwise and" is zero.
= 0 F featB = 1The values of feature FeatB is set to 1 if the state is exactly 0.
^= 9 T template1 template2Both template template1 and template2 are executed if the state does not match the value 9 exactly.
The state is an integer value. It can be changed by changing the value of the pre-defined 'pseudo-feature' STATE. For example:
F STATE + 2 # set the non-zero bits from value 2 (new STATE = old STATE OR 2)
NOTE: Before the processing of each token, the state is saved. That state is used for all tests done for that token, both the token match itself, as well as the execution of templates or other changes of feature values. Changing the value of the feature STATE will have no effect until the next token is processed.
NOTE: Usually, actions for pre-modifiers are executed last, when the actions for head and modifiers have finished. However, changes to STATE have effect as soon as pre-modifiers are parsed. However (again), changes to other features are made under condition of the state at the time of parsing the corresponding token.
Schematicly, with two pre-modifiers (P1, P2), head (H), and two modifiers (M1, M2):
STATE1 = current STATE parse P1 : - change STATE if requested STATE2 = current STATE parse P2 : - change STATE if requested STATE3 = current STATE parse H : - change features on condition of state STATE3 - change STATE if requested STATE4 = current STATE parse M1 : - change features on condition of state STATE4 - change STATE if requested STATE5 = current STATE parse M2 : - change features on condition of state STATE5 - change STATE if requested - change features on condition of state STATE2 for P2 - change features on condition of state STATE1 for P1
At the start of each string, (sequence of tokens making one dialect item), STATE is set to the value of START as defined in the DEFINES section of the configuration file, or to 0 if START is not set.
For determining the feature set of an indel (part INDELS of the configuration file), STATE is set to 0.
Mini parser and EOT
Using EOT with a STATE condition enables you to check that the mini parser is in the right state at the end of the input string. Example:
: 1 EOT ^: 4 EOT
It is OK if at the end of the input string, the mini parser matches state 1, or doesn't match state 4. Any other state causes an error.
If you don't use EOT, any state at the end of the strings is acceptable. This is identical to using EOT without a condition. This is unnecessary:
All (pre-)modifiers in combination with EOT are ignored.
Mini parser, an example
Stress is usually marked at the start of a syllable. It would make sense to have a feature 'stress' on a vowel. But there may be one or more consonants between the stress marker and the first vowel of the syllable. So the stress must be remembered until it can be translated into a feature. This is how this can work:
TEMPLATES T vowel F stress = 0 # no stress : 1 F stress = 1.0 # primary stress : 2 F stress = 0.5 # secondary stress F STATE - 3 # clear stress bits TOKENS P " # primary stress F STATE + 1 P % # secondary stress F STATE + 2 : 4 EOT # end of string is accepted when state matches 4
Suppose you have three features, x, y, and z, with feature weights wx, wy, and wz, and you have two sets of feature values, A and B. In addition, both sets of feature values have a pseudo-feature WEIGHT. Suppose that in the part DEFINES of the configuration file, you have set METHOD SUM. The function d() determines the simple difference between two features, based on type of feature (bitmap, numeric, discrete). The difference F between sets A and B is now determined as follows:
F = ( d(A[x], B[b]) * wx + d(A[y], B[y]) * wy + d(A[z], B[z]) * wz ) F = F * A[WEIGHT] * B[WEIGHT] if (F < 0) F = 0 if (F > SUBSTMAX) F = SUBSTMAX
If you have used METHOD SQUARE, you get:
F = ( (d(A[x], B[b]) * wx) ^ 2 + (d(A[y], B[y]) * wy) ^ 2 + (d(A[z], B[z]) * wz) ^ 2 ) F = F * A[WEIGHT] * B[WEIGHT] if (F < 0) F = 0 if (F > SUBSTMAX) F = SUBSTMAX
And with METHOD EUCLID, you get:
F = ( (d(A[x], B[b]) * wx) ^ 2 + (d(A[y], B[y]) * wy) ^ 2 + (d(A[z], B[z]) * wz) ^ 2 ) F = sqrt(F) * A[WEIGHT] * B[WEIGHT] if (F < 0) F = 0 if (F > SUBSTMAX) F = SUBSTMAX
And with METHOD MINKOWSKI, with value rho, you get:
F = ( (d(A[x], B[b]) * wx) ^ rho + (d(A[y], B[y]) * wy) ^ rho + (d(A[z], B[z]) * wz) ^ rho ) F = F^(1/rho) * A[WEIGHT] * B[WEIGHT] if (F < 0) F = 0 if (F > SUBSTMAX) F = SUBSTMAX