tokenise datafiles from X-SAMPA or similar


xstokens [-s filename] xs_table_file datafile(s)


-s filename
save token count to file


This program tokenises datafiles that are processed by the leven program. Strings that represent a single sound are replaced with a unique code, possibly with additional codes representing feature characteristics of the sound. The definition for the tokenisation is in the xs_table_file.

The program processes the datafiles given on the command line, creating for each file a new file with .tok as an additional filename extension. The input files should hold dialect data, in the format that is processed by the leven program, but all data should be present as strings, preceded with a minus sign. Data preceded with a plus sign is not allowed. The output files are again in the format that is used by the leven program.

The datafiles are transformed by matching substrings. When multiple substrings match, the longest is used. Each substring is replaced by a token (or group of tokens) that is defined in the xs_table_file. In the output files, each token is replaced by a unique number. The idea is that in the input files each sound is represented by a unique string, while diacritics such as used in IPA are given as subsequent strings following that sound, and these transcriptions can be parsed unambiguously by matching substrings from left to right.

Two different kind of strings are distinguished, representing sortable and unsortable substrings. Sortable substrings are sorted as a group. After each unsortable substring there can be a group of one or more sortable substrings. The unsortable substrings represent basic sounds, or indication of stress or linking to the next sound. The sortable substrings represent diacritics, that can be given in any order. For instance, the following...

      U2 S3 S2 U1 S4 S1
... would be sorted as...
      U2 S2 S3 U1 S1 S4
These means that two input strings that are differing only in the order the diacritics are added to the sound will result in identical output strings.

Empty lines and lines starting with a hash (#) are ignored. Unsortable substrings are defined with a preceding U:

      # vowel, front, close, unrounded
      U  i  .

      # vowel, front, close, rounded
      U  y  .
After the substring you put what token it should be translated into. A token is any sequence of characters not containing white space. Each time you use a dot, a new unique token is generated. Any others tokens will be the same if they are written the same. Here, the sound from the input data encoded by the single character i will be encoded as a single token in the output data, and the y will be encoded as a different token in the output data.

Sortable substring are preceded by an S:

      S  _h    .
      S  _h\   .

You can translate a sound into a string of tokens, for instance:

      # vowel, front, close, unrounded
      U       i       . frontA1 frontB1 closeA1 closeB1 closeC1 unrounded

      # vowel, front, close, rounded
      U       y       . frontA1 frontB1 closeA1 closeB1 closeC1 rounded
These two sounds differ in only one phonetic feature. Because they are represented by strings of tokens, with only two differences between them, (dot vs dot and unrounded vs rounded) the Levenshtein algorithm will find a much smaller difference between these two sounds than between sounds that are represented by seven different tokens, representing two very dissimilar sounds.

If you have tokens in the input files that should be excluded from the output, put a letter I directly after the first letter. You don't need to add tokens for the output:

    UI  <
    SI  >
This means that the tokens '<' and '>' are ignored. Note that there is a difference between the two above. The second represents a substring that is just ignored. The first, representing an unsortable means that not just this substrig should be ignored, but also all sortable substrings following, up to the next unsortable.

Here is an example of an xs_table_file:


See also

The program features offers a more accurate method to calculate differences based on phonetic features.