LAMSAS research at Rijksuniversiteit Groningen

Reproduction of figures is not allowed without permission from: kleiweg@let.rug.nl

This site desperately needs an overhaul. At the moment, it presents none of the resent results.

A new version of this site will be available soon

Contents


Links


Papers

Verslag
Please note: The results as described in the Verslag do not reflect all the latest results as presented on the rest of this site. The results on lexical measurements are outdated, representing results of a less well-defined method.


Software

This research was conducted using the Levenshtein software. For details on procedures, refer to the manual of the leven program, especially the section datafiles.


Overview

General
Maps describing the LAMSAS area, informants, fieldworkers, etc.


Results on all LAMSAS data

Results sorted by validation using local incoherence, by region:
1 phonetic, simple 1.62251 (best)
2 phonetic, with superscript 1.63169
3 phonetic, feature sets 1.73534
4 phonetic, emphasis 1.78906
5 phonetic, discarding 1.92346
 
  lexical, >12% 2.15141
  lexical, all 2.6876
 
theoretical optimum 0.0
if all dialect differences were equal 12.2789
average of some random solutions 12.3
theoretical worst 30.4506

Correlation between calculated differences:

12345
110.99296660.94896940.92439870.8934522
20.992966610.94775140.90916270.8837078
30.94896940.947751410.90025530.8714213
40.92439870.90916270.900255310.9472306
50.89345220.88370780.87142130.94723061

phonetic

Remarks on phonetic measurements:

1. phonetic, simple
The most simple measurement. Other than sorting groups of diacritics to ensure identical string sequences for identical typeset output, no special handling of data was done.

See also: maps for phonetic, simple, by informants.

2. phonetic, with superscript
In the LAMSAS data set, curly braces are used to indicate superscript. In the simple measurement, the string "ABC{ABC}" is treated as a string of length 8, with 5 different tokens:
        A  B  C  {  A  B  C  }
In the current measurement this becomes a string of length 6, with 6 different tokens:
        A  B  C  {A}  {B}  {C}
3. phonetic, feature sets
An attempt to translate each sound to a set of features, using differentiated distances between sets of features.
4. phonetic, emphasis
All basic sounds are weighted 10 times as important as diacritical marks.

(Reverting the weights, making diacritics ten times more important as other tokens, results in a local incoherence of 1.72508)

5. phonetic, discarding
All characters that are used more often in a particular period the data was collected than in another are discarded. The periods are: before 1939, 1939 to 1941, after 1941.

lexical

Running the Levenshtein algorithm on all available lexical strings gives a local incoherence of 2.6876. By removing infrequent words from the data (presumably noise), this value could be reduced to 2.15141, a reduction of 20%.

The LAMSAS data is a set of 151 files, each file containing all variations of one word for all informants (grouped by location for these measurements). Each file is named for its main variant of a word. To remove infrequent words, the following procedure was used: In each file, 100% was defined as the number of occurrences of the main word. Each variant which had an occurrence of less than n% was removed. If only one variant remained, the file was not used at all. Results are shown in the graphs below, using circle markers.

When only one word remains in a file, that file does not contribute to the differences calculated between locations, since the difference is always zero. However, these files would contribute to the overall count of differences calculated between certain locations, and since in the end the sum of all differences is divided by the number of calculated differences, the one-word files do have an effect. As can be expected, this effect is harmful. For some percentages of removal of words, this effect was calculated. The results are marked in the graphs below using crosses.

Instead of calculating the Levenshtein distance between lexical strings, another test was to use a binary comparison: strings are either identical or not. Some of the results are indicated with square markers in the graphs below.

results for all lexical strings

results for lexical strings, 12% or more variants (files with only one variant left are not used)


Results on subsets of LAMSAS data

1939-1941
Most data was collected in the years 1939, 1940, and 1941. These are the results using the phonetic, simple measurements on this subset (except some data from South Carolina).
Guy Lowman
Guy Lowman collected most of the LAMSAS data. These are the results using the phonetic, simple measurements on this subset.

See also: maps for Lowman, by informants.

Informant Type (inftype)
(Text from LAMSAS site)
This column indicates the type classification assigned by the field worker, a subjective measure which was associated by Kurath and his field workers with formal education, private reading, and participation in social activites.
Type I:
Folk speakers, local usage subject to a minimum of education and other outside influence.
Type II:
Common speakers, local usage subject to a moderate amount of education (generally high school), private reading, and other external contacts.
Type III:
Cultivated speakers, representing wide reading and elevated local cultural traditions, generally but not always with higher education.
This classification was an important constituent in the planning of the survey and selection of particular informants. A count of informants belonging to each type is presented below; three informants (formerly auxiliary informants) cannot be classified by type.
Type I
582
Type II
439
Type III
138
Locality Type (commtype)
(Text from LAMSAS site)
Informants have been grouped according to the criteria of the 1940 US Census for relative population of their localities:

CodeDescriptionNumber
UUrban277
RRural884

The Urban designation refers to localities with populations greater than 2500. Census categories for Rural Farming and Rural Non-Farming communities have been collapsed into a single Rural category; there is too little evidence in the LAMSAS informant biographies to make reliable farming/non-farming distinctions. LAMSAS communities are typically counties; judgments of locality type were based on specific residence within the LAMSAS community. Missing information prevented classification of one informant.


Remarkable results

results


LAMSAS character set

These are utilities to display strings that were encoded in the LAMSAS phonetic font.

view a single phonetic string

search examples of phonetic items with certain substrings