deze tekst in het Nederlands
How to determine the best clustering of an area
Required knowledge: clustering, multi-dimensional scaling (MDS), see
How to map difference between geographic areas
There exist several methods to do clustering, each with its own characteristics. The
theory behind these methods will not be discussed here. Our approach is
practical. We have a table with dialect differences between
places in Germany, and want to know which clustering methods to use to
find the borders between the different German dialects.
We start with a clustering method that goes by the name of Ward's
Method. It is also known as Minimum Variance.
Though in the end, this method does not turn out to be the best one, it is very
useful as a method to start with.
Ward's Method has a strong tendency to split data in groups of roughly equal
size. This means that when the "natural" clusters differ much
in size, then the big ones will be split in smaller parts roughly equal
in size to the smaller "natural" clusters.
Clustering with noise (above, right, noise level 1.5
on 8 clusters) suggests that the border between magenta and
dark blue (above, left) is not a true cluster border.
The advantage of Ward's Method is that it doesn't leave any "loose
ends". No clusters with only one or a few elements. All data is
grouped in bite size chunks, which can be studied further quite easily.
Now we will make use of this property. Below, on the left side, you
will see cluster maps made with Ward's Method. We use multi-dimensional
scaling (MDS) on the table of differences to map the places into two
dimensions, and the results is shown in the graph to the right of each
map. Each place in the MDS graph to the right will have the same colour
as it has in the map on the left.
Above, right, you can see that the magenta, dark blue, and medium blue clusters
are part of one big group that stands apart from the other group. This
shows the primary dialect border of Germany. The north is the area of Low
German, the south that of High German.
Now we focus on parts of Germany. From the map on the left, we selects a number
of clusters we want to examine more closely. We make a smaller table of
differences, a table that has only the differences between the places
of the areas we are interested in. Then we apply MDS to that smaller
table. Because a large part of the original places are removed, there
is more room available to "pull apart" the remaining clusters.
If you look at the graph above right, you see all the places nicely grouped by
colour. There are no points with different colours mixed in one region.
However, there is no clear distance between the tree colour groups.
The borders are not natural, but an artifact of the clustering method.
Above, the places from the medium blue group are removed. MDS is applied to the
two remaining groups. Again, no visually clear border between the clusters.
If we didn't use colour (below, left), would you split the dots over
the same two groups as was done by clustering with Ward's Method?
In the south, bright green and cyan are distinct clusters. But what about red
and dark green? Below, there is no visual border.
In the graph above right, some dots are replaced with numbers. These numbers
correspond to those in the map on the left. This allows you not only to
look at groups, but also at individual places within each group. As is
done above, you can mark the most exceptional places. Whether the dialect of
these places really is markedly distinct from that in the surrounding
area, or whether your data from these places may be less reliable, that
is something you might choose to investigate.
Within the north, there were no true cluster borders, which means, no dialect
borders. That does not mean that the dialect is the same everywhere in
the north. The pronunciation in the north-east can be very different from
that in the north-west, but the transition from one end of the area to
the other is gradual, and with our data, we cannot identify any distinct sub-areas.
On the other hand, what is visible is that the amount of transition across the north
differs. In the magenta area, the mutual differences are relatively large
compared to the mutual differences in the dark blue area.
Such a difference also exists in the south. The mutual differences in the
yellow area are much larger than those in the dark green area. The
yellow dots use up most of the space in the MDS plot (above, right), even though in
reality, the dark green area is much larger (above, left).
Now we get to the question: which clustering method should we use in our case?
Ward's Method was useful as a tool for exploration, but the overall
picture emerging from the clustering is not correct.
It turns out that, in this case, a method known as Weighted Average
(also known as McQuitty) results in the best representation.
The map below right shows the result of clustering with noise (combined noise
levels of 1.0 and 1.5 ) into eight groups. The north is not divided
into clusters, and the former red and dark green area is not separated
by a cluster border.
The places with numbers 97 and 123 in the red area are rather
exceptional, as was shown by MDS. Note how these places are marked in the
map on the right.
A clustering method related to Weighted Average is Group
. This method does not work well in this case. That does
not mean that Weighted Average is always better than Group
Below are two maps resulting from clustering with noise (same levels as used
with Weighted Average), on the left with 16 clusters, on the right with
26 clusters. De border between cyan and red, which was shown to be an
important dialect border, is invisible or nearly invisible. (Red:
Schwabian, related to Swiss. Cyan: Bavarian, strongly related to
Below, again the results of two clusterings. Using noise. On the left the map
made with Ward's Method, the method we started with. On the right the map
with Weighted Average, which turned out to be the best method.