“Cluster Analysis of the Newcastle Electronic Corpus of
Tyneside English: A comparison of Methods”
Herbert
Moisl
University of Newcastle
hermann.moisl@ncl.ac.uk
Val
Jones
University of Twente
jones@cs.utwente.nl
The Newcastle Electronic Corpus of Tyneside English (NECTE) project is based
on two separate corpora of recorded speech, one of them collected in the
late 1960s as part of the Tyneside Linguistic Survey (TLS), and the other in
1994 by the Phonological Variation and Change in Contemporary Spoken English
(PVC) project. Its aim is to combine the TLS and the PVC collections into a
single corpus and to make it available to the research community in a
variety of formats: digitized sound, phonetic transcription, and standard
orthographic transcription, all aligned and available on the Web.
We are currently developing a methodology to study NECTE from a
sociolinguistic point of view, and have begun by looking at the one
formulated by the TLS, which was radical at the time and remains so today:
in contrast to the then-universal and even now dominant theory-driven
approach, where social and linguistic factors are selected by the analyst on
the basis of a predefined model, the TLS proposed a fundamentally empirical
approach in which salient factors are extracted from the data itself and
then serve as the basis for model construction. To this end, an electronic
corpus was created from a subset of the data, and various cluster analysis
algorithms were applied to it in order to derive social and linguistic
classifications of the sample. Stability of classifications across different
clustering methods was already a known theoretical problem. The clustering
techniques available at the time, and still widely used today, are sensitive
to factors such as vector distance measure, clustering algorithm, and the
order in which data items are presented —different combinations of these
factors typically yield different analyses of the same dataset. These
effects were observed in the TLS classifications. In an experiment on
artificial data sets Jones (1979) demonstrated that certain combinations of
clustering algorithms are capable of imposing erroneous structure on data
which was inherently unclassifiable (by design). The types of structurings
derived were consistent with theory and observation; for example Ward’s
method tended to ‘discover’ spherical clusters irrespective of the natural
structure of the data and classifications were shown to be sensitive to
input order of datapoints. These properties of clustering techniques raise
at least two issues relating to validation of classifications:
- objectivity—to what extent does a given analysis represent the actual structure of the data, and to what extent is it an artefact of the clustering method?
- selection—upon what criteria does one choose among alternative analyses?
TLS METHODOLOGY
The TLS aimed to model the overall linguistic variability of an urban community, that of Tyneside in north-east England, and more specifically- to identify and exhaustively characterise the varieties of speech which co-occur in that area, and
- to determine the distribution of both the speech varieties and their constituent elements across the relevant social subgroups
SELF-ORGANIZING MAPS
The self-organizing map, also known as the Kohonen net after its inventor, is a k-dimensional surface of processing units, where k is usually 2, together with a buffer into which input vectors are loaded. Associated with each unit is a set of connections from the input buffer such that, for a buffer of length n, there are n connections per unit, and each connection can take on a real-number value or ‘strength’ in some range, typically -1..1 or 0..1 (for clarity, only sample connections are shown in Figure 1):
Because it is an artificial neural network, the SOM is not explicitly
configured or ‘programmed’ to behave in some desired way like a conventional
computer, but rather learns its behaviour by exposure to input data using a
learning algorithm. A full explanation of SOM learning would take us too far
afield; details can be found in most textbooks on artificial neural
networks. In outline, though, learning takes place by repeated presentation
of vectors drawn randomly from a set V, and adjustment of the connections at
each presentation. The SOM is initialized by assigning random values in some
range (ie, 0..1) to the connections. When the first vector vi is presented,
it activates the processing units to varying degrees, depending on the
differences in connection strengths between the input buffer and each unit;
the most highly activated unit uj is selected, and the connections are
adjusted so that, next time vi is presented, uj will be even more highly
activated than before, thus associating vi ever more strongly with a
specific location on the map. As learning proceeds over—usually—many
thousands of presentations, each of the vectors in V is associated with a
specific unit in the map. After learning is complete, the entire set V is
presented once again. Each vector activates the unit with which it has
learned to become associated, and the result is a pattern of activations on
the map surface. That pattern is significant: the distances among activated
units represent the similarity relations in the input vector space.
COMPARATIVE STUDY
This comparative study confines itself to the phonetic-level representation of the TLS corpus. In order to apply cluster analysis to this data, the TLS had to represent it numerically. The method was as follows. For each of the 52 informants whose phonetic-level transcriptions had been digitized, the number of token occurrences of each of the 542 state types S defined in the transcription protocol was counted, where a ‘state’ is a discrete phonetic segment type. Each informant’s phonetic profile was thus represented as a 542-element integer-valued vector V, in which any element Vi contained the number of token occurrences of state Si. The set of informant vectors was stored in a 52 x 542 matrix which, after normalization, served as input to the various clustering algorithms used in the analysis. The present study replicates the TLS data representation and cluster analyses, and then compares the performance of a SOM on the same data, using a variety of settings for initialization of connections, sequence of input vector presentations, and map dimension.CONCLUSIONS
Preliminary results from a relatively small subset of the 52 TLS informants indicate that the SOM performs as well as the other clustering algorithms in terms of its ability to identify and represent clusters, but that it is far less affected by variation in processing parameters. Results for the full TLS dataset will be available if and when this paper is presented.REFERENCES
B. Everit. Cluster Analysis. London: E. Arnold, 1993. 170.
J. Hair R. Anderson R. Tatham C. Black. Multivariate Data Analysis. Englewood Cliff, NJ: Prentice Hall, 1995. 751.
V. Jones-Sargent. Tyne Bytes: a computerized sociolinguistic study of Tyneside English. Frankfurt am Main and New York: P. Lang, 1983. 368.
T. Kohonen. Self-Organizing Maps. Berlin and New York: Springer, 1995. 312.
J. Pellowe. “A dynamic modelling of linguistic variation: the urban
(Tyneside) linguistic survey.” Lingua. 1972. 30: 1-30.
J. Pellowe V. Jones. “On intonational variety in Tyneside speech.” Sociolinguistic Patterns of British English. Ed. P. Trudgill. London: Arnold, 1978. 101-121.
V. Sargent(nee Jones). “Cycles and the equal society.” Classification Society Bulletin. 1979. 4: 31-45.
B. Strang. “The Tyneside Linguistic Survey.” Zeitschrift für Mundartforschung. 1968. 4: 788-794.