“Cluster Analysis of the Newcastle Electronic Corpus of
Tyneside English: A comparison of Methods”
Herbert
Moisl
University of Newcastle
hermann.moisl@ncl.ac.uk
Val
Jones
University of Twente
jones@cs.utwente.nl
The Newcastle Electronic Corpus of Tyneside English (NECTE) project is based
on two separate corpora of recorded speech, one of them collected in the
late 1960s as part of the Tyneside Linguistic Survey (TLS), and the other in
1994 by the Phonological Variation and Change in Contemporary Spoken English
(PVC) project. Its aim is to combine the TLS and the PVC collections into a
single corpus and to make it available to the research community in a
variety of formats: digitized sound, phonetic transcription, and standard
orthographic transcription, all aligned and available on the Web.
We are currently developing a methodology to study NECTE from a
sociolinguistic point of view, and have begun by looking at the one
formulated by the TLS, which was radical at the time and remains so today:
in contrast to the then-universal and even now dominant theory-driven
approach, where social and linguistic factors are selected by the analyst on
the basis of a predefined model, the TLS proposed a fundamentally empirical
approach in which salient factors are extracted from the data itself and
then serve as the basis for model construction. To this end, an electronic
corpus was created from a subset of the data, and various cluster analysis
algorithms were applied to it in order to derive social and linguistic
classifications of the sample. Stability of classifications across different
clustering methods was already a known theoretical problem. The clustering
techniques available at the time, and still widely used today, are sensitive
to factors such as vector distance measure, clustering algorithm, and the
order in which data items are presented —different combinations of these
factors typically yield different analyses of the same dataset. These
effects were observed in the TLS classifications. In an experiment on
artificial data sets Jones (1979) demonstrated that certain combinations of
clustering algorithms are capable of imposing erroneous structure on data
which was inherently unclassifiable (by design). The types of structurings
derived were consistent with theory and observation; for example Ward’s
method tended to ‘discover’ spherical clusters irrespective of the natural
structure of the data and classifications were shown to be sensitive to
input order of datapoints. These properties of clustering techniques raise
at least two issues relating to validation of classifications:
- objectivity—to what extent does a given analysis represent the actual structure of the data, and to what extent is it an artefact of the clustering method?
- selection—upon what criteria does one choose among alternative analyses?
TLS METHODOLOGY
The TLS aimed to model the overall linguistic variability of an urban community, that of Tyneside in north-east England, and more specifically- to identify and exhaustively characterise the varieties of speech which co-occur in that area, and
- to determine the distribution of both the speech varieties and their constituent elements across the relevant social subgroups
SELF-ORGANIZING MAPS
The self-organizing map, also known as the Kohonen net after its inventor, is a k-dimensional surface of processing units, where k is usually 2, together with a buffer into which input vectors are loaded. Associated with each unit is a set of connections from the input buffer such that, for a buffer of length n, there are n connections per unit, and each connection can take on a real-number value or ‘strength’ in some range, typically -1..1 or 0..1 (for clarity, only sample connections are shown in Figure 1):Figure 1.
Figure 1
COMPARATIVE STUDY
This comparative study confines itself to the phonetic-level representation of the TLS corpus. In order to apply cluster analysis to this data, the TLS had to represent it numerically. The method was as follows. For each of the 52 informants whose phonetic-level transcriptions had been digitized, the number of token occurrences of each of the 542 state types S defined in the transcription protocol was counted, where a ‘state’ is a discrete phonetic segment type. Each informant’s phonetic profile was thus represented as a 542-element integer-valued vector V, in which any element Vi contained the number of token occurrences of state Si. The set of informant vectors was stored in a 52 x 542 matrix which, after normalization, served as input to the various clustering algorithms used in the analysis. The present study replicates the TLS data representation and cluster analyses, and then compares the performance of a SOM on the same data, using a variety of settings for initialization of connections, sequence of input vector presentations, and map dimension.CONCLUSIONS
Preliminary results from a relatively small subset of the 52 TLS informants indicate that the SOM performs as well as the other clustering algorithms in terms of its ability to identify and represent clusters, but that it is far less affected by variation in processing parameters. Results for the full TLS dataset will be available if and when this paper is presented.REFERENCES
B. Everit. Cluster Analysis. London: E. Arnold, 1993. 170.
J. Hair R. Anderson R. Tatham C. Black. Multivariate Data Analysis. Englewood Cliff, NJ: Prentice Hall, 1995. 751.
V. Jones-Sargent. Tyne Bytes: a computerized sociolinguistic study of Tyneside English. Frankfurt am Main and New York: P. Lang, 1983. 368.
T. Kohonen. Self-Organizing Maps. Berlin and New York: Springer, 1995. 312.
J. Pellowe. “A dynamic modelling of linguistic variation: the urban
(Tyneside) linguistic survey.” Lingua. 1972. 30: 1-30.
J. Pellowe V. Jones. “On intonational variety in Tyneside speech.” Sociolinguistic Patterns of British English. Ed. P. Trudgill. London: Arnold, 1978. 101-121.
V. Sargent(nee Jones). “Cycles and the equal society.” Classification Society Bulletin. 1979. 4: 31-45.
B. Strang. “The Tyneside Linguistic Survey.” Zeitschrift für Mundartforschung. 1968. 4: 788-794.