“Cluster Analysis of the Newcastle Electronic Corpus of Tyneside English: A comparison of Methods”

Herbert Moisl University of Newcastle hermann.moisl@ncl.ac.uk Val Jones University of Twente jones@cs.utwente.nl

The Newcastle Electronic Corpus of Tyneside English (NECTE) project is based on two separate corpora of recorded speech, one of them collected in the late 1960s as part of the Tyneside Linguistic Survey (TLS), and the other in 1994 by the Phonological Variation and Change in Contemporary Spoken English (PVC) project. Its aim is to combine the TLS and the PVC collections into a single corpus and to make it available to the research community in a variety of formats: digitized sound, phonetic transcription, and standard orthographic transcription, all aligned and available on the Web. We are currently developing a methodology to study NECTE from a sociolinguistic point of view, and have begun by looking at the one formulated by the TLS, which was radical at the time and remains so today: in contrast to the then-universal and even now dominant theory-driven approach, where social and linguistic factors are selected by the analyst on the basis of a predefined model, the TLS proposed a fundamentally empirical approach in which salient factors are extracted from the data itself and then serve as the basis for model construction. To this end, an electronic corpus was created from a subset of the data, and various cluster analysis algorithms were applied to it in order to derive social and linguistic classifications of the sample. Stability of classifications across different clustering methods was already a known theoretical problem. The clustering techniques available at the time, and still widely used today, are sensitive to factors such as vector distance measure, clustering algorithm, and the order in which data items are presented —different combinations of these factors typically yield different analyses of the same dataset. These effects were observed in the TLS classifications. In an experiment on artificial data sets Jones (1979) demonstrated that certain combinations of clustering algorithms are capable of imposing erroneous structure on data which was inherently unclassifiable (by design). The types of structurings derived were consistent with theory and observation; for example Ward’s method tended to ‘discover’ spherical clusters irrespective of the natural structure of the data and classifications were shown to be sensitive to input order of datapoints. These properties of clustering techniques raise at least two issues relating to validation of classifications:

objectivity—to what extent does a given analysis represent the actual structure of the data, and to what extent is it an artefact of the clustering method?
selection—upon what criteria does one choose among alternative analyses?

We seek an approach to classification which improves on the methods available at the time when the first analyses of the TLS data were conducted, that is, techniques as insensitive as possible to variation in the sort of parameters identified above. In this paper we consider as a candidate a method that had not been invented when the TLS was active —the self-organizing map (SOM). The discussion is in three main parts. The first part outlines the TLS methodology, the second describes self-organizing maps, and the third compares the consistency of the analytical methods used by the TLS with that of the SOM relative to the TLS phonetic data.

TLS METHODOLOGY

The TLS aimed to model the overall linguistic variability of an urban community, that of Tyneside in north-east England, and more specifically

to identify and exhaustively characterise the varieties of speech which co-occur in that area, and
to determine the distribution of both the speech varieties and their constituent elements across the relevant social subgroups

To achieve this aim, a methodology was proposed which differed fundamentally from the one standard at the time, and which was taken to have important theoretical and practical consequences for the discipline of sociolinguistics. In place of the theory-driven approach which characterized the work of Labov and Trudgill, among others, in which a sociolinguistic model was defined and then used to select a relatively small number of social and linguistic variables for analysis, the TLS was conceived as a set of methods whereby the salient features of linguistic and social diversity were to be empirically determined by, rather than presupposed in, the model. This methodology was designed to generate multiple candidate hypotheses about the data by applying high-dimensional multivariate analysis methods to it in different ways, from which those most useful to the aims of the TLS could be selected.

SELF-ORGANIZING MAPS

The self-organizing map, also known as the Kohonen net after its inventor, is a k-dimensional surface of processing units, where k is usually 2, together with a buffer into which input vectors are loaded. Associated with each unit is a set of connections from the input buffer such that, for a buffer of length n, there are n connections per unit, and each connection can take on a real-number value or ‘strength’ in some range, typically -1..1 or 0..1 (for clarity, only sample connections are shown in Figure 1):

Figure 1. Figure 1

Because it is an artificial neural network, the SOM is not explicitly configured or ‘programmed’ to behave in some desired way like a conventional computer, but rather learns its behaviour by exposure to input data using a learning algorithm. A full explanation of SOM learning would take us too far afield; details can be found in most textbooks on artificial neural networks. In outline, though, learning takes place by repeated presentation of vectors drawn randomly from a set V, and adjustment of the connections at each presentation. The SOM is initialized by assigning random values in some range (ie, 0..1) to the connections. When the first vector vi is presented, it activates the processing units to varying degrees, depending on the differences in connection strengths between the input buffer and each unit; the most highly activated unit uj is selected, and the connections are adjusted so that, next time vi is presented, uj will be even more highly activated than before, thus associating vi ever more strongly with a specific location on the map. As learning proceeds over—usually—many thousands of presentations, each of the vectors in V is associated with a specific unit in the map. After learning is complete, the entire set V is presented once again. Each vector activates the unit with which it has learned to become associated, and the result is a pattern of activations on the map surface. That pattern is significant: the distances among activated units represent the similarity relations in the input vector space.

COMPARATIVE STUDY

This comparative study confines itself to the phonetic-level representation of the TLS corpus. In order to apply cluster analysis to this data, the TLS had to represent it numerically. The method was as follows. For each of the 52 informants whose phonetic-level transcriptions had been digitized, the number of token occurrences of each of the 542 state types S defined in the transcription protocol was counted, where a ‘state’ is a discrete phonetic segment type. Each informant’s phonetic profile was thus represented as a 542-element integer-valued vector V, in which any element Vi contained the number of token occurrences of state Si. The set of informant vectors was stored in a 52 x 542 matrix which, after normalization, served as input to the various clustering algorithms used in the analysis. The present study replicates the TLS data representation and cluster analyses, and then compares the performance of a SOM on the same data, using a variety of settings for initialization of connections, sequence of input vector presentations, and map dimension.

CONCLUSIONS

Preliminary results from a relatively small subset of the 52 TLS informants indicate that the SOM performs as well as the other clustering algorithms in terms of its ability to identify and represent clusters, but that it is far less affected by variation in processing parameters. Results for the full TLS dataset will be available if and when this paper is presented.

REFERENCES

B. Everit. Cluster Analysis. London: E. Arnold, 1993. 170.

J. Hair R. Anderson R. Tatham C. Black. Multivariate Data Analysis. Englewood Cliff, NJ: Prentice Hall, 1995. 751.

V. Jones-Sargent. Tyne Bytes: a computerized sociolinguistic study of Tyneside English. Frankfurt am Main and New York: P. Lang, 1983. 368.

T. Kohonen. Self-Organizing Maps. Berlin and New York: Springer, 1995. 312.

J. Pellowe. “A dynamic modelling of linguistic variation: the urban (Tyneside) linguistic survey.” Lingua. 1972. 30: 1-30.

J. Pellowe V. Jones. “On intonational variety in Tyneside speech.” Sociolinguistic Patterns of British English. Ed. P. Trudgill. London: Arnold, 1978. 101-121.

V. Sargent(nee Jones). “Cycles and the equal society.” Classification Society Bulletin. 1979. 4: 31-45.

B. Strang. “The Tyneside Linguistic Survey.” Zeitschrift für Mundartforschung. 1968. 4: 788-794.