“Combining corpus and experimental data: methodological considerations ”

Inge de Mönnink University of Nijmegen I.deMonnink@let.kun.nl

Over the last three decades the study of language use on the basis of corpus data has been re-established in linguistics, while the use of corpora has also spread to fields such as speech research, sociolinguistics and lexical studies where corpora are nowadays widely used. So far, little attention has been given to the methodological issues that arise with respect to the use of corpus data in any of these (research) contexts. Corpora are exploited for both quantitative (absolute and relative frequency of occurrence) and qualitative (distribution) information. And while certain issues are being addressed, such as to how large corpora and samples should (ideally) be, other, related and certainly no less important issues are not being considered at all. One of these issues is the appropriateness of the use of corpora for the study of relatively infrequent phenomena. In this context questions must be raised such as whether and if so how corpus data can or must be supplemented with experimental data. A good example of what appears to be an infrequent phenomenon is the variation in the constituent structure of the noun phrase (de Mönnink, 1996). In handbooks on English grammar the noun phrase has been described as comprising of an optional determiner followed by zero of more premodifying elements, an obligatory head, and zero or more postmodifying elements. However, as previous research shows, there are several types of noun phrase that do not conform to this basic pattern. Examples are: 1 shifted premodification: the premodifier occurs before the determiner; 2 discontinuous modification: two constituents which are intuitively felt to belong together are split up into two non-adjacent parts, the first preceding the head, the second following it; 3 floating postmodification: the postmodifier is not adjacent to the other constituents of the noun phrase it modifies. Corpus-based studies generally comprise a qualitative and quantitative analysis. The qualitative analysis aims at a detailed description of the phenomenon under study. The quantitative analysis gives a precise picture of absolute and relative (in)frequency of occurrence of the particular phenomenon. For the qualitative analysis no generally accepted methodology exists. Which methods are used depends on the phenomenon that is being studied, the findings of earlier studies, hypotheses formulated by the researcher and (the nature of the annotation in) the available corpus data. For the qualitative analysis, use is made of generally accepted statistical methods. However, for some statistical tests to be reliable a specific minimum frequency is needed (see e.g. de Haan, 1992). Thus, the quantitative analysis is particularly suitable for frequent phenomena and less so (or perhaps not at all) for infrequent phenomena. This finding is confirmed by studies on the representativeness of corpora. It has been observed (Biber, 1990, 1993; de Haan, 1992) that for infrequent structures to be fully represented in the corpus, samples have to be large. It is, however, difficult to predict beforehand how large a sample has to be. In turn, if you want your corpus to contain various text types to identify linguistic variation among texts, the corpus as a whole has to be very large. In other words, for the (quantitative) study of an infrequent phenomenon by means of a corpus that is representative of the population under study, the corpus has to be sufficiently large. How large exactly depends on the actual frequency of the phenomenon. Table 1 gives the frequency of occurrence of (types of) NPs in a corpus of some hundred and forty thousand words. The corpus contains four text types: fiction, non-fiction, drama and spoken material. For a chi-square test to be reliable the number of expected observations of a single variable cannot be lower than five. For the three types of NP described above we see that only 20,000 word fiction samples have a fair chance of containing enough occurrences of all three types. For floating noun phrase postmodification 10,000 word samples are generally big enough, for discontinuous modification 20,000 word samples and for shifted premodification 30,000 word samples or even bigger samples for non-fiction and drama.

Table 1: Number of occurrences of (types of) NPs in different genres
Genre	Sample	number of words	number of NPs	floating postm.	discont. modifier	shifted prem.
Fiction	BW	21,558	6318	11	7	8
	MR	20,266	6382	8	2	1
	CC	20,011	6136	34	6	5
Non-fiction	CB	19,368	5788	26	11	2
	CM	10,581	3282	5	1	-
Drama	SI	14,022	4629	11	2	1
	NC	5,642	1866	1	1	-
Spoken	Sp1	14,919	4548	17	2	2
	Sp2	15,938	5182	14	3	2

So far there appears to be a simple solution to use of corpora for the study of infrequent phenomena: simply increase the corpus size. For a simple quantitative analysis this would indeed be adequate. However, a quantitative analysis lacks the descriptive richness of the nature of structures which a qualitative analysis can provide. A qualitative analysis, on the other hand, can only give subjective judgments about currency or rarity. For a corpus-based study of a relatively infrequent phenomenon that wants to take into account both the qualitative and the quantitative aspects, a major problem is constituted by the fact that while large corpora are required, the detailed annotation of such corpora is not feasible. Given the present state-of-the-art the detailed annotation of corpora requires as yet a vast amount of handwork. Hand-annotation is so time-consuming and subject to inconsistencies that the corpus has to remain necessarily small. This demand is directly opposite the demand for large corpora from quantitative approaches that take an interest primarily in the quantitative information a corpus provides. Thus for the study of the nature and frequency of a relatively infrequent phenomenon more is needed than the combination of a qualitative and a quantitative analysis of corpus data. Additional data has to be considered. In the past, elicitation data have been used to supplement corpus data (e.g. Quirk and Svartvik, 1966, 1979; Greenbaum, 1970, 1973, 1984). When the experiment is designed with care, elicitation including both performance and judgment tests can form an important source to supplement corpus data. In de Mönnink (forthcoming) I argue that the combination of corpus and experimental data forms a valuable contribution to the description of language use. If a phenomenon is too infrequent to be subjected to a corpus-based study alone, elicitation tests enable the linguist to supplement his data, not only with the native speakers' judgements on the general acceptability of structures, but also with additional structures, either expected because they were predicted by the grammar, or intuitively considered probable, or unexpected yet acceptable. Although experimental data have been combined with corpus data before, no attention has so far been paid to the problems of combining these two in essence very different approaches to gathering data. Each has its own methodology for collecting, classifying, analysing and reporting the data in a systematic way. While in de Mönnink (1996) I have discussed the design of elicitation experiments that can be used for supplementation of corpus data, in this paper I discuss ways for combining the two approaches on points of data classification and analysis. I argue that this combination is not simply a matter of integrating statistical outputs, but that it influences both methodologies in such a radical way that it leads to an entirely new methodology for a multi-method approach. I illustrate my findings with the study of non-regular noun phrases.

References

D. Biber. “Methodological Issues Regarding Corpus-based Analysis of Linguistic Variation.” Literary and Linguistic Computing. 1990. 5: 257-69.

D. Biber. “Representativeness in Corpus Design.” Literary and Linguistic Computing. 1993. 8: 243-57.

S. Greenbaum. “Informant Elicitation of Data on Syntactic Variation.” Lingua. 1973. 31: 201-212.

S. Greenbaum. “Corpus Analysis and Elicitation Tests.” Corpus Linguistics. Recent Developments in the Use of Computer Corpora in English Language Research. Ed. J. Aarts W. Meijs. Amsterdam: Rodopi, 1984. 193-201.

S. Greenbaum R. Quirk. “.” Elicitation Experiments in English Linguistic Studies in Use and Attitude. London: Longman, 1970.

P. de Haan. “The Optimum Corpus Sample Size?.” New Directions in English Language Corpora. Ed. G. Leitner. Berlin: Mouton de Gruyter, 1992. 3-19.

I. de Mönnink. “A First Approach to the Mobility of Noun Phrase Constituents.” Synchronic Corpus Linguistics. Papers from the sixteenth International Conference on English Language Research on Computerized Corpora (ICAME 16). Ed. C. Percy F. Meyer I. Lancashire. Amsterdam: Rodopi, 1996. 143-57.

I. de Mönnink. “Using Corpus and Experimental Data: a Multi-method Approach.” Papers from the seventeenth International Conference on English Language Research on Computerized Corpora (ICAME 17). Ed. M. Ljung. : ,

R. Quirk J. Svartvik. Investigating Linguistic Acceptability. The Hague: Mouton, 1966.

R. Quirk J. Svartvik. “A Corpus of Modern English.” Empirisch Textwissenschaft. Aufbau und Auswertung van Textcorpora. Ed. H. Bergenholtz B. Schaeder. Koenigstein: Scriptor, 1979. 204-218.