Digital Humanities Abstracts

“Tailoring a formal grammar for efficiency without compromising its linguistic motivation”
Nelleke Oostdijk University of Nijmegen oostdijk@let.kun.nl

Introduction

Corpus linguistics can be characterised as the formalized approach to descriptive linguistics. Its main objective is the study of actual language use and variation therein on the basis of text corpora. A corpus is not just any amount of textual data; rather, a corpus is a balanced collection of language data and is constituted by samples of connected discourse, usually in a single dialect. In its raw form, the corpus serves as a test-bed for the linguistic hypotheses laid down in a formal grammar. Once annotated (in accordance with the grammar), the corpus constitutes a database that may be consulted in order to obtain information about linguistic structures, their frequency of occurrence and distribution, as well as to gain insights into co-occurrence restrictions (cf. e.g. Oostdijk and de Haan, 1994). In this approach, the formal grammar plays a central role: the grammar both contains the formalized description of the language under investigation, which is validated in the process of annotating the corpus, and it is also used to automatically derive the parser that is employed to annotate the corpus. The formal grammar is originally conceived on the basis of the linguist's intuitions and any information found in sources such as grammatical handbooks (for English, for example, Quirk et al. 1985) and linguistic monographs. Through a process of iterative testing on the corpus and augmenting the description contained in it, the formal grammar is developed until it reaches a satisfactory level of descriptive and observational adequacy. From the point of view of the (descriptive) linguist, the added value of the corpus linguistic approach then lies in the fact that the description that results is explicit, exhaustive, objective and validated, this in contrast with other, traditionally informal accounts.

Current practice: an evaluation

While the annotation of corpora can serve a dual purpose, so far the creation of databases has been given priority over the advancement of descriptive theory. The reason for this is simply that this continues to be the most urgent task, since corpora that have been annotated with detailed linguistic information are still rare. Therefore, at present grammars are being constructed for the purpose of analysing corpora that, once analysed, can be used by the general linguistic community. It goes without saying that the linguistic descriptions contained in these grammars should adhere to the standards set by the discipline. In effect this means that the descriptions must conform to a large extent to what is familiar and traditional. The corpus linguistic approach described above to the linguistic annotation of corpora can be said to be a rather ambitious one, as the parser should produce for each corpus sentence (at least) the one contextually appropriate analysis. Since the knowledge incorporated into the formal grammar is for various reasons insufficient, overgeneration is unavoidable. This has a negative impact on the efficiency of the analysis process. With the present orientation towards the production of annotated corpora that can serve as databases for further linguistic reseach, it is required that for each corpus sentence the database should contain ONLY the one analysis that is contextually appropriate. In a situation in which the parser is permitted to overgenerate, human intervention then becomes necessary (van Halteren and Oostdijk, 1993). This not only slows down the analysis process even further, but also the consistency in the analyses is no longer warranted (as it would be if the analysis process were to run fully autonomously). So far the idea has always been upheld that overgeneration is unavoidable. The overall effects overgeneration has on the analysis process and the quality of the output are perhaps more far- reaching than is desirable. While it is undoubtedly true as Aarts et al. (1996) point out that "consistency is mainly endangered if the human analyst takes the initiative in the analysis process" and that therefore it is better "to have the linguist only react to prompts given by automatic processes, by asking him to choose from a number of possibilities presented by the machine", the negative effects of the interaction with the human analyst in terms of loss of consistency and efficiency remain. With regard to our short term goal, what we see is that there still is a great demand for corpora that have been annotated with detailed linguistic information, while the rate at which such corpora are being produced is far too low. Our long term goal, the advancement of descriptive linguistic theory through iterative testing and augmenting the description contained in the formal grammar, cannot be achieved if we do not succeed in shortening the time-span that is needed to complete a single iteration.

Tailoring the grammar

In theory, there are two possible solutions to the problem of overgeneration. The first solution is to resort to underspecification, a strategy which is widely adopted both in tagging and in parsing. The portmanteau tags in various tagsets are typical examples of underspecification. In the PENN Treebank approach (Marcus et al., 1993) for instance the portmanteau part-of-speech tag IN is assigned to both prepositions and subordinating conjunctions. Underspecification is also found at all levels of English Constraint Grammar (ENGCG, Karlsson et al. 1995). The major drawback of underspecification is of course the loss of information. An alternative solution can be found in incorporating into the grammar the knowledge that is now brought into play in the analysis process in the interaction with the human analyst. The nature of this information is diverse and includes knowledge about semantics, pragmatics, discourse and syntax. Now at this stage it would be too unrealistic to propose to incorporate all these types of knowledge. However, closer examination of the overgeneration found in the corpus we have (syntactically) analysed yields the following picture: (1) not all knowledge that we have as far as syntax is concerned has as yet found its way into the grammar, and (2) the knowledge we have incorporated in our grammar so far has not been used to the full. The two points are obviously related. They both have to do with the fact that while constructing the formal grammar, it was not at all clear what knowledge and what detail were required. The construction of the formal grammar so far has amounted to formalizing what knowledge we were aware of and which was deemed linguistically relevant. The present paper reports the results of an investigation into the nature of the overgeneration as found in the analysis results obtained in the process of annotating a corpus of Modern British English by means of a rule-based parser. As these results show, there is sufficient reason to believe that it should indeed be possible to tailor the grammar for efficiency without compromising its linguistic motivation. Moreover, the nature of (some of) the adaptations is such that they must be considered relevant not only with respect to the specific parser used in the experiment, but that they can also be of importance in a broader context.

References

J. Aarts H. van Halteren N. Oostdijk. “The TOSCA analysis system.” Proceedings of the First AGFL Workshop. Ed. C. H. A. Koster E. Oltmans. Nikmegen: CSI, 1996. 181-191.
H. van Halteren N. Oostdijk. “Towards a linguistic database: the TOSCA analysis system.” English Language Corpora: Design, analysis and exploitation. Ed. J. Aarts P. de Haan N. Oostdijk. Amsterdam - Atlanta: Rodopi, 1993. 145-161.
Constraint Grammar. A Language-Independent System for Parsing Unrestricted Text. Ed. F. Karlsson A. Voutilainen J. Heikkilä A. Anttila. Berlin - New York: Mouton de Gruyter, 1995.
M. Marcus B. Santorini M. A. Marcinkiewicz. “Building a large annotated corpus of English: The Penn Treebank.” Computational Linguistics. 1993. 19: 313-330.
N. Oostdijk P. de Haan. “Clause patterns in Modern British English. A corpus-based (quantitative) study.” ICAME Journal. 1994. 18: 41-80.
R. Quirk S. Greenbaum G. Leech J. Svartvik. A Comprehensive Grammar of the English Language. London: Longman, 1985.