Digital Humanities Abstracts

“A Corpus-Based Methodology for Identifying Non-nominal "It": Rule-Based and Machine Learning Approaches”
Richard Evans University of Wolverhampton, UK

In this paper, seven uses of "it" are identified in English. These uses involve noun phrase anaphora (eg. "Do not sweep the dust when dry, you will only recirculate IT."), verb phrase anaphora (eg. "Raising money for your favourite charity can be fun. You can do IT on your own..."), reference to clauses (eg. "Not every city would be suited to this approach, IT must be admitted."), reference to entire discourse segments, (eg. "Always use a tool for the job it was intended to do. Always use tools correctly. If IT feels very awkward, stop."), cataphoric reference to entities (eg. "When IT fell, the glass broke."), pleonastic uses in which the pronoun has no reference and is used only due to a requirement of the grammar (eg. "IT was {raining, 4 o'clock, All Saints' day, etc.}", "IT's recommended that...", "IT's easier to..."), and uses in idiomatic constructions (eg. "I take IT you're going now."). Due to the absence of a suitable term in the literature, the term "non-nominal it" is used to identify all the cases in which "it" is not in an anaphoric relationship with a noun phrase in the text. Numerous researchers have so far proposed hand-crafted rule-based pattern matching techniques to identify pleonastic "it". These methods have the drawback that they require recognition of potentially large and open-ended lists of trigger words and complex expressions in order to succeed. The goal here was to compare a rule-based method with a method devised to use machine learning to make the identification. It was hoped that information such as the position of the pronoun and its complex relation to the surrounding syntactic context would contribute to the accuracy of the identification. We implemented both methods. Corpora were constructed, annotated and used to classify and evaluate the accuracy of these programs. A comparison was made between them. The literature makes it easy to infer the importance of recognising non-nominal uses of "it" in the fields of anaphora resolution, information retrieval, machine translation and text summarisation. The task is especially crucial when it is considered that almost one third of the uses of "it" in our corpus of randomly selected texts were non-nominal. In the full paper, the treatment of pleonastic "it" in surveys of English usage is reviewed, as is work by Paice & Husk (1987), Lappin & Leass (1994) and Denber (1998) on methods for automatic recognition of pleonastic "it". The application of machine learning to a different problem in linguistics is described in the review of Litman (1996) on the automatic classification of cue phrases. One of the methods in the present paper applies machine learning to the automatic identification of non-nominal "it". A novel resource was required for this corpus-based research. A corpus was therefore constructed using 77 randomly selected texts from the BNC and stripped down versions of the Susanne corpus. We implemented a software tool that facilitates SGML mark up of instances of "it" that appear in the corpus by a human annotator. Non-Nominal uses of "it" are marked <PLEO ID="XX">it</PLEO> whereas other instances are left unmarked by the annotator. On completion, the corpus contained 368830 words, 3171 occurrences of "it" and 1025 non-nominal uses. A DTD was defined for the annotated corpus and the SGML aware LT-Chunker (Mikheev 1996) was used to tokenise the corpus while preserving the prior mark up. The tokenised file was then processed by a Perl program written to report the paragraph, sentence and word positions of the non-nominal instances of "it". This information was written to a data file and used to evaluate the methods implemented and described later. We implemented a program based on Paice & Husk's (1987) method for recognition of pleonastic "it". In the first step, a plain text version of the corpus was tagged using Tapanainen & Jarvinen's (1997) SGML-blind FDG-Parser. The output from the tagger was converted to an SGML format by our software and then processed by our program based on Paice & Husk's pattern recognition method. In this way a classification was assigned to each instance of "it". Evaluation was performed by comparing the output of the program with the contents of the data file produced earlier. A machine learning approach was also implemented. It exploits Daelemans' (1999) TiMBL memory based learning method. TiMBL works by using a training file of feature-value vectors that have been given a classification: non-nominal; or not. The construction of the training file was made by processing a plain text version of the annotated corpus with the FDG-Parser and the SGML conversion program. The SGML file was input to a program that described each instance of "it" as a vector of feature values. The features used in our approach were designed to describe the position of non-nominal instances, the lemmas of significant "following" words such as verbs and adjectives, as well as the relation of "it" to other structures in the text, such as prepositions and noun phrases. A thorough description appears in the full paper. The vectors associated with the instances were classified by comparison with the data file constructed earlier. The set of classified vectors made this way was then used as the training file. TiMBL classifies query vectors according to their similarity to the examples in the training file. The method of ten-fold cross-validation was used to obtain an evaluation of the average accuracy of the technique over our corpus. The results of our automatic evaluation showed no major differences in the level of accuracy between the two methods. However, it was noted that the method based on work by Paice & Husk was slightly more accurate over this text (78.81% vs. 78.68%) but had a stronger tendency to misclassify instances as non-nominal (false positives: 265 vs. 243). If false positives are undesirable to the user, then the machine learning approach is better. Further experiments in which the classification of instances in the training set was extended using a 7-ary system, in accordance with the uses given in the introduction, showed some improvement in making the binary distinction between nominal and non-nominal uses. The classification accuracy rose to 78.74% and the number of false positive classifications fell to 209. Predictably, the detection rate for each of the different types of usage was low (50.35% on average). Given that TiMBL is reliant on a training file, it will also be beneficial to extend our resource in terms of size as well as information content. The present file, with 3171 instances, cannot be considered to be of sufficient size. The availability of a suitable resource for evaluation is also important for the application of optimisation techniques. Of course, non-nominal pronouns appear in languages other than English, and it would be valuable to generate resources in order to explore machine learning based methods to identify non-nominal pronouns for them.

References

L. Burnard. Users Reference Guide British National Corpus Version 1.0. : Oxford University Computing Services, UK, 1995.
W. Daelemans. TiMBL: Tilburg Memory Based Learner version 2 Reference Guide,. ILK Technical Report - ILK 99-01. The Netherlands: Tilburg University, 1999.
M. Denber. Automatic Resolution of Anaphora in English. : Eastman Kodak Co., Imaging Science Division, 1998.
S. Lappin H. J.Leass. “An Algorithm for Pronominal Anaphora Resolution.” Computational Linguistics. 1994. 20: .
D. J.Litman. “Cue Phrase Classification Using Machine Learning.” Journal of Artificial Intelligence Research. 1996. 5: 53-94.
A. Mikheev. LT_CHUNK V 2.1. : Language Technology Group, University of Edinburgh, UK, 1996.
C. D.Paice G. D.Husk. “Towards the Automatic Recognition of Anaphoric Features in English Text: The Impersonal Pronoun 'It'.” Computer Speech and Language. Academic Press, 1987. 2: 109-32.
G. Sampson. English for the Computer: The SUSANNE Corpus and Analytic Scheme. : Oxford University Press, 1995.
P. Tapanainen T. Jarvinen. “A Non-Projective Dependency Parser.” The Proceedings of The 5th Conference of Applied Natural Language Processing. : ACL, 1997. 64-71.