“A Corpus-Based Methodology for Identifying Non-nominal
"It": Rule-Based and Machine Learning Approaches”
Richard
Evans
University of Wolverhampton, UK
In this paper, seven uses of "it" are identified in English. These uses involve
noun phrase anaphora (eg. "Do not sweep the dust when dry, you will only
recirculate IT."), verb phrase anaphora (eg. "Raising money for your favourite
charity can be fun. You can do IT on your own..."), reference to clauses (eg.
"Not every city would be suited to this approach, IT must be admitted."),
reference to entire discourse segments, (eg. "Always use a tool for the job it
was intended to do. Always use tools correctly. If IT feels very awkward,
stop."), cataphoric reference to entities (eg. "When IT fell, the glass
broke."), pleonastic uses in which the pronoun has no reference and is used only
due to a requirement of the grammar (eg. "IT was {raining, 4 o'clock, All
Saints' day, etc.}", "IT's recommended that...", "IT's easier to..."), and uses
in idiomatic constructions (eg. "I take IT you're going now."). Due to the
absence of a suitable term in the literature, the term "non-nominal it" is used
to identify all the cases in which "it" is not in an anaphoric relationship with
a noun phrase in the text. Numerous researchers have so far proposed
hand-crafted rule-based pattern matching techniques to identify pleonastic "it".
These methods have the drawback that they require recognition of potentially
large and open-ended lists of trigger words and complex expressions in order to
succeed. The goal here was to compare a rule-based method with a method devised
to use machine learning to make the identification. It was hoped that
information such as the position of the pronoun and its complex relation to the
surrounding syntactic context would contribute to the accuracy of the
identification. We implemented both methods. Corpora were constructed, annotated
and used to classify and evaluate the accuracy of these programs. A comparison
was made between them.
The literature makes it easy to infer the importance of recognising non-nominal
uses of "it" in the fields of anaphora resolution, information retrieval,
machine translation and text summarisation. The task is especially crucial when
it is considered that almost one third of the uses of "it" in our corpus of
randomly selected texts were non-nominal.
In the full paper, the treatment of pleonastic "it" in surveys of English usage
is reviewed, as is work by Paice & Husk (1987), Lappin & Leass (1994)
and Denber (1998) on methods for automatic recognition of pleonastic "it". The
application of machine learning to a different problem in linguistics is
described in the review of Litman (1996) on the automatic classification of cue
phrases. One of the methods in the present paper applies machine learning to the
automatic identification of non-nominal "it".
A novel resource was required for this corpus-based research. A corpus was
therefore constructed using 77 randomly selected texts from the BNC and stripped
down versions of the Susanne corpus. We implemented a software tool that
facilitates SGML mark up of instances of "it" that appear in the corpus by a
human annotator. Non-Nominal uses of "it" are marked <PLEO
ID="XX">it</PLEO> whereas other instances are left unmarked by the
annotator. On completion, the corpus contained 368830 words, 3171 occurrences of
"it" and 1025 non-nominal uses. A DTD was defined for the annotated corpus and
the SGML aware LT-Chunker (Mikheev 1996) was used to tokenise the corpus while
preserving the prior mark up. The tokenised file was then processed by a Perl
program written to report the paragraph, sentence and word positions of the
non-nominal instances of "it". This information was written to a data file and
used to evaluate the methods implemented and described later.
We implemented a program based on Paice & Husk's (1987) method for
recognition of pleonastic "it". In the first step, a plain text version of the
corpus was tagged using Tapanainen & Jarvinen's (1997) SGML-blind
FDG-Parser. The output from the tagger was converted to an SGML format by our
software and then processed by our program based on Paice & Husk's pattern
recognition method. In this way a classification was assigned to each instance
of "it". Evaluation was performed by comparing the output of the program with
the contents of the data file produced earlier.
A machine learning approach was also implemented. It exploits Daelemans' (1999)
TiMBL memory based learning method. TiMBL works by using a training file of
feature-value vectors that have been given a classification: non-nominal; or
not. The construction of the training file was made by processing a plain text
version of the annotated corpus with the FDG-Parser and the SGML conversion
program. The SGML file was input to a program that described each instance of
"it" as a vector of feature values. The features used in our approach were
designed to describe the position of non-nominal instances, the lemmas of
significant "following" words such as verbs and adjectives, as well as the
relation of "it" to other structures in the text, such as prepositions and noun
phrases. A thorough description appears in the full paper. The vectors
associated with the instances were classified by comparison with the data file
constructed earlier. The set of classified vectors made this way was then used
as the training file. TiMBL classifies query vectors according to their
similarity to the examples in the training file. The method of ten-fold
cross-validation was used to obtain an evaluation of the average accuracy of the
technique over our corpus.
The results of our automatic evaluation showed no major differences in the level
of accuracy between the two methods. However, it was noted that the method based
on work by Paice & Husk was slightly more accurate over this text (78.81%
vs. 78.68%) but had a stronger tendency to misclassify instances as non-nominal
(false positives: 265 vs. 243). If false positives are undesirable to the user,
then the machine learning approach is better.
Further experiments in which the classification of instances in the training set
was extended using a 7-ary system, in accordance with the uses given in the
introduction, showed some improvement in making the binary distinction between
nominal and non-nominal uses. The classification accuracy rose to 78.74% and the
number of false positive classifications fell to 209. Predictably, the detection
rate for each of the different types of usage was low (50.35% on average).
Given that TiMBL is reliant on a training file, it will also be beneficial to
extend our resource in terms of size as well as information content. The present
file, with 3171 instances, cannot be considered to be of sufficient size. The
availability of a suitable resource for evaluation is also important for the
application of optimisation techniques. Of course, non-nominal pronouns appear
in languages other than English, and it would be valuable to generate resources
in order to explore machine learning based methods to identify non-nominal
pronouns for them.
References
L. Burnard. Users Reference Guide British National Corpus Version 1.0. : Oxford University Computing Services, UK, 1995.
W. Daelemans. TiMBL: Tilburg Memory Based Learner version 2 Reference Guide,. ILK Technical Report - ILK 99-01. The Netherlands: Tilburg University, 1999.
M. Denber. Automatic Resolution of Anaphora in English. : Eastman Kodak Co., Imaging Science Division, 1998.
S. Lappin H. J.Leass. “An Algorithm for Pronominal Anaphora Resolution.” Computational Linguistics. 1994. 20: .
D. J.Litman. “Cue Phrase Classification Using Machine
Learning.” Journal of Artificial Intelligence Research. 1996. 5: 53-94.
A. Mikheev. LT_CHUNK V 2.1. : Language Technology Group, University of Edinburgh, UK, 1996.
C. D.Paice G. D.Husk. “Towards the Automatic Recognition of Anaphoric Features
in English Text: The Impersonal Pronoun 'It'.” Computer Speech and Language. Academic Press, 1987. 2: 109-32.
G. Sampson. English for the Computer: The SUSANNE Corpus and Analytic Scheme. : Oxford University Press, 1995.
P. Tapanainen T. Jarvinen. “A Non-Projective Dependency Parser.” The Proceedings of The 5th Conference of Applied Natural Language Processing. : ACL, 1997. 64-71.