“Word sense disambiguation using cross-lingual
inform”
Nancy
Ide
Department of Computer Science Vassar
College
Ide@cs.vassar.edu
It is well known that the most nagging issue for word sense disambiguation (WSD)
is the definition of just what a word sense is. At its base, the problem is a
philosophical and linguistic one that is far from being resolved. However, work
in automated language processing has led to efforts to find practical means to
distinguish word senses, at least to the degree that they are useful for natural
language processing tasks such as summarization, document retrieval, and machine
translation. Several criteria have been suggested and exploited to automatically
determine the sense of a word in context, including syntactic behavior, semantic
and pragmatic knowledge, and especially in more recent empirical studies, word
co-occurrence within syntactic relations (e.g., Hearst, 1991; Yarowsky, 1993),
words co-occurring in global context (e.g., Gale et al., 1993; Yarowsky, 1992;
Schütze, 1992, 1993), etc. No clear criteria have emerged, however, and the
problem continues to loom large for WSD work.
The notion that cross-lingual comparison can be useful for sense disambiguation
has served as a basis for some recent work on WSD. For example, Brown et al.
(1991) and Gale et al. (1992a, 1993) used the parallel, aligned Hansard Corpus
of Canadian Parliamentary debates for WSD, and Dagan et al. (1991) and Dagan and
Itai (1994) used monolingual corpora of Hebrew and German and a bilingual
dictionary. These studies rely on the assumption that the mapping between words
and word senses varies significantly among languages. For example, the word duty
in English translates into French as devoir in its obligation sense, and impôt
in its tax sense. By determining the translation equivalent of duty in a
parallel French text, the correct sense of the English word is identified. These
studies exploit this information in order to gather co-occurrence data for the
different senses, which is then used to disambiguate new texts. In related work,
Dyvik (1998) used patterns of translational relations in an English-Norwegian
parallel corpus (ENPC, Oslo University) to define semantic properties such as
synonymy, ambiguity, vagueness, and semantic fields and suggested a derivation
of semantic representations for signs (e.g., lexemes), capturing semantic
relationships such as hyponymy etc., from such translational relations.
Recently, Resnik and Yarowsky (1997) suggested that for the purposes of WSD, the
different senses of a word could be determined by considering only sense
distinctions that are lexicalized cross-linguistically. In particular, they
propose that some set of target languages be identified, and that the sense
distinctions to be considered for language processing applications and
evaluation be restricted to those that are realized lexically in some minimum
subset of those languages. This idea would seem to provide an answer, at least
in part, to the problem of determining different senses of a word: intuitively,
one assumes that if another language lexicalizes a word in two or more ways,
there must be a conceptual motivation. If we look at enough languages, we would
be likely to find the significant lexical differences that delimit different
senses of a word.
However, this suggestion raises several questions. For instance, it is well known
that many ambiguities are preserved across languages (for example, the French
intérêt and the English interest), especially languages that are relatively
closely related, such as English and French. Assuming this problem can be
overcome, should differences found in closely related languages be given lesser
(or greater) weight than those found in more distantly related languages? More
generally, which languages should be considered for this exercise? All
languages? Closely related languages? Languages from different language
families? A mixture of the two? How many languages would be "enough" to provide
adequate information for this purpose?
There is also the question of the criteria that would be used to establish that a
sense distinction is "lexicalized cross-linguistically". How consistent must the
distinction be? Does it mean that two concepts are expressed by mutually
non-interchangeable lexical items in some significant number of other languages,
or need it only be the case that the option of a different lexicalization exists
in a certain percentage of cases?
This paper attempts to provide some preliminary answers to these questions, in
order to eventually determine the degree to which the use of parallel data is
viable to determine sense distinctions, and if so, the ways in which this
information might be used. Given the lack of large parallel texts across
multiple languages, the study is necessarily limited; however, close examination
of a small sample of parallel data can, as a first step, provide the basis and
direction for more extensive studies. I have used parallel, aligned versions of
George Orwell's Nineteen Eighty-Four (Erjavec and Ide, 1998) in five languages:
English, Slovene, Estonian, Romanian, and Czech.° The study therefore involves languages from four language
families (Germanic, Slavic, Finno-Ugrec, and Romance), as well as two languages
from the same family (Czech and Slovene). Four ambiguous English words were
considered in this study: hard, line, head, and country. Line and hard were
chosen because they have served in various WSD studies to date (e.g., Leacock et
al, 1993) and a corpus of occurrences of these words from the Wall Street
Journal corpus was generously made available for comparison. Serve, another word
frequently used in these studies, did not appear frequently enough in the Orwell
text to be considered, nor did any other suitable ambiguous verb. Country and
head were chosen as substitutes because they appeared frequently enough for
consideration.
All sentences containing an occurrence or occurrences (including morphological
variants) of each of the three words were extracted from the English text,
together with the parallel sentences in which they occur in the texts of the
four comparison languages (Czech, Estonian, Romanian, Slovene). The English
occurrences were grouped into senses, using the relatively coarse sense
distinctions in the Oxford Advanced Learner's Dictionary (OALD)° (used to provide sense distinctions in WordNet [Miller et
al., 1990; Fellbaum, forthcoming]). The sense categorization was performed by
the author and two student assistants; results from the three were compared and
a final, mutually agreeable grouping was established.
For each of the four comparison languages, the corpus of sense-grouped parallel
sentences for English and that language was sent to a linguist and native
speaker of the comparison language. The linguists were asked to provide the
following information for each word occurrence:
- 1. Provide the lexical item in each parallel sentence that corresponds to the ambiguous English word. If it is inflected, provide both the inflected form and the root form.
- 2. Is the translation one-to-one (i.e., the English word is translated by a single word in your language)? If not, please provide the phrase (or other means) by which it is translated, or indicate that it is not lexicalized.
- 3. Are there obvious synonyms for the word in your language that could have been used instead of the one chosen? Are they better or worse as translations?
- 4. f a given word in any one of its senses is translated using different words in your language (for example, if a word in the "not soft" sense of "hard" is translated differently in different sentences), please indicate why this difference may exist. For example, is it due to the use of a more general term (hyponym)? a more specific word (hypernym)? a different sense?
- 5. Is any of the translations of one of the ambiguous words itself ambiguous in your language? Is the ambiguity the same as in the English? If not, is the word ambiguous among different meanings than those for which it is ambiguous in English? If so, what are its other meanings?
References
Ido Dagan Alon Itai. “Word sense disambiguation using a second language
monolingual corpus.” Computational Linguistics. 1994. 20: 563-596.
Ido Dagan Alon Itai Ulrike Schwall. “Two languages are more informative than one.” Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, 18-21 June 1991, Berkeley, California. : , 1991. 130-137.
Helge Dyvik. “Translations as Semantic Mirrors.” Proceedings of Workshop W13: Multilinguality in the lexicon II, The 13th Biennial European Conference on Artificial Intelligence (ECAI 98), Brighton, UK. : , 1998. 24-44.
Tomaz Erjavec Nancy Ide. “The MULTEXT-EAST Corpus.” Proceedings of the First International Conference on Language Resources and Evaluation, 27-30 May 1998, Granada. : , 1998. 971-74.
WordNet: An Electronic Lexical Database. Ed. Christiane Fellbaum. Cambridge, Massachusetts: MIT Press,
William A. Gale Kenneth W. Church David Yarowsky. “A method for disambiguating word senses in a large
corpus.” Computers and the Humanities. 1993. 26: 415-439.
Marti A. Hearst. “Noun homograph disambiguation using local context in
large corpora.” Proceedings of the 7th Annual Conference of the University of Waterloo Centre for the New OED and Text Research, Oxford, United Kingdom. : , 1991. 1-19.
Claudia Leacock Geoffrey Towell Ellen Voorhees. “Corpus-based statistical sense resolution.” Proceedings of the ARPA Human Language Technology Workshop. San Francisco: Morgan Kaufman, 1993.
Dan I. Melamed. “Measuring Semantic Entropy.” ACL-SIGLEX Workshop Tagging Text with Lexical Semantics: Why, What, and How? April 4-5, 1997, Washington, D.C.. : , 1997. 41-46.
George A. Miller Richard T. Beckwith Christine D. Fellbaum Derek Gross Katherine J. Miller. “WordNet: An on-line lexical database.” International Journal of Lexicography. 1990. 3: 235-244.
Philip Resnik David Yarowsky. “A perspective on word sense disambiguation methods and
their evaluation.” ACL-SIGLEX Workshop Tagging Text with Lexical Semantics: Why, What, and How? April 4-5, 1997, Washington, D.C.. : , 1997. 79-86.
Hinrich Schütze. “Dimensions of meaning.” Proceedings of Supercomputing '92. Los Alamitos, California: IEEE Computer Society Press, 1992. 787-796.
Hinrich Schütze. “Word space.” Advances in Neural Information Processing Systems. Ed. Stephen J. Hanson Jack D. Cowan C. Lee Giles. San Mateo, California: Morgan Kauffman, 1993. 895-902.
David Yarowsky. “Word sense disambiguation using statistical models of
Roget's categories trained on large corpora.” Proceedings of the 14th International Conference on Computational Linguistics, COLING'92, 23-28 August, Nantes, France. : , 1992. 454-460.
David Yarowsky. “One sense per collocation.” Proceedings of the ARPA Human Language Technology Workshop, Princeton, New Jersey. : , 1993. 266-271.