Digital Humanities Abstracts

“Corpus Methods for Interlingual Machine Translation”
Michelle Vanni Georgetown University, U.S. Dept. of Defense

That corpus analysis has become a fundamental element in the process of designing natural language processing (NLP) systems is generally recognized: “Efforts in the development of NLP and [information technology] are converging on the recognition of the importance of some sort of corpus-based research as part of the infrastructure for the development of advanced language processing applications” (Atkins, Clear and Ostler 1992:1). In order to be effective, NLP systems must handle not only those linguistic structures occurring in text which are predictable from explanatory models but also those which are idiosyncratic, occur less frequently, and whose meaning is derived from convention rather than composition. Corpus studies provide evidence of such usages. In the close examination of categories of linguistic phenomena, they also offer insight into new generalities not considered by rational theorists. There is a noteworthy congruence between findings in corpus analysis studies and those in MT research regarding the actual coverage of theoretical models which view syntax and semantics independently. In support of the suggestion that these two levels are instead interdependent, Sinclair (1991) states that a certain structure may only be appropriate for a particular sense of a word and that, conversely, one word sense may have associated with it only a finite set of common syntactic patterns. Lexical studies in support of interlingual MT make a similar point. It has been recognized (Levin and Nirenburg 1991, 1993, 1994a, 1994b) that two levels of representation, one which indicates semantic properties from which syntactic behavior can be predicted (B.Levin 1993) and one which expresses meaning as a set of relationships to concepts as defined in a structured model of a particular semantic domain (Goodman and Nirenburg 1992), must exist in an interlingual MT lexicon in order adequately to account for the meaning of conventional linguistic expressions which have come to be known as constructions (Fillmore 1988, Goldberg 1994). While the MT research work uses cross-linguistic data to argue that neither of the levels, alone, provides sufficient representation, monolingual data from on-line corpora can be shown to support a similar conclusion, that models of processing which have been developed from rational theories only account for a small percentage of what actually occurs in language and that further research on patterns of actual language use is required in order to derive effective grammars which handle the majority of linguistic phenomena occurring in text. In this paper, we use corpus methods to explore approaches to the analysis of Italian verbs in related semantic fields and lexical variation associated with three of a particular verb's morphological forms. Hypotheses regarding the complementary argument structure of frequently occurring verbs in the domains of sensation, cognition and emotion will be tested and variation among the structures in which present, imperfect and preterit forms appear will be observed for the changes in semantic interpretation with which they may be associated. Based on preliminary findings, an interlingual structure will be proposed to account for these domains and forms.


