“Korean Analysis and Transfer in Multilingual Machine
Translation System”
Sung-Kwon
Choi
Systems Engineering Research Institute
skchoi@seri.re.kr
Tae-Wan
Kim
Systems Engineering Research Institute
twkim@seri.re.kr
Soo-Hyun
Lee
Systems Engineering Research Institute
shlee@seri.re.kr
Dong-In
Park
Systems Engineering Research Institute
dipark@seri.re.kr
Abstract
Multilingual machine translation means translation between more than two languages. The existing multilingual machine translation systems can be classified into the transfer-based and interlingual-based multilingual machine translation. In the former the analysis and generation rules were written each other differently, so that the commonness of the languages was ignored and the whole memory space led to increase. The latter had the difficulty in implementing the linguistic universal model available to many languages. In order to get over the shortcomings of these existing multilingual machine translation systems, this paper describes the multilingual MT systems through the common rules which can accept the commonness of languages and many languages can share.1 Introduction
The analysis and generation rules in the existing transfer-based multilingual machine translation systems (SYSTRAN, EUROTRA, METAL, LOGOS, GETA etc.) are independent and different according to target languages.[Hutchins 1992] It says that the existing multilingual machine translation systems don't acknowledge the commonness of languages. For this reason the existing multilingual machine translation systems have the form like the bundle of bilingual MT systems and this leads to a result increasing the size of system. There are the transfer-based multilingual machine translation systems that use interlingual method for reducing the transfer processes (CETA, SALAT, DLT, KANT etc.), however they have difficult problems to complete the linguistic universal model[Lewis 1992]. From this point of view this paper describes the new multilingual machine translation method by the common rules and constraint rules to overcome the problem the existing multilingual machine translation systems have. The common rules mean the rules that are in common with more than two languages. It is the merits of common rules that can reduce the memory space, augment the consistency of grammatical information and standardize the information structure of lexicon because the common rules are loaded into memory only once. They also have another merit for MT. That is, new grammar modules can be created easily through the combination of 'common' rules when we add a new language to the existing system and translate it into the existing languages. The constraint rules mean the rules controlling the linguistic characteristics of individual languages. This paper consists of three parts: In the chapter 2 the construction of the whole system is introduced. The chapter 3 describes the modules consisting of common rules, that is, the common grammatical rules, the common lexicon information structure, the common structural transfer rules, and the common information transfer rules. In the chapter 4 we explain the analysis and transfer of Korean through the parameterized common rules and the constraint rules.2 System construction
The Figure 1 shows the system construction of multilingual machine translation by the common rules and constraint rules:
The middle field of Figure 1 means the common module. 'rn' is a file of
common rules consisting of the common module. These files of the common
rules are called by the grammar modules of the individual languages and
constitute the grammar rules of an individual language together with the
constraint rules for the language. For example, Korean, Japanese, English
and German in Figure 1 have in common a rule file r3, but Korean and
Japanese share more rule files r2 and r4 because they are more similar in
the language typology than English and German
3 Common rules
In this chapter I will show the construction of common rules. Common rules for analysis consist of the common grammar rules and the common lexicon information structure and those for transfer consist of the common structural transfer rules and the common information transfer rules.3.1 Common grammatical rules
To handle many languages in multilingual machine translation system, common grammatical rules should explain linguistic phenomena of as many countries as possible. For explanation of linguistic phenomena of configurational language (e.g. English) as well as nonconfigurational languages (e.g. Korean, Japanese, German) whose word order is relatively free, we have made new grammar rules where X-bar syntactic theory[Jackendoff 1977] and HPSG [Pollard 1994] were mixed. The new grammar was made in binary structure except the coordination structure which was made in triple structure.Table 1. Common grammar rules | |||
---|---|---|---|
head-final-structure | head-first-structure | head-middle-structure | |
1 | PRED => ARG PRED | PRED => PRED ARG | COORD => ARG1 COORD ARG2 |
2 | MODED => MOD MODED | MODED => MODED MOD | |
3 | FUNCT => ARG FUNCT | FUNCT => FUNCT ARG |
3.2 Common lexicon information structure
We need to make the lexicon information structure in order to input, manage and correct consistently the lexicon information of the multilingual machine translation system. It is desirable to build not monotonic, but multiple structure so that the information structure of lexicon may represent the possible linguistic information and be moved collectively. From this point of view I have selected the feature structure as the multilingual lexicon information structure and made the attributes be the same in many languages. Appendix 2 shows an example of multilingual lexicon information structure.3.3 Common structural transfer rules
There is also the part in the transfer process the many languages can share. It is the compositional transfer that copies the node of the source language to that of the target language if the analysis structure of the former and the generation structure of the latter are the same. We make use of the method deleting the functional words and then transforming the syntactic nodes to the 'predicate-argument-modifier' nodes in our multilingual machine translation system in order to transfer compositionally the different structures between the languages. We have recorded the noncompositional structural rules unusable to the common structural transfer rules in the transfer lexicon because they depend on the lexemes. The transfer rules have the priority order: the noncompositional structural transfer rules are applied first to the transfer process, second, the common structural transfer rules and last, the lexical transfer rules in the lexicon. The following rule shows the common structural transfer rule:(1) common_structural_transfer_rule = {}.[+node] <=> {}.[+node]. The rule (1) says that all compositional transfer trees, that is, '+node' are transferred unvaryingly from the source language to the target language.
3.4 Common information transfer rules
Simplifying the transfer process in the multilingual machine translation is also able to result from the separation of the structure from the information. In the existing transfer-based machine translation systems the structural transfer has included the information transfer. It has brought out the duplication of the information and the increase of the memory space. But the isolation between the structure and the information results in excluding the shortcomings of the existing machine translation systems. In this sense, the common information transfer rules have the function to transfer the common information available to many languages, that is, they are the rules that copy the semantic informations from the source language to target language. The semantic informations are produced by the mapping from form to its meaning in the analysis. The following rules show the common information transfer rules: (We use the notation of the CAT2 system.) (2) Common information transfer rules- Lexical_semantic_transfer =
{head:{ehead:{sem:SEM}}}.[*] <=> {head:{ehead:{sem:SEM}}}.[*]. - Transfer_of_semantic_roles =
{role:ROLE}.[*] <=> {role:ROLE}.[*].
4 Korean analysis and transfer by constraint rules
The grammar of individual languages consists of the universal rule and its parameter [Chomsky 1981]. The language typology can be classified by the parameter [Greenberg 1963]. There is an example of machine translation[Dorr 1993] that has used the univeral principle and its parameter. According to the Greenberg's parameterized word order we can consider the Korean standard word order as follows: (3) Standard Word Order of Korean- SOV
- Number-Noun
- Demonstrator-Noun
- Adjective-Noun
- Possessive Pronoun-Noun
- Relative clause-Noun
4.1 Korean analysis by parameterized common rules
According to the Korean standard word order the head word must always follow its argument or modifier. From this point of view we can select the head-final common rules for Korean under the multilingual common grammatical rules in the Figure 1. The head-final rules in Figure 1 and Head Feature Principle percolating the information of lexical head into that of its phrase are as follows: (the coordination structure of Korean can be considered as part of the 'Argument-Functional word structure'. I hold the coordination structure of Korean as the triple structure for the efficient analysis.)Table 2. Parameterized common grammar rules for Korean | ||
---|---|---|
head-final-structure | head-middle-structure | |
1 | PRED => ARG PRED | COORD => ARG1 COORD ARG2 |
2 | MODED => MOD MODED | |
3 | FUNCT => ARG FUNCT |
(5) cengpwunun saylowun kyeyhoykanul malyenhayessta. government+SUBJ new plan+OBJ make+PAST+DECL The government made a new plan. In (5) the fine line shows the application of 'FUNCT => ARG FUNCT', the dotted line that of 'MODED => MOD MODED' and the thick line that of 'PRED => ARG PRED'.
4.2 Korean analysis by grammatical constraint rules
With analysing Korean in the machine translation, we must consider specially the following [Oh 1994]:(6) Korean Characteristics
- Phonological peculiarity
sonyen-i, sonye-ka
boy-SUBJ, girl-SUBJ
boy, girl - Double objects
kunun seoulul yehayngul hayessta.
He-SUBJ Seoul-OBJ trip-OBJ make-PAST-DECL
He made a trip to Seoul - Honorifics
kyoswunimkkeyse osipnita.
professor-SUBJ(HON) come-HON-DECL
The professor comes.
Table 3. Common rules and constraint rules | ||
---|---|---|
Korean characteristics | Common rules | Constraint rules |
Phonological peculiarities | FUNCT => ARG FUNCT | Phonological rule |
Double objects | PRED => ARG PRED | Argument exchange |
Honorifics | HFP | Context information |
Table 4. Lexicon of 'hata (do/make)' | ||||
---|---|---|---|---|
lex | hata | |||
arg1 | ARG1 | |||
arg2 | ARG2 | |||
frame | cat | noun | ||
arg3 | frame | arg1 | ARG1 | |
arg2 | ARG2 |
4.3 Transfer constraint rules
The syntactic tree of Korean results in the semantic tree through tree transformation. The semantic tree has the 'predicate-argument-modifier' arrangement. HFP also is applied to nodes of the semantic tree. We are transducing the Korean syntactic tree (5) to the following semantic tree through the transformation rules. (7) cengpwunun saylowun kyeyhoykanul malyenhayessta. government-SUBJ new plan-OBJ make-PAST-DECL. The government made a new plan. The semantic tree becomes the input of transfer. All semantic trees that can be transferred compositionally are transferred to target language by the 'common structural transfer rules' and 'common information tranfer rules'. There is, however, the compositional transfer that is not able to apply to the common information transfer rules. The idiomatic expressions with functional verbs 'hata(do/make)' or 'toyta(be done/be made)' belong to this example. We delete 'hata' during transformation from syntactic tree to semantic tree and copy the information of 'hata' to the feature 'functional verb' of predicate noun, so that the predicate of a sentence becomes the predicate noun during transformation from syntactic tree to semantic tree and copy the information of 'hata' to the feature 'functional verb' of predicate noun, so that the predicate of a sentence becomes the predicate noun. But there is no multilingual rule that can control the relation between the predicate noun of source language and the predicate noun of target language or between predicate noun of source language and verb or adjective of target language. For this reason we need the rule constraining the common transfer rule. Now we have the transfer constraint rules for the common information transfer rules. (8) Constraint rule of predicate noun- idiomatic expression vs idiomatic expression
Let copy the information of Korean functional verb to that of functional verb of target language, if the lexeme of target language has the functional verb that is equalent to the Korean idiomatic expression with 'hata'.
ex.) sanpolul hata => take a walk, einen Sparziergang machen, sanpowo suru
ilul hata => sikotowo suru - idiomatic expression vs verb or adjective
Let copy the information of Korean functional verb to that of the lexeme of target language, if the lexeme of target language has no functional verb that is equivalent to the Korean idiomatic expression with 'hata'.
ex.) ilul hata => work, arbeiten
5 Conclusion
In this paper I have proposed a new philosophy of multilingual machine translation that accepts the commonness of languages to reduce the memory space of the multilingual machine translation system and to simplify the transfer process. This philosophy is explained by the common rules for many languages and the constraint rules for the individual languages. For example, the analysis of Korean is explained by the parameterized common rules and the constraint rules and the transfer from Korean to other target languages is explained by the common structure transfer rules, the common information transfer rules, and the transfer constraint rules. The following table shows the size of the common and constraint rules used for the analysis and transfer of Korean in the translation from 300 Korean sentences to English or German.Syntactic Analysis | Semantic Analysis | Transfer | |||
Common | Constraint | Common | Constraint | Common | Constraint |
9 | 55 | 39 | 8 | 43 | 3 |
- Truncation of the number of the parse trees
- Conflict between the old and the new lexical information
- Recognizing the idiomatic expressions and collocations
- Disambiguation of polysemy
- Usage of the probabilistic method
- Information processing by the multiple inheritance
- Implementation of the compound unit recognizer
- Usage of the domain
Kil-Lok Oh Key-Sun Choi Sey-Young Park. Korean Language Engineering. : Tae-Young-Sa, 1994.
N. Chomsky. Lectures on Government and Binding. The Pisa Lectures. Studies in Generative Grammar 9. Dordrecht Holland & Cinnaminson U.S.A.: Foris Publication, 1981.
J. H.Greenberg. “Some universals of grammar with particular reference to the order of meaningful elements.” Universals of Language. Ed. Joseph H. Greenberg. Cambridge, Massachusetts: The M.I.T. Press, 1963.
B. J. Dorr. Machine Translation: A View from the Lexicon. Cambridge, Massachusetts and London, England: MIT Press, 1993.
W. J. Hutchins H. L. Somers. An Introduction to Machine Translation. : Academic Press, 1992.
R. S. Jackendoff. X-bar Syntax: A Study of Phrase Structure.. Cambridge: MIT Press, 1977.
D. Lewis. “Computers and Translation.” Computers and Written Texts. Ed. Christopher Butler. : Blackwell, 1992. 75-114.
C. Pollard I.Sag. Head-Driven Phrase Structure Grammar. Studies in Comtemporary Linguistics. Chicago & London: The University of Chicago Press, 1994.
R. Sharp. CAT2 Reference Manual Version 3.6. IAI Working Papers N.27. Saarbruecken, Germany: , 1994.