Digital Humanities Abstracts

“Korean Analysis and Transfer in Multilingual Machine Translation System”
Sung-Kwon Choi Systems Engineering Research Institute skchoi@seri.re.kr Tae-Wan Kim Systems Engineering Research Institute twkim@seri.re.kr Soo-Hyun Lee Systems Engineering Research Institute shlee@seri.re.kr Dong-In Park Systems Engineering Research Institute dipark@seri.re.kr

Abstract

Multilingual machine translation means translation between more than two languages. The existing multilingual machine translation systems can be classified into the transfer-based and interlingual-based multilingual machine translation. In the former the analysis and generation rules were written each other differently, so that the commonness of the languages was ignored and the whole memory space led to increase. The latter had the difficulty in implementing the linguistic universal model available to many languages. In order to get over the shortcomings of these existing multilingual machine translation systems, this paper describes the multilingual MT systems through the common rules which can accept the commonness of languages and many languages can share.

1 Introduction

The analysis and generation rules in the existing transfer-based multilingual machine translation systems (SYSTRAN, EUROTRA, METAL, LOGOS, GETA etc.) are independent and different according to target languages.[Hutchins 1992] It says that the existing multilingual machine translation systems don't acknowledge the commonness of languages. For this reason the existing multilingual machine translation systems have the form like the bundle of bilingual MT systems and this leads to a result increasing the size of system. There are the transfer-based multilingual machine translation systems that use interlingual method for reducing the transfer processes (CETA, SALAT, DLT, KANT etc.), however they have difficult problems to complete the linguistic universal model[Lewis 1992]. From this point of view this paper describes the new multilingual machine translation method by the common rules and constraint rules to overcome the problem the existing multilingual machine translation systems have. The common rules mean the rules that are in common with more than two languages. It is the merits of common rules that can reduce the memory space, augment the consistency of grammatical information and standardize the information structure of lexicon because the common rules are loaded into memory only once. They also have another merit for MT. That is, new grammar modules can be created easily through the combination of 'common' rules when we add a new language to the existing system and translate it into the existing languages. The constraint rules mean the rules controlling the linguistic characteristics of individual languages. This paper consists of three parts: In the chapter 2 the construction of the whole system is introduced. The chapter 3 describes the modules consisting of common rules, that is, the common grammatical rules, the common lexicon information structure, the common structural transfer rules, and the common information transfer rules. In the chapter 4 we explain the analysis and transfer of Korean through the parameterized common rules and the constraint rules.

2 System construction

The Figure 1 shows the system construction of multilingual machine translation by the common rules and constraint rules:
Figure 1. Figure1: Construction of multilingual machine translation system
The middle field of Figure 1 means the common module. 'rn' is a file of common rules consisting of the common module. These files of the common rules are called by the grammar modules of the individual languages and constitute the grammar rules of an individual language together with the constraint rules for the language. For example, Korean, Japanese, English and German in Figure 1 have in common a rule file r3, but Korean and Japanese share more rule files r2 and r4 because they are more similar in the language typology than English and German

3 Common rules

In this chapter I will show the construction of common rules. Common rules for analysis consist of the common grammar rules and the common lexicon information structure and those for transfer consist of the common structural transfer rules and the common information transfer rules.

3.1 Common grammatical rules

To handle many languages in multilingual machine translation system, common grammatical rules should explain linguistic phenomena of as many countries as possible. For explanation of linguistic phenomena of configurational language (e.g. English) as well as nonconfigurational languages (e.g. Korean, Japanese, German) whose word order is relatively free, we have made new grammar rules where X-bar syntactic theory[Jackendoff 1977] and HPSG [Pollard 1994] were mixed. The new grammar was made in binary structure except the coordination structure which was made in triple structure.
Table 1. Common grammar rules
head-final-structure head-first-structure head-middle-structure
1 PRED => ARG PRED PRED => PRED ARG COORD => ARG1 COORD ARG2
2 MODED => MOD MODED MODED => MODED MOD
3 FUNCT => ARG FUNCT FUNCT => FUNCT ARG
The common grammar rules of the table 1 are described in Appendix 1 according to the notation of the CAT2 machine translation system.

3.2 Common lexicon information structure

We need to make the lexicon information structure in order to input, manage and correct consistently the lexicon information of the multilingual machine translation system. It is desirable to build not monotonic, but multiple structure so that the information structure of lexicon may represent the possible linguistic information and be moved collectively. From this point of view I have selected the feature structure as the multilingual lexicon information structure and made the attributes be the same in many languages. Appendix 2 shows an example of multilingual lexicon information structure.

3.3 Common structural transfer rules

There is also the part in the transfer process the many languages can share. It is the compositional transfer that copies the node of the source language to that of the target language if the analysis structure of the former and the generation structure of the latter are the same. We make use of the method deleting the functional words and then transforming the syntactic nodes to the 'predicate-argument-modifier' nodes in our multilingual machine translation system in order to transfer compositionally the different structures between the languages. We have recorded the noncompositional structural rules unusable to the common structural transfer rules in the transfer lexicon because they depend on the lexemes. The transfer rules have the priority order: the noncompositional structural transfer rules are applied first to the transfer process, second, the common structural transfer rules and last, the lexical transfer rules in the lexicon. The following rule shows the common structural transfer rule:
(1) common_structural_transfer_rule = {}.[+node] <=> {}.[+node].
The rule (1) says that all compositional transfer trees, that is, '+node' are transferred unvaryingly from the source language to the target language.

3.4 Common information transfer rules

Simplifying the transfer process in the multilingual machine translation is also able to result from the separation of the structure from the information. In the existing transfer-based machine translation systems the structural transfer has included the information transfer. It has brought out the duplication of the information and the increase of the memory space. But the isolation between the structure and the information results in excluding the shortcomings of the existing machine translation systems. In this sense, the common information transfer rules have the function to transfer the common information available to many languages, that is, they are the rules that copy the semantic informations from the source language to target language. The semantic informations are produced by the mapping from form to its meaning in the analysis. The following rules show the common information transfer rules: (We use the notation of the CAT2 system.) (2) Common information transfer rules
  • Lexical_semantic_transfer =
    {head:{ehead:{sem:SEM}}}.[*] <=> {head:{ehead:{sem:SEM}}}.[*].
  • Transfer_of_semantic_roles =
    {role:ROLE}.[*] <=> {role:ROLE}.[*].
The lexical semantic transfer says that the lexical semantic information of the source language is copied to that of the target language on the same node level and the reverse too ('<=>' means the bidirection). The transfer of semantic roles shows the copy of the information of the semantic role between the source language and the target language.

4 Korean analysis and transfer by constraint rules

The grammar of individual languages consists of the universal rule and its parameter [Chomsky 1981]. The language typology can be classified by the parameter [Greenberg 1963]. There is an example of machine translation[Dorr 1993] that has used the univeral principle and its parameter. According to the Greenberg's parameterized word order we can consider the Korean standard word order as follows: (3) Standard Word Order of Korean
  • SOV
  • Number-Noun
  • Demonstrator-Noun
  • Adjective-Noun
  • Possessive Pronoun-Noun
  • Relative clause-Noun
This standard word order gives an individual language a clue for its parameter. In the next section we will see the paramterized common grammatical rules for Korean.

4.1 Korean analysis by parameterized common rules

According to the Korean standard word order the head word must always follow its argument or modifier. From this point of view we can select the head-final common rules for Korean under the multilingual common grammatical rules in the Figure 1. The head-final rules in Figure 1 and Head Feature Principle percolating the information of lexical head into that of its phrase are as follows: (the coordination structure of Korean can be considered as part of the 'Argument-Functional word structure'. I hold the coordination structure of Korean as the triple structure for the efficient analysis.)
Table 2. Parameterized common grammar rules for Korean
head-final-structure head-middle-structure
1 PRED => ARG PRED COORD => ARG1 COORD ARG2
2 MODED => MOD MODED
3 FUNCT => ARG FUNCT
(4) Head_Feature_Principle = {head:HEAD}.[{},{head:HEAD}]. A Korean sentence that is analysed by the parameterized common grammar rules and the HFP results in what follows:
(5) cengpwunun saylowun kyeyhoykanul malyenhayessta. government+SUBJ new plan+OBJ make+PAST+DECL The government made a new plan.
In (5) the fine line shows the application of 'FUNCT => ARG FUNCT', the dotted line that of 'MODED => MOD MODED' and the thick line that of 'PRED => ARG PRED'.

4.2 Korean analysis by grammatical constraint rules

With analysing Korean in the machine translation, we must consider specially the following [Oh 1994]:
(6) Korean Characteristics
  • Phonological peculiarity
    sonyen-i, sonye-ka
    boy-SUBJ, girl-SUBJ
    boy, girl
  • Double objects
    kunun seoulul yehayngul hayessta.
    He-SUBJ Seoul-OBJ trip-OBJ make-PAST-DECL
    He made a trip to Seoul
  • Honorifics
    kyoswunimkkeyse osipnita.
    professor-SUBJ(HON) come-HON-DECL
    The professor comes.
These peculiarities of Korean can be explained by the constraint rules. The table 3 shows the relation between common rules and their constraint rules.
Table 3. Common rules and constraint rules
Korean characteristics Common rules Constraint rules
Phonological peculiarities FUNCT => ARG FUNCT Phonological rule
Double objects PRED => ARG PRED Argument exchange
Honorifics HFP Context information
- Phonological rule All morphemes contain their last phoneme that is subcategorized and predicted by a functional word.
example) sonyen{phon:con} i{phon:voc,frame:{arg1:{phon:con}}} boy{phon:con} SUBJ{phon:voc,frame:{arg1:{phon:con}}} - Argument exchange The subcategorization structures of functional verb 'hata (= do/make)' and those of predicate noun are exchanged for each other in the lexicon:
Table 4. Lexicon of 'hata (do/make)'
lex hata
arg1 ARG1
arg2 ARG2
frame cat noun
arg3 frame arg1 ARG1
arg2 ARG2
example) kunun(arg1) seoulul(arg2) yehayngul(arg3(arg1,arg2)) ha(arg1,arg2,arg3)yessta. He-SUBJ Seoul-OBJ Trip-OBJ make-PAST-DECL He made a trip to Seoul. - Context information The context information of sentence subject agrees with that of verb phrase.
example) kyoswunimkkeyse(context:honor) osi(context:honor)pnita. professor-SUBJ(HON) come-HON-DECL. The professor comes.

4.3 Transfer constraint rules

The syntactic tree of Korean results in the semantic tree through tree transformation. The semantic tree has the 'predicate-argument-modifier' arrangement. HFP also is applied to nodes of the semantic tree. We are transducing the Korean syntactic tree (5) to the following semantic tree through the transformation rules. (7) cengpwunun saylowun kyeyhoykanul malyenhayessta. government-SUBJ new plan-OBJ make-PAST-DECL. The government made a new plan.
The semantic tree becomes the input of transfer. All semantic trees that can be transferred compositionally are transferred to target language by the 'common structural transfer rules' and 'common information tranfer rules'. There is, however, the compositional transfer that is not able to apply to the common information transfer rules. The idiomatic expressions with functional verbs 'hata(do/make)' or 'toyta(be done/be made)' belong to this example. We delete 'hata' during transformation from syntactic tree to semantic tree and copy the information of 'hata' to the feature 'functional verb' of predicate noun, so that the predicate of a sentence becomes the predicate noun during transformation from syntactic tree to semantic tree and copy the information of 'hata' to the feature 'functional verb' of predicate noun, so that the predicate of a sentence becomes the predicate noun. But there is no multilingual rule that can control the relation between the predicate noun of source language and the predicate noun of target language or between predicate noun of source language and verb or adjective of target language. For this reason we need the rule constraining the common transfer rule. Now we have the transfer constraint rules for the common information transfer rules. (8) Constraint rule of predicate noun
  • idiomatic expression vs idiomatic expression

    Let copy the information of Korean functional verb to that of functional verb of target language, if the lexeme of target language has the functional verb that is equalent to the Korean idiomatic expression with 'hata'.

    ex.) sanpolul hata => take a walk, einen Sparziergang machen, sanpowo suru
    ilul hata => sikotowo suru
  • idiomatic expression vs verb or adjective

    Let copy the information of Korean functional verb to that of the lexeme of target language, if the lexeme of target language has no functional verb that is equivalent to the Korean idiomatic expression with 'hata'.

    ex.) ilul hata => work, arbeiten

5 Conclusion

In this paper I have proposed a new philosophy of multilingual machine translation that accepts the commonness of languages to reduce the memory space of the multilingual machine translation system and to simplify the transfer process. This philosophy is explained by the common rules for many languages and the constraint rules for the individual languages. For example, the analysis of Korean is explained by the parameterized common rules and the constraint rules and the transfer from Korean to other target languages is explained by the common structure transfer rules, the common information transfer rules, and the transfer constraint rules. The following table shows the size of the common and constraint rules used for the analysis and transfer of Korean in the translation from 300 Korean sentences to English or German.
Syntactic Analysis Semantic Analysis Transfer
Common Constraint Common Constraint Common Constraint
9 55 39 8 43 3
- Further work Although the multilingual machine translation by the common rules and the constraint rules is performed reasonably well, reducing the analysis rules and simplifying the transfer process, there are yet many problems to be solved:
  • Truncation of the number of the parse trees
  • Conflict between the old and the new lexical information
  • Recognizing the idiomatic expressions and collocations
  • Disambiguation of polysemy
In order to solve the problems we are testing the following methods:
  • Usage of the probabilistic method
  • Information processing by the multiple inheritance
  • Implementation of the compound unit recognizer
  • Usage of the domain
Kil-Lok Oh Key-Sun Choi Sey-Young Park. Korean Language Engineering. : Tae-Young-Sa, 1994.
N. Chomsky. Lectures on Government and Binding. The Pisa Lectures. Studies in Generative Grammar 9. Dordrecht Holland & Cinnaminson U.S.A.: Foris Publication, 1981.
J. H.Greenberg. “Some universals of grammar with particular reference to the order of meaningful elements.” Universals of Language. Ed. Joseph H. Greenberg. Cambridge, Massachusetts: The M.I.T. Press, 1963.
B. J. Dorr. Machine Translation: A View from the Lexicon. Cambridge, Massachusetts and London, England: MIT Press, 1993.
W. J. Hutchins H. L. Somers. An Introduction to Machine Translation. : Academic Press, 1992.
R. S. Jackendoff. X-bar Syntax: A Study of Phrase Structure.. Cambridge: MIT Press, 1977.
D. Lewis. “Computers and Translation.” Computers and Written Texts. Ed. Christopher Butler. : Blackwell, 1992. 75-114.
C. Pollard I.Sag. Head-Driven Phrase Structure Grammar. Studies in Comtemporary Linguistics. Chicago & London: The University of Chicago Press, 1994.
R. Sharp. CAT2 Reference Manual Version 3.6. IAI Working Papers N.27. Saarbruecken, Germany: , 1994.

Endnote

This paper summarizes the experiment of the multilingual machine translation system CAT2 [Sharp 1994]. The CAT2 system is now working on a UNIX-workstation. Its programming language is PROLOG and it uses the 'constraint bottom-up chart' parser. We are now translating Korean into English as well as German and are testing the translation from Korean into French, Chinese, Russian, and Japanese as the target languages.

Appendix 1. Multilingual common grammar rules written in CAT2 notation

Appendix 2. Multilingual Lexicon Information Structure