“Compound Unit Recognition for Efficient English-Korean Translation”

Hanmin Jung Machine Translation Laboratory, Systems Engineering Research Institute, Korea jhm@seri.re.kr Sanghwa Yuh Machine Translation Laboratory, Systems Engineering Research Institute, Korea shyuh@seri.re.kr Taewan Kim Machine Translation Laboratory, Systems Engineering Research Institute, Korea twkim@seri.re.kr Dong-In Park Machine Translation Laboratory, Systems Engineering Research Institute, Korea dipark@seri.re.kr

1. Introduction

If morphologically analyzed sentences are directly parsed, a number of translation failures caused by the limitations of word-to-word translation may happen. The following show two sentences and their translations for this difficulty. "He gave to his opinion." -> "Geu-neun geu-eui euigyeon-eul balpyo-hayeotda."
"I don't think his work is up to much." -> "Na-neun geu-eui jakpoom-i daedan-chiantago saenggak-handa." The underlined phrases cannot be directly translated only by their own meanings. We define compound unit(CU) as a bundle of words that is difficult to be directly translated or appear in the same form regardless of context. These units are necessary to recognized for more natural translation. But, previous work like [Bond95], [Lauer94] and [Li95] has interest only on one or two categories of the following. [SHYoon94] classifies CU into five categories.

1. traditional metaphorical idioms: "kick the bucket" -> "jookda"
2. typical word usage: "prevent ~ from -ing" -> "~ga -haneun geot-eul banghae-hada"
3. compound words: "operating system" -> "oonyeong chaeje"
4. lexical gaps: "by name" -> "jimyeong-hayeo"
5. frozen expressions: "How are you?" -> "annyeonhaseyo?"

We add (6) phrasal verbs: "put on ~" -> "~eul ipda" in these categories. CU reduces the search space of parser by divide-and-conquer strategy and resolves some POS(part-of-speech) ambiguities [HGLee94]. It also brings down the loads of morphological/syntactic generation module by providing pre-defined natural translation. We use several mechanisms for more efficient recognition, for instance, (1) cooccurent constraint POS information, (2) cooccurent constraint string set, (3) verb type information and (4) pseudo syntactic tag. We also use syntactic tag for CU which has unclear bound. It makes for syntactic analyzer to parse with predictable top-down method.

2. Terminology

(1) Conjugation Tags: These tags represent various conjugation forms. VBD, VBG, VBN, VBP and VBZ are conjugation tags in [Marcus93]. If CU has the first constituent with one of the tags, the conjugation tag is used to represent the unit instead of pre-defined representative POS tag, which is in CU dictionary. (2) Verb Type Information: This information consists of two categories, action mode/position information and verb-modifee form information. Action mode/position information consists of seven types that represent the properties of verb, for example, T-type verb is transitive verb with an object. Verb-modifee form information restrains the constituents which are modified by current verb, for instance, 1-type is one or more nouns following verb. We use modified verb type information from [Longman83] and [GCKim92]. (3) Variable Constituent and Fixed Constituent: We can find CU "keep ~ in ~ mind" from the following sentence, "I kept the words in my mind yesterday". "keep", "in" and "mind" of the unit are fixed forms regardless of context. We define this kinds of words as fixed constituents. "the words" and "my" can be replaced with some other words by the context. These context sensitive words are defined as variable contituents. (4) Pseudo Syntactic Tag: Pseudo syntactic tag is zero or positive integer number for CUs and variable constituents. This tag implies the syntactic meaning that can be mapped to syntactic tag. For example, CU "keep ~ in ~ mind" is converted into "keep *1#1 in *1#2 mind" as a dictionary entry. *1 is pseudo syntactic tag for representing the unit, "#1" and "#2" for the two variable constituents. All pseudo syntactic tags are attached to the first word of corresponding unit or constituent. These forms make CUs with embedded structure represented with hierarchy. The following example is an embedded structure and its expression. [sentence] "Investors continued to pour into money funds" [CU] "continue to #1(VB)" -> *1, "pour into #1(NP)" -> *2, "money funds" -> *3 [sentence with pseudo syntactic tags] "Investors continued(*1) to pour(*2, *1#1) into money(*3, *2#1) funds" {*n | n = 1, 2, 3, ...} is for CU, {#n | n = 0, 1, 2, ...} is for variable constituent

3. System Structure

Figure 1 shows the system structure of CU recognizer.

The search module of CU extracts all possible CUs for each word by referencing index. Index is the memory view of index dictionary that is a binary form of CU dictionary. POS attachment module gives representative POS tag to extracted CU. In the case that representative constituent(the first fixed constituent of CU) of the unit is verb or its conjugation, the module does not work. This alternative processing enables syntactic analyzer to use the original meaning of input context. Recognition result creation module draws translations from recognized units, and makes adequate data structures from the results.

4. Compound Unit Dictionary

The following shows the entry format of CU dictionary. (FFC FFCN CN ... {"FC" FCi} | {"VC" VCj CCSSF CCSSk CCPN {CCP | CCPSl} VMFC} ... RPTC APC TN ... {TCNi ... {Tij STN ... STk ...} ...} ...)

Table 1. Acronym and its meaning on CU dictionary
acronym	meaning
FFC:	first fixed constituent
CCP:	cooccurent constraint POS
FFCN:	the number of first fixed constituents
CCPSl:	lth cooccurent constraint POS set
CN:	the number of constituents
VMFC:	verb-modifee form code
"FC", "VC":	constituent type identifier
RPTC:	representative POS tag code
FCi:	ith fixed constituent
APC:	action mode/position code
VCj:	jth variable constituent
TN:	the number of translations
CCSSF:	cooccurent constraint string set flag (0/1)
TCNi:	the number of ith translation constituents
{}:	virtual set
Tij:	jth translation constituent of ith translation
\|:	selecting one alternative
STN:	the number of syntactic tags
CCSSk:	kth cooccurent constraint string set
STk:	kth syntactic tag
CCPN:	the number of cooccurent constraint POS (0/1/2)

CU dictionary consists of (1) CU search information, (2) CU information, (3) representative POS information, (4) action mode/position information and (5) translation information. CU search information has the fields for dictionary sorting and index file making. Since CU is designed for English, a word corresponds with a constituent. The contents of CU information vary with constituent type - fixed constituent or variable constituent. There is no information in case of fixed constituent. On the other hand, variable constituent has CU information. Since translation is designed for Korean, an eojeol(Korean word form) corresponds with a constituent. Figure 2 shows the information hierarchy of CU dictionary.

The following example is an entry of CU dictionary. [sentence] "I kept the words in my mind yesterday." [CU] "keep #1 in #2 mind" [entry of CU dictionary] (keep 1 // CU search information 5 FC keep // CU information VC #1 0 0 1 FC in VC #2 0 2 one's 1 FC mind VB // representative POS information D // action mode/position information 1 // translation information 4 #2 0 maeum-e 0 #1-eul 0 saegyeoduda 0)

5. Compound Unit Search Algorithms

The principle of CU search is "most-specific-expression-first" [SHYoon94]. This means "fixed constituent first, variable constituent next" and "longer expression(CU) first", that is, the longest of successfully found expressions for a word is expected to be the best. It also implies that at most one expression can be extracted from each word in a sentence. Thus, the number of expressions is equal or less than that of words in a sentence. The "most-specific-expression-first" can be defined by "more-specific-than" relation >> as follows. fixed constituent >> variable constituent if a >> b iff a1 >> b1 else if a1 = b1 then a >> b iff a2a3...an >> b2b3...bn recursively where a = a1a2a3...an and b = b1b2b3...bn CUs are searched on index. Index is the memory view of index dictionary that is made from CU dictionary. Its structure is modified trie in order to represent heterogeneous types(fixed constituent and variable constituent). Figure 3 shows the index structure on memory.

Index structure consists of (1) beginning index, (2) constituent index and (3) representative information index. Each element of beginning index is the first two characters from the first fixed constituent. Empirically, the case of using the first two characters instead of one reduces searched nodes about 20~80%. Constituent index is modified trie structure for representing the two kinds of constituents. We use "method" mechanism for the heterogeneous types. Control on a constituent node is moved by this "method". The following are "method" types and their action.

1. DO_GO_CHILD: in case of exact matching for fixed constituent, go to child node
2. DO_GO_SIBLING: in case of matching failure, go to sibling node
3. DO_SKIP_TO_CHILD: in case of no-constraint variable constituent, skip to child node
4. DO_SKIP_TO_NEXT_WORD: in case of matching failure after DO_SKIP_TO_CHILD, skip to next word

An entry of representative information index corresponds to a CU. The entry has common information for the unit - representative POS information, action mode/position information and translation information. "distinguish oneself" -> key("oneself") -> CCSS("oneself") = {myself, himself, themselves, ...}

6. Experimental Results

6.1 Test Corpus

Our test corpus is "Wall Street Journal" in Penn Treebank [Marcus93]. 1281 sentences are extracted and tested for experimentally analysis. CU dictionary has 1222 entries that are extracted manually for the 1281 sentences. Average word number in a sentence is 15.4 with standard deviation of 3.7. The sentences have average path number of 157.2 with standard deviation of 391.5 by our tagger. This means the tagger makes about 157 paths for a sentence owing to POS ambiguities.

6.2 Compound Unit Recognition

A sentence of the test corpus has average 1.7 CUs. Representative POSs drawed from the corpus are VB(verb), NN(noun), IN(preposition), RB(adverb) and JJ(adjective). 63.19% of 170 found CUs is compound nouns, 28.22% is phrasal verbs and 8.59% is for the others. Our system has 95.88% for correct recognition. Table 2 shows the experimental results for the recognition.

Table 2. Experimental results for CU recognition
extracted and tested sentences	1281
recognition rate	95.88%
unrecognition rate	4.12%
misrecognition rate	1.76%
the number of average segmentatio	4.1 with 2.3 standard deviation

Most unrecognized CUs are caused by unexpected insertion of adverb or adverbial phrase. For example, "heavily" is inserted into CU "invest in" like "invest heavily in" (adverb insertion), and "at midnight Tuesday" into "drop to" like "drop at midnight Tuesday to" (adverbial phrase insertion). All these cases occur between verb and preposition. They can be removed by checking unexpected insertion for some specific positions during CU search. Misrecognized results are divided into two categories. One is sub-CU that is in other recognized CU like "as #1 as" in "twice as many as". The other is insufficiency of cooccurent constraint or syntactic tag information, for example, "take much to" is recognized "take #1 to". We expect that these two misrecognized problems can be resolved by boundary check and supplementation of information. The number of segmentation means that of split pieces for divide-and-conquer. The experimental results show the number is average 4.1, that is, a sentence is normally divided into 4.1 pieces by our system. Divide-and-conquer can be obtained from parsing each piece and merging them. [sentence] "Despite recent declines in yields, investors continue to pour cash into money funds." [CU] "declines in", "continue to #1", "pour #1 into #2", "money funds" (4) [spaces for divide-and-conquer] Despite recent / declines in / yields / , investors / continue to pour / cash / into / money funds / . (9)

6.3 Compound Unit Search

Figure 4 shows the ratio of "method" types used on index.

The average number of methods for a sentence is 9.5 for GO-CHILD, 117.4 for GO-SIBLING, 0.9 for SKIP-TO-CHILD and 3.9 for SKIP-TO-NEXT-WORD. Currently, GO-SIBLINGs are much more than the other methods. This reason is from the structure of constituent index. It is modified trie which has several depths for searching a whole entry. Experimentally, the index has too breadthwisely spreaded siblings compared with children. We put this problem into one of future works.

7. Conclusion

Our CU recognizer is designed to find the units of six categories. As the experimental results show the usefulness of our system, this approach is a promising method for high performance parser in view of parsing time and space. The following are the strong points of the system.

1. Various CUs as well as restricted simple idioms or compound nouns are recognized.
2. Parsing space is reduced by divide-and-conquer which can be obtained from combination with CU recognition.
3. Cooccurent constraint POS information resolves some POS ambiguities.
4. Cooccurent constraint POS/string set provides flexible recognition.
5. Verb type information makes high performance parsing possible by means of restricted form information.
6. Translation information including pseudo syntactic tags offer natural translation.
7. Pseudo syntactic tags for both representative constituent and variable constituents make the processing of embedded structure possible.
8. Syntactic tag in CU helps predictable top-down parsing in the case that it is impossible to determine the range of the unit.

We have several future works. First, enlarge type information for other POSs as well as verb. Second, apply more efficient structure and mechanism for CU search. Third, benchmark for the combination of CU recognizer and parser.

References

F. Bond K. Ogura T. Kawaoka. “Noun Phrase Reference in Japan-to-English Machine Translation.” Proceeding of TMI. : , 1995.

G. C. Kim. “Research on English-to-Korean MT System(III): Development of Grammar Writing Tools and English Analysis Grammars.” KAIST. : , 1992.

H. G. Lee. “Recognition of Korean-English Bilingual Idioms using Idiom Dispersion Characteristics.” Seoul National University, 1994.

M. Lauer M. Dras. “A Probabilistic Model of Compound Nouns.” Proceedings of the Seventh Joint Australian Conference on Artificial Intelligence. : , 1994.

W. Li H. Pan , et al. “Corpus-based Maximal-length Chinese Noun Phrase Extraction.” Proceedings of NLPRS. : , 1995.

unknown. Longman Dictionary of Contemporary English. : Longman Dictionaries, 1983.

M. Marcus B. Santorini M. Marcinkiewicz. “Building a Large Annotated Corpus of English: The Penn Treebank.” Computational Linguistics. 1993. 19: .