“Compound Unit Recognition for Efficient English-Korean
Translation”
Hanmin
Jung
Machine Translation Laboratory, Systems
Engineering Research Institute, Korea
jhm@seri.re.kr
Sanghwa
Yuh
Machine Translation Laboratory, Systems
Engineering Research Institute, Korea
shyuh@seri.re.kr
Taewan
Kim
Machine Translation Laboratory, Systems
Engineering Research Institute, Korea
twkim@seri.re.kr
Dong-In
Park
Machine Translation Laboratory, Systems
Engineering Research Institute, Korea
dipark@seri.re.kr
1. Introduction
If morphologically analyzed sentences are directly parsed, a number of translation failures caused by the limitations of word-to-word translation may happen. The following show two sentences and their translations for this difficulty. "He gave to his opinion." -> "Geu-neun geu-eui euigyeon-eul balpyo-hayeotda.""I don't think his work is up to much." -> "Na-neun geu-eui jakpoom-i daedan-chiantago saenggak-handa." The underlined phrases cannot be directly translated only by their own meanings. We define compound unit(CU) as a bundle of words that is difficult to be directly translated or appear in the same form regardless of context. These units are necessary to recognized for more natural translation. But, previous work like [Bond95], [Lauer94] and [Li95] has interest only on one or two categories of the following. [SHYoon94] classifies CU into five categories.
- 1. traditional metaphorical idioms: "kick the bucket" -> "jookda"
- 2. typical word usage: "prevent ~ from -ing" -> "~ga -haneun geot-eul banghae-hada"
- 3. compound words: "operating system" -> "oonyeong chaeje"
- 4. lexical gaps: "by name" -> "jimyeong-hayeo"
- 5. frozen expressions: "How are you?" -> "annyeonhaseyo?"
2. Terminology
(1) Conjugation Tags: These tags represent various conjugation forms. VBD, VBG, VBN, VBP and VBZ are conjugation tags in [Marcus93]. If CU has the first constituent with one of the tags, the conjugation tag is used to represent the unit instead of pre-defined representative POS tag, which is in CU dictionary. (2) Verb Type Information: This information consists of two categories, action mode/position information and verb-modifee form information. Action mode/position information consists of seven types that represent the properties of verb, for example, T-type verb is transitive verb with an object. Verb-modifee form information restrains the constituents which are modified by current verb, for instance, 1-type is one or more nouns following verb. We use modified verb type information from [Longman83] and [GCKim92]. (3) Variable Constituent and Fixed Constituent: We can find CU "keep ~ in ~ mind" from the following sentence, "I kept the words in my mind yesterday". "keep", "in" and "mind" of the unit are fixed forms regardless of context. We define this kinds of words as fixed constituents. "the words" and "my" can be replaced with some other words by the context. These context sensitive words are defined as variable contituents. (4) Pseudo Syntactic Tag: Pseudo syntactic tag is zero or positive integer number for CUs and variable constituents. This tag implies the syntactic meaning that can be mapped to syntactic tag. For example, CU "keep ~ in ~ mind" is converted into "keep *1#1 in *1#2 mind" as a dictionary entry. *1 is pseudo syntactic tag for representing the unit, "#1" and "#2" for the two variable constituents. All pseudo syntactic tags are attached to the first word of corresponding unit or constituent. These forms make CUs with embedded structure represented with hierarchy. The following example is an embedded structure and its expression. [sentence] "Investors continued to pour into money funds" [CU] "continue to #1(VB)" -> *1, "pour into #1(NP)" -> *2, "money funds" -> *3 [sentence with pseudo syntactic tags] "Investors continued(*1) to pour(*2, *1#1) into money(*3, *2#1) funds" {*n | n = 1, 2, 3, ...} is for CU, {#n | n = 0, 1, 2, ...} is for variable constituent3. System Structure
Figure 1 shows the system structure of CU recognizer. The search module of CU extracts all possible CUs for each word by referencing index. Index is the memory view of index dictionary that is a binary form of CU dictionary. POS attachment module gives representative POS tag to extracted CU. In the case that representative constituent(the first fixed constituent of CU) of the unit is verb or its conjugation, the module does not work. This alternative processing enables syntactic analyzer to use the original meaning of input context. Recognition result creation module draws translations from recognized units, and makes adequate data structures from the results.4. Compound Unit Dictionary
The following shows the entry format of CU dictionary. (FFC FFCN CN ... {"FC" FCi} | {"VC" VCj CCSSF CCSSk CCPN {CCP | CCPSl} VMFC} ... RPTC APC TN ... {TCNi ... {Tij STN ... STk ...} ...} ...)Table 1. Acronym and its meaning on CU dictionary | |
---|---|
acronym | meaning |
FFC: | first fixed constituent |
CCP: | cooccurent constraint POS |
FFCN: | the number of first fixed constituents |
CCPSl: | lth cooccurent constraint POS set |
CN: | the number of constituents |
VMFC: | verb-modifee form code |
"FC", "VC": | constituent type identifier |
RPTC: | representative POS tag code |
FCi: | ith fixed constituent |
APC: | action mode/position code |
VCj: | jth variable constituent |
TN: | the number of translations |
CCSSF: | cooccurent constraint string set flag (0/1) |
TCNi: | the number of ith translation constituents |
{}: | virtual set |
Tij: | jth translation constituent of ith translation |
|: | selecting one alternative |
STN: | the number of syntactic tags |
CCSSk: | kth cooccurent constraint string set |
STk: | kth syntactic tag |
CCPN: | the number of cooccurent constraint POS (0/1/2) |
5. Compound Unit Search Algorithms
The principle of CU search is "most-specific-expression-first" [SHYoon94]. This means "fixed constituent first, variable constituent next" and "longer expression(CU) first", that is, the longest of successfully found expressions for a word is expected to be the best. It also implies that at most one expression can be extracted from each word in a sentence. Thus, the number of expressions is equal or less than that of words in a sentence. The "most-specific-expression-first" can be defined by "more-specific-than" relation >> as follows. fixed constituent >> variable constituent if a >> b iff a1 >> b1 else if a1 = b1 then a >> b iff a2a3...an >> b2b3...bn recursively where a = a1a2a3...an and b = b1b2b3...bn CUs are searched on index. Index is the memory view of index dictionary that is made from CU dictionary. Its structure is modified trie in order to represent heterogeneous types(fixed constituent and variable constituent). Figure 3 shows the index structure on memory. Index structure consists of (1) beginning index, (2) constituent index and (3) representative information index. Each element of beginning index is the first two characters from the first fixed constituent. Empirically, the case of using the first two characters instead of one reduces searched nodes about 20~80%. Constituent index is modified trie structure for representing the two kinds of constituents. We use "method" mechanism for the heterogeneous types. Control on a constituent node is moved by this "method". The following are "method" types and their action.- 1. DO_GO_CHILD: in case of exact matching for fixed constituent, go to child node
- 2. DO_GO_SIBLING: in case of matching failure, go to sibling node
- 3. DO_SKIP_TO_CHILD: in case of no-constraint variable constituent, skip to child node
- 4. DO_SKIP_TO_NEXT_WORD: in case of matching failure after DO_SKIP_TO_CHILD, skip to next word
6. Experimental Results
6.1 Test Corpus
Our test corpus is "Wall Street Journal" in Penn Treebank [Marcus93]. 1281 sentences are extracted and tested for experimentally analysis. CU dictionary has 1222 entries that are extracted manually for the 1281 sentences. Average word number in a sentence is 15.4 with standard deviation of 3.7. The sentences have average path number of 157.2 with standard deviation of 391.5 by our tagger. This means the tagger makes about 157 paths for a sentence owing to POS ambiguities.6.2 Compound Unit Recognition
A sentence of the test corpus has average 1.7 CUs. Representative POSs drawed from the corpus are VB(verb), NN(noun), IN(preposition), RB(adverb) and JJ(adjective). 63.19% of 170 found CUs is compound nouns, 28.22% is phrasal verbs and 8.59% is for the others. Our system has 95.88% for correct recognition. Table 2 shows the experimental results for the recognition.Table 2. Experimental results for CU recognition | |
---|---|
extracted and tested sentences | 1281 |
recognition rate | 95.88% |
unrecognition rate | 4.12% |
misrecognition rate | 1.76% |
the number of average segmentatio | 4.1 with 2.3 standard deviation |
6.3 Compound Unit Search
Figure 4 shows the ratio of "method" types used on index. The average number of methods for a sentence is 9.5 for GO-CHILD, 117.4 for GO-SIBLING, 0.9 for SKIP-TO-CHILD and 3.9 for SKIP-TO-NEXT-WORD. Currently, GO-SIBLINGs are much more than the other methods. This reason is from the structure of constituent index. It is modified trie which has several depths for searching a whole entry. Experimentally, the index has too breadthwisely spreaded siblings compared with children. We put this problem into one of future works.7. Conclusion
Our CU recognizer is designed to find the units of six categories. As the experimental results show the usefulness of our system, this approach is a promising method for high performance parser in view of parsing time and space. The following are the strong points of the system.- 1. Various CUs as well as restricted simple idioms or compound nouns are recognized.
- 2. Parsing space is reduced by divide-and-conquer which can be obtained from combination with CU recognition.
- 3. Cooccurent constraint POS information resolves some POS ambiguities.
- 4. Cooccurent constraint POS/string set provides flexible recognition.
- 5. Verb type information makes high performance parsing possible by means of restricted form information.
- 6. Translation information including pseudo syntactic tags offer natural translation.
- 7. Pseudo syntactic tags for both representative constituent and variable constituents make the processing of embedded structure possible.
- 8. Syntactic tag in CU helps predictable top-down parsing in the case that it is impossible to determine the range of the unit.
References
F. Bond K. Ogura T. Kawaoka. “Noun Phrase Reference in Japan-to-English Machine
Translation.” Proceeding of TMI. : , 1995.
G. C. Kim. “Research on English-to-Korean MT System(III):
Development of Grammar Writing Tools and English Analysis
Grammars.” KAIST. : , 1992.
H. G. Lee. “Recognition of Korean-English Bilingual Idioms using
Idiom Dispersion Characteristics.” Seoul National University, 1994.
M. Lauer M. Dras. “A Probabilistic Model of Compound Nouns.” Proceedings of the Seventh Joint Australian Conference on Artificial Intelligence. : , 1994.
W. Li H. Pan , et al. “Corpus-based Maximal-length Chinese Noun Phrase
Extraction.” Proceedings of NLPRS. : , 1995.
unknown. Longman Dictionary of Contemporary English. : Longman Dictionaries, 1983.
M. Marcus B. Santorini M. Marcinkiewicz. “Building a Large Annotated Corpus of English: The Penn
Treebank.” Computational Linguistics. 1993. 19: .