CORP - A Corpus-oriented Parser

“CORP - A Corpus-oriented Parser”

Hong Liang Qiao University of Bergen qiao@hd.uib.no

CORP, a corpus-oriented parser, was developed in the University of Queensland. It was written in C and can be run on UNIX. The purpose of the software design is to test a novel corpus-based parsing technique. The basic idea of the parsing is to use the data extracted from the Lancaster Parsed Corpus as the training corpus and then test parsing on both the training corpus and some unseen sentences. The major types of data retrieved to support parsing are T-tags and probabilistic grammar rules. T-tags are structural boundary labels annotated in the Lancaster Parsed Corpus (hereafter called "the LPC"). T-tag is a terminology in corpus linguistics, which means that between a pair of parts-of-speech, there exist grammatical solutions of higher level structures. For example, between a noun and a verb, most probably it will end a noun phrase and open a verb phrase. However, T-tags actually found between a noun and a verb are far more than that. T-tags are established in a simple Markov model and are dynamic in syntactic context. The whole idea of T-tag oriented parsing is based on the hypothesis that if T-tags can be extracted from a corpus of systematically annotated texts in terms of syntactic structures, then such T-tags can, in return, be used in parsing sentences with such annotation by placing T-tags between pairs of tags. CORP carries out parsing by assigning T-tags between tag pairs in tag (part-of-speech) sequences to test whether T-tags can be assigned till the end of the tag sequence with proper structural openings and closings. In other words, each opening structural label should find its closing counterpart. One condition which is crucial for such a parsing approach is that the parsed corpus used to train the parser should be big enough to tackle unrestricted texts. Due to the limited hardware condition, T-tag oriented parsing will not be feasible without the application of grammar rules. Besides, some particular techniques on the basis of linguistic indication found in the study of the LPC were crucial in making the parsing practical. These techniques include the detection of >17 structural openings, inconsistent closings and STG pruning. Probabilistic data that were used finally to make a judgement on the selection of the best one, when some parses are generated. The results showed that T-tag oriented parsing is a feasible parsing approach. It also demonstrated a great potential for improvement. A multi-order Markov model may make the parser achieve a better quality of parsing as well.