“CORP - A Corpus-oriented Parser”
Hong
Liang
Qiao
University of Bergen
qiao@hd.uib.no
CORP, a corpus-oriented parser, was developed in the University of Queensland. It
was written in C and can be run on UNIX. The purpose of the software design is
to test a novel corpus-based parsing technique. The basic idea of the parsing is
to use the data extracted from the Lancaster Parsed Corpus as the training
corpus and then test parsing on both the training corpus and some unseen
sentences. The major types of data retrieved to support parsing are T-tags and
probabilistic grammar rules.
T-tags are structural boundary labels annotated in the Lancaster Parsed Corpus
(hereafter called "the LPC"). T-tag is a terminology in corpus linguistics,
which means that between a pair of parts-of-speech, there exist grammatical
solutions of higher level structures. For example, between a noun and a verb,
most probably it will end a noun phrase and open a verb phrase. However, T-tags
actually found between a noun and a verb are far more than that. T-tags are
established in a simple Markov model and are dynamic in syntactic context. The
whole idea of T-tag oriented parsing is based on the hypothesis that if T-tags
can be extracted from a corpus of systematically annotated texts in terms of
syntactic structures, then such T-tags can, in return, be used in parsing
sentences with such annotation by placing T-tags between pairs of tags. CORP
carries out parsing by assigning T-tags between tag pairs in tag
(part-of-speech) sequences to test whether T-tags can be assigned till the end
of the tag sequence with proper structural openings and closings. In other
words, each opening structural label should find its closing counterpart. One
condition which is crucial for such a parsing approach is that the parsed corpus
used to train the parser should be big enough to tackle unrestricted texts.
Due to the limited hardware condition, T-tag oriented parsing will not be
feasible without the application of grammar rules. Besides, some particular
techniques on the basis of linguistic indication found in the study of the LPC
were crucial in making the parsing practical. These techniques include the
detection of >17 structural openings, inconsistent closings and STG pruning.
Probabilistic data that were used finally to make a judgement on the selection
of the best one, when some parses are generated.
The results showed that T-tag oriented parsing is a feasible parsing approach. It
also demonstrated a great potential for improvement. A multi-order Markov model
may make the parser achieve a better quality of parsing as well.