“Construction of Russian Corpus-Driven Dictionary Based and Monitor Corpora”

Serge A. Yablonsky St. Petersburg University of Transport, Russicon Company, Russia

1. Introduction

Monitor corpora are of interest to lexicographers and language learners who can trawl a stream of new texts looking for the occurrence of new words, or for changing meanings of old words (Collins COBUILD, 1995; McEnery T., Wilson A., 1996). Their main advantages are that they are not static and provide for a large and broad sample of language. The application of language processing technologies for construction of shareable and multifunctional language corpora led to hopeful results (Varile G. B., Zampolli A., 1996). Progress in Russian language processing affords an opportunity for applying its results for creating Russian monitor corpora strongly connected with the set of electronic dictionaries by the help of linguistic software. Our approach is particularly dependent on the language processor Russicon, and on wide usage of Russicon electronic dictionaries (Yablonsky S.A., 1998).

2. Composition of the corpora

The main part of the corpus was described in (Yablonsky S.A., 1999 a,b). Today's corpus is based on wide representation of Russian XIX and XX century literature, critics, philosophy, religion, newspapers, memoirs, law, business, computers, historical documents, stenographs, translations, folklore, Internet literature, "underground" literature etc. The texts are taken from printed resources, CD - resources and the Internet. ASCII and Unicode text are the basic text type standards. Additionally SGML, HTML and XML markup is done by designing C-conversion programs. SGML configuring of texts is done by the SoftQuad SGML Publishing Suite. The text collection will continue to grow as resources are created and encoded. The open-ended (constantly growing) Russian monitor corpus helps in dictionary building as it enables lexicographers to keep on top of new words entering the language, or existing words changing their meanings, or the balance of their use according to genre etc.

3. Corpus-driven dictionary

The chief distinction of the corporus is its strong connection with the set of Russian electronic dictionaries and language processing tools, particularly dependent on the language processor Russicon. Every word of the corpus simultaneously is the entry word of the corpus-driven dictionary and vice versa. For any form of a Russian word input, the dictionary outputs:

one or several lemmas (lexical homonyms);
one or several sets (in the case of morphological homonyms) of such grammatical characteristics: part of speech, case, gender, number, tense, person, degree of comparison, voice, aspect, mood, form, type, transitiveness, reflexive, animation;
the synonym row(s);
the antonyms;
the precise definitions;
the explanatory comments;
all or several examples of usage in the corpora;

At the same time users can search for patterns of word combination, check word frequencies, see examples of all the uses of particular words. The pilot system is realized on IBM PC using Visual Basic 6.0 and MS SQL Server 7.0 and works in personal and local net mode.

References

B. M.Belyaev A. S.Surcis S. A.Yablonsky. “Russian Language Processor RUSSICON: Design and Applications.” Proceedings of the East-West Artificial Intelligence Conference (EWAIC-93), Moscow. : , 1993. 175-180.

unknown. Collins COBUILD on CD ROM. London: HarperCollins, 1995.

Corpus Linguistics. Ed. T. McEnery A> Wilson. Edinburgh: Edinburgh University Press, 1996.

G. B. Varile A. Zampolli. Survey of the State of Art in Human Language Technology. Cambridge: Cambridge University Press, 1996.

S. A.Yablonsky. “Russicon Slavonic Language Resources and Software.” Proceedings of the First International Conference on Language Resources & Evaluation, Granada, Spain.. Ed. A. Rubio N. Gallardo R. Castro A. Tejada. : , 1998. 1141-1147.

S. A.Yablonsky. “Russian Written Language Corpora Development.” Proceedings of the International Seminar Dialog99, May 30-June 8, Tarussa, Russia. : , 1999.

S. A.Yablonsky. “Russian 20th Century Literature Digital Library for Language Teaching.” Proceedings of the International Conference of the ACH/ALLC Digital Libraries for Humanities Scholarship and Teaching, JUNE 9-13, 1999, University of Virginia, Charlottesville, Virginia, USA. : , 1999.