“A System for Dynamic Text Corpus Management (with an
Example Corpus of the Russian Mass Media of the 1990s)”
Girogri
Sidorov
National Polytechnic Institute (IPN), Mexico
Anatoly
Baranov
Russian Academy of Sciences, Russia
Mikhail
Mikhailov
Russian Academy of Sciences, Russia
We present a system for text corpus processing which is oriented to the idea of a
"dynamic text corpus". With its help a user can search for examples of usage
(words, phrases, and even morphemes), build word lists and concordances, compile
his own subcorpus. The software was used while compiling a text corpus on modern
Russian mass media. It is a collection of texts from Russian newspapers and
magazines of the 1990s with a total size of about 15 Mb. Each text of the corpus
is classified by 6 parameters - source, date, author(s), genre, and topic(s).
Later these parameters are used to generate the subcorpora that conform to
users' needs.
Introduction
Corpus linguistics is part of the computational linguistics that deals with the problems of compilation, representation, and analysis of large text collections. One of the most complex problems in modern corpus linguistics is defining of the principles of the text corpus compilation. The text corpus should in the ideal case answer the criteria of representativeness and at the same time be much smaller than the whole dedicated field. On the other hand the representativeness of the text corpus is directly connected with the research objectives. For example, the research connected with text macrostructure needs quite different parameters than sociolinguistic research or the description of contexts of usage of a certain morpheme or a word. The difficulty of reconciling statistical representativeness and user demands leads to the fact that many of the existing corpora do not have any explicit and clear criteria for texts' selection. For example, there are no clear-cut criteria of the items' selection for the well-known Birmingham corpus of English texts; the situation is the same with the German text corpora. We suggest a definite strategy for text corpora compilation that allows a user to create his own subset of texts from a corpus for his own task (as a new subcorpus). We call the initial text corpus, which is the source for further manipulations and selection plus corresponding software, dynamic text corpus. For the compiling of our corpus we used texts from Russian mass media of the 1990s.General strategy of initial text corpus compilation
Taking into account the requirement of representativeness we directed special attention to choosing the most prominent mass media editions with different political orientation which was fairly important for society during the period the research in question was covering (1990s) and their proportional representation considering their popularity and significance. We used as the criterion of popularity the results of the last elections when, roughly speaking, 25 percent voted for the communists, 10 for the ultra left, 25 for the right, and 40 for the center. The second important factor of the corpus compiling was quantity of texts. There should be enough texts to reflect relevant features of the dedicated field. The upper limit was connected only with pragmatic considerations, the disk space and the speed of the service software. In our case, during the project that took place in 1996-1998 we collected around 15 Megabytes of text. As we stated above the different users have different tasks and expect different things from the text corpus. It is also necessary to take into account the fact that some users may not be linguists. These people may be interested in the reflection in the mass media of certain events during a certain period. It is probable that they would like to read the whole texts and not just concordances. To consider possible different requirements it is necessary to compile the text corpus not of extracts from the texts but of whole texts. The idea of using extracts (so called sampling) was popular at the early stage of corpus linguistics, e. g., the famous "Brown corpus", which consists of 1000-word-long text extracts. It is also necessary to take into account that linguists from different linguistic areas have different requirements of text corpora. For example, for morphological or syntactic research a 1 million word text corpus would be sufficient. Sometimes it is even more convenient to use a relatively small corpus because the concordances of usage of function words may occupy thousands of pages and most of the examples will be trivial. However, even for grammar research it seems reasonable to have in the corpus the texts of different structure and genre. At the same time the text corpus should be large enough to ensure the presence of rare words. Only in this case is the corpus interesting for a lexicologist or a lexicograper. Thus, the task of compilers of a text corpus is to take into account all the different and sometimes contradictory users' requirements. We suggest allowing the user to construct his own subset of texts (his own corpus) from the dynamic text corpus. To ensure this possibility each document has a certain search pattern which allows the software to filter the initial corpus and construct the corpus which fits the needs of the user.Encoding of corpus units
After the analysis of the text data the following parameters were chosen as corpus-forming.- 1. Source (the mass media printed editions),
- 2. Author (about 1000 authors),
- 3. Title of the article (1369 articles),
- 4. Political orientation (left, ultra left, right, center),
- 5. Genre (memoir, interview, critique, discussion, essay, reportage, review, article, feuilleton),
- 6. Theme (internal policy, external policy, literature, arts, etc. In total 39 themes.),
- 7. Date (exact date of publication. In our case we used articles published during the period of the 1990s).
Software description
The text corpus seems incomplete and hard to work with without software that assures the user-friendly interface and allows different kinds of processing. A general problem of the corpus software is selecting the texts to work with. If the user wants to deal just with certain parts of the corpus he has to do it manually by choosing file names. This is typical of corpus software and it is not convenient. The other possibility - to have all text files merged - simply does not allow any additional selection in the corpus. However, in our system it is possibile to select texts automatically using their feature sets. All the user has to do is to describe his requirements for his own corpus. We should mention that the collection of texts with descriptions are only rough material while in the traditional technology it is the final result. In the technology suggested in this article the 'big corpus' is a source for compilation of subcorpora answering the user's needs with greater accuracy. The initial text corpus is stored as a data base where each text is a record and each parameter is a field. The texts of articles are stored in a MEMO field. Importation of the manually marked articles into the data base is performed by a special utility. On the basis of this information a user can create his own corpus by indicating a set of parameters. He does it by going through a sequence of dialogue routines, answering questions or choosing from the lists. The resulting corpus is a text file containing the texts matching the selected parameters. The system allows the following main functions:- 1. Standard browsing of the texts and their parameters.
- 2. Selection and ordering of texts according to the chosen parameters or their logical combinations. The system has a standard set of QBE queries which are translated automatically into SQL. The experienced users can write SQL queries directly.
- 3. Generating a text corpus that is a subset of the initial corpus on the base of a stochastic choice and the given percentage for each parameter.
- 4. Generating a user's text corpus.
- 5. Browsing the user's text corpora and text processing: building concordances or word lists.