Digital Humanities Abstracts

“A System for Dynamic Text Corpus Management (with an Example Corpus of the Russian Mass Media of the 1990s)”
Girogri Sidorov National Polytechnic Institute (IPN), Mexico Anatoly Baranov Russian Academy of Sciences, Russia Mikhail Mikhailov Russian Academy of Sciences, Russia

We present a system for text corpus processing which is oriented to the idea of a "dynamic text corpus". With its help a user can search for examples of usage (words, phrases, and even morphemes), build word lists and concordances, compile his own subcorpus. The software was used while compiling a text corpus on modern Russian mass media. It is a collection of texts from Russian newspapers and magazines of the 1990s with a total size of about 15 Mb. Each text of the corpus is classified by 6 parameters - source, date, author(s), genre, and topic(s). Later these parameters are used to generate the subcorpora that conform to users' needs.

Introduction

Corpus linguistics is part of the computational linguistics that deals with the problems of compilation, representation, and analysis of large text collections. One of the most complex problems in modern corpus linguistics is defining of the principles of the text corpus compilation. The text corpus should in the ideal case answer the criteria of representativeness and at the same time be much smaller than the whole dedicated field. On the other hand the representativeness of the text corpus is directly connected with the research objectives. For example, the research connected with text macrostructure needs quite different parameters than sociolinguistic research or the description of contexts of usage of a certain morpheme or a word. The difficulty of reconciling statistical representativeness and user demands leads to the fact that many of the existing corpora do not have any explicit and clear criteria for texts' selection. For example, there are no clear-cut criteria of the items' selection for the well-known Birmingham corpus of English texts; the situation is the same with the German text corpora. We suggest a definite strategy for text corpora compilation that allows a user to create his own subset of texts from a corpus for his own task (as a new subcorpus). We call the initial text corpus, which is the source for further manipulations and selection plus corresponding software, dynamic text corpus. For the compiling of our corpus we used texts from Russian mass media of the 1990s.

General strategy of initial text corpus compilation

Taking into account the requirement of representativeness we directed special attention to choosing the most prominent mass media editions with different political orientation which was fairly important for society during the period the research in question was covering (1990s) and their proportional representation considering their popularity and significance. We used as the criterion of popularity the results of the last elections when, roughly speaking, 25 percent voted for the communists, 10 for the ultra left, 25 for the right, and 40 for the center. The second important factor of the corpus compiling was quantity of texts. There should be enough texts to reflect relevant features of the dedicated field. The upper limit was connected only with pragmatic considerations, the disk space and the speed of the service software. In our case, during the project that took place in 1996-1998 we collected around 15 Megabytes of text. As we stated above the different users have different tasks and expect different things from the text corpus. It is also necessary to take into account the fact that some users may not be linguists. These people may be interested in the reflection in the mass media of certain events during a certain period. It is probable that they would like to read the whole texts and not just concordances. To consider possible different requirements it is necessary to compile the text corpus not of extracts from the texts but of whole texts. The idea of using extracts (so called sampling) was popular at the early stage of corpus linguistics, e. g., the famous "Brown corpus", which consists of 1000-word-long text extracts. It is also necessary to take into account that linguists from different linguistic areas have different requirements of text corpora. For example, for morphological or syntactic research a 1 million word text corpus would be sufficient. Sometimes it is even more convenient to use a relatively small corpus because the concordances of usage of function words may occupy thousands of pages and most of the examples will be trivial. However, even for grammar research it seems reasonable to have in the corpus the texts of different structure and genre. At the same time the text corpus should be large enough to ensure the presence of rare words. Only in this case is the corpus interesting for a lexicologist or a lexicograper. Thus, the task of compilers of a text corpus is to take into account all the different and sometimes contradictory users' requirements. We suggest allowing the user to construct his own subset of texts (his own corpus) from the dynamic text corpus. To ensure this possibility each document has a certain search pattern which allows the software to filter the initial corpus and construct the corpus which fits the needs of the user.

Encoding of corpus units

After the analysis of the text data the following parameters were chosen as corpus-forming.
  • 1. Source (the mass media printed editions),
  • 2. Author (about 1000 authors),
  • 3. Title of the article (1369 articles),
  • 4. Political orientation (left, ultra left, right, center),
  • 5. Genre (memoir, interview, critique, discussion, essay, reportage, review, article, feuilleton),
  • 6. Theme (internal policy, external policy, literature, arts, etc. In total 39 themes.),
  • 7. Date (exact date of publication. In our case we used articles published during the period of the 1990s).
The following printed editions (magazines and newspapers) were used: VEK, Druzba Narodov, Zavtra, Znamia,Izvestiya, Itogi, Kommunist, Literaturnaya gazeta, Molodaya gvardiya, Moskovskiy komsomolec, Moskovskie novosti, Nash sovremennik, Nezavisimaya gazeta, Novyi mir, Ogonyok, Rossiiskaya gazeta, Russki vestnik, Segodnya, Sobesednik, Sovetskaya Rossiya, Trud, Ekspert, Elementy, Evraziiskoe obozrenie. Every text in the corpus is characterized by a set of these features. At the current stage it was done manually. The most representative are the following sources: Vek (8%), Zavtra (14%), Itogi (11%), Literaturnaya gazeta (6%), Moskovskie novosti (8%), Novy mir (8%).

Software description

The text corpus seems incomplete and hard to work with without software that assures the user-friendly interface and allows different kinds of processing. A general problem of the corpus software is selecting the texts to work with. If the user wants to deal just with certain parts of the corpus he has to do it manually by choosing file names. This is typical of corpus software and it is not convenient. The other possibility - to have all text files merged - simply does not allow any additional selection in the corpus. However, in our system it is possibile to select texts automatically using their feature sets. All the user has to do is to describe his requirements for his own corpus. We should mention that the collection of texts with descriptions are only rough material while in the traditional technology it is the final result. In the technology suggested in this article the 'big corpus' is a source for compilation of subcorpora answering the user's needs with greater accuracy. The initial text corpus is stored as a data base where each text is a record and each parameter is a field. The texts of articles are stored in a MEMO field. Importation of the manually marked articles into the data base is performed by a special utility. On the basis of this information a user can create his own corpus by indicating a set of parameters. He does it by going through a sequence of dialogue routines, answering questions or choosing from the lists. The resulting corpus is a text file containing the texts matching the selected parameters. The system allows the following main functions:
  • 1. Standard browsing of the texts and their parameters.
  • 2. Selection and ordering of texts according to the chosen parameters or their logical combinations. The system has a standard set of QBE queries which are translated automatically into SQL. The experienced users can write SQL queries directly.
  • 3. Generating a text corpus that is a subset of the initial corpus on the base of a stochastic choice and the given percentage for each parameter.
  • 4. Generating a user's text corpus.
  • 5. Browsing the user's text corpora and text processing: building concordances or word lists.
The program contains four standard variants of the initial corpus. In the whole corpus there are its proportional subsets containing 25% of the initial one for the parameters sources, themes, and genres.

Conclusions

We developed a system that implements dynamic text corpus management (the software is included in the notion of the dynamic corpus) for Russian mass media texts. The system is applicable to any corpus. All texts of the corpus are classified according to the parameters described above. The system ensures easy corpus processing for a user. The corpus is representative from the point of view of the chosen parameters. It means that all values and their combinations are presented in the corpus (except the impossible ones, e.g., the magazine Novy mir (a literature magazine) has no articles on finances, and the magazine Expert (a financial magazine) has no articles on literature).