“Austrian Academy Corpus - Doing literary markup by
means of XML”
Karlheinz
Moerth
Austrian Academy of Sciences, Austria
Hanno
Biber
Austrian Academy of Sciences, Austria
The poster we intend to present is supposed to give some details concerning the
pilot phase of the newly founded Austrian Academy Corpus (AAC). The AAC is a
project which has been started with the perspective of building up a digital
corpus tailored to the particular needs of scholars doing research in the field
of literary studies. The build-up of electronic text collections has been the
domain of linguists so far. Although literary texts are becoming available in
ever-increasing numbers, many of these products originated from linguistic
endeavours to gather data for lexicographic or learners' purposes. Literary
scholars have rather neglected the whole issue. Most of the existing digital
literary resources are poorly edited and serve for little more than simplistic
word searches. As the exigencies of work on literary texts differ considerably
from those of linguistic studies the AAC has been trying to work towards
solutions that meet the necessities of the literary domain. The AAC proceeds in
its work from the basis of digital data which have been collected at the
Austrian Academy of Sciences during the last few years. At that time the
department's focus was placed on literary sources which were classified as
functional literary text types, i.e. magazines, diaries, sermons, speeches,
letters, obituaries and the like. Thus the materials in the corpus display
considerable variety as to contents and structure. The data was accumulated
primarily for text-lexicographic purposes preparing a phraseological dictionary
which was published in 1999. This text-dictionary was based on a literary
satirical magazine which appeared for the first time in 1899 in fin-de-siecle
Vienna and was published until 1936. This magazine, 'Die Fackel', which was very
popular among intellectuals in its time, is still of utmost interest to scholars
of German literature trying to understand this crucial period of history. The
electronic text of 'Die Fackel' was generated in a first run by means of rather
unsophisticated OCR, a process which was completed several years ago. During the
compilation of the above-mentioned text-dictionary the quality of this
electronic text was improved in several runs of proofreading and adding basic
markup which was developed specifically for this purpose. Only very few tags
were applied, indicating for example the position of images or identifying
special characters, which were not supported by the standard code-pages of the
user interface. The texts being digitised at the moment (primarily historical
magazines) undergo several stages of processing, first being scanned and made
readable by means of OCR, then being corrected several times. The application of
markup is the last step in this process. Formatting information of the digitised
texts is in the first instance conserved in file formats such as RTF or
utilizing standardized style sheets which yield quite reasonable results. With
the increasing availability of more advanced text encoding techniques the
working group of the AAC started to think about a more general approach to cope
with the manifold problems of integrating text and text-related secondary data.
We had to look for a markup scheme which allowed for labelling different types
of data within one coherent markup system. In their endeavours to find
appropriate ways of describing the structure of texts, linking up data and
facilitating structured searches the working group of the AAC started doing
experiments applying XML (Extensible Markup Language) to their data. The issue
of XML, especially the question of XML versus SGML, has been touched upon
repeatedly in Humanities computing. XML, as the youngest descendant of the
SGML/HTML family, is a subset of SGML and can be viewed as the logic further
development of SGML. As a general rule one can assume that SGML-encoded data can
easily be transformed into XML-data, to a certain extent even the other way
round. The XML related experiments at our department started already in the
early stages of XML's appearance on the scene, i.e. in 1998. As yet only parts
of the corpora of the AAC have been provided with XML-conforming markup as we
are still trying to fathom out the potential benefits of XML for our work. Basic
markup (start and end of pages, line breaks), character encoding as well as
concatination of hyphenated words was carried out by means of a special
conversion engine. There are many different ways to perform such
transformations. Many would prefer to accomplish such a task making use of
scripting languages such as Perl which, of course, can be a quite practicable
and efficient way to do it. We tried to use graphical user interfaces from the
very beginning of our work in order to ensure the constant monitoring and
controlling of the transformation processes. Therefore we developed our tools
using the programming language Delphi, a RAD tool which helped to cut developing
time. The absence of any XML compliant browser software in the early phase
forced us to seek a solution of our own and consequently to develop a graphical
interface to bring our data on screen and display them in a readable form. These
experiments showed that the fundamentally strict structure of XML documents
makes parsing pretty straightforward. Even in the absence of a definitive set of
tags XML-files are still processible without parsing an attached DTD (Document
Type Definition). As the texts we are working on contain passages in various
languages and characters XML's compliance with the Unicode Standard proved to be
extremely helpful. Unicode has brought about the unification of a huge number of
diverging standards and will in the future ensure the exchangeability and
interoperability of all sorts of textual data. XML does not only allow generic,
highly specialised encoding, but also enables the flawless exchange of data
among different systems.