Digital Humanities Abstracts

“The Austrian Academy Corpus - Digital Resources and Textual Studies”
Hanno Biber Austrian Academy Corpus hanno.biber@oeaw.ac.at Evelyn Breiteneder Austrian Academy Corpus evelyn.breiteneder@oeaw.ac.at Karlheinz Moerth Austrian Academy Corpus karlheinz.moerth@oeaw.ac.at

In this paper we will describe the aims, the main research objectives and the crucial computational aspects in establishing a large electronic text corpus. In the first part the concept of our corpus approach will be given and its backgroud as well as its consequences discussed. In the second part of the paper noteable features of the digitization processes based upon the application of XML-Schemas will be discussed. The Austrian Academy Corpus (AAC) is a newly founded institution based within the Austrian Academy of Sciences in Vienna. Its aim is to set up a text corpus and also to conduct research in the field of electronic text corpora. Electronic text collections to date have been generally focused on linguistic studies and lexicography,and designed and set up for language-orientated research. Recently, the perspective has changed towards providing resources for scholars from various fields within the humanities. The AAC is attempting to establish a corpus that meets the needs of textual studies and conveys essential information about language and history. The AAC functions as an example of an experimental corpus that is predominantly designed for textual studies. It will be a complexly structured text collection in which sources from a variety of fields will be included. The AAC also aims to include a wide range of significant texts from various cultural domains, which will be carefully selected as being of key historical and cultural significance and relevance. The AAC will create, structure, provide and analyse selected text sources from the past two centuries, taking advantage of the latest standards and techniques in electronic text processing. The AAC intends to digitally store a wide selection of different sources of scholarly, journalistic and political texts which were of considerable influence in the period between 1848 and 1989. It has started the digitisation and structured integration of texts, amongst which are for example several influential and notable literary and political journals, such as “Die Weltbühne” or “Die Aktion”, published in Berlin in the first decades of the last century, and the Austrian journal “Der Brenner”, published in Innsbruck, as well as many other sources. The famous satirical magazine “Die Fackel”, published by Karl Kraus in Vienna, will constitute the core of the AAC and will be a starting point for future selections of texts. Images and manuscripts will be included in the corpus, where necessary, because the original graphical and typographical information is important for the meaning and interpretation of digitised texts. This is particularly the case with complex text structures such as newspapers or literary journals which comprise a whole variety of functionally different text types within their structure. Digital resources in the form of electronic text corpora should be regarded as structures for representing complex information. The electronic text collections established at the AAC so far and its future projects will focus on electronic representations not only of literary texts, literary magazines, journals and newspapers but also on a carefully considered selection of texts from several other cultural and social domains. Special emphasis will be placed on areas that have been rather neglected in humanities computing to date. Journals and newspapers pose an especially difficult task when it comes to their representation in digital form. An equally difficult task is the analysis and description of the media’s decisive historical influences and contexts. The study and detailed investigation of texts has always been crucial for our understanding of historical processes. The knowledge of texts and the accessibility of textual knowledge can be furthered by means of large text corpora like the AAC. The text selection for the AAC, which will take place at the same time as the corpus work, will be guided by thematic and empirical criteria, as well as factors specifically related to the type of text. The specificity of text type is therefore, amongst others, a decisive factor not only for the selection of texts but also for their categorisation in a corpus: letters by Oskar Kokoschka, anecdotes by Max Liebermann, writings of Adolf Loos, narrations by Adalbert Stifter, feature articles by Daniel Spitzer, funeral sermons, electoral speeches, propaganda and advertising slogans, pop song lyrics, political speeches, comic books, instructions, travel guides, TV programmes, mailing catalogues, and so on.. In recent years, the establishment of large German language corpora has been restricted to the field of linguistic and lexicographic studies. So far, there have not been any large-scale initiatives in the area of text-centred studies. Although more and more literary texts are becoming available, many of these came into existence as by-products of efforts to amass data for lexicographic research. Generally speaking, the historical period on which the Austrian Academy Corpus is working is poorly documented in terms of digital literary texts. This applies even more when it comes to collective text ‘carriers’ such as magazines, papers, yearbooks, commemorative volumes and similar materials. Among the sources being digitised for the AAC are a considerable number of historical literary magazines of major importance. One example is the journal “Der Brenner”, which was published by Ludwig Ficker in Innsbruck from 1910 until 1954. Among its contributors are figures as renowned as Carl Dallago, Theodor Haecker, Else Lasker-Schüler, Adolf Loos, and Georg Trakl. Other sources on which the AAC is working were published in Berlin, for instance, the journal “Die Aktion” edited by Franz Pfemfert between 1911 and 1932. Among its contributors were Peter Altenberg, Hermann Bahr, Walter Benjamin, Max Brod, Richard Dehmel, Salomo Friedlaender, Georg Heym, Kurt Hiller, Max Oppenheimer, Egon Schiele and August Strindberg. Another journal to be mentioned here and perhaps the most important one in the pipeline is the weekly Berlin journal “Die Schaubühne” (1905 - 18), later renamed “Die Weltbühne” (1918 - 33,) which was edited by Siegfried Jacobsohn, Kurt Tucholsky and Carl von Ossietzky. Among the writers who contributed to “Die Weltbühne” were Henry Barbusse, Bertolt Brecht, Alfred Döblin, Lion Feuchtwanger, Arthur Koestler, Heinrich Mann, Alfred Polgar, Romain Rolland, and Leon Trotsky. To produce a digital version of, for example, the literary journals “Der Brenner” or “Die Weltbühne”, the original text has to undergo the usual stages of electronic processing. After being scanned, the text is made readable by means of OCR. The structure of the text (pages, paragraphs and lines) is identified by automatic routines. Application of markup is the last step in this process. Tags encoding content are carefully inserted by literary scholars especially trained for this task. This process, which takes several runs, is accompanied by proofreading against the original and constant checking and validating of the achieved results. Literary encoding projects in the past have employed SGML, very often following the TEI Guidelines (P3). The AAC makes extensive use of XML’s modular system of specifications. Aside from the basic XML specification, several other specifications exist, all of them having more or less defined place within the overall framework provided by XML. The exact nature of some of these specifications is not yet clear (XLink, XML Query), as development work still continues apace. Those that are classified as recommendations are XSLT (Extensible Stylesheet Transformations) and XPath (a language for addressing parts of an XML document). The implications of others such as XLink (Extensible Linking Language), XPointer (an abstract language that specifies locations), and XQL (Extensible Query Language) for literary computing will have to be considered in due course. As XML comes of age, the issue of a standard way of defining the structure of documents becomes more and more important. Both traditional DTDs (document type definitions) and XML Schemas are formats which model document structures. Whereas DTDs in the traditional sense have been around for some time and are widely accepted in the field of SGML-based text-encoding, XML Schemas must be regarded as a fledgling technology that still has to win its spurs. XML Schemas are commonly regarded as an attempt at an XML answer to the problem of defining the structure, content and semantics of documents. There are several arguments in favour of XML Schemas, among which are XML syntax, object orientation, inheritance, polymorphism and datatyping. Firstly, XML Schemas follow XML syntax rules, which makes it possible to parse them with XML tools. Nowadays, authors of XML documents often regard traditional DTDs as unwieldy and inconsistent with the structure of XML. With XML Schemas, validating parsers can be built on the basis of XML syntax. Secondly, XML Schemas may include explicit restrictions on the data types an element may hold. They allow the text programmer to attribute data types such as strings, numbers (integer, floating point), date and time formats, boolean and others to elements constituting an XML document. In addition, XML Schemas are also intended to allow the definition of new data types to future refine the markup scheme being used. For the corpus holdings of the AAC such applications are investigated and implemented for the benefit of various corpus-based linguistic and textual studies.