Digital Humanities Abstracts

“AAC - Digital Resources in Textual Studies”
Hanno Biber AAC (Austrian Academy Corpus), Austrian Academy of Sciences hanno.biber@oeaw.ac.at Evelyn Breiteneder AAC (Austrian Academy Corpus), Austrian Academy of Sciences evelyn.breiteneder@oeaw.ac.at Karlheinz Moerth AAC (Austrian Academy Corpus), Austrian Academy of Sciences karlheinz.moerth@oeaw.ac.at

The Austrian Academy Corpus (AAC) is a newly founded institution based at the Austrian Academy of Sciences in Vienna. It was designed to set up a text corpus and to conduct research in the field of electronic text corpora. The AAC working group has had expertise in digital text studies and in lexicography for more than ten years. For a long time electronic text collections were primarily focused on linguistic studies and on lexicography. Only recently has the perspective changed towards providing material for scholars interested in texts from various fields of the humanities. The AAC has been trying to find solutions that meet the needs of textual studies and convey essential information about language and history. The aim of the proposed poster is to investigate the potential of digital resources for textual studies in various fields of the humanities. The electronic text collections established at the AAC so far and its future projects will mainly focus on electronic representations not only of literary texts, literary magazines, journals and newspapers but also on a carefully considered selection of texts from many cultural and social domains. The poster will consider the advance of new systems of digital representation and its implications for the study of language, literature and cultural history. Digital resources in the form of electronic text corpora should be regarded as structures for representing complex information. Journals and newspapers pose an especially difficult task when it comes to representation in digital form. An equally difficult task is the analysis and description of the mediaís decisive historical influences and contexts. The poster, which will be a digital projection, will comprise three parts that will show the range of interests pursued in the AAC research group. The first part will deal with the general organisational structures of the AAC. The second part will be concerned with the specific selection criteria for the great variety of texts which will form the AAC. Finally, the third part will examine practical issues in digitising the magazine ìDie Weltb¸hneî, giving special attention to the applicability of XML Schemas in literary computing.

1) AAC Structures

Research projects in the field of humanities computing rely heavily on cooperation, collaboration and the constant exchange of knowledge and expertise. The AAC will eventually be accessible on the internet, where innovative and collaborative presentation techniques and current graphic design developments will be utilised. However, being a research unit within the Austrian Academy of Sciences, the AAC is also set within the wider framework of a trilateral research scheme organized and planned at the Austrian Academy of Science in Vienna, the Berlin-Brandenburg Academy of Sciences and Humanities in Berlin and the Swiss Academy of Humanities and Social Sciences in Berne. In such a wider perspective, the common and individual settings and conditions of the German language will have to be taken into consideration, as will their historical and contemporary literatures of various kinds. The constant cultural exchange between the three countries opens particular research areas and fields of study for the linguists and scholars engaged in the establishment of digital resources and in computing activities in the humanities in the three countries. Whereas the efforts undertaken in Berlin and Berne are predominantly concerned with providing selected data and texts of the 20th century mainly for lexicographic purposes, the Austrian Academy of Sciences intends to tackle the problem of digital representation of scholarly, journalistic and political texts which were of considerable influence between 1848 and 1989. The Berlin-Brandenburg Academy of Sciences and Humanities has started its project of a Digital Dictionary of the 20th Century German Language, the main task being to develop a dictionary system and a prime source for linguistic and lexicographic information. The Swiss Academy of Humanities and Social Sciences in Berne has recently joined the efforts of this trilateral cooperation.

2) AAC Selection

To set up a text corpus several conditions and considerations are required. In the past twenty years, electronic text corpora have been built up in academic institutions of many European countries, such as France, Norway, Sweden, Slovenia, Spain, the Czech Republic, and the UK. The setting up of these corpora is motivated by the stateís will to document the national language in a comprehensive manner and to make the corpora available for scientific, especially linguistic, application. The AAC has a different starting point. For the construction of an Austrian corpus one must consider complicated issues relating to the history of the past two hundred years on the one hand and to our own specific interests on the other. The text selection for the AAC, which will take place at the same time as the corpus work, will be guided by thematic and empirical criteria, as well as factors specifically related to the type of text. The specificity of text type is therefore a factor for the choice of texts, but also for their categorisation in a corpus: letters by Oskar Kokoschka, anecdotes by Max Liebermann, writings of Adolf Loos, narrations by Adalbert Stifter, feature articles by Daniel Spitzer, funeral sermons and electoral speeches, propaganda slogans and advertising slogans, pop song lyrics and political speeches, comic books, instructions, travel guides, TV programmes, mailing cataloguesthese and other text types as well as the various kinds of text ëcarrierí are important for the choice of text. In recent years, the establishment of large German language corpora has been restricted to the field of linguistic and lexicographic studies. So far, there have not been any large-scale endeavours in the area of text-centred studies. Although more and more literary texts are becoming available, many of these came into existence as by-products of efforts to amass data for lexicographic research. Generally speaking, the historical period on which the AAC is working is poorly documented in terms of digital literary texts. This applies even more when it comes to collective text ëcarriersí such as magazines, papers, year-books, commemorative volumes and similar materials. To our knowledge there do not exist any large amounts of digitised historical magazines or papers in the German language. The sources being digitised for the AAC at the moment are historical literary magazines of major importance. In the first instance there is ìDer Brennerî, which was published by Ludwig Ficker in Innsbruck from 1910 until 1934. Among the contributors to the ìBrennerî are figures as renowned as Carl Dallago, Theodor Haecker, Else Lasker-Sch¸ler, Adolf Loos and Georg Trakl. The other two magazines on which the AAC is working were both published in Berlin. The journal ìDie Aktionî (1911 -1932) was edited by Franz Pfemfert. Among its contributors were Peter Altenberg, Hermann Bahr, Walter Benjamin, Max Brod, Richard Dehmel, Salomo Friedlaender, Georg Heym, Kurt Hiller, Max Oppenheimer, Egon Schiele and August Strindberg. The last journal to be mentioned here and perhaps the most important one of those being worked on at the moment is the weekly Berlin journal ëDie Schaub¸hneí (1905 - 1918), later renamed ëDie Weltb¸hneí (1918 - 1933,) which was edited by Siegfried Jacobson, Kurt Tucholsky and Carl von Ossietzky. Among the writers who contributed to the ëWeltb¸hneí were Henry Barbusse, Bertolt Brecht, Alfred D–blin, Lion Feuchtwanger, Arthur Koestler, Heinrich Mann, Alfred Polgar, Romain Rolland and Leon Trotsky.

3) AAC XML-Schemas

To produce a digital version of the magazine ìDie Weltb¸hneî, the original text has to undergo the usual stages of electronic processing: After being scanned, the text is made readable by means of up-to-date OCR. Then pages, paragraphs and lines are identified by automatic routines. The application of markup is the last step in this process. Tags describing contents are carefully inserted by literary scholars especially trained for this job. This process, which takes several runs, is accompanied by proofreading against the original and constant checking and validating of the achieved results. Literary projects in the past used to employ SGML, very often in connection with the TEI guidelines. The AAC also makes extensive use of XMLís modular system of specifications. Aside from the basic XML specification, several other specifications exist, all of them having their more or less well-defined place within the overall framework. The exact nature of some of these sub-specifications is not yet clear (XLink, XML Query), as everything is very much in a state of flux at the moment. Those that are classified as recommendations are XSLT (Extensible Stylesheet Transformations) and XPath (a language for addressing parts of an XML document). The implications of others such as XLink (Extensible Linking Language), XPointer (an abstract language that specifies locations), and XQL (Extensible Query Language) for literary computing will have to be considered in due course. As XML comes of age, the issue of a standard way of defining the structure of documents becomes more and more important. Both traditional DTDs (document type definitions) and XML Schemas are technologies that provide such descriptions of document structures. Whereas DTDs in the traditional sense have been around for some time and are widely accepted in the field of SGML-based text-encoding, XML Schemas must be regarded as a fledgling technology that still has to win its spurs. XML Schemas are commonly regarded as an attempt at an XML answer to the problem of defining the structure, content and semantics of documents. There are several arguments in favour of XML Schemas, among which are XML syntax, object orientation, inheritance, polymorphism and datatyping. Firstly, XML Schemas follow XML syntax rules, which makes it possible to parse them with XML tools. Nowadays, authors of XML documents often regard traditional DTDs as unwieldy and inconsistent with the structure of the overall XML system. With XML Schemas, validating parsers can be built on the basis of XML syntax. Secondly, XML Schemas may include explicit restrictions on the data types an element may hold. They let the text programmer attribute data types such as strings, numbers (integer, floating point), date and time formats, boolean and others to elements constituting an XML document. In addition, XML Schemas are also supposed to allow the text worker to define new data types to refine the markup system being used. The AACís experiences in applying this new technology, focusing on the issue of DTDs and XML Schemas in processing text corpora will be described and some details will be given of the pilot phase of the AACís project of establishing a corpus.

Selected references

Istvan Deak. Weimar Germany's Left-Wing Intellectuals: A Political History of the Weltbühne and its Circle. Berkley and Los Angeles: , 1968.
Thomas Dietzel Hans-Otto Hügel. Deutsche literarische Zeitschriften, 1880-1945 : ein Repertorium. München, New York, London, Paris: , 1988.
unknown. Document Object Model (DOM) Level 3 Core Specification. Version 1.0. W3C Working Draft 01 September, 2000. : , 2000.
Extensible Markup Language (XML) 1.0 (Second Edition) W3C Recommendation 6 October 2000 2000 ° Extensible Stylesheet Language (XSL) Version 1.0. W3C Working Draft 27 March 2000 2000 ° XML Path Language (XPath) Version 1.0 W3C Recommendation 16 November 1999 1999 ° XML Schema Part 0: Primer W3C Candidate Recommendation 24 October 2000 2000 ° XML Schema Part 1: Structures W3C Candidate Recommendation 24 October 2000 2000 ° XML Schema Part 2: Datatypes W3C Candidate Recommendation 24 October 2000 2000 ° XSL Transformations (XSLT) Version 1.0 W3C Recommendation 16 November 1999 1999 °