Text Structure vs. Encoded Structure - Dealing with Mixed Genres and Ambiguous Texts

Introduction:

'Henrik Ibsen's Writings' aims at producing and publishing both an electronic version and a book version of all Ibsen's writings: dramas, poems, drafts, letters, articles, notes. All manuscripts and editions in Norwegian/Danish from the playwright's lifetime will be encoded using SGML/TEI. Our encoding is rather detailed, as we wish to reproduce every text witness in full accuracy. In this paper we will examine some empirical problems which have surfaced when encoding text structure and textual features in Ibsen's plays. We will relate these to problems concerning how to encode the textual features in Ibsen's 'Norma' (1851). Based on genre one could define this text as a drama, but it could also be considered a covert article. An analytical perspective on the text would expose a good mixture of different genre elements. The text's physical representation, i.e. typography, shares the same ambiguity. These different approaches all form different 'logical' structures which overlap each other, and either approach forces us to consider an unfortunate priority in the encoding. To solve these problems we have looked into different solutions for encoding text structure and overlapping text features. Briefly we have asked the basic questions: What is text, what defines structure and other textual features, and how does this affect the encoding of texts? These questions have been much discussed within the text encoding society in the past, but the approach in this paper is somewhat different from much of the earlier work. We will try to examine the questions from a more practical point of view, with focus on the encoding of 'Norma', and we hope that we thus will be able to add a new perspective to the discussion.

Ibsen's 'Norma':

The problems of overlapping and ambiguous textual features and structures are particularly manifest when one has to deal with texts with mixed features which cannot easily be combined. Ibsen's 'Norma', which was written and published in the newspaper 'Andhrimner' in 1851, is an example of such a text. Choosing which textual features to encode in 'Norma' is not easy, as it is problematic to define the text as a drama, but even more problematic to define it as something else. We have chosen to regard it as a drama, but several textual elements in 'Norma' are difficult to incorporate in the drama structure. Particularly problematic is e.g. a 'speech' from 'the curtain' (which has no reference in the cast list) typographically rendered not as speech, but as a stage direction. Other problems concern footnotes (both in the cast list and in the speeches), speeches in brackets that do not seem to be 'asides' (asides are marked by stage directions) etc. These problems, and others like them, will be presented and further discussed at the conference.

Encoding Text Structure:

The nature of a text has been much discussed, in several different contexts. One answer to what a text is, is that it is made of several interwoven features, or content objects, which together form 'a text'. The nature of a text is thus complex, and it seems difficult, or even impossible, to find a single structure that 'is' the text. Some would argue that texts even seem to be able to include things 'outside' themselves. Another view on the text, much discussed within the text encoding society, is the claim that 'text is an ordered hierarchy of content objects', the so called OHCO-thesis. This claim has been thoroughly examined by Renear et al. (1993), and further discussed e.g. in Biggs & Huitfeld (1997). We will leave this discussion aside here, and only point out that the OHCO-thesis (at least in its simplest form) may be appealing from a text encoding perspective, but that it is far from unproblematic. When encoding texts one generally chooses either declarative markup languages based on SGML, or its subset XML. We are using SGML/TEI, but are considering moving over to XML. As Sperberg-McQueen & Huitfeldt (1998) have pointed out, SGML markup in its simplest forms uses a straightforward model for markup: elements nest within each other so that the SGML document forms a hierarchical structure. The basic model of SGML associates single occurrences of features with single SGML-elements. Tagging a textual element as a SGML-element of a particular type and giving it particular attribute values, thus claims that this element exhibits the textual features associated with that same element type and those attribute values. The relationship between SGML element types and text features depends, however, on the encoder's understanding and interpretation of the genre, structure and perspective of the text. The encoder defines text objects in elements matching the chosen hierarchical structure. One of the large challenges for application of SGML to existing texts is finding suitable representations in SGML's tree-based data model for multiple hierarchies and textual features which overlap each other. Such overlapping textual features seem to be an inescapable fact of textual life, but present a problem in the simple SGML-model because while two textual features/hierarchies may overlap, two SGML-elements may not.

Encoding Overlapping Features:

The problem of overlapping features has been discussed before. In the TEI Guidelines there are several ways to overcome some of the problems: one may e.g. use 'milestone' elements, in which a feature is predicated by the span of text between one milestone element and the next. Other techniques rely on the fragmentation of one element into multiple SGML elements and then knitting the fragments into a whole; this is the method used for example in the part attribute of the <l>-element and in the <join>-element. There are also several possibilities permitting 'true' overlapping features. Within SGML/TEI the 'concur'-feature allows a document to be marked up concurrently using more than one DTD, with each tag labelled with the name of the DTD to which it belongs. Other none-SGML systems include MECS (Multi-Encoding-System), a system developed at the Wittgenstein Archives at the University of Bergen. MECS permits any two codes to overlap, but has on the other hand no specific document grammar, and therefore no SGML-alike document validation is possible. We have chosen not to use such systems in our project, as they do not seem suitable for our purposes, partly because they make text encoding too complex. Furthermore, encoding 'true' overlapping structures with concur, MECS, or similar systems, does not really solve the problem of deciding how to structure the text: you have the possibility to encode overlapping features, but you still have to decide which feature(s) to encode, and even which not to encode. The boundaries of the different features are not always clear and text features seem to exist both dependent and independent of each other. Our project uses the first printed editions as base texts for the edition. When encoding texts, we try to reproduce these texts as accurately as possible. When choosing to use standard TEI, we thus have to deal with the problem of encoding the textual features into hierarchical SGML-documents. This also means choosing which text features to encode and which not to encode, when that is necessary.

Encoding 'Norma':

In the case of 'Norma' we have ended up encoding the text as an ordinary drama. The additional features of the text that do not seem to be part of a 'normal' drama structure, but still can easily be encoded into the drama structure (e.g. footnotes and speeches in brackets), are also incorporated, even though this means 'violating' the logical structure of a drama. On the other hand additional features that depend on redefining the whole conceptualization of dramas (e.g. speeches in stage directions) so far are ignored in the encoding (but documented elsewhere, e.g. in the header of the document). 'Norma' is an ambiguous text, and while we want to incorporate as much as possible of this ambiguity in the encoded version of it, not all ambiguity can be kept, and however the encoding is done, the encoded text could possibly be ambiguous in new ways. It may seem like a paradox that the encoding of features necessary to give possibilities for electronic processing and analysis of texts, at the same time includes interpretation that may restrict the use of the texts. There is no simple way out of this problem, but if the interpretation in the encoding is restricted at a reasonable level, the encoding can open up the text more than delimit it. Renear et al. (1993) state that 'It should be a commonplace that machine-readable texts are "subjective" and "interpretive", but not especially subjective or interpretative.', and that encoding a text in this aspect is much like making a traditional edition.

Introduction:

Ibsen's 'Norma':

Encoding Text Structure:

Encoding Overlapping Features:

Encoding 'Norma':

Bibliograph