Introduction:
'Henrik Ibsen's Writings' aims at producing and publishing both an electronic
version and a book version of all Ibsen's writings: dramas, poems, drafts,
letters, articles, notes. All manuscripts and editions in Norwegian/Danish
from the playwright's lifetime will be encoded using SGML/TEI. Our encoding
is rather detailed, as we wish to reproduce every text witness in full
accuracy.
In this paper we will examine some empirical problems which have surfaced
when encoding text structure and textual features in Ibsen's plays. We will
relate these to problems concerning how to encode the textual features in
Ibsen's 'Norma' (1851). Based on genre one could define this text as a
drama, but it could also be considered a covert article. An analytical
perspective on the text would expose a good mixture of different genre
elements. The text's physical representation, i.e. typography, shares the
same ambiguity. These different approaches all form different 'logical'
structures which overlap each other, and either approach forces us to
consider an unfortunate priority in the encoding. To solve these problems we
have looked into different solutions for encoding text structure and
overlapping text features. Briefly we have asked the basic questions: What
is text, what defines structure and other textual features, and how does
this affect the encoding of texts?
These questions have been much discussed within the text encoding society in
the past, but the approach in this paper is somewhat different from much of
the earlier work. We will try to examine the questions from a more practical
point of view, with focus on the encoding of 'Norma', and we hope that we
thus will be able to add a new perspective to the discussion.
Ibsen's 'Norma':
The problems of overlapping and ambiguous textual features and structures are
particularly manifest when one has to deal with texts with mixed features
which cannot easily be combined. Ibsen's 'Norma', which was written and
published in the newspaper 'Andhrimner' in 1851, is an example of such a
text. Choosing which textual features to encode in 'Norma' is not easy, as
it is problematic to define the text as a drama, but even more problematic
to define it as something else. We have chosen to regard it as a drama, but
several textual elements in 'Norma' are difficult to incorporate in the
drama structure. Particularly problematic is e.g. a 'speech' from 'the
curtain' (which has no reference in the cast list) typographically rendered
not as speech, but as a stage direction. Other problems concern footnotes
(both in the cast list and in the speeches), speeches in brackets that do
not seem to be 'asides' (asides are marked by stage directions) etc. These
problems, and others like them, will be presented and further discussed at
the conference.
Encoding Text Structure:
The nature of a text has been much discussed, in several different contexts.
One answer to what a text is, is that it is made of several interwoven
features, or content objects, which together form 'a text'. The nature of a
text is thus complex, and it seems difficult, or even impossible, to find a
single structure that 'is' the text. Some would argue that texts even seem
to be able to include things 'outside' themselves. Another view on the text,
much discussed within the text encoding society, is the claim that 'text is
an ordered hierarchy of content objects', the so called OHCO-thesis. This
claim has been thoroughly examined by Renear et al. (1993), and further
discussed e.g. in Biggs & Huitfeld (1997). We will leave this discussion
aside here, and only point out that the OHCO-thesis (at least in its
simplest form) may be appealing from a text encoding perspective, but that
it is far from unproblematic.
When encoding texts one generally chooses either declarative markup languages
based on SGML, or its subset XML. We are using SGML/TEI, but are considering
moving over to XML. As Sperberg-McQueen & Huitfeldt (1998) have pointed
out, SGML markup in its simplest forms uses a straightforward model for
markup: elements nest within each other so that the SGML document forms a
hierarchical structure. The basic model of SGML associates single
occurrences of features with single SGML-elements. Tagging a textual element
as a SGML-element of a particular type and giving it particular attribute
values, thus claims that this element exhibits the textual features
associated with that same element type and those attribute values. The
relationship between SGML element types and text features depends, however,
on the encoder's understanding and interpretation of the genre, structure
and perspective of the text. The encoder defines text objects in elements
matching the chosen hierarchical structure.
One of the large challenges for application of SGML to existing texts is
finding suitable representations in SGML's tree-based data model for
multiple hierarchies and textual features which overlap each other. Such
overlapping textual features seem to be an inescapable fact of textual life,
but present a problem in the simple SGML-model because while two textual
features/hierarchies may overlap, two SGML-elements may not.
Encoding Overlapping Features:
The problem of overlapping features has been discussed before. In the TEI
Guidelines there are several ways to overcome some of the problems: one may
e.g. use 'milestone' elements, in which a feature is predicated by the span
of text between one milestone element and the next. Other techniques rely on
the fragmentation of one element into multiple SGML elements and then
knitting the fragments into a whole; this is the method used for example in
the part attribute of the <l>-element and in the <join>-element.
There are also several possibilities permitting 'true' overlapping features.
Within SGML/TEI the 'concur'-feature allows a document to be marked up
concurrently using more than one DTD, with each tag labelled with the name
of the DTD to which it belongs. Other none-SGML systems include MECS
(Multi-Encoding-System), a system developed at the Wittgenstein Archives at
the University of Bergen. MECS permits any two codes to overlap, but has on
the other hand no specific document grammar, and therefore no SGML-alike
document validation is possible.
We have chosen not to use such systems in our project, as they do not seem
suitable for our purposes, partly because they make text encoding too
complex. Furthermore, encoding 'true' overlapping structures with concur,
MECS, or similar systems, does not really solve the problem of deciding how
to structure the text: you have the possibility to encode overlapping
features, but you still have to decide which feature(s) to encode, and even
which not to encode.
The boundaries of the different features are not always clear and text
features seem to exist both dependent and independent of each other. Our
project uses the first printed editions as base texts for the edition. When
encoding texts, we try to reproduce these texts as accurately as possible.
When choosing to use standard TEI, we thus have to deal with the problem of
encoding the textual features into hierarchical SGML-documents. This also
means choosing which text features to encode and which not to encode, when
that is necessary.
Encoding 'Norma':
In the case of 'Norma' we have ended up encoding the text as an ordinary
drama. The additional features of the text that do not seem to be part of a
'normal' drama structure, but still can easily be encoded into the drama
structure (e.g. footnotes and speeches in brackets), are also incorporated,
even though this means 'violating' the logical structure of a drama. On the
other hand additional features that depend on redefining the whole
conceptualization of dramas (e.g. speeches in stage directions) so far are
ignored in the encoding (but documented elsewhere, e.g. in the header of the
document). 'Norma' is an ambiguous text, and while we want to incorporate as
much as possible of this ambiguity in the encoded version of it, not all
ambiguity can be kept, and however the encoding is done, the encoded text
could possibly be ambiguous in new ways.
It may seem like a paradox that the encoding of features necessary to give
possibilities for electronic processing and analysis of texts, at the same
time includes interpretation that may restrict the use of the texts. There
is no simple way out of this problem, but if the interpretation in the
encoding is restricted at a reasonable level, the encoding can open up the
text more than delimit it. Renear et al. (1993) state that 'It should be a
commonplace that machine-readable texts are "subjective" and "interpretive",
but not especially subjective or interpretative.', and that encoding a text
in this aspect is much like making a traditional edition.