Applying the TEI: Problems in the classification of proper nouns

“Applying the TEI: Problems in the classification of proper nouns”

Julia Flanders Brown University Julia_Flanders@brown.edu Sydney Bauman Brown University Sydney_Bauman@brown.edu Paul Caton Brown University Paul_Caton@brown.edu Mavis Cournane Computer Center, University College Cork mavis@eolas.ucc.ie Willard McCarty Kings College London Willard.McCarty@kcl.ac.uk John Bradley Kings College London john.bradley@kcl.ac.uk

Abstract

The testing of the TEI Guidelines since their release has thus far taken a somewhat private form. Scholarly text encoding projects have availed themselves of the Guidelines' exceptional richness and nuance, but with the aim of doing the greatest possible justice to the complexity of their own data, or the particular needs of their own users, rather than with any concern for developing consistency between projects. To a certain extent this is justifiable; the point of a flexible standard is precisely that it can accommodate the multiple needs of its various users. However, where divergence is only the result of random choice among equivalent options, rather than being motivated by real constraints, it serves no purpose and only impedes the exchange and use of data. Now that the Guidelines have been in use long enough to create a substantial base of encoded data, projects whose source material and encoding strategies are similar can benefit from comparing approaches to common problems, and assessing whether their divergences are justified by differences in data or philosophy, or merely represent unnecessary variation in the application of the TEI. One area of primary source transcription which deserves examination along these lines is the classification of proper nouns and similar words and phrases, using the elements described in Chapter 20 of the TEI Guidelines: <name>, <rs>, and the suite of more specific elements such as <placeName>, <orgName>, <foreName>, <surName>, <roleName>, etc. These elements describe a set of phenomena whose retrieval and processing are important to the scholarly user of the encoded text, but whose boundaries are quite fluid and often involve the application of theoretical considerations quite unrelated to text encoding. (For example, is "God" a personal name?) The proposed session will present several perspectives on this problem, with several aims: first, of allowing the participating projects (and those represented in the audience) to compare practices and discuss the status of their variation; second, of situating the specific problem of encoding proper nouns within the context of scholarly analysis, so as to create a more precise sense of the needs which the encoding is intended to address; and third, to think more broadly about the pressures and constraints on classification systems in text encoding. Two of the three papers in this session come from encoding projects which use the TEI, and which have used Chapter 20 in particularly detailed and carefully considered ways. The first of these is the Brown University Women Writers Project, in a paper co-authored by Julia Flanders, Paul Caton, and Sydney Bauman, which will address the WWP's approach to the use of Chapter 20, and its attempt to balance scholarly needs and cost-effectiveness. The second is the Thesaurus Linguarum Hiberniae (TLH) project, discussed by Mavis Cournane, who will examine TLH's use of TEI to classify and specify different kinds of proper nouns within TLH's corpus of writing in Ireland. The last paper, by Willard McCarty and John Bradley, will consider these issues from the perspective of a non-TEI project dealing intensively with names and their classification, An Analytical Onomasticon to the Metamorphoses of Ovid. This paper will discuss the encoding of names in relation to the complex issues of literary criticism and analysis, with an in-depth exploration of examples from the Metamorphoses.

Nouns Proper and Improper: Using the TEI for primary sources

Julia Flanders Sydney D. Bauman Paul Caton

Introduction

The TEI approaches the encoding of names as a problem having largely to do with the need to give labels to existing phenomena: Chapter 20, "Names and Dates", begins by saying that the elements provided therein offer the encoder "a detailed substructure" and the ability "to distinguish explicitly between names of persons, places or organizations" [P3, p. 583]. The elements offered in this section and elsewhere in the TEI are indeed sufficient to encode most if not all of the name-related phenomena found in the texts with which the Women Writers Project is concerned. However, this sufficiency on the SGML side of the equation does not assist with the other side: the fact that the encoder must in fact "distinguish explicitly" between the names of persons, places, organizations, mythical creatures, objects, and the like. That is, we must decide what the thing is before we can encode it, and this is not always easy.

The Women Writers Project

The WWP has an additional challenge, however, which is that in working with older texts we are confronted with a set of phenomena which the text itself identifies--by typographical emphasis of some sort--as being of linguistic or rhetorical importance. In texts printed in the 17th and 18th centuries, this set includes the elements discussed in Chapter 20, but also some related textual features such as abstract nouns and adjectives derived from proper nouns. Thus texts from this period themselves identify a set of features in which a scholar might well be interested, but which shade into one another and may be difficult to identify and classify with any certainty. For example, if one wants to distinguish names of persons from abstract nouns, one runs into challenges in the case of allegory or moral poetry, where virtues may be apostrophized as if they were human, or may be identified with a human agent, or may be in fact the name of that agent. Similarly if one wishes to distinguish names of persons from the names of other kinds of things (such as non-human creatures or objects) one needs not only the encoding equipment to label these but also a clear definition of what it means to be human. Test cases here might be the Medusa, the Minotaur, mermaids, centaurs, Niobe after her transformation into a stone, and the vexed question of the human status of various deities. An additional challenge arises in the case of adjectives which are derived from proper names, like Caesarian or Plutonic; for these there is not even a clear TEI element for the purpose, since <rs> is technically reserved for nouns.

Problems of Classification

Discussion of these issues often verges on the whimsical, for instance when one is forced to articulate what a "person" is (human from the neck up? able to speak or write? able to mate with humans?), but also engages with more serious issues concerning the nature of naming. If one does not wish to make naming and reference the centerpiece of one's encoding system (as would be appropriate for a text like the Metamorphoses, but not for an eclectic collection like the WWP's), one needs to draw a line between the category of names and the other things which shade into them: for instance epithets, vocatives like "Milady", or terms like "the Cockatrice" whose unique reference is vitiated by the presence of an article, and the whole range of apostrophes to abstract qualities like "Fair Virtue". Without such a line, it is hard to know where to stop, and the result is a huge set of features from which it is impossible to retrieve the information one wants. The natural response to this problem is to attempt to classify these in turn, for instance with type attributes, an approach which other projects (CURIA for instance) have taken with success. The WWP has found, however, that for our texts it is very difficult to create a sufficiently comprehensive and unambiguous set of values to categorize these features in a way that would allow researchers actually to do systematic work on them. The WWP's path to this conclusion involved several attempts to create a system which could do justice to this complexity. We tried dividing our field of features into names of persons (using <persName> and its various components), names of non-persons (using <name>), and non-name references to both persons and non-persons (using <rs>). This last category was especially baroque, since it included the most heterogeneous group (abstractions, epithets, personifications of inanimate things, symbols, apostrophes, and references to mythical or imaginary creatures), and in fact each iteration of the classification process proved again that a substantial challenge lay in what to do with the residuum, the things which are only alike in being unlike some other, more clearly delimited category. The conclusion we found ourselves drawing was that although the concepts we were dealing with were fairly distinct, their application to specific textual phenomena was not by any means straightforward. Furthermore, although the components of the "residuum" were easily identifiable as categories which did not fit into the other two (personal and non-personal names), we were not confident that they represented categories which would be useful for scholarly study, though we were confident that trying to use them would prove to be extremely time-consuming and hence expensive. As a result, we eventually decided to use a simplified system which made no attempt to classify things beyond the element level; we now distinguish between personal names and the names of non-persons, and any other kind of reference which the text identifies as a proper noun is encoded using <rs> without a type attribute.

Conclusion

The conclusion which emerges from this attempt seems to be that despite the various provisions of the TEI for encoding these complex textual phenomena, the limiting factor really is human use and the ability to define and enforce categorization. The question which leads from this is one of how to regard and apply the TEI: if it is imagined as a system for accounting to one's own satisfaction for what one finds in the text, then the complexity available is essential. However, if the TEI is regarded as a method of communicating textual information to others, as long as the text itself is allowed to determine the encoding solution we will find this communication extremely difficult. Put another way, if an encoding project develops a TEI-based encoding system based on the assumption that its own data has unique requirements, that very assumption limits drastically the possibility of integrating that data with that of other projects to build larger resources, or the possibility of users being able to make common assumptions about how data will be treated. As a strategic matter, these possibilities are best kept open by the counterassumption that data can be treated similarly (even if that counterassumption is to some degree false). At this stage in the TEI's development, projects working on similar undertakings (similar materials, similar methodologies) have had the opportunity to discover the uniqueness of their own data and to revel in it, and they need to turn their attention to finding ways to share it. The ultimate goal of this session is therefore to discuss the degree to which this is possible, and the costs of doing so.

The application of SGML/TEI to the reality of Irish Texts

Mavis Cournane

Abstract

This paper will look at how the TEI DTD is used to encode names in Irish texts. Most of the questions and difficulties encountered by the TLH encoder of names will have presented themselves to others. Some of the problems of encoding are generated by the TEI DTD itself, while others are due to the inherent complexity of the texts themselves. In addressing the problems, I hope to take encoding decisions out of the realm of the arbitrary. The proposed solutions will demonstrate how the TEI DTD can be manipulated and the need for consistency in encoding.

Introduction

The TEI DTD is a descriptive DTD rather than a prescriptive one. The descriptive nature of the TEI DTD is problematic for the encoder because very little in TEI is mandatory, and the encoder is given several choices of how to encode various textual features. From the outset the would-be encoder is faced with problem of what to encode, how to encode, and who to encode it for. The target audience determines the degree of markup necessary in a text and also strongly influences the way in which you mark something up. Irish texts by nature are rich in prose, poetry, chivalry, hagiography, linguistic, genealogical, and historical data. They appear in five languages, Old Irish, Old Norse, Norman French, Latin, and Hiberno-English and contain some transliterated Hebrew and Greek. Consequently, their encoding requires a great depth and variety of markup. The application of the theoretical world of TEI to the real world of text presents challenges and problems. These are particularly acute in the encoding of names. TEI puts at your disposal several elements for naming people and things, all of which would work, or appear reasonable. You are provided with tags such as <name>, <persname>,<surname>, <forename>, <rolename>, <addname>, <genname>, <namelink>, and <placename>. The encoder then has to decide which of these best suit her needs. Close attention also has to be paid to the nesting requirements within TEI. For example one cannot simply decide to encode a name as a <forename> without first nesting it within a <persname>. The same holds true for <surname>, <rolename>, <addname>, <genName> and <namelink>.

The problems

1. What needs to be encoded? 2. How much needs to be encoded? 3. Should there be a difference between encoding the names of sacred personages, people and objects? 4. In historical texts which span centuries, personal names are dynamic. What denoted a forename in the 11th century indicated a surname by the 13th century. How can markup be consistent yet reflect the semantic accuracy of the text? 5. Placenames present the same difficulty. Their function can change over time. For example the place Armagh functioned intially as a church, then it graduated to a monastery, a monastic town, then a church and in a 20th century context it denotes a town and is also the seat of the archbishops. For example: <pn type="church">Ard Macha</pn> d' fothuccadh <pn type="monastery">Ard Macha</pn> do losccadh <pn type="town">Ard Macha</pn> Further inconsistency occurs for search purposes when there is variation in the spelling of placenames. If a place can be spelt eight different ways and have had a dynamic function over eight centuries, how is the untrained person to conduct meaningful searches? 6. In non-English historical texts it can be problematic to distinguish between <placename> and <orgName>. Many elements have been abbreviated in TLH for ease of reading when markup is on display. For example, <orgName> is abbreviated to <on>. It is used to markup organizations in the widest possible sense, be they historical groups, parties, lineages etc. In many instances it is not always clear if one should encode a name as a place or organization. Many placenames take their name from dynastic names. This poses problems not only for the encoder, but also for the user. As much encoding in this instance is subjective, the user, seeking to search the database, will have her own prejudices and expectations. For example: (1)<on type="people/dynasty" >Connacht</on> (2)<pn type="kingdom">Connacht</pn> (3)<pn type="province">Connacht</pn> In example (1) it is the dynasty Connacht which is being referred to and in (2) and (3) it is the place, the kingdom or province Connacht which is in question. These ambiguities meant that close attention has to be given to the semantics of the text to discern how to encode.

Solutions

(1) the development of an encoding scheme based on a scholarly rationale rather than one legitimized by Lust und Laune. (2) greater use of the attributes provided by TEI to further categorise, regularise and normalise encoded text. These solutions need not be mutually exclusive.

Conclusion

Names in Irish texts are dynamic, and ambiguous. Their encoding will only serve a meaningful purpose for the end user, if regularisation and consistency guide the marking up process. Without due consideration of these factors search and retrieval would be a complicated, unsatisfactory exercise.

Theft of fire: meaning in the markup of names

Willard McCarty John Bradley

Our subject is what happens when a computational metalinguistic tagging scheme is imposed on a poetic text in order to make a subset of the data accessible to automatic processing. In our case, tagging is employed to mark 'names' in the broadest sense, i.e. all devices of language by which persons are identified. Our text is the Metamorphoses, a highly complex mythological compendium written by the Roman poet Ovid during the reign of Augustus. Its 12,000 lines of Latin hexameter contain approximately 50,000 such devices, a systematic accounting of which is the aim of the project. The result, generated automatically by software from the tagged text, is An Analytical Onomasticon to the Metamorphoses of Ovid. This is a new kind of reference work designed to help Ovidian specialists figure out how the poem might cohere. For humanities computing, however, the primary interest lies in the radical 'loss in translation' when ambiguous poetic phenomena are rendered as meta-linguistic tags, and in how this loss is turned to advantage.° As one of us has argued elsewhere, the translation model is quite useful in thinking about the literary and linguistic consequences of tagging a poetic text (McCarty 1994). The radical poverty of tagging 'languages' makes the discussion about loss-in-translation, pervasive in the literature, especially valuable (Barnstone 1993). The question of this loss devolves into the more fundamental one of expression itself, where it takes the useful form of meditations on that which seems curiously to be in but not of language - George Steiner's "flame of the spirit in the momentary fixity of the letter".° What happens to this flame in the act of translation? What happens to it when the target is a computational pseudo-language, therefore radically deficient for representing the rich ambiguities of poetry? A computing project that attempts to render what we loosely call 'meaning' into processible form faces such questions at every working moment. The struggle is, of course, central to computing as a whole insofar as it models complex realities or perceptions of them with crude mechanical constructs (McCarty 1994: 278-81). For humanities computing as such, the crudity of our methods is a central issue; it is only avoided, not answered, by construing the computer as a 'mere tool' or by taking refuge in progress. Arguably we step over the most significant threshold in (or rather into) the field when we realise that the inevitable failure of all such modeling is anything but a death-sentence to our common project, but rather the source of its integrity and power. What we lose in tagging is in a sense what we gain. In this paper we will focus closely on how the construction of a taxonomy for naming simultaneously falsifies and illuminates the text. Using quite specific examples from the Metamorphoses we will demonstrate in detail how the application of such a taxonomy to a text, in the process of encoding it, itself constitutes a kind of literary criticism. The central problem that the Onomasticon is intended to address, the coherence of the Metamorphoses as a work of literature, will require brief explication in the paper to demonstrate how well the poem serves the interests of humanities computing. Briefly, we will argue that because Ovid's poem reflects the central problem of tagging literary text - seeking "the flame of the spirit" as it transmigrates through "the momentary fixity" of bodies - the Metamorphoses presents in a radically stubborn form just the kind of challenge our common project needs to advance intellectually. For the Metamorphoses, the spoor of this transmigrating spirit is supplied by numerous kinds of language data through which apparently disparate narrative elements are associated. For a manageable subset, we chose all references to persons, i.e. names, because persons are to be found in every story of the poem and each reference to a person can be exactly identified with particular elements of the the textual data. (The same could not consistently be said, for example, of metaphors, allusions, or themes.) No particular theory on how the Metamorphoses is or could be constructed was assumed, although a theory arguing for multiple, simultaneous constructions is offered in the book. Details of the mark-up scheme and of how tags are processed are available but mostly irrelevant for present purposes. Our focus is rather on the taxonomy and how well it works, or rather how well it fails and in failing serves a scholarly purpose. The basic taxonomy is simple. Onomastic devices are classified under the headings of proper names (including patronymics, matronymics, toponyms, et sim.), nominals (nouns and adjectives, including phrases), pronouns, verbs, and personal attributes, i.e. nominals referring to anything that in context is closely enough associated to evoke the person. Text quoted in the tag is lemmatised insofar as the syntax of the quoted segment will permit, given a taxonomic type, and assigned to the person. (More precisely, the onomastic device is given a 'standard name', or editorial lemma that in most cases is simply the name of the character but is used to identify synonymous references, to declare disparate references as such, and otherwise to assign an identity to something, such as a metamorohosed being.) Although interpretative problems occur at every step, the most interesting ones are in assigning names to persons. This should not be surprising for a poem in which people become things and, less often, things people, but the particular demands of markup turn the expected into a surprisingly complex and revealing operation. At root is the question of what constitutes a person. For a majority in the Metamorphoses the answer is as obvious as in daily life, but for a large number of doubtful instances, we found that the most useful way to approach the question was to look for ontological shift. Hence, any sub-human entity (animal, vegetable, mineral) is regarded as a person, and so tagged, if it undergoes metamorphosis up the 'chain of being' or is otherwise personified; if it reverts to sub-human state, it ceases to be a person and so is not tagged. Any anthropomorph that has undergone downward metamorphosis (including a god that has temporarily changed shape) is said to be a different through closely related person, 'X-in-the-form-of-Y'. Certain figures of speech that suggest but do not manifest ontological shift are similarly marked by assigning a distinct but closely related standard name: similes, 'X-compared-to-Y', and multiple persons having a corporate identity by acting as one, 'X-and-Y-and-Z'. Dis-personification of entities formerly persons (e.g. venus for sexual passion, bacchus for wine) is treated by dropping the entity as a person but including the reference as his or her attribute. Persons, then, are treated in essence as momentary constructions, apt to appear as such under certain conditions of language, or to change identity or vanish altogether when those conditions change. Thus much is obvious given the text under consideration. For the Analytical Onomasticon to be useful, however, the tagging must be governed rigorously by consistent editorial policies that specify these conditions of language. The first of these specifies a fundamental binary state: a textual phenomenon is either identified with a person or not; no 'weights' or degrees of identity are allowed. Beyond that, our policies have evolved inductively, through a long series of heuristic intermediaries, almost continuously revised during the work. Although they are to be reviewed in the final stage of the project, they are mostly stable now and will be described briefly in the paper. The Analytical Onomasticon contains a long theoretical introduction in which they are discussed in great detail. Apart from the fact that these policies are fundamental to the usefulness of the Analytical Onomasticon, they enable the user intelligently to criticise the work, and since plans are to publish all the component materials in electronic form, to modify the tagged text effectively and so to regenerate the book along different lines. (The scope of our paper unfortunately does not allow further discussion of the publishing aspects.) More importantly for the subject of this paper, these policies by nature aim explicitly to specify how certain well-known but imperfectly understood literary phenomena actually happen. Personification has, for example, been well studied, both within and well beyond classical studies, but nowhere does one find an attempt to spell out its linguistic conditions. Similarly, metamorphosis is an obvious and well-studied topic, yet one does not find an empirical guide to its boundaries, when it can be said to occur, when not, and while it is happening, of what the process exactly consists. Tagging forces one to say, or more precisely, to make the attempt. In the paper, explication of particular examples will demonstrate such attempts and in particular focus on the literary-critical value of their failure: (1) the personification of Sol in the story of Phaethon, Met. 1.751ff, and the dis-personification of Bacchus and Venus at various points throughout; (2) Apollo's pursuit and metamorphosis of Daphne, 1.525-567, and her ambiguous status as the laurel tree later in