“Meaning and Interpretation of Markup”

C. M. Sperberg-McQueen World Wide Web Consortium, USA Claus Huitfeldt University of Bergen, Norway Allen Renear Brown University, USA

Markup is inserted into textual material not at random, but to convey some meaning. An author may supply markup as part of the act of composing a text; in this case the markup expresses the author's intentions. The author creates certain textual structures simply by tagging them; the markup has performative significance. In other cases, markup is supplied as part of the transcription in electronic form of pre-existing material. In such cases, markup reflects the understanding of the text held by the transcriber; we say that the markup expresses a claim about the text. In the one case, markup is constitutive of the meaning; in the other, it is interpretive. In each case, the reader (for all practical purposes, readers include software which processes marked up documents) may legitimately use the markup to make inferences about the structure and properties of the text. For this reason, we say that markup licenses certain inferences about the text. If markup has meaning, it seems fair to ask how to identify the meaning of the markup used in a document, and how to document the meaning assigned to particular markup constructs by specifications of markup languages (e.g. by DTDs and their documentation). In this paper, we propose an account of how markup licenses inferences, and how to tell, for a given marked up text, what inferences are actually licensed by its markup. As a side effect, we will also provide an account of what is needed in a specification of the meaning of a markup language. We begin by proposing a simple method of expressing the meaning of SGML or XML element types and attributes; we then identify a fundamental distinction between distributive and sortal features of texts, which affects the interpretation of markup. We describe a simple model of interpretation for markup, and note various ways in which it must be refined in order to handle standard patterns of usage in existing markup schemes; this allows us to define a simple measure of complexity, which allows direct comparison of the complexity of different ways of expressing the same information (i.e. licensing the same inferences) about a given text, using markup. For simplicity, we formulate our discussion in terms of SGML or XML markup, applied to documents or texts. Similar arguments can be made for other uses of SGML and XML, and may be possible for some other families of markup language. Related work has been done by Simons (in the context of translating between marked up texts and database systems), Sperberg-McQueen and Burnard (in an informal introduction to the TEI), Langendoen and Simons (also with respect to the TEI), Huitfeldt and others in Bergen (in discussions of the Wittgenstein Archive at the University of Bergen, and in critiques of SGML), Renear and others at Brown University, and Welty and Ide (in a description of systems which draw inferences from markup). Much of this earlier work, however, has focused on questions of subjectivity and objectivity in text markup, or on the nature of text, and the like. The approach taken in this paper is somewhat more formal, while still much less formal and rigorous than that taken by Wadler in his recent work on XSLT. Let us begin with a concrete example. Among the papers of the American historical figure Henry Laurens is a draft Laurens prepared of a letter to be sent from the Commons House of Assembly of South Carolina to the royal governor, Lord William Campbell, in 1775. Some words have lines through them, and others written above the line. The editors of Laurens's papers interpret the lines through words as cancellations, and the words above the lines as insertions; an electronic version of the document using TEI markup and reflecting these interpretations, might read thus:
<DEL>It was be</DEL> <DEL>For</DEL> When we applied to Your Excellency for leave to adjourn it was because we foresaw that we <DEL>were</DEL> <ADD>should continue</ADD> wasting our own time ... From the DEL elements, the reader of the document is licensed to infer that the letters "It was be", "For", and "were" are marked as deleted; from the ADD element, the reader may infer that the words "should continue" have been added. Software might rely on these inferences in the course of making a concordance or displaying a clear text; human readers will rely on them in interpreting the historical document. Note that the markup here stops short of licensing the inference that "should continue" was substituted for "were". The editors could license that inference as well by appropriate markup, if they wished. Human readers may make the inference on their own, given the linguistic context; software cannot safely infer a substitution every time an addition is adjacent to a deletion. A simple way to capture the meaning of markup is to define, for each markup construct, a set of open sentences - sentences with unbound variables - which express the inferences licensed by the use of that construct. In formal reasoning, such open sentences may be transformed into logical predicates in the usual way. For example, the TEI element type DEL is said by the documentation to mark "a letter, word or passage deleted, marked as deleted, or otherwise indicated as superfluous or spurious in the copy text by an author, scribe, annotator or corrector" (TEI P3, p. 922). We take this to mean that when a DEL element is encountered in a document, the reader is licensed to infer that the material so marked has been deleted. In formal contexts, we may write "deleted(X)"; we can specify the meaning of the DEL element and of the logical predicate "deleted(X)" by means of an open sentence: "X has been deleted, or marked as deleted, or ..." etc. The variable X is to be bound, in practice, to the contents of the DEL element. If we imagine a variable named 'this', instantiated to each element of a document in turn, and a function 'contents' which returns the contents of its argument, then the meaning of the DEL element becomes "deleted(contents(this)))", or equivalently "contents(this) has been deleted ..." etc. The TEI element type HI, similarly, "marks [its contents] as graphically distinct from the surrounding text" (TEI P3, p. 1013). We can capture the meaning of HI by the open sentence "X is graphically distinct from the surrounding text", or "highlighted(X)", where X is, as before, to be replaced by "contents(this)". Attributes may be treated similarly. The 'rend' attribute on the <hi> element "describes the rendition or presentation of the word or phrase highlighted". In the example
<HI REND="gothic">And this Indenture further witnesseth</HI> that the said <HI REND="italic">Walter Shandy</HI>, merchant, in consideration of the said intended marriage ... 
the HI elements convey the information that the contents of those elements are distinct from their surroundings, while the 'rend' attributes on the HI elements specify how. The meaning of the 'rend' attribute is expressed by the open sentence "X is rendered in style Y." An HI element with a 'rend' attribute thus means "X is graphically distinct from its surroundings, and X is rendered in style Y". Perhaps the simplest method of interpreting markup is to assume that

1. The meaning of every element type is expressed by an open sentence whose single unbound variable is to be bound to 'contents(this)'.
2. The meaning of every attribute is expressed by an open sentence with two unbound variables, one of which is to be bound to 'contents(this)' and the other to 'value(this,attribute-name)' (i.e. to the value of the attribute in question). In other words, each attribute defines some relation R which holds between the contents of the element and the value of the attribute.
3. All inferences licensed by any two elements are compatible.

The set of inferences applicable to any given location L is then the union of the inferences licensed by all the elements within which L is contained. Let us call this the 'union model' of interpretation. The union model is simple, and provides a good first approximation of the rules of inference for marked up text. But it is not wholly adequate. First, it fails to distinguish distributed properties (such as 'italic' or 'highlighted') from sortal properties (such as paragraphs, sections, or - as illustrated above - deletion). It is as true to say "The word 'And' is in black-letter" as to say it of the entire phrase, and the meaning of the example given above would not change if the HI elements were split into two or more adjacent pieces each with the same 'rend' value. Conversely, two HI elements with the same attribute values can be merged without changing the meaning of the markup. Other elements mark properties which are NOT distributed equally among the contents, and cannot be split or joined without changing the meaning of the markup. From the markup
Reader, I married him.
we can infer the existence of one paragraph, but we cannot infer that "Reader" is itself a paragraph. Such properties we call 'sortal' properties, borrowing a term of art from linguistics. Elements marking sortals are usefully countable; those marking distributed properties are not. Second, the union model fails to allow a correct interpretation of inherited values and overrides, as illustrated by the TEI 'lang' attribute or the xml:lang attribute of XML. In fact, some inferences do contradict each other, and specifications of the meaning of markup need to say which inferences are compatible, and which are in conflict, and how to adjudicate conflicts. Third, the union model allows inferences about a location L only on the basis of markup on open elements (those which contain L); in order to handle common idioms of SGML and XML, a model of interpretation must handle

upward propagation: the meaning of an element may depend in part on its contents; this is unusual in colloquial SGML/XML systems, but is a regular feature of proposals to eliminate attributes from markup languages.
context dependency: the meaning of an element may depend on its context; trivial examples include TEI's HI and FOREIGN, which can mean 'not-Roman' and 'not-English' in one context, and 'not-italic' and 'not-German' in others.
ordinal position, relative or absolute; dependence of meaning upon ordinal position is seldom an explicit feature of markup languages, but dependence of processing based on position is a standard feature of style-sheet languages.
milestone elements; these convey information by position in the beginning-to-end scan of the linear form of the document, rather than by position in the tree.
linking: out-of-line or 'standoff' markup conveys information about location L based not only on open elements, but on elements which point at L or some ancestor of L.

Other methods of associating markup with meaning are imaginable, but we believe a survey of existing DTDs will show that all or virtually all current practice is covered by any model of interpretation which encompasses the complications just outlined. Essentially, these can be handled by extending the rules for binding variables in the open sentences which specify the meaning of a given markup construct. The simple union model allows only 'contents(this)' and 'value(this,attribute-name)'; the constructs listed above require more complex expressions, roughly equivalent in expressiveness to the TEI extended-pointer notation or to the patterns of the XPath language defined by W3C. Complexity of the semantics associated with an element type or attribute may be measured by the number of unbound variables in the open slots, by the complexity of the expressions which are to fill them, and by the amount or kind of memory required to allow full generation of the inferences licensed by markup in a particular text.

References

Steve DeRose et al. “What is Text, Really?.” Journal of Computing in Higher Education. 1990. 1: 3-26.

Claus Huitfeldt. “Multi-Dimensional Texts in a One-Dimensional Medium.” Computers and the Humanities. 1995. 28: 235-241.

D. Terence Langendoen Gary F. Simons. “Rationale for the TEI Recommendations for Feature-Structure Markup.” Computers and the Humanities. 1995. 29: 191-209.

[Laurens, Henry.]. “Commons House of Assembly to Lord William Campbell.” The Papers of Henry Laurens. Ed. David R. Chesnutt et al. Columbia, S.C.: University of South Carolina Press, 1985. Vol. 10: 305-308.

Alois Pichler. “What is Transcription, Really?.” ACH/ALLC '93, Georgetown. : , 1993.

Allen Renear David G. Durand Elli Mylonas. “Refining our notion of what text really is: the problem of overlapping hierarchies.” Research in Humanities Computing. Oxford: Oxford University Press, 1995.

Gary F. Simons. “Conceptual Modeling versus Visual Modeling: A Technological Key to Building Consensus.” Computers and the Humanities. 1997. 30: 303-319.

Guidelines for Electronic Text Encoding and Interchange (TEI P3). Ed. C. M. Sperberg-McQueen Lou Burnard. Chicago, Oxford: ACH, ALLC, and ACL, 1994.

C. M. Sperberg-McQueen Lou Burnard. “The Design of the TEI Encoding Scheme.” Computers and the Humanities. 1995. 29: 17-39.

Philip Wadler. “A formal semantics of patterns in XSLT.” Paper presented at Markup Technologies '99. : , 1999.

Christopher Welty Nancy Ide. “Using the Right Tools: Enhancing Retrieval from Marked-up Documents.” Computers and the Humanities. 1999. 33: 58-84.