“Meaning and Interpretation of Markup”
C.
M.
Sperberg-McQueen
World Wide Web Consortium, USA
Claus
Huitfeldt
University of Bergen, Norway
Allen
Renear
Brown University, USA
Markup is inserted into textual material not at random, but to convey some
meaning. An author may supply markup as part of the act of composing a text; in
this case the markup expresses the author's intentions. The author creates
certain textual structures simply by tagging them; the markup has performative
significance. In other cases, markup is supplied as part of the transcription in
electronic form of pre-existing material. In such cases, markup reflects the
understanding of the text held by the transcriber; we say that the markup
expresses a claim about the text.
In the one case, markup is constitutive of the meaning; in the other, it is
interpretive. In each case, the reader (for all practical purposes, readers
include software which processes marked up documents) may legitimately use the
markup to make inferences about the structure and properties of the text. For
this reason, we say that markup licenses certain inferences about the text.
If markup has meaning, it seems fair to ask how to identify the meaning of the
markup used in a document, and how to document the meaning assigned to
particular markup constructs by specifications of markup languages (e.g. by DTDs
and their documentation).
In this paper, we propose an account of how markup licenses inferences, and how
to tell, for a given marked up text, what inferences are actually licensed by
its markup. As a side effect, we will also provide an account of what is needed
in a specification of the meaning of a markup language. We begin by proposing a
simple method of expressing the meaning of SGML or XML element types and
attributes; we then identify a fundamental distinction between distributive and
sortal features of texts, which affects the interpretation of markup. We
describe a simple model of interpretation for markup, and note various ways in
which it must be refined in order to handle standard patterns of usage in
existing markup schemes; this allows us to define a simple measure of
complexity, which allows direct comparison of the complexity of different ways
of expressing the same information (i.e. licensing the same inferences) about a
given text, using markup.
For simplicity, we formulate our discussion in terms of SGML or XML markup,
applied to documents or texts. Similar arguments can be made for other uses of
SGML and XML, and may be possible for some other families of markup
language.
Related work has been done by Simons (in the context of translating between
marked up texts and database systems), Sperberg-McQueen and Burnard (in an
informal introduction to the TEI), Langendoen and Simons (also with respect to
the TEI), Huitfeldt and others in Bergen (in discussions of the Wittgenstein
Archive at the University of Bergen, and in critiques of SGML), Renear and
others at Brown University, and Welty and Ide (in a description of systems which
draw inferences from markup). Much of this earlier work, however, has focused on
questions of subjectivity and objectivity in text markup, or on the nature of
text, and the like. The approach taken in this paper is somewhat more formal,
while still much less formal and rigorous than that taken by Wadler in his
recent work on XSLT.
Let us begin with a concrete example. Among the papers of the American historical
figure Henry Laurens is a draft Laurens prepared of a letter to be sent from the
Commons House of Assembly of South Carolina to the royal governor, Lord William
Campbell, in 1775. Some words have lines through them, and others written above
the line. The editors of Laurens's papers interpret the lines through words as
cancellations, and the words above the lines as insertions; an electronic
version of the document using TEI markup and reflecting these interpretations,
might read thus:
<P><DEL>It was be</DEL> <DEL>For</DEL> When we applied to Your Excellency for leave to adjourn it was because we foresaw that we <DEL>were</DEL> <ADD>should continue</ADD> wasting our own time ... </P> From the DEL elements, the reader of the document is licensed to infer that the letters "It was be", "For", and "were" are marked as deleted; from the ADD element, the reader may infer that the words "should continue" have been added. Software might rely on these inferences in the course of making a concordance or displaying a clear text; human readers will rely on them in interpreting the historical document. Note that the markup here stops short of licensing the inference that "should continue" was substituted for "were". The editors could license that inference as well by appropriate markup, if they wished. Human readers may make the inference on their own, given the linguistic context; software cannot safely infer a substitution every time an addition is adjacent to a deletion. A simple way to capture the meaning of markup is to define, for each markup construct, a set of open sentences - sentences with unbound variables - which express the inferences licensed by the use of that construct. In formal reasoning, such open sentences may be transformed into logical predicates in the usual way. For example, the TEI element type DEL is said by the documentation to mark "a letter, word or passage deleted, marked as deleted, or otherwise indicated as superfluous or spurious in the copy text by an author, scribe, annotator or corrector" (TEI P3, p. 922). We take this to mean that when a DEL element is encountered in a document, the reader is licensed to infer that the material so marked has been deleted. In formal contexts, we may write "deleted(X)"; we can specify the meaning of the DEL element and of the logical predicate "deleted(X)" by means of an open sentence: "X has been deleted, or marked as deleted, or ..." etc. The variable X is to be bound, in practice, to the contents of the DEL element. If we imagine a variable named 'this', instantiated to each element of a document in turn, and a function 'contents' which returns the contents of its argument, then the meaning of the DEL element becomes "deleted(contents(this)))", or equivalently "contents(this) has been deleted ..." etc. The TEI element type HI, similarly, "marks [its contents] as graphically distinct from the surrounding text" (TEI P3, p. 1013). We can capture the meaning of HI by the open sentence "X is graphically distinct from the surrounding text", or "highlighted(X)", where X is, as before, to be replaced by "contents(this)". Attributes may be treated similarly. The 'rend' attribute on the <hi> element "describes the rendition or presentation of the word or phrase highlighted". In the example
<P><HI REND="gothic">And this Indenture further witnesseth</HI> that the said <HI REND="italic">Walter Shandy</HI>, merchant, in consideration of the said intended marriage ... </P>
the HI elements convey the information that the contents of those elements are distinct from their surroundings, while the 'rend' attributes on the HI elements specify how. The meaning of the 'rend' attribute is expressed by the open sentence "X is rendered in style Y." An HI element with a 'rend' attribute thus means "X is graphically distinct from its surroundings, and X is rendered in style Y". Perhaps the simplest method of interpreting markup is to assume that
<P>Reader, I married him.</P>
we can infer the existence of one paragraph, but we cannot infer that "Reader" is itself a paragraph. Such properties we call 'sortal' properties, borrowing a term of art from linguistics. Elements marking sortals are usefully countable; those marking distributed properties are not. Second, the union model fails to allow a correct interpretation of inherited values and overrides, as illustrated by the TEI 'lang' attribute or the xml:lang attribute of XML. In fact, some inferences do contradict each other, and specifications of the meaning of markup need to say which inferences are compatible, and which are in conflict, and how to adjudicate conflicts. Third, the union model allows inferences about a location L only on the basis of markup on open elements (those which contain L); in order to handle common idioms of SGML and XML, a model of interpretation must handle
<P><DEL>It was be</DEL> <DEL>For</DEL> When we applied to Your Excellency for leave to adjourn it was because we foresaw that we <DEL>were</DEL> <ADD>should continue</ADD> wasting our own time ... </P> From the DEL elements, the reader of the document is licensed to infer that the letters "It was be", "For", and "were" are marked as deleted; from the ADD element, the reader may infer that the words "should continue" have been added. Software might rely on these inferences in the course of making a concordance or displaying a clear text; human readers will rely on them in interpreting the historical document. Note that the markup here stops short of licensing the inference that "should continue" was substituted for "were". The editors could license that inference as well by appropriate markup, if they wished. Human readers may make the inference on their own, given the linguistic context; software cannot safely infer a substitution every time an addition is adjacent to a deletion. A simple way to capture the meaning of markup is to define, for each markup construct, a set of open sentences - sentences with unbound variables - which express the inferences licensed by the use of that construct. In formal reasoning, such open sentences may be transformed into logical predicates in the usual way. For example, the TEI element type DEL is said by the documentation to mark "a letter, word or passage deleted, marked as deleted, or otherwise indicated as superfluous or spurious in the copy text by an author, scribe, annotator or corrector" (TEI P3, p. 922). We take this to mean that when a DEL element is encountered in a document, the reader is licensed to infer that the material so marked has been deleted. In formal contexts, we may write "deleted(X)"; we can specify the meaning of the DEL element and of the logical predicate "deleted(X)" by means of an open sentence: "X has been deleted, or marked as deleted, or ..." etc. The variable X is to be bound, in practice, to the contents of the DEL element. If we imagine a variable named 'this', instantiated to each element of a document in turn, and a function 'contents' which returns the contents of its argument, then the meaning of the DEL element becomes "deleted(contents(this)))", or equivalently "contents(this) has been deleted ..." etc. The TEI element type HI, similarly, "marks [its contents] as graphically distinct from the surrounding text" (TEI P3, p. 1013). We can capture the meaning of HI by the open sentence "X is graphically distinct from the surrounding text", or "highlighted(X)", where X is, as before, to be replaced by "contents(this)". Attributes may be treated similarly. The 'rend' attribute on the <hi> element "describes the rendition or presentation of the word or phrase highlighted". In the example
<P><HI REND="gothic">And this Indenture further witnesseth</HI> that the said <HI REND="italic">Walter Shandy</HI>, merchant, in consideration of the said intended marriage ... </P>
the HI elements convey the information that the contents of those elements are distinct from their surroundings, while the 'rend' attributes on the HI elements specify how. The meaning of the 'rend' attribute is expressed by the open sentence "X is rendered in style Y." An HI element with a 'rend' attribute thus means "X is graphically distinct from its surroundings, and X is rendered in style Y". Perhaps the simplest method of interpreting markup is to assume that
- 1. The meaning of every element type is expressed by an open sentence whose single unbound variable is to be bound to 'contents(this)'.
- 2. The meaning of every attribute is expressed by an open sentence with two unbound variables, one of which is to be bound to 'contents(this)' and the other to 'value(this,attribute-name)' (i.e. to the value of the attribute in question). In other words, each attribute defines some relation R which holds between the contents of the element and the value of the attribute.
- 3. All inferences licensed by any two elements are compatible.
<P>Reader, I married him.</P>
we can infer the existence of one paragraph, but we cannot infer that "Reader" is itself a paragraph. Such properties we call 'sortal' properties, borrowing a term of art from linguistics. Elements marking sortals are usefully countable; those marking distributed properties are not. Second, the union model fails to allow a correct interpretation of inherited values and overrides, as illustrated by the TEI 'lang' attribute or the xml:lang attribute of XML. In fact, some inferences do contradict each other, and specifications of the meaning of markup need to say which inferences are compatible, and which are in conflict, and how to adjudicate conflicts. Third, the union model allows inferences about a location L only on the basis of markup on open elements (those which contain L); in order to handle common idioms of SGML and XML, a model of interpretation must handle
- upward propagation: the meaning of an element may depend in part on its contents; this is unusual in colloquial SGML/XML systems, but is a regular feature of proposals to eliminate attributes from markup languages.
- context dependency: the meaning of an element may depend on its context; trivial examples include TEI's HI and FOREIGN, which can mean 'not-Roman' and 'not-English' in one context, and 'not-italic' and 'not-German' in others.
- ordinal position, relative or absolute; dependence of meaning upon ordinal position is seldom an explicit feature of markup languages, but dependence of processing based on position is a standard feature of style-sheet languages.
- milestone elements; these convey information by position in the beginning-to-end scan of the linear form of the document, rather than by position in the tree.
- linking: out-of-line or 'standoff' markup conveys information about location L based not only on open elements, but on elements which point at L or some ancestor of L.
References
Steve DeRose et al. “What is Text, Really?.” Journal of Computing in Higher Education. 1990. 1: 3-26.
Claus Huitfeldt. “Multi-Dimensional Texts in a One-Dimensional
Medium.” Computers and the Humanities. 1995. 28: 235-241.
D. Terence Langendoen Gary F. Simons. “Rationale for the TEI Recommendations for
Feature-Structure Markup.” Computers and the Humanities. 1995. 29: 191-209.
[Laurens, Henry.]. “Commons House of Assembly to Lord William
Campbell.” The Papers of Henry Laurens. Ed. David R. Chesnutt et al. Columbia, S.C.: University of South Carolina Press, 1985. Vol. 10: 305-308.
Alois Pichler. “What is Transcription, Really?.” ACH/ALLC '93, Georgetown. : , 1993.
Allen Renear David G. Durand Elli Mylonas. “Refining our notion of what text really is: the problem
of overlapping hierarchies.” Research in Humanities Computing. Oxford: Oxford University Press, 1995.
Gary F. Simons. “Conceptual Modeling versus Visual Modeling: A
Technological Key to Building Consensus.” Computers and the Humanities. 1997. 30: 303-319.
Guidelines for Electronic Text Encoding and Interchange (TEI P3). Ed. C. M. Sperberg-McQueen Lou Burnard. Chicago, Oxford: ACH, ALLC, and ACL, 1994.
C. M. Sperberg-McQueen Lou Burnard. “The Design of the TEI Encoding Scheme.” Computers and the Humanities. 1995. 29: 17-39.
Philip Wadler. “A formal semantics of patterns in XSLT.” Paper presented at Markup Technologies '99. : , 1999.
Christopher Welty Nancy Ide. “Using the Right Tools: Enhancing Retrieval from
Marked-up Documents.” Computers and the Humanities. 1999. 33: 58-84.