“A logic programming environment for document semantics
and inference”
David
Dubin
University of Illinois at Champaign-Urbana
ddubin@uiuc.edu
Michael
Sperberg-McQueen
World Wide Web Consortium, USA
cmsmcq@acm.org
Allen
Renear
University of Illinois at Champaign-Urbana
renear@alexia.lis.uiuc.edu
Claus
Huitfeldt
University of Bergen, Norway
Claus.Huitfeldt@hit.uib.no
Recently Sperberg-McQueen and others have argued that markup functions by
licensing inferences about a text. They remark, however, that the information
warranting such inferences may not be entirely explicit in the syntax of the
markup language used to encode the text. (Sperberg-McQueen et al., 2001)
For example, a language defined in SGML or XML may include an attribute (such as
'lang') that an encoder may apply to an element with the generic identifier
'QUOT.' One might then infer that the QUOT element marks an identifiable
component of the document (called a quotation) and that the quotation has the
property of being in a particular language (as indicated by the 'lang'
attribute). It may also be valid to infer that children of the 'QUOT' element
share the property of being in that language, unless overridden with a language
attribute of their own. On the other hand, there may not be such a simple
one-to-one mapping between components and elements: for example, a single
quotation may be broken across two or more 'QUOT' elements.
There are a number of other inferences that are typically assumed by tag set
designers and application designers alike, but which cannot be formally
expressed in the DTD, and may or may not be informally expressed in the tag set
documentation.
In order to adequately represent such inferences (the "meaning of markup") the
Sperberg-McQueen group developed techniques for expressing in predicate logic,
(i) the facts signalled by the encoding of a particular document instance and
(ii) the logical relationships commonly understood to exist and license further
inferences. A Prolog database was used to demonstrate the effectiveness of this
approach.
The present paper builds directly on this previous work, and reflects new results
which provide more rigorous and explanatory layers of abstraction and progress
in understanding problems with "deictic" expressions and domains of variables,
etc. But the fundmental new result presented is the completion of a complete
integrated working system with an entirely new and substantially redesigned
Prolog database at its core. This Prolog database has been redesigned to improve
functionally, better reflect the theoretical results, and increase
functionality, flexibility, and performance.
The system permits an analyst to specify facts about the markup syntax (e.g.,
generic identifiers and attribute values) separately from facts and rules of
inference about semantic entities and properties. The system provides a level of
abstraction at which the performative or interpretive meaning of the markup can
be explicitly represented in machine-readable and executable form. Inferences
can then be drawn regarding document components, including problematic
structures, such as those participating in overlapping hierarchies.
The new Prolog database is integrated with an SGML/XML parser so that SGML and
XML instances can be input and output. Facts and rules of inference concerning
the document are expressed in Prolog's standard declarative syntax. We have
developed a collection of predicates that emulate a subset of the W3C's Document
Object Model methods for navigating the hierarchical structure of nodes, and
retrieving attribute values and information from the document type definition.
These predicates afford a clear separation of the syntactic information captured
by the parser and the document semantics expressed by the analyst.
Another collection of predicates support deictic expression resolution. These
allow rules of inference to include location-relative pointing from one part of
the document to another. For example, we have predicates for resolving an
element's closest ancestor having a particular generic identifier, attribute, or
attribute value pair. Another set of predicates resolves the identity of an
element of a particular type occurring most closely in terms of the linear
structure of the document (rather than the closest in the hierarchy). A third
set of predicates supports the tracing of connections across elements, such as
those linked by ID and IDREF attribute pairs.
Rules (axioms) represent the further logical relationships mentioned above, such
as for defeasible inheritance, distribution of distributive properties, etc.
In developing the architecture of this system, we have adopted an object-oriented
strategy: each node identified by the parser and semantic entity instantiated
via a rule of inference has a unique identifier assigned by the system.
Predicates for retrieving or manipulating that information are written with the
aim of hiding the underlying data structure. The system architecture can
therefore be understood to have several distinct layers of representation:
- 1. A parser that handles the serialized document instance.
- 2. Predicates for processing the output of the parser.
- 3. Predicates for storing a representation of the parse tree in the Prolog database.
- 4. Predicates for emulating DOM methods, deictic expression resolution, object instantiation, and general characteristics of properties (such as their inheritance and distribution).
- 5. Facts and rules of inference expressing the document semantics.
Bibliography
C. M. Sperberg-McQueen. “Text in the Electronic Age: Textual Study and Text
Encoding, with Examples from Medieval Texts.” Literary & Linguistic Computing. 1991. 6: .
C. M. Sperberg-McQueen Allen Renear Claus Huitfeldt. “Meaning and Interpretation of Markup.” Markup Languages: Theory and Practice. 2001. 2: 215-234.