“XML Content Integration: An Example from the Heml
Project”
Bruce
Robertson
Mount Allison University, Canada
brobertson@mta.ca
Introduction
This paper illustrates how humanists' XML documents from around the web can be shared and integrated using careful schema design and simple XSLT programs. In the web publishing world, resource sharing and syndication are all the rage. portion Using XML-based methods like the RDF Site Summary (RSS) or Channel Definition Format (CDF), sites can reuse and share their headlines and stories across the web. oreillynet.com's Meerkat is an impressive example of the power of this technology. Computerists in the humanities have applied less effort to syndication and document sharing, but as scholarly markup begins to describe realms of knowledge such as archaeology °, anthropology , geography or history, whose content is impossible to confine, and as XML namspaces make it possible to blend and reuse documents one within another °, it is clear that we need to build into our markup endeavours the means of integrating resources from around the web. As briefly illustrated in a paper at ACH/ALLC 2001, this has been a facet of the research of the Historical Event Markup and Linking Project (Heml) from its inception two years ago. Heml aims to provide the means to coordinate historical information on the web though a markup language that describes historical events and through transformations of conforming documents into historical maps, timelines and tables. After defining its markup language in XML schemas (version 2001 07 02) and producing conforming documents outlining Roman Republican history, Heml has turns its attention to exploring the problems and opportunities offered by a semantic web of disparate documents across the web °. Two conclusions have been drawn from this work: first, rich scholarly markup like Heml has requirements in content integration that are more complex than the problems which the current syndication techniques aim to solve; and that, secondly, these requirements can be met using simple and ubiquitous computational tools.The Difficulties of Content Integration
The more complex content integration requirements of Heml and similar schemes can be illustrated with the example of collecting recipes. Syndication schemes in common use today gather XML content like recipes in a file-box: documents or links are placed alongside each other creating a larger document seriatim. However, as illustrated in Figure 1, this process fails to recognize the overlapping identities between documents of richly marked-up content: the id attributes flour and apples appear twice, confusing matters and possibly causing the document to no longer be valid.Figure 1.
Figure 1. Gathering Documents Seriatim
Figure 2.
Figure 2. Integrating Content
Design Goals
There are many ways to approach the problem of integrating such materials; some approaches have the virtue of not requiring metadata at all. The following design goals dictated the Heml project's solution:- 1. Each constituent document and the resulting integrated document should be intelligible and valid.
- 2. It should be possible to refer to constituent documents through URLs.
- 3. A third party (that is, someone who is not the creator of any of the documents and has no influence in their design) should be able to integrate documents as satisfactorily as their authors.
- 4. The integrating code should be as portable as possible, ideally running on clients or servers.
- 5. The integration process should be recursive so that its results can be the input of a further integration process.