XML Content Integration: An Example from the Heml Project

Introduction

This paper illustrates how humanists' XML documents from around the web can be shared and integrated using careful schema design and simple XSLT programs. In the web publishing world, resource sharing and syndication are all the rage. portion Using XML-based methods like the RDF Site Summary (RSS) or Channel Definition Format (CDF), sites can reuse and share their headlines and stories across the web. oreillynet.com's Meerkat is an impressive example of the power of this technology. Computerists in the humanities have applied less effort to syndication and document sharing, but as scholarly markup begins to describe realms of knowledge such as archaeology °, anthropology , geography or history, whose content is impossible to confine, and as XML namspaces make it possible to blend and reuse documents one within another °, it is clear that we need to build into our markup endeavours the means of integrating resources from around the web. As briefly illustrated in a paper at ACH/ALLC 2001, this has been a facet of the research of the Historical Event Markup and Linking Project (Heml) from its inception two years ago. Heml aims to provide the means to coordinate historical information on the web though a markup language that describes historical events and through transformations of conforming documents into historical maps, timelines and tables. After defining its markup language in XML schemas (version 2001 07 02) and producing conforming documents outlining Roman Republican history, Heml has turns its attention to exploring the problems and opportunities offered by a semantic web of disparate documents across the web °. Two conclusions have been drawn from this work: first, rich scholarly markup like Heml has requirements in content integration that are more complex than the problems which the current syndication techniques aim to solve; and that, secondly, these requirements can be met using simple and ubiquitous computational tools.

The Difficulties of Content Integration

The more complex content integration requirements of Heml and similar schemes can be illustrated with the example of collecting recipes. Syndication schemes in common use today gather XML content like recipes in a file-box: documents or links are placed alongside each other creating a larger document seriatim. However, as illustrated in Figure 1, this process fails to recognize the overlapping identities between documents of richly marked-up content: the id attributes flour and apples appear twice, confusing matters and possibly causing the document to no longer be valid.

Figure 1. Figure 1. Gathering Documents Seriatim

Figure 2 illustrates the preferred outcome. This process gathers the recipes as they are in a cookbook, where ingredients are identified properly with each other and so would, for instance, appear only once in an index.

Figure 2. Figure 2. Integrating Content

Content integration of Heml documents has to be of the second, more complex sort because its schema is built largely upon the identification of entities through id and idref attributes. In brief, Heml markup comprises a series of event elements; each of these includes information about participants, chronology, locations, keywords and supporting documents. Example 1 is a Heml fragment that identifies the location `Rome.' Once defined in this way, subsequent references to this same location within the document use an XML reference element , thus: <LocationRef idref="Rome">. These corresponding Concept and ConceptRef elements are a design feature of Heml markup: no elements are defined with names ending in the string Ref, except those that have idrefs which are meant to refer to elements with the corresponding name. Example 1. heml:Location element <Location id="Rome"> <LocationLabelSet> <Label xml:lang="en">Rome</Label> <Label xml:lang="la">Roma</Label> <Label xml:lang="el">Ῥώμη</Label> </LocationLabelSet> <Latitude> <GeographicalHourLatitude>41</GeographicalHourLatitude> <GeographicalMinute>49</GeographicalMinute> <GeographicalSecond>2</GeographicalSecond> </Latitude> <Longitude> <GeographicalHourLongitude>12</GeographicalHourLongitude> <GeographicalMinute>19</GeographicalMinute> <GeographicalSecond>8</GeographicalSecond> </Longitude> </Location>

Design Goals

There are many ways to approach the problem of integrating such materials; some approaches have the virtue of not requiring metadata at all. The following design goals dictated the Heml project's solution:

1. Each constituent document and the resulting integrated document should be intelligible and valid.
2. It should be possible to refer to constituent documents through URLs.
3. A third party (that is, someone who is not the creator of any of the documents and has no influence in their design) should be able to integrate documents as satisfactorily as their authors.
4. The integrating code should be as portable as possible, ideally running on clients or servers.
5. The integration process should be recursive so that its results can be the input of a further integration process.

Solution

The Heml projects has fulfilled these goals through a metadata file and an algorithm implemented in the XSLT XML document transformation language. The meta-documents that control this process have been named 'Jackdaw' documents, since like their namesakes they gather and put to use disparate materials. RDF Example 2 is a jackdaw file that controls the integration of documents relating to the history of Rome down to 201 BCE. Reading from the bottom of the document up, the <filelist> element collects URLs that refer to documents whose content this Jackdaw integrates. Further up, <IdEquivalence> elements identify a <Master> element document with one or more <Duplicate> document elements. In our example the element identified as 'Rome' in the second punic war.xml document is listed as the master of similarly named elements in the other two. The XSLT code that operates on the Jackdaws can be obtainedhomepage on the Heml CVS server. Though the Heml project presently organizes its XSLT transformations using the server-side Cocoon2 engine from the Apache group, advanced web browsers -- Microsoft Explorer 5.5 and greater and most builds of Mozilla 0.9.5 and greater -- perform XSLT transformations on documents that include the proper XML processing instruction tags. The integrating algorithm is simple and based on the assumption that document URLs are unique and that ids are always unique within their document. A function that concatenates an input id with its document's URL is therefore also assumed to be unique in the integrated output document. The integrating algorithm blindly copies all <Event> elements and their children from every document addressed in the <file> elements, except that it generates new id or idref attributes based on the URL and id of the old ones. Furthermore, if an idref points to an element whose id is among those listed in as an <IdEquivalence><Duplicate>, the new id of the corresponding <IdEquivalence><Master> is output instead. If an id is among those listed as an <IdEquivalence><Duplicate>, the XSLT generates the appropriate ConceptRef element and gives it the idref attribute generated from the corresponding<IdEquivalence><Master>. Example 2. `Jackdaw' Metadata file <Jackdaw> <IdEquivalences> <IdEquivalence> <Master id="http://localhost:8080/heml-cocoon/source/second_punic_war.xml#Rome"/> <Duplicate id="http://www.java.utoronto.ca/~brucerob/early_history.xml#Rome"/> <Duplicate id="http://heml.mta.ca/~brucerob/first_punic_war.xml#Rome"/> </IdEquivalence> <IdEquivalence> <Master id="http://localhost:8080/heml-cocoon/source/second_punic_war.xml#Carthage"/> <Duplicate id="http://heml.mta.ca/~brucerob/first_punic_war.xml#Carthage"/> </IdEquivalence> </IdEquivalences> <filelist> <file>http://localhost:8080/heml-cocoon/source/second_punic_war.xml</file> <file>http://heml.mta.ca/~brucerob/first_punic_war.xml</file> <file>http://www.java.utoronto.ca/~brucerob/early_history.xml</file> </filelist> </Jackdaw> This process is reasonably speedy. As a functioning example for this paper I ran the jackdaw file in Example 2 using Apache's Xalan 2.0 XSLT engine on a Celeron 400-class machine running RedHat Linux 7.2 and the IBM 1.3-8.0 Java2 SDK. With all caching switched off, it took under ten seconds to gather the resulting 35k document; and three of these seconds appear to be overhead required just to start the java processes. ( xsltproc, an XSLT engine written in C, completes the same task in 1.7 seconds!) Network access seems to be the limiting factor for these reasonably small files. Of course, the resulting integrated XML document is usually invisible to the user, who navigates one or more further transformations of that document into HTML or images. Among the transformations available for Heml documents is a dynamic SVG map. (A browser plugin, available from Adobe for Windows and Macintosh, is required to view this image.) Passing the cursor over a dot on the map will bring up the name of the location and a list of events that took place at that location. It can be seen that the lists for Rome and Carthage include events from all periods, as instructed in the Jackdaw file in Example 2.