Objective
The objective of this project was to investigate SGML as a vehicle for
porting documents between diverse hypertext systems. The documents were
provided by European Passenger Services (EPS) who issue them in a
traditional printed format. They form two major publications:- factsheet giving information about EPS and its services. It
consists of text interspersed with images and tables;
- an educational package for children at primary school. This
contains a higher proportion of images but also questions to test
reading comprehension and simple mathematics.
The two types of material were very different and were treated accordingly.
The aim with the factsheet was merely to improve the accessibility and
browsability of the information through an electronic text while the
teaching material demanded more genuinely interactive features that would
convert it into a Compiler Assisted Learning Package.
Tagging and Parsing
However, the first stage with both data sets was to define a Document Type
Definition (DTD) in terms of the predefined structural components (ie
chapters, sections, paragraphs and sentences) and to tag every occurrence of
each element in the document. Subsequently, the documents were validated
using Mark-It, an SGML parser which reads a DTD and checks that a given
document instance conforms with it. Mark-It can also be used to produce a
canonical form of the document, in which any markup missing owing to tag
minimisation and use of short-refs is replaced. Finally, and most
importantly, it has a generator or link mechanism that will facilitate
systematic replacement of bugs (and content) in the canonical version. This
can be used to replace the SGML tags by formatting instructions to produce a
traditional printed text or, more ambitiously, to produce hypertext versions
of the document. It relies on the introduction into the DTD of a generator
section the application control and link processes that will be used by
Mark-It to transform an EPS document instance from SGML into another
format.
Target Hypertext Systems
In the project, documents were converted to:- 1. source format for the Unix Guide hypertext system,
- 2. HTML format for browsers such as Netscape and Mosaic,
and
- 3. PC Guide.
Each of these threw up individual problems which demanded independent
solutions and threw doubt on the viability of using a single DTD for
multiple purposes. For example, the Unix Guide markup produced a concise
hypertext document that could be navigated by selecting hypertext buttons
and using the scroll bar. For aesthetic reasons, and to improve
functionality, the Guide file was split into smaller files and linked using
usage buttons, rather than replace buttons. Because Guide naturally uses
expansion buttons, the "contents page" of the original document was
redundant and the contents element was removed from the DTD.
However, HTML needed a "contents" page to implement hypertext links to link
each topic to the relevant section within the document. To maintain DTD
consistency, a contents page was generated automatically together with the
HTML document file. This, while increasing generator complexity, was
inherently pleasing because it meant the contents section did not have to be
manually maintained and updated. Again, functionality was enhanced by the
use of many small files, rather than a single big file.
PC GUIDE offered the most comprehensive markup possibilities, although it
required the use of an intermediary GUIDE Writer in the conversion. Firstly,
Mark-It was used to convert SGML into Hypertext Markup Language (HML). HML,
not to be confused with HTML, is a subset of SGML containing 27 element
tags, that can be used to describe document structure in terms of specific
types of GUIDE objects. GUIDE Write creates and links these hypertext
objects; automatically generating separate files connected by
hyperlinks.
Conversion Comparison
For browsing and navigation, all three hypertext conversions were realisable,
and to that extent, demonstrated the portability of the data through an
accurate and application independent description of the information.
Nevertheless, the project raised issues about the interaction between data
portability and system functionality and the level of portability that can
be achieved. For example, to optimise functionality for each system required
modifications whenever a new system was introduced to match the requirement
of that particular system. Also the conversion is ultimately dependent on
the mechanisms and power of the Mark-It parser and the ingenuity of the user
in writing appropriate link routines.
Multimedia
Multimedia information incorporating sound and video exacerbate these
differences. Whilst the initial focus of SGML was textual data, other media
can be included by one of two mechanisms. The more obvious is through data
entities which provide a detailed and formal description of data. However,
if the main aim is portability between hypertext systems, this method is
cumbersome as it requires an entity and notation declaration must be
modified to conform with the format of the target hypertext system.
Consequently data portability is more easily achieved by using empty
elements to represent different media. In this method the DTD remains
constant (although, of course, commands appropriate to each hypertext format
must be included in the generator section). Only one element name is
declared for each medium; different files are distinguished using Mark-It's
`counter' mechanism.
Comparing the use of sound and video across the three hypertext systems
focuses attention on the role of their respective browsers. While the Unix
Guide interface itself was primarily designed for text and graphics,
multimedia applications can be easily launched by creating usage buttons to
run script commands and there are no restrictions on the use of different
sound and video formats save those arising from the limitations of the
multimedia software on the supporting platform. While web browsers such as
Mosaic and Netscape were specifically designed to process media other than
text and graphics, the browsers must be configured so that they are linked
to the multimedia software via the Multipurpose Internet Mail Extensions
protocol (MIME) which enables documents in varying formats to be transmitted
over the Internet. Consequently the range of sound and video formats
supported by web browsers is more restrictive than with Unix Guide although
the linking method is very straightforward involving only a simple external
hypertext link; the browser then performs all necessary work automatically.
PC Guide has similar restrictions: the sound and video formats that are used
are determined by the MS Windows platform. Again inclusion is not
complicated as the platform via the MCI performed all necessary tasks.
Conclusion
The markup of EPS information was an evolutionary process ie it was not clear
at the start what features should be tagged and continual modification of
the DTD was required to ensure that the same definition could be used
effectively for all three hypertext formats and the question of the optimum
level of mark up for portability is still not wholly resolved. An SGML
document ensures that there is an accurate description of the logical
structure and components, but to achieve maximum functionality from the
target hypertext system, modifications were frequently desirable and much
vital and system dependent work had to be done during the parsing stage.
Data portability is still greatly constrained by the requirements of the
target system.