Introduction
This session explores the reasons for and the challenges of setting up a text
processing application using SGML for lexicographic data. More particularly,
it presents the experience of a group of researchers in the Humanities who
were forced to become familiar with the SGML standard, to help design a
Document Type Definition (DTD), and to get used to using SGML authoring
tools to write a dictionary.
The dictionary in question is the Bilingual Canadian Dictionary(BCD), which
is still in preparation. As its tentative title indicates, it is a bilingual
dictionary which will reflect English and French as they are used in Canada.
The creation of this dictionary is the major objective of a vast
collaborative research project, called "Comparative Lexicography of French
and English in Canada", funded by the Social Sciences and Humanities
Research Council. The project involves three universities: the University of
Ottawa (which is also the administrative centre), the University of
Montreal, and Laval University.
The Long Road from Wordprocessing to SGML
Dictionary entry preparation, which got off to a modest start in 1988, was
done using a wordprocessing program for eight years (till 1996). Although we
realized almost from the start that this was a very unsatisfactory method
both for entry preparation and for future dissemination, it took us years of
research on available technologies to find a suitable solution.
The primary problem seemed to be that a dictionary entry has many components,
a number of which are optional or repeatable at various points. Thus, a
dictionary entry can be very long (e.g. the BCD entry for coeur) or quite
short (e.g. our entry for motoneigiste) or simply a cross-reference to
another entry (e.g. our entry for naveau). Another problem was the number of
entries we have planned for the dictionary (approximately 80,000).
Terminological database management systems, such as AQUILA and MC4, were
obviously unsuitable for our needs, because of their predetermined structure
which cannot be modified. MTX, also used for terminological data, offered
more flexibility because it does not preestablish fields, but it has the
disadvantage of separating, into two different zones, information on the
source language word from information related to this word (the target
language equivalents, illustrative examples, etc.) and of restricting the
scope of each of these zones. In any case, it is not intended for large
numbers of entries (or lexicographic records).
After rejecting the possibility of using any of the readily available
terminological database management systems, we began, in 1993, a worldwide
search for lexicographic database management systems. However, despite our
conviction that such systems exist - after all, dictionaries are constantly
published - we were able to locate only two: one in Copenhagen (called
GestorLEX) and another in Cambridge designed for the Cambridge International
Dictionary of English. While both GestorLEX and the CIDE database system
seemed to be comprehensive, well thought-out packages, with many specialized
features, they both have two major disadvantages: (a) no North American
technical support is available; and (b) both are "closed systems", in that
they store their document bases in an internal format which can only be read
by the system, even though GestorLEX does support SGML marking to facilitate
information interchange. In addition, both would have needed to be adapted
to our specific needs, and we were unable to obtain test copies to verify
just how much further programming would be required on our part.
However, programming was not our strong point, as we soon realized after
attempting to adapt a general-purpose database management system, Rbase, to
meet our needs. And efforts by the University of Montreal computer centre to
design a system for us in Prolog progressed so slowly, apparently because of
the complexity of some of our entries, that after a year and a half only the
"introductory zone" was completed.
By 1995, we had reached two major decisions: (a) given the fact that our
project is a long-term project, we could not run the risk of being tied to
one platform or one specific database system; (b) we needed expert advice,
which was not available at the university level, on a text processing and
text retrieval applications.
By this point, we had acquired some knowledge of SGML, and thought that an
SGML application could provide us with many advantages:- (a) Since an SGML document exists simply as plain text, it is
easily portable from one application to another, or another platform
to another. This is of considerable importance to us for several
reasons: we have three lexicographic centres, which need to
"exchange" information constantly; since the final product, the BCD
will appear not only in electronic form but also in print, SGML
would be a dramatic time-saver both to us and to the publisher;
since, over the duration of the project, we would be updating our
hardware, the fact that SGML can be used on any hardware was a
definite asset.
- (b) Since an SGML document does not need authors to concern
themselves with formatting and layout, lexicographers would be
better able to concentrate on the actual content of entries.
- (c) Since SGML markup provides the document's structure,
lexicographers would not accidently forget elements that need to be
included in the entry. Moreover, since SGML entries are
automatically parsed when they are saved, a number of lexicographic
omissions would be flagged at this point.
- (d) Since SGML markup is multipurpose, the tagging can serve for
verification of specific entry elements during the final revision of
all the dictionary entries, and for the creation of future
subproducts (e.g. a bilingual dictionary of Canadianisms).
Armed with the conviction that SGML would provide us with a solution, we
initially worked on a DTD with the help of a visiting professor from Rennes.
Realizing, however, that the DTD design was only one element of an
integrated system for our entries, we consulted with three computer
consulting firms to see if they could help us realize our goal. While two of
them could only promise us some sort of useable SGML product in the more or
less distant future, we had the great good fortune to find at Microstar in
Ottawa a consultant, Dr. David Megginson, who not only specializes in SGML,
but who had also done previous lexicographic programming. With his
assistance, we moved, in the space of six months, from wordprocessing to a
fully integrated SGML application.
Preparation of the Document Type Definition
On Dr. Megginson's advice, the BCD team set up a committee of nine members
(lexicographers, revisers, and professors) to work with him on the
preparation of the DTD. It took seven day-long meetings to decide on the DTD
structure, although we already had a good starting point.
Since the BCD team had already designed a tentative entry structure and had
produced about 10,000 entries, we began by examining the proposed dictionary
microstructure and its realization in about thirty entries. The proposed
microstructure was complex because of the number of components and
combination of components possible. The entries chosen for examination were
selected to represent as many of the components as possible. The main
selection criteria were the following: representation of different parts of
speech (e.g. aîné n, aîné adj); representation of monosemous and polysemous
word entries (e.g. allopathy, a monosemous noun, versus ball, which is a
highly polysemous noun); representation of compounds and fixed expressions
in certain entries (e.g. feu n); representation of "marked" word entries
(e.g. bébelle n) as well as unmarked word entries (e.g. adjust v);
representation of entries created in different centres; and representation
of entries from French to English as well as English to French.
On the basis of all this supporting documentation, the DTD committee first
identified all possible components and described each component in a
Component Form.
At the end of this stage, we had identified seventy-six components. These
were then reviewed and revised. For example, we had initially identified as
a component Annotated Translation Example, which we had defined as
"translation of a source language free combination, collocation, fixed
expression or compound preceded by one or more of the following: sense
indication, actants, referents, comments." However, at the review stage, we
realized that we did not need to identify this as a separate component,
since all its constituents (translation, sense indication, actant, referent,
comment) had been individually identified. At the review stage, we also
added a couple of components, which had been previously missed.
After all the components were identified and described, we decided to set up
two separate DTDs - one for a Full Dictionary Entry, another for a
Cross-Reference entry - since the latter contains very little information.
We then started ordering the components for the Full Dictionary Entry, both
vertically and horizontally, using Microstar's product, Near and Far
Designer, which uses a simple "drag-and- drop" graphic interface to program
the DTD. As the Dictionary Component Form presented above illustrates, while
identifying the components, we had already specified to some extent the
relations between components. These relations needed to be further clarified
at this stage and elements grouped in blocks and groups. Thus, the elements
irregular feminine, irregular plural and irregular singular were grouped
together in a Noun Information Block, while the elements irregular feminine,
irregular plural, regular comparative, regular superlative, irregular
comparative and irregular superlative were placed together in the Modifier
Information Block. At this point, we also had to decide which components
should be elements and which should be attributes (of the elements). We
ended up by considering several components that would disappear in the final
dictionary entry as attributes: source codes and lexicographer's notes, for
instance. After much discussion and reordering, we finally arrived at what
seemed a satisfactory DTD structure.
We then tested it for a couple of months using the SGML authoring program
InContext. Entry preparation using the DTD revealed a certain number of
problems. For example, we could not add the grammatical comment NonC at the
start of a sense division for a word that was non-count in only one or two
of its senses (e.g. bush). While we had a repeatable irregular feminine form
element and a repeatable irregular plural form element, we had forgotten the
fact that we needed to distinguish between the irregular masculine plural
from the irregular feminine plural. We established lists of "problems", i.e.
elements that were either not appropriately placed or subelements that we
had ignored. These problems led to further changes to the DTD.
We now have a DTD that works for all words. While we still notice minor
"problems" from time to time, the changes required are so minimal that they
do not affect the overall structure. However, since every entry is different
from another, we have taken precautions to allow for future additions. We
have added a "loophole" at the end of each block, which can later be used to
add so-far unforeseen elements.
SGML Authoring Tools Used by the BCD
We began preparing entries using InContext, an authoring tool generously
donated to the project by Robert Arn, president of InContext Corporation.
This structured editor provides a dual view of structure and content.
However, only compulsory elements are visible in the left hand column. The
lexicographer has to click between compulsory elements to see what other
elements could be added at that point. Some lexicographers see this as a
disadvantage. Another disadvantage is that attributes are not immediately
visible to lexicographers, and given that the program allows only about 100
characters for attributes, some lexicographer's notes (which are attributes)
are cut short, unbeknownst to the lexicographer. However, despite these
limitations, InContext is an ideal tool for a lexicographer who is just
beginning to author entries in SGML, since the basic structure is clearly
revealed but separated from the content zone and it guides the lexicographer
through the authoring process.
More recently, we have begun to use the SGML option in Corel's WordPerfect 7.
This program allows the lexicographer to see, at a given point, all the
elements and attributes in the DTD on the screen. However, the element and
attribute are accompanied by SGML tags, which can "clutter up" the screen.
While the tags can be hidden, certain lexicographers still prefer using
InContext.
Formatting Entries Written with SGML Markup
The greatest advantage for lexicographers authoring entries is that they no
longer have to worry about formatting and layout. Using a wordprocessing
program, they not only had to bold, italicize, and indent, but they also had
to remember what components needed such marking. They have now been freed of
these tasks. However, a dictionary entry authored using SGML is very
difficult for revisers and editors to read, even if the SGML tags are
removed at the time of printing. To facilitate the revision stage, we needed
style sheets.
The greatest advantage for lexicographers authoring entries is that they no
longer have to worry about formatting and layout. Using a wordprocessing
program, they not only had to bold, italicize, and indent, but they also had
to remember what components needed such marking. They have now been freed of
these tasks. However, a dictionary entry authored using SGML is very
difficult for revisers and editors to read, even if the SGML tags are
removed at the time of printing. To facilitate the revision stage, we needed
style sheets.
Using Dictionary Entries as a Database
All our SGML entries are stored in a common directory on a SPARCstation and
can be consulted using LiveLink Search by the Open Text Corporation. This
search engine allows us not only to locate particular entries in the
directory, but also, by searching the SGML tags, to pull out specific
content elements. For example, we can thus identify all the words or senses
that are Canadianisms or informal in register, see if a particular
illustrative example has already been used, ensure that crossreferenced
material is indeed in the entry where it is supposed to be. In other words,
Livelink Search allows use to use the SGML entries as a "free-floating"
database.
Mission Accomplished?
We have come a long way in the last year in terms of suitable computerization
of our entries. And our transition from wordprocessing to SGML has been
relatively smooth. This is, in large part, due to the very efficient
assistance we received from our Microstar consultant David Megginson. But it
is also due to the fact that, having involved lexicographers themselves from
the start in the conversion to SGML, we have not had to cope with the
attitude problems described by Ed Hicks in his article "Battling Hydra -
Introduction of SGML in a Government Environment" (1996). The fact that most
of our lexicographers are young graduate students open to change and used to
technology has undoubtedly also been a factor in our not having to face what
Hicks terms "IIABDFI" (if it ain't broke, don't fix it).
However, we still have one major task that remains: that of converting over
12,000 completed entries from WordPerfect format to SGML. We are looking
into the possibility of using a parser for this purpose. But, meanwhile,
while new entries are being authored in SGML, former entries are still in
WordPerfect. So our goal of SGMLizing all our entries has not yet been
accomplished.