SGMLizing the Bilingual Canadian Dictionary: Reasons, Process, and Problems

Introduction

This session explores the reasons for and the challenges of setting up a text processing application using SGML for lexicographic data. More particularly, it presents the experience of a group of researchers in the Humanities who were forced to become familiar with the SGML standard, to help design a Document Type Definition (DTD), and to get used to using SGML authoring tools to write a dictionary. The dictionary in question is the Bilingual Canadian Dictionary(BCD), which is still in preparation. As its tentative title indicates, it is a bilingual dictionary which will reflect English and French as they are used in Canada. The creation of this dictionary is the major objective of a vast collaborative research project, called "Comparative Lexicography of French and English in Canada", funded by the Social Sciences and Humanities Research Council. The project involves three universities: the University of Ottawa (which is also the administrative centre), the University of Montreal, and Laval University.

The Long Road from Wordprocessing to SGML

Dictionary entry preparation, which got off to a modest start in 1988, was done using a wordprocessing program for eight years (till 1996). Although we realized almost from the start that this was a very unsatisfactory method both for entry preparation and for future dissemination, it took us years of research on available technologies to find a suitable solution. The primary problem seemed to be that a dictionary entry has many components, a number of which are optional or repeatable at various points. Thus, a dictionary entry can be very long (e.g. the BCD entry for coeur) or quite short (e.g. our entry for motoneigiste) or simply a cross-reference to another entry (e.g. our entry for naveau). Another problem was the number of entries we have planned for the dictionary (approximately 80,000). Terminological database management systems, such as AQUILA and MC4, were obviously unsuitable for our needs, because of their predetermined structure which cannot be modified. MTX, also used for terminological data, offered more flexibility because it does not preestablish fields, but it has the disadvantage of separating, into two different zones, information on the source language word from information related to this word (the target language equivalents, illustrative examples, etc.) and of restricting the scope of each of these zones. In any case, it is not intended for large numbers of entries (or lexicographic records). After rejecting the possibility of using any of the readily available terminological database management systems, we began, in 1993, a worldwide search for lexicographic database management systems. However, despite our conviction that such systems exist - after all, dictionaries are constantly published - we were able to locate only two: one in Copenhagen (called GestorLEX) and another in Cambridge designed for the Cambridge International Dictionary of English. While both GestorLEX and the CIDE database system seemed to be comprehensive, well thought-out packages, with many specialized features, they both have two major disadvantages: (a) no North American technical support is available; and (b) both are "closed systems", in that they store their document bases in an internal format which can only be read by the system, even though GestorLEX does support SGML marking to facilitate information interchange. In addition, both would have needed to be adapted to our specific needs, and we were unable to obtain test copies to verify just how much further programming would be required on our part. However, programming was not our strong point, as we soon realized after attempting to adapt a general-purpose database management system, Rbase, to meet our needs. And efforts by the University of Montreal computer centre to design a system for us in Prolog progressed so slowly, apparently because of the complexity of some of our entries, that after a year and a half only the "introductory zone" was completed. By 1995, we had reached two major decisions: (a) given the fact that our project is a long-term project, we could not run the risk of being tied to one platform or one specific database system; (b) we needed expert advice, which was not available at the university level, on a text processing and text retrieval applications. By this point, we had acquired some knowledge of SGML, and thought that an SGML application could provide us with many advantages:

(a) Since an SGML document exists simply as plain text, it is easily portable from one application to another, or another platform to another. This is of considerable importance to us for several reasons: we have three lexicographic centres, which need to "exchange" information constantly; since the final product, the BCD will appear not only in electronic form but also in print, SGML would be a dramatic time-saver both to us and to the publisher; since, over the duration of the project, we would be updating our hardware, the fact that SGML can be used on any hardware was a definite asset.
(b) Since an SGML document does not need authors to concern themselves with formatting and layout, lexicographers would be better able to concentrate on the actual content of entries.
(c) Since SGML markup provides the document's structure, lexicographers would not accidently forget elements that need to be included in the entry. Moreover, since SGML entries are automatically parsed when they are saved, a number of lexicographic omissions would be flagged at this point.
(d) Since SGML markup is multipurpose, the tagging can serve for verification of specific entry elements during the final revision of all the dictionary entries, and for the creation of future subproducts (e.g. a bilingual dictionary of Canadianisms).

Armed with the conviction that SGML would provide us with a solution, we initially worked on a DTD with the help of a visiting professor from Rennes. Realizing, however, that the DTD design was only one element of an integrated system for our entries, we consulted with three computer consulting firms to see if they could help us realize our goal. While two of them could only promise us some sort of useable SGML product in the more or less distant future, we had the great good fortune to find at Microstar in Ottawa a consultant, Dr. David Megginson, who not only specializes in SGML, but who had also done previous lexicographic programming. With his assistance, we moved, in the space of six months, from wordprocessing to a fully integrated SGML application.

Preparation of the Document Type Definition

On Dr. Megginson's advice, the BCD team set up a committee of nine members (lexicographers, revisers, and professors) to work with him on the preparation of the DTD. It took seven day-long meetings to decide on the DTD structure, although we already had a good starting point. Since the BCD team had already designed a tentative entry structure and had produced about 10,000 entries, we began by examining the proposed dictionary microstructure and its realization in about thirty entries. The proposed microstructure was complex because of the number of components and combination of components possible. The entries chosen for examination were selected to represent as many of the components as possible. The main selection criteria were the following: representation of different parts of speech (e.g. aîné n, aîné adj); representation of monosemous and polysemous word entries (e.g. allopathy, a monosemous noun, versus ball, which is a highly polysemous noun); representation of compounds and fixed expressions in certain entries (e.g. feu n); representation of "marked" word entries (e.g. bébelle n) as well as unmarked word entries (e.g. adjust v); representation of entries created in different centres; and representation of entries from French to English as well as English to French. On the basis of all this supporting documentation, the DTD committee first identified all possible components and described each component in a Component Form. At the end of this stage, we had identified seventy-six components. These were then reviewed and revised. For example, we had initially identified as a component Annotated Translation Example, which we had defined as "translation of a source language free combination, collocation, fixed expression or compound preceded by one or more of the following: sense indication, actants, referents, comments." However, at the review stage, we realized that we did not need to identify this as a separate component, since all its constituents (translation, sense indication, actant, referent, comment) had been individually identified. At the review stage, we also added a couple of components, which had been previously missed. After all the components were identified and described, we decided to set up two separate DTDs - one for a Full Dictionary Entry, another for a Cross-Reference entry - since the latter contains very little information. We then started ordering the components for the Full Dictionary Entry, both vertically and horizontally, using Microstar's product, Near and Far Designer, which uses a simple "drag-and- drop" graphic interface to program the DTD. As the Dictionary Component Form presented above illustrates, while identifying the components, we had already specified to some extent the relations between components. These relations needed to be further clarified at this stage and elements grouped in blocks and groups. Thus, the elements irregular feminine, irregular plural and irregular singular were grouped together in a Noun Information Block, while the elements irregular feminine, irregular plural, regular comparative, regular superlative, irregular comparative and irregular superlative were placed together in the Modifier Information Block. At this point, we also had to decide which components should be elements and which should be attributes (of the elements). We ended up by considering several components that would disappear in the final dictionary entry as attributes: source codes and lexicographer's notes, for instance. After much discussion and reordering, we finally arrived at what seemed a satisfactory DTD structure. We then tested it for a couple of months using the SGML authoring program InContext. Entry preparation using the DTD revealed a certain number of problems. For example, we could not add the grammatical comment NonC at the start of a sense division for a word that was non-count in only one or two of its senses (e.g. bush). While we had a repeatable irregular feminine form element and a repeatable irregular plural form element, we had forgotten the fact that we needed to distinguish between the irregular masculine plural from the irregular feminine plural. We established lists of "problems", i.e. elements that were either not appropriately placed or subelements that we had ignored. These problems led to further changes to the DTD. We now have a DTD that works for all words. While we still notice minor "problems" from time to time, the changes required are so minimal that they do not affect the overall structure. However, since every entry is different from another, we have taken precautions to allow for future additions. We have added a "loophole" at the end of each block, which can later be used to add so-far unforeseen elements.

SGML Authoring Tools Used by the BCD

We began preparing entries using InContext, an authoring tool generously donated to the project by Robert Arn, president of InContext Corporation. This structured editor provides a dual view of structure and content. However, only compulsory elements are visible in the left hand column. The lexicographer has to click between compulsory elements to see what other elements could be added at that point. Some lexicographers see this as a disadvantage. Another disadvantage is that attributes are not immediately visible to lexicographers, and given that the program allows only about 100 characters for attributes, some lexicographer's notes (which are attributes) are cut short, unbeknownst to the lexicographer. However, despite these limitations, InContext is an ideal tool for a lexicographer who is just beginning to author entries in SGML, since the basic structure is clearly revealed but separated from the content zone and it guides the lexicographer through the authoring process. More recently, we have begun to use the SGML option in Corel's WordPerfect 7. This program allows the lexicographer to see, at a given point, all the elements and attributes in the DTD on the screen. However, the element and attribute are accompanied by SGML tags, which can "clutter up" the screen. While the tags can be hidden, certain lexicographers still prefer using InContext.

Formatting Entries Written with SGML Markup

The greatest advantage for lexicographers authoring entries is that they no longer have to worry about formatting and layout. Using a wordprocessing program, they not only had to bold, italicize, and indent, but they also had to remember what components needed such marking. They have now been freed of these tasks. However, a dictionary entry authored using SGML is very difficult for revisers and editors to read, even if the SGML tags are removed at the time of printing. To facilitate the revision stage, we needed style sheets. The greatest advantage for lexicographers authoring entries is that they no longer have to worry about formatting and layout. Using a wordprocessing program, they not only had to bold, italicize, and indent, but they also had to remember what components needed such marking. They have now been freed of these tasks. However, a dictionary entry authored using SGML is very difficult for revisers and editors to read, even if the SGML tags are removed at the time of printing. To facilitate the revision stage, we needed style sheets.

Using Dictionary Entries as a Database

All our SGML entries are stored in a common directory on a SPARCstation and can be consulted using LiveLink Search by the Open Text Corporation. This search engine allows us not only to locate particular entries in the directory, but also, by searching the SGML tags, to pull out specific content elements. For example, we can thus identify all the words or senses that are Canadianisms or informal in register, see if a particular illustrative example has already been used, ensure that crossreferenced material is indeed in the entry where it is supposed to be. In other words, Livelink Search allows use to use the SGML entries as a "free-floating" database.

Mission Accomplished?

We have come a long way in the last year in terms of suitable computerization of our entries. And our transition from wordprocessing to SGML has been relatively smooth. This is, in large part, due to the very efficient assistance we received from our Microstar consultant David Megginson. But it is also due to the fact that, having involved lexicographers themselves from the start in the conversion to SGML, we have not had to cope with the attitude problems described by Ed Hicks in his article "Battling Hydra - Introduction of SGML in a Government Environment" (1996). The fact that most of our lexicographers are young graduate students open to change and used to technology has undoubtedly also been a factor in our not having to face what Hicks terms "IIABDFI" (if it ain't broke, don't fix it). However, we still have one major task that remains: that of converting over 12,000 completed entries from WordPerfect format to SGML. We are looking into the possibility of using a parser for this purpose. But, meanwhile, while new entries are being authored in SGML, former entries are still in WordPerfect. So our goal of SGMLizing all our entries has not yet been accomplished.