“In Defense of Invalid SGML”
David
J.
Birnbaum
University of Pittsburgh
djbpitt+@pitt.edu
SGML was developed primarily for encoding new texts, an environment in which the
rigorous adherence to a DTD ensures that the resulting documents will observe a
coherent structure. For example, if a dictionary is divided into, among other
things, lexical entries, which consist of a keyword followed by a phonetic
transcription followed by examples, this structural feature can be represented
partially by the following declaration:
<!ELEMENT entry - - (keyword, phonetic, example+)> Users who create new dictionaries based on this DTD will be required by their SGML editing and validating tools to include exactly one keyword followed by exactly one phonetic transcription followed by at least one example. The SGML software is not, of course, able to ensure that the #PCDATA content of these component elements is correct, but it can guarantee, for example, that the element phonetic will precede, and not follow, the element example. This model is appropriate in an environment where SGML tools are used to create structured documents, but it is far less well suited to a different environment that happens to be extremely common in humanities computing: producing electronic versions of preexisting documents. The problem is that preexisting documents that were created outside SGML editors may, because of the fallability of human editors, violate the overall logical structure of those documents, leaving the editor of the electronic edition with three unattractive choices: 1) "correct" the original text during transcription, 2) create a loose DTD that does not enforce a strict element order, or 3) create an invalid document that violates its DTD. The first of these alternatives, "correcting" errors in the original source document, is unattractive because transcriptions of (to continue the original example) existing print dictionaries have two essences: they are new and functional electronic dictionaries and they are electronic records of existing archaeographic materials, viz. print dictionaries. While "correcting" errors observes the spirit of the first of these essences, it runs counter to the second. The second alternative, creating a loose DTD that does not enforce a strict element order (by, for example, replacing the commas with ampersands in the content model portion of the element declaration above), is unattractive because it sacrifices structural information. If the evidence clearly points to the transgression of a structure involving strict order, rather than to conformity to a loose structure that does not require strict order, a DTD based on the latter conceals information about the document, viz. what is regular and what is exceptional. The solution that best preserves the integrity of the original source while simultaneously encoding the difference between norm and violation of norm is the third alternative, the creation of an invalid document that violates its DTD. This solution is unavailable, and, indeed, almost kinky or subversive in an SGML context because our perception of SGML as a way of modeling document structure assumes both that documents may be structured and that if they are structured, that structure should be represented by the DTD. The weakness in this perspective is that documents created in obvious conformity to a very explicit structure (such as dictionaries), but created without the assistance of SGML tools, may occasionally violate their structure through human error. These violations are informational, at least from an archaeographic perspective, and should be preserved. And what should be preserved is not merely that the offending portions observe a different structure, and certainly not that the overall document structure is generally loose. What should be preserved is what document analysis reveals: the document has a highly-structured DTD and the offending portions are conceptual DTD violations, rather than alternative valid structures. I do not advocate, of course, that we prepare and publish invalid SGML, or that SGML processing software be enhanced to react affirmatively not only to valid SGML events, but also to SGML errors. But I would suggest that when we perform document analysis on existing texts, we recognize that some oddities may at least logically (although perhaps not practically) be represented not as document structure, but as violations of document structure. The requirement that SGML be valid seems in most contexts so obvious that it would never be questioned, but if document analysis of existing documents reveals violations of structure, the most appropriate model of this information in SGML terms involves invalid SGML. If the creation of invalid SGML is foreclosed for practical reasons, our most honest alternative is to enrich whatever solution we do adopt with annotations in markup that tell the truth: what we are encoding are the equivalent of parser error messages, and the fact that our document violates its basic structure in specific places is informational.
<!ELEMENT entry - - (keyword, phonetic, example+)> Users who create new dictionaries based on this DTD will be required by their SGML editing and validating tools to include exactly one keyword followed by exactly one phonetic transcription followed by at least one example. The SGML software is not, of course, able to ensure that the #PCDATA content of these component elements is correct, but it can guarantee, for example, that the element phonetic will precede, and not follow, the element example. This model is appropriate in an environment where SGML tools are used to create structured documents, but it is far less well suited to a different environment that happens to be extremely common in humanities computing: producing electronic versions of preexisting documents. The problem is that preexisting documents that were created outside SGML editors may, because of the fallability of human editors, violate the overall logical structure of those documents, leaving the editor of the electronic edition with three unattractive choices: 1) "correct" the original text during transcription, 2) create a loose DTD that does not enforce a strict element order, or 3) create an invalid document that violates its DTD. The first of these alternatives, "correcting" errors in the original source document, is unattractive because transcriptions of (to continue the original example) existing print dictionaries have two essences: they are new and functional electronic dictionaries and they are electronic records of existing archaeographic materials, viz. print dictionaries. While "correcting" errors observes the spirit of the first of these essences, it runs counter to the second. The second alternative, creating a loose DTD that does not enforce a strict element order (by, for example, replacing the commas with ampersands in the content model portion of the element declaration above), is unattractive because it sacrifices structural information. If the evidence clearly points to the transgression of a structure involving strict order, rather than to conformity to a loose structure that does not require strict order, a DTD based on the latter conceals information about the document, viz. what is regular and what is exceptional. The solution that best preserves the integrity of the original source while simultaneously encoding the difference between norm and violation of norm is the third alternative, the creation of an invalid document that violates its DTD. This solution is unavailable, and, indeed, almost kinky or subversive in an SGML context because our perception of SGML as a way of modeling document structure assumes both that documents may be structured and that if they are structured, that structure should be represented by the DTD. The weakness in this perspective is that documents created in obvious conformity to a very explicit structure (such as dictionaries), but created without the assistance of SGML tools, may occasionally violate their structure through human error. These violations are informational, at least from an archaeographic perspective, and should be preserved. And what should be preserved is not merely that the offending portions observe a different structure, and certainly not that the overall document structure is generally loose. What should be preserved is what document analysis reveals: the document has a highly-structured DTD and the offending portions are conceptual DTD violations, rather than alternative valid structures. I do not advocate, of course, that we prepare and publish invalid SGML, or that SGML processing software be enhanced to react affirmatively not only to valid SGML events, but also to SGML errors. But I would suggest that when we perform document analysis on existing texts, we recognize that some oddities may at least logically (although perhaps not practically) be represented not as document structure, but as violations of document structure. The requirement that SGML be valid seems in most contexts so obvious that it would never be questioned, but if document analysis of existing documents reveals violations of structure, the most appropriate model of this information in SGML terms involves invalid SGML. If the creation of invalid SGML is foreclosed for practical reasons, our most honest alternative is to enrich whatever solution we do adopt with annotations in markup that tell the truth: what we are encoding are the equivalent of parser error messages, and the fact that our document violates its basic structure in specific places is informational.