“SMART Project: Methods for Computer-based Research of
Premodern Chinese Texts”
Christian
Wittern
Chung-Hwa Institute of Buddhist Studies, Taiwan
This presentation will start with a look at some of the problems encountered so
far in a number of projects that tried to apply TEI [TEIP3] markup to premodern
Chinese Buddhist texts. I have been working with the TEI Guidelines for more
than seven years and published the first text, rather heavily marked up in TEI
fashion, in 1995°. Since then I became involved with some
other projects digitizing Chinese Buddhist texts, most prominently the work by
the Chinese Buddhist Electronic Texts Association (CBETA)°. We now have about
200 MB of texts basically marked up° according to the Guidelines.
All of these projects worked from printed editions published 80-100 years ago.
One of the most obvious problems we encountered is the large amount of
non-standard characters found in these texts, but TEI and SGML in general is
quite able to handle this elegantly - nevertheless there are some important
details that should be noted°. Some of the more subtle
problems involve structural elements specific to texts of the sphere of Chinese
cultural influence. Examples of these elements include the notion of a scroll,
that is carried over from the time when the documents were actually written on
scrolls, but still mark divisions in the printed editions. Being based on the
physical medium, they fall into a similar category as the LB, PB and MILESTONE
elements in TEI, but they are usually associated with some other heading-like
text, colophons and the like. While this could be taken care of with the FW in
some way, we decided to come up with our own solution, which was to introduce a
new element, JUAN, (Chinese for scroll) and encode the information therein.
Other structural elements that presented difficulties include colophons or other
backmatter-like text at the end of a scroll, but in the middle of a DIV element
that continued on the next scroll and sound glosses in the text.
A second part of this presentation will give an overview of the recent
developments in the SMART (System for Markup and Retrieval of Texts)
project°. This
project aims at providing a working environment for research and markup on East
Asian texts by utilizing the TEI Guidelines (see also [SpMcQ91]) and other
international, open standards. The environment tries to enable network based
collaboration and layered, private markup added to a central repository of
texts, but it is intended to make it possible to use it on stand-alone machines
without a live connection to the Internet. So far, the basic framework has been
outlined and some of the utilities built. Originally, the plan was to develop
this into a collection of open modules, that can interact through an open
protocol in the spirit of presentations at ACH/ALLC 1999 by Michael
Sperberg-McQueen, Jon Bradley and others. However, since such a protocol
specification is far from being finalized, I found that I would rather have a
concrete implementation to play with and to iron out problems. I therefore
recently decided to build the tools I would need on top of the Zope° Web-Application platform. This is an OpenSource™
project build mainly with Python, implementing an object-oriented database and a
complete framework for developing dynamic Web-Applications. It has a strong
support for XML and related standards and thus seems especially suited for the
purpose at hand. All the methods are exposed through a URL-based interfaced, but
also callable through XML-RPC.
The presentation in the context of the ALLC/ACH conference aims at contributing
to a discussion of how such an open framework can be implemented, while at the
same time showing some of the problems that arise when dealing with East Asian
languages (see [ApWi96] and [CCAG80-85]). East Asian languages do not normally
mark the word boundaries and even the definition of a word is highly disputed
among linguists. In this situation, a list of all occurring words in the manner
of a word-wheel cannot be applied. Additionally, the texts used here cont