SMART Project: Methods for Computer-based Research of Premodern Chinese Texts

“SMART Project: Methods for Computer-based Research of Premodern Chinese Texts”

Christian Wittern Chung-Hwa Institute of Buddhist Studies, Taiwan

This presentation will start with a look at some of the problems encountered so far in a number of projects that tried to apply TEI [TEIP3] markup to premodern Chinese Buddhist texts. I have been working with the TEI Guidelines for more than seven years and published the first text, rather heavily marked up in TEI fashion, in 1995°. Since then I became involved with some other projects digitizing Chinese Buddhist texts, most prominently the work by the Chinese Buddhist Electronic Texts Association (CBETA)°. We now have about 200 MB of texts basically marked up° according to the Guidelines. All of these projects worked from printed editions published 80-100 years ago. One of the most obvious problems we encountered is the large amount of non-standard characters found in these texts, but TEI and SGML in general is quite able to handle this elegantly - nevertheless there are some important details that should be noted°. Some of the more subtle problems involve structural elements specific to texts of the sphere of Chinese cultural influence. Examples of these elements include the notion of a scroll, that is carried over from the time when the documents were actually written on scrolls, but still mark divisions in the printed editions. Being based on the physical medium, they fall into a similar category as the LB, PB and MILESTONE elements in TEI, but they are usually associated with some other heading-like text, colophons and the like. While this could be taken care of with the FW in some way, we decided to come up with our own solution, which was to introduce a new element, JUAN, (Chinese for scroll) and encode the information therein. Other structural elements that presented difficulties include colophons or other backmatter-like text at the end of a scroll, but in the middle of a DIV element that continued on the next scroll and sound glosses in the text. A second part of this presentation will give an overview of the recent developments in the SMART (System for Markup and Retrieval of Texts) project°. This project aims at providing a working environment for research and markup on East Asian texts by utilizing the TEI Guidelines (see also [SpMcQ91]) and other international, open standards. The environment tries to enable network based collaboration and layered, private markup added to a central repository of texts, but it is intended to make it possible to use it on stand-alone machines without a live connection to the Internet. So far, the basic framework has been outlined and some of the utilities built. Originally, the plan was to develop this into a collection of open modules, that can interact through an open protocol in the spirit of presentations at ACH/ALLC 1999 by Michael Sperberg-McQueen, Jon Bradley and others. However, since such a protocol specification is far from being finalized, I found that I would rather have a concrete implementation to play with and to iron out problems. I therefore recently decided to build the tools I would need on top of the Zope° Web-Application platform. This is an OpenSource™ project build mainly with Python, implementing an object-oriented database and a complete framework for developing dynamic Web-Applications. It has a strong support for XML and related standards and thus seems especially suited for the purpose at hand. All the methods are exposed through a URL-based interfaced, but also callable through XML-RPC. The presentation in the context of the ALLC/ACH conference aims at contributing to a discussion of how such an open framework can be implemented, while at the same time showing some of the problems that arise when dealing with East Asian languages (see [ApWi96] and [CCAG80-85]). East Asian languages do not normally mark the word boundaries and even the definition of a word is highly disputed among linguists. In this situation, a list of all occurring words in the manner of a word-wheel cannot be applied. Additionally, the texts used here cont