“A Digital Library System for Japanese Classical
Literature”
Shoichiro
Hara
National Institute of Japanese Literature
hara@nijl.ac.jp
Hisashi
Yasunaga
National Institute of Japanese Literature
yasunaga@nijl.ac.jp
1. Overview
The National Institute of Japanese Literature (NIJL) has been designing, building, managing, and maintaining the databases on Japanese classical literature for academic researchers both in Japan and foreign countries. The NIJL's database system is comprised from a computer and inter-network, and provides three catalogue databases (i.e., the Catalogue of Holding Microfilms of Manuscripts and Printed Books on Japanese Classical Literature, the Catalogue of Holding Manuscripts and Printed Books on Japanese Classical Literature, and the Bibliography of Research Papers on Japanese Classical Literature). A feature of the NIJL's computer system is that all data processing from data compiling, data correction, database service, and to publishing is executed on a main frame computer system. However, during more than ten years, NIJL's database system has had many problems awaiting solution from the view of software and hardware. To solve these problems, NIJL has started the new project of the digital library for Japanese classical literature. This project downsizes the main frame computer system and reconstructs it as the so-called distributed computer system over several years. The key words of this project are "standardization of data," "data independent from systems" and "multimedia oriented." At present, following this definite policy, we are reconstructing catalogue databases and full text databases, and from this year, we start constructing the new image database of Holding Manuscripts and Printed Books on Japanese Classical Literature. During few years of experiment, we recognize that a digital library alone cannot always contribute to research activities of humanities scholars. A digital library is only a bank of raw material data, on the other hand, valuable results are produced under the individual research environments. Thus, we feel better and effective software tools, linking with digital libraries for downloading raw data and uploading research results can assist research skill done by the researchers. We begin new study of software for humanities as a "Digital Study System." In the following, chapter two describes "On Going Project" of the digital library, chapter three describes the new project of the image database. Finally, new study of the "Digital Study System" for humanities is described in chapter four.2. On Going Projects
2.1 SGML as the Basis of Data Description
There are a several languages or standards for describing text structures, including SGML (Standard Generalized Markup Language), TeX, PostScript, and ODA (Open Document Architecture). Among these, SGML is the only language that can describe the logical structure of text. As it is established as ISO and JIS (Japanese Industrial Standard) standard, many applications have been developed. At present, we are under reconstruction of catalogue databases and full text databases. Both data can essentially be considered as nested string fields with variable length. SGML can describe the complicated text structure such as repeating groups, nests, an order of appearance, and number of appearances. If a data search is regarded as "a search for a specific string in text data," constructing database system that uses a string searching device is possible. Actually, in research on Japanese literature, search by string is more common way than search by numbers. Meanwhile, fast string search devices and software are being developed and sold; all of the products are capable of handling SGML data. Consequently, we have done some projects based on SGML.2. 2 Catalog Databases
Catalogue data is used for various purposes such as on-line database service, publishing in printed form, publishing in CD-ROM and so on°. This database system was designed more than 10 years ago based on devices at that time. As the latest computer system cannot support these devices, taking this opportunity, we begin reconstructing whole database systems. Reviewing old systems, we make new system policy of independence data from hardware and software, definitely speaking, we introduced SGML to describe the data° As the original data was prepared and compiled by librarians from their points of view, some researchers are not satisfied with the contents for their research purposes. We adopt their advices to expand data structure while reconstructing the systems. Based on the ideas described above, we began reconstruction of new catalogue database systems. Specifically, we have:- 1) Created DTD (Data Type Definition) for new catalogue data.
- 2) Converted original data to SGML data.
- 3) Test-produced a database system using a string searching tool.
- 4) Converted SGML data to LaTeX data and output in printed form.
2.3 Full Text Database
At the beginning of our full-text database construction, the movement of standardization of the text data description was not active in Japan. We considered SGML as a favorable standard for defining text structure and describing text data. However, its Japanese standard was not established, and the worse, there were no applications to manipulate Japanese language. For these reasons, we had to establish our own text description rules based on SGML. We call these rules as KOKIN rules (KOKubungaku (Japanese literature) INformation)° As KOKIN rules were designed to be easy to understand and to use, they have been favored by humanities researchers. However, as they are independent rules from another standard, there are no good tools to parse and check the KOKIN-based texts. SGML was originally developed as the document markup language for publishers, but recently, it has been regarded as an encoding scheme for transmission of data among the systems. From this background, we believed our text data should be converted to SGML-based one from the point of effective data circulation. Recently, as SGML has become popular in Japan, we began the project to construct a new full-text database based on SGML° We used "the Anthology of Storiette" as a sample. This is the collection of the short stories of the citizen in "Edo" period and this text have been already transcribed by one of our co-worker in NIJL, and it is also marked up by KOKIN rules. This text has complex structures such as editorial corrections, side notes, Japanese rendering and so on. We conducted the following experiments of:- 1) Creating DTD for "the Anthology of Storiette."
- 2) Converting the original data to the SGML data.
- 3) Constructing the database system using a string searching tool.
- 4) Converting SGML data to LaTeX data and printing a block-copy manuscript.