Technical aspects of the production process of digital books using XML-TEI at the Miguel de Cervantes digital library

“Technical aspects of the production process of digital books using XML-TEI at the Miguel de Cervantes digital library”

Alejandro Bia Miguel de Cervantes Digital Library, University of Alicante abia@dlsi.ua.es

We describe the digital-book production process followed at the Miguel de Cervantes Digital Library, from book acquisition up to Internet publishing, highlighting the main requirements and design considerations of the workflow system. Our library covers many different areas, from a "library of voices" up to academic thesis and it includes all kinds of multimedia material: text, images, audio and video. However, the vast majority of our resources are in text format. These are our 4000 digital books, public domain Hispanic classics, from the twelfth century up to these days, including narrative, theater, poetry, history and other subjects. Many professionals and technicians take part in the development of our digital books: librarians, scanner operators, correctors, markup specialists and computer technicians. The poster describes the production process of the digital books and allows the discussion of markup issues concerning our approach using XML-TEI encoding.

Figure 1. Architecture of the digital-book production process

The production process begins with a bibliographic search to find interesting available books to digitize. After selecting new literary works to add to the collection, the librarians elaborate the orders to be sent to various sources (conventional libraries, bookstores, publishers, private collectors in the case of rare books, etc.). Bibliographic information associated to each book is stored into a catalogue database This information is used for many purposes: it helps in the control of both the production process and the publication process, it allows catalogue searches, and is provided to the readers in the form of a digital bibliographic card accessible through the Internet. The source physical books and the produced digital books do not always relate in a one-to-one basis. In some cases, a physical book will give birth to many digital books as is the case of collections or "complete works" that may be split into several digital books (In a DL there is no reason to group different literary works as it is done on a printed book, since the criteria used for traditional books do not apply to their digital siblings. However, there are exceptions. Literary experts may decide to group poems from different collections into a single digital book.). Titles may differ also since some works are known by many titles according to different editions. Upon reception, the books are cataloged. Information like subject, authors and collaborators, universal decimal classification and search keys that will simplify the location and retrieval of the books is stored in a database. At this point, the production process begins. The librarians mark the received source titles as available for processing and a unique code that will identify the d-book permanently is assigned. This code is used within the workflow system, and also in the names of all the files related to the book (during production and also for publication purposes). At every stage of the production process start and end date-time information is recorded along with the operator identification for follow up and production control purposes. At any time, the librarians can access the records of the d-books under development to modify bibliographic-catalogue information. The resulting output of the scanning process is a set of files of two classes: first, scanned images, and then optical character recognition (OCR), text documents. The former are stored in backup media for future projects. The latter, after an automatic error recovery process, are passed to the correction stage. At quality control, if too many errors are detected, measures are taken to adjust the scanning-OCR process to improve the resulting output, since a high rate of mistakes rise the time-cost of the rest of the process. As our library handles books of