“Technical aspects of the production process of digital
books using XML-TEI at the Miguel de Cervantes digital library”
Alejandro
Bia
Miguel de Cervantes Digital Library, University of Alicante
abia@dlsi.ua.es
We describe the digital-book production process followed at the Miguel de
Cervantes Digital Library, from book acquisition up to Internet publishing,
highlighting the main requirements and design considerations of the workflow
system.
Our library covers many different areas, from a "library of voices" up to
academic thesis and it includes all kinds of multimedia material: text,
images, audio and video. However, the vast majority of our resources are in
text format. These are our 4000 digital books, public domain Hispanic
classics, from the twelfth century up to these days, including narrative,
theater, poetry, history and other subjects. Many professionals and
technicians take part in the development of our digital books: librarians,
scanner operators, correctors, markup specialists and computer technicians.
The poster describes the production process of the digital books and allows
the discussion of markup issues concerning our approach using XML-TEI
encoding.
The production process begins with a bibliographic search to find interesting
available books to digitize. After selecting new literary works to add to
the collection, the librarians elaborate the orders to be sent to various
sources (conventional libraries, bookstores, publishers, private collectors
in the case of rare books, etc.).
Bibliographic information associated to each book is stored into a catalogue
database This information is used for many purposes: it helps in the control
of both the production process and the publication process, it allows
catalogue searches, and is provided to the readers in the form of a digital
bibliographic card accessible through the Internet.
The source physical books and the produced digital books do not always relate
in a one-to-one basis. In some cases, a physical book will give birth to
many digital books as is the case of collections or "complete works" that
may be split into several digital books (In a DL there is no reason to group
different literary works as it is done on a printed book, since the criteria
used for traditional books do not apply to their digital siblings. However,
there are exceptions. Literary experts may decide to group poems from
different collections into a single digital book.). Titles may differ also
since some works are known by many titles according to different editions.
Upon reception, the books are cataloged. Information like subject, authors
and collaborators, universal decimal classification and search keys that
will simplify the location and retrieval of the books is stored in a
database. At this point, the production process begins. The librarians mark
the received source titles as available for processing and a unique code
that will identify the d-book permanently is assigned. This code is used
within the workflow system, and also in the names of all the files related
to the book (during production and also for publication purposes).
At every stage of the production process start and end date-time information
is recorded along with the operator identification for follow up and
production control purposes. At any time, the librarians can access the
records of the d-books under development to modify bibliographic-catalogue
information.
The resulting output of the scanning process is a set of files of two
classes: first, scanned images, and then optical character recognition
(OCR), text documents. The former are stored in backup media for future
projects. The latter, after an automatic error recovery process, are passed
to the correction stage.
At quality control, if too many errors are detected, measures are taken to
adjust the scanning-OCR process to improve the resulting output, since a
high rate of mistakes rise the time-cost of the rest of the process.
As our library handles books of