Digital Humanities Abstracts

“The "Biblioteca Italiana Telematica" Project (http://cibit.unipi.it): a Progress Report”
Mirko Tavoni University of Pisa, Italy Eugenio Picchi CNR, Italy

The Biblioteca Italiana Telematica (Italian Library Online) is a digital library of representative texts of the Italian cultural tradition from the Middle Ages to the 20th Century (literature, language, history, philosophy, art, music and cultural history in its general sense). The library aims to be a major research and teaching facility at the service of researchers and students of Italian language and culture throughout the world. The library has been set up by the CIBIT (Centro Interuniversitario Biblioteca Telematica Italiana), which includes interdisciplinary research groups in 16 Italian universities and also sees to the library's management. The project's software implementation has been entrusted to Eugenio Picchi of the Institute of Computational Linguistics (ILC) of the Italian National Research Council (CNR). The project is financed by the Italian Ministry of Higher Education (MURST): research programmes of relevant national interest, Biblioteca Telematica Italiana: the Italian cultural tradition on the Internet (1998-99), and Textual memory: editions, studies and tools for computer analysis of the Italian heritage (2000-2001). The design of the Biblioteca Telematica Italiana has been quite complex due to the many different functions that had to be developed for and integrated within it. The purpose of the present paper is to illustrate:
  • the project's objectives;
  • the competencies deemed to be necessary and how they have been provided for;
  • system choices made and development procedures performed up to now;
  • the results achieved to date; further lines of ongoing development.

Objectives

The objectives established for the project can be summarised as follows:
  • to create an online reference and research tool that satisfies the demands of a broad public of varying characteristics, including:
    • the simple student or general user interested in retrieving a text, obtaining bibliographical information on it, reading and/or downloading it;
    • the researcher aiming at philological and linguistic research on single texts or on sub-corpora of texts that can be defined dynamically on the basis of numerous selection parameters;
    • the researcher interested in carrying out advanced philological and linguistics research on particular sub-corpora of texts marked up according to various criteria (grammar, metrics, concepts, etc.);
  • managing texts that include graphics and sound;
  • bringing the whole textbase into conformity with SGML-TEI standards;
  • to endow the library with a true catalogue, that is, a rich professional body of bibliographical information on the texts; this, for two reasons:
    • to furnish multiple access points for the selection of sub-corpora to be queried according to the most varied research demands;
    • creating a catalogue able to interface with the online catalogues (OPAC) of the large libraries with the aim of integrating the data and making consultation of the Biblioteca Italiana Telematica available from such OPACs;
    • securing the textbase by restricting download of texts through suitable login procedures and, possibly, download charges.

Competencies

The project includes three main kinds of specialised expertise:
  • content-related (literature, language, history, philosophy, art, music, etc.), spread widely throughout the research project participants;
  • linguistic-computational (the team directed by Eugenio Picchi at the ILC-CNR);
  • bibliographical - library science (A. Petrucciani, University of Pisa - A. Scolari, University of Genoa);
  • digital graphics (Studio X-Lab, Pisa).

System choices and development procedures

Particular attention has been paid to the quality of the services offered and their optimisation. Besides offering the possibility of reading and downloading texts, we have also aimed to provide online all of those text-analysis tools offered in stand-alone linguistics, philology and computational lexicography applications. We hold such tools to be crucial features for the very concept of an online library. The procedures for storage, analysis and querying of the chosen text corpus is the DBT system (Data Base Testuale) developed at the Pisa ILC by Eugenio Picchi. The main efforts in integrating the text-analysis tools with the telematics library have been directed at providing the same functionality and uses as the local programs over a network connection. Procedures for reading and querying texts are by far the services demanded most frequently. The utmost attention has therefore been devoted to response times, the ability to serve the greatest number of users contemporaneously, and to offering the maximum possible guarantee of recovery from most problems that may arise. Such considerations underlie the decision to develop the consultation and query system in the Java language, as it is able to satisfy a wide range of requirements: it is a stable tool on both the server and client ends; it makes the service accessible to all Internet-capable hard- and software platforms through a suitable browser; it provides state-sensitive sessions, that is, able to maintain information regarding the dynamics of the varied requests made by each user. A set of specialised applets is therefore able to guarantee access to the library's consultation functions: searching the catalogue of the available e-texts in order to select the desired text and/or sub-corpora using a whole series of pre-set bibliographical selection keys; reading single texts; querying the selected texts through the typical querying and text-analysis procedures of DBT. The applets have been developed with the aim of maintaining the greatest possible compatibility with the DBT system, to which many users are accustomed. While for the foregoing, most-used functions, our efforts were largely directed at providing optimal service, for other, more specialist functions, it was decided instead to favour simplicity of use, a feature furnished more readily by a development and access technology other than Java. The techniques adopted here are ISAPI/CGI scripts, which enable establishing lighter client/server connections, with on-the-fly creation of HTML pages in response to different user requests and interactions. Such scripting has been applied to the querying of lemmatised texts, that is, texts for which morpho-syntactic analysis and classification have been performed beforehand. Queries can thereby be launched also on the para-textual data containing a wide range of information on each word, such as: its lemma, the lemma's grammatical classification, the form's morpho-syntactic classification. Such a wealth of information stored in a text allows for qualitatively more precise and far more productive search functionality. Particular tools for the online querying of texts with metric markers have been developed using MS-Access. In anticipation of the forthcoming Java or ISAPI applications necessary to make such tools directly available online, they have for the moment been implemented using the Windows NT Terminal Edition that, through a special plug-in, allows browser access to the server-side OS through an Internet connection. Finally, by integrating the contributions of the different specialists outlined in the preceding section, we have set up the Biblioteca Telematica Italiana Web site (http://cibit.unipi.it), within which easy access to the different functions has been organised ergonomically from four fundamental points of entry: Reading, Catalogue, Collections and Advanced Searches.

Results

The textbase currently online (Dec. 1999) contains about 900 texts. The Biblioteca Italiana Telematica currently offers the following services through the Internet:
  • consulting the general catalogue through different cross-indexed access points (Author, Prose/Poetry, Literary genre, Language, Chronology); these stem from a relational database of records containing bibliographical information concerning the text itself, its reference edition and its electronic edition;
  • consultation of the graphics catalogue;
  • accessing the texts and relative text-attributes, iconographic and sound add-ons (through the DBT Java applet described above);
  • querying individual texts with advanced computational-linguistics search functions (through the DBT Java applet);
  • querying the entire corpus or selected subsets (through the DBT Java applet);
  • querying a sub-corpus (Dante Alighieri's vernacular and Latin works) with grammatical markers (through the lemmatised ISAPI/CGI DBT application described above);
  • querying a sub-corpus (medieval-Renaissance lyric poetry) with metric markers (through the Windows NT Terminal Edition described above);
  • downloading texts in varied formats, including SGML/TEI, upon completion of suitable identification and login procedures.

Further lines of development

  • Development of tools for online query of texts with metric markers (corpus of medieval-Renaissance lyric poetry), as well as musical texts (corpus of 17th century Italian poetry for music).
  • Enhancing the search system to allow consultation of bibliographical databanks according to structured DBT procedures.
  • Implementing a prototype able to natively process SGML/XML files, particularly in TEI format (Text Encoding Initiative).
  • Testing procedures for interaction between, on the one hand, library catalogues and other bibliographical databanks based on the UNIMARC international standard and, on the other, digital libraries using forthcoming standards for network delivery (meta-data, particularly Dublin Core). Inter-communication and the greatest possible integration between the two environments is in fact essential to avoid a rift between the two major sources of publicly available information.
  • Implementation of a terminological repository containing thesaurus relationships and the relative procedures for conversion to and from the UNIMARC and Dublin Core standards.
  • Testing procedures for importing and exporting standard UNIMARC format data to and from the Biblioteca Italiana Telematica catalogue. This, with the aim of storing the electronic resources of the Biblioteca Italiana Telematica within various library OPACs, starting with those of the universities co-operating in the project.