“A Workbook Application for Digital Text Analysis”

Worthy N. Martin University of Virginia, USA Olga Gurevich University of Virginia, USA Thomas B. Horton Florida Atlantic University, USA Robert Bingler University of Virginia, USA

Introduction

The workbook facility for scholars in the humanities aims to help them organize the results of their research in a convenient and easily accessible manner. The recent proliferation of marked-up electronic corpora has produced a need for tools that would allow structured search and extraction from texts. However, not all potentially interesting features of a text can be described in terms of the mark-up hierarchy: some features involve overlapping elements of markup, others are too fine-grained to be marked up. Thus we need a mark-up independent method of searching, and tools that combine both. The text region-based approach to processing digital text resources is the proposed method of searching and extracting both marked-up and non marked-up information from a collection of texts. The workbook facility is a prototype application of this method that allows extraction and linking of portions of XML-formatted texts. The selection of the regions can be based either on their structural characteristics or on other features.

Concepts

A text region (a.k.a. span) is a continuous portion of a document identified by its start and end offsets. It can be a complete XML element or just a string of characters. A text region can be created through a variety of operations described below. In our use of the concept, most XML elements are assigned a unique identifier within the document. The offsets for text regions are therefore relative to the nearest preceding ID within the document. A text occurrence object (TOO) consists of one or more text regions as well as notes and a user-defined name. The text regions can come from one or more documents and do not have to be contiguous. SGREP (structured grep) is a command-line search tool. It allows structured searches on XML and SGML formatted texts and collections of texts, as well as simple searches. The search results are returned as a set of text regions that can be organized into a text object occurrence. SGREP allows nested searches (for example an XML element labeled "verse" containing the word "Hamlet" within a DIV1 element) as well as unions and intersections of search expressions.

Workbook Facility

The workbook facility aims to help humanities scholars to organize the results of their research in a convenient and easily accessible manner. It provides a way to bookmark and annotate documents without changing the original texts, and to store and link annotated extractions from texts. The workbook consists of a set of TOO's, each of which contains one or more text regions that can originate from different documents. A TOO can thus link portions of texts from different places in a document or from different documents. The workbook facility has a built-in XML parser that creates a DOM structure. We are using IBM's Java-based parser, and the rest of the workbook is also written in Java. The following operations are available to the user:

Select a collection of XML documents with which he wants to work (the base collection).
For each document, view the raw text or the DOM (Document Object Model) structure resulting from parsing the XML document. The DOM structure is displayed in a tree control.
Create text occurrence objects in several different ways.
Select a complete XML element from the DOM, in which case the TOO will consist of a single text region.
Select any continuous portion of raw text from any document in the collection to create a TOO.
Run an SGREP search on one or more documents in the collection. If the search is successful, SGREP will return one or more text regions which will be put together into a TOO.
Name and annotate all TOO's in the same manner, regardless of how a particular TOO was created. The list of TOO's constitutes the workbook and is displayed separately.
By clicking on any text region within a TOO, view the spot within the document from which the text region originated. Thus, it is possible to view the larger context of a particular extraction.
Produce a word distribution list for any particular TOO, and compare word distributions between several TOO's. The word distribution list can be sorted by relative frequency of words or by alphabetical order, and two lists can be viewed side by side.
Order and re-order the TOO's within the workbook.
Organize the TOO's into folders with arbitrary depth of nesting, similar to how files are organized on a disk.
Save the workbook (i.e. the individual TOO's along with folders or TOO's) and re-open it later. The ability to return to the original documents remains after the workbook has been re-opened.
Save particular TOO's and sets of TOO's as separate XML documents and include them in the base collection.
Run SGREP searches on documents created from parts of the workbook in the same way as on the original ones.

Impact

The proposed workbook facility will be useful for several research goals. Extracting, ordering and naming textual fragments is a convenient way for an instructor to prepare for a lecture about a particular text. Scholars that study different versions of the same text (i.e. versions in different languages or different editions) can use the workbook to link parallel passages and annotate the resulting TOO. Since the creation of text regions can be markup-independent, this can be done even if the parallel passages in two documents are not contained within a single XML element. Moreover, extracting regions that share particular features can be automated with the help of SGREP. The word distribution feature of the program is intended to demonstrate that operations found in software like TACT and similar tools can be easily integrated with our workbook approach. Once a workbook is created, it still contains links to the original documents and the history of how the extractions were made. That is, the process is completely retraceable, and the user can view the context from which any text region came.

References

unknown. DOM (Document Object Model) standard. : ,

Thomas B. Horton. “A region-based approach for Processing Digital Text Resources.” Digital Resources for the Humanities, King's College, London, Sept. 12-15, 1999. : , 1999. 47-49.

Jani Jaakkola Pekka Kilpeläinen. SGREP (structured grep). at the University of Helsinki, Finland: ,

unknown. XML standard. : ,

unknown. XML4J, the XML parser for Java produced by IBM. : ,