A broadcast architecture for distributed text tools

“A broadcast architecture for distributed text tools”

Steven J. DeRose Computer Center University of Illinois at Chicago C. M. Spergberg-McQueen Computer Center University of Illinois at Chicago cmsmcq@acm.org

Adequate textual analysis software is difficult to create. Scholarly users have special requirements seldom met by commercial packages, e.g. lexical, syntactic, and statistical analysis; special layouts for interlinear texts; synchronized scrolling of multiple translations or editions; and flexible tools for searching and for organizing search results and making latent patterns visible. Disparity in document formats and levels of tagging and meta-information long made it difficult to share text software. And the cost of software development frequently exceeds the resources available for humanities computing infrastructure. Thanks to SGML, XML, the TEI, and even HTML, we are now closer to having a uniform way to exchange information about documents and their structures. And thanks to other existing and emergent standards, it is now possible to specify a simple architecture that can help organize a modular system, into which a variety of analysis, display, and other tools can be plugged. This would allow independent development, maintenance, and use of far more tools than could ever be handled with a monolithic approach.

A simple scenario

Consider a user viewing a large collection of texts; perhaps all the literary works of a single author or period, using several tools:

a fully-formatted view;
a word list, from which searches may be issued;
a Key Word in Context (KWIC) view;
an interlinear view with grammatical, thematic, or other information displayed in association with text portions.

When the user selects a different hit in the KWIC display, the full-text view might scroll to the new location; when the user selects a different word from the word list, both the KWIC display and the full-text display might change accordingly. Each view has its own set of configuration options. Our architecture is designed to exploit several insights:

1. Almost all scholarly analysis tools can be construed as "views" of an underlying corpus.
2. Little communication is required between the views. Each must have efficient access to the underlying data, but individual views only need communicate terse information (e.g. the new focal word-type) to others when they change state.
3. When one view changes state, other views can respond by changing their own state; the first view need not control the others. This allows the user to have some KWIC or text views which respond to new selections in the wordlist, and others which are unaffected. Thus, our architecture decentralizes inter-view control: any view can respond to others, but no view is controlled by another.
4. A view's response to events elsewhere may be simple (e.g. scroll) or arbitrarily complex (e.g. recalculate a statistical description of the text).

Overall architecture

The underlying data-access layer

The base of the system is an XML repository, which provides access to application modules (views) through the W3C Document Object Model (DOM). Application modules communicate directly with the XML data through the repository; they need not interact intimately with each other. This simplifies protocols and implementation considerably. Since the repository uses standard interfaces, multiple implementations can co-exist and compete on thei