Digital Humanities Abstracts

“ Mots15 - An interactive concordance system (built from mostly off-the- shelf parts)”
Paul Meurer University of Bergen paul.meurer@hit.uib.no Michael Sperberg-McQueen World Wide Web Consortium cmsmcq@acm.org

Mots 15 is an interactive Web-based concordance or full-text retrieval system built mostly out of off-the-shelf software. The goals of the Mots-15 project are:
  • to build a reasonably capable full-text retrieval system, with functionality generally similar to Tact, ARRAS, and the like, but with better markup awareness
  • to keep minimum investment low for both implementors and users
  • to allow experimentation with interesting parts of the query system
From these design goals follow several design principles:
  • simplicity of implementation
  • use of off-the-shelf components wherever possible
  • modularity, loose coupling among modules using predefined interfaces wherever possible

1. Basic interfaces in a query system

1.1. Monoliths

At a very simple level, an interactive query system simply accepts queries from a user, which return responses from the data. In systems like Arras and Tact, the single monolithic software package controls everything in the diagram.
Figure 1. Image 1: A monolithic query system

1.2. Web interface

With the advent of graphical browsers for the World Wide Web, however, it is possible to provide a fairly attractive interface at a much lower cost than would otherwise be possible. It may still make sense to devise special-purpose user interface software for specific purposes, but we can go a long way without it, just relying on the users to have chosen a Web browser they like reasonably well. The Web, that is, exposes an interface between the user interface and the data in the back end.
Figure 2. Image 2: A Web-based query system
This interface sets certain limits to our freedom — we must now use HTML to describe what the user sees (we can use arbitrary XML if we are willing to require that the user have an XML-capable browser) and the user's interactions with the server are limited to what can be done using HTML forms (and possibly a browser scripting language like Javascript) — but within those limits we can develop better user interfaces at a lower cost than if we were building from scratch. Even more important, we can now swap front- and back-ends in and out. We can experiment with different user interfaces by writing different front-end forms and HTML style sheets. In theory, we can also experiment with different back ends by substituting one for the other and using the same front end; in practice, the existing systems built on this model don't easily allow for swapping different back ends in and out, because the interface between the front end and the back end varies with the specific product used as the back end. Because different commercial products rarely support identical interfaces, this means it's rarely possible to swap a new back end in with minimal effort.

1.3. Mots 15

The Mots 15 system differs from the generic Web-based system primarily by exposing a generic query interface in front of the back-end-specific query interface, in order to buffer the front end and back end from each other.
Figure 3. Image 3: Basic plan of MOTS query system
Ideally, this generic query interface should follow some open specification; ideally, it should provide all the functionality we want (to keep life simple), and no more (so that it is easy to build back ends if we want to do it ourselves); the exact choice depends on the tradeoff between these incompatible goals. Assuming that we have some suitable query language, and a way to translate from it into the query language of the back end, then any XML query engine may be used as back end. A SQL dbms may be the most flexible back end. The design made by MSM for this would involve a few light-weight scripts which run on top of’ the SQL database system. The SQL system itself would in this design produce not elements but element pointers, which would be used to extract the elements from a saved copy of the XML file. The task of translating from the open query language to the proprietary back end query language is, of course, simplified if the back end accepts the open query language itself. Another alternative would be using the corpus management system Corpus WorkBench (IMS, Stuttgart) as the MOTS back end, possibly combined with an XML query facility. This would allow for more advanced linguistic queries.

2. Pieces of Mots 15

Mots 15 is designed to make it relatively simple to specify and implement each piece of the system. The better we succeed in this goal, the easier it will be for us to experiment with different parts of the system, and the easier it will be for eventual users to customize it for their own purposes. Eventually, the designers hope that Mots 15 will grow into a library of reusable and customizable pieces, which individuals and small projects can modify to make useful special-purpose systems. The Mots 15 design requires the following pieces of software:
  • browser: an off-the-shelf Web browser; this handles the actual display of results on the user's screen and interaction with the user
  • forms: one or more HTML forms which allow the user to specify searches; these produce an HTML-forms data stream which the parser hands to an appropriate CGI script
  • form-to-query translator: a program to translate the forms data into a query, expressed in the open query language
  • query-to-query translator: a program to translate the query from the open query language into the query language supported by the back end
  • back end: a program, which accepts queries in some (possibly proprietary) query language and returns as results e.g. some set of XML elements
  • wrapper: a program which takes the results and places them in two-level wrapper: (a) an outermost mots:result XML element and (b) an element depending on the hit type wrapped around each hit, each with attributes providing useful information about the query and its results
  • XML-to-HTML translator: a program which takes the wrapped results and translates them into HTML suitable for display in the user's off-the-shelf browser
  • transaction manager: a CGI script to manage the query/response transaction, by calling (or incorporating) the various other programs in this list; it may also be responsible for session management

3. Open problems and opportunities

The existing implementation of Mots15 (as of November 2001) is a minimal system with
  • a choice of several simple Web interfaces
  • support for straightforward XML documents only
  • XSLT stylesheets for XML-HTML translation
  • a limited (XPath based) query language, extended with word frequency queries
There are several obvious challenges for the future development of Mots 15:
  • serious Web interface (room for experiment)
  • ‘XML++’ support
  • display of parallel versions, textual variation
  • external and user supplied annotation
  • proximity searching
  • exploiting grammatical annotation of text
  • supporting documents with overlap (i.e. TexMECS)
  • allowing users to search as if the text were marked up more simply than it is (e.g. with a uniform chapter/section/paragraph/sentence hierarchy)
  • supporting more powerful back ends (either by means of wormholes in the open query language, or by means of a second interface)
  • managing selection of texts from a corpus or collection; federated searches

References

John Price-Wilkin. “Using the World-Wide Web to Deliver Complex Electronic Documents: Implications for Libraries.” Public-Access Computer Systems Review. 1994. 5: 5-21.
John Price-Wilkin. “A Gateway between the World Wide Web and PAT: Exploring SGML Through the Web.” Public-Acces Computer Systems Review. 1994. 5: 527.
John Price-Wilkin. “The Feasibility of Wide-area Textual Analysis Systems in Libraries: A Practical Analysis.” Presented at Literary Texts in an Electronic Age: Scholarly Implications and Library Services, the 31st Annual Clinic on Library Applications of Data Processing (University of Illinois at Urbana- Champaign). April 10-12, 1994. : , 1994.
John Price-Wilkin. “Just-in-time Conversion, Just-in-case Collections: Effectively leveraging rich document formats for the WWW.” D-Lib Magazine. 1997. : .