“ Mots15 - An interactive concordance system (built from
mostly off-the- shelf parts)”
Paul
Meurer
University of Bergen
paul.meurer@hit.uib.no
Michael
Sperberg-McQueen
World Wide Web Consortium
cmsmcq@acm.org
Mots 15 is an interactive Web-based concordance or full-text retrieval system
built mostly out of off-the-shelf software.
The goals of the Mots-15 project are:
- to build a reasonably capable full-text retrieval system, with functionality generally similar to Tact, ARRAS, and the like, but with better markup awareness
- to keep minimum investment low for both implementors and users
- to allow experimentation with interesting parts of the query system
- simplicity of implementation
- use of off-the-shelf components wherever possible
- modularity, loose coupling among modules using predefined interfaces wherever possible
1. Basic interfaces in a query system
1.1. Monoliths
At a very simple level, an interactive query system simply accepts queries from a user, which return responses from the data. In systems like Arras and Tact, the single monolithic software package controls everything in the diagram.Figure 1.
Image 1: A monolithic query system
1.2. Web interface
With the advent of graphical browsers for the World Wide Web, however, it is possible to provide a fairly attractive interface at a much lower cost than would otherwise be possible. It may still make sense to devise special-purpose user interface software for specific purposes, but we can go a long way without it, just relying on the users to have chosen a Web browser they like reasonably well. The Web, that is, exposes an interface between the user interface and the data in the back end.Figure 2.
Image 2: A Web-based query system
1.3. Mots 15
The Mots 15 system differs from the generic Web-based system primarily by exposing a generic query interface in front of the back-end-specific query interface, in order to buffer the front end and back end from each other.Figure 3.
Image 3: Basic plan of MOTS query system
2. Pieces of Mots 15
Mots 15 is designed to make it relatively simple to specify and implement each piece of the system. The better we succeed in this goal, the easier it will be for us to experiment with different parts of the system, and the easier it will be for eventual users to customize it for their own purposes. Eventually, the designers hope that Mots 15 will grow into a library of reusable and customizable pieces, which individuals and small projects can modify to make useful special-purpose systems. The Mots 15 design requires the following pieces of software:- browser: an off-the-shelf Web browser; this handles the actual display of results on the user's screen and interaction with the user
- forms: one or more HTML forms which allow the user to specify searches; these produce an HTML-forms data stream which the parser hands to an appropriate CGI script
- form-to-query translator: a program to translate the forms data into a query, expressed in the open query language
- query-to-query translator: a program to translate the query from the open query language into the query language supported by the back end
- back end: a program, which accepts queries in some (possibly proprietary) query language and returns as results e.g. some set of XML elements
- wrapper: a program which takes the results and places them in two-level wrapper: (a) an outermost mots:result XML element and (b) an element depending on the hit type wrapped around each hit, each with attributes providing useful information about the query and its results
- XML-to-HTML translator: a program which takes the wrapped results and translates them into HTML suitable for display in the user's off-the-shelf browser
- transaction manager: a CGI script to manage the query/response transaction, by calling (or incorporating) the various other programs in this list; it may also be responsible for session management
3. Open problems and opportunities
The existing implementation of Mots15 (as of November 2001) is a minimal system with- a choice of several simple Web interfaces
- support for straightforward XML documents only
- XSLT stylesheets for XML-HTML translation
- a limited (XPath based) query language, extended with word frequency queries
- serious Web interface (room for experiment)
- XML++ support
- display of parallel versions, textual variation
- external and user supplied annotation
- proximity searching
- exploiting grammatical annotation of text
- supporting documents with overlap (i.e. TexMECS)
- allowing users to search as if the text were marked up more simply than it is (e.g. with a uniform chapter/section/paragraph/sentence hierarchy)
- supporting more powerful back ends (either by means of wormholes in the open query language, or by means of a second interface)
- managing selection of texts from a corpus or collection; federated searches
References
John Price-Wilkin. “Using the World-Wide Web to Deliver Complex Electronic Documents: Implications for Libraries.” Public-Access Computer Systems Review. 1994. 5: 5-21.
John Price-Wilkin. “A Gateway between the World Wide Web and PAT: Exploring SGML Through the Web.” Public-Acces Computer Systems Review. 1994. 5: 527.
John Price-Wilkin. “The Feasibility of Wide-area Textual Analysis Systems in Libraries: A Practical Analysis.” Presented at Literary Texts in an Electronic Age: Scholarly Implications and Library Services, the 31st Annual Clinic on Library Applications of Data Processing (University of Illinois at Urbana- Champaign). April 10-12, 1994. : , 1994.
John Price-Wilkin. “Just-in-time Conversion, Just-in-case Collections: Effectively leveraging rich document formats for the WWW.” D-Lib Magazine. 1997. : .