“Query Visualization, Markup, and a Region-based
Document Model”
Thomas
Horton
University of Virginia
horton@virginia.edu
Sunish
Parikh
Florida Atlantic University
Robert
Nash
Florida Atlantic University
Abhijit
Pandya
Florida Atlantic University
abhi@cse.fau.edu
This paper will address our work in the area of software models and architectures
for supporting the development of software tools for text corpora and digital
libraries. Previously we have presented a region-based approach for defining and
manipulating things found inside text documents [6]. We believe that combining a
region-based approach with a markup-centered approach that supports XML and SGML
has many benefits. For example, our approach allows software tools to manipulate
text features that are not marked-up and features that are non-hierarchical in
nature (including arbitrary user selections of parts of a document). Both
approaches can be integrated into one tool; for example, a prototype in Java has
been developed in earlier work [4] that can manipulate an XML document using a
DOM (Document Object Model) implementation while also supporting the
region-based operations that we propose.
In our paper we will specifically address one benefit of our approach: a general
method for the visualization of query results. Before presenting this method, we
must first provide the terminology used to define it.
Our region-based approach is largely inspired by sgrep, a free utility for
searching structured text documents. It takes a query from a user and returns
sections of a text file that match the query. Both its query language and the
results it finds are based on an algebra of *regions* and *region sets*:
http://www.acm.org/sigchi/chi95/Electronic/documnts/papers/mah fg3.gif In this output, a document is represented by a rectangle. Each of these has a "row" for each query term, and the "columns" represent the segments that the document is divided into. A shading of grey in each row-column intersection indicates how often each query term occurs in that segement. This visualization technique displays the results of queries in terms of several dimensions: documents, terms, and segments. Various attributes of the query results can easily be seen. First, the size of each document (in terms of segments) is shown by the size of the tile. Second, occurrence of terms by segment is indicated when a cell is grey, revealing the relative location within a document where the terms occur. Third, strength of occurrence of each term is indicated by the shade of grey. Finally, cooccurrence of multiple terms is indicated by looking for columns where both rows are shaded. Our modifications to the original TileBar method are as follows. First, our query terms are not limited to simply word types. Any text object or combination of text objects can be used. These could include of course markup-elements. Second, the context that defines the segments in each document can be completely controlled by the user, as long as a region-set is selected. The context could be quickly and easily changed to allow a user to see results by, say, act, scene or line in a play. Each of these improvements is directly supported by our region-based model for documents and their contents. In conclusion, we believe region-based approach can be combined with a markup-based view of documents to produce software architectures that can meet a variety of user needs. For example, our approach provides a common model for representing words and markup that can more flexibly support the needs of users of digital text resources. This common model shows its advantages in developing a useful and generalized visualization scheme based on the TileBars approach.
- A *region* is a chunk of text, determined by a starting and ending byte position in the text file.
- A *region set* is collection of regions that can be manipulated in powerful and well-defined ways. For example, two sets can be concatenated, or merged to produce all regions in the first set that are contained with some region in the second region set.
- all words in the document;
- all occurrences of a given token;
- all DIV1 elements in an XML document;
- all XML elements that have an attribute with a given value.
http://www.acm.org/sigchi/chi95/Electronic/documnts/papers/mah fg3.gif In this output, a document is represented by a rectangle. Each of these has a "row" for each query term, and the "columns" represent the segments that the document is divided into. A shading of grey in each row-column intersection indicates how often each query term occurs in that segement. This visualization technique displays the results of queries in terms of several dimensions: documents, terms, and segments. Various attributes of the query results can easily be seen. First, the size of each document (in terms of segments) is shown by the size of the tile. Second, occurrence of terms by segment is indicated when a cell is grey, revealing the relative location within a document where the terms occur. Third, strength of occurrence of each term is indicated by the shade of grey. Finally, cooccurrence of multiple terms is indicated by looking for columns where both rows are shaded. Our modifications to the original TileBar method are as follows. First, our query terms are not limited to simply word types. Any text object or combination of text objects can be used. These could include of course markup-elements. Second, the context that defines the segments in each document can be completely controlled by the user, as long as a region-set is selected. The context could be quickly and easily changed to allow a user to see results by, say, act, scene or line in a play. Each of these improvements is directly supported by our region-based model for documents and their contents. In conclusion, we believe region-based approach can be combined with a markup-based view of documents to produce software architectures that can meet a variety of user needs. For example, our approach provides a common model for representing words and markup that can more flexibly support the needs of users of digital text resources. This common model shows its advantages in developing a useful and generalized visualization scheme based on the TileBars approach.
References
Rob Gaizauskas et al. GATE User Guide. : ,
Gaston H. Gonnet Ricardo A. Baeza-Yates Tim Snider. “New Indices for Text: PAT Trees and PAT Arrays.” Information Retrieval: Data Structures and Algorithms. Ed. William B. Frakes Ricardo Baeza-Yates. : Prentice Hall PTR, 1992. 66-82.
Ralph Grisham et al. TIPSTER Text Phase II Architecture Design Version 2.3 9. : New York University, 1996.
Olya Gurevich Thomas B. Horton Robert Bingler Worthy N. Martin. “A Workbook for Humanities Scholars .” Submitted for presentation at ALLC/ACH 2000, Glasgow, 21-25 July 2000. : , 2000.
Marti A. Hearst. “TileBars: Visualization of Term Distribution
Information in Full Text Information Access.” Proceedings of CHI'95, Conference on Human Factors and Computing Systems. Assoc. for Computing Machinery. : , 1995.
Thomas B. Horton. “A Region-based Approach for Processing Digital Text
Resources.” Conference Abstracts of the Digital Resources for the Humanities, September 12-15, 1999, King's College, London. : , 1999. 47-49.
Jani Jaakkola Pekka Kilpeläinen. SGREP (structured grep). : University of Helsinki, Finland,