“The MATE Workbench - a Tool for XML Corpus
Annotation”
Amy
Isard
University of Edinburgh, UK
David
McKelvie
University of Edinburgh, UK
Andreas
Mengel
Universitaet Stuttgart, Germany
Morten
Baun
Moeller
Odense University, Denmark
The MATE workbench is an annotation tool which allows transcription, annotation,
display and querying of speech and text corpora. It is not tied to any particular
annotation scheme, and allows users to define interfaces which suit their particular
needs.
The main difference between the MATE workbench and other annotation tools is its
flexibility. Any annotation scheme which can be expressed in XML [7] can be used
with the workbench (for a discussion of overlapping hierarchies see below), and the
display and annotation interfaces are defined using a language based on the
stylesheet language XSLT [8]. The workbench is written entirely in Java and is
therefore platform independent.
Annotation can be a very tedious task for humans, and many tools have been developed
to make it easier. We conducted a review of some existing annotation tools before
beginning development of the workbench [3], and many of them have in common a fixed
user interface or a restriction to a particular coding scheme. One tool which has
some similarities to MATE is GATE [1], which can also be used with any annotation
scheme, but has a different internal architecture, based on the US Tipster
architecture rather than XML, and a main aim of making it possible to integrate
different automatic annotation components within one system. MATE aims to provide a
framework where stylesheets can be used to provide user-defined annotation and
display interfaces. Because the stylesheet language is quite high level, it is
easier to write a stylesheet to provide a given interface than to write an entire
coding tool from scratch.
The MATE project has developed annotation schemes for five sets of linguistic
phenomena, and examples of markup using these schemes will be distributed with the
workbench, along with stylesheets for their annotation and display. Users of the
workbench are by no means limited to these schemes, however.
To display annotated data in the workbench, a user must have a MATE project file,
which specifies one or more XML annotation or transcription files and sound files if
appropriate, and a stylesheet which is to be applied to these files. Several
examples of these will be provided with the workbench. When the workbench is
started, a corpus folder window appears with a display of all the available project
files. After selecting a project file, the user clicks on the "run" button, the
specified files are processed, and one or more display or annotation windows
appears, depending on what was specified in the stylesheet. A different stylesheet
can be used with the same files to produce different behaviour.
The other major use of the workbench is in performing queries over a corpus. A query
language [4] was developed within the project which is tailored to our internal
representation of the data, including the treatment of multiple hierarchies as
defined below. To perform a query, the user first loads in one or more documents,
then opens a Query Window, which provides support for building complex queries. The
results of the queries are saved in XML format within the workbench, and are then
also displayed using a stylesheet, allowing a flexible representation of the
results.
One question which often comes up when XML annotation schemes are being developed is
multiple overlapping hierarchies of markup. The TEI [5] describes several possible
ways to provide overlapping in SGML. One of these, 'concur', is not possible in XML,
which was designed to be a less complicated and easier to use version of SGML, and
therefore left out some features. We have chosen to take a different approach from
any of these, but one which has been proven to be successful in at least one
large-scale corpus annotation project [2]. Our solution is to keep each level of
annotation, and each data-stream (in the case of multi-speaker conversations for
instance) separate, and to link each level to a common base-level. This base level
would normally be the smallest unit on which all the other annotations depend. This
may often be the word level, but could also be phonemes in the case of speech,
higher level units such as sentences or paragraphs or indeed anything else as
appropriate. The MATE workbench will therefore deal appropriately with any data
which are marked up in this way using hyperlinks, as defined in the XLINK proposal
[6].
Another advantage of the generality of the MATE workbench is that it makes it easier
to combine views of annotation done using different schemes on the same corpus.
These annotations may be done on different sites without any contact, but if both
use hyperlinks to the same base level, then it is possible to create stylesheets
which display both the annotations at the same time. It is also possible to write a
stylesheet which will display one level and allow annotation of another level at the
same time.
The MATE workbench has just been completed, and testing and evaluation are about to
begin. We will be able to provide a section on this evaluation for the final paper.
We will be asking testers to use the workbench for a variety of different annotation
tasks, and provide feedback on ease of use and power, and also evaluating whether
the stylesheet language allows testers to define new annotation and display
interfaces easily. We will also be submitting a proposal to ALLC/ACH 2000 for a demo
of the workbench.
References
unknown. GATE. : ,
A. Isard D. McKelvie H. S.Thompson. “Towards a Minimal Standard for Dialogue Transcripts: A
New SGML Architecture for the HCRC Map Task Corpus.” Proceedings of the 5th International Conference on Spoken Language Processing, ICSLP98, Sydney. : , 1998.
unknown. MATE Deliverable D3.1: Specification of Coding Workbench. : ,
A. Mengel U.Heid. “Enhancing Reusability of Speech Corpora by Hyperlinked
Query Output.” Eurospeech 99, Budapest, September 1999. : , 1999.
TEI Guidelines for Electronic Text Encoding. Ed. C. M. Sperberg-McQueen Lou Burnard. : ,
unknown. XLINK. : ,
unknown. XML. : ,
unknown. XSLT. : ,