“Towards standards for lexicons and the linguistic
annotation of texts.”
Nicoletta
Calzolari
Istituto di Linguistica Computazionale del CNR
glottolo@ilc.pi.cnr.it
Antonio
Zampolli
Istituto di Linguistica Computazionale del CNR
eagles@ilc.pi.cnr.it
Ulrich
Heid
IMS-CL, Universitaet Stuttgart
heid@ims.uni-stuttgart.de
Antonio
Sanfilippo
Oxford University
Ralph
Grishman
New York
Catherine
McLeod
New York
Motivation
As more and more machine readable text material becomes available, the importance of linguistic annotation of this material is in steady increase. This is true not only in the field of Natural Language Processing (NLP) and Language Engineering, but as well in Humanities Computing: for example, it is evident that linguistically informed free text search and text retrieval (especially if these are written in morphologically richer languages) is more precise (less noise) than search in texts not linguistically pre-analyzed. Linguistic annotation includes- the identification and tagging of word, sentence and paragraph boundaries;
- the identification and tagging of the category (POS, word class) of word forms in running text;
- the identification and tagging of morphological features (tense, number, person, etc.);
- the identification and tagging of syntactic properties of predicates (syntactic subcategorization);
Workshop objectives
The workshop aims at presenting and discussing recent and ongoing work towards standards for linguistic classification and annotation of word forms in texts and lexicons; the second main objective is to gather the feedback of the Humanities Computing scene with respect to the standardization work. Specific objectives include the following:- - Identify and discuss the need for and the problems related with standards in the field of linguistic resources (in particular lexicons and corpora);
- - Discuss questions of the interaction between lexicon and corpus: if there is a common underlying classification of linguistic material, at the levels indicated above, interesting new possibilities for `compound' resources are opened up: dynamic links from the lexicon to the corpus, corpus-based lexicon validation, new possibilities for linguistic acquisition, etc.
- - Describe the EAGLES approach to the definition of standards proposals, the representations used, and the mechanisms available for validation, consistency checking etc.
- - Describe the existing proposals for syntactic (and possibly semantic) annotation in texts and lexicons, based on efforts in EAGLES and in the COMLEX project at NYU;
- - Discuss the EAGLES proposals from the point of view of `users': if a lexicon design project or a corpus analysis project is set up, does the use of annotation standards contribute to the efficiency of the project?
Standards for lexicons and corpora -- Areas, interaction between lexicon and corpus, current state of EAGLES
Nicoletta Calzolari
When we highlight the complex structure of the interrelationships
between lexicon and corpus, we have to work on the assumption of an
interdependence between the two views, and we have to take into
account this interdependence in any lexical or corpus analysis or
application.
This was also the approach taken within the LRE EAGLES (see
Calzolari, McNaught, EAGLES Editors' Introduction, 1996) project
towards the development of standards both in Morphosyntax and
Syntax: the awareness of the interdependence between lexical
specifications on the one hand, and corpus tagsets/syntactic
annotations on the other, has guided the formulation of the
proposals for standards and recommendations in both the Corpus and
the Lexicon Work Groups of EAGLES. Corpus tagging/annotating was
considered as the first obvious application of a Computational
Lexicon. Therefore attention was given to the definition of
compatible sets of attributes and values.
The presentation will address problems of the interaction between
the two types of resources, corpora and lexicons, in particular from
the perspective of a standardization project.
From specifications to tagsets and coding guidelines: EAGLES morphosyntax annotations in lexicons and texts
Ulrich Heid
EAGLES has first established a core set of commonly agreed
annotations for morphosyntax, basically by collecting, comparing and
filtering existing annotation proposals from lexicons and tagsets.
Once the synthesis is available, how can it be put to use in both
lexicon building and text annotation work? This question is
addressed in our contribution.
We want the EAGLES morphosyntax annotation to be applicable in
different usage contexts, especially in both, lexicons for NLP and
text corpora. Moreover, other than in many tagsets for corpus
annotation, the classifications used must be strict: the classes
form a hierarchy, and any item to be described has to fall in one
branch of the hierarchy. Along with this structure, there is need,
however, for support for manual tagging: how can we make sure that
different people will classify the same facts in the same way?
In EAGLES, we have tried to come close to solutions for some of the
requirements stated above. We have defined a typed class hierarchy,
to specify the classifications underlying our language-specific
morphosyntactic coding systems. These hierarchies can be mapped onto
lexicon codes as well as onto corpus tagsets, the latter is even
automatic. For a subset of languages, we have written guidelines for
manual annotation, which contain discussions of borderline cases,
tests and a large collection of examples, to support manual coders.
We have applied the tagset and guidelines to the manual coding of
60.000 word reference corpora for German and Italian.
The contribution will summarize the experiences gained in this
exercise, and we will point to the ressources and tools produced
therein.
The Comlex Syntax Lexicon and the Eagles Subcategorization Standard
Ralph GRISHMAN/Catherine McLEOD
Comlex Syntax is a broad-coverage English lexicon which was developed
specifically for natural language processing and includes detailed
information on complements for verbs, nouns, and adjectives. The
information we seek to capture is similar to that of EAGLES
syntactic lexicons, but our focus on a single language has allowed
us to define a fixed set of complement classes. These classes are
defined in the dictionary as combinations of standard syntactic
constituents.
Comlex Syntax is accompanied by a corpus with tagged examples of the
subcategorization patterns of various high-frequency verbs. Our
definition mechanism for complement classes allowed us to readily
add new classes for previously unseen and rare complement
structures. The tagging has also required additional features
(besides those in the lexicon) to account for subcategorization
patterns in context.
We are currently extending Comlex Syntax to include information on
nominalizations. This is requiring us for the first time to link
subcategorization frames from different parts of speech.