Digital Humanities Abstracts

“A corpora-based environment for linguistic knowledge exploration”
Rochdi Oueslati ERIC (Equipe de Recherche en Ingenierie des Connaissances) rochdi@eric.u-strasbg.fr

Introduction

The design of linguistic tools can be a helpful aid to a linguist exploring concepts and relations in texts. Analysing texts and acquiring linguistic concepts from corpora may be an interactive, successive- refinement process. A linguist may analyse a domain text and perform a series of experiments (he may explore a list of domain terms, search for contexts, and then interpretes the results) to search for linguistic concepts. For such a research approach to be practical, a linguist needs a user-friendly system that supports:
  • - text analysis without requiring the user to be aware of the storage structures and processing strategies involved
  • - easy formulation and execution of queries against the text
  • - a natural representation of the relationships between words, terms and linguistic concepts in texts.
To perform text analysis and knowledge exploration we developed an environment called LEXICA (lexical and linguistic concept acquisition) which processes texts and allows the user (a linguist or a terminologist) to enter a query in order to search for linguistic concepts in texts.

1. The major design goals of LEXICA

The process of acquisition and exploitation of linguistic knowledge from texts is seen as an important issue to help the construction of knowledge bases (Czap and Nedobity, 1990) , (Meyer, et al., 1992) , (Skuce, 1993). The major design goals of our system that contibute to its usefulness as an environment for linguistic knowledge acquisition and structuring are to:
  • - provide a set of modules to perform text analysis
  • - provide a query facility to search for linguistic concepts

2. How LEXICA can help a linguist

Consider a linguist charged with studying domain-specific relations. One way to search for these relations is to formulate a query which search for co-occuring domain terms. For instance, the linguist can enter the following formula: art+term+verb+art+term (an article followed by a term followed by a verb, followed by an article followed by a term) which can be seen as a list of liguistic constraints performed in order to search for all contexts of two terms which co-occur in the context of a verb. First of all this paper discusses methodologies for term and linguistic knowledge acquisition from texts and describes our system LEXICA (section 3). Then we describe how such a system can be efficient to support tasks of linguistic knowledge acquisition and exploitation (section 4). Finally, the paper closes with a brief overview (section 5).

3. System architecture

3.1 Introduction

Corpora-based techniques have been widely used for term acquisition and text analysis (Delisle and Szpakowics, 1991) , (Delisle, et al., 1994) , (Müller, 1989). Classical methods often use grammars and domain dictionaries (Mars, et al., 1994). They cannot be used in totally new domains where domain dictionaries and conceptual hierarchies have not yet been established. Such a process is time consuming. We propose to use an original method to text analysis which uses repeated word sequences to identify terms denoting concept labels and perform term structuring.

3.2. Description of our system

Our system is implemented in Maccintosh Common Lisp on a Mac station. Two parts of the system have to be differentiated:
  • 1) a preprocessing part (modules: MANTEX, TSTRUCT, VERB) which performs text and term processing and structuring (identification of words, sentences, collocations, terms).
  • 2) a query part (LEX module) which is a user-friendly langage which allow the user to handle a set of predefined linguistic categories and search for linguistic relations.

3.2.1. The preprocessing part

- The MANTEX module <Conceptual diagram>: Text --> MANTEX --> term candidates --> term validation (automatic) (interactive) Many term acquisition tools (Enguehard, 1992) , (Smadja, 1993) , (Bourrigault and Condamines, 1995) , (Justeson and Katz, 1995) (Reinert, 1995) use different methods to perform term acquisition and linguistic knowledge structuring. For instance, Smadja uses statistical methods to collect relevant pairs of co-occuring words, which are then used to reconstruct n-word collocations (called n-grams). Our approach to term acquisition uses repeated word sequences and word distributions to help locate meaningful entities in text. Repeated word sequences (Lebart and Salem, 1994) are n-word strings (n > 1) occurring at least twice in the text (strings containing punctuation are not considered). The semantic hypothesis behind repeated word sequences is based on Harris (Harris, 1968) . According to Harris, language offers discrete and arbitrary linguistic units which may be clustered according to linguistic constraints (i.e., terms are made up of linguistic units chosen to cluster according to linguistic constraints). MANTEX generates a list of term candidates which may be validated by a terminologist. - The TSTRUCT module <Conceptual diagram>: Terms --> TSTRUCT module --> term tree (automatic) The TSTRUCT module produces a structured list of terms: generally terms can occur in texts with different extensions, the main term can be seen as the head of a tree and extensions as branches. For instance, in French, the first noun of a noun phrase usually expresses the main concept, while the rest expresses specification. While our system has specifically addressed only French terminology we think it can easily be transported to other languages. For instance we are working on extracting terms from a genetic engineering corpus in English. For instance, in "DNA molecules" the main concept is "DNA" and "molecules" is used to specify the main concept. example of a term tree:
(DNA) ! -------------------------------------------------- ! ! ! ! (molecules) (markers) (sequences) (strands)
In English the problem is that the main noun in a noun phrase is often the last, as in (DNA STRANDS) but may sometimes be the first as in (STRANDS OF DNA). - The VERB module <Conceptual diagram>: Text, morphological rules --> VERB --> verbs (automatic) To build a list of domain verbs we use a set of morphological rules: Let (L) = (e, ait, aient, ant, er) be a list of verb endings. If a word (w) ends with a member (m) of (L) then our system supposes the string consisting of (w) without (m) is a radical candidate. If this candidate is also found with another member of (L) then it is retained as a verb (for example, "montr(ent)" and "montr(ait)" --> the verb is :"montr(er)").

3.2.1. The query part

- The LEX module <Conceptual diagram>: Text, terms, categories, formula --> LEX --> contexts (automatic) The LEX module handles lexical (words, terms, categories ...) and morphological data (the different forms of a verb, terms in a singular/plural form...). - A personalized dictionary The linguist may use a personalized dictionary which contains a few predefined linguistic categories such as: art: for article term: for term, etc.. In addition, he may interact with the system to create new categories. For instance, he may create the category: det (as determinant) which may refer to the following linguistic categories: article (a, an, the), possessive adjectives (his, her, ...). The following table resumes the predefined categories:
category explanation examples
abr abbreviation "DNA"
prep preposition to, for, ...
art article a, the, ...
artd definite article the
arti indefinite article a, an
term domain term
period !.:;? ...
pon other punctuations
comma ,
verb a verb
pron pronoun he, she, ...
text a string of text
- The LEXICA formula The LEXICA formula is a list of constraints: C1, .., Cn each constraint of the formula may be a variable, a string of text, or linguistic category. Variables and categories may be followed by a list of logical predicates (a Lisp like form). For instance, the user may specifie that a constraint is the verb "to contain": verb/(="contain"). Two parts of the system have to be differentiated: - the local constraints (described above) - the global constraints: they are specified at the end of the formula (after the sign "&") and they are applied to the whole formula. For instance the user executes the formula art+x+verb+y and may describe that the number of the contexts to be found is equal to the value1. After adding the global constraints the formula is now: art+x+verb+y & (context= 1) In order to express local and global constraints, we developed a set of predefined predicates which may be used in a formula, for instance:
predicates examples
= x/(="contain")
;; true if the value of x is equal to the
word "contain"
in x/(in list-of-words)
;; true if x belongs to the list-of-words
context= (context> 1)
contex< ;; checks the number of contexts to be found
contex>
- Syntax of the formula:
<symb> ::= a Lisp symbol
<symbp> ::= a predifined Lisp symbol
<nombre> ::= {0|1|2|3|4|5|6|7|8|9}+
<string> ::= a string of text
<variable> ::= {<symb>}
<category> ::= {<symbp>}
<element> ::= <string> | <variable> | <category>
<formula> ::= {<element>}*
<predicat> ::= (<symb> {<element>}*)
<condition> ::= ({<predicat>}) | {<operator> {<predicat>}+}*
<selection> ::= {/<condition>}
<contraint> ::= (<element> <selection>)
<operator> ::= and | or | not
Each symbol handled by a formula is a Lisp symbol. In order to separate the constraints we use the sign "+". To express the global constraints we use the sign "&" at the end of the formula followed by a list of predicates. A <constraint> of a formula may be described by an <element> followed by a selection (the sign "/") of a condition which is a a predicate or a list of predicates connected by a Lisp <operator>.

4. Exploring linguistic knowledge

In this section we will describe how the terms can be exploited to handle relations which hold between terms. We are interested in external relations which hold between two terms: term1 and term2 . Our main idea is that a verb occurring with two terms: <term1 verb term2> may describe a domain-specific relation. 1) example: we apply the following formula: term+verb/(="montrer")+term & (context> 1). The LEX module collects contexts containing two terms which co-occur with the verb "montrer" (to show):
term form of the verb term
(coronarographie) montre (present) (stenose)
(coronarographie) montre (present) (reseau coronaire)
(coronarographie) montrait (past) (lesions bitronculaires)
(coronarographie) montre (present) (obliteration de l'IVA)
(coronarographie) montre (present) (minimes irrégularites)
(vaisseaux du cou) montraient (past) (occlusion de la carotide)
(ventriculographie) montre (present) (fuite mitrale minime)
2) Then we apply the following formula: "coronarographie"+text+"lesion" & (context> 1). The LEX module collects other contexts containing the two terms "coronarographie" and "lesion" which co-occur with other expressions:
term expression term
coronarographie met en evidence lesion
coronarographie a confirme l'existance lesion
The linguist may analyse all these structures and may interprete that all these contexts express a unique relation called "montre" (show). After this, he may applies the formula: x+ "met en evidence" +y & (context> 1) to the corpus in order to collect other contexts which may contain new terms verifying the current relation.

5. Conclusion

We described in this paper a linguistic environment called LEXICA which is effective for identifying terms and other linguistic categories. On the basis of some example texts, we illustrated how terms can be exploited to identify domain-specific relations. The language used by LEXICA allows the user to enter a query in the form of a list of linguistic constraints (described by labels and predicates) in order to search for contexts, and explore new linguistic concepts in texts.

6. References

D. Bourrigault A. Condamines. “Réflexions sur le concept de Base de Connaissances Terminologiques.” Actes des Journées du PRC-IA. Nancy: , 1995.
D. Bourrigault P. Lepine. “Méthodologie d'utilisation de LEXTER pour l'acquisition des connaissances à partir de textes.” Actes de JAVA. Strasbourg: , 1994.
H. Czap W> Nedobity. “Terminology and Knowledge Engineering.” Proceedings of the 2nd International Congress on Terminology and Knowledge Engineering. Frankfurt: Indeks Verlag, 1990.
S. Delisle K. Baker J. F. Delannoy S. Matwin , et al. “Du texte aux clauses de Horn par combinaison de l'analyse linguistique et de l'apprentissage symbolique.” Actes de JAVA 94 Strasbourg. Strasbourg: , 1994.
S. Delisle S. Szpakowics. “A broad coverage parser for knowledge acquisition from technical texts.” Proceedings of 5th International Conference on Symbolic and Logical Computing ICEBOL5 MAdison SD. USA: , 1991. 169-183.
C.Enguehard. “ANA: Apprentissage Naturel Automatique d'un reseau sémantique.” UTC Compiegne et CEA Cadarache, 1992.
Z. Harris. Mathematical structure of Language. New York: Wiley Interscience, 1968.
J. J.Justeson M.Katz. “Technical terminology: some linguistic properties and an algorithm for identification in text.” Natural Language Engineering. 1995. 1: 9-27.
L. Lebart A. Salem. Statistique textuelle. : Dunod, 1994.
N. Mars H. D.Jong P. H.Speel G. Wilco, et al. “Semi- automatic knowledge acquisition in Plinius: an engineering approach.” Proceedings of the 8th Banff Knowledge Acquisition for Knowledge Based Systems Workshop. : , 1994.
I. Meyer D. Skuce L. Bowker K. Eck. “Towards a new generation of terminological resources: an experiment in building a terminological knowledge base.” Proceedings of COLING. : , 1992. 956-960.
J. U.Müller. “Knowledge acquisition from texts.” Proceedings of EKAW88 St Augustin. : , 1989. 25-1, 25-16.
Reinert. ADT manuel de l'utilisateur (société Image). : , 1995.
D. Skuce. A multi-functional knowledge management system. : Knowledge Acquisition, 1993. 305-346.
F. Smadja. “XTRACT: an Overview. In Computer and the Humanities.” Computers and the Humanities. 1993. 26: .