“NeoloSearch: Automatic detection of neologisms in
French Internet documents”
Tatjana
Janicijevic
French Studies, Queen's University at
Kingston
4tj2@qsilver.queensu.ca
Derek
Walker
Computing and Information Science, Queen's
University at Kingston
walker@qucis.queensu.ca
Due to the influx of technical terminology, the globalizing effects of the
Internet, and to the natural processes of language change, even the most
reluctant languages are currently being infused with significant numbers of new
words. But since the search for neologisms is a very time consuming and labour
intensive process, automation of the search activity is a highly desirable
goal.
Our aim is to identify an efficient and accurate automated method for identifying
neologisms in an open corpus. Baayen and Lieber (p 801) suggest that neologisms
tend to occur as hapax legomena in corpora. We explore this idea in the hope of
making it more general through the use of relative token frequencies. The second
part of the research attempts to deduce what kinds of productive processes are
involved in word production and to identify productive affixes in the retrieved
neologisms. The program which performs all these tasks has been aptly named
NeoloSearch.
The analysis consists of four fully automated phases. In the first phase the
corpus is created from documents collected over the Internet via a specially
developed tool. Because of the difficulties in representing and processing
typographic marks like accents, fewer electronic corpora studies have been
conducted on a language such as French. In response to this trend, French was
used as the basis of our study, though the method should be easily adaptable to
other languages.
The second phase performs a statistical analysis on the tokens contained in the
corpus in an attempt to identify neologisms. The third phase weeds out rare
words, proper names, and typographic errors in the neologism candidate list.
Finally, the context of each token in the list is analysed for additional
information and common affixes are identified.
The Internet provides a wealth of documents online which can potentially be used
as the basis of a corpus. A fully automated Internet 'robot' was constructed to
speed up the acquisition of documents utilizing a breadth-first-search
algorithm. Retrieved documents are validated according to a predetermined
content heuristic: documents must be of a specified size and be of the correct
language. A simple counting method is used to observe the percentage of tokens
which also appear in an online corpus of inflected forms. Any indecipherable
tokens, for instance those that contain numerals, are thrown away and accents
are converted to a standard format.
Baayen and Lieber (p 809) demonstrated that it is possible in several ways to
quantify a measurement of the productivity of an affix with respect to the
productivity of other affixes found in the corpora. The rate at which new
lexemes are created indicates an affix's productivity. If the object of interest
is not the affix but the resultant lexeme, then it follows from their work (and
from common intuition) that while common forms will occur in high frequency,
neologisms will tend to occur with very low frequency.
The first phase of the analysis involves calculating the frequency of each token
in the corpora. These values are then tabulated to yield a table of frequencies
of token frequencies. The mean, standard deviation and z-scores for each
occurrence frequency are calculated and tabulated. Only tokens whose z-scores
fall below an arbitrary cut-off point are selected for study. The conversion to
standard units allows the threshold principle to be generalized to any size
corpus.
If the corpus is of a minimum size, there is a high probability that the only
tokens occurring in the reduced set will be proper names, rare words,
typographic errors and neologisms. Consequently, the tokens that fall within the
selected group are looked up in a corpus of exclusion which contains common
inflected forms and proper names. Any forms found in the corpus of exclusion are
thrown away. Then a filter which uses approximate string matching techniques
attempts to eliminate any tokens which are the result of typographic errors by
attempting to match the tokens with their closest match in the corpus of
exclusion.
To study the production of new words more than just a list of anticipated
neologisms is needed. Consequently, the tokens which pass through all the above
stages are subjected to a simple parsing analysis which attempts to deduce word
type and proximity to productive verbs using a +/- n-word context. Observations
are summed and reported by word type and context and concordances are observed.
Finally, string matching techniques are employed to identify commonly occurring
affixes in the list of remaining tokens. This helps the researcher attempt to
deduce the sources of the neologisms and what types of productive forces are at
work. Statistics on all phases of the analysis are also part of the program
output at the end of the run and a +/-1 sentence contexts can be generated for
each neologism retrieved from the corpus using a simple query.
There are several important discussions which result from this work. World Wide
Web documents rest somewhere in permanence between spoken language and printed
media and represent a unique means of communication. There are many mass-media
characteristics to the Internet, with millions of subscribers world-wide and a
multitude of on-line resources, including electronic journals and newspapers. It
would be useful to contrast results of this experiment with those resulting from
a closed corpus consisting of a more conventional mixture of spoken language and
printed media.
The statistical methods employed make the identification of collocations,
semantic neologisms, and some compound words improbable and future research is
needed to explore these areas. The user, however, can give feedback to the
program after each analysis iteration by adding to the corpus of exclusion and
through the manual retrieval of collocations. The program consequently posesses
a limited amount of flexibility in this area.
In summary, NeoloSearch attempts two goals: to
provide a statistical understanding of neologisms and their relationship to
productivity and to provide an automated neologism retrieval and analysis
system.
Bibliography
H. Baayen R. Lieber. “Productivity and English derivation: a corpus based
study.” Linguistics. 1991. 29: 801-843.
Elisabeth Brandon. “Choisir un logiciel de terminologie.” La Banque de Mots. 1991. : .
Maria Teresa Cabré Lluis de Yzaguirre. “Stratégie pour détection sémi-automatique des
néologismes de presse.” Traduction-Terminologie-Rédaction. 1995. 8: 89-100.
Gabriel Otman. “Terminologie et intelligence artificielle.” La Banque des mots. 1989. : 63-95.
Gabriel Otman. “Des ambitions et des performances d'un système de
dépouillement terminologique assisté par ordinateur.” La Banque des mots. 1991. : 59-96.
Paul Wijnands. “La néonymie et les systèmes experts.” La Banque des mots. 1992. : 15-26.