Linguistics Meets Exact Sciences

Jan Hajič

Linguistics is a science that studies human language, both spoken and written. It studies the structure of language and its usage, the function of its elements, and its relation to other bordering sciences (psycholinguistics, sociolinguistics, etc.), of which the most important related field today is computer science. Diachronic linguistics studies language from the perspective of its development over time, whereas synchronic linguistics studies the function and structure of the current live language. General linguistics studies human language as a system, but particular languages (such as English, Mandarin, Tamil, or any other) are studied in detail as well. Due to its complexity, the study of language is often divided into several areas. Phonetics and phonology are related to speech (or more precisely, to the spoken language), whereas orthography deals with the standard written form of a particular language, including capitalization and hyphenation where appropriate. Morphology studies the composition of words by morphemes (prefixes, roots, suffixes, endings, segmentation in general, etc.) and its relation to syntax, which introduces structure into the description of language at the level of phrases, clauses, and sentences. Sentences are typically considered to be very important basic language units. Semantics and pragmatics study the relation of the lower levels to meaning and contents, respectively.

A description of a (correct) behavior of a particular language is typically called a grammar. A grammar usually generalizes: it describes the language structurally and in terms of broad categories, avoiding the listing of all possible words in all possible clauses (it is believed that languages are infinite, making such a listing impossible). An English grammar, for example, states that a sentence consisting of a single clause typically contains a subject, expressed by a noun phrase, and a verb phrase, expressed by a finite verb form as a minimum. A grammar refers to a lexicon (set of lexical units) containing word-specific information such as parts of speech (noun, verb, adjective, particle, etc.) or syntactic subcategorization (e.g., that the verb "to attach" has a subject and an indirect object with the preposition "to").

From the historic perspective, the scientific, economic, and political developments in the world before and during World War II were preparing the field of linguistics for what happened shortly thereafter. Natural language moved into the focus of people working in several scientific fields previously quite distant from linguistics (and other humanities as well): computer science, signal processing, and information theory, supported by mathematics and statistics. Today, we can say that one of the turning points for linguistics was Claude Shannon's work (1948). Shannon was an expert in communication theory, and his work belongs to what is known today as information theory, a probabilistic and statistical description of information contents. However, he was interested not only in the mathematical aspects of technical communication (such as signal transfer over telegraph wires), but he and Warren Weaver also tried to generalize this approach to human language communication. Although forgotten by many, this work was the first attempt to describe the use of natural language by strictly formal (mathematical and statistical, or stochastic) methods. The recent revival of stochastic methods in linguistics only underscores the historical importance of their work.

Developments in theoretical linguistics that led to the use of strictly formal methods for language description were strongly influenced by Ferdinand de Saussure's work (1916). His work turned the focus of linguists to so-called synchronic linguistics (as opposed to diachronic linguistics). Based on this shift in paradigm, the language system as a whole began to be studied. Later, Noam Chomsky (even though his view differs from de Saussure's in many respects) came up with the first systematic formalization of the description of the sentences of natural language (1957). It should be noted, however, that Chomsky himself has always emphasized that his motivation for introducing formal grammar has never been connected with computerization. Moreover, he has renounced probabilistic and statistical approaches. For various reasons, most notably the lack of computer power needed for probabilistic and other computationally intensive approaches, his work stayed dominant in the field of computational linguistics for more than thirty years.

In line with the formal means he proposed, Chomsky also adopted the view (introduced by the descriptivist school of American linguists) that sentence structure in natural language can be represented essentially by recursive bracketing, which puts together smaller, immediately adjacent phrases (or so-called constituents) to form bigger and bigger units, eventually leading to a treelike sentence structure. An alternative theory known in essence from the nineteenth century (Becker 1837) but developed in its modern form by Tesniere (1959) and other European linguists states that the relation between two sentence constituents is not that of closeness but that of dependence. This theory offers some more flexibility in expressing relations among constituents that are not immediately adjacent, and it is also more adequate for a functional view of language structure.

Other formal theories emerged during the 1960s, 1970s, and early 1980s. The generalized phrase structure grammars (Gazdar) are still close to the context-free grammar formalism as proposed by Chomsky. The formalism later developed into so-called head-driven phrase structure grammars, a formalism that has characteristics of both the immediate constituent structure and dependency structure. Similarly, the independently developed lexicalized functional grammar (Kaplan and Bresnan 1983) explicitly separates the surface form, or constituent structure (so-called c-structure) from functional structure (f-structure, which is close to the predicate-argument structure) and uses lexical information heavily.

Chomsky himself proposed a number of modifications of his original theory during the same period, most notably the government and binding theory (1981), later referred to as the principles and parameters theory, and recently the minimalist theory (1993). While substantially different from the previous theories and enthusiastically followed by some, they seem to contribute little to actual computational linguistics goals, such as building wide-coverage parsers of naturally occurring sentences. Chomsky's work is thus more widely used and respected in the field of computer science itself, namely in the area of formal languages, such as syntax of programming or markup languages.

During the 1980s, stochastic methods (based largely on Shannon and Weaver's work, cf. above) re-emerged on a higher level, primarily thanks to the greatly increased power of computers and their widespread availability even for university-based research teams. First, the use of stochastic methods has led to significant advances in the area of speech recognition (Bahl et al. 1983), then it was applied to machine translation (Brown et al. 1990), and in the late 1990s, almost every field of computational linguistics was using stochastic methods for automatic text and speech processing. More complex, more sophisticated ways of using probability and statistics are now available, and formal non-probabilistic means of language descriptions are being merged with the classic information-theoretic methods, even though we are still waiting for a real breakthrough in the way the different methods are combined.

Computational Linguistics

Computational linguistics is a field of science that deals with computational processing of a natural language. On its theoretical side, it draws from modern (formal) linguistics, mathematics (probability, statistics, information theory, algebra, formal language theory, etc.), logic, psychology and cognitive science, and theoretical computer science. On the applied side, it uses mostly the results achieved in modern computer science, user studies, artificial intelligence and knowledge representation, lexicography (see chapter 6), and language corpora (see chapter 21). Conversely, the results of computational linguistics contribute to the development of the same fields, most notably lexicography, electronic publishing (see chapter 35), and access to any kind of textual or spoken material in digital libraries (see chapter 36).

Computational linguistics can be divided (with many overlappings, of course) into several subfields, although there are often projects that deal with several of them at once.

Theoretical computational linguistics deals with formal theories of language description at various levels, such as phonology, morphology, syntax (surface shape and underlying structure), semantics, discourse structure, and lexicology. Accordingly, these subfields are called computational phonology, computational semantics, etc. Definition of formal systems of language description also belongs here, as well as certain research directions combining linguistics with cognitive science and artificial intelligence.

Stochastic methods provide the basis for application of probabilistic and stochastic methods and machine learning to the specifics of natural language processing. They use heavy mathematics from probability theory, statistics, optimization and numerical mathematics, algebra and even calculus – both discrete and continuous mathematical disciplines are used.

Applied computational linguistics (not to be confused with commercial applications) tries to solve well-defined problems in the area of natural language processing. On the analysis side, it deals with phonetics and phonology (sounds and phonological structure of words), morphological analysis (discovering the structure of words and their function), tagging (disambiguation of part-of-speech and/or morphological function in sentential context), and parsing (discovering the structure of sentences; parsing can be purely structure-oriented or deep, trying to discover the linguistic meaning of the sentence in question). Word sense disambiguation tries to solve polysemy in sentential context, and it is closely related to lexicon creation and use by other applications in computational linguistics. In text generation, correct sentence form is created from some formal description of meaning. Language modeling (probabilistic formulation of language correctness) is used primarily in systems based on stochastic methods. Machine translation usually combines most of the above to provide translation from one natural language to another, or even from many to many (see Zarechnak 1979). In information retrieval, the goal is to retrieve complete written or spoken documents from very large collections, possibly across languages, whereas in information extraction, summarization (finding specific information in large text collections) plays a key role. Question answering is even more focused: the system must find an answer to a targeted question not just in one, but possibly in several documents in large as well as small collections of documents. Topic detection and tracking classifies documents into areas of interest, and it follows a story topic in a document collection over time. Keyword spotting (or, flagging a document that might be of interest based on preselected keywords) has obvious uses. Finally, dialogue systems (for man-machine communication) and multi-modal systems (combining language, gestures, images, etc.) complete the spectrum.

Development tools for linguistic research and applications are needed for quick prototyping and implementation of systems, especially in the area of applied computational linguistics as defined in the previous paragraph. Such support consists of lexicons for natural language processing, tools for morphological processing, tagging, parsing and other specific processing steps, creation and maintenance of language corpora (see chapter 21) to be used in systems based on stochastic methods, and annotation schemas and tools.

Despite sharing many of the problems with written language processing, speech (spoken language) processing is usually considered a separate area of research, apart even from the field of computational linguistics. Speech processing uses almost exclusively stochastic methods. It is often referred to as automatic speech recognition (ASR), but the field is wider: it can be subdivided into several subfields, some of them dealing more with technology, some with applications. Acoustic modeling relates the digitized speech signal and phonemes as they occur in words, whereas language modeling for speech recognition shares many common features with language modeling of written language, but it is used essentially to predict what the speaker will utter next based on what she said in the immediate past. Speaker identification is another obvious application, but it is typically only loosely related to the study of language, since it relies more on acoustic features. Small-vocabulary speech recognition systems are the basis for the most successful commercial applications today. For instance, systems for recognizing digits, typically in difficult acoustic conditions such as when speaking over a telephone line, are widely used. Speaker-independent speech recognition systems with large vocabularies (sometimes called dictation) are the focus of the main research direction in automatic speech recognition. Speaker adaptation (where the speech recognizers are adapted to a particular person's voice) is an important subproblem that is believed to move today's systems to a new level of precision. Searching spoken material (information retrieval, topic detection, keyword spotting, etc.) is similar to its text counterpart, but the problems are more difficult due to the lack of perfect automatic transcription systems. A less difficult, but commercially very viable field is text-to-speech synthesis (TTS).

It is clear however that any such subdivision can never be exact; often, a project or research direction draws upon several of the above fields, and sometimes it uses methods and algorithms even from unrelated fields, such as physics or biology, since more often than not their problems share common characteristics.

One very important aspect of research in computational linguistics, as opposed to many other areas in humanities, is its ability to be evaluated. Usually, a gold-standard result is prepared in advance, and system results are compared (using a predefined metric) to such a gold-standard (called also a test) dataset. The evaluation metric is usually defined in terms of the number of errors that the system makes; when this is not possible, some other measure (such as test data probability) is used. The complement of error rate is accuracy. If the system produces multiple results, recall (the rate at which the system hits the correct solution by at least one of its outputs) and precision (the complement of false system results) have to be used, usually combined into a single figure (called F-measure). Objective, automatic system evaluation entered computational linguistics with the revival of statistical methods and is considered one of the most important changes in the field since its inception – it is believed that such evaluation was the driving force in the fast pace of advances in the recent past.

The Most Recent Results and Advances

Given the long history of research in computational linguistics, one might wonder what the state-of-the-art systems can do for us today, at the beginning of the twenty-first century. This section summarizes the recent results in some of the subfields of computational linguistics.

The most successful results have been achieved in the speech area. Speech synthesis and recognition systems are now used even commercially (see also the next section). Speech recognizers are evaluated by using so-called word error rate, a relatively simple measure that essentially counts how many words the system missed or confused. The complement of word error rate is the accuracy, as usual. The latest speech recognizers can handle vocabularies of 100,000 or more words (some research systems already contain million-word vocabularies). Depending on recording conditions, they have up to a 95 percent accuracy rate (with a closely mounted microphone, in a quiet room, speaker-adapted). Broadcast news is recognized at about 75–90 percent accuracy. Telephone speech (with specific topic only, smaller vocabulary) gives 30–90 percent accuracy, depending on vocabulary size and domain definition; the smaller the domain, and therefore the more restricted the grammar, the better the results (very tiny grammars and vocabularies are usually used in successful commercial applications). The worst situation today is in the case of multi-speaker spontaneous speech under, for example, standard video recording conditions: only 25–45 percent accuracy can be achieved.

It depends very much on the application whether these accuracy rates are sufficient or not; for example, for a dictation system even 95 percent accuracy might not be enough, whereas for spoken material information retrieval an accuracy of about 30 to 40 percent is usually sufficient.

In part-of-speech tagging (morphological disambiguation), the accuracy for English (Brants 2000) has reached 97–98 percent (measuring the tag error rate). However, current accuracy is substantially lower (85–90 percent) for other languages that are morphologically more complicated or for which not enough training data for stochastic methods is available. Part-of-speech tagging is often used for experiments when new analytical methods are considered because of its simplicity and test data availability.

In parsing, the number of crossing brackets (i.e., wrongly grouped phrases) is used for measuring the accuracy of parsers producing a constituent sentence structure, and a dependency error rate is used for dependency parsers. Current state-of-the-art English parsers (Collins 1999; Charniak 2001) achieve 92–94 percent combined precision and recall in the crossing-brackets measure (the number would be similar if dependency accuracy is measured). Due to the lack of training data (treebanks) for other languages, there are only a few such parsers; the best-performing published result of a foreign-language parser (on Czech, see Collins et al. 1999) achieves 80 percent dependency accuracy.

Machine translation (or any system that produces free, plain text) is much more difficult (and expensive) to evaluate. Human evaluation is subjective, and it is not even clear what exactly should be evaluated, unless a specific task and environment of the machine translation system being evaluated is known. Machine translation evaluation exercises administered by DARPA (Defense Advanced Research Projects Agency) in the 1990s led to about 60 percent subjective translation quality of both the statistical machine translation (MT) systems as well as the best commercial systems. Great demand for an automatic evaluation of machine translation systems by both researchers and users has led to the development of several automated metrics (most notably, those of Papineni, see Papineni et al. 2001) that try to simulate human judgment by computing a numeric match between the system's output and several reference (i.e., human) translations. These numbers cannot be interpreted directly (current systems do not go over 0.30–0.35 even for short sentences), but only in relation to a similarly computed distance between human translations (for example, that number is about 0.60 with 4 reference translations; the higher the number, the better).

Successful Commercial Applications

The success of commercial applications is determined not so much by the absolute accuracy of the technology itself, but rather by the relative suitability to the task at hand. Speech recognition is currently the most successful among commercial applications of language processing, using almost exclusively stochastic methods. But even if there are several dictation systems on the market with accuracy over 95 percent, we are for the most part still not dictating e-mails to our computers, because there are many other factors that make dictation unsuitable: integration with corrections is poor, noisy environments decrease the accuracy, sometimes dramatically. On the other hand, scanning broadcast news for spotting certain topics is routinely used by intelligence agencies. Call routing in the customer service departments of major telephone companies (mostly in the USA) is another example of a successful application of speech recognition and synthesis: even if the telephone speech recognition is far from perfect, it can direct most of the customer's calls correctly, saving hundreds of frontline telephone operators. Directory inquiries and call switching in some companies are now also handled automatically by speech recognition; and we now have cell phones with ten-number voice dialing – a trivial, but fashionable application of speech recognition technology – not to mention mass-market toys with similar capabilities.

Successful applications of non-speech natural language processing technology are much more rare, if we do not count the ubiquitous spelling and grammar checkers in word processing software. The most obvious example is machine translation. General machine translation systems, an application that has been being developed at many places for almost fifty years now, is still lousy in most instances, even if it sort of works for rough information gathering (on the Web for example); most translation bureaus are using translation memories that include bilingual and multilingual dictionaries and previously translated phrases or sentences as a much more effective tool. Only one system – SYSTRAN – stands out as a general machine translation system, now being used also by the European Commission for translating among many European languages (Wheeler 1984). Targeted, sharply constrained domain-oriented systems are much more successful: an excellent example is the Canadian METEO system (Grimaila and Chandioux 1992), in use since May 24, 1977, for translating weather forecasts between English and French. Research systems using stochastic methods do surpass current general systems, but a successful commercial system using them has yet to be made.

Searching on the Web is an example of a possible application of natural language processing, since most of the Web consists of natural language texts. Yet most of the search engines (with the notable exception of Askjeeves and its products <www.ask.com>, which allows queries to be posted in plain English) still use simple string matching, and even commercial search applications do not usually go beyond simple stemming. This may be sufficient for English, but with the growing proportion of foreign-language web pages the necessity of more language-aware search techniques will soon become apparent.

Future Perspectives

Due to the existence of corpora, to the ascent of stochastic methods and evaluation techniques, computational linguistics has become mostly an experimental science – something that can hardly be said about many other branches of humanities sciences. Research applications are now invariably tested against real-world data, virtually guaranteeing quick progress in all subfields of computational linguistics. However, natural language is neither a physical nor a mathematical system with deterministic (albeit unknown) behavior; it remains a social phenomenon that is very difficult to handle automatically and explicitly (and therefore, by computers), regardless of the methods used. Probability and statistics did not solve the problem in the 1950s (weak computer power), formal computational linguistics did not solve the problem in the 1960s and 1970s (language seems to be too complex to be described by mere introspection), and stochastic methods of the 1980s and 1990s apparently did not solve the problem either (due to the lack of data needed for current data-hungry methods). It is not unreasonable to expect that we are now, at the beginning of the twenty-first century, on the verge of another shift of research paradigm in computational linguistics. Whether it will be more linguistics (a kind of return to the 1960s and 1970s, while certainly not leaving the new experimental character of the field), or more data (i.e., the advances in computation – statistics plus huge amounts of textual data which are now becoming available), or neural networks (a long-term promise and failure at the same time), a combination of all of those, or something completely different, is an open question.

References for Further Reading

There are excellent textbooks now for those interested in learning the latest developments in computational linguistics and natural language processing. For speech processing, Jelinek (2000) is the book of choice; for those interested in (text-oriented) computational linguistics, Charniak (1996), Manning and Schutze (1999), and Jurafsky and Martin (2000) are among the best.

Bahl, L. R., F. Jelinek, and R. L. Mercer (1983). A Maximum Likelihood Approach to Continuous Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 5, 2: 179–220.

Becker, K. F. (1837). Ausfürliche Deutsche Grammatik als Kommentar der Shulgrammatik. Zweite Abtheilung [Detailed Grammar of German as Notes to the School Grammar]. Frankfurt am Main: G. F. Kettembeil.

Brants, T. (2000). TnT - A Statistical Part-of-Speech Tagger. In S. Nirenburg, Proceedings of the 6th ANLP (pp. 224–31). Seattle, WA: ACL.

Brown, P. F., J. Cocke, S. A. Della Pietra, V. J. Della Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin (1990). A Statistical Approach to Machine Translation. Computational Linguistics 16, 2: 79–220.

Charniak, E. (1996). Statistical Language Learning. Cambridge, MA: MIT Press.

Charniak, E. (2001). Immediate-head Parsing for Language Models. In N. Reithinger and G. Satta, Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (pp. 116–23). Toulouse: ACL.

Chomsky, A. N. (1957). Syntactic Structures. The Hague: Mouton.

Chomsky, A. N. (1981). Lectures on Government and Binding (The Pisa Lectures). Dordrecht and Cinnaminson, NJ: Foris.

Chomsky, A. N. (1993). A Minimalist Program for Linguistic Theory. In K. Hale and S. J. Keyser (eds.), The View from Building 20: Essays in Linguistics in Honor of Sylvain (pp. 1–52). Cambridge, MA: Bromberger.

Collins, M. (1999). Head-driven Statistical Models for Natural Language Parsing. PhD dissertation, University of Pennsylvania.

Collins, M., J. Hajič, L. Ramshaw, and C. Tillmann (1999). A Statistical Parser for Czech. In R. Dale and K. Church, Proceedings of ACL 99 (pp. 505–12). College Park, MD: ACL.

Gazdar, G. et al. (1985). Generalized Phrase Structure Grammar. Oxford: Blackwell.

Grimaila, A. and J. Chandioux (1992). Made to Measure Solutions. In J. Newton (ed.), Computers in Translation, A Practical Appraisal (pp. 33–45). New York: Routledge.

Jelinek, F. (2000). Statistical Methods for Speech Recognition. Cambridge, MA: MIT Press.

Jurafsky, D. and J. H. Martin (2000). Speech and Language Processing. New York: Prentice Hall.

Kaplan, R. M., and J. Bresnan (1983). Lexical-functional Grammar: A Formal System for Grammatical Representation. In J. Bresnan, The Mental Representation of Grammatical Relations (pp. 173–381). Cambridge, MA: MIT Press.

Manning, C. D. and H. Schütze (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.

Papineni, K., S. Roukos, T. Ward, and Wei-Jing Zhu (2001). Bleu: A Method for Automatic Evaluation of Machine Translation. Published as IBM Report RC22176. Yorktown Heights, NY: IBM T. J. Watson Research Center.

Pollard, C. and I. Sag (1992). Head-driven Phrase Structure Grammar. Chicago: University of Chicago Press.

Saussure de, F. (1949). Court de linguistique générale, 4th edn. Paris: Libraire Payot.

Shannon, C. (1948). The Mathematical Theory of Communication. Bell Systems Technical Journal 27: 398–403.

Tesnière, L. (1959). Elements de Syntaxe Structural. Paris: Editions Klincksieck.

Wheeler, P. J. (1984). Changes and Improvements to the European Comission's SYSTRAN MT System, 1976–1984. Terminologie Bulletin 45: 25–37. European Commission, Luxembourg.

Zarechnak, M. (1979). The History of Machine Translation. In B. Henisz-Dostert, R. Ross Macdonald, and M. Zarechnak (eds.), Machine Translation. Trends in Linguistics: Studies and Monographs, vol. 11 (pp. 20–8). The Hague: Mouton.