Digital Humanities Abstracts

“COLT on TACT A demonstration of the TACTweb software as applied to the Bergen Corpus of London Teenage Language”
Kristine Hasund Dept. of English, University of Bergen Gisle Andersen Dept. of English, University of Bergen gisle.andersen@eng.uib.no

The Bergen Corpus of London Teenage Language (COLT) is the first large English Corpus focusing on the speech of teenagers. It was collected in 1993 and consists of the spoken language of 13 to 17-year-old boys and girls from different boroughs of London. The aim of the COLT project is to compile a 500.000 word corpus of spoken teenage language, and make it available for students of English at the University of Bergen, as well as for language researchers world-wide. This poster presents the use of TACTweb on the COLT corpus. TACTweb, which connects the text-retrieval program TACT to the World Wide Web, enables the user to search in a database of spoken conversations for the location of words, word combinations and word formation patterns. In the COLT database, TACTweb is applied to give the distribution of an item in relation to certain non-linguistic variables. Searches in the corpus are made possible through the indexing of the texts in the database. The COLT database has the following indices:
  • 1. Reference number for each text file (eg. <REF> B132401)
  • 2. who= index for speaker identity (eg who=1)
  • 3. id= index for speaker turn number (eg id=1). This index is the same as is used in the BNC (the British National Corpus)
  • 4. speaker's age (eg <AGE1> 14)
  • 5. speaker's gender (eg <GEN1> f)
  • 6. speaker's socioeconomic group (eg <SOC1> 2)
  • 7. speaker's occupation (eg <OCC1> student)
  • 8. location of conversation (eg <LOC> Hackney)
  • 9. setting of conversation (eg <SET> classroom)
  • 10. number of participants (eg <AUD> 5)
Four different types of display systems are available for searches in the corpus:

KWIC - Key Words In Context

A KWIC display lists all the occurrences of a word with one line of context. Here is an example that shows the occurrences of the word "Peter" see Figure 1. The number in parentheses in the top line shows the total number of occurrences of "Peter" in the entire corpus. The numbers at the front of each line give the reference number, and then the turn number where the word can be found. The target word appears in the middle of the line. Clicking on the target word shows the full text, which allows a closer study of each occurrence. The KWIC display allows the user to quickly browse a large number of occurrences to see how a particular word is used, or to search for a word which has many occurrences.
Figure 1. Figure 1

Variable Context Display

Whereas the KWIC display gives only one line of context, the Variable Context Display allows the user to control the amount of context in which a word is to be displayed. For example, one can ask for the word "Peter" to be displayed in a context of 3 lines before and 3 lines after the occurrence:

Distribution

This display allows the user to search for the occurrence of a word as it is distributed across the variables speaker identity, age, gender, socio-economic group, location, setting, occupation, and number of participants. Here is an example of how the word "shit" is distributed according to age

Word List

The Word List display gives a list of all the words that match a particular pattern. For instance, it is possible to produce a list of all words ending in a particular letter or sequence of letters. This is particularily useful for a researcher who is interested in the productivity of certain morphemes, such as -able:
  • unavailable (1)
  • unbelievable (4)
  • uncomfortable (2)
  • unfuckingtouchable (2)
  • unreliable (1)
  • unscrewable (1)
  • unsociable (1)
  • untouchable (1)
  • up-gradable (2)
  • vulnerable (4)
The purpose of the poster presentation is to demonstrate these and other facilities, focusing on TACTweb as a useful tool for the linguistic researcher. Moreover, an overview of ongoing research will be given.