“Working with Alignment of Text and Sound in Spoken Corpora”

Knut Hofland University of Bergen, Norway

The Bergen Corpus of London Teenage Language (COLT) has been transcribed and the cassette tapes have been digitized to Windows WAV-files (9 GB). The texts have been time aligned at the word level with the sound files by the company Softsound in the UK. The poster will describe how this material is made available through the Corpus WorkBench from IMS in Stuttgart. The user can search in the corpus by means of a Web-browser and from the resulting concordance play the corresponding sound to each occurrence (5-15 seconds). For this purpose a program was written to deliver small pieces of a sound file across the Web. These sound extracts can be saved by the user and further analyzed by signal processing programs. Two Norwegian spoken corpora are also available for searching in this way. In the one corpus, a mark was put manually in the transcripts for every 10 seconds. A program then generated an interpolated time stamp for each word. In the other corpus, the program SyncWriter was used while transcribing the text. This program keeps track of time information for each unit which is transcribed. This information can be extracted from the data file together with the text. The time stamp for each word is interpolated between these values and the text and time information are indexed by the search software.

References:

unknown. COLT:. : ,

unknown. Softsound Speech/Text alignment. : ,

unknown. Corpus WorkBench. : ,

unknown. SyncWriter. : ,

unknown. Demo concordance. : ,