“Working with Alignment of Text and Sound in Spoken
Corpora”
Knut
Hofland
University of Bergen, Norway
The Bergen Corpus of London Teenage Language (COLT) has been transcribed and the
cassette tapes have been digitized to Windows WAV-files (9 GB). The texts have
been time aligned at the word level with the sound files by the company
Softsound in the UK. The poster will describe how this material is made
available through the Corpus WorkBench from IMS in Stuttgart. The user can
search in the corpus by means of a Web-browser and from the resulting
concordance play the corresponding sound to each occurrence (5-15 seconds). For
this purpose a program was written to deliver small pieces of a sound file
across the Web. These sound extracts can be saved by the user and further
analyzed by signal processing programs.
Two Norwegian spoken corpora are also available for searching in this way. In the
one corpus, a mark was put manually in the transcripts for every 10 seconds. A
program then generated an interpolated time stamp for each word. In the other
corpus, the program SyncWriter was used while transcribing the text. This
program keeps track of time information for each unit which is transcribed. This
information can be extracted from the data file together with the text. The time
stamp for each word is interpolated between these values and the text and time
information are indexed by the search software.
References:
unknown. COLT:. : ,
unknown. Softsound Speech/Text alignment. : ,
unknown. Corpus WorkBench. : ,
unknown. SyncWriter. : ,
unknown. Demo concordance. : ,