Digital Humanities Abstracts

“The Scottish Corpus of Texts and Speech: problems of corpus design”
Fiona Douglas University of Glasgow F.Douglas@englang.arts.gla.ac.uk

In recent years the use of large corpora has revolutionised the way we study language. There are now numerous well established corpus projects which have set the standard for future corpus based research. As more and more corpora are developed and technology continues to offer greater and greater scope, the emphasis has shifted from corpus size to establishing norms of good practice. There is also an increasingly critical appreciation of the crucial role played by corpus design. Corpus design can, however, present peculiar problems for particular types of source material, and the development of the Scottish Corpus of Texts and Speech illustrates the problems which may be encountered when dealing with a complicated linguistic situation such as exists in Scotland. The Scottish Corpus of Texts and Speech is the first large-scale corpus project specifically dedicated to the languages of Scotland, and therefore it faces many unanswered questions, such as those outlined below, which will have a direct impact on the corpus design. The project is a joint venture by the Department of English Language and STELLA project at the University of Glasgow, and the Language Technology Group at the University of Edinburgh, and is funded by the Engineering and Physical Sciences Research Council. The project seeks to address the current gap in knowledge about the languages of Scotland by building a publicly available electronic corpus of written and spoken texts mounted on the Internet. The linguistic situation in Scotland is complex, with Scottish English, Scots, Gaelic and numerous non-indigenous community languages all playing a role. However, surprisingly little reliable information is available on a variety of issues such as the survival of Scots, the distinguishing characteristics of Scottish English, the use of non-indigenous languages, or the way they have developed in Scotland. The first phase of the corpus is focusing on the collection of Scots and Scottish English texts. However, the language varieties Scots and Scottish English are themselves difficult to describe, and between these two extremes lie multifarious other language varieties which defy rigid categorisation. Established practice norms to ensure corpus representativeness cannot be easily applied, as these Scottish language varieties have disparate and shifting functional roles. Scottish English is generally accepted in a wider variety of formal contexts than Scots, but Scots has stronger local and community ties which may also exert a pressure. Social class and education also influence when and where each language variety may be used. Indeed the labels 'Scots' and 'Scottish English' are themselves problematic, as written and spoken varieties of Scots and Scottish English are not as closely linked as might be assumed. There are numerous different local varieties, and so there is a strong regional dimension to be considered. Native Scots themselves often disagree about what is and is not 'Scots', before they even reach considerations of where its use is and is not considered to be appropriate. The perceived status of Scots thus has important implications for the text types and modes in which it is used. To date there has been no large scale study to identify where each of these language varieties is deemed acceptable usage by native Scots. Indeed, the native Scots themselves have ambivalent and wide-ranging opinions on these language varieties, and there are unspoken but nevertheless tangible rules which impact on where and how and when they are used. Present-day Scots also has no agreed standard spelling system, which presents problems when developing search tools for the corpus. A balanced corpus which seeks to reflect the true linguistic situation in Scotland must be sensitive to these problems and anomalies. It must reflect the variety and breadth of possible linguistic options without skewing the data along preconceived notions of what is and is not Scots or Scottish English. It must also gather its texts from a discourse community which has very ambivalent views about the range of language varieties it encompasses. This paper considers the problems presented for corpus design in view of the complex linguistic situation that exists in Scotland. It considers questions such as how to decide what should be included, how to choose, and in what proportions relative to the corpus as a whole and to the range of possible language varieties. It examines the problematic issue of how to construct a well balanced and representative corpus in what is largely uncharted linguistic territory. The paper will also consider points of comparison with other corpora.