English literature, electronic text and computer analysis: An impossible combination?

“English literature, electronic text and computer analysis: An impossible combination?”

Claire Warwick Department of Information Studies University of Sheffield c.warwick@sheffield.ac.uk

In 1991 Corns discovered that despite the potential usefulness of computational text analysis techniques in the study of English Literature, very little work had been published in the field which showed any evidence of their use. He hoped that this was due a lack of knowledge on the part of more traditional literary professionals. Knowledge is now more widespread and electronic text and analysis tools easier to find and use. However, the application of the same method of quantitative analysis of the research output in selected journals suggests that computational analysis of English literary texts is no more common now than it was eight years ago. This paper will suggest reasons for this, and argue that the discontinuity between the way that machines and humans read prevents the more widespread use of electronic texts by literary scholars. Electronic text is still basically defined in terms of its content. (Renear) Thus the tools which we have at our disposal for analysing electronic literary text work in terms of information extraction. (eg. how many times does a word occur, in what collocation?) Even if the text is encoded, the searches we can perform are more complex versions of a content model. (eg how many times does Hamlet as speaker of the word Ophelia happen as opposed to the reverse?) Computational and corpus linguists have been able to produce a great deal of valuable work, based on this sort of data, yet to date very little has emerged as a result of applying computer analysis of electronic text in the field of English literature. Researchers who are interested in tracking cultural or historical patterns in large amounts of data, or charting textual variants may find computational techniques a great use. However, most scholars still believe that the core activity of the literary critic in whatever language is critical analysis and close reading. Although we have not fully understood what we do when we read a literary text, we know that we do not simply collect quantitative data. Reading conflates the activities of information retrieval, (How many times does x occur?); text analysis, when we examine the significance of the data, (i.e. having found out how many times a word occurs, in a given writer, is it different from that of any of his contemporaries, and if so, does it matter to me?) and the identification of emotional effects (I notice that a character tends to be presented in such a way, this determines how I as reader perceive that character and the action in which they are involved). Therefore, while critics may use quantitative data to support further analysis, the definition of 'close reading' is much less easy. What we do know is that it involves intangible concepts such as sensibility, originality, creativity and is predicated upon things that are nuanced and unprovable. These characteristics can be comprehended by humans. But they are much more difficult to adapt to the right or wrong, on or off, world of logical hierarchies that are ideal for computer analysis. Furthermore, unlike linguists, literary scholars often do not need large quantities of information in order to come to their judgements, which they admit should not necessarily be absolute or objective. Humanists do not necessarily expect that a problem can be solved once and for all nor that their findings must be incontestable. (Watisboone) To make any profitable use of computer techniques of analysis, humans must also be able to define exactly the problem under investigation, what the nature of the data is, and why results are significant. This is something that many English literature scholars find difficult, and this may be a reflection as much of the nature of the subject as the competence of the researcher. Text encoders might suggest that the text under analysis is insufficiently well analysed and marked up for the user's purpose. Perhaps therefore they should spend some time marking up their text. But what should they mark up? Even if they could define the sort of literary nuances that they are looking for, or translate them into an encoding system, would this really be a good use of time? The text would have to be so heavily marked up that the critic might as well just read it anyway. Should h