Digital Humanities Abstracts

“Experiments in Multivariate Analysis and Authorship Attribution ”
David L. Hoover New York University david.hoover@nyu.edu

Although statistical stylistics has never been a very popular area of study, it is attractive because its powerful techniquesñwidely used in the sciences and social sciencesñseem especially appropriate for the large amounts of information that texts represent. I am working on a project that reexamines the statistical techniques used by the most careful and respected of the practitioners of the methodñspecifically, methods inspired by or derived from the work of John F. Burrows (1987, 1989, 1992; Burrows and Hassall, 1988; Burrows and Craig, 1994; Craig, 1999). Specifically, I am working with cluster analysis of the frequencies of high frequency words. The high frequency and low semantic load of the most frequent function words have led researchers to assume that their use is likely to escape the conscious control of authors. If so, their frequencies may reflect deeply ingrained linguistic habits and provide what might be called ìwordprintsî for authors. If such wordprints exist, they may provide the kind of objective measure of style that has been sought since the 18th century. The techniques pioneered by Burrows have been quite well received because they are careful, reasonable, and compelling, and they have been extended to examinations of the authorship of the Book of Mormon (Holmes, 1992), have been tested on the Federalist Papers (Holmes and Forsyth, 1995), and have been applied to the question of Miltonís authorship of De Doctrina Christiana (Tweedie, Holmes, and Corns, 1998); see also Holmes (1994), Baayen, Van Halteren, and Tweedie (1996), and Tweedie and Baayen (1998). This work tends to confirm the accuracy and effectiveness of multivariate analysis in authorship attribution, but in each case the field of claimants and the range of texts is relatively limited. No one has taken up Burrows suggestion ìto match a natural desire to work on celebrated cases like Henry VIII and The Revengerís Tragedy with a more sober, though less immediately rewarding, concern for testing our methods thoroughly on cases where the true answers are not in any doubtî (Burrows, 1992, 174). I am interested mainly in the possible application of statistical techniques to stylistic analysis, especially in the areas of character development, genre definition, and stylistic variability within works or authors, but I would like to demonstrate some of the work required for the task suggested by Burrows. After all, only those statistical techniques that can effectively and reliably distinguish known authors and known texts from each other seem likely to be useful in characterizing and comparing the styles of those authors. My first experiment analyzed the first 3,000 words of opening chapters of a group of 50 current novels by 27 authors, downloaded from HTTP://WWW.CONTENTVILLE.COM. My second experiment analyzed the first 30,000 words of 46 novels by 31 authors, mainly taken from Hoover (1999). Another experiment analyzed the 4,000-word sections of 25 pieces of current literary criticism by 14 authors, downloaded from Project Muse (http://muse.jhu.edu/journals/elh/). Unfortunately, none of these experiments showed the kind of results that one would hope. In fact, the best result was less than 90% accurate in attributing texts to the correct authors. This was true even when first and third person narration was separated and when function word homographs were distinguished. Using the results of the analyses, I then selected some of the most problematic texts in the second and third groups for further analysis. When I analyzed the texts of only 2-4 authors, cluster analysis still failed to group texts by the same author or distinguish texts by different authors accurately. This lack of accuracy suggests that further work is necessary before such techniques can be accepted as important tools in authorship attribution or stylistic studies. In my presentation, I would like to show how the process of multivariate analysis works, from the point when the texts have been collected to the production of cluster graphs or PCA plots. Based on conversations with other humanities computing people, I believe that there is a need for this kind of fairly explicit and basic introduction to statistical analysis. At the same time, the results of my experiments seem very interesting and significant in their own right: although a proof of the accuracy of the techniques on large groups of varied texts would have been more welcome, a demonstration of their inaccuracy may, in the long run be just as useful. The main components of my own technique are TACT, used to analyze the word frequencies of the individual texts and of a text that combines all of the texts; FoxPro, a programmable database, used to import word frequency data, tag it with author and text information, cull the data so that it includes only the desired number of most frequent words (generally the 50-500 most frequent words), and create zero- frequency records of frequent words that do not appear in one or more of the individual texts (note: I am currently looking into the feasibility of moving the techniques to Microsoft Access, with Visual Basic as the programming language); and Minitab, a statistical analysis program, used to perform the actual PCA and cluster analysis. The techniques I use allow for quick and relatively painless analysis of many different groups of texts of many different kinds, and so have the potential to provide a wide range of tests of the techniques on extremely varied and extensive groups of textsñsomething that has not, to my knowledge, been done before. For my poster presentation, I would need access to a PC running Windows 98 (NOT 2000 or MEñI need access to DOS, for running batch files), on which I could install Foxpro (Access might be needed, if I get the conversion done in time), Minitab, TACT, and a bunch of texts for possible analysis.

Works Cited

R. HaraldBaayen. “Statistical Models for Word Frequency Distributions: A Linguistic Evaluation.” Computers and the Humanities. 1992. 26: 347-363.
R. HaraldBaayen. “The Effect of Lexical Specialization on the Growth Curve of the Vocabulary.” Computational Linguistics. 1996. 22: 455-480.
R. HaraldBaayen Hans Van Halteren Fiona J.Tweedie. “Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution.” Literary & Linguistic Computing. 1996. 11: 121-31.
J. F. Burrows. Computation into Criticism. Oxford: Clarendon Press, 1987.
J. F.Burrows. “'A Vision' as a Revision?.” Eighteenth-Century Studies. 1989. 22: 551-565.
J. F.Burrows. “Computers and the Study of Literature.” Computers and Written Texts: an Applied Perspective. Ed. Christopher Butler. Oxford: Blackwell, 1992. 167-204.
J. F.Burrows A. J.Hassall. “Anna Boleyn and the Authenticity of Fielding's Feminine Narratives.” Eighteenth Century Studies. 1988. 21: 427-453.
J. F.Burrows D. H.Craig. “Lyrical Drama and the 'Turbid Mountebanks': Styles of Dialogue in Romantic and Renaissance Tragedy.” Computers and the Humanities. 1994. 28: 63-86.
HughCraig. “Contrast and Change in the Idiolects of Ben Jonson Characters.” Computers and the Humanities. 1999. 33: 221-40.
D. I.Holmes. “A Stylometric Analysis of Mormon Scripture and Related Texts.” Journal of the Royal Statistical Society, Series A. 1992. 155: 91-120.
D. I.Holmes. “Authorship Attribution.” Computers and the Humanities. 1994. 28: 87-106.
D. I.Holmes R. S.Forsyth. “The Federalist Revisited: New Directions in Authorship Attribution.” Literary & Linguistic Computing. 1995. 10: 111-127.
David L. Hoover. Language and Style in The Inheritors. Lanham, MD: University Press of America, 1999.
F. J.Tweedie R. H.Baayen. “How Variable May a Constant Be? Measures of Lexical Richness in Perspective.” Computers and the Humanities. 1998. 32: 323-352.
F. J.Tweedie D. I.Holmes Thomas N.Corns. “The Provenance of De Doctrina Christiana, Attributed to John Milton: A Statistical Investigation.” Literary & Linguistic Computing. 1998. 13: 77-87.