“Collocations, Authorship Attribution, and Authorial Style”

David Hoover New York University david.hoover@nyu.edu

Authorship attribution typically seeks a small number of textual characteristics that distinguish the texts of authors effectively from each other (see Morton, 1978 for a classic discussion). With small groups of texts, these features can be found by examining frequency lists manually, but statistical tests such as the t-test can also be used (see Binongo and Smith, 1999). For the purposes of authorship attribution, a few items occurring at consistent and consistently different frequencies in all of the known texts by all of the claimants may be sufficient for confident attribution. Most multivariate authorship work focuses on frequent words, following the lead of Burrows (1987, 1988, 1989, 1992a, 1992b, 1994). Much persuasive recent work continues this tradition (Craig, 1999a, 1999b, 1999c, 1999d; Forsyth et al., 1999; Holmes et al., 2001a, 2001b; McKenna and Antonia, 2001; Tweedie et al., 1998). In two recent studies, however, I have shown that cluster analyses based on frequent words often fail to attribute known texts to their authors, and that analyses based on word sequences are sometimes more effective (Hoover, 2001, 2002). Continuing along these lines, I will test the accuracy of analyses based on collocations, while simultaneously examining the effects of using much larger numbers of items than are typically used. Large numbers of words, sequences, and collocations provide more information for potential stylistic analyses, assure that the results take into account a large proportion of the texts under consideration, and, as we will see, usually produce more accurate results. The results of my investigation also show that analyses based on collocations are often more accurate than those based on frequent words or sequences. For this investigation I will define collocations simply as any two words that appear repeatedly within a certain span of words. Preliminary tests show that, perhaps contrary to intuition, meaningful collocations like house...yard, or car...highway, are not very effective for authorship attribution. They do not occur very frequently, and their occurrence depends too much on the content of the text. Many multivariate analyses have been based on function words alone, in the belief that such frequent and relatively insignificant words are most likely to reflect unconscious and regular authorial habits. This suggests the use of collocations of function words, but preliminary tests show that these are also not very effective. The most effective collocations are simply those that occur at the highest frequencies, with the exception of collocations of personal pronouns, which, like collocations of meaningful words, seem too much conditioned by content (especially the characters) of the texts. I omit personal pronouns and any items for which a single text provides more than 80% of the occurrences (typically proper names). To test the effectiveness of collocations in authorship, it seems best to begin with a corpus of texts by known authors, so that various spans, numbers of collocations, and statistical methods can be tested for effectiveness before trying the method on real authorship questions. I begin with a corpus of 10,000 words of pure narrative from fourteen third-person novels by six authors from about 1900, and, as a baseline, test the effectiveness of frequent words and sequences. For the restriction to narrative and to third- person, see Burrows (1987, 1992) and Hoover, (2001). The best results cluster the texts of five of the six authors. Although analysts usually select a small number of items (e.g., the 50 most frequent function words), much larger numbers of frequent words are often more effective. I test the 50, 100, 200, 300, 400, 500, 600, 700, and 800 most frequent items except where fewer items than 800 occur frequently enough to be included. (For this corpus, the best results for frequent words are based on the 300-800 most frequent.) When collocations are tested, various spans and linkages give various results, but several analyses correctly cluster the texts of all six authors, as Fig. 1 shows. A representative completely correct cluster analysis is shown in Fig. 2.

Figure 1. Fig. 1

Figure 2. Fig. 2

It seems useful to test the methods on another genre, as I did in previous work (Hoover, 2001), so my next corpus consists of the first 4,000 words of twenty-one contemporary literary critical articles by ten authors. Here, analyses based on frequent words and sequences each correctly cluster all of the texts once. Analyses based on collocations with spans of two, five, and ten words also succeed. Analyses based on collocations seem to be quite effective in attributing texts to their authors in cases of known authorship, and can now be tested in an authorship simulation to see how well they work under conditions that more closely resemble true attribution problems. The simulation includes the fourteen narratives by six authors discussed above, adds four novels by two new authors, and then two “anonymous” novels, each known to be by one of the eight authors. Frequent sequences succeed for only six of the authors. Frequent words still fail to cluster Kipling’s texts correctly, but they do successfully cluster the four texts of the two new authors. They also consistently cluster one of the anonymous texts with Cather’s texts and the other with London’s. Analyses based on collocations with a span of four words are extremely effective and consistent: the 400, 500, 600, 700, and 800 most frequent correctly cluster all of the known texts, even when the graphs are strictly interpreted, as Fig. 3 shows. Like analyses based on frequent words, these also consistently cluster the anonymous texts with those of Cather and London. These identifications are correct. What makes these results even more impressive is the fact that four of the six added texts, including the two anonymous ones, are first-person rather than third-person narratives.

Figure 3. Fig. 3

The results of my study confirm what many researchers have found: analyses based on the frequencies of frequent words are quite effective in attributing texts to their authors. Analyses based on frequent sequences of words are also often effective, and are more effective under certain conditions, as I have showed elsewhere (Hoover, 2002). Frequent collocations, however, are often more effective than either words or sequences, producing the only completely correct attributions in some cases and producing more consistently correct attributions in others. The frequencies of frequent collocations clearly reflect important aspects of authorial style. Analyses based on them constitute a promising method of authorship attribution and may also prove useful in stylistic studies.

REFERENCES

J. N. G. Binongo M. W. A. Smith. “The application of principal Component analysis to stylometry.” Literary and Linguistic Computing. 1999. 14: 445–65.

J. F. Burrows. Computation into Criticism. Oxford: Clarendon Press, 1987.

J. F. Burrows A. J. Hassall. “Anna Boleyn and the authenticity of Fielding's feminine narratives.” Eighteenth Century Studies. 1988. 21: 427–453.

J. F. Burrows. “‘A Vision’ as a revision.” Eighteenth Century Studies. 1989. 22: 551–65.

J. F. Burrows. “Computers and the study of literature.” Computers and Written Texts. Ed. C. S. Butler. Oxford: Blackwell, 1992. 167–204.

J. F. Burrows. “Not unless you ask nicely: the interpretive nexus between analysis and information.” Literary and Linguistic Computing. 1992. 7: 91-109.

J. F. Burrows D. H. Craig. “Lyrical drama and the ‘turbid mountebanks’: styles of dialogue in Romantic and Renaissance tragedy.” Computers and the Humanities. 1994. 28: 63-86.

D. H. Craig. “Authorial attribution and computational stylistics: if you can tell authors apart, have you learned anything about them?.” Literary and Linguistic Computing. 1999. 14: 103-113.

H. Craig. “Contrast and change in the idiolects of Ben Jonson characters.” Computers and the Humanities. 1999. 33: 221-40.

H. Craig. “Jonsonian chronology and the styles of a tale of a tub.” Re-Presenting Ben Jonson: Text, History, Performance. Ed. Martin Butler. Houndmills, England: MacMillan, St. Martin's, 1999. 210-32.

H. Craig. “The weight of numbers: common words and Jonson's dramatic style.” Ben Jonson Journal: Literary Contexts in the Age of Elizabeth, James and Charles. 1999. 6: 243-59.

R. S. Forsyth D. I. Holmes Emily K. Tse. “Cicero, Sigonio, and Burrows: investigating the authenticity of the Consolatio.” Literary and Linguistic Computing. 1999. 14: 375-400.

D. I. Holmes L. J. Gordon C. Wilson. “A Widow and her Soldier: Stylometry and the American Civil War.” Literary and Linguistic Computing. 2001. 16: 403-420.

D. I. Holmes M. Robertson R. Paez. “Stephen Crane and the New-York Tribune: a case study in traditional and non-traditional authorship attribution.” Computers and the Humanities. 2001. 35: 315-331.

D. L. Hoover. “Statistical stylistics and authorship attribution: an empirical investigation.” Literary and Linguistic Computing. 2001. 16: 421-44.

D. L. Hoover. “New Directions in Statistical Stylistics and Authorship Attribution.” Association for Literary and Linguistic Computing and Association for Computers and the Humanities, Joint International Conference, Tübingen, Germany, July 24–28. : , 2002.

C. W. F. McKenna A. Antonia. “The Statistical Analysis of Style: Reflections on Form, Meaning, and Ideology in the ‘Nausicaa’ Episode of Ulysses.” Literary and Linguistic Computing. 2001. 16: 353–373.

A. Q. Morton. Literary Detection: How to Prove Authorship and Fraud in Literature and Documents. New York: Scribner, 1978.

F. J. Tweedie D. I. Holmes T. N. Corns. “The provenance of De Doctrina Christiana, attributed to John Milton: a statistical investigation.” Literary and Linguistic Computing. 1998. 13: 77-87.