“New Directions in Statistical Stylistics and Authorship
Attribution”
David
Hoover
New York University
david.hoover@nyu.edu
This presentation will describe an investigation that compares the relative
effectiveness and accuracy of multivariate analysis (cluster analysis) of the
frequencies of very frequent words and the frequencies of very frequent word
sequences in correctly attributing texts to authors. Cluster analyses based on
the most frequent words are fairly accurate for corpora of texts by known
authors, whether the texts are 30,000- or 10,000-word sections of modern British
and American novels, or 4,000-word sections of contemporary literary critical
texts. They are, however, only rarely completely accurate; furthermore, when
small groups of problematic texts taken from the corpora are used in simulated
authorship studies, analyses based on frequent words rather consistently fail to
cluster them correctly. But when frequent word sequences are used rather than
frequent words or in addition to them, the analyses often improve in accuracy,
sometimes quite significantly, suggesting that analyses based on frequent word
sequences constitute improved tools for authorship attribution and statistical
stylistic studies.
One of the most popular places to search for a "wordprint" that can characterize
the style of an author has been among the frequencies of the most frequent words
of the language. In his seminal work on Jane Austen (1987), John F. Burrows
demonstrated fairly convincingly that the frequencies of extremely frequent
words like the, and, of, a, and to, can often be used to distinguish different
authors, novels, and even different characters within a single novel. In spite
of their intuitively insignificant nature, such words can even have interesting
and potentially significant stylistic effects. This seems surprising when we
remember that the five words above normally constitute roughly 20% of the word
tokens in a novel. Yet their high frequency and the extreme unlikelihood that
authors can or even wish to consciously control them suggests habitual or
routinized use that may reflect an author's style across all his or her texts,
in spite of differing subjects, themes, and points of view. Because of this, and
because their frequencies often vary significantly among different authors,
texts, and characters, in spite of their uniformly high frequencies (see
Burrows, 1987: 3-4), frequent words have been popular targets for various kinds
of multivariate analysis (see also, Burrows and Hassall, 1988, and Burrows,
1992).
Much recent work with multivariate analysis of the frequencies of frequent words
has produced interesting and significant results, especially in the field of
authorship attribution (see Holmes, 1992, Holmes and Forsyth, 1995; Tweedie,
Holmes, and Corns, 1998). As I have shown in a recent study (Hoover, 2001),
however, cluster analysis of the frequencies of the most frequent words is very
often not completely accurate in attributing texts to their authors when
performed on a corpus of texts by known authors.
As interesting as authorship attribution is, multivariate analysis is, I would
argue, of potentially more interest in statistical stylistics and corpus
stylistics. If techniques can be found that can accurately distinguish authors
from each other, those techniques should be able to tell us something
significant about the styles of those authors. To further the search for more
accurate analytic methods, I have been evaluating cluster analyses based on
frequent word sequences (defined simply as groups of contiguous words) rather
than or combined with frequent words. (The idea for this project came out of a
discussion with Gary Shawver of the Humanities Computing group at New York
University about the possibility of looking at frequent collocations.) One
reason that sequences are attractive is that the order of words within them
provides information that cannot be retrieved from the frequencies of the
constitutive elements alone.
My investigation has shown that analyses involving frequent word sequences are
often superior to analyses of the frequencies of frequent words in attributing
known texts to known author. Some analyses using frequent sequences produce
results that are completely accurate where frequent words alone fail, and some
analyses using combinations of frequent sequences and frequent words are more
effective than either by themselves, again sometimes producing completely
accurate attributions in relatively intractable cases.
I begin with an initial corpus of twenty-nine 30,000-word sections of Modern
British and American novels by fourteen authors. I go on to analyze a subset of
this corpus, consisting of twenty of novels by eight authors, limit the analysis
still further to a corpus consisting of the sixteen third-person novels
extracted from among the twenty novels, and then to the pure narrative of the
same sixteen novels. I turn next to a very different kind of corpus, analyzing
twenty-five contemporary articles of literary criticism, to test whether
frequent collocations produce improved results for other genres. Finally,
extracting the texts of two problematic authors from the sixteen pure narratives
and two more from among the literary criticism analysis, I test the
effectiveness of the analysis of frequent sequences under circumstances that
more closely resemble traditional authorship problems.
Although analyses based on frequent sequences or combinations of frequent
sequences and frequent words are not universally more effective than those based
on frequent words alone, and still fail to achieve completely correct results in
some cases, they do seem promising as an additional tool in authorship
attribution, and, potentially, in stylistic studies as well. Figures 1 and 2
show the improvement in two analyses when frequent sequences are used instead
of, or in combination with, frequent words. Figure 3 shows that, in some cases,
frequent sequences give correct results, even when frequent words uniformly
fail.
Bibliography
J. F. Burrows. Computation into Criticism. Oxford: Clarendon Press, 1987.
J. F. Burrows. “Computers and the Study of Literature.” Computers and Written Texts. Ed. Christopher S. Butler. Oxford: Blackwell, 1992. 167-204.
J. F. Burrows A. J. Hassall. “Anna Boleyn and the Authenticity of Fielding's Feminine
Narratives.” Eighteenth Century Studies. 1988. 21: 427-453.
Computers and Written Texts. Ed. Christopher S. Butler. Oxford: Blackwell, 1992.
D. I. Holmes. “A Stylometric Analysis of Mormon Scripture and Related
Texts.” Journal of the Royal Statistical Society (A). 1992. 155: 91-120.
D. I. Holmes R. S. Forsyth. “The Federalist Revisited: New Directions in Authorship
Attribution.” Literary & Linguistic Computing. 1995. 10: 111-127.
D. L. Hoover. “Making Use of Statistical Measures of Style.” MLA Convention, San Francisco, December 28, 1998. : , 1998.
D. L. Hoover. Language and Style in The Inheritors. Lanham, MD: University Press of America, 1999.
D. L. Hoover. “Statistical Stylistics and Authorship Attribution: an
Empirical Investigation.” Literary & Linguistic Computing. 2001. 16: 421-44.
F. J. Tweedie D. I. Holmes Thomas N. Corns. “The Provenance of De Doctrina Christiana, Attributed to
John Milton: A Statistical Investigation.” Literary & Linguistic Computing. 1998. 13: 77-87.