Digital Humanities Abstracts

“Stylometric Analysis Using Discriminant Analysis: A Study of Sherlock Holmes Stories”
Peter Smith The City University London


Stylometric analysis may be defined as the quantitative analysis of the recurrence of particular features of style for the purpose of deducing authorship and/or chronology of texts. Methods of sylometric analysis may be broadly subdivided up into lexical and non-lexical approaches (Binongo 2000). The approach to be taken in this study is to use function words using discriminant analysis as a method of analysis. The use of function words has often been criticised by stylistic experts as lacking in scientific validity, for example, see (Waith 1984). An ultimate aim of this work is to establish scientific foundations for the use of function words in stylometric analysis. One promising line of enquiry is in the study of aphasic patients where it has been suggested that function words may be processed by the brain separately to lexical words (Garrett 1982), perhaps we may have less conscious control over our use of them. Three plausible variables for stylometric analysis to consider are: chronology, genre and author. A writer's stylistic tendencies cannot be assumed to span an entire writing career, hence when the work was written is just as important as who wrote it. Similarly, it is unreasonable to assume that style will be fixed in different genres. One aim of this study is to fix two of the three variables as closely as possible while studying the effect of the other variable.

The Basis for this Study

The primary basis for this study is to examine the effectiveness of discriminant analysis using function words as a means of distinguishing author while examining the effect of genre and time. The Sherlock Holmes stories of Arthur Conan-Doyle were chosen as subjects for study because they were freely available in machine-readable form and there were several comparable texts written at the same time in the same basic genre. Arthur Conan Doyle was also a prolific writer who has written different types of fiction very closely in time. The Sherlock Holmes stories were originally serialised in The Strand Magazine. However, when "The Hound of the Baskervilles" appeared it was plagued with controversy. Arthur Conan Doyle had insisted that his name appear jointly with a Mr. Fletcher Robinson. It has been suggested that Fletcher Robinson had written part of the story, in which case stylometric analysis might be able to throw light on this mystery. However, if Conan-Doyle was just given an idea for a story then there is not much that can be discovered by this technique. Nine texts were chosen for analysis, which can be divided into three equal groups:
  • Sherlock Holmes stories, including “The Hound of the Baskervilles”, as close in time as possible to it.
  • Other works written by Conan Doyle at the same time as the three Sherlock Holmes stories.
  • Other works by different writers in the same or similar genre (including some published in The Strand Magazine).
The sets of Sherlock Holmes stories that were written closest in time were chosen as comparand texts (note that all stories were serialised and published monthly). Thus the three Sherlock Holmes stories were: The Memoirs of Sherlock Holmes (1892/3), The Hound of the Baskervilles (1902) and The Return of Sherlock Holmes (1904) (Conan-Doyle 1986). The three comparand texts written by Conan-Doyle around the same time were: “The Parasite”. (1894) The Adventures of Gerard (1903) and Sir Nigel (1906).The three texts written in a similar genre around the same time but by different authors were: “The Ponsonby Diamonds” (Meale and Halifax 1894 - published in The Strand Magazine). The Old Man in the Corner (Baroness Orczy 1901)and The Scarlet Pimpernel (Baroness Orczy 1905). The texts were prepared in a manner that follow the technique employed by (Smith 1993) very closely. The top 20 most commonly occurring function words from "The Hound of the Baskervilles" were chosen. A discriminant analysis was then run using SPSS 10.00 with the three texts forming three groups. This produced a "nearness" metric for the texts. A series of tests will be presented that provide a strong basis for the use of discriminant analysis. The tests demonstrate the ability of this technique to separate texts by author, by genre (when author is fixed), or even by time. Tests were also carried out to ensure that the tests were not just arbitrary, but showed a real variation in texts.

Discussion and Further Work

The starting point for this research was to investigate whether there is any evidence to support the thesis that Conan-Doyle may not have written all of "The Hound of the Baskervilles". There is absolutely no support for this in the results. The technique attempted to test for authorship by attempting to control two other major variables: time and genre. It was capable of distinguishing texts by author consistently. It also appeared to have the capability for separation of texts by genre and separated Conan-Doyle's works by time as well (The Hound of the Baskervilles was closer to The Return of Sherlock Holmes, written only two years later, than it was to The Memoirs of Sherlock Holmes, written some 9 years earlier.) One possible reason why the Sherlock Holmes stories differ from, the story used as a comparand text in a different genre might be because they employ a considerable amount of spoken dialogue, whereas the other story chosen contains far more narrative text. This has yet to be investigated. Principal Components Analysis (PCA) has been successfully used in stylometric analysis, (Burrows 1987, Binongo 2000, Binongo and Smith 1999). (Binongo 2000) seems overly pessimistic in his dismissal of discriminatory analysis, arguing against its use because of the assumption of multivariate normality. However his work reveals several worrying aspects of the use of principal components analysis. Firstly: the most frequent words tend to have the least discriminatory power and the first principal component may not be able to reveal authorship accurately. If the frequencies of function words are standardised, this in turn may lead more frequent words to be swamped by less frequent words. (Binongo and Smith 1999) demonstrated that PCA was capable of distinguishing difference in genres in a comparison between the essays and plays of Oscar Wilde. In a later study (Binongo and Smith 1999) they also demonstrated the success of this technique on a comparison of the works of two contemporaneous American authors Nathaniel Hawthorne and Herman Melville using 25 function words. Principal Components Analysis was also employed on the texts used in this study and although it appeared to reliably differentiate between three different authors, the principal components appeared to be less reliable in different genre tests and especially where two groups were from the same text. In this case the Kaiser-Meyer-Olkin statistic, measuring sampling adequacy (Kaiser 1970) indicated that the extracted components might be unreliable. Kaiser (1974) recommends accepting values greater than 0.5 and even values between 0.5 and 0.7 are considered mediocre (see also Field 2000). When PCA was applied to the same text split into two groups, KMO scores of between 0.4 and 0.55 were observed. Experiments with the numbers of function words were also tried. It was found that increasing the set of function words from 20 to 25 or 30 made only a marginal difference. Running a MANOVA test on the function word data allowed us to identify function words that were unreliable and re-running a discriminant analysis with these words removed produced a slight improvement. As the number of function words was decreased progressively the sensitivity of the test was diminished. The drawback with this approach is that it requires large amounts of text to produce reliable results (something approaching the size of a novella or short novel as minimum). The test will not be so sensitive to the insertion or interleaving of texts by different authors. If a form of dimension reduction can be established, then it might be possible to treat a text as a time series and use a window to drag over the text to look for anomalous sections that might indicate a change of author. A further way in which this work can be developed is to examine the function words themselves and examine why each author varies their usage. Some function words are used as what Schiffrin (1987) calls discourse markers and as higher-level indicators of structure, their use may well vary from one writer to the next. A linguistic basis for the variation in function words needs to be established if only to demonstrate the scientific credentials for stylometric analysis.


J. N. G. Binongo W. Smith. “The Application of principal components analysis to stylometry.” Literary & Linguistic Computing. 1999. 14: 445-466.
J. N. G. Binongo. “Stylometry and its implementation by Principal Components Analysis.” University of Ulster, Co. Antrim, Northern Ireland, UK, 2000.
J. F. Burrows. Computation into Criticism: A Study of Jane Austin's Novels and an experiment in Method. Oxford: Clarendon, 1987.
Arthur Conan Doyle. The Illustrated Sherlock Holmes. London: Omega Books, 1986.
A. Field. Discovering Statistics using SPSS for Windows. London: Sage Publications, 2000.
M. F. Garrett. “Production of Speech: Observations from Normal and Pathological Language Use.” Normality and Pathology in Cognitive Functions. Ed. A. Ellis. London: Academic Press, 1982.
H. F. Kaiser. “A Second Generation little jiffy.” Psychometrika. 1970. 35: 401-415.
H. F. Kaiser. “An Index of factorial simplicity.” Psychometrika. 1974. 39: 31-36.
D. Schriffrin. Discourse Markers. Cambridge: Cambridge University Press, 1987.
W. Smith. “Edmund Ironside.” Notes and Queries. 1993. 238: 202-5.
Titus Andronicus: The Oxford Shakespeare. Ed. E. M. Waith. : Oxford University Press, 1984.