Digital Humanities Abstracts

“'DNA' and Non-traditional Authorship Attribution: An Inclusive Model”
Joseph Rudman Carnegie Mellon University jr20@andrew.cmu.edu

“ Anything a person writes contains the code of his intellectual DNA, or whatever you want to call it.” Webb 1994
“The greater the number of features and the more the features belong to different categories (e.g., syntactic structures, type of grammatical subject, inflexions, vocabulary, spelling, and so on) the stronger the case for shared authorship. ” Eagleson 1989

INTRODUCTION:

For many years it has been obvious from the literature that most non-traditional authorship attribution studies using one or some other small number of style markers do not carry the weight of scientific validity with either the majority of other authorship attribution practitioners, the specialists in the field of the study, or the general public. (In addition to Eagleson, see Banks and Rudman -- also, Rudman 1998) During a talk on the "Style-Marker Mapping Project" at the ALLC-ACH 2000 conference in Glasgow, I mentioned, in passing, an attribution model based on a "DNA" concept. (Rudman 2000) It was illustrative and not "on topic." However, the audience picked up on this and some of the ensuing questioning and discussion kept trying to move away from the Style-marker Mapping Project. This paper presents a non-traditional authorship attribution model based on a "DNA" analogy. This paper emphasizes that it is only an analogy -- a framework to explain the techniques of the "Inclusive Model" -- there are obvious fundamental differences between DNA and style. Because some of the terms in this paper could be unfamiliar to the expected audience, a clear and concise definition is given the first time each such term is used.

I. BACKGROUND AND DEFINITIONS

If we look at style as a living organism, style-markers are its genetic material -- making the Style-Marker Mapping Project (Rudman, 2000) analogous to the human genome project. I would like to extend this biology analogy: The Inclusive Authorship Attribution Model being analogous to the DNA analysis. The earliest reference to DNA and style that I have seen is Bailey's comparison of the tools used to decode the underlying makeup of the two -- X-ray diffraction for DNA, the computer for style. Bailey does not move towards a DNA model for stylistics. (Bailey) The lead quote by Webb also is quoted in Forsyth's dissertation. Yet Forsyth does not use the intent of the quote to move into a DNA model. (Forsyth) I have been leaning towards a more inclusive attribution model that would utilize a large number of style-markers since the mid 1980's. Other researchers also have recognized the need to expand the number of style markers in attribution studies. As the DNA structure became decoded and the comparison methods refined, it became the analogous model of choice. I first mentioned the model at the ALLC-ACH Oxford conference in 1992. (Banks and Rudman) The thrust of that presentation was towards a statistical method of combining the results of different statistical results on various style-markers. This section briefly traces the evolution of the DNA model through various publications and presentations. Clear and concise definitions of the DNA autoradiogram are given. (Kirby) A brief explanation of why this model is necessary closes this section. (Willing)

II. THE MODEL

“Outline a method of analysis which will allow organization of these features [the entire range of linguistic features] so as to facilitate comparison of any one use of language with any other” (Carter, Crystal and Davy, and Darbyshire). McMenamin 1993
  • A) How the Inclusive Model differs from other models (e.g. multivariate models and Burrows' Delta Project). (Holmes, Burrows)
  • B) The DNA Analogy is Explicated.
    It is shown how each locus of the autoradiogram is equivalent to a different style-marker. The determination of each style-marker locus is discussed.
    Forsyth's suggestion at the Glasgow conference that a list of "proven" style-markers should be provided and used is discussed.
  • C) Visual Representation
    A Method of visual representation of the results of the model is shown.
  • D) The following two statistical methods of combining each style-marker locus into a final answer are presented and discussed:
    • (1) If the style-markers that are used can be shown to be independent of one another (e.g. word length distribution, percentage of nouns starting sentences, type/token ratio) a procedure based on Fisher's method for combining significance probabilities from independent statistical tests can be used. (Fisher)
    • (2) If the style-markers that are used are not independent of each other (e.g. word length distribution, word length correlation, percentage of latinate words) the statistical method employed by DNA researchers can be used.

CONCLUSION

The method of determining the DNA loci and style-marker loci are different. A single technique is employed to determine all of the DNI loci. Each style-marker locus is determined, for the most part, by different experimental techniques. And some of the style-marker loci are actually the result of multivariate statistical analysis. The Inclusive Authorship Attribution Model promises a degree of acceptability not seen in most non-traditional attribution studies -- especially in types of studies such as McMenamin's, "`Population Model' where there are no obvious authorship candidates, and texts from an entire population of possible authors are considered against texts by one suspected author." (McMenamin)

Preliminary Bibliography

Richard W. Bailey. “The Future of Computational Stylistics.” ALLC Bulletin. 1979. 7: 4-11.
David J. Balding Peter Donnelly. “Inference in Forensic Identification.” JOURNAL OF THE ROYAL STATISTICAL SOCIETY A. 1995. 158: 21-53.
David L. Banks Joseph Rudman. “Questionable Attribution in the Canon of Daniel Defoe: A Study of Techniques.” ALLC-ACH'92 Conference. Oxford University, April 7, 1992. : , 1992.
John Burrows. “Questions of Authorship: Attribution and Beyond. A Lecture Delivered on the Occasion of the Roberto Busa Award.” ACH-ALLC01 Conference. New York University, New York, June 14, 2001. : , 2001.
Robert D. Eagleson. “Linguist for the Prosecution.” WORDS AND WORDSMITHS. Ed. Geraldine Barnes et al. Sydney: The University of Sydney Press, 1989. 22-31.
R. A. Fisher. STATISTICAL METHODS FOR RESEARCH WORKERS. London: Hafner, 1969.
Richard S. Forsyth. “Stylistic Structures: A Computational Approach to Text Classification.” University of Nottingham, 1995.
David I. Holmes. “Authorship Attribution and the Book of Morman: A Case Study in Stylometric Techniques.” University of London, Kings College, 1990.
David I. Holmes. “Vocabulary Richness and the Prophetic Voice.” University of London, Kings College, 1990.
Lorne T. Kirby. DNA FINGERPRINTING: AN INTRODUCTION. New York: W. H. Freeman, 1992.
Gerald R. McMenamin. FORENSIC STYLISTICS. Amsterdam: Elsevier, 1993.
Joseph Rudman. “The Style-marker Mapping Project: A Rational and Progress Report.” ALLC/ACH 2000 Conference, University of Glasgow, Scotland, July 25, 2000. : , 2000.
Joseph Rudman. “The State of Authorship Attribution Studies: Some Problems and Solutions.” COMPUTERS AND THE HUMANITIES. 1997. 31: 351-365.
Charles Webb. “Interview in.” THE INDEPENDENT MAGAZINE. 1994. : 35.
Richard Willing. “Mismatch Calls DNA Tests Into Question.” USA TODAY. 2000. : 3A.