Digital Humanities Abstracts

“On Determining a Valid Text for Non-Traditional Authorship Attribution Studies: Editing, Unediting, and De-Editing”
Joseph Rudman Carnegie Mellon jr20@andrew.cmu.edu

INTRODUCTION:

The work’s material history since its inception, the vast and largely uncharted alterations imposed by the history and by the mediation of generation upon generation of printers, editors, publishers—this is a relativism we are prone to ignore, but ignore at our peril.(Marcus 1996)
The literary texts often are not homogenous since they may comprise dialogues, narrative parts, etc. An integrated approach, therefore, would require the development of text sampling tools for selecting the parts of the text that best illustrate an author’s style.(Stamatatos et al. 2001)
Most non-traditional authorship attribution studies place too much emphasis on statistics, stylistics, and the computer and not enough focus is given to the integrity and validity of the primary data— the text itself. It is intuitively obvious and easily shown empirically that if you are conducting a study of the patterns of an author’s stylistic usage (e.g. Daniel Defoe), the study will be systematically denigrated by each interpolation of non-Defoe text and even by each interpolation of Defoe text of a different genre or significantly different time period. The crux of this paper is about one important element in the empirical methodology of a valid non-traditional authorship attribution study—the preparation of the text for stylistic and statistical analysis: unediting, de-editing, and editing. The general emphasis of this presentation is on prose analysis with some peripheral treatment of drama and poetry.
  • I. BACKGROUND AND DEFINITIONS
    • A. Why a valid text is necessary should not even be asked. No valid experiment can be done if the input data is flawed—garbage in, garbage out!
      Too many practitioners simply grab a text from any available source—without any thought to its pedigree. (e.g. Khmelev and Tweedie’s “Using Markov Chains for the Identification of Writers.”)
      Are undertakings such as Project Gutenberg or the Oxford Text Archive with their easily available machine readable texts a boon or a bane to non-traditional authorship atudies? This question is explored in some detail.
    • B. Selecting a starting text
      The validity of using texts from the oral tradition and the scribal tradition is discussed.
      Before any manipulation and analysis of a text is carried out, a valid starting text must be acquired that fulfills many necessary requirements. This selection is primarily bibliographically driven. If a practitioner is not savvy in the bibliographical arts, a collaborator who is should be recruited.
      Examples of bad starting texts causing problems are given (e.g. Peng and Hengartner’s “Quantitative Analysis of Literary Styles.”)
      If you cannot obtain a valid text, do not do the study.
    • C. Unediting—getting back to the state of “not yet edited”
      De-editing—removing selected text
      Editing—changing (preparing) a text for statistical analysis
  • II. EXPLICATION
    The statement, “each age, each author, each study demands a different mixture of the following particulars,” is discussed.
    • A. Unediting
      As a rule, the closest text to the holograph should be found and used.
      • 1. Editorial interpolation
        • a. Filled in lacunae
        • b. Marginal notation
        • c. ‘Changes’ in the text
        • d. Critical editions
      • 2. Printer interpolation
        For the Printer is a beast, and understands nothing I can say to him of correcting the press.Dryden (Ward p. 97)
        • a. Catchwords (the first word of the next leaf or gathering)
        • b. Signatures (combinations of letters and numerals used something like catchwords)
        • c. Removing obvious typesetting mistakes (a slippery slope)
          • i. ‘f’ for the long ‘s’
          • ii. Double words (e.g. ‘the the’ ‘was was’
    • B. De-editing
      • 1. Quotes
        • a. Factual, unattributed
        • b. Factual, attributed
        • c. Self quotes from earlier writings
      • 2. Plagiarism
        • a. Direct copy
        • b. Paraphrasing
        • c. Imitation
      • 3. Collaboration
        • a. Sectional
        • b. Phrasal
        • c. Word level
        • d. Ghostwriting
      • 4. Genre
        • a. Poetry, prose, drama, letters, etc.
        • b. Mixture (e.g. verse drama)
      • 5. Graphs and Numbers
        • a. Tables
        • b. Lists
        • c. Arabic and Roman numerals
      • 6. Guide words
        • a. Titles—chapter headings—the end word ‘Finis’
        • b. Marginal annotation
      • 7. Foreign Languages
        • a. Sentence level and greater
        • b. Phrase or word level
      • 8. Translations
        • a. Verbatim
        • b. Concepts
      • 9. Examples of items de-edited (or not de-edited) incorrectly by practitioners
        • a. Biblical quotes
        • b. Titles in direct apposition
        • c. Numbers that are spelled out
        • d. Words with an initial capital
    • C. Editing
      • 1. Encoding the text
        • a. Why (e.g. homographic forms)
        • b. TEI
      • 2. Regularizing
        • a. Spelling
        • b. Contracted forms (simple, compound)
        • c. Hyphenation
        • d. Masked words (e.g. ‘D_ _ _ e’ for ‘Defoe’)
      • 3. Lemmatizing
        • a. Pro
        • b. Con
    • D. Special Problems in Drama and Poetry
      • 1. Stage directions
      • 2. The ‘age’ dependency of transmission and technique.
  • III. SOME EXAMPLES
    Studies that are compromised by mistakes of commission and/or omission in editing, unediting, or de-editing.
    • A. Historia Augusta
      • 1. Twelve individual studies
    • B. Shakespeare
      • 1. Eliott and Valenza
      • 2. Foster
      • 3. Horton
    • C. Defoe
      • 1. Hargevik
      • 2. Rothman
  • IV. CONCLUSION
    • 1. Some items that are de-edited are valid style markers in their own right (e.g. latin phrases, different genre) and should be treated as such in a parallel study.
    • 2. No matter which text is selected, the practitioner must disclose which text was used and everything that was done to it.
    • 3. The same care must be taken with every text in the study—the anonymous text, the suspected author’s text, and all of the control texts.
    • 4. If valid texts cannot be located and correctly edited, unedited, and de-edited, do not do the study
    • 5. A valid text does not guarantee a valid study. However, a non-valid text guarantees a non-valid study.

REFERENCES

Richard D. Altick John J. Fenstermaker. The Art of Literary Research. New York: W.W. Norton & Company, 1993.
John Burrows. “Questions of Authorship: Attribution and Beyond. A Lecture Delivered on the Occasion of the Roberto Busa Award.” ACH-ALLC01 Conference. New York University, New York, June 14, 2001. : , 2001.
Ward E. Y. Eliott Robert J. Valenza. “So Many Hardballs, So Few Over the Plate: Conclusions From Our ‘Debate’ With Donald Foster.” Computers and the Humanities. 2002. 36: 450-460.
Don Foster. Author Unknown: On the Trail of Anonymous. New York: Henry Holt and Company, 2000.
Bertrand A. Goldgar. “Imitation and Plagiarism: The Lauder Affair and Its Critical Aftermath.” Studies in Literary Imagination. 2001. 34: 1-16.
D. C. Geetham. Textual Scholarship: An Introduction. New York: Garland, 1992.
Gregory Grefenstette Pasi Tapanainen. “What is a Word, What is a Sentence? Problems of Tokenization.” Proceedings of the 3rd International Conference on Computational Lexicography. Budapest: Research Institute for Linguistics, Hungarian Academy of Sciences, 1994.
Steig Hargevik. The Disputed Assignment of “Memoirs of an English Officer to Daniel Defoe”. Stockholm: Almqvist and Wiksell, 1974.
David I. Holmes , et al. “A Widow and Her Soldier: Stylometry and the American Civil War.” Literary and Linguistic Computing. 2001. 16: 403-420.
Thomas B. Horton. “ The Effectiveness of the Stylometry of Function Words in Discriminating between Shakespeare and Fletcher.” University of Edinburg, 1987.
Dmitri V. Khmelev Fiona J. Tweedie. “Using Markov Chains for Identification of Writers..” Literary and Linguistic Computing. 2001. 16: 299–307.
Alexander Lindey. Plagiarism and Originality. New York: Harper and Brothers, 1952.
Leah S. Marcus. “Afterword: Confessions of a Reformed Uneditor.” The Renaissance Text: Theory, Editing, Textuality. Manchester: Manchester University Press, 2000. 211–216.
Leah S. Marcus. Unediting the Renaissance: Shakespeare, Marlow, Milton. London: Routledge, 1996.
Maximillian E. Novak. “The Defoe Canon: Attribution and De-attribution.” Huntington Library Quarterly. 1997. 59: 83–104.
Roger D. Peng Nicolas W. Hengartner. “Quantitative Analysis of Literary Styles.” The American Statistician. 2002. 56: 175-185.
. Project Gutenberg. : ,
Pat Rogers. The Text of Great Britain: Theme and Design in Defoe's ‘Tour’. Cranbury, NJ: , 1998.
Irving N. Rothman. “Defoe De-Attributions Scrutinized Under Hargevik Criteria: Applying Stylometrics to the Canon.” Papers of the Bibliographic Society of America. 2000. 94: 375–398.
Joseph Rudman. “The State of Authorship Attribution Studies: Some Problems and Solutions.” Computers and the Humanities. 1998. 31: 351-365.
Joseph Rudman. “Non-Traditional Authorship Attribution Studies in the Historia Augusta: Some Caveats.” Literary and Linguistic Computing. 1998. 13: 151-157.
Eliot Slater. The Problem of “The Reign of King Edward III:” A Statistical Approach. Cambridge: Cambridge University Press, 1988.
E. Stamatatos , et al. “Computer-Based Authorship Attribution Without Lexical Measures.” Computers and the Humanities. 2001. 35: 193–214.
. Text Encoding Initiative. : ,
James Thorp. Watching the Ps & Qs: Editorial Treatment of Accidentals. Lawrence, Kansas: University of Kansas Printing Service, 1971.
The Letters of John Dryden: With Letters Addressed to Him. Ed. Charles E. Ward. Durham, NC: Duke University Press, 1942.
David S. Williams. Stylometric Authorship Studies in Flavius Josephus and Related Literature. Lewistown, New York: The Edwin Mellen Press, 1992.