“Dating Dickinson: an Experimental Approach to
Stylochronometry”
Constantina
Stamou
University of Luton, UK
Richard
S.
Forsyth
University of Luton, UK
Stylometry is the statistical analysis of literary style, whose two primary
applications are authorship attribution and chronological problems. It
originated in 1851 when Augustus de Morgan suggested that it is possible to
settle authorship by determining if one text "does not deal in longer words"
than another (Holmes, 1998). Stylometry is based upon the notion that it is
possible to detect an author's 'signature' by examining quantifiable features of
written texts. The only difference between the two applications is that
attributional studies claim that certain features in an author's style are
manipulated unconsciously and therefore remain fixed, whilst chronological
studies support the idea that stylistic fingerprints evolve smoothly throughout
an author's life. The contradiction is overridden, though, by the choice of
features.
Stylochronometry, a term used to cover the dating of texts from stylistic
evidence, concerns itself with problems of specifying the sequence of
composition of the works of a given author. Famous cases are the dating of
Plato's dialogues, of certain of Shakespeare's plays, and the dating of the New
Testament scriptures, although their true chronology will never be known since
there is not enough external evidence to back up such stylometric findings.
Scientific approaches to chronology begin with the choosing of a group of texts
that are more or less securely dated, then proceed with the application of
stylometric methods that manipulate the chosen variables which will best
correlate with the dates of the texts. Once the methods used assign the correct
dates to the initial test set, the final step is to employ the same methods on
disputed cases. Such stylometric variables include high frequency words,
function and common words, type-token ratio, vocabulary richness measures and
others.
A famous example comes from Brainerd (1980) on the chronology of Shakespeare's
plays. Examining the percentage of occurrence of 120 lemmata, which were mainly
related to high-frequency lexical items, additionally combined with the
investigation of the average verse line length in words, the percentage of split
lines and the type-token ratio, he concentrated initially on a group of plays
that had fairly accurate dates of composition. Since only 20 out of the 120
lemmata proved to be useful discriminators for chronology, he used them in order
to construct a function that would predict the dates of the control group. Once
his method produced the desired results, the final step was to use it on those
of Shakespeare's plays which were of disputed nature in terms of dates.
Difficulties arose related to the possibility of multiple authorship in certain
cases, authorial revision at some stage, and the status of manuscripts used for
the preparation of the basic copy texts. However, multivariate statistics proved
useful in order to detect which plays were likely to be products of multiple
authorship.
In poetry though, it has not been possible to date texts of less than 500 words
in length until recently. Forsyth (1999) at BSRU investigated a method of dating
short pieces of text (averaging 114 words in length) and tested them on W.B.
Yeats's work. This method, among others, will be used in our project, which aims
at building on collaborative work begun by Dr Forsyth and Prof. Margaret Freeman
of Valley College, California, on the investigation of chronological changes in
the style of the American poet Emily Dickinson (1830-1886).
Born in Amherst, Massachusetts, Dickinson lived at her father's house most of her
life and in her later years became a recluse. Because of her individualistic
style, which, as it is accepted nowadays, set her ahead of her time, only 10 of
her poems were published during her lifetime. Moreover, due to her difficult
handwriting and her idiosyncratic punctuation, they were heavily edited, since
the public was not yet prepared for her eccentric masterpieces. At the time of
her death, 1775 poems were discovered arranged in 60 small packets. Following
that, efforts were made by her relatives to get all the poems published; still,
though, her poetry was heavily edited. Her impact on the American public
gradually became intense, and in 1955 a complete edition of her work was
published by Thomas H. Johnson, this time using her own punctuation and
vocabulary. Today she is known for her startling originality, her bold
experiments in prosody, her tragic vision, and the range of her intellectual and
emotional explorations.
Johnson's edition provides approximate dates of composition (Johnson, 1961),
according to Theodora Ward's study, who collaborated with Johnson, of the
changes in her handwriting, apart from a few poems which have precise dates,
either because Dickinson sent them as parts of letters to various friends or
because she mentions contemporary events.
Our investigation will initially concentrate on control authors that have
securely dated works, such as Christina Rossetti and W.B. Yeats. Both poets
lived in the 19th century as Dickinson did. It is proposed to utilise a
feature-finding program developed by Forsyth & Holmes (1996) at BSRU, a
tagger such as TOSCA from Nijmegen University, and a content analysis tool. Thus
we will tap into linguistic information of different kinds - lexical, syntactic
and semantic. Our aim is to detect the type of linguistic information that is
useful for discriminating between the early and late works of our poets with the
intention of using the techniques applied on the control authors to date
Dickinson's work.
Laan (1995) argues that there is no hard evidence to suggest that authors have
both a conscious and an unconscious aspect to their writing style, as stylometry
suggests. On the other hand, possibilities such as the existence of a stable and
an adaptable part in an author's unconscious style, or the idea that some change
their unconscious features of their styles and others do not, also exist as Laan
(1995) admits. The question to what extent such claims are true has been
investigated by Robinson (1992) and Keyser (1992) who both suggest proceeding
from authors with known publication dates to authors with unknown publication
dates.
Initial studies, to be reported at this conference, have investigated the idea
that authors generally exhibit a trend towards decreasing complexity as they
grow older. Using the Fog Index as a measure of the density of language, based
on the proportion of long words and average sentence length, we have found
equivocal results. But other measures do seem to show increased simplicity with
time. We believe that this brings us a step closer to correct chronology.
References
B. Brainerd. “The Chronology of Shakespeare's Plays: A Statistical
Study.” Computers and the Humanities. 1980. 14: 221-230.
R. S.Forsyth. “Stylochronometry with Substrings Or: A Poet Young and
Old.” Literary & Linguistic Computing. 1999. 14: 1-11.
R. S.Forsyth D. I.Holmes. “Feature Finding for Text Classification.” Literary & Linguistic Computing. 1996. 11: 163-174.
D. I.Holmes. “The Evolution of Stylometry in Humanities
Scholarship.” Literary & Linguistic Computing. 1998. 13: 111-117.
The Complete Poems of Emily Dickinson. Ed. T. H. Johnson. Boston: Little, Brown and Company, 1961.
P. Keyser. “Stylometric Method and the Chronology of Plato's Works
(review article).” .Bryn Mawr Classical Review. 1992. 3: 58-74.
N. M.Laan. “Stylometry and Method: The Case of Euripides.” Literary & Linguistic Computing. 1995. 10: 271-278.
T. M.Robinson. “Plato and the Computer.” Ancient Philosophy. 1992. 12: 375-382.