“Towards a text benchmark suite”
Richard
S.
Forsyth
University of the West of England
rs-forsyth@csm.uwe.ac.uk
1. Introduction
In many areas of computing, benchmarking is a routine practice. There is insufficient room here to go into the pros and cons of benchmarking in any depth, except to acknowledge that sets of benchmarks do have drawbacks as well as advantages. Nevertheless benchmarking does have a role to play in setting objective standards. For example, in the field of forecasting, the work of Makridakis and colleagues (e.g. Makridakis & Wheelwright, 1989), who tested a number of forecasting methods on a wide range of time series, transformed the field -- leading to both methodological and practical advances. Likewise, in machine learning, the general acceptance of the Machine-Learning Database Repository (Murphy & Aha, 1991) as an agreed standard, and its employment in extensive comparative tests (e.g. Michie et al., 1994) has thrown new light on the strengths and weaknesses of competing algorithms. Although billion-byte public-domain archives of text exist, e.g. Project Gutenberg and the Oxford Text Archive, stylometry currently lacks an equivalent set of accepted test problems. Therefore we at Bristol have compiled a textual benchmark suite. The current version of this suite is known as Tbench96. Despite its deficiencies, it does present a broader variety of test problems than other workers in stylometry and allied fields have previously used.1.1 Selection Criteria
The text-categorization problems in this suite were selected to fulfil a number of requirements.- 1. Provenance: the true category of each text should be well attested.
- 2. Variety: problems other than authorship should be included.
- 3. Language: not all the texts should be in English.
- 4. Difficulty: both hard and easy problems should be included.
- 5. Size: the training texts should be of `modest' size, such as might be expected in practical applicati
1.2 Pre-processing
In order to impose uniformity of layout and thus reduce the effect of factors such as line-length (not usually an authorial decision) all text samples have been passed through a program called PRETEXT. This program makes some minor formatting changes, e.g. case-folding and conversion of tabs into blanks. However, the most important change made is to break running text into segments that are then treated as cases to be classified. Just what consititutes a natural unit of text is by no means obvious. Different researchers have made different decisions about the best way of segmenting long texts. Some have used fixed-length blocks (e.g. Elliott & Valenza, 1991); others have respected natural subdivisions in the text (e.g. Ule, 1982). Both approaches have merits as well as disadvantages. Because linguistic materials have a hierarchical structure there is no universally correct segmentation scheme. In Tbench96 each block boundary is taken as the first new-line in the text on or after the 999th byte in the block being formed. Such units will be referred to as kilobyte lines. The number of words per kilobyte line varies according to the type of writing. A representative figure for Tbench96 as a whole is 185 words per line. Thus this is an attempt to work with text units near the lower limit of what has previously been considered feasible. Evidence of this is provided by the two quotations below, made 20 years apart. “"It is clear in the present study that there is considerable loss in discriminatory power when samples fall below 500 words".” (Baillie, 1974) “"We do not think it likely that authorship characteristics would be strongly apparent at levels below say 500 words, or approximately 2500 letters. Even using 500 word samples we should anticipate a great deal of unevenness, and that expectation is confirmed by these results."” (Ledger & Merriam, 1994) Although Felton (1996) has studied 100-word text blocks (in New Testament Greek) and Simonton (1990) even analyzed word usage in the final couplets of Shakespeare's 154 sonnets (averaging 17.6 words each), the block size in Tbench96 is small relative to most previous stylometric studies; therefore it poses a relatively challenging series of tests.2. Details of Data Sets
The 13 text-classification problems that constitute TBench96 (Text Benchmark Suite, 1996 edition) form an enhanced version of the test suite used by Forsyth (1995). They constitute a potentially valuable resource for future studies in text analysis. Summary information is given below about the texts used in the benchmark suite. Note: A policy adhered to throughout was never to split a single work (article, essay, poem or song) between training and test sets.Authorship / Prose
FEDS (2 classes): A selection of papers by two Federalist authors, Hamilton and Madison. This difficult authorship problem -- subject of a ground- breaking analysis by Mosteller & Wallace (1984 [1964]) -- is possibly the best candidate for an accepted benchmark in stylometry. An electronic text of the entire Federalist papers was obtained by anonymous ftp from Project Gutenberg at GUTNBERG@vmd.cso.uiuc.edu For checking purposes the Dent Everyman edition was used (Hamilton et al., 1992 [1788]). Division into test and training sets was as follows. Author Training TestHamilton 6, 7, 9, 11, 12, 17, 1, 13, 16, 21, 29, 30,
22, 27, 32, 36, 61, 31, 34, 35, 60, 65, 75,
67, 68, 69, 73, 76, 81 85
Madison 10, 14, 37-48 49-58, 62, 63 This division implies accepting the view expounded by Martindale & McKenzie (1995), who state that: "Mosteller and Wallace's conclusion that Madison wrote the disputed Federalist papers is so firmly established that we may take it as given." JOJO (2 classes): Writings by Joseph Smith, the founder of the Mormon religion, and Joanna Southcott, a religious prophet contemporary with Smith -- from files kindly donated by Dr David Holmes of UWE Bristol. Southcott's work was supplied in four files: one from her diaries, two files of prophetic meditations, and one file of prophetic verse. Smith's three files were all extracts from his diaries. These texts (and others) have been analyzed by Holmes (1992).
Authorship / Poetry
EZRA (3 classes): Poems by Ezra Pound, T.S. Eliot and William B. Yeats -- three contemporaries who influenced each other's writings. For example, Pound is known to have given editorial assistance to Yeats and, famously, Eliot (Kamm, 1993). A random selection of poems by Ezra Pound written up to 1926 was taken from Selected Poems 1908-1969 (Pound, 1977), and entered by hand. It was supplemented by random selection of 18 pre-1948 Cantos, obtained from the Oxford Text Archive. Poems by T.S. Eliot were from Collected Poems 1909-1962 (Eliot, 1963). A random selection of 148 poems by W.B. Yeats was taken from the Oxford Text Archive. For checking purposes Collected Poems (Yeats, 1961) was used. NAMESAKE (2 classes): Poems by Bob Dylan and Dylan Thomas. Songs by Bob Dylan (born Robert A. Zimmerman) were obtained from Lyrics 1962-1985 (Dylan, 1994). In addition, two tracks from the album Knocked Out Loaded (Dylan, 1988) and the whole A-side of Oh Mercy (Dylan, 1989) were transcribed by hand and included, to give fuller coverage. Poems of Dylan Thomas were obtained from Collected Poems 1934-1952 (Thomas, 1952) with four more early works added from The Notebook Poems 1930-1934 (Maud, 1989).Chronology
ED (2 classes): Poems by Emily Dickinson, early work being written up to 1863 and later work being written after 1863. Emily Dickinson had a great surge of poetic composition in 1862 and a lesser peak in 1864, after which her output tailed off gradually. The work included is all of A Choice of Emily Dickinson's Verse selected by Ted Hughes (Hughes, 1993) as well as a random selection of 32 other poems from the Complete Poems (edited by T.H. Johnson, 1970). JP (3 classes): Poems by John Pudney, divided into three classes. The first category came from Selected Poems (Pudney, 1946) and For Johnny: Poems of World War II (Pudney, 1976); the second from Spill Out (Pudney, 1967) and the third from Spandrels (Pudney, 1969). Every distinct poem in these four books was used. John Pudney (1909-1977) described his career as follows: "My poetic life has been a football match. The war poems were the first half. Then an interval of ten years. Then another go of poetry from 1967 to the present time" (Pudney, 1976). Here the task is to distinguish his war poems (published before 1948) from poems in two other volumes, published in 1967 and 1969. WY (2 classes): Early and late poems of W.B. Yeats. Early work taken as written up to 1914, the start of the First World War, and later work being written in or after 1916, the date of the Irish Easter Rising, which had a profound effect on Yeats's beliefs about poetry. For these problems the classification objective was to discriminate between early and late works by the same poet.Subject-Matter
MAGS (2 classes): This used articles from two academic journals Literary and Linguistic Computing (75 articles) and Machine Learning (69 articles). The task was to classify texts according to which journal they came from. In fact, each `article' consisted of the Abstract and first paragraph of a single paper. NEWS (4 classes): This data-set consists of News stories extracted from the Associated Press wire service during December 1979. A total of about 250,000 words was obtained from the Oxford Text Archive, where it was deposited by Dr G. Akers in 1980. Stories in this archive are classified into at least six mutually exclusive categories. For Tbench96, four of these story types were extracted: F -- Financial stories; I -- International stories; S -- Sports stories; and W -- Washington stories. The Washington category covers US domestic politics. For training data stories up to 15th December were used. For test data stories after that date were used. TROY (2 classes): Electronic versions of the complete texts of Homer's Iliad and Odyssey, both transliterated into the Roman alphabet in the same manner, were kindly supplied by Professor Colin Martindale of the University of Maine at Orono. Traditionally each book is divided into 24 sections or `books'. For both works the training sample comprises the odd-numbered books and the test sample consists of the even-numbered books. The classification task is to tell which work each kilobyte line comes from. (It is possible that this task is an authorship discrimination as well (Griffin, 1980).)Miscellaneous:
GENDERS (2 classes): short stories written by first-year undergraduate students at the University of Maine on the subject: boy meets girl (or vice versa). These texts were kindly supplied by Professor Colin Martindale of the Psychology Department of the University of Maine at Orono. These stories arrived in an arbitrary order. Even-numbered stories were used as training data, odd numbered stories as test data. The objective was to distinguish tales written by males from those written by females. AUGUSTAN (2 classes): The Augustan Prose Sample donated by Louis T. Milic to the Oxford Text Archive. For details of the rationale behind this corpus and its later development, see Milic (1990). This data consists of extracts by many English authors during the period 1678 to 1725. It is held as a sequence of records each of which contains a single sentence. Sentence boundaries identified by Milic were respected. RASSELAS (2 classes): The complete text of Rasselas by Samuel Johnson, written in 1759. This was obtained in electronic form from the Oxford Text Archive. For checking purposes, the Clarendon Press edition was used (Johnson, 1927 [1759]). This novel consists of 49 chapters. These were allocated alternately to four different files. The inclusion of random or quasi-random data may need justification. The chief objective of doing so here was to provide an opportunity for what statisticians call overfitting to manifest itself. The author's view is that some `null' cases should form part of any benchmark suite: as well as finding what patterns do exist, a good classifier should avoid finding patterns that don't exist.Acknowledgements
Thanks are due to Dr David Holmes and Professor Colin Martindale for providing some of the text files used in this benchmarking suite, as well as for helpful comments. In addition, the following institutions -- the Oxford Text Archive, Project Gutenberg, and UWE's Bolland Library -- have also provided resources without which this collection could not have been compiled.References
W. M. Baillie. “Authorship Attribution in Jacobean Dramatic
Texts.” Computers in the Humanities. Ed. J. L. Mitchell. : Edinburgh Univ. Press, 1974.
B. Dylan. Knocked Out Loaded. Sony Music Entertainment Inc., 1988.
B. Dylan. Oh Mercy. CBS Records Inc., 1989.
B. Dylan. Lyrics 1962-1985. London: Harper Collins Publishers, 1994.
T. S. Eliot. Collected Poems 1909-1962. London: Faber & Faber Limited, 1963.
W. E. Y. Elliott R. J. Valenza. “A Touchstone for the Bard.” Computers & the Humanities. 1991. 25: 199-209.
R. Felton. Personal Communication. 1996.
R. S. Forsyth. “Stylistic Structures: a Computational Approach to Text
Classification.” Faculty of Science, University of Nottingham, 1995.
J. Griffin. Homer. Oxford: Oxford University Press, 1980.
A. Hamilton J. Madison J. Jay. The Federalist Papers. Ed. W. R. Brock. London: Dent, 1992.
D. I. Holmes. “A Stylometric Analysis of Mormon Scripture and Related
Texts.” J. Royal Statistical Society (A). 1992. 155: 91-120.
E. J. Hughes. A Choice of Emily Dickinson's Verse. London: Faber & Faber Limited, 1993.
S. Johnson. The History of Rasselas, Prince of Abyssinia. Oxford: Clarendon Press, 1927.
Emily Dickinson: Collected Poems. Ed. T. H. Johnson. London: Faber & Faber Limited, 1970.
A. Kamm. Biographical Dictionary of English Literature. Glasgow: HarperCollins, 1993.
G. R. Ledger T. V. N. Merriam. “Shakespeare, Fletcher, and the Two Noble
Kinsmen.” Literary & Linguistic Computing. 1994. 9: 235-248.
S. Makridakis S. C. Wheelwright. Forecasting Methods for Managers. New York: John Wiley & Sons, 1989.
C. Martindale D. P. McKenzie. “On the Utility of Content Analysis in Authorship
Attribution: the Federalist.” Computers & the Humanities. 1995. 29: .
Dylan Thomas: the Notebook Poems 1930-1934. Ed. R. Maud. London: J.M. Dent & Sons Limited, 1989.
Machine Learning, Neural and Statistical Classification. Ed. D. Michie D. J. Spiegelhalter C. C. Taylor. Chichester: Ellis Horwood, 1994.
L. T. Milic. “The Century of Prose Corpus.” Literary & Linguistic Computing. 1990. 5: 203-208.
F. Mosteller D. L. Wallace. Applied Bayesian and Classical Inference: the Case of the Federalist Papers. New York: Springer-Verlag, 1984.
P. M. Murphy D. W. Aha. UCI Repository of Machine Learning Databases. : Dept. Information & Computer Sceince, University of California at Irvine, CA., 1991.
E. L. Pound. Selected Poems. London: Faber & Faber Limited, 1977.
J. S. Pudney. Selected Poems. London: John Lane The Bodley Head Ltd., 1946.
J. S. Pudney. Spill Out. London: J.M. Dent & Sons Ltd., 1967.
J. S. Pudney. Spandrels. London: J.M. Dent & Sons Ltd., 1969.
J. S. Pudney. For Johnny: Poems of World War II. London: Shepheard-Walwyn, 1976.
D. K. Simonton. “Lexical Choices and Aesthetic Success: a Computer
Content Analysis of 154 Shakespeare Sonnets.” Computers & the Humanities. 1990. 24: 251-264.
D. M. Thomas. Collected Poems 1934-1952. London: J.M. Dent & Sons Ltd., 1952.
L. Ule. “Recent Progress in Computer Methods of Authorship
Determination.” ALLC Bulletin. 1982. 10: 73-89.
W. B. Yeats. The Collected Poems of W.B. Yeats. London: Macmillan & Co. Limited., 1961.