Retired, therefore outwith the university sector
Professor Emeritus, Department of Mathematics and Statistics,The College of New Jersey
This is the source
Thomas Paine's
This paper presents the results of a stylometric study of Thomas Paine's
More than the writings of any other writer, those of Thomas Paine (1737-1809)
illustrate the transformation in the meaning of the term
With the help of Dr. Benjamin Rush, who had suggested the title and also found him a printer, in 1776 Paine published
In the five years after the end of the American Revolutionary War, Paine more
or less left politics behind and became absorbed in a series of scientific
experiments, in particular the design and construction of a single-arched
iron bridge. In 1787, again following the advice of Benjamin Franklin, who
told him he would do well to seek sponsors for his bridge in either Paris or
London, Paine returned to Europe at the very time when revolutionary and
radical pressures were building there. He did not lack for well-placed
friends. His admirer Thomas Jefferson had been appointed to be American
Minister to France, and his old friend The Marquis de Lafayette, wreathed
with American laurels, was also at his disposal. Lafayette kept a copy of
the American Declaration on one panel of his study, leaving the opposite
panel empty hoping it would be adorned one day by a similar French one
Paine's
Now Paine had fallen victim to a gigantic counter-revolution in revolutionary disguise which had succeeded in entrenching his original foes — the monarchy. Hearing from his old friend Thomas Jefferson that he was welcome to return to America he gave up France as a bad job and in September 1802 sailed to Baltimore. His writings bear witness to his revolutionary activities and provide us with a detailed picture of the evolution of social and political change at the end of the eighteenth century.
Jonathan Clark, the Hall Distinguished Professor of British History at Kansas University, calls into question the belief that Thomas Paine wrote the whole work, in an article on the authorship of
I will here, as concisely as I can, trace out the growth of the French revolution, and mark the circumstances that have contributed to produce it
Who, then, could have written this 6000-word narrative? Clark suggests that
its author was probably The Marquis de Lafayette. It seems to embody not
neutral history but his very personal perspective on events and has as hero
none other than Lafayette himself. Clark thinks that this passage was
expressed in the third person in order to conceal the
Common Sense [Paine] is writing a Book for you — there you will see a part of My Adventures.
Clark concludes that this passage is very probably not a history primarily
written by Paine but Lafayette's self-serving publicity, part of his attempt
to become the
For our main study, related to the
As the works available by the Marquis of Lafayette in English are much fewer than those of Paine, and in genres (letters and memoir) that are absent or rare in Paine's writings we selected works by a number of additional control authors to contrast with Tom Paine's written style. These authors were all contemporaries of Paine, active in the American Revolution (like Paine and Lafayette), with whom he interacted. The texts from these control authors are summarized in Table 2.
Having examples from this set of control authors allows us to make comparisons
not just between Paine and Lafayette (where the genre distribution is inevitably
unbalanced) but between Paine's works and representatives of the kinds of
writing on topics of concern to him which were being written and read (by Paine
among others) during the time when he was active. The
Not all of the text files in our corpus represent individual whole works. Some are sections or chapters of longer works. The 2 cases where a larger work was split into more than 2 files were
In addition, as part of a subsidiary study on the theme of co-authorship and its stylistic signals, we also collected a parallel corpus, summarized in Table 4, further detailed in Appendix 2. This will be used in subsections 3.4 and 4.1 to study of how well our analytic tools can detect co-authorship, in a naturally-occurring case where we have the great advantage of knowing just where each author's contribution begins and ends.
The median size of these texts was 3094 word tokens. Additionally, five individually authored sections of
In a pioneering work first published in 1964, Mosteller and Wallace used relative frequencies of commonly occurring words — mainly words such as prepositions, conjunctions and articles — as discriminators to investigate the mystery of the authorship of the
The choice of text size in stylometric studies is always problematic. Smaller
text units are too short to provide opportunities for stylistic habits to
operate on the arrangement of internal constituents, while larger units are
insufficiently frequent to provide enough examples for reliable statistical
inference. Forsyth and Holmes found the median text block size in a selection of
stylometric studies to be around 3500 words
The value of N used varies by application and genre but typically lies between 50
and 100, the implication being that these words should be among the most common
in the language and that content words should generally be avoided.
Attributional studies have achieved success with N set as low as 50
As it happened, 44 words were removed by this procedure, leaving 100 words to be used as features. These 100 words, as well as the 44 words excluded can be seen in Appendix 3. Using these words, a grid of size 55 by 100 was produced, with each of the 55 rows representing a text and each of the 100 columns giving the occurrence rates (per hundred) of a particular word in those texts.
The first phase in this investigation was designed to assess the validity of
the proposed technique. We were interested to discover how clearly the texts
of the three
For this purpose, a Principal Components Analysis was performed on 42 rows of this dataset, representing 10 works by Hamilton, 5 by Jefferson, 5 by Madison and 22 unquestionably by Paine. This reduced the 100 variables in the original data to 8 composite variables which between them accounted for 51.74% of the overall variance. Figure 1 shows the 42 texts plotted the space of the first two (most important) of these components (PCs). PC1 accounted for 11.18 percent and PC2 for 9.38 percent of the overall variance (20.56% together).
This diagram shows an imperfect yet encouraging degree of separation. The first, horizontal, component, clearly expresses a strong authorial signal. Taken together with the second, vertical, component, we find a nearly perfect separation of Paine, in the upper right-hand area of the graph, from the other authors. It also looks as if Hamilton and Jefferson are quite easy to differentiate, mainly on the basis of the vertical dimension.
Broadly speaking, this pattern recapitulates the finding of Sigelman et al.
The appearance of three State of the Union addresses, by Jefferson and Madison, towards the lower left of the graph suggests that dropping pronominals has not eliminated genre effects, as does the tendency of letters (including part 1 of Paine's rebuttal of the Abbé Raynal, which is framed as a letter) to have low scores on the second component.
To assist in interpreting this graph, the 6 words loading most positively and
negatively on these two Principal Components are listed in Table 5. The
contrast of
The next step is to see how well our sample of texts by Lafayette can be
distinguished from those of the
Another Principal Components Analysis was performed, this time requiring 6 components to account for more than 50% of the overall variance in the data. Figure 2 shows the 30 texts plotted in the space of the first 2 components (PCs). PC1 accounts for 21.48 percent and PC2 for 9.11 percent of the overall variance (30.59% together).
This presents an interesting picture. The samples from
Of particular interest are the three letters by Lafayette, to Dayen,
Rochambeau and Vergennes, which are poorly distinguished from Jefferson.
These letters were written originally in French and translated into English
when Lafayette's collected works were published. We have retained them in
our sample out of curiosity. That may be said to introduce noise into the
system, but studies by Rybicki (2012) and Forsyth & Lam (2014) have
shown, rather counter-intuitively, that authorship attribution is possible
in translated works
Overall then, this part of the analysis shows that
The most negatively and positively loaded words on the first two Principal
Components are shown in Table 6. The presence of
Thus we have imperfect discrimination, where authorship is to some extent confounded with other signals. Nevertheless Paine in general is relatively well distinguished from his contemporaries and a pair of works by Lafayette is very different from a variety of works by those contemporaries. Even Lafayette's translated letters are atypical when compared to texts by Hamilton, Jefferson and Madison. In short, this approach does offer clues to authorship when applied to writings of the relevant vintage, provided that those clues are taken as evidence rather than proof.
We next examine whether our two principal authors of interest, Lafayette and Paine, are distinguishable; and, in particular, whether the queried parts of TROM resemble one of these authors more than another. To do this we apply PCA to the 10 texts by Lafayette and the 22 by Paine already analyzed in the preceding sections, along with three further samples from TROM.
Figure 3 shows a plot of these 35 texts in the space of the first 2 Principal Components (PCs). PC1 accounts for 20.81 percent and PC2 10.48 percent of the variance in the data (31.29 percent together). Table 7 shows the words with the strongest loadings on these dimensions.
Again Paine's sample contains an outlier, his letter to Franklin of 1778. Lafayette's texts split into two well-separated groups. Even so, it is possible to separate the undisputed Paine samples from the undisputed Lafayette samples on the first Principal Component with only one exception.
TROM_Rights_1, the first part of
This graph is thus highly compatible with the assertion that TROM_Rights_1 is solely or mainly by Paine. It is also compatible with the proposition that the queried passage is by neither of them. The idea that it could be a co-authorship is by no means ruled out, an idea that will be followed up in the succeeding section.
To gain another perspective on the relations among these texts we performed a hierarchical cluster analysis using Ward's method on a distance matrix obtained from the first 6 Principal Components (enough to account for at least half of the overall variance, 53.46% to be precise). Figure 4 shows the result of this clustering.
This clustering presents a twofold grouping at the top level. In the left-hand group are the seven chunks of
To summarise this phase of the analysis, it is fair to say that TROM_Rights_1 behaves overall like a typical text by Thomas Paine, while the queried sections do not. There is some evidence linking them to Lafayette, but it is inconclusive. The notion of co-authorship remains plausible.
To gain a fuller idea of what might be expected in cases of co-authorship, we
applied the same approach to the small corpus described in Table 4, above.
In other words, we performed a self-examination on texts written by
ourselves, including co-authorships, as well as a
Like most researchers, we have been co-authors ourselves on several occasions. We decided to exploit this fact by focussing on a paper with particular relevance to the era of Tom Paine,
Co-authorship can take many forms (see:
In total, as tabulated in Appendix 2, we collected 45 text files for this subsidiary study, of which 20 were singly authored by Forsyth and 14 by Holmes. In addition, this corpus contains the co-authored text of
Initially, we performed a Principal Components Analysis, comparable with that in subsection 3.3, on 43 of these files, excluding only the two sections of The Federalist Revisited with lengths less than 800 words. For this purpose 100 words were again selected as features, using the same procedure as described previously in section 3, namely, taking the 144 most frequent words of the corpus and then first removing gendered pronouns (one example) and then 43 other words which were found in fewer than 32 of the 43 documents. The complete list of words, and exclusions, is shown in Appendix 4.
Figure 5 shows these texts plotted in the space of the first 2 Principal Components (PCs) which together accounted for 27.55% of the overall variance. PC1 accounted for 14.92% of the total variance and PC2 for 12.63%.
Here the picture is not clear-cut. The horizontal axis (PC1) does not distinguish the authors. Even knowing the texts concerned it is not simple to interpret this component, although it is influenced by topic: only one of the texts scoring higher than 2.5 on this component relates to linguistic themes, whereas the majority of those scoring lower than this (to the left of the diagram) do relate to linguistic or textual analysis.
However, some degree of authorial discrimination is achieved by PC2. Items in the lower part of the diagram, scoring less than zero on this component, are predominantly by RF, while the majority of points in the upper part are by DH. However, the lower left quadrant is mixed, and rather difficult to view. In fact, a plot of the second and third Principal Components illustrates the situation more clearly, as shown in Figure 6, even though PC3 (which accounts for 7.76% of the total variance) discriminates only weakly between the two main authors. Table 8 shows for each of the first three Principal Components, the five words that load most positively and most negatively.
In Figure 6, PC2, the horizontal axis, still carries the only strong authorship signal, but at least the points are better spread for interpretive purposes. The 34 texts singly authored by either DH or RF (i.e. all except co-authorships, parts of CO_Feds and the distractor, MO_Authorid), can be separated by a straight line, shown dashed in the diagram, with just 2 mistakes. To the left of this line we find all 20 main texts by RF, as well as the two co-authored texts by RF and someone other than DH, and one part (RF_Fed4) of the Federalist paper (CO_Feds). However, we also find two texts written solely by DH and the distractor written by Michael Oakes.
To the right of this dashed line we find 12 of the 14 main texts by DH, as well as DH_Fed2, the longest of the sections of CO_Feds written by DH. We also find the two papers co-authored by DH with someone other than RF. If we were to treat this demarcation line as definitive, however, we would not only assign the whole of CO_Feds to DH, we would also assign part of that paper written by RF (RF_Fed1) to DH as well. It can be seen that while CO_Feds falls near the centre of the DH distribution on PC2, it represents an outlier on that dimension for RF. Interestingly enough, RF_Fed1 is even more of an outlier on that dimension, yet it was written solely by RF.
A conclusion to be drawn from this side study is that while Principal Component Analysis using high-frequency words does very often reveal authorial signals, it cannot on its own be relied on to tease out contributions in a co-authorship. This conclusion is reinforced by Figure 7, which shows the results of a hierarchical cluster analysis, using Ward's method, based on the first 6 Principal Components derived from this data, which between them account for 51.77% of the overall variance.
Ignoring the isolate (RF_TeskeyRev) which happens to be the shortest of these texts, we can form five groupings by cutting the vertical lines in Figure 7 horizontally at about level 15 on the vertical axis (though this number has no unambiguous interpretation). The leftmost of these five clusters is a group of 3 DH texts that can be seen as an outlying subgroup in Figures 5 and 6. They are relatively informal writings compared with the rest of the corpus. Next, reading from the left, is a group of 6 texts all by RF on topics other than authorship. Then come five texts of which four are by DH on stylometric themes but one, RF_Foreword, is an introduction by RF to a book on finance. Next from the left comes a large group of 15 texts of which 12 are by RF, 2 are co-authorships by RF and someone other than DH, and one (MO_Authorid) is the distractor written by Michael Oakes. Finally, the rightmost grouping contains 8 texts by DH, 2 co-authorships by DH with someone other than RF and the prime focus text, CO_Feds, written by both DH and RF. But it also contains 2 texts solely by RF (RF_Cons1to4 and RF_Fed1).
On the basis of this phase of the analysis, if we suspected that the paper on the
Thus it can be said that this classic multivariate approach, based on occurrence rates of high-frequency words, does provide suggestive clues in cases of this type, but cannot be regarded as conclusive. It is, after all, explicitly a mode of exploratory data analysis. To be more definitive, we turn to other, complementary, modes of analysis.
To gain another perspective on the probable authorship of the queried passages in TROM, we employed the TOCCATA text-classification system, version 9, available at the following address: http://www.richardsandesforsyth.net/software.html.
TOCCATA (which stands for Text-Oriented Computational Classifier, Applicable To Authorship) is designed as a test harness for various text-categorization strategies. It comes supplied with software libraries that enable five different text-classification techniques to be used. In addition, the user may write bespoke libraries (in Python3) to implement other techniques. It allows methods to be assessed by cross-validation on a training sample of texts with known class membership (known authorship in this context) and also applied to classify holdout texts that may include disputed items or classes unseen in the training data.
Experience in this field has taught us that no single method is likely to emerge
as
The first three of these methods use individual words as features. Thus they
follow what has become the conventional language is not merely a bag of words
Prior to applying our toolkit of text-classification techniques to the main problem under study, we thought it instructive to discover how the methods performed in a case study where we possess privileged information.
Once again, this preliminary investigation concerns the corpus summarized in Table 4 and detailed in Appendix 2. These texts are those investigated by exploratory methods in subsection 3.4, above, including the co-authored text of
The TOCCATA program initially runs a subsampling phase on the training data. In this phase, it repeatedly picks a random subsample of n texts from the training data of N texts, where n = int(sqrt(N)). It then builds a model using the remaining larger sample (size N-n) and uses that model to predict the categories of the items in the smaller sample. In the present case, the larger subsample would contain 24=29-5 texts and the smaller 5 texts. This random subsampling is repeated until the required number of held-out items have been classified (255 trials in the present instance). This implements a mode of cross-validation, meaning that the success-rate statistics printed at the end of this process should be a relatively unbiased estimate of how the method would perform with fresh unseen data.
Part of the output from this phase, using repeated subsampling on the 29 training texts, using the MAWS method, is reproduced below.
Here the cross-validated success rate is 84.71 percent, respectable though not outstanding in a 2-class problem. We can see from these figures a slight bias, in that Recall is higher (and Precision lower) for DH than RF. This indicates that the method is slightly more likely to predict DH than it should.
More interesting is what happens in the third phase, when the system applies
the model generated from all 29 training texts to classifying the 16 holdout
texts, which it did not see during the training phase. Here the system
classifies each text as belonging to whichever class it estimates to be more
likely. It also ranks them according to a certainty measure, called
All TOCCATA models work by assigning similarity scores to each text that
measure the similarity of that text to the models of each category in the
training data.
The value labelled
Overall, therefore, ranking by credence should give an empirically based indication of how much credibility to attach to each decision on the holdout sample. For the 16 holdout texts under consideration, the output is listed below.
Thus, for example, DH_ProphetVoice_1991 with credence 0.9436 is very confidently ascribed to DH; while RF_Fed1_1995 with credence 0.0784 is ascribed to RF but with very little confidence. It is correct, but only just. (In this example all 10 holdout texts of known authorship are correctly classified.)
The above listing illustrates the use of a single method, MAWS. Table 10, below, collates the results of six such runs. The 16 holdout texts are listed in Table 10 along with their rankings when each of the 6 chosen methods was used to create a model on the training data and that model applied to the holdout texts.
The entries in this table are ranked according to the values in the column
labelled
Overall, this aggregation of results from 6 methods shows that DH_ProphetVoice_1991 was judged most surely by DH and RF_Robayes_1996 most surely by RF. Of the top 6, four are by DH and two by DH with a co-author not present in the training sample. Of the bottom 6, four are by RF and two, again, by RF with co-authors absent from the training sample.
The interesting items are the middle four, i.e. those decisions where the system is most uncertain. Two of these are portions of
This output is more naturally interpreted with the aid of a graph. Figure 8 shows this data in 2 dimensions. The horizontal axis simply shows the scores used to rank Table 4b divided by 96, its theoretical maximum. The vertical axis is the average credence score given by TOCCATA to the decision. It is an index of certainty, as explained above.
Since these scores are related, an approximate U-shape or V-shape of the
plotted points is inevitable. However, the
This gives us an idea of what to expect when the same procedure is applied to a more genuinely problematic case. It is apparent that the document written jointly by both the contrasted authors (CO_Feds_1995) sits almost exactly on the borderline of the polarity between them. Interestingly, we see that co-authored pieces where the second author isn't part of the contrast in focus can be assigned quite confidently to the author who is part of that contrast. The single distractor, MO_Autho_2014, does not gravitate strongly towards either pole, although appearing somewhat more Forsythian than Holmesian. The two pieces that land so close together as to render them illegible are CO_FoundTran_2014 and RF_Fed4, both on the RF side.
From this graph it would seem that anything with a mean credence score of
less than about 0.4 cannot confidently be ascribed to either of the two
authors being compared, as this implies a relatively neutral polarity whose
small difference from zero could easily be due to chance. This would put 6
of the 16 texts into a zone of uncertainty. The two texts deepest into this
zone of uncertainty are RF_Fed1 and DH_Fed5. These are the (historical)
introduction and conclusions sections respectively. Although we wrote these
individually, they are the most general sections, arising out of many
discussions and attempting to convey a coherent overall message. Perhaps it
is unsurprising that we achieved something like a common style for this
joint enterprise. By contrast, in parts 2, 3 and 4, we were describing the
results of applying our own specialized analytical techniques. It is
evident, therefore, that despite using methods that rely mainly on frequent
function words or part-of-speech tags, the factors of topic and/or genre
still have some influence on the classification process. Hence this method
is only partially immune to effects other than authorship.
Thus, turning towards Paine and contemporaries, it must be acknowledged that we have no magic bullet, but we do have a way separating some of the stronger signals from the noise.
For the corresponding experiment on the
The idea of treating Non-Paine as distinct text category raises several
issues, not least of which is: how could anyone sample the writings of this
composite
As illustration, an extract from the output listing using tokspans method with spansize of 3 is reproduced below.
Once again the credence ranking does what it is meant to do: there are 2 mistakes, but they lie at the bottom of the ranking; in fact, the last 6 of these 20 items consist of four queried cases and two errors, both the latter by authors outside the training data — the final item even in another language.
The top 10 NonPaine & top 10 Paine token spans actually chosen in this run are shown in Table 11.
This shows that many of the items are shorter than the maximum spansize of 3.
For example, the span ('it', 'is') defines a triple containing 'it' and 'is'
in that order along with one other word not in the high-frequency list,
which in this case consisted of 112 words. Thus
This holdout sample includes five texts treated as being uncontentiously ascribed to Paine (including TROM_Rights_1 which is the initial part of the
The two other dubious texts are TROM_Mid1 and TROM_Mid2, the first and second half of the 6000-word passage immediately preceding the
Finally, the non-Paine texts include three
In this case, all five texts unambiguously by Paine are clearly shown as such. In addition the four
The three clear-cut
We repeated this procedure a number of times, selecting different training and holdout sets at random, though keeping
We regard these results as further supportive evidence for Jonathan Clark's contention. In this connection it may be worth noting that TROM_Mid2, the second half of the doubtful passage is rated somewhat more similar to Paine than the first half. This could be a genre effect, in that the former section is a brief history of the dramatic events leading up to the King's journey from Versailles to attend the Parliament in Paris in which a starring role is played by a certain M. de la Fayette — rather like some parts of the autobiographical work
In summary, this form of analysis, based on using the TOCCATA package, tends to support the findings of our analysis in section 3, based on a Burrows-style approach. This lends further weight to the assertion that the queried 6000-word portion of the
Up to this point, we have used prior knowledge, in the case of our own paper, or
prior hypothesis, in the case of TROM, to divide our questioned texts into
portions for individual scrutiny. This approach could be viewed as falling
within a hypothesis-testing paradigm. Recently, however, an alternative, more
exploratory, mode, known as
In this approach, a questioned text is divided sequentially into fixed-length blocks, usually overlapping, and each block is compared to a training sample of texts with known authorship (or, more generally, known class membership). Then some measure of proximity or distance to the known texts (or an aggregation of them by category) is computed for each block, and these scores are plotted as y-values against sequential position in the questioned text.
This approach suits our research question very well, so we decided to apply the
rolling-stylometry functions of the stylo package
To produce Figure 11, we selected 14 of our larger texts, nine by Paine, four by Lafayette and one by Hamilton, for comparison with TROM and used the rolling.delta() function of the stylo package. (For this purpose we joined together the separate portions of
The black points and lines show the Delta-distances of each block from the texts by Paine; the green and red lines show distances to Lafayette's works; and the blue line, for reference, shows distances to the control text by Hamilton. The red line is in almost all places the most distant from the TROM sections, presumably indicating a genre difference: it is an inter-personal letter (written in English). However, there are three sections where the lowest, i.e. most similar, text is by Lafayette. The first of these starts at position 6000; the second at position 9000; and the third falls between 23000 and 25000, almost in the middle of the passage queried by Clark.
We take this third section as further corroboration of the results reported above in sections 3 to 4. More interesting in this context, perhaps, is the indication that two earlier sections of the work are also atypical of Paine's style, and indeed similar to Lafayette's. To investigate this further, we applied a second function supplied with the stylo package, namely rolling.classify(). For this we again used a block size of 1000 words with a step size of 500. Other settings were system defaults. We used the same comparison texts, except that Hamilton's was omitted, to give a simple 2-class classification problem. The program produced the graph shown here as Figure 12.
Here the portions ascribed to Paine are coloured green, while those ascribed to
Lafayette are coloured red. The three horizontal bars of colour indicate the
system's first, second and third choice category. Again we see a work mostly
attributed to Paine, but with three places where the text resembles writing by
Lafayette. The dotted vertical lines labelled
The earlier portions, just after position 5000 and just before position 10000 are
also intriguing. To investigate these in particular, we developed a sequential
version of the TOCCATA software described earlier which allows us to use
TOCCATA's methods in a
This Python3 program (slabsim.py) operates in two distinct phases. Phase 1 uses
the leave-1-out method of cross-validation in which one text at a time is
removed from the training sample and a block of the requested size is taken at
random from this left-out text (or two blocks if the text is large enough), to
be classified by a model formed from the remaining training texts. Since TOCCATA
models produce similarity scores, this permits collection of a series of
similarity scores of text blocks not used to form the model when compared with
that model. This set is sorted and can be used in the subsequent phase to yield
empirical
For example, let us consider a case using deltoid similarity (1/Delta, since TOCCATA works with similarities not dissimilarities). If the training texts yield 80 blocks of the requisite size having deltoid similarity scores ranging from 0.8 to 2.5 with reference to the models built by excluding each of those 80 blocks in turn, these 80 numbers can be sorted and saved for phase 2. Let us further suppose that the 8th and 9th largest of these 80 similarity values are 1.7 and 1.6.
In phase 2, the program runs sequentially through the blocks of the test text. For each block a similarity score is computed comparing the text block to the training-set model. This score is not used directly; instead it is expressed as a centile or percentage with reference to the set generated during the leave-1-out phase. To continue the above example, if the deltoid similarity for a given block were 1.65 this would place it between the 8th and 9th highest of the values retained from phase 1. This beats all but 8 (=72) of the 80 scores, so would be given a centile score of 72/80, or 90 in percentage terms. Thus the centile score for each block in the test text is the percentage of similarity scores obtained during the cross-validation phase which are beaten by, i.e. lower than, the score obtained for the block concerned by applying the classification model to the text under consideration.
A reason for this 2-step procedure is that many different similarity measures are available in TOCCATA, some of which have no natural interpretation, and this puts them all onto a common scale. Another point is that the initial cross-validation phase gives an unbiased empirical estimate of the expected distribution of similarity scores for unseen blocks of this size, of known category, when compared with a model formed from a training set of that category.
Figure 13 shows results of applying the Vote method to successive 1000-word
blocks of TROM, again with a 500-word step size, using 86 training texts, 24 by
Paine and 62 by our other authors, including 17 by Lafayette, all categorized
for this exercise as
Here the vertical lines mark the start of the section queried by Clark and the end of the
The graph broadly agrees with the results of
Nearly as dramatic is the dip before word-position 10000. This is due to two groups of 3 points which on this measure are unlike Paine's style. The first of these defines a segment of 2000 words starting at position 5000. The second of defines a segment of 2000 words beginning at position 8000.
It is natural to wonder what these sections of the work contain. Referring to the two earlier, smaller, atypical sections as Block 1 and Block 2, and the main atypical section as Block 3, Appendix 7 shows the paragraphs nearest the beginnings and ends of these blocks. All three blocks describe historical events leading up the
energetic apostrophe by M. de la Fayette.
To illustrate what sort of plot would result from applying this technique, with the same training set and the same block size and step size, to a different text, we created an artificial composite text-file consisting of the following four files: Lafayette_In America_2b, Queried_Federalist_19 (jointly written by Hamilton & Madison), PaineT_AgeReason_2iii, and Lafayette_toWashington_17790613. None of these were present in the training sample used of 86 texts. The part by Paine of this mixed-author text begins at position 5306 and ends at position 10080.
Of course, concatenating slabs by different writers is not how co-authorship
operates in reality, except perhaps in extreme cases. Among other things, the
topic focus of each part of this synthetic
As might be expected, this gives a more clear-cut picture. The vertical lines delimit the portion by Paine. Note that the point preceding the left vertical line, at startpos 5000, does contain some text from Tom Paine's section. Likewise, the two points preceding the right vertical line contain text by Lafayette.
What this highlights, in comparison with Figure 13, is that the section of TROM following the
A plausible explanation for this pattern would be that Lafayette supplied Paine with a historical account of events leading up to the
This conjecture is compatible with the stylometric evidence, and indeed suggested by that evidence, though it would require corroboration by information external to the texts here examined to be accepted as historical fact.
The results from the foregoing investigations lead us to believe that the
contention of Jonathan Clark (2015) is highly credible
All three passages refer to momentous historical events, so the possibility remains that Paine adapted his habitual language when writing in a genre that he seldom otherwise used. Nevertheless, Paine's familiarity with Lafayette, who had first-hand knowledge of these events, along with the fact that the main queried passage yields results very like those of our known co-authorship (subsection 4.1), leads us to believe that some degree of co-authorship is the most likely explanation. The fact that all three passages assign a central role to Lafayette merely strengthens this belief.
A likely scenario is that Lafayette supplied written descriptions of the events about which he knew more than Paine, which Paine then inserted into his book at points he considered suited to the thrust of his argument, with greater or lesser degrees of editing. Stylometry alone cannot prove this. The putative co-authors are dead. But the stylometric evidence presented here is entirely consistent with such a supposition.
From a methodological viewpoint, we believe this study shows that the TOCCATA
package warrants consideration as an addition to the toolkit of stylometric
researchers. Furthermore, using
We wish to thank Juliana Hessel of The College of New Jersey for collecting some of the electronic texts used in this investigation and for performing a pilot study on this question. We would also like to thank two anonymous reviewers for helpful comments on a previous draft of this paper.
A zipped file containing these texts is available for research purposes at: http://www.richardsandesforsyth.net/pubs.html.
These texts were only lightly pre-processed before analysis. Normalization was
limited to standardizing the character-representations of apostrophes, dashes,
hyphens and quotation marks. Embedded quotations were not deleted except in the
single case of TROM_Declaration itself, which we believe
is the longest quoted passage in any of these works. This was extracted into an
individual file (listed above).
These texts were pre-processed in the same way as those listed in Appendix 1; i.e. apostrophes, dashes, hyphens and quotation marks were standardized. In addition, in those papers where Reference Lists were present, they were removed.
In the studies reported above, the setting of parameter
Module Deltoid is an implementation of Burrows's delta i)
are converted to similarities as 1.0/di. The
number, N, of most-frequent words to employ is a user-selectable parameter
but if this is absent the system sets N to be the square root of the
vocabulary size V (i.e. total different vocabulary items, not total running
tokens), which is usually a reasonable choice. In the present study the
default, N=round(sqrt(V)), was used.
This library module implements a method inspired by what Mosteller &
Wallace, in their classic work (1964/1984) on the disputed Federalist
papers, call their
Module Vote is exceptional in that it actually uses every single word-type in
the training corpus as a feature. The
This method attempts to capture some of the information inherent in
Step 1 goes through the texts examining each segment of S consecutive words,
where S is a parameter called spansize, set by the user. With spansize=3, as
in the trials reported above, the triplet
This method actually uses the same software library as Tokspans, above, with
different parameter settings. Among the chief differences are that sets of
tokens were collected, not tuples, i.e. that sequence within spans was
ignored, and that part-of-speech tags (as assigned by the GoTagger; see
Appendix 6) were used as tokens not orthographic words. In addition, the
spansize parameter was set to 5. Hence, again, many sets chosen as features
consist of less than 5 items. For example, the quintet dt nn of dt nn
and
would thus generate the potential feature ['dt', 'nn', 'of'], since sets
don't include replications. (On output, the items are listed in alphabetic
order, which may not reflect the order in the original text.)
Taverns is another method exploiting sequential information. The name stands for (Textual Affinity Values Employing Repeated N-gram Sequences). It employs a technique borrowed from the formulib package, developed to explore formulaic language, which is available at the website below.
http://www.richardsandesforsyth.net/software.html
Essentially this method allows short n-grams to overlap. In the experiments reported above, n-grams of length from 2 to 4 items were generated for each text category and the most frequent at each size retained. Thus, for instance, in Tom Paine's writings the following 4- and 3-grams were kept, among many others:
When classifying a fresh text using the list of frequent n-grams in each category, the occurrences of each are not just counted. Rather, the proportion of the text covered jointly by all the n-grams in the list is computed. Thus, if the text contained the 8-gram "the people of the united states of america" that would count as 8 tokens covered by a combination of the three items above: "the people of" would be marked as covered by the 3-gram; "of the united states" would be covered by the first 4-gram; "united states of america" would be covered by the second 4-gram. The fact that some words were covered twice wouldn't matter. The eventual similarity score would only depend on what proportion of tokens in the text being classified had been covered overall, not on how many times each word was covered nor how many times particular n-grams had been found in the text. Note that any word can appear in an n-gram: in our experiments there was no preliminary exclusion of words on the basis of their individual frequencies.
The GoTagger software, by Kazuaki Goto, is freely available at the following website: http://web4u.setsunan.ac.jp/Website/TreeOnline.htm
It is not the most modern part-of-speech software, and its accuracy for computational linguistics is not state-of-the-art; but it is fast, free and consistent, the last being the most important attribute in the context of authorship attribution. We have written a short post-processing program to deal with some of the quirks of GoTagger that caused problems for TOCCATA's tokenization routine, so the codes above are not quite identical with those listed on the website above. (Punctuation was not used in the experiments reported, so the last 7 symbols above were ignored.)