DHQ: Digital Humanities Quarterly
2021
Volume 15 Number 1
Volume 15 Number 1
Can an author style be unveiled through word distribution?
Abstract
The inclusion of semantic features in the stylometric analysis of literary texts appears to be poorly investigated. In this work, we experiment with the application of Distributional Semantics to a corpus of Italian literature to test if words distribution can convey stylistic cues. To verify our hypothesis, we have set up an Authorship Attribution experiment. Indeed, the results we have obtained suggest that the style of an author can reveal itself through words distribution too.
Introduction
Stylometry, that is the application of the study of linguistic style, offers a
means of capturing the elusive character of an author’s style by quantifying
some of its features. The basic stylometric assumption is that each writer has a
“human stylome”
[Van Halteren et al. 2005], that is a set of certain stylistic
idiosyncrasies that define their style. Analysis based on stylometry is often
used for Authorship Attribution (AA) tasks, since the main idea behind
computationally supported AA is that by measuring some textual features, it is
possible to distinguish between texts written by different authors [Stamatatos 2009]. One of the less investigated stylistic features
is the way in which authors use words from a semantic point of view, e.g. if
they tend to use more, when dealing with polysemous words, a certain sense over
the others, or senses that differ (even slightly) from the one that is more
commonly used (as it happens, typically, in poetry). It would then be
interesting to discover if the semantics an author bestows to words is actually
part of its “human stylome.” Computing semantics,
though, is far from easy.
A possible approach to the analysis of this characteristic is to consider the
textual contexts in which certain words appear. According to Distributional
Semantics (DS), certain aspects of the meaning of lexical expressions depend on
the distributional properties of such expressions, or better, on the contexts in
which they are observed [Lenci 2008]
[Miller and Charles 1991]. The semantic properties of a word can then
be defined by inspecting a significant number of linguistic contexts,
representative of the distributional behavior of such word.
In this work, we would like to investigate if the analysis of the distribution
of words in a text can be exploited to provide a stylistic cue. In order to
inspect that, we have experimented with the application of DS to the stylometric
analysis of literary texts belonging to a corpus constituted by texts pertaining
to the work of six Italian writers of the late nineteenth century. In the
following, Section 2 both provides a brief survey on the works related to
stylometry and introduces the fundamental Distributional Semantics concepts on
which this works relies upon. Section 3 describes the approach together with the
corpus used to conduct our investigation and Section 4 discusses results.
Finally, Section 5 draws some conclusions and outlines some possible future
works.
Background Knowledge
Stylometry
Even if the first attempt at computing the writing style of an author dates
back to the first half of the 20th century ([Yule 1938]
[Yule 1944]
[Zipf 1932]), the work that is believed to be the starting
point of the so-called “non-traditional” Authorship Attribution is a
study by Mosteller and Wallace
(1964). They conducted an investigation on the authorship of the
“Federalist Paper,” a series of political
essays written by John Jay, Alexander Hamilton, and James Madison, 12 of
which have ambiguous paternity, being claimed by both Hamilton and Madison.
From then on (to, at least, the end of the 1990s), research in AA mainly
coincided with “stylometry,” i.e. defining features to quantify the
style of an author by using measures based on counting frequencies of words,
characters, and sentences [Holmes 1994]
[Holmes 1998].
Despite their working well, these systems followed a methodology that
underwent some limitations, such as the little statistical homogeneity of
the analyzed texts or the fact that the evaluation of the developed methods
was mainly intuitive, using corpora that were not controlled for topic;
moreover the lack of benchmark data made it difficult to compare different
methods. These problems were partially overcome at the end of the 1990s,
when the internet made a vast amount of electronic texts available, also
highlighting all different areas in which AA could be useful, beyond that of
literary research (i.e. intelligence [Abbasi and Chen 2005],
criminal and civil law [Chaski 2005]
[Grant 2007], computer forensic [Frantzeskou et al. 2006] and so on). Nowadays, the main emphasis on AA tasks regards the objective
evaluation of the proposed methods using common benchmark corpora [Juola 2004].
As previously stated, the very first attempts to analyze the style of an
author were mainly based on lexical features, such as sentence length count
or words length count, which can be applied to any language and corpus
without additional requirements [Koppel and Schler 2004]
[Stamatatos 2006]
[Zhao and Zobel 2005]
[Argamon et al. 2007]. Character measures, too, have been proven to
be useful in quantifying the writing style. A text can then be viewed as a
sequence of characters on which various measures (such as alphabetic, digit,
uppercase and lowercase characters count) can be defined [Zheng et al. 2006]
[Grieve 2007]
[De Vel et al. 2001]. A more elaborate text representation method is
based on the assumption that authors tend to use similar syntactic patterns,
so syntactic information is employed, being considered a more reliable
authorial fingerprint than lexical information [Gamon et al. 2004]
[Stamatatos et al. 2000]
[Stamatatos et al. 2001]
[Hirst and Feiguina 2007]
[Uzuner and Katz 2005].
Very few attempts to exploit high-level features for stylometric purposes
have been made, due to the fact that tasks such as full syntactic parsing,
semantic analysis, or pragmatic analysis cannot yet be handled adequately by
current NLP technologies. The most important method of exploiting semantic
information, so far, was based on the theory of Systemic Functional Grammar
(SFG) [Halliday 1994] and consisted on the definition of a set
of functional features that associate certain words or phrases with semantic
information, as described in Argamon
(2007).
However, the goal of our work is to use only information about words
distribution, in order to discover if a correlation between an author's
style and words distribution exists. In order to analyze words distribution,
we rely on the Distributional Hypothesis and, consequently, on
Distributional Semantics. Their theoretical basis will be presented in the
next subsection.
Distributional Semantics
The assumption behind all distributional semantics models (DSMs) is that the
notion of semantic similarity can be defined in terms of linguistic
distributions. This is known as the Distributional Hypothesis, which is
stated as follows: “The degree of semantic similarity between two
linguistic expressions a and b depends on the similarity of the
linguistic contexts in which a and b can appear.” In accord with this
definition, certain aspects of the meaning of lexical expressions depend on
the contexts in which they are observed. The semantic properties of a word
can then be defined by inspecting a significant number of linguistic
contexts, representative of the distributional behavior of such word.
Despite the huge consensus reached lately by this theory in Computational
Linguistics, its origins reside in the context of Zellig Harris’ proposal of
Distributional Semantics as the bedrock of linguistics as a scientific
discipline [Harris 1970]. Harris’ proposal was conceived for
phonemic analysis and later turned into a general methodology to be applied
at every linguistic level. The distribution procedure was regarded as a way
for linguists to give a methodological base to their analysis. He then, not
only extended his theory to meaning but assumed that meaning could actually
be explained on distributional grounds. The Distributional Hypothesis can be
considered a cognitive hypothesis about the form and origin of semantic
representations [Miller and Charles 1991]. Some of the most
influential models for distributional semantics, such as Latent Semantic
Analysis (LSA; [Landauer and Dumais 1997]) and Hyperspace Analogue
to Language (HAL; [Burgess and Lund 1997]) have been developed for
cognitive and psychological research, meant to represent how semantic
representations are learned by extracting co-occurrence patterns [Landauer 2007].
Within the corpus linguistics tradition, there was no need to motivate the
Distributional Hypothesis as a methodological principle for semantic
analysis. This is better summarized in the well-mentioned slogan by Firth:
“You shall know a word by the company it
keeps”
[Firth 1957]. In corpus linguistics, the Distributional
Hypothesis is often claimed to be the only possible source of evidence for
the exploration of meaning.
Distributional Semantics has gained popularity in computational linguistics
starting from the late 1980s when there was the progressive predominance of
corpus-based statistical methods for language processing. Within this new
paradigm, statistical methods were naturally applied to the lexical-semantic
analysis. Corpora are indeed connected to Distributional Semantics since
they can be used as repositories of linguistic usages, then representing the
primary source of information to identify the world distributional
properties. Their role has been enhanced by the current availability of a
huge collection of texts, contextually with increasingly sophisticated
computational linguistics techniques to process them and extract the
relevant context feature to build distributional semantic representations.
Despite its currently being corpus-based, distributional semantics is not
prevented to underline aspects of the format and origin of word meaning and
the issue of how and to what extent features extracted from the linguistic
input actually shape meaning.
The way in which it is possible to proceed in order to infer a geometric
representation starting from distributional information can be originated
from Harris (1970), who writes that “the distribution of an element will be understood as the
sum of all its environments.” In linguistics, an environment is
called a context, and we here assume a context to be the setting of a word,
phrase, etc., among the surrounding words, phrases, etc., often used for
helping to explain the meaning of the word, phrase, etc.
One way to collect this information is to tabulate the contextual
information, so that for each word we provide a list of the co-occurrents of
the word and the number of times they have co-occurred. In a second step, we
take away the actual words and only leave the co-occurrence counts. Also, we
make each list equally long by adding zeroes in the places where we lack
co-occurrence information. We also sort each list so that the co-occurrence
counts for each context come in the same places in the lists. The
mathematical backbone of Latent Semantic Analysis [Landauer and Dumais 1997], is Singular Value Decomposition, a
well-known linear algebra technique aimed at extracting the most informative
dimensions in a matrix of data and here used to reconstruct the latent
structure behind the distributional hypothesis [Deerwester et al. 1990]. The names vector semantics, word or semantic spaces merely highlight
specific mathematical techniques used to formalize the notion of contextual
representation. This information can then be modeled as vectors, as
described in Schütze (1992), Schütze (1993), who builds context vectors
(which he calls “term vectors” or “word vectors”) in the following way: co-occurrence
counts are collected in a words-by-words matrix, in which the elements
record the number of times two words co-occur within a set window of word
tokens.
Context vectors are then defined as the rows or the columns of the matrix
(the matrix is symmetric, so it does not matter if the rows or the columns
are used). A cell fij of the co-occurrence matrix records the frequency of
occurrence of the word i in the context of the word j or of the document j,
as shown in Figure 1. Context vectors do not only allow us to go from distributional
information to a geometric representation, but they also make it possible
for us to compute proximity between words. Thus, the point of the context
vectors is that they allow us to define (distributional, semantic)
similarity between words in terms of vector similarity. There are many ways
to compute the similarity between vectors, and the measures can be divided
into similarity measures and distance measures. The difference is that
similarity measures produce a high score for similar objects, whereas
distance measures produce a low score for the same objects: large similarity
equals small distance, and conversely. A convenient way to compute
normalized vector similarity is to calculate the cosine of the angles
between two vectors x and y, defined as: $$ sim_{cos(\vec x,\vec y)} = \frac{x \cdot y}{|x| \cdot
|y|}=\frac{\sum_{i=1}^n x_{i} \cdot y_{i}}{\sqrt{\sum_{i=1}^n x_{i}^{2}}
\cdot \sqrt{\sum_{i=1}^n y_{i}^{2}}}$$ The cosine measure
corresponds to taking the scalar product of the vectors and then dividing by
their norms. It is the most frequently utilized similarity metric in
word-space research. It is attractive because it provides a fixed measure of
similarity: it ranges from 1 for identical vectors, over 0 for orthogonal
vectors. Figure 2 shows an example of the
graphical representation of words vectors related to the matrix depicted in
Figure 1. The words “cat” and “dog” in the matrix above are
depicted as never appearing together in the same context. This does not mean
the ey are not semantically similar, because they can actually happen in
really similar contexts, it just means they don't appear together in the
context windows we defined. Anyway, here, we represent “cat” and
“dog” as dissimilar, being the angle between their vector of almost
90 degrees. The vector representative of the word “animal” is instead
as similar to that of “dog” than to that of “cat”, while the
vector representative of the word “canine” is closer to the vector
representative of “dog” than to “cat” and the vector
representative of “feline” is closer to “cat” than to “dog”
meaning that “canine” is more semantically similar to “dog” and
“feline” is more semantically similar to “cat”. Also, the
vectors of “canine” and “feline” are both close to “animal”,
suggesting that the two words are often used in similar contexts in the
texts analyzed to generate the co-occurrence vectors here represented. Of
course, this example is not representative of the linguistic and semantic
reality of things, but entirely indicative and apt to properly describe and
illustrate the concept of vectorial representation of semantic similarity.
Experimental Setup
First, we want to specify that it is not our purpose to propose new ways to
improve state-of-the-art AA algorithms. Indeed, our aim is just to verify the
hypothesis that the distribution of words can provide an indication of a
distributional stylistic fingerprint of an author. To do this, we have set up a
simple classification task. Subsection 3.1 briefly depicts the data set we used,
Section 3.2 describes why and how we chose the authors that would constitute our
dataset and Section 3.3 depicts the steps implemented in our experiment.
Data Set Construction
In order to build the reference and test corpora, we started from texts
pertaining to the work of six Italian writers working at the turn of the
20th century, namely, Luigi Capuana, Federico De Roberto, Luigi Pirandello,
Italo Svevo, Federigo Tozzi and Giovanni Verga. We chose contiguous authors
in a chronological sense, whose texts are available in digital format (in
fact we could not do a similar survey on the narrative of the 1990s because
it is still under copyrights). Indeed, we used texts freely available for
download from the digital library of the Manunzio project, via the
LiberLiber website (www.liberliber.it). Since they were encoded in various formats,
such as .epub, .odt, and .txt, our pre-processing consisted in converting
them all in .txt format and getting rid of all XML tags, together with
footnotes and editors’ notes and comments.
Authors and Texts Choice
In between the 1875 and the early 1900s, a literary movement peaked in Italy:
Verismo (meaning “realism”, from Italian vero, meaning “true”).
The main exponents of this literary movement, as well as the authors of its
manifesto, were Giovanni Verga and Luigi Capuana. Verismo did not constitute
a formal school, but it was still based on specific principles, its birth
being influenced by a positivist climate which put absolute faith in
science, empiricism, and research and which developed from 1830 until the
end of the 19th century.
All the authors selected to build the corpus used for this work pertained to
the temporal span in which the Literary Verismo developed, but not all of
them are proponents of such genre. Indeed, three of the selected authors are
considered to be Verist Authors (Giovanni Verga, Federico De Roberto, and
Luigi Capuana) while three (Luigi Pirandello, Federico Tozzi, and Italo
Svevo) are representative of another Literary Movement: Modernism. We
decided to choose texts pertaining to those authors and literary movements
firstly because of their being all written in the same temporal span. This
allowed us to get rid of any eventual lexical bias, due to the difference in
languages of works published in different epochs.
Also, the selection was performed having in mind an eventual future evolution
of this work. Using texts pertaining to the same period, but to different
literary movement, would allow us to investigate the ability of our method
in recognizing the literary movement the texts pertain to, instead of the
author that wrote them. This style-based text categorization tasks, known as
genre detection, is quite similar to authorship attribution, even if there
are characteristics that distinguish the one from the other. An important
question to investigate, then, is how it would be possible to discriminate
between two basic factors: authorship and genre, and if semantics could be
useful not only for recognizing the author of a literary work but also the
literary genre it pertains to.
Another line of research that has not been adequately examined so far is the
development of robust attribution techniques that can be trained on texts
from one genre and applied to texts of another genre by the same authors.
The way we selected and balanced the texts composing the corpus could be
useful for this task, too.
Experiment Description
According to Rudman (1997), a striking
problem in stylometry is due to the lack of homogeneity of the examined
corpora, in particular to the improper selection or fragmentation of the
texts that might cause alterations in the writers’ style. As already
depicted in the previous section, the corpus has been built according to
this assumption, trying to use texts pertaining to the same time span and
balanced between the two selected literary movements. Also, in order to
create balanced reference corpora, i.e. covering all the authors’ different
stylistic and thematic phases, for each author, as shown in Figure 3, we built a reference corpus as the
composition of the 70% of every single work (usually a novel). The same
technique was used to create the test corpus by using the remaining 30% of
each work. Typical AA approaches consist of analyzing known authors and
assigning authorship to previously unseen text on the basis of various
features. Train and test sets should then contain different texts. Contrary
to the classical AA task, our train and test sets contain different parts of
the same texts. Indeed, with this experiment, we wanted to understand if the
semantics that an author bestows to a word, is peculiar to his writing. To
prove this, we wanted to cover all the different stylistic and thematic
phases an author can go through during his activity, hence the partition of
all his texts in a reference and a test portion.
Our work relies on the assumption that the works of an author are
representative of the author's thought, so it is assumed that the semantics
that an author associates with certain words are representative of its
thoughts. One possible flaw of this kind of approach is that if an author
changes the semantics that he associates with concepts in different works
overtime, the method might not work. It can happen that an author who has a
long career changes its point of view on things and this should obviously
reflect on its works. This is one of the main reasons why we decided to use
all the works from each author as training text. We wanted to take into
account each different phase an author may go through during its career,
especially considering changes in the semantics associated with concepts.
Using all the works of each author allows us to have complete photography of
the author itself, and allows to understand the semantics associated with
concepts through all its work, even accounting for changes. In fact, the
association between words extracted using our method would highlight changes
in semantics by changing the score associated with a pair of words. Let’s
hypothesize that a strong association for the young Verga (i.e. for Verga in
his first works) is sun and joy, while later on the strongest association
is, let’s say, moon and joy. Our hypothesis would be that, while in first
works Verga associated the concept of “sun” with that of “joy,”
later on, is the concept of “moon” that is associated with that of
“joy.” Deciding to use some works of Verga as train as some other
as tests might then be deceiving, because what is semantically true for his
first work is not true later on. Using our method, if an association is true
just for some works, and not for all its score is evened out and the pair of
words is not that semantically relevant, and thus is not used for
classification. Pair with high score are those for which the association
between the concepts are true throughout all the work of an author, or
reports a score that is so high in one or more work, that is could not be
evened out from the score reported in all the other association from all the
other works from that particular author.
We then analyzed each reference and test corpora with a Part-of-Speech (PoS)
tagger and a lemmatizer for Italian [Dell’Orletta et al. 2014]. For
every author, we built two lists of word pairs (with their lemma and PoS),
one relative to the tagged reference corpus (reference pairs) and the other
to the tagged test set (test pairs), where each word was paired with all the
other words with the same PoS. We also filtered the pairs to leave only
nouns, adjectives and verbs. Starting from the tagged corpora, we built two
words-by-words matrixes of co-occurrence counts for each author. Being the
corpus relatively small and not having particular computability issues, we
chose not to apply decomposition techniques to reduce the size of the
matrixes (and thus not losing any information). We performed different
empiric setup of the window’s size and chose the one that showed more
suitable results, according to what is stated by Kruszewski and Baroni: the
context window was then set to 3 words prior and 3 words following the one
under examination [Kruszewski and Baroni 2014]. The chosen DS model
[Baroni and Lenci 2010] was applied to each matrix to calculate
the cosine between the vectors representing the two words of each pair. This
allowed us to evaluate the semantic relatedness between the words by
assessing their proximity in the distributional space as represented by the
cosine value: as explained in Section 2.2, the more this value tends to 1,
the more the two words of the pair are considered to be related. We then
obtained two related word pair (RWP) lists for each author A: RWPrefA and
RWPtestA. Figure 3 depicts the process described above.
Hypothesis and Discussion
Since we wanted to focus on the analysis of the semantic distribution of words,
we decided to exclude any possible “lexical bias.” For this reason, we
restricted the analysis on a common vocabulary, i.e. a vocabulary constituted by
the intersection of the six authors’ vocabularies. In this way, we prevent our
classifier to exploit, as a feature, the presence of words used by some (but not
all) of the authors. Moreover, we removed from the RWPtest lists all those pairs
of words occurring frequently together in the same context, since they might
constitute a multiword expression that, once again, could be pertaining with the
signature lexicon of each author. To remove them, we computed the number of
times (#co-occ in Figure 4) they appeared together
in the context window, as well as their total number of occurrences (#occa and
#occb) and we excluded from the analysis those pairs for which the ratio between
the number of co-occurrences and the total occurrences of the less frequent word
was higher than the empirically set threshold of 0.5. The first two pairs of
Figure 4 would be removed as probable multiword (PM column in Figure 4):
“scoppio” (burst) and “risa” (laughter) could mostly co-occur in
“scoppio di risa” (meaning “burst of laughter”) and the words
“man” and “mano” (both meaning “hand”) could mostly co-occur
in “man mano” (meaning “little by little,” or “progressively”).
Finally, we reduced the size of the six RWPref and RWPtest lists by
sorting them in decreasing order of the cosine value and then by keeping the
pairs with the highest cosine, selected using a percentage parameter θ as a
threshold. We chose to introduce the parameter θ for two reasons: first of all
we wanted to avoid the classification algorithm to be disturbed by noisy (i.e.
not significative) pairs which would not hold any relevant stylistic cue, also,
we would like to ease a literary scholar in the interpretation of the results by
having to analyze just a limited selection of (potentially) semantically related
word pairs. For the last phase of our experiment, we defined a classification
algorithm to test the effective presence of stylistic cues inside the obtained
RWPtest lists. We defined a classifier using a nearest-cosine method to
attribute each test list to an author. The method consisted in searching for a
pair of words contained in the test list inside each reference list and
incrementing by 1 the score of the author whose reference list included the pair
with the more similar cosine value (i.e. having the minimum difference): the
chosen author was the one with the highest score. Figure 5 shows the
classification results for θ = 5%. As summarized in Figure 6, the correct classification of all RWPs in
RWPtest lists has been obtained with a θ value of 5%. To help in interpreting the failure of the algorithm in classifying
Tozzi’s test list for θ values lower than 5% (as shown in Figure 6) we
calculated the cardinality of the RWPtest lists for each author with the change
in θ value (Figure 7). It is possible to observe how the choice of θ influences the correct
classification of Tozzi’s test list. Indeed, the use of a θ sense below 5% has
the effect of remarkably reducing an already small test list (RWPtextTozzi) as
shown in Figure 7. It is apparent that increasing the value of θ and
consequently the number of significant RWPs that are analyzed, the system is
able to correctly classify RWPtestTozzi (see the values in Tozzi’s row of Figure
6).
Conclusion and Next Steps
In this paper, we investigated the possibility that an analysis of the semantic
distribution of words in a text can be potentially exploited to get cues about
the style of an author. In order to validate our hypothesis, we conducted the
first experiment on six different Italian authors. Of course, it is not our
intent, with this paper, to define new methods for enhancing state-of-the-art
authorship attribution algorithms. However, the obtained results seem to suggest
that the way words are distributed across a text, can provide a valid stylistic
cue to distinguish an author’s work. In light of what we have shown up to this
point, the direction of our next steps can be twofold. On the one hand, our
research will focus on detecting and providing useful indications about the
style of an author. This can be done by highlighting, for example, atypical
distributions of words (e.g. with contrastive methods) or by analyzing their
distributional variability. Furthermore, it could be interesting to use a
different distributional measure than the cosine, to test our hypothesis. On the
other hand, it would be interesting to confront the computational task of
authorship attribution, by measuring the effective contribution that a feature
based on distributional semantics would provide to a canonical classification
process. Also, as highlighted in Section 3.2.3,
another interesting development of this work would regard the investigation of
the ability of our method in recognizing the literary movement the texts pertain
to, instead of the author that wrote them.
Works Cited
Abbasi and Chen 2005 Abbasi A., Chen H. 2005.
“Applying authorship analysis to extremist-group web
forum messages.”
IEEE Intelligent Systems, 20(5), 67-75.
Argamon et al. 2007 Argamon S., Whitelaw C.,
Chase P., Hota S. R., Garg N., and Levitan S. 2007. “Stylistic text classification using functional lexical features.”
Journal of the American Society for Information Science and
Technology, 58(6):802– 822, April.
Baroni and Lenci 2010 Baroni M. and Lenci A.
2010. “Distributional memory: A general framework for
corpus-based semantics.”
Computational Linguistics, 36(4):673–721.
Buitelaar et al. 2014 Buitelaar P., Aggarwal
N., and Tonra J. 2014. “Using distributional semantics to
trace influence and imitation in romantic orientalist poetry.” In
AHA!-orkshop 2014 on Information Discovery in Text. ACL.
Burgess and Lund 1997 Burgess, Curt, and
Kevin Lund. 1997. “Representing abstract words and emotional
connotation in a high-dimensional memory space.” Proceedings of the
Cognitive Science Society. 1997.
Chaski 2005 Chaski C. E. 2005. “Who’s at the keyboard? Authorship attribution in digital
evidence investigations.”
International Journal of Digital Evidence, 4(1).
De Vel et al. 2001 De Vel O., Anderson A., Corney
M., and Mohay G. 2001. “Mining e-mail content for author
identification forensics.”
ACM Sigmod Record, 30(4):55–64.
Deerwester et al. 1990 Deerwester, Scott;
Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K.; Harshman, Richard.
1990. “Indexing by Latent Semantic Analysis.”
Journal of the American Society for Information
Science. 41 (6): 391–407
Dell’Orletta et al. 2014 Dell’Orletta F.,
Venturi G., Cimino A., and Montemagni S. 2014. “T2k2: a
system for automatically extracting and organizing knowledge from
texts.” In LREC, pages 2062–2070.
Firth 1957 Firth J. R. 1957. “Modes of meaning.” Papers in Linguistics.
Frantzeskou et al. 2006 Frantzeskou G.,
Stamatatos E., Gritzalis S. and Katsikas S. 2006. “Effective
identification of source code authors using byte-level information.”
In Proceedings of the 28th International Conference on
Software Engineering (pp. 893-896).
Gamon et al. 2004 Gamon M. 2004. “Linguistic correlates of style: authorship classification with
deep linguistic analysis features.” In Proceedings of the 20th international conference on Computational
Linguistics, page 611. Association for Computational Linguistics.
Grant 2007 Grant T. D. 2007. “
Quantifying evidence for forensic authorship analysis.”
International Journal of Speech-Language and the
Law, 14(1), 1 -25.
Grieve 2007 Grieve J. 2007. “Quantitative Authorship Attribution: An Evaluation of Techniques.”
Literary and Linguistic Computing, 22(3):251–270,
May.
Halliday 1994 Halliday M. A. K. 1994. Functional grammar. London: Edward Arnold.
Harris 1970 Harris Z. S. 1970. Distributional structure. Springer.
Herbelot 2015 Herbelot A. 2015. “The semantics of poetry: A distributional reading.”
Digital Scholarship in the Humanities,
30(4):516–531.
Hirst and Feiguina 2007 Hirst G. and
Feiguina O. 2007. “Bigrams of Syntactic Labels for
Authorship Discrimination of Short Texts.”
Literary and Linguistic Computing, 22(4):405–417,
September.
Holmes 1994 Holmes D. I. 1994. “ Authorship attribution.”
Computers and the Humanities, 28, 87–106.
Holmes 1998 Holmes D. I. 1998. “The evolution of stylometry in humanities scholarship.”
Literary and Linguistic Computing, 13(3), 111-117.
Juola 2004 Juola P. 2004. “
Ad-hoc authorship attribution competition.” In Proceedings of the Joint Conference of the Association for Computers and
the Humanities and the Association for Literary and Linguistic
Computing (pp. 175-176).
Koppel and Schler 2004 Koppel M. and Schler
J. 2004. “ Authorship verification as a one-class
classification problem.” In Proceedings of the
twenty-first international conference on Machine learning, page 62.
ACM.
Kruszewski and Baroni 2014 Kruszewski G.
and Baroni M. 2014. “Dead parrots make bad pets: Exploring
modifier effects in noun phrases.”
Lexical and Computational Semantics (* SEM 2014),
page 171.
Landauer 2007 Landauer, Thomas K. 2007. “LSA as a theory of meaning.”
Handbook of latent semantic analysis 3 (2007):
32.
Landauer and Dumais 1997 Landauer, Thomas
K., and Susan T. Dumais 1997. “A solution to Plato's
problem: The latent semantic analysis theory of acquisition, induction, and
representation of knowledge.”
Psychological review 104.2 (1997): 211.
Lenci 2008 Lenci A. 2008. “Distributional semantics in linguistic and cognitive research.”
Italian journal of linguistics, 20(1):1–31.
Li et al. 2006 Li J., Zheng R., and Chen H. 2006.
“From fingerprint to writeprint.”
Communications of the ACM, 49(4):76–82.
Miller and Charles 1991 Miller G. A. and
Charles W. G.. 1991. “Contextual correlates of semantic
similarity.”
Language and cognitive processes, 6(1):1–28.
Mosteller and Wallace 1964 Mosteller F.
and Wallace D. L. 1964. “Inference and disputed authorship:
The Federalist.” Addison-Wesley.
Rudman 1997 Rudman J. 1997. “The state of authorship attribution studies: Some problems and
solutions.”
Computers and the Humanities, 31(4):351–365.
Schütze 1992 Schütze H. 1992. “ Dimensions of meaning.” In Supercomputing’92,
Proceedings, pages 787–796. IEEE.
Schütze 1993 Schütze H. 1993. “ Word space.” In Advances in Neural Information
Processing Systems 5. Citeseer.
Stamatatos 2006 Stamatatos E. 2006. “ Authorship attribution based on feature set subspacing
ensembles.”
International Journal on Artificial Intelligence
Tools, 15(05):823–838.
Stamatatos 2009 Stamatatos E. 2009. “A survey of modern authorship attribution methods.”
J. Am. Soc. Inf. Sci. Technol., 60(3):538–556,
March.
Stamatatos et al. 2000 Stamatatos E.,
Fakotakis N., and Kokkinakis G. 2000. “Automatic text
categorization in terms of genre and author.”
Computational linguistics, 26(4):471–495.
Stamatatos et al. 2001 Stamatatos E.,
Fakotakis N., and Kokkinakis G. 2001. “Computer-based
authorship attribution without lexical measures.”
Computers and the Humanities, 35(2):193–214.
Teng et al. 2004 Teng G., Lai M.S., Ma J.B., and Li
Y. 2004. “E-mail authorship mining based on SVM for computer
forensics.” In Machine Learning and Cybernetics, 2004. Proceedings of
2004 International Conference on, volume 2, pages 1204–1207. IEEE.
Uzuner and Katz 2005 Uzuner O. and Katz B.
2005. “A comparative study of language models for book and
author recognition.” In Natural Language Processing–IJCNLP 2005,
pages 969–980. Springer.
Van Halteren et al. 2005 Van Halteren H.,
Baayen H., Tweedie F., Haverkort M., and Neijt A. 2005. “New
machine learning methods demonstrate the existence of a human
stylome.”
Journal of Quantitative Linguistics, 12(1):65–77.
Yule 1938 Yule G. U. 1938. “On
sentence-length as a statistical characteristic of style in prose, with
application to two cases of disputed authorship.”
Biometrika, 30, 363-390.
Yule 1944 Yule G. U. 1944. The
statistical study of literary vocabulary. Cambridge University
Press.
Zhao and Zobel 2005 Zhao Y. and Zobel J. 2005.
“Effective and scalable authorship attribution using
function words.” In Information Retrieval
Technology, pages 174–189. Springer.
Zheng et al. 2006 Zheng R., Li J., Chen H., and
Huang Z. 2006. “A framework for authorship identification of
online messages: Writing-style features and classification
techniques.”
Journal of the American Society for Information Science and
Technology, 57(3):378–393, February.
Zipf 1932 Zipf G. K. 1932. Selected studies of the principle of relative frequency in
language. Harvard University Press, Cambridge, MA.