Giulia Benotto was educated at the University of Pisa, where she graduated in Digital Humanities (with specialization in Language Technology) and she got a Ph.D. in Computational Linguistics. After different experiences in both industry and research, she is now working in Extra Group as NLP Expert with a huge focus on developing conversational agents.
This is the source
The inclusion of semantic features in the stylometric analysis of literary texts appears to be poorly investigated. In this work, we experiment with the application of Distributional Semantics to a corpus of Italian literature to test if words distribution can convey stylistic cues. To verify our hypothesis, we have set up an Authorship Attribution experiment. Indeed, the results we have obtained suggest that the style of an author can reveal itself through words distribution too.
This article explores whether authorship can be attributed through word distribution by applying distributional semantics to a corpus of Italian literature.
Stylometry, that is the application of the study of linguistic style, offers a
means of capturing the elusive character of an author’s style by quantifying
some of its features. The basic stylometric assumption is that each writer has a
human stylome
human stylome.
Computing semantics,
though, is far from easy.
A possible approach to the analysis of this characteristic is to consider the
textual contexts in which certain words appear. According to Distributional
Semantics (DS), certain aspects of the meaning of lexical expressions depend on
the distributional properties of such expressions, or better, on the contexts in
which they are observed
In this work, we would like to investigate if the analysis of the distribution of words in a text can be exploited to provide a stylistic cue. In order to inspect that, we have experimented with the application of DS to the stylometric analysis of literary texts belonging to a corpus constituted by texts pertaining to the work of six Italian writers of the late nineteenth century. In the following, Section 2 both provides a brief survey on the works related to stylometry and introduces the fundamental Distributional Semantics concepts on which this works relies upon. Section 3 describes the approach together with the corpus used to conduct our investigation and Section 4 discusses results. Finally, Section 5 draws some conclusions and outlines some possible future works.
Even if the first attempt at computing the writing style of an author dates
back to the first half of the 20th century (non-traditional
Authorship Attribution is a
study by Mosteller and Wallace
(1964). They conducted an investigation on the authorship of the
stylometry,i.e. defining features to quantify the style of an author by using measures based on counting frequencies of words, characters, and sentences
Despite their working well, these systems followed a methodology that
underwent some limitations, such as the little statistical homogeneity of
the analyzed texts or the fact that the evaluation of the developed methods
was mainly intuitive, using corpora that were not controlled for topic;
moreover the lack of benchmark data made it difficult to compare different
methods. These problems were partially overcome at the end of the 1990s,
when the internet made a vast amount of electronic texts available, also
highlighting all different areas in which AA could be useful, beyond that of
literary research (i.e. intelligence
As previously stated, the very first attempts to analyze the style of an
author were mainly based on lexical features, such as sentence length count
or words length count, which can be applied to any language and corpus
without additional requirements
Very few attempts to exploit high-level features for stylometric purposes
have been made, due to the fact that tasks such as full syntactic parsing,
semantic analysis, or pragmatic analysis cannot yet be handled adequately by
current NLP technologies. The most important method of exploiting semantic
information, so far, was based on the theory of Systemic Functional Grammar
(SFG)
However, the goal of our work is to use only information about words distribution, in order to discover if a correlation between an author's style and words distribution exists. In order to analyze words distribution, we rely on the Distributional Hypothesis and, consequently, on Distributional Semantics. Their theoretical basis will be presented in the next subsection.
The assumption behind all distributional semantics models (DSMs) is that the
notion of semantic similarity can be defined in terms of linguistic
distributions. This is known as the Distributional Hypothesis, which is
stated as follows: The degree of semantic similarity between two
linguistic expressions a and b depends on the similarity of the
linguistic contexts in which a and b can appear.
In accord with this
definition, certain aspects of the meaning of lexical expressions depend on
the contexts in which they are observed. The semantic properties of a word
can then be defined by inspecting a significant number of linguistic
contexts, representative of the distributional behavior of such word.
Despite the huge consensus reached lately by this theory in Computational
Linguistics, its origins reside in the context of Zellig Harris’ proposal of
Distributional Semantics as the bedrock of linguistics as a scientific
discipline
Within the corpus linguistics tradition, there was no need to motivate the
Distributional Hypothesis as a methodological principle for semantic
analysis. This is better summarized in the well-mentioned slogan by Firth:
You shall know a word by the company it
keeps
Distributional Semantics has gained popularity in computational linguistics starting from the late 1980s when there was the progressive predominance of corpus-based statistical methods for language processing. Within this new paradigm, statistical methods were naturally applied to the lexical-semantic analysis. Corpora are indeed connected to Distributional Semantics since they can be used as repositories of linguistic usages, then representing the primary source of information to identify the world distributional properties. Their role has been enhanced by the current availability of a huge collection of texts, contextually with increasingly sophisticated computational linguistics techniques to process them and extract the relevant context feature to build distributional semantic representations. Despite its currently being corpus-based, distributional semantics is not prevented to underline aspects of the format and origin of word meaning and the issue of how and to what extent features extracted from the linguistic input actually shape meaning.
The way in which it is possible to proceed in order to infer a geometric
representation starting from distributional information can be originated
from Harris (1970), who writes that the distribution of an element will be understood as the
sum of all its environments.
In linguistics, an environment is
called a context, and we here assume a context to be the setting of a word,
phrase, etc., among the surrounding words, phrases, etc., often used for
helping to explain the meaning of the word, phrase, etc.
One way to collect this information is to tabulate the contextual
information, so that for each word we provide a list of the co-occurrents of
the word and the number of times they have co-occurred. In a second step, we
take away the actual words and only leave the co-occurrence counts. Also, we
make each list equally long by adding zeroes in the places where we lack
co-occurrence information. We also sort each list so that the co-occurrence
counts for each context come in the same places in the lists. The
mathematical backbone of Latent Semantic Analysis term vectors
or word vectors
) in the following way: co-occurrence
counts are collected in a words-by-words matrix, in which the elements
record the number of times two words co-occur within a set window of word
tokens.
Context vectors are then defined as the rows or the columns of the matrix
(the matrix is symmetric, so it does not matter if the rows or the columns
are used). A cell fij of the co-occurrence matrix records the frequency of
occurrence of the word i in the context of the word j or of the document j,
as shown in Figure 1. Context vectors do not only allow us to go from distributional
information to a geometric representation, but they also make it possible
for us to compute proximity between words. Thus, the point of the context
vectors is that they allow us to define (distributional, semantic)
similarity between words in terms of vector similarity. There are many ways
to compute the similarity between vectors, and the measures can be divided
into similarity measures and distance measures. The difference is that
similarity measures produce a high score for similar objects, whereas
distance measures produce a low score for the same objects: large similarity
equals small distance, and conversely. A convenient way to compute
normalized vector similarity is to calculate the cosine of the angles
between two vectors x and y, defined as: cat
and dog
in the matrix above are
depicted as never appearing together in the same context. This does not mean
the ey are not semantically similar, because they can actually happen in
really similar contexts, it just means they don't appear together in the
context windows we defined. Anyway, here, we represent cat
and
dog
as dissimilar, being the angle between their vector of almost
90 degrees. The vector representative of the word animal
is instead
as similar to that of dog
than to that of cat
, while the
vector representative of the word canine
is closer to the vector
representative of dog
than to cat
and the vector
representative of feline
is closer to cat
than to dog
meaning that canine
is more semantically similar to dog
and
feline
is more semantically similar to cat
. Also, the
vectors of canine
and feline
are both close to animal
,
suggesting that the two words are often used in similar contexts in the
texts analyzed to generate the co-occurrence vectors here represented. Of
course, this example is not representative of the linguistic and semantic
reality of things, but entirely indicative and apt to properly describe and
illustrate the concept of vectorial representation of semantic similarity.
First, we want to specify that it is not our purpose to propose new ways to improve state-of-the-art AA algorithms. Indeed, our aim is just to verify the hypothesis that the distribution of words can provide an indication of a distributional stylistic fingerprint of an author. To do this, we have set up a simple classification task. Subsection 3.1 briefly depicts the data set we used, Section 3.2 describes why and how we chose the authors that would constitute our dataset and Section 3.3 depicts the steps implemented in our experiment.
In order to build the reference and test corpora, we started from texts pertaining to the work of six Italian writers working at the turn of the 20th century, namely, Luigi Capuana, Federico De Roberto, Luigi Pirandello, Italo Svevo, Federigo Tozzi and Giovanni Verga. We chose contiguous authors in a chronological sense, whose texts are available in digital format (in fact we could not do a similar survey on the narrative of the 1990s because it is still under copyrights). Indeed, we used texts freely available for download from the digital library of the Manunzio project, via the LiberLiber website (www.liberliber.it). Since they were encoded in various formats, such as .epub, .odt, and .txt, our pre-processing consisted in converting them all in .txt format and getting rid of all XML tags, together with footnotes and editors’ notes and comments.
In between the 1875 and the early 1900s, a literary movement peaked in Italy:
Verismo (meaning realism
, from Italian vero, meaning true
).
The main exponents of this literary movement, as well as the authors of its
manifesto, were Giovanni Verga and Luigi Capuana. Verismo did not constitute
a formal school, but it was still based on specific principles, its birth
being influenced by a positivist climate which put absolute faith in
science, empiricism, and research and which developed from 1830 until the
end of the 19th century.
All the authors selected to build the corpus used for this work pertained to the temporal span in which the Literary Verismo developed, but not all of them are proponents of such genre. Indeed, three of the selected authors are considered to be Verist Authors (Giovanni Verga, Federico De Roberto, and Luigi Capuana) while three (Luigi Pirandello, Federico Tozzi, and Italo Svevo) are representative of another Literary Movement: Modernism. We decided to choose texts pertaining to those authors and literary movements firstly because of their being all written in the same temporal span. This allowed us to get rid of any eventual lexical bias, due to the difference in languages of works published in different epochs.
Also, the selection was performed having in mind an eventual future evolution of this work. Using texts pertaining to the same period, but to different literary movement, would allow us to investigate the ability of our method in recognizing the literary movement the texts pertain to, instead of the author that wrote them. This style-based text categorization tasks, known as genre detection, is quite similar to authorship attribution, even if there are characteristics that distinguish the one from the other. An important question to investigate, then, is how it would be possible to discriminate between two basic factors: authorship and genre, and if semantics could be useful not only for recognizing the author of a literary work but also the literary genre it pertains to.
Another line of research that has not been adequately examined so far is the development of robust attribution techniques that can be trained on texts from one genre and applied to texts of another genre by the same authors. The way we selected and balanced the texts composing the corpus could be useful for this task, too.
According to Rudman (1997), a striking problem in stylometry is due to the lack of homogeneity of the examined corpora, in particular to the improper selection or fragmentation of the texts that might cause alterations in the writers’ style. As already depicted in the previous section, the corpus has been built according to this assumption, trying to use texts pertaining to the same time span and balanced between the two selected literary movements. Also, in order to create balanced reference corpora, i.e. covering all the authors’ different stylistic and thematic phases, for each author, as shown in Figure 3, we built a reference corpus as the composition of the 70% of every single work (usually a novel). The same technique was used to create the test corpus by using the remaining 30% of each work. Typical AA approaches consist of analyzing known authors and assigning authorship to previously unseen text on the basis of various features. Train and test sets should then contain different texts. Contrary to the classical AA task, our train and test sets contain different parts of the same texts. Indeed, with this experiment, we wanted to understand if the semantics that an author bestows to a word, is peculiar to his writing. To prove this, we wanted to cover all the different stylistic and thematic phases an author can go through during his activity, hence the partition of all his texts in a reference and a test portion.
Our work relies on the assumption that the works of an author are representative of the author's thought, so it is assumed that the semantics that an author associates with certain words are representative of its thoughts. One possible flaw of this kind of approach is that if an author changes the semantics that he associates with concepts in different works overtime, the method might not work. It can happen that an author who has a long career changes its point of view on things and this should obviously reflect on its works. This is one of the main reasons why we decided to use all the works from each author as training text. We wanted to take into account each different phase an author may go through during its career, especially considering changes in the semantics associated with concepts.
Using all the works of each author allows us to have complete photography of
the author itself, and allows to understand the semantics associated with
concepts through all its work, even accounting for changes. In fact, the
association between words extracted using our method would highlight changes
in semantics by changing the score associated with a pair of words. Let’s
hypothesize that a strong association for the young Verga (i.e. for Verga in
his first works) is sun and joy, while later on the strongest association
is, let’s say, moon and joy. Our hypothesis would be that, while in first
works Verga associated the concept of sun
with that of joy,
later on, is the concept of moon
that is associated with that of
joy.
Deciding to use some works of Verga as train as some other
as tests might then be deceiving, because what is semantically true for his
first work is not true later on. Using our method, if an association is true
just for some works, and not for all its score is evened out and the pair of
words is not that semantically relevant, and thus is not used for
classification. Pair with high score are those for which the association
between the concepts are true throughout all the work of an author, or
reports a score that is so high in one or more work, that is could not be
evened out from the score reported in all the other association from all the
other works from that particular author.
We then analyzed each reference and test corpora with a Part-of-Speech (PoS)
tagger and a lemmatizer for Italian
Since we wanted to focus on the analysis of the semantic distribution of words,
we decided to exclude any possible lexical bias.
For this reason, we
restricted the analysis on a common vocabulary, i.e. a vocabulary constituted by
the intersection of the six authors’ vocabularies. In this way, we prevent our
classifier to exploit, as a feature, the presence of words used by some (but not
all) of the authors. Moreover, we removed from the RWPtest lists all those pairs
of words occurring frequently together in the same context, since they might
constitute a multiword expression that, once again, could be pertaining with the
signature lexicon of each author. To remove them, we computed the number of
times (#co-occ in Figure 4) they appeared together
in the context window, as well as their total number of occurrences (#occa and
#occb) and we excluded from the analysis those pairs for which the ratio between
the number of co-occurrences and the total occurrences of the less frequent word
was higher than the empirically set threshold of 0.5. The first two pairs of
Figure 4 would be removed as probable multiword (PM column in Figure 4):
scoppio
(burst) and risa
(laughter) could mostly co-occur in
scoppio di risa
(meaning burst of laughter
) and the words
man
and mano
(both meaning hand
) could mostly co-occur
in man mano
(meaning little by little,
or progressively
).
Finally, we reduced the size of the six RWPref and RWPtest lists by
sorting them in decreasing order of the cosine value and then by keeping the
pairs with the highest cosine, selected using a percentage parameter θ as a
threshold. We chose to introduce the parameter θ for two reasons: first of all
we wanted to avoid the classification algorithm to be disturbed by noisy (i.e.
not significative) pairs which would not hold any relevant stylistic cue, also,
we would like to ease a literary scholar in the interpretation of the results by
having to analyze just a limited selection of (potentially) semantically related
word pairs. For the last phase of our experiment, we defined a classification
algorithm to test the effective presence of stylistic cues inside the obtained
RWPtest lists. We defined a classifier using a nearest-cosine method to
attribute each test list to an author. The method consisted in searching for a
pair of words contained in the test list inside each reference list and
incrementing by 1 the score of the author whose reference list included the pair
with the more similar cosine value (i.e. having the minimum difference): the
chosen author was the one with the highest score. Figure 5 shows the
classification results for θ = 5%. As summarized in Figure 6, the correct classification of all RWPs in
RWPtest lists has been obtained with a θ value of 5%. To help in interpreting the failure of the algorithm in classifying
Tozzi’s test list for θ values lower than 5% (as shown in Figure 6) we
calculated the cardinality of the RWPtest lists for each author with the change
in θ value (Figure 7). It is possible to observe how the choice of θ influences the correct
classification of Tozzi’s test list. Indeed, the use of a θ sense below 5% has
the effect of remarkably reducing an already small test list (RWPtextTozzi) as
shown in Figure 7. It is apparent that increasing the value of θ and
consequently the number of significant RWPs that are analyzed, the system is
able to correctly classify RWPtestTozzi (see the values in Tozzi’s row of Figure
6).
In this paper, we investigated the possibility that an analysis of the semantic distribution of words in a text can be potentially exploited to get cues about the style of an author. In order to validate our hypothesis, we conducted the first experiment on six different Italian authors. Of course, it is not our intent, with this paper, to define new methods for enhancing state-of-the-art authorship attribution algorithms. However, the obtained results seem to suggest that the way words are distributed across a text, can provide a valid stylistic cue to distinguish an author’s work. In light of what we have shown up to this point, the direction of our next steps can be twofold. On the one hand, our research will focus on detecting and providing useful indications about the style of an author. This can be done by highlighting, for example, atypical distributions of words (e.g. with contrastive methods) or by analyzing their distributional variability. Furthermore, it could be interesting to use a different distributional measure than the cosine, to test our hypothesis. On the other hand, it would be interesting to confront the computational task of authorship attribution, by measuring the effective contribution that a feature based on distributional semantics would provide to a canonical classification process. Also, as highlighted in Section 3.2.3, another interesting development of this work would regard the investigation of the ability of our method in recognizing the literary movement the texts pertain to, instead of the author that wrote them.