Francesca Frontini obtained a PhD the University of Pavia with a thesis on corpus linguistics; later she joined the Institute for Computational Linguistics in Pisa, working on several European projects with a focus on computational lexicography and natural language processing. Her current research interests lie in Named Entity Recognition and textual classification; in particular she has worked on NLP methods for the analysis of literary texts and literary criticism. In addition, she has published extensively on issues relating to language resource documentation, preservation and standardisation, and was involved in the development of the Italian consortium of the CLARIN infrastructure. Today, she is maître de conferences (associate professor) of Linguistique Informatique at the Université Paul-Valéry in Montpellier.
Mohamed-Amine Boukhaled is a temporary assistant professor at Paris-6 University. He pursued his PhD on computational stylistics at the computer science laboratory of Paris-6 University (Laboratoire d'Informatique de Paris 6). Before that, he graduated from Grenoble-1 University in 2013 with a Master degree in Artificial Intelligence. His current research work lies in the areas of computational stylistics and text mining. More specifically, he is working on modelling and developing sequential data mining techniques for the extraction of relevant syntactic patterns.
Jean-Gabriel Ganascia is Professor (outstanding class) of Computer Science at the University Pierre and Marie Curie (UPMC). He pursues his research activities in the LIP6 (Laboratory of Computer Science of the UPMC), where he heads the ACASA team. He is also deputy director of the OBVIL Laboratory of Excellence, in which the humanists of the Paris Sorbonne University cooperate with the computer scientists of the UPMC on the literary side of Digital Humanities. Jean-Gabriel Ganascia is a EurAI fellow, a senior member of the Institut Universitaire de France, and the chairman of the COMETS that is the CNRS ethical committee. His current research activities are focused on Artificial Intelligence, Machine Learning, Computational Philosophy, Computer Ethics and Digital Humanities.
This is the source
This paper presents and describes a bottom-up methodology for the detection of stylistic traits in the syntax of literary texts. The extraction of syntactic patterns is performed blindly by a sequential pattern mining algorithm, while the identification of significant and interesting features is performed at a later stage by using correspondence analysis and by ranking patterns by contribution.
Presents and describes a bottom-up methodology for the detection of stylistic traits in the syntax of literary texts
Computational stylistics is a form of computer aided literary analysis, that
aims to extract significant stylistic traits characterising a literary work, an
author, a genre, a period… Computational stylistics shares similarities with
computer aided authorship attribution; indeed stylometric methods have begun to
be developed in order to identify the most likely author of a text of unknown
attribution (see
The term
Very basic features are traits such as the number of sentences in a text, the
number of words in a text, the average number of words in a sentence, average
word length, punctuation frequency. Other features are more properly linguistic,
relating to vocabulary size of a text (lexical richness), frequency of function
words, frequency of Part Of Speech (PoS) tags or PoS n-grams
Once features have been selected and counted, each text can be seen as a vector that contains for each feature the counts of such feature in the text itself. The different texts to be compared constitute a matrix such as the one represented in Table 1. Counts can be in absolute or relative terms. Normalization is recommended when the sizes of the texts are different. Nevertheless one should consider that smaller texts have generally a higher internal variability; thus it is recommended not to compare texts of too different sizes and sometimes experiments are carried out on selected samples of equal size.
Several methods can then be used to measure and compare the texts based on the frequencies of the features. They generally rely on vector distance measurements that allow one to identify whether two texts show the same behaviour with respect to the selected features. In Table 1 for instance it is evident that Texts 1 and 3 show similar distributions, despite the difference in scale, as they have roughly the same proportion of Features 1, 2 and 3, while Text 3 shows a totally different behaviour, with a higher count for Feature 3 than for Feature 1 and 2.
The similarity of methods notwithstanding the purpose of computational
stylistics is profoundly different from the one of authorship attribution.
Indeed attribution methods aim to identify unconscious traits in the work of a
given author, that give him/her away and that are for this reasons normally
defined as
On the other hand literary style is something that an author masters in a more conscious way. It is possible that different works by the same author may show different stylistic traits, although others may be found in all of his/her works. Generally speaking, we can assume that more complex linguistic features are used in a more conscious and controlled way and thus when some of them are strongly over-used or under-used in an author with respect to others, this may be taken as a possible stylistic trait.
Moreover, authorship attribution can be clearly framed as a classification
problem (who is the most likely author of text A given a bunch of candidates)
and indeed it is applied as such not only to literature but also in forensic
contexts. Computational stylistics is an open-ended problem
Clearly such measures are more difficult to evaluate in terms of the accuracy
measures commonly used in information retrieval. A debate is currently on going
on whether computational stylistic methods should be a way to radically change
the methodology of literary criticism and make it more scientific
. The
influential book by Ramsay,
The method presented in this paper extracts and ranks
This is not an uncommon scenario in today’s computational stylistics research; it can be compared to the early stages of historical linguistics, when researchers established their comparative method of genetic reconstruction on Romance languages, for which in fact the antecedent (Latin) was available. Only the possibility of independently verifying their methods on an attested source could establish the correctness of comparative methods and allowed researchers to subsequently reconstruct other proto-languages for which no attestation was present (Proto-Germanic, Indo-european).
This caveat notwithstanding, we believe that our methodology, alongside other
similar approaches, can already be useful for specialists, who can find
confirmation of known facts and thus substantiate their claims with more data.
With the added value that such algorithms are able to easily process large
quantities of text, and can thus be applied to that part of literature that
Franco Moretti
Works investigating the lexical differences between authors using stylometric
techniques are common in the literature; generally, studies count and compare
individual lexical elements (see some examples on Shakespeare and other
playwrights in
Sometimes collocations are extracted and analysed with concordancing tools
When working with a small number of pre-selected features, a commonly used technique which gives the user insight into the decision process that the algorithm used to produce the visualisation is the so called bi-plot (see Figure 1 for an example).
As you can see from the plot, texts are represented in the bi-dimensional space together with the selected features. The visualisation places the features in the space near to the texts they are more strongly associated with. This means that the researcher can easily identify which features are more responsible for the differences between texts. Also, the relative positioning of the features with respect to each other is indicative of the possible meaning that the representation takes along the two axes. The labels in blue outside the plot on Figure 1 are the interpretation that a human may give of the bi-dimensional distribution of the features.
Such visualization methods play a very important role in that they allow the researcher not only to confirm or discard a given a priori classification, but also to explain the reason why this is the case. In this sense, alongside hypothesis driven studies, they can also provide a tool for investigation and analysis that is more in line with common practices within literary criticism.
Clearly such techniques can be used not only to study the authors' lexicon, but also to try and detect syntactic differences in style. Reliable parsing is not always available, and may work less well on literary texts even for English; nevertheless syntactic patterns in the form of PoS n-grams or constructions can be easily obtained, as PoS Tagging is nowadays available in many languages and domains.
The two main options when trying to port multivariate analysis to syntax are the
following:
These two scenarios mirror a distinction introduced by
Clearly the second option is more in line with the idea of an exploratory tool and gives us some hope of being able to use such techniques in the future to discover new facts about literary texts. Nevertheless pattern extraction algorithms (such as sequential pattern mining, which we shall present in the next paragraph) are known to produce a huge amount of patterns, and thus large vectors, even for small portions of text. Thus bi-plots, that are so useful for exploration, become unreadable due to the projection of thousands of features. In the following sections of this paper a feature ranking method is presented that aims to overcome such impediments, thus allowing researchers to conjugate a bottom-up feature selection procedure and an exploratory visualization of the results.
The proposed methodology is based on the exploitation of two different
techniques, sequential pattern mining and correspondence analysis, followed by
an interpretation of the results, and is subdivided into 5 steps:
Data is first segmented into sentences, then syntactical categories are
annotated by using a freely available tool, TreeTagger Le livre est sur la table.
DET:art - NOM - VER:pres - PRP - DET:art -
NOM - SENT
On the PoS tagged text sequential pattern mining is applied, namely a data
mining technique introduced by
Patterns are extracted with their counts. Three types of filtering are applied;
one is based on threshold settings. Users can set an absolute threshold,
filtering away, for instance, patterns with frequency less than, say, 5 in a
text; or a relative one filtering away, for instance, patterns that do not occur
in at least 1% of the sentences. Finally automatic filtering is applied in order
to eliminate patterns that are included in another one. So, for instance, if we
find the following results:
we can deduce that all instances the shorter pattern only occur in the
context of the longer one and Pattern 2 can be thus be filtered away.
This extraction method has been tested on several corpora such as theatrical plays, poems and novels. According to the size of the corpora and the settings, this extraction method can produce up to 10,000 patterns that can be seen globally as a syntactic description of the text. This method therefore is meant to bypass the feature-selecting phase. The researcher doesn’t need to pre-compile a list of possible syntactic sequences that may differentiate one text from the others. Patterns are extracted bottom-up and blindly. Obviously a large quantity of such patterns will be insignificant for stylistic differentiation as they have probably the same frequency in all texts.
Thus correspondence analysis is then performed as follows:
The last 3 points are crucial and require further description.
Correspondence analysis (CA) is a dimensionality reduction technique developed
by Jean Paul Benzécri (
The third is the most important result table for our methodology and contains
the
Finally the extraction tool EREMOS is also equipped with an instance retrieving method; that allows researchers to see all instances in the text corresponding to any given pattern. This latter feature is also very important as experts can verify the evidence in texts and map the automatically identified patterns to the actual linguistic structures that such patterns are in fact mirroring.
We shall see with an example how this works practically. The current discussion is not intended to be a thorough critical analysis of the chosen texts, but it aims only to show what possible uses experts may make of the data.
In order to show how the methodology works in practice, four classic French
novels were chosen:
The idea is to compare these four 19th-century works of fiction in order to
extract differentiating stylistic traits, without any a-priori targeted
structure. In the present experiment, for simplicity, a basic configuration is
chosen for EREMOS, extracting patterns
In this configuration, EREMOS is basically working as a 3-4-5grams
extractor.
When performing CA with the basic settings, Figure 2 is what results.
As you can see, the large amount of patterns makes the plot unusable and the position of the tests themselves in the plot invisible.
Using one of the settings provided by FactoMiner we are able to produce a more readable Figure 3, where patterns are unlabelled and printed in grey with partial tran sparency.
Figure 3 shows us a clearer picture. Novels (here labelled with the names of their authors) are diverging on the two axes.
To understand the positioning of patterns and texts in a bi-plot the metaphor of
a magnetic field can be used. The majority of patterns are concentrated in the
centre, because they are equally attracted (represented) in all texts. On the
other hand some patterns are strongly attracted (over-represented) by just one
text and are repulsed by the others, positioning themselves at the extremity.
Others are equally attracted by two texts only, positioning themselves somewhat
in between. Moreover the force of attraction is not the same. Some patterns seem
to be stronger in pulling
a text towards them: so for instance in our
case Balzac and Zola are less central. This can be interpreted in the sense that
such texts have stronger characterizing features than the other two.
By using the figures produced by FactoMineR, we can go further, and actually remove the cloud of central patterns, while retaining for further analysis those patterns that are most contributive in terms of the displacement of the texts over the two axes. Figure 4 shows for instance a plot displaying only the 10 most contributive patterns. Moreover, by combining contribution and proximity, it is possible to select, among the patterns with high contribution, those that are nearer to one text than to the other three.
First of all let us see the aggregated resulting table, which provides the results of CA in a textual way. In Table 2 the ten most contributive patterns of the analysis are printed. For each one, the author they are mostly associated with is indicated. This is calculated by measuring the Euclidean distance between the position of each text and the feature, and choosing the nearest text.
Moreover, a simple algorithm can be used to extract the top 5 most contributive patterns for each novel, namely the top 5 patterns that are more associated with each text.
Algorithm 1: procedure to extract 5 top contributive patterns for each author.
By running Algorithm 1 we can extract the following lists of patterns, which we analyse in the following paragraph in detail. Notice again how the ranks of the first 5 patterns of Hugo are much higher than the others. This means that such patterns have a lower contribution.
As we can see from Tables 3 to 6, the analytic results confirm the intuition derived from the plot. Zola and Balzac are associated with the most contributive patterns, namely with patterns that are strongly over-used in their respective novels. Among the top 10 only one is associated with Flaubert and none with Hugo. In fact, the first five patterns in order of contribution that stands closer to Hugo (Table 6) are ranked 80 to 339, while all other novels have associated patterns in the first 10 positions.
Is it possible to say that Balzac and Flaubert show a more syntactically marked
language? In order to do that, we need to analyse more closely the instances for
each pattern, and see if the differences in the pattern frequencies are due to
stylistic reasons or to other more epiphenomenal facts. The phase of analysis is
very important because superficial formatting differences in the text may
sometime cause errors of tagging or simply push the frequency of some
insignificant patterns
In what follows some individual patterns among those extracted for each novel are discussed.
Pattern_1211 is very distinctive of Zola’s
clarté vivante , l' idole saine et solide de la
charcuterie ; et on ne la nomma plus que la belle Lisa
.vue troublée , les pieds comme tirés
, sans qu' il en eût conscience , par cette image de Paris , au loin
, très -loin , derrière l' horizon , qui l' appelait , qui l'
attendait .
From the analysis of such instances it becomes clear that this pattern is
used in descriptions and enumerations [1211_A]; it is also very used in
parenthetical phrases [1211_B] with the function of free adjuncts with
adverbial function, namely modifying the verb (here
The same can be said of Pattern_1208 and Pattern_1209, which is often a concatenation of 1211:
, la vue
troublée , les pieds comme tirés , ...
, la vue troublée , les pieds comme tirés , ...
Pattern_1207 also seems to be an expression of the same preference of Zola's for implicit clauses to modify the verb and express manner.
Notice how all these patterns contain punctuation elements, often commas. The style of Zola is dry, effective, with frequent use of parentheticals rather than explicit forms.
marchait , dormant
à demi , dodelinant des oreilles , lorsque , à la hauteur
de la rue de Longchamp , un sursaut de peur le planta net sur ses
quatre pieds .
Instead the second most important pattern for
The analysis thus seems to be in line with the received idea of Zola’s intentionally choosing a realist style that should represent the reality of with authenticity and in an objective way.
A first look at
The first pattern is strongly associated with dialogical structures, which are very frequent in this work:
The same can be said of Pattern_1016, which is used mostly to (post-) introduce direct speech:
, dit
-il au vigneron, dit -il à
Eugénie
Pattern_1025 is associated with two structures, both subordinate infinitives, with an explicative value (1025_A) or to describe co-occurring events (1025_B).
usage pour
indiquer les vignobles qui produisent la première qualité
de vin .en prenant une pincée de tabac ,
et offrant sa tabatière à la ronde : Qui mieux que madame , dit -il
, pourrait faire à monsieur les honneurs de Saumur ?
Pattern_1023 is used in phrases containing proper names, often place names in the function of modifiers of a noun.
de France
est là tout entière.
[1023_A] Les habitants de Saumur étant peu révolutionnaires,....
Pattern_1018 shows a main transitive verb with its object and an implicit subordinate phrase. Like Pattern_1025, it is used to better specify actions or events. Notice that basically this type of pattern constitutes the counterpart to those used by Zola, who prefers the verbless forms of predicate modification. With a little imagination we could write Zola’s version of [1018_B] as something like “Grandet, la bouche fermée, regarda sa fille”.
tendit la main en
défaisant son anneau regarda sa fille sans trouver
un mot à dire .
Thus Balzac’s style is more verbose, more explicit. The use of preposition to introduce phrases or clauses is important to highlight the relationship between head and modifier. It makes sentences less difficult to interpret. Balzac is considered the father of realism, but he aimed at a broader and more popular audience than Zola’s, (for financial reasons as well as for artistic ones). His style reflects possibly this necessity, as well as the time constraints of his immense production.
All of
autour du cou , et , l' ayant fait asseoir au
bord du lit , se mettait à lui parler de ses chagrins : il l'
oubliait , il en aimait une autre !
Patterns indicating a certain style in punctuation should always be taken
with a grain of salt, since punctuation in the edited version does not
always reflect the choice of the author, but may be submitted to editorial
guidelines. Nevertheless Mangiapane
As was presented before, Hugo’s work is less syntactically marked than that of the others. The patterns that do show some overrepresentation in
Two of these patterns (Pattern_31, Pattern_27) are absent in Zola and Balzac, but are shared with Flaubert. Pattern_31 is the longest. It seems to be used mostly in descriptions of places, which are very rich in the historical novel of Hugo, and help the reader to enter into the world of medieval Paris.
place comme une cascade dans un lac ..
Pattern_27 is often used in structure subordinate clauses that show a preference for demonstrative pronouns to underline situations.
que ce public qui l' entourait était du
peuple . que ce
peuple avait été sur le point de se rebeller contre
monsieur le bailli , par impatience d' entendre son ouvrage !
Pattern_520 and Pattern_833 are shared with other authors, though slightly overrepresented in Flaubert. Here too the punctuation variant found in Flaubert emerges, though not as strongly.
, et
regarda les assaillants avec le grincement de dents d' un
tigre fâché .
Pattern_190 finally is used in comparisons and descriptions.
aussi gaie que si elle était veuve .
By this analysis, the style of Hugo seems to emerge as full of lively descriptions, simple, personal, engaging, and popular just as we know it from literary tradition.
We presented a detailed outline of a methodology for the extraction of syntactic
patterns in texts and for the measurement of their interestingness in a corpus.
The results suggest that this methodology bears substantial promise as a
hermeneutical instrument offered to experts in the literary domain to
investigate style in texts and to extract interesting stylistic features. For
this reason both EREMOS, namely the tool required to perform the pattern
extractions, and the R scripts used to perform correspondence analysis on such
extractions have been released and made available to the community
Other interestingness measures besides correspondence analysis have been tested,
which are based on the distribution of the patterns in different parts of the
same text
New experiments using this methodology on a number of other texts have been
carried out; in particular parallel researches have studied the syntactical
aspects of characterization in Molière's plays (
A further interesting domain of application could be the study of
This research was supported by French state funds managed by the ANR within the Investissements d'Avenir programme under reference the ANR-11-IDEX-0004-02 and by an IFER “Fernand Braudel” Scholarship awarded by the Fondation Maison Sciences de l'Homme. We thank the anonymous reviewers of GDDH 2015 and DHQ for their helpful comments.