Shlomo Argamon is an associate professor of computer science at the Illinois Institute of Technology, where he is the director of the Linguistic Cognition Laboratory. He received his B.Sc. in applied mathematics from Carnegie-Mellon University in 1988, his Ph.D. in computer science from Yale University, where he was a Hertz Foundation Fellow, in 1994, and was a Fulbright Fellow at Bar-Ilan University in Israel from 1994 to 1996. Dr. Argamon's research focuses on the development of computational text analysis techniques, with applications mainly in computational stylistics, authorship attribution, sentiment analysis, and scientometrics.
Jean-Baptiste Goulain received his diplôme d'ingénieur (2007) in computer science and applied mathematics from École Nationale Supérieure d'Informatique et de Mathématiques Appliquées in Grenoble, France. During this time, he spent a semester at the Illinois Institute of Technology where he was a member of the Linguistic Cognition Laboratory. He is currently a student intern at Société Générale bank in New York.
Russell Horton is a research programmer at The ARTFL Project and the Digital Library Development Center at the University of Chicago, where he received his BA in Linguistics in 2002. He works on machine learning and text analysis software for the humanities.
Mark Olsen is the Assistant Director of the ARTFL Project at the University of Chicago. Mark received his Ph.D. in French history from the University of Ottawa in 1991 and has been involved in digital humanities and computer-aided text analysis since the early 1980s. His current ambition is to write a biography of the Marquis de Pastoret in candle-light with a quill.
Authored for DHQ; migrated from original DHQauthor format
In this study, a corpus of 300 male-authored and 300 female-authored French literary and historical texts is classified for author gender using the Support Vector Machine (SVM) implementation SVMLight, achieving up to 90% classification accuracy. The sets of words that were most useful in distinguishing male and female writing are extracted from the support vectors. The results reinforce previous findings from statistical analyses of the same corpus, and exhibit remarkable cross-linguistic parallels with the results garnered from SVM models trained in gender classification on selections from the British National Corpus. It is found that female authors use personal pronouns and negative polarity items at a much higher rate than their male counterparts, and male authors demonstrate a strong preference for determiners and numerical quantifiers. Among the words that characterize male or female writing consistently over the time period spanned by the corpus, a number of cohesive semantic groups are identified. Male authors, for example, use religious terminology rooted in the church, while female authors use secular language to discuss spirituality. Such differences would take an enormous human effort to discover by a close reading of such a large corpus, but once identified through text mining, they frame intriguing questions which scholars may address using traditional critical analysis methods.
Patterns of gender difference in historical French texts parallel those in modern English.
Adam's Rib, 1949Amanda Bonner: What I said was true, there's no difference between the sexes. Men, women, the same.
Adam Bonner: They are?
Amanda Bonner: Well, maybe there is a difference, but it's a little difference.
Adam Bonner: Well, you know as the French say...
Amanda Bonner: What do they say?
Adam Bonner: Vive la difference!
Amanda Bonner: Which means?
Adam Bonner: Which means hurrah for that little difference.
Attempts to identify and characterize differences between male and female discourse have utilized methods such as close reading, sociolinguistic modeling
This study was based on the same male and female corpora used by Olsen in previous statistical analyses Sand Effect.
Because we are working with the same corpus previously subjected to a purely statistical analysis
Because machine learning algorithms are fundamentally rooted in the exploitation of differential distributions of features (in our case, words), we would expect to see many of the same words appear as highly weighted features in our machine learning results that Olsen found to be significant in his statistical analysis. However, we would not expect the lists to be identical because there are additional factors that influence SVM trained weights that are not captured by differential frequency statistics or other statistical measures such as information gain (IG). Differential frequency and IG are innate properties of an individual word's distribution between sub-corpora, whereas an SVM weight has meaning only within the context of a particular model generated by the learning algorithm, and must be considered in relation to the weights of other features in that model. Differential rates and IG may simply be calculated according to a set formula with unvarying results, whereas SVM weights are heuristically assigned and refined by the learning algorithm in a search for maximum performance on the classification problem.
Information gain and other statistical measures of distribution are commonly used as heuristics for reducing feature set dimensionality and for setting initial weights for machine learning algorithms, but there is no guarantee that all words with highly differential frequencies in the corpora will be assigned high weights by the machine learner in the final model. SVM produces two weighted sets of words, male and female, which, taken together, are maximally effective (to the extent of the ability of the algorithm to produce an optimal solution) at discriminating between texts from the two corpora. Words which might exhibit interesting distributions but which do not fit well into a particular model will not be assigned high weights and will escape our notice. Therefore, it is useful to perform a variety of machine learning runs, find what works, and search for common threads in the results. Ultimately, results must find support from a knowledgeable reading of the texts and be fitted with a critical hypothesis to be of great interest from the literary scholar's point of view, although predictive models may have practical uses, such as adding guessed metadata to unclassified documents, independent of their critical value or validity.
The machine learning algorithm chosen for this classification task is an SVM implementation called SVMLight
For our preliminary experiments, we prepared 8 sets of vectors, comprised of the two collections (the full 600 document corpus and 184 document subset) in four versions each: the surface form of the words, the lemmas, the parts of speech (POS) of the words as assigned by TreeTagger, and a simplified part of speech grouping, with broader categories (POSgroup). Each matrix consisted of either 600 or 184 vectors, labeled with 1 for male-authored and -1 for female-authored documents. For a look at the generic data preparation process for text classification, see
We then trained SVMLight on each matrix, and obtained the accuracies given in Tables 1 and 2, after cross-validation. Surface form and lemma accuracies cluster around 85%, which means that overall, the models generated by SVMLight can correctly predict the gender of the author about 85% of the time. This is a significant result and indicates that the model has indeed found generalizable differences between the texts in the two corpora. The differences in accuracy between the surface and lemma forms of the words are insignificant, and the POS and POSgroup accuracy differences are generally quite slight as well. The most notable distinction is that POS/POSgroup accuracies are consistently much lower than word/lemma accuracies. The former hover around 70% accuracy, which we have adopted as the borderline for a significant result on a binary classification problem. 70% accuracy is not a particularly compelling result on a coin-flip
problem, because it shows only 20% improvement over the agreement expected by random chance. Naturally, the more accurate our model is, the more importance we can attach to the words the model weights toward each author gender.
In order to test whether our accuracies were an artifact of the classifier used, rather than demonstrative of true differences between our corpora, we performed the same experiment but with each document randomly labeled as male or female, regardless of true author gender. Over multiple runs, the classifier never achieved more than 50% accuracy in this random falsification experiment, so we can be confident that SVMLight cannot reliably distinguish between and random sub-corpora grouping in this corpus.
We can try to learn from our failures here. The fact that SVMLight cannot construct a very accurate prediction model based on POS vectors is a kind of weak evidence against any theory of gendered authorship that holds that men and women speak radically different languages. If, in fact, men and women used the basic building blocks of language in substantially different ways, we might expect to see strong mechanical differences between male and female writing reflected in POS usage rates that the model could exploit to make accurate classifications. That such differences do not widely obtain in this corpus is strongly suggested by the inability of SVMLight to construct a very accurate model to distinguish between the gendered corpora on that basis. Of course, this does not rule out mechanical and stylistic differences that aren't reflected in the simple metric of POS frequencies, but it does suggest a base level of linguistic similarity between the two classes.
Based on these initial results, we decided to proceed with further experiments using the surface forms of the words, that being the simplest method and tied for most accurate with the lemmatized forms. All runs cited hereforth were executed within the PhiloMine data mining extensions to the PhiloLogic text search engine
Our first impulse when examining the feature list was to scan for the presence of shibboleth
words that trivially identify some subset of works as definitively male- or female-authored, either because they are explicit markers of author gender (such as metadata tags inadvertently retained in the document), or because they are features that occur in only one or a relative handful of works that are homogeneous for author gender. Such terms are gifts to the machine learner, greedily seized upon by our classification model but unlikely to generate any penetrating insight for the scholar. Proper names are the prime example of such features, and we saw several in Table 5,
The highest-ranked words in each category are common function words — pronouns, articles, quantifiers, adpositions, common verb forms of
These results are striking in that they replicate almost exactly those of a similar analysis of female- and male- authored texts in the British National Corpus (BNC)
Somewhat lower down the list than the function words, we start to encounter content words, and some of the same phenomena noted by Olsen in his statistical analyses are apparent. {[female] space may be characterized by a more personal, emotive and interactive frame that is not explained by differences in genre or period
Having found support for previous findings in Argamon
Within the male and female lists, it is possible to identify a number of interesting semantic groupings of words. Reassuringly, the female pronouns and negative polarity items and male quantifiers discussed earlier are still present. In addition, there are a number of other semantic categories of words that appear to cohere:
The number of strongly cohesive thematic groupings that can be constructed from the highly-weighted features that obtain in both time periods suggest that male and female writers in the corpus exercise markedly different topic selection. Although the identification of these persistent themes marks the endpoint of this machine learning analysis of the corpus, the themes themselves form a natural starting point for a scholar interested in pursuing the differences between male and female writing from a traditional literary critical viewpoint. It would be quite interesting, for example, to explore why male authors favor religious terminology rooted within the church, whereas female authors spend more time discussing spirituality in a personal, more secular language. Similarly, why should so many anatomical terms rank in the very top of male-weighted features, and are they literal expressions of physicality, or rooted in metaphorical usage? Clearly, these thematic groupings cannot be taken as definitive, universal statements about gendered authorship, but they are clearly identifiable trends that provide a neat snapshot of some basic differences between male and female authors, while suggesting potentially fruitful areas for further analysis, either computer-assisted or using traditional methods. Scholars intrigued by these questions could narrow the context for a close reading by refining the text mining analysis, focusing on questions such as which authors and works best exemplify the discovered trends, and which provide exceptions and counter-examples.
Our research demonstrates the utility of using support vector machine models to find contrasting features of male and female writing by interrogating the trained models to identify patterns of word usage that distinguish the gendered corpora. We found little advantage to using lemmatized forms of words as our features and a significant disadvantage to using parts of speech, and therefore used the surface forms of the words for the bulk of our research, achieving accuracies in classification between 80% and 90%. Of the words found to be most useful in distinguishing male and female writing, several distinct functional and semantic groupings were identified. The more personal and emotional frame of reference found in female authors' writing by Olsen in his statistical analysis of the same corpus was supported by our machine learning models. The marked male preference for determiners and female preference for personal pronouns and negative polarity items was a particularly promising finding, as it echoes very closely previous work by Argamon et al.