Volume 3 Number 2
Vive la Différence! Text Mining Gender Difference in French Literature
Abstract
In this study, a corpus of 300 male-authored and 300 female-authored French literary and historical texts is classified for author gender using the Support Vector Machine (SVM) implementation SVMLight, achieving up to 90% classification accuracy. The sets of words that were most useful in distinguishing male and female writing are extracted from the support vectors. The results reinforce previous findings from statistical analyses of the same corpus, and exhibit remarkable cross-linguistic parallels with the results garnered from SVM models trained in gender classification on selections from the British National Corpus. It is found that female authors use personal pronouns and negative polarity items at a much higher rate than their male counterparts, and male authors demonstrate a strong preference for determiners and numerical quantifiers. Among the words that characterize male or female writing consistently over the time period spanned by the corpus, a number of cohesive semantic groups are identified. Male authors, for example, use religious terminology rooted in the church, while female authors use secular language to discuss spirituality. Such differences would take an enormous human effort to discover by a close reading of such a large corpus, but once identified through text mining, they frame intriguing questions which scholars may address using traditional critical analysis methods.
Amanda Bonner: What I said was true, there's no difference between the sexes. Men, women, the same.
Adam Bonner: They are?
Amanda Bonner: Well, maybe there is a difference, but it's a little difference.
Adam Bonner: Well, you know as the French say...
Amanda Bonner: What do they say?
Adam Bonner: Vive la difference!
Amanda Bonner: Which means?
Adam Bonner: Which means hurrah for that little difference. (Adam's Rib, 1949)
Introduction
Comparison With Previous Research
Experimental Design
Machine Learning Runs
Word | Lemma | PoS | PoSgroup | |
Male | 88.3% | 87.3% | 73.0% | 69.7% |
Female | 83.3% | 84.4% | 75.7% | 78.7% |
All | 85.7% | 85.9% | 74.4% | 74.2% |
Word | Lemma | PoS | PoSgroup | |
Male | 91.3% | 92.4% | 73.9% | 73.9% |
Female | 81.5% | 81.5% | 78.3% | 69.6% |
All | 86.4% | 87.0% | 76.1% | 71.7% |
153 persistent features in Male-authored documents: 1, a, abord, action, affaire, ajouta, amie, article, au, aura, auteur, autour, autre, aux, avons, bas, bouche, bras, c, capitaine, cent, chacun, chair, champ, charles, chez, christ, ciel, cinq, comment, comtesse, contre, corps, coup, coups, crime, côté, d', des, deux, diable, dis, docteur, doigts, dont, doute, droite, du, entre, est, face, fait, façon, femme, feu, fin, fit, fois, foule, gens, gros, haut, histoire, homme, hé, hôtel, ils, in, jacques, jean, juge, jusqu', la, laquelle, le, les, leurs, ligne, long, lorsque, main, mains, maîtresse, messieurs, mis, mit, moins, monseigneur, monsieur, montre, mot, même, nez, nom, nombre, nos, oeil, oeuvres, ordre, oreille, ou, oui, où, par, passage, pied, pieds, présente, président, prêtre, quatre, quelqu', quelque, quelques, question, qui, quoi, reprit, reste, rue, récit, saint, saints, salut, sang, second, seconde, selon, ses, seulement, simple, sire, soit, sous, sur, table, tirer, tour, toute, trente, trois, un, v, ventre, vers, vieux, village, vin, vingt, voici, y, yeux, à |
192 persistent features in Female-authored documents: 192 persistent features in Female-authored documents: absence, admiration, afin, agréable, ai, aimable, aime, aimer, aller, amitié, amour, anglais, angleterre, auguste, auprès, aurais, avais, avait, avec, avez, avoir, beaucoup, belle, bien, bonheur, bonne, brillante, but, cacher, car, caractère, celle, chagrin, chercher, chère, coeur, comprendre, compte, comte, confiance, conserver, cour, crois, destinée, disant, donner, douceur, douleur, doux, elle, elles, empêcher, encore, enfance, enfant, enfants, entièrement, envie, esprit, espérance, estime, eût, faisait, fallait, faut, fièvre, fleurs, france, frère, fût, gloire, goût, grande, grandes, généreux, henri, hiver, ici, il, imagination, impossible, inquiétude, inspire, inspirer, instant, intérêt, jamais, jardin, jours, liberté, lui, lumières, m, ma, mais, malgré, manière, manières, me, moi, mon, montrer, mère, ne, ni, nécessaire, opinion, parce, parler, parlez, passion, pauvre, pays, personne, personnes, petite, peut, peuvent, plaire, plaisir, pleurs, plusieurs, possible, pourquoi, pourrais, pouvait, prince, princes, princesse, pu, puisque, puissance, père, quand, que, quitter, regarder, reine, repos, retrouver, revenir, roi, sais, sait, sans, savoir, secret, sentiment, sentir, seule, si, son, souffrir, souvenir, souvent, soyez, suis, supporter, surprise, tant, toi, toujours, tous, toutes, trop, trouva, trouver, très, tu, utile, veux, vie, vit, vivre, voir, vois, vos, votre, voulait, voulut, vous, voyage, voyant, véritable, âme, éducation, égard, égards, émotion, épouser, était, êtes |
Enduring Male Terms | Enduring Female Terms |
|
|
Conclusion
Male Features | Female Features | ||
Word | Weight | Word | Weight |
qui | 3.032 | elle | -4.270 |
un | 2.706 | ne | -2.768 |
à | 2.568 | vous | -2.256 |
le | 2.512 | pas | -1.812 |
des | 2.392 | et | -1.594 |
du | 1.993 | avec | -1.435 |
les | 1.847 | mais | -1.433 |
au | 1.598 | lui | -1.365 |
monsieur | 1.396 | était | -1.346 |
est | 1.302 | si | -1.245 |
deux | 1.264 | avait | -1.178 |
de | 1.250 | me | -1.127 |
sur | 1.033 | ma | -1.069 |
a | 0.953 | pour | -0.952 |
homme | 0.884 | sans | -0.811 |
par | 0.867 | moi | -0.794 |
ce | 0.746 | consuelo | -0.779 |
madame | 0.690 | quand | -0.779 |
d' | 0.656 | bien | -0.702 |
une | 0.594 | roi | -0.676 |
ces | 0.590 | l' | -0.666 |
ses | 0.586 | il | -0.614 |
dont | 0.566 | beaucoup | -0.570 |
quelque | 0.554 | n' | -0.560 |
femme | 0.535 | henri | -0.543 |
ils | 0.528 | m' | -0.535 |
où | 0.511 | jamais | -0.523 |
tems | 0.496 | reine | -0.513 |
charles | 0.493 | je | -0.482 |
ou | 0.487 | princesse | -0.479 |
autre | 0.451 | toujours | -0.470 |
aux | 0.449 | car | -0.465 |
yeux | 0.429 | ai | -0.462 |
main | 0.417 | votre | -0.459 |
fit | 0.392 | esprit | -0.453 |
leurs | 0.386 | avais | -0.447 |
quelques | 0.384 | m | -0.444 |
leur | 0.380 | personne | -0.430 |
cette | 0.379 | albert | -0.419 |
fait | 0.379 | temps | -0.400 |
après | 0.374 | mon | -0.393 |
avois | 0.374 | bonne | -0.383 |
reste | 0.363 | être | -0.381 |
mille | 0.355 | dans | -0.379 |
même | 0.327 | ça | -0.371 |
saint | 0.326 | se | -0.365 |
fille | 0.324 | liberté | -0.364 |
francs | 0.309 | la | -0.360 |
tout | 0.307 | âme | -0.356 |
lettre | 0.299 | très | -0.356 |
étoit | 0.298 | enfants | -0.349 |
entre | 0.287 | peut | -0.347 |
Male Features | Female Features | ||
Word | Weight | Word | Weight |
qui | 3.043 | elle | -4.291 |
un | 2.716 | ne | -2.780 |
à | 2.578 | vous | -2.265 |
le | 2.522 | pas | -1.820 |
des | 2.400 | et | -1.599 |
du | 2.000 | avec | -1.441 |
les | 1.856 | mais | -1.439 |
au | 1.603 | lui | -1.366 |
monsieur | 1.400 | était | -1.348 |
est | 1.305 | si | -1.250 |
deux | 1.269 | avait | -1.179 |
de | 1.252 | me | -1.127 |
sur | 1.037 | ma | -1.072 |
a | 0.956 | pour | -0.956 |
homme | 0.888 | sans | -0.814 |
par | 0.870 | moi | -0.795 |
ce | 0.749 | quand | -0.782 |
madame | 0.690 | bien | -0.706 |
d' | 0.657 | roi | -0.679 |
une | 0.597 | l' | -0.668 |
ces | 0.592 | il | -0.621 |
ses | 0.587 | beaucoup | -0.572 |
dont | 0.568 | n' | -0.564 |
quelque | 0.555 | henri | -0.549 |
femme | 0.537 | m' | -0.536 |
ils | 0.530 | jamais | -0.526 |
où | 0.513 | reine | -0.515 |
tems | 0.498 | je | -0.483 |
charles | 0.495 | princesse | -0.481 |
ou | 0.488 | toujours | -0.471 |
autre | 0.452 | car | -0.466 |
aux | 0.450 | ai | -0.462 |
yeux | 0.430 | votre | -0.460 |
main | 0.418 | esprit | -0.455 |
fit | 0.394 | avais | -0.447 |
leurs | 0.387 | m | -0.445 |
quelques | 0.386 | personne | -0.431 |
cette | 0.381 | albert | -0.420 |
leur | 0.381 | temps | -0.402 |
fait | 0.380 | mon | -0.392 |
après | 0.375 | bonne | -0.385 |
avois | 0.375 | être | -0.380 |
reste | 0.364 | dans | -0.378 |
mille | 0.356 | ça | -0.375 |
même | 0.329 | se | -0.366 |
saint | 0.327 | liberté | -0.365 |
fille | 0.325 | la | -0.358 |
francs | 0.311 | très | -0.358 |