Cameron Blevins is a digital historian studying the nineteenth-century United States and the American West. He is currently a postdoctoral fellow at Rutgers University's history department and the Rutgers Center for Historical Analysis.
Lincoln Mullen is an assistant professor in the Department of History and Art History at George Mason University, where he works on American religious history and digital history.
This is the source
This article describes a new method for inferring the gender
of personal names using large historical datasets. In
contrast to existing methods of gender prediction that treat
names as if they are timelessly associated with one gender,
this method uses a historical approach that takes into
account how naming practices change over time. It uses
historical data to measure the likelihood that a name was
associated with a particular gender based on the time or
place under study. This approach generates more accurate
results for sources that encompass changing periods of time,
providing digital humanities scholars with a tool to
estimate the gender of names across large textual
collections. The article first describes the methodology as
implemented in the gender
package for the R
programming language. It goes on to apply the
method to a case study in which we examine gender and
gatekeeping in the American historical profession over the
past half-century. The gender package illustrates the
importance of incorporating historical approaches into
computer science and related fields.lmullen/gender-article
GitHub repository for the code used to create
this article.
Introduces a method for algorithmically inferring gender using large historical datasets
Leslie Problem
The analytical category of gender has transformed humanities
research over the past half-century, and, more recently,
widespread digitization has opened up opportunities to apply
this category to new sources and in new ways. Initiatives
like the Orlando
Project, the Poetess
Archive, and the Women Writers
Project have done crucial work to recover the
voices of female authors and writers
As the digital archive grows larger and larger, researchers
have not only been able to access information that
One of the basic pieces of information researchers can infer
from a personal name is the gender of that individual:
Jane Fay,
for instance, is likely a woman, while
her brother John Fay
is likely a man. A computer
program can infer their respective genders by looking up
Jane
and John
in a dataset that
matches names to genders. But this approach runs into a
problem: the link between gender and naming practices, like
language itself, is not static. What about Jane and John
Fay’s sibling, Leslie? If Leslie was born in the past sixty
years, chances are good that Leslie would be their sister.
But if the three siblings were born in the early twentieth
century, chances are good that Leslie would be their
brother. That is because the conventional gender for the
name Leslie switched over the course of one hundred years.
In 1900 some 92% of the babies born in the United States who
were named Leslie were classified as male, while in 2000
about 96% of the Leslies born in that year were classified
as female.
For those working on contemporary subjects, the Leslie
problem
is not an especially pressing one. There are
a variety of tools that use current databases of names. Genderize.io, for
instance, predicts the gender of a first name from the user
profiles of unnamed social media sites Jan
,
have a different causal mechanism. We speculate that Jan
as an English name is female but that Jan as a
Scandinavian name is male. Because the name has remained
relatively uncommon, other causes besides changing
gender preferences can explain the change.
The Leslie problem
is not just for researchers who
think of themselves as historians. Anyone studying a period
longer than a few years, or anyone studying a group whose
demographics do not match the groups used by a contemporary
tool such as Genderize.io will also encounter this problem.
A literary scholar studying a corpus of poems from the
mid-twentieth century might wish to compare the stanza
structures of male and female poets. Existing tools based on
contemporary data risks misidentifying these writers, many
of whom were born (and named) more than a century prior to
the creation of these modern name datasets. This problem
will only increase with growing life expectancies: as of
2012, the average American lifespan was nearly seventy-nine
years — more than enough time for naming practices to change
quite dramatically between a person’s birth and death
Our solution to the Leslie problem
is to create a
software package that combines a predictive algorithm with
several historical datasets suitable to various times and
regions. The algorithm calculates the proportion of male or
female uses of a name in a given birth year or range of
years. It thus can provide not only a prediction of the
gender of a name, but also a measure of that prediction’s
accuracy. For example, our program predicts that a person
named Leslie who was born in 1950 is female, but reports a
low level of certainty in that prediction, since just 52
percent of the babies named Leslie born in the United States
in 1950 were girls and 48 percent were boys. These
probabilities help a user determine what level of certainty
is acceptable for predicting the gender of different
names.
Researchers who use this method need to be fully aware of its
limitations, in particular its dependency on a state-defined
gender binary. Inferring gender from personal names is a
blunt tool to study a complex subject. Gender theorists and
feminist scholars have spent decades unpacking the full
range of meanings behind gender as an analytical concept,
including the fluid relationship between biological sex and
gender identity male
or female
mean in any given
historical and social context
Imperfect as it is, this method nevertheless gives digital humanities scholars a much-needed tool to study gender in textual collections. We have made the method as transparent and flexible as possible through the inclusion of probability figures. This makes the underlying data visible to users and allows them the opportunity to interrogate the assumptions that come with it. When used thoughtfully, this method can infer additional information about individuals who stand at the heart of humanities scholarship, providing another way to see and interpret sources across large scales of analysis.
The remainder of this article is divided into two sections. First, we describe our method in more detail, outline its advantages over existing methods, and explain how to use it. Second, we apply the method to a case study of gatekeeping in the historical profession in order to demonstrate its usefulness for humanities scholars.
Why do scholars need a method for inferring gender that
relies on historical data? Let’s start with a comparison.
One existing method for predicting gender is available in
the Natural Language
Toolkit for the Python programming language
The Kantrowitz corpus provides the list of names.either
.
One can then easily write a function which looks up the
gender of a given name.
The most significant problem with the Kantrowitz names corpus
and thus the NLTK implementation is that it assumes that
names are timeless. As pointed out above, this makes the
corpus problematic for historical purposes. Furthermore
the Kantrowitz corpus includes other oddities which make it
less useful for research. Some names such as Abby are
overwhelmingly female and some such as Bill are
overwhelmingly male, but the corpus includes them as both
male and female. The Kantrowitz corpus contains only 7,576
unique names, a mere 8.3% of the 91,320 unique names in a
dataset provided by the Social Security Administration and
2.23% of the 339,967 unique names in the census records
provided in the Integrated Public Use Microdata Series
(IPUMS) USA dataset. There are therefore many names that it
cannot identify. Assuming for the moment that our method
provides more accurate results, we estimate that 4.74%
percent of the names in the Kantrowitz corpus are classified
as ambiguous when a gender could be reasonably predicted
from the name, that 1.24% percent of the names are
classified as male when they should be classified as female,
and that 1.82% are classified as female when they should be
classified as male. This error rate is a separate concern
from the much smaller size of the Kantrowitz
corpus.
We mention the Kantrowitz name corpus as implemented in NLTK
because the Natural Language Toolkit is rightly regarded as
influential for scholarship. Its flaws for predicting
gender, which are a minor part of the software’s total
functionality, are also typical of the problems with most
other implementations of gender prediction algorithms. The
Genderize.io API is, for example, a more sophisticated
implementation of gender prediction than the NLTK algorithm.
Besides predicting male or female for gender, it also
reports the proportion of male and female names, along with
a count of the number of uses of the name on which it is
basing its prediction. Genderize.io will also permit the user
to customize a prediction for different countries, which is
an important feature. Genderize.io reports that its database contains 142848 distinct names
across 77 countries and 85 languages.
Genderize.io is unsuitable for historical work, however, because it is
based only on contemporary data. According to the
documentation for its API, it utilizes
big datasets of information, from user profiles across
major social networks.
It would be anachronistic
to apply these datasets to the past, and Genderize.io
provides no functionality to filter results chronologically
as it does geographically. In addition, Genderize.io does
not make clear exactly what comprises the dataset and how it
was gathered, which keeps scholars from interrogating the
value of the source
R and Python are two of the most commonly used languages for
data analysis. Surveys and analysis of usage
point out the growth of R and the continuing popularity of
Python for data science generally, and scholars are
producing guides to using these languages for historical or
humanities research
To that end we have created the gender package for R which includes both a predictive algorithm and an associated genderdata package containing various historical datasets. This R implementation is based on an earlier Python implementation by Cameron Blevins and Bridget Baird.
We have chosen the R programming language for several reasons. The R language is open-source, so it is freely
available to scholars and students. The language has a
strong tradition of being friendly to scholarship. It was
originally created for statistics and many of its core
contributors are academics; it provides facilities for
citing packages as scholarship. CRAN (the
Comprehensive R Archive Network) offers a central location
for publishing R packages for all users. These include a
rigorous set of standards to ensure the quality of packages
contributed. R has a number of language features, such as
data frames, which permit easy manipulation of
data.
Inferring gender from names depends on two things. First, it requires a suitable (and suitably large) dataset for the time period and region under study. Unsurprisingly such datasets are almost always gathered in the first instance by governments, though their compilation and digitization may be undertaken by scholarly researchers. Second, it depends on a suitable algorithm for estimating the proportion of male and female names for a given year or range of years, since often a person cannot be associated with an exact date. It is especially important that the algorithm take into account any biases in the data to formulate more accurate predictions. Development of the R package has had two primary aims. The first is to abstract the predictive algorithm to the simplest possible form so that it is usable for a wide range of historical problems rather than depending on the format of any particular data set. The second has been to provide as many datasets as possible in order for users to tailor the algorithm’s predictions to particular times and places.
The gender package currently uses several datasets which make
it suitable for studying the United States from the first
federal census in 1790 onwards. The first dataset contains
names of applicants for Social Security and is available
from Data.gov
The Social Security Administration (SSA) Baby Names dataset was created as a result of the
Social Security Act of 1935 during the New Deal.baby names
provided by the SSA is a
serious misnomer. When Social Security became available
during the New Deal, its first beneficiaries were adults
past or near retirement age. The dataset goes back to 1880,
the birth year for a 55 year-old adult when Social Security
was enacted. Even after 1935, registration at birth for
Social Security was not mandatory until 1986. As we will
demonstrate below, the way in which the data was gathered
requires an adjustment to our predictions of gender.
The IPUMS-USA dataset, contributed by Benjamin Schmidt,
contains records from the United States decennial census
from 1790 to 1930. This dataset includes the birth year and
numbers of males and females under the age of 62 for all the
years in that range. This data has been aggregated by IPUMS at the
University of Minnesota and is released as a sample of the
total census data. Unlike the SSA dataset, which includes a
100% sample for every name reported to the Social Security
Administration and used more than five times, the IPUMS data
contains 5% or 10% samples of names from the total census
data. Because the gender()
function relies on
proportions of uses of names, rather than raw counts of
people with the names, the sampling does not diminish the
reliability of the function’s predictions
The gender package also features a dataset from the North Atlantic
Population Project. This dataset includes names
from Canada, Great Britain, Germany, Iceland, Norway, and
Sweden from 1801 to 1910.tidy data
framework proposed by Hadley
Wickham for performance reasons. By keeping the data in
a wider
format instead of a tidy format as
defined by Wickham, the function is able to make its
predictions faster name
,
year
, and number of female
and
male
uses of that name in a particular
year. This simple format makes it possible to extend the
package to include any place and time period for which there
is suitable data.
Our method for predicting gender is best understood through a series of examples. First we will use it to predict the gender of a single name in order to demonstrate a simplified version of the inner workings of the function. We will then apply it to a small sample dataset to show how a researcher might use it in practice.
The method for predicting gender from a name using the
package’s datasets is simple. Let’s begin by assuming
that we want to predict the gender of someone named
Sidney who was born in 1935 using the Social Security
Administration dataset. Because the dataset contains a
list of names for each year, we can simply look up the
row for Sidney in 1935. Using the dplyr
package for R, which provided a grammar for data
manipulation, this can be expressed with the action
filter
:
Thus, according to the Social Security Administration,
there were 974 boys and 93 girls named Sidney born in
1935. We can add another command to calculate the
mutate
in the vocabulary of the dplyr
package) rather than raw numbers.
In other words, there is an approximately 91.3% percent
chance that a person born in 1935 named Sidney was male.
In 2012, for comparison, there was an approximately
60.7% percent chance that a person born named Sidney was
female.
The method is only slightly more complex if we do not
know the exact year when someone was born, as is often
the case for historical data. Suppose we know that
Sidney was born in the 1930s but cannot identify the
exact year of his or her birth. Using the same method as
above we can look up the name for all of those years.
Next we can sum up the male and female columns
(summarize
in dplyr
package vocabulary) and calculate the proportions of
female and male uses of Sidney
during that
decade.
In other words, for the decade of the 1930s, we can
calculate that there is a 93.2% percent chance that a
person named Sidney was male. This is roughly the same
as the probability we calculated above for just 1935,
but our method also returns the figures it used to
calculate those probabilities: 1,067 instances of
Sidney
in 1935 versus 10,110 total instances
for the decade as a whole.
The method’s real utility stems from being able to
process larger datasets than a single name. Let’s use,
for example, a hypothetical list of editors from a
college newspaper to illustrate how a researcher might
apply it to their own data. The package’s prediction
function allows researchers to choose which reference
datasets they would like to use and the range of years
for making their predictions.
By taking into account the year of birth, we find that four of our six names were likely male, whereas we might otherwise have predicted that all six were female. We also now know the approximate likelihood that our predictions are correct: at least 80% for all of these predictions.
It is also possible to use a range of years for a dataset like this. If our list of editors contained the year in which the person served on the newspaper rather than the birth year, we could make a reasonable assumption that their ages were likely to be between 18 and 24. We could then calculate a minimum and a maximum year of birth for each, and run the prediction function on that range for each person. The exact code to accomplish these types of analysis can be found in the vignette for the gender package.
As previously mentioned, the history of how the Social Security Administration collected the data affects its validity. Specifically, because the data extends back to 1880 but the first applications were gathered after 1935, the sex ratios in the dataset are skewed in the years before 1930. For example, this dataset implies that thirty percent of the people born in 1900 were male.
It is extremely improbable that nearly seventy percent of
the people born in 1900 were female.
The solution to this problem is two-fold. First, we
recommend that researchers use the IPUMS-USA dataset to
make predictions for years from 1790 to 1930 (which
avoids the gender()
function. If we assume that
the secondary sex ratio (that is, the ratio of male to
female births) in any given year does not deviate from
0.5
(that is, equality), it is possible
to calculate a correction factor for each year or range
of years to even out the dataset. We apply this
correction factor automatically when using the SSA
dataset.
A case study that relies on the gender package illustrates
how the method can reveal hidden patterns within large
textual collections. In a 2005 report for the American
Historical Association, Elizabeth Lunbeck acknowledged
a sea change in the [historical]
profession with respect to gender
before going
on to describe the limits to this change for female
historians, who face ongoing personal discrimination, lower
salaries, and barriers to securing high-ranking positions.
Lunbeck’s report drew in part on a survey of 362 female
historians that produced a rich source of responses
detailing the entrenched and multi-faceted challenges facing
women in the profession
We begin with the history dissertation, often the defining
scholarly output of a historian’s early career. The
completion of a dissertation marks a standardized moment of
transition out of the training phase of the historical
profession. To identify how many women and men completed
PhD-level training in history each year, we used data
supplied by ProQuest for roughly eighty thousand PhD
dissertations completed in the United States between 1950
and 2012.
Another way to examine this trend is to ignore changes in the absolute number of dissertations produced each year and to instead look at changes in the proportion of male and female dissertation authors. The proportion of dissertations written by women has steadily increased over the past half-century, a change that began in the late 1960s and continued through the early 2000s. Since that point, the proportion of dissertations written by women has largely plateaued at a few percentage points below the proportion written by men. Female historians have achieved something approaching parity with male historians in terms of how many women and men complete dissertations each year.
But the dissertation is only the first major piece of
scholarship produced by an academic historian. The second
usually centers on the writing of an academic monograph to
be read and evaluated by their peers. These monographs
remain the coin of the realm
for many historians, and
the value of that coin often depends on it being reviewed in
academic journals. To study the role of gender in monograph
reviews, we turn to one of the leading journals in the
historical profession: the
every major field of historical study
A few caveats are in order. First, a book that is never reviewed by the
Scraping the table-of-contents of every issue of the
But a closer look shows substantial differences between the proportion of women as the authors of history dissertations and the proportion of women appearing in the
Not every historian wants or seeks a career path that neatly moves from writing a dissertation to authoring a monograph to being reviewed in the
theAHR receives over 3,000 books a year; we have the resources to publish at most 1,000 reviews a year (approximately 200 per issue).
Examples like the Pulitzer and Bancroft Prizes are only the
most obvious cases of gender inequity in the historical
profession. Overt discrimination still exists, but inequity
operates in ways that are far more cloaked and far more
complex, making them difficult to recognize and remedy. They
take the form of subconscious bias that subtly weakens the
assessment of work done by female historians, whether in a
seminar paper or a job application. Moreover, gender
discrimination frequently intersects with other kinds of
racial and socioeconomic inequity that erects even higher
barriers in women’s careers
Finally, we used the gender package to reveal some of the subtler ways in which gender iniquity operates in the historical profession. After inferring the gender of book authors reviewed by the
expertsas female
expertsappearing in the
The second finding we arrived at when examining the gender of
The findings related to
One 2013 analysis of 2,500 recent history PhDs found that
gender played little role in
employment patterns across particular professions and
industries
The digital turn has made large datasets widely available,
yet in the midst of abundance many humanists continue to
deal with the problem of incompleteness and scarcity within
their sources. Our method uses radical, unrealized potential of
digital humanities
binary sex-class gets
us pretty far
and that at the moment
she is pessimistic whether there’s any way we can
code large corpora for something subtler about
gender than sex-class
The gender package makes a pragmatic trade-off between
complexity and discovery. Its reliance on a male/female
binary dampens the complexity of gender identity while
simulatenously pointing towards new discoveries about gender
within large textual collections. Its use of historical
datasets helps its users uncover patterns where they were
previously unknown, providing further grist to the mill of
more traditional methods. Furthermore, the gender package is
entirely open-source, and we have provided detailed
documentation for its code and underlying data so that those
who wish to build upon it can modify the package
accordingly. For instance, future researchers might compile
datasets of names that move beyond the gender binary imposed
by government agencies. They could then incorporate these
datasets into the gender package, replacing male
and
female
designations with a much wider spectrum of
potential gender identities.
As it currently stands, the gender package meets an important
methodological need. As of November 2015, the gender package
has been downloaded more than 80,000 times and continues to
be downloaded about 19,000 times per month. It has also been
incorporated as a pedagogical tool within digital humanities courses
identity.
genderpackage in R.