Daniel Burckhardt studied Mathematics and History of Science and Technology. He built the content management systems for H-Soz-Kult and H-ArtHist as well as the database for the federally funded project
This is the source
For the past fifteen years, scholarly communication networks such as H-Soz-Kult –
the German Information Service for Historians – and H-ArtHist – a specialized
discussion and information network for art history based in Germany with an
international reach – have been steadily publishing conference announcements and
reports. Since both services were born digitally, starting with the listserv
infrastructure of the Michigan based H-Net and later supplemented by
database-driven web sites, the archives are easily accessible by electronic
means. The aim of this paper is to demonstrate that the archives of scholarly
communication provide a suitable basis for conducting an assessment of broad
fields such as
This paper aims to demonstrate that the archives of scholarly communication provide a suitable basis for conducting an assessment of broad fields
The starting point for the following analysis was Lev Manovich’s provocative talk
don’t start with a research question.
By exploring an extensive corpus of texts of a single genre – H-Soz-Kult has published over 5,000 conference reports since 1998 – we were confident of revealing relevant insights into the disciplinary culture of historians as practiced in German-speaking countries over the past fifteen years.
The initial approach was primarily technology-driven. Readily available tools for
Natural Language Processing (NLP) such as Carrot2 for dynamic text
clustering
As the following example screen shows, LingoEdition
. Other clusters were labeled
with stop words (
) that carry little
information about the actual content within this corresponding set of
documents.
Similar effects could be observed in the list of representative terms for a set
of 100 topics generated from the conference reports by an initial run of MALLET:
While some of them are immediately recognizable (
), others again looked like fairly random
aggregates of stop words (
).
These initial difficulties shouldn’t be taken as a general argument against the
two clustering methods: As a recent study of scholarly blogging platforms has
shown,
Stanford NER provides a classifier explicitly trained on large German language
sets
This list of persons grouped by conferences then formed the basis for a bipartite
network where each appearance as a speaker defines an edge between a conference
and a person. The one-mode projection to the speaker nodes reveals
Web-services such as genderize.io
Visualizing the speakers’ network in Gephi
As mentioned above, the initial dataset consisted of 3,537 conference reports as
listed on the website of H-Soz-Kult from January 2008 until summer
2014.
The website archive for the second dataset, H-ArtHist’s conference announcements, starts a couple years later towards the end of 2010. The monthly average of roughly 39 conferences per month is slightly lower compared to the number of reports on H-Soz-Kult but in a similar range.
The yearly conference cycle shows distinct peaks in November and June of every year and doesn’t follow the German academic calendar as closely. This confirms the more international focus of H-ArtHist that we observe among both the countries of subscribers and conference locations:
Germany dominates, but we observe a significant share of subscribers and conferences held throughout Western Europe, especially the UK. For the US, we note a striking difference: While more than fifteen percent of the subscribers are based in that country, only 8% of the conferences announced are held in the US. Two factors may account for this difference: many of the influential art history departments belong to private universities, and there is simply no tradition of widely circulating information about conferences. In addition, conferences as announced on H-ArtHist are closely tied to a third-party driven mode of research less common in the humanities in North America. For France and Italy, we observe the opposite relation: They show more prominently among conference locations than would be expected from the subscribers’ share. This is mainly due to the three independent German art history institutes abroad with an active and actively announced conference schedule, the Kunsthistorisches Institut in Florence, Bibliotheca Hertziana in Rome, and the Deutsches Forum für Kunstgeschichte in Paris.
The essential step for building a bipartite network between the conference
reports or announcements and the respective speakers was the extraction of the
person entities from the unstructured texts. In the first case, the H-Soz-Kult
conference reports, a suitable regular expression led to a reasonable set of
candidates.\b(\p{Lu}[\p{Lu}\x{00df}-\']+\s[\p{Lu}\x{00df}-\'\s.]+[\p{Lu}\x{00df}])s?\b
\p{Lu}
, short for \p{Uppercase_Letter}
; a
Unicode upper case \x{00df}
: German ß (has no upper-case
variant) optional lower case at the end for genitive (HANS MEIERs Vortrag).
A significant number of near duplicates resulted from variations of upper and
lower case, umlauts, accents and hyphens, illustrated by a single person in our
corpus alternatively spelled as Xosé Manoel Núñez, Xosé-Manoel Núñez and
Xose-Manoel Nunez. Our solution was to build a
We still found numerous misspelled names in the original reports: Between one and two percent of the names contained errors besides variations in accents and punctuation. Since the median of speakers per report is – as will be detailed below – a bit above 10, more than 400 reports had to be manually adjusted. Some of the misspellings could easily be found by checking against a list of known first names (e.g. Urlike instead Ulrike). Harder to catch were variations among given names with multiple correct spellings such as Matthias and Mathias. Even for a human, it was sometimes tricky to decide if such variants (e.g. Höppner, Annika vs. Höppner, Anika) needed to be adjusted towards a single person or should remain as is since these are actually two different individuals both correctly spelled.
Both our regular expression as well as more advanced NER algorithms come up with
false positives. One could be tempted to start building a classifier that would
be trained to distinguish human names from company or room names like BMW AG or
Salle Vasari.
This turns out to be very hard for certain edge cases,
such as LINDE AG (the company) compared to LINDE NG (a person that actually has
a Facebook profile).Brage Bei Der Wieden
as a
person’s actual name, while
mountains along the Via Regia– is the title of a talk appearing in the conference overview appended to the report. Instead of investing additional technical effort into a programmatic solution, a cursory overlook of the full list of all person entities with manual removal of false positives was used to take care of these cases.
Even after this manual correction step, there are known mistakes in the full list
of speakers. Since the effort to manually research the following systematic
errors was deemed too high, no attempts were made to unify the following
variations:
In addition, due to changing affiliations and broad fields of interest that pose a serious obstacle to a systematic disambiguation even by human investigators, we did not try to resolve persons with the same exact name. So ANDREAS SCHNEIDER (Berlin) and ANDREAS SCHNEIDER (Meiningen) as well as HARALD MÜLLER (medieval history) vs. HARALD MÜLLER (law) show up as a single person entity in our dataset. From our experience with the H-ArtHist speakers’ list where GND-identifiers were manually researched for the 1,000 speakers with most conference appearances, we can be reasonably sure that namesakes are a relatively rare phenomenon in a corpus with predominantly Western names affecting at most half a percent of the persons.
After all these steps, we were left with 30,502 speakers among H-Soz-Kult’s 3,537 conference reports. In the case of H-ArtHist, the somewhat messier list resulted in 26,023 persons extracted from 2,459 conference announcements.
While the editorial convention in the H-Soz-Kult reports differentiates speakers (e.g. MORITZ FÖLLMER) from persons being spoken about (e.g. Walter Benjamin), this is not the case for H-ArtHist’s conference announcements. Therefore, key figures in the historiography of the arts such as Giorgio Vasari, Walter Benjamin and Aby Warburg showed up prominently among the most active persons and have to be manually flagged as non-speaker entities. H-Soz-Kult also differentiates between persons acting as speakers and those welcoming the attendants or taking part in the discussion. Clearly separating these roles can be difficult even for a human reporter: when does a welcome note turn into an introduction; what is the difference of a respondent’s statement compared to a far-reaching question during the discussion?
In this light, we should expect the number of speakers in conference reports for H-Soz-Kult to be significantly lower than the average number of persons found in H-ArtHist’s conference announcements. The actual difference is relatively low (H-Soz-Kult: median of 13, average: 8.62, H-ArtHist: median 14, average: 10.8). This can partially explained by the observation that there are but a limited number of reports on small conferences with just a few speakers (40 reports with 1 to 3 speakers compared to 101 announcements with the same person count).
Sorting the speakers’ list by the highest number of conference appearances provided a simple means for identifying the core group of very active (art) historians.
It has been shown that preferential attachment – well known conference speakers have a higher chance of being invited to the next conference – can lead to power law distributions. Both for H-Soz-Kult and H-ArtHist, we can observe such a pattern when graphing the distribution of the number of speakers in our list by the number of conference appearances on a double logarithmic scale:
In the case of H-Soz-Kult, there are more than 22,000 persons appearing in just a single report. Around 4,300 appear in two reports, 1,650 in three reports. On the very opposite, we will find but 23 people appearing twenty times or more. To the very right of this distribution, we find a single person who presented at 37 conferences within less than seven years. Very prominent among this highly active group are the leaders of non-university research institutes. If we focus on the group of the most active university professors, we see that younger researchers clearly dominate over eminent scholars towards the end of their academic career.
Twenty-two thousand persons appearing at just one conference over more than six
years seems to be an incredibly high figure. Picking out random samples, it
turns out to be a very diverse set ranging from graduate students presenting
their topic for the first time, visiting scholars from abroad, scholars from
neighboring disciplines as well as politicians and other well known but rather
rare speakers in an academic setting. More important: The total number of 30,000
people is in a similar range as the total number of subscriptions to H-Soz-Kult
which is currently around 25,000 people. Due to the wide variety of professional
backgrounds just mentioned, we wouldn’t expect a full overlap. But we can safely
conjecture that of the roughly 8,000 people appearing twice or more, a
significant amount will also turn out to be subscribers to the mailing list.
This group forms a
DE-Statis, the Federal Statistical Office, counts slightly more than 6,000 full-
or part-time employees at German Universities in 2013; accounting for a
comparable per capita share in Austria and the German speaking part of
Switzerland, we’d expect a bit more than 7,000 full- or part-time employees in
history in the German speaking academy
For H-ArtHist, the number of entities extracted from the announcements amounts to
slightly more than 26,000 persons with roughly 21,000 appearing just once. Due
to the less accurate name extraction process and no manual filtering so far, we
should take this number with a pinch of salt. But if we again assume the
remaining 5,000 people that appear more than once to represent the academic core
of art history; assume that half of them are active in the German speaking
academy (based on the country distribution of the announcements), we can
estimate that there are about 2,500 academic art historians, a count again not
too far from the employment figures reported by DE-Statis.
As mentioned above, checking the first word of extracted person entities against a given set of well known given names was initially introduced to identify typos and false positives. Since genderize.io provided a suitable API for this job, we ended up with an automated assignment of female or male gender for the wide majority of persons extracted from our datasets.
Gender inequality in the academia has been extensively studied for the past fifty
years.Why have there been no great women
[art] historians?
For an overview of recent studies focusing on
teaching and research in the US see
If we look at the figures from our reports, we find a The sudden
drop [among women] happens in the postdoc phase, when family planning is
pending, but the academic system only offers uncertain perspectives. Female
academics that have to muddle through temporary and poorly paid third-party
projects until the age of 40 are often forced to choose between a child and
a career.
Art history, maybe not that unexpectedly, shows a significantly smaller gender
gap: The distribution among all persons mentioned in our announcements is 56%
(male) to 44% (female). If we only take into consideration the people being
mentioned in five or more announcements, the woman’s percentage drops slightly
to 40%, for 10 or more announcements to 35%. This is roughly comparable to the
share of women’s habilitation in German art history.
After these rather simple but nevertheless insightful counts of basic properties,
we finally move over to the more complex relations among persons studied through
the network of
If we compare our tie to those commonly studied in bibliometrics, co-speakership
is certainly weaker than co-authorship, a relation we still only very rarely
observe in the humanities. On the other hand, our tie seems quite a bit stronger
than citations among different authors; a primarily intellectual connection
among them, which – at least in larger disciplines – only rarely implies a real
world social interaction.
Since speakers that only appear once don’t add any additional connections between otherwise unrelated speakers, we started with the reduced network including only persons appearing at two or more conferences. In the case of H-Soz-Kult, we get a graph with 8,361 speaker nodes connected by more than 130,000 edges. For H-ArtHist, the corresponding graph consists of 5,165 person nodes and a bit less than 65,000 edges.
Though the full pictures shows the typical hair-ball one would expect, coloring
the graph by modularity classes, Gephi’s community detection
algorithm
No such division could be observed in the corresponding Gephi-graph for art history. We find a few well known scholars in the center of the graph, but their neighborhood is much less defined by specific periods or topics than by the geographical location of the research institute.
The more unitary structure of the narrower field of art history is probably due to the smaller sized institutes and departments resulting in wider teaching duties and a lower degree of specialization.
Of the commonly studied centrality measures for one-mode networks, Degree
distribution and Betweenness centrality directly correlate with the total number
of conference appearances of a person in our datasets and therefore add no
additional insights to our analysis.assessing how well
connected an individual is to the parts of the network with the greatest
connectivity.
Going back to the full network of H-Soz-Kult, one sees that the early modern period shows a group of highly connected people (marked by the size of the labels). This corresponds to the gateway function between the medieval and modern periods as well as the tight connections among themselves. So it is absolutely not surprising that the only working group of the German Historian’s association organized around a specific period is the
Another fact worth investigating is the number of
conference-
This relative rareness of co-appearances leads to a very simple but amazingly
effective clustering algorithm for
Due to the rise of electronic publications and communications (e-mail, Twitter), large scale social network analysis will become more and more feasible, especially in politics, contemporary history as well as in science studies. As the case of H-ArtHist has shown, readily available NER-services such as AlchemyAPI are working reasonably well to get roughly comparable results on corpora where implicit (H-Soz-Kult) or explicit (well structured TEI-editions) markup of persons is missing.
The data we’ve been analyzing covers the past six (H-Soz-Kult) and five (H-ArtHist) years respectively. In the humanities, this corresponds to a relatively brief period roughly comparable to the time used for writing a PhD or a habilitation thesis. Therefore we cannot yet trace personal careers all the way from the first conference appearance – usually at the beginning of a PhD – all the way up to the appointment as a professor generally more than a decade later. Hopefully, both services continue their work in order to repeat this analysis in a few years with a special focus on newly appointed faculty. Such an analysis might identify successful personal and thematic networks similar to the
While citation indices and impact factors are widely accepted in many branches of
science, social sciences and economics, we are well aware of the fact that
attempts to quantify and measure institutional or personal research activity
have been strongly opposed by professional organizations in the humanities,
especially in Germany.
But even where a suitable corpus is available, professional curiosity eager to
reveal fine grained images by assembling pieces of public data must not forget
the right of individuals not to be spied upon.