Matthew Milner is a digital historian of late medieval and early modern England, specializing in the history of the senses and religious life, and cultural network modeling. He is the author of
Stephen Wittek is Assistant Professor of literature at Carnegie Mellon University, Pittsburgh. He is the author of
Stéfan Sinclair is an Associate Professor in Digital Humanities at McGill University. His research focuses primarily on the design, development and theorization of tools for the digital humanities, especially for text analysis and visualization. He has led or contributed significantly to projects such as Voyeur Tools, Simulated Environment for Theatre, and BonPatron. Other professional activities include serving as associate editor for
This is the source
We provide a comprehensive introduction to DREaM (Distant Reading Early Modernity), a hybrid text analysis and text archive project that opens up new possibilities for working with the collection of early modern texts in the EEBO-TCP collection (Phases I & II). Key functionalities of DREaM include i) management of orthographic variance; ii) the ability to create specially-tailored subsets of the EEBO-TCP corpus based on criteria such as date, title keyword, or author; and iii) direct export of subsets to Voyant Tools, a multi-purpose environment for textual visualization and analysis.
An introduction to DREaM (Distant Reading Early Modernity), a hybrid text analysis and text archive project.
DREaM
The creators of DREaM (Matthew Milner, Stéfan Sinclair, and Stephen Wittek) are members of the Digital Humanities Team for
Early Modern Conversions: earlymodernconversions.com
Early Modern Conversions Digital Humanities Team:http://www.earlymodernconversions.com/people/digital-humanities-team/.
one of the constitutive notions of Western consciousness and conscience,arguing that,
in effect, one can represent the whole history of the West as a ceaseless effort at renewal by perfecting the techniques ofconversion,which is to say the techniques intended to transform human reality, either by bringing it back to its original essence (conversion-return) or by radically modifying it (conversion-mutation).
distant readingtechniques to a collection of texts like those made available through EEBO-TCP, with the long-term goal of lighting a way forward for similar investigations of other early modern corpora in the future
A canon of two hundred novels, for instance, sounds very large for nineteenth-century Britain (and is much larger than the current one), but is still less than one per cent of the novels that were actually published: twenty thousand, thirty, more, no one really knows — and close reading won’t help here, a novel a day every day of the year would take a century or so.Matt Jockers makes a similar point:
Macroanalysis is not a competitor pitted against close reading. Both the theory and the methodology are aimed at the discovery and delivery of evidence. This evidence is different from what is derived through close reading, but it is evidence, important evidence. At times, the new evidence will confirm what we have already gathered through anecdotal study. At other times, the evidence will alter our sense of what we thought we knew. Either way the result is a more accurate picture of our subject. This is not the stuff of radical campaigns or individual efforts toconquer and lay waste to traditional modes of scholarship.
The EEBO-TCP collection does not lend itself to computer-based analysis without a great deal of labour-intensive preparation and re-organization, primarily because of two distinct sets of problems: a) the relative inflexibility in the EEBO interface (see http://quod.lib.umich.edu/e/eebogroup/), which functions very well as a finding aid, but is a poor mechanism for compiling subsets of full texts; and b) orthographic irregularity. To get a sense of how these problems can impact a workflow, suppose a user wanted to analyze the 740 texts in the EEBO-TCP corpus dating from 1623 to 1625. Although identifying the desired material on EEBO requires nothing more than a simple date search, the process of bringing all the files together into a unified, searchable subset is considerably more difficult because the EEBO interface does not offer options for batch downloading, and therefore requires users to download files in a desired subset on a one-by-one basis. Once assembled, any subset created in this manner will also require significant editing because the plain text file one can download from EEBO does not come along with an accompanying file for metadata (i.e., information to indicate the title, date of publication, author, etc.). Rather, metadata appears within the file itself, a convention that can significantly complicate or contaminate the results of macro-scale analysis.
On a similar note, texts in the EEBO-TCP collection also feature a high degree of
orthographic irregularity, a characteristic that adds an extra layer of complication
for researchers who want to conduct any sort of query that involves finding and
measuring words. As anyone who has ever read an unedited early modern text will be
aware, writers in the period did not have comprehensive standards for spelling, so
any given term can have multiple iterations (e.g., wife, wif, wiv,
wyf, wyfe, wyff, etc.). Without a method for managing spelling variance,
one cannot reliably track the distribution of words in a corpus, or perform many of
the basic functions fundamental to textual analytics.
DREaM addresses both of these issues directly. Although the interface is similar in
some ways to the interface for EEBO, it is much more flexible, and offers different
kinds of search parameters, making the process of subset creation more powerful and
convenient, and allowing users of the EEBO-TCP corpus to perform macro-scale textual
analysis with greater ease. Secondly, and perhaps most important for corpus
text-analysis, DREaM utilizes a version of the EEBO-TCP corpus specially encoded
with normalizations for orthographic variants, a feature that enables frequency
tracking across multiple iterations of a given term. In other words, DREaM can
search orthographically standardized versions of texts
Figure 2 shows an example search on the DREaM
interface. In the top half of the screen, there are four fields that one can use to
define a subset in terms of keyword, title keyword, author, and publisher. Just
below these fields, there is a horizontal slider with two handles that users can
drag to define a date range. As one enters the subset parameters, a number appears
in the top right hand corner of the Export button to indicate the number of texts
the proposed subset will contain (for example, in Figure
2, the user has defined a subset of 56 texts that feature the term
conversion
in the title). To the right of the Export button, a thumbnail
line graph offers a rough idea of text distribution across the date range (the graph
in Figure 2 shows a significant peak toward the latter
end of the range).
Clicking on the Export button will bring up a window where users can choose to download the subset as a ZIP archive, or send it directly in Voyant Tools, a multi-purpose environment for textual visualization and analysis. Users can also choose to download the subset as a collection of XML or plain text files. At the bottom of the Export window, a convenient drag-and-drop mechanism offers options for naming the files according to year, title, author, publisher, or combinations thereof, a functionality that facilitates custom tailoring for specific sorts of enquiries. For example, a researcher comparing works by various authors would likely want to put the author at the beginning of the file name, while a researcher tracking developments across a specific date range would likely want to begin the file name with the year.
Clicking on Send to Voyant Tools
in the Export window will open up a new page
that shows textual analysis from a suite of digital tools (see Figure 3). The tools displayed in the default settings
are i)
distant readingand
close readingperspectives, and also makes it easier to follow up on unexpected discoveries, or explore specific items of interest on the fly.
In order to clarify the intervention that DREaM aims to bring to digital humanities research on early English print, it will help to briefly review two similar projects based around the EEBO-TCP collection, and to situate them in comparison to DREaM. The first is
Despite overlap of certain functions, DREaM, EMP, and BYU-EEBO represent very
different responses to a growing demand for tools that reach beyond the conventional
use scenarios envisioned by the designers of the EEBO interface in the late
nineties. Each project has distinct strengths and weaknesses. At the risk of
over-generalization, one might say that BYU-EEBO caters primarily to linguistics
research, while DREaM aims for a more open-ended, exploratory style of corpus
interrogation — and EMP is somewhere in between. Although all three projects bring
benefits of value to a growing field, it is important to note that, because it works
seamlessly with Voyant Tools, DREaM is the only one that enables full, direct access
to the texts in the EEBO-TCP corpus, a feature that allow users to check the source
of data very quickly, or follow up on elements of particular interest. On a similar
note, DREaM is also the only platform designed to work in conjunction with other
tools. For example, a user could create a subset in DREaM, download it, and conduct
analysis in other platforms, such as R or Python.
Our vision of an easy-to-use interface, and quick corpus creation required transformation and enhancement of the existing EEBO-TCP collection in two major ways. First, was the normalization, or standardization, of orthographic variants which are critical for the statistical analytics that power distant reading methods. Second was the enhancement of the metadata, allowing a richer set of parameters for building corpora of texts to analyze. Both were iterative processes. The resulting texts which power DREaM are new versions of the EEBO-TCP corpus that combine the EEBO-TCP metadata header with the outputs of each process: a full normalized version of each text containing the EEBO-TCP TEI SGML, with tagged normalizations, and new metadata drawn from linked open data provided by OCLC.
Before we could normalize the spelling of our 44,418-document corpus the texts
need some pre-processing. Although the EEBO-TCP is primarily English, it also
contains texts in Latin, French, German, Dutch, Spanish, Portuguese, Italian,
Hebrew, and Welsh. To make things more complicated, some EEBO-TCP texts are
multi-lingual. We decided to use VARD2, a tool built specifically for Early
Modern English.<text
lang="eng">
. Since TCP allows for multiple values in the language
attribute, we initially identified potential texts if eng
appeared as a
value of the attribute, e.g. <text lang="eng lat ita">
. The
result was a working set of 40,170 documents which contained declared English
text in some place or another.
In EEBO-TCP <text>
elements can be nested. Combined with
multilingual values for the language attribute of the element, this creates
potentially complicated conditions for extraction. We quickly found that there
were 128 possible xPath locations of <text>
in the TCP SGML
documents. Here are a few examples of these locations:
Since we were only interested in normalizing English text, we had to isolate the
<text lang="eng">
elements, and preserve their order. As
noted by the following examples, where the parent was English-only, we extracted
the parent; however, when <text>
elements contained
multilingual language attribute values extracted English-only children.
The script we created concatenated the matching <text
lang="eng">
elements, and remarried the existing EEBO-TCP metadata
file (*.hdr in the EECO-TCP file dump) to the new English-only body
to
produce a corpus of 40,170 truncated texts that could stand on its own, or be
used for orthographic normalization.
Normalizing the 40,170 English-only texts was iterative, as we optimized and
tweaked our method. Prior to producing any final sets using VARD2, we ran it on
the entire English corpus, and edited the dictionary by hand, catching some 373
normalizations which were problematic in one way or another. These amounted to
462,975 changes overall — only 1.03% of the total number of normalizations.
Several examples are illustrative of the problem normalizations:
strawberie
became strawy
rather than strawberry,
and
hoouering
became hoovering
rather than hovering
,
effectively creating a word that simply didn’t exist in period English. VARD’s
statistical method is not robust enough to analyze the particular context that
would allow it to discern which was appropriate: more ambiguous words like
peece
could either be piece
or peace
. In such cases, we
decided normalization was too problematic, and thus set VARD to ignore
peece
entirely.
Our pre-processing script not only extracted the <text>
elements, it also altered them prior to loading the texts into VARD2. We
experimented with both the texts we input into VARD2, and with its orthographic
normalization parameters, creating twelve overall versions of the EEBO-TCP
English corpus. The purpose was straightforward: to ascertain what degree of
normalization best balanced the contextual ambiguities like peece
but
achieved a level of optimal normalization for text-analysis tools like Voyant,
and to see whether limited pre-processing of the EEBO-TCP texts themselves,
prior to VARD2 processing, improved the results. We ran each degree of
normalization on three versions of the English-only <text>
elements: a Regular
unedited version acted as our control; a
Cleaned
version which removed characters and tags that impeded VARD’s
normalization process by splitting words; and an Expanded
version that
expanded the macron diacritic, typically used in early modern English to
represent a compressed m
or n
on the preceding vowel (e.g.
com̄ited
became committed
).The Cleaned
version removed
pipe characters | and editorial square brackets [], but also
<supr>
and <subs>
tags which broke up
words. We opted not to remove any <GAP DESC="illegible"...>
elements as it quickly became apparent that doing so would create more problems
than it might resolve. Although VARD could likely handle a single missing
character, the highly variable nature of a text gap made it questionable what
gaps we should let VARD patch
, and which it could not: two characters, or
only a single character? Is that character a word on its own? We felt that the
subjective nature of matching whether legibility was accurately assessed or not,
and where to set the bar for VARD normalization, made the removal of these
elements unpredictable enough that it would confound later analytical interests.
We ran each of these versions in VARD2, in turn, at four distinct match
levels in order to assess what level seems best for the overall corpus. VARD
normalizes words if it finds a match that is +0.01% higher than the match
parameter a user sets: a 50% setting will only alter a variant with a
statistical match of 50.01%. We first ran VARD at 50%, it was evident this
excluded a large number of obvious variants we needed normalized which appeared
between 45% and 50%. We re-ran the normalization at 45%. We also ran VARD2 at
35% and 65% both as controls, with a mind to producing sets of the English-only
corpus that might help us determine whether normalization levels were best
adjusted as we proceeded chronologically in the corpus. In each case we set the
output to XML
so that VARD2 would tag the normalizations, retaining the
original spelling as an attribute in the surrounding
<normalized>
tag. These tags were critical later on for
indexing the texts in DREaM itself. In the end we decided to use the 45% match
Cleaned
version of the English only EEBO-TCP corpus for DREaM because
it seems to have the best balanced normalization following the removal of tags
that prevented VARD2 from working correctly, but without any expansions or
<GAP>
elements. When collating the final file, we noted
the set name, as well as the date and match level, as attributes in a wrapper
element that enclosed the normalized text. We then coupled the entire XML text
to the EEBO-TCP TEI header, and enclosed everything once again in an
<EEBO>
element.
VARD2 itself was fairly easy to use. Even so, it took some doing to ensure it
operated smoothly with the variable sizes of EEBO-TCP texts. Although it has a
command line batch mode, we quickly ran into trouble as VARD2 would crash
handling 5MB texts on the recommended memory settings of –Xms256M
–Xmx512M
. The crashes occurred frequently enough, despite raising the
memory settings to over 1GB, that we decided to run VARD2 through our own php
script, executing the program once for each of the 40,170 input files, with the
appropriate setting. Even then, 1-3GB memory settings were insufficient to
handle the ~100 or so EEBO-TCP files that are over 10MB. We ended up running our
VARD2 processing script over a four-day period at -Xms6000M -Xmx7000M
–
or, significantly higher than the recommended settings. Undoubtedly this took
longer than batch mode, but it was more stable.
Enhancing the EEBO-TCP metadata was also iterative, but did not use a specific
tool; rather, we processed the metadata using a combination of php, mysql, and
text files to build gazetteer data. The existing metadata files of the EEBO-TCP
(*.hdr) contain a wealth of information about the file encoding process, as well
as several types of identifiers. They also employ a standardized or canonized
list of authors and publication places, as well as publication dates. Though the
metadata contains the critical identifiers from the Short Title Catalogue, it
lacks data that is now available via Open Data resources like OCLC and VIAF that
is often of interest to scholars using the EEBO text, such as gender of authors,
or better dates of birth and death. Moreover, in both OCLC and EEBO-TCP
metadata, the <publisher>
remains an unparsed string despite
containing a wealth of information on publication such as publishers, dates, and
historic addresses.
The enhancement of the metadata was two-fold. First, it was to find possible
matching OCLC records and IDs for EEBO-TCP files, and pull in OCLC data to
create a new metadata header containing authority dates (some
<date>
elements in EEBO-TCP contained artifacts like the
letter L for the number 1 and the letter O for the number 0), places of
publication, and titles, as well as data like gender, and birth and death dates
from VIAF for EEBO-TCP authors, and referencing the OCLC and VIAF IDs as online
resources in the new XML. Second, to identify individuals, places, and addresses
in the unparsed <publisher>
element.
Matching EEBO-TCP texts to OCLC records was not as straightforward as it might
appear. EEBO-TCP comprises textual witnesses or instances, while OCLC collates
records from its partner libraries in order to build records that are
manifestations or works
. This conceptual distinction is critical as it
means that there might well be several OCLC IDs for an individual EEBO-TCP text,
potentially with variable metadata. Initially we had thought it possible to
obtain a dump of the OCLC using EEBO as a <sex>
element, rather
than gender, to describe this information despite the problems inherent with sex
/ gender distinction). This version was coupled to the English only corpus we
processed with VARD2, along with a truncated version of the original EEBO-TCP
metadata. Combining the new metadata, and the original, in a larger metadata
header for the files, gave us the ability to present users with the option of
searching using the original EEBO-TCP metadata, or the hybrid OCLC-EEBO-TCP
metadata.
The second version of the DREaM metadata, which we produced in early fall 2015,
took up the challenge of parsing the <publisher>
element
which remains unparsed in both OCLC and EEBO-TCP metadata. EEBO-TCP A00257
provides a good example of this string: <publisher>By Iohn Allde and
Richarde Iohnes and are to be solde at the long shop adioining vnto S.
Mildreds Churche in the Pultrie and at the litle shop adioining to the
northwest doore of Paules Churche,</publisher>
John Allde and
Richard Jones are not mentioned anywhere in the original EEBO-TCP metadata, only
the author <author>H. B., fl. 1566</author>
. Isolating
these individuals required creation of two gazetteers: first, for place names,
and second, one for known agents (from the EEBO-TCP author list). Rather than
working with 44,418 entries, we ran our script iteratively over the 23,644
distinct <publisher>
strings, adding the possible places and
individuals to the growing gazetteers, and pulling in variants for authors’
names from VIAF’s RDF XML files (<schema:alternateName>
)
retrieved by using the results from VIAF’s AutoSuggest API (http://viaf.org/viaf/AutoSuggest?query={searchterms}). We also
created a short list to translate common early modern first names and
abbreviations into modern versions, such as Io. for John, or Wyliam for William.
After some 20 passes over the data, including manual editing of the agents
gazetteer, patching for new scenarios, and comparing possible matches to the
dates of publications in EEBO-TCP texts, we were left with a gazetteer of
195,213 variants representing 24,076 distinct possible names for 19,836 distinct
VIAF IDs. Some, like F.M.
, were too imprecise to resolve. Our work also
alerted OCLC to problems with their AutoSuggest API when it returned the
canonical names of an author with another individual’s VIAF ID, usually for
co-author (e.g. Nicholas Bourne’s VIAF ID appeared for Thomas Goodwill). This
required additional processing of incoming data to see if the names it returned
matched those we wanted to query. Equally problematic were dates, which appeared
inconsistently throughout the VIAF open data. RDF XML might lack any dates,
while the VIAF <ns2:birthDate>
and <ns2:deathDate>
or
as part of a MARC individual agent record element <ns2:subfield
code="d">
. Many values in these locations were 0
despite
the presence of birth and death years as part of the canonical name; this became
a third option for checking whether someone could be the author of the text in
question. Last resort was using the decades of publishing activity indicated in
the Cluster XML as <ns2:dates max="201" min="150">
.
Consequently, all of these matches have a confidence marker.0
denotes an exact match or a match with a
EEBO-TCP author, with a publication date which falls in between an
individual’s birth and death dates; 1
lacked either birth or death
dates, or used publishing activity dates; and lastly 2
lacked any
dates.<dreamheader>
is richer than the original EECO-TCP
metadata, containing not only the biographical data of VIAF records for the
author, H. B., but also similar data for the individuals found in the
<publisher>
element.
The addition of confidence markers in our new DREaM metadata indicates a
different approach to open metadata. Rather than seeing metadata as an
authoritative or concrete accounting of actual attribution and representation
(as is the practice of archivists and cataloguers), the DREaM metadata
(especially in regards to the parsed <publisher>
data) should
be viewed more as a kind of contingent scholarly assertion. DREaM does not have
the resources to double check all some 40,000 texts to ensure exact accuracy of
the matching of EECO-TCP metadata with VIAF and OCLC identifiers. In many cases,
doing so requires expert domain knowledge, and means to accurately resolve
entities. A good example of this is B. Alsop
: is it Bernard or Benjamin
Alsop? The two printer publishers were most likely related, but in some
instances it isn’t possible to distinguish between the two, as in the case of
EEBO-TCP text A00012, Robert Aylett,
B. Alsop. By documenting both, and marking a confidence level, we’re asserting that metadata is very much the product of ongoing research: it should not be seen as definitive. While DREaM allows researchers the ability to create corpora based on either exact or fuzzy searches, we’re also publishing this metadata separately, and offering it to EEBO-TCP so that the wider scholarly community can refine and critique it.
DREaM has led to the enrichment and enhancement of the EEBO-TCP corpus, but the project has also led to some significant architectural and functionality improvements in Voyant Tools.
In particular:
In addition, work on DREaM helped prioritized other planned functionality, such as reordering documents in a corpus, editing document metadata, access management for corpora (to reflect restrictions for EEBO-TCP), and overall scalability improvements. We had not yet worked on a single corpus with more than 44,000 texts of variable lengths (the input XML for DREaM weighs in at more than 10GB). In short, DREaM provided an ideal test bed for efforts to enhance the scalability of Voyant Tools. This use-case driven development seems to us an ideal scenario for building new generations of resources in the Digital Humanities.