Tom J. Lynch is a Senior Application Designer at CSC, Computer Sciences Corporation. Since 1997, his professional interests have included the design and development of web-based software applications, data modeling, and user-interface design. In 2012, he completed a Master of Library and Information Science at the University of Illinois at Urbana-Champaign where his research interests included data curation, digital humanities, and text mining.
This is the source
For digital humanists planning to build tools for cyberinfrastructure several variables ought to be defined for each project. Pay close attention to the balance of traditional methods and new ways of conducting research. When gathering resources to do the job, seek contributions of different domain experts. Also, careful consideration of a tool’s intended scope will help refine the required resources needed to complete a project. This case study illustrates how one project, the Social Networks and Archival Context Project (SNAC), has defined these variables. The process of building a new tool also benefits from an awareness of older infrastructure that has come before it. SNAC illustrates this awareness in the way it has taken advantage of previously existing infrastructure, both cyber and not, by extending its purpose and building new features on top of it.
A case study of the Social Networks and Archival Context Project
When speculating about the future of scholarship, it seems certain that research and
teaching will continue to be affected by the evolution of traditional infrastructure
into digital forms. This is as true for the humanities as it is for other
disciplines and digital humanists in particular should have an important role to
play in shaping the infrastructure for their domains. Digital humanists, who can
both identify the needs of mainstream humanities scholars and suggest acceptable
computational solutions to those needs
The American Council of Learned Societies (ACLS) report on cyberinfrastructure
defines cyberinfrastructure as the layer of information, expertise,
standards, policies, tools, and services that are shared broadly across
communities of inquiry but developed for specific scholarly purposes:
cyberinfrastructure is something more specific than the network itself, but
it is something more general than a tool or a resource developed for a
particular project, a range of projects, or, even more broadly, for a
particular discipline.
stack
I use is
the one that refers to a pile of objects, one on top of the other. In this case the
stacked objects are abstract systems and technologies. The oldest technology rests
at the bottom and newer technologies pile up to the top. If you remove one from the
stack, the objects above it and supported by it could not exist without the missing
foundation. To illustrate this concept and elaborate on the balancing act of tool
design, I offer a case study of The Social Networks and Archival Context (SNAC)
Project, which is an emerging example of cyberinfrastructure. SNAC serves this
purpose by demonstrating the ability to balance many of the variables necessary for
cyberinfrastructure and the project is built upon an easily definable infrastructure
stack. I will begin this case study with a description of SNAC and its prototype
user interface.
The SNAC project aims to provide scholars with improved access to distributed
historical records with new discovery tools. The heart of the project is a very
large (and still growing) dataset of archival creator descriptions expressed in
the international metadata standard Encoded Archival Context-Corporate Bodies,
Persons, and Families (EAC-CPF). The records describe people and corporate
bodies who are the creators of primary humanities resources that have been
collected into archives and libraries around the world. The descriptions contain
biographical data, lists of associated resources, and lists of associations with
other people. This curated set of data enables SNAC to build new discovery tools
for researchers. SNAC’s goal is to provide improved access to the resources that
document the lives, work, and events surrounding historical persons, and provide
unprecedented access to the biographical-historical contexts of the people
documented in the resources, including the social-professional networks within
which they lived and worked
SNAC’s emphasis on social networks is, of course, no accident. Social networks
are a hot topic. However, social networks have been around far longer than
Facebook. Simply put, a network is a set of relations between objects; a social
network the set of relations between people. Kadushin describes social networks
as having been at the core of human society
since we were hunters and gatherers. People were tied together through
their relations with one another and their dependence on one
another…Kinship and family relations are social networks. Neighbors,
villages, and cities are crisscrossed with networks of obligations and
relationships.
systematic ways of talking about social networks, depicting
them, analyzing them, and showing how they are related to more formal social
arrangements such as organizations and governments.
From Facebook to Flickr, Twitter to Tumblr, the social networking concept is
embodied everywhere online through many widely used Web applications. Taking
advantage of the relationships captured in its dataset, SNAC has developed a
social network web application that has helped the project earn the nickname
Facebook for the dead
Figure 1 shows an example of a personal profile in the online SNAC prototype
(http://socialarchive.iath.virginia.edu/snac/search), in this case the
one for J. Robert Oppenheimer. The profile contains descriptive metadata such as
life dates, occupations, subjects to which the person is related, alternative
forms of his or her name, and a biographical history. The central column
contains Links to collections
and Related names in SNAC,
where
lists of connections from the personal profile to archival collections, other
people, corporate bodies, and resources are articulated as well as providing
links to traverse the connections. For example, Oppenheimer is listed as the
creator of 38 collections housed at the Library of Congress and several
university libraries. Additionally, SNAC lists another 163 archival collections
in which Oppenheimer is referenced. For each of these collections SNAC provides
a link to its entry in WorldCat, the online catalog of the world’s largest
network of library content and services
The profile pages, the links between them, and the lists of external related
resources are not created manually by experts. Another characteristic SNAC
shares with social networking sites is the bottom-up, rather than top-down,
approach to organizing its content. Individuals in the online community define
the content and structure of a typical social network site instead of
professional information providers
Using SNAC as an example this case study will show what a software tool may look like within humanities cyberinfrastructure and describe its relationship with the other components listed in the definition of cyberinfrastructure from the ACLS report quoted above (information, expertise, standards, etc.). Alongside discussing SNAC specifically, I will address cyberinfrastructure more generally by defining a set of variables to consider when approaching the design of a new tool. When reviewing the existing literature about humanities cyberinfrastructure a picture emerges that helps define these variables. Imagine a three-dimensional graph where examples of cyberinfrastructure technology could be plotted. The three axes of the graph represent the following opposing extremes:
I will discuss each of these axes in turn through highlighting ideas from the literature, as well as show how SNAC is a real-world example of how one project has found its place on the graph by striking an appropriate balance for its context on each axis. Plotting a multitude of projects on the axes imagined here is beyond the scope of this article, but I do believe SNAC falls close to the middle of the graph, hence its use as an illustrative example.
Due to its complexity and necessarily broad range of required skills, the
development of cyberinfrastructure often requires collaboration between
humanists and non-humanists. First and foremost, scholars of the humanities
must lead the development of their own cyberinfrastructure, despite the fact
that tool development is not yet rewarded on par with more traditional
scholarly outputs cooperation with librarians,
curators, and archivists; the involvement of experts in the
sciences, law, business, and entertainment; and active participation
from and endorsement by the general public
In the case of SNAC, its emphasis on social networks is conducive to
attracting collaborative partners. The study of social networks has been
identified as a cross-cutting agenda where the digital scholarship of
humanists, technology researchers, and others converge in a way that
encourages collaboration and, in particular, brings humanists to a seat at
the table where scholarly infrastructure decisions are being made
The Institute for Advanced Technology in the Humanities (IATH), University of
Virginia, leads the team developing SNAC. IATH is known for collaborative
research and technology development with the practical needs of humanities
scholars as its primary motivation. The list of other participating
institutions and individuals on SNAC’s project team and advisory board
suggests a wide range of disciplinary expertise is contributing to the
project, including historians, English scholars, librarians, archivists,
information technology developers, information scientists, computer
scientists, and more
Given the great need for teams of collaborators, it should be evident that
the scale of cyberinfrastructure research is large, the goals are many and
multi-faceted, and the aim is to serve a large audience. However, attempting
to do too much, to make a technology useful for everyone in the humanities,
risks the possibility of making it useful for nobody, because a tool with a
very broad audience in mind is unlikely to meet the requirements of
specialized fields
SNAC’s focus on social networks and creator description limits its scope by
reducing the types of data to be managed and the kinds of functionality
designers might build into the tool for analysis. However, it does not
restrict the size of the project. SNAC is poised to expand widely within a
certain band of information, specifically the description of primary
resource creators and the relations between them. SNAC has compiled over 2.6
million EAC-CPF records describing persons, families, and corporate bodies
With its narrow, but deep, focus on creator description, SNAC intends to do a
handful of tasks extremely well and integrate itself into existing
infrastructure to offer researchers more functionality. The challenges
involved in creating, managing, and providing access to this massive amount
of data are a big enough task for one project, so SNAC will rely on others
to extend its functionality. For example, SNAC does not intend to collect
digitized primary resources or provide direct access to such resources, but
rather prefers to leverage existing cyberinfrastructure by providing links
to WorldCat (http://www.worldcat.org/) and the online finding aids of archival
institutions to bring researchers closer to the resources they seek. Also,
SNAC has a specific audience in mind that is large, but certainly not
all-inclusive: anyone interested in historical research. SNAC will serve
this audience in a few ways: integrated access to distributed primary and
secondary resources about people and organizations; access to historical and
biographical descriptions; and access to the social networks of people of
historical interest
Humanities cyberinfrastructure should be a step forward for research
infrastructure otherwise there is no reason to build it. It should offer new
solutions to old problems or enable humanists to ask new questions.
Therefore, to build such systems there is a need to utilize methods and
technologies that go beyond the traditional tools of the humanities.
Traditional humanities infrastructure includes intellectual categories such
as literary genres and linguistic phenomena, material artifacts like books
and maps, buildings such as libraries and book stores, organizations
including universities and journals, business models such as subscriptions
and memberships, and social practices such as publication and peer review
SNAC pushes the boundaries of traditional infrastructure through its use of
digital data. The creator description data that drives the project is a
standards-based, internally consistent dataset with high potential for
interoperability with other systems and purposes. Data of this sort (and the
tools that can use it) provide a desirable characteristic of
cyberinfrastructure: that access to the data should be seamless across
repositories
The radial graph visualization initially puts the name of an individual at
the center of a circle of nodes (other names) to which he or she is related
in some way. The example in Figure 3 shows Oppenheimer, J. Robert
as
the central node with lines extending out to the names of all the other
people and organizations he had been associated with in his lifetime. In
addition to the lines drawn between the central node in the network and all
the other surrounding nodes, relationships between the surrounding nodes in
the circle are also indicated with a line. The visualization enables further
exploration by responding to mouse clicks on any node in the graph by
redrawing the visualization to include the immediate social network of the
selected node. For an amusing example, from the graph in Figure 3, clicking
on the node for Eliot, Thomas Stearns
opens his social network and
reveals that Oppenheimer, the father of the atomic bomb, was only two
degrees of separation from Groucho Marx
While the current phase of SNAC development does not feature the radial
graph, the feature is archived at another location (http://socialarchive.iath.virginia.edu/xtf/view?mode=RGraph&docId=oppenheimer-j-robert-1904-1967-cr.xml).
It stands as an early experiment in what is possible with SNAC's data,
providing a hint of what is to come in future features such as a graph
visualization of social-document networks, visualizations of chronologies,
and geographic displays of collection locations
To build upon existing infrastructure an understanding of its layered history and the relationships between its various components may be beneficial. For example, when trying to introduce a new public utility or a new mode of transportation to a community, thorough knowledge of the existing layers of infrastructure may reward planners with insights such as knowing the gaps in existing service that a new one may fill, discovering ways to save development effort for a new project by integrating with current infrastructure, and understanding usage patterns that may help predict the best places to initially roll out a new service. Look at the streetscape of a typical city and you’ll see the layers, old and new. Sewers, sidewalks, bike lanes, buses, and telephone poles line the streets. The poles are slung with wires that carry electricity, telephone, cable TV, and the information superhighway. The infrastructure that surrounds us is the result of an iterative process of change and evolution over time. Humanities cyberinfrastructure will be no different.
SNAC is not possible without the many layers of humanities infrastructure
that have come before it. SNAC is built on top of a long-standing
infrastructure stack that has served the scholarly community in a reciprocal
relationship, where it has been shaped and honed by scholarly use and in
turn has also shaped how scholars do their research. Any attempt to build
cyberinfrastructure ought to include a process of looking back while moving
forward. Which facets of prior infrastructure continue to be important
today? How can new technology be used to augment and extend those facets to
create a new layer of the infrastructure stack? Of course, some may ask, is
it even worth building? Certainly there are some who will have doubts about
the value of cyberinfrastructure and this points to the importance of the
Z-axis of the evaluation criteria. Cyberinfrastructure will not emerge
unless it is designed in a way that balances traditional research methods
along with the pursuit of advanced technology. One way to design technology
that respects the traditions of the past is to look back at the
infrastructure stack upon which the new technology is being built and not
neglect the scholarly activity it supports. If one can build a technology
that eliminates artificial or unnecessary restrictions on scholarly
activity, freeing scholars to do what they really want to do — read, write,
analyze, produce knowledge, and distribute it — then that technology will be
successful
Arguably one of the earliest forms of infrastructure supporting information
for and about humans is the practice of naming people and their
organizations. The abstract appellation is a shorthand way of referencing a
whole person in narratives, news, rumors, documents, and data systems like
SNAC. Look back at Figures 3 and 4 in this article and notice the primacy of
names in SNAC's radial graph. Figure 5 shows a portion of the SNAC
prototype's home page where a searchable list of names provides access to
data contained in SNAC. The history of naming is long, but is perhaps most
significantly related to SNAC in the way personal names reflect the growing
complexity of society. For example, the legal and judicial systems of
expanding towns in medieval Europe needed to identify individuals clearly in
order to attach them to property, to tax and recruit them, and to prosecute
them, eventually leading to compulsory naming in the modern era for these
reasons
Archives are institutions that collect the unique records of corporations and
the papers of individuals and families, the unselfconscious by-products of
corporate bodies conducting business and people living their lives Any record of human
experience can be a data source to a humanities scholar
A finding aid is a printed description of all the records left in an archive
with a common creator or source. A finding aid contains a description of the
creator, functions performed by the creator, and the records generated by
the creator through the performance of those functions
The Encoded Archival Description (EAD) Document Type Definition (DTD) is a
standard for encoding archival finding aids using Extensible Markup Language
(XML)
Before reaching the top of this mostly archival infrastructure stack a side
step to library infrastructure is necessary. An authority record is a tool used by librarians to
establish forms of names (for persons, places, meetings, and
organizations), titles, and subjects used on bibliographic
records
Twain, Mark, 1835-1910
) and the list of
cross-references that lead to the authorized names (such as Snodgrass,
Quintus Curtius, 1835-1910
and Clemens, Samuel Langhorne,
1835-1910
). For reasons that will be described shortly, SNAC relies
on authority files to flesh out the biographical descriptions of people and
organizations in its data.
Encoded Archival Content – Corporate bodies, Persons, and Families (EAC-CPF)
is the top of the infrastructure stack that supports SNAC. EAC-CPF Schema is
a standard for encoding contextual information about persons, corporate
bodies, and families related to archival materials using XML
As I mentioned before, one of SNAC’s primary objectives is the creation of a
large (and growing) collection of EAC-CPF records that power its prototype
access system. As its original National Endowment for the Humanities (NEH)
proposal describes, SNAC takes three steps to generate creator descriptions
in EAC-CPF
In the second step, duplicate EAC-CPF records that describe the same person,
family, or corporate body are automatically matched and merged into a single
record. This process presents a challenge for SNAC. It’s not unusual for
people and corporate bodies to use different names and leave records behind
under those various names. Additionally, the EAD records, while structured,
often contain text data created manually by humans. Consequently, names may
contain misspellings or the same name may be entered in a variety of ways,
such as Twain, Mark, 1835-1910,
Twain, Mark,
Mark Twain,
or M. Twain.
For more information on this step and
further technical details about the entire SNAC project, see Larson and
Janakiraman
The third step involves matching the set of EAC-CPF records against authority records from the Library of Congress Name Authority File (LCNAF) and the Getty Vocabulary Program Union List of Artists’ Names (ULAN). Doing so will enable the project to set the authoritative form of a name as the primary name in the EAC-CPF record as well as incorporating a list of alternative forms. Also, biographical/historical data found in the ULAN files will be added to the data in the EAC-CPF record to form a more complete description of the person, family, or corporate body.
As of August 2014, SNAC has derived more than 2.6 million EAC-CPF records by
extracting data from 2.2 million WorldCat archival descriptions and nearly
300,000 British Library authority records. Future phases of development
expect to derive data from nearly 190,000 more descriptions of historical
collections from other government and academic archives and libraries
This case study has discussed the development of cyberinfrastructure at the level of building tools to enhance research. I used SNAC as an example because it allowed me to illustrate several variables that ought to be considered when designing a research tool for humanities cyberinfrastructure: size and scope of the project, members of the project’s team, and choice of technologies. SNAC aims to be a project of great size focused on a narrow mission — creating a network of archival data. To meet the specific research needs of its anticipated community of users SNAC will leverage the power of its data to provide access to research materials through integration with existing infrastructure. This goal could not be achieved without a diverse group of collaborators, from technology specialists to humanists, each representing the various domains of expertise that contribute to the project. This variety of influence on SNAC’s design contributes to decision-making that weighs the merits of traditional infrastructure and advanced cyberinfrastructure. SNAC adopts and develops technologies that best meet the needs of humanities researchers while pushing the limits of what is possible. One key way designers of SNAC ensure that the needs of current scholars are met is to heed the heritage of the project’s infrastructure stack. The creators of SNAC have understood the infrastructure of the past and present, identified what has served users well and which aspects need improvement, and thought creatively about how to augment existing services and solve current problems. Further examples of new cyberinfrastructure are likely to evolve naturally from the layers of existing infrastructure in this way as opportunities to build new layers are realized in the future, just as the infrastructure layers of the past were once new solutions to old research challenges. Research infrastructure both reflects and inspires humanities thinking and methods. As old obstacles to research crumble through the development of new infrastructure, new methods evolve and generate new challenges to bedevil researchers. The developers of infrastructure respond with technology and ingenuity, creating new services to meet these challenges. The cycle continues to this day, where digital technologies are inspiring new methods of research and new tools are being designed to facilitate, enhance, and accelerate those methods while eventually setting the stage for the next evolution in humanities thinking.
While I have discussed mainly tool building, cyberinfrastructure is much more
than that. Cyberinfrastructure is the digital manifestation of a research
culture. However, even while designing cyberinfrastructure at this grand
scale, some of the variables that contribute to the success of a single tool
may also contribute to successful shaping of a digital culture. SNAC can
serve as an illustrative example in this regard, too. To better serve the
entire research community SNAC intends to grow from the size of a research
project to a self-sustaining national program. The Institute for Museum and
Library Services (IMLS) has funded a project named Building a National
Archival Authorities Infrastructure to help meet this goal. A proper mix of
collaborators that represent the breadth of the project’s stakeholders is
essential to bring this plan to fruition. The IMLS funding will support a
meeting of leaders in the archive, library, museum, scholarly, and funding
communities to determine the requirements of building a sustainable National
Archival Authorities Cooperative (NAAC) standard for creator
description will facilitate building international, national,
regional, and institutional biographical and historical databases
that can serve as resources
As progress continues with SNAC, NAAC, and other related research efforts, it
appears that tools like those discussed in this case study will continue to
grow and serve their intended community of scholars. In time, perhaps new
projects will emerge to fill gaps in other disciplines. As this happens, we
will get closer and closer to the goal set forth by the ACLS