S. L. Ziegler is Head of Digital Programs and Services at LSU Libraries in Baton Rouge, Louisiana, where they manage the digitization initiatives, digital preservation operations, and the digital library infrastructure. They build partnerships throughout the state of Louisiana to ensure long-term access to at-risk digital cultural heritage material. Before joining LSU Libraries, Ziegler was the Head of Digital Scholarship and Technology at the American Philosophical Society in Philadelphia, Pennsylvania, where they were the founding head of the Center for Digital Scholarship and oversaw all aspects of digital production and outreach.
This is the source
Treating collections in cultural institutions as data encourages novel approaches to the use of historic collections. To reframe collections as data is to focus on how digitized collection material, collection metadata, and transcriptions can be used and reused for various types of computational analysis. Scholars active in the field of digital humanities have long taken advantage of computational data. This paper focuses on the work of cultural heritage institutions, which are increasingly offering collections as data. This paper outlines the collections as data project and examines specific examples of cultural institutions active in this space. The paper then details the practices of data brokers, and explores how the data broker model can frame the use of data in cultural heritage institutions. In closing a number of experiments are described that might help mitigate the harm that data in cultural institutions might cause. As we create and share data, can we be sure we are better than data brokers?
Analyzes the interaction between data brokers and cultural heritage institutions' data collections to explore how to mitigate the harm potentially caused by data in cultural institutions
Recently, increased attention has been paid to data in cultural
institutions.lead to the creation of a framework to support library
collections as data, the identification of methods for making
computationally-amenable library collections more discoverable, use cases
and user stories for such collections, and guidance for future technical
development
The AAC project has done a great deal of important work in bringing together a
wide variety of practitioners and examples and for this reason situating an
exploration of data in cultural heritage institutions within the framework of
the collections as data conversation is beneficial. As a catalyst within a wider
world of data-oriented endeavors in cultural institutions, the AAC project has
opened new avenues of investigation and has amplified the need for collaboration
among institutions and practitioners. It is becoming increasingly common to see
issues related to data in special collections libraries appearing in syllabi,
library strategic goals, and position papers The growing interest in collections as data,
writes Chela Scott Weber in a recent OCLC Research Position Paper, means we must collaborate with colleagues in scholarly
communications, data services, and elsewhere across the library to grapple
with what computational access to our collections might look like
Collections as data explores an expansive definition of data. “To see collections
as data begins with reframing all digital objects as data,” Thomas Padilla
writes. Data are defined as ordered information, stored
digitally, that are amenable to computation. Wax cylinders, reel to reel
tape, vellum manuscripts, websites, masterworks, musical scores, social
media, code and software in digital collections are brought onto the same
field of consideration
Data as well as the data
that describe those data,
explains the Santa Barbara Statement,
are considered in scope. For example, images and the
metadata, finding aids, and/or catalogs that describe them are equally in
scope. Data resulting from the analysis of those data are also in
scope
In many ways treating collections as data eases some barriers to sharing data.
However, collections as data is not the same as open data. Open data has few, if
any restrictions on use and reuse Accessibility and reusability,
write Koster and
Woutersen-Windhouwer, do not require collections and
objects to be freely available, modifiable and shareable with free
tools,
as the open definition requires. Some
metadata or objects will be copyright protected, have privacy issues or
local law issues
In November 2019 the AAC grant project came to an end and was succeeded by a new
phase, Collections as Data: Part to Whole (“Part to Whole” n.d.). Collections as
Data: Part to Whole fosters the development of broadly
viable models that support implementation and use of collections as
data
by funding project teams that will develop
models that support collections as data implementation and holistic
reconceptualization of services and roles that support scholarly use
(“Part to Whole” n.d.). Even beyond the AAC project the number and scale of
cultural heritage collections available as data continues to increase. In 2019,
the Library of Congress, with funding from the Andrew W. Mellon Foundation,
launched the Computing Cultural Heritage in the Cloud (CCHC) project, to pilot ways to combine cutting edge technology and the
collections of the largest library in the world, to support digital research
at scale
collections
as data in a machine-readable format, widening the scope for digital
research and analysis
The AAC team has compiled a number of Facets, or case studies that draw attention
to many ways cultural institutions are creating and using data.A facet documents a collections as data implementation. An
implementation consists of the people, services, practices,
technologies, and infrastructure that aim to encourage computational use
of cultural heritage collections.
see allowing people to creatively re-imagine
and re-engineer our collection in the digital space
data on approximately 28,269 objects across all
departments of the museum[:] fine arts, decorative arts, photography,
contemporary art, and the Heinz Architectural Center
The case to provide the public increased access to museum data
was not a difficult one at the Carnegie Museum of Art,
explain the
authors of the data, the museum considers engagement and
education to be a core part of its mission, and firmly believes in Open
Access as essential to museum practice
In addition to sharing metadata, many libraries and museums are allowing full
access to digital facsimiles. The Getty, for example, makes
available, without charge, all available digital images to which the Getty
holds the rights or that are in the public domain to be used for any
purpose. No permission is required
contains complete sets of high-resolution
archival images of manuscripts … along with machine-readable TEI P5
descriptions and technical metadata. All materials on this site are in the
public domain or released under Creative Commons licenses as Free Cultural
Works
Beyond metadata and digital facsimiles, collections reformatted as structured data is also a growing trend in cultural institutions. This often involves transcribing text and applying some type of structure. Haverford College Libraries reformatted collections into TEI-encoded text for the projects
A fuller exploration of a particular open data project will help highlight both
the process and possible pitfalls of collections as data work in cultural
heritage institutions. The American Philosophical Society Library (APS)
digitized, reformatted as data, and opened historic prison records. The APS
holds three admission books created by the Eastern State Penitentiary.cellular isolation
in which each
inmate spends all of his or her time in a cell by themselves. For more
information, see the APS finding aid: https://search.amphilsoc.org/collections/view?docId=ead/Mss.365.P381p-ead.xml,
and the website of the Eastern State Historic Site: https://www.easternstate.org.
In 2015, the team
Taking the process one step further, in 2016 and 2017 the team transcribed the
content into spreadsheets.
However, the process is not always obvious, and choices do matter. For example,
the handwriting can often be difficult to read, and conventions about how to
depict abbreviations, spelling errors and indecipherable words had to be
established.Exactly as they appear
: Another Look at the Notes of a 1766
Treason Trial in Poughkeepsie, New York, with Some Musings on the
Documentary Foundations of Early American History
From the point of view of the APS Library data initiatives team, the CSV files
themselves are the finished product. The Open Data Initiative of the APS Library
aims to increase access to computational data by identifying material in the
collection that is conducive to being reconfigured as
datasets
As the examples in this section highlight, collections as data in cultural institutions can take many forms. Metadata enables the re-imagining of art collections, as in the case with the Carnegie Museum of Art. Digitized objects allow use and reuse for both scholarly and imaginative purposes. Collection content can be transcribed as structured open data allowing for analysis, visualizations, access through innovative interfaces, and new types of research. However, there is always interpretive work that needs to be done when creating datasets from cultural heritage material.
In June 2016, HBO’s Last Week Tonight With John Oliver
purchased nearly 15 million dollars of debt for 60 thousand dollars. The
purchase was part of an exposé on the debt buying business, what John Oliver
called a grimy business.
A large part of the
griminess of the business is the ability to buy and sell personal information
about individuals. Names, addresses, social security numbers and other
information is passed from one buyer to the next, often emailed in spreadsheets.
The buyers hope to pressure the named individuals into paying the debts,
creating a profit for the debt buyer. While the
In March 2014, CBS’
Every piece of data about us now seems to be worth something to somebody,said Tim Sparapani during the show.
And lots more people are giving up information about people they do business with, from state Departments of Motor Vehicles, to pizza parlors
The dossiers are about individuals,one interviewee continued.
That's the whole point of these dossiers. It is information that is individually identified to an individual or linked to an individual
In 2014, the FTC released a report on data brokers that highlighted how little we
know about the data being collected, and how much data there is about us.
Because most of the data gathered about us do not come directly from us, most
people do not fully grasp the amount of information data brokers collect and
sell. Some of the information data brokers collect, like
bankruptcy information, voting history, consumer purchase data, web browsing
activities and warranty registrations
are gathered from other
sources Potentially sensitive categories include those that primarily focus on
ethnicity and income-levels, a consumer’s age, or health-related conditions
like
Expectant Parent,
Diabetes Interest,
and Cholesterol Focus
The practice of data brokerage is secretive, and there is often no way to appeal
incorrect information. The profiles these companies assign to us are often
incorrect. In the world of data brokers, you have no idea
who all has bought, acquired or harvested information about you, what they
do with it, who they provide it to, whether it is right or wrong or how much
money is being made on your digital identity,
writes Kalev Leetaru,
of his efforts to determine who is making money from his information. Nor do you have the right to demand that they delete their
profile on you
nearly 50 percent of the data in the report about me was
incorrect
In 2017, Equifax announced a data breach that allowed the personal information of
143 million people to be stolen. In 2018, Facebook announced that the data
analysis company Cambridge Analytica used personal data in ways that easily
match John Oliver’s definition of grimy. The use of personal data by companies
large and small to profit is an important backdrop against which to evaluate
open data in cultural institutions. Equifax and Cambridge Analytica are not,
technically, data brokers. The former is a credit monitoring company and the
latter, before declaring bankruptcy
Data brokers, along with credit monitoring companies and data analytic companies,
benefit from the information of other people. This information often harms
individuals through the categories they create. Examples such as single mom struggling in an urban setting
or people
who did not speak English and felt more comfortable speaking in
Spanish
or gamblers
Data brokers offer an important example of one way of interacting with data. This is an example against which we should compare ourselves when we release data in cultural institutions and use data for digital humanities. We should always ask ourselves, are we better than data brokers?
Data brokers profit from other people’s information. Those described in their datasets often have no way of knowing how they are being represented, and have no way of questioning or correcting this representation. As data becomes more prevalent in cultural institutions, and many of us — through publishing papers and presenting on our work — benefit from data about other people, now is good time to evaluate ourselves in relation to data brokers. This section explores examples of harm done by institutions as they represent individuals and groups.
Identifying specific cases of harm can be difficult. For this reason, it is common to focus on groups that are historically marginalized, and who have reason to be suspicious of their representation in mainstream culture. Anyone who has ever had a reason to fear categorization by a dominant culture can more easily understand the power of data. Many groups have reason to be suspicious. For the purposes of this paper, however, examples will focus on African American representations. This decision is meant to both draw attention to the unique position of the African American community as a marginalized group and to honor the important work of generations of scholars who struggle to educate the dominant culture about these and related dangers.
Are people harmed by the data that we have and share? It is not standard practice
for cultural institutions to share social security numbers, credit card numbers,
or other sensitive personal information in either physical records, digitized
facsimiles, or datasets.
Writing about the role of the media in enforcing negative representations of African Americans, and thus the media’s culpability in historic lynchings, Sherrilyn Ifill writes,
ordinarinessof the black people's lives … undermined the ability of whites to see their black neighbors, servants, and laborers as human beings
Whites could at best ignore the conditions in which most blacks lived and at worst develop a sense that blacks did not lead normal lives in which education, work and family were paramount and central. Instead, blacks could be seen as ‘other,’ ‘different,’ not possessed of the same humanity as whites ... The complicity of ordinary whites, who stood and watched a lynching without interfering, was made possible by the dehumanizing choices the media made in their coverage of blacks.
The lingering remnants of these dehumanizing portrayals of blacks in the media,writes Ifill, bringing the issue to the present,
have modern currency,including over-incarceration of African American males, hyper policing of black communities, and police brutality
The representation of one group by another group can range from obvious fiction
to pretense of objective truth. All groups tell
stories,
writes David Pilgrim, founding curator of the Jim Crow
Museum, but some groups have the power to impose their
stories on others, to label others, stigmatize others, paint others as
undesirables, and to have these social labels presented as scientific fact,
God’s will or wholesome entertainment
When we watch movies or read novels,
continues Pilgrim, we know that they are stories; we identify the characters,
follow the plot and anticipate the conclusion. But there are other stories
that are not so easily identified — sometimes they masquerade as object,
race-neutral truth
In what way are cultural institutions doing the same? The collecting practices of
cultural institutions have long been marred by the racial bias of the archivists
and curators who build collections. The decisions made about what is collected
are colored by the opinions of those doing the collecting and this has tipped
the scales on how African Americans are represented in archives. An overt
example from the author's own institution is a case in point. In 1945, the LSU
Department of Archives and Manuscripts (a pre-curser to the current Special
Collections) was offered the opportunity to acquire the collection of African
American bibliophile and book collector, Henry P. Slaughter. The quality of the
collection was endorsed by archivist Herbert Kellar and book dealer Forest Sweet
who wrote, it is no sense a collection to be filed away -
it is rather a collection to be worn out with legitimate use for what it can
offer as a basis of study of the negro problem
Despite this endorsement, the University eventually passed on the acquisition on
the advice of Archivist and History Professor, Edwin Davis:the collection has been selected with the plan to emphasize
the Negro’s point of view of the race problem. If this is true it is my
opinion that the collection will have a considerable amount, perhaps an
appreciable percentage, of what might be termed weak material. I have
however, no evidence for this opinion. It might prove to be a very valuable
collection
Decisions about what material cultural institutions collect have long term
repercussions that are felt for generations. In March 2019, Tamara Lanier filed
a lawsuit against Harvard University over ownership of a daguerreotype
photograph of an enslaved ancestor named Renty and his daughter, Delia. The
photographs were commissioned by Professor Louis Agassiz and utilized as
evidence of inferiority of the African American race. The photographs were
rediscovered in 1976 hidden away in the attic of a campus museum. Since then the
University has loaned the photographs to other museums but also limited the use
of the images by researchers due to their sensitive
nature
What is collected matters, and so does its description. The role of the library
catalog in reinforcing dominant points-of-view has been explored many
times.ways
that sections of library classifications were constructed based on ideas
about African Americans
The business of gathering, combining, and selling data is not new. However, the
scope of surveillance and reach of the systems created with the data is
unprecedented due to new tools and methods. In the same way, library catalogs
have long represented groups of people in problematic ways. What is different is
the new tools and methods we are using to promote the use and reuse of these
descriptions and collections. The collections as data framework in cultural
institutions carries with it the possibility for our descriptions of people to
be shared, combined with other data, and used to negatively affect groups. The
ability to exert control over group and personal identity
and memory,
writes Noble, must become a matter
of concern for archivists, librarians, and information workers
This section posits three possible directions that are still in the early phases
of exploration. These possibilities are listed as sincere efforts to investigate
possible steps toward a future in which my work with data feels less grimy. As
white librarians, we work in a field that has long struggled to be inclusive to
historically marginalized communities.
The grimy business of data brokers is legal.
Arguing for a move away from thin legalistic frameworks, Michelle Caswell and
Marika Cifor explore the role of feminist ethics in reconceptualizing the role
of archives and the people represented in our collections. This approach is
applicable for all cultural institutions, collections as data work, and digital
humanities projects that use this data. In a feminist
ethics approach,
they write, archivists are
seen as caregivers, bound to records creators, subjects, users, and
communities through a web of mutual affective responsibility
counted,
classified, studied, enslaved, traded as property, and/or murdered
[A] feminist approach
guides the archivist to an affective responsibility to empathize with
the subjects of the records and, in so doing, to consider their
perspectives in making archival decisions. This is in contrast to the
dominant West mode of archival practice, in which archivists solely
consider the legal rights of records creators … In the feminist
approach, the archivist cares about and for and with subjects; she
empathizes with them.
If we would work differently when those represented in our work can see it, how could we ensure that this happens? The particular individuals described in datasets by cultural institutions are likely to be deceased, as is the case for the historic prison records described above. However, we can work with members of affected communities.
Safiya Noble, writing about the shortcomings of Google and other tech companies,
calls for the combination of technology and critical studies. We need people designing technologies for society to have
training and an education on the histories of marginalized people, at a
minimum,
Nobel writes, and we need them working
alongside people with rigorous training and preparation from the social
sciences and humanities
Many cultural institutions are unlikely to be able to create new professional positions specifically for individuals trained in the histories of marginalized communities. However, it is critical that we find ways to pay the scholars whose help we seek. In academic libraries and archives this might take the form of graduate assistantships. Digital humanists will likely be similarly restricted. However, including these roles in project plans and grant applications is one way to normalize the process of asking for help. Money is not the only incentive; this could be a valuable means of exposing scholars to possible careers in cultural institutions and digital humanities. However, monetary compensation is an important part of letting scholars know we take their expertise seriously.
As we have seen, it can be very difficult to know what information data brokers
have about us, and to correct what is incorrect.
We can also include context in downloadable data. For example, the CSV files that
constitute the core of the historic prison data project, described above, have
the transcriptions of the admission books. No metadata is included in the
download, and no context. Instead of simple files containing structured
transcriptions, we could use data packets to bundle contextualizing information
together with the transcriptions.
Part of the context we could add would be explanations about decisions that we
make while creating data. A danger of institutionally-generated collections as
data is the perception of objectivity. Devon Mordell, writing about this danger,
proposes we frame collections data work in cultural heritage institutions within
a new paradigm, a collections-as-data paradigm, that considers both the
conceptual and practical concerns related to the use of data in archives ensure that a
social justice critique is maintained within
the emerging work
related to collections as data active
participation and critical discourse
around the tools and practices
is needed to ensure that new technologies reinscribe a false since neutrality
These four directions are presented as a means to begin conversations about practical implementation toward the goal of ensuring that the work we do is better than data brokering. There is still much to learn, and the implementation of any of these proposed solutions will certainly open new possibilities, of both success and failure. Many of the issues we hope to address have taken decades, and sometimes centuries, to create. There will be no quick solution.
The examination of data brokers is meant to offer a warning to those of us working in cultural heritage institutions. However, it is not just one example taken at random. Rather it is a generative lens to view our work because we are always already existing as data in the world of data brokers. The power that data brokers have over us to collect, analyze, describe, and sell our data should bring into sharp focus the power that we have over those with whose data we work.
Thinking about collections as data framework creates an ideal moment for a
reflection on the creation and use of data in cultural institutions and digital
humanities. This reconceptionalization enables unprecedented access and
interaction with collection material in libraries, archives, and museums,
including but not limited to text mining, data
visualization, mapping, image analysis, audio analysis, and network
analysis
This is also a moment to consider how we do not want to interact with data. Having a ready example against which to judge our behavior is useful, and data brokers provide a perfect use case. Companies that buy, combine, and resell personal data to the detriment of individuals and groups can be the example we need. Defining ourselves in contrast to data brokers also grants us the opportunity to reflect on the historically problematic aspects of cultural institutions’ descriptive practices.