By the People, For the People: Assessing the Value of Crowdsourced, User-Generated Metadata
Christina Manzo is currently a reference librarian at the Boston Public Library, but her past employers include the Harvard Law Library, the Crowdsourcing Consortium for Libraries and Archives and Digital Commonwealth. Her background is in user-interface design and is looking forward to bringing libraries into the future of technology.
Geoff Kaufman is a postdoctoral researcher at Tiltfactor, the game design and research laboratory at Dartmouth College. He holds a Ph.D. and M.A. in social psychology from Ohio State University, and a B.A. in psychology from Carnegie Mellon University. His primary research focuses on how experience-taking – the mental simulation of characters’ experiences in fictional narratives, virtual worlds, or games – can change individuals’ self-concepts, attitudes, behaviors, and emotions. He has investigated how such transformative experiences can build interpersonal understanding and empathy, reduce stereotypes and prejudice, and inspire higher levels of social consciousness.
Sukdith "Sukie" Punjasthitkul is a project manager and designer at Tiltfactor, the game design and research laboratory at Dartmouth College. Sukie has a diverse background that includes video, audio, and web production, QA, system administration, and project management. With Tiltfactor, he has been involved with a number of projects, including POX: SAVE THE PEOPLE, Buffalo, Awkward Moment, ZombiePOX, and Metadata Games.
Mary Flanagan is a creative pioneer and the founder of the Tiltfactor game lab at Dartmouth College. Her fifth book,
This is the source
With the growing volume of user-generated classification systems arising from media
tagging-based platforms (such as Flickr and Tumblr) and the advent of new
crowdsourcing platforms for cultural heritage collections, determining the value and
usability of crowdsourced, folksonomic,
or
user-generated, freely chosen keywords
A study investigating the value and accuracy of folksonomies for cultural heritage institutions.
Classification is a basic, integral and historically significant human function.
Defined by Golder and Huberman as an act through which meaning emerges
With the rise of such classifications as Dewey, Cutter and LC Classification, local
classification was mostly retired in the late 19th century in favor of a united
system that would allow for understanding across all libraries. This was, of course,
until the rise of Internet tagging-based platforms such as Flickr, Twitter, Tumblr,
and Delicious presented a challenge to these standardized, centralized systems.
Currently information professionals are facing an unprecedented amount of
unstructured classification. Classification systems generated by such tagging-based
platforms are referred to as folksonomy,
or a type of classification system for
online content, created by an individual user who tags information with freely
chosen keywords; also, the cooperation of a group of people to create such a
classification system
While folksonomies represented an increased diversity of classification, they were
perceived mostly as sources of entertainment and documentations of casual
colloquialisms rather than a formal system of documentation. However, in 2006 Golder
and Huberman conducted a study of folksonomic data generated by users of the website
Delicious demonstrating that user-provided tags not only formed in coherent groups,
but also accurately described the basic elements of items that were tagged, such as
the
Golder and Huberman’s analysis also revealed two highly subjective categories that
may diminish the potential value of crowdsourced metadata: Self Reference
(i.e., any tag including the word my,
such as mydocuments
) and Task
Organizing
(i.e., tags denoting actions such as readlater
or
sendtomom
). In examining the overall accuracy and reliability of tags,
Golder and Huberman concluded: Because stable patterns emerge in tag
proportions, minority opinions can coexist alongside extremely popular ones
without disrupting the nearly stable consensus choices made by many
users
This research was expanded upon in 2007, when Noll and Meinel compared tags from
websites Delicious and Google against author-created metadata and found that the
former provided a more accurate representation of items’ overall content. Bischoff et
al. [2009] also examined folksonomic metadata within the context of the music
industry and found that tags submitted by users at the website Last.fm were
comparable with professional metadata being produced at AllMusic.com. Syn and Spring
[2009] examined folksonomic classifications within the domain of academic publishing,
and also found authoritative metadata to be lacking when compared with its
user-generated counterparts. As stated by Noll and Mienel, tags provide additional information which
is not directly contained within a document. We therefore argue that
integrating tagging information can help to improve or augment document
classification and retrieval techniques and is worth further research.
While some may have reservations regarding the Folksonomies have come about in part in
reaction to the perception that many classificatory structures represent an
outdated worldview, in part from the belief that since there is no single
purpose, goal or activity that unifies the universe of knowledge
unlikely to be very successful
and are
becoming less frequent as patrons’ behavior is
shaped by keyword search engines
[in today's culture] if it cannot be
found, it may as well not exist
Our research aimed to provide new empirical evidence supporting the value of folksonomies by: (1) directly testing the benefits of adding user-generated folksonomic metadata to a search index and (2) comparing the range and accuracy of tags produced by library and information science professionals and non-professional users. The three main questions guiding this work were:
The present research employed a hybrid form of usability testing utilizing the
Metadata Games platform [http://www.metadatagames.org], an online, free and open-source suite of
games supported by the National Endowment for the Humanities and developed by
Dartmouth College’s Tiltfactor Laboratory. The Metadata Games Project, launched in
January 2014, aims to use games to help augment digital records by collecting
metadata on image, audio, and film/video artifacts through gameplay
This research represents a collaborative project between the first author, who chose
to use the Metadata Games platform as the focus for an independent study project at
the Simmons College of Library and Information Sciences, and the co-authors from
Dartmouth College’s Tiltfactor Laboratory. To be clear, the goal of the reported
study was not to provide a validation of the Metadata Games platform, but rather to
study the value of folksonomic metadata more generally; that is, the focus of this
research was on the data itself, and the tool employed was intended to be largely
incidental and peripheral to the study’s aims. At the time, the Metadata Games
project was one of the few open-source metadata gathering tools available for
cultural heritage institutions.
According to Nielsen, the number of participants needed for a usability test to be
considered statistically relevant is five
The study was divided into two main tasks. In the first task, participants were
presented with physical facsimiles of five images from the Leslie Jones Collection of
the Boston Public Library and instructed to retrieve these items using the Metadata
Games search platform. The images presented to participants were divided into the
following categories: Places, People (Recognizable), People (Unidentified or
Unrecognizable), Events/Actions, Miscellaneous Formats (Posters, Drawings,
Manuscripts etc.), as seen in Figures 1-5 below. Upon
being given each physical facsimile, participants were timed from the moment they
entered their first search term until the moment they clicked on the correct digital
item retrieved from the Metadata Games search platform. This practice was adapted to
reflect the digital items that would most commonly exist in a typical
humanities-based collection, (i.e., photographs, manuscripts, postcards, glass plate
negatives and other miscellanea). This design mirrored the common everyday occurrence
of patrons attempting to retrieve a specific media item that they have in mind when
using a library search index. According to a 2013 PEW Research Study, 82% of people
that used the library in the last 12 months did so looking for a specific book, DVD
or other resource
For this image search component of the study, participants were randomly assigned to
one of two search index conditions: one with access to controlled vocabularies and
folksonomic metadata (i.e., the folksonomy condition
) and the other with
restricted access only to controlled vocabularies (i.e., the controlled vocabulary
condition
) [See Figure 6 for a schematic
representation of the study design]. Searches were conducted using two different,
customized versions of the search index on the Metadata Games website [play.metadatagames.org/search]. The
folksonomic metadata was generated by previous users of the Metadata Games platform,
whereas the controlled vocabularies attached to the items were generated by Boston
Public Library staff. The process of inputting the controlled vocabularies into both
versions of the search index required some reformatting. For example, because the
version of the search platform used in the study did not allow for special characters
such as the dash -
or the comma ,
, terms such as Boston Red Sox
(baseball team)
were imported as two individual tags: Boston Red Sox
and baseball team.
Additionally, the system returned exact matches only, which meant that if a
participant searched for sailboat
and the only term present in the system was
sailboats,
the search would be unsuccessful. This aspect of the study
design was necessitated by the technical specifications and functionality of the
version of the Metadata Games search index used in the study, rather than a strategic
methodological choice. The frequency of preventable
To further illustrate the differences between the two search index conditions,
consider the case of a participant in each condition attempting to retrieve image 3
(see Figure 3 above). In the controlled vocabulary
condition, the only search terms that would yield a successful retrieval were:
harbors,
sailboats,
marblehead harbor
and glass negatives.
In contrast, in the folksonomy
condition, a participant would successfully retrieve this item by entering any of the
following search terms: harbor,
sailboats,
water,
woman,
sail boats,
porch,
scenic,
view,
sail,
sun,
summer,
marblehead harbor,
boats,
dock,
harbors,
veranda,
balcony,
girl looking at boats,
marina,
glass negatives,
sailboats on water
and yacht.
Immediately following the image search task, participants were instructed to play three different single-player tagging games from the Metadata Games suite:
Tags were scored by the lead author using a revised version of the Voorbij and
Kipp scale used by Thomas et al.
A score of one was thus reserved for an exact match between a user-provided tag
and the professional metadata, including punctuation. To illustrate, consider the
sample image provided in Figure 10 and the
corresponding professional and folksonomic metadata provided in Tables 1 and 2 below. With
this example, the user-provided term Hindenburg Airship
would not be deemed
an exact match because, as indicated in Table 1, the controlled vocabulary term
encloses Airship
in parentheses. Scores three through five were based on
judgments made by the Library of Congress in their subject heading hierarchy. For
example, dog
would represent a broader term of the controlled vocabulary
term Golden retriever,
whereas the tag biology
would represent a
narrower term than the controlled vocabulary term Science.
We reserved
related terms
(a score of 5) for tags referring to objects or concepts
that were represented in the image but not expressed in the professional metadata.
A score of six was only to be awarded if, after research, the conclusion was made
that the term was unrelated to the image or any of the terms included in the
controlled vocabulary. A score of seven, though rare, was reserved for useless
Self Reference
or Task Organization
classes
mentioned previously).
Search Time. On average, participants in the
controlled vocabulary index condition took six times longer to search for each
image (
Items Found.
Because participants were allowed to
When taking the failed searches of the controlled vocabulary condition into
account, there were a total of 131 completely
Recall that all tags generated by participants in the gameplay portion of the study were coded using the Voorbij and Kipp scales; Figure 13 (below) depicts the breakdown of scores assigned to the 811 tags generated by the participants.
As shown above, a score of five (related terms
) accounts for the largest
segment of recorded tags, meaning that 50% of all of the tags entered were valid
classifications not included in traditional metadata. This implies a fundamental
semantic gap between traditional classification and folksonomies.
As illustrated in Figures 14 and 15 (above), the distribution of Voorbij and Kipp
scores was constant and nearly identical between the LIS and non-LIS subsamples.
This suggests that, when given the same instructions, both librarians and
non-librarians can and do produce the same types of useful, accurate data.
Additionally, a score of seven, for a so-called
Another concern for cultural heritage institutions is determining what media subjects might work best to collect new metadata through crowdsourcing. As previously mentioned, the images that participants tagged in the present study were divided into five key subject groups (see Figure 17).
Results revealed that the images garnering the highest number of unique tags were
those that fell into the categories People (Unrecognizable) and Miscellaneous
Formats (in this case, a digitized newspaper). The image that generated the fewest
tags was Image 4, a picture of Thomas Edison, Harvey Firestone, and Henry Ford.
Few people recognized the inventors and many simply input tags such as old
men,
although it is important to note that several participants did express
some level of familiarity with the figures in the image (e.g., one participant
uttered, I feel like I should know this.
). These results suggest that the
best subjects for crowdsourced metadata might be media items that requires no
prior knowledge. For example, the unrecognizable person and the digitized
newspaper were some of the only instances in which the intent of the photograph
was either completely subjective (unrecognizable person) or objectively stated
(digitized newspaper). Many other images of famous historical figures and events
simply caused the participants to become frustrated with their own lack of
knowledge. In light of this fact, crowdsourcing platforms may be well-advised to
provide users with the tools to perform their own research about the content of
the media to fill in any gaps in knowledge or recollection that they experience.
This is a challenge that Metadata Games has begun to address, with the addition of
a Wikipedia search bar to encourage users to research what they do not know about
a particular media item.
As of now, there remains debate about the comparative value of traditional and
folksonomic metadata as organizational systems for today’s information needs.
Nonetheless, there is growing recognition of the fact that folksonomies offer
libraries with an ideal return-on-investment scenario
The research team would like to thank all study participants, Tom Blake, Linda Gallant, the Boston Public Library, The Digital Public Library of America, Mary Wilkins Jordan, Jeremy Guillette, and the National Endowment for the Humanities.