What is Preservation and Why Does it Matter?
The purpose of preserving cultural and intellectual resources is to make their use possible at some unknown future time. A
complete and reliable record of the past is important for many reasons, not the least of which is to provide an audit trail
of the actions, thoughts, deeds, and misdeeds of those who have gone before us. For the humanities – a field of open-ended inquiry into the nature of humankind and especially of the culture it creates – access to the recorded information and knowledge of the past is absolutely crucial, as both its many subjects of inquiry
and its methodologies rely heavily on retrospective as well as on current resources. Preservation is a uniquely important
public good that underpins the health and well-being of humanistic research and teaching.
Preservation is also vitally important for the growth of an intellectual field and the professional development of its practitioners.
Advances in a field require that there be ease of communication between its practitioners and that the barriers to research
and publishing be as low as possible within a system that values peer review and widespread sharing and vetting of ideas.
Digital technologies have radically lowered the barriers of communication between colleagues, and between teachers and students.
Before e-mail, it was not easy to keep current with one's colleagues in distant locations; and before listservs and search
engines it was hard to learn about the work others were doing or to hunt down interesting leads in fields related to one's
own. Those who aspire to make a mark in the humanities may be attracted to new technologies to advance their research agenda,
but those who also aspire to make a career in the humanities now feel hampered by the barriers to electronic publishing, peer
review, and reward for work in advanced humanities computing. The common perception that digital creations are not permanent
is among the chief obstacles to the widespread adoption of digital publishing, and few scholars are rewarded and promoted
for their work in this area.
In research-oriented institutions such as libraries, archives, and historical societies, primary and secondary sources should
be maintained in a state that allows – if not encourages – use, and therefore the concept of "fitness for use" is the primary principle that guides preservation decisions, actions, and investments. (This is in contrast to museums, which
seldom loan objects to patrons or make them available for people to touch and manipulate.) Of course, fitness for use also
entails describing or cataloguing an item, ensuring that it is easily found in storage and retrieved for use, and securing
the object from mishandling, accidental damage, or theft (Price and Smith 2000). Imagine a run of journals that contains information a researcher wants to consult: the researcher must be able to know
what the title is (it is often found in a catalogue record in a database) and where it is held. He or she must be able to
call the journals up from their location, whether on site, in remote storage, or through inter-library loan or document delivery.
Finally, the researcher must find the journals to be the actual title and dates requested, with no pages missing in each volume
and no volume missing from the run, for them to be of use.
In the digital realm, the ability to know about, locate and retrieve, and then verify (or reasonably assume) that a digital
object is authentic, complete, and undistorted is as crucial to "fitness for use" or preservation as it is for analogue objects – the manuscripts and maps, posters and prints, books and journals, or other genres of information that are captured in continuous
waves, as opposed to discrete bits, and then recorded onto physical media for access (see chapter 32, this volume).
The general approach to preserving analogue and digital information is exactly the same – to reduce risk of information loss to an acceptable level – but the strategies used to insure against loss are quite different. In the analogue realm, information is recorded on to
and retrieved from a physical medium, such as paper, cassette tapes, parchment, film, and so forth. But as paper turns brittle,
cassette tapes break, or film fades, information is lost. Therefore, the most common strategy for preserving the information
recorded onto these media is to ensure the physical integrity of the medium, or carrier. The primary technical challenges
to analogue preservation involve stabilizing, conserving, or protecting the material integrity of recorded information. Physical
objects, such as books and magnetic tapes, inevitably age and degrade, and both environmental stresses such as excess heat
or humidity, and the stresses of use tend to accelerate that loss. Reducing stress to the object, either by providing optimal
storage conditions or by restricting use of the object in one way or another (including providing a copy or surrogate to the
researcher rather than the original object), are the most common means to preserve fragile materials. Inevitably, preservation
involves a trade-off of benefits between current and future users, because every use of the object risks some loss of information,
some deterioration of the physical artifact, some compromise to authenticity, or some risk to the data integrity.
In the digital realm, there are significant trade-offs between preservation and access as well, but for entirely different
reasons. In this realm information is immaterial, and the bit stream is not fixed on to a stable physical object but must
be created ("instantiated" or "rendered") each time it is used. The trade-offs made between long-term preservation and current ease of access stem not from so-called
data dependencies on physical media per se, but rather from data dependencies on hardware, software, and, to a lesser degree, on the physical carrier as well.
What is archiving?
The concept of digital preservation is widely discussed among professional communities, including librarians, archivists,
computer scientists, and engineers, but for most people, preservation is not a common or commonly understood term. "Archiving", though, is widely used by computer users. To non-professionals, including many scholars and researchers unfamiliar with
the technical aspects of librarianship and archival theory, digital archiving means storing non-current materials some place
"offline" so that they can be used again. But the terms "archiving", "preservation", and "storage" have meaningful technical distinctions – as meaningful as the difference between "brain" and "mind" to a neuroscientist. To avoid confusion, and to articulate the special technical needs of managing digital information as
compared with analogue, many professionals are now using the term "persistence" to mean long-term access – preservation by another name. The terms "preservation", "long-term access", and "persistence" will be used interchangeably here.
Technical Challenges to Digital Preservation and Why They Matter
The goal of digital preservation is to ensure that digital information – be it textual, numeric, audio, visual, or geospatial – be accessible to a future user in an authentic and complete form. Digital objects are made of bit streams of 0s and 1s arranged
in a logical order that can be rendered onto an interface (usually a screen) through computer hardware and software. The persistence
of both the bit stream and the logical order for rendering is essential for long-term access to digital objects.
As described by computer scientists and engineers, the two salient challenges to digital preservation are:
• physical preservation: how to maintain the integrity of the bits, the Os and Is that reside on a storage medium such as a
CD or hard drive; and
• logical preservation: how to maintain the integrity of the logical ordering of the object, that code that makes the bits "renderable" into digital objects.
In the broader preservation community outside the sphere of computer science, these challenges are more often spoken of as:
• media degradation: how to ensure that the bits survive intact and that the magnetic tape or disk or drive on which they are
stored do not degrade, demagnetize, or otherwise result in data loss (this type of loss is also referred to as "bit rot"); and
• hardware/software dependencies: how to ensure that data can be rendered or read in the future when the software they were written in
and/or the hardware on which they were designed to run are obsolete and no longer supported at the point of use.
Because of these technical dependencies, digital objects are by nature very fragile, often more at risk of data loss and even
sudden death than information recorded on brittle paper or nitrate film stock.
And from these overarching technical dependencies devolve nearly all other factors that put digital data at high risk of corruption,
degradation, and loss – the legal, social, intellectual, and financial factors that will determine whether or not we are able to build an infrastructure
that will support preservation of these valuable but fragile cultural and intellectual resources into the future. It may well
be that humanities scholars, teachers, and students are most immediately affected by the copyright restrictions, economic
barriers, and intellectual challenges to working with digital information, but it will be difficult to gain leverage over
any of these problems without a basic understanding of the ultimate technical problems from which all these proximate problems
arise. Without appreciating the technical barriers to preservation and all the crucial dependencies they entail, humanists
will not be able to create or use digital objects that are authentic, reliable, and of value into the future. So a bit more
detail is in order.
Media degradation: Magnetic tape, a primary storage medium for digital as well as analogue information, is very vulnerable to physical deterioration,
usually in the form of separation of the signal (the encoded information itself) from the substrate (the tape on which the
thin layer of bits reside). Tapes need to be "exercised" (wound and rewound) to maintain even tension and to ensure that the signal is not separating. Tapes also need to be reformatted
from time to time, though rates of deterioration are surprisingly variable (as little as five years in some cases) and so
the only sure way to know tapes are sound is by frequent and labor-intensive examinations. CDs are also known to suffer from
physical degradation, also in annoyingly unpredictable ways and time frames. (Preservationists have to rely on predictable
rates of loss if they are to develop preservation strategies that go beyond hand-crafted solutions for single-item treatments.
Given the scale of digital information that deserves preservation, all preservation strategies will ultimately need to be
automated in whole or in part to be effective.) Information stored on hard drives is generally less prone to media degradation.
But these media have not been in use long enough for there to be meaningful data about how they have performed over the decades.
Finally, a storage medium may itself be physically intact and still carry information, or signal, that has suffered degradation
– tape that has been demagnetized is such an example.
Hardware/software obsolescence: Data can be perfectly intact physically on a storage medium and yet be unreadable because the hardware and software – the playback machine and the code in which the data are written – are obsolete. We know a good deal about hardware obsolescence and its perils already from the numerous defunct playback machines
that "old" audio and visual resources required, such as Beta video equipment, 16 mm home movie projectors, and numerous proprietary
dictation machines. The problem of software obsolescence may be newer, but it is chiefly the proliferation of software codes,
and their rapid supersession by the next release, that makes this computer software concern intractable. In the case of software,
which comprises the operating system, the application, and the format, there are multiple layers that each require attending
to when preservation strategies, such as those listed below, are developed.
There are currently four strategies under various phases of research, development, and deployment for addressing the problems
of media degradation and hardware/software obsolescence (Greenstein and Smith 2002).
• Migration. Digital information is transferred, or rewritten, from one hardware/software configuration to a more current one over time as the old formats are superseded by new ones. Often, a digital repository
where data are stored will reformat or "normalize" data going into the repository; that is, the repository will put the data into a standard format that can be reliably managed
over time. As necessary and cost-effective as this process may be in the long run, it can be expensive and time-consuming.
In addition, digital files translated into another format will lose some information with each successive reformatting (loss
similar to that of translations from one language to another), ranging from formatting or presentation information to potentially
more serious forms of loss. Migration works best for simple data formats and does not work well at all for multimedia objects.
It is the technique most commonly deployed today and it shows considerable reliability with ASCII text and some numeric databases
of the sort that financial institutions use.
• Emulation. Emulation aims to preserve the look and feel of a digital object, that is, to preserve the functionality of the software
as well as the information content of the object. It requires that information about the encoding and hardware environments
be fully documented and stored with the object itself so that it can be emulated or essentially recreated on successive generations
of hardware/software (though, of course, if that information is itself digital, further problems of accessibility to that information
must also be anticipated). Emulation for preservation is currently only in the research phase. People are able to emulate
retrospectively recently deceased genres of digital objects – such as certain computer games – but not prospectively for objects to be read on unknown machines and software programs in the distant future. Many in the
field doubt that proprietary software makers will ever allow their software code to accompany objects, as stipulated by emulation,
and indeed some in the software industry say that documentation is never complete enough to allow for the kind of emulation
100 or 200 years out that would satisfy a preservation demand (Rothenberg 1999; Bearman 1999; Holdsworth and Wheatley 2000). Perhaps more to the point, software programs that are several decades or centuries old may well be no more accessible to
contemporary users than medieval manuscripts are accessible to present-day readers who have not been trained to read medieval
Latin in a variety of idiosyncratic hands. Nonetheless, emulation remains a tantalizing notion that continues to attract research
• Persistent object preservation. A relatively new approach being tested by the National Archives and Records Administration for electronic records such as
e-mails, persistent object preservation (POP) "entails explicitly declaring the properties (e.g., content, structure, context, presentation) of the original digital information
that ensure its persistence" (Greenstein and Smith 2002). It envisions wrapping a digital object with the information necessary to recreate it on current software (not the original
software envisioned by emulation). This strategy has been successfully tested in its research phase and the Archives is now
developing an implementation program for it. At present it seems most promising for digital information objects such as official
records and other highly structured genres that do not require extensive normalization, or changing it to bring it into a
common preservation norm upon deposit into the repository. Ironically, this approach is conceptually related to the efforts
of some digital artists, creating very idiosyncratic and "un-normalizable" digital creations, who are making declarations at the time of creation on how to recreate the art at some time in the future.
They do so by specifying which features of the hardware and software environment are intrinsic and authentic, and which are
fungible and need not be preserved (such things as screen resolution, processing speed, and so forth, that can affect the
look and feel of the digital art work) (Thibodeau 2002; Mayfield 2002).
• Technology preservation. This strategy addresses future problems of obsolescence by preserving the digital object together with the hardware, operating
system, and program of the original. While many will agree that, for all sorts of reasons, someone somewhere should be collecting
and preserving all generations of hardware and software in digital information technology, it is hard to imagine this approach
as much more than a technology museum attempting production-level work, doomed to an uncertain future. It is unlikely to be
scalable as an everyday solution for accessing information on orphaned platforms, but it is highly likely that something like
a museum of old technology, with plentiful documentation about the original hardware and software, will be important for future
digital archaeology and data mining. Currently, digital archaeologists are able, with often considerable effort, to rescue
data from degraded tapes and corrupted (or erased) hard drives. With some attention now to capturing extensive information
about successive generations of hardware and software, future computer engineers should be able to get some information off
the old machines.
The preservation and access trade-offs for digital information are similar to those for analogue. To make an informed decision
about how to preserve an embrittled book whose pages are crumbling or spine is broken, for example, one must weigh the relative
merits of aggressive and expensive conservation treatment to preserve the book as an artifact, versus the less expensive option
of reformatting the book's content onto microfilm or scanning it, losing most of the information integral to the artifact
as physical object in the process. One always needs to identify the chief values of a given resource to decide between a number
of preservation options. In this case, the question would be whether one values the artifactual or the informational content
more highly. This consideration would apply equally in the digital realm: some of the technologies described above will be
cheaper, more easily automated, and in that sense more scalable over time than others. Migration appears to be suitable for
simpler formats in which the look and feel matters less and in which loss of information at the margins is an acceptable risk.
The other approaches have not been tested yet in large scale over several decades, but the more options we have for ensuring
persistence, the likelier we are to make informed decisions about what we save and how. There is no silver bullet for preserving
digital information, and that may turn out to be good news in the long run.
Crucial Dependencies and Their Implications for the Humanities
Preservation by benign neglect has proven an amazingly robust strategy over time, at least for print-on-paper. One can passively
manage a large portion of library collections fairly cheaply. One can put a well-catalogued book on a shelf in good storage
conditions and expect to be able to retrieve it in 100 years in fine shape for use if no one has called it from the shelf.
But neglect in the digital realm is never benign. Neglect of digital data is a death sentence. A digital object needs to be
optimized for preservation at the time of its creation (and often again at the time of its deposit into a repository), and
then it must be conscientiously managed over time if it is to stand a chance of being used in the future.
The need for standard file formats and metadata and the role of data creators
All the technical strategies outlined above are crucially dependent on standard file formats and metadata schemas for the
creation and persistence of digital objects. File formats that are proprietary are often identified as being especially at
risk, because they are in principle dependent on support from an enterprise that may go out of business. Even a format so
widely used that it is a de facto standard, such as Adobe Systems, Inc.'s portable document format (PDF), is treated with great caution by those responsible
for persistence. The owner of a such a de facto standard has no legal obligation to release its source code or any other proprietary information in the event that it goes
bankrupt or decides to stop supporting the format for one reason or another (such as creating a better and more lucrative
Commercial interests are not always in conflict with preservation interests, but when they are, commercial interests must
prevail if the commerce is to survive. For that reason, the effort to develop and promote adoption of non-proprietary software,
especially so-called open source code, is very strong among preservationists. (Open source, as opposed to proprietary, can
be supported by non-commercial as well as commercial users.) But to the extent that commercial services are often in a better
position to support innovation and development efforts, the preservation community must embrace both commercial and non-commercial
formats. While preservationists can declare which standards and formats they would like to see used, dictating to the marketplace,
or ignoring it altogether, is not a promising solution to this problem. One way to ensure that important but potentially vulnerable
proprietary file formats are protected if they are orphaned is for leading institutions with a preservation mandate – national libraries, large research institutions, or government archives – to develop so-called fail-safe agreements with software makers that allow the code for the format to go into receivership
or be deeded over to a trusted third party. (For more on file formats, see chapter 32, this volume.)
Metadata schemas – approaches to describing information assets for access, retrieval, preservation, or internal management – is another area in which there is a delicate balance between what is required for ease of access (and of creation) and what
is required to ensure persistence. Extensive efforts have been made by librarians, archivists, and scholars to develop sophisticated
markup schemes that are preservation-friendly, such as the Text Encoding Initiative (TEI) Guidelines, which were first expressed in SGML, or Encoded Archival Description, which, like the current instantiation of the TEI, is
written in XML, and open for all communities, commercial and non-commercial alike. The barriers to using these schemas can
be high, however, and many authors and creators who understand the importance of creating good metadata nevertheless find
these schema too complicated or time-consuming to use consistently. It can be frustrating to find out that there are best
practices for the creation of preservable digital objects, but that those practices are prohibitively labor-intensive for
The more normalized and standard a digital object is, the easier it is for a digital repository to take it in (a process curiously
called "ingest"), to manage it over time, and to provide the objects back to users in their original form. Indeed, most repositories under
development in libraries and archives declare that they will assume responsibility for persistence only if the objects they
receive are in certain file formats accompanied by certain metadata. This is in sharp contrast to the more straightforward
world of books, photographs, or maps, where one can preserve the artifact without having to catalogue it first. Boxes of unsorted
and undescribed sources can languish for years before being discovered, and once described or catalogued, they can have a
productive life as a resource. Although fully searchable text could, in theory, be retrieved without much metadata in the
future, it is hard to imagine how a complex or multimedia digital object that goes into storage of any kind could ever survive,
let alone be discovered and used, if it were not accompanied by good metadata. This creates very large up-front costs for
digital preservation, both in time and money, and it is not yet clear who is obligated to assume those costs.
For non-standard or unsupported formats and metadata schemas, digital repositories might simply promise that they will deliver
back the bits as they were received (that is, provide physical preservation) but will make no such promises about the legibility
of the files (that is, not guarantee logical preservation). Though developed in good faith, these policies can be frustrating
to creators of complex digital objects, or to those who are not used to or interested in investing their own time in preparing
their work for permanent retention. This is what publishers and libraries have traditionally done, after all. Some ask why
should it be different now.
There is no question that a digital object's file format and metadata schema greatly affect its persistence and how it will
be made available in the future. This crucial dependency of digital information on format and markup begs the question of
who should pay for file preparation and what economic model will support this expensive enterprise. Who are the stakeholders
in digital preservation, and what are their roles in this new information landscape? There is now an interesting negotiation under way between data creators and distributors on the one hand, and libraries on
the other, about who will bear the costs of ingest. Various institutions of higher learning that are stepping up to the challenge
of digital preservation are working out a variety of local models that bear watching closely (see below, the section on solutions
and current activities).
Need for early preservation action and the role of copyright
Regardless of the outcome, it seems clear that those who create intellectual property in digital form need to be more informed
about what is at risk if they ignore longevity issues at the time of creation. This means that scholars should be attending to the information resources crucial to their fields by developing and adopting
the document standards vital for their research and teaching, with the advice of preservationists and computer scientists
where appropriate. The examples of computing-intensive sciences such as genomics that have developed professional tracks in
informatics might prove fruitful for humanists as more and more computing power is applied to humanistic inquiry and pedagogy.
Such a function would not be entirely new to the humanities; in the nineteenth century, a large number of eminent scholars
became heads of libraries and archives in an age when a scholar was the information specialist par excellence.
One of the crucial differences between the information needs of scientists and those of humanists is that the latter tend
to use a great variety of sources that are created outside the academy and that are largely protected by copyright. Indeed,
there is probably no type of information created or recorded by human beings that could not be of value for humanities research
at some time, and little of it may be under the direct control of the researchers who most value it for its research potential.
The chief concern about copyright that impinges directly on preservation is the length of copyright protection that current
legislation extends to the rights holders – essentially, the life of the creator plus 70 years (or more) (Copyright Office, website). Why that matters to preservation
goes back to the legal regime that allows libraries and archives to preserve materials that are protected by copyright. Institutions
with a preservation mission receive or buy information – books, journals, manuscripts, maps – to which they may have no intellectual rights. But rights over the physical objects themselves do transfer, and the law allows
those institutions to copy the information in those artifacts for the purposes of preservation.
This transfer of property rights to collecting institutions breaks down with the new market in digital information. Publishers
and distributors of digital information very seldom sell their wares. They license them. This means that libraries no longer
own the journals, databases, and other digital intellectual property to which they provide access, and they have no incentive
to preserve information that they essentially rent. Because publishers are not in the business of securing and preserving
information "in perpetuity", as the phrase goes, there is potentially a wealth of valuable digital resources that no institution is claiming to preserve
in this new information landscape. Some libraries, concerned about the potentially catastrophic loss of primary sources and
scholarly literature, have successfully negotiated "preservation clauses" in their licensing agreements, stipulating that publishers give them physical copies (usually CDs) of the digital data if
they cease licensing it, so that they can have perpetual access to what they paid for. While CDs are good for current access
needs, few libraries consider them to be archival media. Some commercial and non-commercial publishers of academic literature
have forged experimental agreements with libraries, ensuring that in the event of a business failure, the digital files of
the publisher will go to the library.
It is natural that a bibliocentric culture such as the academy has moved first on the issue of what scholars themselves publish.
The greater threat to the historical record, however, is not to the secondary literature on which publishers and libraries
are chiefly focused. The exponential growth of visual resources and sound recordings in the past 150 years has produced a
wealth of primary source materials in audiovisual formats to which humanists will demand access in the future. It is likely
that most of these resources, from performing arts to moving image, photographs, music, radio and television broadcasting,
geospatial objects, and more, are created for the marketplace and are under copyright protection (Lyman and Varian 2000).
Efforts have barely begun to negotiate with the major media companies and their trade associations to make film and television
studios, recording companies, news photo services, digital cartographers, and others aware of their implied mandate to preserve
their corporate digital assets for the greater good of a common cultural heritage. As long as those cultural and intellectual
resources are under the control of enterprises that do not know about and take up their preservation mandate, there is a serious
risk of major losses for the future, analogous to the fate of films in the first 50 years of their existence. More than 80
percent of silent films made in the United States and 50 percent made before 1950 are lost, presumably for ever. Viewing film
as "commercial product", the studios had no interest in retaining them after their productive life, and libraries and archives had no interest in
acquiring them on behalf of researchers. It is critical that humanists start today to identify the digital resources that
may be of great value now or in the future so that they can be captured and preserved before their date of expiration arrives.
Need to define the values of the digital object and the role of the humanist
Among the greatest values of digital information technologies for scholars and students is the ability of digital information
to transform the very nature of inquiry. Not bound to discrete physical artifacts, digital information is available anywhere
at any time (dependent on connectivity). Through computing applications that most computer users will never understand in
fine grain, digital objects can be easily manipulated, combined, erased, and cloned, all without leaving the physical traces
of tape, erasure marks and whiteouts, the tell-tale shadow of the photocopied version, and other subtle physical clues that
apprise us of the authenticity and provenance, or origin, of the photograph or map we hold in our hands.
Inquiring minds who need to rely on the sources they use to be authentic – for something to be what it purports to be – are faced with special concerns in the digital realm, and considerable work is being done in this area to advance our trust
in digital sources and to develop digital information literacy among users (Bearman and Trant 1998; CLIR 2000). Where this issue of the malleability of digital information most affects humanistic research and preservation,
beyond the crucial issue of authenticity, is that of fixity and stability. While the genius of digital objects is their ability
to be modified for different purposes, there are many reasons why information must be fixed and stable at some point to be
reliable in the context of research and interpretation. For example, to the extent that research and interpretation builds
on previous works, both primary and secondary, it is important for the underlying sources of an interpretation or scientific
experiment or observation to be accessible to users in the form in which it was cited by the creator. It would be useless
to have an article proposing a new interpretation of the Salem witch trials rely on diary sources that are not accessible
to others to investigate and verify. But when writers cite as their primary sources web-based materials and the reader can
find only a dead link, that is, in effect, the same thing.
To the extent that the pursuit of knowledge in any field builds upon the work of others, the chain of reference and ease of
linking to reference sources are crucial. Whose responsibility is it to maintain the persistence of links in an author's article
or a student's essay? Somehow, we expect the question of persistence to be taken care of by some vital but invisible infrastructure, not unlike
the water that comes out of the tap when we turn the knob. Clearly, that infrastructure does not yet exist. But even if it
did, there are still nagging issues about persistence that scholars and researchers need to resolve, such as the one known
as "versioning", or deciding which iteration of a dynamic and changing resource should be captured and curated for preservation. This is
a familiar problem to those who work in broadcast media, and digital humanists can profit greatly from the sophisticated thinking
that has gone on in audiovisual archives for generations.
The problem of persistent linking to sources has larger implications for the growth of the humanities as a part of academic
life, and for the support of emerging trends in scholarship and teaching. Until publishing a journal article, a computer model,
or a musical analysis in digital form is seem as persistent and therefore a potentially long-lasting contribution to the chain
of knowledge creation and use, few people will be attracted to work for reward and tenure in these media, no matter how superior
the media may be for the research into and expression of an idea.
Solutions and Current Activities
There has been considerable activity in both the basic research communities (chiefly among computer scientists and information
scientists) and at individual institutions (chiefly libraries and federal agencies) to address many of the critical technical
issues of building and sustaining digital repositories for long-term management and persistence. The private sector, while
clearly a leading innovator in information technologies, both in the development of hardware and software and in the management
of digital assets such as television, film, and recorded sound, has not played a leading public role in the development of
digital preservation systems. That is primarily because the time horizons of the preservation community and of the commercial
sectors are radically different. Most data storage systems in the private sector aim for retention of data for no more than
five to ten years (the latter not being a number that a data storage business will commit to). The time horizon of preservation
for libraries, archives, and research institutions must include many generations of inquiring humans, not just the next two
generations of hardware or software upgrades.
Because digital preservation is so complex and expensive, it is unlikely that digital repositories will spring up in the thousands
of institutions that have traditionally served as preservation centers for books. Nor should they. In a networked environment
in which one does not need access to a physical object to have access to information, the relationship between ownership (and
physical custody) of information and access to it will be transformed. Within the research and nonprofit communities, it is
likely that the system of digital preservation repositories, or digital archives, will be distributed among a few major actors
that work on behalf of a large universe of users. They will be, in other words, part of the so-called public goods information
economy that research and teaching have traditionally relied upon for core services such as preservation and collection building.
Among the major actors in digital archives will be academic disciplines whose digital information assets are crucial to the
field as a whole. Examples include organizations such as the Inter-university Consortium for Political and Social Research
(ICPSR), which manages social science datasets, and the Human Genome Data Bank, which preserves genetic data. Both are supported
directly by the disciplines themselves and through federal grants. The data in these archives are not necessarily complex,
but they are highly structured and the depositors are responsible for preparing the data for deposit. JSTOR, a non-commercial
service that preserves and provides access to digital versions of key scholarly journals, is another model of a preservation
enterprise designed to meet the needs of researchers. It is run on behalf of researchers and financially supported by libraries
through subscriptions. (Its start-up costs were provided by a private foundation.)
Some large research university libraries, such as the University of California, Harvard University, Massachusetts Institute
of Technology, and Stanford University, are beginning to develop and deploy digital repositories that will be responsible
for some circumscribed portion of the digital output of their faculties. Another digital library leader, Cornell, has recently
taken under its wing the disciplinary pre-print archive developed to serve the high-energy physics community and its need
for rapid dissemination of information among the small but geographically far-flung members of that field. In the sciences,
there are several interesting models of digital information being created, curated, and preserved by members of the discipline,
among which arXiv.org is surely the best known. This model appears difficult to emulate in the humanities because arts and
humanities disciplines do not create shared information resources that are then used by many different research teams. Nevertheless,
where redundant collections, such as journals and slide libraries, exist across many campuses, subscription-based services
are being developed to provide access to and preserve those resources through digital surrogates. Examples include JSTOR,
AMICO, and ARTstor, now under development. The economies of scale can be achieved only if the libraries and museums that subscribe
to these services believe that the provider will persistently manage the digital surrogates and that they are therefore able
to dispose of extra copies of journals or slides.
Learned societies may be a logical locus of digital archives, as they are trusted third parties within a field and widely
supported by members of the communities they serve. But they are not, as a rule, well positioned to undertake the serious
capital expenditures that repository services require. That seems to be the reason why these subscription-based preservation
and access services have appeared in the marketplace.
The Library of Congress (LC), which is the seat of the Copyright Office and receives for inclusion in its collections one
or more copies of all works deposited for copyright protection, is beginning to grapple with the implications of digital deposits
and what they mean for the growth of the Library's collections. (At present, with close to 120 million items in its collections,
it is several times larger than other humanities collections in the United States.) LC is developing a strategy to build a
national infrastructure for the preservation of digital heritage that would leverage the existing and future preservation
activities across the nation (and around the globe) to ensure that the greatest number of people can have persistent rights-protected
access to that heritage. The National Archives is also working to acquire and preserve the digital output of the federal government,
though this will entail an expenditure of public resources for preservation that is unprecedented in a nation that prides
itself on its accountability to its people.
The final word about major actors in preservation belongs to the small group of visionary private collectors that have fueled
the growth of great humanities collections for centuries. The outstanding exemplar of the digital collector is Brewster Kahle,
who has designed and built the Internet Archive, which captures and preserves a large number of publicly available sites.
While the visible and publicly available Web that the Internet Archive harvests is a small portion of the total Web (Lyman 2002), the Archive has massive amounts of culturally rich material. Kahle maintains the Archive as a preservation repository – that is its explicit mission – and in that sense his enterprise is a beguiling peek into the future of collecting in the digital realm.
Selection for Preservation
While much work remains to ensure that digital objects of high cultural and research value persist into the future, many experts
are cautiously optimistic that, with enough funding and will, technical issues will be addressed and acceptable solutions
will be found. The issue that continues to daunt the most thoughtful among those engaged in preservation strategies is selection:
how to determine what, of the massive amount of information available, should be captured and stored and managed over time.
In theory, there is nothing created by the hands of mankind that is not of potential research value for humanities scholars,
even the humblest scrap of data – tax records, laundry lists, porn sites, personal websites, weblogs, and so forth. Technology optimists who believe that it
will be possible to "save everything" through automated procedures have advocated doing so. Some teenager is out there today, they point out, who will be president
of the United States in 30 years, and she no doubt already has her own website. If we save all websites we are bound to save
hers. If completeness of the historical record is a value that society should support, then saving everything that can be
saved appears to be the safest policy to minimize the risk of information loss.
There may well be compelling social reasons to capture everything from the Web and save it forever – if it were possible and if there were no legal and privacy issues. (Most of the Web, the so-called Deep Web, is not publicly
available, and many copyright experts tend to think that all non-federal sites are copyright-protected.) Certainly, everything
created by public officials in the course of doing their business belongs in the public record, in complete and undistorted
form. Further, there are the huge and expensive stores of data about our world – from census data to the petabytes of data sent back to Earth from orbiting satellites – that may prove invaluable in future scientific problem solving. On the other hand, humanists have traditionally valued the
enduring quality of an information object as much as the quantity of raw data it may yield. There is reason to think that
humanists in the future will be equally interested in the depth of information in a source as they are in the sheer quantity
of it. Indeed, some historians working in the modern period already deplore the promiscuity of paper-based record making and
keeping in contemporary life.
But the digital revolution has scarcely begun, and the humanities have been slower to adopt the technology and test its potential
for transforming the nature of inquiry than have other disciplines that rely more heavily on quantitative information. Some
fields – history is a good example – have gone through periods when quantitative analysis has been widely used, but this was before the advent of computing power
on today's scale. If and when humanists discover the ways that computers can truly change the work of research and teaching,
then we can expect to see the growth of large and commonly used databases that will demand heavy investments in time and money
and that will, therefore, beg the question of persistence.
We will not be able in the future to rely on traditional assessments of value for determining what deserves preservation.
In the digital realm, there will be no uniqueness, no scarcity, no category of "rare." There will remain only the signal categories of evidential value, aesthetic value, and associational value, the very criteria
that are, by their subjectivity, best assessed by scholar experts. The role of humanists in building and preserving collections
of high research value will become as important as it was in the Renaissance or the nineteenth century. Unlike those eras,
however, when scholars could understand the value of sources as they have revealed themselves over time, there is no distinction
between collecting "just in case" something proves later to be valuable, and "just in time" for someone to use now. Scholars cannot leave it to later generations to collect materials created today. They must assume
a more active role in the stewardship of research collections than they have played since the nineteenth century or, indeed,
Digital Preservation as a Strategy for Preserving Non-digital Collections
There seems little doubt that materials that are "born digital" need to be preserved in digital form. Yet another reason why ensuring the persistence of digital information is crucial to
the future of humanities scholarship and teaching is that a huge body of very fragile analogue materials demands digital reformatting
for preservation. Most moving image and recorded sound sources exist on media that are fragile, such as nitrate film, audio
tapes, or lacquer disks, and they all require reformatting on to fresher media to remain accessible for use. Copying analogue
information to another analogue format results in significant loss of signal within a generation or two (imagine copying a
video of a television program over and over), so most experts believe that reformatting onto digital media is the safest strategy.
Digital reformatting does not result in significant (in most cases in any) loss of information.
For those who prize access to the original source materials for research and teaching, the original artifact is irreplaceable.
For a large number of other uses, an excellent surrogate or access copy is fine. Digital technology can enhance the preservation
of artifacts by providing superlative surrogates of original sources while at the same time protecting the artifact from overuse.
(See chapter 32, this volume.)
All this means that, as creators and consumers of digital information, humanists are vitally interested in digital preservation,
both for digital resources and for the abundance of valuable but physically fragile analogue collections that must rely on
digitization. Given the extraordinary evanescence of digital information, it is crucial now for them to engage the copyright
issues, to develop the economic models for building and sustaining the core infrastructure that will support persistent access,
and, most proximate to their daily lives, to ensure that students and practitioners of the humanities are appropriately trained
and equipped to conduct research, write up results for dissemination, and enable a new generation of students to engage in
the open-ended inquiry into culture that is at the core of the humanistic enterprise.
The author thanks Amy Friedlander, Kathlin Smith, and David Rumsey for their invaluable help.
Bearman, David (1999). Reality and Chimeras in the Preservation of Electronic Records. D-Lib Magazine 5, 4. At http://www.dlib.org/dlib/april99/bearman/04bearman.html.
Bearman, David and Jennifer Trant (1998). Authenticity of Digital Resources: Towards a Statement of Requirements in the Research Process. D-Lib Magazine (June). At http://www.dlib.org/dlib/june98/06bearman.html.
Copyright Office. Circulars 15 and 15a. Accessed April 27, 2004. At http://www.loc.gov/copyright/circs/circ!5.pdf (CLIR) http://www.loc.gov/copyright/circs/circl5a.pdf.
Council on Library and Information Resources; (2000). Authenticity in a Digital Environment. Washington, DC: Council on Library and Information Resources. At http://www.clir.org/pubs/reports/pub92/contents.html.
Greenstein, Daniel and Abby Smith (2002). Digital Preservation in the United States: Survey of Current Research, Practice, and Common Understandings. At http://www.digitalpreservation.gov.
Holdsworth, David and Paul Wheatley (2000). Emulation, Preservation and Abstraction. CAMiLEON Project, University of Leeds. At http://184.108.40.206/CAMiLEON/dh/ep5.html.
Lyman, Peter (2002). Archiving the World Wide Web. In Building a National Strategy for Digital Preservation: Issues in Digital Media Archiving. Washington, DC: Council on Library and Information Resources and the Library of Congress. At http://www.clir.org/pubs/reports/publ06/con-tents.html; and http://www.digitalpreservation.gov/ndiipp/repor/repor_back_web.html.
Lyman, Peter, and Hal R. Varian (2000). How Much Information? At http://www.sims.berkeley.edu/how-much-info.
Mayfield, Kedra (2002). How to Preserve Digital Art. Wired (July 23). At http://www.wired.com/news/culture/0,1284,53712,00.html.
Price, Laura and Abby Smith (2000). Managing Cultural Assets from a Business Perspective. Washington, DC: Council on Library and Information Resources. At http://www.clir.org/pubs/reports/pub90/contents.html.
Rothenberg, Jeff (1999). Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation. Washington, DC: Council on Library and Information Resources. At http://www.clir.org/pubs/reports/pub77/contents.html.
Thibodeau, Kenneth (2002). Overview of Technological Approaches to Digital Preservation and Challenges in Coming Years. In The State of Digital Preservation: An International Perspective. Conference Proceedings. Documentation Abstracts, Inc., Institutes for Information Science Washington, DC, April 24–25, 2002. Washington, DC: Council on Library and Information Resources. At http://www.clir.org/pubs/reports/publ07/contents.html.
Websites of Organizations and Projects Noted
Art Museum Image Consortium (AMICO). http://www.amico.org.
Internet Archive, http://www.archive.org.
Inter-university Consortium for Political and Social Research (ICPSR). http://www.icpsr.umich.edu.
Library of Congress. The National Digital Information and Infrastructure Preservation Program of the. Library of Congress is available at http://www.digitalpreservation.gov. National Archives and Record Administration (NARA). Electronic records Archives, http://www.archives.gov/electronic_records_archives.
National Archives and Record Administration (NARA). Electronic records Archives http://www.archives.gov/electronic_records_archives.
For Further Reading
On the technical, social, organizational, and legal issues related to digital preservation, the best source continues to be
Preserving Digital Information: Report of the Task Force on Archiving of Digital Information (Washington, DC, and Mountain View, CA: Commission of Preservation and Access and the Research Libraries Group, Inc.). Available
Digital preservation is undergoing rapid development, and few publications remain current for long. To keep abreast of developments
in digital preservation, see the following list, which themselves regularly digest or cite the latest information on the subject
and present the leading research of interest to humanists on the subject:.
D-Lib Magazine. http://www.dlib.org.
Council on Library and Information Resources (CLIR). http://www.clir.org.
Digital Library Federation (DLF). http://www.diglib.org.
National Science Foundation (NSF), Digital Libraries Initiative (DLI). http://www.dli2.nsf.gov.
National Library of Australia, Preserving Access to Digital Information (PADI). http://www.nla.gov.au/padi/format/case.html.