What is Preservation and Why Does it Matter?
The purpose of preserving cultural and intellectual resources is to make their use possible at some unknown future time. A complete and reliable record of the past is important for many reasons, not the least of which is to provide an audit trail of the actions, thoughts, deeds, and misdeeds of those who have gone before us. For the humanities – a field of open-ended inquiry into the nature of humankind and especially of the culture it creates – access to the recorded information and knowledge of the past is absolutely crucial, as both its many subjects of inquiry and its methodologies rely heavily on retrospective as well as on current resources. Preservation is a uniquely important public good that underpins the health and well-being of humanistic research and teaching.
Preservation is also vitally important for the growth of an intellectual field and the professional development of its practitioners. Advances in a field require that there be ease of communication between its practitioners and that the barriers to research and publishing be as low as possible within a system that values peer review and widespread sharing and vetting of ideas. Digital technologies have radically lowered the barriers of communication between colleagues, and between teachers and students. Before e-mail, it was not easy to keep current with one's colleagues in distant locations; and before listservs and search engines it was hard to learn about the work others were doing or to hunt down interesting leads in fields related to one's own. Those who aspire to make a mark in the humanities may be attracted to new technologies to advance their research agenda, but those who also aspire to make a career in the humanities now feel hampered by the barriers to electronic publishing, peer review, and reward for work in advanced humanities computing. The common perception that digital creations are not permanent is among the chief obstacles to the widespread adoption of digital publishing, and few scholars are rewarded and promoted for their work in this area.
In research-oriented institutions such as libraries, archives, and historical societies, primary and secondary sources should be maintained in a state that allows – if not encourages – use, and therefore the concept of "fitness for use" is the primary principle that guides preservation decisions, actions, and investments. (This is in contrast to museums, which seldom loan objects to patrons or make them available for people to touch and manipulate.) Of course, fitness for use also entails describing or cataloguing an item, ensuring that it is easily found in storage and retrieved for use, and securing the object from mishandling, accidental damage, or theft (Price and Smith 2000). Imagine a run of journals that contains information a researcher wants to consult: the researcher must be able to know what the title is (it is often found in a catalogue record in a database) and where it is held. He or she must be able to call the journals up from their location, whether on site, in remote storage, or through inter-library loan or document delivery. Finally, the researcher must find the journals to be the actual title and dates requested, with no pages missing in each volume and no volume missing from the run, for them to be of use.
In the digital realm, the ability to know about, locate and retrieve, and then verify (or reasonably assume) that a digital object is authentic, complete, and undistorted is as crucial to "fitness for use" or preservation as it is for analogue objects – the manuscripts and maps, posters and prints, books and journals, or other genres of information that are captured in continuous waves, as opposed to discrete bits, and then recorded onto physical media for access (see chapter 32, this volume).
The general approach to preserving analogue and digital information is exactly the same – to reduce risk of information loss to an acceptable level – but the strategies used to insure against loss are quite different. In the analogue realm, information is recorded on to and retrieved from a physical medium, such as paper, cassette tapes, parchment, film, and so forth. But as paper turns brittle, cassette tapes break, or film fades, information is lost. Therefore, the most common strategy for preserving the information recorded onto these media is to ensure the physical integrity of the medium, or carrier. The primary technical challenges to analogue preservation involve stabilizing, conserving, or protecting the material integrity of recorded information. Physical objects, such as books and magnetic tapes, inevitably age and degrade, and both environmental stresses such as excess heat or humidity, and the stresses of use tend to accelerate that loss. Reducing stress to the object, either by providing optimal storage conditions or by restricting use of the object in one way or another (including providing a copy or surrogate to the researcher rather than the original object), are the most common means to preserve fragile materials. Inevitably, preservation involves a trade-off of benefits between current and future users, because every use of the object risks some loss of information, some deterioration of the physical artifact, some compromise to authenticity, or some risk to the data integrity.
In the digital realm, there are significant trade-offs between preservation and access as well, but for entirely different reasons. In this realm information is immaterial, and the bit stream is not fixed on to a stable physical object but must be created ("instantiated" or "rendered") each time it is used. The trade-offs made between long-term preservation and current ease of access stem not from so-called data dependencies on physical media per se, but rather from data dependencies on hardware, software, and, to a lesser degree, on the physical carrier as well.
What is archiving?
The concept of digital preservation is widely discussed among professional communities, including librarians, archivists, computer scientists, and engineers, but for most people, preservation is not a common or commonly understood term. "Archiving", though, is widely used by computer users. To non-professionals, including many scholars and researchers unfamiliar with the technical aspects of librarianship and archival theory, digital archiving means storing non-current materials some place "offline" so that they can be used again. But the terms "archiving", "preservation", and "storage" have meaningful technical distinctions – as meaningful as the difference between "brain" and "mind" to a neuroscientist. To avoid confusion, and to articulate the special technical needs of managing digital information as compared with analogue, many professionals are now using the term "persistence" to mean long-term access – preservation by another name. The terms "preservation", "long-term access", and "persistence" will be used interchangeably here.
Technical Challenges to Digital Preservation and Why They Matter
The goal of digital preservation is to ensure that digital information – be it textual, numeric, audio, visual, or geospatial – be accessible to a future user in an authentic and complete form. Digital objects are made of bit streams of 0s and 1s arranged in a logical order that can be rendered onto an interface (usually a screen) through computer hardware and software. The persistence of both the bit stream and the logical order for rendering is essential for long-term access to digital objects.
As described by computer scientists and engineers, the two salient challenges to digital preservation are:
• physical preservation: how to maintain the integrity of the bits, the Os and Is that reside on a storage medium such as a CD or hard drive; and
• logical preservation: how to maintain the integrity of the logical ordering of the object, that code that makes the bits "renderable" into digital objects.
In the broader preservation community outside the sphere of computer science, these challenges are more often spoken of as:
• media degradation: how to ensure that the bits survive intact and that the magnetic tape or disk or drive on which they are stored do not degrade, demagnetize, or otherwise result in data loss (this type of loss is also referred to as "bit rot"); and
• hardware/software dependencies: how to ensure that data can be rendered or read in the future when the software they were written in and/or the hardware on which they were designed to run are obsolete and no longer supported at the point of use.
Because of these technical dependencies, digital objects are by nature very fragile, often more at risk of data loss and even sudden death than information recorded on brittle paper or nitrate film stock.
And from these overarching technical dependencies devolve nearly all other factors that put digital data at high risk of corruption, degradation, and loss – the legal, social, intellectual, and financial factors that will determine whether or not we are able to build an infrastructure that will support preservation of these valuable but fragile cultural and intellectual resources into the future. It may well be that humanities scholars, teachers, and students are most immediately affected by the copyright restrictions, economic barriers, and intellectual challenges to working with digital information, but it will be difficult to gain leverage over any of these problems without a basic understanding of the ultimate technical problems from which all these proximate problems arise. Without appreciating the technical barriers to preservation and all the crucial dependencies they entail, humanists will not be able to create or use digital objects that are authentic, reliable, and of value into the future. So a bit more detail is in order.
Media degradation: Magnetic tape, a primary storage medium for digital as well as analogue information, is very vulnerable to physical deterioration, usually in the form of separation of the signal (the encoded information itself) from the substrate (the tape on which the thin layer of bits reside). Tapes need to be "exercised" (wound and rewound) to maintain even tension and to ensure that the signal is not separating. Tapes also need to be reformatted from time to time, though rates of deterioration are surprisingly variable (as little as five years in some cases) and so the only sure way to know tapes are sound is by frequent and labor-intensive examinations. CDs are also known to suffer from physical degradation, also in annoyingly unpredictable ways and time frames. (Preservationists have to rely on predictable rates of loss if they are to develop preservation strategies that go beyond hand-crafted solutions for single-item treatments. Given the scale of digital information that deserves preservation, all preservation strategies will ultimately need to be automated in whole or in part to be effective.) Information stored on hard drives is generally less prone to media degradation. But these media have not been in use long enough for there to be meaningful data about how they have performed over the decades. Finally, a storage medium may itself be physically intact and still carry information, or signal, that has suffered degradation – tape that has been demagnetized is such an example.
Hardware/software obsolescence: Data can be perfectly intact physically on a storage medium and yet be unreadable because the hardware and software – the playback machine and the code in which the data are written – are obsolete. We know a good deal about hardware obsolescence and its perils already from the numerous defunct playback machines that "old" audio and visual resources required, such as Beta video equipment, 16 mm home movie projectors, and numerous proprietary dictation machines. The problem of software obsolescence may be newer, but it is chiefly the proliferation of software codes, and their rapid supersession by the next release, that makes this computer software concern intractable. In the case of software, which comprises the operating system, the application, and the format, there are multiple layers that each require attending to when preservation strategies, such as those listed below, are developed.
There are currently four strategies under various phases of research, development, and deployment for addressing the problems of media degradation and hardware/software obsolescence (Greenstein and Smith 2002).
• Migration. Digital information is transferred, or rewritten, from one hardware/software configuration to a more current one over time as the old formats are superseded by new ones. Often, a digital repository where data are stored will reformat or "normalize" data going into the repository; that is, the repository will put the data into a standard format that can be reliably managed over time. As necessary and cost-effective as this process may be in the long run, it can be expensive and time-consuming. In addition, digital files translated into another format will lose some information with each successive reformatting (loss similar to that of translations from one language to another), ranging from formatting or presentation information to potentially more serious forms of loss. Migration works best for simple data formats and does not work well at all for multimedia objects. It is the technique most commonly deployed today and it shows considerable reliability with ASCII text and some numeric databases of the sort that financial institutions use.
• Emulation. Emulation aims to preserve the look and feel of a digital object, that is, to preserve the functionality of the software as well as the information content of the object. It requires that information about the encoding and hardware environments be fully documented and stored with the object itself so that it can be emulated or essentially recreated on successive generations of hardware/software (though, of course, if that information is itself digital, further problems of accessibility to that information must also be anticipated). Emulation for preservation is currently only in the research phase. People are able to emulate retrospectively recently deceased genres of digital objects – such as certain computer games – but not prospectively for objects to be read on unknown machines and software programs in the distant future. Many in the field doubt that proprietary software makers will ever allow their software code to accompany objects, as stipulated by emulation, and indeed some in the software industry say that documentation is never complete enough to allow for the kind of emulation 100 or 200 years out that would satisfy a preservation demand (Rothenberg 1999; Bearman 1999; Holdsworth and Wheatley 2000). Perhaps more to the point, software programs that are several decades or centuries old may well be no more accessible to contemporary users than medieval manuscripts are accessible to present-day readers who have not been trained to read medieval Latin in a variety of idiosyncratic hands. Nonetheless, emulation remains a tantalizing notion that continues to attract research dollars.
• Persistent object preservation. A relatively new approach being tested by the National Archives and Records Administration for electronic records such as e-mails, persistent object preservation (POP) "entails explicitly declaring the properties (e.g., content, structure, context, presentation) of the original digital information that ensure its persistence" (Greenstein and Smith 2002). It envisions wrapping a digital object with the information necessary to recreate it on current software (not the original software envisioned by emulation). This strategy has been successfully tested in its research phase and the Archives is now developing an implementation program for it. At present it seems most promising for digital information objects such as official records and other highly structured genres that do not require extensive normalization, or changing it to bring it into a common preservation norm upon deposit into the repository. Ironically, this approach is conceptually related to the efforts of some digital artists, creating very idiosyncratic and "un-normalizable" digital creations, who are making declarations at the time of creation on how to recreate the art at some time in the future. They do so by specifying which features of the hardware and software environment are intrinsic and authentic, and which are fungible and need not be preserved (such things as screen resolution, processing speed, and so forth, that can affect the look and feel of the digital art work) (Thibodeau 2002; Mayfield 2002).
• Technology preservation. This strategy addresses future problems of obsolescence by preserving the digital object together with the hardware, operating system, and program of the original. While many will agree that, for all sorts of reasons, someone somewhere should be collecting and preserving all generations of hardware and software in digital information technology, it is hard to imagine this approach as much more than a technology museum attempting production-level work, doomed to an uncertain future. It is unlikely to be scalable as an everyday solution for accessing information on orphaned platforms, but it is highly likely that something like a museum of old technology, with plentiful documentation about the original hardware and software, will be important for future digital archaeology and data mining. Currently, digital archaeologists are able, with often considerable effort, to rescue data from degraded tapes and corrupted (or erased) hard drives. With some attention now to capturing extensive information about successive generations of hardware and software, future computer engineers should be able to get some information off the old machines.
The preservation and access trade-offs for digital information are similar to those for analogue. To make an informed decision about how to preserve an embrittled book whose pages are crumbling or spine is broken, for example, one must weigh the relative merits of aggressive and expensive conservation treatment to preserve the book as an artifact, versus the less expensive option of reformatting the book's content onto microfilm or scanning it, losing most of the information integral to the artifact as physical object in the process. One always needs to identify the chief values of a given resource to decide between a number of preservation options. In this case, the question would be whether one values the artifactual or the informational content more highly. This consideration would apply equally in the digital realm: some of the technologies described above will be cheaper, more easily automated, and in that sense more scalable over time than others. Migration appears to be suitable for simpler formats in which the look and feel matters less and in which loss of information at the margins is an acceptable risk. The other approaches have not been tested yet in large scale over several decades, but the more options we have for ensuring persistence, the likelier we are to make informed decisions about what we save and how. There is no silver bullet for preserving digital information, and that may turn out to be good news in the long run.
Crucial Dependencies and Their Implications for the Humanities
Preservation by benign neglect has proven an amazingly robust strategy over time, at least for print-on-paper. One can passively manage a large portion of library collections fairly cheaply. One can put a well-catalogued book on a shelf in good storage conditions and expect to be able to retrieve it in 100 years in fine shape for use if no one has called it from the shelf. But neglect in the digital realm is never benign. Neglect of digital data is a death sentence. A digital object needs to be optimized for preservation at the time of its creation (and often again at the time of its deposit into a repository), and then it must be conscientiously managed over time if it is to stand a chance of being used in the future.
The need for standard file formats and metadata and the role of data creators
All the technical strategies outlined above are crucially dependent on standard file formats and metadata schemas for the creation and persistence of digital objects. File formats that are proprietary are often identified as being especially at risk, because they are in principle dependent on support from an enterprise that may go out of business. Even a format so widely used that it is a de facto standard, such as Adobe Systems, Inc.'s portable document format (PDF), is treated with great caution by those responsible for persistence. The owner of a such a de facto standard has no legal obligation to release its source code or any other proprietary information in the event that it goes bankrupt or decides to stop supporting the format for one reason or another (such as creating a better and more lucrative file format).
Commercial interests are not always in conflict with preservation interests, but when they are, commercial interests must prevail if the commerce is to survive. For that reason, the effort to develop and promote adoption of non-proprietary software, especially so-called open source code, is very strong among preservationists. (Open source, as opposed to proprietary, can be supported by non-commercial as well as commercial users.) But to the extent that commercial services are often in a better position to support innovation and development efforts, the preservation community must embrace both commercial and non-commercial formats. While preservationists can declare which standards and formats they would like to see used, dictating to the marketplace, or ignoring it altogether, is not a promising solution to this problem. One way to ensure that important but potentially vulnerable proprietary file formats are protected if they are orphaned is for leading institutions with a preservation mandate – national libraries, large research institutions, or government archives – to develop so-called fail-safe agreements with software makers that allow the code for the format to go into receivership or be deeded over to a trusted third party. (For more on file formats, see chapter 32, this volume.)
Metadata schemas – approaches to describing information assets for access, retrieval, preservation, or internal management – is another area in which there is a delicate balance between what is required for ease of access (and of creation) and what is required to ensure persistence. Extensive efforts have been made by librarians, archivists, and scholars to develop sophisticated markup schemes that are preservation-friendly, such as the Text Encoding Initiative (TEI) Guidelines, which were first expressed in SGML, or Encoded Archival Description, which, like the current instantiation of the TEI, is written in XML, and open for all communities, commercial and non-commercial alike. The barriers to using these schemas can be high, however, and many authors and creators who understand the importance of creating good metadata nevertheless find these schema too complicated or time-consuming to use consistently. It can be frustrating to find out that there are best practices for the creation of preservable digital objects, but that those practices are prohibitively labor-intensive for most practitioners.
The more normalized and standard a digital object is, the easier it is for a digital repository to take it in (a process curiously called "ingest"), to manage it over time, and to provide the objects back to users in their original form. Indeed, most repositories under development in libraries and archives declare that they will assume responsibility for persistence only if the objects they receive are in certain file formats accompanied by certain metadata. This is in sharp contrast to the more straightforward world of books, photographs, or maps, where one can preserve the artifact without having to catalogue it first. Boxes of unsorted and undescribed sources can languish for years before being discovered, and once described or catalogued, they can have a productive life as a resource. Although fully searchable text could, in theory, be retrieved without much metadata in the future, it is hard to imagine how a complex or multimedia digital object that goes into storage of any kind could ever survive, let alone be discovered and used, if it were not accompanied by good metadata. This creates very large up-front costs for digital preservation, both in time and money, and it is not yet clear who is obligated to assume those costs.
For non-standard or unsupported formats and metadata schemas, digital repositories might simply promise that they will deliver back the bits as they were received (that is, provide physical preservation) but will make no such promises about the legibility of the files (that is, not guarantee logical preservation). Though developed in good faith, these policies can be frustrating to creators of complex digital objects, or to those who are not used to or interested in investing their own time in preparing their work for permanent retention. This is what publishers and libraries have traditionally done, after all. Some ask why should it be different now.
There is no question that a digital object's file format and metadata schema greatly affect its persistence and how it will be made available in the future. This crucial dependency of digital information on format and markup begs the question of who should pay for file preparation and what economic model will support this expensive enterprise. Who are the stakeholders in digital preservation, and what are their roles in this new information landscape? There is now an interesting negotiation under way between data creators and distributors on the one hand, and libraries on the other, about who will bear the costs of ingest. Various institutions of higher learning that are stepping up to the challenge of digital preservation are working out a variety of local models that bear watching closely (see below, the section on solutions and current activities).
Need for early preservation action and the role of copyright
Regardless of the outcome, it seems clear that those who create intellectual property in digital form need to be more informed about what is at risk if they ignore longevity issues at the time of creation. This means that scholars should be attending to the information resources crucial to their fields by developing and adopting the document standards vital for their research and teaching, with the advice of preservationists and computer scientists where appropriate. The examples of computing-intensive sciences such as genomics that have developed professional tracks in informatics might prove fruitful for humanists as more and more computing power is applied to humanistic inquiry and pedagogy. Such a function would not be entirely new to the humanities; in the nineteenth century, a large number of eminent scholars became heads of libraries and archives in an age when a scholar was the information specialist par excellence.
One of the crucial differences between the information needs of scientists and those of humanists is that the latter tend to use a great variety of sources that are created outside the academy and that are largely protected by copyright. Indeed, there is probably no type of information created or recorded by human beings that could not be of value for humanities research at some time, and little of it may be under the direct control of the researchers who most value it for its research potential. The chief concern about copyright that impinges directly on preservation is the length of copyright protection that current legislation extends to the rights holders – essentially, the life of the creator plus 70 years (or more) (Copyright Office, website). Why that matters to preservation goes back to the legal regime that allows libraries and archives to preserve materials that are protected by copyright. Institutions with a preservation mission receive or buy information – books, journals, manuscripts, maps – to which they may have no intellectual rights. But rights over the physical objects themselves do transfer, and the law allows those institutions to copy the information in those artifacts for the purposes of preservation.
This transfer of property rights to collecting institutions breaks down with the new market in digital information. Publishers and distributors of digital information very seldom sell their wares. They license them. This means that libraries no longer own the journals, databases, and other digital intellectual property to which they provide access, and they have no incentive to preserve information that they essentially rent. Because publishers are not in the business of securing and preserving information "in perpetuity", as the phrase goes, there is potentially a wealth of valuable digital resources that no institution is claiming to preserve in this new information landscape. Some libraries, concerned about the potentially catastrophic loss of primary sources and scholarly literature, have successfully negotiated "preservation clauses" in their licensing agreements, stipulating that publishers give them physical copies (usually CDs) of the digital data if they cease licensing it, so that they can have perpetual access to what they paid for. While CDs are good for current access needs, few libraries consider them to be archival media. Some commercial and non-commercial publishers of academic literature have forged experimental agreements with libraries, ensuring that in the event of a business failure, the digital files of the publisher will go to the library.
It is natural that a bibliocentric culture such as the academy has moved first on the issue of what scholars themselves publish. The greater threat to the historical record, however, is not to the secondary literature on which publishers and libraries are chiefly focused. The exponential growth of visual resources and sound recordings in the past 150 years has produced a wealth of primary source materials in audiovisual formats to which humanists will demand access in the future. It is likely that most of these resources, from performing arts to moving image, photographs, music, radio and television broadcasting, geospatial objects, and more, are created for the marketplace and are under copyright protection (Lyman and Varian 2000).
Efforts have barely begun to negotiate with the major media companies and their trade associations to make film and television studios, recording companies, news photo services, digital cartographers, and others aware of their implied mandate to preserve their corporate digital assets for the greater good of a common cultural heritage. As long as those cultural and intellectual resources are under the control of enterprises that do not know about and take up their preservation mandate, there is a serious risk of major losses for the future, analogous to the fate of films in the first 50 years of their existence. More than 80 percent of silent films made in the United States and 50 percent made before 1950 are lost, presumably for ever. Viewing film as "commercial product", the studios had no interest in retaining them after their productive life, and libraries and archives had no interest in acquiring them on behalf of researchers. It is critical that humanists start today to identify the digital resources that may be of great value now or in the future so that they can be captured and preserved before their date of expiration arrives.
Need to define the values of the digital object and the role of the humanist
Among the greatest values of digital information technologies for scholars and students is the ability of digital information to transform the very nature of inquiry. Not bound to discrete physical artifacts, digital information is available anywhere at any time (dependent on connectivity). Through computing applications that most computer users will never understand in fine grain, digital objects can be easily manipulated, combined, erased, and cloned, all without leaving the physical traces of tape, erasure marks and whiteouts, the tell-tale shadow of the photocopied version, and other subtle physical clues that apprise us of the authenticity and provenance, or origin, of the photograph or map we hold in our hands.
Inquiring minds who need to rely on the sources they use to be authentic – for something to be what it purports to be – are faced with special concerns in the digital realm, and considerable work is being done in this area to advance our trust in digital sources and to develop digital information literacy among users (Bearman and Trant 1998; CLIR 2000). Where this issue of the malleability of digital information most affects humanistic research and preservation, beyond the crucial issue of authenticity, is that of fixity and stability. While the genius of digital objects is their ability to be modified for different purposes, there are many reasons why information must be fixed and stable at some point to be reliable in the context of research and interpretation. For example, to the extent that research and interpretation builds on previous works, both primary and secondary, it is important for the underlying sources of an interpretation or scientific experiment or observation to be accessible to users in the form in which it was cited by the creator. It would be useless to have an article proposing a new interpretation of the Salem witch trials rely on diary sources that are not accessible to others to investigate and verify. But when writers cite as their primary sources web-based materials and the reader can find only a dead link, that is, in effect, the same thing.
To the extent that the pursuit of knowledge in any field builds upon the work of others, the chain of reference and ease of linking to reference sources are crucial. Whose responsibility is it to maintain the persistence of links in an author's article or a student's essay? Somehow, we expect the question of persistence to be taken care of by some vital but invisible infrastructure, not unlike the water that comes out of the tap when we turn the knob. Clearly, that infrastructure does not yet exist. But even if it did, there are still nagging issues about persistence that scholars and researchers need to resolve, such as the one known as "versioning", or deciding which iteration of a dynamic and changing resource should be captured and curated for preservation. This is a familiar problem to those who work in broadcast media, and digital humanists can profit greatly from the sophisticated thinking that has gone on in audiovisual archives for generations.
The problem of persistent linking to sources has larger implications for the growth of the humanities as a part of academic life, and for the support of emerging trends in scholarship and teaching. Until publishing a journal article, a computer model, or a musical analysis in digital form is seem as persistent and therefore a potentially long-lasting contribution to the chain of knowledge creation and use, few people will be attracted to work for reward and tenure in these media, no matter how superior the media may be for the research into and expression of an idea.
Solutions and Current Activities
There has been considerable activity in both the basic research communities (chiefly among computer scientists and information scientists) and at individual institutions (chiefly libraries and federal agencies) to address many of the critical technical issues of building and sustaining digital repositories for long-term management and persistence. The private sector, while clearly a leading innovator in information technologies, both in the development of hardware and software and in the management of digital assets such as television, film, and recorded sound, has not played a leading public role in the development of digital preservation systems. That is primarily because the time horizons of the preservation community and of the commercial sectors are radically different. Most data storage systems in the private sector aim for retention of data for no more than five to ten years (the latter not being a number that a data storage business will commit to). The time horizon of preservation for libraries, archives, and research institutions must include many generations of inquiring humans, not just the next two generations of hardware or software upgrades.
Because digital preservation is so complex and expensive, it is unlikely that digital repositories will spring up in the thousands of institutions that have traditionally served as preservation centers for books. Nor should they. In a networked environment in which one does not need access to a physical object to have access to information, the relationship between ownership (and physical custody) of information and access to it will be transformed. Within the research and nonprofit communities, it is likely that the system of digital preservation repositories, or digital archives, will be distributed among a few major actors that work on behalf of a large universe of users. They will be, in other words, part of the so-called public goods information economy that research and teaching have traditionally relied upon for core services such as preservation and collection building.
Among the major actors in digital archives will be academic disciplines whose digital information assets are crucial to the field as a whole. Examples include organizations such as the Inter-university Consortium for Political and Social Research (ICPSR), which manages social science datasets, and the Human Genome Data Bank, which preserves genetic data. Both are supported directly by the disciplines themselves and through federal grants. The data in these archives are not necessarily complex, but they are highly structured and the depositors are responsible for preparing the data for deposit. JSTOR, a non-commercial service that preserves and provides access to digital versions of key scholarly journals, is another model of a preservation enterprise designed to meet the needs of researchers. It is run on behalf of researchers and financially supported by libraries through subscriptions. (Its start-up costs were provided by a private foundation.)
Some large research university libraries, such as the University of California, Harvard University, Massachusetts Institute of Technology, and Stanford University, are beginning to develop and deploy digital repositories that will be responsible for some circumscribed portion of the digital output of their faculties. Another digital library leader, Cornell, has recently taken under its wing the disciplinary pre-print archive developed to serve the high-energy physics community and its need for rapid dissemination of information among the small but geographically far-flung members of that field. In the sciences, there are several interesting models of digital information being created, curated, and preserved by members of the discipline, among which arXiv.org is surely the best known. This model appears difficult to emulate in the humanities because arts and humanities disciplines do not create shared information resources that are then used by many different research teams. Nevertheless, where redundant collections, such as journals and slide libraries, exist across many campuses, subscription-based services are being developed to provide access to and preserve those resources through digital surrogates. Examples include JSTOR, AMICO, and ARTstor, now under development. The economies of scale can be achieved only if the libraries and museums that subscribe to these services believe that the provider will persistently manage the digital surrogates and that they are therefore able to dispose of extra copies of journals or slides.
Learned societies may be a logical locus of digital archives, as they are trusted third parties within a field and widely supported by members of the communities they serve. But they are not, as a rule, well positioned to undertake the serious capital expenditures that repository services require. That seems to be the reason why these subscription-based preservation and access services have appeared in the marketplace.
The Library of Congress (LC), which is the seat of the Copyright Office and receives for inclusion in its collections one or more copies of all works deposited for copyright protection, is beginning to grapple with the implications of digital deposits and what they mean for the growth of the Library's collections. (At present, with close to 120 million items in its collections, it is several times larger than other humanities collections in the United States.) LC is developing a strategy to build a national infrastructure for the preservation of digital heritage that would leverage the existing and future preservation activities across the nation (and around the globe) to ensure that the greatest number of people can have persistent rights-protected access to that heritage. The National Archives is also working to acquire and preserve the digital output of the federal government, though this will entail an expenditure of public resources for preservation that is unprecedented in a nation that prides itself on its accountability to its people.
The final word about major actors in preservation belongs to the small group of visionary private collectors that have fueled the growth of great humanities collections for centuries. The outstanding exemplar of the digital collector is Brewster Kahle, who has designed and built the Internet Archive, which captures and preserves a large number of publicly available sites. While the visible and publicly available Web that the Internet Archive harvests is a small portion of the total Web (Lyman 2002), the Archive has massive amounts of culturally rich material. Kahle maintains the Archive as a preservation repository – that is its explicit mission – and in that sense his enterprise is a beguiling peek into the future of collecting in the digital realm.
Selection for Preservation
While much work remains to ensure that digital objects of high cultural and research value persist into the future, many experts are cautiously optimistic that, with enough funding and will, technical issues will be addressed and acceptable solutions will be found. The issue that continues to daunt the most thoughtful among those engaged in preservation strategies is selection: how to determine what, of the massive amount of information available, should be captured and stored and managed over time.
In theory, there is nothing created by the hands of mankind that is not of potential research value for humanities scholars, even the humblest scrap of data – tax records, laundry lists, porn sites, personal websites, weblogs, and so forth. Technology optimists who believe that it will be possible to "save everything" through automated procedures have advocated doing so. Some teenager is out there today, they point out, who will be president of the United States in 30 years, and she no doubt already has her own website. If we save all websites we are bound to save hers. If completeness of the historical record is a value that society should support, then saving everything that can be saved appears to be the safest policy to minimize the risk of information loss.
There may well be compelling social reasons to capture everything from the Web and save it forever – if it were possible and if there were no legal and privacy issues. (Most of the Web, the so-called Deep Web, is not publicly available, and many copyright experts tend to think that all non-federal sites are copyright-protected.) Certainly, everything created by public officials in the course of doing their business belongs in the public record, in complete and undistorted form. Further, there are the huge and expensive stores of data about our world – from census data to the petabytes of data sent back to Earth from orbiting satellites – that may prove invaluable in future scientific problem solving. On the other hand, humanists have traditionally valued the enduring quality of an information object as much as the quantity of raw data it may yield. There is reason to think that humanists in the future will be equally interested in the depth of information in a source as they are in the sheer quantity of it. Indeed, some historians working in the modern period already deplore the promiscuity of paper-based record making and keeping in contemporary life.
But the digital revolution has scarcely begun, and the humanities have been slower to adopt the technology and test its potential for transforming the nature of inquiry than have other disciplines that rely more heavily on quantitative information. Some fields – history is a good example – have gone through periods when quantitative analysis has been widely used, but this was before the advent of computing power on today's scale. If and when humanists discover the ways that computers can truly change the work of research and teaching, then we can expect to see the growth of large and commonly used databases that will demand heavy investments in time and money and that will, therefore, beg the question of persistence.
We will not be able in the future to rely on traditional assessments of value for determining what deserves preservation. In the digital realm, there will be no uniqueness, no scarcity, no category of "rare." There will remain only the signal categories of evidential value, aesthetic value, and associational value, the very criteria that are, by their subjectivity, best assessed by scholar experts. The role of humanists in building and preserving collections of high research value will become as important as it was in the Renaissance or the nineteenth century. Unlike those eras, however, when scholars could understand the value of sources as they have revealed themselves over time, there is no distinction between collecting "just in case" something proves later to be valuable, and "just in time" for someone to use now. Scholars cannot leave it to later generations to collect materials created today. They must assume a more active role in the stewardship of research collections than they have played since the nineteenth century or, indeed, ever.
Digital Preservation as a Strategy for Preserving Non-digital Collections
There seems little doubt that materials that are "born digital" need to be preserved in digital form. Yet another reason why ensuring the persistence of digital information is crucial to the future of humanities scholarship and teaching is that a huge body of very fragile analogue materials demands digital reformatting for preservation. Most moving image and recorded sound sources exist on media that are fragile, such as nitrate film, audio tapes, or lacquer disks, and they all require reformatting on to fresher media to remain accessible for use. Copying analogue information to another analogue format results in significant loss of signal within a generation or two (imagine copying a video of a television program over and over), so most experts believe that reformatting onto digital media is the safest strategy. Digital reformatting does not result in significant (in most cases in any) loss of information.
For those who prize access to the original source materials for research and teaching, the original artifact is irreplaceable. For a large number of other uses, an excellent surrogate or access copy is fine. Digital technology can enhance the preservation of artifacts by providing superlative surrogates of original sources while at the same time protecting the artifact from overuse. (See chapter 32, this volume.)
All this means that, as creators and consumers of digital information, humanists are vitally interested in digital preservation, both for digital resources and for the abundance of valuable but physically fragile analogue collections that must rely on digitization. Given the extraordinary evanescence of digital information, it is crucial now for them to engage the copyright issues, to develop the economic models for building and sustaining the core infrastructure that will support persistent access, and, most proximate to their daily lives, to ensure that students and practitioners of the humanities are appropriately trained and equipped to conduct research, write up results for dissemination, and enable a new generation of students to engage in the open-ended inquiry into culture that is at the core of the humanistic enterprise.
The author thanks Amy Friedlander, Kathlin Smith, and David Rumsey for their invaluable help.
Bearman, David (1999). Reality and Chimeras in the Preservation of Electronic Records. D-Lib Magazine 5, 4. At http://www.dlib.org/dlib/april99/bearman/04bearman.html.
Bearman, David and Jennifer Trant (1998). Authenticity of Digital Resources: Towards a Statement of Requirements in the Research Process. D-Lib Magazine (June). At http://www.dlib.org/dlib/june98/06bearman.html.
Council on Library and Information Resources; (2000). Authenticity in a Digital Environment. Washington, DC: Council on Library and Information Resources. At http://www.clir.org/pubs/reports/pub92/contents.html.
Greenstein, Daniel and Abby Smith (2002). Digital Preservation in the United States: Survey of Current Research, Practice, and Common Understandings. At http://www.digitalpreservation.gov.
Holdsworth, David and Paul Wheatley (2000). Emulation, Preservation and Abstraction. CAMiLEON Project, University of Leeds. At http://220.127.116.11/CAMiLEON/dh/ep5.html.
Lyman, Peter (2002). Archiving the World Wide Web. In Building a National Strategy for Digital Preservation: Issues in Digital Media Archiving. Washington, DC: Council on Library and Information Resources and the Library of Congress. At http://www.clir.org/pubs/reports/publ06/con-tents.html; and http://www.digitalpreservation.gov/ndiipp/repor/repor_back_web.html.
Lyman, Peter, and Hal R. Varian (2000). How Much Information? At http://www.sims.berkeley.edu/how-much-info.
Mayfield, Kedra (2002). How to Preserve Digital Art. Wired (July 23). At http://www.wired.com/news/culture/0,1284,53712,00.html.
Price, Laura and Abby Smith (2000). Managing Cultural Assets from a Business Perspective. Washington, DC: Council on Library and Information Resources. At http://www.clir.org/pubs/reports/pub90/contents.html.
Rothenberg, Jeff (1999). Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation. Washington, DC: Council on Library and Information Resources. At http://www.clir.org/pubs/reports/pub77/contents.html.
Thibodeau, Kenneth (2002). Overview of Technological Approaches to Digital Preservation and Challenges in Coming Years. In The State of Digital Preservation: An International Perspective. Conference Proceedings. Documentation Abstracts, Inc., Institutes for Information Science Washington, DC, April 24–25, 2002. Washington, DC: Council on Library and Information Resources. At http://www.clir.org/pubs/reports/publ07/contents.html.
Websites of Organizations and Projects Noted
Art Museum Image Consortium (AMICO). http://www.amico.org.
Internet Archive, http://www.archive.org.
Inter-university Consortium for Political and Social Research (ICPSR). http://www.icpsr.umich.edu.
Library of Congress. The National Digital Information and Infrastructure Preservation Program of the. Library of Congress is available at http://www.digitalpreservation.gov. National Archives and Record Administration (NARA). Electronic records Archives, http://www.archives.gov/electronic_records_archives.
National Archives and Record Administration (NARA). Electronic records Archives http://www.archives.gov/electronic_records_archives.
For Further Reading
On the technical, social, organizational, and legal issues related to digital preservation, the best source continues to be Preserving Digital Information: Report of the Task Force on Archiving of Digital Information (Washington, DC, and Mountain View, CA: Commission of Preservation and Access and the Research Libraries Group, Inc.). Available at http://www.rlg.org/ArchTF.
Digital preservation is undergoing rapid development, and few publications remain current for long. To keep abreast of developments in digital preservation, see the following list, which themselves regularly digest or cite the latest information on the subject and present the leading research of interest to humanists on the subject:.
D-Lib Magazine. http://www.dlib.org.
Council on Library and Information Resources (CLIR). http://www.clir.org.
Digital Library Federation (DLF). http://www.diglib.org.
National Science Foundation (NSF), Digital Libraries Initiative (DLI). http://www.dli2.nsf.gov.
National Library of Australia, Preserving Access to Digital Information (PADI). http://www.nla.gov.au/padi/format/case.html.