Volume 14 Number 3
Fading Away... The challenge of sustainability in digital studies
Abstract
This paper emphasizes the need to think about sustainability as a key element of digital studies and digital hermeneutics. It addresses the inherent tensions between the long-term needs of data preservation and maintenance on the one side, and the short life cycles of the data formats, platforms and infrastructures on the other side. Challenges are not limited to the technical maintenance of software, tools and data, but also apply to the wider institutional contexts, epistemic traditions and social practices in which the doing of research in social sciences and humanities are embedded. We explore these tensions at several levels, temporalities and key stages in research – namely data access and building a corpus, establishing a research framework and analysis, and finally the use/dissemination of results.
The dominant economy of the traces of each era are based on agents of different ages. Within the framework of slow-moving institutions whose role is precisely to make the system stagnate, the rules of this economy change at the faster pace of social changes, which it adjusts with technologies that change even faster - and which can produce unexpected or contrary effects on the logic of the first two levels. This stratification of collective equilibriums induces an interweaving of technical environments and innovative societal projects with older or even archaic beliefs or hierarchies. (Merzeau 1998, our translation)
Introduction
1. From data access to corpus building
1.1 Diversity and instability of data search and gathering
- Researchers (and librarians) who collect data in real time in Twitter run
the risk of lacking distance and potentially overlooking important trends.
As explained by Thomas Drugeon, who works at INA in France and launched an
emergency Twitter collection during the Charlie Hebdo attack in 2015:
I added hashtags to the collection at lunchtime on the day after the events in the editorial offices at Charlie Hebdo. We acted more quickly for the events of 13 November. The Twitter flows weren’t organised in quite the same way either. The hashtags were more varied in November, whereas in January nearly everything was concentrated on the #jesuischarlie hashtag. In November, at least five hashtags emerged and we could observe movements, cycles too, for example day/night reflecting the different time zones at international level. For #jesuischarlie we missed the peak at the beginning. [...] Valérie Schafer: And did you also collect #jesuisahmed, #jenesuispascharlie? Thomas Drugeon: No, we didn’t capture them ourselves, but we could access them indirectly, as a side effect. We didn’t want to start adding hashtags along the way as we wanted to make sure that we would have a homogeneous set.[6]
- Researchers who react after an event will not be able to fully access data unless they buy them from Twitter or use datasets which were created by others (where available), may they be individuals or librarians. Researchers then face other challenges: although the datasets may be documented, stressing the importance of metadata [Pomerantz 2015], they are reliant on the choices of others – maybe even robots.
- As pointed out by Thomas Bottini and Virginie Julliard, who used INA archives to analyze the semiotic dimension of the “gender theory controversy” that played out on Twitter,[7] researchers need to “consider the often non-explicit archiving logic of these databases, as well as changes in their architecture and editorial policies” [Bottini and Julliard 2017, 47]. The massive availability of data in archiving collections is not synonymous with “turnkey” corpora [Barats 2016] [Brunet 2012] as collecting involves specific criteria and curation, which claim for a knowledge of archiving choices.
Factiva, for example, collects and digitises the texts without preserving the layout and format of distribution. These choices should be accessible to researchers whether they are stable or not. Etienne Brunet compared the Frantext[9] textual database with the French works on Google books. Frantext provides the researcher with a defined and explicit database for collection criteria [Brunet 2012].The source text [...] is then turned into various types of semiotic text, including a first, the “digital” text per se, in which it is encoded using 0s and 1s. This text is generally illegible for humans. A second, known as an “image text,” can be displayed on a screen or printed on paper and read as such, but any analysis of this version is typically “manual.” Then come the dynamic text, the annotated text and the edited text, and finally the readable, analysable text. So [...] digitisation does not produce a single “digital” copy of the text, but an entire galaxy of digitised texts which, interlinked and organised hierarchically, open up new avenues for interpretation and analysis. [Meunier 2018]
1.2 Long-term preservation, trial and error
2. Conflicting temporalities and distributed materialities
2.1 Tools’ evolution and obsolescence
2.2 Temporalities of users
In general, we deal with the volatility of the data on a day-to-day basis: we have transferred our servers three times. Part of the corpus data is hosted on personal servers, outside the research institutions. However, we are fortunate to have redundant servers in the laboratory that we manage internally: all our data are saved in XML format (txt) that is always accessible and guarantees their use in the long-term, and in a compressed form that goes from one server to another several times a day. In each of our projects, a substantial part of out funding is earmarked for 1 / the legal protection of the data, ensuring the sharing and the diffusion of the data entered (without confiscation) and 2 / the saving of data without a subscription (even if the data is hosted by a provider to avoid giving priority to one university project over another. No “cloud” storage is used, but hard drives are purchased and CRON backup programming is used on the websites of both project partners, etc.) Each time, the solution is to reinvent, according to the type of financing and the project itself: an editorial project? A “pure data” project? etc. (DH and Language Interview, May 2018)
3. Open Access and Digital Wastelands
3.1 Sharing and Open Access
3.2 “404 not found” and other challenges raised by time
The planning of the DDS founders here was remarkable. However, being prepared is not always sufficient to avoid risk. One survey respondent described how the earmarking of funds to store data beyond the end of an international research project was still insufficient to keep the data available on the medium term: “The data was hosted by a university institution which could no longer cover the costs of data storage, leading to the loss of all the data. We had planned the financing of data maintenance, but the funding was limited to a ten-year period and could only be temporarily extended, leading to the loss of data that could have been used for future research” (ICS interview, June 2018).On 15–16 January 1996, the DDS servers were down for most of the night to allow for a full backup of De Digitale Stad. A full 1-on-1 disk copy of all the servers running DDS was created on 3 Digital Linear Tapes (DLT). DDS congratulated itself with a city frozen in time, preserved “to be studied by archaeologists in a distant future: the FREEZE.” [...] In restoring old data, it soon came to light that the package would not simply unwrap, or defrost. The DLT tapes holding the FREEZE did not easily render their content. [Alberts et al. 2017]