Geoffrey Rockwell is a Professor of Philosophy and Digital Humanities at the University of Alberta where he is also the Director of the Kule Institute for Advanced Study and Associate Director of AI for Society signature area. He publishes on textual visualization, text analysis, pachinko, ethics and technology and on digital humanities including a co-authored book with Stéfan Sinclair titled
Stéfan Sinclair was an Associate Professor of Digital Humanities at McGill University. His primary area of research was in the design, development, usage and theorization of tools for the digital humanities, especially for text analysis and visualization. He led or contributed significantly to projects such as Voyant Tools, the Text Analysis Portal for Research (TAPoR), the MONK Project, the Simulated Environment for Theatre, the Mandala Browser, and BonPatron. In additional to his work developing sophisticated scholarly tools, he had numerous publications related to research and teaching in the Digital Humanities, including Visual Interface Design for Digital Cultural Heritage, co-authored with Stan Ruecker and Milena Radzikowska (Ashgate 2011) and Hermeneutica with Geoffrey Rockwell (MIT Press 2016). He was active in the digital humanities community serving as President of both the Association for Computers and the Humanities (ACH), VP of the Canadian Society for Digital Humanities / Société pour l'étude de médias interactifs (SDH/SEMI) and an editor of Digital Humanities Quarterly (Digital Humanities Quarterly). We grieve his untimely passing on August 6th, 2020. An obituary is available at https://csdh-schn.org/stefan-sinclair-in-memoriam-2/.
This is the source
This paper looks at innovations in Busa’s
tremendous mechanical labour ... of no great utility.We use this criticism to draw attention to the very different meshing of human and mechanical labour developed at Busa’s concording factory.
Examines innovations in
From the standpoint of philosophy and theology, it is my opinion that the proposed work would have no utility commensurate with the tremendous mechanical labor it would involve. While an Index of technically philosophical and theological terms occurring in theOpera Omnia of St. Thomas would be very useful, the extension of the work to includeall words in St. Thomas’ works (including, I presume, conjunctions, prepositions, etc.) seems to me : — 1) of no great utility; 2) a sort of fetish of scholarship gone wild, and 3) a drift in the direction of pure mechanical verbalism which would tend to deaden rather than revivify the thought of St. Thomas. (I think he himself would have been horrified at the thought!)
Early in 1950, Daniel L. McGloin, the Chair of Philosophy at Loyola University of Los Angeles (now Loyola Marymount) wrote a letter critical of the idea of the
a sort of fetish of scholarship gone wild. It was the one negative response we have on record and it is clear that McGloin was appalled.
What exactly bothered McGloin so much? What can we learn from the criticism of
what is arguably the first digital humanities project? In what could be the
first instance of a long tradition of pointing out the dark side of automation
in the humanities, McGloin the philosopher was horrified by what he perceived to
be the mechanical verbalism which would tend to deaden
rather than revivify the thought of St. Thomas.
He could see the
value of a concordance of philosophical and theological terms, but not of
of no great
utility
, especially considering the tremendous
mechanical labor it would involve.
insufficiently interpretative — a complaint
that would continue to be brought against computing in the humanities
for decades and, indeed, continues today to figure in critiques of the
digital humanities.
(Location 2336)Unconsciously, we have come to view life after the
analogy of an assembly-line. We construct an educational system as we
blueprint an efficient factory, which is an aggregation of machines and
operators. Bring your material in, run it through the machines, and out
comes a tank.
This paper sets out to understand that tension by revisiting the mechanical labor
of Busa’s project, or more precisely, by trying to replicate exactly how
automation was developed for concording Aquinas. Replication here is a praxis of
media archaeology that can help us understand the context of a technology as
understood in its time
From the comfortable distance of more than half a century later, living as we do
in an epoch defined by language engineering companies like Alphabet (Google), we
can say that Busa and his IBM-based collaborator Tasman proved to be prescient,
despite McGloin’s complaints. No one today would question the usefulness of the
tremendous language processing techniques that are currently used, and that is
part of what makes it harder to understand the work that went into actually
developing ways for processing textual data back then. What were the mechanical
processes? What was the data processed and what processing techniques allowed
Busa and Tasman to generate a First example of word indexes
automatically compiled and printed by IBM punched card machines
(the
subtitle of Busa’s
If projects are central to digital humanities practice
In the case of GIRCSE and the
A second type of important resource is secondary sources like Steven Jones’ book
Thirdly, there are oral testimonies by those who worked on the project or knew of it and related materials. Julianne Nyhan and Melissa Terras have been gathering information about the women who worked on the
hiddencontributions to the history of Digital Humanities: the Index Thomisticus’ female keypunch operators.
Finally, there are the publications by the project leaders like Busa and Tasman describing the methods they were developing. Julianne Nyhan and Marco Passarotti are putting together an edited and translated collection of some of Busa’s key papers which will make it much easier to study Busa’s published thought. The key publications used in this paper are, in chronological order:
First example of word indexes automatically compiled and printed by IBM punched card machines,as the subtitle put it. This publication was the proof-of-concept project for the much larger
In all of these there is a summary of the technical process they developed, a clear indication of their awareness of the importance of the developing process at the time. In the introduction of
I bring down to five stages the most material part of compiling a concordance:
- transcription of the text, broken down into phrases
The word for , on to separate cards;phrase used in the parallel Italian, which Busa probably wrote first, is the more technical wordpericope which means acoherent unit of thoughtand etymologically comes froma cutting-out. (See the Wikipedia entry on Pericope, https://en.wikipedia.org/wiki/Pericope.) In Tasman (1957) the word used is(p. 253)phrases (meaningful sub-grouping of words… .- multiplication of the cards (as many as there are words on each);
- indicating on each card the respective entry (lemma);
- the selection and placing in alphabetical order of all the cards according to the lemma and its purely material quality;
- finally, once that formal elaboration of the alphabetical order of the words which only an expert's intelligence can perform, has been done, the typographical composition of the pages to be published. (p. 20)
He goes on to say that the IBM system could “carry out all the material part of the work” of steps 2, 3, 4, and 5, though he also talks (p. 26) about the need to have a philologist intervene at stage 3 to disambiguate and lemmatize words. In Paul Tasman’s
transcription of texthe has two steps, that of the scholar who marks up the text and that of the keypunch operator who copies it onto cards. This more clearly separates the scholarly from the data entry work.
In “The Use of Punched Cards” Busa describes yet another process. Here is the flow chart that Busa provides in
One thing that stands out in this process, especially when one looks at other
materials like the photographs of the space where the work was done, is how many
people were needed for what we today think of as an automated process. The
machines may have done some of the tedious work, but it took human labor to
prepare texts for data entry, then there was the data entry itself, there was
programming, there was the moving of stacks of cards from machine to machine,
and then were all the management tasks. The 1962 project proposal gives a break
down of the staff needed, admittedly for an accelerated completion. The number
of staff proposed, divided by department is listed as:
This is a total of 70 staff for a project that was supposed to be showing the advantages of automation! What also stands out is the gendering of the departments. The operators of the Machine Department, which was the largest proposed unit, were mostly women. As Terras points out in her Ada Lovelace day blog post (2013), it was women who actually ran the machines and they are often shown in the photographs demonstrating the computing to men.
Returning to the processes that were automated — and looking at the details of
the descriptions Busa and Tasman give us — two key innovations stand out that
made linguistic data processing possible.
These innovations changed the integration of people and technologies that went
into concording, and for that matter, changed how data was processed (previously
cards had been used for structured data such as census information, but not for
unstructured linguistic data). It seems obvious now, but these innovative uses
of machines made possible a different organization of people and technology
where data entry was separated from programming which was separated from the
scholarly work. Technologies, as philosophers Ihde (1998) and Winner (1980)
remind us, are not neutral tools. McGloin may have intuited how Busa’s project
would still call for tremendous human labor despite the machines and that much
of that labor would be mechanical for the women involved.
In sum, Busa’s team conceived a new way to break down the work of concording by first figuring out how to represent text as data that could be processed by machine and then figuring out how to process that data into tokens (words) that could then be manipulated to generate various types of indexes. In these two connected innovations they essentially developed literary data processing, no small feat at the time.
The next two parts of this paper will look at these two stages and discuss the
replication of what they might have actually done as a way of probing our
understanding. This is a story from early in the
technological revolution, when the application was out searching for the
hardware, from a time before the Internet, a time before the PC, before the
chip, before the mainframe. From a time even before programming itself.
From today’s perspective it is hard to appreciate how different the
The one technology that had been at least partially standardized was the punched
card as a way of entering and storing data. The punched card, despite later
discussion about folding, spindling, or mutilating, was a robust way of entering
information so that it could be processed manually
The Busa project used the IBM card format. This had been developed in 1928 and had become a de facto standard. Each card was 7⅜ inches wide by 3¼ inches tall. They were made of stiff paper with a notch in the upper left for orientation. The arrangement of holes was standardized so that all relevant machines could process them. Given the size, the IBM card could fit 80 columns and 12 rows of punch locations. While the dimensions and punch zones were standardized, projects could overprint with ink whatever they wanted on the cards as what was printed wouldn’t affect the data processing; printing was intended for human processing, not machine processing.
It is also worth noting that Busa’s project was large and different enough that they had their own non-standard cards, including some with bubbles for manual pencil marks. Figure 2 shows an example card from the Busa Archive with the areas and labels related to
To return to understanding what Busa may have done in the way of digitally representing textual data we have tried to virtually replicate the data dimension of his punch cards with a simple interactive punch card that shows you what holes would be punched in order to represent the data you want the card to bear (see below). This is set up with our best guess of the card data format they used — feel free to enter examples and see if you get what you expected. We have here used replication as a way of thinking through the history of computing. In the replication of a technology we can discover aspects of the technology that wouldn’t occur when just documenting it and others can test our hypothesis as to how data was encoded. Among other things the punched card interactive helped us better understand how this encoding system was actually a curious blend of a decimal-based system (0 to 9 with a couple of extra control rows) and the more binary-based digital processing (0 or 1). A typical punched card had 12 rows to represent any one character, which in fact is potentially very high resolution (12-bit = 4096 possibilities), whereas the first character sets for computers were at most 4, 5 or 6-bit (64 possibilities).
In this case we were forced to figure out exactly which IBM format was used.
Emulating the cards raises the issue of
It is worth noting however, that they did not need to adhere to any particular standard the way we need to today. There was no operating system or off-the-shelf software enforcing standards. There were no other projects that might use their data. The only issue was what the keypunch could enter, what could be read, and what the printers could print, if printing was needed. At that time you could extend things and program the machines to operate on your own hole combinations. It could very well be that the
Having explored some of the Each phrase is preceded by the reference to the place where
this line is found and provided with a serial number and a special reference
sign.
What was actually held on the two types of cards? The sentence or phrase cards held the location/reference in the original text, the text of the phrase, a number for the phrase (serial number) and a special reference mark to indicate if the phrase was thought to be by St. Aquinas himself or a reference to words by another. The amount of text for each card, given the limitations of 80 columns (not counting the columns reserved for reference data), would be decided by the scholar who divided the text up into logical thoughts and then into meaningful phrases that would fit on a card. It is important to note that human intervention was needed to fit the ongoing unstructured text onto 80-column cards.
The EWCs had less text and more associated information. By the time of Tasman’s
article, each word card would have encoded in the punched holes the following
(some fields are further explained below):
The Form Cards were the headings for each different word. The Entry cards were
the entry headings that presumably would appear in the final printed concordance
for the different word types after lemmatization and disambiguation
(
There are a few noteworthy aspects about the project’s use of punched cards. First, data on cards was material in the sense that users could physically touch the carrier of the data — the punched card. Many of the processes from keypunching to sorting involved physically manipulating the cards themselves, not data in memory. The human was also part of the sequence of operations as someone would have to load cards, move stacks from machine to machine, and cards would actually be consulted by humans at certain points. Busa would have been aware that he was setting up a system that integrated people and machines in new configurations, at least for the humanities.
Busa and Tasman made full use of the materiality of the punched card. They
actually had at least three orders of information on the cards. There was what
we are going to call the data proper that was punched on the card and could be
operated on mechanically. Then they would print (not punch) extra information on
the front or back of the cards for human consultation. For example, they would
print the data punched on the top of the fronts of the cards so that they could
be checked by eye, or they could print up to 9 lines (phrases) of context on the
backs of EWCs so a scholar could easily check the context when lemmatizing
without having to call up the phrase card. Finally, they had an area on the left
of cards that would take lead pencil marks so scholars could mark cards for
additional processing.
One way to appreciate the difference between data then and now is to reflect on
how the project was not about using computers or software as we understand them
now. The actual work involved punching information onto cards, proofing of
cards, using electromechanical machines to process the cards, creating new
cards, printing things on cards, moving cards from machine to machine,
duplicating them, and marking them for further processing. The
Now to the second innovation of Busa and Tasman and that is the algorithmic one.
It is fascinating to reconstruct the mechanical processes with which all this
information could be generated and added to the cards in successive passes with
a limited number of standard machines. For example, to add the first letter of
each following card the stack of EWCs would be sorted in reverse order of
appearance and then run through starting with the last word. The first character
of each word would then be carried forward (or backward in this case) and
written out on the next card, which illustrates another property of the material
cards, the fact that data can be incrementally added to the blank areas with
proper planning. We do not have space here to go through all the succession of
processes. Instead, we will focus on the one process that made literary data
processing possible, and that is the tokenizing of the Sentence Cards to
generate Each Word Cards (EWCs). This is the crucial innovation that showed how
systems designed for accounting could be used for unstructured text.This is equivalent to state that each line was multiplied as
many times as words it contained. I must confess that in actual practice
this was not so simple as I endeavoured to make it in the description; the
second and the successive words did not actually commence in the same column
on all cards. In fact, it was this lack of determined fields which
constituted the greatest hindrance in transposing the system from the
commercial and statistical uses to the sorting of words from a literary
text. The result was attained by exploring the cards, column by column, in
order to identify by the non-punched columns the end of the previous word
and the commencement of the following one; thus, operating with the sorter
and reproducer together, were produced only those words commencing and
finishing in the same columns.
This would have been considered difficult at the time. As Busa writes in 1951,
the existing commercial accounting and statistical uses of punched cards assumed
that there would be zones on the card that constituted predictable fields. These
fields could hold alphabetic information like names in a personnel dataset, but
the start of each name – the zone of the card where it appeared - would be known
in advance, allowing processing operations to be easily set up.
By contrast, with a phrase of unstructured text only the first word on each card is in a predictable place and even then there is no information about where the field ends. Thus the problem they had to solve was how to explore the Sentence Cards over and over, identifying each successive word … and in 1951 they had to do this using only sorting machines with replicators to create the new EWCs. This was the key mechanical process that allowed one type of data — phrase cards – to be processed to generate another — word cards. Everything else could in theory then be generated using these two sets of cards (along with a bit of human work lemmatizing/disambiguating.)
To figure out how this could be done we have again replicated the process, though
in code rather than with an emulated machine.
Later they found other ways to do this. By 1958 Busa had actually developed three ways to do this crucial process. In
If you intend only an index verborum, then work will be extremely easy and economical for then you could start by punching the word cards directly.(p. 1 of 2)
using the Cardatype, recently developed by IBM
The Cardatype was a modular and multi-function machine which makes it hard to
figure out exactly what they did with it. It had one or more IBM electric
typewriters, a verifier, and could have a tape-punching unit attached that would
output a paper tape which could then control a card punch that would punch the
EWCs
To conclude we begin by noting that Busa provided answers, of sorts, to McGloin’s criticism. Father Busa in his 1980 reflection on the
all functional or grammatical worlds … manifest the deepest logic of being which generates the basic structures of human discourse. It is this basic logic that allows the transfer from what the words mean today to what they meant to the writer
A different sort of answer can be found in an article by Paul Tasman, Busa’s IBM
engineering collaborator. He believed that innovations like the
machine-searching application developed by the project may
initiate a new era of language engineering.
The indexing and coding techniques developed by this
method offer a comparatively fast method of literature searching, and it
appears that the machine-searching application may initiate a new era of
language engineering. It should certainly lead to improved and more
sophisticated techniques for use in libraries, chemical documentation,
and abstract preparation, as well as in literary analysis.
Tasman saw the business potential for IBM in the innovations of the
Having a sense of some of the tremendous mechanical labor involved in Busa’s
project, something McGloin didn’t really have when commenting, we can conclude
by asking again if such labor was worth it. Obviously, from an efficiency point
of view, it was worth it if you grant that Busa was right about the value of a
full concordance. Tasman
Big Data has emerged a system of knowledge that is already changing the objects of knowledge, while also having the power to inform how we understand human networks and community.
McGloin was right to focus on labor and the changes in attention, what he calls
fetish, though we disagree with his unquestioned association of mechanism with a
loss of life. Changes in the labor of research change the object of research and
the methods with which we think-through (and vice versa). Busa and Tasman were
aware of this in that they discussed some of the implications of the new methods
they were developing and could see some of the possibilities for different types
of research and answers. McGloin is right to ask about these changes, but wrong
to assume there is something necessarily dead about mechanised methods while the
existing objects and methods are sacred. McGloin poses the critical question
about what and how we are interpreting, but he is a romantic who does not
question his own practices. As boyd and Crawford argue for big data (2012)
it is time to start critically interrogating this
phenomenon, its assumptions, and its biases
(665). We would add that
to critically interrogate a new phenomenon one needs to also ask about that with
which it was contrasted in its time and also ask about the metaphors of
contrast. One of the best ways to interrogate such a phenomenon of literary and
linguistic computing is to do the archaeology of the technologies that actually
changed the labor, objectives and methods, and that is what we have tried to do
here. Much can be brought to life in the replication of liminal moments like
Busa’s
Following boyd and Crawford, an aspect of the labor that McGloin might have
objected to if he had had more time would have been the factory-like management
systems developed for this big data project and the ways they might change what
is considered knowledge in the humanities. Busa and Tasman didn’t just develop
new technologies, they had to develop a physical center where the work could
take place and an organization capable of carrying out so much human and
mechanical labor. This wasn’t the first large scale humanities project, there
had been large projects for some time, including concording projects, dictionary
projects, archaeological projects, and editorial projects. It was, however, one
of the first projects to integrate information technology and human labor so
closely. The project needed deliberate planning, fund raising, management,
public relations, and training in the use of computers. Jones (2016) has gone a
long way towards documenting how this was the first humanities computing
project. Nyhan and Terras have documented the gendered division of labor between
male scholars and female punch card operators.
From today’s perspective McGloin might have commented on how the challenge of dealing with so much text led to the development of a new organizational model for humanities projects where the integration of technology forced a new division of labor on the humans. It forced humanists to develop practices which integrated scholars with punch card operators, technical staff and engineers. We are still struggling with issues of credit and project management around these new configurations. These new configurations change the communities of knowledge and may have also changed the way knowledge is conceived. Information, often in the form of large amounts of data, has become synonymous with knowledge. The datafication of the cultural record has changed our thinking about knowledge. Scholarship has given way to information management.
All images from the Busa Archive are property of the Biblioteca dell'Università Cattolica. They are shared in this presentation with permission. For further information, or to request permission for reuse, please contact Paolo Senna at paolo_dot_senna_at_unicatt_dot_it.
This paper was originally presented at Loyola, Chicago in 2016. Rockwell and Passarotti also presented some of these ideas as part of a different paper in Rome, Italy in 2017 at the 6th AIUCD Conference. That paper was published in 2019 under the title
Do Not Fold, Spindle or Mutilate: A Cultural History of the Punch Card.
hiddencontributions to the history of Digital Humanities: the Index Thomisticus’ female keypunch operators.