Digital Humanities Abstracts

“ACAD - a Cambridge Alumni Database”
John Dawson University of Cambridge, UK

1. Introduction

From 1922-27 Venn published the four volumes of Part I of Alumni Cantabrigienses, a biographical list of all known students, graduates and holders of office at the University of Cambridge, from the earliest times to 1751. This was followed from 1940-54 by the six volumes of Part II, covering 1752-1900. Subsequent archival research unearthed much more detail, and many more names, for the period up to 1500, and in 1963 Emden published his two-volume A Biographical Register of the University of Cambridge to 1500. Together, these twelve volumes cover approximately 180000 names, with some overlap. It goes without saying that all this information is of the utmost importance for historical research, covering as it does a large proportion of the religious, legal, administrative, medical, academic, and royal appointments in Britain, the Empire, and the Colonies, as well as many other countries. A good deal of social history is also included, albeit patchily. However, all these publications have a great defect for research: there is no index. Also, since Venn's work, many corrections and additions to the information have come to light, and without the incorporation of this new material it is easy to be misled by the original books. So, to find all the Vicars of Trumpington mentioned by Venn and Emden requires an exhaustive search of twelve large volumes and many card indexes of Addenda and Corrigenda.

2. Database

We therefore set about the creation of an on-line database to make all this information accessible. Other sources, such as the Tripos Lists (lists of degrees awarded), and College Registers (especially those of the women's colleges, which were ignored by Venn) have been included. It is envisaged that the database will be made freely available for public searching on the World Wide Web, though it is not yet clear what mechanism for searching will be provided. Similar projects, based on Emden's Registers for Oxford and Cambridge, were undertaken in the 1970s.[3] In those studies, the data was highly coded to allow easy cross-tabulation. Several important articles using results from the studies have appeared.[1][2][4] There is, however, no resource for Oxford comparable to that of Venn for the later period. For many years we were unable to find a simple and reliable way to put the data into machine-readable form. Venn's books are in small hand-set type, printed on thick rough paper, and are full of italics, all of which proved completely intractable to the OCR packages available until recently. By chance, just as we had found suitable technology to cope with Venn's printing, we discovered that Ancestry.com had already prepared machine-readable versions of most of the volumes of Part II. Negotiations between Cambridge University Press (the copyright holders) and Ancestry.com soon led to an agreement to share this data, as their product and ours are for essentially different purposes, theirs being mainly accessed for genealogical information. Ancestry.com are also planning to put the remaining volumes of Venn into the computer, and to make the data available to us. Emden's Biographical Register, the Tripos Lists, and the registers of the women's colleges have proved relatively easy to read using OCR and the services of an excellent methodical proofreader.

3. Structural Analysis

A typical entry from Emden looks like this (with references abbreviated):
Dawson, John (Dauson).* Entered in C.L. ET 1484; grace that study for 6 yr in C. and Cn.L. suffice for entry in Cn.L. gr. 1488-9; Inc. C.L., adm. June 1490 [Ref_1]; D.C.L. R. of Debden, Essex, clk, adm. 17 May 1484; till death [Ref_2]. Died 1492. Will dated 10 Aug. 1492; proved 12 Feb. 1493 [Ref_3]. Requested burial in S. Michael's, Cambridge.
and has the following structure:
heading event 1 event 2 ...
where each event in general comprises:
topic type place date(s) reference(s)
The initial form of the database is an SGML-tagged text, from which subsequent databases and searching/sorting structures can easily be obtained. My first attempts at analysis were written in Perl, a widely available string-handling language which allows complex regular expressions. (A regular expression is just a pattern which is used to match parts of the data and extract those parts which can vary.) It soon became apparent that the complexity of the regular expressions needed for the recognition of large-scale structures such as these entries uses too much memory in Perl, and the programs frequently failed. At Cambridge we have a locally-written programmable text editor called NE which has good regular expression handling. It may seem a retrograde step to use a one-off local program like NE in preference to a widely used standard such as Perl, but in our case only the product (the database) is useful; the process used to make the product is different for each text analysed, so the ephemeral nature of the analysis programs is not significant. Events will in general be split over several input lines, so it was first necessary to combine the lines of a complete paragraph into a single line, then to split them at punctuation such as semicolons, and to put references on separate lines. It was clear that some type of formal, structured, but readable output would be needed in the first instance. This could then be converted automatically as input to any required database package. SGML provides an adequate structure for these needs, and is widely used by publishers of machine-readable databases. First attempts at analysis were very heuristic, but served to clarify the problems in my mind. Writing a DTD for the SGML structure was then very helpful, as it forced me to take decisions about nesting of fields, etc. Initially, my regular expressions tried to match complete events, including place names and dates, but two problems arose: the programs ran out of time or store, or NE's regular expression processor found the structure too complex to analyse. Automatically pre-tagging identifiable structures such as dates and place names enabled simpler regular expressions to be written.

4. Results

A discussion of the complete DTD and the analytical processes used will be presented. Various types of results will be used to illustrate the processing, including complete updated entries amalgamated from all sources, and statistics about certain types of event such as religious appointments. The Figures (see below) are based on only part of the data (one volume of Venn), as the analysis is not yet complete. The Figures should be used with care, because although they represent approximately ten thousand individuals, they are constrained by having surnames beginning 'Abbey' to 'Challis', so a preponderance of one family attending one college may skew the results. Figure 1 will show the range of ages at admission to all colleges, and holds no surprises. Age at admission is given in only about half of the entries in Venn. Most of the older admissions are men who have already been ordained. Figure 2 will show the admissions to the two largest colleges, Trinity and St John's, between 1752 and 1900, and illustrates the general increase in size of all the colleges, and hence the whole university, in that period. Figures 3 and 4 will show the admissions to other colleges during that period (except Downing, Selwyn, and the women's colleges, which were not founded until the nineteenth century). The outstanding feature of Figure 4 is the dramatic increase in admissions at Queens' College from 1821-1830. Figure 5 will show the number of religious appointments (Curate, Vicar, or Rector) per county.

5. References

T. H.Ashton. “Oxford's Medieval Alumni.” Past & Present. 1977. 74: 3-35.
T. H.Ashton G. D.Duncan T. A. R.Evans. “The Medieval Alumni of the University of Cambridge.” Past & Present. 1980. 86: 9-86.
R. Evans. “The Analysis by Computer of A.B. Emden's Biographical Registers of the Universities of Oxford and Cambridge.” Medieval Lives and the Historian: Studies in Medieval Prosopography. Ed. N. Bulst J.-P. Genet. Kalamazoo: Medieval Institute Publications, Western Michigan University, 1986.
R. B.Dobson. “Recent Prosopographical Research in Late Medieval English History: University Graduates, Durham Monks, and York Canons.” Medieval Lives and the Historian: Studies in Medieval Prosopography. Ed. N. Bulst J.-P. Genet. Kalamazoo: Medieval Institute Publications, Western Michigan University, 1986.