Volume 10 Number 3

# Six Degrees of Francis Bacon: A Statistical Method for Reconstructing Large Historical Social Networks

## Abstract

In this paper we present a statistical method for inferring historical social networks from biographical documents as well as the scholarly aims for doing so. Existing scholarship on historical social networks is scattered across an unmanageable number of disparate books and articles. A researcher interested in how persons were connected to one another in our field of study, early modern Britain (c. 1500-1700), has no global, unified resource to which to turn. Manually building such a network is infeasible, since it would need to represent thousands of nodes and tens of millions of potential edges just to include the relations among the most prominent persons of the period. Our Six Degrees of Francis Bacon project takes up recent statistical techniques and digital tools to reconstruct and visualize the early modern social network.

We describe in this paper the natural language processing tools and statistical graph learning techniques that we used to extract names and infer relations from the Oxford Dictionary of National Biography. We then explain the steps taken to test inferred relations against the knowledge of experts in order to improve the accuracy of the learning techniques. Our argument here is twofold: first, that the results of this process, a global visualization of Britain’s early modern social network, will be useful to scholars and students of the period; second, that the pipeline we have developed can, with local modifications, be reused by other scholars to generate networks for other historical or contemporary societies from biographical documents.

# Introduction

*transference*of text “from bibliographical to digital machines.” SDFB tackles a related but more difficult problem: the

*transformation*of biographical text, which focuses on a single person but contains rich information about social relations, into a global (non-egocentric) network graph, which requires extracting information about nodes (persons) and edges (relations) while ignoring or discarding other kinds of biographical information.

# 1. Source Material

# 2. Pre-Processing Source Material

Subset | Recall | Precision |

Stanford (ST), Person Tags Only LingPipe (LP), Person Tags Only |
63.51% 52.44% |
91.75% 72.11% |

ST, Person and Organization Tags LP, Person and Organization Tags |
70.74% 67.83% |
74.02% 46.19% |

ST + LP, Person Tags Only ST + LP, Person and Organization Tags |
79.37% 85.66% |
77.91% 51.61% |

*n*×

*p*matrix, Y, has

*n*rows representing documents and

*p*columns representing people, or actors, in the network. The number of times person

*j*is mentioned by name in document

*i*gives us Yij, a non-negative integer for each document/person pair. We used this document-count matrix to infer the social network.

# 3. Statistical Inference

# 4. Expert Validation

Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 |

london famili merchant coloni trade work |
earl lord parliament second king london |
bishop colleg church minist preach london |
armi command return forc captain ship |
work publish letter write public book |

Topic 6 | Topic 7 | Topic 8 | Topic 9 | Topic 10 |

king polit parliament appoint duke lord |
work london publish book physician |
work publish poem play translat edit |
earl london famili marriag second will |
king queen english england court royal |

Measure | Within-Topic | Between-Topic |

Fraction of Edges with Confidence ≥ 90% | 0.0001444 | 0.0000117 |

Fraction of Edges with Confidence ≥ 75% | 0.0004523 | 0.0000527 |

Fraction of Edges with Confidence ≥ 50% | 0.0016402 | 0.0003082 |

Fraction of Edges with Confidence ≥ 30% | 0.0033207 | 0.0007585 |

Confidence Interval | Number of SDFB Inferred Relationships | Precision (# correct / # found) |
Article Recall (# found in article / # in article) |
SDFB Recall (# found in article also in SDFB / # from article) |

80-100 (certain) | 5 | 80.00% | 1.98% | 3.96% |

60-100 (likely) | 28 | 89.29% | 8.42% | 16.83% |

40-100 (possible) | 107 | 74.77% | 25.74% | 51.49% |

10-100 (unlikely) |
283 | ≥28.27%[4] | 33.66% | 67.33% |

# 5. Humanities Significance

# Conclusion

# APPENDIX: The Poisson Graphical Lasso

## Introduction

## Poisson Graphical Lasso

## Modifications

- relationships produced by simple correlation
- relationships produced by our model with document sectioning
- relationships produced by our model without document sectioning

## Confidence Estimate Procedure

- Sample half of the rows in the data matrix
- Fit Poisson Graphical Lasso on this data as follows:
- For each j (column), fit the model in Equation 3, and obtain the coefficient estimates for ϱ=0.001
- Ignore any coefficient that has been estimated as negative

- Repeat steps (1) and (2) 100 times.
- Estimate the confidence of an edge between node j and k as \({\widehat{C}}_{\text{ij}} = \frac{\Sigma_{t = 1}^{B}\left\lbrack I\left( {\widehat{\theta}}_{\text{jk}}^{(t)} > 0\text{\ or\ }{\widehat{\theta}}_{\text{kj}}^{(t)} > 0 \right) \right\rbrack}{B}\)