Eric Hoyt is Assistant Professor of Media & Cultural Studies in the Department of Communication Arts at the University of Wisconsin-Madison. He is the author of
Kevin Ponto is Assistant Professor in the Living Environments Lab and the Design Studies Department in the School of Human Ecology. Ponto received his Bachelors from the University of Wisconsin-Madison, his master’s from the university of California, Irvine and his Ph.D from the University of California, San Diego. Ponto's research is focused on advancing the state of the art in the field of virtual reality, ranging from creating novel and natural interfaces for immersive virtual environments to developing methods, techniques and tools to better understand, evaluate, and develop interactive virtual experiences.
Carrie Roy is Coordinator for the Humanities Research
Bridge at the University of Wisconsin, Madison. Her research
interests span digital humanities tools and data visualizations,
folklore, Scandinavian studies, medieval literature, art, folk art, and
material culture. Collaborations on tools, programming and research
involve partners in the humanities, biological sciences, healthcare,
statistics, computer science, and the most recent, Victorian Eyes, an art
exhibition.
This is the source
Of all narrative textual forms, the motion picture screenplay may be the most perfectly pre-disposed for computational analysis. Screenplays contain capitalized character names, indented dialogue, and other formatting conventions that enable an algorithmic approach to analyzing and visualizing film narratives. In this article, the authors introduce their new tool, ScripThreads, which parses screenplays, outputs statistical values which can be analyzed, and offers four different types of visualization, each with its own utility. The visualizations represent character interactions across time as a single 3D or 2D graph. The authors model the utility of the tool for the close analysis of a single film (Lawrence Kasdan’s
distant readingby identifying patterns of character presence across a dataset of 674 screenplays.
Of all narrative textual forms, the motion picture screenplay may be the most perfectly pre-disposed for computational analysis.
In November 2009, the website xkcd.com published a series of info-graphics that visualized character interactions in movies such as the
The challenge of conveying complex information (character appearances, relations
with other characters, and absences) as it changes over time — along with our
shared interests in the Digital Humanities and our more singular scholarly
focuses on cinema and television (Hoyt), visualization and storytelling (Roy),
and computational approaches (Ponto) — spurred us to develop a tool that
algorithmically analyzes and visualizes screenplays. In building a tool to aid
Humanists, we sought to heed Abello, Broadwell, and Tanghrelini’s call for
computational approaches that combine distant reading and close reading
distant reading
that Franco Moretti first proposed and, in our
case, allowing a researcher to study 1,000 screenplays rather than 10 or 20
In this article, we introduce our tool, ScripThreads, and discuss some of our initial research findings from using ScripThreads to analyze and visualize hundreds of screenplays from the American Film Scripts Online collection. We model how the tool can be productively used in film analysis as a tool for close reading by analyzing and comparing two screenplays co-written and directed by Lawrence Kasdan,
In the process of building the tool’s prototype and writing this article, we have
come to appreciate the many ways that a computer reads a screenplay differently
from you or me. Humans gather insight from watching and experiencing the
emotion, tension, and dynamics of a movie or screenplay. Rather than attempting
to train a computer to understand a film in the exact same way we do, we would
prefer to ask a computer to do tasks that it is designed to do well and that we
humans struggle with. Humans have memory limitations when it comes to matters of
sequential timing and the entrances and exits of dozens of characters. In
contrast, computers are excellent at gathering and recording these sorts of
details from structured texts. Lev Manovich has suggested that one of the most
valuable things that comes from combining computational analysis and the
visualization of vast amounts of information in a single image is that it
defamiliarizes our understanding of the works that we study in the Humanities
Of all narrative textual forms, the motion picture screenplay may be the most perfectly pre-disposed for computational analysis. As Murtagh, Ganz, and Reddington explain,
Our research and software development contributes to a lively and growing area of
research about screenwriting history, form, theory, and practice. As a
discipline, Film Studies has long lived in the shadow of the
A small contingent of screenwriting researchers are, like us, pursuing the
computational analysis of screenplays
Regardless of the software platform, there are some inherent limitations to an automated approach to screenplay analysis. First, a screenplay nearly always differs to some degree from the completed film version, which may include improvised dialogue from the film’s production or lack scenes or characters that were cut in post-production. Screenplay analysis requires us to qualify arguments we may want to make about the entire film. Another major challenge to this line of research is access — specifically, digital access to the authoritative versions of screenplays. There are several websites offering free, downloadable screenplays of contemporary Hollywood movies. The provenance, authoritativeness, and legal status of these screenplays, though, are not clear. Moreover, these websites generally have very few screenplays for movies produced prior to the 1980s.
American Film Scripts Online (AFSO) is the digital resource that we believe provides access to the largest number of authoritative, digitized screenplays. The resource’s creator, Alexander Street Press, licensed 1,009 screenplays from Warner Bros., Universal, and a number of other rights holders. Roughly half of the screenplays are available as PDF facsimiles of the original documents, and all of the screenplays are available as HTML documents, which have been re-keyed, eliminating most of the problems that come from uncorrected OCR text. The HTML mark-up is not consistently or semantically structured, but this productively forced us to code ScripThreads’s parsing algorithm so that it could handle a wider variety of screenplays and not depend on rigid mark-up standards. As for AFSO’s selection of screenplays, the collection is stronger in some areas than others. Over half of the 1,009 screenplays derive from 1930s and 1940s Hollywood movies — primarily, productions from Warner Bros., RKO, and MGM (it’s no coincidence that Time Warner holds the rights to all three of these studio film libraries). AFSO is well suited, then, for research questions focusing on Hollywood’s
The ScripThreads software prototype is a cross-platform tool for the analysis and visualization of screenplays. The tool is written in C++ and utilizes the QT toolkit for its graphical user interface, making it easily ported to multiple systems. Figure 2 shows a screenshot of the graphical user interface, which offers separate windows for the Reading View, Visualization View, and Character and Settings View. By September 2014, users will be able to download the ScripThreads prototype at http://scripthreads.org. ScripThreads takes in text and HTML file screenplays as an input, parses these files and generates data for visualization and analysis. The features of the tool are described below and showcased with visualizations from
While screenplays contain a structure that is far more defined than other mediums, we still found substantial variation between different works. For instance while most works contained indentation and then the character name in full capitalization before each paragraph of dialogue, the number of spaces this indentation took varied greatly between different works. Furthermore, some authors used indentation and capitalization to indicate other screenplay attributes such as sound effects, locations, and times.
For these reasons we used a two-step process to automatically find characters in a screenplay. In the first pass, each line was analyzed to determine if it was potentially the indication of a character. Lines which were considered candidates were pushed to a list. After the first pass, the list was analyzed to determine the most likely amount of indentation before a character name. For instance, a given screenplay may have 12 spaces of indentation before a character name with the spoken lines included in the paragraph below.
A second pass was then undertaken to generate information of which characters
were in which scene. This pass consisted of parsing for three different items,
character names, scene breaks, and meta-information. Character names were
gathered from the first pass. Additional names could also be entered by the
user. Scenes were determined by looking for a user defined set of keywords, such
as int.
or ext.
. In practice, it was determined that a list of
10-12 keywords well captured scene changes. Finally, meta-information data, such
as page numbers, were found using simple matching techniques.
This second pass generated data for each scene as to how long the scene lasted, which characters were involved, and for how many pages in the original screenplay the scene encompassed. From this data, four different types of visualization are available to users, each with its own utility.
After the characters are detected, ScripThreads applies the framework of network
theory by drawing a relationship (edge) between characters (nodes) who share the
same scene. Force directed graphs are a common method visualizing this type of
information. Digital Humanities scholars have used force directed graphs to
represent networks of characters, literary authors, and topics
Unfortunately, this approach is only designed for representing connectivity information, and is not designed for a time-based approach. One approach to encode temporal information along with connectivity information is to treat each time-step as an individual force directed graph. These graphs can then be converted into a 3D data structure which can be analyzed from arbitrary positions. The limitation of this approach on its own is that each time-step is handled on an individual basis; there is no guarantee that the nodes will appear as temporally connected threads.
To overcome this limitation, ScripThreads accounts for the temporal dimension
through use of a single 3D data structure. Each node is placed on a series of
time-step planes on the z-axis and are not only connected to the other relevant
nodes in their time-step, but are also connected to the previous and future
states. This enables each character to be viewed as a virtual thread that can
become entangled in other threads when relationships occur. An added value to
the vertical alignment of character threads, as opposed to the horizontal
arrangement of the xkcd examples, is that the entire script is visible and able
to be scrolled alongside the visualization — enabling the integrated close and
distant reading noted in the introduction. The rendering system also simulates
the idea of a virtual thread, by rendering each character as a continuous
character thread — thick when the character is active, thin when not active.
This enables unbiased color blending to occur on a model level, often referred
to as color weaving
The force directed visualizations reveal insights into a screenplay’s narrative
structure, especially for films featuring episodic segments or parallel
narratives, parallel protagonists, or parallel lines of action. In screenwriting
manuals, narrative structure
is often synonymous with three act
structure
and the goal-oriented protagonist whose pursuit of some goal
pushes the story from one act to the next. The goal-oriented protagonist is
fundamental to Hollywood storytelling, and we acknowledge that ScripThreads does
not capture this important dimension of screenplay structure.
Another reoccurring visual pattern we have detected is alternating convergences of different color character threads. Such alternating clusters strongly suggest the screenplay features parallel protagonists, parallel narratives, or parallel lines of action that play out across different spaces. The force directed graph of
In contrast to the
In ScripThreads’ absence graph,
the x-axis measures presence and absence.
A thread’s distance from the center of x-axis conveys length of absence as a
character — measured forward and backward in time. The resulting visualization
can be read like a bus map: characters run parallel routes when they both appear
in a scene. When a character is not in a scene, his or her bus route splits off.
The presence graph
provides a quick glance as to when a character is
active in a scene. The size of the thread is wider when the character is active
and is smaller when the character is not active. Time is shown in the y-axis
from top to bottom. Horizontal lines indicate dialog between the characters.
If we return to
The increasing graph is useful for communicating, in a single image, character activity and storytelling techniques across the course of a narrative. Unlike the force directed and convergence graphs, ScripThreads’ increasing graph is not rooted in the network theory. Perhaps for this reason, though, we’ve found that Humanities researchers unfamiliar with networks tend to find the increasing graph the fastest to grasp and interpret.
The increasing graph rotates the axes from the convergence graph: the x-axis becomes time and the y-axis becomes character presence. If a character is present in a scene, then his or her colored thread vertically increases. If a character is not present in a scene, then her thread remains flat.
ensembleor
multi-characterscreenplay with such a tight range of presence levels for this many characters.
The screenplays that Kasdan wrote and directed immediately before and after
When we looked at hundreds of screenplay increasing graphs, we noticed a sub-group in which the red thread shoots up diagonally in a straight line, far exceeding any other thread line (Figure 13). These graphs are indicative of screenplays that focus on a single protagonist and in which the protagonist appears in every scene or nearly every scene. We analyze this pattern in greater depth in Section 5 of this article.
ScripThreads also gives users the option to export two different types of data: scene statistics and character statistics. In the
Thus far, we have demonstrated that ScripThreads generates visualizations that reveal storytelling techniques, character interactions, and character activity within a screenplay. In sharing our work, though, we have been asked: how does this tool yield knowledge that couldn’t be gained simply through reading the screenplay, watching the film closely, or turning to the existing body of scholarship on narratology, cognitivism, and Hollywood storytelling? While we are enthusiastic about the potential of ScripThreads for distant reading, we also recognize that the close analysis of individual films will always be an important activity of film criticism and scholarship. In this section, we model how ScripThreads can be used as an interpretative tool that enhances — rather than replaces — the use of narrative theory and the method of close reading. To model how a scholar’s engagement with ScripThreads can enrich an understanding of a film’s narrative structure, we will continue our focus on Lawrence Kasdan’s work and analyze the screenplay for
In the 80’s he brought us
So
reads the tag line on the movie poster for
The ScripThreads visualizations for
network narrative— a multi-protagonist film that follows numerous characters whose lives intersect at different moments
If we put aside the comparative question and instead focus on the details of
In Lawrence Kasdan’sGrand Canyon(1991), the married couple Mack and Claire and the brother-sister pair of Simon and [Deborah] are given roughly equal emphasis… other plotlines show Mack’s son falling in love with a girl he meets at camp, [Deborah]’s son being alienated, and Mack’s friend Davis vowing to stop making ultraviolent movies. The subsidiary characters don’t encounter all the customary obstacles and setbacks, yet their wants are developed beyond the limits of a traditional subplot, providing thematic echoes or counterpoints.
Does this mean that David Bordwell’s analysis of the film is incorrect? No. Bordwell never claims that Mack, Claire, Simon, and Deborah are given equal screen time. Instead, he’s suggesting that the storytelling techniques invite the audience to think of the characters as equally important. In fact, Bordwell’s book offers insights that explain the discrepancy between Mack’s on-screen involvement and the audience’s understanding of Mack as one of multiple roughly equal characters. Bordwell describes
ensemble filmsin which
several protagonists are given equal emphasis, based on screen time, star wattage, control over events, or other spotlighting maneuvers.The star wattage and spotlighting maneuvers are especially significant to our interpretation of
The screenplay’s two major spotlighting maneuvers, which occur at the beginning
and end, further encourage us to perceive Mack (Kevin Kline) and Simon (Danny
Glover) as equal in narrative importance. The script’s most important
spotlighting maneuver occurs from pages 6 to 16: the white lawyer Mack’s car
breaks down at night in South Central Los Angeles; armed young black men
approach him and tell him to get out of the car; Simon, a black tow truck
driver, pulls up, tells the armed men that Mack is his responsibility, and takes
Mack back to his house in an affluent neighborhood. The screenplay interweaves
scenes of Mack waiting for the tow truck and scenes of his family back in
Brentwood. When Simon and Mack are leaving South Central, Simon’s line, My sister and her kids live near here,
motivates a transition to a scene that introduces Simon’s sister and her
troubled son. The Mack and Simon characters are contrasted by their race and
social class, yet treated as equals through the attention to each one’s family
and their shared sense that, in the words of Simon, the world ain't supposed to work like this... Everything's
supposed to be different than it is.
This loss of faith in the
social and moral order — counterbalanced by the possibility for human kindness,
growth, and, even, miracles — provides the thematic glue for the entire film.
The final scene of Mack, Simon, and their loved ones gazing in wonder at the
Grand Canyon reestablishes our understanding of the narrative parity between the
lives of Mack and Simon. Simon’s question, What
do you think?
and Mack’s response, I think… it’s not all bad. Not at all,
provides an affirmative,
glass-half-full answer to the existential questions that have run throughout the
136-page screenplay.
The many scenes that occur between the car breakdown and visit to the Grand
Canyon further encourage the audience to think of the married couple Mack and Claire and the
brother-sister pair of Simon and [Deborah]
as roughly equal [in]
emphasis
As our analysis has shown, the ScripThreads graphs can help scholars, critics, and practitioners better appreciate how storytelling techniques shape the audience’s perception of a narrative. In the case of
Some research in the Digital Humanities begins with a fixed research question
and a clear process for gathering evidence. But as Sinclair, Ruecker, and
Radzikowska suggest, another important task for the Humanities is to locate (or discover) new
material, with no prior knowledge of the kinds of details used for
retrieval
After identifying the pattern, we began searching for all instances of the pattern both visually and mathematically. We used R to write and execute a simple algorithm that targeted the exported Character Stats CSV files and extracted information on the character from each screenplay with the highest percentage of involvement. ScripThreads’ Character Stats function calculates the percentage of character involvement by: A) identifying whether a character appears in a scene — yes or no; B) calculating how much of the screenplay any given scene takes up as a percentage; C) adding all of the percentage points for the instances when a character is present. In conducting this analysis, roughly one quarter of the 935 screenplays did not parse properly and we chose to disregard their results. We could have opted to use ScripThreads’ Advanced Settings and gone one-by-one through the screenplays that returned inaccurate results, adjusting the settings to more clearly identify the way a particular screenplay notes characters and scene breaks. And if we were using ScripThreads for close reading, this is exactly what we would have done. However, because we wanted to test how the ScripThreads prototype performed at scale, we moved forward in analyzing the reduced corpus of 674 screenplays. As we continue to improve the tool, we anticipate the rate of screenplays that accurately parse at the computer’s first pass will increase.
Out of the 674 screenplays, the median percentage of maximum character involvement was 80% and the mean percentage was 78%. Only 70 of the screenplays (roughly one-tenth) had a main character present in 94% or more of the screenplay. This group of 70 screenplays formed the sub-set of screenplays featuring a hyper-present protagonist that we analyzed in more detail. Specifically, we were interested in three variables: historical era, genre, and author. Which, if any, of these variables held the most significance for stories featuring the lead character in every scene?
When we the examined the production dates of our sub-set of data, we found examples of the hyper-present protagonist in screenplays ranging from 1932 to the 2000s.
However, the data indicates that this storytelling tradition did not become prominent in American cinema until the 1940s. Out of the 189 screenplays from the 1930s that we analyzed, we found only two that clearly featured a hyper-present protagonist —
Ultimately, our historical analysis was a case in which distant reading
confirmed what the leading historians of film style and narrative have argued.
But as Matthew Jockers points out, we should not expect distant reading and
computational literary analysis to always overturn previous understandings.
There is value to bring[ing] a new type of evidence
and a new perspective to the matter and in so doing fortify…the existing
hypothesis
In our analysis of genre, we confirmed certain existing assumptions about film
storytelling and challenged others. Screenwriting guidebooks generally discuss
the detective movie as the genre most likely to be presented as a
Interestingly, though, most of the hyper-present protagonist films we found did
not belong to the detective genre. Numerous dramas used the
If the detective genre is especially fertile ground for the hyper-present protagonist, then are there other genres in which such a character is rarely found? We have yet to find any occurrences in the romantic comedy or musical (a genre that frequently sets a romantic comedy story to song). The romantic comedy, by its nature, depends on multiple characters and obstacles to delay their happy union until their end. These conflicts and obstacles are often rooted in misunderstandings, which require the audience to know information that one of the characters does not know. Beyond the romantic comedy, we found that comedy screenplays, in general, almost never feature a hyper-present protagonist. One explanation might be that writers depend on the protagonist’s absence to create situations that serve as set-ups for the jokes later delivered verbally or physically by the protagonist. The very title of
Only one comedy screenplay in our dataset featured a hyper-present protagonist. This outlier was the Jim Carrey comedy
What if a hotshot lawyer could not tell a lie for 24 hours due to his son’s birthday wish?The premise enables screenwriters Paul Guay and Stephen Mazur to generate nearly all the jokes with the protagonist present. This comedy depends on incongruity, but it’s different than the incongruity derived from placing the Marx Brothers at fancy restaurants and the opera. Instead, the incongruity comes from the difference between Fletcher’s (Jim Carrey’s) compulsive lying before his son’s wish and how he must adapt after he can no longer tell a lie. This example demonstrates that a film’s concept (and perhaps the desire of producers to fully capitalize on their highly paid star) can override the pattern of character presence typical for a particular genre.
Finally, we explored the question of authorship as it relates to patterns of the protagonist’s presence. Do some screenwriters have a tendency to write films that focus on one protagonist and place that character in nearly every scene? The answer, we found, was yes. As noted earlier, screenwriter Brown Holmes co-wrote the two outliers we identified from the 1930s,
As we examined the AFSO corpus for hyper-present protagonists, one author leapt out at us for his tendency to structure films with a main character present in nearly every scene. The AFSO corpus contains the screenplays for fifteen films that were either written or co-written by Paul Schrader, who is best known for writing and directing dark, character-oriented dramas, such as
Schrader’s tendency to frame stories around single characters who are almost always present can be seen in Figure 14, which superimposes the fifteen increasing graphs of Schrader’s screenplays onto one another. The consistency of Schrader’s approach is clear from the cluster of red lines (each one representing the most present character from a different Schrader film) thrusting diagonally in a nearly straight line. For a point of comparison, we can turn back to Lawrence Kasdan. Figure 15 superimposes the seven Kasdan screenplays that are available in the AFSO corpus. The Kasdan image shows a screenwriter who works across numerous genres and utilizes a wide variety of storytelling approaches — one film featuring a hyper-present protagonist (
In some ways, Figure 14 provides an illustration for what film critics and scholars already assume about Schrader: he is a filmmaker who writes character studies about men who are psychologically and/or existentially anguished.
all my life has been dedicated to the existential hero, and the existential hero seems to have come to the end of his path, replaced by the ironic hero
Schrader’s best known screenplays follow the hyper-present, existential protagonist model we have described. However, a more heterogeneous portrait of Schrader as an author emerges when we examine his twelve unproduced screenplays in the AFSO corpus. To be clear, these are screenplays written by Schrader that were never made into films. Figure 16 shows the superimposed increasing graphs of these 12 screenplays. This group of screenplays includes four hyper-present protagonist scripts: three music-oriented biopics (
Schrader’s unproduced project in which the protagonist is least present (59%) is his script for a retelling of
The increasing graphs of Schrader’s unproduced work open up a series of
questions that we plan to investigate further. What is the relationship between
genre and authorship in how narratives are structured? How do certain genres or
stories override a screenwriter’s established narrative techniques? Finally, how
do industrial and cultural assumptions about a particular screenwriter shape the
types of projects the writer is offered and that make it into production? By
examining Schrader’s unproduced works at a distance, we can speculate that
industry assumptions about what constitutes a
In this article, we have introduced the ScripThreads tool and demonstrated how it can be used for closely analyzing one screenplay (
In using ScripThreads to closely analyze a single film, the continuities and
differences between the viewer’s perception and the computer’s visualizations is
a powerful starting point for uncovering storytelling techniques and better
understanding cognitive reception. In using ScripThreads to analyze a large
group of screenplays, the visualizations and CSV output files allow researchers
to recognize patterns without having prior knowledge of the films. To draw
accurate and meaningful conclusions, though, some domain expertise in film
history is essential (just as a researcher would want some knowledge of 19th-century literature before making arguments about
the century based on topic modeling 100 Victorian novels). Whether applied
toward close ready or distant reading, ScripThreads is meant to help researchers
gain a richer understanding of the text or texts they are studying. This is a
tool to aid Humanities scholars in analysis and interpretation, not a substitute
for screenwriting and criticism.
ScripThreads offers one additional affordance — the ability to quickly visualize the narrative structure of an unproduced screenplay. Films and television programs play out their stories visually in sequences of photographed and edited action; a produced screenplay has already been visualized for the screen. However, due to the difficulty and expense of making a film, the vast majority of screenplays are never produced, never visualized in this fashion. The increasing graphs of Schrader’s unproduced screenplays (Figure 16) provide what may be the first transformative visualizations of these twelve works. The graphs allow us to quickly recognize one way (the protagonist’s level of presence) that some of these stories differ from Schrader’s better known screenplays. What if we could apply a similar analysis to the screenplay libraries of Hollywood’s studios, producers, and talent agencies? The results would yield not simply graphs of character presences, co-occurrences, and absences, but transformative renderings of thousands of stories that have remained absent from audiences. We would no longer be visualizing American film history; we would be visualizing a history that might have been.