After completing a Ph.D. in English and Systems Design Engineering at the University of Waterloo, Adam James Bradley joined the Visualization for Information Analysis Lab (Vialab) at Ontario Tech University as a research scientist. Adam's work is concerned with the problems of subjectivity, ambiguity, and interpretation that are omnipresent in digital tool design and text analysis. In some way all of Adam's work from visual text analytics, to interactive pen and paper applications, to visual search platforms are all trying to address the same question: How can we use technology to augment human actions in a way that allows us to be creative and imaginative, while still leveraging the power and speed of the machine.
Victor Sawal is a developer in the Visualization for Information Analysis Lab (Vialab) at Ontario Tech University.
Sheelagh Carpendale is a Canada Research Chair in Information Visualization in Computing Science at Simon Fraser University. Her many awards include: IEEE Visualization Career Award, NSERC STEACIE, and a BAFTA. She is a Fellow in the Royal Society of Scientists and has been inducted into the both IEEE Visualization Academy and the ACM CHI Academy. By studying how people interact with data or information both in work and social settings, she works towards designing more natural, accessible and understandable interactive visual representations of data.
Christopher Collins received the PhD degree from University of Toronto in 2010. He is currently the Canada Research Chair in Linguistic Information Visualization and Associate Professor at Ontario Tech University. His research focus combines information visualization and human-computer interaction with natural language processing. He is a past member of the executive of the IEEE Visualization and Graphics Technical Committee and has served several roles on the IEEE VIS Conference Organizing Committee.
This is the source
In this paper, we present a system that automatically adds visualizations and natural language processing applications to analog texts, using any web-based device with a camera. After taking a picture of a particular page or set of pages from a book or uploading an existing image, our system builds an interactive digital object that automatically inserts modular elements in a digital space. Leveraging the findings of previous studies, our framework augments the reading of analog texts with digital tools, making it possible to work with texts in both a digital and analog environment.
Presents a system for automatically adding visualizations and NLP applications to analog texts, using any web-based device with a camera.
Printed books still remain persistent in the workflow of scholars even though
there is a plethora of digital options available that afford great power and
flexibility to the user. Word processors and other applications have been
completely integrated into people’s daily lives and have started to replace pen
and paper as a modality for interacting with the written word; when it comes to
books, the affordances offered by digital platforms such as search and copy are
considered paradigm shifting additions to the act of reading. But, even though
these tools exist, scholars still write on paper and still have books on their
bookshelves. There is a tension that exists between these new digital formats
and our history. We often create digital tools to mimic the affordances of
books, but while they improve steadily, the weight, smell, and sounds of a book
are still unique to bound paper and ink. It is important to note that it is
still not known how these affordances affect cognition, and for literary
scholarship, interpretation. Mehta et al. studied fourteen literary critics and
found that each had idiosyncratic methods of marking up a literary document, but
most importantly, all of them engaged in some form of annotation when working
with poetic texts
Beyond the affordances of physical texts, many older texts used for research
often do not have reliable digital versions, and many corpora are still
digitized as images such as Early English Books Online
To start this project, we asked ourselves the question: How can we allow users
the ability to interact with both analog and digital text at the same time? Not
knowing exactly what we lose when we digitize a text in terms of interpretation
or cognition is a much larger problem, and we wanted to address how to interact
with these texts without completely discarding the originals. While there are
great efforts to digitize the world’s books, such as the Google books projects
In this paper, we present a framework that extends the power of the digital to physical books in near real-time. Our contribution is bringing together ideas studied in digital document spaces and existing word-scale visualizations to demonstrate how these known quantities can be leveraged to bridge analog and digital reading and writing. Our framework is informed by previous results describing document spaces and the different ways they are used with analog documents. We outline each of the document spaces and describe how they can be used in tool design, and we implement a prototype to demonstrate the robustness of this framework (see Figure 1).
Through Textension we offer quick access, applicable to the analog text at hand, to an integrated digital/analog environment, only requiring common equipment such as the camera in a phone and a preferred web-enabled device. Simply photograph or upload an existing picture of a document, display it in Textension on a web-browser and start interacting. By making paper documents interactive on mobile devices Textension allows for a smooth transition between our history and our present by allowing users a quick way to digitize documents while working on-site in places like libraries. Our system produces in-line visualizations and interactive elements directly on the newly built digital document, allowing for work to continue while having an augmented digital document at the ready.
While we present a prototype in this paper, we see our main contribution as a discussion and amalgamation of what is possible when bringing together analog and digital affordances as it relates to text. We see Textension as a way to leverage past studies and computational approaches to natural language to quickly create digital documents from analog texts using readily accessible mobile technology.
Bridging the affordances of paper documents and digital technologies is not a new pursuit. Many projects have attempted to cross the boundaries between these two modalities to leverage what is best in both.
Early work on Fluid Documents
Pen-based systems that cross between physical and digital interfaces have
also been explored by Weibel et al.
Paper-augmented digital documents merge physical annotations on paper with a
digital representation
Our tool uses several computer vision techniques to aid in the OCR from the
camera of a mobile device. The following papers all describe techniques that
helped us to consider the problems and solutions of in-the-wild document
digitization. Digitizing historical documents is a difficult prospect and
the binarization and filtering techniques presented in da Silva’s work
Post-processing for text documents tends to be idiosyncratic to the documents
themselves. For example, a document with a completely white background may
not need to be gamma corrected. We built into our tool the ability to
control multiple processing parameters. Influence for this decision came
from work such as perspective correction
Text vis is an enormous subsection of information visualization. Rather than
a survey of available text visualizations, we have included a list of
references that directly affected our work and design decisions. Early text
visualizations such as ThemeRiver and TextArc demonstrate novel ways of
visualizing text
Perhaps the pioneering work in word-scale visualization was Tufte’s proposal
of Sparklines
Previous studies on digital document interaction, annotations, e-readers, and
marginal interactions
From the previous work, we have identified five different types of spaces in
digital documents that can be augmented:
Word Space: Space inside the bounding boxes
of words, lines, and paragraphs.Line Space: The space within the bounding box
of the text on the page that makes up white space between the
lines.Margin Space: Space that is outside of the
text bounding box but within the boundaries of the document
itself.Occlusion Space: Any overlay on the document
whether permanent or impermanent that covers up the existing text or
space.Canvas Space: Space that is created outside
of the borders of the original text and can be infinitely
expanded.
It is important to note when discussing document spaces as part of a larger
framework that they can be used alone or in concert with each other and that
sometimes the lines blur between them. For example, the occlusion space bounding box of a paragraph includes the line spaces from that paragraph and the word spaces above each word. Despite this stacking, it would be
possible, for example, to design annotations which are complementary at each
level.
Word Space is any portion of a document that has the printed
word on it. The OCR engine we used, Tesseract, can identify and create bounding
boxes around both printed and hand-written words, lines, and paragraphs.
Depending on the specific implementation, word space could be considered any or
all of these. In an analog book the actual printed type is an element that can
be interacted with cognitively, but as we move into digital representations of
that text it allows us to alter and query the text in interesting ways. The
possibilities here are great, with one of the motivations being simply on-demand
OCR. But when the text is digitized it opens up possibilities for text analysis,
computational linguistics, and machine learning based on language. Marshall et
al. reference both highlighting and annotation as a way that this space is
interacted with on paper documents
Line space is any space that exists between existing printed
lines but remains inside the bounding box of the entire block of printed text on
the page. Digitizing documents using our framework allows for on-demand opening
and closing of these spaces. The manipulation of line space can create room for
additional elements, such as ink annotations, inserted figures, and data
visualizations that relate to the text. We synthesize the background of the
document to avoid jarring the reader, and once this is done effectively any
amount of space can be added to the document. Previous studies have found that
this space is most often used for annotation, specifically in-line notes and for
connectors such as arrows between words.
Margin space is any area outside of the bounding box of the
text but still within the bounds of the original document. This is the space
commonly used for free-form note-taking. For example, when studying, editing, or
conducting a close reading of a document
Occlusion space is a layer that covers the document.
Additions in the occlusion space can obscure the text, so semi-transparent and
impermanent elements are appropriate to maintain document legibility. In an
analog setting, the occlusion space is often used with physical additions such
as sticky notes. This space can be accessed in multiple ways, but the underlying
function of the space tends to be information that is needed in the moment, but
not on a continual basis. We demonstrate how to interact with this space by
using tool tips that show the definitions of words on demand. Most of the
techniques we demonstrate in this paper could be taken from one of the other
spaces and placed into occlusion space, the question of permanence or
impermanence will be a decision that rests with the designer.
Canvas space is a concept that we could loosely attribute to
the desk that a printed book is placed on or the space outside a page of text
taped on a whiteboard. When we move to a digital representation of the book, the
canvas space could be infinitely expandable allowing for the insertion of larger
interactive elements. We have chosen to demonstrate how linguistic analysis
could lead to automatic insertion of images, figures, and tables within the
canvas space. It is important to note that external images could also be
inserted in this space as Cheema et al. propose in AnnotateVis
For this project, we imagined a tool that scholars of the humanities could use to do their required work on printed manuscripts, edited collections, and books, while still having access to digital affordances. Imagine the scenario where a literary scholar is in a rare book archive and cannot write directly on the document, or a scenario where a humanist is interested in the linguistic statistics of a text, but lacks the training to execute the digitization and processing of a text using code. The latter is one of the main problems that is ever present within the emerging field of digital humanities, the roadblock of technical knowledge needed to produce tools. We set out to build an extensible framework that would allow a humanities scholar with limited technical knowledge the ability to process, augment, and export digital versions of analog texts. To achieve this, we bring together multiple technologies including OCR, machine translation, and information visualization. By enumerating the set of spaces that can be used for these techniques and demonstrating their possibilities with examples, we hope to inspire users to add their own document augmentations to our existing framework.
The Textension framework starts with a document image, processes the image to discern both the content as well as the use of space on the page, adds space to the page as needed, and creates augmentations, both static and interactive, to insert into the newly digital object. The resulting processed image is presented to the user for further exploration, annotation, and interaction. The framework architecture is illustrated in Figure 2.
We provide a specific implementation of Textension in a web-based system which offers a selection of document augmentations and interactive tools, which we will describe in this section.
When a user comes to the opening screen of Textension they are presented with two input options. They can either use the camera that is built into their device (webcam, phone camera, front facing tablet camera) and take a snapshot of the document they wish to process, or they can upload an image file that has been previously prepared. Document images can be single or multiple pages and are uploaded with a drag and drop interface. The next stage of document processing begins immediately after the upload completes. The system uses image processing from the Python Image Library and image manipulation from OpenCV. We have found that a combination of binarization, grey-scaling, and image sharpening have had a noticeable effect on the results of the OCR, which is the next stage of processing.
We used the open source Tesseract OCR engine in the Textension prototype.
Smith provides an overview and a history of the development of the
engine
While great effort is being taken by many companies to digitize the world’s books, this process is expensive, hardware dependent, and time-consuming. The Tesseract OCR engine provides high-quality open source OCR in a local setting for printed and hand-written text. Often when working between analog and digital platforms scholars are forced to type passages out, for example, to extract a quote from a book for insertion into a manuscript. This is due to the fact that many digital book readers do not give you access to text that can be copy and pasted for copyright reasons, or in some cases, only images of paper documents are provided. Our domain expert has been using Textension to quickly digitize small portions of text for inclusion into working documents. This application of Textension allows for easy transfer of quotable information from analog books to digital platforms using only the OCR functionality. In addition, the OCR engine provides the content in a machine-readable form for later linguistic processing, linking, and other augmentations. Tesseract also provides bounding boxes for each word, which we use to identify document spaces.
In order to augment documents with helpful annotations, or to provide space for users to make pen-based annotations, document spaces often need to be enlarged. This is not possible when working with an analog document. However, in the digital version, we can manipulate the image to provide the needed space. For example, to place a translation of text between lines, the inter-line spacing first needs to be increased. Document backgrounds can be complicated, with changing lighting conditions almost guaranteed using mobile phone and tablet cameras. To retain the original look of the document image, we created a method for inserting space by synthesizing sections of the document background which seamlessly integrate with the original. These regions can be optionally clearly noted, for example by using a different color. This may be preferable in situations where differentiating the original document from manipulations is important, for example in archival and preservation work. Background pre-processing is a computationally intensive process, so an option for low or high-resolution processing is included. Low-resolution processing is suitable for quick interactive applications, where high-resolution processing is more suitable for printing and saving the results of an analysis session.
To improve image capture quality, which can affect background synthesis, we provide the user with a frame to set their image in. While it is possible to adjust skew correction and automatically crop text from images, we found from our internal testing that forcing the user to frame the image themselves resulted in much better OCR and therefore a much better experience. There is precedence for this type of interaction in commercial settings such as remote cheque deposits for online banking, where a user is forced to frame and focus the cheque before the system will accept the image. Once we have the image we use the bounding boxes provided by the OCR engine to rebuild the document in image fragments within the web platform. Each space and word is modeled separately to allow us to manipulate those elements within the browser.
The important image regions for the space insertion algorithm are
illustrated in Figure 3. To expand the
space between the lines (expand line space) in
high-resolution mode we first use Hough line detection to identify
x-values where vertical lines exist on the page
The algorithm then copies a slice of the image from between each
individual line from one edge of the page to the other. The height of
the slice (the pixel cut height
) is set to the height of
the unimpeded space between the bounding boxes of the lines of text
above and below. For the low-resolution processing, copies of this slice
are inserted vertically to create space in the document. Depending on
the complexity of the background this process is sometimes adequate.
However, in most cases, this results in image streaking, which is usable
for testing and exploration, but can be distracting and is not
sufficient for photo realistic background additions.
As local lighting and color effects are so prevalent in scanned and
photographed documents, especially historical documents, we wanted to
model the backgrounds from as local a position as possible. The
intuition behind our approach is that we can randomly reorder pixels in
a local region to reduce streaking while retaining local lighting. From
the extracted slice of the document, we select a patch of the original
image that has the dimensions of pixel cut height
by
pixel cut width
. To insert a new patch below it we
simply randomize the pixels from the current patch and insert it. As we
scan across the line we continue this in increments of pixel cut
width
until the line is complete. The one exception is when
the patch location falls within a definable threshold of the x-values of
the vertical lines found by the Hough line algorithm at the start of the
process. In this case, the pixels in that patch are not randomized but
rather copied, to retain the sharpness of the detected edge. This allows
us to sample local lighting effects in the background of the image. For
adding horizontal space between words a similar process is used. When
space is added only between two words on a single line, we also add the
same amount of space to all lines by distributing it across all
inter-word spaces. This preserves the original justification of the
document (see Figure 4).
The pixel cut height
and pixel cut width
parameters control the locality of the modeling. Reasonable defaults are
provided, but they can be varied in the settings screen to obtain the
best result.
We have also found that artifacts on the page disrupt the color balance within each individual cut and can affect the quality of the background rendering. We offer the option of removing artifacts on the input screen. To remove blemishes on the page, the cropped line is binarized, any pixels that turn black in the binarization process are not used within the color randomization that synthesizes the background color. This leaves the original artifact but does not propagate it during the space insertion process.
Background synthesis and artifact removal are pre-processed across all candidate regions of the document to a pre-set threshold of inserted space. The synthesized image data is stored for quick access and insertion during document augmentation and interaction.
After creating an interactive, expandable document from the captured image, augmentations can be added to provide supportive features as required for the specific task and context. For example, a learner may require word definitions, while a literary scholar may be interested in the contemporary use of the words in the document. Augmentations can take the form of inserted glyphs, images, overlays, and annotations in the document spaces, or they may replace or change the words in the document. Augmentations can be temporary or permanent, as appropriate for their purpose and the document space in which they appear. The insertion and placement of augmentations and the provision of interactivity on the document and its augmentations is provided by the layout engine (see Figure 5).
The images after upload are broken into individual word and space objects that are then recompiled in order onto an HTML canvas to reproduce the original image with the added flexibility of moving, inserting, and changing elements. Augmentations are placed on the canvas as a layer on top of the image objects. Textension has been developed to support the creation of new augmentations, which can draw on custom data processing, local datasets, or public APIs and data. Textension was built using flask, a python server back-end; bootstrap, for UI elements; jinja, a template engine for python; and jquery, for data handling.
What we present in this section are a series of concrete implementations of document augmentations that demonstrate a subset of the possibilities of the Textension framework. The selected examples highlight the possible breadth available when considering the five document spaces and the possibilities for interaction with those spaces. We explore insertion augmentations, as well as temporary and permanent overlays. The availability of the plain text allows easy integration of natural language processing, and the fact that the digital document is built in pieces allows for easy insertion of space to accommodate for the adding of new features. In this way, we envision Textension as both a sandbox for designing interactive elements for digital documents and a way to use both digital and analog affordances simultaneously when working with texts.
To allow for interaction with books off the shelf we wanted to cross over
between digital and analog affordances. The ability to write notes directly
on the pages of a book is one of the analog affordances that is constantly
used, much to the dismay of librarians throughout the world. As seen in
studies by Marshall line space and the margin space.
The pre-processed background is used to insert space within the text, to
allow for writing notes or inserting elements such as maps.
There are two ways to insert space in Textension. The first is to simply tap and hold on a line with the stylus, or click with a mouse and that line will open up allowing space for writing. This works both vertically and horizontally; the trigger for vertical line creation is in the space between lines and for the horizontal space insertion it is in the space between words (see Figure 4). The size of the space to be added is determined in the parameter settings for the tool. By default, we have set the opening increment at 20 pixels. Inserted space is both editable as a text box by clicking on it, or by toggling the draw mode space can be written in any way the user chooses (see Figure 6). The second method is to open all spaces of a given type (e.g. line spaces) through a menu function. Spaces can also be inserted by other augmentations which require space. For example, inserting space is a precursor to inserting translations between lines. Together, these methods allow for flexible interaction that can be used for editing and annotation.
Once space has been opened up we wanted to maintain the ability to type and write on the document. Both functions have a toggle and a color picker that allows for this type of interactivity. You can draw anywhere on the document, but typing has been constrained to text boxes that have been created in the new space of the document (see Figure 6).
Often when digitizing analog texts OCR confidence is very important.
Textension provides a feature where users can see an overlay of how
uncertain the OCR algorithm was for each word. This augmentation is
displayed as an overlay in the word space. The darker
the color the less confident the score. This mapping was designed to draw
attention to and
While the OCR confidence overlay can allow users to understand how well their
scan has been processed, we added another feature that works in concert with
the previous ones to place text within the line space.
The OCR text feature will spread all the lines in the document and insert
the OCR text into editable text boxes (see Figure
7). The user can then correct the OCR directly on the newly built
document and export the finished text into a text file. Our own project’s
domain expert is already using this feature to solve the problem of not
being able to cut and paste from digital books while writing Humanities
essays. He has been taking pictures of quotations from physical copies of
the books and exporting the OCR directly into a word processor.
Once the user has tuned the OCR to their liking, they can toggle the
auto-translate menu button which will then use the Google translate API and
automatically insert a translation of the text in the line
space (see Figure 8). This method
works for all of the languages currently supported by Google and its one
limitation is typographical, in that books often split words on the end of
lines. Future work will address this limitation.
The manipulation of the word space on the level of the
text is an option that could be used in many scenarios. This widget provides
the ability to select an individual word, erase it from the document image,
and substitute in a word provided by the user. New words are scaled to fit
into the space of the existing word and are highlighted to show that they
were additions. Possible scenarios for this type of inclusion could be
manual translation, gender pronoun switching or switching between the Latin
and common names of scientific organisms.
With named entity recognition, we demonstrate the power of digital
affordances with photographed texts by inserting maps. During
pre-processing, we detect and store place names within the text. When the
map feature is toggled Textension highlights the place name in the document,
and automatically inserts a map from the Google maps API directly into the
document in the margin space. This feature is a
demonstration of the power of combining existing technology, such as the
Google maps API with automatic document space expansion. Because the
document is built in pieces we can freely move interactive elements into
different document spaces to see which works best for the specific
implementation.
Word space visualizations have been showing promise as
ways to augment digital texts
A context map lists all of the ways that a particular word or phrase has been
used within a document. This digital affordance uses canvas
space to build interactive concordance lines that highlight the
four words before and after the word in question. The maps are built using
the images patches of words in the document to maintain the document
aesthetics and reduce the impact of OCR errors. This is an example of the
types of things that can be done with ready access to linguistic information
and expandable canvas space (see Figure
10).
The second word space visualization we implemented was
showing the uniqueness of each word within the language (see Figure 11). When this mode is toggled active
the user is given two time sliders, one for the upper and one for the lower
bounds of the time in question and small bar charts are automatically
inserted for each word in the document. Each chart is a relative
representation of the word’s uniqueness within the given document, meaning
that as the upper and lower bounds time sliders are adjusted, each glyph
will adjust relative to the other. This type of interaction could be
adjusted to address specific historical questions from literary scholars and
could be extended to display anything that has data relating to the text.
Possible scenarios for this include showing etymological information, usage
information, or using color as a visual variable to display languages of
origin.
To demonstrate the possibilities of the occlusion space
we have implemented a widget that allows the user to hover on a word within
the newly digitized document and get dictionary information scraped from
Webster’s Online Dictionary API
As the features are toggled on and off within the tool, the user is given the option to save both editable text as an external text document but also high-resolution images of the current state within the program. The user has the option to export whatever document state that they create. Many features of Textension can be used at the same time so it is possible to create and export multiple variations of a single document.
The tension that exists between our analog pasts and our digital present can be addressed using our framework. Our prototype, Textension, leverages the power of OCR and digitally manipulates the five document spaces in near real-time. The system we present is an implementation of previous studies brought together in a way that can be extended easily for domain-specific analysis tasks. The web-based platform allows for easy integration with mobile technology and makes it possible to use Textension in a variety of locations and scenarios. We have demonstrated a breadth of possible use cases and have chosen widgets that demonstrate the utility of each of the document spaces.
Textension can be used in situations where quick digitization is necessary
and digital versions may not be available, such as within the stacks of a
university library. The web-based framework allows for easy document
digitization using any web-enabled camera. This could be on a mobile phone
or a desktop computer with a webcam. The system also allows for the
uploading of previous digitized texts. The drag and drop interface allows
for the uploading of PDF’s and digital images providing a robustness of
input possibilities. The provided text augmentations support humanities
activities such as annotation and close reading. By bringing in linked
reference resources such as maps and lexical uniqueness scores, Textension
can situate an unknown text in the greater spatial and linguistic context,
assisting with tasks associated with distant reading
While we designed Textension to demonstrate the usefulness and power of
bringing together affordances from paper and digital documents, there are
still several ways that we can expand the system. The first is by using
larger canvases. Textension focuses on space within documents and provides a
limited extended canvas space. An infinite canvas workspace, such as the
zoomable interface of PAD++
The second addition that we envision is to apply these techniques in the opposite direction, namely to augment digital books with the same types of interactive elements. We have already seen some of these approaches within existing e-readers like Amazon’s Kindle, but there is a lot of room to experiment with that design space. Another welcome addition would be horizontal space organization to solve problems like line breaks when using the Google Translate API. Because the OCR uses lines as an organizing principle, hyphenated words often disrupt the OCR and the translation algorithm. Reconnecting hyphenated words would require reflowing the document to maintain a justified layout.
A final addition that would solve a problem in the digital humanities is to provide a way to easily create new augmentations for Textension so that users with limited programming abilities would be able to add features to the interface. The current implementation makes it easy to add new features with modest programming skills, but we would like to make that more accessible in the future.