DHQ: Digital Humanities Quarterly
2021
Volume 15 Number 1

Using an Advanced Text Index Structure for Corpus Exploration in Digital Humanities

Abstract

With suitable index structures many corpus exploration tasks can be solved in an efficient way without rescanning the text repository in an online manner. In this paper we show that symmetric compacted directed acyclic word graphs (SCDAWGs) - a refinement of suffix trees - offer an ideal basis for corpus exploration, helping to answer many of the questions raised in DH research in an elegant way. From a simplified point of view, the advantages of SCDAWGs rely on two properties. First, needing linear computation time, the index offers a joint view on the similarities (in terms of common substrings) and differences between all text. Second, structural regularities of the index help to mine interesting portions of texts (such as phrases and concept names) and their relationship in a language independent way without using prior linguistic knowledge. As a demonstration of the power of these principles we look at text alignment, text reuse in distinct texts or between distinct authors, automated detection of concepts, temporal distribution of phrases in diachronic corpora, and related problems.

Introduction

A text/corpus index is a kind of table that, given a string $$w$$, stores the positions of all occurrences of $$w$$ in the given text/corpus. The computation of the index is a preprocessing step to be applied only once. Corpus index structures considerably simplify corpus analysis since they help to avoid rescanning the complete texts for each task. The index helps to directly answer many interesting questions, hence index based methods in general are much faster and more elegant. In this paper we look at an advanced corpus index structure and explain its potential use in Digital Humanities. This index, called symmetric compacted directed acyclid word graph (SCDAWG), is used to represent a collection of texts, each text is considered as a flat sequence of symbols [Inenaga et al. 2005]. SCDAWGs can be considered as a refinement of suffix trees [Weiner 1973], [McCreight 1976], [Ukkonen 1995], [Gusfield 1997]. The SCDAWG for a given text collection, as the suffix tree, can be computed in time and space linear in the size of the corpus. As a first advantage, the SCDAWG of a corpus is much smaller than the corresponding suffix tree. Even more importantly, SCDAWGs have two special features that make them very attractive for various corpus exploration tasks.
Finding co-occurrences. Using the SCDAWG it is simple to find all co-occurrences of the same portions of text in two, several, or all texts of the corpus. The index enables a joint search in all texts, there is no need for comparing all pairs of texts individually. This joint search, which leads to all co-occurrences, can be realized using a single scan of the index. The importance of this feature in our context should be obvious: research in Digital Humanities is often concentrated on the problem of finding phrases and text portions that occur in distinct (two, several, or all) texts of a collection. For all these problems the use of SCDAWGs offers an elegant solution. As a demonstration we look at alignment of two or several texts, detection of text reuses in distinct texts, and detection of text reuses between distinct authors.
Mining interesting concepts and their relationship. The nodes of the SCDAWG represent portions of text (sequences of symbols, infixes). In general, the number of all distinct infixes of a corpus $$C$$ is quadratic in size $$|C|$$ of the corpus. In contrast, the number of nodes of the SCDAWG is always linear in $$|C|$$. Yet, in a sense to be explained below, it contains all infixes that are interesting from a linguistic point of view.[1] For example, names, concepts and phrases etc. in general correspond to nodes of the SCDAWG. Compared to suffix trees, SCDAWGs yield a much smaller set of nodes/infixes with this property. Using structural regularities of the index, the set of all nodes can be even further filtered to approximate the subset of linguistically relevant infixes. Also hierarchical relationships and dependencies between these concepts and phrases can be detected. From the perspective of Digital Humanities, these mining techniques are relevant since they are completely language independent and do not use any kind of prior linguistic knowledge. They point to interesting concepts and bricks of text, the index helps to study their use, structure and relationship.
In this paper we first formally introduce SCDAWGs in Section 2. We compare three related index structures, SCDAWGs, suffix trees and DAWGs (directed acyclic graphs [Blumer et al. 1987]), compare the sizes of these index structures and explain the advantages of the SCDAWG for corpus exploration tasks as those mentioned above. In Section 3 we give a brief overview of the corpora used for the experiments described in the paper. In Section 4 we show how SCDAWGs can be utilized to solve tasks in the fields of text alignment and text reuse detection [Büchler et al. 2012]. For these problems, the full set of nodes of the SCDAWG is used. In Section 5 we sketch the afore-mentioned filtering of nodes/infixes based on structural regularities (in Gerdjikov and Schulz [2016] the filtering method is described in full detail). Using the example corpora we illustrate how concepts and their relationships can be mined in a language independent way. In Section 6 we look at extensions of the SCDAWG index, adding metadata information (e.g., on authors, temporal periods) from the texts. Examples from our corpora show how to detect text reuses, e.g., between distinct authors and to reveal the temporal flow of phrases across texts/authors. Before we come to a short conclusion, Section 7 provides loose ends for future research, indicating how SCDAWGs might help to treat diachronic language variation and various text classification tasks.

SCDAWGs as a corpus index structure

In the introduction we explained the general benefits that can be obtained for corpus analysis when using an index structure. If we are only interested in the distribution and occurrences of single words, a simple index structure is sufficient that represents the corpus as a “bag of words”. Usually an “inverted file” then stores for each word $$w$$ occurring in the corpus the texts and positions where $$w$$ occurs. However, corpus analysis in Digital Humanities and other fields is often focussed on other pieces of text: re-used portions of text, names, terminological and other multi-word expressions, phrases that express semantic relationships, syllables, morphems, to name a few. In this situation advanced index structures are preferable that give direct access to the occurrences of arbitrary infixes[2]. This explains why corpus analysis tools in Digital Humanities or Computational Biology are often based on the latter, more powerful type of index structures. Among the latter type of index structure, directed acyclic word graphs (DAWGs) and suffix trees have often been used for research, but refinements have not received proper attention. In this paper we describe compacted directed acyclic word graphs (CDAWGs) and a symmetric variant (SCDAWGs) and argue that these index structures are preferable for many corpus analysis tasks.
The common ideas behind DAWGS, suffix trees, CDAWGs and SCDAWGs are summarized in the following way.
• Each index represents a graph, nodes representing specific infixes of the corpus.
• Each infix of the corpus is represented at most once in the index.
• The total size of the index structure is linear in the size of the corpus.
• Given any string $$v$$ we may use the index to decide in time $$O(\vert v\vert)$$ if $$v$$ is an infix of the corpus.[3] It is also possible to use the index “as a guide” for finding all texts and positions where $$v$$ occurs.
The nodes of a DAWG represent the infixes $$v$$ that are “left-closed” in the sense that there does not exist a unique symbol $$\sigma$$ directly in front of each occurrence of $$v$$ in the corpus. The nodes of a suffix tree represent the infixes $$v$$ that are “right-closed” in the sense that there does not exist a unique symbol $$\sigma$$ directly after each occurrence of $$v$$ in the corpus. The CDAWG and the SCDAWG for a corpus have the same set of nodes representing the infixes $$v$$ that are “left- and right-closed” at the same time. Edges of a suffix tree, DAWG, or CDAWG represent extensions of infixes in reading order (i.e., to the right). SCDAWGs have two type of edges, respectively representing right extensions in standard reading order and left-extensions in left-to-right order. Example 2.1 As an example, consider the corpus with the two toy “texts”:
• op 1 in A
• op 2 in B
Infixes n and in are not left-closed since we find the symbol i directly in front of each occurrence of n and the blank directly in front of each occurence of in in the corpus. In contrast, infixes in , i and op 1 are left-closed. Each prefix of each text is left-closed, the corpus contains $$20$$ left-closed infixes, its DAWG has $$20$$ nodes (cf. Figure 1). Infixes o and op are not right-closed since we find the symbol p directly after each occurrence of o and the blank directly after each occurence of op in the corpus. In contrast, infixes op and p, 1 in A are right-closed. Each suffix of each text is right-closed, the corpus contains $$25$$ right-closed infixes, accordingly its suffix tree has $$25$$ nodes (cf. Figure 1). The only infixes that are left- and right-closed are the empty string, the blank, the strings op and in and the two texts. Hence the (S)CDAWG only has $$6$$ nodes (cf. Figure 1). Lines/transitions in the figures represent extensions of nodes/infixes. In the DAWG in Figure 1, node in has two extensions with $$A$$ and $$B$$, respectively leading to the left-closed infixes op 1 in A and op 2 in B. In the suffix tree, node op has two extensions with symbols $$1$$ and $$2$$, respectively leading to the right-closed infixes op 1 in A and op 2 in B. In the SCDAWG in Figure 1, black (blue) links represent right (left) extensions. Labels of left extensions are written in right-to-left order.
As we mentioned above, each index can be used to check if a string $$v$$ represents an infix of the corpus in time linear in $$\vert v\vert$$. We sketch the procedure for the SCDAWG in Figure 1. Assume we want to check if p 1 is an infix. Starting from the top node (empty string) we look for an extension to the right where the label starts with the letter p. We find an extension with label p leading to op. We continue to look for a right extension of the new node starting with 1. We find the extension with label 1 in A leading to op 1 in A. Hence p 1 is an infix.

The strength of SCDAWGs compared to DAWGs and suffix trees relies on three features.
1. Number of nodes. In general, the number of nodes in the (S)CDAWG of a corpus is much smaller than the number of nodes in a suffix tree or DAWG. This is seen in Figure 1. A more informative picture is found in Table 1 where we compare the size of these index structures for a small collection of corpora of distinct sizes. For the sake of completeness, the table also includes data for suffix tries, a simplified version of suffix trees where each suffix represents a node.
2. Naturalness of nodes. Most of the infixes representing the nodes of a (S)CDAWG are very natural from an intuitive point of view[4]: imagine we index the German version of Goethe's Faust. Among the nodes of the suffix tree, we will find right-closed infixes such as ephisto, phisto, retchen, and etchen. In the DAWG we find left-closed infixes such as Mephist, Mephis, Gretche and Gretch. In the (S)CDAWG we only have the two-sided closures Mephisto and Gretchen. The example indicates that two-sided closure is a very natural property when looking for “interesting” and “meaningful” infixes of a corpus.
3. Node Extension Property. Among the infixes that represent the nodes of the index (suffix tree, DAWG, or (S)CDAWG) in general there are many inclusion relations where one infix $$v$$ is a substring of another one $$w$$. In a suffix tree, DAWG, or CDAWG extension steps departing from node $$v$$ in many cases do not lead to node $$w$$. For example, in the suffix tree in Figure 1 we cannot reach op 1 in A from in following the transitions. In contrast, the two-directional edge structure of SCDAWGs completely covers infix containment in the sense that whenever node $$v$$ is a substring of $$w$$ there exists a chain of left and right extensions leading from $$v$$ to $$w$$. In other words, starting from a node $$v$$ and traversing the index we find each context containing $$v$$. For the sake of reference, this property will be called the “Node Extension Property”. It can be used, for example, to find in a very elegant way all texts where an infix $$v$$ occurs: starting from the node that represents (the two-sided closure of) $$v$$, compute all maximal chains of left extensions followed by all maximal chains of right extensions. The nodes reached at the end represent the texts in which $$v$$ occurs. For example, in Figure 1 starting from op or in we reach the nodes representing the two input texts.
 Suffix Trie Suffix Tree DAWG CDAWG SCDAWG abcbc$$abcab$$: 12 Bytes; 12 symbols Nodes 66 17 16 6 6 Edges 65 16 23 13 21 OCR page: 7 KB; 6.945 symbols Nodes 21.418.626 7.355 11.995 1.163 1.163 Edges 21.418.625 7.354 14.915 4.083 8.094 Excerpt EU-Corpus: 106 KB; ca. 106.000 symbols Nodes - 161.001 165.962 25.273 25.273 Edges - 161.000 227.515 86.826 170.358 Small Corpus: 1.2 MB; ca. 1.250.000 symbols Nodes - 1.921.704 1.922.811 366.070 366.070 Edges - 1.921.703 2.730.597 1.173.856 2.355.669
Table 1.
Comparing the size of distinct index structures for four input texts.
Node-documents function. Using the Node Extension Property in one scan of the SCDAWG we may compute and store a function $$T$$ that assigns to each node $$\mu$$ (represented as a number) the set of texts where $$\mu$$ occurs as an infix [Blumer et al. 1987]. $$T$$ is called the node-documents function. Let us assume that each text comes with a number. We define a procedure FindTexts, the input is a node number $$\mu$$, the output is a set of text numbers. The procedure is called in the form FindTexts(root). As a result, the function $$T$$ is computed.

Procedure FindTexts($$\mu$$)
if $$T(\mu)$$ is already defined, then
output $$T(\mu)$$;
else if $$\mu$$ is a leaf (text), then
let $$n$$ be the number of the text;
define $$T(\mu) := \{n\}$$;
output $$\{n\}$$;
otherwise
let $$\nu_1,\ldots,\nu_k$$ be the (left or right) immediate successors of $$\mu$$.
call FindTexts($$\nu_1$$), ...., FindTexts($$\nu_k$$);
define $$T(\mu) :=\bigcup_{i=1,\ldots, k} T(\nu_i);$$
output $$T(\mu)$$;

The procedure first visits all successor nodes of the input node $$\mu$$ and then assigns the union of the sets of text numbers of the direct successors to $$\mu$$. It is simple to see that each link of the index is used only once. Hence, if the number of all distinct texts is treated as a constant, the procedure is linear in the size of the index.

Corpora used for experiments

For the experiments we selected corpora of different periods, languages, and genres. Languages covered are Latin, English, German, French and Italian. Domains vary from political texts to poetry, lyrics and literature. Metadata include author, temporal data, and others.
Poems Corpus. The Poems Corpus consists of roughly 40k poems which had been compiled from TEI – XML annotated documents from the TextGrid Repository1. It contains poems written by German poets from about the time when printing was invented to the beginning of the 20th century. Amongst these authors are well known names like Johann Wolfgang v. Goethe, Friedrich Schiller, Joseph v. Eichendorff, or Georg Trakl. All poems are written in German language. In addition to the author each poem is furnished with the volume in which the poem was published as well as dates referring to the timespan in which the volume had been created. If this is not known the lifespan of the author is used for the dates.
Lyrics Corpus. The Lyrics Corpus contains about 15k of popular lyrics from songs, which had been present in the single charts of the German-speaking world, reaching from the mid nineteen fifties to nowadays. This corpus had been created by means of webcrawling. Aside from the interpreter it stores the year, in which the song had been in the charts, the chart position and whether the chart position had been a top 10 position, as well as the language, in which the song was performed, as its metadata. The language had been assigned automatically via n-gram based language classification. Since the manner of appearance of the considered single charts has changed over time, songs of the 80s and 90s are prevailing.
Parallel OCR Corpus. A smaller corpus of historic books, which also had been digitized by means of OCR. All documents in this collection had been created by two different OCR engines, leading to a parallel corpus of the same content. The corpus was only used in the alignments, and there no metadata had been considered.
Wittgenstein Corpus. The Wittgenstein Corpus consists of 3,871 original remarks of the German philosopher Ludwig Wittgenstein. Although the general language is German, the corpus also contains a few remarks in English.
Medline Corpus. The Medline Corpus is a collection of about 15 Million medical abstracts in English. Each abstract consists of about 10 sentences. We considered two subsets of this corpus. They were obtained by randomly selecting 0.5% and 10%, respectively, of the abstracts and gathering all the sentences in the resulting abstracts. The 0.5%-Medline Corpus consists of 800 K sentences, whereas the 10%-Medline Corpus consists of about 15 M sentences.

Using the index for alignment and text reuse

In this section we use the node-documents function defined in Section 2 to find large strings that co-occur in several texts of the collection. Such strings point to some form of text parallelism/reuse. Two applications are considered. In the first subsection we show how to efficiently align two or more parallel texts in a linear manner. In the second subsection we show how to explore large collections, looking at maximal strings that co-occur in several texts of the corpus

Index-based text alignment

In our context, to “align” two strings means to find a common linear representation where “common parts” of the two strings are shown, defining a linear sceleton, and where “deviations” or regions where the two strings do not agree are associated with the “holes” of the sceleton. The classical method for this task is the Needleman-Wunsch Algorithm [Needleman and Wunsch 1970]: The input are two texts $$A$$ and $$B$$ of length $$m=|A|$$, $$n=|B|$$. The algorithm finds an optimal alignment, given a scoring function. Using dynamic programming it fills the cells of a $$|A|\times |B|$$ matrix, cells come with “backward pointers”. The backward pointers are then used to find an optimal alignment. The complexity is $$O(mn)$$, which is problematic for large texts. E.g., comparing two texts with $$10^6$$ symbols each, we need several terrabytes to store the full matrix. For a better solution we look at a strategy to translate the problem
Finding longest common subsequences and longest increasing subsequences. Such a translation can be found in the longest common subsequence or the longest increasing subsequence problem: Assume the scoring function for an alignment assigns value 1 to each pair of matching symbols and value 0 to each disagreement (here: insertion, deletion, or substitution). Then the matching symbols of the alignment are given by the longest common subsequence (LCS) of the two strings. A subsequence of a string $$A= a_1\ldots a_m$$ is any string of the form $$A= a_{i_1}\ldots a_{j_k}$$ where $$k \leq m$$ and $$i_1 < \ldots < i_k$$. For example, “bdf” is a subsequence of “abcdef”.
This means that for a particular and natural scoring function, finding the LCS of two strings $$A$$ and $$B$$ helps to solve the global alignment problem for $$A$$ and $$B$$. To avoid the above matrix computation with its large complexity we may use an alternative approach to solve the LCS, utilizing the longest increasing subsequence (LIS) problem. Given a sequence of natural numbers, we look for the longest increasing subsequence. For example, the LIS for $$1-5-2-7-3-4$$ is $$1-2-3-4$$. Given $$A= a_1\ldots a_m$$ and $$B= b_1\ldots b_n$$, the LCS problem for $$A$$ and $$B$$ can be reduced to the LIS problem [Gusfield 1997]. We ignore the details of the translation, but look at the complexity of the new procedure for solving the LCS. Let $$r(i)$$ denote the number of occurrences of the letter $$a_i$$ in $$B$$, let $$r = \sum^m_{i=1} r(i)$$. Using translation into LIS, the LCS problem for $$A$$ and $$B$$ can be solved without dynamic programming in time $$O(r\ log(n))$$.
In many cases, using the LIS method helps to reduce the computation time for an alignment tasks. However, for large texts the value for the number $$r$$ is typically very large and the computation time remains enormous. For example, the “innocent” text $$A$$ shown in Figure 2, which is taken from a historical document, already leads to a value of $$r \sim 130,000$$ when compared with a similar text $$B$$. For more details on the LCS- or LIS- problem see [Gusfield 1997]. Improved index-based alignment. Consider the SCDAWG of two texts $$A$$ and $$B$$. Applying the Node-documents function $$T$$ it can be determined whether the infix represented by each node occurs in both texts. The main observation now is that using this information we immediately find large portions of the two texts that have a perfect alignment. We use the following definition.
Definition 4.1 An index node $$\mu$$ of the SCDAWG for two texts $$A$$ and $$B$$ is called quasi maximal if
1. the infix associated with $$\mu$$ occurs in both texts $$A$$ and $$B$$,
2. $$\mu$$ does not have any transition that leads to another node $$\mu$$’ with the Property 1.
As an illustration we refer to Figure 1. Our two example “texts” lead to two quasi maximal nodes op and in . Based on the Node-documents function there exists a simple procedure to find all quasi-maximal nodes: a queue is initialized with the root of the SCDAWG (empty string). We treat all entries of the queue in the order they are added to the queue. The treatment of a node $$\mu$$ involves two steps:
• We consider each node reached from $$\mu$$ with a single left or right transition in the SCDAWG. Each such node representing an infix occurring both in $$A$$ and $$B$$ is added to the end of the queue.
• If $$\mu$$ does not have any extension representing an infix occurring both in $$A$$ and $$B$$, then $$\mu$$ is added to the list of quasi maximal nodes.
Note that each index node is added to the queue and treated at most once. At the end we obtain the full list of all quasi maximal nodes. Simple additional calculations provide the positions of the occurrences of each quasi maximal nodes. Technical details are omitted.
The full alignment method proceeds as follows: We compute the SCDAWG index for the texts $$A$$ and $$B$$ (linear time). We traverse the index to find all quasi maximal nodes and the end positions of occurrences in $$A$$ and $$B$$. We use end positions and the LIS method to define the linear sceleton of the alignment. When using the LIS method, the important new point to note is that the strings corresponding to the quasi maximal nodes are treated as single “symbols”. In this way, the factor $$r$$ mentioned in the above formula is drastically reduced. In the experiment from Figure 2, the usual procedure leads to $$r \sim 130.000$$, using the SCDAWG and quasi maximal nodes the value is $$r=29$$. When aligning two OCR output texts of $$5.500$$ symbols, the standard method yields $$r \sim 2.300.000$$, using the SCDAWG and quasi maximal nodes the value is only $$r=169$$.
In a project related to improving OCR on historical documents we use this technique to align the outputs of distinct OCR engines for historical texts. “Holes” obtained from the alignment based on quasi maximal nodes sometimes cover portions of texts where a finer subanalysis helps to improve results. Figure 3 shows an alignment result for two OCR outputs and two texts, with distinct levels of disagreement.
Multiple string alignment. When dealing with $$n>2$$ texts $$A_1,\ldots, A_n$$, the notion of a quasi maximal node is generalized. Definition 4.2 An index node $$\mu$$ of the SCDAWG for $$A_1,\ldots, A_n$$ is called quasi maximal if
1. the infix associated with $$\mu$$ occurs in all texts $$A_1,\ldots, A_n$$,
2. $$\mu$$ does not have any transition that leads to another node $$\mu$$’ with Property 1.
For technical reasons we actually have the additional restriction that quasi maximal nodes used for alignment only occur once in each text. A generalization of the LIS algorithm can be used to find a maximal linear sequence of quasi maximal nodes of all texts, which gives the desired alignment sceleton. As an illustration, Figure 4 shows the simultaneous alignment of three texts. Two sequences represent parallel parts of the OCR-outputs of two OCR engines on a historical document, the third sequence shows the corresponding part of the ground truth file.

Detecting text reuse and similarities across collections}

When comparing large collections of texts $$A_1,\ldots A_n$$, in most cases it is not natural to ask for substrings that co-occur in all texts. For the following experiment a node of the SCDAWG index for $$A_1,\ldots A_n$$ is called quasi maximal if it is quasi maximal for two texts in sense of Definition 4.1. We have seen that we can extract the set of all quasi maximal nodes from the SCDAWG index in this sense, and for each such node $$\mu$$ with string $$v_{\mu}$$ we immediately get the information in which texts $$A_i$$ the string $$v_{\mu}$$ occurs. In Figure 5 this information is used to compute “survey” graphs for text reuses in two complete collections (Poems Corpus and Lyrics Corpus). Each graph contains two types of nodes. Large and small ellipses respectively represent quasi maximal nodes and texts numbers of the collection. For each node $$\mu$$ a prefix and a suffix of the node text $$v_{\mu}$$ is shown. A link from a quasi maximal node $$\mu$$ to a text number $$k$$ indicates that $$v_{\mu}$$ is a substring of $$A_k$$.
Depending on the text reuses in the collection, the graph becomes very large. Still it is very useful to find interesting subregions with eye-catching multiple text reuses. In the lyrics corpus, many reuse effects can be traced back to covered songs. Since the song texts in the corpus were collected in a community based way, texts differ in details. In the survey graph eye-catching regions are mainly caused by songs that have been covered many times. It is then possible to zoom to these regions and to see reused text portions and the texts where these findings occur. Figures 6and 7 show such zoomed regions. In Figure 6 it can be seen that five poems in the collection have many text reuse connections. Some examples for reused text are (cf. Figure 6):
• laut und leise, Unterric... träumt, Angenehme zu hören
• wälzend kam die Sturmesf ... der Chor vom ganzen Haine
In Figure 7 many test reuses are centered around the word “Hallelujah”. Examples are
• She tied you to a kitche .. Hallelujah, Hallelojah}
• s seen the light It's a -- ah Hallelujah, Hallelujah}

Refined “linguistic” view on a corpus

We have seen that the set of nodes of the (S)CDAWG for a corpus represents a relatively small set of infixes that often are “natural” portions of text. Experiments in Gerdjikov and Schulz (2016) show that many of the phrases corpus analyzers would look at (names, terminological expressions, stylistic phrases, ...) are typically left- and right-closed and appear among the nodes of the (S)CDAWG. This means that when just using the index nodes for corpus exploration we do not miss important portions of text.
From this point, the (S)CDAWG can be considered as a first step towards a new kind of “linguistic corpus index” where nodes - in the ideal case - exactly represent the linguistic units of interest (morphems, words, phrases, sentences, ...) of the corpus and their relationships. In a way, such an index would be the optimal basis for all corpus analysis tasks focussed on linguistic units [5]. However, for arbitrary texts there is not even a useful formal definition of a “phrase” or “linguistic unit”, let alone a procedure for finding this strings.
There are two possible main paths how to come closer to a “linguistic refinement” of the (S)CDAWG. On the one hand side, we can take any available NLP prodecure (parser, lexical analyser, phrase detector,..). Assuming that the units found represent nodes of the (S)CDAWG, we arrive at a (S)CDAWG substructure that directly points to units of interest and their occurrences in the corpus. On the other hand we may try to further analyze regularities found in the corpus to find a restricted set of nodes that comes closer to the idea of a linguistic phrase. In Gerdjikov and Schulz (2016), we followed the second path, which is completely language independent. Crucial assumptions are:
• phrases appear in distinct contexts,
• phrases are combined using function words as connectors, and
• sentences have natural decompositions into phrases, overlaps representing function words.
We then describe a bootstrapping method for finding function words, phrases and sentence decompositions into phrases.[6] Using the same kind of techniques, phrases can be further decomposed into sub(sub)phrases. As a matter of fact, for this form of decomposition also given lists of function words could be used. The main point is that the bootstrapping method in a completely unsupervised and language independent way leads to an interesting set of phrases and subphrases that helps to considerably reduce the set of nodes considered, thus coming closer to a “linguistic” index.
Finding important concepts of the corpus. When ignoring bordering function words, atomic subphrases obtained in this way typically represent content words or “concepts”. Analyzing the role of these concepts in phrase decomposition it is possible to compute a ranked list of characteristic concepts of the corpus. See Gerdjikov and Schulz (2016) for details. Below some examples for most characteristic atomic phrases in the example corpora are given. For example, in the Wittgenstein Corpus, central concepts found using the above methods are (cf. Figure 9) Bedeutung (meaning), Farbe(colour), Sprache (language), and Erklärung (explanation).
Exploring the use of concepts. After computing characteristic concepts, the aforementioned decomposition of phrases helps to find all larger phrases where a given concept occurs. In this way the use of the concept in the corpus can be studied. Figure 8 shows the hierarchical structure of phrases of the Wittgenstein Corpus extending Bedeutung.
Exploring the relationship of concepts in a visual way. A third goal is to explore interesting relationships between the concepts. Existing corpus exploration tools and automated approaches for exploring the paradigmatics of lexical units typically look at the co-occurrence of terms in documents, paragraphs, sentences or fixed size neighbourhoods [Schütze and Pedersen 1993],[Storjohann 2010]. Another, more “syntactic” view is obtained when looking at the co-occurrence of concepts in phrases. In Figures 9 and 10 we see the relationship of concepts (characteristic atomic kernels) in terms of co-occurrences in phrases. For example, in the Wittgenstein Corpus, Bedeutung (meaning) is strongly related to Erklärung (explanation), as can be seen from phrases like Erklärung der Bedeutung eines Worts (explanation of the meaning of a word). In the Medline Corpus (cf. Figure 10), concepts found to be related to tumor are, e.g., risk (witness phrase risk for tumor), chemotherapy (witness phrase tumor cells to chemotherapy), and vaccines (witness phrase tumor vaccines).

When metadata are available, many of the aforementioned techniques can be refined for revealing similarities in the space of metadata. Technically, we assign the metadata for the original documents to the nodes in the SCDAWG that represent those documents. For the rest of the nodes we can retrieve this information on demand by simple traversal of small parts of the SCDAWG.
Comparing authors. Our first illustration uses the Poems Corpus, where we have authorship and temporal metadata for texts. In Figure 11 we use the aforementioned refined “linguistic” view. After adding authorship information to the maximal nodes of the index we first adapt the notion of a quasi maximal node: a node $$\mu$$ with text $$v_{\mu}$$ is called quasi maximal with respect to authors if $$v_{\mu}$$ occurs in texts of two distinct authors, while any phrase that properly extends $$v_{\mu}$$ only occurs in the works of one author.
The nodes in Figure 11 represent important concepts in the corpus - some of the most characteristic atomic subphrases for the Poems Corpus in the sense considered above. Our language independent techniques revealed that Welt (world), Sonne (sun), Himmel (sky/heaven), Gold (gold), and Schnee (snow) are important concepts in the Poems Corpus. In Figure 11, each link between two concepts stands for a phrase that
• contains both concepts, and
• is quasi maximal with respect to authors.
The authors that have used the phrase are annotated with the link. For example it is seen that
• Schnee zur Erde (snow to earth) was both used by Nikolaus Lenau and Hermann Löns, and
• der Himmel mit der Erde (the heaven/sky with the earth) was both used by Betty Paoli and Friedrich Rückert.
This gives a basis for comparing two authors, now asking if both authors used similar (quasi-maximal) phrases with the central concepts of the corpus.
Temporal development of similarities between authors. In Figure 12 we see a dual variant of the graph where the role of nodes and edges is changed. Nodes now represent authors, authors are linked if they have used identical quasi maximal (w.r.t authors) phrases. Using colouring of nodes, also temporal information about authors is added - we simply used the year defining the middle point in the life of an author. The spectral ordering of colours red-green-blue-purple represents the temporal timeline from earlier periods to later periods. Links always point from earlier authors to later authors.
In the lower part of the graph we find a path from Anna Louisa Karsch (1756) Figure 12, circle marker (1)) $$\longrightarrow$$ Friedrich Gottlieb Klopstock (1763) $$\longrightarrow$$ Ernst Schulze (1803) Figure 12, circle marker (2)). The latter is also reached with a link from Johann Wolfgang von Goethe (1794) Figure 12, circle marker (3)). Goethe has several successors. In this graph, an even more central person is Ludwig Achim von Arnim (1806) Figure 12, circle marker (4)), who builds the center of the large cluster in the middle of the figure. The temporal spectrum of this cluster corresponds to the first half of the 19th century. Authors of the 17th and 18th century are found in the second cluster on the top right part. An early “influential” author of this period is Simon Dach (1632) Figure 12, circle marker (5)).
The two clusters indicate a change of literary concepts covered between the Baroque/Pre-Romantic period and the Romantic period. The three authors (John Brinckman (1842), Fritz Reuter (1842), and Klaus Groth (1859) Figure 12, circle marker (6)), who form the cluster right to Romantic period cluster however are not connected to the cluster consisting of other authors of their period due to the fact that all three were writing poems in the german dialect of niederdeutsch.
When constructing this kind of graph for the Lyrics Corpus similar observations can be made. The nodes of Figure 13 again are connected by co-occurring quasi maximal phrases. Instead of the author now band names are used together with the mean of the release years of their corresponding songs. The colour coding is the same as used in the previous experiment. One of the first things that can be observed is a cluster of Italian singers centered around Eros Ramazotti (1998) (See Figure 13, circle marker (1)), located at the right to the center of the graph. Also lots of subgraphs, which only contain of two or three interpreters are seen. They often vizualize relations between cover songs, which are contained in the corpus. Therefore for instance Right Said Fred (1998) and Peter Sarstedt (1969) (See Figure 13, circle marker (2)) can be found at the bottom left. Another example of this is a three node cluster at the top right with edges leading from Julie Covington (1977) and Don McLean (1972) to Madonna (1996) (See Figure 13, circle marker (3)), who famously released covers of the songs “Don't Cry For Me Argentina” and “American Pie”. Other subgraphs connect interpreters who obviously belong to the same genre, e.g. at little to the right of the center a path can be found leading from 2Pac (2000) $$\longrightarrow$$ Busta Rhymes (2001) $$\longrightarrow$$ Kanye West (2007) (See Figure 13, circle marker (4)), with the interpreters representing famous hip-hop artists.

Loose ends - diachronic language variation and classification tasks

Even though the focus of the current paper are the problems of text-reuse and mining of relationships based on it, we suggest that our approach can be extended to other related problems, e.g. diachronic language variation and text classification.
In Gerdjikov et al. (2013) and Sariev et al. (2014), the authors develop a historical text normalization system. Based on a modern dictionary and several thousands of pairs, historical word and its modern variant, they show that the DAWG structure can be used to automatically learn and extract spelling variations. The main benefit of the infix structure is that: (i) it does not restrict the spelling variation to any particular predefined patterns and (ii) it reflects the entire structure of the language (on word level). In Sariev et al. (2014) the combination of this technique with a language model for the modern language yields a complete historical text normalization system.
Of course, the requirement for training data sets practical limitations -- training data is often unavailable and its production is expensive and time-consuming. In Mitankin et al. (2014) the authors try to address this problem and suggest a completely automatic historical text normalization approach. The main idea is to automatically generate pairs of historical word and its modern variant. Mitankin et al. (2014) proposes to use Levenshtein edit distance and approximate search in the modern dictionary in order to generate for each historical word its most relevant modern variant(s). Afterwards we are in the situation to run the historical text normalization system from Mitankin et al. (2014) and Sariev et al. (2014). Unfortunately, the Levenshtein edit distance, as a generator of pairs, turns out to introduce a lot of noise in the system. In particular, the accuracy of the normalization system drops from 94% to 82%.
The results from the current paper suggest two straightforward approaches in order to reduce the noise introduced by the unsupervised approximate search. The first approach is the following. Instead of searching for historical words and their modern word variants, to search for historical phrases and their modern phrase variants, automatically extracted from the SCDAWG structure for the historical texts and modern texts, respectively. In this way we plug in additional language context in the query, which in general will exclude orthographically similar but semantically irrelevant candidates. On the other hand, if the modern text corpus covers the domain of the historical texts, it is possible that the basic phrases in the historic language have indeed been preserved in the modern language and do have their spelling variants in the modern language. Thus, without seriously reducing the recall, we could increase the quality of the automatically generated pairs historical phrases and their modern phrase var. The production of pairs of words is then straightforward[7].
The second approach is the following. Instead of modifying the searching space, we can modify the edit-distance and use more relevant edit-distance mechanism. As described in Section 4, we can use the SCDAWG structure in order to align different editions of the same source. In particular, if the editions belong to close time periods, the most disagreements in the alignment will be due to uncertainties in the spelling that have been typical in this time period. This phenomenon can be used in order to constrain the elementary edit-operations and the contexts in which they are applicable. As a result we can expect that the generator of pairs, historic word and its modern variant, will be more relevant with respect to the structure of the historical language.
We should stress that the arguments raised in the previous two paragraphs require further research. It is also a challenging open problem to directly map the SCDAWG structure of a historical corpus to a substructure of the SCDAWG structure for a large modern language. The latter, of course, would resolve the normalization problem.
The substrings of a SCDAWG index could also be used to train models for text classification tasks. A similiar approach, which uses maximal substrings of a suffix tree and led to promising results, had been persued in Okanohara and Tsujii (2009). As stated in Section 2 the two-sided closures, which form the nodes of the SCDAWG, represent more natural infixes than those of a suffix tree or DAWG. Therefore they may render descriptive features which still are not restricted to fixed word boundaries. To find optimal features in the substrings of a SCDAWG of a corpus again metadata can be used to find longest or shortest substrings which occur only in a certain instance of a metadata attribute. For example the longest/shortest substrings which belong to the author “Goethe” can be used as features to train a text classification model with regard of this metadata attribute.

Conclusion

In this paper we have shown how symmetric compact acyclic word graphs (SCDAWGs) can help to efficiently mine interesting portions of texts in corpora, which serve as a basis for many ways of comparing texts like alignment and the detection of text reuse (Section 4). These findings can be refined by adding a form of “linguistic analysis” where “phrases”, “subphrases”, “function words”, and “important content words” (characteristic atomic kernels) are computed in an unsupervised way without using prior linguistic knowledge (Section 5). If metadata for the texts in the collection are available, additional information stored in the index and the above techniques give a basis for finding similarities in the space of metadata (Section 6).
Of course many of the here mentioned techniques would require more detailed analysis and comparsion with other existing methods. However this was not the purpose of this paper, since its focus didn't lie on the development and description of a certain method but rather on showing how manifold and versatile the application of such an index can be to the task of large scale corpus exploration. In many cases, various variants exist for the concrete methods used in our examples. When moving to real applications, experts from Digital Humanities are needed to find those variants that are most adequate and lead to real insights.

Acknowledgements

We express our gratitude to: Uwe Springmann for his help with OCR recognition of historic documents and the corpus derived from them. We also thank Stefanie Schneider for the provision of the lyrics corpus. Part of the results described in this paper were obtained during the stay of the third author at LMU for which he has received funding from the People Programme (Marie Curie Actions) of the European Union's Seventh Framework Programme (FP7/2007--2013) under REA grant agreement 625160.

Notes

[1] As a matter of fact, there is no formal definition of text units (sequences of symbols) that are linguistically relevant.
[2] A string $$v$$ is an infix of a text $$t$$ if $$t$$ can be represented as a concatenation of the form $$t=uvw$$. A string is an infix of a corpus if it is an infix of a text of the corpus.
[3] $$\vert v\vert$$ denotes the length of the string $$v$$.
[4] An exception are very small infixes, which often also are left- and right-closed in the above sense.
[5] More generally, the ideal index structure for corpus analyzers would point to all and exactly those infixes that are relevant for the analysis, their positions and their relationships.
[6] Since we have a symbol-based view on corpora, the expression function “word” is misleading. The blank, as well as strings such as as well as may serve as a function “word”.
[7] Actually, the approach in Gerdjikov et al. (2013) is not limited to words.

Works Cited

Blumer et al. 1987 Blumer, A., Blumer, J., Haussler, D., McConnell, R., and Ehrenfeucht, A.“Complete inverted files for efficient text retrieval and analysis”. Journal of the ACM (JACM) 34 (1987): 578-595.
Büchler et al. 2012 Büchler, M., Crane, G., Moritz, M., and Babeu, A. “ Increasing recall for text re-use in historical documents to support research in the humanities ”. In Proceedings of 16th International Conference on Theory and Practice of Digital Libraries, (tpdl 2012): pp. 95–100 Springer Berlin Heidelberg.
Gerdjikov and Schulz 2016 Gerdjikov, S., and Schulz, K. U. “Corpus analysis without prior linguistic knowledge-unsupervised mining of phrases and subphrase structure”. ArXiv e-prints (2016): 1602.05772.
Gerdjikov et al. 2013 Gerdjikov, S. and Mihov, S. and Nenchev, V. “Extraction of spelling variations from language structure for noisy text correction”. In Proc. Int. Conf. for Document Analysis and Recognition (2013): 324-328.
Gusfield 1997 Gusfield, D. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge university press, Cambridge, 1997.
Inenaga et al. 2005 Inenaga, S., Hoshino, H., Shinohara, A., Takeda, M., Arikawa, S., Mauri, G., and Pavesi, G.“On-line construction of compact directed acyclic word graphs”. Discrete Applied Mathematics 146 (2005): 156-179.
McCreight 1976 McCreight, Edward M.“A space-economical suffix tree construction algorithm”. Journal of the ACM (JACM) 23 (1976): 262-272.
Mitankin et al. 2014 Mitankin, P. and Gerdjikov, S. and Mihov, S. “ An approach to unsupervised historical text normalization ”. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage: (2014) 29-34.
Needleman and Wunsch 1970 Needleman, S. B., and Wunsch, C. D.“A general method applicable to the search for similarities in the amino acid sequence of two proteins.”. Journal of molecular biology 48 (1970): 443-453.
Okanohara and Tsujii 2009 Okanohara, Daisuke and Tsujii, Jun'ichi“Text categorization with all substring features”. In Proceedings of the 2009 SIAM International Conference on Data Mining: (2009) 839-846.
Sariev et al. 2014 Sariev, Andrei and Nenchev, Vladislav and Gerdjikov, Stefan and Mitankin, Petar and Ganchev, Hristo and Mihov, Stoyan and Tinchev, Tinko“Flexible noisy text correction”. In Proceedings of Document Analysis Systems: (2014)
Schütze and Pedersen 1993 Schütze, H., and Pedersen, J.“ A vector model for syntagmatic and paradigmatic relatedness ”. In Proceedings of the 9th Annual Conference of the UW Centre for the New OED and Text Research, (oed 1993): pp. 104–113.
Storjohann 2010 Storjohann, P. (Ed.) Lexical-semantic relations: theoretical and practical perspectives (Vol. 28). John Benjamins Publishing, Cambridge, 2010.
Ukkonen 1995 Ukkonen, Esko. “On-line construction of suffix trees”. Algorithmica14 (1995): 249-260.
Weiner 1973 Weiner, P. “Linear pattern matching algorithms”. In Proceedings of 14th Annual Symposium on Switching and Automata Theory, (swat 1973): pp. 1–11 IEEE.