Working on and with Categories for Text Analysis: Challenges and Findings from and for Digital Humanities Practices

Dominik Gerstorfer <dominik_dot_gerstorfer_at_tu-darmstadt_dot_de>, Technical University of Darmstadt

https://orcid.org/0000-0002-8095-2540

Evelyn Gius <evelyn_dot_gius_at_tu-darmstadt_dot_de >, Technical University of Darmstadt

https://orcid.org/0000-0001-8888-8419

Janina Jacke <janina_dot_jacke_at_uni-goettingen_dot_de>, Kiel University

https://orcid.org/0000-0001-7217-3136

Abstract

This is the editorial of the special issue “Working on and with Categories for Text Analysis.”

Why we need new theoretical and methodological perspectives on categories in the digital humanities

In the realm of digital humanities, computational social sciences and related fields, categories are omnipresent. Not only do we use them to systematize our objects of interest, such as texts or aesthetic artifacts, and organize their representations in databases and repositories. Categories also play an important role in text analysis, especially when extracting or annotating parts of texts in a structured manner.

Categories serve as powerful tools for both purposes. They allow, for example, the linguistic labeling of objects or elements and their subsequent grouping according to selected relevant features. In this context, categories seem suited to integrate two important complementary tasks: if they are based on adequate parameters, they can further a sensible reduction of complexity, thereby facilitating the analysis of and communication about complex textual artifacts and data sets. At the same time, categories offer the possibility of a detailed description through the creation of subcategories. When categories are organized in systems such as ontologies or taxonomies, they can additionally provide information about the relationship between relevant phenomena. Moreover, since creating categories usually requires defining terms explicitly, categories greatly facilitate the scholarly exchange of information on subjects in the humanities, cultural studies, or social sciences – among other things, through enhancing understanding and comparability of claims and hypotheses.

The prevalence of categories in the digital humanities can be attributed, in part, to the influence of standards from the formal sciences in this field. In contrast to this, developing and using categories is rather the exception than a rule when we look at most traditional humanities disciplines, where it is limited to certain sub-disciplines. This, together with the omnipresence of categories and at the same time little systematic reflection in the digital humanities, raises a number of questions: Which categories or which types of category systems are appropriate for objects in the humanities? What determines the validity and fruitfulness of categories in this field? How can we develop and revise category systems using existing or new procedures? And how can we employ categories to address complex, and often hermeneutical, questions that are central to most humanities disciplines?

Approaching categories in digital humanities from many perspectives

Answering these questions requires considering multiple perspectives, not only from different humanities disciplines and social sciences, but also from information science and technology. Taking this as impetus, we have organized two interdisciplinary workshops on the topic of categories in the digital humanities.[1] The first workshop focused on theoretical and formalistic aspects, exploring non-hierarchical concept ontologies and markup schemas. The second workshop emphasized methodological and application-oriented aspects, specifically the development and application of category systems for text research. Drawing from the inspiring ideas and projects presented and discussed there, we decided to collect the ideas in a special issue. With an open call for papers, we therefore invited the workshop contributors as well as other interested researchers to contribute to this issue.

One focus of this issue lies on the work on and with categories for text annotation and analysis. A second focus is on systems and methods for the organization and classification of texts in the context of databases – as well as the interplay between these two areas, where category systems play a crucial role. We requested that the contributions be based on concrete studies in the field of the digital humanities or related fields, or provide an information science perspective that has been, or can be, adapted in the digital humanities.

The contributions that were selected for this issue cover a wide range of work on and with categories. They delve not only the development and application of category systems themselves and the disciplines that inform and require them but also the various conceptual and pragmatic problems encountered during these processes.

The contributions of this issue

In their contribution “Interlinking Text and Data with Semantic Annotation and Ontology Design Patterns to Analyse Historical Travelogues”, Sandra Balck, Ingo Frank and Hermann Beyer-Thoma present an approach to the digital edition of travelogues by Franz Xaver Bronner. The project is a modular digital research infrastructure for creating and distributing digital editions of historical travelogues. It uses a model that facilitates ontology-based semantic annotations and links between text and data by combining widely accepted standards such as CIDOC CRM, Dublin Core, SKOS with the project's own TEI.

With “Category Development at the Interface of Interpretive Pragmalinguistic Annotation and Machine Learning: Annotation, detection and classification of linguistic routines of discourse referencing in political debates” Michael Bender, Maria Becker, Carina Kiemes and Marcus Müller present an approach to detect references to previous oral or written utterances in the speeches of the German Bundestag. They discuss the challenges the phenomenon of discourse reference poses to human annotators and provide insights into the development of a robust annotation procedure as well as its integration into a machine learning approach. This approach is designed to cater both for interesting linguistic and successful machine learning findings.

The article “Making the Whole Greater than the Sum of its Parts: Taxonomy development as a site of negotiation and compromise in an interdisciplinary software development project” by Jennifer Edmond, Alejandro Benito-Santos, Michelle Doran, Roberto Therón, Michał Kozak, Cezary Mazurek and Eveline Wandl-Vogt presents the design of a taxonomy of sources of uncertainty in digital humanities datasets with an interdisciplinary and international group. A special focus in this contribution lies on the process of finding a common ground between different communities of practice that have been collaborating widely.

Marlene Ernst, Sebastian Gassner, Markus Gerstmeier and Malte Rehbein contribute an article titled: “Categorising Legal Records – Deductive, Pragmatic, and Computational Strategies.” This paper presents three different approaches to categorizing semi-structured information concerning legal history. It discusses the development of a categorization system for the analysis of approximately 10,000 inventory entries for legal cases from the Special Court Munich (1933–1945).

In their article “Made to Be a Woman. A case study on the categorization of gender using an individuation-based approach in the analysis of literary texts”, Marie Flüh and Mareike Schumacher introduce a category system for gender analysis in texts that includes non-binary forms of gender. The authors make use of both theory-driven approaches (based on the gender roles presented in Simone de Beauvoir’s Second Sex) and data-driven approaches (using networked graph visualization). The practical use of the category system for text analysis is then demonstrated by applying it to a literary corpus inspired by de Beauvoir’s choice of texts from which she developed her gender roles.

Maria Hinzmann’s “Categorial Relations in (Re)constructing Topoi and in (Re)modeling Topology as a Methodology: Vertical, horizontal, heuristic and epistemological interdependencies” explores categories in terms of their textual manifestation, instantiation as concepts, underlying principles, and the overarching epistemological framework. Using the example of topoi in travel literature as a starting point, she puts forth five types of categories and identifies the key relationships between them that are pertinent to category systems in digital humanities.

Jan Horstmann, Christian Lück and Immanuel Normann address the formalization and annotation of intertextual relations in their paper “Systems of Intertextuality: Towards a formalization of text relations for manual annotation and automated reasoning”. Instead of focusing on automatic recognition, the article proposes a coherent category system using description logic to make intertextuality computable without sacrificing expressiveness. The model is presented in a machine-readable RDF format, allowing a structured representation of relations and related entities. The paper illustrates the application of this theory-driven model with examples from literary studies.

In their article “Are Ontologies Trees or Lattices?” Claus Huitfeldt and Michael Sperberg-McQueen argue that ontologies are not (simply) trees, nor are they (simply) lattices, just as territories are not (simply) maps. The authors argue that ontologies based on superset and subset relations, modeled as lattices, offer more flexibility and usefulness than tree models. They use the CATMA annotation system to evaluate their findings and assert that a lattice structure makes it easier for users and developers to perform essential tasks.

With their article “Visualization of Categorization: How to see the wood and the trees”, Ophir Münz-Manor and Itay Marienberg Milikowsky contribute a paper that discusses how figurative language has been annotated in a corpus of Late Antiquity Hebrew Liturgical Poetry – first on paper and then with the annotation and analysis software CATMA – and which new kinds of insight this transition from analog to digital was able to show. The contribution also introduces a way to visualize the annotations in a way that supports category-based hermeneutic speculation about the analyzed texts and phenomena with the visualization tool Vis-À-Vis.

In his case study “Annotating German in Austria: A Case-study of manual annotation in and for digital variationist linguistics”, Markus Pluschkovitz aims to contribute to the discussion on the epistemological status of annotation (in the sense of classification of data) using the annotation structure of the project "Deutsch in Österreich". This project investigates the spoken varieties of German in Austria employing hierarchical stand-off annotations. Pluschkovitz illustrates how this method organizes annotations and introduces a semantic model that improves the transparency of the annotation process. Furthermore, he argues that the hierarchical organization provides a more robust epistemological foundation compared to one-dimensional annotations.

The paper “From Semi-structured Text to Tangible Categories: Analysing and annotating death lists in 18th-century newspaper issues” by Claudia Resch, Nina C. Rastinger and Thomas Kirchmair reports the process of developing and applying categories for annotating death lists in the 18th-century Viennese newspaper Wien[n]erisches Diarium and provides examples for the usefulness of the generated metadata in the context of (quantitatively) analyzing topographical and biographical data.

The challenges for computational analysis in digital humanities due to the absence of formalized concepts are discussed in Shanmugapriya T’s article “Developing Computational Models for Formalizing Concepts in the British Colonial India Corpus.” The exploration and analyses of the British India Colonial Corpus with its intricate and non-standard format stay at the center of the contribution that also implements a computational approach based on affinity propagation. Shanmugapriya T shows the possibilities and also the challenges of formalizing these concepts through concept-based formal models and metamodels for advanced text mining.

Angelika Zirker’s and Michael Göggelmann’s contribution “Case Study: Annotating the ambiguous modality of must in Jane Austen’s Emma” reports on a teaching experiment where students annotated occurrences of “must” in Jane Austen’s Emma according to its function – based on the assumption that the ambiguous quality of passages potentially containing free indirect discourse hinges on this word and its possible functions. A focus lies on the question of how students handled this task while using either CATMA or CorefAnnotator.

Conclusion

In terms of disciplines and research questions and with regard to suggested approaches to the handling of categories, the range of work presented in this special issue is considerable. What becomes apparent is the complexity arising in reflections on categories. It is a complexity with many varieties. Therefore we refrained from the urge of categorizing the developed approaches. Instead, as readers may have noticed, we presented the contributions in alphabetic order. Consolidation of working with categories, so it seems, needs to happen in many regards. For the digital humanities’ text analysis, it involves covering the very phenomena that are captured in categories, coming up with organizing principles (categories of categories) and reflecting on relations between these things and computational implementation of text analysis, to mention the most critical aspects. Another interesting issue is where the setting of the criteria for developing and implementing the categories takes place. Again, this spans a wide range of possibilities, as demonstrated by the contributions in this issue. It may foreground the theoretical conceptual work it relies upon, the collaboration – and thus the intersubjective understanding – of experts, the computational possibilities and implications or the very objective of analysis. Most importantly, reflecting on categories brings to light many of the major challenges of digital humanities, opening them up for discussion. A conversation which is worth continuing.

Notes

[1] Both workshops were associated with the project forTEXT; workshop reports can be accessed via the project website: https://fortext.net/news/2020/workshop-report-non-hierarchical-concept-ontologies-and-markup-schemas and https://fortext.net/news/2021/workshop-report-development-and-application-of-category-systems-for-text-research.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

URL: http://www.digitalhumanities.org/dhq/vol/17/3/000704/000704.html
Comments:
Published by: and
Affiliated with: Digital Scholarship in the Humanities
DHQ has been made possible in part by the National Endowment for the Humanities.
Copyright © 2005 -

Unless otherwise noted, the DHQ web site and all DHQ published content are published under a Creative Commons Attribution-NoDerivatives 4.0 International License. Individual articles may carry a more permissive license, as described in the footer for the individual article, and in the article’s metadata.

Announcements