India is a multilingual society with hundreds of years of continuous literary traditions
in dozens of languages, some stretching as far back as two and a half millennia. We use
“India” here not only to refer to the contemporary nation-state but also to the greater
Indian subcontinent, taking note of the complex, overlapping territorial histories of
the premodern “Hind” or “Hindustan,” the “des” or “desh,” “British India,” and Cold War
“South Asia.” Additionally, we acknowledge the interconnectedness of India and its
diaspora — the world's largest — with other regions. For any project on Indian
literature(s), it makes more sense to speak of multiple literary traditions instead of
one. When we speak of “Indian poetics,” we are actually referring to the longstanding
traditions of poetics in what come to be understood in the modern period as distinct
literary traditions in multiple languages, such as Hindi, Urdu, Bengali, Marathi, Tamil,
and Telugu among many others, which cumulatively stand for “Indian Literature(s).”
While the specificities of literary traditions are of interest in themselves, we choose
to emphasize commonalities and interactions in various poetic traditions of modern
Indian languages by reading them comparatively.
Indian literature, imagined this way, is an enormously broad conception that demands
philological skills that exceed the capacity of any single individual. Therefore, for
our research project, we started with Hindi and Urdu, two languages with which our
larger research group, consisting of students and faculty in India and the United
States, generally felt some familiarity. These languages have an entangled yet estranged
relationship, and they are the two Indian languages, other than English, that our
multilingual team knows the best. It is not unusual in India for a person to know two or
three languages well enough to share in these literary traditions partially. Languages
such as Punjabi and Bengali have some common sources of vocabulary and generally similar
grammars to Hindi and Urdu but are written in different scripts with varying
pronunciations. Other languages, such as Malayalam, have, in addition to script and
phoneme differences, fully divergent grammars — Dravidian rather than Indo-Aryan —
though they also incorporate some common sources, such as Sanskrit.
Hindi and Urdu share a common ancestry: the North Indian speech of the Delhi area.
Literary histories of “Urdu” and “Hindi” name this common ancestor as
Hindi/Hindui/Hindavi. But they note, this name could be Persian for “Indian” or any
language spoken in India (
Hind). Even though the literary
language around Delhi began to put “undue — and sometimes even almost mindless emphasis on ‘correct’ or ‘standard, sanctioned’ speech in poetry and prose” sometime in
the eighteenth century, perhaps due to the growing prestige of Persian, it was the British bias in
language policy that eventually led to the separation of the North Indian speech into
two separate languages defined along religious lines [
Faruqi 2003, 850] [
McGregor 2003, 912–957].
The gulf only widened further in the late 19th century, when a Hindi-language
movement led to the development of Modern Standard Hindi. The movement fashioned Hindi
as the language of Hindus by embracing Sanskrit-derived vocabulary and disparaging
“Urdu” and its literature as foreign to South Asia because of its “Persian” language
elements and metaphors [
King 1994]. Tied primarily to the ambitions of identity
formation in the emerging struggles of nationalism, as well as the economic competition
that was a consequence of colonial linguistic policy in Colonial North India, Hindi and
Urdu were reconceptualized as representing different religious identities [
Rai 2001].
That separation saw its highpoint during the partition of British India and continues to
reverberate across South Asia today. We believe that the intimate yet fraught nature of
the history that these languages share provides a fruitful ground for comparisons.
Our collaborative project between the Department of English at Jamia Millia Islamia and
Michigan State University, titled “Digital Apprehensions of Indian Poetics,” is
supported by India's Ministry of Education’s Scheme for Promotion of Academic and
Research Collaboration (SPARC), which aims to facilitate academic and research
collaboration between higher education institutes in India and abroad. This new
collaboration facilitates data-intensive textual studies in Indian languages by creating
machine-readable, genre-specific corpora and by curating datasets of annotations. We
attend to the specific conditions of Indian and South Asian languages, for which there
are few scholarly digital editions and where access to computing is often minimal, even
at research institutes and universities [
Shanmugapriya and Menon 2020]. Our goal is to
create digital editions based on minimal computing principles for use by scholars, the
public, and students and teachers in the classroom, as well as for ingestion in
information systems. This involves creating corpora using optical character recognition
(OCR) and adapting existing digital resources, which include those developed by
passionate individuals based outside of the academy, or citizen scholars. Therefore,
we aim to facilitate the study of texts by the general public, as well as professional
scholars. Additionally, we aim to build a vocabulary to document, interpret, analyze,
and visualize these textual corpora using linked open data, visualizations, and
keywords. Throughout our work, we presume a multilingual audience and aim to facilitate
digital humanities research across Indian languages.
In what follows, we expound on our desired outcomes, approach, and architecture. As
noted above, we aim to develop digital critical editions and datasets of annotations,
including poetic keywords. We approach corpora development as a form of making rooted in
the principle of jugaad, a North Indian term for reuse and
innovation in the presence of constraints. We conclude with a description of the minimal
computing architecture that we have adopted for our publishing, editing, and annotation
work. We present an architecture that can be accessed across Indian languages — starting
with Hindi, Urdu, and English — using minimal computational resources. Finally, we
conclude with a note on plain text and the promise it holds for data-intensive textual
studies in Indian languages.
Critical Editions and Keywords for Indian Poetics
Critical editions have long been a central component of traditional humanities, primarily
for framing our “perception of history, literature, art, thinking, language” by
establishing “reliable sources for research” and authorizing and canonizing “certain
readings” [
Sahle 2016, 19]. Digital critical editions add a new suite of possibilities,
including “interactivity, multimedia, hypertext, and immaterial and highly dynamic (or
fluctuating) ways of representing content,” which are absent from printed critical
editions [
Hillesund et al. 2017, 122]. The digital paradigm renders the static text
of a printed critical edition into a “laboratory where the user is invited to work with
the text more actively,” with the help of integrated features and tools “allowing for
customization, personalization, manipulation and contribution” [
Sahle 2016, 30].
Harnessing the networked nature of computation, we envision developing a digital
platform for scholarly collaboration in publishing. Our goal is to curate and annotate
critical editions of individual poets' works, while providing critical bibliographies to
facilitate research and appreciation.
Through such digital critical editions, our aim is not merely to “bring the past into
the future,” but also to furnish it for newer modes of computational inquiries
[
Hillesund et al. 2017, 123]. Remediating the text not only breathes life into
discussions of Indian poetics but also makes the works of a variety of poets more easily
accessible to scholars and readers alike. More than that, the endeavor also enables
reconsideration of questions about the very notion of the digital text and what it means
to read digitally within the context of South Asian languages and literature(s).
We draw from Raymond Williams’ method of “developing accounts of words as reflective
essays” for Indian poetics in order to make sense of our textual corpus [
The Keywords Project 2020].
Through a mix of distant and close reading approaches on a textual corpus
derived through OCR, we hope to develop a historically informed vocabulary of poetic
terms and genres that are crucial for understanding the technical field of poetry in
Indian languages. We extend Williams’ keyword approach to take stock of the emergent
vocabulary that continues to displace established frameworks in the humanities,
especially as a consequence of shifting media paradigms. Our objective with the
fundamental vocabulary of poetic traditions is not to focus on “fixing (their)
definition”; rather, like Williams, we hope to explore the “complex uses” of a variety
of conceptual categories to convey the contested nature of their meanings as clearly and
succinctly as possible [
Bennett 2005, xvii]. When we speak of a fundamental
vocabulary, we are obviously presupposing the existence of a canon, and our plan is to
foreground the basic implications of the accepted canon in respected literary traditions
as much as it is to explore the marginal.
While we want to make this resource accessible to a general audience, we are also keen
on designing it for use by specialists and teachers and students in the classroom, as well as in computational or traditional research, by providing a snapshot of the historical
evolution of the semantic fields associated with these terms. While the historical
variations of meaning might not be as evident for topics pertaining to prosody, we
believe this might be particularly useful for pinning down both the “history of words
and the contestation of their meanings” for technical poetic vocabularies in different
literary traditions over generations, eras, and epochs, acknowledging their
discontinuities, ruptures, erasures, and reconstructions, especially in moments of
political and social upheaval [
The Keywords Project 2020].
[1]
Making Digital Corpora for Indian Languages
Because of the absence of existing Indian language digital corpora, especially for
literary texts, “making” became a necessary component of our research. Making has
undoubtedly been central to digital humanities and, as recent debates on the issue have
clarified, doesn’t necessarily have to begin “from scratch” but instead can start “in
media res” [
Sayers 2017, 11]. This conception of making — rooted in collaboration,
sharing, sustainability, and improvising upon what already exists while critically
attending to the specificities of the localities where such making is taken up — focuses
on maintaining or remaking instead of reinventing the wheel. Such “critical making,” in
Matt Ratto’s sense of the term, not only opens the disciplinary boundaries of digital
humanities for introspection but also enriches it by drawing upon practices unfamiliar
in its predominantly Anglo-American roots.
For us, critical making also entails localization, whereby we hope to develop, reuse,
and repurpose open-source tools and technologies for the particularities of Indian
languages and scripts. Our conception of making is rooted in
jugaad, an intrinsically Indian way of making do, as Pankaj Sekhsaria
describes, by “reconfiguring materialities” of existing tools to “overcome obstacles
and find solutions” [
Sekhsaria 2013, 1153]. Our conception of
jugaad, which leans less towards engineering than bricolage, is intrinsically
tied to localization, and because of this, we inhabit a position closer to a bricoleur
than to an engineer. We see our labor as bricolage, in the sense that Claude
Lévi-Strauss articulates: using what's at hand, we design new tools by redeploying a
finite set of heterogeneous tools in contexts for which they were not originally
designed, in turn not only extending the intended applications of these tools, but also
renewing or enriching the stock of tools for future use [
Levi-Strauss 1974]. For
Sekhsaria,
jugaad underlies every conversation on technological
innovation in India, “particularly north of the Vindhya Mountain Range,” which refers to
the territories associated with the Indo-Aryan languages in which the word occurs, and
testifies to the prevalence of this word in a large Indian populace, given that the
Vindhya Mountain Range essentially splits the country in half [
Sekhsaria 2013, 1153].
Analyzing
jugaad as a “techno-myth” — alongside other cultural
practices of making in the Global South that Ernesto Oroza would deem “technological
disobedience” [
Gil 2016] — Kat Braybrooke and Tim Jordan contrast the element of
necessity in
jugaad with the element of leisure and choice in the
maker movement in the West [
Braybrooke and Jordan 2017, 30–31]. This notion of necessity
in
jugaad is also central to Padmini Ray Murray and Chris Hand’s
account of making culture in Indian digital humanities, wherein they advocate for a
critical dialogue between academic practices and local modes of making [
Murray and Hand 2015].
Jugaad,
then, as a quintessential Indian practice, is primarily to
make-do in response to resource constraints. To borrow from Oroza, “misery is not an
alternative” for us, as we are motivated to engage with “hybrid” means to achieve our
scholarly objectives [
Gil 2016].
For example, our initial efforts to localize open-source OCR tools for Indian languages
take advantage of two ongoing development efforts in France and Germany. First, we
focus on creating ground truth corpora to provide accurate transcriptions for training
and testing of both automatic text recognition (ATR) and automatic layout analysis (ALA)
models for historical documents in Indian languages using eScriptorium. eScriptorium is
an open-source tool, currently under development for hand-written text recognition (HTR)
at École Pratique des Hautes Études, Université Paris Science et Lettres (EPHE – PSL),
that adapts well for bidirectional scripts. It enables easy application of
state-of-the-art neural networks for transcribing and annotating historical documents
using an intuitive graphical user interface in modern web browsers. Developed as part of
the Scripta project
[2], it integrates the Kraken
engine for OCR/HTR
[3], and can be installed both locally and as preconfigured Docker
images [
Stokes et al. 2021] [
Kiessling 2019].
We also draw from ongoing developments within the context of the DFG funded OCR-D
project
[4]
[
Neudecker et al. 2019]. Although it is designed for a different context, OCR-D’s clear
guidelines for transcribing and labelling ground truth in the PAGE-XML format offer us a
broader framework to maintain consistency across our ground truth. More importantly, we
rely on the modularity of OCR-D’s software stack for streamlining workflows best suited
for our documents. OCR-D maintains, builds, and repackages a collection of open-source
projects in the form of what they call modules. These modules can be painlessly
installed with either Makefiles or as pre-built Docker images. Each module comes with an
array of tools, which are called processors. These processors can be easily called from
the command line for each task in an end-to-end OCR pipeline. Here, image data stored in
a workspace or local directories can be easily exposed to either a single processor or
several processors sequentially for different processing tasks through their
corresponding Metadata Encoding and Transmission Schema (METS) files. Once a task is
finished, the information about the processing step(s) is written to the METS file,
while the processed data is saved in the respective output directory. This integrated
pipeline with pre-configured dependencies becomes extremely salient and functional for
the needs of our larger group.
Working with historical documents in different layouts and typefaces, this part of our
project will create a publicly available textual corpus in Hindi and Urdu that can be
further processed downstream for a host of data-driven humanistic inquiries. These
include information extraction methods, such as named entity recognition. Our objective
is to work primarily with publicly available digital images accessed through
International Image Interoperability Framework (IIIF) API endpoints hosted at different
cultural institutions, including the British Library and Internet Archive. Additionally,
we will publish well-documented state-of-the-art OCR models in public repositories,
along with their corresponding provenance and ground truth datasets, for use in further
development of open source OCR for historical print in Hindi and Urdu.
In the absence of robust institutional support, citizen scholars around the world have
done much of the work to make Hindi and Urdu texts available on the Internet. Care for
languages and poetic experience has led impassioned citizen scholars to launch digital
platforms such as Rekhta (Urdu), KavitaKosh (Hindi), and PunjabiKavita (Punjabi). The
reach of these projects, especially Rekhta, is enormous, spanning from social media like
Twitter, Facebook, and Instagram, to festivals and conferences, to classrooms, to more
personal spaces such as family chat groups. Though these platforms provide enormous
resources to scholars, their primary and intended audience is not the academy. Our
collaboration intervenes here to provide access to these texts as data for an audience
that includes the quotidian Indian as well as scholars, poets, information
technologists, library systems, and nonhuman agents. Proper metadata is essential for
such aims. Using linked open data, we incorporate citizen-scholar projects and resources
into our projects as well.
Citizen scholarship is only one example of the changes in the portability of texts
through new digital technologies. The malleability and flexibility of digital texts,
coupled with their easy accessibility and reproducibility, make them increasingly
desirable for scholarly inquiry. These affordances motivate our focus on the production
of high-quality, accessible digital editions. While learning from community-driven
initiatives by citizen-scholars, our objective is to digitally remediate the critical
editions of individual poets’ corpora in Indian languages, using minimal markup schemes
such as Markdown that are easy to reproduce for anyone willing to “decode, receive, and
revise” them [
Tenen 2017, 21]. The sharing and archiving of these digital editions draw
inspiration from projects such as OpenArabicPE and OpenITI. These projects not only
version plain-text data using Git but also link it to facsimiles, wherever available,
with minimal markup so that scholars can review, contribute, and track changes.
As we will describe below, we follow this model of using plain text in human- and
machine-readable Markdown files. While the interface we use to access this data may
change, the form of the data, a simple text file, is remarkably versatile. The
transformation from Markdown to Textual Encoding Initiative (TEI) XML, too, can be a
simple and automatable procedure. However, we require additional layers of annotation,
especially for poetry, as well as a multilingual interface.
Minimal Computing Approaches to Indian Digital Texts
Prefaced by traditions of localization and open-source activism within India, such as
those of Delhi's Sarai program
[5] at the Centre for the Study of Developing Societies, we grapple with questions
of software translation and localization of linguistic resources. Our focus is on plain
text and the need for an interface that supports both the conceptual and practical
dimensions of studying poetics in Indian languages, especially when the digital text
itself, no matter how plain, is insufficient. Our Hindi and Urdu texts span alphabets
and scripts — left-to-right and right-to-left, Indic, Perso-Arabic, Roman, Devanagari,
and unwritten — and yet are mutually aurally comprehensible, for the most part. While we
are working under constraints, we dare to dream big and use computing to get around
barriers that otherwise look imposing.
We use Git to organize and collaboratively write and contribute to our project, as it
makes both minimal dependence and minimal maintenance possible. Though used primarily to
version computer code, we use Git both in software development and in our other
collaborative efforts, including writing. It allows us to avoid dependency on research
computing professionals or buying expensive, high-maintenance technology. Minimal
computing helps us avoid the alienation of users or the fetishization of tools. While a
certain amount of code goes into any GitHub or GitLab project, it makes entry into
project development possible and accessible for beginners.
[6] As
humanists, the "making" of Git-based projects appeals to our sensibilities. It is useful not only for not learning but also for tinkering around in a perfect
jugaad fashion, using code and tools in ways that they might not have been
intended for before, allowing for backtracking and trying out new branches. Git makes
minimal automation possible by circumventing reliance on dedicated web servers; instead,
writing and editing work can be done in one's local system and then committed
to the central GitHub repository. This feature is particularly useful in situations where Internet access is unreliable or during power outages. Also, it uses low
bandwidth, as on each update only the changes are transferred.
As noted, we turn to adaptation and localization of existing tools for our collaborative
writing workflow instead of reinventing the wheel. Specifically, we use Jupyter Book,
part of the Executable Book Project, which enables users to assemble books and articles
from a combination of Markdown files, executable Jupyter Notebooks and other sources
using Sphinx, the robust documentation engine developed for Python. Though originally
used for collaborative computing in STEM fields, we find the executable notebook and
book approach of great interest for our digital humanities work, as it allows code to be
embedded alongside text. Sharing the source files allows for one team member's work to
be reviewed by others, permitting open access to any calculations and visualizations as
well as the techniques used to render them. By utilizing Sphinx, Jupyter Book allows
these various documents to be easily combined and cross-referenced. Like with Markdown,
the text can be converted into HTML for website viewing, as well as into other
formats.
In our workflow, we had to make some adjustments to attend to the specific requirements
of Indian languages and of our multilingual audiences, as well as to the needs of our
larger research group. In Unicode, the standard system used to digitally encode the
world's writing systems, only the orthographic information about a word can be encoded.
Unlike in English, there can be more than one way to write certain ligatures in both
Devanagari and Nastaliq scripts, which therefore requires some normalization.
Transliteration between scripts, metrical analysis, lexicography, and interfacing with
information systems, however, requires additional layers. For example,
there is simply not enough information outside of context to distinguish between کیا
(kiyaa: done) and کیا (kyaa: what). For
our larger group, another concern is getting their computer systems ready for the
minimal computing toolkit. Though the technical requirements are few, minimal computing
still requires some training and setup. Running Jupyter Book locally, for example,
requires installing a number of programming languages and libraries to be installed: Python, to run the documentation-generator Sphinx, and Pandoc, a Haskell library, to generate bibliographic citations. We wanted to have the option of two different approaches: 1) to be able to
sync from a locally running machine to a remote Git repository and 2) to have an
alternative to syncing between a locally running server and the remote.
To address these issues, we took advantage of two newer developments: the JAMstack and
using Git as a content management system.
[7]
In the JAMstack, web pages are typically pre-generated and written to disk through a
process referred to as “Server-Side Rendering” (SSR). SSR is advantageous from a search
engine optimization perspective. As the web crawlers of Internet search engines visit a
page, they learn not only that there's a page at a given address but also what its
contents are. Another advantage is that pages can be loaded very quickly because the
pages are pre-generated and static. Finally, these websites are usually relatively
small and can often be hosted for free on GitHub, GitLab, Netlify, or other web
providers.
While there are several popular JAMstack frameworks, all of which could have handled the
tasks at hand, we were attracted to the Vue.js JavaScript framework and within it the
Gridsome JAMstack framework. Vue.js recommends a "single page component" approach to web
design, whereby code, template, and webpage style declarations are all kept together in
one file. We decided to embrace this new approach, hoping it would help us get our
projects going quickly. Gridsome uses Vue.js and adds additional features, such as a
GraphQL (Graph Query Language) interface to query collections. It relies on plugins,
which can import data from different sources. Commonly used sources include hidden or
public blogs, Drupal websites, and, as in our case described below, a directory of plain
text Markdown files. In addition to markup of text, Markdown files can also contain data
of nearly any sort in their header. The following example shows a sample header encoded
in the YAML format, a human-readable way of storing or transmitting data:
—-
title: Bol
author: Faiz Ahmed Faiz
—-
Bol kih lab… (Poem text is here in the body )
The sample above includes a “title” and an “author” field. These fields, which usually
provide metadata about the text in the “body” of the Markdown file, can be used to store
numbers, dates, tags, and nearly any sort of data.
The second trend that we adopted is the move to use Git not only to write or code
together but also as a content management system. Here, we used the open-source,
JavaScript-based Netlify CMS, which is written using React, an alternative to Vue.js.
Easily added to Gridsome and other frameworks, Netlify CMS allows users to authenticate
with a Git repository — we used GitLab — and make and commit changes to the repository
via a web-accessible editor page. A plugin to Gridsome adds a route to a webpage where
users can edit the Markdown header fields online according to a configuration that they
specify; we can determine what their content should be (e.g., dates, strings, numbers,
lists of strings, or references to other nodes). Netlify CMS then handles the updates to
the Git repository. As a result, we are able to have people access and update the data
without requiring them to install the full JAMStack on their local computer.
We also knew we wanted to have certain content and data be available in Hindi, Urdu, and
English. Fortunately, we were able to adapt the internationalization (i18n) features of
Netlify CMS to work with those of Gridsome. Netlify CMS handles translation of Markdown
header and body fields by keeping certain common fields in the default locale — we chose
English (“en”) for this project — and by storing “localized” (translated) fields in
other locales, as in this YAML header:
—-
en:
title: Faiz Ahmed Faiz
bday: 1911.02.13
body: Urdu poet
hi:
title: "फ़ैज़ अहमद फ़ैज़"
body: उर्दू शायर
ur: title: فیض احمد فیض
body: اردو شاعر
—-
In this sample from an “author” collection, the field “bday” only appears in the default
locale (“en”). The fields that can be translated (“title” and “body”) appear also in the
Hindi (“hi”) and Urdu (“ur”) locales. Note that the “body” field, which followed the
header in the previous Markdown header example, is now contained as a field within the
header itself.
The Markdown file of a text can then reference its author(s) using the “author”
field:
—-
en:
title: Bol
author:
- Faiz Ahmed Faiz
body: |-
bol kih lab aazaad hai tere
bol zabaa;n ab tak terii hai
hi:
author:
- Faiz Ahmed Faiz
title: बोल
body: |-
बोल कि लब आज़ाद है तेरे
बोल ज़बाँ अब तक तेरी है
ur:
author:
- Faiz Ahmed Faiz
title: بول
body: |-
بول کہ لب آزاد ہے تیرے
بول زباں اب تک تیری ہے
—-
In this example, we specify that there can be more than one “author,” hence that field
contains a list, indicated in YAML by the hyphen. The author is referenced using the
Roman version of the poet's name. This allows for a fully human- and machine-readable
version of this document. In this way, the plain text files in the directory of a Git
repository are treated as a document-oriented database. (Gridsome, in fact, uses the
speedy JavaScript LokiJS database internally). When displayed, fields are presented on a
webpage in accordance with the client's chosen locale — Hindi, Urdu, or English.
Through this combination of the JAMstack and a Git repository content management system,
we can provide localized access not only to viewers but to our contributors, even if
they are on a mobile phone or tablet without proper access to a computer. By using
continuous integration — in our case, the running of a script when a commit is made to
the Git repository — the website is automatically updated when changes to the content
are made, and tests are run to assert that the changes are valid. Updates to the data
can also be federated; by adjusting the Netlify CMS settings to use permission levels in
Git, some users can automatically make changes while others require approval. These
proposed changes can come from the public, too, offering a straight-forward pathway to
crowdsourcing.
For web-based annotation, we are developing a custom widget that starts from the
transcription of the text in its original script (OCRed, if appropriate) stored in the
“body” field of a Markdown file. A “genre” field determines how exactly the text will be
treated. In general, we address the location of the individual tokens/words by their
coordinates in the Markdown file. This allows us to add multiple layers of annotation.
The transcribed words are treated as “phrases” that can contain multiple “words.” A
name, for example, can be a phrase, but so can a compound word. (Linguists also prefer
to have some features, such as future case markers, separated.) Sentences are stored as
a span of coordinates (e.g., for poetry indexed by line group, line, and phrase) after
we split the original text between spaces, punctuation, and paragraphs. While editing,
changes to a custom widget are mapped to a model representation of the text in the web
browser and then written to disk following updates.
We are also able to produce a view of the text in the Textual Encoding Initiative (TEI)
XML, which is widely used in digital humanities. Sentences map to TEI’s “<s>”
element, phrases to the “<phr>” element, and words are its “<w>” element. In
this way, we avoid the awkwardness of dealing with right-to-left text in XML editors.
Individual words or phrases, moreover, can have additional views or links attached to
them, such as scholarly or library-system transliteration or the International Phonetic
Alphabet. Also, we can offer views of the individual sentences using the CoNLL-UL format
used by Universal Dependencies (UD), a framework for grammatical annotation, allowing us
to take advantage of the rich set of annotation tools developed for UD.
For the purposes of developing an annotated corpus, this jugaad — using the JAMstack and Git as a CMS — is sufficient. Our plain-text data is
relatively small, and it can be read by humans and machines alike. We avoid hosting
large image files by instead providing IIIF links to allow transcribers and developers
to query the images. Using Git, changes can be tracked and
undone. There are few limitations as to the data or links we can provide. For example,
interfacing with citizen scholar projects or Wikidata merely requires adding additional
fields. For interoperability, the Markdown header field can also approximate (to a
certain extent) the graph-based Research Description Framework (RDF), which
uses a subject-predicate-object (S-P-O) approach. For example, the previous example can
be transformed to RDF as: poem “Bol” (S) “hasAuthor” (P) “Faiz Ahmed Faiz ” (O). The
“document” or “node” serves as a “subject”; the metadata field (e.g., “author:”) as the
RDF “predicate” (e.g, “hasAuthor”); and the value (e.g., “Faiz Ahmed Faiz”) as the
“object.” In Gridsome, the entire network graph of relations is available both through
the query language GraphQL and as a database. Finally, the whole system can be accessed
and updated either on a local machine that has the JAMstack implemented or through a web
interface to Git using Netlify CMS. While the former is especially useful for those of
us doing computational analysis, the latter allows updates to be made from any phone.
The data can be easily versioned and progressively archived from Git to Zenodo or other
repositories, and the interface expands to meet our needs.