1. Introduction
In an article in these pages on “Genealogy of Distant Reading” Ted Underwood made the valid point that distant reading as a form of
“macroscopic literary inquiry” did not originate with digital text [
Underwood 2017]. Still, all DH methods, by definition, process digital data, and most can only be applied effectively only within a computational setting.
While some methods (e.g. markup or GIS) extend analog practice (such as editing or cartography) into the digital, others are digital natives.
Topic modeling is one of the computational methods that over the last twenty years have made their way from Natural Language Processing [
Blei et al2003] into the Digital Humanities.
It has been consistently associated with distant reading. In the introduction to a special issue on topic modeling in the
Journal of the Digital Humanities the editors even called it
“distant reading in the most pure sense” [
Meeks and Weingart 2012].
[1]
As digital corpora grow ever larger, so does the need for distant reading. Thus topic modeling has survived early hype and critique [
Schmidt 2012]
and is still going strong in DH [
Du 2019] and in information retrieval in general. Originally dominated by the application of Latent Dirichlet Allocation [
Blei et al2003], by now there are several different approaches to topic modeling.
Comparing their respective strengths and weaknesses has resulted in a veritable sub-genre of comparative studies [
Kherwa and Bansal 2019], [
Vayansky and Kumar 2020], [
Fu et al 2020], [
Ma et al 2021], [
Egger and Yu 2022], [
Rüdiger et al 2022], [
Chen et al 2023].
In this paper we will use a modified version of BERTopic [
Grootendorst 2022], a new, neural network based approach, to distant read two related text corpora in Buddhist Chinese, a low-resource idiom.
Although an attempt has been made to use BERTopic on Chinese poetry [
Fang 2023], to our knowledge this is the first time BERTopic has been used on canonical Buddhist texts in any language.
Our first question is whether BERTopic, which relies on BERT for word embeddings, does return coherent topics at all when confronted with a low-resource idiom such as Buddhist Chinese, which was not represented in the training data for BERT,
and for which tokenization algorithms are available but not widely implemented [
Wang 2020]. Secondly, we would like to know whether the application of the BERTopic workflow can yield meaningful topics that
can be taken to express concerns in Chinese Buddhist texts created between 500 and 800 CE and might in particular be used to distinguish two different corpora, Indian-Chinese and Chinese-Chinese texts, and how they might be used in furthering our understanding of the texts.
Why this particular time-frame? In the centuries between 500 and 800 CE, Buddhism in China turned into Chinese Buddhism. The first Indian Buddhist texts were translated into Chinese in the second century. Starting from the late third century
we are able to reconstruct an unbroken social network that connects generations of Chinese Buddhists until today [
Bingenheimer 2021].
[2]
Until the 6th century this network was mainly informed by Indian-Chinese texts, i.e. Indian texts
translated into Chinese. Chinese-Chinese texts, i.e. texts authored by Chinese (or Korean or Japanese) writers, began as paratext (prefaces, catalogs etc.) to these translations, then developed via short essays, apocrypha and commentaries,
eventually blossoming into long historiographical and doctrinal works in the 6th century. The time between 500 and 800 is special in the sense that in spite of the continuing reception and translation of Indian texts, it were the Chinese-Chinese texts produced
during that period that became distinctive for later Chinese Buddhism. The philosophical works of Zhiyi 智顗 (538–597) and Fazang 法藏 (643–712),
the devotional treatises of Tanluan曇鸞 (476-542) and Shandao善導 (613–681), and the recorded sayings of early Chan masters like Huineng 慧能 (638–713) and Mazu Daoyi 馬祖道一 (709–788),
to give but a few examples, became the textual foundation for the dominant traditions within Chinese, and indeed East Asian, Buddhism of the second millennium: Chan, Pure Land, and Tiantai Buddhism.
While in the year 500 CE Buddhism in China was still very much dominated by the influx of Indian Buddhist ideas, by 800 CE Chinese Buddhists had started to put their own spin on Buddhism. Most features that would come to characterize Chinese Buddhism were in place,
even if it took another few centuries for them to fully form: the dichotomy, both stable and dynamic, between meditative Chan and devotional Pure Land practice, an interest in historiography and lineage, a commitment to the lay-monastic distinction,
and the necessity of apologetics in the framework of the Chinese state. While Buddhism declined in India after 800 CE and became virtually extinct there after the 12th century, Chinese Buddhism blossomed throughout the Song Dynasty (960–1279)
and today remains the largest world religion in China with more than 200 million adherents [
Pew Research Center 2023].
The topics returned by topic modeling methods are notoriously open to interpretation. Whether or not they cohere with respect to a particular domain can only be evaluated by domain experts. In order to communicate how we read a topic to a wider audience
we combine the lists of terms that represent topics into “virtual paragraphs” (Appendix A). In that way we hope to make our reading transparent for non-specialists who can easily grasp what the topic is about.
Creating virtual paragraphs from topics could also be used in teaching. When practiced in conjunction with close, linear reading of classical texts, students stand to gain familiarity with concepts and semantic fields that are latent in larger corpora.
3. Method
Our method follows BERTopic [
Grootendorst 2022], a recent topic modeling framework that has quickly attracted attention.
[6]
We use our own variation of the BERTopic pipeline, which we make available at
https://github.com/mbingenheimer/cbetaCorpusSorted.
[7]
Specifically, our implementation involves the following steps:
- Using a large language model trained on classical Chinese (Koichi Yasuoka’s variant of GuwenBERT), we embed every sentence in our corpus into a high dimensional vector space.
- We perform UMAP to reduce the sentence embeddings to a 3-dimensional set.
- We use HDBSCAN to determine clusters which we interpret as topics.
- The words in the sentences of each topic are collected and ranked according to c-tf-idf to measure the degree to which they represent the entire topic.
- The individual sentences are tagged as to their provenance (whether they come from a Chinese-Chinese or a Chinese-Indian source). This allows us to compute the monochromaticity of each cluster and determine which clusters are mixed versus
which ones are representative of one or the other of our corpora.
The main idea of BERTopic is to take advantage of the capacity of modern neural networks to create sentence embeddings which respect semantic similarity, meaning that linguistic elements with similar meaning should be embedded close to each other in the target vector space.
While many previous techniques are based on mapping language elements to vectors (with word2vec probably being the best-known), encoder-only models like BERT have the advantage of embedding language in a way that is sensitive to semantics. In particular the embedding of a multivalent word “bank” is dependent on the surrounding sentence.
Thus homonyms can be disambiguated in the embedding. At the same time synonyms like “happy” and “joyful” that have a similar probability of occurring at a particular position in a sentence are defined with similar embeddings. As a result, any BERT model can be expected to display some degree of identifying semantic similarity.
Specifically, these models operate by learning how to predict missing tokens (e.g. words) from a natural language sentence. To be successful, the model has to learn the probabilities associated with various possible words at a specific point in the sentence.
This will necessarily entail disambiguating homonyms – if it would represent all instances of the word bank in the same way then it could not correctly learn the context-dependent likelihoods of the words money and river occurring elsewhere in the sentence.
Similarly, it would be helpful (if not strictly necessary) for synonyms to have close embeddings; this would allow them to automatically treat their influence on the rest of the sentence similarly.
Moreover, BERT models can be explicitly trained to respect semantic similarity – this requires a custom dataset designed for this purpose. In our case we chose Koichi Yasuoka’s variant of GuwenBERT as our base model.
[8]
GuwenBERT is a BERT model which is trained on the Daizhige 殆知阁 dataset, which according to the creators of GuwenBERT contains “15,694 books in Classical Chinese,
covering Buddhism, Confucianism, Medicine, History, ...Taoism, [and others]”.
[9]
Whereas the original GuwenBERT was trained on the corpus in simplified Chinese, Yasuoka’s variant allows for traditional Chinese characters. To our knowledge, there is no semantic similarity training set for Classical Chinese;
thus we rely on the level of semantic similarity native to our base model.
[10]
This results in the danger of outputting collocations instead of topics: For example, our pipeline sometimes picks up on ngrams that contain “十一eleven (-somethings)” or “四十forty (-somethings)”, resulting in clusters that are based on one or two characters rather than genuine,
domain-specific semantic clustering. By grouping for instance, “ 四十 (forty),四十二 (Forty-two), 四十九 (forty nine), 四十八 (forty-eight), 四十四 (Forty-four), 四十一 (forty one), 四十六 (Forty -six)” in one cluster the pipeline does its job
(it picks up on the signal of “forty”), but because we are (not yet) able to fine-tune it to produce more semantically relevant clusters. The more a cluster looks like a snippet from a KWIC index,
the less interesting it is for topic modeling, because we are looking for semantically related terms not for the same term in different contexts.
We pass each sentence of each document into our base model, thereby obtaining a high-dimensional vector representation of each sentence. These sentences are tagged according to the corpus they originated from. It is understood within data science that high dimensional datasets are difficult to work with;
many mathematical algorithms designed for low dimensional datasets break down in high dimensions – this is known as the
curse of dimensionality. Thus our set of high dimensional sentence representation is then passed through the UMAP algorithm [
McInnes et al 2018].
which reduces the representations to being 3-dimensional. UMAP is designed so that representations which are close (in some sense) in the high dimensional space will get transformed into representations which are still close in 3-dimensional space.
This ensures that clusters that are observed in the 3 dimensional space do actually correspond to representations which are close in the high-dimensional space. If our embeddings display a high degree of semantic similarity,
this should mean that sentences whose representations are clustered close together actually represent sentences with related meanings.
Having reduced the sentences representations to three dimensions, we use HDBSCAN [
McInnes et al 2017], a clustering algorithm, to identify and extract clusters. Each cluster is comprised of a set of sentences. To uncover what topic each cluster represents, the sentences are broken into words.
Each word in the cluster is given a score determined by the frequency of the word in the cluster, multiplied by the information content (or rarity) of the word in the corpus. This metric is known as c-tf-idf. We take the top 20 scoring words as indicative of the cluster's content.
As usual with topic modeling, the total number of topics varies with parameterization.
Finally we score each cluster by how monochromatic (in our case: purely derived from the “Chinese-Chinese” corpus) and how large it is (i.e. how many sentences are represented in the cluster). Our intuition is that large clusters are more representative of the corpus.
Monochromaticity (MC) is expressed by a score between 0 and 1, with values converging to 0 indicating a cluster largely derived from the Indian-Chinese corpus, convergence to 1 indicates a cluster mostly derived from the Chinese-Chinese corpus, and clusters with MC around 0.5 are derived equally from both corpora.
Below an attempt to visualize the results of this process:
4. Discussion
Orienting ourselves by monochromaticity and cluster size allows us to focus on the “
coherent” topics that have the best chance at being “
meaningful” for distinguishing between Indian-Chinese and Chinese-Chinese corpora. We found it useful to distinguish between these two orders of sense-making:
We define “coherent” here not by one of the several computed coherence metrics for topic modeling (such as Umass, C-V etc.).
In the case of BERTopic, that is already taken care of during the dimensionality reduction via HDBSCAN.
Rather, we use human evaluation to decide whether the terms in a topic semantically relate to each other in a way that allows a reader with domain knowledge
to string them together in virtual paragraphs (Appendix A) without forcing associations. Domain knowledge remains indispensable for such exercises.
On trying out the classic tf-idf formula on Woolf’s The Waves, Stephen Ramsay [
Ramsay 2011, 12] wrote: “Few readers of The Waves would fail to see some emergence of pattern in this list”.
It should be added that, conversely, for someone who has not read the text “patterns” are highly unlikely to emerge. A modified form of tf-idf remains part of BERTopic and other topic modeling frameworks,
and although the topic coherence as assessed by computable metrics arguably has improved, the need for domain knowledge for “the ‘lighting up’ of an aspect (das ‘Aufleuchten’ eines Aspekts) remains.
[11]
Beyond seeing a coherent pattern in the word lists that represent topics, a second-order of sense making is relevant. “Meaningful” topics must be meaningful not only in themselves (i.e. coherent) but also relevant in the wider context of a research question.
Do the topics make heuristic sense for a researcher in Buddhist studies? Are they insightful to the distant reader? In our case, do they actually distinguish the Indian-Chinese from the Chinese-Chinese texts or are they just different?
Obviously, coherence here is a condition of meaningfulness. If the top-twenty words that represent a topic do not cohere, there is no topic to discuss in a wider context.
The particular iteration from which the ten topics below are taken yielded c. 750 clusters, but there is no natural upper or lower limit for the amount of clusters produced in any given pipeline.
How did we select the ten clusters we discuss below, that we have identified as coherent and meaningful? Like with the overall amount of clusters,
“ten” too is an arbitrary limit that takes into account what can be comfortable presented to DHQ readers.
Arguments could be made for longer discussions of five topics, or a more concise presentation of twenty.
Our selection “algorithm” was: Descending from the most monochromatic clusters
(which are most likely to distinguish Indian-Chinese from Chinese-Chinese texts)
select the first ten clusters that are both “coherent” and “meaningful” in the sense sketched above.
Different researchers of Chinese Buddhism might come up with slightly different sets but that is the not yet the point.
Distant reading allows for hermeneutic difference just as close reading. Are there many more clusters that are coherent and meaningful?
This is hard to quantify, because “meaningful” implies a concrete research questions. In our case, reading through these lists of topics,
our intuition is that about one in ten clusters seems coherent, but of these, only those with high monochromaticity should be expected
to meaningfully distinguish the Indian-Chinese from the Chinese-Chinese corpus.
Below, we will discuss the heuristic value of ten topics suggested by BERTopic. These are presented as ‘virtual paragraphs’ in Appendix A.
What signals does a distant reading of our two corpora discern? Among the topics that align strongly with the
Indian-Chinese corpus the least surprising ones
are two that relate to the introduction of tantric, esoteric Buddhism to China:
maṇḍala (A1) and
mantra (A2).
Maṇḍalas were used in Indian esoteric Buddhism in visualization practice, as part of rituals, and as an art form.
These three aspects are related as sophisticated
maṇḍala paintings are not only used in rituals, but also as models for meditators who learn to visualize them as part of their meditative practice.
In India, the earliest Buddhist uses of
maṇḍala images are attested for the 6th century, well within our time-frame.
In China, a host of texts on how to use
maṇḍalas in rituals were translated in the 8th century (e.g. in the Taishō edition (T):
T0850, T0852a, T0852b, T0862, T0911, T0912, T0959, T1001, T1004, T1040, T1067, T1167, T1168B, T1184).
Texts containing
mantras or the longer
dhāraṇī spells (e.g. T0402, T0899, T0901, T0902, T0903, T0905, T0907, T0918, T0933, T0944A, T0952, T0956, T0962, T0963, T0964, T0967, T0968)
were already mentioned above as the reason why the Indian-Chinese corpus has twice as many texts but an overall lower character count than the Chinese-Chinese corpus.
Neither poetry nor prose,
mantra and
dhāraṇī texts are a distinctive genre in themselves. That distinctiveness, not surprisingly, appears in the BERTopic output.
Topic A2 illustrates a ‘mantra’ topic; the terms
mantra and
dhāraṇī echo through several other topics and indeed also appear together in A5.
Maṇḍalas and
mantras/dhāraṇīs are the visual and aural elements of esoteric ritual and meditation practice that was introduced to China in the 8th century (and from there to Japan in the early 9th century).
This was the last great transmission of a distinct Buddhist tradition from India to China and topic modeling picks up a clear signal of this process.
This proves at the very least that BERTopic works for our corpus, and can identify “typical” Indian-Chinese topics.
But there are other, more subtle topics, such as
A3, “Yama Heaven,” where things get more interesting to think with.
Indian cosmography posits a number of heavens above (and hells below) our “middle earth” (
madhyadeśa).
[12]
In Chinese Buddhism between 500 and 800 CE, the writings of Tanluan, Daochuo and Shandao laid the foundation for the Pure Land school in which practitioners aim to be reborn in the heavenly paradise of Amitābha Buddha.
Their writings in turn are based on Mahāyāna Indian sūtra texts translated before 500 CE. The “Yama Heaven” topic signals that in the Indian texts translated 500-800 CE the traditional view of “layered” heavens still persisted,
and was not yet subsumed into the otherworldly Pure Land of the West that became so extraordinarily influential in East Asian Buddhism in the second millennium.
This generates new avenues for research: The texts connected to the topic (e.g. T21n1340, T13n0416, T16n0675, T18n0892, T14n0455, T17n0721)
can now be further explored e.g. to see how the heavenly realms in Mahāyāna Indian-Chinese texts differed from Amitābha’s Pure Land extolled in the Chinese-Chinese works in our period.
Another Indian topic related to place is
A4 “The palace 宮 of Śākyamuni.” Like early Buddhist iconography the topic is in a way aniconic: it describes the city where Prince Siddhartha,
the Sage (muni) of the Śākya clan grew up, but names only his father, Śuddhodana, not Śākyamuni Buddha himself.
That the monochromaticity of the topic converges to 0 suggests the Buddha legend features highly in Indian-Chinese texts.
And indeed, the topic appears not only in a dedicated Buddha legend epic (T0190), but also in (Mūlasārvastivādin) Vinaya texts
(T1442, T1443, T1450), and encyclopedic collections (T2121, T2122). The latter are compiled in China, but contain extensive quotations from translated Indian texts.
The topic marks an influx of Indian texts which speak of the historical Buddha in a Chinese Mahāyāna environment,
where many other texts propagate the existence of a multitude of Buddhas “innumerable as grains of sands in the Ganges.”
It is a reminder that the proliferation of Buddhas did not overwrite interest in the story of Śākyamuni, the historical Buddha.
Topic
A5 reflects an ongoing concern with the “propagation of the Dharma.” That this topic lights up in Indian texts translated 500-800 CE is perhaps an indication of the pressures Buddhism faced in India.
While in China the Sui (581–618), Tang (618–907), and Song (960–1279) dynasties were (mostly) a long golden summer for Buddhism, in India autumn had set in by the 7th century.
The invasion and rule of North India by the Alchon Huns in the sixth and early fifth centuries (c. 460-530 CE) was adversarial to Buddhism,
and Xuanzang who traveled to India some hundred years later (629-645 CE) found many pilgrimage sites in decline.
Besides the last blossoming of Buddhist philosophy in the works of Dharamakīrti and Candrakīrti (7th century),
which were not translated into Chinese, the development and transmission of tantric esoteric Buddhism was the last major doctrinal development in Indian Buddhism.
As Indian Buddhism slowly lost ground to powerful Hindu movements (Vaishnavism, Shaivaism, Bhakti etc.)
the need to teach and propagate the Buddhist teachings to laypeople remained a concern in Buddhist literature.
Two highly monochromatic topics are associated with monastic life and its rules, one
(A6) tending to the Indian-Chinese corpus,
the other
(B1) to the Chinese-Chinese corpus. Whereas
A6 draws on the more technical discourse of Buddhist canon law, the Vinaya, and its history,
B1 is about precepts, the rules that Buddhists ought to follow. Returning to the texts we realize that topic
A6 is connected to the translations of Yijing 義淨 (635–713),
who went to India and returned with the Mūlasarvastivāda Vinaya and commentaries. Although in the end the authoritative version of the monastic rules in East Asian Buddhism relied on a different Vinaya tradition,
A6 can be said to reflect an ongoing exchange between India and China in terms of canon law.
Although associated with translated Indian texts the topic does not so much reflect a new concern within Indian Buddhism,
but rather an ongoing Chinese interest in the Vinaya. Indeed, later historiographers have asserted the development of a “Vinaya School”
律宗 in seventh and eighth century China. In contrast,
B1 is marked as a Chinese-Chinese topic by the peculiar term jieti 戒體, the “essence” of the precepts,
which had great traction in the Chinese Vinaya tradition, but for which there is no ready Indian equivalent. Distant reading
A6 against
B1,
one can discern an important polarity which reflects two competing Vinaya traditions in East Asia.
Next to the mainstream Indian Vinaya, there were the “Bodhisattva precepts” of an Eastern Mahāyāna Vinaya that formed around the,
probably apocryphal [
Funayama 1996],
Fanwang jing梵網經. Both traditions were present in China during and beyond our timeframe in China as well as in Japan and Korea,
but are rarely addressed as distinct. The two topics thus do not merely reflect historical reality, but highlight a fundamental distinction in the development of Buddhist norms.
In addition, the association of the Fanwang jing vocabulary of
B1 with Chinese-Chinese texts supports the claim that this sūtra was
indeed apocryphal and topically related to Chinese, not Indian concerns.
Regarding topic
B2, “Early Translators,” one might at first be tempted to think that any “translation” topic (there are other, less monochromatic ones),
might be due to paratext, such as the translator byline that precedes most fascicles of an Indian-Chinese text, but obviously this topic is aligned with the Chinese-Chinese corpus.
Moreover, the topic is specifically about “early” translators, active from the second to the fourth centuries, before our timeframe.
It therefore cannot reflect paratext, but rather a specific Chinese concern with Buddhist historiography, a fundamental difference between Chinese and Indian Buddhism in any period.
Chinese Buddhists made use of Chinese historiographic genres to create catalogs, biographies, annals and more. Compared with India,
the historical record for Chinese Buddhism of the first millennium is extremely rich and detailed.
Although this might be better understood as a general cultural trait, not specific to Buddhism, topic B2 points to the role Buddhist historiography played in the formation of a distinct Chinese Buddhism.
Another “typical” Chinese topic is “Pillars of the state”
(B3) which clusters the titles of Chinese government officials, none of which would appear in an Indian text,
where the administrative hierarchy was much less sophisticated. Which texts are responsible for the topic? The first two texts associated with it are anthologies of apologetic writing (T2110, T2103)
where Buddhists are in debate with officials, friendly or adversarial. Next is the encyclopedic Fayuan zhulin 法苑珠林 (T2122), which, like T2110, too was compiled by Falin 法琳 (571-639).
Then there are scriptural catalogs (T2156, T2157) and biographies (T2051 (again Falin), T2060). Thus a closer look at the sources of the terms that represent Topic B3
encourages us to reevaluate Falin’s role in the sinicization of Buddhism [
Jülch 2014].
A final monochromatic topic associated with the Chinese-Chinese corpus is
B4 which seems to built around collocations of
sheng 乘 ‘vehicle’ (Sk.
yāna).
The topic is meaningful in that it reflects a major concern in medieval Chinese Buddhism, namely the categorization of Buddhism in different traditions.
After c. 400 CE Chinese Buddhists increasingly understood themselves as followers of Mahāyāna, a distinctive new development within Indian Buddhism.
While in India Mahāyāna Buddhism developed further into esoteric or tantric Buddhism (Vajrayāna, Tantrayāna), Chinese Buddhism continue to define itself as Mahāyāna,
although, as we saw above, was exposed to esoteric Buddhism in our time frame. In the
sheng 乘 imaginaire earlier Indian mainstream Buddhism was cast as the
‘smaller (or ‘lesser’) vehicle’ Hīnayāna, in contrast to the ‘larger (or ‘greater’) vehicle’, the Mahāyāna. Topic B4 reflects this coming to terms with different counts of yānas,
such as one (the single transcendent truth applicable to all), two (Hīnayāna and Mahāyāna), or five yānas (the teachings of non-Buddhist humans, deities, śrāvaka Buddhists, pratyekabuddhas, and bodhisattva Buddhists).
Although not used in academic narratives of Buddhist history, these distinctions are still of interest to modern Chinese Buddhists and BERTopic is able to identify this concern in the texts produced between 500 and 800 CE.
5. Conclusion
Topic modeling is a real upgrade on divination. Traditional divination systems set up a randomized procedure that selects symbolic tokens from a given set.
It is then left to the diviner to interpret the tokens for the question at hand. Whereas with divination systems the number of elements in the set is usually
fixed (78 cards in the Marseille Tarot, 64 Yijing hexagrams, 12 zodiac signs etc.), topic modeling builds its topics as it goes along and can return any number of them.
However, like divination, it relies heavily on the interpreter to perceive the internal coherence as well as the overall meaning of a topic. The output of topic modeling is certainly grounded in more sophisticated probabilistic methods than the randomness of lot drawing, but its susceptibility to parametrization and the fact that different approaches yield somewhat different topics, lend the procedure a vagueness and ambivalence not unlike the hermeneutic play of divination. In the BERTopic pipeline a degree of randomness, specifically in the UMAP dimensionality reduction, means that even with the same data and parameters, the top twenty words in a topic may be slightly different. As Benjamin Schmidt (2012) warned at the beginning of the DH topic modeling boom, topics are neither necessarily coherent nor stable. There is no surefire method to separate a “true” topic from a “fata morgana topic”. And whether a topic is meaningful in the context of a research question can (for now) only be decided by humans.
This being said, our experiments with different BERTopic workflows, give us confidence that the latent topics are stable at least within this framework.
Moreover, as argued in Section 4, our experiments indeed turned up coherent and meaningful topics. These topics mark differences between the Indian-Chinese
and the Chinese-Chinese corpus, in spite of BERTopic working on a low-resource idiom (Buddhist Chinese). The caveats are: First, there is no guarantee that we
have identified all relevant topics – the rate of recall is unknown and perhaps unknowable. Second, there is no guarantee to precision either, as other
approaches might turn up different, but equally plausible topics.
To counter these limitations to a degree, we are working on a follow-up study in which we use machine translated versions of the texts instead of the Chinese originals. The approach might
confirm or undermine the stability of topics. Furthermore automating the production of “virtual paragraphs” might lead to greater involvement by the wider
community of scholars interested in the corpus. In the end, to us the glass is half full: Our modified BERTopic workflow yields coherent topics that meaningfully distinguish Indian-Chinese and
Chinese-Chinese texts in our time frame. Some of the topics are more surprising than others, but even those that at first glance seem obvious, such as the “translation” topic
(B2) associated with the Chinese-Chinese corpus,
turned out to be more complex than expected. Since all topics can be traced back to texts, it is possible to follow up and study them further. Thus the distant reading via topic
modeling can nudge our close reading and research in directions it would not have taken otherwise.