Chiara Palladino is Assistant Professor of Classics at Furman University working on semantic annotation and translation alignment.
Maryam Foradi is a postdoctoral researcher working on Translation in Digital Age and the Problem of Semantics in Corpus-based Translation Studies at Institute for Applied Linguistics and Translation Studies, University of Leipzig
Tariq Yousef is a research associate at Leipzig University, working on Computational Linguistics, Textual Alignment, and Data Visualization.
This is the source
This paper proposes text alignment in digital environments as a way to empower language learning. It presents the principles and goals of text alignment in Natural Language Processing, and introduces Ugarit, a web-based translation alignment editor for the collection of aligned language pairs. Then, it reports observations on the application of translation alignment in historical language courses at Tufts and Furman University between 2017 and 2019.
Proposes text alignment in digital environments as a means to empower language learning
The World Wide Web has introduced a massive change in the cognitive processes of
reading: skimming has become the most popular way in which readers interact with
written texts on digital support, with possible enormous consequences on how
human beings process information and articulate complex thought
Text alignment, and more specifically translation alignment, is a type of
annotation,
This paper collects a set of preliminary observations on groups of students that have used translation alignment in the framework of language courses at various levels of expertise. We propose text alignment as a way to empower the perception of the complexity of a written source, but also as a method to leverage usual obstacles in the process of reading in a different language by directly engaging with original literary texts.
Translation alignment is one of the most popular applications of Natural Language
Processing. It is defined as the comparison of two or more texts in different
languages, also called parallel texts or parallel corpora
The alignment of texts in different languages, however, is an exceptionally
complex task, because of the several variables involved. It is often difficult
to find perfect correspondences across languages that express ideas through
different morphosyntactic constructs, with variations in word order, sentence
length, and even underlying cultural significance. Machine-actionable systems
are often inefficient in providing equivalences for wordplays, metaphors, and
other rhetorical devices. The resulting aligned pairs may be one-to-one (one
word in the source text corresponds to one word in the translation), but often
align as one-to-many, many-to-many, or many-to-one. Each word correspondence may
be complete or perfect (with complete overlap between two words), but also
possible or incomplete (partial overlap, or both words being a translation of
each other only in certain contexts
For these reasons, manually aligned word pairs are one of the main sources of
training data, and are not only employed for the implementation of machine
translation systems, but are increasingly useful for many other purposes,
including information extraction and automated creation of bilingual lexica
The operation of collecting training data is often configured as a Citizen
Science effort: for example, Glosbe (https://glosbe.com/) provides thousands of dictionaries created from
the manual alignment of word pairs performed by a community of users.
The first public web-based environment for manual text alignment of parallel corpora was conceived within the Alpheios Project (http://alpheios.net/), and is currently hosted by the Perseids Project (https://www.perseids.org/). The editor offers an easy environment to manually match corresponding words in two parallel texts.
The Ugarit Translation Alignment Editor (http://ugarit.ialigner.com/),
developed in 2016 by Tariq Yousef at the University of Leipzig, was conceived as
the first version of an improved interface that could perform more extensive
alignment tasks than the Alpheios editor: it allows alignment in up to three
texts, incorporates transliteration of non-Latin alphabets, and collects
manually aligned pairs as training datasets
The first version of the tool was made public in March 2017. Since then, Ugarit has registered an ever-increasing number of visits and new users.
Originally, Ugarit was developed to collect training data for the implementation of statistical machine translation of historical languages, mainly Ancient Greek, Latin, and Persian, for which few to none aligned datasets exist. Ideally, historical languages are closed systems with a finite number of words and very limited change in the foreseeable future. Therefore, it should be possible to create adequately efficient automated methods of statistical machine translation based on a relatively small training dataset.
However, after the tool was made public, the number and variety of languages included by the users has steadily increased and has gone far beyond the original intent: at the moment this article is being written, 45 languages are included in Ugarit, including Ancient Greek, Latin, Arabic, Persian, English, French, German, Italian, Portuguese, Croatian, Armenian, Japanese, Chinese, Syriac, Sanskrit, Coptic, Egyptian, Akkadian, and Ethiopic; 465 unique users, and about 24,500 parallel texts, are currently hosted.
The home page of Ugarit offers a quick overview of the languages currently available (the graph showing the proportions of the respective corpora), the number of users and texts, the available word pairs, and the newest alignments.
The workflow is simple: users can upload their corpus in text format, or import
it by calling its CTS URN from the Perseus Digital Library (http://www.perseus.tufts.edu/hopper/)
Currently, Ugarit works with up to three texts in different languages, and there is virtually no limit to the amount and length of texts that can be uploaded. A clickable option enables the user to decide whether the alignment can be publicly visible on the website or not. When the default value is kept, the text appears in the homepage of the website, and other users can inspect the alignment by hovering with the mouse on individually paired tokens, which are automatically highlighted. The proportion of aligned tokens is indicated in the colored bar below the text: the green indicates the rate of matching words, the red the rate of non-matching words.
For languages with non-Latin alphabets, Ugarit offers automatic transliteration,
which is visible when the pointer hovers on the desired word. This feature is
currently available for Greek, Arabic, Persian,
Translation pairs are stored in a local database that gathers data from the
manually aligned texts, but also from bilingual dictionaries and automatically
aligned texts with GIZA++.
As such, the database of translation pairs serves two main functions: first, it
serves search queries to provide information on all the available translations
of a word. Therefore, if the user looks for a word (for instance, life
)
with the search function available on the home page, the output displays all the
available aligned pairs currently hosted in Ugarit. In this way, the user can
access an extended number of translations of a word in many different
languages.
Second, Ugarit provides training data to the Dynamic Lexicon (http://dynamiclexicon.com/), which
applies the principle of triangulation to extract bilingual pairs across Ancient
Greek, Persian, and Latin, using bridge pairs in Greek/English, Persian/English,
and Latin/English: the Dynamic Lexicon fetches alignment information from all
available aligned corpora, and it uses aligned languages as pivot to fetch
translated words in other languages that may bear the same meaning
In other words, Ugarit was designed as a Citizen Science resource to collect
training datasets on historical languages from a variety of sources and
projects. One of the most important initiatives in this regard was the Hafez
Project (http://dynamiclexicon.com/hafez/; http://www.divan-hafez.com) led by
Maryam Foradi at the University of Leipzig
These results suggested that alignment could serve as a pedagogical tool with a certain effect of long-term retention of vocabulary. Between 2017 and 2019, we conducted a pilot study investigating how students of historical languages could use translation alignment to improve their learning experience through direct engagement with original texts. The lack of scholarship on the pedagogical application of alignment as a tool for language learning, and the complexity involved in the reproducibility of the experience, compel us to start with pre-experimental observations from specific cases, where translation alignment has been applied as a standard method during language courses. Our initial observations provide the justification for a more experimental design.
Ugarit was used in graduate and undergraduate semester-long Classics courses at
the Universities of Tufts and Furman, with different focuses: language (mainly
Ancient Greek and Latin) at elementary, intermediate, and upper level,
literature in translation, literature surveys, and individual research
projects.
The students reported their observations on the process, evaluated how
it affected their understanding of the source, and performed a qualitative
analysis on the languages in question, examining specific morphosyntactic,
semantic, and expressive phenomena.
We observed the application of this workflow on students with very diverse
language skills, including those who mastered other modern languages, and those
whose native language was not English. We propose three cases of study:
In the following paragraphs, we report the considerations formulated by
the students themselves, and our observations on the outcomes of the
process.
Students of Case 1 were enrolled in intermediate and upper level courses in
historical languages, mainly Ancient Greek or Latin. They performed two
tasks:
The purpose of the first exercise was to set the groundwork for a
structural analysis of the most important morphosyntactic and semantic
devices of the source language, comparing how those features were rendered
in a different language (often English), of which the students had complete
mastery. In the second exercise, they built upon this experience to draw an
evaluation of two competing translations, focusing on the different ways in
which complex linguistic phenomena can be rendered in another language.
Graduate, native English speaker, upper level Ancient Greek. The student created bilingual alignments of Plato’s
Undergraduate, native English speaker, intermediate Latin. The student performed a three-text alignment of a passage from the
translator’s voice.
Undergraduate, English-Chinese speaker, upper-elementary Greek. The student aligned the original text of
The typical case for trilingual alignment was a student who was proficient in two languages, and wished to improve a third one. The student would perform a trilingual alignment (possibly preceded by a pairwise alignment between the two better-known languages), systematically comparing a lesser-known language against two translations. The student would use the knowledge of the first two languages to leverage the obstacles in understanding the third one, by recognizing common syntactical patterns, morphologies, or similarities in vocabulary.
Interestingly, this case was not limited to the study of historical languages: a collateral advantage of trilingual alignment was that students with proficiency in a historical language could use this opportunity to focus on another language that they knew less, often a second modern language that they were studying at the same time.
Undergraduate, native English speaker, advanced Latin, basic German. The student performed a trilingual alignment of Tacitus’
Graduate, advanced French, intermediate Ancient Greek, basic Arabic. The
student performed a trilingual alignment of the famous Loqman fables, to
pursue a research project that focused on gathering systematic evidence
of the relation between Loqman and the Aesopic corpus, which the sage is
often said to have translated
Graduate, Chinese native speaker, English as second language, intermediate Ancient Greek. The student performed a trilingual alignment of Sophocles’
The third case involved students, enrolled in courses on culture or
literature, who had no prior knowledge of the source language and were also
not studying it. The students were given short sections of a source text and
corresponding translations in various languages, and performed several tasks
aimed at gradually building a critical approach to the original:
Interestingly, we observed that students who had mastery of more
than one language almost immediately learned to use trilingual alignment to
leverage the difficulties in dealing with a language that they had never
seen before. This strategy proved effective in various cases: many students
completed the first two assignments using both English and a third modern
language (often Spanish or French); non-native English speakers used their
native language to better understand complex words and expressions; students
who mastered another classical language tested if they could use it to
better engage with phenomena like inflected forms, participial uses, and
subordination. Overall, trilingual alignment proved to be an effective
strategy to engage with the source text, as already assessed in Case 2. By the end of the course, students were
able to align relatively long passages against one or more translations.holy
,
honor
, or love
, and align them with the
corresponding words in a translation in a chosen language. The goal
of this exercise was to develop an understanding of the depth of
crucial cultural concepts, by assessing the many different ways in
which the same word in the source text could be translated.above.
Undergraduate, native English speaker, advanced Ancient Greek, no Latin. The student used trilingual alignment to verify whether the knowledge of Greek, with English as a bridge language, could serve to leverage the unfamiliarity with a text in Latin. The passage chosen, from Herodotus’
Undergraduate, Chinese native speaker, English as second language, no
Ancient Greek. The student created a trilingual alignment of Euripides’
Bacchae, comparing the English translation by I. Johnson (http://jelks.nu/libri/classics/bacchae.html) and a revised
Chinese version based on the English (as there were no available Chinese
translations of the original Greek). As a Chinese native speaker, the
student used Chinese as a bridge to establish accurate correspondences
between the English and the Ancient Greek words that had never been seen
before. The result was an investigation into the meaning of Ancient
Greek concepts through a meaningful comparison with similar Chinese
terms, often with an emphasis on imperfect grammatical and semantic
correspondences between the Greek and its translation (such as cleverness
, translated as
being clever
).
At the end of the course all the students, no matter what level of mastery
they had of the languages, developed an acute sense of how limited their
understanding of the source text was from the translation, as much as the
critical recognition that they could convey in-depth knowledge of critical
cultural concepts expressed in the source language, by contrasting them with
their modern language translation and investigating into their various
meanings
Students of the source language developed a tangible sense for the fluidity of translation by evaluating different strategies employed by professional scholars to approach complex phenomena; they systematically approached complex morphosyntactic constructs and were compelled to discuss their function and significance, while assessing the necessary imperfect character of any translation of them. In addition, students who were experimenting with learning a new modern language could approach unknown expressions and vocabulary by using their skills in other languages as a bridge to convey similar constructs.
One of the most relevant outcomes of this study was that students with no prior knowledge of the source language could start learning it by directly reading original literary texts. This approach could be revolutionary in the field of slow reading and language learning, and it also promises non-trivial consequences for passive users: readers with no knowledge of a source language could easily use already available alignments to perform a dynamic reading of the original aligned with its translation, gaining a basic understanding of vocabulary and syntax in the original.
We hope to have shown the pedagogical potential of translation alignment. However, we are also aware that significant implementation needs to be pursued to improve some underdeveloped aspects and to allow a more systematic use of a translation alignment interface in teaching practices.
We recognize that at the moment the tool is open to every category of user, and the aligned pairs are partly the result of unsupervised work, which may affect the quality and consistency of the information collected in the database. The lack of a suitable evaluation/correction workflow not only makes the dataset not accurate enough for machine translation, but also prevents organic teacher supervision. Therefore, the integration of a voting system for the evaluation of the accuracy by teachers or expert users is one of the most needed implementations. A related problem is the current impossibility of working on collaborative projects, where multiple users can work together on the same corpus.
Secondly, not all translation pairs are equally useful as training data. We
should expect alignment to produce conflicting, mutually exclusive, results:
after all, translation alignment is the result of interpretation, and not
all cases are easily classifiable as right
or wrong
. In some
cases, typically literary texts, it is extremely difficult to establish
perfect or even partial correspondences between expressions and concepts in
potentially very different languages: we have found that, in lack of
specific guidelines or gold standards for each single language pair, users
tend to create their own set of rules to keep a consistent alignment
strategy, but these rules tend to be very different according to the
particular purpose of the alignment. Therefore, we need a system to filter
translation pairs: an option would be to allow users to classify different
kinds of translation pairs, for example distinguishing categories such as
perfect/complete
or partial/incomplete
, and literal
or free/poetical
.
Finally, we recognize that immediate access to part-of-speech tagging and available word pairs while performing the alignment would enormously impact on the work of non-specialized users and students, and obviously on the improvement of analytical passive reading. This very desirable additional feature is also included in our future implementation.