Daniel Schüller studied philosophy, linguistics and history at Friedrich-Wilhelms-Universität Bonn and RWTH Aachen University, focusing on philosophy of science and logic as well as on philosophy of language. In 2013 he graduated with an M.A. thesis in which he comparatively investigated the use and heuristics of fictional models in history and physics. Since 2014, he is a research assistant and doctoral student at the chair of Linguistics and Cognitive Semiotics and the Natural Media Lab at RWTH Aachen University. His main research interests include linguistic and semiotic theory – with special emphases on sign processes in co-speech gesture, semiotics in and of gesture research, motion-capture technology, and the field of digital humanities in general.

Christian Beecks is head of the Data Management and Analytics Group in the Computer Science Department at the University of Münster. In addition, he is a Senior Researcher in the User-Centered Ubiquitous Computing Group at the Fraunhofer Institute for Applied Information Technology FIT. He received his PhD in 2013 from RWTH Aachen University.

His research interests include Multimedia Data Engineering, Real-time Data Management and Smart Data Analysis. He has authored more than 70 conference and journal papers and won two best paper awards. In addition, he is a reviewer for various international conferences and journals.

Marwan Hassani is an assistant professor in the architecture of information systems group at Eindhoven University of Technology, The Netherlands. Previously, he acted as a postdoc researcher and associate teaching assistant at the data management and data exploration group at the RWTH Aachen University, Germany. His research interests include stream data mining, sequential pattern mining of multiple streams, stream process mining, efficient anytime clustering of big data streams and exploration of evolving graph data. Marwan received hid PhD (2015) from RWTH Aachen University. He received an equivalence Master in Computer Science from RWTH Aachen University (2009). He coauthored more than 42 scientific publications and serves on several program committees.

Jennifer Hinnell is a doctoral candidate in the Department of Linguistics at the University of Alberta, Edmonton, Canada. Her research centers around communication in interaction. She uses multimodal corpus data, 3D motion capture data, and experimental methods to explore how people use their bodies, in conjunction with semantic and syntactic structures in speech, to create and convey meaning. Jennifer enjoys fruitful research partnerships with Little Red Hen Distributed Learning Lab (UCLA) and the Natural Media Lab at the RWTH Aachen, Germany.

Bela Brenger studied linguistics and computer science at RWTH Aachen University. He graduated in 2015 with his interdisciplinary thesis analyzing motion-capture data of head gestures in dialogues. Since 2016 he is part of the scientific staff at the chair of Linguistics and Cognitive Semiotics and manages the Natural Media Motion-Capture-Lab. Main interests are data-driven analysis of multimodal communication with emphasis on methods to integrate spatial gesture data and speech.

Thomas Seidl is Professor of Computer Science at Ludwig-Maximilians-Universität München (LMU Munich), Germany.

Irene Mittelberg is Professor of Linguistics and Cognitive Semiotics at the Institute of English, American and Romance Studies at RWTH Aachen University. She directs the Natural Media Lab at Human Technology Centre (HumTec) and the Center for Sign Language and Gesture (SignGes). After gaining an M.A. in French linguistics and art history from Hamburg University, she completed an M.A. and a Ph.D. in Linguistics and Cognitive Studies at Cornell University. Combining embodiment research with classic semiotic theories (e.g. C.S. Peirce, R. Jakobson), Mittelberg’s cross-disciplinary research on language, gesture, space, embodied cognition and the visual arts has emphasized the role of metonymy, metaphor, frames, constructions, and image schemas in multimodal communication. Moreover, Mittelberg and her research team have developed tools and methods to use optical motion-capture technology for empirical gesture research at the juncture of linguistics, semiotics, architectural design, computer science, social neuroscience, and digital humanities.

This is the source

The question of how to model similarity between gestures plays an important role in current studies in the domain of human communication. Most research into recurrent patterns in co-verbal gestures – manual communicative movements emerging spontaneously during conversation – is driven by qualitative analyses relying on observational comparisons between gestures. Due to the fact that these kinds of gestures are not bound to well-formedness conditions, however, we propose a quantitative approach consisting of a distance-based similarity model for gestures recorded and represented in motion capture data streams. To this end, we model gestures by flexible feature representations, namely gesture signatures, which are then compared via signature-based distance functions such as the Earth Mover's Distance and the Signature Quadratic Form Distance. Experiments on real conversational motion capture data evidence the appropriateness of the proposed approaches in terms of their accuracy and efficiency. Our contribution to gesture similarity research and gesture data analysis allows for new quantitative methods of identifying patterns of gestural movements in human face-to-face interaction, i.e., in complex multimodal data sets.

Gesture similarity research and gesture data analysis allowing for new quantitative methods of identifying patterns of gestural movements in human face-to-face interaction.

Introduction

Given the central place of the

Indeed, human communication typically involves multiple modalities such as
vocalizations, spoken or signed discourse, manual gestures, eye gaze, body posture
and facial expressions. In face-to-face communication, manual gestures play an
important role by conveying meaningful information and guiding the interlocutors’
attention to objects and persons talked about. Gestures here are understood as
spontaneously emerging, dynamic configurations and movements of the speakers’ hands
and arms that contribute to the communicative content and partake in the interactive
organization of a spoken dialogue situation (e.g.

Drawing on this large body of gesture research across various fields of the
humanities and social sciences, the interdisciplinary approach presented here aims at
identifying and visualizing patterns of gestural behavior with the help of
custom-tailored computational tools and methods. Although co-speech gestures tend to
be regarded as highly idiosyncratic in respect to their spontaneous individual
articulation by speakers in spoken dialogue situations, it is safe to assume that
there are recurring forms of dynamic hand configurations and movement patterns which
are performed by speakers sharing the same cultural background. From this assumption
follows the hypothesis that, on the one hand, a general degree of similarity between
gestural forms may be presumed – trivially – due to the shared morphology of the
human body (e.g. kinaesthemes

; Kendon locution
clusters

, McNeill catchments

; Ladewig recurring gestures

, and Cienki

In this paper, we will focus on certain kinds of co-verbal gestures, i.e. specific
image-schematic gestalts, e.g. spirals, circles, and straight paths

Research Objective

Whereas the gesture research discussed above mostly relies on observational methods
and qualitative video analyses, our aim is to add to the catalogue of methods for
empirical linguistics and gesture studies by outlining a computational, quantitative
and comparative 3D-model driven approach in gesture research. While there is a trend
to combine qualitative with quantitative as well as experimental methods in
multimodal communication research

and then to apply this methodology to the recorded 3D numerical MoCap data of
a group of participants.

Both the alignment of gestures with the co-occurring speech, and the semantic
comparison of the established (formally) sufficiently similar gesture-speech
constructions, still have to be done manually by human gesture researchers, through
semiotic analyses of the multimodal, speech and behavioral data corpora. The primary
aim of developing an automated indicator of gesture similarity is to identify
recurrent movement patterns of interest from the recorded 3D corpus data
computationally, and thus to enable human gesture researchers to handle these data
sets in a more efficient manner. In order to make gesture similarity automatically
accessible, we propose a distance-based similarity model for gestures arising in
three-dimensional motion capture data streams. In comparison to two-dimensional video
capture technology, working with numerical three-dimensional motion capture
technology has the advantage of measuring and visualizing the temporal and spatial
dynamics of otherwise invisible movement traces with the highest possible accuracy.
We aim at maintaining this accuracy by aggregating movement traces, also called
trajectories, into a

Properties of the 3D Data Model
Vicon Motion Capture system was used in this study.
Participants wear a series of markers attached to predetermined body parts of
interest (fingers, wrists, elbows, neck, head, etc.). The Vicon system automatically generates a chart of numerical 4-tuples of
Euclidean space-time coordinates for each marker attached to these points on the
participants’ bodies. The movement of the markers is tracked by 14 Vicon infrared cameras, and the physical trajectories of the
markers are represented in a chart of space-time coordinates. These space-time charts
form the data sets that are investigated algorithmically, relieving the gesture
analyst of the difficult, and subjective, task of manually examining highly ephemeral
real-world dialogue situations. But what are the crucial features that such a
numerical representation must have in order to enable researchers to not only
investigate a model but also to finally derive statements and theories about a
modeled real-world situation? We address the following research questions: Which
logical features of the model are essential if one wants to investigate the real
world by investigating a model? And secondly, what are the epistemic benefits of
investigating models instead of real-world situations?

From a philosophy of science point of view, before being able to apply computational algorithms to naturalistic real-world gestures, there must be a translation from the real-world dialogue situations, involving people speaking and gesturing, from which data are captured, to a computable set of data. For this purpose, a marker-based

The most important feature is that the model identity

relation, in that two entities have an identical relation to a
third entity – a frame of reference, or a

Definition: Transitivity

For a binary relation

∀x,y,z ∈ A: xRy & yRz → xRz

The transitivity of

Regarding the above-mentioned definition of transitivity, let our variables take the following values:

x = movement of body part from position a to b;

y = movement of marker M from position a to b;

z = trajectory of marker M

Given these values, we outline the transitivity relation as follows:

∀x,y,z ∈ A: xRy & yRz → xRz: x [movement of body part from position a to b]
R y [movement of marker M from position a to b] & y
[movement of marker M from position a to b] R z [trajectory
of marker M] → x [movement of body part from position a to b] R z [trajectory of marker M].

This means that if

into

fails to be an event

and aggregated states of
affairs

as being synonymous, the problem completely disappears. Otherwise, we
have to re-translate the problematic concept into one which suits our needs. In terms
of epistemic benefits, one major advantage of the proposed distance-based
gesture-similarity model (see the following section), i.e. the combination of gesture
signatures with signature-based distance functions, is its applicability to any type
of gestural pattern and to data sets of any size. In fact, distance-based similarity
models can be utilized in order to model similarity between gestural patterns whose
movement types are well known and between gestural patterns whose inherent structures
are completely unknown. In this way, they provide an unsupervised way of modeling
gesture similarity. This flexibility is attributable to the fact that the proposed
approaches are model independent, i.e. no complex gesture model has to be learned in
a comprehensive training phase prior to indexing and query processing. Another
advantage of the proposed distance-based gesture-similarity model is the possibility
of efficient query processing. Although calculating the distance between two gesture
signatures is a computationally expensive task, which results in at least a quadratic
computation time complexity with respect to the number of relevant trajectories, many
approaches such as the independent minimization lower bound of the Earth Mover's
Distance on feature signatures

Modeling Gesture Similarity

In this section, we present a distance-based similarity model for the comparison of
gestures within three-dimensional motion capture data streams. To this end, we first
introduce

Gesture Signatures

Motion capture data streams can be thought of as sequences of points in a three-dimensional Euclidean space. In the scope of this work, these points arise from several reflective markers which are attached to the body and in particular to the hands of a participant. The motion of the markers is triangulated via multiple cameras and finally recorded every 10 milliseconds. In this way, each marker defines a finite trajectory of points in a three-dimensional space. The formal definition of a trajectory is given below.

Definition: Trajectory

Given a three-dimensional feature space R3, a
trajectory 3 is defined for
all 1≤i≤n as:

i,yi,zi)

A trajectory describes the motion of a single marker in a three-dimensional
space. It is worth noting that the time information is abstracted to integral
numbers in order to model trajectories arising from different time intervals.
Since a gesture typically arises from multiple markers within a certain period of
time, we aggregate several trajectories including their individual relevance by
means of a gesture signature. For this purpose, we denote the set of all finite
trajectories as trajectory space T=∪k∈N{t| t:{1,…,k}→
R3} , which is time-invariant, and define a gesture
signature as a function from the trajectory space T into the real numbers R. The
formal definition of a gesture signature is given below.

Definition: Gesture Signature

Let T be a trajectory space. A T is defined as:

S:T→ R subject to |S-1(R{0})|<∞

A gesture signature formalizes a gesture by assigning a finite number of
trajectories non-zero weights reflecting their importance. Negative weights are
immaterial in practice but ensure the gesture space S={T∧|S-1(R{0})|<∞} forms a
vector space. While a weight of zero indicates insignificance of a trajectory, a
positive weight is utilized to indicate contribution to the corresponding gesture.
In this way, a gesture signature allows us to focus on the trajectories arising
from those markers which actually form a gesture. For example, if a gesture is
expressed by the participant's hands, only the corresponding hand markers and thus
trajectories have to be weighted positively.

A gesture signature defines a generic mathematical model but omits a concrete
functional implementation. In fact, given a subset of relevant trajectories 𝒯+⊂T, the most naive way of defining a gesture signature

The isotropic behavior of this approach, however, completely ignores the inherent
characteristics of the relevant trajectories. We therefore weight each relevant
trajectory according to its inherent properties of

Definition: Motion Distance and Motion Variance

Let T be a trajectory space and 3 be a trajectory. The δ:T→R of trajectory

The motion variance mσ 2:T→R
of trajectory

as:

The intuition behind motion distance and motion variance is to take into account
the overall movement and vividness of a trajectory. The higher these qualities,
the more information the trajectory may contain and vice versa. Their utilization
with respect to a set of relevant trajectories finally leads to the definitions of
a

Definition: Motion Distance Gesture Signature and Motion
Variance Gesture Signature

Let T be a trajectory space and 𝒯+⊂T be a subset of
relevant trajectories. A motion distance gesture signature Sm δ∈R

A motion variance gesture signature Sm σ2∈R

Motion distance and motion variance gesture signatures are able to reflect the characteristics of the expressed gestures with respect to the corresponding relevant trajectories by adapting the number and weighting of relevant trajectories. As a consequence, the computation of a (dis)similarity value between gesture signatures is frequently based on the (dis)similarity values among the involved trajectories in the trajectory space. We thus outline applicable trajectory distance functions in the following section.

Trajectory Distance Functions

Due to the nature of trajectories whose inherent properties are rarely
expressible in a single figure, trajectories are frequently compared by aligning
their coincident similar points with each other. A prominent example is the

Definition: Dynamic Time Warping Distance

Let n:{1,…,n}→ R3 and tm:{1,…,m}→ R3 be two trajectories from T and 3×R3→R be a distance function. The
δ:T×T→R
between n and m is recursively defined as:

with

As can be seen in the definition above, the Dynamic Time Warping Distance is
defined recursively by minimizing the distances

Although there exist further approaches for the comparison of trajectories, such
as

Given a ground distance in the trajectory space T, we will show in the following
section how to lift this ground distance to the gesture space S⊂RT in order to compare gesture signatures with each
other.

Gesture Signature Distance Functions

Gesture signatures can differ in size and length, i.e., in the number of relevant
trajectories and in the lengths of those trajectories. In order to quantify the
distance between differently structured gesture signatures, we apply
signature-based distance functions

The Earth Mover's Distance, whose name was inspired by Stolfi and his vivid
description of the transportation problem, which he likened to finding the minimal
cost to move a total amount of earth from earth hills into holes

Definition: Earth Mover’s Distance

Let 1,2∈S be two gesture signatures and δ:S×S→R between 1 and
2 is defined as a minimum cost flow of
all possible flows

subject to the constraints:

As can be seen in the definition above, the Earth Mover's Distance between two
gesture signatures is defined as a linear optimization problem subject to
non-negative flows which do not exceed the corresponding limitations given by the
weights of the trajectories of both gesture signatures. The computation of the
Earth Mover's Distance can be restricted to the relevant trajectories of both
gesture signatures and follows a specific variant of the simplex algorithm

The idea of the Signature Quadratic Form Distance consists in adapting the
generic concept of 1,2∈S is then defined as:

The similarity correlation between two gesture signatures finally leads to the definition of the Signature Quadratic Form Distance, as shown below.

Definition: Signature Quadratic Form Distance

Let 1,2∈S be two gesture signatures and s:S×S→R between 1
and 2 is defined as:

The Signature Quadratic Form Distance is defined by adding the intra-similarity
correlations <1,1>s and
<2,2>s of the gesture
signatures 1 and 2 and subtracting their inter-similarity correlation
<1,2>s. The smaller the
differences among the intra-similarity and inter-similarity correlations the lower
the resulting Signature Quadratic Form Distance, and vice versa. The computation
of the Signature Quadratic Form Distance can be restricted to the relevant
trajectories of both gesture signatures and has a quadratic computation time
complexity with respect to the number of relevant trajectories.

More details regarding the Earth Mover's Distance and the Signature Quadratic
Form Distance as well as possible similarity functions can be found for instance
in the PhD thesis of Beecks

Experimental Evaluation

Evaluating the performance of distance-based similarity models is a highly empirical
discipline. It is nearly unforeseeable which approach will provide the best retrieval
performance in terms of accuracy. To this end, we qualitatively evaluated the
proposed distance-based approaches to gesture similarity by using a natural media
corpus of motion capture data collected for this project. This dataset comprises
three-dimensional motion capture data streams arising from eight participants during
a guided conversation. The participants were equipped with a multitude of reflective
markers which were attached to the body and in particular to the hands. The motion of
the markers was tracked optically via cameras at a frequency of 100 Hz. In the scope
of this work, we used the right wrist marker and two markers attached to the right
thumb and right index finger each. The gestures arising within the conversation were
classified by domain experts according to the following types of movement: spiral,
circle, and straight. Example gestures of these movement types are sketched in Figure
1. A total of 20 gesture signatures containing five trajectories each was obtained
from the motion capture data streams. The trajectories of the gesture signatures have
been normalized to the interval [0,1]3∈R3 in order to maintain translation invariance.

The resulting distance matrices between all gesture signatures with respect to the
Earth Mover's Distance and the Signature Quadratic Form Distance are shown in Figure
2 and Figure 3, respectively. As described in the previous Section, we utilized the
Dynamic Time Warping Distance based on Euclidean Distance as trajectory distance for
the Earth Mover's Distance and converted this trajectory distance by means of the
power kernel

As can be seen in Figure 2 and Figure 3, both Earth Mover's Distance and Signature Quadratic Form Distance show the same tendency in terms of gestural dissimilarity. Although distance values computed through the aforementioned distance functions have different orders of magnitude, both gesture signature distance functions are generally able to distinguish gesture signatures from different movement types. On average, gesture signatures belonging to the same movement type are less dissimilar to each other than gesture signatures from different movement types. We further observed that the distinction between gesture signatures from the movement types spiral and straight are most challenging. This is caused by a similar sequence of movement of these two gestural types. While gesture signatures belonging to the movement type straight follow a certain direction, e.g., movement on the horizontal axis, gesture signatures from the movement type spiral additionally oscillate with respect to a certain direction. Since this oscillation can be dominated by the movement direction, the underlying trajectory distance functions are often unable to distinguish oscillating from non-oscillating trajectories and thus gesture signature of movement type spiral from those of movement type straight.

Apart from the quality of accuracy, efficiency is another important aspect when
evaluating the performance of gesture similarity models. For this purpose, we
measured the computation times needed to perform single distance computations on a
single-core 3.4 GHz machine. We implemented the proposed distance-based approaches in
Java 1.7. The Earth Mover's Distance, which needs on average 148.6 milliseconds for a
single distance computation, is approximately three times faster than the Signature
Quadratic Form Distance, which needs on average 479.8 milliseconds for a single
distance computation. In spite of the theoretically exponential and empirically
super-cubic computation time complexity of the Earth Mover's Distance

To sum up, the experimental evaluation reveals that the proposed distance-based approaches are able to model gesture similarity in a flexible and model-independent way. Without the need for a preceding training phase, the Earth Mover's Distance and the Signature Quadratic Form Distance are able to provide similarity models for searching similar gestures which are formalized through gesture signatures.

Conclusions and Future Work

In this paper, we have investigated distance-based approaches to measure similarity between gestures arising in three-dimensional motion capture data streams. To this end, we have explicated gesture signatures as a way of aggregating the inherent characteristics of spontaneously produced co-speech gestures and signature-based distance functions such as the Earth Mover's Distance and the Signature Quadratic Form Distance in order to quantify dissimilarity between gesture signatures. The experiments conducted on real data are evidence of the appropriateness in terms of accuracy and efficiency of the proposal.

In future work, we intend to extend our research on gesture similarity towards indexing and efficient query processing. While the focus of the present paper lies on dissimilarity between pairs of gestures, we further plan to quantitatively analyze motion capture data streams in a query-driven way in order to support the domain experts' qualitative analyses of gestural patterns within multi-media contexts. The overall goal of this research is to contribute to the advancement of automated methods of pattern recognition in gesture research by enhancing qualitative analyses of complex multimodal data in the humanities and social sciences. While this paper focuses on formal features of the gestural movements, further steps will entail examining the semantic and pragmatic dimensions of these patterns in light of the cultural contexts and embodied semiotic practices they emerge from.

Acknowledgment

This work is partially funded by the Excellence Initiative of the German federal and
state governments and DFG grant SE 1039/7-1. This work extends