Volume 11 Number 2

# Automated Pattern Analysis in Gesture Research: Similarity Measuring in 3D Motion Capture Models of Communicative Action

## Abstract

The question of how to model similarity between gestures plays an important role in current studies in the domain of human communication. Most research into recurrent patterns in co-verbal gestures – manual communicative movements emerging spontaneously during conversation – is driven by qualitative analyses relying on observational comparisons between gestures. Due to the fact that these kinds of gestures are not bound to well-formedness conditions, however, we propose a quantitative approach consisting of a distance-based similarity model for gestures recorded and represented in motion capture data streams. To this end, we model gestures by flexible feature representations, namely gesture signatures, which are then compared via signature-based distance functions such as the Earth Mover's Distance and the Signature Quadratic Form Distance. Experiments on real conversational motion capture data evidence the appropriateness of the proposed approaches in terms of their accuracy and efficiency. Our contribution to gesture similarity research and gesture data analysis allows for new quantitative methods of identifying patterns of gestural movements in human face-to-face interaction, i.e., in complex multimodal data sets.

# Introduction

*embodied mind*in experientialist approaches to language, co-verbal gestures have become a valuable data source in cognitive, functional and anthropological linguistics (e.g. [Sweetser 2007]). While there exist various views on embodiment, the core idea is that human higher cognitive abilities are shaped by the morphology of our bodies and the way we interact with the material, spatial and social environment (e.g. [Gibbs 2006]; [Johnson 1987]). Drawing on these premises, some gesture scholars stress that gestures are conditioned by the forms and affordances of their material habitat as well as the speakers’ interactive and collaborative practices (e.g. [Enfield 2009]; [Streeck 2011]). Pioneering work done by Kendon (e.g. [Kendon 1972], [Kendon 2004]), McNeill (e.g. [McNeill 1985], [McNeill 1992], [McNeill 2005]) and Müller [Müller 1998] has shown that manual gestures are an integral part of utterance formation and communicative interaction. The state-of-the-art of research in the growing interdisciplinary field of

*gesture studies*has recently been presented in the International Handbook on Multimodality in Human Interaction ([Müller 2013], Vol. 1 2013, [Müller 2014] Vol. 2 2014). One quintessence to be drawn from this large body of work is that language, whether spoken or signed, is embodied, dynamic, multimodal and intersubjective (see also [Duncan 2007]; [Gibbs 2006]; [Jäger 2004]; [Mittelberg 2013]; [Müller 2008]).

# Research Objective

- whether there are shared or common reoccurring gestural movement patterns in a given set of 3D recorded, behavioral communication data,
- exactly which forms there are, and
- the extent to which they occur,

*gesture signature*[Beecks 2015]. This gesture signature has the ability of weighting trajectories according to their relevance. Based on this lossless feature representation, we propose to measure similarity by means of distance-based approaches [Beecks 2013], [Beecks 2010]. We particularly investigate the

*Earth Mover's Distance*[Rubner 2000] and the

*Signature Quadratic Form Distance*[Beecks 2010] for the comparison of two gesture signatures on real conversational motion capture data.

# Properties of the 3D Data Model

*represents*its representandum. The representandum in question is a set of relevant features and relations of a given part of reality, namely the change in space-time location of certain body parts caused by the participants’ kinetic movement. The representing model, on the other hand, is the virtual, computational 3D recording of the real-world kinetic movement, mapped onto Euclidean space. Representation itself is the relation which holds between the model and its representandum. Representation, as we understand the term here, is a non-reflexive and non-symmetric relation, which simply means that an

*a*does not necessarily represent

*a*, and that if

*a*represents

*b*, then

*b*does not necessarily represent

*a*. We further assume that representation depends on transitive relations, such as identity between some complex relational features of the entities to be represented in a model and the entities that represent/model them. In short, this means that if there are entities

*x, y, z,*there must be at least one mutual complex relational feature

*Fr(x,y,z),*which these entities have in common. The transitive relation

*R*, then, is the “identity” relation, in that two entities have an identical relation to a third entity – a frame of reference, or a tertium comparationis.

*R*over a set

*A*consisting of

*x, y, z,*the following statement holds true:

*R*is so important here because, in terms of modeling, it is the crucial feature that R must have. The different relata involved are: movements of body parts

*(x)*, movements of markers

*(y)*, and computational trajectories

*(z)*. The relation R which holds between these relata is the identity of their curve with respect to space – either to a given virtual Euclidian space, or the physical space (which functions as a reference frame). The identity of the movement curve of body parts and the movement curve of markers simply stems from their physical attachment/ conjunction in physical space. The identity of marker movements and the computational trajectories is a result of metering the markers’ light reflections, by means of the 14 Vicon infrared cameras, and numerically mapping the outcome onto Euclidean space coordinates. But if one differentiates between physical and Euclidean space, there is hardly any identity one could honestly speak of in this case. In what sense could physical and virtual movement curves ever be identical? Only in the sense that we

*identify*physical and Euclidean space by conventions of scientific modeling and metering practices; the term

*curve*is itself an indication of how familiar that convention is. Since Euclidean space is a conventionalized and well-accepted geometrical model of physical space – we are well accustomed to talking about physical movement in terms of, distances, trajectories, vectors, miles, kilometres and change in space/time coordinates etc. – one can say that Euclidean space is our familiar standard model for describing our perception of movement in physical space, both in everyday conversation and scientific discussion. Thus, it is justified to speak of the

*identity*of the spatial curves of trajectories in Euclidean space and those of moving physical objects such as Motion Capture markers, only if we accept this convention. Regarding representation, what does this imply? If representation depends on the transitivity of relation R holding between the entities

*x, y, z,*and the identity of that relation depends on conventions of metering and modeling, representation additionally depends on following conventions. Given a 4-tuple of coordinates of a MoCap marker, the movement of this marker is modeled by a vector which points from tuple-1 to tuple-2 to tuple-n. The crucial feature of this kind of modeling is that the movement of a single marker

*a*is represented as a dynamic space-time trajectory which aggregates the consecutively changing coordinates of

*a*over a given time frame.

y = movement of marker M from position a to b;

z = trajectory of marker M

*x, y, z*obtain a transitive relation

*R*in the above sense, then

*x, y,*and

*z*are to be regarded as homomorphous abstract

*concepts*that all denote the same event of spatiotemporal movement, and

*R*is an equivalence relation. So the

*extensional*equivalence of these concepts is a necessary and sufficient condition that allows us to investigate reality by investigating the model: If a language and its translation are equivalent, it should be equally valid to investigate one or the other. However, since the translation of the concept “

*movement*of marker” into “

*aggregated*marker

*coordinates*” fails to be an

*intensionally*adequate translation, i.e. the concepts do not

*mean*the same (event vs. aggregated states of affairs), it could at first seem odd to describe real movement in terms of trajectories. But, since the concepts are at least phenomenologically and extensionally equivalent, this basically remains a question of the interpreter’s ontology (see [Quine 1980]) and how the final research result is to be formulated. If we decide to treat “event” and “aggregated states of affairs” as being synonymous, the problem completely disappears. Otherwise, we have to re-translate the problematic concept into one which suits our needs. In terms of epistemic benefits, one major advantage of the proposed distance-based gesture-similarity model (see the following section), i.e. the combination of gesture signatures with signature-based distance functions, is its applicability to any type of gestural pattern and to data sets of any size. In fact, distance-based similarity models can be utilized in order to model similarity between gestural patterns whose movement types are well known and between gestural patterns whose inherent structures are completely unknown. In this way, they provide an unsupervised way of modeling gesture similarity. This flexibility is attributable to the fact that the proposed approaches are model independent, i.e. no complex gesture model has to be learned in a comprehensive training phase prior to indexing and query processing. Another advantage of the proposed distance-based gesture-similarity model is the possibility of efficient query processing. Although calculating the distance between two gesture signatures is a computationally expensive task, which results in at least a quadratic computation time complexity with respect to the number of relevant trajectories, many approaches such as the independent minimization lower bound of the Earth Mover's Distance on feature signatures [Uysal 2014] and metric indexing [Beecks 2011], as well as the Ptolemaic indexing [Hetland 2013] of the Signature Quadratic Form Distance, are available for efficient query processing and, thus, for assessing gesture similarity in a larger quantitative way.

# Modeling Gesture Similarity

*gesture signatures*as a formal model of gestures arising in motion capture data streams. Since gesture signatures comprise multiple three-dimensional trajectories, we continue with outlining distance functions for trajectories before we investigate distance functions applicable to gesture signatures.

## Gesture Signatures

*t*:{1,…,n}→ R3 is defined for all 1≤i≤n as:

*t*(i)=(xi,yi,zi)

*gesture signature S*∈RT is defined as:

*S*∈RT∧|S-1(R{0})|<∞} forms a vector space. While a weight of zero indicates insignificance of a trajectory, a positive weight is utilized to indicate contribution to the corresponding gesture. In this way, a gesture signature allows us to focus on the trajectories arising from those markers which actually form a gesture. For example, if a gesture is expressed by the participant's hands, only the corresponding hand markers and thus trajectories have to be weighted positively.

*S*consists in assigning relevant trajectories a weight of one and irrelevant trajectories a weight of zero, i.e. by defining

*S*for all

*t*∈T as follows: $$S(t) = \left\{ \begin{matrix} 1, \ if\ t \in \mathcal{T}^{+} \\ 0, \ \text{otherwise} \\ \end{matrix} \right.$$

*motion distance*and

*motion variance*. These properties are defined below.

*t*:{1,…,n}→ R3 be a trajectory. The

*motion distance m*δ:T→R of trajectory

*t*is defined as: $$m_{\delta}(t) = \sum_{i = 1}^{n - 1}\left\| t(i) - t(i + 1) \right\|_{2}$$

*t*is defined with mean $$m_{\sigma^{2}}(t) = \frac{1}{n} \cdot \sum_{i = 1}^{n}\left\| t(i) - \mu(t) \right\|_{2}^{2}$$

*motion distance gesture signature*and a

*motion variance gesture signature*, as shown below.

*t*∈T as: $$S_{m_{\sigma^{2}}}(t) = \left\{ \begin{matrix} m_{\sigma^{2}}(t), \ if\ t \in \mathcal{T}^{+} \\ 0, \ \text{otherwise} \\ \end{matrix} \right.$$

*t*∈T as: $$DTW_{\delta}\left( t_{n},t_{m} \right) = \delta\left( t_{n}(n),t_{m}(m) \right) + \min_{}\left\{ \begin{matrix} DTW_{\delta}\left( t_{n - 1},t_{m - 1} \right) \\ DTW_{\delta}\left( t_{n},t_{m - 1} \right) \\ DTW_{\delta}\left( t_{n - 1},t_{m} \right) \\ \end{matrix} \right.$$

## Trajectory Distance Functions

*Dynamic Time Warping Distance*, which was first introduced in the field of speech recognition by Itakura [Itakura 1975] and Sakoe and Chiba [Sakoe 1978] and later brought to the domain of pattern detection in databases by Berndt and Clifford [Berndt 1994]. The idea of this distance is to locally replicate points of the trajectories in order to fit the trajectories to each other. The point-wise distances finally yield the Dynamic Time Warping Distance, whose formal definition is given below.

*t*n:{1,…,n}→ R3 and tm:{1,…,m}→ R3 be two trajectories from T and

*δ*:R3×R3→R be a distance function. The

*Dynamic Time Warping Distance DTW*δ:T×T→R between

*t*n and

*t*m is recursively defined as:

- \(DTW_{\delta}\left( t_{0},t_{0} \right) = 0\)
- \(DTW_{\delta}\left( t_{i},t_{0} \right) = \infty\ \ \forall 1 \leq i \leq n\)
- \(DTW_{\delta}\left( t_{0},t_{j} \right) = \infty\ \ \forall 1 \leq j \leq m\)

*δ*between replicated elements of the trajectories. In this way, the distance

*δ*assesses the spatial proximity of two points while the Dynamic Time Warping Distance preserves their temporal order within the trajectories. By utilizing Dynamic Programming, the computation time complexity of the Dynamic Time Warping Distance lies in 𝒪(n·m).

*Edit Distance on Real Sequences*[Chen 2005],

*Minimal Variance Matching*[Latecki 2005], and

*Mutual Nearest Point Distance*[Fang 2009], we have decided to utilize the Dynamic Time Warping Distance for the following reasons: (i) The distance value is based on all points of the trajectories with respect to their temporal order and is not attributed to partial characteristics of the trajectories, (ii) it provides the ability of exact indexing by lower bounding [Keogh 2002], and (iii) it indicates superior quality in terms of accuracy within preliminary investigations.

## Gesture Signature Distance Functions

*Earth Mover's Distance*[Rubner 2000] and the correlation-based

*Signature Quadratic Form Distance*[Beecks 2010] in the remainder of this section.

*S*1,

*S*2∈S be two gesture signatures and

*δ*:T×T→R be a trajectory distance function. The

*Earth Mover’s Distance EMD*δ:S×S→R between

*S*1 and

*S*2 is defined as a minimum cost flow of all possible flows

*F*={f| f: T×T→R} as:

- \(\forall t,t'\mathbb{\in T};f(t,t') \geq 0\)
- \(\forall t \in \mathbb{T};\Sigma_{t'\mathbb{\in T}}f(t,t') \leq S_{1}(t)\)
- \(\forall t'\mathbb{\in T};\Sigma_{t\mathbb{\in T}}f(t,t') \leq S_{2}(t')\)
- \(\Sigma_{t \in \mathbb{T}}\Sigma_{t'\mathbb{\in T}}\ f(t,t') = \min_{}\left\{ \Sigma_{t \in \mathbb{T}}S_{1}(t),\Sigma_{t'\mathbb{\in T}}S_{2}(t') \right\}\)

*correlation*to gesture signatures. In general, correlation is the most basic measure of bivariate relationship between two variables [Rodgers 1988] and can be interpreted as the amount of variance these variables share [Rovine 1997]. In order to apply the concept of correlation to gesture signatures, all trajectories and corresponding weights are related with each other based on a trajectory similarity function

*s*:T×T→R. The resulting

*similarity correlation*between two gesture signatures

*S*1,

*S*2∈S is then defined as: $$SQFD_{s}\left( S_{1},S_{2} \right) = \sqrt{\left\langle S_{1},S_{1} \right\rangle_{s} - 2 \cdot \left\langle S_{1},S_{2} \right\rangle_{s} + \left\langle S_{2},S_{2} \right\rangle_{s}}$$

*S*1,

*S*2∈S be two gesture signatures and

*s*:T×T→R be a trajectory similarity function. The

*Signature Quadratic Form Distance SQFD*s:S×S→R between

*S*1 and

*S*2 is defined as: $$SQFD_{s}\left( S_{1},S_{2} \right) = \sqrt{\left\langle S_{1},S_{1} \right\rangle_{s} - 2 \cdot \left\langle S_{1},S_{2} \right\rangle_{s} + \left\langle S_{2},S_{2} \right\rangle_{s}}$$

*S*1,

*S*1>s and <

*S*2,

*S*2>s of the gesture signatures

*S*1 and

*S*2 and subtracting their inter-similarity correlation <

*S*1,

*S*2>s. The smaller the differences among the intra-similarity and inter-similarity correlations the lower the resulting Signature Quadratic Form Distance, and vice versa. The computation of the Signature Quadratic Form Distance can be restricted to the relevant trajectories of both gesture signatures and has a quadratic computation time complexity with respect to the number of relevant trajectories.

# Experimental Evaluation

*α*=1 into a trajectory similarity function for the Signature Quadratic Form Distance. Since weighting of relevant trajectories by motion distance and motion variance, approximately shows a similar behavior, we include the results regarding motion variance gesture signatures only. We depict small and large distance values by bluish and reddish colors in order to visually indicate the performance of our proposal: gesture signatures from the same movement type should result in bluish colors while gesture signatures from different movement types should result in reddish colors.