Recognizing Social Touch Gestures Using Recurrent and Convolutional Neural Networks

2017 IEEE International Conference on Robotics and Automation (ICRA)
Singapore, May 29 - June 3, 2017
Recognizing Social Touch Gestures using Recurrent and Convolutional

Neural Networks
Dana Hughes1 and Alon Krauthammer2 and Nikolaus Correll1
Abstract Deep learning approaches have been used to per-

form classication in several applications with high-dimensional
input data. In this paper, we investigate the potential for deep
learning for classifying affective touch on robotic skin in a
social setting. Three models are considered, a convolutional
neural network, a convolutional-recurrent neural network and
an autoencoder-recurrent neural network. These models are
evaluated on two publicly available affective touch datasets, and
compared with models built to classify the same datasets. The
deep learning approaches provide a similar level of accuracy,
and allows gestures to be predicted in real-time at a rate of 6 to
9 Hertz. The memory requirements of the models demonstrate
that they can be implemented on small, inexpensive microcon-
trollers, demonstrating that classication can be performed in
the skin itself by collocating computing elements with the sensor
array.
I. INTRODUCTION
Full-body tactile sensitive robotic skins based on large
pressure sensitive arrays has found several applications in
manipulation, exploration, navigation, and human-robotic in-
teraction [1]. Tactile sensing modalities have been especially
useful for conveying emotion and intent in therapeutic and Fig. 1. Tactile sensitive companion robot interacting through affective
companion robots [2]. Over the last few years, detecting touch (from [3]) and basis for the HAART dataset.
and identifying high-level affective touch gestures have been
explored for social human-robot interaction [3][5]. Specif-
ically, there have been several robotic pets developed for learning approaches for social touch recognition has not
therapeutic and companion purposes for the elderly and ill, been explored, with the exception of using deep autoen-
such as in Figure 1. coders to compress individual frames in [10]. Deep learn-
Motivated by the need for high quality gesture classiers, ing approaches have shown several advantages over hand-
two affective touch datasets [4], [6] have been made publicly engineered features: CNNs are useful for automatically gen-
available, and has allowed for several different classication erating features from training data, features exhibit invariance
approaches to be explored [7][10]. These investigations to translation and scaling, and RNNs can provide predictions
have focused primarily on manually developing features at each point in a sequence, rather than one prediction for a
to be extracted from gestures, usually based on common sequence as a whole.
features from other domains, such as speech recognition In addition, we are interested in framing pressure-sensitive
and image processing. In general, the features used in these robotic skins as a robotic material [11], [12]. In the robotic
investigations are extracted from a gesture as a whole, which material paradigm, a network of inexpensive sensing and
requires detecting the onset and end of a gesture, and limits computing elements are embedded in and tightly coupled
the ability to perform real-time classication. with a physical material. Individual computing elements pro-
We are interested in using deep learning techniques, cess local sensor data and share information with neighboring
such as convolutional (CNN) and recurrent neural networks nodes in order to perform some global tasks, such as gesture
(RNN), for classifying social touch gestures in the context classication. This effectively allows a signicant amount of
of robotic materials. To the best of our knowledge, deep computing to be performed in the material itself, reducing
communication to and processing in a centralized computer,
D. Hughes was supported by the Airforce Ofce of Scientic Research.
We are grateful for this support. and allowing the material to scale.
1 Department of Computer Science, University of The purpose of this investigation is to explore the suitabil-
Colorado Boulder, Boulder, CO 80309-0430, USA ity of deep learning for affective touch classication, deter-
{dana.hughes,nikolaus.correll}@colorado.edu
2 The Aerospace Corporation, El Segundo, CA 90245 mine the accuracy trade-off when classifying entire gestures
alon.krauthammer@aero.org versus providing predictions several times per second, and
978-1-5090-4633-1/17/$31.00 2017 IEEE 2315

determining if deep learning models may be suitable for use III. APPROACH
in a robotic materials context. We explore three deep neural network architectures:
CNNs, CNN-RNNs and Autoencoder-RNNs. The CNNs
II. RELATED WORK performed classication using frames in a short window of
time. The CNN-RNNs extend the CNNs by incorporating
Several tactile sensitive robots have been developed specif-
a recurrent layer to utilize temporal information about the
ically for exploring affective touch in HRI. Robotic pets,
gesture. In a similar manner, the Autoencoder-RNN extends
such as the Haptic Creature [3], The Huggable [13] and
the autoencoders used in [10], which were used to reduce
CuddleBot [6], have been created which are capable of
the dimensionality of each frame to 10 real-valued numbers,
detecting touch pressure. For these robots, touch is detected
by using the code layer as input to a recurrent network.
using a sensing array made of pressure sensitive fabrics [3],
[6] or quantum tunneling composites [13]. A. Datasets
In the last two years, two specic datasets, the Corpus Networks were built and evaluated for the CoST and
of Social Touch (CoST) [4] and the Human-Animal Affec- HAART datasets. The provided datasets were separated into
tive Robot Touch (HAART) [6] dataset, have been made training and test sets for the 2015 Social Touch Chal-
publically available. These datasets have been used for the lenge [5]. The CoST dataset contains 14 touch gestures
2015 Social Touch Challenge, which evaluates approaches to (grab, hit, massage, pat, pinch, poke, press, rub, scratch, slap,
gesture classication from several research groups [5]. Four squeeze, stroke, tap, and tickle) collected from 31 subjects
research groups participated in this initial challenge [7][10], for a total of 5,203 gesture captures. The HAART dataset
resulting in features being adapted from several other similar contains 7 gestures (pat, press, rub, scratch, tickle, and no
domains, such as image or audio processing. touch), collected from 10 subjects for a total of 829 gesture
In [7], sixteen time domain signals were produced by captures. Each dataset was captured using an 8x8 pressure
calculating a summary of the pressure (mean pressure, cen- sensor grid with 10-bit sensor values. The CoST dataset was
troid of pressure, maximum pressure, location of maximum sampled at 135Hz and the HAART dataset was sampled at
pressure and polar moment) and a summary of the region of 54Hz.
contact (area, convex hull, major and minor axis lengths, ec- For CNNs and CNN-RNNs, the CoST data was split into
centricity, orientation, diameter and Euler number). For each windows with a window size of 45 samples (333ms) and a
signal, various statistics (mean, median, variance, minimum hop size of 15 samples (111ms). The HAART data was split
and maximum value), energy, Hurst exponent and Hjorth into window with a window size of 27 samples (500ms) and
complexity, as well as coefcients from an autoregressive a hop size of 9 samples (167ms).
model were extracted. The CoST dataset consisted of gestures of varying dura-
In [8], histograms of the number of cells of no touch, touch tions (from 10 samples to 1297 samples, or 1 sample window
and high pressure value ranges were calculated at the frame to 85 sample windows). In order to efciently train the CNN-
and gesture level. Additionally, the Binary Motion History RNN model, we limited the number of windows in a training
was calculated by mapping which cells were touched with sample to 36, which results in some of the gesture captures
high pressure, Motion Statistical Distribution was calcu- being split into two or three training samples. The HAART
lated from the statistics (mean, median, minimum, maximum, dataset had consistent gesture durations, and each training
area, rst and third quartiles, interquartile range, variance, sample consisted of a complete capture.
skewness and kertosis) of each cell during a gesture. Spatial For each dataset, the Social Touch Challenge randomly
Multi-Scale Motion History Histograms were used to track split subjects into a training set and test set. For the CoST
the motion of pressure on the taxels at various time and dataset, 21 subjects were used for training, and 10 subjects
spacial scales. were used for testing, and for the HAART dataset, 7 subjects
In [9], statistics were extracted from the gesture as a were used for training, and 3 subject were used for testing.
whole (number of frames, mean and maximum pressure, and We split the provided training data into a training and
variation around specic values) and for individual channels validation set, with 85% of the data used for training and
(mean and variation of each channel, and percentage of time 15% used for validation. As splitting the training and test
over a xed threshold). Additionally, spectral information data by subject results in a more difcult classication task,
was extracted from the mean pressure of each frame using a due to the test data being from novel subjects, we maintained
Fast Fourier transform and discrete cosine transform. this approach to allow for comparison with the results in [5].
Finally, in [10], individual frames were summarized using
geometric moments, as well as using an autoencoder to B. Convolutional Neural Network
reduce the dimensionality of frames to a 10-digit value. The convolutional neural networks used consisted of 3D
Using these summaries as observations, Gaussian Hidden convolutional layers and max pooling layers. Each convolu-
Markov models (GHMM) were trained for each gesture, and tional layer extracts spatio-temporal features from the gesture
the likelihood of each model was used as features for a window, similar in operation to [14]. The model, shown in
classier. These features were also augmented with statistics Figure 2, consists of two sets of convolution and max-pooling
of the gesture as a whole, as well as spectral information. layers, followed by a fully connected layer and a softmax
2316
classication layer. The convolution layers implement 3D D. Autoencoder-RNN
convolutional kernels, described by ve values (time, frame The nal architecture extends the autoencoder used in [10]
width, frame height, number of input channels, number of by connecting the output of the code layer to an RNN layer,
output channels), and the max-pool layers have a pool size as shown in Figure 3. In [10], an autoencoder [15] was used
and stride of 2 in each dimension. The output of the max- to reduce the size of individual frames from 64 units to
pooling layers and fully connected layer are activated with 10 linear units. This output was used as observations for
a rectied linear unit (ReLU) activiation function. hidden Markov models (HMM) to estimate the likelihood
To determine the hyperparameters (i.e., kernel dimen- that a sequence of frames belonged to each type of ges-
sions), multiple CNNs were trained to classify individual ture. In this investigation, the HMM was replaced with a
windows. To determine the number of and dimensionality recurrent layer. The resulting architecture has the advantage
of kernels for each layer, a coarse grid search over these of allowing the autoencoder portion to be pre-trained to
parameters were performed iteratively for two layers. The reconstruct individual frames (as in [10]) prior to training the
kernel parameters were selected by attening the output of classier. Additionally, training the network for classication
the CNN layer being optimized, and connecting this output renes the encoding portion of the network so that the
to a softmax layer, ignoring the intermediate fully connected encoded frame provides better information for classication,
layer. The network was trained to convergence, and the as opposed to simply performing data compression.
kernel parameters were selected from the model that resulted The autoencoder portion of the architecture used matches
in the lowest validation cost without showing evidence of that in [10]: the encoding layers consist of four fully con-
overtting of the training data. nected layers with 200, 100, 50 and 25 units, followed by
The resulting CNN for the CoST dataset consisted a code layer of 10 units. The decoding layer is simply a
of two layersthe rst layers kernel dimensions were mirror of the encoding layer: 25, 50 100, and 200 units. In
(20, 2, 2, 1, 30), and the second layers kernel dimensions [10], the code layer used a linear activation function, while
were (5, 2, 2, 30, 10). The CNN for the HAART dataset con- the encoding and decoding layers consisted of sigmoid units.
sisted of two layersthe rst layers kernel dimensions were Here, rectied linear units were used instead of sigmoid
(10, 3, 3, 1, 30), and the second layers kernel dimensions units, as these types of units learn faster than sigmoid units.
were (5, 2, 2, 30, 20). The number of units in the RNN layer and fully connected
The two convolutional / max-pooling layers were fully output layer was determined similarly to the CNN-RNN
connected to a layer of 50 ReLU units, which is then architecture. Both the RNN and fully connected layers used
connected to the softmax layer for nal classication pur- 25 units; a larger number of units typically resulted in a
poses. The number of units were selected by determining the model which would output a single class regardless of input,
number of units at which validation cost stopped improving. and would show little or no change in validation cost during
training.
C. CNN-RNN IV. RESULTS

The quality of each model is measured by its accuracy as
Recurrent layers can be employed to adapt CNNs, which well as its memory requirements. The memory requirements
perform classication on a single time window, for use of a model are critical for determining if it can be used in
with sequential data. To create CNN-RNNs for sequential the context of a robotic material: Models with sufciently
classication, the architectures developed in Section III-B low memory requirements may be implemented on inex-
can be extended by attening the output of the nal pooling pensive microcontrollers for local processing, while larger
layer in the CNN, which is then connected to the input of a models will require processing on a larger computing sink,
recurrent layer, followed by an optional fully connected layer increasing overall communication and central computation
and a softmax layer. While the output of the trained CNNs requirements.
could be computed for each time window and used as input
for a separate RNN, the complete network can be trained A. Classication Results
end-to-end, which may result in more suitable featured being Each model was trained on both the CoST and HAART
extracted for the sequential data. datasets, using the model parameters determined in Sec-
We determined the number of units used in the recurrent tion III. Models were implemented using Tensorow [16],
layer and the number of hidden units by performing grid and trained using the RMSProp algorithm with a learning
search over these two hyperparameters, and selecting the rate of 0.0001. The full training sets (i.e., the combined
parameters which produced the lowest validation cost, while training and validation sets used in Section III) were used
avoiding overtting the training set, similar to determining to train each model, and models were trained until training
the hyperparameters for the CNNs. The hyperparameters for cost converged. The models were evaluated using the test
the CNN portion of the CNN-RNN architecture were those sets provided from [5].
found to be optimal for the CNN architecture. Searching over Each gesture produces multiple measurement windows.
the range of 10 - 200 units for each parameter resulted in The two RNN models provide a prediction of the gesture
selecting 50 units for each of these layers. for each measurement window. To convert this to a single
2317

Fig. 2. Convolutional Neural Network Architecture. Filter dimensions for each layer and window duration are arbitrarily selected.
TABLE I

S UMMARY OF RESULTS FOR THE THREE MODELS AND [5].

Test

F1 Accuracy
CNN 0.4125 42.34%
CNN-RNN 0.4724 52.86%

Autoencoder-RNN 0.3253 33.52%
CoST
Balli Altuglu et al. [7] 26.00%
Gaus et al. [8] 58.70%

Hughes et al. [10] 47.20%

Ta et al. [9] 61.30%
CNN 0.5538 56.10%
CNN-RNN 0.6028 61.35%

HAART
Autoencoder-RNN 0.5193 55.78%

Balli Altuglu et al. [7] 61.00%
Fig. 3. Autoencoder-RNN Architecture Gaus et al. [8] 66.50%
Hughes et al. [10] 67.70%
Ta et al. [9] 70.90%
accuracy, the prediction of the class for the gesture capture

as a whole was set as the most common prediction (i.e.,
majority voting) for the windows in the gesture capture. For on all data points in the gesture capture, while our approaches
the CNN model, we report the accuracy from the predictions are limited to data in a short window of time and any prior
of each window, rather than performing majority voting over information gathered. Additionally, for the CoST data, the
the gesture as a whole. This increases the number of test duration of each capture is an informative features which was
cases, but ensures that the results reect the lack of temporal used by others [9], [10], and is unavailable to our approach.
context this model would have when evaluating gestures in As our approach provides gesture predictions at a rate of 6 to
real-time. 9 times per second, the reduction of accuracy is a reasonable
Table I summarizes the nal classication accuracy and F1 trade-off for continual, real-time classication.
scores of the training and test data using the three models. To investigate the results in more detail, confusion ma-
For comparison, each of the highest test accuracies reported trices were generated for each model and dataset. The
from the four participating groups in [5] are also provided confusion matrices for the HAART dataset using the CNN,
for comparison. CNN-RNN and Autoencoder-RNN models are given in Ta-
Comparing the models explored in this paper, we note bles II, III, and IV, respectively. For all three models, the
that, while the CNN model performed relatively well, in- No Touch, Constant and Scratch are generally classied
corporating the recurrent layer into the model provides an correctly. Tickle and Scratch are commonly confused, and
improvement in overall accuracy by 5% 10%. This to a lesser extent, Pat and Stroke. These specic confusions
demonstrates that, while gestures can be classied to some were also seen in the models summarized in [5], and is
extent from a short sample, the temporal aspect of touch specically noted in [8], and is attributed to the assumed
gestures is very important for identication purposes. similarity between the gestures.
When compared to the four models in [5], our models The confusion matrices for the CoST dataset using the
typically do not perform as well, though the nal classica- CNN, CNN-RNN and Autoencoder-RNN architectures is
tion accuracy of the CNN-RNN model compares well, and given in Tables V, VI, and VII. As with the HAART results,
outperforms [7] for both datasets. However, the four models the confused gestures (e.g., slap and pat) are similar to those
consider the gesture as a whole and can extract features based found in the Social Touch Challenge. The Autoencoder-
2318
TABLE III
RNN, which resulted in poor classication accuracy, demon-
C ONFUSION M ATRIX FOR THE HAART DATASET USING CNN-RNN
strates a bias towards the Tickle gesture, though the cause of
ARCHITECTURE .
this is unknown.
Predicted Class
B. Model Size True Class A B C D E F G
(A) No Touch 97.2 0.0 0.0 0.0 0.0 2.8 0.0
The second aspect of interest is the overall memory (B) Constant 2.9 85.7 0.0 5.7 2.9 2.9 0.0
footprint of each model. Models which can be implemented (C) Pat 2.8 0.0 41.7 13.9 16.7 5.6 19.4
on small, inexpensive microcontrollers allow for ofoading (D) Rub 0.0 16.7 11.1 27.8 27.8 13.9 2.8
(E) Scratch 0.0 0.0 5.6 5.6 77.8 0.0 11.1
the task of classication onto the material itself, allowing for (F) Stroke 2.8 5.6 8.3 16.7 2.8 61.1 2.8
only the predicted gesture to be communicated to an external (G) Tickle 0.0 0.0 5.6 0.0 52.8 2.8 38.9
computing sink. Assuming that models are trained off-line on
a larger (e.g., desktop) computer, memory requirements can TABLE IV
be divided into model parameters (i.e., weights and biases) C ONFUSION M ATRIX FOR THE HAART DATASET USING
and neuron activation levels. Measurements and neuron ac- AUTOENCODER -RNN ARCHITECTURE .
tivation levels need to be stored in RAM, which is typically
very limited on microcontrollers (e.g., 8kB on an XMega128 Predicted Class
series), while model parameters can be stored in program True Class A B C D E F G
memory (FLASH), which is generally more plentiful, though (A) No Touch 91.7 0.0 0.0 2.8 0.0 5.6 0.0
still limited (e.g., 256kB on an XMega128). (B) Constant 2.9 97.1 0.0 0.0 0.0 0.0 0.0
(C) Pat 2.8 0.0 47.2 8.3 19.4 19.4 2.8
The memory requirements can be determined directly from (D) Rub 0.0 5.6 0.0 5.6 77.8 0.0 11.1
each model: the number of values stored in RAM is the (E) Scratch 0.0 0.0 11.1 2.8 86.1 0.0 0.0
(F) Stroke 16.7 0.0 8.3 5.6 13.9 44.4 11.1
number of data points per sample window and total number (G) Tickle 0.0 0.0 13.9 0.0 75.0 0.0 11.1
of neurons in the model, and the number of values stored
in FLASH is the number of weights and biases in the
model. We summarize these two values for each model in
Table VIII. For the CNN and CNN-RNN models, convolution The total memory requirements is dependent on the num-
and pooling can be performed jointly, reducing the need to ber of bits required to represent a single value. From the
store the full output of each convolutional layer. We report values given in Table VIII, the maximum amount of FLASH
the number of sensors and neurons for both implementations, required (Autoencoder-RNN model) would be 80kB or
including full convolutional outputs in parentheses. 160kB for 16-bit and 32-bit values, respectively, and the
For the two sets of models, the Autoencoder-RNN requires maximum amount of RAM required (CNN-RNN model)
the largest amount of FLASH memory, primarily due to the would be 13kB (16-bit) or 26kB (32-bit). Based on these
large number of full connections. In contrast, the convolu- values, a 16-bit implementation of any of these models could
tional kernels in the CNN and CNN-RNN models provide a be implemented on an inexpensive ($5) microcontroller.
large amount of weight sharing, which greatly reduces the For example, the Atmel SAM D21 series microcontroller
memory requirements compared to fully connected layers. contains an ARM Cortex-M0+ CPU running at up to 48MHz,
Consequently, the FLASH footprint of the CNN and CNN- and has up to 256kB Flash and 32kB SRAM, which is
RNN models are much less. Including the recurrent layer sufcient for the model requirements given.
also requires only a minimal amount of additional memory,
namely the additional full connection from the CNN output V. DISCUSSION
to the recurrent layer input, and the amount of neurons in While deep learning approaches provide state-of-the-art
the recurrent layer. This increase in memory requirement is results in many elds, our models do not perform better
a small trade-off for the large increase in overall accuracy. than the results in [9]. There may be multiple reasons for
the lack of performance. First, we wish to maximize model
TABLE II accuracy while ensuring low memory requirements. Using
C ONFUSION M ATRIX FOR THE HAART DATASET USING CNN additional layers, increasing the number of output channels,
ARCHITECTURE . and decreasing the time dimension of each kernel in the
CNN may improve accuracy, but would increase memory
Predicted Class requirements, making the model unsuitable for in-material
True Class A B C D E F G processing. Second, RNN layers suffer from an inability
(A) No Touch 92.9 0.0 0.6 1.7 0.0 2.6 2.1 to capture long-term dependencies in a sequence; using
(B) Constant 2.9 85.8 5.4 4.8 0.1 1.0 0.1
(C) Pat 6.1 3.1 43.9 13.3 1.7 25.3 6.7 long short-term memory (LSTM, [17]) layers could improve
(D) Rub 0.3 6.3 14.2 42.9 16.5 13.8 6.1 results, though at the expense of more parameters. A limiting
(E) Scratch 0.0 0.6 6.1 7.2 60.1 8.3 17.7 factor to implementing these, however, is the possibility of
(F) Stroke 13.0 0.9 9.9 26.4 5.0 37.3 7.4
(G) Tickle 0.5 0.8 7.5 3.4 52.8 4.3 30.7 overtting due to a large number of model parameters. Aug-
menting the data by collecting additional gesture samples,
2319
TABLE V
C ONFUSION M ATRIX FOR THE C O ST DATASET USING CNN ARCHITECTURE .
Grab Hit Massage Pat Pinch Poke Press Rub Scratch Slap Squeeze Stroke Tap Tickle
Grab 60.2 2.8 8.2 0.6 4.1 0.6 2.5 1.0 1.1 0.1 14.0 1.7 0.3 2.7
Hit 3.8 22.8 1.0 10.3 14.8 9.5 2.8 1.5 0.5 7.0 4.8 5.8 5.0 10.5
Massage 5.1 0.7 46.3 0.5 6.7 0.2 2.4 13.4 2.1 0.5 6.1 4.4 0.1 11.6
Pat 2.4 6.2 2.1 45.9 1.8 1.3 7.3 2.8 1.3 6.0 1.6 3.4 10.2 7.8
Pinch 3.1 4.0 5.8 1.9 60.0 2.7 2.4 1.3 0.2 0.5 9.8 3.2 0.3 4.9
Actual Class
Poke 1.9 6.9 2.5 3.2 21.9 31.6 3.8 1.3 1.2 0.7 1.2 2.7 11.4 9.6
Press 6.4 2.7 3.3 2.7 5.8 6.9 50.0 5.0 0.3 1.2 10.8 2.0 0.7 2.2
Rub 1.1 1.5 11.9 1.4 2.1 0.5 4.2 39.2 4.2 0.6 1.5 14.0 0.4 17.5
Scratch 0.7 1.5 6.2 2.5 1.5 0.9 0.7 14.6 17.9 0.7 1.2 5.3 1.3 45.0
Slap 3.4 22.0 0.6 12.4 5.3 1.2 5.6 3.1 1.5 19.2 5.3 4.3 2.2 14.0
Squeeze 38.5 1.7 15.4 0.6 15.4 0.4 1.0 0.8 0.1 0.2 21.2 2.0 0.1 2.6
Stroke 1.9 2.7 5.4 1.3 4.5 0.1 2.1 28.8 4.0 0.7 2.1 25.1 0.5 21.0
Tap 1.5 6.9 1.8 21.5 7.1 2.7 6.0 1.5 1.6 3.8 3.6 3.5 21.2 17.2
Tickle 0.5 1.0 5.4 2.0 3.7 1.0 0.2 4.5 5.5 0.4 0.7 3.7 1.0 70.5
TABLE VI
C ONFUSION M ATRIX FOR THE C O ST DATASET USING CNN-RNN ARCHITECTURE .
Grab 73.0 0.0 1.6 0.8 0.8 0.0 0.0 0.0 3.3 0.0 17.2 3.3 0.0 0.0
Hit 0.0 1.7 0.0 94.2 1.7 0.0 0.0 0.0 0.0 0.8 0.8 0.8 0.0 0.0
Massage 0.0 0.0 67.6 2.8 1.1 0.0 0.0 2.8 7.8 0.0 2.2 11.7 0.0 3.9
Pat 0.0 0.0 0.0 91.9 0.0 0.0 0.0 0.0 0.0 2.4 0.0 2.4 0.0 0.8
Pinch 1.7 0.0 3.3 4.2 70.0 0.8 0.8 0.0 0.0 0.0 18.3 0.8 0.0 0.0
Actual Class
Poke 0.0 0.0 0.0 45.0 10.0 34.2 0.0 0.0 0.0 0.0 4.2 0.0 5.0 1.7
Press 14.2 0.0 0.0 0.8 4.2 4.2 63.3 0.0 0.0 0.0 10.8 2.5 0.0 0.0
Rub 0.0 0.0 9.2 0.0 0.0 0.0 3.8 22.3 13.8 0.0 2.3 42.3 0.0 6.2
Scratch 0.0 0.0 10.9 5.1 1.4 0.0 0.0 10.1 31.9 0.0 1.4 16.7 0.0 22.5
Slap 0.0 0.0 0.0 85.8 0.8 0.0 0.0 0.0 0.8 10.0 0.8 1.7 0.0 0.0
Squeeze 50.4 0.0 0.8 0.0 10.1 0.0 0.0 0.0 0.0 0.0 37.8 0.8 0.0 0.0
Stroke 0.0 0.0 0.8 0.8 0.8 0.0 0.0 2.5 0.0 0.8 0.0 88.4 0.0 5.8
Tap 0.0 0.0 0.0 82.5 0.0 0.8 0.8 0.0 0.8 0.0 0.0 0.0 15.0 0.0
Tickle 0.0 0.0 8.6 4.6 1.3 0.0 0.0 1.3 0.7 0.0 0.7 1.3 2.0 75.5
TABLE VII
C ONFUSION M ATRIX FOR THE C O ST DATASET USING AUTOENCODER -RNN ARCHITECTURE .
Grab 29.2 0.0 18.2 0.0 0.0 0.7 0.0 0.0 0.0 0.0 47.4 0.7 0.0 3.6
Hit 0.0 7.5 0.8 0.0 4.2 21.6 0.0 0.0 0.0 2.5 0.0 0.8 39.2 23.3
Massage 2.3 1.3 57.4 0.0 3.5 1.3 0.0 11.9 1.6 0.3 3.5 6.1 0.6 10.0
Pat 0.0 5.8 0.8 5.8 3.3 2.5 0.0 0.8 0.0 9.2 0.0 5.8 25.8 40.0
Pinch 0.0 0.8 26.8 0.0 21.1 17.9 0.0 0.0 2.4 0.8 4.9 4.1 0.8 20.3
Actual Class
Poke 0.0 7.4 0.0 0.0 9.9 59.5 0.0 0.0 0.0 0.0 0.0 0.0 4.1 19.0
Press 9.0 5.7 9.8 1.6 1.6 4.1 39.3 9.8 4.1 1.6 5.7 4.1 1.6 1.6
Rub 2.1 1.5 18.6 0.5 0.5 1.5 13.9 22.2 2.6 4.1 0.0 10.3 3.6 18.6
Scratch 0.5 2.4 9.7 3.4 2.4 3.4 1.9 5.8 13.1 1.0 0.5 10.2 4.9 40.8
Slap 0.0 5.8 0.0 0.0 3.3 10.0 0.0 0.0 0.0 5.8 0.0 0.0 59.2 15.8
Squeeze 19.7 0.7 23.5 0.0 3.0 1.5 0.0 0.0 0.0 0.0 47.7 0.0 2.3 2.3
Stroke 0.0 0.0 6.9 0.0 4.9 1.4 0.0 11.1 2.1 2.1 0.0 37.5 2.8 31.3
Tap 0.0 4.9 0.8 4.1 5.8 8.3 0.0 0.0 0.0 5.8 0.0 0.8 29.8 39.7
Tickle 0.0 0.4 8.4 0.0 5.9 5.5 0.0 0.0 2.5 0.0 0.0 5.9 3.8 67.6
adding noise, or rotating or translating samples may mitigate robot, rather than new users. In these applications, we would
this issue and provide improved results. expect to see an increase in model performance, as models
can learn gestures specic to one or a few subjects, rather
Splitting the data based on subject for training and test than needing to generalize to a large population using gesture
sets introduces subject dependence in the classier. Model information from a few individuals.
performance is expected to be reduced in this scenario, when
compared to splitting the data based on individual gestures. VI. CONCLUSION
In many applications, such as with assistive or companion We present three deep learning models for the task of
robots, we expect the same user to be interacting with the identifying affective touch for social robots: a CNN model,
2320
TABLE VIII
[4] M. M. Jung, R. Poppe, M. Poel, and D. K. J. Heylen, Touching
M EMORY REQUIREMENTS FOR EACH MODEL . the voidintroducing cost: Corpus of social touch, in Proceedings of
the 16th International Conference on Multimodal Interaction. ACM,
2014, pp. 120127.
FLASH RAM
[5] M. M. Jung, X. L. Cang, M. Poel, and K. E. MacLean, Touch
CNN 15,714 6,414 (44,994)
HAART CoST
challenge15: Recognizing social touch gestures, in Proceedings of
CNN-RNN 18,264 6,464 (45,044)
the 2015 ACM on International Conference on Multimodal Interaction.
Autoencoder-RNN 39,949 488
ACM, 2015, pp. 387390.
CNN 17,777 4,255 (24,095)
[6] X. L. Cang, P. Bucci, A. Strang, J. Allen, K. MacLean, and H. Liu,
CNN-RNN 20,327 4,305 (24,145)
Different strokes and different folks: Economical dynamic surface
Autoencoder-RNN 39,767 481
sensing and affect-related touch recognition, in Proceedings of the
2015 ACM on International Conference on Multimodal Interaction.
ACM, 2015, pp. 147154.
[7] T. Balli Altuglu and K. Altun, Recognizing touch gestures for
a CNN-RNN model and an Autoencoder-RNN model. These social human-robot interaction, in Proceedings of the 2015 ACM on
International Conference on Multimodal Interaction. ACM, 2015,
model provide similar performance in terms of accuracy pp. 407413.
as those presented in the 2015 Social Touch Challenge. [8] Y. F. A. Gaus, T. Olugbade, A. Jan, R. Qin, J. Liu, F. Zhang, H. Meng,
While none of the models outperform the best models in and N. Bianchi-Berthouze, Social touch gesture recognition using
random forest and boosting on distinct feature sets, in Proceedings of
the Social Touch Challenge in terms of overall accuracy, the the 2015 ACM on International Conference on Multimodal Interaction.
deep learning models explored here provide a touch gesture ACM, 2015, pp. 399406.
prediction in real-time at a rate of 6 to 9 times per second, [9] V.-C. Ta, W. Johal, M. Portaz, E. Castelli, and D. Vaufreydaz,
The grenoble system for the social touch challenge at icmi 2015,
while the models in the Social Touch Challenge only provide in Proceedings of the 2015 ACM on International Conference on
predictions once the entire gesture is captured. Additionally, Multimodal Interaction, 2015, pp. 391398.
our models automatically extract meaningful features from [10] D. Hughes, N. Farrow, H. Prota, and N. Correll, Detecting and iden-
tifying tactile gestures using deep autoencoders, geometric moments
training data, rather than requiring hand-engineered features and gesture level features, in Proceedings of the 2015 on International
to be designed and evaluated. Conference on Multimodal Interaction. ACM, 2015, pp. 415422.
The models we consider can provide high prediction rates [11] M. A. McEvoy and N. Correll, Materials that couple sensing,
actuation, computation and communication, Science, vol. 347, no.
based on the hop size of the sampled windows. The datasets 6228, p. 1261689, March 2015.
used were collected for social touch classication, where [12] D. Hughes and N. Correll, Distributed machine learning in materials
applications involved determining the emotion conveyed by that couple sensing, actuation, computation and communication,
arXiv preprint arXiv: 1606.03508, June 2016.
a human subject and rapid response is not critical. In other [13] W. D. Stiehl, J. Liberman, C. Breazeal, L. Basel, L. Lalla, and M. Wolf,
applications, quick response to a touch gesture may be Design of a therapeutic robotic companion for relational, affective
important and justify a lower classication accuracy, such as touch, in Proceedings of the International Workshop on Robot and
Human Interactive Communication (ROMAN), 2005, pp. 408415.
with assistive robots or providing tactile-based commands in [14] S. Ji, W. Xu, M. YAng, and K. Yu, 3d convolutional neural net-
industrial settings. works for human activity recognition, IEEE Transactions on Pattern
The memory requirements for these models were also eval- Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221231, 2013.
[15] G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality
uated to determine if deep learning approaches are suitable of data with neural networks, Science, vol. 313, pp. 504507, 2006.
for gesture recognition in a Robotic Materials context. We [16] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
have determined that the models could be implemented on G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,
I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,
inexpensive microcontrollers, though the memory consump- L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga,
tion would be a signicant portion of the available memory, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,
demonstrating that deep learning approaches are promising I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
for Robotic Materials. The memory requirements for the Y. Yu, and X. Zheng, TensorFlow: Large-scale machine learning on
models presented in the Social Touch Challenge are not heterogeneous systems, 2015, software available from tensorow.org.
known, so we are unable to compare the requirements of our [Online]. Available: http://tensorow.org/
[17] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural
models with those. However, as the entire gesture is needed Computation, vol. 9, no. 8, pp. 17351780, 1997.
to extract the features, we can estimate the RAM needed
to store an entire gesture to be 55kB, implying that the
approaches used in the Social Touch Challenge may not be
suitable for a Robotic Material approach.
R EFERENCES
[1] R. S. Dahiya, P. Mittendorfer, M. Valle, G. Cheng, and V. J. Lumelsky,
Directions toward effective utilization of tactile skin: A review, IEEE
Sensors Journal, vol. 13, no. 11, pp. 41214138, 2013.
[2] S. Yohanan and K. E. MacLean, The role of affective touch in human-
robot interaction: Human intent and expectations in touching the haptic
creature, International Journal of Social Robotics, vol. 4, no. 2, pp.
163180, 2012.
[3] A. Flagg and K. MacLean, Affective touch gesture recognition for a
furry zoomorphic machine, in Proceedings of the 7th International
Conference on Tangible, Embedded and Embodied Interaction. ACM,
2013, pp. 2532.
2321

Recognizing Social Touch Gestures Using Recurrent and Convolutional Neural Networks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Recognizing Social Touch Gestures Using Recurrent and Convolutional Neural Networks

Uploaded by

Copyright:

Available Formats

2017 IEEE International Conference on Robotics and Automation (ICRA)

Singapore, May 29 - June 3, 2017

Recognizing Social Touch Gestures using Recurrent and Convolutional

Abstract Deep learning approaches have been used to per-

978-1-5090-4633-1/17/$31.00 2017 IEEE 2315

C. CNN-RNN IV. RESULTS

Autoencoder-RNN 0.5193 55.78%

accuracy, the prediction of the class for the gesture capture

You might also like