PerceptionNet: A Deep Convolutional Neural Network for Late Sensor Fusion
We develop a Deep Convolutional Neural Network that applies a late 2D convolution on multimodal time-series sensor data, in order to extract more efficient features than those extracted from humans and Early Fusion (EF) approaches. To test our approach, we use two public available Human Activity Recognition (HAR) datasets, showing that the proposed model surpasses the performance of other state-of-the-art deep learning HAR methods.
- Layer 1: 48 1D convolutional filters with a size of (1,15), i.e., W1 has the shape (1, 15, 1, 48). This is followed by a ReLU activation function, a (1,2) strided 1D max-pooling operation and a dropout probability equal to 0.4.
- Layer 2: 96 1D convolutional filters with a size of (1,15), i.e., W2 has the shape (1, 15, 48, 96). This is followed by a ReLU activation function, a (1,2) strided 1D max-pooling operation and a dropout probability equal to 0.4.
- Layer 3: 96 2D convolutional filters with a size of (3,15) and a stride of (3,1), i.e., W3 has the shape (3, 15, 96, 96). This is followed by a ReLU activation function, a global average-pooling operation and a dropout probability equal to 0.4.
- Layer 4: 10 output units, i.e., W4 has the shape (96, 6), followed by a softmax activation function.
We used the UCL HAR dataset, which consists of tri-axial accelerometer and of tri-axial gyroscope sensor data, collected by a waist-mounted smartphone (Samsung Galaxy S II smartphone). A group of 30 volunteers, with ages ranging from 19 to 48 years, executed six daily activities (standing, sitting, laying down, walking, walking downstairs and upstairs). The mobile sensors produced 3-axial linear acceleration and 3-axial angular velocity data with a sampling rate of 50 Hz and were segmented into time windows of 128 values (2.56 sec), having a 50% overlap. Furthermore, the dataset is separated into train data. The obtained dataset contains 10,299 samples, which are partitioned into two sets, where 70% of the volunteers (21 volunteers) was selected for generating the training data (7,352 samples) and 30% (9 volunteers) the test data (2,947 samples).
Fig. 2 shows that by applying the t-SNE algorithm after the last convolutional operation most of the instances of the six activity classes are easily categorized.
We achieved an 0.9725 accuracy on the test data. Fig. 3 presents the confusion matrix which reveals the difficulty of distinguishing standing from sitting and the reverse. Τhis misclassification is a common problem, and an extra IMU on the thigh would be a solution.
This work is funded by the European Commission under project TRILLION, grant number H2020-FCT-2014, REA grant agreement n° . Moreover, we gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan-X GPU used for this research.
1 C. A. Ronao & S.-B. Cho, “Human activity recognition with smartphone sensors using deep learning neural networks,” Expert Systems with Applications, vol. 59, pp. 235-244, Oct. 2016.
2 D. Anguita, A. Ghio, L. Oneto, X. Parra & J. L. Reyes-Ortiz, “A Public Domain Dataset for Human Activity Recognition Using Smartphones,” in 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013, April 2013.
3 W. Jiang & Z. Yin, “Human Activity Recognition using Wearable Sensors by Deep Convolutional Neural Networks”, in Proceedings of the 23rd ACM international conference on Multimedia, pp. 1307-1310, 2015