Results 1 -
1 of
1
Towards good practices for very deep two-stream ConvNets
- CoRR
"... Deep convolutional networks have achieved great suc-cess for object recognition in still images. However, for ac-tion recognition in videos, the improvement of deep convo-lutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the curre ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Deep convolutional networks have achieved great suc-cess for object recognition in still images. However, for ac-tion recognition in videos, the improvement of deep convo-lutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets [12]) are relatively shallow compared with those very deep models in image domain (e.g. VGGNet [13], GoogLeNet [15]), and therefore their modeling capacity is constrained by their depth. Second, probably more importantly, the training dataset of action recognition is extremely smal-l compared with the ImageNet dataset, and thus it will be easy to over-fit on the training dataset. To address these issues, this report presents very deep two-stream ConvNets for action recognition, by adapting recent very deep architectures into video domain. Howev-er, this extension is not easy as the size of action recogni-tion is quite small. We design several good practices for the training of very deep two-stream ConvNets, namely (i) pre-training for both spatial and temporal nets, (ii) small-er learning rates, (iii) more data augmentation techniques, (iv) high drop out ratio. Meanwhile, we extend the Caffe toolbox into Multi-GPU implementation with high compu-tational efficiency and low memory consumption. We verify the performance of very deep two-stream ConvNets on the dataset of UCF101 and it achieves the recognition accuracy of 91:4%. 1.