A survey of action recognition algorithms based on deep learning
Author:
  • Article
  • | |
  • Metrics
  • |
  • Reference [56]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    Human action recognition has always been a hot topic in computer vision research and widely applied in virtual reality, short video, etc.Meanwhile, the fast development of deep learning in recent years has also inspired the action recognition algorithms.Compared with traditional methods, the action recognition algorithms based on deep learning have advantages of strong robustness and high accuracy.Here, we make a survey on the action recognition algorithms based on deep learning proposed in recent years, and focus on those developed from two-stream network and 3D convolutional network, then summarize their performances and positive results, and finally make prospects in this field.

    Reference
    [1] Suma E A, Krum D M, Lange B, et al. Adapting user interfaces for gestural interaction with the flexible action and articulated skeleton toolkit[J]. Computers & Graphics, 2013, 37(3): 193-201
    [2] 张文轩. 基于人机交互的智能家居安防系统设计[J]. 中国新技术新产品, 2017(4): 129-130 ZHANG Wenxuan. Design of intelligent home security system based on human-computer interaction[J]. China New Technology and New Products, 2017(4): 129-130
    [3] Yang X D, Tian Y L. Action recognition using super sparse coding vector with spatio-temporal awareness[M]//Computer Vision: ECCV 2014. Cham: Springer International Publishing, 2014: 727-741
    [4] Peng X J, Zou C Q, Qiao Y, et al. Action recognition with stacked fisher vectors[M]//Computer Vision: ECCV 2014. Cham: Springer International Publishing, 2014: 581-595
    [5] Peng X J, Wang L M, Wang X X, et al. Bag of visual words and fusion methods for action recognition: comprehensive study and good practice[J]. Computer Vision and Image Understanding, 2016, 150: 109-125
    [6] Arandjelovic R, Zisserman A. All about VLAD[C]//IEEE Conference on Computer Vision and Pattern Recognition, 2013: 1578-1585
    [7] Duta I C, Ionescu B, Aizawa K, et al. Spatio-temporal VLAD encoding for human action recognition in videos[M]//Multimedia Modeling. Cham: Springer International Publishing, 2016: 365-378
    [8] 朱红蕾, 朱昶胜, 徐志刚. 人体行为识别数据集研究进展[J]. 自动化学报, 2018, 44(6): 978-1004 ZHU Honglei, ZHU Changsheng, XU Zhigang. Research advances on human activity recognition datasets[J]. Acta Automatica Sinica, 2018, 44(6): 978-1004
    [9] 陈一鸣, 高翔. 深度学习的最新进展[J]. 计算机科学与应用, 2018, 8(4): 565-571 CHEN Yiming, GAO Xiang. The latest development of deep learning[J]. Computer Science and Application, 2018, 8(4): 565-571
    [10] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos[C]//Advances in Neural Information Processing Systems, 2014: 568-576
    [11] Wang L L, Ge L Z, Li R F, et al. Three-stream CNNs for action recognition[J]. Pattern Recognition Letters, 2017, 92: 33-40
    [12] Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016: 1933-1941
    [13] Wang L M, Xiong Y J, Wang Z, et al. Temporal segment networks: towards good practices for deep action recognition[M]//Computer Vision: ECCV 2016. Cham: Springer International Publishing, 2016: 20-36
    [14] Lan Z Z, Zhu Y, Hauptmann A G, et al. Deep local video feature for action recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017: 1-7
    [15] Zhou B L, Andonian A, Oliva A, et al. Temporal relational reasoning in videos[M]//Computer Vision: ECCV 2018. Cham: Springer International Publishing, 2018: 831-846
    [16] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//IEEE International Conference on Computer Vision (ICCV), 2015: 4489-4497
    [17] Carreira J, Zisserman A. Quo vadis, action recognition?A new model and the kinetics dataset[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017: 6299-6308
    [18] Diba A, Fayyaz M, Sharma V, et al. Temporal 3D ConvNets: new architecture and transfer learning for video classification[J]. arXiv e-print, 2017, arXiv: 1711.08200[cs. CV]
    [19] Qiu Z F, Yao T, Mei T. Learning spatio-temporal representation with pseudo-3D residual networks[C]//IEEE International Conference on Computer Vision (ICCV), 2017: 5533-5541
    [20] Ng J Y H, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: deep networks for video classification[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 4694-4702
    [21] Li C, Zhong Q Y, Xie D, et al. Collaborative spatiotemporal feature learning for video action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019: 7872-7881
    [22] 李瑞峰, 王亮亮, 王珂. 人体动作行为识别研究综述[J]. 模式识别与人工智能, 2014, 27(1): 35-48 LI Ruifeng, WANG Liangliang, WANG Ke. A survey of human body action recognition[J]. Pattern Recognition and Artificial Intelligence, 2014, 27(1): 35-48
    [23] 刘命强. 基于深度学习的人体动作识别研究[D]. 郑州: 河南大学, 2018 LIU Mingqiang. Research on human action recognition based on deep learning[D]. Zhengzhou: Henan University, 2018
    [24] 刘双叶. 基于视频的行为识别[D]. 石家庄: 河北师范大学, 2017 LIU Shuangye. Video-based action recognition[D]. Shijiazhuang: Hebei Normal University, 2017
    [25] Wang H, Klaser A, Schmid C, et al. Action recognition by dense trajectories[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011: 3169-3176
    [26] Wang H, Schmid C. Action recognition with improved trajectories[C]//IEEE International Conference on Computer Vision (ICCV), 2013: 3551-3558
    [27] Sánchez J, Perronnin F, Mensink T, et al. Image classification with the fisher vector: theory and practice[J]. International Journal of Computer Vision, 2013, 105(3): 222-245
    [28] Goodfellow I, Bengio Y, Courville A. Deep learning[M]. Cambridge: MIT Press, 2016: 367-415
    [29] 罗会兰, 王婵娟, 卢飞. 视频行为识别综述[J]. 通信学报, 2018, 39(6): 169-180 LUO Huilan, WANG Chanjuan, LU Fei. Survey of video behavior recognition[J]. Journal of Communications, 2018, 39(6): 169-180
    [30] Diba A, Sharma V, van Gool L. Deep temporal linear encoding networks[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017: 2329-2338
    [31] 罗会兰, 童康, 孔繁胜. 基于深度学习的视频中人体动作识别进展综述[J]. 电子学报, 2019, 47(5): 1162-1173 LUO Huilan, TONG Kang, KONG Fansheng. Advances in human motion recognition in video based on deep learning[J]. Electronic Journal, 2019, 47(5): 1162-1173
    [32] Bilen H, Fernando B, Gavves E, et al. Action recognition with dynamic image networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(12): 2799-2813
    [33] Sun S Y, Kuang Z H, Sheng L, et al. Optical flow guided feature: a fast and robust motion representation for video action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2018: 1390-1399
    [34] Zhao Y, Xiong Y J, Lin D H. Recognize actions by disentangling components of dynamics[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2018: 6566-6575
    [35] Huang G, Liu Z, van der Maaten L, et al. Densely connected convolutional networks[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017: 4700-4708
    [36] Zhou Y Z, Sun X Y, Zha Z J, et al. MiCT: mixed 3D/2D convolutional tube for human action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2018: 449-458
    [37] 任智慧, 徐浩煜, 封松林, 等. 基于LSTM网络的序列标注中文分词法[J]. 计算机应用研究, 2017, 34(5): 1321-1324, 1341 REN Zhihui, XU Haoyu, FENG Songlin, et al. Sequence labeling Chinese word segmentation method based on LSTM networks[J]. Computer Applied Research, 2017, 34(5): 1321-1324, 1341
    [38] 邓三鸿, 傅余洋子, 王昊. 基于LSTM模型的中文图书多标签分类研究[J]. 数据分析与知识发现, 2017(7): 52-60 DENG Sanhong, FU Yuyangzi, WANG Hao. Multi-label classification of Chinese books with LSTM model[J]. New Technology of Library and Information Service, 2017(7): 52-60
    [39] 张玉环. 基于多种LSTM结构的文本情感分析[D]. 北京: 北京邮电大学, 2018 ZHANG Yuhuan. Text sentiment analysis based on multiple LSTM structures[D]. Beijing: Beijing University of Posts and Telecom, 2018
    [40] 黄积杨. 基于双向LSTMN神经网络的中文分词研究分析[D]. 南京: 南京大学, 2016 HUANG Jiyang. Chinese word segmentation analysis based on bidirectional LSTMN recurrent neural network[D]. Nanjing: Nanjing University, 2016
    [41] Du W B, Wang Y L, Qiao Y. RPAN: an end-to-end recurrent pose-attention network for action recognition in videos[C]//IEEE International Conference on Computer Vision (ICCV), 2017: 3725-3734
    [42] Long X, Gan C, De Melo G, et al. Multimodal keyless attention fusion for video classification[C]//Thirty-Second AAAI Conference on Artificial Intelligence, 2018
    [43] Song S J, Lan C L, Xing J L, et al. An end-to-end spatio-temporal attention model for human action recognition from skeleton data[J]. arXiv e-print, 2016, arXiv: 1611.06067[cs. CV]
    [44] Wu C Y, Zaheer M, Hu H X, et al. Compressed video action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2018: 6026-6035
    [45] Choutas V, Weinzaepfel P, Revaud J, et al. PoTion: pose MoTion representation for action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2018: 7024-7033
    [46] Cho S, Foroosh H. Spatio-temporal fusion networks for action recognition[M]//Computer Vision: ACCV 2018. Cham: Springer International Publishing, 2019: 347-364
    [47] 朱煜, 赵江坤, 王逸宁, 等. 基于深度学习的人体行为识别算法综述[J]. 自动化学报, 2016, 42(6): 848-857 ZHU Yu, ZHAO Jiangkun, WANG Yining, et al. A review of human action recognition based on deep learning[J]. Acta Automatica Sinica, 2016, 42(6): 848-857
    [48] Zhang P F, Lan C L, Xing J L, et al. View adaptive recurrent neural networks for high performance human action recognition from skeleton data[C]//IEEE International Conference on Computer Vision (ICCV), 2017
    [49] Si C Y, Jing Y, Wang W, et al. Skeleton-based action recognition with spatial reasoning and temporal stack learning[M]//Computer Vision: ECCV 2018. Cham: Springer International Publishing, 2018: 106-121
    [50] Chen K, Forbus K. Action recognition from skeleton data via analogical generalization over qualitative representations[C]//Thirty-Second AAAI Conference on Artificial Intelligence, 2018
    [51] Tang Y S, Tian Y, Lu J W, et al. Deep progressive reinforcement learning for skeleton-based action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2018: 5323-5332
    [52] Li C, Zhong Q Y, Xie D, et al. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation[J]. arXiv e-print, 2018, arXiv: 1804.06055[cs. CV]
    [53] Ke Q H, Bennamoun M, An S J, et al. A new representation of skeleton sequences for 3D action recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017: 3288-3297
    [54] Ding Z W, Wang P C, Ogunbona P O, et al. Investigation of different skeleton features for CNN-based 3D action recognition[C]//IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 2017: 617-622
    [55] 孟乐乐. 融合时空网络与注意力机制的人体行为识别研究[D]. 北京: 北京交通大学, 2018 MENG Lele. Human action recognition based on spatio-temporal network and attention mechanism[D]. Beijing: Beijing Jiaotong University, 2018
    [56] Zhang C L, Liu X X, Wu J X. Towards real-time action recognition on mobile devices using deep models[J]. arXiv e-print, 2019, arXiv: 1906.07052[cs. CV]
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

HU Kai, ZHENG Fei, LU Feiyu, HUANG Yukun. A survey of action recognition algorithms based on deep learning[J]. Journal of Nanjing University of Information Science & Technology,2021,13(6):730-743

Copy
Share
Article Metrics
  • Abstract:603
  • PDF: 20
  • HTML: 0
  • Cited by: 0
History
  • Received:November 07,2019
  • Online: January 21,2022
Article QR Code

Address:No. 219, Ningliu Road, Nanjing, Jiangsu Province

Postcode:210044

Phone:025-58731025