基于关键帧的双流卷积网络的人体动作识别方法
作者:
基金项目:

国家自然科学基金(61872042,61572077);北京市自然科学基金委和北京市教委联合重点项目(KZ201911417048)


Human motion recognition based on key frame two-stream convolutional network
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • | | |
  • 文章评论
    摘要:

    针对视频序列中人体动作识别存在信息冗余大、准确率低的问题,提出基于关键帧的双流卷积网络的人体动作识别方法.该方法构建了由特征提取、关键帧提取和时空特征融合3个模块构成的网络框架.首先将空间域视频的单帧RGB图像和时间域多帧叠加后的光流图像作为输入,送入VGG16网络模型,提取视频的深度特征;其次提取视频的关键帧,通过不断预测每个视频帧的重要性,选取有足够信息的有用帧并汇聚起来送入神经网络进行训练,选出关键帧并丢弃冗余帧;最后将两个模型的Softmax输出加权融合作为输出结果,得到一个多模型融合的人体动作识别器,实现了对视频的关键帧处理和对动作的时空信息的充分利用.在UCF-101公开数据集上的实验结果表明,与当前人体动作识别的主流方法相比,该方法具有较高的识别率,并且相对降低了网络的复杂度.

    Abstract:

    Aiming at the problem of large information redundancy and low accuracy in human motion recognition in video sequences,a human motion recognition method is proposed based on key frame two-stream convolutional network. We construct a network framework consisting of three modules:feature extraction,key frame extraction,and spatial-temporal feature fusion.Firstly,the single-frame RGB image of the spatial domain video and the optical flow image superimposed in the time domain multi-frame are sent as input to the VGG16 network model to extract the depth feature of the video;secondly,the importance of each video frame is continuously predicted,then useful frames with sufficient information are pooled and trained by neural network to select key frames and discard redundant frames.Finally,the Softmax outputs of the two models are weighted and combined as the output result to obtain a multi-model fusion.The human body motion recognizer realizes the key frame processing of the video and the full utilization of the spatial-temporal information of the action.The experimental results on the UCF-101 public dataset show that,compared with the mainstream methods of human motion recognition,the proposed method has a higher recognition rate and relatively reduces the complexity of the network.

    参考文献
    [1] Simonyan K,Zisserman A.Two-stream convolutional networks for action recognition in videos[J].Neural Information Processing Systems,2014,1(2):568-576
    [2] Feichtenhofer C,Pinz A,Wildes R P.Spatiotemporal residual networks for video action recognition[J].Neural Information Processing Systems,2016,2(3):3468-3476
    [3] Feichtenhofer C,Pinz A,Wildes R P.Spatiotemporal multiplier networks for video action recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2017,DOI:10.1109/CVPR.2017.787
    [4] Zhu Y,Lan Z Z,Newsam S,et al.Hidden two-stream convolutional networks for action recognition[M]//Computer Vision-ACCV 2018.Cham:Springer International Publishing,2019:363-378
    [5] Hochreiter S,Schmidhuber J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780
    [6] Kar A,Rai N,Sikka K,et al.AdaScan:adaptive scan pooling in deep convolutional neural networks for human action recognition in videos[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2016:3376-3385
    [7] Bobick A F,Davis J W.The recognition of human movement using temporal templates[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2001,23(3):257-267
    [8] Gorelick L,Blank M.Actions as space-time shapes[J].Pattern Analysis and Machine Intelligence,2007,29(12):2247-2253
    [9] Laptev I,Marszalek M,et al.Learning realistic human actions from movies[J].IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2008,DOI:10.1109/CVPR.2008.4587756
    [10] Klaser A,Marszalek M.A spatio-temporal descriptor based on 3D-gradients[C]//British Machine Vision Conference,2008,DOI:10.5244/C.22.99
    [11] Mikolajczyk K,Mikolajczyk K.Scale & affine invariant interest point detectors[J].International Journal of Computer Vision,2004,60(1):63-86
    [12] Scovanner P,Ali S,Shah M.A 3-dimensional sift descriptor and its application to action recognition[C]//ACM International Conference on Multimedia,2007:357-360
    [13] Wang H,Ullah M M,Klaser A,et al.Evaluation of local spatio-temporal features for action recognition[C]//British Machine Vision Conference,2009,DOI:10.5244/C.23.124
    [14] Ng Y H,Hausknecht M,Vijayanarasimhan S,et al.Beyond short snippets:deep networks for video classification[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2015:4694-4702
    [15] Donahue J,Hendricks L A,Rohrbach M,et al.Long-term recurrent convolutional networks for visual recognition and description[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2014,39(4):677-691
    [16] Karpathy A,Toderici G,Shetty S,et al.Large-scale video classification with convolutional neural networks[C]//IEEE Conference on Computer Vision and Pattern Recognition,2014:1725-1732
    [17] Glorot X,Bengio Y.Understanding the difficulty of training deep feedforward neural networks[C]//International Conference on Artificial Intelligence and Statistics,2010:249-256
    [18] 张文宇.基于证据理论的无线传感器网络决策融合算法研究[D].北京:北京交通大学,2016 ZHANG Wenyu.Research on beliffunction based decision fusion for wireless sensor networks[D].Beijing:Beijing Jiaotong University,2016
    [19] Soomro K,Zamir A R,Shah M.Ucf101:a dataset of 101 human actions classes from videos in the wild[J].arXiv Preprint,2012,arXiv:1212.0402
    [20] Wang L,Xiong Y,Wang Z,et al.Towards good practices for very deep two-stream convnets[J].arXiv Preprint,2015,arXiv:1507.02159
    [21] Deng J,Dong W,Socher R,et al.ImageNet:a large-scale hierarchical image database[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),2009,DOI:10.1109/CVPR.2009.5206848
    [22] Simonyan K,Zisserman A.Very deep convolutional networks for large-scale image recognition[J].arXiv Preprint,2014,arXiv:1409.1556
    [23] Srivastava N,Mansimov E,Salakhutdinov R.Unsupervised learning of video representations using LSTMs[C]//The 32th International Conference on Machine Learning (ICML),2015:843-852
    [24] Bilen H,Fernando B,Gavves E,et al.Dynamic image networks for action recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2016:3034-3042
    [25] Feichtenhofer C,Pinz A,Zisserman A.Convolutional two-stream network fusion for video action recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2016:1933-1941
    相似文献
    引证文献
引用本文

张聪聪,何宁.基于关键帧的双流卷积网络的人体动作识别方法[J].南京信息工程大学学报(自然科学版),2019,11(6):716-721
ZHANG Congcong, HE Ning. Human motion recognition based on key frame two-stream convolutional network[J]. Journal of Nanjing University of Information Science & Technology, 2019,11(6):716-721

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2019-10-07
  • 在线发布日期: 2020-01-19

地址:江苏省南京市宁六路219号    邮编:210044

联系电话:025-58731025    E-mail:nxdxb@nuist.edu.cn

南京信息工程大学学报 ® 2025 版权所有  技术支持:北京勤云科技发展有限公司