CN108985223B - Human body action recognition method - Google Patents

Human body action recognition method Download PDF

Info

Publication number
CN108985223B
CN108985223B CN201810766185.5A CN201810766185A CN108985223B CN 108985223 B CN108985223 B CN 108985223B CN 201810766185 A CN201810766185 A CN 201810766185A CN 108985223 B CN108985223 B CN 108985223B
Authority
CN
China
Prior art keywords
network
sequence
deep learning
convolution
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810766185.5A
Other languages
Chinese (zh)
Other versions
CN108985223A (en
Inventor
张德馨
史玉坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Isecure Technology Co ltd
Original Assignee
Tianjin Isecure Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Isecure Technology Co ltd filed Critical Tianjin Isecure Technology Co ltd
Priority to CN201810766185.5A priority Critical patent/CN108985223B/en
Publication of CN108985223A publication Critical patent/CN108985223A/en
Application granted granted Critical
Publication of CN108985223B publication Critical patent/CN108985223B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Social Psychology (AREA)
  • Multimedia (AREA)
  • Psychiatry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a human body action recognition method based on a deep learning technology, which comprises two stages of training and recognition, wherein a network used in the training and recognition stages comprises a sequence feature extraction module, the sequence feature extraction module comprises a color map deep learning network and a CNN network of an optical flow deep learning network, the color map deep learning network comprises three LSTM layers, and the optical flow deep learning network comprises two LSTM layers. After the LSTM layer is added, the recognition method has the capability of learning a long image sequence, so that the time sequence information of the sequence video can be better utilized, and the detection accuracy is effectively improved. Meanwhile, a convolution network with a four-layer structure is used in the deep learning network, and the convolution network is used for changing the receptive field of the feature codes, so that a part of images in the image sequence also participate in the determination of the detection result.

Description

Human body action recognition method
Technical Field
The invention belongs to the field of machine learning, and particularly relates to a human motion recognition method.
Background
The traditional human body motion recognition is to add a biological sensor or a mechanical sensor and other acquisition equipment on the body of a person, so that the human body motion recognition is a contact type motion detection method, and can bring an anti-sense or tired sense to the person. With the development of technology, this recognition mode has been gradually replaced by an image-based recognition method.
The deep learning brings breakthrough progress to machine learning and brings new development direction to human motion recognition. Unlike traditional recognition methods, deep learning can automatically learn high-level features from low-level features, and solves the problems that feature selection is too dependent on tasks and long time is consumed in the adjustment process.
Disclosure of Invention
In the prior art, the recognition of human body actions directly uses a full connection layer, and detection is performed based on the whole feature, so that problems can occur, such as when the actions are relatively fast, the length of a picture sequence with the actions is much smaller than the length of a unit complete sequence set in the detection, and the problems that the actions cannot be detected can occur. Meanwhile, the historical information of the sequence images is not considered in the prior art, and the detection precision is still to be improved. Based on the design, the human body action recognition method adopts the following technical scheme:
The human body action recognition method is based on a deep learning technology and comprises two stages of training and recognition, wherein a network used in the training and recognition stages comprises a sequence feature extraction module, the sequence feature extraction module comprises a color map deep learning network and a CNN network of an optical flow deep learning network, the color map deep learning network comprises three LSTM layers, and the optical flow deep learning network comprises two LSTM layers.
Further, the number of neurons in the hidden layer in the LSTM layer is 200.
Further, the training phase includes the steps of:
step 1, acquiring an action video, splitting the action video into frame images, calculating a light flow diagram, extracting a frame of image at intervals of 16 frames to serve as a sequence center frame, and marking an action position;
step 2, respectively generating a sequence picture sample and a label, a center frame picture sample and a position label and a sequence optical flow picture sample and a label from the video sequence image for training a corresponding feature extraction model;
Step 3, sending the sequence picture sample and the label into a color image deep learning network, sending the center frame picture sample and the position label into a CNN network, sending the sequence optical flow picture sample into the optical flow deep learning network, and extracting features;
Step 4, fusing the extracted features of the three network models to generate feature codes corresponding to the video sequences;
Step 5, the feature codes are sent into a convolution network, and the receptive fields of the video sequence features are subjected to different time scale changes;
step 6, sending the feature code samples with different receptive fields into a video recognition network to generate a recognition model;
And 7, iterative training until the recognition model converges.
Further, feature codes of the video sequence in the identification stage are generated by the sequence feature extraction module, and the feature codes are identified and classified after the feature codes change the receptive field through a convolution network.
Further, the convolution network adopts a four-layer structure.
Compared with the prior art, the invention has the beneficial effects that:
1. The redesigned deep learning network structure can better extract the characteristics of the video sequence, and the motion recognition precision is high.
2. The four-layer convolution network is adopted to carry out receptive field change on the video sequence feature codes, and the problem that when the length of a picture sequence containing actions in a sequence image is much smaller than the length of a complete sequence, the actions cannot be detected is effectively solved on the premise of ensuring the identification instantaneity.
Drawings
FIG. 1 is a flow chart of the model training of the present invention;
FIG. 2 is a color map deep learning network workflow diagram;
FIG. 3 is an optical flow deep learning network workflow;
FIG. 4 is a CNN network workflow diagram;
FIG. 5 is a flow chart of the motion recognition of the present invention;
Fig. 6 is a convolutional layer network workflow diagram.
Detailed Description
As shown in fig. 1, a training phase in the human motion recognition method of the present invention includes:
step 1, acquiring an action video, splitting the action video into frame images, calculating a light flow diagram, extracting a frame of image at intervals of 16 frames to serve as a sequence center frame, and marking an action position;
step 2, respectively sending the video sequence images into an image sequence processing unit, a center frame image processing unit and an optical flow sequence processing unit to generate a sequence picture sample and a label, a center frame picture sample and a position label and a sequence optical flow picture sample and a label, which are used for training a corresponding feature extraction model;
Step 3, sending the sequence picture sample and the label into a color image deep learning network, sending the center frame picture sample and the position label into a CNN network, sending the sequence optical flow picture sample into the optical flow deep learning network, and extracting features;
Step 4, fusing the extracted features of the three network models to generate feature codes corresponding to the video sequences;
Step 5, the feature codes are sent into a convolution network, and the receptive fields of the video sequence features are subjected to different time scale changes;
step 6, sending the feature code samples with different receptive fields into a video recognition network to generate a recognition model;
And 7, iterative training until the recognition model converges.
The image sequence processing unit, the center frame image processing unit, the optical flow sequence processing unit, the color map deep learning network, the CNN network, the optical flow deep learning network and the feature fusion unit form a sequence feature extraction module.
Because human motion is continuous and the acquired image frames are discrete, the history information of the previous frame image is correlated to the image of the current frame. The deep learning network is mainly constructed as a CNN network, and the invention constructs a color map deep learning network and an optical flow deep learning network on the basis of the CNN network. The CNN network adopts an SSD network layer to extract specific position information of actions in the key frames. As shown in fig. 2 and 3, the color map deep learning network adds three LSTM layers, and the optical flow deep learning network adds two LSTM layers. Wherein the hidden layer in the LSTM layer has 200 neurons. After adding the LSTM layer, the recognition method is enabled to learn long image sequences. Compared with an algorithm for identifying only by adopting a single frame of picture, the method for identifying the deep learning network by utilizing the reconstruction can better utilize the time sequence information of the sequence video, and effectively improve the detection accuracy.
As shown in fig. 5, the recognition stage in the human motion recognition method of the present invention includes:
step 1, acquiring an action video, splitting the action video into frame images, calculating a light flow diagram, extracting a frame of image at intervals of 16 frames to serve as a sequence center frame, and marking an action position;
step 2, generating a feature code corresponding to the video sequence by utilizing a sequence feature extraction module;
step 3, the feature codes are sent into a convolution network, and the receptive fields of the video sequence features are subjected to different time scale changes;
Step 4, classifying the feature codes with different receptive fields;
And 5, obtaining a human body action recognition result.
As shown in fig. 6, the convolutional network used in the training and recognition process has a four-layer structure, and is used for changing the receptive field of the feature code, and after the feature code passes through the four-layer convolutional layers, the receptive field is changed four times. The purpose of the changing receptive field is to have a part of the image in a sequence of a certain length also participate in the determination of the detection result, i.e. the result is determined by the whole feature code data and part of the feature code data together. The convolution network is composed of time sequence convolutions, one-dimensional convolutions of conv9 are used in each layer of convolutions, the step length is 1, and each convolution layer is matched with one pooling layer.
The above embodiments are merely preferred embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (1)

1. The human body action recognition method is based on a deep learning technology and is characterized in that the human body action recognition method comprises two stages of training and recognition, a sequence feature extraction module is arranged in a network used in the training and recognition stages, the sequence feature extraction module comprises a color image deep learning network, an optical flow deep learning network and a CNN network, the color image deep learning network is added with three LSTM layers on the basis of the CNN network, the optical flow deep learning network is added with two LSTM layers on the basis of the CNN network, and the CNN network adopts an SSD network layer;
the number of neurons in the hidden layer in the LSTM layer is 200;
the training phase comprises the following steps:
Step 1, acquiring an action video, splitting the action video into frame images, calculating a light flow diagram, extracting a frame of image at intervals of 16 frames to serve as a sequence center frame, and marking an action position;
step 2, respectively generating a sequence picture sample and a label from the video sequence image; center frame picture sample position and label; the sequence optical flow picture sample and the label are used for training the corresponding feature extraction model;
Step 3, sending the sequence picture sample and the label into a color image deep learning network, sending the center frame picture sample and the position label into a CNN network, sending the sequence optical flow picture sample into the optical flow deep learning network, and extracting features;
Step 4, fusing the extracted features of the three network models to generate feature codes corresponding to the video sequences;
Step 5, the feature codes are sent into a convolution network, and the receptive fields of the video sequence features are subjected to different time scale changes;
Step 6, sending the feature code samples with different receptive fields into a video recognition network to generate a recognition model;
step 7, iterative training is carried out until the recognition model converges;
The feature codes of the video sequence in the identification stage are generated by the sequence feature extraction module, and the feature codes are identified after the receptive field is changed by the convolution network;
the convolution network adopts a four-layer structure and is formed by time sequence convolution;
Each convolution layer in the convolution network uses one-dimensional convolution, the step length is 1, and each convolution layer is matched with one pooling layer.
CN201810766185.5A 2018-07-12 2018-07-12 Human body action recognition method Active CN108985223B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810766185.5A CN108985223B (en) 2018-07-12 2018-07-12 Human body action recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810766185.5A CN108985223B (en) 2018-07-12 2018-07-12 Human body action recognition method

Publications (2)

Publication Number Publication Date
CN108985223A CN108985223A (en) 2018-12-11
CN108985223B true CN108985223B (en) 2024-05-07

Family

ID=64537893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810766185.5A Active CN108985223B (en) 2018-07-12 2018-07-12 Human body action recognition method

Country Status (1)

Country Link
CN (1) CN108985223B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109685213B (en) * 2018-12-29 2022-01-07 百度在线网络技术(北京)有限公司 Method and device for acquiring training sample data and terminal equipment
CN110084259B (en) * 2019-01-10 2022-09-20 谢飞 Facial paralysis grading comprehensive evaluation system combining facial texture and optical flow characteristics
CN109902565B (en) * 2019-01-21 2020-05-05 深圳市烨嘉为技术有限公司 Multi-feature fusion human behavior recognition method
CN109919031B (en) * 2019-01-31 2021-04-09 厦门大学 Human behavior recognition method based on deep neural network
CN110544301A (en) * 2019-09-06 2019-12-06 广东工业大学 Three-dimensional human body action reconstruction system, method and action training system
CN112257568B (en) * 2020-10-21 2022-09-20 中国人民解放军国防科技大学 Intelligent real-time supervision and error correction system and method for individual soldier queue actions

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933417A (en) * 2015-06-26 2015-09-23 苏州大学 Behavior recognition method based on sparse spatial-temporal characteristics
CN106845351A (en) * 2016-05-13 2017-06-13 苏州大学 It is a kind of for Activity recognition method of the video based on two-way length mnemon in short-term
CN107273800A (en) * 2017-05-17 2017-10-20 大连理工大学 A kind of action identification method of the convolution recurrent neural network based on attention mechanism
CN107292247A (en) * 2017-06-05 2017-10-24 浙江理工大学 A kind of Human bodys' response method and device based on residual error network
CN107463949A (en) * 2017-07-14 2017-12-12 北京协同创新研究院 A kind of processing method and processing device of video actions classification
CN108108699A (en) * 2017-12-25 2018-06-01 重庆邮电大学 Merge deep neural network model and the human motion recognition method of binary system Hash
CN108229338A (en) * 2017-12-14 2018-06-29 华南理工大学 A kind of video behavior recognition methods based on depth convolution feature

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933417A (en) * 2015-06-26 2015-09-23 苏州大学 Behavior recognition method based on sparse spatial-temporal characteristics
CN106845351A (en) * 2016-05-13 2017-06-13 苏州大学 It is a kind of for Activity recognition method of the video based on two-way length mnemon in short-term
CN107273800A (en) * 2017-05-17 2017-10-20 大连理工大学 A kind of action identification method of the convolution recurrent neural network based on attention mechanism
CN107292247A (en) * 2017-06-05 2017-10-24 浙江理工大学 A kind of Human bodys' response method and device based on residual error network
CN107463949A (en) * 2017-07-14 2017-12-12 北京协同创新研究院 A kind of processing method and processing device of video actions classification
CN108229338A (en) * 2017-12-14 2018-06-29 华南理工大学 A kind of video behavior recognition methods based on depth convolution feature
CN108108699A (en) * 2017-12-25 2018-06-01 重庆邮电大学 Merge deep neural network model and the human motion recognition method of binary system Hash

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Expression Empowered ResiDen Network for Facial Action Unit Detection;Shreyank Jyoti 等;arXiv;20180614;第1节 *
Long-term Recurrent Convolutional Networks for Visual Recognition and Description;Jeff Donahue 等;《2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)》;第1节、第4节 *
Long-term Recurrent Convolutional Networks for Visual Recognition and Description;Jeff Donahue 等;2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR);20151015;第1节、第4节、图1 *
一种基于融合多传感器信息的手语手势识别方法;阳平 等;航天医学与医学工程;20120831;第25卷(第4期);摘要 *
基于双流CNN的异常行为分类算法研究;王昕培;《中国优秀硕士学位论文全文数据库 信息科技辑》;第2018年卷(第2期);I138-2191 *

Also Published As

Publication number Publication date
CN108985223A (en) 2018-12-11

Similar Documents

Publication Publication Date Title
CN108985223B (en) Human body action recognition method
CN109543667B (en) Text recognition method based on attention mechanism
US11657230B2 (en) Referring image segmentation
CN107766894B (en) Remote sensing image natural language generation method based on attention mechanism and deep learning
CN110580500B (en) Character interaction-oriented network weight generation few-sample image classification method
CN109671102B (en) Comprehensive target tracking method based on depth feature fusion convolutional neural network
CN110046671A (en) A kind of file classification method based on capsule network
CN112949647B (en) Three-dimensional scene description method and device, electronic equipment and storage medium
CN110135461B (en) Hierarchical attention perception depth measurement learning-based emotion image retrieval method
CN111753189A (en) Common characterization learning method for few-sample cross-modal Hash retrieval
CN113536922A (en) Video behavior identification method for weighting fusion of multiple image tasks
CN113516152B (en) Image description method based on composite image semantics
CN113449801B (en) Image character behavior description generation method based on multi-level image context coding and decoding
Tian et al. Aligned dynamic-preserving embedding for zero-shot action recognition
CN108960171B (en) Method for converting gesture recognition into identity recognition based on feature transfer learning
Mogan et al. Gait-ViT: Gait recognition with vision transformer
Duwairi et al. Automatic recognition of Arabic alphabets sign language using deep learning.
CN116452805A (en) Transformer-based RGB-D semantic segmentation method of cross-modal fusion network
CN115690549A (en) Target detection method for realizing multi-dimensional feature fusion based on parallel interaction architecture model
Aksoy et al. Detection of Turkish sign language using deep learning and image processing methods
Al-Obodi et al. A Saudi Sign Language recognition system based on convolutional neural networks
Singh et al. A sparse coded composite descriptor for human activity recognition
CN112216379A (en) Disease diagnosis system based on intelligent joint learning
CN116578738B (en) Graph-text retrieval method and device based on graph attention and generating countermeasure network
CN105046193B (en) A kind of human motion recognition method based on fusion rarefaction representation matrix

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant