CN110110686A

CN110110686A - Based on the human motion recognition methods for losing double-current convolutional neural networks more

Info

Publication number: CN110110686A
Application number: CN201910400344.4A
Authority: CN
Inventors: 吴春雷; 曹海文; 王雷全; 魏燚伟
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2019-05-14
Filing date: 2019-05-14
Publication date: 2019-08-09

Abstract

The invention discloses based on the human motion recognition methods for losing double-current convolutional neural networks, it belongs to action recognition technical field more, solves the problems, such as that traditional binary-flow network movement detailed information loses and can not extract space-time characteristic.The present invention is to be constituted to the improvement of timing segmentation network by losing spatial network and time network more, and from the point of view of architecture angle, the double fluid convolutional neural networks that lose are made of more three branches: action recognition, movement are restored and difference punishment.Movement, which is restored, joined recovery loss, and reservation acts the motion characteristic information that detailed information and balance are extracted.Difference punishment is classified using external appearance characteristic calculating action feature, to obtain effective space-time characteristic.It is lose the training study in a manner of end to end of double fluid convolutional neural networks more, and extract video abundant using action recognition loss, recovery loss and difference loss auxiliary movement identification module and express, it can preferably improve the accuracy rate of action recognition.

Description

Based on the human motion recognition methods for losing double-current convolutional neural networks more

Technical field

The present invention relates to computer vision and area of pattern recognition, especially relate to lose double-current convolutional Neural based on more The human motion recognition method of network belongs to action recognition field.

Background technique

Action recognition technology is identified to the movement of the people in one section of video.With the day of internet and digital equipment Benefit is universal, and processing and the analysis especially video actions identification of video are widely studied in computer vision direction, it can be wide It is general to be applied to every field, for example, intelligent video monitoring, human-computer interaction and human behavior analysis etc..Due to convolutional Neural net Network achieves huge success in picture classification task, and the action recognition technology based on video can be counted as a kind of classification Task, therefore action identification method is no longer limited to traditional hand-made characterization method, but based on convolutional neural networks Action identification method.But the research direction still has many challenges, for example, the movement of camera, the influence of shooting background The accuracy of action recognition can all be impacted with the variation of light etc..

In recent years, action recognition technology focuses primarily upon the static and dynamic of fusion video in the progress of video field State information.Due to validity of the convolutional neural networks in Computer Vision Task, naturally it is applied in action recognition Space characteristics extract network in.However, only capturing video static state action message is not fill to complicated action recognition task Point.So a kind of input mode of the light stream as complementation, captures the multidate information of video in time network, show to dynamic Make the powerful validity of identification mission.The double-current convolutional network that Karen Simonyan et al. is proposed, combines spatial network And time network, become one of the mainstream of action recognition technology, but it is confined to the input of single picture and light stream figure, not There is the sequence problem in view of video.Timing segmentation network is proposed as Limin Wang et al. is improved, when being based on long-term Between structural modeling thought, sparse time sampling strategy is utilized and measure of supervision based on videl stage carries out action recognition, energy It is enough efficiently to learn entire action video.But timing segmentation network also has certain difficulty on identifying similar movement, and And in the detailed information for being easily lost movement and space-time characteristic in video is not accounted for.Therefore the present invention is divided in timing Improved on the basis of network, introduce recovery loss and difference loss can auxiliary movement identification module extract space-time Feature and reservation act detailed information.

Summary of the invention

The purpose of the present invention is to solve can not be extracted existing for traditional double-current convolutional neural networks space-time characteristic and Shortage movement details and the problem that causes accuracy of identification low.

The technical solution adopted by the present invention to solve the above technical problem is:

S1. video V in data set is equally divided into K sections of S₁,S₂,…,S_K(K be empirical value K=3), from each subsegment with Machine samples a frame picture and light stream figure as the inputs for losing double-current convolutional neural networks more.

S2. building loses double-current convolutional neural networks framework.

S3. the picture acquired in step S1 and light stream figure lose in double-current convolutional neural networks is input to more to instruct Practice, so that loss function is minimum.

S4. by test sample picture and light stream figure be input to it is above-mentioned trained complete more loss double fluid convolutional Neurals It is tested in network, then carries out double-current fusion, finally complete the human action identification based on video.

Specifically, the building double fluid convolutional neural networks that lose include the following contents more:

The double fluid convolutional neural networks that lose are the improvement to timing segmentation network, the net of spatial network and time network more Network structure is identical (input mode is different, is picture and light stream figure respectively), the spaces for losing double-current convolutional neural networks more The network structure of network and time network is divided into three branches: action recognition, movement are restored and difference punishment.

(1) action recognition

Action recognition branch selects the network structure based on BN-Inception, in order to simulate long-term time knot Structure, the present invention carry out sparse sampling in entire video and assemble segment characterizations progress action recognition.

(2) movement is restored

The recovery to input data is carried out in the output of the last layer convolutional layer of action recognition branch, present invention employs Four layers of warp lamination and four layers of jump articulamentum are restored, and poor using the recovery of Euclidean distance costing bio disturbance, in order to guarantee Partial act detailed information can be retained in action recognition network.

(3) difference is punished

Difference punishes that branch and action recognition and movement restore branch's sharing feature coding network, it is in action recognition branch The last layer convolutional layer after carry out difference punishment operation, present invention utilizes the feature differences between adjacent segment to carry out movement knowledge Not, auxiliary movement identification network can extract space-time characteristic abundant.

Specifically, the training methods for losing double-current convolutional neural networks are to do pre-training using ImageNet data set more Model, training action identification module are trained whole network on the basis of action recognition network training is completed, and are utilized Stochastic gradient descent algorithm optimizes.

The loss function calculation formula for losing double-current convolutional neural networks as follows more:

(1) action recognition

Entire video V is equally divided into K sections of { S₁,S₂,…,S_K, the one frame { I of stochastical sampling from each segment₁,I₂,…, I_KInput as network, it finally obtains video and predicts fractional function in each movement class are as follows:

R(I₁,I₂,…,I_K)=P (h (C (I₁；W),C(I₂；W),…,C(I_K；W))) (1)

Wherein W is parameter, function C (I_k；W it) calculates each input and passes through the class score that network is exported, k ∈ 1,2 ..., K, function h represent the output of K segment of fusion, obtain final movement class hypothesis score, and function P is Softmax operation.

Therefore, the loss function of action recognition module are as follows:

Wherein n is movement class total number, y_iBeing really is label, H_r=h (C (I₁；W),C(I₂；W),…,C(I_K；W)), i.e. H_i =h (C_i(I₁),C_i(I₂),…,C_i(I_K)) indicate that the same i of K segment prediction acts the score of class.

(2) movement is restored

This module optimizes training, loss function using Euclidean distance loss are as follows:

WhereinIt is the characteristic pattern that k-th of segment is exported by restoring network, I_kFor primitive character figure.

(3) difference is punished

It is defeated to last one layer of convolutional layer by action recognition network to first have to calculate adjacent two picture either light stream figure Feature (f out_k, f_k+1) difference:

d_k=f_k+1-f_k (4)

Then by feature d_kCarry out action recognition, the i.e. loss function of difference punishment are as follows:

In conclusion the total losses functions for losing double-current convolutional neural networks more are as follows:

L=L_r(y,H_r)+L_g+L_d(y,H_d) (6)

In general, through the invention it is contemplated above technical scheme is compared with the prior art, have below beneficial to effect Fruit:

(1) present invention employs the sparse samplings on entire video, to obtain long term time video presentation.

(2) present invention adds restoring to lose, reduce the loss of movement detailed information to a certain extent, and balance The video presentation information of extraction.

(3) the invention proposes difference penalty term, subtract each other to obtain feature difference using adjacent segment feature, recycle difference Feature carries out action recognition, and auxiliary movement identifies that network extracts space-time characteristic.

(4) present invention utilizes multiple losses to optimize action recognition network, with action recognition loss, restores loss and difference It loses auxiliary movement identification module and extracts better space-time video expression characteristic, increase substantially the precision of action recognition.

Detailed description of the invention

Fig. 1 is more loss double fluid convolutional neural networks structural schematic diagrams that the embodiment of the present invention uses.

Fig. 2 is that timing divides network and the Structure Comparison for losing double-current convolutional neural networks provided in an embodiment of the present invention more Figure.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent.

Below in conjunction with drawings and examples, the present invention is further elaborated.

Fig. 1 is more loss double fluid convolutional neural networks structural schematic diagrams that the embodiment of the present invention uses.As shown in Figure 1, should Method the following steps are included:

S2. building loses double-current convolutional neural networks framework.

The double fluid convolutional neural networks that lose are the improvement to timing segmentation network, the net of spatial network and time network more Network structure is identical (input mode is different, is picture and light stream figure respectively), the spaces for losing double-current convolutional neural networks more The network structure of network and time network is divided into three branches: action recognition, movement are restored, difference is punished.

(1) action recognition

(2) movement is restored

(3) difference is punished

(1) action recognition

R(I₁,I₂,…,I_K)=P (h (C (I₁；W),C(I₂；W),…,C(I_K；W))) (1)

Therefore, the loss function of action recognition module are as follows:

(2) movement is restored

(3) difference is punished

d_k=f_k+1-f_k (4)

L=L_r(y,H_r)+L_g+L_d(y,H_d) (6)

Fig. 2 is that timing divides network and the Structure Comparison for losing double-current convolutional neural networks provided in an embodiment of the present invention more Figure.As shown in Fig. 2, Fig. 2-(a) is that timing divides network, Fig. 2-(b) be lose double-current convolutional neural networks, the present invention when Recovery loss and difference penalty term are increased on the basis of sequence segmentation network, it is special that the video that multiple loss optimizations are extracted is utilized Sign, robustness and accuracy with higher.

In this work, the present invention provides a kind of new methods to identify skill to complete the human action based on video Art.Compared with the existing methods, the present invention is utilized recovery loss and difference is punished on the basis of traditional timing divides network It penalizes item to carry out auxiliary optimization to action recognition module, realizes the video expression characteristic for extracting and there is space-time, and reduce movement The loss of detailed information, so that action recognition precision is greatly improved.

Finally, the details of the above embodiment of the present invention is only to illustrate examples of the invention, for this field Technical staff, any modification, improvement and replacement etc. to above-described embodiment, should be included in the protection model of the claims in the present invention Within enclosing.

Claims

1. based on the human motion recognition methods for losing double-current convolutional neural networks more, which is characterized in that the method includes with Lower step:

S1. video V in data set is equally divided into K sections of S₁,S₂,…,S_K(K is empirical value K=3), adopts at random from each subsegment One frame picture of sample and light stream figure as the inputs for losing double-current convolutional neural networks more.

S2. building loses double-current convolutional neural networks framework.

S3. the picture acquired in step S1 and light stream figure are input to lose in double-current convolutional neural networks to be trained more, are made It is minimum to obtain loss function.

S4. by test sample picture and light stream figure be input to it is above-mentioned trained complete more loss double fluid convolutional neural networks In tested, then carry out double-current fusion, finally complete the human action identification based on video.

2. according to claim 1 based on the human motion recognition methods for losing double-current convolutional neural networks, feature more It is, the detailed process of the S2 are as follows:

The double fluid convolutional neural networks that lose are the improvement to timing segmentation network, the network knot of spatial network and time network more Structure is identical (input mode is different, is picture and light stream figure respectively), the spatial networks for losing double-current convolutional neural networks more Be divided into three branches with the network structure of time network: action recognition, movement are restored and difference punishment.

(1) action recognition

Action recognition branch selects the network structure based on BN-Inception, in order to simulate long-term time structure, this Invention carries out sparse sampling in entire video and assembles segment characterizations progress action recognition.

(2) movement is restored

The recovery to input data is carried out in the output of the last layer convolutional layer of action recognition branch, present invention employs four layers Warp lamination and four layers of jump articulamentum are restored, and poor using the recovery of Euclidean distance costing bio disturbance, in order to guarantee to act Partial act detailed information can be retained in identification network.

(3) difference is punished

Difference punishes branch and action recognition and movement recovery branch's sharing feature coding network, it action recognition branch most Difference punishment operation is carried out after later layer convolutional layer, present invention utilizes the feature differences between adjacent segment to carry out action recognition, auxiliary Help action recognition network that can extract space-time characteristic abundant.

3. according to claim 1 based on the human motion recognition methods for losing double-current convolutional neural networks, feature more It is, the detailed process of the S3 are as follows:

Specifically, the training methods for losing double-current convolutional neural networks are to do pre-training mould using ImageNet data set more Type, training action identification module, action recognition network training complete on the basis of be trained whole network, and be utilized with Machine gradient descent algorithm optimizes.

(1) action recognition

Entire video V is equally divided into K sections of { S₁,S₂,…,S_K, the one frame { I of stochastical sampling from each segment₁,I₂,…,I_KMake For the input of network, finally obtains video and predicts fractional function in each movement class are as follows:

R(I₁,I₂,…,I_K)=P (h (C (I₁；W),C(I₂；W),…,C(I_K；W))) (1)

Wherein W is parameter, function C (I_k；W the class score that each input is exported by network, k ∈ 1,2 ..., K, function) are calculated H represents the output of K segment of fusion, obtains final movement class hypothesis score, and function P is Softmax operation.

Therefore, the loss function of action recognition module are as follows:

Wherein n is movement class total number, y_iBeing really is label, H_r=h (C (I₁；W),C(I₂；W),…,C(I_K；W)),

That is H_i=h (C_i(I₁),C_i(I₂),…,C_i(I_K)) indicate that the same i of K segment prediction acts the score of class.

(2) movement is restored

(3) difference is punished

It first has to calculate what adjacent two picture either light stream figure was exported by action recognition network to last one layer of convolutional layer Feature (f_k, f_k+1) difference:

d_k=f_k+1-f_k (4)

L=L_r(y,H_r)+L_g+L_d(y,H_d) (6) 。

4. according to claim 1 based on the human motion recognition methods for losing double-current convolutional neural networks, feature more It is, more loss double fluid convolutional neural networks that training is completed are tested in the S4, and each video uses a picture either Light stream figure carrys out the score of prediction action identification as the input of more loss double-stream digestions, finally merges spatial network and time network The score of output as the final test scores for losing double-current convolutional neural networks more.