CN108647599A

CN108647599A - In conjunction with the Human bodys' response method of 3D spring layers connection and Recognition with Recurrent Neural Network

Info

Publication number: CN108647599A
Application number: CN201810394571.6A
Authority: CN
Inventors: 宋佳蓉; 杨忠; 胡国雄; 韩佳明; 徐浩; 陈聪
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2018-04-27
Filing date: 2018-04-27
Publication date: 2018-10-12
Anticipated expiration: 2038-04-27
Also published as: CN108647599B

Abstract

The present invention discloses a kind of Human bodys' response method of combination 3D spring layers connection and Recognition with Recurrent Neural Network, includes the following steps：Step 1, every section of video is divided into N section, from every extracting section L frame pictures, N, L are natural number；Step 2, space-time characteristic extraction is carried out to video using trained 3D convolutional neural networks, and the space-time characteristic of different levels is connected in series with to obtain high dimensional feature vector；Step 3, standardization processing is carried out to the high dimensional feature vector that step 2 obtains；Step 4, the high dimensional feature vector after standardization processing in step 3 is sent into Recognition with Recurrent Neural Network, carries out Fusion Features；Step 5, classify to the feature after being merged in step 4, obtain the corresponding action classification of video.Such method need not manually extract low layer movable information, compare artificial sport feature design method, the present invention has better robustness, while the video information of long period can be effectively treated.

Description

In conjunction with the Human bodys' response method of 3D spring layers connection and Recognition with Recurrent Neural Network

Technical field

The invention belongs to Computer Vision Recognition technical field, more particularly to a kind of combination 3D convolutional layer spring layers are connected and are followed The Human bodys' response method of ring neural network.

Background technology

Since Human bodys' response has important application prospect in fields such as video monitoring, human-computer interaction, virtual realities And market value, therefore the identification of the human action based on video has become one of research hotspot in computer vision.Meanwhile with The especially convolutional neural networks of deep learning achieve effective achievement in computer vision, based on convolutional neural networks The concern of the human body behavior person of being studied much.

Patent No. CN201611117772.9's《Activity recognition side based on track and convolutional neural networks feature extraction Method》Track is extracted to the image/video data of input first, recycles convolutional neural networks to extract convolution feature, then in conjunction with rail The convolution feature of mark and convolutional layer feature extraction based on profile constraints simultaneously extracts stack part Fei Sheer vector characteristics, finally trains Support vector machine model reaches classification purpose.

Patent No. CN201510527937.9's《Human bodys' response method based on 3D convolutional neural networks》First The apparent image of screening human body behavioural characteristic simultaneously preserves, then to the image after preservation, extract respectively gray scale, the directions x and y ladder Degree and light stream amount to five channel informations, and the convolution feature that five channel informations are extracted followed by convolutional neural networks is finally real Now classify.

Both the above method is required for utilizing video data to extract low-dimensional movable information in advance, can not be directly by original video Data are sent into network, therefore cannot achieve end-to-end classification prediction.

Patent No. CN201610047682.0's《Activity recognition method based on deep learning and multi-scale information》It is first Deep video is first split into multiple video-frequency bands, then each video-frequency band is learnt using branch's neural network, then pair simultaneously The high-rise expression that each neural network branch of row operation learns carries out simple fusion connection, finally indicates the high level after fusion It is sent into full articulamentum and classification layer carries out Classification and Identification.In the method when the longer video of input duration, it can to melt Characteristic dimension after conjunction is excessively high, so that the more difficult training of network.

Although in conclusion having more research to the action recognition based on convolutional neural networks both at home and abroad, exists and need The problems such as carrying out artificial sport information extraction in advance to video data or long-term video can not be handled.

Invention content

The purpose of the present invention is to provide a kind of Human bodys' response side of combination 3D spring layers connection and Recognition with Recurrent Neural Network Method need not manually extract low layer movable information, compare artificial sport feature design method, and the present invention has better robust Property, while the video information of long period can be effectively treated.

In order to achieve the above objectives, solution of the invention is：

A kind of Human bodys' response method of combination 3D spring layers connection and Recognition with Recurrent Neural Network, includes the following steps：

Step 1, every section of video is divided into N section, from every extracting section L frame pictures, N, L are natural number；

Step 2, space-time characteristic extraction is carried out to video using trained 3D convolutional neural networks, and by different levels Space-time characteristic is connected in series with to obtain high dimensional feature vector；

Step 3, standardization processing is carried out to the high dimensional feature vector that step 2 obtains；

Step 4, the high dimensional feature vector after standardization processing in step 3 is sent into Recognition with Recurrent Neural Network, carries out feature and melts It closes；

Step 5, classify to the feature after being merged in step 4, obtain the corresponding action classification of video.

In above-mentioned steps 1, give up the video if video totalframes is less than 48 frames, if video totalframes cannot be divided exactly by L, Then give up last several frames.

In above-mentioned steps 1, every section of video is divided into N section, the content from every extracting section L frame pictures is：By one Video is averagely divided into the parts N=3 by frame number, includes same number of frames per part, and extract L=16 frames from every part equal intervals Picture.

The detailed process of above-mentioned steps 2 is：

Transfer learning：Using the convolution sum pond layer of trained C3D networks as feature extractor, to every in step 1 A 16 frame input carries out space-time characteristic extraction, obtains the output vector of pool5num dimensions, and carrying out space-time characteristic to entire video carries It takes, result is indicated with two-dimentional tensor (3, pool5num) after extraction, wherein pool5num indicates feature extractor pond layer 5 Export dimension；

It is connected in series with：Each 16 frame is inputted, by the pond layer 1 of feature extractor, pond layer 2, pond layer 3 and pond layer 5 output is connected in series with, and the feature vector of poolall_num dimensions is obtained, and carrying out feature to entire video is connected in series with behaviour Make, the result after being connected in series with is indicated with two-dimentional tensor (3, poolall_num), wherein poolall_num=pool1num+ Pool2num+pool3num+pool5num, pool1num, pool2num, pool3num indicate feature extractor pond layer respectively 1, the output dimension of pond layer 2, pond layer 3.

In above-mentioned steps 3, carrying out the detailed process of standardization processing is：

Mean value E [the x of each dimension of high dimensional feature vector in step 2 are sought on entire training set^(k)] and variance Var [x^(k)], then each dimension of feature vector is standardized, standardization formula is：

Wherein, x^(k)Indicate activation value,Indicate the value after standardization；

Then, with following formula pairIt is converted, obtains passing through γ^(k)And β^(k)New value y after variation^(k), then y^(k)Table Show the characteristic value after standardization processing：

Wherein, γ^(k)And β^(k)It is Recognition with Recurrent Neural Network parameter, is obtained by e-learning.

In above-mentioned steps 4, the high dimensional feature vector after standardization processing in step 3 is sent into Recognition with Recurrent Neural Network, is carried out The particular content of Fusion Features is：Two-dimentional tensor (3, poolall_num) after standardization processing is sent into cycle nerve Network, wherein the time step of Recognition with Recurrent Neural Network is 3, includes a hidden layer, and the neuron number for including in hidden layer is 256.

In above-mentioned steps 5, using multiclass Softmax graders, the output of Recognition with Recurrent Neural Network in step 4 is carried out linear Classification.

After adopting the above scheme, beneficial effects of the present invention are as follows：

(1) it utilizes C3D networks directly to extract the space time information of video, movable information need not be carried out in advance to video data Extraction, realizes end-to-end identification method；

(2) it is connected in series with the characteristic information for the different levels extracted by convolution kernel, video is extracted compared to engineer Low layer movable information, the low layer space time information of convolution kernel output has higher robustness, while more fully；

(3) feature of different levels in feature extractor is connected in series with, obtains the height for including different levels information Position feature vector, this step can be obviously improved identification accuracy；

(4) standardization processing is carried out to high dimensional feature vector, accelerates network convergence；

(5) further time-domain information fusion is carried out in the feature vector using Recognition with Recurrent Neural Network after standardization so that Whole network structure can handle prolonged video input.

Description of the drawings

Fig. 1 is the flow chart of the present invention；

Fig. 2 is the network structure of the present invention；

Fig. 3 is Recognition with Recurrent Neural Network detail view.

Specific implementation mode

Below with reference to attached drawing, technical scheme of the present invention and advantageous effect are described in detail.

As shown in Figure 1, the present invention provides a kind of Human bodys' response side of combination 3D spring layers connection and Recognition with Recurrent Neural Network Method, detailed process are embodied in following steps：

Video segmentation, 3 parts are averagely divided by frame number by a video, and 16 frame pictures are extracted at equal intervals from each part Form a segment, wherein give up the video if video totalframes is less than 48 frames, if if video totalframes cannot be divided exactly by 3, Then give up last several frames.

After video segmentation, a video is represented by 5 dimension tensors (3,16, H, W, 3), and each 16 frame fragment can indicate For 4 dimension tensors (16, H, W, 3), wherein 3 expression videos, which are divided evenly, to be indicated for 3 parts, 16 from 16 frame of each extracting section Picture, H, W respectively represent the length and width dimensions of picture, and 3 indicate the port number of picture, represent RGB pictures here.

Training set video is divided according to mentioned above principle, after division in training set each video be expressed as 5 dimension tensors (3, 16, H, W, 3), by each video scaling to 3 × 16 × 128 × 171 × 3 sizes, then each video be represented by 5 dimension tensors (3, 16,128,171,3), 16 each segment frame number is represented, 128,171,3 respectively represent the length and width and port number of every frame picture.

To a video, it converts 5 dimension tensors (3,16,128,171,3) to 34 dimension tensor (16,128,171,3) tables Show form.

According to previous step, by all training set datas by all becoming (16,128,171,3) form, wherein each video Including continuous 34 dimension tensors (16,128,171,3).

It averages to all 4 dimension tensors (16,128,171,3) of training set, 4 dimension tensor mean=of the mean value acquired (16,128,171,3) it indicates.

All segments in training set are subtracted into mean=(16,128,171,3) so that each pixel value divides in training set For cloth near zero, this step can eliminate influence of the noise to classification.

To a video, by 3 continuous 4 dimension tensors (16,128,171,3) for subtracting mean value be converted into 5 dimension tensors (3, 16,128,171,3).

By all video datas in training set according to previous step, it is converted into the expression of 5 dimension tensors (3,16,128,171,3) Form, and it is cut to (3,16,112,112,3) size by 5 dimension tensors after average value processing are subtracted.

By treated, video is sent into C3D feature extractors, to each video, 1 16 frame piece of continuous 3 times each feedings Section is sent into 4 dimension tensors (16,112,112,3) every time, export pool5num dimensional vectors, 2 dimensions of final each video features It measures (3, pool5num) to indicate, wherein pool5num indicates the output dimension of feature extractor pond layer 5.

To each video, the output of the pond layer 1, pond layer 2, pond layer 3 and pond layer 5 of feature extractor is gone here and there Connection connection, as shown in Fig. 2, the high dimensional feature after being connected in series with is indicated with two-dimentional tensor (3, poolall_num), wherein Poolall_num=pool1num+pool2num+pool3num+pool5num, pool1num, pool2num, pool3num points Not Biao Shi feature extractor pond layer 1, pond layer 2, pond layer 3 output dimension.

Entire training set is sent into feature extractor and is connected in series with operation through row, obtains high dimensional feature training data.

Obtained high dimensional feature training data is sent into Recognition with Recurrent Neural Network, as shown in Figure 2, wherein be sent into cycle god Through advanced row standardized operation before network, as shown in Fig. 2, the layer of addition standardization here is to accelerate network convergence rate and convergence Effect.

Standardized operation is made of two steps, first, is standardized to feature, and higher-dimension is sought on entire training set Mean value E [the x of feature each dimension in pool5num dimensions^(k)] and variance Var [x^(k)], x then is inputted to each activation^(k) It is standardized,Indicate that the value after standardization, standardization formula are：

Secondly, in order not to change the ability to express of feature vector, with following formula pairIt is converted, is passed through γ^(k)And β^(k)New value y after variation^(k), then y^(k)Indicate the characteristic value after standardization processing：

Wherein, γ^(k)And β^(k)It is obtained by e-learning.

Using backpropagation, training Recognition with Recurrent Neural Network parameter and parameter γ^(k), β^(k), obtain trained network.

When predicting input video, a video is averagely divided into 3 parts by frame number, is extracted at equal intervals from each part 16 frame pictures form a segment, then the video is represented by 5 dimension tensors (3,16, H, W, 3).

Video to be predicted (3,16, H, W, 3) is first zoomed into (3,16,128,171,3) size, then each 16 frame is regarded Frequency subtracts mean value mean=(16,128,171,3), then is cut in every frame center picture, and treated, and video to be predicted can indicate For 5 dimension tensors (3,16,112,112,3).

Will treated video (3,16,112,112,3) to be predicted is converted into 34 dimension tensors (16,112112,3) and according to Secondary feeding network, the high dimensional feature (3, poolall_num) after being connected in series with.

The high dimensional feature (3, poolall_num) of video to be predicted is sent into trained BN layers and Recognition with Recurrent Neural Network, Obtain prediction output.

Above example is merely illustrative of the invention's technical idea, and protection scope of the present invention cannot be limited with this, every According to technological thought proposed by the present invention, any change done on the basis of technical solution each falls within the scope of the present invention Within.

Claims

1. a kind of Human bodys' response method of combination 3D spring layers connection and Recognition with Recurrent Neural Network, it is characterised in that including walking as follows Suddenly：

Step 2, space-time characteristic extraction is carried out to video using trained 3D convolutional neural networks, and by the space-time of different levels Feature is connected in series with to obtain high dimensional feature vector；

Step 4, the high dimensional feature vector after standardization processing in step 3 is sent into Recognition with Recurrent Neural Network, carries out Fusion Features；

2. as described in claim 1 in conjunction with the Human bodys' response method of 3D spring layers connection and Recognition with Recurrent Neural Network, feature It is：In the step 1, give up the video if video totalframes is less than 48 frames, if video totalframes cannot be divided exactly by L, Give up last several frames.

3. as described in claim 1 in conjunction with the Human bodys' response method of 3D spring layers connection and Recognition with Recurrent Neural Network, feature It is：In the step 1, every section of video is divided into N section, the content from every extracting section L frame pictures is：By a video It is averagely divided into the parts N=3 by frame number, includes same number of frames per part, and L=16 frame figures are extracted from every part equal intervals Piece.

4. as claimed in claim 2 in conjunction with the Human bodys' response method of 3D spring layers connection and Recognition with Recurrent Neural Network, feature It is：The detailed process of the step 2 is：

Transfer learning：Using the convolution sum pond layer of trained C3D networks as feature extractor, in step 1 each 16 Frame input carries out space-time characteristic extraction, obtains the output vector of pool5num dimensions, carries out space-time characteristic extraction to entire video, carries The two-dimentional tensor (3, pool5num) of rear result is taken to indicate, wherein pool5num indicates the output dimension of feature extractor pond layer 5 Degree；

It is connected in series with：Each 16 frame is inputted, by the pond layer 1 of feature extractor, pond layer 2, pond layer 3 and pond layer 5 Output is connected in series with, and the feature vector of poolall_num dimensions is obtained, and feature series connection attended operation is carried out to entire video, Result after being connected in series with is indicated with two-dimentional tensor (3, poolall_num), wherein poolall_num=pool1num+ Pool2num+pool3num+pool5num, pool1num, pool2num, pool3num indicate feature extractor pond layer respectively 1, the output dimension of pond layer 2, pond layer 3.

5. as described in claim 1 in conjunction with the Human bodys' response method of 3D spring layers connection and Recognition with Recurrent Neural Network, feature It is：In the step 3, carrying out the detailed process of standardization processing is：

Then, with following formula pairIt is converted, obtains passing through γ^(k)And β^(k)New value y after variation^(k), then y^(k)Indicate warp Cross the characteristic value after standardization processing：

6. as described in claim 1 in conjunction with the Human bodys' response method of 3D spring layers connection and Recognition with Recurrent Neural Network, feature It is：In the step 4, the high dimensional feature vector after standardization processing in step 3 is sent into Recognition with Recurrent Neural Network, carries out feature The particular content of fusion is：Two-dimentional tensor (3, poolall_num) after standardization processing is sent into Recognition with Recurrent Neural Network, Wherein, the time step of Recognition with Recurrent Neural Network is 3, includes a hidden layer, and the neuron number for including in hidden layer is 256.

7. as described in claim 1 in conjunction with the Human bodys' response method of 3D spring layers connection and Recognition with Recurrent Neural Network, feature It is：In the step 5, using multiclass Softmax graders, the output of Recognition with Recurrent Neural Network in step 4 is linearly divided Class.