CN106650674B

CN106650674B - A kind of action identification method of the depth convolution feature based on mixing pit strategy

Info

Publication number: CN106650674B
Application number: CN201611229368.0A
Authority: CN
Inventors: ***; 肖翔
Original assignee: Sun Yat Sen University; SYSU CMU Shunde International Joint Research Institute
Current assignee: Sun Yat Sen University; SYSU CMU Shunde International Joint Research Institute
Priority date: 2016-12-27
Filing date: 2016-12-27
Publication date: 2019-09-10
Anticipated expiration: 2036-12-27
Also published as: CN106650674A

Abstract

The present invention discloses a kind of action identification method of depth convolution feature based on mixing pit strategy, comprising: 1) uses spatial flow depth network model to each frame of input video, obtain the appearance features of every frame；To time flow depth network model is used in video per continuous 10 frame, the motion feature of video is extracted；2) corresponding character representation is obtained using termporal filter pond method to the depth trellis diagram of the last layer convolutional layer output of spatial flow and time flow depth network, obtains first using principal component analytical method progress dimensionality reduction and describes subcharacter；Corresponding character representation is obtained using space-time pyramid pond method to the depth trellis diagram of the last layer convolutional layer of spatial flow and time flow depth network output, carries out dimensionality reduction with principal component analytical method and obtain second to describe subcharacter；3) obtain step 2) first and second describes subcharacter and cascades up, and forms the Feature Descriptor of input video, and carry out tagsort using linear SVM, obtains recognition accuracy.

Description

A kind of action identification method of the depth convolution feature based on mixing pit strategy

Technical field

The present invention relates to computer vision fields, more particularly, to a kind of depth convolution based on mixing pit strategy The action identification method of feature.

Background technique

So that picture pick-up device is popularized, the video data of enormous amount also generates therewith for the development of science and technology.Meanwhile needle It also comes into being to the application of video: intelligent video monitoring, video data classification, advanced human-computer interaction etc..In such applications, Movement for people carries out the core content that understanding is most crucial focus and people's research.

Since human action identification has very big potential value, so this project continue for as a research hotspot At least ten years, a variety of methods are all suggested, such as: it is based on the method for intensive track (DT), based on space-time interest points Method and the method etc. for being based on convolutional neural networks (CNN).Wherein, the number of the technique study based on CNN is most, this side Method can obtain result best at present.However, most of deep layer CNN networks all regard individual trellis diagram as an entirety With, and the local message in trellis diagram is often ignored, so, our action recognition research will be for based on depth convolution The action identification method in feature multichannel pyramid pond is to extract the local message in depth characteristic.

The main thought of method based on convolutional neural networks is: firstly, using convolutional layer, the pond layer of multilayer to video With full articulamentum, the description subcharacter of video is extracted；Next these features are put into classifier and are classified, to complete most Whole identification process.Many scholars are explored and have been improved on this basis.Annane et al. proposes a kind of double fluid volume Product network is used for action recognition, including spatial flow and time flow network, and spatial flow is used to extract the appearance features of video frame, time The motion feature for extracting video successive frame is flowed, the two is merged, recognition effect is promoted with this.Wang et al. is by depth Convolution feature and manual features are merged, and the advantage of both different type features of depth characteristic and manual features is arrived in study. Above method all achieves preferable effect, but the existing research based on depth network usually makees individual depth characteristic figure For an entirety come using and have ignored the local message in depth characteristic, and this clue for improve based on depth network Recognition accuracy is helpful.

Summary of the invention

In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a kind of depth convolution based on mixing pit strategy The action identification method of feature.This method carries out video feature extraction and identification, most using the video of sets of video data as input The classification results of video are exported afterwards, and this method has simple easily realization, the good feature of recognition effect.

In order to achieve the above object, the technical solution adopted by the present invention is that:

A kind of action identification method of the depth convolution feature based on mixing pit strategy, comprising the following steps:

(1) it inputs video to be identified and every frame is obtained using spatial flow depth network model to each frame of input video Appearance features；Motion feature is obtained using time flow depth network model per continuous 10 frame to input video simultaneously.Wherein Spatial flow depth network and time flow depth network model include 5 convolutional layers, 3 pond layers and 3 full articulamentums；

(2) the last layer convolutional layer that spatial flow depth network model and time flow depth network model obtain is exported Depth trellis diagram obtains corresponding character representation using termporal filter pond method, using different length interval time sequence Column to obtain the global and local movement of video, and carry out dimensionality reduction to feature using principal component analytical method, obtain the first description Subcharacter；

Meanwhile the last layer convolutional layer that spatial flow depth network model and time flow depth network model obtain is exported Depth trellis diagram corresponding character representation is obtained using space-time pyramid pond method, using 4 layers of space-time pyramid structure To obtain the local message in depth characteristic figure, and there is robustness for target and geometry deformation；Similarly also using it is main at Analysis carries out Feature Dimension Reduction, obtains second and describes subcharacter；

(4) it describes subcharacter to step (2) are extracted first and second to cascade up, forming the final vector of the video indicates； Using support vector machines (SVM) carry out tagsort, final output classification results, obtain the action recognition of video as a result, 90.8% accuracy rate is realized on UCF50 human body behavioral data collection.

The present invention is based on depth convolutional neural networks methods, and by exploring local message and fortune in depth characteristic figure Dynamic information proposes a kind of new depth convolution feature based on mixing pit strategy, it can effectively obtain characteristic pattern and exist Local message and motion information under different scale, significantly improve the accuracy rate of action recognition.

Preferably, in step (1), spatial flow and time flow depth network model are using the every frame of video as input, to original Image does the convolution sum pondization operation of multilayer, and the output for obtaining every layer is all multiple depth trellis diagrams, forms more abstract figure As feature.

Preferably, in step (2), the convolution of the last layer convolutional layer output of space flow network and time flow network is chosen Figure is temporally filtered the operation in device pond, specifically to characteristic pattern using 4 kinds of different time intervals filter (Isosorbide-5-Nitrae, 8, 16) carry out analysis depth feature in the movement of time-domain, it is time movement in entire range of video that wherein time interval 1 is corresponding Namely global motion, and time interval 16 it is corresponding be under out to out local time movement.For each different time Interval, depth characteristic can all be divided into multiple timeslices within the scope of entire video time, to the feature in each timeslice We obtain most representative feature in the timeslice using maximum pond and pond method of summing simultaneously, and by both ponds Changing result and being together in series indicates movement in the timeslice.Then the video features entire termporal filter Chi Huahou obtained Carry out PCA dimensionality reduction.

Preferably, in step (2), the multi-pass of the last layer convolutional layer output of space flow network and time flow network is chosen Road trellis diagram carries out the operation in space-time pyramid pond, specifically to trellis diagram using 4 layers of space-time pyramid structure (1 × 1 × 1,2 × 2 × 2,3 × 3 × 3,4 × 4 × 4) it is in entire time and spatial dimension that, wherein first layer (1 × 1 × 1) is corresponding Characteristic pattern, and it is local space time's characteristic block under out to out that the 4th layer (4 × 4 × 4) is corresponding.Therefore pass through space-time pyramid Structure obtains the localized mass that characteristic pattern is located under different time and space scales.To each local space time's block using maximum pond method, meter Calculate character representation of the maximum value in space-time block as the localized mass.Since the characteristic pattern on each channel is extracted different figures Picture/video information, therefore the feature of the localized mass of space-time position same in the characteristic pattern on all channels is together in series, being formed should The multi-channel feature of local space time's block describes son.Finally space-time block features all in video are cascaded up, form the spy of video Sign indicates.Then PCA dimensionality reduction is carried out to the video features that entire space-time pyramid Chi Huahou is obtained.

Preferably, in step (3), the depth characteristic of video is passed through into termporal filter pondization and space-time pyramid Chi Huahou Two kinds of features be together in series, obtain the final character representation of video.Classified using support vector machines to feature, is obtained To the action classification label of the video.

The present invention has the following advantages and effects with respect to the prior art:

1, the invention proposes a kind of new description subcharacters sufficiently to obtain motion information and part under different scale Information improves recognition effect.

2, the present invention does pondization connection, the difference in the available region to the same area of the trellis diagram under different channels The information of aspect, such as edge or texture.

Detailed description of the invention

Fig. 1 is overview flow chart of the invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.

Attached drawing gives operating process of the invention, as shown, a kind of depth convolution based on mixing pit strategy is special The action identification method of sign, comprising the following steps:

(2) the depth convolution for the last layer convolutional layer output that spatial flow network model and time flow network model are obtained Figure obtains corresponding character representation using termporal filter pond method, using different length interval time sequence, to obtain The global and local movement of video, and dimensionality reduction is carried out to feature using principal component analytical method；

(3) the depth convolution for the last layer convolutional layer output that spatial flow network model and time flow network model are obtained Figure obtains corresponding character representation using space-time pyramid pond method, obtains depth using 4 layers of space-time pyramid structure Local message in characteristic pattern, and there is robustness for target and geometry deformation；Similarly also carried out using principal component analysis Feature Dimension Reduction；

(4) the description subcharacter extracted to step (2) and (3) cascades up, and forming the final vector of the video indicates；It adopts Tagsort is carried out with support vector machines (SVM), final output classification results predict the action classification label of video, and 90.8% accuracy rate is realized on UCF50 human body behavioral data collection.

Further, detailed process is as follows in step (1): spatial flow and time flow depth network model are by the every frame of video As input, the convolution sum pondization for doing multilayer to original image is operated, and the output for obtaining every layer is all multiple depth trellis diagrams, shape At more abstract characteristics of image.

Detailed process is as follows in step (2): the last layer convolutional layer for choosing space flow network and time flow network is defeated Trellis diagram out is temporally filtered the operation in device pond, to characteristic pattern using 4 kinds of different time intervals filter (Isosorbide-5-Nitrae, 8,16) carry out analysis depth feature in the movement of time-domain, it is time fortune in entire range of video that wherein time interval 1 is corresponding Dynamic namely global motion, and it is local time's movement under out to out that time interval 16 is corresponding.For it is each different when Between be spaced, depth characteristic can all be divided into multiple timeslices within the scope of entire video time, to the spy in each timeslice Levy us while most representative feature in the timeslice obtained using maximum pond and pond method of summing, and by both Pond result, which is together in series, indicates movement in the timeslice.Then the video spy entire termporal filter Chi Huahou obtained Sign carries out PCA dimensionality reduction.

Detailed process is as follows in step (3): the last layer convolutional layer for choosing space flow network and time flow network is defeated Multichannel convolutive figure out carries out the operation in space-time pyramid pond, to trellis diagram uses 4 layers of space-time pyramid structure (1 × 1 × 1,2 × 2 × 2,3 × 3 × 3,4 × 4 × 4) it is in entire time and spatial dimension that, wherein first layer (1 × 1 × 1) is corresponding Characteristic pattern, and it is local space time's characteristic block under out to out that the 4th layer (4 × 4 × 4) is corresponding.Therefore pass through space-time gold word Tower structure obtains the localized mass that characteristic pattern is located under different time and space scales.Maximum pond method is used to each local space time's block, Calculate character representation of the maximum value in space-time block as the localized mass.Due to the characteristic pattern on each channel be extracted it is different Image/video information, therefore the feature of the localized mass of space-time position same in the characteristic pattern on all channels is together in series, it is formed The multi-channel feature of local space time's block describes son.Finally space-time block features all in video are cascaded up, form video Character representation.Then PCA dimensionality reduction is carried out to the video features that entire space-time pyramid Chi Huahou is obtained.

Detailed process is as follows in step (4): the depth characteristic of video is passed through termporal filter pondization and space-time gold word Two kinds of features after tower basin are together in series, and obtain the final character representation of video.Feature is carried out using support vector machines Classification, obtains the action classification label of the video.

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. a kind of action identification method of the depth convolution feature based on mixing pit strategy, which is characterized in that including following step It is rapid:

(1) it inputs video to be identified and the table of every frame is obtained using spatial flow depth network model to each frame of input video See feature；Simultaneously to every continuous N frame of input video, motion feature is obtained using time flow depth network model；Wherein space Flow depth degree network model and time flow depth network model include 5 convolutional layers, 3 pond layers and 3 full articulamentums；

(2) depth for the last layer convolutional layer output that spatial flow depth network model and time flow depth network model are obtained Trellis diagram obtains corresponding character representation using termporal filter pond method, using different length interval time sequence, with The global and local movement of video is obtained, and dimensionality reduction is carried out to feature using principal component analytical method, it is special to obtain the first description Sign；

Meanwhile the depth for the last layer convolutional layer output that spatial flow depth network model and time flow depth network model are obtained Degree trellis diagram obtains corresponding character representation using space-time pyramid pond method, is obtained using 4 layers of space-time pyramid structure The local message in depth characteristic figure is taken, and there is robustness for target and geometry deformation；Similarly also using principal component point Analysis carries out Feature Dimension Reduction, obtains second and describes subcharacter；

(3) it describes subcharacter to step (2) are extracted first and second to cascade up, forming the final vector of the video indicates；Using Support vector machines (SVM) carries out tagsort, and final output classification results obtain the action recognition result of video；

In the step (2), the volume of the last layer convolutional layer output of spatial flow depth network and time flow depth network is chosen Product figure is temporally filtered the operation in device pond, specifically to characteristic pattern use 4 kinds of different time intervals filter (Isosorbide-5-Nitrae, 8,16) carry out analysis depth feature in the movement of time-domain, it is time fortune in entire range of video that wherein time interval 1 is corresponding Dynamic namely global motion, and it is local time's movement under out to out that time interval 16 is corresponding；For it is each different when Between be spaced, depth characteristic can all be divided into multiple timeslices within the scope of entire video time, to the spy in each timeslice Levy us while most representative feature in the timeslice obtained using maximum pond and pond method of summing, and by both Pond result, which is together in series, indicates movement in the timeslice；Then the video spy entire termporal filter Chi Huahou obtained Sign carries out PCA dimensionality reduction；

In the step (2), the more of the last layer convolutional layer output of spatial flow depth network and time flow depth network are chosen Channel trellis diagram carries out the operation in space-time pyramid pond, specifically uses 4 layers of space-time pyramid structure (1 × 1 to trellis diagram × 1,2 × 2 × 2,3 × 3 × 3,4 × 4 × 4) it is in entire time and spatial dimension that, wherein first layer (1 × 1 × 1) is corresponding Characteristic pattern, and it is local space time's characteristic block under out to out that the 4th layer (4 × 4 × 4) is corresponding；Therefore pass through space-time gold word Tower structure obtains the localized mass that characteristic pattern is located under different time and space scales；Maximum pond method is used to each local space time's block, Calculate character representation of the maximum value in space-time block as the localized mass；Due to the characteristic pattern on each channel be extracted it is different Image/video information, therefore the feature of the localized mass of space-time position same in the characteristic pattern on all channels is together in series, it is formed The multi-channel feature of local space time's block describes son；Finally space-time block features all in video are cascaded up, form video Character representation；Then PCA dimensionality reduction is carried out to the video features that entire space-time pyramid Chi Huahou is obtained.

2. the action identification method of the depth convolution feature according to claim 1 based on mixing pit strategy, feature It is, in the step (1), spatial flow and time flow depth network model do original image using the every frame of video as input The convolution sum pondization of multilayer operates, and the output for obtaining every layer is all multiple depth trellis diagrams, forms more abstract characteristics of image.

3. the action identification method of the depth convolution feature according to claim 1 based on mixing pit strategy, feature It is, in the step (3), the depth characteristic of video is passed through into termporal filter pondization and two kinds of space-time pyramid Chi Huahou Feature is together in series, and obtains the final character representation of video, is classified using support vector machines to feature, obtains the view The action classification label of frequency.