CN105740773A

CN105740773A - Deep learning and multi-scale information based behavior identification method

Info

Publication number: CN105740773A
Application number: CN201610047682.0A
Authority: CN
Inventors: 刘智; 冯欣; 张�杰; 张杰慧; 张凌; 黄智勇
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2016-01-25
Filing date: 2016-01-25
Publication date: 2016-07-06
Anticipated expiration: 2036-01-25
Also published as: CN105740773B

Abstract

The invention discloses a deep learning and multi-scale information based behavior identification method. The method comprises the steps of constructing a plurality of deep networks to form a parallel structure for researching human body behavior identification of a deep video; and splitting the deep video into a plurality of video segments at first, then performing learning by using parallel branch neural networks, performing fusion connection on high level representations obtained by learning through the branch neural networks, and finally transmitting the high level representations after fusion into a full connection layer and a classification layer for classification and identification. The behavior identification can be effectively carried out by using a deep learning method, and especially when behavior actions are greatly different, the identification rate can be remarkably increased and the real-time property is good.

Description

Activity recognition method based on degree of depth study and multi-scale information

Technical field

The present invention relates to Human bodys' response field, particularly relate to a kind of Activity recognition method based on degree of depth study and multi-scale information.

Background technology

Along with computer, image the maturation of first-class hardware technology and the requirements at the higher level of social management, the research of Human bodys' response increasingly causes the attention of computer vision research worker, and is widely used to automatic monitoring, event detection, man-machine interface, the every field such as video acquisition.Traditional Human bodys' response method carries out feature extraction first against each video describing human body behavior, such as histograms of oriented gradients (HistogramsofOrientedGradient, HOG), motion history image (MotionHistoryImage, MHI) etc., then adopt the grader such as support vector machine, random forest that the feature extracted is carried out Classification and Identification.Research based on the Human bodys' response of computational methods has been achieved for a lot of outstanding achievements, but there is also some insoluble problems: the feature of extraction is pointed, not easily extensive to other data；Computing cost is too big, is difficult to accomplish real-time.

Degree of depth study can automatically extract the multilayer feature being hidden between data and represent, the degree of depth Learning Studies based on convolutional neural networks achieves very big success in image classification, identification, location, segmentation etc..But, the convolution in image procossing is two dimension computing, it is impossible to directly apply to the 3 D video describing human body behavior.

Summary of the invention

Present invention aims to the deficiencies in the prior art, a kind of Activity recognition method based on degree of depth study and multi-scale information is provided, the method using degree of depth study can effectively carry out Activity recognition, especially when each behavior act difference is bigger, discrimination can be significantly improved, and the Generalization Capability of the present invention is good, can be trained on a large data sets, be subsequently used for lacking in training the Activity recognition field of data, can greatly reduce the time overhead of Activity recognition, and real-time is high.

The present invention, with deep video data for object of study, by building the multi-scale informations such as the deep neural network structure based on CNN, and the hand motion of the human body behavioural information of amalgamation of global and local, uses traditional two-dimentional CNN to study the Human bodys' response of three-dimensional.

The present invention is by building multiple degree of depth networks, and composition parallel organization carrys out the Human bodys' response of the depth of investigation video.First deep video is first split into multiple video-frequency band, then each parallel branch neutral net is used to learn respectively, the high-rise expression more each neutral net branch learnt carries out merging connection, it is attached after the data vector of each branch's neutral net, become one-dimensional vector, in order to input full articulamentum below.Finally full articulamentum is sent in the high-rise expression after fusion and layer of classifying carries out Classification and Identification.Meanwhile, nuance is had only in hand for major part behavior in MSRDailyActivity3D data set, as read, writing, with notebook computer, playing the behaviors such as game, the present invention proposes the thought of the multi-scale informations such as the global behavior information by merging coarseness and fine-grained hand motion.

The object of the present invention is achieved like this: a kind of Activity recognition method based on degree of depth study and multi-scale information, comprises the steps:

(1) training dataset is set up；The coarseness global behavior video that described training data is concentrated is selected from MSRDailyActivity3D data set.

(2) the deep neural network model with some parallel degree of depth convolutional neural networks is built；

(3) the coarseness global behavior video step-length L with setting of training data concentration is chosen_StrideCarrying out segmentation, wherein, every segment length is set as L_Seg, after segmentation, define N_SegIndividual coarseness video-frequency band matrix, segments is N_Seg=1+ (N_F-L_Seg)/L_Stride, N_FFrame number for coarseness global behavior video；

(4) obtaining fine granularity local behavior video in the coarseness global behavior video from step (3), (3) same method that fine granularity local behavior video is taken steps carries out segmentation and obtains N_SegIndividual fine granularity video-frequency band matrix；The size of each frame of fine granularity video-frequency band matrix is identical with the size of each frame of coarseness video-frequency band matrix.Intercept behavior sequence set beading degree local, the fine granularity local behavior video in the coarseness each frame of global behavior video.The behavior of fine-grained local can be hand motion, it is also possible to for the details action at other positions.Obtain fine granularity video method: centered by the left hand joint of the coarseness each frame of global behavior video, intercept the frame composition N of W/4 × H/4 size_FThe new video of × W/4 × H/4, this video is fine granularity hand motion video, wherein W, H, N_FThe frame number respectively comprised in the width of original depth frame of video, height and video.In the same size after this size and coarseness video down-sampling.

(5) N that step (3) is obtained_SegThe N that individual coarseness video-frequency band matrix and step (4) obtain_SegWhat build in individual fine granularity video-frequency band matrix parallel feeding step (1) has 2N_SegThe deep neural network model of individual parallel degree of depth convolutional neural networks is trained；

(6) choose that coarseness global behavior video to be identified carries out step (3), (4) respectively obtain N_SegIndividual coarseness video-frequency band matrix and N_SegIndividual fine granularity video-frequency band matrix, the N that will obtain_SegIndividual coarseness video-frequency band matrix and N_SegIndividual fine granularity video-frequency band matrix parallel is sent in the deep neural network model trained that step (5) obtains and is carried out Activity recognition.Coarseness global behavior video to be identified is the video through pretreatment.

Deep neural network in step (2), with convolutional neural networks for building block, has a classification layer, at least one convolutional layer, at least one pond layer and at least one full articulamentum.Parallel degree of depth convolutional neural networks includes the first volume lamination, the first pond layer, volume Two lamination, the second pond layer, the 3rd convolutional layer, the 3rd pond layer, the first full articulamentum, the second full articulamentum and the classification layer that are sequentially connected with.

Carry out segmentation again after each frame of the coarseness global behavior video in step (3) is carried out down-sampling, act as: 1, reduce amount of calculation；2, the size making each frame of coarseness video-frequency band matrix is identical with the size of each frame of fine granularity video-frequency band matrix, it is simple to input network.

Coarseness global behavior video is deep video.

The coarseness global behavior video that training data is concentrated is the video through pretreatment, and coarseness global behavior video to be identified is the video through pretreatment.Described pretreatment is: first, uses interpolation technique by all video specificationization in data set to unified length.This length is the intermediate value of all video lengths.Secondly, remove background, only retain video section focusing on people, and by video size specification to certain size.Again, using min-max method respectively by the x of all videos, y, z coordinate value standardizes to [0,1] scope.Finally, all samples are carried out flip horizontal and forms new sample thus the training sample in dilated data set at double.

A kind of Activity recognition method based on degree of depth study and multi-scale information, comprises the steps:

(1) training dataset is set up；The deep video that described training data is concentrated is selected from MSRDailyActivity3D data set；

(3) the behavior video step-length L with setting of training data concentration is chosen_StrideCarrying out segmentation, wherein every segment length is set as L_Seg, after segmentation, define N_SegIndividual video-frequency band matrix, segments is N_Seg=1+ (N_F-L_Seg)/L_Stride, N_FFrame number for deep video；

(4) N that step (3) is obtained_SegWhat build in individual video-frequency band matrix parallel feeding step (2) has N_SegThe deep neural network model of individual parallel degree of depth convolutional neural networks is trained；

(5) choose behavior video to be identified to carry out step (3) and obtain N_SegIndividual video-frequency band matrix, the N that will obtain_SegIndividual video-frequency band matrix parallel is sent in the deep neural network model trained and is carried out Activity recognition.Behavior video to be identified is the video through pretreatment.

Deep neural network in step (2), with convolutional neural networks for building block, has a classification layer, at least one convolutional layer, at least one pond layer and at least one full articulamentum.

Behavior video is deep video.

The behavior video that training data is concentrated is the video through pretreatment, and behavior video to be identified is the video through pretreatment.Described pretreatment is: first, uses interpolation technique by all video specificationization in data set to unified length.This length is the intermediate value of all video lengths.Secondly, remove background, only retain video section focusing on people, and by video size specification to certain size.Again, using min-max method respectively by the x of all videos, y, z coordinate value standardizes to [0,1] scope.Finally, all samples are carried out flip horizontal and forms new sample thus the training sample in dilated data set at double.

The invention have the benefit that the present invention obtains coarseness and fine granularity video matrix, designed parallel degree of depth convolutional neural networks is trained, deep neural network after training is used for the discriminator of behavior, the Generalization Capability making the present invention is good, can be trained on a large data sets, the Activity recognition field of the data that are subsequently used for lacking in training.

The present invention devises a parallel degree of depth convolutional neural networks, by the parallel input of behavior video, can greatly reduce the time overhead of Activity recognition, and real-time is effective.

The present invention uses deep video to be object of study, and deep video has the insensitive feature describing object geometry and light, color.

Experiment and result show, the human body behavior represented with deep video can effectively be identified by the degree of depth learning method based on CNN that the present invention proposes, in MSRDailyActivity3D data set behavior difference comparatively significantly lie down sofa, five behaviors of walking, play guitar, stand and sit down average recognition rate be 98%, the discrimination to behaviors all on whole data set is 60.625%.

Below in conjunction with the drawings and specific embodiments, the invention will be further described.

Accompanying drawing explanation

Fig. 1 is the theory diagram based on degree of depth study and the Activity recognition method of multi-scale information of the present invention；

Fig. 2 is the behavior video (under before pretreatment, above: drink water: write) in MSRDailyActivity3D；

Fig. 3 is the behavior video (under after pretreatment, above: drink water: write) in MSRDailyActivity3D.

Detailed description of the invention

Embodiment one

Referring to Fig. 1, a kind of Activity recognition method based on degree of depth study and multi-scale information, comprise the steps:

(1) training dataset is set up；The coarseness global behavior video that described training data is concentrated is selected from MSRDailyActivity3D data set.The coarseness global behavior video that training data is concentrated is the video through pretreatment.Coarseness global behavior video to be identified is the video through pretreatment.Described pretreatment is: first, uses interpolation technique by all video specificationization in data set to unified length.This length is the intermediate value of all video lengths.Secondly, remove background, only retain video section focusing on people, and by video size specification to certain size.Again, using min-max method respectively by the x of all videos, y, z coordinate value standardizes to [0,1] scope.Finally, all samples are carried out flip horizontal and forms new sample thus the training sample in dilated data set at double.

(2) the deep neural network model with some parallel degree of depth convolutional neural networks is built.Deep neural network in step (2), with convolutional neural networks for building block, has a classification layer, at least one convolutional layer, at least one pond layer and at least one full articulamentum.Present invention layer of classifying uses softmax grader.The parallel degree of depth convolutional neural networks of the present embodiment includes the first volume lamination, the first pond layer, volume Two lamination, the second pond layer, the 3rd convolutional layer, the 3rd pond layer, the first full articulamentum, the second full articulamentum and the classification layer that are sequentially connected with.

(3) the coarseness global behavior video step-length L with setting of training data concentration is chosen_StrideCarrying out segmentation, wherein, every segment length is set as L_Seg, after segmentation, define N_SegIndividual coarseness video-frequency band matrix, segments is N_Seg=1+ (N_F-L_Seg)/L_Stride, N_FFrame number for coarseness global behavior video.Carry out segmentation again after each frame of the coarseness global behavior video in step (3) is carried out down-sampling, act as: 1, reduce amount of calculation；2, the size making each frame of coarseness video-frequency band matrix is identical with the size of each frame of fine granularity video-frequency band matrix, it is simple to input network.Object of study and coarseness global behavior video adopt deep video.

(4) obtaining fine granularity local behavior video in the coarseness global behavior video from step (3), (3) same method that fine granularity local behavior video is taken steps carries out segmentation and obtains N_SegIndividual fine granularity video-frequency band matrix.The size of each frame of fine granularity video-frequency band matrix is identical with the size of each frame of coarseness video-frequency band matrix.Intercept behavior sequence set beading degree local, the fine granularity local behavior video in the coarseness each frame of global behavior video.The behavior of fine-grained local can be hand motion, it is also possible to for other details actions.The behavior of fine-grained local is determined according to concrete application, and the details action of notebook data collection is concentrated mainly on hand, if details action is at other positions, is then likely to choose the details action of other parts.The present embodiment is centered by the hand joint of the coarseness each frame of global behavior video, and intercepting the frame composition frame number being sized is N_FFine granularity local behavior video.

(6) choose that coarseness global behavior video to be identified carries out step (3), (4) respectively obtain N_SegIndividual coarseness video-frequency band matrix and N_SegIndividual fine granularity video-frequency band matrix, the N that will obtain_SegIndividual coarseness video-frequency band matrix and N_SegIndividual fine granularity video-frequency band matrix parallel is sent in the deep neural network model trained and is carried out Activity recognition.N before the present embodiment_SegIndividual network processes coarseness video, rear N_SegIndividual network processes fine granularity video.

Embodiment two

Present embodiment discloses a kind of Activity recognition method based on degree of depth study and multi-scale information, the present embodiment only uses the global behavior information of coarseness to carry out Activity recognition.Comprise the steps:

(1) training dataset is set up；The deep video that described training data is concentrated is selected from MSRDailyActivity3D data set；The behavior video that training data is concentrated is the video through pretreatment.Behavior video to be identified is the video through pretreatment.Described pretreatment is: first, uses interpolation technique by all video specificationization in data set to unified length.This length is the intermediate value of all video lengths.Secondly, remove background, only retain video section focusing on people, and by video size specification to certain size.Again, using min-max method respectively by the x of all videos, y, z coordinate value standardizes to [0,1] scope.Finally, all samples are carried out flip horizontal and forms new sample thus the training sample in dilated data set at double.

(2) referring to Fig. 1, the deep neural network model with some parallel degree of depth convolutional neural networks is built.Deep neural network in step (2), with convolutional neural networks for building block, has a classification layer, at least one convolutional layer, at least one pond layer and at least one full articulamentum.

Present invention layer of classifying uses softmax grader.

(3) deep video step-length L with setting of training data concentration is chosen_StrideCarrying out segmentation, wherein every segment length is set as L_Seg, after segmentation, define N_SegIndividual video-frequency band matrix, segments is N_Seg=1+ (N_F-L_Seg)/L_Stride, N_FFrame number for deep video；

(5) choose deep video to be identified to carry out step (3) and obtain N_SegIndividual video-frequency band matrix, the N that will obtain_SegIndividual video-frequency band matrix parallel is sent in the deep neural network model trained and is carried out Activity recognition.

Experimental procedure of the present invention describes as follows: assume that the video size representing a behavior after standardization is N_F× W × H (being 192 × 128 × 128 in the present invention), the wherein width of W, H respectively frame of video and height.

(1) it is N by frame number_FBehavior video with L_StrideCarrying out segmentation for step-length, wherein every segment length is L_Seg, then segments is N_Seg=1+ (N_F-L_Seg/L_Stride, then by frame of video 1/4 down-sampling, then define N after segmentation_Seg×L_SegThe video-frequency band matrix of × W/4 × H/4；

(2) centered by the left hand joint of each frame of deep video, the frame composition N of W/4 × H/4 size is intercepted_FThe new video of × W/4 × H/4, new video is taken steps (1) same method obtains N_Seg×L_SegThe video-frequency band matrix of × W/4 × H/4；

(3) the video-frequency band matrix of step (1) and step (2) is carried out fusion and obtain 2N_Seg×L_SegThe video-frequency band matrix of × W/4 × H/4；This video-frequency band matrix is the input of degree of depth network, and namely this network has 2N_SegIndividual parallel degree of depth convolutional neural networks, the input of each deep neural network is L_SegThe video of × W/4 × H/4.

(4) using the parallel degree of depth convolutional neural networks of training data set pair to be trained, then use test data set to carry out the test of Human bodys' response, training dataset and tested data set are completely non-intersect.In the present invention tested 1,3,5,7,9} performance behavior video be used for training, and tested 2,4,6,8,10} performance behavior videos be used for testing.This data set is completed by 10 people's (tested), and the data of the 1st, 3,5,7,9 people are used for training, and the data of 2,4,6,8,10 these 5 people are used for testing.

Assume L_Seg=16, L_Stride=16, then deep neural network framework needs to adopt 24 parallel networks, and the input of each network is the video-frequency band sequence of 16 × 32 × 32, and namely each video-frequency band contains 16 frame videos, and video image is sized to 32 × 32.

Degree of depth network that table 1 present invention uses and parameter thereof

Experiment and discussion

1. data set and pretreatment

The present invention adopts Microsoft to use the MSRDailyActivity3D data set that Kinect device gathers, and this data set have collected 16 kinds of behaviors common in daily life: drink water, eat a piece, read, make a phone call, write, with notebook computer, with vacuum cleaner, hail, stand still, paper-tear picture, object for appreciation game, sofa of lying down, walk, play guitar, stand and sit down.Each behavior act is completed in two different ways by same main examination: is sitting on sofa or stands.Whole data set has 320 behavior videos.Fig. 2 gives some the behavior samples in this data set.This data set have recorded human body behavior and surrounding simultaneously, and the depth information extracted contains substantial amounts of noise, and the most of behavior in data set is only locally lying in nuance, as shown in Fig. 2, Fig. 3, thus extremely challenging.

On pretreatment, each video having been carried out simple pretreatment, first, used interpolation technique by all video specificationization in data set to unified length, this length is the intermediate value of all video lengths；Secondly, remove background, only retain video section focusing on people, and by video size specification to certain size, as shown in Figure 3；Again, using min-max method respectively by the x of all videos, y, z coordinate value standardizes to [0,1] scope；Finally, all samples are carried out flip horizontal and forms new sample thus the training sample in dilated data set at double.The experiment of the present invention adopts Torch platform [20] to write, and learning rate therein is 1*10^-4, loss function is the softmax function that platform carries.

2. merge based on multi-scale information and the HAR of degree of depth study identifies

The present invention uses the 2CNN2F network in table 1, using the input as degree of depth network of the multi-scale informations such as the global behavior identification video of coarseness and fine-grained hand motion sequence.Step-length L in this section experiment_StrideWith segments L_SegBeing disposed as 16, global behavior sequence and 12 × 16 × 32 × 32 locally hand motion sequences by extract whole video 12 × 16 × 32 × 32 merge fusion formation 24 × 16 × 32 × 32 input video matrixes.Table 2 gives the present invention and proposes method and additive method contrast of recognition performance on MSRDailyActivity3D data set.Wherein 2CNN2F refers to the global behavior information only using coarseness, and 2CNN2F+Joint then represents the multi-scale information fusion method of the present invention.Can be seen that from table the accuracy of the inventive method Activity recognition is 60.625%, iff the global behavior information using coarseness, its discrimination is in a slight decrease, is 56.875%, and the method for its recognition performance and Traditional Man feature extraction has comparability.It should be noted that, it is identified iff to the 11-16 behavior (namely play game, sofa of lying down, walk, play guitar, stand and sit down), then discrimination reaches 98%, this is possibly due between the 11-16 behavior have bigger difference, and other large number of rows in data set be between difference very trickle, as read, writing, have nuance with the several behavior of notebook computer only in hand motion.Experimental result illustrates, uses the method for degree of depth study can effectively carry out Activity recognition, and especially when each behavior act difference is bigger, discrimination can be significantly improved.

Table 2 the inventive method compares with additive method recognition performance on MSRDailyActivity3D data set

Algorithm	Discrimination
		LOP features[8]	42.5%
Joint Position features[8]	68%
		Dynamic Temporal Warping[21]	54%
2CNN2F	56.875%
		2CNN2F+Joint	60.625%

3. the network depth impact on identifying

The present invention constructs the neutral net containing 3 layers of CNN and 4 layers of CNN simultaneously respectively, i.e. 3CNN2F_8 and 4CNN2F (as shown in table 3), for the impact on recognition effect of the Probe into Network degree of depth.Network parameter is as shown in table 1.Owing to network depth increases, in order to ensure network not transition matching, the video sequence of this experiment use 24 × 8 × 128 × 128 is as the input of neutral net, it is about to 192 × 128 × 128 videos after standardization, with step-length for 8, split into the video-frequency band of 24 8 × 128 × 128, be input simultaneously to the neutral net with 24 parallel organizations.As shown in Table 2, discrimination when using 3CNN2F_8 network is 52.5%, and the discrimination of 4CNN2F is 58.75%.Experimental result illustrates that the increase of network depth can be effectively improved Activity recognition rate.

Parameter configuration in table 3 heterogeneous networks and discrimination

4. split the step-length impact on recognition effect

In order to check the fractionation step-length impact on recognition effect, the present invention is directed to 3CNN2F type network and build the network of two different inputs: 3CNN2F_8 and 3CNN2F_4, the input of 3CNN2F_8 is the video sequence of 24 × 8 × 128 × 128, and the input of 3CNN2F_4 be sized to 47 × 8 × 128 × 128, it is about to 192 × 128 × 128 videos after standardization, with step-length for 4, split into the video-frequency band of 47 8 × 128 × 128, the intersegmental repetition with 4 frames of two adjacent video after fractionation.Experimental result is as shown in table 3.When step-length is 8, recognition accuracy is 52.5%, and when step-length is 4, recognition accuracy is 56.875%.Discrimination effectively improves, and the reduction being primarily due to step-length can cause the change of two aspects, and step-length is more little on the one hand, the video-frequency band split is more many, and degree of depth network needs more parallel branch, and what become in the horizontal is wider, network parameter is more many, and the general Huaneng Group power of network is more good；On the other hand, step-length reduction and split video-frequency band increase so that training data have also been obtained increase, network training better effects if simultaneously.

In view of deep video has the insensitive feature describing object geometry and light, color, the present invention is with deep video for object of study, adopt traditional two-dimentional CNN (convolutional neural networks) to build deep neural network model, the behavior in MSRDailyActivity3D data set is carried out Classification and Identification.Experiment and result show, the human body behavior represented with deep video can effectively be identified by the degree of depth learning method based on CNN that this article proposes, in MSRDailyActivity3D data set behavior difference comparatively significantly lie down sofa, five behaviors of walking, play guitar, stand and sit down average recognition rate be 98%, the discrimination to behaviors all on whole data set is 60.625%.The discrimination how improving degree of depth study has also been carried out certain explorative experiment by the present invention simultaneously.Research finds to split the reduction of video-frequency band step-length, merges coarseness and fine-grained video information, suitably increases network depth and all can be effectively improved the discrimination of degree of depth network.

The present invention is not limited solely to above-described embodiment, carries out the technical scheme of few modifications, should fall into protection scope of the present invention when without departing substantially from technical solution of the present invention spirit.

Claims

1. the Activity recognition method based on degree of depth study and multi-scale information, it is characterised in that comprise the steps:

(1) training dataset is set up；

(3) the coarseness global behavior video that training data is concentrated is chosen, with the step-length L set_StrideCarrying out segmentation, wherein, every segment length is set as L_Seg, after segmentation, define N_SegIndividual coarseness video-frequency band matrix, segments is N_Seg=1+ (N_F-L_Seg)/L_Stride, N_FFrame number for coarseness global behavior video；

(4) obtaining fine granularity local behavior video in the coarseness global behavior video from step (3), (3) same method that fine granularity local behavior video is taken steps carries out segmentation and obtains N_SegIndividual fine granularity video-frequency band matrix；

(5) N that step (3) is obtained_SegThe N that individual coarseness video-frequency band matrix and step (4) obtain_SegWhat build in individual fine granularity video-frequency band matrix parallel feeding step (2) has 2N_SegThe deep neural network model of individual parallel degree of depth convolutional neural networks is trained；

(6) choose that coarseness global behavior video to be identified carries out step (3), (4) respectively obtain N_SegIndividual coarseness video-frequency band matrix and N_SegIndividual fine granularity video-frequency band matrix, the N that will obtain_SegIndividual coarseness video-frequency band matrix and N_SegIndividual fine granularity video-frequency band matrix parallel is sent in the deep neural network model trained that step (5) obtains and is carried out Activity recognition.

2. the Activity recognition method based on degree of depth study and multi-scale information according to claim 1, it is characterized in that: the deep neural network in step (2), with convolutional neural networks for building block, has a classification layer, at least one convolutional layer, at least one pond layer and at least one full articulamentum.

3. the Activity recognition method based on degree of depth study and multi-scale information according to claim 1, it is characterized in that: carry out segmentation again after each frame of the coarseness global behavior video in step (3) is carried out down-sampling, the size making each frame of coarseness video-frequency band matrix is identical with the size of each frame of fine granularity video-frequency band matrix.

4. the Activity recognition method based on degree of depth study and multi-scale information according to claim 1, it is characterised in that: coarseness global behavior video is deep video.

5. the Activity recognition method based on degree of depth study and multi-scale information according to claim 1 or 4, it is characterized in that: the coarseness global behavior video that training data is concentrated is the video through pretreatment, and coarseness global behavior video to be identified is the video through pretreatment.

6. the Activity recognition method based on degree of depth study and multi-scale information according to claim 1, it is characterised in that: intercept behavior sequence set beading degree local, the fine granularity local behavior video in the coarseness each frame of global behavior video.

7. the Activity recognition method based on degree of depth study and multi-scale information, it is characterised in that comprise the steps:

(1) training dataset is set up；

(5) choose behavior video to be identified to carry out step (3) and obtain N_SegIndividual video-frequency band matrix, the N that will obtain_SegIndividual video-frequency band matrix parallel is sent in the deep neural network model trained and is carried out Activity recognition.

8. the Activity recognition method based on degree of depth study and multi-scale information according to claim 7, it is characterized in that: the deep neural network in step (2), with convolutional neural networks for building block, has a classification layer, at least one convolutional layer, at least one pond layer and at least one full articulamentum.

9. the Activity recognition method based on degree of depth study and multi-scale information according to claim 7, it is characterised in that: behavior video is deep video.

10. the Activity recognition method based on degree of depth study and multi-scale information according to claim 7 or 9, it is characterised in that: the behavior video that training data is concentrated is the video through pretreatment, and behavior video to be identified is the video through pretreatment.