CN112560668A

CN112560668A - Human behavior identification method based on scene prior knowledge

Info

Publication number: CN112560668A
Application number: CN202011470438.8A
Authority: CN
Inventors: 袁家斌; 刘昕; 王天星
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2021-03-26

Abstract

The invention discloses a human behavior identification method based on scene prior knowledge, which comprises the following steps: preprocessing an input video; establishing an indoor scene-human behavior prior knowledge base; establishing a video scene recognition model and a human behavior recognition model M; and carrying out scene prediction on the input video, and fusing corresponding scene prior knowledge into a human behavior recognition network model M based on a scene recognition result to obtain human behavior classification. The method can fully utilize the correlation between the scene and the human body activity, optimize the objective function by converting the prior knowledge into the constraint on the weight in the behavior recognition model, and effectively improve the human body behavior recognition effect in the video.

Description

Human behavior identification method based on scene prior knowledge

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a human behavior identification method based on scene prior knowledge.

Background

In recent years, the development of video platforms makes video one of the most widely used data, the understanding of human behaviors in video is also receiving more attention, and how to better utilize elements contained in video to identify human behaviors in different environments becomes a challenging task. In the real world, human behaviors are closely related to the scene where the human behaviors are located, and elements contained in the scene, such as objects, environmental structures, scene attributes and the like in the scene, can influence the behaviors of a subject. While some actions are relatively scenario-independent, in a particular scenario, a subject may only accomplish a particular behavior.

In the prior art, a more mainstream human behavior identification method is to optimize a deep neural network, or to improve an extraction mode of human features and a processing means of the features after extraction, a typical double-flow network learns Spatial (Spatial) features of a single frame image by using a Spatial flow channel and learns motion information of the optical flow image by using a temporal flow channel, and finally, the two channels are fused by fractions of softmax to obtain behavior classification. The 3D convolution network fully utilizes the characteristic that the 3D convolution is more suitable for learning space-time characteristics, and fully fuses the low-dimensional characteristics and the high-dimensional characteristics of the human body. The methods effectively improve the identification accuracy of human behavior identification, focus on a single human behavior identification task, and ignore objective relation between a scene where a human body is located and a human body main body.

For the relationship between the scene context and the human behavior, in some human behavior recognition methods, human features and scene features are input into different channels and are respectively extracted and analyzed, and the method does not fully utilize the constraint condition that the human activities are limited by the scene where the human activities are located.

Disclosure of Invention

The invention provides a human behavior identification method based on scene prior knowledge, which aims to solve the problem of low accuracy of behavior identification caused by neglecting the association of scenes and human behaviors in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

a human behavior identification method based on scene prior knowledge comprises the following steps:

s1, preprocessing an input video;

s2, establishing an indoor scene-human body behavior prior knowledge base;

s3, establishing a video scene recognition model and a human body behavior recognition model;

s4, model training and testing;

and S5, inputting the video to perform human behavior recognition to obtain human behavior classification.

Further, the specific step of step S1 is:

s11, randomly extracting one frame of the video to serve as a scene image, and establishing a mapping relation between the scene image and the video after extraction;

s12, sparsely sampling an input video by using a TSN (time sequence coding) method, averagely dividing the video into N segments according to the number of frames, randomly extracting one frame from each Segment, and after an optical flow graph is obtained by using a TV-L1 method in OpenCV, randomly extracting one frame of optical flow image from each Segment by using the TSN method;

s13, the preprocessed images are all cut by using centers, and the size of each image is 224 multiplied by 224 (pixels).

Further, the specific step of step S2 is:

s21, quantifying prior knowledge and obtaining the prior knowledge G of different scenes_jExpressed as:

wherein:

representing the prior probability of occurrence of the t-th action in the j-th scene;

s22, storing prior knowledge under all scenes in a prior knowledge base, and mapping relation G from the scenes to behaviors_SExpressed as: g_S＝(G₁,G₂,……,G_j)^TWherein G is_jIndicating a priori knowledge in the j-th scenario, T refers to transpose.

Further, the specific step of step S3 is:

s31, selecting ResNet152 as a network model for scene recognition, and pre-training the network on large-scale data sets of Places365 and SUN 397;

s32. establishing a baseIn an improved network model M of I3D, fusing the priori knowledge in the softmax function and the predicted classification result according to the influence factor mu of the priori knowledge to obtain a final prediction result, and optimizing a loss function for a training set with k classifications by comparing the label value with the prediction result of the softmax function

Wherein: y denotes the correct tag value, Y denotes the final predicted value, Y_tIs a tag value, y_tFor the predicted value, Y ═ Y_t。

Further, the specific step of step S4 is:

s41, inputting the scene graph into a scene recognition model for scene classification, taking TOP-3 of a recognition result, and recalculating the result according to a proportion of 100%;

s42, recalculating the prior knowledge corresponding to each scene into new fusion scene prior knowledge according to scene proportion based on the result of the step S41;

s43, the network model M is of a modular structure and comprises a convolution layer, a maximum pooling layer, an average pooling layer, an inclusion module and a final softmax, fusion scene prior knowledge is introduced to constrain the weight of the softmax, so that the final output of the softmax excludes certain types of impossible behavior classification, the accuracy of behavior identification is improved, meanwhile, in order to avoid the problem of gradient disappearance, 2 additional softmax for conducting the gradient are added to the network model M on the basis of I3D, the output of a certain inclusion module in the middle of the two auxiliary classifiers is used for classification, the two auxiliary classifiers are fused into a final classification result according to a weight value of 0.2, and in a final test, the two auxiliary classifiers are removed.

Compared with the prior art, the invention has the following beneficial effects:

the human behavior identification method based on the scene prior knowledge comprises the steps of preprocessing an input video to obtain a scene image and an image after sparse sampling, then establishing an indoor scene-human behavior prior knowledge base, inputting the preprocessed video into a video scene identification model to carry out scene prediction on the input video, fusing corresponding scene prior knowledge into a human behavior identification network model M based on a scene identification result to obtain human behavior classification, and completing content identification of video multitasks through the method.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a model diagram of the scene recognition model and the human behavior recognition network model M.

Fig. 3 is a block diagram of the human behavior recognition network model M.

Detailed Description

The present invention will be further described with reference to the following examples.

As shown in fig. 1, a human behavior recognition method based on scene prior knowledge includes the following steps:

s1, preprocessing an input video;

s11, one frame of the video is arbitrarily extracted to serve as a scene image, and a mapping relation between the scene image and the video is established after extraction so as to be used when association of scene-video-priori knowledge is carried out;

S2, establishing an indoor scene-human body behavior prior knowledge base;

wherein:

s32, establishing an improved network model M based on I3D, fusing the prior knowledge in the softmax function with the predicted classification result according to the influence factor mu of the prior knowledge to obtain a final prediction result, and optimizing the loss function of the training set with k classifications by comparing the label value with the prediction result of the softmax function

Wherein: y denotes the correct tag value, Y denotes the final predicted value, Y_tFor the predicted value, Y ═ Y_t；

In particular: the main body framework of the human behavior recognition part of the network model M adopts an I3D network with a good effect on human behavior modeling, the I3D network is a network which expands a 2D filter into 3D, the 2D neural network uses an increment-V1 network, the expanded 3D network can use parameters of the 2D network in ImageNet pre-training, and the method is that the parameters of a 2D convolution kernel are copied along time and then divided by the time dimension of the 3D convolution kernel. Using 9 Inception modules, in the last average pooling layer of the model, reducing the dimensions of the human behavior features extracted from the previous layers into 1024-dimensional characteristic vectors, inputting the 1024-dimensional characteristic vectors into a softmax classification layer, and outputting a result A of the softmax classification layer for a training set with k classifications_iTo representComprises the following steps: a. the_i＝(a_i1,a_i2,……,a_it,……a_ik) Wherein a is_itRepresenting the probability value of the ith video belonging to the t-th action, fusing the priori knowledge in the softmax function with the predicted classification result according to the influence factor mu of the priori knowledge to obtain the final prediction result, and comparing the label value with the prediction result of the softmax function to optimize the loss function of a single training sample

Wherein: y denotes the correct tag value, Y denotes the final predicted value, Y_tIs a tag value, y_tFor the prediction, the final loss function J is expressed as

Wherein, ω is_i、b_iMu is weight, bias value and priori knowledge weight to be learned of softmax, N is video number, Y is_iIs the tag value, y, of the ith video_iIs a prediction value for the ith video.

S4, model training and testing, wherein the process is shown in figure 2;

the recalculated prior knowledge of the ith video is:

wherein the content of the first and second substances,

TOP3 showing the probability of the ith video in the jth scene, m indicates that the recognition result takes TOP-3, which means "TOP-3 recognition result of scene recognition", so m ranges from 1 to 3,

introducing scene prior knowledge into the training process for the behavior occurrence probability of each video corresponding to the prior knowledge

Then, in the reverse learning process, if the priori knowledge is positively correlated with the prediction result, the weight mu value of the priori knowledge is increased, if the priori knowledge is inaccurate, the mu value is reduced, meanwhile, the deep neural network learns the sample by analyzing the distribution of data, and in order to prevent the introduction of the priori knowledge from generating decisive influence on the final result, mu is set as an influence factor of the priori knowledge, and 0<μ<1, initial μ ═ 0.5. The behavior recognition result of the ith video combined with the prior knowledge is represented as:

s43, the network model M is of a modular structure, as shown in FIG. 3, the network model M comprises a convolution layer, a maximum pooling layer, an average pooling layer, an inclusion module and a final softmax, fusion scene prior knowledge is introduced to constrain the weight of the softmax, so that the final output of the softmax excludes certain classes of impossible behavior classification, the accuracy of behavior identification is improved, meanwhile, in order to avoid the problem of gradient disappearance, the network model M is additionally provided with 2 auxiliary softmax on the basis of I3D for conducting the gradient forwards, the two auxiliary classifiers take the output of a certain middle inclusion module as classification, the two auxiliary classifiers are fused into the final classification result according to a weight value of 0.2, but in the final test, the two auxiliary classifiers are removed;

in particular: the network model M comprises convolution layers, a maximum pooling layer, an average pooling layer, an inclusion module and a last softmax, wherein the convolution kernel of the last convolution layer is 1 x 1 to generate a classification score, other convolution layers are all subjected to BN operation and Relu activation, the learning rate can be increased on the basis of the structure, and the dropout layer is only placed behind the average pooling layer. The method includes the steps that fusion scene prior knowledge is introduced to constrain the weight of softmax, so that certain types of impossible behavior classification can be eliminated from final output of softmax, accuracy of behavior identification is improved, meanwhile, in order to avoid the problem of gradient disappearance, 2 auxiliary softmax are additionally added to a network model M on the basis of I3D to conduct gradient forwards, output of a certain Incepration module in the middle of the two auxiliary classifiers is used as classification, the two auxiliary classifiers are fused into a final classification result according to a weight value of 0.2, and in a final test, the two auxiliary classifiers can be removed.

The human behavior identification method based on scene prior knowledge can complete the multi-task content identification of videos, fully utilizes the association between scene information and human behaviors, optimizes a target function by converting the prior knowledge into the constraint on weight in a behavior identification model, and improves the accuracy rate of behavior identification.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A human behavior recognition method based on scene prior knowledge is characterized by comprising the following steps:

s1, preprocessing an input video;

s2, establishing an indoor scene-human body behavior prior knowledge base;

s4, model training and testing;

2. The human behavior recognition method based on scene a priori knowledge of claim 1, wherein the specific steps of step S1 are as follows:

s13, cutting the preprocessed images by using a center, wherein the size of each image is 224 multiplied by 224 and the unit is a pixel.

3. The human behavior recognition method based on scene a priori knowledge of claim 1, wherein the specific steps of step S2 are as follows:

wherein:

4. The human behavior recognition method based on scene a priori knowledge of claim 1, wherein the specific steps of step S3 are as follows:

s32, establishing an improved network model M based on I3D, fusing the prior knowledge in the softmax function with the predicted classification result according to the influence factor mu of the prior knowledge to obtain a final prediction result, and optimizing the loss function for the training set with k classifications by comparing the label value with the prediction result of the softmax function

5. The human behavior recognition method based on scene a priori knowledge of claim 1, wherein the specific steps of step S4 are as follows:

s43, the network model M is of a modular structure and comprises a convolution layer, a maximum pooling layer, an average pooling layer, an inclusion module and a final softmax, fusion scene prior knowledge is introduced to constrain the weight of the softmax, so that the final output of the softmax excludes certain types of impossible behavior classification, 2 auxiliary softmax is additionally added to the network model M on the basis of I3D for conducting the gradient forwards, the output of one of the inclusion modules in the middle of the two auxiliary classifiers is used for classification, the two auxiliary classifiers are fused into a final classification result according to the weight, but in a final test, the two auxiliary classifiers are removed.