CN113449564A

CN113449564A - Behavior image classification method based on human body local semantic knowledge

Info

Publication number: CN113449564A
Application number: CN202010228189.5A
Authority: CN
Inventors: 李永露; 徐良; 刘欣鹏; 许越; 卢策吾
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-03-26
Filing date: 2020-03-26
Publication date: 2021-09-28
Anticipated expiration: 2040-03-26
Also published as: CN113449564B

Abstract

An image classification method based on human body local behavior semantic knowledge is characterized in that a human body part behavior state recognition model for obtaining human body local fine-grained semantic representation is established and model training is carried out; then, converting visual information in the image to be detected into language-based priori knowledge by using natural language understanding, fusing the priori knowledge and the visual information to generate a fine-grained behavior characterization vector, and transferring the fine-grained behavior characterization vector to a computer visual behavior and recognition task; and finally, reasoning the overall behavior by combining the local fine-grained characteristics of the human body to finish the behavior understanding process to obtain a classification result. The invention achieves ideal recognition performance improvement in a plurality of complex behavior understanding tasks; meanwhile, the method has the advantages of one-time pre-training and multiple times of various migration, and has generalization and flexibility.

Description

Behavior image classification method based on human body local semantic knowledge

Technical Field

The invention relates to a technology in the field of image recognition and artificial intelligence, in particular to an image classification method based on human body local behavior semantic knowledge.

Background

Human behavior detection is an important branch of computer vision, with the goal of inferring human behavior and interaction with the environment in an image or video. Behavior detection is widely applied to the fields of intelligent driving, security and robots, is one of the most important artificial intelligence technologies for the industry, and is more and more concerned by people. Machine learning mainly studies computer algorithms capable of being automatically improved through experience, and key information and knowledge are generally obtained, abstracted and summarized from a large amount of experience data, while an artificial neural network is an important branch of machine learning and is widely applied to artificial intelligence related tasks at present. The existing image behavior detection method directly infers the behavior of a person from the characteristics of the image level, and the method is easy to fall into the performance bottleneck due to overlarge modal difference between the image behavior detection method and the image level.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an image classification method based on human body local behavior semantic knowledge, so that the ideal recognition performance improvement is achieved in various complex behavior understanding tasks; meanwhile, the method has the advantages of one-time pre-training and multiple times of various migration, and has generalization and flexibility.

The invention is realized by the following technical scheme:

the invention relates to an image classification method based on human body local behavior semantic knowledge, which comprises the steps of establishing a human body part behavior state recognition model for obtaining human body local fine-grained semantic representation and carrying out model training; then, converting visual information in the image to be detected into language-based priori knowledge by using natural language understanding, fusing the priori knowledge and the visual information to generate a fine-grained behavior characterization vector, and transferring the fine-grained behavior characterization vector to a computer visual behavior and recognition task; and finally, reasoning the overall behavior by combining the local fine-grained characteristics of the human body to finish the behavior understanding process to obtain a classification result.

The human body part behavior state recognition model comprises: the human body part behavior state classifier comprises a 50-layer residual convolutional neural network for pre-training, 10 interest region pooling layers with 512 dimensions, two layers of perceptrons with a ReLU nonlinear activation layer and 10 human body part behavior state classifiers with 76 dimensions of output.

The model training adopts a human body part behavior state training sample set, and the human body part behavior state training sample set is obtained by the following method: in the image data set containing human behavior and its mark (including human boundary box b)_hObject bounding box b_o(when the behavior is human-object interaction behavior) and behavior tag label_action) Defining the human body part behavior states of the people participating in the interaction, and finally obtaining 76 different human body part states; based on these definitions, the body part status of each person behavior instance in the image dataset is tagged, the result comprising two parts: body part state label_pastaAttention vector label of human body part_attCharacterizing whether each part contributes to the behavior sample; and carrying out two-dimensional human body posture estimation on the people in the training set, and generating a boundary box b of ten parts of each person according to the estimation result_p1～b_p10. The above bounding boxes are all four-dimensional vectors (x)₁,y₁,x₂,y₂) The coordinate of the upper left corner of the bounding box is (x)₁,y₁) The coordinate of the lower right corner is (x)₂,y₂)。

The visual information comprises: the image set HICO-DET which is disclosed and contains human behavior labels is used as a migration task data set to be trained to obtain a human body part behavior state recognition model, so that human body local fine-grained visual semantic representation and estimation of human body part attention vectors are extracted.

The language-based prior knowledge is that: according to the way of natural language understanding, the language representation vector of each part name and human body is extracted by using a BERT (pre-training of deep two-way transformation for language understanding) model.

The fusion is as follows: and combining the language-based prior knowledge and the visual information in a splicing mode to obtain a fine-grained behavior characterization vector.

The computer vision behavior and recognition task comprises the following steps: construction is based on human body local fine-grained semantic representation with f_pastaTo input and derive an inferred score S of human behavior^pastaThe overall behavioral inference model of (a), the overall behavioral inference model comprising: hierarchical graph model, linear combination, multilayer perceptron, graph convolution network, sequence model, tree structure information conduction, wherein: the hierarchical graph model divides human body parts according to functional modules, merges and induces the human body parts according to layers, and performs behavior reasoning; and the linear combination, the multilayer perceptron, the graph convolution network, the sequence model and the tree structure information conduction respectively utilize single-layer full-connection operation, multilayer full-connection operation, graph convolution operation, LSTM operation and tree structure operation to classify the fine-grained behavior characterization vectors and infer human behaviors.

The loss function adopted by the training of the overall behavior inference model

Wherein: l is_pastaA loss function adopted for training a human body part behavior state recognition model is adopted, the cross soil moisture between a model output result and a label is adopted as the loss function, and the loss function is omitted when no human body part state information exists in a migration task;

to use f_pastaA cross entropy function calculated from the behavior detection result obtained after the model is sent;

the cross entropy function calculated for the conventional method is omitted when not combined with the conventional method.

The combined human body local fine granularityThe characteristic reasoning overall behavior is as follows: outputting the behavior detection score obtained by the overall behavior inference model and the method of only inputting the image-level features to map human behaviors as a behavior detection score S^instThe combined output is S ═ S^pasta+S^instAnd obtaining the final detection result.

The invention relates to an identification system for realizing the method, which comprises the following steps: the device comprises an image feature extraction unit, a local state identification unit, a local state language feature unit and a behavior reasoning unit, wherein: the image feature extraction unit is used for extracting features of an input image and transmitting the input image to the local state identification unit connected with the image feature extraction unit so as to identify a local state and extract visual features of the local state; the local state language feature unit reads the recognition result of the local state recognition unit and converts the recognition result into language features; the local state recognition unit and the local state language feature unit respectively transmit the visual features and the language features to the behavior reasoning unit for final behavior recognition.

Technical effects

The invention integrally solves the problem that the migration learning is not facilitated due to more types of behaviors, and realizes the knowledge migration among different behaviors by learning less types and easier-to-migrate semantic knowledge of the local behaviors of the human body so as to improve the behavior recognition under a small sample.

Compared with the prior art, the method obviously improves the human behavior detection precision in the image, constructs fine-grained human local semantic representation by introducing human local behavior semantic knowledge and combining visual information and language information, generally improves the human local semantic representation by about 10% on a common behavior understanding data set, and can apply a feature extraction model after one-time training to various behavior understanding and identifying tasks such as human-object interaction behavior understanding, video or picture behavior understanding and the like through transfer learning.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of the system of the present invention;

FIG. 3 is a schematic diagram illustrating the effect of the present invention.

Detailed Description

As shown in fig. 1, the present embodiment relates to an object-attribute combined image recognition method based on symmetry and group theory, which includes the following steps:

step 1, constructing a data set: using the public image data set containing human behavior and obtaining the human bounding box b_hObject bounding box b_o(when the behavior is human-object interaction behavior) and behavior tag label_actionDefining the human body part behavior states of the people participating in the interaction, and finally obtaining 76 different human body part states; based on these definitions, the human body part status of each human behavior instance in the image dataset is labeled, and the following results are obtained: body part state label_pastaAnd a human body part attention vector label with a length of 10_attCharacterizing whether each part contributes to the behavior sample; and carrying out two-dimensional human body posture estimation on the people in the training set, and generating a boundary box b of ten parts of each person according to the estimation result_p1～b_p10。

The bounding boxes are all four-dimensional vectors (x)₁,y₁,x₂,y₂) The coordinate of the upper left corner of the bounding box is (x)₁,y₁) The coordinate of the lower right corner is (x)₂,y₂)。

Step 2: training a human body part behavior state recognition model.

Step 2.1: constructing a human body part behavior state recognition model, wherein the model comprises the following steps: the human body part behavior state classifier comprises a pre-trained 50-layer residual convolutional neural network, 10 interest region pooling layers with 512 dimensions, two layers of perceptrons of a ReLU nonlinear activation layer and 10 human body part behavior state classifiers with 76 dimensions as output, wherein: RGB three-channel color picture I_RGBSending into residual convolution neural network to obtain characteristic diagram with resolution reduced to original 1/16 and having 1024 channels, and processing with the characteristic diagram and b_p1～b_p10After input interest areas are pooled, ten characteristics corresponding to each human body part are obtained and then are respectively sent to a corresponding multilayer perceptron and a human body part behavior state classifier to obtain P_pasta。

Step 2.2: training the model with the data set constructed in step 1: will train data I_RGB、b_p1～b_p10Inputting the corresponding human body part state label into the human body part behavior state recognition model, and calculating a loss function L according to the output result_pastaAnd iteratively training the model by using a gradient back propagation algorithm.

Said loss function L_pastaThe method specifically comprises the following steps: and (3) a loss function adopted by the human body part behavior state recognition model training, cross moisture between a model output result and a label is adopted as the loss function, and the loss function is omitted when no human body part state information exists in the migration task.

And step 3, obtaining the human body local fine-grained semantic representation.

Step 3.1: acquiring a human bounding box b in an open image data set HICO-DET containing human behaviors as a migration task data set_hObject bounding box b_o(when the behavior is human-object interaction behavior) and behavior tag label_action. Because the data set has behavior state labels of human body parts, corresponding label is also obtained_pastaAnd label_attIt is divided into a training set and a test set as input information, namely a three-channel RGB image I comprising human behaviors_RGBAnd bounding boxes b of people, human body parts, objects (e.g. for human-object interaction behavior)_h,b_o,b_p。

Step 3.2: inputting the data obtained in the step 3.1 into the human body part behavior state recognition model trained in the step 2, and outputting the data to represent the local fine-grained visual semantic meaning of the human body

Recognition result of human body part behavior state

And estimation of attention vectors of human body parts

Obtaining the visual characteristics of the local state through final splicing

Wherein:

is a pair of

And

and (4) splicing.

The human body local fine-grained visual semantic representation is the output of a last full connection layer of a human body part behavior state classifier

The length is 512, n in this embodiment₁Is the dimension of the output.

In the overall training of step 2.2, a loss function is calculated by using the label corresponding to the input information and the output result of the network, and the iterative optimization is performed on the neural network parameters by using a gradient back propagation algorithm, wherein the loss function is

To estimate the body part behavior state of the ith individual body part,

a cross entropy loss function for estimating the human body part attention of the ith human body part.

The length of the visual information is 1024 in this embodiment.

Step 3.3: local behavior state language features based on self-language understanding are generated and are combined with the local state visual features obtained in the step 3.2

Fusing to generate fine-grained behavior characterization vectors: specifically, the local state identified by the local behavior state identification unit is converted into the local state language feature based on the natural language word description by using J Devrlin and the like, which are described in the document 'book of Pre-training of deep bidirectional transformations for language understanding' (Pre-training of deep bidirectional transformation for language understanding):

n₂is the length of the language feature vector, associated with the selected language model; then, obtain

Then it is right

And

carrying out fusion: will be provided with

And

are spliced to obtain

And 4, step 4: and training an overall behavior inference model based on human body local fine-grained semantic representation.

Step 4.1: constructing a whole behavior inference model based on human body local fine-grained semantic representation, wherein the modelThe type comprises two layers 102 of four-dimensional multi-layer perceptron with activation function ReLU and full connection layer classifier behind the perceptron and is represented by f_pastaAs input, inferred scores of human behavior are output.

Step 4.2: f belonging to the training set and obtained in the step 3_pastaInputting the data into a model to obtain a behavior detection score S^pastaAnd calculating a loss function L therefrom_pastaThe model is iteratively trained and updated using a gradient back propagation algorithm.

And 5: and (3) carrying out behavior classification based on human body local behavior semantic knowledge by using the trained model: f belonging to the test set and obtained in the step 3_pastaInputting into a model to obtain an output S^pastaResult S output by the method using only image level features^instCombining to obtain S ═ S^pasta+S^instAs a final test result.

After combination, the relative improvement is 29% compared with that before combination.

As shown in fig. 2, the present embodiment further relates to an identification system for implementing the above method, including: the device comprises an image feature extraction unit, a local state identification unit, a local state language feature unit and a behavior reasoning unit, wherein: the image feature extraction unit is used for extracting features of an input image and transmitting the input image to the local state identification unit connected with the image feature extraction unit so as to identify a local state and extract visual features of the local state; the local state language feature unit reads the recognition result of the local state recognition unit and converts the recognition result into language features; the local state recognition unit and the local state language feature unit respectively transmit the visual features and the language features to the behavior reasoning unit for final behavior recognition.

Preferably, the system acquires dynamic local changes of continuous video frames through a video-based human body part tracking unit, further acquires a local behavior state in a time period, and receives multi-frame input in cooperation with a behavior reasoning unit to obtain an overall behavior recognition result in a certain time period, so that the system can be used for recognizing daily video behaviors, the behavior recognition performance in videos is improved through the judgment of the local dynamic time sequence state of the human body, and the precision can be effectively improved by 4.2% on a large-scale public video behavior data set AVA.

As shown in fig. 3, for the image-level human behavior classification task, the batch size is set to be 16 under a single Nvidia Titan X GPU, the initial learning rate is 1e-5 and the cosine decreases, a model is trained by using a stochastic gradient descent optimizer with momentum of 0.9, and through 80k training and 20k training adjustments, the experimental data that can be obtained on the HICO data set is 46.3 maps, which reaches the most advanced level at present.

Compared with the prior art, the performance index of the method is improved by introducing the identification of the human body part behavior state, so that the huge difference of direct mapping from the image to the human behavior is avoided; meanwhile, the local state can be shared by different overall behaviors, and the behavior recognition under the learning of small samples is better improved due to better mobility.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. An image classification method based on human body local behavior semantic knowledge is characterized in that a human body part behavior state recognition model for obtaining human body local fine-grained semantic representation is established and model training is carried out; then, converting visual information in the image to be detected into language-based priori knowledge by using natural language understanding, fusing the priori knowledge and the visual information to generate a fine-grained behavior characterization vector, and transferring the fine-grained behavior characterization vector to a computer visual behavior and recognition task; and finally, reasoning the overall behavior by combining the local fine-grained characteristics of the human body to finish the behavior understanding process to obtain a classification result.

2. The image classification method according to claim 1, wherein the human body part behavior state recognition model comprises: the human body part behavior state classifier comprises a 50-layer residual convolutional neural network for pre-training, 10 interest region pooling layers with 512 dimensions, two layers of perceptrons with a ReLU nonlinear activation layer and 10 human body part behavior state classifiers with 76 dimensions of output.

3. The image classification method according to claim 1 or 2, characterized in that the model training uses a human body part behavior state training sample set, and the human body part behavior state training sample set is obtained by: defining human body part behavior states of people participating in interaction on the image data set containing human behaviors and labels thereof, and finally obtaining 76 different human body part states; based on these definitions, the body part status of each person behavior instance in the image dataset is tagged, the result comprising two parts: body part state label_pastaAttention vector label of human body part_attCharacterizing whether each part contributes to the behavior sample; and carrying out two-dimensional human body posture estimation on the people in the training set, and generating a boundary box b of ten parts of each person according to the estimation result_p1～b_p10The bounding boxes are all four-dimensional vectors (x)₁,y₁,x₂,y₂) The coordinate of the upper left corner of the bounding box is (x)₁,y₁) The coordinate of the lower right corner is (x)₂,y₂)。

4. The image classification method according to claim 1, wherein the visual information comprises: the image set HICO-DET which is disclosed and contains human behavior labels is used as a migration task data set to be trained to obtain a human body part behavior state recognition model, so that human body local fine-grained visual semantic representation and estimation on human body part attention vectors are extracted;

the language-based prior knowledge is that: according to a natural language understanding mode, extracting names of all parts of a human body and language expression vectors by using a pre-training model of deep bidirectional transformation of language understanding;

5. The image classification method according to claim 1, characterized in that the computer vision behavior and recognition tasks are: construction is based on human body local fine-grained semantic representation with f_pastaTo input and derive an inferred score S of human behavior^pastaThe overall behavioral inference model of (a), the overall behavioral inference model comprising: hierarchical graph model, linear combination, multilayer perceptron, graph convolution network, sequence model, tree structure information conduction, wherein: the hierarchical graph model divides human body parts according to functional modules, merges and induces the human body parts according to layers, and performs behavior reasoning; and the linear combination, the multilayer perceptron, the graph convolution network, the sequence model and the tree structure information conduction respectively utilize single-layer full-connection operation, multilayer full-connection operation, graph convolution operation, LSTM operation and tree structure operation to classify the fine-grained behavior characterization vectors and infer human behaviors.

6. The image classification method according to claim 1, characterized in that the training of the overall behavior inference model uses a loss function

to use f_pastaFeeding into the mold to obtainCalculating a cross entropy function of the obtained behavior detection result;

7. The image classification method according to claim 1, characterized in that the combined human body local fine-grained feature inference global behavior is: outputting the behavior detection score obtained by the overall behavior inference model and the method of only inputting the image-level features to map human behaviors as a behavior detection score S^instThe combined output is S ═ S^pasta+S^instAnd obtaining the final detection result.

8. An identification system for implementing the method of any one of claims 1 to 7, comprising: the device comprises an image feature extraction unit, a local state identification unit, a local state language feature unit and a behavior reasoning unit, wherein: the image feature extraction unit is used for extracting features of an input image and transmitting the input image to the local state identification unit connected with the image feature extraction unit so as to identify a local state and extract visual features of the local state; the local state language feature unit reads the recognition result of the local state recognition unit and converts the recognition result into language features; the local state recognition unit and the local state language feature unit respectively transmit the visual features and the language features to the behavior reasoning unit for final behavior recognition.

9. The identification system of claim 8, further comprising a video-based human body part tracking unit for capturing dynamic local changes of successive video frames to obtain local behavior state in a time period, and a behavior inference unit for receiving multi-frame input and obtaining overall behavior identification result in a time period, thereby being used for daily video behavior identification.