CN113313039B

CN113313039B - Video behavior recognition method and system based on action knowledge base and ensemble learning

Info

Publication number: CN113313039B
Application number: CN202110618201.8A
Authority: CN
Inventors: 刘芳; 李玲玲; 王宇航; 杨苗苗; 黄欣研; 刘旭; 郭雨薇; 陈璞花
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2023-07-25
Anticipated expiration: 2041-05-31
Also published as: CN113313039A

Abstract

The invention discloses a video behavior recognition method and a system based on an action knowledge base and integrated learning, which are characterized in that a 3D depth residual error network is used for extracting global features of an input video, and the action knowledge base is used for extracting action state features based on vision and action state features based on language; constructing a corresponding graph structure according to the human body part by the extracted features, and constructing a multi-head graph convolution feature fusion network to perform information fusion on the constructed graph structure; constructing five weak classifiers with similar structures, wherein the input of the first three classifiers is the three characteristics, and the input of the second two classifiers is the characteristics after cascading; a dynamic cross entropy loss function is presented to integrate and classify the results of different weak classifiers. The classification of actions contained in the video frequency band is realized, and the classification accuracy is improved.

Description

Video behavior recognition method and system based on action knowledge base and ensemble learning

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a video behavior recognition method and system based on an action knowledge base and integrated learning.

Background

With the development of deep learning, computer vision related research is not limited to the aspect of image understanding, more tasks aiming at video information emerge, and the purpose of behavior recognition is to understand for segmented video, analyze action information in an input video segment, and give action class marks containing actions in the video segment. Coming in the internet age, people have entered the age of information explosion, and in many fields, a large amount of video data is present in people's daily lives, for example: security monitoring, man-machine interaction, various large video websites, small video applications, etc., behavior analysis is one of the key technologies for processing such information, where the intersection of various fields, such as computer vision, pattern recognition, computer science and technology, etc., is involved.

In recent years, in combination with the deep learning related technology, solving the related problem of video understanding has become a research hotspot. The processing of input information by the deep network can be regarded as a feature extraction process, so the exploration of a suitable solution for feature extraction is profound. A good feature extractor should extract features that have several characteristics: first, the resolution, for different categories, the extracted features should have a distinguishing ability, which significantly distinguishes between the different categories. Secondly, the efficiency is high, and the dependence on external conditions such as computing capacity, storage capacity and the like in the process of feature extraction is reduced as much as possible, so that the deployment can be facilitated. Thirdly, robustness is realized, the content reflected in different videos has resolution capability, and different videos often have great differences in illumination conditions, focal distances, resolutions and the like, so that the extracted characteristics are very challenging. For the research of the feature extraction solution, the above several conditions should be satisfied, which has profound significance for the floor utility of the model. Compared with other tasks of computer vision, the behavior recognition has wide application significance in reality. For research on tasks of behavior recognition, various applications of video related tasks in life are greatly promoted.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides the video behavior recognition method and system based on the action knowledge base and the integrated learning, provides the characteristics of action knowledge support for behavior recognition, integrates the final classification result, and enables the model to autonomously select and use the characteristics based on knowledge and the characteristics based on the 3D convolution network, thereby improving the performance of the model on the task of behavior recognition.

The invention adopts the following technical scheme:

a video behavior recognition method based on action knowledge base and ensemble learning comprises the following steps:

s1, detecting a human body part frame of an input video frame to obtain the human body part frame;

s2, extracting motion state characteristics of different parts based on vision by pooling operation of the region of interest according to the human body part frame B obtained in the step S1 and by means of constraint conditions of human body part states in a motion knowledge base;

s3, using the vision-based action state characteristics extracted in the step S2 and action state labels of corresponding parts to form a triplet phrase < part, action state and object >, and inputting the triplet phrase < part, action state and object > into a natural language processing tool Bert to obtain language-based action state characteristics;

S4, inputting the original input video frame into a 3D depth residual error convolution network after pre-training and fine tuning, and obtaining space-time characteristics based on the whole video segment through the constraint of the video segment labels;

s5, constructing a human body part map structure, wherein nodes in the map are the characteristics of six different parts obtained in the step S2 and the step S3, and the space-time characteristics based on the whole video segment obtained in the step S4 are taken as one node, and the map structure has seven nodes in total; the two finally obtained graph structures are respectively a graph structure formed by action state characteristics based on vision + space-time characteristics based on the whole video segment, a graph structure formed by action state characteristics based on language + space-time characteristics based on the whole video segment;

s6, constructing a multi-head graph rolling network, wherein a plurality of parallel branches exist in the multi-head graph rolling network, each branch exists independently, adjacent matrixes in each branch are different, parallel processing is carried out on an input graph structure, and binary pooling operation is carried out on the processed features to finally form a feature;

s7, constructing a plurality of multi-head graph convolution networks by utilizing the multi-head graph convolution network in the step S6, wherein the input of the multi-head graph convolution networks is the two characteristics generated in the step S5 respectively, constructing an integrated network model, fusing results of the multi-head graph convolution network models through dynamic cross entropy loss functions, and outputting the final prediction category of the fused characteristics through a full connection layer for video behavior recognition.

Specifically, in step S1, the human body part frames include 10 parts, namely, the head, the trunk, the hands, the two lower limbs, the buttocks and the two feet, and the corresponding 10 part frames are b= { B ₁ ,b ₂ ,…,b _i ,…b ₁₀ }，b _i Is the bounding box of the i-th object in the image.

Specifically, in step S2, the visual-based motion state feature f ^P The method comprises the following steps:

wherein f _i ^P Visual-based motion state features extracted for corresponding body parts.

Specifically, in step S3, the language-based motion state feature f ^L The method comprises the following steps:

wherein f _i ^L Language-based motion state features extracted for corresponding body parts.

Specifically, in step S6, the multi-head convolutional network is composed of a plurality of branched convolutional networks, and the convolutional neural network is represented as follows:

Z _l ＝AX _l W _l

wherein A represents the adjacency matrix, X is an Nxd matrix, X represents the input feature, W is d learnable parameter, and l represents the first layer of the graph roll-up neural network.

Specifically, step S6 specifically includes:

s601, mapping adjacent matrixes in different forms to obtain a multi-head adjacent matrix;

s602, after a multi-head adjacent matrix is obtained, constructing a multi-path parallel graph convolution neural network, wherein each path is mutually independent;

and S603, aggregating the information of all nodes on each road by using the mapped adjacency matrix, and aggregating the information of all points on the graph structure network.

Further, in step S603, the polymerization is specifically:

wherein G (·) is an aggregation function, Z ₁ ,Z ₂ ,Z ₃ …Z _m To aggregate the output features using an aggregation function,and outputting the characteristic for the network of the last layer of graph convolution, wherein m represents m paths of graph convolution.

Specifically, in step S7, the size of the adjacency matrix is changed from 6×6 to 7×7 by using the spatiotemporal feature based on the whole video as one of the graph nodes in the feature based on the visual action state and the feature based on the language action state, the spatiotemporal feature based on the video is used as one of the nodes for the motion state feature based on the visual action state with the length of 512, and the spatiotemporal feature based on the video is changed by using one transformation matrix W for the motion state feature based on the language with the length of 1536, and the change is performed to the feature of 1536 dimensions, thereby forming three inputs: the method comprises the steps of combining 512-dimensional video-level features based on a 3D residual error network, 7×512-dimensional visual-based action state features and video-level features, combining 7×1536-dimensional language-based action state features and video-level features, classifying the video-identified features by using a multi-layer perceptron, fusing two other multi-head graph convolutional networks respectively, outputting three types of features altogether by three classifiers, wherein each type is 512-dimensional features, and classifying the features respectively by using two full-connection layers after fusing the features.

Specifically, in step S7, the dynamic cross entropy loss function is constructed specifically as follows:

s701, setting the output of all connection layers of all classifiers asi represents the i-th weak classifier, n is the number of classes finally classified, the outputs of all classifiers are normalized using a softmax function, the normalized result is +.>

S702, pair ofMaximum value of the results in->Obtaining the prediction result outputted by the ith classifier +.>Statistics of the number of the predicted results of each classifier by using a voting method to obtain a majority result, wherein the majority result is defined as y _most Obtaining the prediction category with the largest occurrence number;

s703, counting y as the output result in all classifiers _most The classifiers are independently trained, and the output of the corresponding classifier is averaged to obtain the output of the final classifierThe cross entropy loss function is obtained as:wherein (1)>To average the output result of the selected classifier, y _j Is the true tag of the current input video.

The invention also provides a face recognition system based on space-time feature fusion and difficult sample mining, which comprises:

the detection module is used for detecting the human body part frame of the input video frame to obtain the human body part frame;

The extraction module is used for extracting the motion state characteristics of different parts based on vision through the pooling operation of the region of interest according to the human body part frame B obtained by the detection module and by means of the constraint conditions of the human body part states in the motion knowledge base;

the phrase module is used for forming a triplet phrase < part, action state and object > by using the vision-based action state characteristics extracted by the extraction module and the action state labels of the corresponding parts at the same time, and inputting the triplet phrase < part, action state and object > into a natural language processing tool Bert to obtain language-based action state characteristics;

the feature module inputs the original input video frame into a 3D depth residual error convolution network after pre-training and fine tuning, and obtains space-time features based on the whole video segment through the constraint of the video segment labels;

the structure module is used for constructing a human body part graph structure, wherein nodes in the graph are the characteristics of six different parts obtained in the extraction module and the phrase module, the space-time characteristics based on the whole video segment obtained in the characteristic module are taken as one node, and the graph structure has seven nodes in total; the two finally obtained graph structures are respectively a graph structure formed by action state characteristics based on vision + space-time characteristics based on the whole video segment, a graph structure formed by action state characteristics based on language + space-time characteristics based on the whole video segment;

The network module is used for constructing a multi-head graph rolling network, wherein a plurality of parallel branches exist in the multi-head graph rolling network, each branch exists independently, adjacent matrixes in each branch are different, parallel processing is carried out on an input graph structure, and the processed characteristics are subjected to binary pooling operation to finally form a characteristic;

the recognition module is used for constructing a plurality of multi-head graph convolution networks by utilizing the multi-head graph convolution network of the network module, the input of the multi-head graph convolution network is respectively two characteristics generated in the structure module, an integrated network model is constructed, the multi-head graph convolution network models are used for fusing results through dynamic cross entropy loss functions, and the fused characteristics output the final prediction category through a full connection layer and are used for recognizing video behaviors.

Compared with the prior art, the invention has at least the following beneficial effects:

the behavior recognition method based on the action knowledge base and the integrated learning provided by the invention is characterized in that the action knowledge base is used for subdividing the human body part, meanwhile, the action knowledge base is used for defining the states of different parts of the human body, the relativity between different parts is considered, the visual-based action state characteristics and the language-based action state characteristics are obtained by means of a natural language processing tool Bert, the two characteristics are not obtained by learning data set distribution, but external action knowledge information is introduced, the robustness of a model is greatly improved, a multi-head graph convolution network is constructed for fusing the non-structural characteristics, and the difficulty of the traditional convolution network in the aspect of processing non-structural data is overcome. The method has the advantages that the advantages of the characteristics of different characteristics are brought into play by constructing a plurality of adjacency matrixes to extract different related information existing in the human body part graph structure, the defects that the adjacency matrixes are single and the related information cannot be extracted sufficiently in the traditional graph convolution are overcome, and the problems that the single classifier is easy to overfit and has weak generalization capability are solved by constructing a weak classifier to process different inputs. By using a dynamic cross entropy loss function, the autonomous selection of which classifiers to use during training and testing further amplifies the performance advantage of the classifier on specific data.

Furthermore, the body of the moving body in the video is divided into ten parts, not only feature extraction is performed on the motion information in the whole video, but also the motion condition of each part of the moving body is analyzed, and the moving body is divided into ten parts according to human body structural knowledge and then analyzed respectively. The method can extract the whole information and the local information of the video at the same time, thereby improving the identification effect.

Further, the visual characteristics of each part based on the motion state information are obtained by utilizing the guiding constraint information of each part in the behavior action knowledge base, so that the interference of the whole information and the background information on the behavior recognition of the motion main body is ignored. Focusing on visual information of local exercises is more beneficial to find some key visual information for behavior recognition.

Further, the language information based on the motion state information of each part is obtained by utilizing the guiding constraint information of each part in the behavior action knowledge base and the semantic information among each part in the natural language processing. The interaction information between each part and the environment is extracted, so that the information shared between the local part and the whole part can be found, and the recognition accuracy of interaction actions of some environments is improved.

Furthermore, the vision and language characteristics corresponding to different human body parts are updated by using the graph rolling network, so that the characteristics of the human body parts with graph structures are considered, and node information interaction is more effective than that of the traditional convolution network by using the graph rolling network.

Furthermore, a multi-head graph rolling network is constructed, the problem that the traditional convolution network single adjacency matrix is insufficient in the representation capability of graph structure connection information is solved, the original adjacency matrix is subjected to multipath mapping, different associated information in the graph structure is favorable for discovery, and more sufficient node information is fused.

Furthermore, the fused node information is further condensed by using an aggregation function, so that a single characteristic with structural characteristics is formed, and classification by using a multi-layer perceptron is facilitated.

Further, a plurality of weak classifier models are constructed, wherein the input of each classifier is different, so that the diversity of the input of the classifier is realized, the variance brought by a single classifier is reduced, the overfitting phenomenon is reduced, and the effect of the model on test data is improved.

Furthermore, the classifier is screened in the training process by using the dynamic cross entropy loss function, so that different samples are trained on the classifier with good performance, the classifying capability of a single classifier on the corresponding sample is enhanced, and the classifying capability of the whole integrated model is further improved.

In summary, the method and the device effectively improve the robustness of the model on the behavior recognition task by introducing two types of characteristics based on the action knowledge base. And a dynamic cross entropy loss function is constructed by using an ensemble learning method, so that the unique advantages of each weak classifier are improved, and the classification accuracy of the behavior recognition task is improved.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a network diagram of visual and linguistic based action state feature extraction based on an action knowledge base;

FIG. 3 is a human body part frame generation diagram;

FIG. 4 is a 3D residual convolution network diagram;

FIG. 5 is a graph of ensemble learning and dynamic cross entropy loss function.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Various structural schematic diagrams according to the disclosed embodiments of the present invention are shown in the accompanying drawings. The figures are not drawn to scale, wherein certain details are exaggerated for clarity of presentation and may have been omitted. The shapes of the various regions, layers and their relative sizes, positional relationships shown in the drawings are merely exemplary, may in practice deviate due to manufacturing tolerances or technical limitations, and one skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions as actually required.

The invention provides a video behavior recognition method based on an action knowledge base and integrated learning, which uses a 3D depth residual error network to extract global features of an input video, and extracts action state features based on vision and action state features based on language by means of the action knowledge base; constructing a corresponding graph structure according to the human body part by the extracted features, and constructing a multi-head graph convolution feature fusion network to perform information fusion on the constructed graph structure; constructing five weak classifiers with similar structures, wherein the input of the first three classifiers is the three characteristics, and the input of the second two classifiers is the characteristics after cascading; a dynamic cross entropy loss function is presented to integrate and classify the results of different weak classifiers. The classification of actions contained in the video frequency band is realized, and the classification accuracy is improved.

Referring to fig. 1, the method for identifying video behaviors based on action knowledge base and ensemble learning of the present invention includes the following steps:

s1, detecting a human body part frame of an input video frame by utilizing an alpha Pose to obtain the human body part frame;

the point with the confidence higher than 0.7 is a visible point, the obtained key point is taken as the center point of the generated human body part frame, the size of the human body part frame is one third of the distance between the nearest adjacent key points, and ten human body part key points of the head, the trunk, the hands, the two lower limbs, the buttocks and the feet are finally selected, and the corresponding 10 part frames are B= { B ₁ ,b ₂ ,…,b _i ,…b ₁₀ And b is }, where _i The detection result is shown in fig. 3 as the bounding box of the i-th object in the image.

S2, according to the part frame B obtained in the step S1, extracting the motion state characteristics of different parts based on vision through the pooling operation of the region of interest and by means of the constraint conditions of the human body part states in the motion knowledge base;

referring to fig. 2, according to the obtained part frame, features corresponding to the human body part frame are obtained through the region-of-interest pooling operation, the obtained features are input into a 512-dimension fully connected layer to obtain a final state score corresponding to the part, a cross entropy loss function is constructed through the state score and an action state label in an action knowledge base, and the output of the fully connected last layer is taken as the vision-based action state feature. Only one feature is extracted from the symmetrical part of the human body, and six vision-based action state features are correspondingly extracted as followsf _i ^P Visual-based motion state features extracted for corresponding body parts.

S3, using the vision-based action state characteristics extracted in the step S2 and action state labels of corresponding parts to form a triplet phrase < part, action state and object >, and inputting the triplet phrase into a natural language processing tool Bert to obtain language-based action state characteristics;

Referring to fig. 2, using the vision-based motion state features extracted in step S2, and simultaneously detecting motion state labels of corresponding parts, and simultaneously detecting objects in an image by using the method of step S1, to generate features of the objects in the image, and if no object exists in the image, using the features generated by the global image as features of the objects; forming a triplet phrase<Part, action state, object>The triplet phrase is input into a natural language processing tool Bert to obtain language-based action state characteristics. Only one feature is extracted from the symmetrical part of the human body, and six vision-based action state features are correspondingly extracted as followsf _i ^L Language-based motion state features extracted for corresponding body parts.

referring to fig. 4, the 3D depth residual convolution network is evolved from a classical 2D residual convolution network, the 2D convolution in the original network is converted into a 3D depth residual convolution network, the original 4-dimensional image data is input into 5-dimensional video data, the problem of gradient disappearance caused by the increase of the network layer number is reduced by adding a skip layer, the original input video frame is input into the 3D depth residual convolution network which is pre-trained and fine-tuned by a designated data set, and the output with the length of 512 of the last layer of full connection is taken as the space-time characteristic based on the whole video segment by the constraint of the video segment label.

the adjacency matrix formed by the edges of the graph structure is a network parameter, and the corresponding network parameter is learned independently through the network.

the multi-head graph rolling network is composed of a plurality of branched graph rolling networks, the graph rolling network can integrate information of nodes and surrounding nodes, information interaction between the nodes and the surrounding nodes is achieved, and along with the incremental process of graph convolution, the characteristics of the nodes are updated continuously. The graph roll-up neural network is represented as follows:

Z _l ＝AX _l W _l

S601, mapping adjacent matrixes in different forms;

considering the singleness of the adjacency matrix, the characteristic forms of different properties corresponding to different connection modes of human body parts are as follows: the limbs that produce the information interaction are also different among different types of actions, for example, playing football is accomplished by the cooperation of legs and feet, and drinking water is accomplished by the cooperation of upper limbs and heads. Therefore, the above-described information extraction between features cannot be sufficiently performed. Based on such considerations, using the improved multi-head graph rolling network, the adjacency matrix is first mapped in different forms:

wherein A is an initial adjacency matrix, a unitorm initialization strategy is used,is a mapping matrix of 6×6, A _i Is a multi-headed adjacency matrix after mapping.

S602, after a multi-head adjacency matrix is obtained, constructing a multi-path parallel graph convolution neural network, wherein each path is mutually independent, and the adjacency matrix after mapping is used, and the specific process is as follows:

wherein, l epsilon {1,2,3}, through Multi-head graph convolution network, each branch fuses information among different nodes, and the characteristics obtained by final fusion of each branch have own independent characteristics because of being based on different adjacent matrixes.

S603, aggregating the information of all nodes on each road;

wherein G (·) is an aggregation function that serves to aggregate information of all points on the graph structure network, and may be any function with an aggregation function, and in the present invention, mean-pooling is used.

Through the above flow, the features extracted by a plurality of independent graph rolling networks are obtained, each graph rolling network fuses the Nxd features into a d-dimensional feature, and N d-dimensional features are obtained through N graph rolling operations.

Referring to fig. 5, in order not to make the performance difference of the different parts excessively large, the space-time feature based on the whole video is also used as one graph node of the vision-based action state feature and the language-based action state feature, the size of the adjacency matrix is changed from 6×6 to 7×7, the space-time feature based on the video is directly used as one node for the vision-based action state feature with the length of 512, and the space-time feature of the video is changed into the 1536-dimensional feature by using one transformation matrix W for the action state feature based on the language with the length of 1536.

A total of three inputs are formed, respectively: a combination of video level features of a 3D residual network of 512 dimensions, a combination of visual based motion state features and video level features of 7 x 512 dimensions, a combination of language based motion state features and video level features of 7 x 1536 dimensions.

Finally, aiming at the characteristics of video identification, a multi-layer perceptron is used for classifying, two multi-head graph convolution networks are respectively used for fusing the other two types, three types of characteristics are output by the three types of classifiers, each type is a 512-dimensional characteristic, and after the characteristics are fused, two layers of full-connection layers are used for respective classification.

The construction of the dynamic cross entropy loss function is specifically as follows:

s701, setting the output of all connection layers of all classifiers asWhere i represents the i-th weak classifier, n is the number of classes to be finally classified, the outputs of all classifiers are normalized using a softmax function, and the normalized result is defined as +.>

Wherein,,confidence predicted as class i is output for the classifier.

S702, obtaining a majority result by using a voting method, wherein the majority result is approximately the correct result, and defining the majority result as y _most Obtaining the prediction category with the largest occurrence number;

Wherein,,is the prediction result of the current generation of the i-th weak classifier, and most_cur () is a function of the prediction class with the largest number of occurrence of the whole body.

S703, training vote is y _most In the counter propagation, namely, only the corresponding classifier is counter-propagated, the output of the corresponding classifier is averaged to obtain the output of the final classifier

Wherein M is as followsNumber of conditional weak classifiers.

The final loss function still crosses the entropy loss function L _cla The method comprises the following steps:

wherein,,to average the output result of the selected classifier, y _j Is the actual tag for the current input video.

In still another embodiment of the present invention, a face recognition system based on space-time feature fusion and difficult sample mining is provided, where the face recognition system can be used to implement the video behavior recognition method based on motion knowledge base and ensemble learning, and specifically, the face recognition system based on space-time feature fusion and difficult sample mining includes a detection module, an extraction module, a phrase module, a feature module, a structure module, a network module, and a recognition module.

In yet another embodiment of the present invention, a terminal device is provided, the terminal device including a processor and a memory, the memory for storing a computer program, the computer program including program instructions, the processor for executing the program instructions stored by the computer storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computational core and control core of the terminal adapted to implement one or more instructions, in particular adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor in the embodiment of the invention can be used for the operation of the video behavior recognition method based on the action knowledge base and the integrated learning, and comprises the following steps:

Detecting a human body part frame of an input video frame to obtain the human body part frame; according to the human body part frame B, extracting motion state characteristics of different parts based on vision through the pooling operation of the region of interest and by means of constraint conditions of human body part states in a motion knowledge base; forming a triplet phrase < part, action state and object > by using the action state characteristics based on vision and the action state labels of the corresponding parts at the same time, and inputting the triplet phrase < part, action state and object > into a natural language processing tool Bert to obtain the action state characteristics based on language; inputting an original input video frame into a 3D depth residual error convolution network after pre-training and fine tuning, and obtaining space-time characteristics based on the whole video segment through the constraint of the video segment labels; constructing a human body part map structure, wherein nodes in the map are the obtained characteristics of six different parts, and the space-time characteristics based on the whole video segment are taken as a node, and the map structure has seven nodes in total; the two finally obtained graph structures are respectively a graph structure formed by action state characteristics based on vision + space-time characteristics based on the whole video segment, a graph structure formed by action state characteristics based on language + space-time characteristics based on the whole video segment; constructing a multi-head graph rolling network, wherein in the multi-head graph rolling network, a plurality of parallel branches exist, each branch exists independently, adjacent matrixes in each branch are different, parallel processing is carried out on an input graph structure, and the processed characteristics are subjected to binary pooling operation to finally form a characteristic; and constructing a plurality of multi-head graph convolution networks by utilizing the multi-head graph convolution networks, wherein the inputs of the plurality of multi-head graph convolution networks are respectively two characteristics, constructing an integrated network model, fusing results by the plurality of multi-head graph convolution network models through dynamic cross entropy loss functions, and outputting the final prediction category of the fused characteristics through a full connection layer for video behavior recognition.

In a further embodiment of the present invention, the present invention also provides a storage medium, in particular, a computer readable storage medium (Memory), which is a Memory device in a terminal device, for storing programs and data. It will be appreciated that the computer readable storage medium herein may include both a built-in storage medium in the terminal device and an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.

One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the corresponding steps of the above-described embodiments with respect to a video behavior recognition method based on an action knowledge base and ensemble learning; one or more instructions in a computer-readable storage medium are loaded by a processor and perform the steps of:

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The effects of the present invention are further described below with reference to simulation diagrams.

1. Simulation conditions

The hardware conditions of the simulation of the invention are as follows: the intelligent perception and image understanding laboratory graphic workstation comprises 4 8-core Intel Xeon E5-2630 CPUs, the main frequency is 2.4GHz, the memory is 64GB, the GPU is Nvidia TiTan X, and the GPU video memory is 12G. The operating system of program operation is Ubuntu 18.04, and the deep learning computing framework is Pytorch; the data sets used in the simulation of the present invention are UCF101 and HMDB51 data sets, which have 101 action categories and 51 and action categories, respectively. The following five main categories can be distinguished: people and articles in the video interact, only the main person of the video acts, the people interact with each other in the video, various music equipment is used, and the main person performs various common exercises, such as: basketball, high jump, eyebrow tattooing, etc., the diversity of each category is guaranteed to the greatest extent: the background is different, complex and changeable, the observer has different angles, the observer has a near-body angle and a far-viewing angle, and the background has rich diversity in the aspects of illumination conditions, person proportion, camera motion and the like. The UCF101 in both data sets has 9537 training data sets and 3783 test data sets. The HMDB51 has 6766 video segments, for each category of action, all videos are divided into a training data set containing 70 video segments and a test data set containing 30 video segments, and there is no intersection of videos between the training set and the test set, i.e., no training video and test video from the same source video.

2. Emulation content

Using the two data sets, the method of the invention performs behavior recognition, and the comparison with other model results is shown in Table 1.

Table 1 accuracy of the method in HMDB51 dataset and UCF101 dataset

3. Simulation result analysis

It can be seen from table 1 that the present method only inputs RGB images into the model in order to avoid the great amount of calculation force and calculation time caused by calculating optical flow. It can be seen that the overall experimental accuracy of the method using the integrated method versus the multi-head graph convolution fusion is improved by 0.4 percent on UCF101 and by 0.47 percent on the HMDB51 dataset. Meanwhile, compared with some classical methods, the method has good effect, especially compared with the method for increasing optical flow input, under the condition of inputting only images, the method can obtain experimental effect equivalent to that of the methods with optical flow input such as Two-stream, video LSTM and the like.

In summary, according to the video behavior recognition method and system based on the action knowledge base and the integrated learning, on one hand, the local information of the human body part is extracted by utilizing the action knowledge base, on the other hand, the global motion information in the video is extracted by utilizing the convolution network, the information is fused by constructing the multi-path graph convolution network, and the local and the whole motion information are comprehensively considered, so that the accuracy of final behavior recognition is improved. In addition, by using an ensemble learning method, a plurality of weak classifiers are constructed, so that the overall variance of the model is reduced, and the performance of the model on test data is improved.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The video behavior recognition method based on the action knowledge base and the integrated learning is characterized by comprising the following steps of:

2. The method according to claim 1, wherein in step S1, the human body part frames include 10 parts, namely, head, trunk, two hands, two lower limbs, buttocks and two feet, and the corresponding 10 part frames are b= { B ₁ ,b ₂ ,…,b _i ,…b ₁₀ }，b _i Is the bounding box of the i-th object in the image.

3. According to claim 1In step S2, the method is characterized in that the visual-based motion state characteristic f ^P The method comprises the following steps:

wherein,, Visual-based motion state features extracted for corresponding body parts.

4. The method according to claim 1, wherein in step S3, the language-based motion state feature f ^L The method comprises the following steps:

wherein,,language-based motion state features extracted for corresponding body parts.

5. The method according to claim 1, characterized in that in step S6, the multi-headed graph roll-up network is constituted by a plurality of branched graph roll-up networks, the graph roll-up neural network being represented as follows:

Z _l ＝AX _l W _l

6. The method according to claim 1, wherein step S6 is specifically:

7. The method according to claim 6, wherein in step S603, the aggregation is specifically:

8. The method according to claim 1, wherein in step S7, the space-time feature based on the whole video is taken as one graph node of the feature based on the visual action state and the feature based on the language action state, the size of the adjacent matrix is changed from 6 x 6 to 7 x 7, the space-time feature based on the video is taken as one node for the feature based on the visual action state with the length of 512, and the space-time feature based on the language action state with the length of 1536 is changed by using one transformation matrix W to transform the feature of the video into the feature with the length of 1536 to form three inputs respectively: the method comprises the steps of combining 512-dimensional video-level features based on a 3D residual error network, 7×512-dimensional visual-based action state features and video-level features, combining 7×1536-dimensional language-based action state features and video-level features, classifying the video-identified features by using a multi-layer perceptron, fusing two other multi-head graph convolutional networks respectively, outputting three types of features altogether by three classifiers, wherein each type is 512-dimensional features, and classifying the features respectively by using two full-connection layers after fusing the features.

9. The method according to claim 1, wherein in step S7, the construction of the dynamic cross entropy loss function is specifically:

s703, counting y as the output result in all classifiers _most The classifiers are independently trained, and the output of the corresponding classifier is averaged to obtain the output of the final classifierThe cross entropy loss function is obtained as: />Wherein (1)>To average the output result of the selected classifier, y _j Is the true tag of the current input video.

10. Face recognition system based on space-time feature fusion and difficult sample mining, which is characterized by comprising: