CN117789292A - Behavior recognition method, training device, electronic equipment and storage medium - Google Patents

Behavior recognition method, training device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117789292A
CN117789292A CN202311718052.8A CN202311718052A CN117789292A CN 117789292 A CN117789292 A CN 117789292A CN 202311718052 A CN202311718052 A CN 202311718052A CN 117789292 A CN117789292 A CN 117789292A
Authority
CN
China
Prior art keywords
text
behavior recognition
features
time sequence
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311718052.8A
Other languages
Chinese (zh)
Inventor
程虎
殷兵
殷保才
刘文超
林垠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202311718052.8A priority Critical patent/CN117789292A/en
Publication of CN117789292A publication Critical patent/CN117789292A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of computers, and provides a behavior recognition method, a training method, a device, electronic equipment and a storage medium, wherein the behavior recognition method comprises the following steps: acquiring a video to be identified and text characteristics, wherein the text characteristics are obtained by carrying out characteristic extraction on a preset description text through a contrast learning pre-training large model; and based on a behavior recognition model, extracting time sequence features of continuous frame images in the video to be recognized, and performing behavior recognition on the video to be recognized by applying the extracted time sequence features and the text features. The behavior recognition method, the training method, the device, the electronic equipment and the storage medium can enhance the characteristic expression capability of the behavior recognition model for the fine granularity object, thereby improving the recognition effect and enabling the generalization performance to be better.

Description

Behavior recognition method, training device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a behavior recognition method, a training device, an electronic device, and a storage medium.
Background
The behaviors during the student's study period are identified and analyzed, the concentration and distraction ratio is counted, and the student's concentration improvement is monitored and guided by parents to play a vital role.
In the related art, action category recognition based on keypoints or based on images may be employed. However, behavior recognition for student concentration detection also requires finer granularity of information. For example, define the electronic product to be a distraction, and when the student plays a mobile phone, a tablet or a game palm, the distraction is indicated; and normal action can be defined by a pen bag, a book holding, a pencil sharpener, etc. The actions of the single slave behavior are therefore difficult to distinguish. The system has the action distinguishing capability and the fine grain recognition capability for the interactive objects of the hands, but the real scene objects are numerous and difficult to be completely listed, so that a behavior recognition scheme with the fine grain object recognition capability and better generalization performance is needed to be provided.
Disclosure of Invention
The invention provides a behavior recognition method, a training device, electronic equipment and a storage medium, which are used for solving the defect of poor recognition capability for fine granularity objects in the prior art.
The invention provides a behavior recognition method, which comprises the following steps:
Acquiring a video to be identified and text characteristics, wherein the text characteristics are obtained by carrying out characteristic extraction on a preset description text through a contrast learning pre-training large model;
and based on a behavior recognition model, extracting time sequence features of continuous frame images in the video to be recognized, and performing behavior recognition on the video to be recognized by applying the extracted time sequence features and the text features.
According to the behavior recognition method provided by the invention, the behavior recognition model is based on extracting time sequence characteristics from continuous frame images in the video to be recognized, and performing behavior recognition on the video to be recognized by applying the extracted time sequence characteristics and the text characteristics, and the method comprises the following steps:
based on the behavior recognition model, extracting local time sequence features of continuous frame images in the video to be recognized to obtain local time sequence features;
fusing the text features and the local time sequence features to obtain local fusion features;
and carrying out behavior recognition on the video to be recognized based on the local fusion characteristics.
According to the behavior recognition method provided by the invention, the text feature and the local time sequence feature are fused to obtain a local fusion feature, which comprises the following steps:
And fusing the text feature and the local time sequence feature based on the correlation degree between the text feature and the local time sequence feature to obtain the local fusion feature.
According to the behavior recognition method provided by the invention, the behavior recognition model is based on extracting time sequence characteristics from continuous frame images in the video to be recognized, and performing behavior recognition on the video to be recognized by applying the extracted time sequence characteristics and the text characteristics, and the method comprises the following steps:
based on the behavior recognition model, extracting global time sequence features of continuous frame images in the video to be recognized to obtain global time sequence features;
predicting the importance degree of each text feature for behavior recognition, and fusing each text feature based on the importance degree to obtain fused text features;
and performing behavior recognition on the video to be recognized based on the global time sequence feature and the fusion text feature.
According to the behavior recognition method provided by the invention, based on the importance degree, each text feature is fused to obtain a fused text feature, which comprises the following steps:
determining fusion weights of the text features based on the importance degrees;
And fusing the text features based on the fusion weights to obtain the fused text features.
The invention also provides a training method of the behavior recognition model, which comprises the following steps:
based on the contrast learning pre-training large model, extracting features of a preset description text to obtain text features;
training a behavior recognition model based on the text features and the sample video, wherein the behavior recognition model is used for extracting time sequence features of continuous frame images in the video to be recognized, and performing behavior recognition on the video to be recognized by applying the extracted time sequence features and the text features or applying the extracted time sequence features.
According to the training method of the behavior recognition model provided by the invention, the training of the behavior recognition model based on the text features and the sample video comprises the following steps:
acquiring an initial model;
extracting initial time sequence characteristics of continuous frame images in the sample video based on the initial model;
and performing distillation learning on the initial model based on the similarity between the matched text feature and the initial time sequence feature in the text feature and the similarity between the non-matched text feature and the initial time sequence feature in the text feature to obtain the behavior recognition model.
The invention also provides a behavior recognition device, which comprises:
the video acquisition unit is used for acquiring videos to be identified and text features, wherein the text features are obtained by carrying out feature extraction on preset description texts through comparison learning of a pre-training large model;
the behavior recognition unit is used for extracting time sequence characteristics of continuous frame images in the video to be recognized based on the behavior recognition model, and performing behavior recognition on the video to be recognized by applying the extracted time sequence characteristics and the text characteristics.
The invention also provides a training device of the behavior recognition model, which comprises:
the feature extraction unit is used for carrying out feature extraction on a preset description text based on the contrast learning pre-training large model to obtain text features;
the training unit is used for training a behavior recognition model based on the text features and the sample video, wherein the behavior recognition model is used for extracting time sequence features of continuous frame images in the video to be recognized and applying the extracted time sequence features and the text features or applying the extracted time sequence features to recognize the behavior of the video to be recognized.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the behavior recognition method or training method as described in any of the above when executing the program.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a behavior recognition method or training method as described in any of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a behavior recognition method or training method as described in any of the above.
According to the behavior recognition method, the training method, the device, the electronic equipment and the storage medium, the text features are obtained by extracting the features of the preset description text through the contrast learning pre-training large model, the text features are used as the pre-training model knowledge, and the feature expression capability of the behavior recognition model for fine-granularity objects can be enhanced, so that the recognition effect is improved, and the generalization performance is better.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a behavior recognition method according to the present invention;
FIG. 2 is a second flowchart of a behavior recognition method according to the present invention;
FIG. 3 is a flowchart of step 120 in the behavior recognition method according to the present invention;
FIG. 4 is a third flow chart of a behavior recognition method according to the present invention;
FIG. 5 is a second flowchart of step 120 in the behavior recognition method according to the present invention;
FIG. 6 is a flowchart of a behavior recognition method according to the present invention;
FIG. 7 is a schematic flow chart of a training method of a behavior recognition model according to the present invention;
FIG. 8 is a flowchart of step 720 in the training method of the behavior recognition model provided by the present invention;
FIG. 9 is a schematic diagram of a training architecture of a behavior recognition model provided by the present invention;
FIG. 10 is a flowchart of a behavior recognition method according to the present invention;
FIG. 11 is a schematic diagram of a behavior recognition device according to the present invention;
FIG. 12 is a schematic view of the training device provided by the present invention;
fig. 13 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Current behavior detection techniques can be broadly classified as either keypoint-based or image-based schemes. The key point-based scheme mainly solves the problem that human body actions are not related to human interaction scenes. The scheme is that human body key points are obtained through a key point model, and the key points are input into a graph neural network (Graph Convolutional Network, GCN) to be predicted.
Image-based schemes there are streaming schemes based on a deep neural network model for video understanding, such as (Temporal Segment Network, TSM) or (Temporal Convolutional Network, TCN). And based on a 3D or 2D+1D scheme, the input is continuous multi-frame images, the model extracts space-time information through 3D convolution or 2D+1D convolution and identifies action types, and the scheme is difficult to deploy in real time and has low reasoning speed. The streaming scheme based on TSM can conveniently and quickly conduct real-time reasoning, and the TSM structure realizes time sequence information fusion by conducting channel-to-channel conversion on historical information and current frame information.
However, in the human interaction scene, besides learning time sequence action information, the human interaction scene is required to have the capability of carrying out fine-grained identification on objects in an adversary, such as two actions of using correction fluid and writing by a pen, and the difference between the two actions is small from the action perspective. The real scene has a plurality of objects, and the model is difficult to cover completely by simply relying on data acquisition, so that the model needs to have a certain recognition capability of few samples or zero samples, and the behavior recognition network based on TSM is dependent on supervised data and does not have the recognition capability of few samples or zero samples.
Therefore, in order to improve the identification effect on fine-grained objects and make the generalization performance better, the invention concept of the invention is as follows: when the behavior recognition model is used for recognition, text features are introduced, wherein the text features are obtained by extracting features of a preset description text through comparison learning of a pre-training large model, and the text features are used as pre-training model knowledge, so that the feature expression capability of the behavior recognition model for fine-granularity objects can be enhanced, the recognition effect is improved, and the generalization performance is better.
Based on the above inventive concept, the invention provides a behavior recognition method, a training device, an electronic device and a storage medium, which can be applied to the fields of behavior recognition scenes in new generation information technology, such as human-object interaction behavior recognition, student concentration detection and the like, so as to enhance the recognition effect on fine-grained objects and enable the generalization performance to be better.
The technical scheme of the present invention will be described in detail with reference to the accompanying drawings. Fig. 1 is a schematic flow chart of a behavior recognition method provided by the present invention, in which an execution body of each step of the method may be a behavior recognition device, and the device may be implemented by software and/or hardware, and the device may be integrated in an electronic device, where the electronic device may be a terminal device (such as a smart phone, a personal computer, etc.), or may be a server (such as a local server or a cloud server, or may be a server cluster, etc.), or may be a processor, or may be a chip, etc. As shown in fig. 1, the method comprises the steps of:
Step 110, obtaining a video to be identified and text features, wherein the text features are obtained by carrying out feature extraction on a preset description text through a contrast learning pre-training large model;
step 120, based on the behavior recognition model, extracting time sequence features of continuous frame images in the video to be recognized, and performing behavior recognition on the video to be recognized by applying the extracted time sequence features and text features.
Specifically, the video to be identified is a video which needs to be identified by human-object interaction behavior, and specifically, the video can be interaction behavior between a human touch organ, a human visual organ and an object, for example, the object contacted by the human hand in the video is identified, or the object watched by eyes is identified, and the like. The video to be identified can be obtained through shooting by the video acquisition equipment, and can also be prestored, and the embodiment of the invention is not particularly limited.
In order to improve the identification effect on fine-grained objects and enable the generalization performance to be better, the embodiment of the invention can conduct behavior identification on the video to be identified through a pre-trained behavior identification model.
The behavior recognition model is adopted for performing behavior recognition, namely the sequential characteristic extraction is firstly carried out on continuous frame images in the video to be recognized, so that the sequential characteristic is obtained, and then the extracted sequential characteristic and text characteristic are applied to perform behavior recognition on the video to be recognized. The model structure of the behavior recognition model can be based on the TSM model structure, and the text characteristic enhancement model can be applied to recognize the fine-grained objects in the human interaction scene.
Fig. 2 is a second flow chart of a behavior recognition method provided by the invention, as shown in fig. 2, the method is based on a TSM model structure, the behavior recognition model comprises a time sequence feature extractor and a classifier, continuous frame video is firstly intercepted to be recognized, space features are extracted for each frame of image by using a space encoder, then the space features are sent into the time sequence feature extractor to learn time sequence action information, text features and the extracted time sequence features are applied, each frame predicts a behavior result and maintains a queue to carry out historical frame smoothing.
The text features are obtained by extracting features of a preset descriptive text through a contrast learning pre-training large model. The contrast learning pre-training large model refers to a large model which is pre-trained through multi-modal contrast learning. Contrast learning is a common pre-training approach that is constrained by constructing positive and negative pairs of samples. The distance between the non-paired text and image features is pulled away by contrast learning to pull the distance between the paired text and image features closer. Because of the sufficient training data, the image encoder and text encoder in the pre-trained large model have rich semantic information and have proven to have significant improvements in the task of few samples, it is currently common practice to adapt to downstream tasks by fine-tuning the image encoder or learning text cues.
However, the real reasoning process is very time-consuming, whether it is a text encoder or an image encoder, so in the embodiment of the invention, the text encoder is not required to be introduced, and only pre-training large model knowledge, namely text features, is required to be applied during behavior recognition so as to improve the expression capability of the feature extractor on fine-grained objects. Because few parameters are not introduced or are introduced on the original model structure, the terminal equipment can perform real-time reasoning, so that the reasoning time of the original model is not increased.
The contrast learning pre-trained large model can be, for example, a cross-modal learning model CLIP model based on images and texts, which realizes joint understanding of the images and texts by contrast learning and simultaneously learning the representation of the images and texts.
The text features can be pre-extracted, and in the embodiment of the invention, the behavior recognition model does not need to be added with a text encoder to extract the text features, but directly applies the extracted text features. The descriptive text may be a semantic description of the human interaction scene in each frame of image of the video to be identified. For example, the description text is predefined as: "A person holding [ obj ] in the hand", where obj may be a pre-set item category such as: and extracting features of the descriptive text through a text encoder in the contrast learning pre-training large model to obtain N x 512 feature expression, namely text features. Wherein N is the class of the article and can be defined according to specific tasks. The text features obtained by the method can represent the object types in the human-object interaction scene in the video to be identified.
The time sequence feature extraction is carried out on continuous frame images in the video to be identified, and can be realized through a time sequence feature extractor in a behavior identification model. Firstly, intercepting continuous frame video of video to be identified, extracting spatial features of each frame of image by using a spatial encoder, and then inputting the spatial features to a time sequence feature extractor to learn time sequence action information, namely the time sequence features. The timing characteristics herein may be global timing characteristics or local timing characteristics, which are not particularly limited in the embodiments of the present invention.
The method is characterized in that aiming at the video to be identified, the behavior identification can be carried out, the extracted time sequence characteristics and the extracted text characteristics can be simultaneously applied, the time sequence characteristics reflect the spatial characteristics and the time sequence characteristics of the continuous frame images, the text characteristics reflect the category characteristics of the human-object interaction objects in the continuous frame images, the abundant semantic information is provided, the behavior identification can be carried out by applying the time sequence characteristics and the text characteristics, the behavior identification can be carried out on the video to be identified, and the identification effect on the fine-granularity objects can be improved.
For example, the behavior recognition model can perform behavior recognition based on the time sequence feature and the text feature respectively to obtain respective behavior recognition results, and then fuse the respective obtained recognition results to obtain a final recognition result; the time sequence features and the text features can be fused first, and then behavior recognition is performed based on the fused features to obtain recognition results, which are not particularly limited in the embodiment of the invention.
According to the method provided by the embodiment of the invention, the text features are introduced when the behavior recognition model is used for recognition, wherein the text features are obtained by extracting the features of the preset descriptive text through comparing and learning the pre-training large model, and the text features are used as the knowledge of the pre-training model, so that the feature expression capability of the behavior recognition model for fine-granularity objects can be enhanced, the recognition effect is improved, and the generalization performance is better.
In addition, since the behavior recognition model does not introduce or introduces few parameters on the model structure, the terminal side device can perform real-time reasoning, so that the reasoning time of the original model is not increased.
Based on any of the above embodiments, fig. 3 is one of the flow charts of step 120 in the behavior recognition method provided by the present invention, and as shown in fig. 3, step 120 specifically includes:
step 121, based on the behavior recognition model, extracting local time sequence features of continuous frame images in the video to be recognized to obtain the local time sequence features;
step 122, fusing the text features and the local time sequence features to obtain local fusion features;
and step 123, performing behavior recognition on the video to be recognized based on the local fusion characteristics.
Specifically, the embodiment of the invention can utilize text features to carry out local feature enhancement on time sequence features. Firstly, extracting local time sequence features of continuous frame images in a video to be identified to obtain the local time sequence features, wherein the local time sequence features can be features extracted from shallower layers in a behavior identification model, and the layers are close to an input end, such as the first five blocks. The local timing features thus extracted typically contain more pixel information, such as color, texture, edge and corner information, which is useful for identifying local details of the image. However, due to the low semantically nature of the local timing features, there may be limited effect on identifying the overall information of the image.
In this embodiment, after the local time sequence feature is obtained, the text feature and the local time sequence feature are fused, and since the text feature has rich semantic information, the local fusion feature obtained after fusion can perform feature enhancement on the semantic information layer of the object class, and has rich semantic information. The implementation manner of fusion can be, for example, feature stitching, feature addition, or attention-based feature fusion, which is not particularly limited in the embodiment of the present invention. On the basis, behavior recognition is performed based on the local fusion features.
The feature extraction can be further performed on the local fusion features, and the extracted features are input into a classifier for behavior recognition; or directly inputting the local fusion features into the classifier for behavior recognition, which is not particularly limited in the embodiment of the invention.
Based on any of the above embodiments, step 122 fuses the text feature and the local time sequence feature to obtain a local fusion feature, which specifically includes:
and fusing the text features and the local time sequence features based on the correlation between the text features and the local time sequence features to obtain local fusion features.
Specifically, the fusion of text features and local time sequence features can be realized through feature fusion based on an attention mechanism. The correlation between the text feature and the local timing feature is calculated by an attention mechanism. It can be appreciated that the greater the correlation, the greater the fusion weight; the smaller the correlation, the smaller the fusion weight.
Fig. 4 is a third flow chart of the behavior recognition method provided by the present invention, as shown in fig. 4, local feature extraction is performed on continuous frame images by a first stage module stage1 of the behavior recognition model to obtain local time sequence features, the feature size is 512×h/16×w/16, the local time sequence features and text features are fused by an attention module to obtain local fusion features, and the local fusion features are input into a second stage module stage2 of the behavior recognition model to perform behavior recognition. Here, stage1 and stage2 respectively correspond to different phases of the behavior recognition model, for example, stage1 may be a feature extraction phase, and stage2 may be a classification phase; stage1 may be a feature extraction stage and stage2 may include feature extraction and classification stages.
Based on any of the above embodiments, fig. 5 is a second flowchart of step 120 in the behavior recognition method provided by the present invention, and as shown in fig. 5, step 120 specifically includes:
step 124, based on the behavior recognition model, extracting global time sequence features of continuous frame images in the video to be recognized to obtain global time sequence features;
step 125, predicting the importance degree of each text feature for behavior recognition, and fusing each text feature based on the importance degree to obtain fused text features;
And step 126, performing behavior recognition on the video to be recognized based on the global time sequence features and the fusion text features.
In the embodiment of the invention, text features and global time sequence features can be utilized for fusion complementation. The global time sequence features comprise motion information such as motion tracks, speeds, directions and the like of objects in the video, the text features comprise text descriptions, keywords and the like, and the two types of features can be complementary to provide more comprehensive behavior information.
Specifically, fig. 6 is a schematic flow chart of a behavior recognition method provided by the present invention, and as shown in fig. 6, global timing sequence feature extraction is performed on continuous frame images in a video to be recognized based on a behavior recognition model, so as to obtain global timing sequence features.
And predicting the importance degree of each text feature for behavior recognition based on the behavior recognition model, and fusing each text feature based on the importance degree to obtain fused text features. Considering that the importance degree of each text feature for behavior recognition is different, for example, the importance degree of the text feature matched with the image semantics is relatively high; whereas text features that do not match the image semantic information are of relatively low importance. Therefore, the importance degree of each text feature for behavior recognition can be predicted first, and the text features are fused based on the importance degree to obtain fused text features. The text features with higher importance degree on behavior recognition can be emphasized in the obtained fusion text features, and correspondingly, the text features with lower importance degree on behavior recognition can be weakened.
Based on any of the above embodiments, in step 125, based on the importance level, each text feature is fused to obtain a fused text feature, which includes:
determining fusion weights of the text features based on the importance degrees;
and fusing the text features based on the fusion weights to obtain fused text features.
Specifically, to further improve the accuracy of behavior recognition, one can begin with improving the feature expression capabilities of the fused text features. First, based on the importance degree of each text feature for behavior recognition, the fusion weight of each text feature is determined. The importance degree of each text feature can be normalized to obtain the fusion weight of each text feature. It can be understood that the higher the importance degree of each text feature for behavior recognition is, the larger the corresponding fusion weight is; conversely, the lower the importance of each text feature to behavior recognition, the smaller the corresponding fusion weight.
And then, based on the fusion weight, fusing the text features to obtain fused text features.
According to the method provided by the embodiment of the invention, the importance degree of each text feature for behavior recognition is predicted, the text features are fused based on the importance degree to obtain the fused text features, then the behavior recognition is carried out on the video to be recognized based on the global time sequence features and the fused text features, and the features of different modes are fused to further improve the accuracy of recognizing the fine-granularity objects.
Based on any of the above embodiments, fig. 7 is a flow chart of a training method of a behavior recognition model provided by the present invention, and as shown in fig. 7, a training method of a behavior recognition model is provided, including:
step 710, performing feature extraction on a preset description text based on a contrast learning pre-training large model to obtain text features;
step 720, training a behavior recognition model based on the text features and the sample video, wherein the behavior recognition model is used for extracting time sequence features of continuous frame images in the video to be recognized, and performing behavior recognition on the video to be recognized by applying the extracted time sequence features and the text features or applying the extracted time sequence features.
Specifically, for the acquisition of the behavior recognition model, the behavior recognition model can be obtained by training based on a TSM model structure. In order to improve the recognition effect of the behavior recognition model on fine-grained objects and enable the generalization performance to be better, the embodiment of the invention firstly performs feature extraction on a preset description text based on a contrast learning pre-training large model to obtain text features.
The contrast learning pre-training large model refers to a large model which is pre-trained through multi-modal contrast learning. Contrast learning is a common pre-training approach that is constrained by constructing positive and negative pairs of samples. The distance between the non-paired text and image features is pulled away by contrast learning to pull the distance between the paired text and image features closer. Because of the sufficient training data, the image encoder and text encoder in the pre-trained large model have rich semantic information and have proven to have significant improvements in the task of few samples, it is currently common practice to adapt to downstream tasks by fine-tuning the image encoder or learning text cues.
However, the real reasoning process is very time-consuming, and therefore, in the embodiment of the invention, the text encoder is not required to be introduced, and only pre-training large model knowledge, namely text features, is required to be applied in the training process, so that the expression capability of the feature extractor on fine-grained objects is improved. Because few parameters are not introduced or are introduced on the original model structure, the terminal equipment can perform real-time reasoning, so that the reasoning time of the original model is not increased.
The contrast learning pre-training large model can be, for example, a cross-modal learning model CLIP model based on images and texts, which realizes joint understanding of the images and texts by contrast learning and simultaneously learning the representations of the images and texts.
The descriptive text may be a semantic description of the human interaction scene in each frame of image of the video to be identified. For example, the description text is predefined as: "A person holding [ obj ] in the hand", where obj may be a pre-set item category such as: and extracting features of the descriptive text through a text encoder in the contrast learning pre-training large model to obtain N x 512 feature expression, namely text features. Wherein N is the class of the article and can be defined according to specific tasks. The text features obtained by the method can represent the object types in the human-object interaction scene in the video to be identified.
After text features, behavior recognition models can be trained based on the text features, as well as sample video. In the training process of the behavior recognition model, the semantic correspondence between the text features and the continuous image frames of the video to be recognized can be learned through the text features, so that the expression capability of the text recognition model for fine-grained objects in the continuous frame images of the video to be recognized is enhanced.
Because the text features learn cross-modal semantic information of the images and the texts in advance, the text features and the global features extracted by the behavior recognition model can be subjected to feature alignment in a feature alignment mode, so that the expression capability of the global features for fine-granularity objects is enhanced; local constraint can be carried out on local features extracted from the behavior recognition model by utilizing text features in a space feature enhancement mode; the importance degree of each text feature to the behavior recognition can be predicted through the behavior recognition model, the text features are fused based on the importance degree, and then the behavior recognition is performed based on the fused text features and the global features. One or more modes can be flexibly selected according to the needs, and the behavior recognition model can be trained.
The behavior recognition model obtained through training can be used for extracting time sequence features of continuous frame images in the video to be recognized, and applying the extracted time sequence features and text features, or applying the extracted time sequence features to perform behavior recognition on the video to be recognized.
The video to be identified is a video which needs to be identified by human-object interaction behavior, and specifically, the video can be the interaction behavior between a human touch organ, a human visual organ and an object, for example, the object contacted with the human hand in the video is identified, or the object watched by eyes is identified, and the like. The video to be identified can be obtained through shooting by the video acquisition equipment, and can also be prestored, and the embodiment of the invention is not particularly limited.
The time sequence feature extraction is carried out on continuous frame images in the video to be identified, and can be realized through a time sequence feature extractor in a behavior identification model. Firstly, intercepting continuous frame video of video to be identified, extracting spatial features of each frame of image by using a spatial encoder, and then inputting the spatial features to a time sequence feature extractor to learn time sequence action information, namely the time sequence features. The timing characteristics herein may be global timing characteristics or local timing characteristics, which are not particularly limited in the embodiments of the present invention.
Aiming at the behavior recognition of the video to be recognized, the time sequence features can be directly applied to perform the behavior recognition. Because the behavior recognition model applies text features to perform feature alignment in the training process, the extracted time sequence features can reflect the category features of the fine-grained objects, so that the recognition effect of the fine-grained objects in the human interaction scene is improved.
In addition, the extracted time sequence features and text features can be simultaneously applied to conduct behavior recognition, the time sequence features reflect the spatial features and the time sequence features of the continuous frame images, the text features reflect the category features of the human-object interaction objects in the continuous frame images, the semantic information is rich, the time sequence features and the text features are applied to conduct behavior recognition, the behavior recognition can be conducted on videos to be recognized, and the recognition effect on fine-granularity objects can be improved.
For example, the behavior recognition model can perform behavior recognition based on the time sequence feature and the text feature respectively to obtain respective behavior recognition results, and then fuse the respective obtained recognition results to obtain a final recognition result; the time sequence features and the text features can be fused first, and then behavior recognition is performed based on the fused features to obtain recognition results, which are not particularly limited in the embodiment of the invention.
According to the method provided by the embodiment of the invention, before the behavior recognition model is trained, the pre-training large model is pre-trained based on contrast learning, and the feature extraction is carried out on the preset description text, so that the text feature is obtained; then training a behavior recognition model based on text features and sample videos, wherein the text features are used as pre-training model knowledge, so that feature expression capability of the behavior recognition model for fine-granularity objects can be enhanced, and the behavior recognition model obtained through training can improve recognition effect and enable generalization performance to be better.
In addition, since the behavior recognition model does not introduce or introduces few parameters on the model structure, the terminal side device can perform real-time reasoning, so that the reasoning time of the original model is not increased.
Based on any of the above embodiments, fig. 8 is a schematic flow chart of step 720 in the training method of the behavior recognition model provided by the present invention, and as shown in fig. 8, the training of the behavior recognition model in step 720 based on the text feature and the sample video specifically includes:
step 721, obtaining an initial model;
step 722, extracting initial time sequence characteristics of continuous frame images in the sample video based on the initial model;
step 723, performing distillation learning on the initial model based on the similarity between the matched text feature and the initial time sequence feature in the text features and the similarity between the non-matched text feature and the initial time sequence feature in the text features to obtain a behavior recognition model.
In particular, considering that behavior recognition in the related art is generally to combine objects in different hands according to function definitions, for example, playing electronic products including mobile phones, tablet players, palm players, etc.; eating includes eating potato chips, apples, bananas, etc. The different types are directly combined into a group, and the intra-class difference is large. Therefore, in the embodiment of the invention, the behavior recognition model is obtained by extracting text features of the fine-grained objects by using the pre-training model and performing distillation learning on the initial model.
Here, the initial model may adopt a TSM model structure, intercept continuous frame images in the sample video, and extract an initial timing characteristic of each frame image.
The matching text feature in the text features refers to the feature matched with the initial time sequence feature, namely the text and the image express the same semantic meaning. For example, if the frame of image is a hand holding a mobile phone, the matching text feature may be "A person holding a mobile phone in the hand" or a text feature having the same or similar meaning as the matching text feature; the matching text feature refers to a feature which is not matched with the semantics represented by the initial time sequence feature, for example, the frame of image picture is a hand holding a mobile phone in a human hand, the text feature is A person holding an apple in the hand, and the text feature is the non-matching text feature.
And performing distillation learning on the initial model based on the similarity between the matched text feature and the initial time sequence feature in the text feature and the similarity between the non-matched text feature and the initial time sequence feature in the text feature to obtain a behavior recognition model. The distillation loss can be represented by the following formula using the comparative loss:
wherein T is i Is the ithMatching text features, I i For the ith initial timing feature, T j And for the j-th non-matching text feature, N is the number of the text features.
According to the method provided by the embodiment of the invention, the text features are utilized to carry out distillation learning on the initial model, the distance between the non-matching text features and the initial time sequence features is pulled out by adopting the contrast loss, and the distance between the matching text features and the initial time sequence features is pulled in, so that the behavior recognition model obtained by training can learn the pre-training knowledge of a large amount of data training, and the expression capability of the behavior recognition model for fine granularity objects is improved.
Fig. 9 is a schematic diagram of a training structure of a behavior recognition model provided by the invention, as shown in fig. 9, initial time sequence features of continuous frame images in a sample video are extracted, and text features are aligned with the initial time sequence features by adopting contrast loss so as to enhance the expression capability of the text recognition model for fine granularity objects.
It should be noted that in the embodiment of the invention, only a behavior recognition model after distillation learning is adopted to extract the time sequence characteristics in the reasoning stage, and accurate recognition of fine-grained objects in a human-object interaction scene in a video to be recognized can be realized by directly applying the time sequence characteristics, so that the reasoning speed is improved, and the terminal equipment can perform real-time reasoning.
Based on any of the above embodiments, fig. 10 is a flowchart of a behavior recognition method according to the present invention, and as shown in fig. 10, a behavior recognition method is provided, including:
s1, carrying out feature extraction on a preset description text based on a contrast learning pre-training large model to obtain text features;
s2, training a behavior recognition model based on the text features and the sample video. The method specifically comprises the following steps: acquiring an initial model; extracting initial time sequence characteristics of continuous frame images in a sample video based on an initial model; and performing distillation learning on the initial model based on the similarity between the matched text feature and the initial time sequence feature in the text feature and the similarity between the non-matched text feature and the initial time sequence feature in the text feature to obtain a behavior recognition model.
S3, based on the behavior recognition model, extracting local time sequence features of continuous frame images in the video to be recognized to obtain the local time sequence features; and fusing the text features and the local time sequence features based on the correlation between the text features and the local time sequence features to obtain local fusion features.
S4, based on the behavior recognition model, carrying out global time sequence feature extraction on the local fusion features to obtain global time sequence features; predicting the importance degree of each text feature for behavior recognition, and fusing each text feature based on the importance degree to obtain fused text features; and fusing the global time sequence features and the fused text features, and performing behavior recognition on the video to be recognized based on the fused features.
According to the method provided by the embodiment of the invention, the knowledge of the pre-training model is skillfully utilized, the text characteristics are applied in the training stage and/or the reasoning stage, no or few parameters are introduced on the basis of the original network, and the terminal equipment can conduct real-time reasoning. In addition, the text features can be utilized to improve the distinguishing capability of the model on fine-grained objects, and the generalization capability on objects outside the collection is improved by fusing the knowledge of the pre-training model.
The behavior recognition device provided by the invention is described below, and the behavior recognition device described below and the behavior recognition method described above can be referred to correspondingly.
Based on any of the above embodiments, fig. 11 is a schematic structural diagram of a behavior recognition device provided by the present invention, and as shown in fig. 11, there is provided a behavior recognition device including a video acquisition unit 1101 and a behavior recognition unit 1102, wherein:
The video obtaining unit 1101 is configured to obtain a video to be identified and text features, where the text features are obtained by performing feature extraction on a preset description text through a contrast learning pre-training large model;
the behavior recognition unit 1102 is configured to perform time sequence feature extraction on continuous frame images in the video to be recognized based on a behavior recognition model, and perform behavior recognition on the video to be recognized by applying the extracted time sequence features and the text features.
According to the behavior recognition device provided by the embodiment of the invention, the text features are introduced when the behavior recognition model is used for recognition, wherein the text features are obtained by carrying out feature extraction on the preset description text through comparing and learning the pre-training large model, and the text features are used as the knowledge of the pre-training model, so that the feature expression capability of the behavior recognition model for fine-granularity objects can be enhanced, the recognition effect is improved, and the generalization performance is better.
Based on any of the above embodiments, the behavior recognition unit is specifically configured to:
based on the behavior recognition model, extracting local time sequence features of continuous frame images in the video to be recognized to obtain local time sequence features;
fusing the text features and the local time sequence features to obtain local fusion features;
And carrying out behavior recognition on the video to be recognized based on the local fusion characteristics.
Based on any of the above embodiments, the behavior recognition unit is specifically configured to:
and fusing the text feature and the local time sequence feature based on the correlation degree between the text feature and the local time sequence feature to obtain the local fusion feature.
Based on any of the above embodiments, the behavior recognition unit is specifically configured to:
based on the behavior recognition model, extracting global time sequence features of continuous frame images in the video to be recognized to obtain global time sequence features;
predicting the importance degree of each text feature for behavior recognition, and fusing each text feature based on the importance degree to obtain fused text features;
and performing behavior recognition on the video to be recognized based on the global time sequence feature and the fusion text feature.
Based on any of the above embodiments, the behavior recognition unit is specifically configured to:
determining fusion weights of the text features based on the importance degrees;
and fusing the text features based on the fusion weights to obtain the fused text features.
Based on any of the above embodiments, fig. 12 is a schematic structural diagram of a training device provided by the present invention, and as shown in fig. 12, a training device for a behavior recognition model is provided, including a feature extraction unit 1201 and a training unit 1202, where:
The feature extraction unit 1201 is configured to perform feature extraction on a preset description text based on the contrast learning pre-training large model, so as to obtain text features;
the training unit 1202 is configured to train a behavior recognition model based on the text feature and the sample video, where the behavior recognition model is configured to perform time sequence feature extraction on continuous frame images in the video to be recognized, and apply the extracted time sequence feature and the text feature, or apply the extracted time sequence feature to perform behavior recognition on the video to be recognized.
Based on any of the above embodiments, the training unit 1202 is specifically configured to:
acquiring an initial model;
extracting initial time sequence characteristics of continuous frame images in the sample video based on the initial model;
and performing distillation learning on the initial model based on the similarity between the matched text feature and the initial time sequence feature in the text feature and the similarity between the non-matched text feature and the initial time sequence feature in the text feature to obtain the behavior recognition model.
Fig. 13 illustrates a physical structure diagram of an electronic device, as shown in fig. 13, which may include: processor 1310, communication interface (Communications Interface) 1320, memory 1330 and communication bus 1340, wherein processor 1310, communication interface 1320, memory 1330 communicate with each other via communication bus 1340. Processor 1310 may call logic instructions in memory 1330 to perform a behavior recognition method, the method comprising:
Acquiring a video to be identified and text characteristics, wherein the text characteristics are obtained by carrying out characteristic extraction on a preset description text through a contrast learning pre-training large model;
and based on a behavior recognition model, extracting time sequence features of continuous frame images in the video to be recognized, and performing behavior recognition on the video to be recognized by applying the extracted time sequence features and the text features.
The processor may invoke logic instructions in the memory to perform a training method of the behavior recognition model, the method comprising:
based on the contrast learning pre-training large model, extracting features of a preset description text to obtain text features;
training a behavior recognition model based on the text features and the sample video, wherein the behavior recognition model is used for extracting time sequence features of continuous frame images in the video to be recognized, and performing behavior recognition on the video to be recognized by applying the extracted time sequence features and the text features or applying the extracted time sequence features.
Further, the logic instructions in the memory 1330 can be implemented in the form of software functional units and can be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the behavior recognition method provided by the methods described above, the method comprising:
acquiring a video to be identified and text characteristics, wherein the text characteristics are obtained by carrying out characteristic extraction on a preset description text through a contrast learning pre-training large model;
and based on a behavior recognition model, extracting time sequence features of continuous frame images in the video to be recognized, and performing behavior recognition on the video to be recognized by applying the extracted time sequence features and the text features.
When the computer program is executed by the processor, the computer can execute the training method of the behavior recognition model provided by the methods, and the method comprises the following steps:
based on the contrast learning pre-training large model, extracting features of a preset description text to obtain text features;
training a behavior recognition model based on the text features and the sample video, wherein the behavior recognition model is used for extracting time sequence features of continuous frame images in the video to be recognized, and performing behavior recognition on the video to be recognized by applying the extracted time sequence features and the text features or applying the extracted time sequence features.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a behavior recognition method provided by the above methods, the method comprising:
acquiring a video to be identified and text characteristics, wherein the text characteristics are obtained by carrying out characteristic extraction on a preset description text through a contrast learning pre-training large model;
and based on a behavior recognition model, extracting time sequence features of continuous frame images in the video to be recognized, and performing behavior recognition on the video to be recognized by applying the extracted time sequence features and the text features.
The computer program, when executed by a processor, implements a training method for performing a behavior recognition model provided by the methods described above, the method comprising:
based on the contrast learning pre-training large model, extracting features of a preset description text to obtain text features;
training a behavior recognition model based on the text features and the sample video, wherein the behavior recognition model is used for extracting time sequence features of continuous frame images in the video to be recognized, and performing behavior recognition on the video to be recognized by applying the extracted time sequence features and the text features or applying the extracted time sequence features.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (11)

1. A method of behavior recognition, comprising:
acquiring a video to be identified and text characteristics, wherein the text characteristics are obtained by carrying out characteristic extraction on a preset description text through a contrast learning pre-training large model;
and based on a behavior recognition model, extracting time sequence features of continuous frame images in the video to be recognized, and performing behavior recognition on the video to be recognized by applying the extracted time sequence features and the text features.
2. The behavior recognition method according to claim 1, wherein the performing behavior recognition on the video to be recognized by performing time sequence feature extraction on successive frame images in the video to be recognized based on the behavior recognition model and applying the extracted time sequence feature and the text feature includes:
Based on the behavior recognition model, extracting local time sequence features of continuous frame images in the video to be recognized to obtain local time sequence features;
fusing the text features and the local time sequence features to obtain local fusion features;
and carrying out behavior recognition on the video to be recognized based on the local fusion characteristics.
3. The behavior recognition method according to claim 2, wherein the fusing the text feature and the local timing feature to obtain a local fused feature includes:
and fusing the text feature and the local time sequence feature based on the correlation degree between the text feature and the local time sequence feature to obtain the local fusion feature.
4. The behavior recognition method according to claim 1, wherein the performing behavior recognition on the video to be recognized by performing time sequence feature extraction on successive frame images in the video to be recognized based on the behavior recognition model and applying the extracted time sequence feature and the text feature includes:
based on the behavior recognition model, extracting global time sequence features of continuous frame images in the video to be recognized to obtain global time sequence features;
Predicting the importance degree of each text feature for behavior recognition, and fusing each text feature based on the importance degree to obtain fused text features;
and performing behavior recognition on the video to be recognized based on the global time sequence feature and the fusion text feature.
5. The behavior recognition method according to claim 4, wherein the fusing each text feature based on the importance level to obtain a fused text feature includes:
determining fusion weights of the text features based on the importance degrees;
and fusing the text features based on the fusion weights to obtain the fused text features.
6. A method for training a behavior recognition model, comprising:
based on the contrast learning pre-training large model, extracting features of a preset description text to obtain text features;
training a behavior recognition model based on the text features and the sample video, wherein the behavior recognition model is used for extracting time sequence features of continuous frame images in the video to be recognized, and performing behavior recognition on the video to be recognized by applying the extracted time sequence features and the text features or applying the extracted time sequence features.
7. The method of claim 6, wherein said training the behavior recognition model based on the text feature and the sample video comprises:
acquiring an initial model;
extracting initial time sequence characteristics of continuous frame images in the sample video based on the initial model;
and performing distillation learning on the initial model based on the similarity between the matched text feature and the initial time sequence feature in the text feature and the similarity between the non-matched text feature and the initial time sequence feature in the text feature to obtain the behavior recognition model.
8. A behavior recognition apparatus, comprising:
the video acquisition unit is used for acquiring videos to be identified and text features, wherein the text features are obtained by carrying out feature extraction on preset description texts through comparison learning of a pre-training large model;
the behavior recognition unit is used for extracting time sequence characteristics of continuous frame images in the video to be recognized based on the behavior recognition model, and performing behavior recognition on the video to be recognized by applying the extracted time sequence characteristics and the text characteristics.
9. A training device for a behavior recognition model, comprising:
The feature extraction unit is used for carrying out feature extraction on a preset description text based on the contrast learning pre-training large model to obtain text features;
the training unit is used for training a behavior recognition model based on the text features and the sample video, wherein the behavior recognition model is used for extracting time sequence features of continuous frame images in the video to be recognized and applying the extracted time sequence features and the text features or applying the extracted time sequence features to recognize the behavior of the video to be recognized.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the behavior recognition method of any one of claims 1 to 5 or the training method of any one of claims 6 to 7 when the program is executed by the processor.
11. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the behavior recognition method of any one of claims 1 to 5 or the training method of any one of claims 6 to 7.
CN202311718052.8A 2023-12-13 2023-12-13 Behavior recognition method, training device, electronic equipment and storage medium Pending CN117789292A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311718052.8A CN117789292A (en) 2023-12-13 2023-12-13 Behavior recognition method, training device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311718052.8A CN117789292A (en) 2023-12-13 2023-12-13 Behavior recognition method, training device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117789292A true CN117789292A (en) 2024-03-29

Family

ID=90395526

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311718052.8A Pending CN117789292A (en) 2023-12-13 2023-12-13 Behavior recognition method, training device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117789292A (en)

Similar Documents

Publication Publication Date Title
Gan et al. Geometry guided convolutional neural networks for self-supervised video representation learning
Zhang et al. Information fusion in visual question answering: A survey
Neverova et al. Multi-scale deep learning for gesture detection and localization
Yang et al. Benchmarking commercial emotion detection systems using realistic distortions of facial image datasets
CN109325148A (en) The method and apparatus for generating information
CN109117777A (en) The method and apparatus for generating information
Liu et al. Spatial-temporal interaction learning based two-stream network for action recognition
Gao et al. The labeled multiple canonical correlation analysis for information fusion
WO2021050772A1 (en) Action recognition with high-order interaction through spatial-temporal object tracking
Zong et al. Emotion recognition in the wild via sparse transductive transfer linear discriminant analysis
CN111491187A (en) Video recommendation method, device, equipment and storage medium
Abebe et al. A long short-term memory convolutional neural network for first-person vision activity recognition
Biswas et al. Recognizing activities with multiple cues
Sudhakaran et al. Learning to recognize actions on objects in egocentric video with attention dictionaries
CN112084887A (en) Attention mechanism-based self-adaptive video classification method and system
CN112085120A (en) Multimedia data processing method and device, electronic equipment and storage medium
CN113408282B (en) Method, device, equipment and storage medium for topic model training and topic prediction
Inácio et al. OSVidCap: A framework for the simultaneous recognition and description of concurrent actions in videos in an open-set scenario
Kumar et al. Facial emotion recognition and detection using cnn
Al-Obaidi et al. Modeling temporal visual salience for human action recognition enabled visual anonymity preservation
CN116955707A (en) Content tag determination method, device, equipment, medium and program product
Alcantara et al. Human action classification based on silhouette indexed interest points for multiple domains
CN117789292A (en) Behavior recognition method, training device, electronic equipment and storage medium
CN114501164A (en) Method and device for labeling audio and video data and electronic equipment
CN113704544A (en) Video classification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination