CN113989940B

CN113989940B - Method, system, device and storage medium for identifying actions in video data

Info

Publication number: CN113989940B
Application number: CN202111363930.XA
Authority: CN
Inventors: 郝艳宾; 谭懿; 何向南; 杨勋
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2024-03-29
Anticipated expiration: 2041-11-17
Also published as: CN113989940A

Abstract

The invention discloses a method, a system, equipment and a storage medium for identifying actions in video data, wherein the related method comprises the following steps: the method comprises the steps of pooling the original video characteristic tensor in different dimensions from different directions by adopting a mode of modeling multiple content dependencies of video data, and performing dependency activation by utilizing a convolution layer to obtain corresponding dependency characterization; and (3) utilizing the attention mechanism of the query structure to realize aggregation depending on the characterization, optimizing the original video feature tensor, and utilizing the optimization result to perform action recognition. According to the scheme, the motion recognition model based on convolution can be directly inserted, extra parameters and calculated amount are hardly brought, and experiments show that the classification performance of the motion recognition model can be obviously improved.

Description

Method, system, device and storage medium for identifying actions in video data

Technical Field

The present invention relates to the field of computer vision, and in particular, to a method, system, device, and storage medium for identifying actions in video data.

Background

In the multimedia age, various terminal devices, such as: the mobile phone, the camera, the monitoring camera and the like continuously generate massive video data, and the action classification is an effective analysis method for the massive video data. However, the video data has increased time dimension relative to the image data, brings more content dependence, and greatly increases the difficulty of video action recognition.

Aiming at the content-dependent modeling problem of action recognition, the following methods exist at present:

1) Based on the implicit dependency modeling method, the method expands the two-dimensional convolution kernel to the three-dimensional convolution kernel by directly expanding the existing image classification network, such as a ResNet network, and only uses the three-dimensional convolution kernel to be optimized to implicitly learn the characteristics in the video data. Such a method relies entirely on the number of stacked layers to model long-range dependencies, so that only the last layers in the network can perceive the long-range dependencies. At the same time, the rough dimension expansion brings about a serious increase in the calculation amount and model size, making such methods difficult to train.

2) Based on a time-dependent modeling method, the method focuses on the increased time dimension of video data relative to image data, and explicitly utilizes the information of the time dimension to capture the dynamic characteristics of the video data. Compared with an implicit dependency modeling method, the method can avoid using a heavy three-dimensional convolution kernel due to the special design aiming at the time dimension, reduces the complexity of the model and improves the performance. However, this type of approach ignores other content dependencies that are widely present in video data, and has limited performance.

3) Based on the global space-time point attention method, the method adds a global attention mechanism for the action classification model, and long-distance content-dependent capture is realized by using the matching relation between every two space-time points in video data. However, calculating the relation between the space-time points pair by pair brings about the problems of model bloated and slow calculation.

In general, none of the above approaches solves the problem between modeling various content dependencies and maintaining model efficiency, and motion recognition model performance and computational overhead remain to be optimized.

Disclosure of Invention

The invention aims to provide a method, a system, equipment and a storage medium for identifying actions in video data, which are used for modeling and aggregating multi-content dependence in the video data while hardly increasing the parameters and the calculated amount of an action identification model aiming at action identification tasks, and improving the classification performance of the action identification model.

The invention aims at realizing the following technical scheme:

a method of motion recognition in video data, comprising:

acquiring an original video characteristic tensor extracted from video data by an action recognition model;

the method comprises the steps of adopting a mode of modeling multiple contents of video data to pool the original video characteristic tensors in different directions and different scales to obtain a plurality of groups of compressed dependent characteristic tensors, and then utilizing a convolution layer to perform dependent activation to obtain corresponding dependent characterization;

introducing a query vector to be matched with all the dependency characterizations by using the attention mechanism of the query structure, calculating the weights of various dependency characterizations according to the matching response intensity, weighting and summing to obtain a final dependency characterizations, and carrying out threshold operation on the original video data feature tensor by using the final dependency characterizations to obtain an optimized video data feature tensor;

and inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result.

A motion recognition system in video data for implementing the foregoing method, the system comprising:

an original video feature tensor obtaining unit, configured to obtain an original video feature tensor of the motion recognition model extracted from the video data;

the video data multi-content dependent modeling unit is used for pooling the original video characteristic tensors in different directions and different scales by adopting a video data multi-content dependent modeling mode to obtain a plurality of groups of compressed dependent characteristic tensors, and then performing dependent activation by utilizing a convolution layer to obtain corresponding dependent characterization;

the dependency aggregation unit is used for introducing the query vector to be matched with all dependency characterizations by utilizing the attention mechanism of the query structure, calculating the weights of various dependency characterizations according to the matching response intensity, weighting and summing to obtain a final dependency characterizations, and carrying out threshold operation on the original video data characteristic tensor by using the final dependency characterizations to obtain an optimized video data characteristic tensor;

and the action recognition unit is used for inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result.

A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

A readable storage medium storing a computer program, characterized in that the aforementioned method is implemented when the computer program is executed by a processor.

The technical scheme provided by the invention can be seen that the method is a plug-and-play method, can be directly inserted into a convolution-based motion recognition model, hardly brings additional parameters and calculation amount, and can obviously improve the classification performance of the motion recognition model through experiments.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for identifying actions in video data according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a model structure of a method for identifying actions in video data according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of dependency aggregation based on an attention mechanism of an interrogation structure according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a motion recognition system in video data according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The terms that may be used herein will first be described as follows:

the terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.

The following describes a method for identifying actions in video data in detail. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer. The reagents or apparatus used in the examples of the present invention were conventional products commercially available without the manufacturer's knowledge.

As shown in fig. 1, a method for identifying actions in video data mainly includes the following steps:

and step 1, acquiring an original video characteristic tensor extracted from video data by the action recognition model.

And step 2, adopting a mode of modeling the multi-content dependence of the video data to pool the original video characteristic tensors from different directions and with different scales to obtain a plurality of groups of compressed dependent characteristic tensors, and then utilizing a convolution layer to perform dependent activation to obtain corresponding dependent characterization.

And 3, introducing a query vector to be matched with all the dependency characterizations by using an attention mechanism of the query structure, calculating weights of various dependency characterizations according to the matching response intensity, weighting and summing to obtain a final dependency characterizations, and carrying out threshold operation on the original video data characteristic tensor by using the final dependency characterizations to obtain the optimized video data characteristic tensor.

And step 4, inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result.

The scheme of the embodiment of the invention can be applied to systems such as monitoring video processing, video content analysis and the like. Taking a widely used time segment model (TSN) and a widely used time translation model (TSM) as base lines, a specific experiment is given later, and the effectiveness of the invention in improving the performance of the motion recognition system can be verified; of course, other motion recognition models may be used with the present invention.

For ease of understanding, the following detailed description is provided for each of the various aspects of the invention.

1. An original video feature tensor is obtained.

In the embodiment of the invention, the input of the motion recognition model is video data, and the output is an original video characteristic tensor; the motion recognition model can be any existing motion recognition model with any structure.

2. Video data multi-content dependent modeling (abbreviated MDM).

As shown in fig. 2, the model structure unit of the classical video data motion recognition method comprises three convolution layers (conv 1, conv2, conv3 shown on the left side of fig. 2), and the method (SDA) proposed by the present invention acts between the second and third convolution layers. For video feature tensors output by motion recognition modelsThe MDM digs out various spatiotemporal content dependencies from it. First, to lighten the model, the MDM uses a convolution layer and a ReLU activation function to tense the video featuresCompression from C channels to C/r _c The generated characteristic tensor is marked as +.>Wherein (1)>Is a real number set; t, H and W represent the length, height and width of the video feature tensor in turn, r _c Representing the compression coefficient.

Taking the generated characteristic tensor Y' as input, the MDM outputs a series of dependency characterization, which is marked as { R } ₁ ，R ₂ ，...，R _M } =mdm (Y'). The calculation of the different dependency characterizations adopts a unified flow: feature compression→dependent activation.

1) And (5) compressing the characteristics.

In the embodiment of the invention, the pooling operation is adopted to realize the feature compression. For the spatio-temporal characteristics of the feature tensor Y', the MDM pools it from different directions (e.g. spatial direction, temporal direction) at different scales (e.g. global scale, local scale).

The pooling core is marked asWherein p is _t ，p _h ，p _w Indicating the size of the pooling nuclear receptive field. In order to obtain the overall data characteristics of the video data in all directions, the invention specifically selects the average pooling as pooling operation, and marks the average pooling operation as pool _avg () Calculating a compressed dependent feature tensor from the video feature tensor Y +.>The process of (1) is expressed as follows:

wherein p is _t ，p _g ，p _w Indicating the size of the pooling nuclear receptive field, different p _t ，p _g ，p _w The size corresponds to different directions and different scales.

The compressed dependency feature tensors provide data characteristics within the pooled kernel receptive field range for subsequent dependency modeling.

2) Dependent on activation.

After the dependency characteristic A is obtained, the MDM uses an operator and a ReLU operator to realize dependency activation, and then adopts a reshape operator to restore the pooled dependent characteristic tensor to the size before pooling so as to facilitate subsequent calculation.

With pooling operation similarity, the convolution kernel is noted asThe convolution operation is recorded as Conv _3d () Performing convolution operation on the dependence characteristic tensor A to obtain corresponding dependence characterization +.>The process of (1) is as follows:

wherein c _t ，c _h ，c _w Indicating the size of the convolution kernel.

Based on the above principle, in the embodiment of the present invention, two sets of content dependencies are set in the multi-content dependency modeling of video data, as shown in two parts (a) and (b) on the right side of fig. 2:

the first group is long-range content dependencies that reflect the relationship between video content from three perspectives of time, space, and space-time: when the pooling core isWhen the time-space dependence (LST) is reflected, the convolution of the corresponding convolution layerThe core is->When the pooling core is +.>In the case of response to long-distance time dependence (abbreviated as LT), the convolution kernel of the corresponding convolution layer is +.>When the pooling core is +.>When the space dependence of long distance is reflected (LS for short), the convolution kernel of the corresponding convolution layer is +.>

Based on the three pooling kernels, three dependent feature tensors can be obtained

Likewise, based on the three convolution kernels described above, the information between the channels is mixed using the three convolution operations and the corresponding dependence characterizations are obtained as follows:

the second group is short distance contentDepending on which information is compressed in the local spatiotemporal receptive field, a local pooling kernel is used to compress the dynamic information presented within the local receptive field. Corresponding pooling core isThe convolution kernel of the corresponding convolution layer is +.>The corresponding dependent characteristic tensor A can be calculated according to the above method ^S And dependent characterization R ₄ 。

In the embodiment of the invention, a is more than 1 and c is more than min (H, W), b is more than 1 and d is more than T. By way of example, it is possible to provide that: a=3, b=3, c=2, d=2.

After various dependency characterizations are obtained, they are scaled to TXH XW XC/r using element replication _c 。

3. Selective dependent polymerization (abbreviated SEC).

The most intuitive dependence aggregation method is to average sum the obtained dependence characterizations, however different videos have different dependence preferences, and simple average aggregation ignores important dependence while paying attention to irrelevant dependence.

As shown in FIG. 3, the present invention utilizes the attention mechanism of the query structure (QSA for short) to implement aggregation of dependency tokens, automatically emphasizing important dependencies by assigning weights to different dependency tokens. Specifically:

introducing a learnable query vectorAnd all dependency characterizations +>Compression to MXC/r by global averaging pooling layer _c The matrix K of dimensions acts as a key in the attention mechanism:

will beSplicing as a value:

where M is the number of dependent characterizations (e.g., m=4 according to the previous example), C represents the number of channels of the original video feature tensor, r _c Representing the compression coefficient.

The matching response intensity of each dependency token is obtained as a weight value for the subsequent weighted summation (for convenience in representation, using matrix multiplication representation) by computing the vector inner product of the query vector and the four dependency tokens by the following formula:

Attention(q，K)＝softmax(q×K ^T )

the final dependency characterization is obtained by weighted summing the various dependency characterizations by the following equation:

R _sec ＝Attention(q，K)×V

where softmax () represents the softmax function, T is a fransose symbol, and x represents the matrix multiplication.

The above operation is accomplished by dependency aggregation Block (relying on an aggregation module) in fig. 2, based on the above design, SEC adds little additional parameters and computation.

In an embodiment of the present invention, the final dependency representation R is characterized using a 1X 1 three-dimensional convolution kernel _sec The number of channels of (2) is restored to the number of the original video feature tensors Y, the section mapped to (0.0, 1.0) is multiplied to the original video feature tensors Y element by element finally by using a Sigmoid activation function, and an optimized video data feature tensor Z is obtained and expressed as follows:

Z＝Sigmoid(Conv3d(R _sec ；1×1×1))⊙Y

wherein, as indicated by element-wise multiplication, conv _3d () Representing a convolution operation.

And taking the optimized video data characteristic tensor Z as the output of SEC.

4. And (5) identifying actions.

Inputting the optimized video data characteristic tensor Z into an action recognition model to obtain an action recognition result, wherein the related recognition principle can refer to a conventional technology and is not repeated herein

To illustrate the effectiveness of the present invention, the following experiments were performed.

Experiments were performed on four real datasets, something-Something V1 and V2, deving 48 and EPIC-KITCHENS, with action classification accuracy (Acc) as an evaluation index. For this experiment, the present invention uses a widely used time-segment network (TSN) and time-shift network (TSM) as baselines. The experiment was divided into three parts:

1) Various dependency modeling effects proposed by the present invention were verified on a Someting-SometingV 1 based on TSN, the results of which are shown in Table 1.

TABLE 1 enhancement of action recognition models by content dependent modeling

Where #P represents the total number of parameters of the model, FLOPS is the number of floating point operations per second, and measures the amount of computation required to classify an action in a video. As can be seen from Table 1, the dependency modeling method provided by the invention only increases a small amount of parameters and calculated amounts. Secondly, the three long-distance dependence modeling provided by the invention can effectively improve the action recognition performance of the baseline model TSN, and simultaneously, better results can be obtained by using the three long-distance dependence than by using any one long distance alone. At the same time, table 1 compares various short-range dependent properties, where Sxxx represents the use of W ^pool The short-range dependency model of = (x, x, x), it can be seen that S122 achieves the best performance, and asThe use of three short-range dependencies does not make motion classification performance much stronger. Finally the present invention achieves maximum amplification of the TSN when three long distances and S122 are used simultaneously.

2) The effect of the selective polymerization and the average summation of the two polymerization-dependent effects proposed by the present invention were compared on Someting-SometingV 1, and the results are shown in Table 2. Wherein AVG represents average polymerization and SEC represents selective polymerization.

TABLE 2 comparison of Selective polymerization with average polymerization

As can be seen from table 2, the use of selective aggregation SEC significantly improves the accuracy of motion recognition over average aggregation AVG under different settings. Meanwhile, the parameters and the calculated amount are almost 0.

3) On Something-Something V1 and V2, deving 48 and EPIC-KITCHENS, TSN and TSM are used as base lines, the accuracy of the motion recognition of the present invention is compared with other most advanced motion recognition models, and the results are shown in Table 3.

TABLE 3 comparison of Performance of the model based on selectively dependent aggregation with other most advanced models

In Table 3, SDA-TSN, SDA-TSM represent combining the protocol of the present invention with two baseline models TSN, TSM; c3D, GST, TAM is the most advanced motion recognition model at present. It can be seen from Table 3 that the SDA-TSN, SDA-TSM far exceeded the original baseline model in all data sets, as well as the current most advanced model.

Another embodiment of the present invention further provides a motion recognition system in video data, which is mainly used for implementing the method provided in the foregoing embodiment, as shown in fig. 4, where the system mainly includes:

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the system is divided into different functional modules to perform all or part of the functions described above.

In addition, the main technical details related to the above system parts are described in detail in the previous method embodiments, so they will not be described in detail.

Another embodiment of the present invention also provides a processing apparatus, as shown in the drawings, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.

Further, the processing device further comprises at least one input device and at least one output device; in the processing device, the processor, the memory, the input device and the output device are connected through buses.

In the embodiment of the invention, the specific types of the memory, the input device and the output device are not limited; for example:

the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;

the output device may be a display terminal;

the memory may be random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as disk memory.

Another embodiment of the present invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiment.

The readable storage medium according to the embodiment of the present invention may be provided as a computer readable storage medium in the aforementioned processing apparatus, for example, as a memory in the processing apparatus. The readable storage medium may be any of various media capable of storing a program code, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, and an optical disk.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A method for identifying actions in video data, comprising:

inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result;

prior to pooling the original video feature tensor at different scales from different directions, comprising: using convolution layer and ReLU activation function to tense the original video featureCompression from C channels to C/r _c The generated video characteristic tensor is marked as +.>Wherein (1)>Is a real number set; t, H and W represent the length, height and width of the video feature tensor in turn, r _c Representing the compression coefficient;

pooling at different scales from different directions includes:

the pooling core is marked asRecord the average pooling operation as pool _avg () Calculating a compressed dependent feature tensor from the video feature tensor Y +.>The process of (1) is expressed as follows:

wherein p is _t ,p _h ,p _w Indicating the size of the pooling nuclear receptive field, different p _t ,p _h ,p _w The size corresponds to different directions and different scales;

performing dependency activation by using a convolution layer, and obtaining corresponding dependency characterization includes:

the convolution kernel is noted asThe convolution operation is recorded as Conv _3d () Performing convolution operation on the dependence characteristic tensor A to obtain corresponding dependence characterization +.>The process of (1) is as follows:

wherein c _t ,c _h ,c _w Representing the size of the convolution kernel;

two groups of content dependencies are set in the video data multi-content dependency modeling:

the first group is long-range content dependencies that reflect the relationship between video content from three perspectives of time, space, and space-time: when the pooling core isWhen the long-distance space-time dependence is reflected, the convolution kernel of the corresponding convolution layer is +.>When the pooling core is +.>In the time-course of which the first and second contact surfaces,the response is long-distance time dependent, the convolution kernel of the corresponding convolution layer is +.>When the pooling core is +.>When the long-distance space dependence is reflected, the convolution kernel of the corresponding convolution layer is +.>

The second group is short-distance content dependence focused on information compressed in local space-time receptive fields, and the corresponding pooling core isThe convolution kernel of the corresponding convolution layer is +.>

Wherein, a is more than 1 and c is more than min (H, W), b is more than 1 and d is more than T;

introducing the query vector to be matched with all the dependency tokens by using the attention mechanism of the query structure, calculating the weights of various dependency tokens according to the matching response intensity and carrying out weighted summation to obtain the final dependency token comprises the following steps:

introducing a learnable query vector q and characterizing all dependenciesCompression to MXC/r by global averaging pooling layer _c The matrix K of dimensions acts as a key in the attention mechanism:

will beSplicing as a value:

wherein M is the number of dependent characterizations, C represents the number of channels of the video feature tensor, r _c Representing the compression coefficient;

calculating the inner product of the query vector and the vector of the dependency tokens to obtain the matching response strength of each dependency token as the weight value of the subsequent weighted summation by the following formula:

Attention(q,K)＝softmax(q×K ^T )

the final dependency characterization is obtained by weighted summation:

R _sec ＝Attention(q,K)×V

where softmax () represents the softmax function, T is the matrix transpose sign, x represents the matrix multiplication.

2. The method of claim 1, wherein thresholding the original video data feature tensor with the final dependency characterization to obtain the optimized video data feature tensor comprises:

the final dependency representation R is characterized using a 1X 1 three-dimensional convolution kernel _sec The number of channels of (2) is restored to the number of the original video feature tensors Y, the section mapped to (0.0, 1.0) is multiplied to the original video feature tensors Y element by element finally by using a Sigmoid activation function, and an optimized video data feature tensor Z is obtained and expressed as follows:

Z＝Sigmoid(Conv3d(R _sec ；1×1×1))⊙Y

3. A motion recognition system in video data for implementing the method of any one of claims 1-2, the system comprising:

4. A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-2.

5. A readable storage medium storing a computer program, characterized in that the method according to any one of claims 1-2 is implemented when the computer program is executed by a processor.