CN113989940B - Method, system, device and storage medium for identifying actions in video data - Google Patents
Method, system, device and storage medium for identifying actions in video data Download PDFInfo
- Publication number
- CN113989940B CN113989940B CN202111363930.XA CN202111363930A CN113989940B CN 113989940 B CN113989940 B CN 113989940B CN 202111363930 A CN202111363930 A CN 202111363930A CN 113989940 B CN113989940 B CN 113989940B
- Authority
- CN
- China
- Prior art keywords
- video data
- dependency
- tensor
- dependent
- pooling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 230000009471 action Effects 0.000 title claims abstract description 43
- 238000003860 storage Methods 0.000 title claims abstract description 11
- 238000012512 characterization method Methods 0.000 claims abstract description 48
- 238000011176 pooling Methods 0.000 claims abstract description 30
- 230000004913 activation Effects 0.000 claims abstract description 15
- 230000007246 mechanism Effects 0.000 claims abstract description 14
- 230000002776 aggregation Effects 0.000 claims abstract description 13
- 238000004220 aggregation Methods 0.000 claims abstract description 13
- 230000001419 dependent effect Effects 0.000 claims description 42
- 230000004044 response Effects 0.000 claims description 11
- 238000007906 compression Methods 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 10
- 230000006835 compression Effects 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 2
- 230000036962 time dependent Effects 0.000 claims description 2
- 238000002474 experimental method Methods 0.000 abstract description 7
- 238000005457 optimization Methods 0.000 abstract 1
- 238000006116 polymerization reaction Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- -1 carrier Substances 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 239000000306 component Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 239000007858 starting material Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
- G06F17/153—Multidimensional correlation or convolution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Algebra (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method, a system, equipment and a storage medium for identifying actions in video data, wherein the related method comprises the following steps: the method comprises the steps of pooling the original video characteristic tensor in different dimensions from different directions by adopting a mode of modeling multiple content dependencies of video data, and performing dependency activation by utilizing a convolution layer to obtain corresponding dependency characterization; and (3) utilizing the attention mechanism of the query structure to realize aggregation depending on the characterization, optimizing the original video feature tensor, and utilizing the optimization result to perform action recognition. According to the scheme, the motion recognition model based on convolution can be directly inserted, extra parameters and calculated amount are hardly brought, and experiments show that the classification performance of the motion recognition model can be obviously improved.
Description
Technical Field
The present invention relates to the field of computer vision, and in particular, to a method, system, device, and storage medium for identifying actions in video data.
Background
In the multimedia age, various terminal devices, such as: the mobile phone, the camera, the monitoring camera and the like continuously generate massive video data, and the action classification is an effective analysis method for the massive video data. However, the video data has increased time dimension relative to the image data, brings more content dependence, and greatly increases the difficulty of video action recognition.
Aiming at the content-dependent modeling problem of action recognition, the following methods exist at present:
1) Based on the implicit dependency modeling method, the method expands the two-dimensional convolution kernel to the three-dimensional convolution kernel by directly expanding the existing image classification network, such as a ResNet network, and only uses the three-dimensional convolution kernel to be optimized to implicitly learn the characteristics in the video data. Such a method relies entirely on the number of stacked layers to model long-range dependencies, so that only the last layers in the network can perceive the long-range dependencies. At the same time, the rough dimension expansion brings about a serious increase in the calculation amount and model size, making such methods difficult to train.
2) Based on a time-dependent modeling method, the method focuses on the increased time dimension of video data relative to image data, and explicitly utilizes the information of the time dimension to capture the dynamic characteristics of the video data. Compared with an implicit dependency modeling method, the method can avoid using a heavy three-dimensional convolution kernel due to the special design aiming at the time dimension, reduces the complexity of the model and improves the performance. However, this type of approach ignores other content dependencies that are widely present in video data, and has limited performance.
3) Based on the global space-time point attention method, the method adds a global attention mechanism for the action classification model, and long-distance content-dependent capture is realized by using the matching relation between every two space-time points in video data. However, calculating the relation between the space-time points pair by pair brings about the problems of model bloated and slow calculation.
In general, none of the above approaches solves the problem between modeling various content dependencies and maintaining model efficiency, and motion recognition model performance and computational overhead remain to be optimized.
Disclosure of Invention
The invention aims to provide a method, a system, equipment and a storage medium for identifying actions in video data, which are used for modeling and aggregating multi-content dependence in the video data while hardly increasing the parameters and the calculated amount of an action identification model aiming at action identification tasks, and improving the classification performance of the action identification model.
The invention aims at realizing the following technical scheme:
a method of motion recognition in video data, comprising:
acquiring an original video characteristic tensor extracted from video data by an action recognition model;
the method comprises the steps of adopting a mode of modeling multiple contents of video data to pool the original video characteristic tensors in different directions and different scales to obtain a plurality of groups of compressed dependent characteristic tensors, and then utilizing a convolution layer to perform dependent activation to obtain corresponding dependent characterization;
introducing a query vector to be matched with all the dependency characterizations by using the attention mechanism of the query structure, calculating the weights of various dependency characterizations according to the matching response intensity, weighting and summing to obtain a final dependency characterizations, and carrying out threshold operation on the original video data feature tensor by using the final dependency characterizations to obtain an optimized video data feature tensor;
and inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result.
A motion recognition system in video data for implementing the foregoing method, the system comprising:
an original video feature tensor obtaining unit, configured to obtain an original video feature tensor of the motion recognition model extracted from the video data;
the video data multi-content dependent modeling unit is used for pooling the original video characteristic tensors in different directions and different scales by adopting a video data multi-content dependent modeling mode to obtain a plurality of groups of compressed dependent characteristic tensors, and then performing dependent activation by utilizing a convolution layer to obtain corresponding dependent characterization;
the dependency aggregation unit is used for introducing the query vector to be matched with all dependency characterizations by utilizing the attention mechanism of the query structure, calculating the weights of various dependency characterizations according to the matching response intensity, weighting and summing to obtain a final dependency characterizations, and carrying out threshold operation on the original video data characteristic tensor by using the final dependency characterizations to obtain an optimized video data characteristic tensor;
and the action recognition unit is used for inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result.
A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.
A readable storage medium storing a computer program, characterized in that the aforementioned method is implemented when the computer program is executed by a processor.
The technical scheme provided by the invention can be seen that the method is a plug-and-play method, can be directly inserted into a convolution-based motion recognition model, hardly brings additional parameters and calculation amount, and can obviously improve the classification performance of the motion recognition model through experiments.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a method for identifying actions in video data according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a model structure of a method for identifying actions in video data according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of dependency aggregation based on an attention mechanism of an interrogation structure according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a motion recognition system in video data according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
The terms that may be used herein will first be described as follows:
the terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.
The following describes a method for identifying actions in video data in detail. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer. The reagents or apparatus used in the examples of the present invention were conventional products commercially available without the manufacturer's knowledge.
As shown in fig. 1, a method for identifying actions in video data mainly includes the following steps:
and step 1, acquiring an original video characteristic tensor extracted from video data by the action recognition model.
And step 2, adopting a mode of modeling the multi-content dependence of the video data to pool the original video characteristic tensors from different directions and with different scales to obtain a plurality of groups of compressed dependent characteristic tensors, and then utilizing a convolution layer to perform dependent activation to obtain corresponding dependent characterization.
And 3, introducing a query vector to be matched with all the dependency characterizations by using an attention mechanism of the query structure, calculating weights of various dependency characterizations according to the matching response intensity, weighting and summing to obtain a final dependency characterizations, and carrying out threshold operation on the original video data characteristic tensor by using the final dependency characterizations to obtain the optimized video data characteristic tensor.
And step 4, inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result.
The scheme of the embodiment of the invention can be applied to systems such as monitoring video processing, video content analysis and the like. Taking a widely used time segment model (TSN) and a widely used time translation model (TSM) as base lines, a specific experiment is given later, and the effectiveness of the invention in improving the performance of the motion recognition system can be verified; of course, other motion recognition models may be used with the present invention.
For ease of understanding, the following detailed description is provided for each of the various aspects of the invention.
1. An original video feature tensor is obtained.
In the embodiment of the invention, the input of the motion recognition model is video data, and the output is an original video characteristic tensor; the motion recognition model can be any existing motion recognition model with any structure.
2. Video data multi-content dependent modeling (abbreviated MDM).
As shown in fig. 2, the model structure unit of the classical video data motion recognition method comprises three convolution layers (conv 1, conv2, conv3 shown on the left side of fig. 2), and the method (SDA) proposed by the present invention acts between the second and third convolution layers. For video feature tensors output by motion recognition modelsThe MDM digs out various spatiotemporal content dependencies from it. First, to lighten the model, the MDM uses a convolution layer and a ReLU activation function to tense the video featuresCompression from C channels to C/r c The generated characteristic tensor is marked as +.>Wherein (1)>Is a real number set; t, H and W represent the length, height and width of the video feature tensor in turn, r c Representing the compression coefficient.
Taking the generated characteristic tensor Y' as input, the MDM outputs a series of dependency characterization, which is marked as { R } 1 ,R 2 ,...,R M } =mdm (Y'). The calculation of the different dependency characterizations adopts a unified flow: feature compression→dependent activation.
1) And (5) compressing the characteristics.
In the embodiment of the invention, the pooling operation is adopted to realize the feature compression. For the spatio-temporal characteristics of the feature tensor Y', the MDM pools it from different directions (e.g. spatial direction, temporal direction) at different scales (e.g. global scale, local scale).
The pooling core is marked asWherein p is t ,p h ,p w Indicating the size of the pooling nuclear receptive field. In order to obtain the overall data characteristics of the video data in all directions, the invention specifically selects the average pooling as pooling operation, and marks the average pooling operation as pool avg () Calculating a compressed dependent feature tensor from the video feature tensor Y +.>The process of (1) is expressed as follows:
wherein p is t ,p g ,p w Indicating the size of the pooling nuclear receptive field, different p t ,p g ,p w The size corresponds to different directions and different scales.
The compressed dependency feature tensors provide data characteristics within the pooled kernel receptive field range for subsequent dependency modeling.
2) Dependent on activation.
After the dependency characteristic A is obtained, the MDM uses an operator and a ReLU operator to realize dependency activation, and then adopts a reshape operator to restore the pooled dependent characteristic tensor to the size before pooling so as to facilitate subsequent calculation.
With pooling operation similarity, the convolution kernel is noted asThe convolution operation is recorded as Conv 3d () Performing convolution operation on the dependence characteristic tensor A to obtain corresponding dependence characterization +.>The process of (1) is as follows:
wherein c t ,c h ,c w Indicating the size of the convolution kernel.
Based on the above principle, in the embodiment of the present invention, two sets of content dependencies are set in the multi-content dependency modeling of video data, as shown in two parts (a) and (b) on the right side of fig. 2:
the first group is long-range content dependencies that reflect the relationship between video content from three perspectives of time, space, and space-time: when the pooling core isWhen the time-space dependence (LST) is reflected, the convolution of the corresponding convolution layerThe core is->When the pooling core is +.>In the case of response to long-distance time dependence (abbreviated as LT), the convolution kernel of the corresponding convolution layer is +.>When the pooling core is +.>When the space dependence of long distance is reflected (LS for short), the convolution kernel of the corresponding convolution layer is +.>
Based on the three pooling kernels, three dependent feature tensors can be obtained
Likewise, based on the three convolution kernels described above, the information between the channels is mixed using the three convolution operations and the corresponding dependence characterizations are obtained as follows:
the second group is short distance contentDepending on which information is compressed in the local spatiotemporal receptive field, a local pooling kernel is used to compress the dynamic information presented within the local receptive field. Corresponding pooling core isThe convolution kernel of the corresponding convolution layer is +.>The corresponding dependent characteristic tensor A can be calculated according to the above method S And dependent characterization R 4 。
In the embodiment of the invention, a is more than 1 and c is more than min (H, W), b is more than 1 and d is more than T. By way of example, it is possible to provide that: a=3, b=3, c=2, d=2.
After various dependency characterizations are obtained, they are scaled to TXH XW XC/r using element replication c 。
3. Selective dependent polymerization (abbreviated SEC).
The most intuitive dependence aggregation method is to average sum the obtained dependence characterizations, however different videos have different dependence preferences, and simple average aggregation ignores important dependence while paying attention to irrelevant dependence.
As shown in FIG. 3, the present invention utilizes the attention mechanism of the query structure (QSA for short) to implement aggregation of dependency tokens, automatically emphasizing important dependencies by assigning weights to different dependency tokens. Specifically:
introducing a learnable query vectorAnd all dependency characterizations +>Compression to MXC/r by global averaging pooling layer c The matrix K of dimensions acts as a key in the attention mechanism:
will beSplicing as a value:
where M is the number of dependent characterizations (e.g., m=4 according to the previous example), C represents the number of channels of the original video feature tensor, r c Representing the compression coefficient.
The matching response intensity of each dependency token is obtained as a weight value for the subsequent weighted summation (for convenience in representation, using matrix multiplication representation) by computing the vector inner product of the query vector and the four dependency tokens by the following formula:
Attention(q,K)=softmax(q×K T )
the final dependency characterization is obtained by weighted summing the various dependency characterizations by the following equation:
R sec =Attention(q,K)×V
where softmax () represents the softmax function, T is a fransose symbol, and x represents the matrix multiplication.
The above operation is accomplished by dependency aggregation Block (relying on an aggregation module) in fig. 2, based on the above design, SEC adds little additional parameters and computation.
In an embodiment of the present invention, the final dependency representation R is characterized using a 1X 1 three-dimensional convolution kernel sec The number of channels of (2) is restored to the number of the original video feature tensors Y, the section mapped to (0.0, 1.0) is multiplied to the original video feature tensors Y element by element finally by using a Sigmoid activation function, and an optimized video data feature tensor Z is obtained and expressed as follows:
Z=Sigmoid(Conv3d(R sec ;1×1×1))⊙Y
wherein, as indicated by element-wise multiplication, conv 3d () Representing a convolution operation.
And taking the optimized video data characteristic tensor Z as the output of SEC.
4. And (5) identifying actions.
Inputting the optimized video data characteristic tensor Z into an action recognition model to obtain an action recognition result, wherein the related recognition principle can refer to a conventional technology and is not repeated herein
To illustrate the effectiveness of the present invention, the following experiments were performed.
Experiments were performed on four real datasets, something-Something V1 and V2, deving 48 and EPIC-KITCHENS, with action classification accuracy (Acc) as an evaluation index. For this experiment, the present invention uses a widely used time-segment network (TSN) and time-shift network (TSM) as baselines. The experiment was divided into three parts:
1) Various dependency modeling effects proposed by the present invention were verified on a Someting-SometingV 1 based on TSN, the results of which are shown in Table 1.
TABLE 1 enhancement of action recognition models by content dependent modeling
Where #P represents the total number of parameters of the model, FLOPS is the number of floating point operations per second, and measures the amount of computation required to classify an action in a video. As can be seen from Table 1, the dependency modeling method provided by the invention only increases a small amount of parameters and calculated amounts. Secondly, the three long-distance dependence modeling provided by the invention can effectively improve the action recognition performance of the baseline model TSN, and simultaneously, better results can be obtained by using the three long-distance dependence than by using any one long distance alone. At the same time, table 1 compares various short-range dependent properties, where Sxxx represents the use of W pool The short-range dependency model of = (x, x, x), it can be seen that S122 achieves the best performance, and asThe use of three short-range dependencies does not make motion classification performance much stronger. Finally the present invention achieves maximum amplification of the TSN when three long distances and S122 are used simultaneously.
2) The effect of the selective polymerization and the average summation of the two polymerization-dependent effects proposed by the present invention were compared on Someting-SometingV 1, and the results are shown in Table 2. Wherein AVG represents average polymerization and SEC represents selective polymerization.
TABLE 2 comparison of Selective polymerization with average polymerization
As can be seen from table 2, the use of selective aggregation SEC significantly improves the accuracy of motion recognition over average aggregation AVG under different settings. Meanwhile, the parameters and the calculated amount are almost 0.
3) On Something-Something V1 and V2, deving 48 and EPIC-KITCHENS, TSN and TSM are used as base lines, the accuracy of the motion recognition of the present invention is compared with other most advanced motion recognition models, and the results are shown in Table 3.
TABLE 3 comparison of Performance of the model based on selectively dependent aggregation with other most advanced models
In Table 3, SDA-TSN, SDA-TSM represent combining the protocol of the present invention with two baseline models TSN, TSM; c3D, GST, TAM is the most advanced motion recognition model at present. It can be seen from Table 3 that the SDA-TSN, SDA-TSM far exceeded the original baseline model in all data sets, as well as the current most advanced model.
Another embodiment of the present invention further provides a motion recognition system in video data, which is mainly used for implementing the method provided in the foregoing embodiment, as shown in fig. 4, where the system mainly includes:
an original video feature tensor obtaining unit, configured to obtain an original video feature tensor of the motion recognition model extracted from the video data;
the video data multi-content dependent modeling unit is used for pooling the original video characteristic tensors in different directions and different scales by adopting a video data multi-content dependent modeling mode to obtain a plurality of groups of compressed dependent characteristic tensors, and then performing dependent activation by utilizing a convolution layer to obtain corresponding dependent characterization;
the dependency aggregation unit is used for introducing the query vector to be matched with all dependency characterizations by utilizing the attention mechanism of the query structure, calculating the weights of various dependency characterizations according to the matching response intensity, weighting and summing to obtain a final dependency characterizations, and carrying out threshold operation on the original video data characteristic tensor by using the final dependency characterizations to obtain an optimized video data characteristic tensor;
and the action recognition unit is used for inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the system is divided into different functional modules to perform all or part of the functions described above.
In addition, the main technical details related to the above system parts are described in detail in the previous method embodiments, so they will not be described in detail.
Another embodiment of the present invention also provides a processing apparatus, as shown in the drawings, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.
Further, the processing device further comprises at least one input device and at least one output device; in the processing device, the processor, the memory, the input device and the output device are connected through buses.
In the embodiment of the invention, the specific types of the memory, the input device and the output device are not limited; for example:
the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;
the output device may be a display terminal;
the memory may be random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as disk memory.
Another embodiment of the present invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiment.
The readable storage medium according to the embodiment of the present invention may be provided as a computer readable storage medium in the aforementioned processing apparatus, for example, as a memory in the processing apparatus. The readable storage medium may be any of various media capable of storing a program code, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, and an optical disk.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.
Claims (5)
1. A method for identifying actions in video data, comprising:
acquiring an original video characteristic tensor extracted from video data by an action recognition model;
the method comprises the steps of adopting a mode of modeling multiple contents of video data to pool the original video characteristic tensors in different directions and different scales to obtain a plurality of groups of compressed dependent characteristic tensors, and then utilizing a convolution layer to perform dependent activation to obtain corresponding dependent characterization;
introducing a query vector to be matched with all the dependency characterizations by using the attention mechanism of the query structure, calculating the weights of various dependency characterizations according to the matching response intensity, weighting and summing to obtain a final dependency characterizations, and carrying out threshold operation on the original video data feature tensor by using the final dependency characterizations to obtain an optimized video data feature tensor;
inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result;
prior to pooling the original video feature tensor at different scales from different directions, comprising: using convolution layer and ReLU activation function to tense the original video featureCompression from C channels to C/r c The generated video characteristic tensor is marked as +.>Wherein (1)>Is a real number set; t, H and W represent the length, height and width of the video feature tensor in turn, r c Representing the compression coefficient;
pooling at different scales from different directions includes:
the pooling core is marked asRecord the average pooling operation as pool avg () Calculating a compressed dependent feature tensor from the video feature tensor Y +.>The process of (1) is expressed as follows:
wherein p is t ,p h ,p w Indicating the size of the pooling nuclear receptive field, different p t ,p h ,p w The size corresponds to different directions and different scales;
performing dependency activation by using a convolution layer, and obtaining corresponding dependency characterization includes:
the convolution kernel is noted asThe convolution operation is recorded as Conv 3d () Performing convolution operation on the dependence characteristic tensor A to obtain corresponding dependence characterization +.>The process of (1) is as follows:
wherein c t ,c h ,c w Representing the size of the convolution kernel;
two groups of content dependencies are set in the video data multi-content dependency modeling:
the first group is long-range content dependencies that reflect the relationship between video content from three perspectives of time, space, and space-time: when the pooling core isWhen the long-distance space-time dependence is reflected, the convolution kernel of the corresponding convolution layer is +.>When the pooling core is +.>In the time-course of which the first and second contact surfaces,the response is long-distance time dependent, the convolution kernel of the corresponding convolution layer is +.>When the pooling core is +.>When the long-distance space dependence is reflected, the convolution kernel of the corresponding convolution layer is +.>
The second group is short-distance content dependence focused on information compressed in local space-time receptive fields, and the corresponding pooling core isThe convolution kernel of the corresponding convolution layer is +.>
Wherein, a is more than 1 and c is more than min (H, W), b is more than 1 and d is more than T;
introducing the query vector to be matched with all the dependency tokens by using the attention mechanism of the query structure, calculating the weights of various dependency tokens according to the matching response intensity and carrying out weighted summation to obtain the final dependency token comprises the following steps:
introducing a learnable query vector q and characterizing all dependenciesCompression to MXC/r by global averaging pooling layer c The matrix K of dimensions acts as a key in the attention mechanism:
will beSplicing as a value:
wherein M is the number of dependent characterizations, C represents the number of channels of the video feature tensor, r c Representing the compression coefficient;
calculating the inner product of the query vector and the vector of the dependency tokens to obtain the matching response strength of each dependency token as the weight value of the subsequent weighted summation by the following formula:
Attention(q,K)=softmax(q×K T )
the final dependency characterization is obtained by weighted summation:
R sec =Attention(q,K)×V
where softmax () represents the softmax function, T is the matrix transpose sign, x represents the matrix multiplication.
2. The method of claim 1, wherein thresholding the original video data feature tensor with the final dependency characterization to obtain the optimized video data feature tensor comprises:
the final dependency representation R is characterized using a 1X 1 three-dimensional convolution kernel sec The number of channels of (2) is restored to the number of the original video feature tensors Y, the section mapped to (0.0, 1.0) is multiplied to the original video feature tensors Y element by element finally by using a Sigmoid activation function, and an optimized video data feature tensor Z is obtained and expressed as follows:
Z=Sigmoid(Conv3d(R sec ;1×1×1))⊙Y
wherein, as indicated by element-wise multiplication, conv 3d () Representing a convolution operation.
3. A motion recognition system in video data for implementing the method of any one of claims 1-2, the system comprising:
an original video feature tensor obtaining unit, configured to obtain an original video feature tensor of the motion recognition model extracted from the video data;
the video data multi-content dependent modeling unit is used for pooling the original video characteristic tensors in different directions and different scales by adopting a video data multi-content dependent modeling mode to obtain a plurality of groups of compressed dependent characteristic tensors, and then performing dependent activation by utilizing a convolution layer to obtain corresponding dependent characterization;
the dependency aggregation unit is used for introducing the query vector to be matched with all dependency characterizations by utilizing the attention mechanism of the query structure, calculating the weights of various dependency characterizations according to the matching response intensity, weighting and summing to obtain a final dependency characterizations, and carrying out threshold operation on the original video data characteristic tensor by using the final dependency characterizations to obtain an optimized video data characteristic tensor;
and the action recognition unit is used for inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result.
4. A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-2.
5. A readable storage medium storing a computer program, characterized in that the method according to any one of claims 1-2 is implemented when the computer program is executed by a processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111363930.XA CN113989940B (en) | 2021-11-17 | 2021-11-17 | Method, system, device and storage medium for identifying actions in video data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111363930.XA CN113989940B (en) | 2021-11-17 | 2021-11-17 | Method, system, device and storage medium for identifying actions in video data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113989940A CN113989940A (en) | 2022-01-28 |
CN113989940B true CN113989940B (en) | 2024-03-29 |
Family
ID=79749106
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111363930.XA Active CN113989940B (en) | 2021-11-17 | 2021-11-17 | Method, system, device and storage medium for identifying actions in video data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113989940B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114926770B (en) * | 2022-05-31 | 2024-06-07 | 上海人工智能创新中心 | Video motion recognition method, apparatus, device and computer readable storage medium |
CN115861901B (en) * | 2022-12-30 | 2023-06-30 | 深圳大学 | Video classification method, device, equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111325145A (en) * | 2020-02-19 | 2020-06-23 | 中山大学 | Behavior identification method based on combination of time domain channel correlation blocks |
WO2020233010A1 (en) * | 2019-05-23 | 2020-11-26 | 平安科技(深圳)有限公司 | Image recognition method and apparatus based on segmentable convolutional network, and computer device |
CN112131943A (en) * | 2020-08-20 | 2020-12-25 | 深圳大学 | Video behavior identification method and system based on dual attention model |
CN112926396A (en) * | 2021-01-28 | 2021-06-08 | 杭州电子科技大学 | Action identification method based on double-current convolution attention |
CN113297964A (en) * | 2021-05-25 | 2021-08-24 | 周口师范学院 | Video target recognition model and method based on deep migration learning |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11315570B2 (en) * | 2018-05-02 | 2022-04-26 | Facebook Technologies, Llc | Machine learning-based speech-to-text transcription cloud intermediary |
-
2021
- 2021-11-17 CN CN202111363930.XA patent/CN113989940B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020233010A1 (en) * | 2019-05-23 | 2020-11-26 | 平安科技(深圳)有限公司 | Image recognition method and apparatus based on segmentable convolutional network, and computer device |
CN111325145A (en) * | 2020-02-19 | 2020-06-23 | 中山大学 | Behavior identification method based on combination of time domain channel correlation blocks |
CN112131943A (en) * | 2020-08-20 | 2020-12-25 | 深圳大学 | Video behavior identification method and system based on dual attention model |
CN112926396A (en) * | 2021-01-28 | 2021-06-08 | 杭州电子科技大学 | Action identification method based on double-current convolution attention |
CN113297964A (en) * | 2021-05-25 | 2021-08-24 | 周口师范学院 | Video target recognition model and method based on deep migration learning |
Non-Patent Citations (2)
Title |
---|
王辉涛 ; 胡燕 ; .基于全局时空感受野的高效视频分类方法.小型微型计算机***.2020,(08),全文. * |
解怀奇 ; 乐红兵 ; .基于通道注意力机制的视频人体行为识别.电子技术与软件工程.2020,(04),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN113989940A (en) | 2022-01-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220392234A1 (en) | Training neural networks for vehicle re-identification | |
CN110929622B (en) | Video classification method, model training method, device, equipment and storage medium | |
Khan et al. | An effective framework for driver fatigue recognition based on intelligent facial expressions analysis | |
CN113989940B (en) | Method, system, device and storage medium for identifying actions in video data | |
WO2020228525A1 (en) | Place recognition method and apparatus, model training method and apparatus for place recognition, and electronic device | |
Zheng et al. | Discriminative dictionary learning via Fisher discrimination K-SVD algorithm | |
CN111242208A (en) | Point cloud classification method, point cloud segmentation method and related equipment | |
JP2017062781A (en) | Similarity-based detection of prominent objects using deep cnn pooling layers as features | |
Hong et al. | D3: recognizing dynamic scenes with deep dual descriptor based on key frames and key segments | |
CN112529068B (en) | Multi-view image classification method, system, computer equipment and storage medium | |
US11804043B2 (en) | Detecting objects in a video using attention models | |
CN115294563A (en) | 3D point cloud analysis method and device based on Transformer and capable of enhancing local semantic learning ability | |
CN114419732A (en) | HRNet human body posture identification method based on attention mechanism optimization | |
CN115131218A (en) | Image processing method, image processing device, computer readable medium and electronic equipment | |
CN114821251B (en) | Method and device for determining point cloud up-sampling network | |
Meena et al. | Effective curvelet-based facial expression recognition using graph signal processing | |
Zhou et al. | Adaptive weighted locality-constrained sparse coding for glaucoma diagnosis | |
Lamba et al. | A texture based mani-fold approach for crowd density estimation using Gaussian Markov Random Field | |
CN114358246A (en) | Graph convolution neural network module of attention mechanism of three-dimensional point cloud scene | |
CN111914809B (en) | Target object positioning method, image processing method, device and computer equipment | |
Zeng et al. | Domestic activities classification from audio recordings using multi-scale dilated depthwise separable convolutional network | |
CN115860802A (en) | Product value prediction method, device, computer equipment and storage medium | |
Danelakis et al. | A robust spatio-temporal scheme for dynamic 3D facial expression retrieval | |
CN114782684B (en) | Point cloud semantic segmentation method and device, electronic equipment and storage medium | |
Li et al. | Geometry-invariant texture retrieval using a dual-output pulse-coupled neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |