CN113989940B - Method, system, device and storage medium for identifying actions in video data - Google Patents

Method, system, device and storage medium for identifying actions in video data Download PDF

Info

Publication number
CN113989940B
CN113989940B CN202111363930.XA CN202111363930A CN113989940B CN 113989940 B CN113989940 B CN 113989940B CN 202111363930 A CN202111363930 A CN 202111363930A CN 113989940 B CN113989940 B CN 113989940B
Authority
CN
China
Prior art keywords
video data
dependency
tensor
dependent
pooling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111363930.XA
Other languages
Chinese (zh)
Other versions
CN113989940A (en
Inventor
郝艳宾
谭懿
何向南
杨勋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202111363930.XA priority Critical patent/CN113989940B/en
Publication of CN113989940A publication Critical patent/CN113989940A/en
Application granted granted Critical
Publication of CN113989940B publication Critical patent/CN113989940B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • G06F17/153Multidimensional correlation or convolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a system, equipment and a storage medium for identifying actions in video data, wherein the related method comprises the following steps: the method comprises the steps of pooling the original video characteristic tensor in different dimensions from different directions by adopting a mode of modeling multiple content dependencies of video data, and performing dependency activation by utilizing a convolution layer to obtain corresponding dependency characterization; and (3) utilizing the attention mechanism of the query structure to realize aggregation depending on the characterization, optimizing the original video feature tensor, and utilizing the optimization result to perform action recognition. According to the scheme, the motion recognition model based on convolution can be directly inserted, extra parameters and calculated amount are hardly brought, and experiments show that the classification performance of the motion recognition model can be obviously improved.

Description

Method, system, device and storage medium for identifying actions in video data
Technical Field
The present invention relates to the field of computer vision, and in particular, to a method, system, device, and storage medium for identifying actions in video data.
Background
In the multimedia age, various terminal devices, such as: the mobile phone, the camera, the monitoring camera and the like continuously generate massive video data, and the action classification is an effective analysis method for the massive video data. However, the video data has increased time dimension relative to the image data, brings more content dependence, and greatly increases the difficulty of video action recognition.
Aiming at the content-dependent modeling problem of action recognition, the following methods exist at present:
1) Based on the implicit dependency modeling method, the method expands the two-dimensional convolution kernel to the three-dimensional convolution kernel by directly expanding the existing image classification network, such as a ResNet network, and only uses the three-dimensional convolution kernel to be optimized to implicitly learn the characteristics in the video data. Such a method relies entirely on the number of stacked layers to model long-range dependencies, so that only the last layers in the network can perceive the long-range dependencies. At the same time, the rough dimension expansion brings about a serious increase in the calculation amount and model size, making such methods difficult to train.
2) Based on a time-dependent modeling method, the method focuses on the increased time dimension of video data relative to image data, and explicitly utilizes the information of the time dimension to capture the dynamic characteristics of the video data. Compared with an implicit dependency modeling method, the method can avoid using a heavy three-dimensional convolution kernel due to the special design aiming at the time dimension, reduces the complexity of the model and improves the performance. However, this type of approach ignores other content dependencies that are widely present in video data, and has limited performance.
3) Based on the global space-time point attention method, the method adds a global attention mechanism for the action classification model, and long-distance content-dependent capture is realized by using the matching relation between every two space-time points in video data. However, calculating the relation between the space-time points pair by pair brings about the problems of model bloated and slow calculation.
In general, none of the above approaches solves the problem between modeling various content dependencies and maintaining model efficiency, and motion recognition model performance and computational overhead remain to be optimized.
Disclosure of Invention
The invention aims to provide a method, a system, equipment and a storage medium for identifying actions in video data, which are used for modeling and aggregating multi-content dependence in the video data while hardly increasing the parameters and the calculated amount of an action identification model aiming at action identification tasks, and improving the classification performance of the action identification model.
The invention aims at realizing the following technical scheme:
a method of motion recognition in video data, comprising:
acquiring an original video characteristic tensor extracted from video data by an action recognition model;
the method comprises the steps of adopting a mode of modeling multiple contents of video data to pool the original video characteristic tensors in different directions and different scales to obtain a plurality of groups of compressed dependent characteristic tensors, and then utilizing a convolution layer to perform dependent activation to obtain corresponding dependent characterization;
introducing a query vector to be matched with all the dependency characterizations by using the attention mechanism of the query structure, calculating the weights of various dependency characterizations according to the matching response intensity, weighting and summing to obtain a final dependency characterizations, and carrying out threshold operation on the original video data feature tensor by using the final dependency characterizations to obtain an optimized video data feature tensor;
and inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result.
A motion recognition system in video data for implementing the foregoing method, the system comprising:
an original video feature tensor obtaining unit, configured to obtain an original video feature tensor of the motion recognition model extracted from the video data;
the video data multi-content dependent modeling unit is used for pooling the original video characteristic tensors in different directions and different scales by adopting a video data multi-content dependent modeling mode to obtain a plurality of groups of compressed dependent characteristic tensors, and then performing dependent activation by utilizing a convolution layer to obtain corresponding dependent characterization;
the dependency aggregation unit is used for introducing the query vector to be matched with all dependency characterizations by utilizing the attention mechanism of the query structure, calculating the weights of various dependency characterizations according to the matching response intensity, weighting and summing to obtain a final dependency characterizations, and carrying out threshold operation on the original video data characteristic tensor by using the final dependency characterizations to obtain an optimized video data characteristic tensor;
and the action recognition unit is used for inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result.
A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.
A readable storage medium storing a computer program, characterized in that the aforementioned method is implemented when the computer program is executed by a processor.
The technical scheme provided by the invention can be seen that the method is a plug-and-play method, can be directly inserted into a convolution-based motion recognition model, hardly brings additional parameters and calculation amount, and can obviously improve the classification performance of the motion recognition model through experiments.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a method for identifying actions in video data according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a model structure of a method for identifying actions in video data according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of dependency aggregation based on an attention mechanism of an interrogation structure according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a motion recognition system in video data according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
The terms that may be used herein will first be described as follows:
the terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.
The following describes a method for identifying actions in video data in detail. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer. The reagents or apparatus used in the examples of the present invention were conventional products commercially available without the manufacturer's knowledge.
As shown in fig. 1, a method for identifying actions in video data mainly includes the following steps:
and step 1, acquiring an original video characteristic tensor extracted from video data by the action recognition model.
And step 2, adopting a mode of modeling the multi-content dependence of the video data to pool the original video characteristic tensors from different directions and with different scales to obtain a plurality of groups of compressed dependent characteristic tensors, and then utilizing a convolution layer to perform dependent activation to obtain corresponding dependent characterization.
And 3, introducing a query vector to be matched with all the dependency characterizations by using an attention mechanism of the query structure, calculating weights of various dependency characterizations according to the matching response intensity, weighting and summing to obtain a final dependency characterizations, and carrying out threshold operation on the original video data characteristic tensor by using the final dependency characterizations to obtain the optimized video data characteristic tensor.
And step 4, inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result.
The scheme of the embodiment of the invention can be applied to systems such as monitoring video processing, video content analysis and the like. Taking a widely used time segment model (TSN) and a widely used time translation model (TSM) as base lines, a specific experiment is given later, and the effectiveness of the invention in improving the performance of the motion recognition system can be verified; of course, other motion recognition models may be used with the present invention.
For ease of understanding, the following detailed description is provided for each of the various aspects of the invention.
1. An original video feature tensor is obtained.
In the embodiment of the invention, the input of the motion recognition model is video data, and the output is an original video characteristic tensor; the motion recognition model can be any existing motion recognition model with any structure.
2. Video data multi-content dependent modeling (abbreviated MDM).
As shown in fig. 2, the model structure unit of the classical video data motion recognition method comprises three convolution layers (conv 1, conv2, conv3 shown on the left side of fig. 2), and the method (SDA) proposed by the present invention acts between the second and third convolution layers. For video feature tensors output by motion recognition modelsThe MDM digs out various spatiotemporal content dependencies from it. First, to lighten the model, the MDM uses a convolution layer and a ReLU activation function to tense the video featuresCompression from C channels to C/r c The generated characteristic tensor is marked as +.>Wherein (1)>Is a real number set; t, H and W represent the length, height and width of the video feature tensor in turn, r c Representing the compression coefficient.
Taking the generated characteristic tensor Y' as input, the MDM outputs a series of dependency characterization, which is marked as { R } 1 ,R 2 ,...,R M } =mdm (Y'). The calculation of the different dependency characterizations adopts a unified flow: feature compression→dependent activation.
1) And (5) compressing the characteristics.
In the embodiment of the invention, the pooling operation is adopted to realize the feature compression. For the spatio-temporal characteristics of the feature tensor Y', the MDM pools it from different directions (e.g. spatial direction, temporal direction) at different scales (e.g. global scale, local scale).
The pooling core is marked asWherein p is t ,p h ,p w Indicating the size of the pooling nuclear receptive field. In order to obtain the overall data characteristics of the video data in all directions, the invention specifically selects the average pooling as pooling operation, and marks the average pooling operation as pool avg () Calculating a compressed dependent feature tensor from the video feature tensor Y +.>The process of (1) is expressed as follows:
wherein p is t ,p g ,p w Indicating the size of the pooling nuclear receptive field, different p t ,p g ,p w The size corresponds to different directions and different scales.
The compressed dependency feature tensors provide data characteristics within the pooled kernel receptive field range for subsequent dependency modeling.
2) Dependent on activation.
After the dependency characteristic A is obtained, the MDM uses an operator and a ReLU operator to realize dependency activation, and then adopts a reshape operator to restore the pooled dependent characteristic tensor to the size before pooling so as to facilitate subsequent calculation.
With pooling operation similarity, the convolution kernel is noted asThe convolution operation is recorded as Conv 3d () Performing convolution operation on the dependence characteristic tensor A to obtain corresponding dependence characterization +.>The process of (1) is as follows:
wherein c t ,c h ,c w Indicating the size of the convolution kernel.
Based on the above principle, in the embodiment of the present invention, two sets of content dependencies are set in the multi-content dependency modeling of video data, as shown in two parts (a) and (b) on the right side of fig. 2:
the first group is long-range content dependencies that reflect the relationship between video content from three perspectives of time, space, and space-time: when the pooling core isWhen the time-space dependence (LST) is reflected, the convolution of the corresponding convolution layerThe core is->When the pooling core is +.>In the case of response to long-distance time dependence (abbreviated as LT), the convolution kernel of the corresponding convolution layer is +.>When the pooling core is +.>When the space dependence of long distance is reflected (LS for short), the convolution kernel of the corresponding convolution layer is +.>
Based on the three pooling kernels, three dependent feature tensors can be obtained
Likewise, based on the three convolution kernels described above, the information between the channels is mixed using the three convolution operations and the corresponding dependence characterizations are obtained as follows:
the second group is short distance contentDepending on which information is compressed in the local spatiotemporal receptive field, a local pooling kernel is used to compress the dynamic information presented within the local receptive field. Corresponding pooling core isThe convolution kernel of the corresponding convolution layer is +.>The corresponding dependent characteristic tensor A can be calculated according to the above method S And dependent characterization R 4
In the embodiment of the invention, a is more than 1 and c is more than min (H, W), b is more than 1 and d is more than T. By way of example, it is possible to provide that: a=3, b=3, c=2, d=2.
After various dependency characterizations are obtained, they are scaled to TXH XW XC/r using element replication c
3. Selective dependent polymerization (abbreviated SEC).
The most intuitive dependence aggregation method is to average sum the obtained dependence characterizations, however different videos have different dependence preferences, and simple average aggregation ignores important dependence while paying attention to irrelevant dependence.
As shown in FIG. 3, the present invention utilizes the attention mechanism of the query structure (QSA for short) to implement aggregation of dependency tokens, automatically emphasizing important dependencies by assigning weights to different dependency tokens. Specifically:
introducing a learnable query vectorAnd all dependency characterizations +>Compression to MXC/r by global averaging pooling layer c The matrix K of dimensions acts as a key in the attention mechanism:
will beSplicing as a value:
where M is the number of dependent characterizations (e.g., m=4 according to the previous example), C represents the number of channels of the original video feature tensor, r c Representing the compression coefficient.
The matching response intensity of each dependency token is obtained as a weight value for the subsequent weighted summation (for convenience in representation, using matrix multiplication representation) by computing the vector inner product of the query vector and the four dependency tokens by the following formula:
Attention(q,K)=softmax(q×K T )
the final dependency characterization is obtained by weighted summing the various dependency characterizations by the following equation:
R sec =Attention(q,K)×V
where softmax () represents the softmax function, T is a fransose symbol, and x represents the matrix multiplication.
The above operation is accomplished by dependency aggregation Block (relying on an aggregation module) in fig. 2, based on the above design, SEC adds little additional parameters and computation.
In an embodiment of the present invention, the final dependency representation R is characterized using a 1X 1 three-dimensional convolution kernel sec The number of channels of (2) is restored to the number of the original video feature tensors Y, the section mapped to (0.0, 1.0) is multiplied to the original video feature tensors Y element by element finally by using a Sigmoid activation function, and an optimized video data feature tensor Z is obtained and expressed as follows:
Z=Sigmoid(Conv3d(R sec ;1×1×1))⊙Y
wherein, as indicated by element-wise multiplication, conv 3d () Representing a convolution operation.
And taking the optimized video data characteristic tensor Z as the output of SEC.
4. And (5) identifying actions.
Inputting the optimized video data characteristic tensor Z into an action recognition model to obtain an action recognition result, wherein the related recognition principle can refer to a conventional technology and is not repeated herein
To illustrate the effectiveness of the present invention, the following experiments were performed.
Experiments were performed on four real datasets, something-Something V1 and V2, deving 48 and EPIC-KITCHENS, with action classification accuracy (Acc) as an evaluation index. For this experiment, the present invention uses a widely used time-segment network (TSN) and time-shift network (TSM) as baselines. The experiment was divided into three parts:
1) Various dependency modeling effects proposed by the present invention were verified on a Someting-SometingV 1 based on TSN, the results of which are shown in Table 1.
TABLE 1 enhancement of action recognition models by content dependent modeling
Where #P represents the total number of parameters of the model, FLOPS is the number of floating point operations per second, and measures the amount of computation required to classify an action in a video. As can be seen from Table 1, the dependency modeling method provided by the invention only increases a small amount of parameters and calculated amounts. Secondly, the three long-distance dependence modeling provided by the invention can effectively improve the action recognition performance of the baseline model TSN, and simultaneously, better results can be obtained by using the three long-distance dependence than by using any one long distance alone. At the same time, table 1 compares various short-range dependent properties, where Sxxx represents the use of W pool The short-range dependency model of = (x, x, x), it can be seen that S122 achieves the best performance, and asThe use of three short-range dependencies does not make motion classification performance much stronger. Finally the present invention achieves maximum amplification of the TSN when three long distances and S122 are used simultaneously.
2) The effect of the selective polymerization and the average summation of the two polymerization-dependent effects proposed by the present invention were compared on Someting-SometingV 1, and the results are shown in Table 2. Wherein AVG represents average polymerization and SEC represents selective polymerization.
TABLE 2 comparison of Selective polymerization with average polymerization
As can be seen from table 2, the use of selective aggregation SEC significantly improves the accuracy of motion recognition over average aggregation AVG under different settings. Meanwhile, the parameters and the calculated amount are almost 0.
3) On Something-Something V1 and V2, deving 48 and EPIC-KITCHENS, TSN and TSM are used as base lines, the accuracy of the motion recognition of the present invention is compared with other most advanced motion recognition models, and the results are shown in Table 3.
TABLE 3 comparison of Performance of the model based on selectively dependent aggregation with other most advanced models
In Table 3, SDA-TSN, SDA-TSM represent combining the protocol of the present invention with two baseline models TSN, TSM; c3D, GST, TAM is the most advanced motion recognition model at present. It can be seen from Table 3 that the SDA-TSN, SDA-TSM far exceeded the original baseline model in all data sets, as well as the current most advanced model.
Another embodiment of the present invention further provides a motion recognition system in video data, which is mainly used for implementing the method provided in the foregoing embodiment, as shown in fig. 4, where the system mainly includes:
an original video feature tensor obtaining unit, configured to obtain an original video feature tensor of the motion recognition model extracted from the video data;
the video data multi-content dependent modeling unit is used for pooling the original video characteristic tensors in different directions and different scales by adopting a video data multi-content dependent modeling mode to obtain a plurality of groups of compressed dependent characteristic tensors, and then performing dependent activation by utilizing a convolution layer to obtain corresponding dependent characterization;
the dependency aggregation unit is used for introducing the query vector to be matched with all dependency characterizations by utilizing the attention mechanism of the query structure, calculating the weights of various dependency characterizations according to the matching response intensity, weighting and summing to obtain a final dependency characterizations, and carrying out threshold operation on the original video data characteristic tensor by using the final dependency characterizations to obtain an optimized video data characteristic tensor;
and the action recognition unit is used for inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the system is divided into different functional modules to perform all or part of the functions described above.
In addition, the main technical details related to the above system parts are described in detail in the previous method embodiments, so they will not be described in detail.
Another embodiment of the present invention also provides a processing apparatus, as shown in the drawings, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.
Further, the processing device further comprises at least one input device and at least one output device; in the processing device, the processor, the memory, the input device and the output device are connected through buses.
In the embodiment of the invention, the specific types of the memory, the input device and the output device are not limited; for example:
the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;
the output device may be a display terminal;
the memory may be random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as disk memory.
Another embodiment of the present invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiment.
The readable storage medium according to the embodiment of the present invention may be provided as a computer readable storage medium in the aforementioned processing apparatus, for example, as a memory in the processing apparatus. The readable storage medium may be any of various media capable of storing a program code, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, and an optical disk.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (5)

1. A method for identifying actions in video data, comprising:
acquiring an original video characteristic tensor extracted from video data by an action recognition model;
the method comprises the steps of adopting a mode of modeling multiple contents of video data to pool the original video characteristic tensors in different directions and different scales to obtain a plurality of groups of compressed dependent characteristic tensors, and then utilizing a convolution layer to perform dependent activation to obtain corresponding dependent characterization;
introducing a query vector to be matched with all the dependency characterizations by using the attention mechanism of the query structure, calculating the weights of various dependency characterizations according to the matching response intensity, weighting and summing to obtain a final dependency characterizations, and carrying out threshold operation on the original video data feature tensor by using the final dependency characterizations to obtain an optimized video data feature tensor;
inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result;
prior to pooling the original video feature tensor at different scales from different directions, comprising: using convolution layer and ReLU activation function to tense the original video featureCompression from C channels to C/r c The generated video characteristic tensor is marked as +.>Wherein (1)>Is a real number set; t, H and W represent the length, height and width of the video feature tensor in turn, r c Representing the compression coefficient;
pooling at different scales from different directions includes:
the pooling core is marked asRecord the average pooling operation as pool avg () Calculating a compressed dependent feature tensor from the video feature tensor Y +.>The process of (1) is expressed as follows:
wherein p is t ,p h ,p w Indicating the size of the pooling nuclear receptive field, different p t ,p h ,p w The size corresponds to different directions and different scales;
performing dependency activation by using a convolution layer, and obtaining corresponding dependency characterization includes:
the convolution kernel is noted asThe convolution operation is recorded as Conv 3d () Performing convolution operation on the dependence characteristic tensor A to obtain corresponding dependence characterization +.>The process of (1) is as follows:
wherein c t ,c h ,c w Representing the size of the convolution kernel;
two groups of content dependencies are set in the video data multi-content dependency modeling:
the first group is long-range content dependencies that reflect the relationship between video content from three perspectives of time, space, and space-time: when the pooling core isWhen the long-distance space-time dependence is reflected, the convolution kernel of the corresponding convolution layer is +.>When the pooling core is +.>In the time-course of which the first and second contact surfaces,the response is long-distance time dependent, the convolution kernel of the corresponding convolution layer is +.>When the pooling core is +.>When the long-distance space dependence is reflected, the convolution kernel of the corresponding convolution layer is +.>
The second group is short-distance content dependence focused on information compressed in local space-time receptive fields, and the corresponding pooling core isThe convolution kernel of the corresponding convolution layer is +.>
Wherein, a is more than 1 and c is more than min (H, W), b is more than 1 and d is more than T;
introducing the query vector to be matched with all the dependency tokens by using the attention mechanism of the query structure, calculating the weights of various dependency tokens according to the matching response intensity and carrying out weighted summation to obtain the final dependency token comprises the following steps:
introducing a learnable query vector q and characterizing all dependenciesCompression to MXC/r by global averaging pooling layer c The matrix K of dimensions acts as a key in the attention mechanism:
will beSplicing as a value:
wherein M is the number of dependent characterizations, C represents the number of channels of the video feature tensor, r c Representing the compression coefficient;
calculating the inner product of the query vector and the vector of the dependency tokens to obtain the matching response strength of each dependency token as the weight value of the subsequent weighted summation by the following formula:
Attention(q,K)=softmax(q×K T )
the final dependency characterization is obtained by weighted summation:
R sec =Attention(q,K)×V
where softmax () represents the softmax function, T is the matrix transpose sign, x represents the matrix multiplication.
2. The method of claim 1, wherein thresholding the original video data feature tensor with the final dependency characterization to obtain the optimized video data feature tensor comprises:
the final dependency representation R is characterized using a 1X 1 three-dimensional convolution kernel sec The number of channels of (2) is restored to the number of the original video feature tensors Y, the section mapped to (0.0, 1.0) is multiplied to the original video feature tensors Y element by element finally by using a Sigmoid activation function, and an optimized video data feature tensor Z is obtained and expressed as follows:
Z=Sigmoid(Conv3d(R sec ;1×1×1))⊙Y
wherein, as indicated by element-wise multiplication, conv 3d () Representing a convolution operation.
3. A motion recognition system in video data for implementing the method of any one of claims 1-2, the system comprising:
an original video feature tensor obtaining unit, configured to obtain an original video feature tensor of the motion recognition model extracted from the video data;
the video data multi-content dependent modeling unit is used for pooling the original video characteristic tensors in different directions and different scales by adopting a video data multi-content dependent modeling mode to obtain a plurality of groups of compressed dependent characteristic tensors, and then performing dependent activation by utilizing a convolution layer to obtain corresponding dependent characterization;
the dependency aggregation unit is used for introducing the query vector to be matched with all dependency characterizations by utilizing the attention mechanism of the query structure, calculating the weights of various dependency characterizations according to the matching response intensity, weighting and summing to obtain a final dependency characterizations, and carrying out threshold operation on the original video data characteristic tensor by using the final dependency characterizations to obtain an optimized video data characteristic tensor;
and the action recognition unit is used for inputting the optimized video data characteristic tensor into the action recognition model to obtain an action recognition result.
4. A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-2.
5. A readable storage medium storing a computer program, characterized in that the method according to any one of claims 1-2 is implemented when the computer program is executed by a processor.
CN202111363930.XA 2021-11-17 2021-11-17 Method, system, device and storage medium for identifying actions in video data Active CN113989940B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111363930.XA CN113989940B (en) 2021-11-17 2021-11-17 Method, system, device and storage medium for identifying actions in video data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111363930.XA CN113989940B (en) 2021-11-17 2021-11-17 Method, system, device and storage medium for identifying actions in video data

Publications (2)

Publication Number Publication Date
CN113989940A CN113989940A (en) 2022-01-28
CN113989940B true CN113989940B (en) 2024-03-29

Family

ID=79749106

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111363930.XA Active CN113989940B (en) 2021-11-17 2021-11-17 Method, system, device and storage medium for identifying actions in video data

Country Status (1)

Country Link
CN (1) CN113989940B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926770B (en) * 2022-05-31 2024-06-07 上海人工智能创新中心 Video motion recognition method, apparatus, device and computer readable storage medium
CN115861901B (en) * 2022-12-30 2023-06-30 深圳大学 Video classification method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325145A (en) * 2020-02-19 2020-06-23 中山大学 Behavior identification method based on combination of time domain channel correlation blocks
WO2020233010A1 (en) * 2019-05-23 2020-11-26 平安科技(深圳)有限公司 Image recognition method and apparatus based on segmentable convolutional network, and computer device
CN112131943A (en) * 2020-08-20 2020-12-25 深圳大学 Video behavior identification method and system based on dual attention model
CN112926396A (en) * 2021-01-28 2021-06-08 杭州电子科技大学 Action identification method based on double-current convolution attention
CN113297964A (en) * 2021-05-25 2021-08-24 周口师范学院 Video target recognition model and method based on deep migration learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11315570B2 (en) * 2018-05-02 2022-04-26 Facebook Technologies, Llc Machine learning-based speech-to-text transcription cloud intermediary

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020233010A1 (en) * 2019-05-23 2020-11-26 平安科技(深圳)有限公司 Image recognition method and apparatus based on segmentable convolutional network, and computer device
CN111325145A (en) * 2020-02-19 2020-06-23 中山大学 Behavior identification method based on combination of time domain channel correlation blocks
CN112131943A (en) * 2020-08-20 2020-12-25 深圳大学 Video behavior identification method and system based on dual attention model
CN112926396A (en) * 2021-01-28 2021-06-08 杭州电子科技大学 Action identification method based on double-current convolution attention
CN113297964A (en) * 2021-05-25 2021-08-24 周口师范学院 Video target recognition model and method based on deep migration learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王辉涛 ; 胡燕 ; .基于全局时空感受野的高效视频分类方法.小型微型计算机***.2020,(08),全文. *
解怀奇 ; 乐红兵 ; .基于通道注意力机制的视频人体行为识别.电子技术与软件工程.2020,(04),全文. *

Also Published As

Publication number Publication date
CN113989940A (en) 2022-01-28

Similar Documents

Publication Publication Date Title
US20220392234A1 (en) Training neural networks for vehicle re-identification
CN110929622B (en) Video classification method, model training method, device, equipment and storage medium
Khan et al. An effective framework for driver fatigue recognition based on intelligent facial expressions analysis
CN113989940B (en) Method, system, device and storage medium for identifying actions in video data
WO2020228525A1 (en) Place recognition method and apparatus, model training method and apparatus for place recognition, and electronic device
Zheng et al. Discriminative dictionary learning via Fisher discrimination K-SVD algorithm
CN111242208A (en) Point cloud classification method, point cloud segmentation method and related equipment
JP2017062781A (en) Similarity-based detection of prominent objects using deep cnn pooling layers as features
Hong et al. D3: recognizing dynamic scenes with deep dual descriptor based on key frames and key segments
CN112529068B (en) Multi-view image classification method, system, computer equipment and storage medium
US11804043B2 (en) Detecting objects in a video using attention models
CN115294563A (en) 3D point cloud analysis method and device based on Transformer and capable of enhancing local semantic learning ability
CN114419732A (en) HRNet human body posture identification method based on attention mechanism optimization
CN115131218A (en) Image processing method, image processing device, computer readable medium and electronic equipment
CN114821251B (en) Method and device for determining point cloud up-sampling network
Meena et al. Effective curvelet-based facial expression recognition using graph signal processing
Zhou et al. Adaptive weighted locality-constrained sparse coding for glaucoma diagnosis
Lamba et al. A texture based mani-fold approach for crowd density estimation using Gaussian Markov Random Field
CN114358246A (en) Graph convolution neural network module of attention mechanism of three-dimensional point cloud scene
CN111914809B (en) Target object positioning method, image processing method, device and computer equipment
Zeng et al. Domestic activities classification from audio recordings using multi-scale dilated depthwise separable convolutional network
CN115860802A (en) Product value prediction method, device, computer equipment and storage medium
Danelakis et al. A robust spatio-temporal scheme for dynamic 3D facial expression retrieval
CN114782684B (en) Point cloud semantic segmentation method and device, electronic equipment and storage medium
Li et al. Geometry-invariant texture retrieval using a dual-output pulse-coupled neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant