CN116895038A

CN116895038A - Video motion recognition method and device, electronic equipment and readable storage medium

Info

Publication number: CN116895038A
Application number: CN202311162287.3A
Authority: CN
Inventors: 姚成辉
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2023-09-11
Filing date: 2023-09-11
Publication date: 2023-10-17
Anticipated expiration: 2043-09-11
Also published as: CN116895038B

Abstract

The application discloses a video action recognition method, a device, electronic equipment and a readable storage medium, belonging to the technical field of data processing, wherein the method comprises the following steps: extracting a plurality of first frames from the target video sequence, and extracting a second frame from the plurality of first frames; inputting a plurality of first frames into a TPEM (thermal processing unit) for feature extraction to obtain time sequence features; inputting the second frame into the SPEM for feature extraction to obtain spatial features; fusing the time sequence features and the space features to obtain fused features; determining video actions according to the fusion characteristics; TPEM contains a resnet network structure and a transducer network structure, and SPEM contains a resnet network structure. The time space double-branch structure is adopted, the space information and the time information are respectively extracted, and the space information is compared to fuse the information, so that the loss of related information is avoided; the characteristics of the video frames are fused in a multi-scale mode by adopting a resnet network structure, and the attention mechanism in the transformer network structure widens the receptive field, so that the video action recognition is more accurate.

Description

Video motion recognition method and device, electronic equipment and readable storage medium

Technical Field

The application belongs to the technical field of data processing, and particularly relates to a video action recognition method, a video action recognition device, electronic equipment and a readable storage medium.

Background

The goal of video motion recognition is to recognize motion occurring in video, which can be seen as a data structure formed by arranging a group of image frames in time sequence, and the motion recognition is to analyze the content of each image in the video and to mine clues from time sequence information among the video frames.

One important dimension of information for motion recognition itself is "timing". If there is no time sequence, only one frame of image is seen, so that "action ambiguity" is easily trapped, for example: for a person bending down, we cannot identify whether the person is sitting or standing up, so it is necessary to determine whether the person is sitting or standing by means of actions taken during the past few frames of the person.

The existing video motion recognition mainly comprises the following methods, but has obvious defects, namely:

1. each frame is framed using a 2D convolutional neural network (Convolutional Neural Networks, CNN) and fused, which ignores the full representation of the timing.

2. Modeling is performed by adopting a 3D CNN mode, and the mode is huge in calculation amount.

3. The method makes up the insufficient expression of the action in time sequence by using the expression of optical flow and the like, but has the characteristics of high difficulty in acquiring the characteristics of optical flow and the like, high resource consumption and low applicability.

Disclosure of Invention

The embodiment of the application provides a video motion recognition method, a device, electronic equipment and a readable storage medium, which can solve the problem that a high-efficiency and accurate video motion recognition method is lacking at present.

In a first aspect, a video action recognition method is provided, including:

extracting a plurality of first frames from the target video sequence, and extracting a second frame from the plurality of first frames;

inputting the plurality of first frames into a time sequence feature extraction module TPEM for feature extraction to obtain time sequence features;

inputting the second frame into a spatial feature extraction module SPEM for feature extraction to obtain spatial features;

fusing the time sequence features and the space features to obtain fusion features;

determining a video action according to the fusion characteristics;

the TPEM comprises a neural network model with a resnet network structure and a neural network model with a transformer network structure, and the SPEM comprises the neural network model with the resnet network structure.

Optionally, the TPEM includes a first convolutional neural network CNN model having a resnet network structure and a second CNN model having a transformer network structure;

the step of inputting the plurality of first frames into the TPEM to perform feature extraction to obtain time sequence features includes:

inputting the plurality of first frames into the first CNN model for feature extraction to obtain first feature data with a plurality of feature codes;

adding category codes to the first characteristic data through first coding processing to obtain second characteristic data;

adding position coding to the second characteristic data through second coding processing to obtain third characteristic data;

inputting the third characteristic data into the second CNN model for characteristic extraction to obtain the time sequence characteristic;

wherein the category code is associated with a category of the video action and the category code is randomly initialized, the position code being associated with a temporal position of each of the first frames in the target video sequence.

Optionally, the adding position coding to the second feature data through a second coding process includes:

the position code is calculated by the following formula:

；

wherein ,the actual temporal position in the video sequence is encoded for the feature,a position vector encoded for a t-th feature of the plurality of feature encodings,for the value of the i-th element in the position vector, d is the dimension of the feature code,indicating that the i-th element is an even-numbered element,indicating that the i-th element is an odd-numbered element.

Optionally, the SPEM includes a third CNN model having a resnet network structure;

inputting the second frame into a spatial feature extraction module SPEM for feature extraction to obtain spatial features, wherein the method comprises the following steps:

and inputting the second frame into the third CNN model to perform feature extraction to obtain spatial features.

Optionally, the fusing the time sequence feature and the space feature to obtain a fused feature includes:

and performing channel splicing on the time sequence features and the space features to obtain the fusion features.

Optionally, the determining a video action according to the fusion feature includes:

determining the video action according to the fusion characteristics and a preset corresponding relation;

the preset corresponding relation is the corresponding relation between the fusion characteristic and the video action.

In a second aspect, there is provided a video motion recognition apparatus, comprising:

an extraction module for extracting a plurality of first frames from the target video sequence and extracting a second frame from the plurality of first frames;

the first feature extraction module is used for inputting the plurality of first frames into the TPEM to perform feature extraction to obtain time sequence features;

the second feature extraction module is used for inputting the second frame into the SPEM to perform feature extraction to obtain spatial features;

the fusion module is used for fusing the time sequence features and the space features to obtain fusion features;

the determining module is used for determining video actions according to the fusion characteristics;

the first feature extraction module is specifically configured to:

Optionally, the first feature extraction module is specifically configured to:

the position code is calculated by the following formula:

；

the second feature extraction module is specifically configured to:

Optionally, the fusion module is specifically configured to:

Optionally, the determining module is specifically configured to:

In a third aspect, there is provided an electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method as described in the first aspect.

In a fourth aspect, there is provided a readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the method according to the first aspect.

In a sixth aspect, there is provided a chip comprising a processor and a communication interface coupled to the processor for running a program or instructions to implement the method of the first aspect.

In a seventh aspect, there is provided a computer program/program product stored in a storage medium, the program/program product being executed by at least one processor to implement the method according to the first aspect.

In the embodiment of the application, a plurality of first frames are extracted from a target video sequence, one second frame is extracted from the plurality of first frames, time sequence feature extraction is carried out on the plurality of first frames, space feature extraction is carried out on the second frames, the extracted time sequence features and the extracted space features are fused, and finally video actions are determined according to the fused features, wherein TPEM comprises a neural network model with a resnet network structure and a neural network model with a transformer network structure, and SPEM comprises the neural network model with the resnet network structure. According to the embodiment of the application, a time space double-branch structure is adopted, space information and time information are respectively extracted, and the space information is compared to carry out fusion information, so that the loss of related information is avoided; the characteristics of the video frames are fused in a multi-scale mode by adopting the resnet network structure, and the information of low-level high resolution and the information of high-level strong semantics are considered, so that the video motion recognition is more efficient, the attention mechanism in the transformer network structure widens the receptive field, the performance of the video motion recognition can be improved to a certain extent, and the video motion recognition is more accurate.

Drawings

Fig. 1 is a flow chart of a video motion recognition method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a module architecture to which the video motion recognition method according to the embodiment of the present application is applied;

fig. 3 is a schematic structural diagram of a video motion recognition device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are derived by a person skilled in the art based on the embodiments of the application, fall within the scope of protection of the application.

The terms "first," "second," and the like, herein, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or otherwise described herein, and that the "first" and "second" distinguishing between objects generally are not limited in number to the extent that the first object may, for example, be one or more. Furthermore, "and/or" in the present application means at least one of the connected objects. For example, "a or B" encompasses three schemes, scheme one: including a and excluding B; scheme II: including B and excluding a; scheme III: both a and B. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The video motion recognition method provided by the embodiment of the application is described in detail below through some embodiments and application scenes thereof with reference to the accompanying drawings.

Referring to fig. 1, an embodiment of the present application provides a video action recognition method, including:

step 101: a plurality of first frames are extracted from the target video sequence and a second frame is extracted from the plurality of first frames.

Step 102: and inputting the plurality of first frames into the TPEM to perform feature extraction to obtain time sequence features.

Step 103: and inputting the second frame into the SPEM to perform feature extraction to obtain spatial features.

Step 104: and fusing the time sequence features and the space features to obtain fused features.

Step 105: and determining the video action according to the fusion characteristics.

The time sequence feature extraction module (Temporal Embedding, TPEM) comprises a neural network model with a residual (resnet) network structure and a neural network model with a transformation (transducer) network structure, and the space feature extraction module (Spatial Embedding, SPEM) comprises a neural network model with a resnet network structure.

It should be noted that, in the process of extracting the frames in step 101, a plurality of first frames are extracted for extracting the time sequence features, and in consideration of the fact that the contents of the adjacent frames are relatively close, in order to improve the recognition accuracy, an interval extraction method is adopted, specifically, the first frames may be extracted in an interval 1 frame manner, and the first frames may also be called as key frames; accordingly, the second frame is extracted from the plurality of extracted first frames for spatial feature extraction, and an intermediate frame of the plurality of first frames may be generally used as the second frame; the embodiment of the application does not limit the specific setting of the selection of the first frames and the second frames, for example, the first frames can also be extracted by adopting 2 frames or 3 frames at intervals, and the second frames can also be selected from the first half part or the second half part of the first frames, and can be flexibly set according to actual requirements.

Optionally, the TPEM includes a first convolutional neural network CNN model having a resnet network structure and a second CNN model having a transformer network structure.

Inputting a plurality of first frames into a TPEM for feature extraction to obtain time sequence features, wherein the method comprises the following steps:

(1) And inputting the plurality of first frames into a first CNN model for feature extraction to obtain first feature data with a plurality of feature codes.

The plurality of first frames are input into a CNN model with a resnet network structure, and the residual structure of the resnet network structure can well solve network degradation and combine low-level high-resolution information and high-level strong semantic information. After feature extraction of the first CNN model, first feature data, which may also be referred to as a feature map, is obtained.

(2) And adding category codes to the first characteristic data through the first coding process to obtain second characteristic data.

And (3) coding the feature map output by the first CNN model, wherein class codes (class token) are added, namely one data dimension is added, the class codes are associated with the classes of video actions, the class token is used for classifying the video actions, and the class codes are randomly initialized, so that the class token is not based on image content, therefore, the biasing of a certain specific token can be avoided, and the accuracy of video action recognition is improved.

(3) And adding position coding to the second characteristic data through second coding processing to obtain third characteristic data.

Considering that the position information is lost by the transition structure in the transition network structure, spatial position coding is performed before the transition network is sent in, and the position coding is associated with the temporal position of each first frame in the target video sequence.

(4) And inputting the third characteristic data into a second CNN model to perform characteristic extraction, so as to obtain time sequence characteristics.

The attention mechanism of the transformer network structure is utilized to strengthen the feature expression of the action changing in the time dimension, specifically, the multi-head attention mechanism can be cited, and then the dimension is enlarged and reduced back through a multi-layer perceptron Block (Multilayer Perceptron Block, MLP Block) so as to ensure that the input and output dimension is consistent with the space feature extracted by SPEM. The MLP Block may be included in the transducer network structure or may be independently disposed outside the transducer network structure, which is not particularly limited in the embodiment of the present application.

Optionally, adding position coding to the second feature data by a second coding process includes:

the position code is calculated by the following formula:

；

Optionally, a third CNN model with a resnet network structure is included in the SPEM.

and inputting the second frame into a third CNN model for feature extraction to obtain spatial features.

Considering that the overall video appearance transformation is slow and stable, the second frame can be directly input into the CNN model for feature extraction, wherein the CNN model with a resnet network structure is adopted, the network degradation can be well solved by utilizing the residual structure of the resnet network structure, and the low-level high-resolution information and the high-level high-semantic information are combined.

Optionally, fusing the temporal feature and the spatial feature to obtain a fused feature, including:

and performing channel splicing on the sequence features and the space features to obtain fusion features.

In the embodiment of the application, a fusion module (CAEM) module can be created to perform feature fusion, but not stacking on a simple channel, specifically, a SPEM module and a TPEM module can be converted into the same shape through a convolution layer, and then channel splicing is performed, so that in order to better fuse the features in space and time, a attention module can be newly added after channel splicing, self attention is performed on the channel dimension, and the information of the time dimension and the space dimension is better fused.

Optionally, determining the video action according to the fusion feature includes:

and determining the video action according to the fusion characteristics and the preset corresponding relation.

In the embodiment of the application, the classification result can be output through the convolution layer and the linear mapping, the corresponding relation between the specific fusion characteristics and the video actions can be preset, and the corresponding video actions can be directly obtained after the fusion characteristics are obtained through the process.

The following description of the embodiment of the present application is given with reference to fig. 2, and it should be noted that specific parameters adopted in the following embodiments are examples, and do not limit parameters of the technical solution of the present application.

Referring to fig. 2, a dual-branch structure adopted by the video action recognition method provided by the embodiment of the present application is shown, where the structure gives consideration to spatial and temporal feature expression of video features, and a specific scheme flow is as follows:

step one: data preparation.

For the action video, selecting continuous 32 frames, extracting key frames at intervals of 1 frame, preprocessing the extracted 16 frames, inputting the preprocessed 16 frames into a time sequence feature extraction module (TPEM), taking the intermediate frames of the 16 frames as key frames, and inputting the key frames into a space feature extraction module (SPEM).

Step two: and (5) extracting spatial characteristics.

The whole video appearance transformation is slow and stable, so that the 16 frames of intermediate frames are extracted to serve as the spatial feature expression of the video, and the problem that gradient vanishes possibly occurs along with continuous superposition of network depths is considered. Meanwhile, considering that the video motion range is not easy, large-amplitude motion and fine motion exist, the outputs of the different-scale convolution layers cov, cov3, cov4 and cov5 of the resnet are fused, the results of the fusion of the different convolution layers are respectively P2, P3, P4 and P5, then the low-level high-resolution information and the high-level high-semantic information are combined together from top to bottom, the last layer is taken as the output, and finally the dimension reduction is carried out through the convolution layers.

Step three: video timing feature extraction.

The key frames have been processed, but if only the key frames are seen, there is a problem of action ambiguity. To eliminate this problem, it is necessary to consider timing. Thus, in addition to key frames, we consider past frames. Thus, inputting a video segment containing a key frame in addition to the key frame is required. To process this piece of video, a timing feature extraction module TPEM is created from which timing features are extracted, mainly through the following steps.

1. And extracting video frame information.

And sending the extracted 16-frame key frames into a convolutional network for feature extraction, wherein the extracted network structure is the same as the key frame network structure.

2、token embedding。

The feature map extracted in the second step is divided into fixed-size parts, each part is 7*7, then each feature map generates 64 parts with the same size, namely, the length of a token sequence is 64, a class token is added, the class token is mainly used for classifying video actions, the class token is randomly initialized, information on all other tokens is gathered along with continuous updating of training of a network (global feature aggregation), and the bias to a certain specific token can be avoided because the feature map is not based on image content, and the token can avoid the interference of the output by the position code due to the fixed position code, so that image email is finally formed.

3. And (5) time position coding.

The video frames are played in time sequence, and considering that each token is at a different position of the video frame, and the position information is lost by the position structure in the transducer, the video frames are subjected to spatial position coding before being sent into the transducer network, and the position coding is added with token ebedding, so that a specific calculation formula of spatial position embedding is as follows:

；

4. Time attention mechanism.

Introducing a time attention mechanism, strengthening the feature expression of the action changing in the time dimension, mapping the result of the step three into q, k and v as input, introducing a multi-head attention mechanism, then amplifying and reducing the dimension back through a mp Block, ensuring that the input and output dimensions are kept consistent, and optionally stacking 6 in total in the structure, and finally outputting the feature expression as the feature expression of the time module TPEM.

Step four: and (5) feature fusion.

The CAEM module is created to perform feature fusion, the SPEM module and the TPEM module are converted into the same shape through a convolution layer instead of stacking on a simple channel, and then channel splicing is performed.

Step five: and outputting a result.

And finally, outputting a classification result through a convolution layer and linear mapping.

According to the video motion recognition method provided by the embodiment of the application, the execution subject can be a video motion recognition device. In the embodiment of the application, a method for executing video motion recognition by a video motion recognition device is taken as an example, and the video motion recognition device provided by the embodiment of the application is described.

Referring to fig. 3, an embodiment of the present application provides a video motion recognition apparatus, including:

an extracting module 301, configured to extract a plurality of first frames from the target video sequence, and extract a second frame from the plurality of first frames;

a first feature extraction module 302, configured to input a plurality of first frames into the TPEM to perform feature extraction, so as to obtain a time sequence feature;

a second feature extraction module 303, configured to input a second frame into the SPEM for feature extraction, so as to obtain a spatial feature;

the fusion module 304 is configured to fuse the time sequence feature and the space feature to obtain a fusion feature;

a determining module 305, configured to determine a video action according to the fusion feature;

the first feature extraction module is specifically configured to:

inputting a plurality of first frames into a first CNN model for feature extraction to obtain first feature data with a plurality of feature codes;

inputting the third characteristic data into a second CNN model for characteristic extraction to obtain time sequence characteristics;

wherein the category codes are associated with categories of video actions and the category codes are randomly initialized and the position codes are associated with a temporal position of each first frame in the target video sequence.

Optionally, the first feature extraction module is specifically configured to:

the position code is calculated by the following formula:

；

the second feature extraction module is specifically configured to:

Optionally, the fusion module is specifically configured to:

Optionally, the determining module is specifically configured to:

determining video actions according to the fusion characteristics and a preset corresponding relation;

The video motion recognition device in the embodiment of the application can be an electronic device, for example, an electronic device with an operating system, or can be a component in the electronic device, for example, an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. By way of example, the other device may be a server, network attached storage (Network Attached Storage, NAS), etc., and embodiments of the present application are not limited in detail.

The video action recognition device provided by the embodiment of the application can realize each process realized by the embodiment of the method and achieve the same technical effect, and in order to avoid repetition, the description is omitted here.

Referring to fig. 4, an embodiment of the present application provides an electronic device 400, including: at least one processor 401, a memory 402, a user interface 403 and at least one network interface 404. The various components in electronic device 400 are coupled together by bus system 405.

It is understood that the bus system 405 is used to enable connected communications between these components. The bus system 405 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 405 in fig. 4.

The user interface 403 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, a trackball, a touch pad, or a touch screen, etc.).

It will be appreciated that the memory 402 in embodiments of the application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DRRAM). The memory 402 described in embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

In some implementations, the memory 402 stores the following elements, executable modules or data structures, or a subset thereof, or an extended set thereof: an operating system 4021 and application programs 4022.

The operating system 4021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application programs 4022 include various application programs such as a media player, a browser, and the like for implementing various application services. A program for implementing the method of the embodiment of the present application may be included in the application program 4022.

In an embodiment of the present application, the electronic device 400 may further include: a program stored on the memory 402 and executable on the processor 401, which when executed by the processor 401, implements the steps of the method provided by the embodiment of the application.

The method disclosed in the above embodiment of the present application may be applied to the processor 401 or implemented by the processor 401. The processor 401 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 401 or by instructions in the form of software. The processor 401 described above may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a computer readable storage medium well known in the art such as random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, and the like. The computer readable storage medium is located in a memory 402, and the processor 401 reads information in the memory 402 and performs the steps of the above method in combination with its hardware. In particular, the computer readable storage medium has a computer program stored thereon.

It is to be understood that the embodiments of the application described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more ASICs, DSPs, digital Signal Processing Devices (DSPDs), programmable logic devices (Programmable Logic Device, PLDs), FPGAs, general purpose processors, controllers, microcontrollers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

The embodiment of the application also provides a readable storage medium, on which a program or an instruction is stored, which when executed by a processor, implements each process of the video motion recognition method embodiment, and can achieve the same technical effects, and in order to avoid repetition, the description is omitted here.

Wherein the processor is a processor in the terminal described in the above embodiment. The readable storage medium includes computer readable storage medium such as computer readable memory ROM, random access memory RAM, magnetic or optical disk, etc. In some examples, the readable storage medium may be a non-transitory readable storage medium.

The embodiment of the application further provides a chip, which comprises a processor and a communication interface, wherein the communication interface is coupled with the processor, and the processor is used for running programs or instructions to realize the processes of the embodiment of the video motion recognition method, and can achieve the same technical effects, so that repetition is avoided, and the description is omitted here.

It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, or the like.

The embodiments of the present application further provide a computer program/program product, where the computer program/program product is stored in a storage medium, and the computer program/program product is executed by at least one processor to implement each process of the embodiments of the video motion recognition method, and the same technical effects can be achieved, so that repetition is avoided, and details are not repeated herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

From the description of the embodiments above, it will be apparent to those skilled in the art that the above-described example methods may be implemented by means of a computer software product plus a necessary general purpose hardware platform, but may also be implemented by hardware. The computer software product is stored on a storage medium (such as ROM, RAM, magnetic disk, optical disk, etc.) and includes instructions for causing a terminal or network side device to perform the methods according to the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms of embodiments may be made by those of ordinary skill in the art without departing from the spirit of the application and the scope of the claims, which fall within the protection of the present application.

Claims

1. A method for identifying video actions, comprising:

determining a video action according to the fusion characteristics;

the TPEM comprises a neural network model with a residual network structure and a neural network model with a transformation transformer network structure, and the SPEM comprises the neural network model with the network structure.

2. The method according to claim 1, wherein the TPEM comprises a first CNN model having a resnet network structure and a second CNN model having a transformer network structure;

3. The method according to claim 2, wherein the adding position coding to the second feature data by the second coding process includes:

the position code is calculated by the following formula:

；

wherein ,encoding the actual temporal position in the video sequence for said feature,/for>A position vector encoded for a t-th feature of the plurality of feature encodings,/for>For the value of the i-th element in the position vector, d is the dimension of the feature code,indicating that the i-th element is the even-th element, < ->Indicating that the i-th element is an odd-numbered element.

4. The method of claim 1, wherein the SPEM includes a third CNN model having a resnet network structure;

5. The method of claim 1, wherein the fusing the temporal feature and the spatial feature to obtain a fused feature comprises:

6. The method of claim 1, wherein said determining a video action from said fusion feature comprises:

7. A video motion recognition apparatus, comprising:

8. The apparatus of claim 7, wherein the TPEM comprises a first CNN model having a resnet network structure and a second CNN model having a transformer network structure;

the first feature extraction module is specifically configured to:

9. The apparatus according to claim 8, wherein the first feature extraction module is specifically configured to:

the position code is calculated by the following formula:

；

10. The apparatus of claim 7, wherein the SPEM includes a third CNN model having a resnet network structure;

the second feature extraction module is specifically configured to:

11. The apparatus of claim 7, wherein the fusion module is specifically configured to:

12. The apparatus of claim 7, wherein the determining module is specifically configured to:

13. An electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the video action recognition method of any one of claims 1 to 6.

14. A readable storage medium, characterized in that the readable storage medium has stored thereon a program or instructions which, when executed by a processor, implement the steps of the video action recognition method according to any one of claims 1 to 6.