CN114973049B

CN114973049B - Lightweight video classification method with unified convolution and self-attention

Info

Publication number: CN114973049B
Application number: CN202210010323.3A
Authority: CN
Inventors: 黎昆昌; 王亚立; 乔宇; 高鹏
Original assignee: Shanghai AI Innovation Center
Current assignee: Shanghai AI Innovation Center
Priority date: 2022-01-05
Filing date: 2022-01-05
Publication date: 2024-04-26
Anticipated expiration: 2042-01-05
Also published as: CN114973049A

Abstract

The invention discloses a lightweight video classification method with unified convolution and self-attention. The method comprises the following steps: acquiring a target video image to be identified; taking the target video image as a series of tokens, extracting features by using a pre-trained transducer model framework, wherein the transducer model framework is arranged into a plurality of layers, each layer comprises a plurality of unified transducer modules, the output of each layer reduces resolution through feature aggregation, each unified transducer module comprises a dynamic position encoder, a multi-head relation aggregator and a feedforward neural network, the dynamic position encoder dynamically integrates three-dimensional position information of the target video image into all tokens, the multi-head relation aggregator is used for aggregating each token with context tokens thereof, and the feedforward network is used for carrying out independent information enhancement on each token; the target video image is classified based on the extracted features. The model provided by the invention has smaller calculated amount and higher classification accuracy.

Description

Lightweight video classification method with unified convolution and self-attention

Technical Field

The invention relates to the technical field of video classification, in particular to a lightweight video classification method with unified convolution and self-attention.

Background

Because of the large amount of local redundancy and complex global dependence among video frames, the existing model is difficult to learn rich and multi-level space-time semantic expression from high-dimensional video information. The current video understanding field comprises two main stream models, namely a three-dimensional convolutional neural network and a space-time vision transducer. Although convolutional neural networks can learn spatio-temporal context information in local areas efficiently, the limited receptive field makes it difficult to capture long time-dependent information. The visual transformer can learn the global context effectively by using the self-attention mechanism, but blindly performs similarity comparison on all feature tokens in all layers, thereby bringing about huge computational redundancy.

Early work has demonstrated that multi-headed self-attention mechanisms can be equivalent to convolution operators, but they suggest the use of self-attention replacement convolutions rather than combining the two. Recently, in the field of images, much work is being done to introduce convolution into the visual transducer. But these work focus mainly on the identification, detection or segmentation of image fields, etc., without considering the spatiotemporal operation of the video understanding field. In the video field, the combination of scrolling and self-attention has remained unexplored. Feature compression of input video has been studied or simply introduced by convolution; or simply pile up the global self-attention transducer in the deep layer of the network, lack the unified design of the whole, thus introduce a large amount of calculation and are difficult to optimize.

The field of video understanding presents two distinct challenges, on the one hand, because the motion of objects between adjacent frames is very small, video contains a large amount of spatiotemporal redundancy. On the other hand, video contains complex spatio-temporal dependencies, since the target relationship between widely spaced frames is dynamic. While the currently common framework of three-dimensional convolutional neural networks and spatio-temporal visual transducers is focused on only one of the challenges mentioned above. For example, three-dimensional convolution processes each pixel in the context of a small three-dimensional neighborhood, so that detailed local spatio-temporal features can be captured, reducing spatio-temporal redundancy between adjacent frames. However, three-dimensional convolution presents difficulties in learning remote dependencies due to receptive field limitations. Instead, the visual transducer learns the global dependence using self-attention of the visual token. However, analyzed, the visual transformer may inefficiently encode local spatiotemporal features at the shallow layers of the network. In particular, spatial attention is focused mainly on neighboring tokens (e.g., 3 x 3 local regions) without any learning of other tokens within the same frame. Similarly, time attention is typically focused only on tokens in adjacent frames, while other tokens in distant frames are ignored. More importantly, such local representations are learned from a global token-to-token similarity comparison, which wastes a significant amount of computation.

Disclosure of Invention

It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a lightweight video classification method with unified convolution and self-attention, the method comprising:

Acquiring a target video image to be identified;

Taking the target video image as a series of tokens and extracting video features by using a pre-trained transducer model framework, wherein the transducer model framework is arranged into a plurality of layers, each layer comprises a plurality of unified transducer modules, the output of each layer reduces resolution through feature aggregation, each unified transducer module comprises a dynamic position encoder, a multi-head relation aggregator and a feedforward neural network, and the dynamic position encoder dynamically integrates three-dimensional position information of the target video image into all the tokens so as to perform video modeling by using the spatiotemporal sequence of the tokens; the multi-head relation aggregator is used for aggregating each token with the context token; the feedforward network is used for carrying out independent information enhancement on each token;

the target video image is classified based on the extracted video features.

Compared with the prior art, the unified converter has the advantages that the advantages of three-dimensional convolution and space-time self-attention can be integrated into a compact converter form, and better balance between calculation amount and video understanding accuracy is realized. Unlike traditional visual transducer, the unified transducer provided by the invention comprises an emerging relation aggregator to process space-time redundancy and dependency relation by learning local and global token similarity in shallow layer and deep layer respectively, so that efficient and lightweight video space-time characterization learning is realized.

Other features of the present invention and its advantages will become apparent from the following detailed description of exemplary embodiments of the invention, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow chart of a unified convolution and self-attention lightweight video classification method according to an embodiment of the present invention;

FIG. 2 is a unified converter frame diagram according to one embodiment of the invention;

FIG. 3 is a schematic diagram of functional blocks in a unified converter according to one embodiment of the invention;

fig. 4 is a schematic diagram of an application process according to one embodiment of the invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Referring to fig. 1, a method for lightweight video classification with unified convolution and self-attention is provided that includes the following steps.

Step S110, an end-to-end model framework for video classification is constructed, wherein the model framework comprises a plurality of unified converter modules and classification layers, and the output of each unified converter module is input to the next layer after feature aggregation.

Conventional visual transducers view an image as a series of patch markings. A token is a small portion of an image consisting of a plurality of pixels, commonly referred to as a "patch". With multi-headed self-attention, the visual transformer reassembles and processes patch markers at each layer according to the relationship between each pair of markers. In this way a global representation of the whole image is generated. Tokens are created at the input level by splitting the image evenly into multiple parts. The output of the upper layer becomes a marker of the lower layer of the intermediate layer. The visual Transformer model separates each image into a sequence of tokens of fixed length, and then applies multiple Transformer layers to model their global relationships for classification. While representing an image with more tokens results in higher prediction accuracy, it also results in a dramatic increase in computational cost.

Compared with the prior art, the three-dimensional convolution and the space-time self-attention are effectively unified in a compact transformer form, and the proposed unified transformer model can achieve better balance between efficiency and effectiveness. Referring to fig. 2, the unified transformer model framework provided generally includes a plurality of unified transformer modules and classification layers, wherein each set of the plurality of unified transformer modules can be considered as a layer of the model framework, feature aggregation is used for locally aggregating output features of each layer to reduce resolution by layer-by-layer feature aggregation, different layers can be assigned different numbers of different types of unified transformer modules, and feature aggregation is implemented using space-time convolution with no overlapping space. The unified transformer module includes a Dynamic Position Encoder (DPE), a multi-headed relational aggregator (MHRA), and a feed forward network (FFE).

Specifically, the model frame may be configured to include four layers, each layer including a plurality of unified converter modules, and the number of unified converter modules of each layer is different, for example, the number of unified converter modules of each layer is respectively set to 3,4,8,3 for a small model, and the number of unified converter modules of each layer is respectively set to 5,8, 20,7 for a large model, and the number of channels of each layer may be respectively set to 64, 128, 320, 512. By setting different numbers of unified converter modules, layers and channels, the method can be applicable to video classification of different scenes.

For clarity, the input-output relationship of the entire model framework may be uniformly expressed as:

X＝DPE(X_in)+X_in

Y＝MHRA(Norm(X))+X (1)

Z＝FFN(Norm(Y))+Y

where X _in represents the token tensor (frame volume) entered, norm represents the normalization operation, which may be employed for different relational aggregations.

As can be seen from equation (1), the introduction of a dynamic position encoder dynamically integrates 3D position information into all tokens to efficiently utilize the spatiotemporal order of the tokens for video modeling. Each token and its context token are then aggregated using a multi-headed relationship aggregator. Finally, a feed forward network is employed for enhancing the information of each token point by point.

In one embodiment, as shown in FIG. 3, dynamic position coding is achieved by three-dimensional depth-separation convolution with residuals (e.g., resNet), where the convolution process uses zero padding and overlaps, by which tokens on the boundary are made aware of their absolute positions, so all tokens can progressively encode their absolute spatio-temporal position information by querying their neighbors. For example, the number of the cells to be processed, dynamic encoding preferably employs 3×3 x 3 or 5 x 5 convolution kernels. With the result of zero padding, tokens are helped to note their absolute position on the boundary, so all tokens can be progressively encoded by querying the neighbors' absolute spatiotemporal position information. By introducing dynamic position coding, the defect that the conventional absolute position coding is difficult to adapt to different resolutions is overcome.

Each token is then aggregated with its context token using a multi-headed relational aggregator. Compared with the traditional video converter, the invention designs a unique relation aggregator. First, instead of using a self-care mechanism at all layers, different relational aggregators are utilized to solve the problems of video redundancy and video dependency, respectively. Specifically, at the shallow layer of the whole model framework, the aggregator learns the local relation through a small learnable parameter matrix, and only the contexts of adjacent tokens in the three-dimensional neighborhood are aggregated, so that the calculation amount is remarkably reduced. In the deep layer, the aggregator learns the global relationship through similarity comparison, and can flexibly construct remote token dependency relationship from different frames in the video. In addition, unlike the space-time attention separation mode of the traditional converter, the relation aggregator provided by the invention jointly codes the global space-time context, and the joint learning mode can further enhance video characterization. And, pile up the unified converter module layer by layer in a layered manner to construct the model frame, utilize the cooperation ability of the local and global relation aggregator, have realized the efficient space-time characterization is learned.

It should be noted that, for the most core multi-head relation aggregator, two forms of local and global are designed, different relation aggregators are used for different layers, local relation aggregators are used for shallow layers, and global relation aggregators are used for deep layers. In the shallow layer, the goal is to learn detailed video representations of local spatio-temporal contexts in three-dimensional neighbors. In one embodiment, the token affinity is designed as a matrix of learnable parameters operating in a local three-dimensional neighborhood, i.e. given one anchor token, the relation aggregator learns the local spatiotemporal affinities between that token and other tokens in a small neighborhood, the value of which depends only on the relative three-dimensional positions between the tokens. The local relation aggregator may employ different convolution kernel sizes, experiments prove that the materials are 3 multiplied by 3, 7 multiplied by 7 a variety of convolution kernel sizes such as 9 x 9, experiments prove that the 5 multiplied by 5 convolution kernel has the best effect. By providing a local relation aggregator, important local structures in the video, such as edges, lines, etc., can be identified.

In the deep layer of the overall model framework, attention is paid to capturing long-term token dependencies in the global video segment, which is the same as the design of the self-attention mechanism. Thus, token similarity is designed by globally comparing the content similarity of all tokens. Conventional video converters employ self-attention at each layer, thereby introducing a significant amount of computation. To reduce the dot product calculation, previous studies have tended to partition spatial and temporal attention, but it compromises the spatiotemporal relationship between tokens. In contrast, the multi-head relational aggregator provided by the invention performs local relational aggregation at a shallow layer, which saves the calculation of token similarity comparison to a great extent. Further, to achieve a better balance of computational accuracy, rather than decomposing the temporal attention, the spatiotemporal relationship of all tokens is encoded together in the deep layer.

Preferably, for a local relational aggregator, batch normalization (normalized by features of the same channel of different samples, feature-by-feature dimension normalization) may be used, and for a global relational aggregator, layer normalization (normalized by different channels of the same sample, sample-by-sample normalization) may be used.

In addition, it should be noted that, for the model framework of fig. 2, from the viewpoint of feature aggregation, feature aggregation forms a multi-level structure, and non-overlapping convolution downsampling is adopted to form a feature pyramid, in this way, the resolution is continuously reduced. For example, feature aggregation uses a space-time convolution at the first downsampling, the convolution size being 3 x2, the step size is2 x2, the convolution size for the remaining several downsampling is1 x2, the step size is1×2×2. In other embodiments, overlapping convolution kernels may be used, but it is verified by experiments that the calculation amount is smaller under the above-mentioned convolution parameter setting, and the trade-off between the calculation amount and the performance is comprehensively considered.

Under the design, shallow large resolution is used for local relation aggregation, and deep small resolution is used for global relation aggregation, so that accuracy is improved and calculation amount is reduced. That is, the feature aggregation mode can be determined based on the positions of the local relation aggregator and the global relation aggregator, so that the feature aggregation mode and the trend of local gradual global are organically combined, and the local gradual aggregation to global gradual aggregation is designed according to the characteristics of video tasks. For example, for a four-layer model framework, the first two layers are divided into shallow layers, the second two layers are divided into deep layers, "progressive" refers to shallow layer aggregation with local relationships, deep layer aggregation with global relationships, and standardized operations in deep and shallow layers are different.

In summary, the multi-head relation aggregator provided by the invention skillfully processes local video redundancy and global video dependence by flexibly designing token similarity learning modes in shallow layers and deep layers. And the feature aggregation mode and the local gradual global trend are organically combined.

Finally, separate information enhancement is performed for each token using the feed forward network. Still referring to fig. 3, the feed forward network may employ a common design in visual transducers, including two linear layers and a nonlinear activation function. The feed forward network is a static nonlinear mapping, and complex nonlinear processing capability can be obtained through complex mapping of simple nonlinear processing units. And each token is subjected to independent information enhancement by using a feedforward network, so that the calculated amount can be reduced, and a lightweight video classification model is realized.

Step S120, training the constructed unified transformer model framework using the sample dataset.

In this step, a unified transformer model framework is trained using the known sample data set to learn parameters of the model, such as weights, biases, etc. According to different sample data set types, video classification under various scenes can be realized. Video classification refers to classifying content contained therein given a video clip. The category may be an action (e.g., cake), a scene (e.g., beach), an object (e.g., car), etc. The invention can realize classification recognition of various types of videos or images, such as facial expression recognition, motion related to object interaction, common limb motion, limb motion related to human interaction, motion related to human interaction with object interaction, motion related to human interaction with human interaction and the like.

And pre-training the constructed unified converter model framework by utilizing the sample data set to obtain model parameters, wherein the pre-training process can be performed offline in a server or a cloud.

And step S130, extracting the characteristics of the target video image by utilizing the trained unified converter model frame, and further obtaining a classification recognition result.

The trained models are embedded into the electronic equipment, so that real-time video classification can be realized. The electronic device may be a terminal device or a server, and the terminal device includes, but is not limited to, any terminal device such as a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a point of sale (POS), a vehicle-mounted computer, a smart wearable device (smart watch, virtual reality glasses, virtual reality helmet, etc.). The server includes, but is not limited to, an application server or a Web server, and may be a stand-alone server or a cluster server or a cloud server, etc. In an actual model application, the electronic device may acquire the target video image to be classified directly from the video image capturing device. Or the target video image may be transmitted to the electronic device via a network. The invention is not limited to the manner in which the target video image is acquired. In fig. 4, an application scenario is schematically shown, a user can upload videos to be identified to a cloud server through application programs at different terminals, the cloud server calls a deployed model to perform category judgment, and a corresponding result is returned after the judgment is completed.

To further verify the effects of the present invention, a number of experiments were performed in a common scene-related and timing-related video classification benchmark dataset. The experimental results are shown in tables 1 and 2 below. Through verification, the invention achieves higher classification accuracy while realizing smaller calculation amount, and achieves optimal performance for various reference data sets.

Table 1 scene correlation dataset

Table 2 time-series related data sets

In summary, the invention provides a unified multi-head relation aggregator, which organically unifies convolution and self-attention, designs a local multi-head relation aggregator with global multi-head relation aggregator, uses local aggregation in a shallow layer, and uses global aggregation in a deep layer, thereby solving redundancy and dependency of video. The provided three-dimensional dynamic position coding uses zero-filled depth separable three-dimensional convolution to dynamically learn token spatio-temporal position information without additional interpolation fine tuning for different resolutions. The method is characterized in that local-global progressive aggregation is designed according to the characteristics of video tasks, the provided multi-layer progressive structure forms a multi-layer pyramid, space-time convolution with non-overlapping space is used for downsampling, meanwhile, local and global operations are performed progressively, and efficiency and accuracy are weighed. Experiments prove that the unified converter obtains the best performance on a plurality of popular video classification benchmark data sets. In summary, the invention does not simply introduce self-attention mechanics learning global relationships, but explores the relationship of convolution and self-attention, and designs a unified converter. Different from the space-time separation attention of the existing video converter, the invention performs the combined space-time attention learning, and realizes the efficient and lightweight video space-time characterization learning.

It is to be noted that modifications or variations can be made to the above-described embodiments by persons skilled in the art without departing from the spirit and scope of the invention. For example, the design of a local relational aggregator may employ local operations such as windowed attention. The unified converter and the local progressive global structure are not limited to the video field, and are applicable to the image field as well as downstream detection segmentation and other tasks. For another example, the size of the convolution kernel, the step size, the size of the three-dimensional field, or the like may be appropriately set according to the demands for the calculation amount, the recognition accuracy, or the like.

The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++, python, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are all equivalent.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A lightweight video classification method for unified convolution and self-attention, comprising the steps of:

Acquiring a target video image to be identified;

Taking the target video image as a series of tokens and extracting video features by using a pre-trained transducer model framework, wherein the transducer model framework is arranged into a plurality of layers, each layer comprises a plurality of unified transducer modules, the output of each layer reduces resolution through feature aggregation, each unified transducer module comprises a dynamic position encoder, a multi-head relation aggregator and a feedforward neural network, and the dynamic position encoder dynamically integrates three-dimensional position information of the target video image into all the tokens so as to perform video modeling by using the spatiotemporal sequence of the tokens; the multi-head relation aggregator is used for aggregating each token with the context token; the feedforward neural network is used for carrying out independent information enhancement on each token;

classifying the target video image based on the extracted video features;

For each unified transformer module, dividing a multi-head relation aggregator contained in the unified transformer module into a local relation aggregator and a global relation aggregator, wherein the local relation aggregator is used for learning local representations, and the global relation aggregator is used for learning global representations;

Wherein the plurality of unified transformer modules are divided into shallow portions and deep portions according to positions in the whole model frame, wherein a multi-headed relation aggregator located in the shallow portions is set as a local relation aggregator, and for a known anchor token, local relation is learned by learning local spatiotemporal affinities between the anchor token and other tokens in a set three-dimensional neighborhood, the values of the local spatiotemporal affinities being dependent on relative three-dimensional position information between the tokens; the multi-head relation aggregator in the deep part is set as a global relation aggregator, and token similarity is designed by globally comparing the content similarity of all tokens, and the space-time relation of all tokens is coded together.

2. The method of claim 1, wherein the dynamic position encoder is a three-dimensional depth separation convolution and residual network, the convolution process using zero padding and overlapping to locate absolute positions of tokens on boundaries in the target video image using zero padding results, all tokens progressively encoding their absolute spatio-temporal position information by querying neighbors.

3. The method of claim 1, wherein a first layer of the transformer model frame is feature aggregated using a first downsampled convolution, the convolution size being set to 3 x 2, the step size being 2 x 2, and wherein a second downsampled convolution is used for other layers of the transformer model frame, the convolution size being set to 1 x 2, the step size being 1 x 2.

4. The method of claim 1, wherein the transducer model frame is provided in four layers, wherein the first two layers are divided into shallow portions and the second two layers are divided into deep portions.

5. The method of claim 1, wherein the three-dimensional neighborhood is set to 5 x 5.

6. The method of claim 1, wherein the number of unified converter modules included is different for each layer of the converter model frame.

7. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor realizes the steps of the method according to any of claims 1 to 6.

8. An electronic device comprising a memory and a processor, on which memory a computer program is stored which can be run on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 6 when the program is executed.