WO2022104293A1

WO2022104293A1 - Multi-modal video transformer (mm-vit) for compressed video action recognition

Info

Publication number: WO2022104293A1
Application number: PCT/US2021/065233
Authority: WO
Inventors: Jiawei Chen; Chiu Man HO
Original assignee: Innopeak Technology, Inc.
Priority date: 2021-10-26
Filing date: 2021-12-27
Publication date: 2022-05-19

Abstract

Novel tools and techniques are provided for implementing multi-modal video transformer ("MM-ViT") for performing compressed video action recognition. In various embodiments, for training, a computing system may implement an artificial intelligence ("AI") model of a MM-ViT neural network to generate a first prediction of video action classification of one or more actions captured in a compressed video file, may train the AI model of the MM-ViT neural network to perform compressed video action recognition based on comparison of the first prediction of video action classification with a video action label associated with the compressed video file, and may update the AI model accordingly. For inferencing, the computing system may implement a trained AI model of a trained MM-ViT neural network to generate a video action classification of one or more actions captured in the compressed video file, and may output the generated video action classification.

Description

MULTI-MODAL VIDEO TRANSFORMER (MM-VIT) FOR

COMPRESSED VIDEO ACTION RECOGNITION

CROSS-REFERENCES TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. Patent Application Ser. No. 63/271,809 (the " '809 Application"), filed October 26, 2021, by Jiawei Chen et al. (attorney docket no. INNOPEAK-1021-147-P), entitled, "MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition," the disclosure of which is incorporated herein by reference in its entirety for all purposes.

COPYRIGHT STATEMENT

[0002] A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

FIELD

[0003] The present disclosure relates, in general, to methods, systems, and apparatuses for implementing neural network, artificial intelligence ("Al"), machine learning, and/or deep learning applications, and, more particularly, to methods, systems, and apparatuses for implementing multi-modal video transformer ("MM-ViT") for performing compressed video action recognition. BACKGROUND

[0004] Most conventional video action recognition systems and techniques solely utilize decoded RGB frames, thereby ignoring other modalities in compressed video files. Those conventional video action recognition systems and techniques that do utilize modalities in compressed video files lack the capability to model complex inter-modal relations for action recognition. Conventional video action recognition systems and techniques also fail to utilize audio waveforms that may be contained in compressed video files for action recognition.

[0005] Hence, there is a need for more robust and scalable solutions for implementing neural network, artificial intelligence ("Al"), machine learning, and/or deep learning applications.

SUMMARY

[0006] The techniques of this disclosure generally relate to tools and techniques for implementing neural network, Al, machine learning, and/or deep learning applications, and, more particularly, to methods, systems, and apparatuses for implementing MM-ViT for performing compressed video action recognition.

[0007] In an aspect, a method may be provided for training a multi-modal video transformer neural network to perform compressed video action recognition. The method may be implemented by a computing system and may comprise generating a plurality of vision tokens based at least in part on one or more of at least one compressed regular image frame, one or more compressed image frames containing motion vector data, or one or more compressed image frames containing residual data contained in a compressed video file, and/or the like; calculating at least one multi-head attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial- temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model, and/or the like; and calculating an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multi-head attention score. The method may further comprise implementing, using a multi-modal video transformer ("MM- ViT") neural network, an artificial intelligence ("Al") model of the MM-ViT neural network to generate a first prediction of video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token; and training the Al model of the MM-ViT neural network to perform compressed video action recognition based at least in part on comparison of the generated first prediction of video action classification with a video action label associated with the compressed video file. [0008] In another aspect, a system may be provided that is operable to perform the method as described above, for training a multi-modal video transformer neural network to perform compressed video action recognition. The system might comprise a computing system, which might comprise at least one first processor and a first non-transitory computer readable medium communicatively coupled to the at least one first processor. The first non- transitory computer readable medium might have stored thereon computer software comprising a first set of instructions that, when executed by the at least one first processor, causes the computing system to: generate a plurality of vision tokens based at least in part on one or more of at least one compressed regular image frame, one or more compressed image frames containing motion vector data, or one or more compressed image frames containing residual data contained in a compressed video file; calculate at least one multihead attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial-temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model, and/or the like; calculate an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multihead attention score; implement, using a multi-modal video transformer ("MM-ViT") neural network, an artificial intelligence ("Al") model of the MM-ViT neural network to generate a first prediction of video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token; and train the Al model of the MM-ViT neural network to perform compressed video action recognition based at least in part on comparison of the generated first prediction of video action classification with a video action label associated with the compressed video file.

[0009] In yet another aspect, a method may be provided for implementing a trained multi-modal video transformer neural network to perform compressed video action recognition. The method may be implemented by a computing system and may comprise generating a plurality of vision tokens based at least in part on one or more of at least one compressed regular image frame, one or more compressed image frames containing motion vector data, or one or more compressed image frames containing residual data contained in a compressed video file, and/or the like; calculating at least one multi-head attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial-temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model, and/or the like; and calculating an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multi-head attention score. The method may further comprise implementing, using a trained multi-modal video transformer ("MM-ViT") neural network, a trained artificial intelligence ("Al") model of the trained MM-ViT neural network to generate a video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token; and outputting, using the computing system, the generated video action classification of the one or more actions captured in the compressed video file.

[0010] In still another aspect, a system may be provided that is operable to perform the method as described above, for implementing a trained multi-modal video transformer neural network to perform compressed video action recognition. The system might comprise a computing system, which might comprise at least one first processor and a first non-transitory computer readable medium communicatively coupled to the at least one first processor. The first non-transitory computer readable medium might have stored thereon computer software comprising a first set of instructions that, when executed by the at least one first processor, causes the computing system to: generate a plurality of vision tokens based at least in part on one or more of at least one compressed regular image frame, one or more compressed image frames containing motion vector data, or one or more compressed image frames containing residual data contained in a compressed video file, and/or the like; calculate at least one multi-head attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial-temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model, and/or the like; calculate an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multi-head attention score; implement, using a trained multi-modal video transformer ("MM-ViT") neural network, a trained artificial intelligence ("Al") model of the trained MM-ViT neural network to generate a video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token; and output the generated video action classification of the one or more actions captured in the compressed video file.

[0011] Various modifications and additions can be made to the embodiments discussed without departing from the scope of the invention. For example, while the embodiments described above refer to particular features, the scope of this invention also includes embodiments having different combination of features and embodiments that do not include all of the above-described features.

[0012] The details of one or more aspects of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the techniques described in this disclosure will be apparent from the description and drawings, and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] A further understanding of the nature and advantages of particular embodiments may be realized by reference to the remaining portions of the specification and the drawings, in which like reference numerals are used to refer to similar components. In some instances, a sub-label is associated with a reference numeral to denote one of multiple similar components. When reference is made to a reference numeral without specification to an existing sub-label, it is intended to refer to all such multiple similar components.

[0014] Fig. 1 is a schematic diagram illustrating a system for implementing multi-modal video transformer ("MM-ViT") for performing compressed video action recognition, in accordance with various embodiments.

[0015] Figs. 2A and 2B are diagrams illustrating various non-limiting examples of compressed video frames that may be used by a MM-ViT for performing compressed video action recognition and corresponding MM-ViT focus, in accordance with various embodiments.

[0016] Figs. 3A-3G are schematic block flow diagrams illustrating non-limiting examples of training of a MM-ViT neural network and inferencing by a trained MM-ViT neural network for performing compressed video action recognition, in accordance with various embodiments. [0017] Figs. 4A-4F are flow diagrams illustrating a method for implementing training of a MM-ViT neural network and inferencing by a trained MM-ViT neural network for performing compressed video action recognition, in accordance with various embodiments.

[0018] Fig. 5 is a block diagram illustrating an example of computer or system hardware architecture, in accordance with various embodiments.

[0019] Fig. 6 is a block diagram illustrating a networked system of computers, computing systems, or system hardware architecture, which can be used in accordance with various embodiments.

DETAILED DESCRIPTION

[0020] Overview

[0021] Various embodiments provide tools and techniques for implementing neural network, artificial intelligence ("Al"), machine learning, and/or deep learning applications, and, more particularly, to methods, systems, and apparatuses for implementing multi-modal video transformer ("MM-ViT") for performing compressed video action recognition.

[0022] In various embodiments, a computing system may be used to train a multi-modal video transformer neural network to perform compressed video action recognition (referred to herein as "training" or the like). The computing system may then use the trained multimodal video transformer neural network to perform compressed video action recognition (referred to herein as "inferencing" or the like).

[0023] For training the MM-ViT neural network to perform compressed video action recognition, the computing system may receive a request to train the MM-ViT neural network to perform compressed video action recognition in a compressed video file, the compressed video file comprising a plurality of frames comprising at least one compressed regular image frame and at least one compressed tracked image change frame, or the like. Each at least one compressed tracked image change frame may include, without limitation, one of one or more compressed image frames containing motion vector data or one or more compressed image frames containing residual data that represent motion of pixels and pixel difference, respectively, between a compressed regular image frame among the at least one compressed regular image frame and said compressed tracked image change frame among the at least one compressed tracked image change frame. After receiving the request to train the MM-ViT neural network, the computing system may generate a plurality of vision tokens based at least in part on one or more of the at least one compressed regular image frame, the one or more compressed image frames containing motion vector data, or the one or more compressed image frames containing residual data, and/or the like. The computing system may calculate at least one multi-head attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial-temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model, and/or the like. The computing system may calculate an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multi-head attention score. The MM-ViT neural network may implement an artificial intelligence ("Al") model of the MM-ViT neural network to generate a first prediction of video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token, may train the Al model of the MM-ViT neural network to perform compressed video action recognition based at least in part on comparison of the generated first prediction of video action classification with a video action label associated with the compressed video file, and may update the Al model based at least in part on the comparison.

[0024] In some embodiments, the computing system may include, without limitation, at least one of a MM-ViT, a machine learning system, an Al system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like. In some instances, the MM-ViT neural network may include, but is not limited to, at least one of a multi-layer perceptron ("MLP") neural network, a transformer deep learning model-based network, a feed-forward artificial neural network ("ANN"), a recurrent neural network ("RNN"), a convolutional neural network ("CNN"), or a fully convolutional network ("FCN"), and/or the like. In some cases, the compressed video file may be a video file compressed using a video compression algorithm or format including, but not limited to, one of moving picture experts group 4 ("MPEG-4") video compression format, advanced video coding ("AVC" or "H.264" or "MPEG-4 AVC) video compression format, or high efficiency video coding ("HEVC" or "H.265") video compression format, and/or the like. In some cases, the cross-modal attention model may include, without limitation, one of a merged attention model, a coattention model, or a shift-merge attention model, and/or the like. [0025] According to some embodiments, generating the plurality of vision tokens may comprise generating a first set of vision tokens, by: decomposing each of the at least one compressed regular image frame into a plurality of first non-overlapping patches; projecting the plurality of first non-overlapping patches into a plurality of first token embeddings, using a first learnable linear embedding layer, to produce a plurality of first patch tokens; and adding first spatiotemporal positional encoding to each first patch token to produce the first set of vision tokens. Generating the plurality of vision tokens may further comprise generating a second set of vision tokens, by: decomposing each compressed image frame containing motion vector data into a plurality of second non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting the plurality of second non-overlapping patches into a plurality of second token embeddings, using a second learnable linear embedding layer, to produce a plurality of second patch tokens; and adding second spatiotemporal positional encoding to each second patch token to produce the second set of vision tokens. Generating the plurality of vision tokens may further comprise generating a third set of vision tokens, by: decomposing each compressed image frame containing residual data into a plurality of third non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting the plurality of third nonoverlapping patches into a plurality of third token embeddings, using a third learnable linear embedding layer, to produce a plurality of third patch tokens; and adding third spatiotemporal positional encoding to each third patch token to produce the third set of vision tokens.

[0026] In some embodiments, the compressed video file may further comprise a compressed audio file containing an audio waveform. In such cases, the computing system may generate audio tokens, by: partitioning the audio waveform into a plurality of segments; projecting each segment among the plurality of segments to an audio vector; applying a linear layer to project the audio vector to the same dimensional space as the plurality of vision tokens to produce projected audio segments; adding temporal positional encoding to the projected audio segments to produce at least one audio token; and replicating each of the at least one audio token by a number corresponding to a spatial dimension of one of the sets of vision tokens. Generating the first prediction of video action classification of the one or more actions captured in the compressed video file may further be based at least in part on the generated audio tokens. In some instances, calculating the at least one multi-head attention score may comprise calculating at least one multi-head attention score based at least in part on using both the generated plurality of vision tokens and the generated audio tokens as inputs for the at least one of the joint spatial-temporal modal attention model, the temporal attention model, the spatial attention model, the cross-modal attention model, the local temporal attention model, or the local spatial attention model, and/or the like.

[0027] According to some embodiments, the computing system may resize the plurality of frames of the compressed video file in a first predetermined two-dimensional size; and, in some cases, may utilize at least one of random horizontal flipping or random cropping of the plurality of frames of the compressed video file to enhance training of the Al model of the MM-ViT neural network; and/or the like.

[0028] For inferencing by a trained MM-ViT neural network to perform compressed video action recognition, the computing system may receive a request to perform compressed video action recognition in a compressed video file, the compressed video file comprising a plurality of frames comprising at least one compressed regular image frame and at least one compressed tracked image change frame, or the like. Each at least one compressed tracked image change frame may include, without limitation, one of one or more compressed image frames containing motion vector data or one or more compressed image frames containing residual data that represent motion of pixels and pixel difference, respectively, between a compressed regular image frame among the at least one compressed regular image frame and said compressed tracked image change frame among the at least one compressed tracked image change frame. After receiving the request to perform compressed video action recognition in the compressed video file, the computing system may generate a plurality of vision tokens based at least in part on one or more of the at least one compressed regular image frame, the one or more compressed image frames containing motion vector data, or the one or more compressed image frames containing residual data, and/or the like. The computing system may calculate at least one multi-head attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial-temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model, and/or the like. The computing system may calculate an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multi-head attention score. The trained MM-ViT neural network may implement a trained Al model of the trained MM-ViT neural network to generate a video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token; and may output the generated video action classification of the one or more actions captured in the compressed video file.

[0029] According to some embodiments, generating the plurality of vision tokens may comprise generating a first set of vision tokens, by: decomposing each of the at least one compressed regular image frame into a plurality of first non-overlapping patches; projecting the plurality of first non-overlapping patches into a plurality of first token embeddings, using a first learnable linear embedding layer, to produce a plurality of first patch tokens; and adding first spatiotemporal positional encoding to each first patch token to produce the first set of vision tokens. Generating the plurality of vision tokens may further comprise generating a second set of vision tokens, by: decomposing each compressed image frame containing motion vector data into a plurality of second non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting the plurality of second non-overlapping patches into a plurality of second token embeddings, using a second learnable linear embedding layer, to produce a plurality of second patch tokens; and adding second spatiotemporal positional encoding to each second patch token to produce the second set of vision tokens. Generating the plurality of vision tokens may further comprise generating a third set of vision tokens, by: decomposing each compressed image frame containing residual data into a plurality of third non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting the plurality of third nonoverlapping patches into a plurality of third token embeddings, using a third learnable linear embedding layer, to produce a plurality of third patch tokens; and adding third spatiotemporal positional encoding to each third patch token to produce the third set of vision tokens.

[0030] In some embodiments, the compressed video file may further comprise a compressed audio file containing an audio waveform. In such cases, the computing system may generate audio tokens, by: partitioning the audio waveform into a plurality of segments; projecting each segment among the plurality of segments to an audio vector; applying a linear layer to project the audio vector to the same dimensional space as the plurality of vision tokens to produce projected audio segments; adding temporal positional encoding to the projected audio segments to produce at least one audio token; and replicating each of the at least one audio token by a number corresponding to a spatial dimension of one of the sets of vision tokens. Generating the first prediction of video action classification of the one or more actions captured in the compressed video file may further be based at least in part on the generated audio tokens. In some instances, calculating the at least one multi-head attention score may comprise calculating at least one multi-head attention score based at least in part on using both the generated plurality of vision tokens and the generated audio tokens as inputs for the at least one of the joint spatial-temporal modal attention model, the temporal attention model, the spatial attention model, the cross-modal attention model, the local temporal attention model, or the local spatial attention model, and/or the like.

[0031] In the various aspects described herein, multi-modal video transformer ("MM- ViT") is provided for performing compressed video action recognition. This allows for recognizing and classifying actions in compressed video files without having to decode these video files. Further, in some embodiments, MM-ViT utilizes modeling of complex inter-modal relations - and, in some cases, utilizing audio data contained in the compressed video files as well - for enhancing action recognition. MM-ViT also provides a new way to understand and organize video content in a search and could make video data more easily accessible to users.

[0032] These and other aspects of the system and method for implementing multi-modal video transformer ("MM-ViT") for performing compressed video action recognition are described in greater detail with respect to the figures.

[0033] The following detailed description illustrates a few embodiments in further detail to enable one of skill in the art to practice such embodiments. The described examples are provided for illustrative purposes and are not intended to limit the scope of the invention. [0034] In the following description, for the purposes of explanation, numerous details are set forth in order to provide a thorough understanding of the described embodiments. It will be apparent to one skilled in the art, however, that other embodiments of the present invention may be practiced without some of these details. In other instances, some structures and devices are shown in block diagram form. Several embodiments are described herein, and while various features are ascribed to different embodiments, it should be appreciated that the features described with respect to one embodiment may be incorporated with other embodiments as well. By the same token, however, no single feature or features of any described embodiment should be considered essential to every embodiment of the invention, as other embodiments of the invention may omit such features.

[0035] Unless otherwise indicated, all numbers used herein to express quantities, dimensions, and so forth used should be understood as being modified in all instances by the term "about." In this application, the use of the singular includes the plural unless specifically stated otherwise, and use of the terms "and" and "or" means "and/or" unless otherwise indicated. Moreover, the use of the term "including," as well as other forms, such as "includes" and "included," should be considered non-exclusive. Also, terms such as "element" or "component" encompass both elements and components comprising one unit and elements and components that comprise more than one unit, unless specifically stated otherwise.

[0036] Various embodiments as described herein - while embodying (in some cases) software products, computer-performed methods, and/or computer systems - represent tangible, concrete improvements to existing technological areas, including, without limitation, action recognition technology, video action recognition technology, compressed video action recognition technology, machine learning technology, deep learning technology, Al technology, and/or the like. In other aspects, some embodiments can improve the functioning of user equipment or systems themselves (e.g., action recognition systems, video action recognition systems, compressed video action recognition systems, machine learning systems, deep learning systems, Al systems, etc.), for example, for training, by, after receiving a request to train the MM-ViT neural network, the computing system may generate a plurality of vision tokens based at least in part on one or more of the at least one compressed regular image frame, the one or more compressed image frames containing motion vector data, or the one or more compressed image frames containing residual data, and/or the like; may calculate at least one multi-head attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial-temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model, and/or the like; may calculate an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multi-head attention score; may implement, using a MM-ViT neural network, an Al model of the MM-ViT neural network to generate a first prediction of video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token, may train the Al model of the MM-ViT neural network to perform compressed video action recognition based at least in part on comparison of the generated first prediction of video action classification with a video action label associated with the compressed video file, and may update the Al model based at least in part on the comparison.; and/or the like. Alternatively, or additionally, some embodiments can improve the functioning of user equipment or systems themselves, for example, for inferencing, by, after receiving the request to perform compressed video action recognition in the compressed video file, the computing system may generate a plurality of vision tokens based at least in part on one or more of the at least one compressed regular image frame, the one or more compressed image frames containing motion vector data, or the one or more compressed image frames containing residual data, and/or the like; may calculate at least one multi-head attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial-temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model, and/or the like; may calculate an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multihead attention score; may implement, using the trained MM-ViT neural network, a trained Al model of the trained MM-ViT neural network to generate a video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token; and may output the generated video action classification of the one or more actions captured in the compressed video file; and/or the like.

[0037] In particular, to the extent any abstract concepts are present in the various embodiments, those concepts can be implemented as described herein by devices, software, systems, and methods that involve novel functionality (e.g., steps or operations), such as, training and implementing multi-modal video transformer neural network Al to perform video action classification of actions captured in compressed video files, and/or the like, to name a few examples, that extend beyond mere conventional computer processing operations. These functionalities can produce tangible results outside of the implementing computer system, including, merely by way of example, optimized video action recognition functionalities that enable recognizing and classifying actions in compressed video files without having to decode these video files, that utilizes modeling of complex inter-modal relations - and, in some cases, utilizing audio data contained in the compressed video files as well - for enhancing action recognition, and that provides a new way to understand and organize video content in a search (e.g., by provide video recognition of compressed video files to enable labelling of said compressed video files, with such labelling being usable as searchable metadata or tags, etc.), and thus could make video data more easily accessible to users, at least some of which may be observed or measured by users, game/content developers, and/or user device manufacturers.

[0038] Some Embodiments

[0039] We now turn to the embodiments as illustrated by the drawings. Figs. 1-6 illustrate some of the features of the method, system, and apparatus for implementing neural network, artificial intelligence ("Al"), machine learning, and/or deep learning applications, and, more particularly, to methods, systems, and apparatuses for implementing multi-modal video transformer ("MM-ViT") for performing compressed video action recognition, as referred to above. The methods, systems, and apparatuses illustrated by Figs. 1-6 refer to examples of different embodiments that include various components and steps, which can be considered alternatives or which can be used in conjunction with one another in the various embodiments. The description of the illustrated methods, systems, and apparatuses shown in Figs. 1-6 is provided for purposes of illustration and should not be considered to limit the scope of the different embodiments.

[0040] With reference to the figures, Fig. 1 is a schematic diagram illustrating a system 100 for implementing MM-ViT for performing compressed video action recognition, in accordance with various embodiments.

[0041] In the non-limiting embodiment of Fig. 1, system 100 may comprise computing system 105 - including, but not limited to, multi-modal video transformer ("MM-ViT") 105a, or the like - and an artificial intelligence ("Al") system 110. The computing system 105, the MM-ViT 105a, and/or the Al system 110 may be part of an action recognition system 115, or may be separate, yet communicatively coupled with, the action recognition system 115. In some instances, the computing system 105, the MM-ViT 105a, and/or the Al system 110 may be embodied as an integrated system. Alternatively, the computing system 105, the MM-ViT 105a, and/or the Al system 110 may be embodied as separate, yet communicatively coupled, systems. In some embodiments, computing system 105 may include, without limitation, at least one of MM-ViT 105a, a machine learning system, Al system 110, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like. In some instances, the MM-ViT neural network may include, but is not limited to, at least one of a multi-layer perceptron ("MLP") neural network, a transformer deep learning modelbased network, a feed-forward artificial neural network ("ANN"), a recurrent neural network ("RNN"), a convolutional neural network ("CNN"), or a fully convolutional network ("FCN"), and/or the like.

[0042] System 100 may further comprise one or more content sources 120 (and corresponding database(s) 125) and content distribution system 130 (and corresponding database(s) 135) that communicatively couple with at least one of computing system 105, MM-ViT 105a, Al system 110, and/or action recognition system 115, via network(s) 140. System 100 may further comprise one or more user devices 145a- 145n (collectively, "user devices 145" or the like) that communicatively couple with at least one of computing system 105, MM-ViT 105a, Al system 110, and/or action recognition system 115, either directly via wired (not shown) or wireless communications links (denoted by lightning bolt symbols in Fig. 1), or indirect via network(s) 140 and via wired (not shown) and/or wireless communications links (denoted by lightning bolt symbols in Fig. 1). According to some embodiments, the user devices 145 may each include, but is not limited to, a portable gaming device, a smart phone, a tablet computer, a laptop computer, a desktop computer, a server computer, a digital photo album platform-compliant device, a web-based digital photo album platform-compliant device, a software application ("app") -based digital photo album platform-compliant device, a video sharing platform-compliant device, a web-based video sharing platform-compliant device, an app-based video sharing platform-compliant device, a law enforcement computing system, a security system computing system, a surveillance system computing system, a military computing system, and/or the like.

[0043] In operation, at least one of computing system 105, MM-ViT 105a, Al system 110, and/or action recognition system 115 (collectively, "computing system") may be used to train a multi-modal video transformer neural network to perform compressed video action recognition (referred to herein as "training" or the like), in accordance with the various embodiments. The computing system may then use the trained multi-modal video transformer neural network to perform compressed video action recognition (referred to herein as "inferencing" or the like), in accordance with the various embodiments.

[0044] For training a MM-ViT neural network to perform compressed video action recognition, the computing system may receive a request to train the MM-ViT neural network to perform compressed video action recognition in a compressed video file (e.g., compressed video data 150, or the like), the compressed video file comprising a plurality of frames comprising at least one compressed regular image frame and at least one compressed tracked image change frame, or the like. Each at least one compressed tracked image change frame may include, without limitation, one of one or more compressed image frames containing motion vector data or one or more compressed image frames containing residual data that represent motion of pixels and pixel difference, respectively, between a compressed regular image frame among the at least one compressed regular image frame and said compressed tracked image change frame among the at least one compressed tracked image change frame. After receiving the request to train the MM-ViT neural network, the computing system may generate a plurality of vision tokens based at least in part on one or more of the at least one compressed regular image frame, the one or more compressed image frames containing motion vector data, or the one or more compressed image frames containing residual data, and/or the like. The computing system may calculate at least one multi-head attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial-temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model, and/or the like. The computing system may calculate an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multi-head attention score. The MM-ViT neural network may implement an Al model of the MM-ViT neural network to generate a first prediction of video action classification (e.g., action class(es) 155, or the like) of one or more actions captured in the compressed video file, based at least in part on the calculated output token, may train the Al model of the MM-ViT neural network to perform compressed video action recognition based at least in part on comparison of the generated first prediction of video action classification with a video action label associated with the compressed video file, and may update the Al model based at least in part on the comparison.

[0045] In some cases, the compressed video file may be a video file compressed using a video compression algorithm or format including, but not limited to, one of moving picture experts group 4 ("MPEG-4") video compression format, advanced video coding ("AVC" or "H.264" or "MPEG-4 AVC) video compression format, or high efficiency video coding ("HEVC" or "H.265") video compression format, and/or the like. In some cases, the cross- modal attention model may include, without limitation, one of a merged attention model, a co-attention model, or a shift-merge attention model, and/or the like.

[0046] According to some embodiments, generating the plurality of vision tokens may comprise generating a first set of vision tokens, by: decomposing each of the at least one compressed regular image frame into a plurality of first non-overlapping patches; projecting the plurality of first non-overlapping patches into a plurality of first token embeddings, using a first learnable linear embedding layer, to produce a plurality of first patch tokens; and adding first spatiotemporal positional encoding to each first patch token to produce the first set of vision tokens. Generating the plurality of vision tokens may further comprise generating a second set of vision tokens, by: decomposing each compressed image frame containing motion vector data into a plurality of second non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting the plurality of second non-overlapping patches into a plurality of second token embeddings, using a second learnable linear embedding layer, to produce a plurality of second patch tokens; and adding second spatiotemporal positional encoding to each second patch token to produce the second set of vision tokens. Generating the plurality of vision tokens may further comprise generating a third set of vision tokens, by: decomposing each compressed image frame containing residual data into a plurality of third non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting the plurality of third nonoverlapping patches into a plurality of third token embeddings, using a third learnable linear embedding layer, to produce a plurality of third patch tokens; and adding third spatiotemporal positional encoding to each third patch token to produce the third set of vision tokens.

[0047] In some embodiments, the compressed video file may further comprise a compressed audio file containing an audio waveform. In such cases, the computing system may generate audio tokens, by: partitioning the audio waveform into a plurality of segments; projecting each segment among the plurality of segments to an audio vector; applying a linear layer to project the audio vector to the same dimensional space as the plurality of vision tokens to produce projected audio segments; adding temporal positional encoding to the projected audio segments to produce at least one audio token; and replicating each of the at least one audio token by a number corresponding to a spatial dimension of one of the sets of vision tokens. Generating the first prediction of video action classification of the one or more actions captured in the compressed video file may further be based at least in part on the generated audio tokens. In some instances, calculating the at least one multi-head attention score may comprise calculating at least one multi-head attention score based at least in part on using both the generated plurality of vision tokens and the generated audio tokens as inputs for the at least one of the joint spatial-temporal modal attention model, the temporal attention model, the spatial attention model, the cross-modal attention model, the local temporal attention model, or the local spatial attention model, and/or the like.

[0048] According to some embodiments, the computing system may resize the plurality of frames of the compressed video file in a first predetermined two-dimensional size; and, in some cases, may utilize at least one of random horizontal flipping or random cropping of the plurality of frames of the compressed video file to enhance training of the Al model of the MM-ViT neural network; and/or the like.

[0049] For inferencing by a trained MM-ViT neural network to perform compressed video action recognition, the computing system may receive a request to perform compressed video action recognition in a compressed video file (e.g., compressed video data 150, or the like), the compressed video file comprising a plurality of frames comprising at least one compressed regular image frame and at least one compressed tracked image change frame, or the like. Each at least one compressed tracked image change frame may include, without limitation, one of one or more compressed image frames containing motion vector data or one or more compressed image frames containing residual data that represent motion of pixels and pixel difference, respectively, between a compressed regular image frame among the at least one compressed regular image frame and said compressed tracked image change frame among the at least one compressed tracked image change frame. After receiving the request to perform compressed video action recognition in the compressed video file, the computing system may generate a plurality of vision tokens based at least in part on one or more of the at least one compressed regular image frame, the one or more compressed image frames containing motion vector data, or the one or more compressed image frames containing residual data, and/or the like. The computing system may calculate at least one multi-head attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial-temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model, and/or the like. The computing system may calculate an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multi-head attention score. The trained MM-ViT neural network may implement a trained Al model of the trained MM-ViT neural network to generate a video action classification (e.g., action class(es) 155, or the like) of one or more actions captured in the compressed video file, based at least in part on the calculated output token; and may output the generated video action classification of the one or more actions captured in the compressed video file, in some cases, by displaying the generated video action classification on a display screen on each of at least one user device 145 among the user devices 145a- 145n, or the like, and/or sending the generated video action classification to content distribution system 130 or other system over network(s) 140, or the like.

[0050] According to some embodiments, generating the plurality of vision tokens may comprise generating a first set of vision tokens, by: decomposing each of the at least one compressed regular image frame into a plurality of first non-overlapping patches; projecting the plurality of first non-overlapping patches into a plurality of first token embeddings, using a first learnable linear embedding layer, to produce a plurality of first patch tokens; and adding first spatiotemporal positional encoding to each first patch token to produce the first set of vision tokens. Generating the plurality of vision tokens may further comprise generating a second set of vision tokens, by: decomposing each compressed image frame containing motion vector data into a plurality of second non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting the plurality of second non-overlapping patches into a plurality of second token embeddings, using a second learnable linear embedding layer, to produce a plurality of second patch tokens; and adding second spatiotemporal positional encoding to each second patch token to produce the second set of vision tokens. Generating the plurality of vision tokens may further comprise generating a third set of vision tokens, by: decomposing each compressed image frame containing residual data into a plurality of third non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting the plurality of third nonoverlapping patches into a plurality of third token embeddings, using a third learnable linear embedding layer, to produce a plurality of third patch tokens; and adding third spatiotemporal positional encoding to each third patch token to produce the third set of vision tokens.

[0051] In some embodiments, the compressed video file may further comprise a compressed audio file containing an audio waveform. In such cases, the computing system may generate audio tokens, by: partitioning the audio waveform into a plurality of segments; projecting each segment among the plurality of segments to an audio vector; applying a linear layer to project the audio vector to the same dimensional space as the plurality of vision tokens to produce projected audio segments; adding temporal positional encoding to the projected audio segments to produce at least one audio token; and replicating each of the at least one audio token by a number corresponding to a spatial dimension of one of the sets of vision tokens. Generating the first prediction of video action classification of the one or more actions captured in the compressed video file may further be based at least in part on the generated audio tokens. In some instances, calculating the at least one multi-head attention score may comprise calculating at least one multi-head attention score based at least in part on using both the generated plurality of vision tokens and the generated audio tokens as inputs for the at least one of the joint spatial-temporal modal attention model, the temporal attention model, the spatial attention model, the cross-modal attention model, the local temporal attention model, or the local spatial attention model, and/or the like.

[0052] In some aspects, MM-ViT allows for recognizing and classifying actions in compressed video files without having to decode these video files. Further, in some embodiments, MM-ViT utilizes modeling of complex inter-modal relations - and, in some cases, utilizing audio data contained in the compressed video files as well - for enhancing action recognition. MM-ViT also provides a new way to understand and organize video content in a search (e.g., by providing video recognition of compressed video files to enable labelling of said compressed video files, with such labelling being usable as searchable metadata or tags, etc.), and thus could make video data more easily accessible to users.

[0053] These and other functions of the system 100 (and its components) are described in greater detail below with respect to Figs. 2-4.

[0054] Figs. 2A and 2B (collectively, "Fig. 2") are diagrams illustrating various nonlimiting examples 200 and 200' of compressed video frames that may be used by a MM-ViT for performing compressed video action recognition and corresponding MM-ViT focus, in accordance with various embodiments.

[0055] To qualitatively evaluate the proposed model described in detail with respect to Figs. 1, 3, and 4, one may visualize model attention from the output tokens to the input space via Attention Rollout method. Turning to Fig. 2, examples obtained from applying MM-ViT to, e.g., UCF-101 videos are shown (namely, examples for "Applying Eye Makeup" (as shown in Fig. 2A) and for "Knitting" (as shown in Fig. 2B)).

[0056] As shown in Fig. 2, MM-ViT may attend to the relevant regions in the input space. For example, when applied to classify the compressed video "Apply Eye Makeup," the model concentrates on the eye area and the eyeshadow brush (as depicted by highlighting in the lower row of frames for each of the I-Frames, Motion Vector Frames, and Residual Frames in Fig. 2A). In some cases, MM-ViT may also perceive the phrases or words that are semantically aligned with the content of the action. For instance, the model may focus on the words "lower eyelid" when classifying the video "Apply Eye Makeup." [0057] Similarly, as shown in Fig. 2B, e.g., when applied to classify the compressed video "Knitting," the model concentrates on the tip of the knitting needle and the hands of the knitter (as depicted by highlighting in the lower row of frames for each of the I- Frames, Motion Vector Frames, and Residual Frames in Fig. 2B). In some cases, MM-ViT may also perceive the phrases or words that are semantically aligned with the content of the action. For instance, the model may focus on the words "make the knit stitch" when classifying the video "Knitting."

[0058] The remarkable consistency of the quantitative and qualitative results (also as described below) confirms the effectiveness of the proposed MM-ViT in complex spatial- temporal-audio reasoning.

[0059] As described herein with respect to Figs. 1, 3, and 4, the I-Frames (also referred to as "compressed regular image frames" or the like), the Motion Vector Frames (also referred to as "compressed tracked image change frames," in particular, "compressed image frames containing motion vector data" or the like), the Residual Frames (also referred to as "compressed tracked image change frames," in particular, "compressed image frames containing residual data" or the like), and Audio data (also referred to as "compressed audio file" or "audio waveform" or the like) may be used to generate corresponding vision and audio tokens that may be used as inputs to the MM-ViT for training the MM-ViT neural network to perform compressed video action recognition and/or for using a trained MM-ViT neural network to perform compressed video action recognition, as described in detail herein.

[0060] Figs. 3A-3G (collectively, "Fig. 3") are schematic block flow diagrams illustrating non-limiting examples of training 300 of a MM-ViT neural network (Figs. 3A- 3F) and inferencing 300' by a trained MM-ViT neural network (Figs. 3G and 3B-3F) for performing compressed video action recognition, in accordance with various embodiments. [0061] With reference to Fig. 3A, for training a MM-ViT neural network, a compressed video file (e.g., compressed video file 305, or the like) may be used. The compressed video file may include a plurality of frames 310, including, but not limited to, at least one or a combination of I-Frames 310a (also referred to as "compressed regular image frames" or the like), the Motion Vectors or Motion Vector Frames 310b (also referred to as "compressed tracked image change frames," in particular, "compressed image frames containing motion vector data" or the like), the Residuals or Residual Frames 310c (also referred to as "compressed tracked image change frames," in particular, "compressed image frames containing residual data" or the like), or Audio data 310d (also referred to as "compressed audio file" or "audio waveform" or the like), and/or the like. In MPEG-4 -based video compressed files (or equivalent formats), the Motion Vectors 310b and the Residuals 310c may be contained within P-Frames or B -Frames, which are frames that contain changes in the image from a previous frame (e.g., for P-Frames) or frames that contain changes in the image in both the previous frame and a next frame (e.g., for B-Frames), respectively.

[0062] The system may embed compressed video files (at block 320), by generating a first set of vision tokens based on the I- Frames 310a (or the compressed regular image frames, or the like) (at block 325a) and embedding the first set of vision tokens (at block 330a) as an input to MM-ViT 335; generating a second set of vision tokens based on the Motion Vectors 310b (or the compressed image frames containing motion vector data, or the like) (at block 325b) and embedding the second set of vision tokens (at block 330b) as another input to MM-ViT 335; generating a third set of vision tokens based on the Residuals 310c (or the compressed image frames containing residual data, or the like) (at block 325c) and embedding the third set of vision tokens (at block 330c) as yet another input to MM- ViT 335; generating a fourth set of vision tokens based on the Audio 310d (or the compressed audio file, or the like) (at block 325d) and embedding the fourth set of vision tokens (at block 330d) as still another input to MM-ViT 335.

[0063] MM-ViT (e.g., MM-ViT 335, or the like) may operate on a compressed video clip V, as follows. The vision modalities may include T (number of) sampled I-frames, motion vectors, and residuals of height H and width IT. The audio modality may contain a ID waveform of length I". This may be formally expressed as follows:

where J, JVC, JI, and c/Z represent I- frame, motion vector, residual, and audio modality, respectively. To (roughly or approximately) align the visual and audio signals, the ID audio waveform may be partitioned into T segments and each segment may be projected to a d- dimensional (e.g., 128-dimensional, or the like) vector using a pretrained VGGish model, or the like. Each RGB I-frame may be decomposed into N non-overlapping patches of size P x P. Then, those patches may be projected into a token embedding using a learnable linear embedding layer (e.g., E ¹ E R^dx3P or the like). [0064] Additionally, a spatiotemporal positional encoding (e.g., PE(_p G R^d, or the like) may be added to each patch token in order to preserve the positional information. The same operations may be applied to tokenize the motion vectors and residuals as well:

where z^_p z^_ty z^A _p are the resulting vision tokens (p = 1, . . . , N; t = 1, . . . , T).

[0065] For the audio feature, a linear layer (e.g., E^rA G 7?^dx12®, _or the like) may first be applied to project it to the same dimensional space as vision tokens, then a temporal positional encoding PE ^ may be added:

where the transformation function <5 may be parameterized by the VGGish model, or the like.

[0066] To facility fully spatiotemporal self-attention across visual and audio modalities, each audio token z^ may be replicated N times along the spatial dimension, thus z₍ ^c _p =

[0067] The resulting token sequences

p = 1, . . . , N; t = , and a special classification ("CLS") token z^o) may constitute the input to the MM-ViT (e.g., MM-ViT 335, or the like). The output embedding of z^^ may be used as the aggregated representation for the entire input sequence.

[0068] The MM-ViT neural network (e.g., MM-ViT 335, or the like) - details of example embodiments of which are described in detail with respect to Figs. 3B-3E, or the like - may implement an Al model of the MM-ViT neural network to generate a first prediction of video action classification (e.g., video action classification 340, or the like) of one or more actions captured in the compressed video file (e.g., compressed video file 305), based at least in part on the these inputs, may train the Al model of the MM-ViT neural network to perform compressed video action recognition based at least in part on comparison of the generated first prediction of video action classification 340 with a video action label 315 associated with the compressed video file (at block 345), and may update the Al model based at least in part on the comparison (as depicted by the arrow between block 345 and MM-ViT 335 in Fig. 3A, or the like). [0069] As shown in Figs. 3B-3E, various embodiments of MM-ViT 335 include four multi-modal video transformer architectures. Fig. 3B depicts an architecture that simply adopts the standard self- attention mechanism to measure all pairwise token relations, while Figs. 3C-3E depicts variants of this model that factorize the self- attention computation over the space-time-modality 4D volume with distinct strategies, as described below.

[0070] Referring to Fig. 3B, a first MM-ViT model 335a (also referred to as "MM-ViT I" or "Joint Space-Time-Modality Attention MM-ViT" or the like) is depicted. Each transformer layer of this model measures pairwise interactions between all input tokens. For instance, MM-ViT I may include L (number of) transformer layers (e.g., transformer layers 350a, 365a, etc.). At each layer, a set of query (q), key (k), and value (v) vectors may be first computed for every input token embedding

from the preceding layer as below:

where WQ¹ , W_K ^l , W_v ^l G R^dh' ^d denotes learnable embedding matrices, LN(-) denotes layer normalization, s E S = {J, M, JI, c/Z}.

(I s')

[0071] The self- attention weights for query patch may be given by: a (i,s)

(p.t) = Softmax (Eqn. 9)

(I s')

[0072] The output token z_(p may be further obtained by first computing the weighted sum of the value vectors based on the self-attention weights, followed by a linear projection through a Multi-Layer Perceptron ("MLP") block (e.g., MLP 360, or the like). A residual connection may then be employed to promote robustness.

[0073] In some embodiments, Multi-Head Self- Attention ("MSA") (e.g., Joint Spatial- Temporal Attention MSA 355a, or the like), which may yield better performance, may be employed. Specifically, MSA may use h sets of (I/I/Q , W^, VF }. The outputs of the h heads may be concatenated and forwarded to the next layer, and may concatenate the resulting encoding across heads in the end. Although this model allows interactions between all token pairs, it has quadratic computational complexity with respect to the number of tokens. [0074] Turning to Fig. 3C, a second MM-ViT model 335b (also referred to as "MM-ViT II" or "Factorized Space-Time Attention MM-ViT" or the like) is depicted. Instead of computing self- attention across all pairs of input tokens, this model factorizes the operation along the spatial and temporal dimensions separately. As shown in Fig. 3C, given a token f^rom layer L-l, one may first conduct self- attention temporally (or spatially) by comparing it with all tokens at the same spatial location across all modalities. Next, a spatial attention (or temporal attention) followed by a linear projection may be applied to

(I s) generate the output embedding z_(p from layer L. Formally, the factorized space-time attention may be defined as:

[0075] This architecture introduces more parameters than MM-ViT I due to one additional MSA operation (in this case, Temporal Attention MSA 355b and Spatial Attention MSA 355c, compared with Spatial-Temporal Attention MSA 355a in MM-ViT I of Fig. 3B). However, by decoupling self-attention over the input spatial and temporal dimensions, MM-ViT II may reduce computational complexity per patch from

O(N - T • |S|) to O(N • |S| + T ■ |S|).

[0076] With reference to Fig. 3D, a third MM-ViT model 335c (also referred to as "MM-ViT III" or "Factorized Space-Time Cross-Modal Attention MM-ViT" or the like) is depicted. The third model further factorizes self-attention over the modality dimension. At each transformer layer (e.g., transformer layers 350c and 365c, etc.), MM-ViT III may attend to space, time, and modality dimensions sequentially, thus reducing the

(I s computational complexity per patch to O(N + T + |S|). For instance, a patch token

from layer L may be calculated as follows:

where MCA denotes for Multi-Head Cross-Attention, which is specifically designed for modeling cross-modal relations. Here, an effective cross-modal attention (e.g., Cross- Modal Attention MCA 355d, etc.) may be provided to facilitate learning from multi-modal data. To this end, three different cross-model attention mechanisms are developed, as described below with respect to Fig. 3F.

[0077] Referring to Fig. 3E, a fourth MM-ViT model 335d (also referred to as "MM- ViT IV" or "Factorized Local Space-Time Cross-Modal Attention MM-ViT" or the like) is depicted. This architecture may restrict the factorized spatial and temporal attention in MM- ViT III to non-overlapping local windows, thereby further reducing the computational cost. Supposing a local spatial and temporal window contains M and F patches, respectively, the computational complexity per patch becomes O(M + F + | S | ) . In some embodiments, one N T may set M = - , F = -.

[0078] However, limiting the respective field to a local window may adversely affect the model's performance. To alleviate this issue, a convolution layer may be inserted, after the local temporal and spatial attention (e.g., Local Temporal Attention MSA 355e and Local Spatial Attention MSA 355f, etc.), to strengthen the connection between the neighboring windows. The convolution kernel size may be the same as the window size, and the stride size may be equal to 1. Here, cross-modal attention (e.g., Cross-Modal Attention MCA 355d, etc.) similar to that used for the third MM-ViT model 335c (in Fig. 3D) may be used.

[0079] Turning to Fig. 3F, the cross-modal attention model may include, without limitation, one of (1) a Merged Attention model, (2) a Co-Attention model, or (3) a Shift- Merge Attention model, and/or the like.

[0080] For (1) a Merged Attention model, given a query from one modality, the model considers all of the keys and values regardless of the modality type. The output of this cross-attention module for query

may be defined as: f (Epnqnn. 117)

where s' = S.

[0081] Alternatively, one can allow queries to interact only with keys and values from other modalities, thus s' = S \ {s}. This cross-modal attention is referred to as "(2) CoAttention." [0082] Third, a computation-free shift-based method called "(3) Shift-Merge Attention" may be used to assist interactions across modalities. This shares a similar spirit to the shift approaches proposed in the CNN domain, seeking to strike a balance between accuracy and efficiency. More specifically, queries and keys are discarded, and directly work on the value embeddings by first evenly splitting each v^’^into four portions ^vi(p^S _£y ^V2(p^St)’ ^V3(p t)’ ^{and v} -£y

[0083] Then, the value embedding portions may be shifted and mixed from different modalities, but at the same with spatial and temporal index as follows:

where r denotes the resulting encoding and || represents concatenation. A residual connection may also be added to preserve the learning capability.

[0084] These and other functions of the system 100 (and its components) are described in greater detail below with respect to Figs. 1, 2, and 4.

[0085] Figs. 4A-4F (collectively, "Fig. 4") are flow diagrams illustrating a method 400 for implementing training of a MM-ViT neural network (Figs. 4A-4E) and inferencing by a trained MM-ViT neural network (Figs. 4F and 4B-4E) for performing compressed video action recognition, in accordance with various embodiments.

[0086] While the techniques and procedures are depicted and/or described in a certain order for purposes of illustration, it should be appreciated that certain procedures may be reordered and/or omitted within the scope of various embodiments. Moreover, while the method 400 illustrated by Fig. 4 can be implemented by or with (and, in some cases, are described below with respect to) the systems, examples, or embodiments 100, 200, 200', 300, and 300' of Figs. 1, 2A, 2B, 3A-3F, and 3B-3G, respectively (or components thereof), such methods may also be implemented using any suitable hardware (or software) implementation. Similarly, while each of the systems, examples, or embodiments 100, 200, 200', 300, and 300' of Figs. 1, 2A, 2B, 3A-3F, and 3B-3G, respectively (or components thereof), can operate according to the method 400 illustrated by Fig. 4 (e.g., by executing instructions embodied on a computer readable medium), the systems, examples, or embodiments 100, 200, 200', 300, and 300' of Figs. 1, 2A, 2B, 3A-3F, and 3B-3G can each also operate according to other modes of operation and/or perform other suitable procedures.

[0087] In the non-limiting embodiment of Fig. 4A, method 400, at block 405, may comprise receiving, using a computing system, a request to train a MM-ViT neural network to perform compressed video action recognition in a compressed video file, the compressed video file comprising a plurality of frames comprising at least one compressed regular image frame and at least one compressed tracked image change frame, or the like. Each at least one compressed tracked image change frame may include, without limitation, one of one or more compressed image frames containing motion vector data or one or more compressed image frames containing residual data that represent motion of pixels and pixel difference, respectively, between a compressed regular image frame among the at least one compressed regular image frame and said compressed tracked image change frame among the at least one compressed tracked image change frame.

[0088] In some embodiments, the computing system may comprise at least one of a MM-ViT, a machine learning system, an Al system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like. In some instances, the MM-ViT neural network may comprise at least one of a multi-layer perceptron ("MLP") neural network, a transformer deep learning model-based network, a feed-forward artificial neural network ("ANN"), a recurrent neural network ("RNN"), a convolutional neural network ("CNN"), or a fully convolutional network ("FCN"), and/or the like. In some cases, the compressed video file may be a video file compressed using a video compression algorithm or format comprising one of moving picture experts group 4 ("MPEG-4") video compression format, advanced video coding ("AVC" or "H.264" or "MPEG-4 AVC) video compression format, or high efficiency video coding ("HEVC" or "H.265") video compression format, and/or the like.

[0089] At optional block 410, method 400 may further comprise performing, using the computing system, at least one of resizing, random horizontal flipping, or random cropping of the plurality of frames, in some cases, to enhance training of the MM-ViT neural network.

[0090] Method 400 may further comprise, at block 415, generating, using the computing system, a plurality of vision tokens based at least in part on one or more of the at least one compressed regular image frame, the one or more compressed image frames containing motion vector data, or the one or more compressed image frames containing residual data, and/or the like. In some embodiments, the compressed video file may further comprise a compressed audio file containing an audio waveform. In such cases, method 400 may further comprise, at optional block 420, generating, using the computing system, audio tokens based at least in part on the compressed audio file.

[0091] At block 425, method 400 may comprise calculating, using the computing system, at least one multi-head attention score based at least in part on using the generated plurality of vision tokens (and in some cases, the generated audio tokens also) as inputs for at least one of a joint spatial-temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model, and/or the like. In some cases, the cross-modal attention model may comprise one of a merged attention model, a co-attention model, or a shiftmerge attention model, and/or the like, as described in detail with respect to Fig. 3F. Method 400 may further comprise calculating, using the computing system, an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multi-head attention score (block 430).

[0092] Method 400 may further comprise, at block 435, implementing, using the MM- ViT neural network, an artificial intelligence ("Al") model of the MM-ViT neural network to generate a first prediction of video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token. Method 400 may further comprise training the Al model of the MM-ViT neural network to perform compressed video action recognition based at least in part on comparison of the generated first prediction of video action classification with a video action label associated with the compressed video file (block 440), and updating the Al model based at least in part on the comparison (block 445). Method 400 may subsequently return to the process at block 410, and the processes at blocks 410 to 445 may be repeated as necessary or as desired to enhance training of the Al model of the MM-ViT neural network.

[0093] With reference to Figs. 4B-4D, generating the plurality of vision tokens (at block 415) may comprise generating a first set of vision tokens (block 450, Fig. 4B), generating a second set of vision tokens (block 455, Fig. 4C), and generating a third set of vision tokens (block 460, Fig. 4D). [0094] In particular, as shown in Fig. 4B, generating the plurality of vision tokens (at block 415) may comprise generating a first set of vision tokens (block 450), by: decomposing, using the computing system, each of the at least one compressed regular image frame into a plurality of first non-overlapping patches (block 450a); projecting, using the computing system, the plurality of first non-overlapping patches into a plurality of first token embeddings, using a first learnable linear embedding layer, to produce a plurality of first patch tokens (block 450b); and adding, using the computing system, first spatiotemporal positional encoding to each first patch token to produce the first set of vision tokens (block 450c).

[0095] Likewise, as shown in Fig. 4C, generating the plurality of vision tokens (at block 415) may comprise generating a second set of vision tokens (block 455), by: decomposing, using the computing system, each compressed image frame containing motion vector data into a plurality of second non-overlapping patches corresponding to the plurality of first non-overlapping patches (block 455a); projecting, using the computing system, the plurality of second non-overlapping patches into a plurality of second token embeddings, using a second learnable linear embedding layer, to produce a plurality of second patch tokens (block 455b); and adding, using the computing system, second spatiotemporal positional encoding to each second patch token to produce the second set of vision tokens (block 455c).

[0096] Similarly, as shown in Fig. 4D, generating the plurality of vision tokens (at block 415) may comprise generating a third set of vision tokens (block 460), by: decomposing, using the computing system, each compressed image frame containing residual data into a plurality of third non-overlapping patches corresponding to the plurality of first nonoverlapping patches (block 460a); projecting, using the computing system, the plurality of third non-overlapping patches into a plurality of third token embeddings, using a third learnable linear embedding layer, to produce a plurality of third patch tokens (block 460b); and adding, using the computing system, third spatiotemporal positional encoding to each third patch token to produce the third set of vision tokens (block 460c).

[0097] Referring to Fig. 4E, generating audio tokens (at optional block 420) may comprise: partitioning, using the computing system, the audio waveform into a plurality of segments (block 465a); projecting, using the computing system, each segment among the plurality of segments to an audio vector (block 465b); applying, using the computing system, a linear layer to project the audio vector to the same dimensional space as the plurality of vision tokens to produce projected audio segments (block 465c); adding, using the computing system, temporal positional encoding to the projected audio segments to produce at least one audio token (block 465d); and replicating, using the computing system, each of the at least one audio token by a number corresponding to a spatial dimension of one of the sets of vision tokens (block 465e). In such cases, generating the first prediction of video action classification of the one or more actions captured in the compressed video file (at block 435) may further be based at least in part on the generated audio tokens.

[0098] With reference to Fig. 4F, for inferencing, method 400, at block 470, may comprise receiving, using a computing system, a request to perform compressed video action recognition in a compressed video file, the compressed video file comprising a plurality of frames comprising at least one compressed regular image frame and at least one compressed tracked image change frame, or the like. As with training as described above with respect to Fig. 4A, each at least one compressed tracked image change frame may include, without limitation, one of one or more compressed image frames containing motion vector data or one or more compressed image frames containing residual data that represent motion of pixels and pixel difference, respectively, between a compressed regular image frame among the at least one compressed regular image frame and said compressed tracked image change frame among the at least one compressed tracked image change frame.

[0099] In some embodiments, as with training as described above with respect to Fig. 4A, the computing system may comprise at least one of a MM-ViT, a machine learning system, an Al system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like. In some instances, the MM-ViT neural network may comprise at least one of a multi-layer perceptron ("MLP") neural network, a transformer deep learning modelbased network, a feed-forward artificial neural network ("ANN"), a recurrent neural network ("RNN"), a convolutional neural network ("CNN"), or a fully convolutional network ("FCN"), and/or the like. In some cases, the compressed video file may be a video file compressed using a video compression algorithm or format comprising one of moving picture experts group 4 ("MPEG-4") video compression format, advanced video coding ("AVC" or "H.264" or "MPEG-4 AVC) video compression format, or high efficiency video coding ("HEVC" or "H.265") video compression format, and/or the like.

[0100] At optional block 475, method 400 may further comprise performing, using the computing system, at least one of resizing or random cropping of the plurality of frames. [0101] The processes at blocks 415-430, as described above with respect to Figs. 4A-4F in the context of training the MM-ViT neural network, may be repeated here with respect to using the trained MM-ViT to perform compressed video action recognition.

[0102] Method 400 may further comprise, at block 480, implementing, using a trained MM-ViT neural network, an Al model of the trained MM-ViT neural network to generate a video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token. Method 400 may further comprise outputting, using the computing system, the generated video action classification of the one or more actions captured in the compressed video file (block 485).

[0103] The following results of empirical studies illustrate the effectiveness of the Al model of the MM-ViT neural network, compared with conventional techniques and systems.

[0104] 1. Setup

[0105] Datasets:

[0106] The MM-ViT models as described herein was evaluated on three popular video action recognition datasets: UCF101, Something-Something-V2, and Kinetics-600. UCF- 101 contains 13,320 trimmed short videos from 101 action categories. It has three trainingtesting splits. The average performance of the MM-ViT models across the three splits is provided below. Kinetics-600 contains around 480,000 10-second long videos for 600 action classes. The Something-Something-v2 ("SSv2") contains about 220,000 videos with a time span from 2 to 6 seconds for 174 action classes. Different from other datasets, SSv2 placed more emphasis on a model's ability to recognize fine-grained actions since the same background scenes can be shared across many classes.

[0107] In addition, the released version of SSv2 has no audio stream, thus only the extracted visual modalities from SSv2 were used to evaluate the MM-ViT models. As mentioned above, SSv2 places more emphasis on a model's ability to recognize fine-grained actions. In our experiments, all compressed videos were converted to MPEG-4 codec, which encoded a video into I-frames and P-frames. On average, an I-frame was followed by 11 P-frames.

[0108] Training Details:

[0109] All the training videos were first resized to 340 x 256. Then, random horizontal flipping (which is omitted for SSv2) and random cropping (224 x 224) were applied to I- frames, motion vectors, and residuals for data augmentation. Patch size was set to 16 x 16 across the visual or vision modalities. The audio waveform was partitioned to 1-second- long segments and projected onto 128-dimensional vectors by VGGish. ViT-B/16 that was pretrained on ImageNet-21K was used as the backbone and was fine-tuned using SGD with a batch size of 8. A learning rate started from 0.5 and was divided by 10 when the validation accuracy plateaued.

[0110] Inference Details:

[0111] During inferencing, unless otherwise mentioned, the input included 8 uniformly sampled triplets of I-frames, motion vectors, and residuals each with a crop size of 224 x 224, and audio features (which were omitted for SSv2) that were temporally aligned with the visual or vision features. The accuracy from the three spatial crops (namely, left, center, right) were reported and then the scores were averaged for a final prediction.

[0112] 2. Ablation studies

[0113] Analysis of the proposed model variants:

[0114] Here, ablation studies refer to studies of performance of Al systems by removing some components of the Al system to determine their contribution to the overall Al system. The performance of the MM-ViT model variants on UCF-101 and SSv2, in terms of accuracy and efficiency, were compared. Table 1 summarizes the detailed experimental results for each model. All models were initialized with ViT weights that were pretrained on ImageNet-21K. Top-1 accuracy was used to measure the classification performance.

Table 1. Performance comparison of the MM-ViT model variants on UCF101 and SSv2 (with FLOP numbers reported for UCF-101 where both visual and audio modalities are involved).

[0115] MM-ViT I (as shown in Fig. 3B) appeared to underperform compared with the factorized alternatives (MM-ViT II & III), although it consumes the most computational cost. This may be due to the lack of dedicated parameters to model spatial, temporal, and cross-modal attentions separately. The results show that factorizing self-attention over the input dimensions may consistently improve both recognition accuracy and efficiency - e.g., MM-ViT II (as shown in Fig. 3C) outperforms MM-ViT I by 0.83% on UCF-101 while incurring 32% less FLOPs (i.e., floating point operations per second). Meanwhile, MM-ViT III (Merged-Att.) (as shown in Figs. 3D and 3F) outperforms MM-ViT II by 1.75% in accuracy and requires 3.5% less FLOPs.

[0116] Among the three cross-modal attention mechanisms, "Merged- Attention" (as shown in Fig. 3F) achieves the best accuracy on both UCF-101 and SSv2. This suggests that sharing keys and values across all modalities may be critical to obtain a comprehensive understanding of the video content. Interestingly, the "Shift-Merge Attention" performs comparably to the "Merged- Attention" while being more efficient, making it attractive in resource constrained scenarios. When restricting the self- attention to local views (such as in MM-ViT IV, as shown in Fig. 3E), the accuracy has a significant drop (f 3.11% on UCF101, f 4.72% on SSv2), indicating more sophisticated cross-window connection may be needed to mitigate the information loss from using local attention views.

[0117] Effect of attention order:

[0118] The effect of attention order was evaluated by enumerating all possible orders of the spatial, temporal, and cross-modal attention. For simplicity, the only the results from MM-ViT III with Merged-Attention on UCF-101 are reported and summarized in Table 2.

Table 2. Effect of attention order on UCF101. T, S, and M represent temporal, spatial, and cross-modal attention, respectively.

[0119] As shown in Table 2, conducting temporal attention before spatial attention may slightly, but consistently, perform better than the opposite, which may indicate that temporal attention provides key clues for distinguishing actions that share similar appearance features. Placing cross-attention in front of temporal and spatial attention may induce performance degradation, in some cases, because the temporal information plays a critical role for distinguishing actions that share similar appearance information. In particular, the best performing attention order may be "Temporal Cross — Modal Spatial." [0120] Effect of input modality: [0121] To evaluate the importance of each data modality, an ablation study was conducted by training and evaluating the best performing model (i.e., MM-ViT III with "Merged-Attention") with different modality combinations on UCF-101, as depicted in Table 3.

Table 3. Performance comparison of different modality combinations on UCF101.

[0122] As shown in Table 3, the I-frame is the most essential data modality as removing it alone decreases Top-1 accuracy by 4.11%. The motion vector and residual frame also play important roles for video action recognition, as without either modality can lead to an accuracy drop of up to 2.54%. Interestingly, the audio modality has a major impact to video action recognition as well, which is confirmed by a significant performance degradation (2.42% drop in Top-1 accuracy) when excluding audio input. This is likely due to the audio containing dynamics and contextual temporal information that is beneficial for video action recognition (e.g., the sound of an axe hitting tree is discriminative for recognizing "Cutting Tree").

[0123] 3. Comparison to State-of-the-Art

[0124] The best performing architecture (i.e., MM-VIT III with "Merged Attention") is compared with conventional approaches on UCF101, SSv2, and Kinetics-600 datasets. Unless otherwise specified, results from 1 x 3 views (i.e., 1 temporal and 3 spatial crops) are shown.

[0125] UCF101:

[0126] Table 4 summarizes the performance of the MM-ViT model and other competing methods on UCF-101.

Table 4. Performance comparison with conventional methods on UCF101.

[0127] As shown in Table 4, even without using audio, MM-ViT already outperforms all other methods that operate using the same or similar compressed video modalities (e.g., EMV-CNN, DTMV-CNN, CoViAR, DMC-Net, etc.), by up to 6.9% in Top-1 accuracy. This suggests that the explicit reasoning of inter-modal relations in MM-ViT is effective. Further improvements are achieved by incorporating audio signal ( ? 2.1$% in Top-1 accuracy) and pretraining the model on Kinetics-600 (? 3.5% in Top-1 accuracy). In addition, the MM-ViT surpasses all CNN alternatives with or without optical flow, and thus establishes a new state-of-the-art video action recognition model for UCF-101.

[0128] SSv2:

[0129] In Table 5, detailed results on SSv2 including Top-1 and Top-5 accuracy, inference resolution, and computational cost (in FLOPs) is presented.

Table 5. Performance comparison with conventional methods on SSv2. The inference resolution is denoted by M x T x S² for the number of modalities, temporal, and spatial sizes.

[0130] As shown in Table 5, MM-ViT surpasses Timesformer and ViViT, both of which also propose pure-transformer models. As the same time, MM-ViT is more efficient in terms of inference FLOPs (i.e., floating point operations per second). This confirms that the additional motion vector and residual modalities used by MM-ViT provide important complementary motion features, which could benefit the classification on "temporally- heavy" datasets like SSv2. Furthermore, MM-ViT consistently performs better than the CNN counterparts that operate in the single RGB modality (by > 1.5% in Top-1 accuracy). Although it slightly underperforms compared to CNN-based MSNet-R50 and bLVNet, which use optical flow as an auxiliary modality, MM-ViT eliminates the huge burden of optical flow computation and storage.

[0131] Kinetics-600:

[0132] Kinetics-600 is a larger video classification dataset, and performance comparisons using Kinetics-600 are shown in Table 6.

Table 6. Performance comparison with conventional methods on Kinetics-600. The inference resolution is denoted by M x T x S² for the number of modalities, temporal, and spatial sizes.

[0133] As shown in Table 6, MM-ViT (T = 16) achieves 83.5% Top-1 accuracy, which results in relative improvement over the Timesformer and ViViT by 1.3% and 0.5%, respectively, while it remains more computationally efficient. This accuracy is also higher than CNN alternatives that either operate in the single RGB modality or use additional flow information. Again, this verifies that the MM-ViT model is effective in learning across multiple modalities for the complex video classification task.

[0134] Examples of System and Hardware Implementation

[0135] Fig. 5 is a block diagram illustrating an example of computer or system hardware architecture, in accordance with various embodiments. Fig. 5 provides a schematic illustration of one embodiment of a computer system 500 of the service provider system hardware that can perform the methods provided by various other embodiments, as described herein, and/or can perform the functions of computer or hardware system (i.e., computing system 105, multi-modal video transformer ("MM-ViT") 105a, 335, and 335', artificial intelligence ("Al") system 110, action recognition system 115, content source(s) 120, content distribution system 130, and user devices 145a- 145n, etc.), as described above. It should be noted that Fig. 5 is meant only to provide a generalized illustration of various components, of which one or more (or none) of each may be utilized as appropriate. Fig. 5, therefore, broadly illustrates how individual system elements may be implemented in a relatively separated or relatively more integrated manner.

[0136] The computer or hardware system 500 - which might represent an embodiment of the computer or hardware system (i.e., computing system 105, MM-ViT 105a, 335, and 335', Al system 110, action recognition system 115, content source(s) 120, content distribution system 130, and user devices 145a- 145n, etc.), described above with respect to Figs. 1-4 - is shown comprising hardware elements that can be electrically coupled via a bus 505 (or may otherwise be in communication, as appropriate). The hardware elements may include one or more processors 510, including, without limitation, one or more general- purpose processors and/or one or more special-purpose processors (such as microprocessors, digital signal processing chips, graphics acceleration processors, and/or the like); one or more input devices 515, which can include, without limitation, a mouse, a keyboard, and/or the like; and one or more output devices 520, which can include, without limitation, a display device, a printer, and/or the like.

[0137] The computer or hardware system 500 may further include (and/or be in communication with) one or more storage devices 525, which can comprise, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, solid-state storage device such as a random access memory ("RAM") and/or a read-only memory ("ROM"), which can be programmable, flash-updateable, and/or the like. Such storage devices may be configured to implement any appropriate data stores, including, without limitation, various file systems, database structures, and/or the like.

[0138] The computer or hardware system 500 might also include a communications subsystem 530, which can include, without limitation, a modem, a network card (wireless or wired), an infra-red communication device, a wireless communication device and/or chipset (such as a Bluetooth™ device, an 802.11 device, a WiFi device, a WiMax device, a WWAN device, cellular communication facilities, etc.), and/or the like. The communications subsystem 530 may permit data to be exchanged with a network (such as the network described below, to name one example), with other computer or hardware systems, and/or with any other devices described herein. In many embodiments, the computer or hardware system 500 will further comprise a working memory 535, which can include a RAM or ROM device, as described above.

[0139] The computer or hardware system 500 also may comprise software elements, shown as being currently located within the working memory 535, including an operating system 540, device drivers, executable libraries, and/or other code, such as one or more application programs 545, which may comprise computer programs provided by various embodiments (including, without limitation, hypervisors, VMs, and the like), and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein. Merely by way of example, one or more procedures described with respect to the method(s) discussed above might be implemented as code and/or instructions executable by a computer (and/or a processor within a computer); in an aspect, then, such code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods.

[0140] A set of these instructions and/or code might be encoded and/or stored on a non- transitory computer readable storage medium, such as the storage device(s) 525 described above. In some cases, the storage medium might be incorporated within a computer system, such as the system 500. In other embodiments, the storage medium might be separate from a computer system (i.e., a removable medium, such as a compact disc, etc.), and/or provided in an installation package, such that the storage medium can be used to program, configure, and/or adapt a general purpose computer with the instructions/code stored thereon. These instructions might take the form of executable code, which is executable by the computer or hardware system 500 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computer or hardware system 500 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc.) then takes the form of executable code.

[0141] It will be apparent to those skilled in the art that substantial variations may be made in accordance with particular requirements. For example, customized hardware (such as programmable logic controllers, field-programmable gate arrays, application-specific integrated circuits, and/or the like) might also be used, and/or particular elements might be implemented in hardware, software (including portable software, such as applets, etc.), or both. Further, connection to other computing devices such as network input/output devices may be employed.

[0142] As mentioned above, in one aspect, some embodiments may employ a computer or hardware system (such as the computer or hardware system 500) to perform methods in accordance with various embodiments of the invention. According to a set of embodiments, some or all of the procedures of such methods are performed by the computer or hardware system 500 in response to processor 510 executing one or more sequences of one or more instructions (which might be incorporated into the operating system 540 and/or other code, such as an application program 545) contained in the working memory 535. Such instructions may be read into the working memory 535 from another computer readable medium, such as one or more of the storage device(s) 525. Merely by way of example, execution of the sequences of instructions contained in the working memory 535 might cause the processor(s) 510 to perform one or more procedures of the methods described herein.

[0143] The terms "machine readable medium" and "computer readable medium," as used herein, refer to any medium that participates in providing data that causes a machine to operate in some fashion. In an embodiment implemented using the computer or hardware system 500, various computer readable media might be involved in providing instructions/code to processor(s) 510 for execution and/or might be used to store and/or carry such instructions/code (e.g., as signals). In many implementations, a computer readable medium is a non-transitory, physical, and/or tangible storage medium. In some embodiments, a computer readable medium may take many forms, including, but not limited to, non-volatile media, volatile media, or the like. Non-volatile media includes, for example, optical and/or magnetic disks, such as the storage device(s) 525. Volatile media includes, without limitation, dynamic memory, such as the working memory 535. In some alternative embodiments, a computer readable medium may take the form of transmission media, which includes, without limitation, coaxial cables, copper wire, and fiber optics, including the wires that comprise the bus 505, as well as the various components of the communication subsystem 530 (and/or the media by which the communications subsystem 530 provides communication with other devices). In an alternative set of embodiments, transmission media can also take the form of waves (including without limitation radio, acoustic, and/or light waves, such as those generated during radio-wave and infra-red data communications) .

[0144] Common forms of physical and/or tangible computer readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read instructions and/or code.

[0145] Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to the processor(s) 510 for execution. Merely by way of example, the instructions may initially be carried on a magnetic disk and/or optical disc of a remote computer. A remote computer might load the instructions into its dynamic memory and send the instructions as signals over a transmission medium to be received and/or executed by the computer or hardware system 500. These signals, which might be in the form of electromagnetic signals, acoustic signals, optical signals, and/or the like, are all examples of carrier waves on which instructions can be encoded, in accordance with various embodiments of the invention.

[0146] The communications subsystem 530 (and/or components thereof) generally will receive the signals, and the bus 505 then might carry the signals (and/or the data, instructions, etc. carried by the signals) to the working memory 535, from which the processor(s) 505 retrieves and executes the instructions. The instructions received by the working memory 535 may optionally be stored on a storage device 525 either before or after execution by the processor(s) 510.

[0147] As noted above, a set of embodiments comprises methods and systems for implementing neural network, artificial intelligence ("Al"), machine learning, and/or deep learning applications, and, more particularly, to methods, systems, and apparatuses for implementing multi-modal video transformer ("MM-ViT") for performing compressed video action recognition. Fig. 6 illustrates a schematic diagram of a system 600 that can be used in accordance with one set of embodiments. The system 600 can include one or more user computers, user devices, or customer devices 605. A user computer, user device, or customer device 605 can be a general purpose personal computer (including, merely by way of example, desktop computers, tablet computers, laptop computers, handheld computers, and the like, running any appropriate operating system, several of which are available from vendors such as Apple, Microsoft Corp., and the like), cloud computing devices, a server(s), and/or a workstation computer(s) running any of a variety of commercially-available UNIX™ or UNIX-like operating systems. A user computer, user device, or customer device 605 can also have any of a variety of applications, including one or more applications configured to perform methods provided by various embodiments (as described above, for example), as well as one or more office applications, database client and/or server applications, and/or web browser applications. Alternatively, a user computer, user device, or customer device 605 can be any other electronic device, such as a thin-client computer, Internet-enabled mobile telephone, and/or personal digital assistant, capable of communicating via a network (e.g., the network(s) 610 described below) and/or of displaying and navigating web pages or other types of electronic documents. Although the system 600 is shown with two user computers, user devices, or customer devices 605, any number of user computers, user devices, or customer devices can be supported.

[0148] Some embodiments operate in a networked environment, which can include a network(s) 610. The network(s) 610 can be any type of network familiar to those skilled in the art that can support data communications using any of a variety of commercially- available (and/or free or proprietary) protocols, including, without limitation, TCP/IP, SNA™, IPX™, AppleTalk™, and the like. Merely by way of example, the network(s) 610 (similar to network(s) 140 of Fig. 1, or the like) can each include a local area network ("LAN"), including, without limitation, a fiber network, an Ethernet network, a Token- Ring™ network, and/or the like; a wide-area network ("WAN"); a wireless wide area network ("WWAN"); a virtual network, such as a virtual private network ("VPN"); the Internet; an intranet; an extranet; a public switched telephone network ("PSTN"); an infrared network; a wireless network, including, without limitation, a network operating under any of the IEEE 802.11 suite of protocols, the Bluetooth™ protocol known in the art, and/or any other wireless protocol; and/or any combination of these and/or other networks. In a particular embodiment, the network might include an access network of the service provider (e.g., an Internet service provider ("ISP")). In another embodiment, the network might include a core network of the service provider, and/or the Internet.

[0149] Embodiments can also include one or more server computers 615. Each of the server computers 615 may be configured with an operating system, including, without limitation, any of those discussed above, as well as any commercially (or freely) available server operating systems. Each of the servers 615 may also be running one or more applications, which can be configured to provide services to one or more clients 605 and/or other servers 615.

[0150] Merely by way of example, one of the servers 615 might be a data server, a web server, a cloud computing device(s), or the like, as described above. The data server might include (or be in communication with) a web server, which can be used, merely by way of example, to process requests for web pages or other electronic documents from user computers 605. The web server can also run a variety of server applications, including HTTP servers, FTP servers, CGI servers, database servers, Java servers, and the like. In some embodiments of the invention, the web server may be configured to serve web pages that can be operated within a web browser on one or more of the user computers 605 to perform methods of the invention. [0151] The server computers 615, in some embodiments, might include one or more application servers, which can be configured with one or more applications accessible by a client running on one or more of the client computers 605 and/or other servers 615. Merely by way of example, the server(s) 615 can be one or more general purpose computers capable of executing programs or scripts in response to the user computers 605 and/or other servers 615, including, without limitation, web applications (which might, in some cases, be configured to perform methods provided by various embodiments). Merely by way of example, a web application can be implemented as one or more scripts or programs written in any suitable programming language, such as Java™, C, C#™ or C++, and/or any scripting language, such as Perl, Python, or TCL, as well as combinations of any programming and/or scripting languages. The application server(s) can also include database servers, including, without limitation, those commercially available from Oracle™, Microsoft™, Sybase™, IBM™, and the like, which can process requests from clients (including, depending on the configuration, dedicated database clients, API clients, web browsers, etc.) running on a user computer, user device, or customer device 605 and/or another server 615. In some embodiments, an application server can perform one or more of the processes for implementing neural network, Al, machine learning, and/or deep learning applications, and, more particularly, to methods, systems, and apparatuses for implementing MM-ViT for performing compressed video action recognition, as described in detail above. Data provided by an application server may be formatted as one or more web pages (comprising HTML, JavaScript, etc., for example) and/or may be forwarded to a user computer 605 via a web server (as described above, for example). Similarly, a web server might receive web page requests and/or input data from a user computer 605 and/or forward the web page requests and/or input data to an application server. In some cases, a web server may be integrated with an application server.

[0152] In accordance with further embodiments, one or more servers 615 can function as a file server and/or can include one or more of the files (e.g., application code, data files, etc.) necessary to implement various disclosed methods, incorporated by an application running on a user computer 605 and/or another server 615. Alternatively, as those skilled in the art will appreciate, a file server can include all necessary files, allowing such an application to be invoked remotely by a user computer, user device, or customer device 605 and/or server 615. [0153] It should be noted that the functions described with respect to various servers herein (e.g., application server, database server, web server, file server, etc.) can be performed by a single server and/or a plurality of specialized servers, depending on implementation-specific needs and parameters.

[0154] In some embodiments, the system can include one or more databases 620a-620n (collectively, "databases 620"). The location of each of the databases 620 is discretionary: merely by way of example, a database 620a might reside on a storage medium local to (and/or resident in) a server 615a (and/or a user computer, user device, or customer device 605). Alternatively, a database 620n can be remote from any or all of the computers 605, 615, so long as it can be in communication (e.g., via the network 610) with one or more of these. In a particular set of embodiments, a database 620 can reside in a storage-area network ("SAN") familiar to those skilled in the art. (Likewise, any necessary files for performing the functions attributed to the computers 605, 615 can be stored locally on the respective computer and/or remotely, as appropriate.) In one set of embodiments, the database 620 can be a relational database, such as an Oracle database, that is adapted to store, update, and retrieve data in response to SQL-formatted commands. The database might be controlled and/or maintained by a database server, as described above, for example.

[0155] According to some embodiments, system 600 may further comprise computing system 625 (similar to computing system 105 of Fig. 1, or the like) - including, but not limited to, multi-modal video transformer ("MM-ViT") 625a (similar to MM-ViT 105a of Fig. 1, or the like) - and artificial intelligence ("Al") system 630 (similar to Al system 110 of Fig. 1, or the like), each part of an action recognition system 635 (similar to action recognition system 115 of Fig. 1, or the like). System 600 may further comprise one or more content sources 640 and corresponding database(s) 645 (similar to one or more content sources 120 and corresponding database(s) 125 of Fig. 1, or the like) and one or more content distribution system 650 and corresponding database(s) 655 (similar to one or more content distribution system 130 and corresponding database(s) 135 of Fig. 1, or the like). [0156] In operation, at least one of computing system 625, MM-ViT 625a, Al system 630, and/or action recognition system 635 (collectively, "computing system") may be used to train a multi-modal video transformer neural network to perform compressed video action recognition (referred to herein as "training" or the like), in accordance with the various embodiments. The computing system may then use the trained multi-modal video transformer neural network to perform compressed video action recognition (referred to herein as "inferencing" or the like), in accordance with the various embodiments.

[0157] For training a MM-ViT neural network to perform compressed video action recognition, the computing system may receive a request to train the MM-ViT neural network to perform compressed video action recognition in a compressed video file, the compressed video file comprising a plurality of frames comprising at least one compressed regular image frame and at least one compressed tracked image change frame, or the like. Each at least one compressed tracked image change frame may include, without limitation, one of one or more compressed image frames containing motion vector data or one or more compressed image frames containing residual data that represent motion of pixels and pixel difference, respectively, between a compressed regular image frame among the at least one compressed regular image frame and said compressed tracked image change frame among the at least one compressed tracked image change frame. After receiving the request to train the MM-ViT neural network, the computing system may generate a plurality of vision tokens based at least in part on one or more of the at least one compressed regular image frame, the one or more compressed image frames containing motion vector data, or the one or more compressed image frames containing residual data, and/or the like. The computing system may calculate at least one multi-head attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial-temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model, and/or the like. The computing system may calculate an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multi-head attention score. The MM-ViT neural network may implement an Al model of the MM-ViT neural network to generate a first prediction of video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token, may train the Al model of the MM-ViT neural network to perform compressed video action recognition based at least in part on comparison of the generated first prediction of video action classification with a video action label associated with the compressed video file, and may update the Al model based at least in part on the comparison.

[0158] In some cases, the compressed video file may be a video file compressed using a video compression algorithm or format including, but not limited to, one of moving picture experts group 4 ("MPEG-4") video compression format, advanced video coding ("AVC" or "H.264" or "MPEG-4 AVC) video compression format, or high efficiency video coding ("HEVC" or "H.265") video compression format, and/or the like. In some cases, the cross- modal attention model may include, without limitation, one of a merged attention model, a co-attention model, or a shift-merge attention model, and/or the like.

[0159] According to some embodiments, generating the plurality of vision tokens may comprise generating a first set of vision tokens, by: decomposing each of the at least one compressed regular image frame into a plurality of first non-overlapping patches; projecting the plurality of first non-overlapping patches into a plurality of first token embeddings, using a first learnable linear embedding layer, to produce a plurality of first patch tokens; and adding first spatiotemporal positional encoding to each first patch token to produce the first set of vision tokens. Generating the plurality of vision tokens may further comprise generating a second set of vision tokens, by: decomposing each compressed image frame containing motion vector data into a plurality of second non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting the plurality of second non-overlapping patches into a plurality of second token embeddings, using a second learnable linear embedding layer, to produce a plurality of second patch tokens; and adding second spatiotemporal positional encoding to each second patch token to produce the second set of vision tokens. Generating the plurality of vision tokens may further comprise generating a third set of vision tokens, by: decomposing each compressed image frame containing residual data into a plurality of third non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting the plurality of third nonoverlapping patches into a plurality of third token embeddings, using a third learnable linear embedding layer, to produce a plurality of third patch tokens; and adding third spatiotemporal positional encoding to each third patch token to produce the third set of vision tokens.

[0160] In some embodiments, the compressed video file may further comprise a compressed audio file containing an audio waveform. In such cases, the computing system may generate audio tokens, by: partitioning the audio waveform into a plurality of segments; projecting each segment among the plurality of segments to an audio vector; applying a linear layer to project the audio vector to the same dimensional space as the plurality of vision tokens to produce projected audio segments; adding temporal positional encoding to the projected audio segments to produce at least one audio token; and replicating each of the at least one audio token by a number corresponding to a spatial dimension of one of the sets of vision tokens. Generating the first prediction of video action classification of the one or more actions captured in the compressed video file may further be based at least in part on the generated audio tokens. In some instances, calculating the at least one multi-head attention score may comprise calculating at least one multi-head attention score based at least in part on using both the generated plurality of vision tokens and the generated audio tokens as inputs for the at least one of the joint spatial-temporal modal attention model, the temporal attention model, the spatial attention model, the cross-modal attention model, the local temporal attention model, or the local spatial attention model, and/or the like.

[0161] According to some embodiments, the computing system may resize the plurality of frames of the compressed video file in a first predetermined two-dimensional size; and, in some cases, may utilize at least one of random horizontal flipping or random cropping of the plurality of frames of the compressed video file to enhance training of the Al model of the MM-ViT neural network; and/or the like.

[0162] For inferencing by a trained MM-ViT neural network to perform compressed video action recognition, the computing system may receive a request to perform compressed video action recognition in a compressed video file, the compressed video file comprising a plurality of frames comprising at least one compressed regular image frame and at least one compressed tracked image change frame, or the like. Each at least one compressed tracked image change frame may include, without limitation, one of one or more compressed image frames containing motion vector data or one or more compressed image frames containing residual data that represent motion of pixels and pixel difference, respectively, between a compressed regular image frame among the at least one compressed regular image frame and said compressed tracked image change frame among the at least one compressed tracked image change frame. After receiving the request to perform compressed video action recognition in the compressed video file, the computing system may generate a plurality of vision tokens based at least in part on one or more of the at least one compressed regular image frame, the one or more compressed image frames containing motion vector data, or the one or more compressed image frames containing residual data, and/or the like. The computing system may calculate at least one multi-head attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial-temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model, and/or the like. The computing system may calculate an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multi-head attention score. The trained MM-ViT neural network may implement a trained Al model of the trained MM-ViT neural network to generate a video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token; and may output the generated video action classification of the one or more actions captured in the compressed video file.

[0163] According to some embodiments, generating the plurality of vision tokens may comprise generating a first set of vision tokens, by: decomposing each of the at least one compressed regular image frame into a plurality of first non-overlapping patches; projecting the plurality of first non-overlapping patches into a plurality of first token embeddings, using a first learnable linear embedding layer, to produce a plurality of first patch tokens; and adding first spatiotemporal positional encoding to each first patch token to produce the first set of vision tokens. Generating the plurality of vision tokens may further comprise generating a second set of vision tokens, by: decomposing each compressed image frame containing motion vector data into a plurality of second non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting the plurality of second non-overlapping patches into a plurality of second token embeddings, using a second learnable linear embedding layer, to produce a plurality of second patch tokens; and adding second spatiotemporal positional encoding to each second patch token to produce the second set of vision tokens. Generating the plurality of vision tokens may further comprise generating a third set of vision tokens, by: decomposing each compressed image frame containing residual data into a plurality of third non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting the plurality of third nonoverlapping patches into a plurality of third token embeddings, using a third learnable linear embedding layer, to produce a plurality of third patch tokens; and adding third spatiotemporal positional encoding to each third patch token to produce the third set of vision tokens.

[0164] In some embodiments, the compressed video file may further comprise a compressed audio file containing an audio waveform. In such cases, the computing system may generate audio tokens, by: partitioning the audio waveform into a plurality of segments; projecting each segment among the plurality of segments to an audio vector; applying a linear layer to project the audio vector to the same dimensional space as the plurality of vision tokens to produce projected audio segments; adding temporal positional encoding to the projected audio segments to produce at least one audio token; and replicating each of the at least one audio token by a number corresponding to a spatial dimension of one of the sets of vision tokens. Generating the first prediction of video action classification of the one or more actions captured in the compressed video file may further be based at least in part on the generated audio tokens. In some instances, calculating the at least one multi-head attention score may comprise calculating at least one multi-head attention score based at least in part on using both the generated plurality of vision tokens and the generated audio tokens as inputs for the at least one of the joint spatial-temporal modal attention model, the temporal attention model, the spatial attention model, the cross-modal attention model, the local temporal attention model, or the local spatial attention model, and/or the like.

[0165] These and other functions of the system 600 (and its components) are described in greater detail above with respect to Figs. 1-4.

[0166] While particular features and aspects have been described with respect to some embodiments, one skilled in the art will recognize that numerous modifications are possible. For example, the methods and processes described herein may be implemented using hardware components, software components, and/or any combination thereof. Further, while various methods and processes described herein may be described with respect to particular structural and/or functional components for ease of description, methods provided by various embodiments are not limited to any particular structural and/or functional architecture but instead can be implemented on any suitable hardware, firmware and/or software configuration. Similarly, while particular functionality is ascribed to particular system components, unless the context dictates otherwise, this functionality need not be limited to such and can be distributed among various other system components in accordance with the several embodiments.

[0167] Moreover, while the procedures of the methods and processes described herein are described in a particular order for ease of description, unless the context dictates otherwise, various procedures may be reordered, added, and/or omitted in accordance with various embodiments. Moreover, the procedures described with respect to one method or process may be incorporated within other described methods or processes; likewise, system components described according to a particular structural architecture and/or with respect to one system may be organized in alternative structural architectures and/or incorporated within other described systems. Hence, while various embodiments are described with — or without — particular features for ease of description and to illustrate some aspects of those embodiments, the various components and/or features described herein with respect to a particular embodiment can be substituted, added and/or subtracted from among other described embodiments, unless the context dictates otherwise. Consequently, although several embodiments are described above, it will be appreciated that the invention is intended to cover all modifications and equivalents within the scope of the following claims.

Claims

WHAT IS CLAIMED IS:

1. A method for training a multi-modal video transformer neural network to perform compressed video action recognition, the method implemented by a computing system and comprising: generating a plurality of vision tokens based at least in part on one or more of at least one compressed regular image frame, one or more compressed image frames containing motion vector data, or one or more compressed image frames containing residual data contained in a compressed video file; calculating at least one multi-head attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial- temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model; calculating an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multi-head attention score; implementing, using a multi-modal video transformer ("MM-ViT") neural network, an artificial intelligence ("Al") model of the MM-ViT neural network to generate a first prediction of video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token; and training the Al model of the MM-ViT neural network to perform compressed video action recognition based at least in part on comparison of the generated first prediction of video action classification with a video action label associated with the compressed video file.

2. The method of claim 1, wherein the computing system comprises at least one of a multi-modal video transformer ("MM-ViT"), a machine learning system, an Al system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, wherein the MM-ViT neural network comprises at least one of a multi-layer perceptron ("MLP") neural network, a transformer deep learning model-based network, a feed-forward artificial neural network ("ANN"), a recurrent neural network ("RNN"), a convolutional neural network ("CNN"), or a fully convolutional network ("FCN").

52

3. The method of claim 1 or 2, wherein the compressed video file is a video file compressed using a video compression algorithm or format comprising one of moving picture experts group 4 ("MPEG-4") video compression format, advanced video coding ("AVC" or "H.264" or "MPEG-4 AVC) video compression format, or high efficiency video coding ("HEVC" or "H.265") video compression format.

4. The method of any of claims 1-3, wherein the compressed video file comprises a plurality of frames comprising the at least one compressed regular image frame and at least one compressed tracked image change frame, each at least one compressed tracked image change frame comprising one of the one or more compressed image frames containing motion vector data or the one or more compressed image frames containing residual data that represent motion of pixels and pixel difference, respectively, between a compressed regular image frame among the at least one compressed regular image frame and said compressed tracked image change frame among the at least one compressed tracked image change frame.

5. The method of any of claims 1-4, wherein generating the plurality of vision tokens comprises: generating a first set of vision tokens, by: decomposing, using the computing system, each of the at least one compressed regular image frame into a plurality of first non-overlapping patches; projecting, using the computing system, the plurality of first non-overlapping patches into a plurality of first token embeddings, using a first learnable linear embedding layer, to produce a plurality of first patch tokens; and adding, using the computing system, first spatiotemporal positional encoding to each first patch token to produce the first set of vision tokens; generating a second set of vision tokens, by: decomposing, using the computing system, each compressed image frame containing motion vector data into a plurality of second non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting, using the computing system, the plurality of second nonoverlapping patches into a plurality of second token embeddings, using a

53 second learnable linear embedding layer, to produce a plurality of second patch tokens; and adding, using the computing system, second spatiotemporal positional encoding to each second patch token to produce the second set of vision tokens; and generating a third set of vision tokens, by: decomposing, using the computing system, each compressed image frame containing residual data into a plurality of third non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting, using the computing system, the plurality of third nonoverlapping patches into a plurality of third token embeddings, using a third learnable linear embedding layer, to produce a plurality of third patch tokens; and adding, using the computing system, third spatiotemporal positional encoding to each third patch token to produce the third set of vision tokens.

6. The method of any of claims 1-5, wherein the cross-modal attention model comprises one of a merged attention model, a co-attention model, or a shift-merge attention model.

7. The method of any of claims 1-6, wherein the compressed video file further comprises a compressed audio file containing an audio waveform, wherein the method further comprises: generating audio tokens, by: partitioning, using the computing system, the audio waveform into a plurality of segments; projecting, using the computing system, each segment among the plurality of segments to an audio vector; applying, using the computing system, a linear layer to project the audio vector to the same dimensional space as the plurality of vision tokens to produce projected audio segments; adding, using the computing system, temporal positional encoding to the projected audio segments to produce at least one audio token; and

54 replicating, using the computing system, each of the at least one audio token by a number corresponding to a spatial dimension of one of the sets of vision tokens; wherein generating the first prediction of video action classification of the one or more actions captured in the compressed video file is further based at least in part on the generated audio tokens.

8. The method of claim 7, wherein calculating the at least one multi-head attention score comprises calculating, using the computing system, at least one multi-head attention score based at least in part on using both the generated plurality of vision tokens and the generated audio tokens as inputs for the at least one of the joint spatial-temporal modal attention model, the temporal attention model, the spatial attention model, the cross- modal attention model, the local temporal attention model, or the local spatial attention model.

9. The method of any of claims 1-8, further comprising: resizing, using the computing system, the plurality of frames of the compressed video file in a first predetermined two-dimensional size; and utilizing, using the computing system, at least one of random horizontal flipping or random cropping of the plurality of frames of the compressed video file to enhance training of the Al model of the MM-ViT neural network.

10. The method of any of claims 1-9, further comprising: updating the Al model based at least in part on the comparison.

11. A system operable to perform the method of claims 1-9, for training a multimodal video transformer neural network to perform compressed video action recognition, the system comprising: a computing system, comprising: at least one first processor; and a first non-transitory computer readable medium communicatively coupled to the at least one first processor, the first non-transitory computer readable medium having stored thereon computer software comprising a first set of instructions that, when executed by the at least one first processor, causes the computing system to:

55 generate a plurality of vision tokens based at least in part on one or more of at least one compressed regular image frame, one or more compressed image frames containing motion vector data, or one or more compressed image frames containing residual data contained in a compressed video file; calculate at least one multi-head attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial-temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model; calculate an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multi-head attention score; implement, using a multi-modal video transformer ("MM-ViT") neural network, an artificial intelligence ("Al") model of the MM- ViT neural network to generate a first prediction of video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token; and train the Al model of the MM-ViT neural network to perform compressed video action recognition based at least in part on comparison of the generated first prediction of video action classification with a video action label associated with the compressed video file.

12. The system of claim 11, wherein the computing system comprises at least one of a multi-modal video transformer ("MM-ViT"), a machine learning system, an Al system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, wherein the MM- ViT neural network comprises at least one of a multi-layer perceptron ("MLP") neural network, a transformer deep learning model-based network, a feed- forward artificial neural network ("ANN"), a recurrent neural network ("RNN"), a convolutional neural network ("CNN"), or a fully convolutional network ("FCN").

13. A method for implementing a trained multi-modal video transformer neural network to perform compressed video action recognition, the method implemented by a computing system and comprising: generating a plurality of vision tokens based at least in part on one or more of at least one compressed regular image frame, one or more compressed image frames containing motion vector data, or one or more compressed image frames containing residual data contained in a compressed video file; calculating at least one multi-head attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial- temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model; calculating an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multi-head attention score; implementing, using a trained multi-modal video transformer ("MM-ViT") neural network, a trained artificial intelligence ("Al") model of the trained MM-ViT neural network to generate a video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token; and outputting, using the computing system, the generated video action classification of the one or more actions captured in the compressed video file.

14. The method of claim 13, wherein the computing system comprises at least one of a multi-modal video transformer ("MM-ViT"), a machine learning system, an Al system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, wherein the MM- ViT neural network comprises at least one of a multi-layer perceptron ("MLP") neural network, a transformer deep learning model-based network, a feed- forward artificial neural network ("ANN"), a recurrent neural network ("RNN"), a convolutional neural network ("CNN"), or a fully convolutional network ("FCN").

15. The method of claim 13 or 14, wherein the compressed video file is a video file compressed using a video compression algorithm or format comprising one of moving picture experts group 4 ("MPEG-4") video compression format, advanced video coding ("AVC" or "H.264" or "MPEG-4 AVC) video compression format, or high efficiency video coding ("HEVC" or "H.265") video compression format.

16. The method of any of claims 13-15, wherein the compressed video file comprises a plurality of frames comprising the at least one compressed regular image frame and at least one compressed tracked image change frame, each at least one compressed tracked image change frame comprising one of the one or more compressed image frames containing motion vector data or the one or more compressed image frames containing residual data that represent motion of pixels and pixel difference, respectively, between a compressed regular image frame among the at least one compressed regular image frame and said compressed tracked image change frame among the at least one compressed tracked image change frame.

17. The method of any of claims 13-15, wherein generating the plurality of vision tokens comprises: generating a first set of vision tokens, by: decomposing, using the computing system, each of the at least one compressed regular image frame into a plurality of first non-overlapping patches; projecting, using the computing system, the plurality of first non-overlapping patches into a plurality of first token embeddings, using a first learnable linear embedding layer, to produce a plurality of first patch tokens; and adding, using the computing system, first spatiotemporal positional encoding to each first patch token to produce the first set of vision tokens; generating a second set of vision tokens, by: decomposing, using the computing system, each compressed image frame containing motion vector data into a plurality of second non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting, using the computing system, the plurality of second nonoverlapping patches into a plurality of second token embeddings, using a second learnable linear embedding layer, to produce a plurality of second patch tokens; and

58 adding, using the computing system, second spatiotemporal positional encoding to each second patch token to produce the second set of vision tokens; and generating a third set of vision tokens, by: decomposing, using the computing system, each compressed image frame containing residual data into a plurality of third non-overlapping patches corresponding to the plurality of first non-overlapping patches; projecting, using the computing system, the plurality of third nonoverlapping patches into a plurality of third token embeddings, using a third learnable linear embedding layer, to produce a plurality of third patch tokens; and adding, using the computing system, third spatiotemporal positional encoding to each third patch token to produce the third set of vision tokens.

18. The method of any of claims 13-17, wherein the cross-modal attention model comprises one of a merged attention model, a co-attention model, or a shift-merge attention model.

19. The method of any of claims 13-18, wherein the compressed video file further comprises a compressed audio file containing an audio waveform, wherein the method further comprises: generating audio tokens, by: partitioning, using the computing system, the audio waveform into a plurality of segments; projecting, using the computing system, each segment among the plurality of segments to an audio vector; applying, using the computing system, a linear layer to project the audio vector to the same dimensional space as the plurality of vision tokens to produce projected audio segments; adding, using the computing system, temporal positional encoding to the projected audio segments to produce at least one audio token; and replicating, using the computing system, each of the at least one audio token by a number corresponding to a spatial dimension of one of the sets of vision tokens;

59 wherein generating the video action classification of the one or more actions captured in the compressed video file is further based at least in part on the generated audio tokens.

20. The method of claim 19, wherein calculating the at least one multi-head attention score comprises calculating, using the computing system, at least one multi-head attention score based at least in part on using both the generated plurality of vision tokens and the generated audio tokens as inputs for the at least one of the joint spatial-temporal modal attention model, the temporal attention model, the spatial attention model, the cross- modal attention model, the local temporal attention model, or the local spatial attention model.

21. A system operable to perform the method of claims 13-20, for implementing a trained multi-modal video transformer neural network to perform compressed video action recognition, the system comprising: a computing system, comprising: at least one first processor; and a first non-transitory computer readable medium communicatively coupled to the at least one first processor, the first non-transitory computer readable medium having stored thereon computer software comprising a first set of instructions that, when executed by the at least one first processor, causes the computing system to: generate a plurality of vision tokens based at least in part on one or more of at least one compressed regular image frame, one or more compressed image frames containing motion vector data, or one or more compressed image frames containing residual data contained in a compressed video file; calculate at least one multi-head attention score based at least in part on using the generated plurality of vision tokens as inputs for at least one of a joint spatial-temporal modal attention model, a temporal attention model, a spatial attention model, a cross-modal attention model, a local temporal attention model, or a local spatial attention model;

60 calculate an output token based on calculation of a weighted sum of value vectors based on the at least one calculated multi-head attention score; implement, using a trained multi-modal video transformer ("MM- ViT") neural network, a trained artificial intelligence ("Al") model of the trained MM-ViT neural network to generate a video action classification of one or more actions captured in the compressed video file, based at least in part on the calculated output token; and output the generated video action classification of the one or more actions captured in the compressed video file.

22. The system of claim 21, wherein the computing system comprises at least one of a multi-modal video transformer ("MM-ViT"), a machine learning system, an Al system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, wherein the MM- ViT neural network comprises at least one of a multi-layer perceptron ("MLP") neural network, a transformer deep learning model-based network, a feed- forward artificial neural network ("ANN"), a recurrent neural network ("RNN"), a convolutional neural network ("CNN"), or a fully convolutional network ("FCN").

61