CN117251821A

CN117251821A - Video-language understanding method and system

Info

Publication number: CN117251821A
Application number: CN202311179398.5A
Authority: CN
Inventors: 甘甜; 王霄; 高晋飞; 聂礼强
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2023-12-19

Abstract

The invention provides a video-language understanding method and a system, which relate to the technical field of video understanding and acquire videos and texts to be understood; based on the trained understanding model, processing the input video and text to generate a final understanding result; the understanding model comprises a video encoder and a text query component, wherein the video encoder comprises a clustering component and a time sequence component; screening image blocks of all video frames in the video by utilizing a clustering component to obtain video frame embedding after redundant information is eliminated; reconstructing a time sequence dependency relationship between video frame embedding through a time sequence component to obtain a video feature matrix; based on the video feature matrix, embedding the text through a text query component to obtain a final understanding result; the invention comprehensively considers three challenges of information redundancy, time dependence and scene complexity, provides an understanding model composed of three key components, and obviously improves the video-language understanding capability of an intelligent agent aiming at specific challenges.

Description

Video-language understanding method and system

Technical Field

The invention belongs to the technical field of video understanding, and particularly relates to a video-language understanding method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

During the last years, both artificial intelligence technology and video streaming have experienced significant growth; video-language understanding capabilities may reflect an agent's ability to perceive and interpret visual, textual content in the real world; this capability may be applied to a series of tasks such as text-to-video cross-modal retrieval, video description generation, and video question-and-answer, etc.; however, this technique faces three major challenges, including information redundancy, time dependence, and complexity of the scenario; because of the complementary nature of these challenges, such as reducing redundant information in video, it can significantly reduce the complexity of the video scene, while considering these three challenges is particularly important.

The existing method mainly solves the problem of information redundancy by selecting meaningful marks or key frames, however, the selection can destroy the spatial consistency between information, thereby increasing the difficulty of modeling for time; on the other hand, placing the feature selection module at the end to address the time-dependent challenges, however, doing so may introduce too much redundant information at the previous module, which is not the optimal solution to the problem; in contrast, some approaches choose to ignore the redundancy of video information, focusing only on solving both the time dependence and scene complexity problems.

Thus, existing methods focus mainly on solving one or both challenges, and cannot take into account other factors that affect the understanding of the video, with low accuracy and performance.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a video-language understanding method and system, which comprehensively consider three challenges of information redundancy, time dependence and scene complexity, and provide an understanding model composed of three key components, wherein each component is aimed at a specific challenge, and the video-language understanding capability of an intelligent agent is obviously improved.

To achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

the first aspect of the present invention provides a video-language understanding method.

A video-language understanding method obtains video and text to be understood;

based on the trained understanding model, processing the input video and text to generate a final understanding result;

the understanding model comprises a video encoder and a text query component, wherein the video encoder comprises a clustering component and a time sequence component; screening image blocks of all video frames in the video by utilizing a clustering component to obtain video frame embedding after redundant information is eliminated; reconstructing a time sequence dependency relationship between video frame embedding through a time sequence component to obtain a video feature matrix; based on the video feature matrix, embedding the text through a text query component to obtain a final understanding result;

the understanding model can at least process text-video cross-modal retrieval tasks, video description generation tasks or video question-answering tasks.

Further, the video encoder further comprises a ViT layer;

the ViT layer is used for carrying out block coding on each video frame to obtain image block embedding with semantic meaning, and the image block embedding belonging to the same video frame forms video frame embedding.

Further, the filtering of the image blocks of all video frames in the video by using the clustering component specifically includes:

embedding and grouping the video frames output by the ViT layer into S segments, wherein each segment comprises F/S frames;

for F/S (1+P) contained in each segment by using a clustering algorithm _in ) The image blocks are clustered to produce (1+P) _in ) Clusters, blocks located at the centroid of each cluster are selected to form video embedded V with redundant information removed ^k ；

Wherein S represents the number of segments, F represents the number of video frames, and P _in Representing the number of blocks that each video frame is partitioned.

Further, in the timing assembly, using an information marking mechanism, reconstructing a timing dependency relationship between video frame embedments, specifically:

embedding video frames with redundant information removed into V ^k Inputting ViT-ATM layer to obtain video feature matrix with time sequence dependence

Further, the text query component integrates an L-layer MoED, the MoED consists of four modules, including a bi-directional self-attention mechanism BiSA, a Causer SA, a channel attention CA and a feed-forward neural network FFA, and the four modules form three variants to complete corresponding task-specific text embedding, namely a text encoder, a video-based text encoder and a video-based text decoder;

for a text-video cross-mode retrieval task, calculating the similarity between a video feature matrix and a text feature matrix obtained by a text encoder to obtain a video set meeting a similarity condition;

for a video description generation task, inputting a video feature matrix into a video-based text decoder to generate text as description text of the video;

for a video question-answering task, inputting a video feature matrix and a question text into a video-based text encoder to obtain a multi-modal feature matrix, and then inputting the multi-modal feature matrix into a video-based text decoder to generate text as an answer to the question.

Further, the text encoder encodes the input text by using BiSA and FFN at each layer, and adds [ CLS ] token to the beginning of the text input to output a text feature matrix;

the video-based text encoder collects visual information related to a task by adding CA between BiSA and FFN in each layer of the text encoder; in CA, the input text is used as query, the embedded video feature matrix is used as key and value, and a multi-mode feature matrix is generated;

the video-based text decoder replaces the BiSA layer of the video-based text encoder with the cable SA layer for decoding the entered multimodal feature matrix into text.

Further, the text-video cross-mode retrieval task is divided into two stages of recall and reordering;

the recall stage is used for recalling the Top Q video segment by calculating cosine similarity between [ cls ] token in the text feature matrix obtained by the video feature matrix and the text encoder;

and in the reordering stage, inputting the video-text query into a text encoder based on the video, embedding the output [ Encode ] into an input full-connection layer and a sigmoid function to obtain a final score, reordering the Q-segment video according to the score, and taking a preset number of videos to form a video set.

Further, the video question-answering task further comprises a plurality of selection question-answering tasks which are used as classification tasks, the questions and the answers are connected into a complete sentence, and then the complete sentence and the video feature matrix are input into a video-based text encoder together to encode pairs of video and question-answering text into a multi-modal feature matrix; finally, the multi-modal feature matrix is input at the linear layer and Softmax layer to obtain the score of the best answer.

Further, training of the understanding model specifically includes:

for a text encoder, using video text contrast loss, encouraging the [ cls ] labels of matching video-text pairs to have a more similar representation than non-matching video-text pairs to align the feature space of the video and text;

for video-based text encoders, learning a video text multi-modal representation using video text matching loss, capturing fine grain alignment between video and text;

for video-based text decoders, the cross entropy penalty is optimized using language modeling penalty, and the training model maximizes the likelihood of text in an autoregressive manner.

A second aspect of the present invention provides a video-language understanding system.

A video-language understanding system comprising an acquisition module and an understanding module:

an acquisition module configured to: acquiring videos and texts to be understood;

an understanding module configured to: based on the trained understanding model, processing the input video and text to generate a final understanding result;

A third aspect of the present invention provides a computer readable storage medium having stored thereon a program which when executed by a processor performs steps in a video-language understanding method according to the first aspect of the present invention.

A fourth aspect of the invention provides an electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing the steps in a video-language understanding method according to the first aspect of the invention when the program is executed.

The one or more of the above technical solutions have the following beneficial effects:

(1) In order to improve the video-language understanding capability of an agent and overcome the limitations and defects of the prior art, the invention aims to eliminate redundant information in a video and reconstruct time dependence of characteristics based on cross-modal embedding of images and texts, and uses different architectures for different tasks to cope with the complexity of video scenes so as to help the agent to effectively complete the video-language understanding task; specifically, the invention contemplates two independent neural network modules (video encoder and text query component) that include three special components that cooperate to generate the final task goal, which can be implemented by any suitable method.

(2) The invention merges the multi-mode information and is used for realizing the natural language understanding and inference of the video content; specifically, the invention obtains the effective coding of the video through the video encoder, and processes the text and the video coding in different modes through the text query component so as to cope with three different video-language understanding tasks of cross-mode retrieval, video description generation and video question-answering, thereby obtaining corresponding results; the method has the advantages that the influence of the video complex scene on the result is effectively reduced, and the understanding capability of the model is improved.

(3) The invention utilizes the clustering component and the time sequence component in the video encoder module to extract the key blocks embedded in the video frames and realize the reconstruction of time dependence; the redundancy of video information is effectively reduced, the time dependence of embedded features is maintained, and the robustness of the model is further improved.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

Fig. 1 is a flow chart of a method of a first embodiment.

Fig. 2 is an overall process flow diagram of the first embodiment.

Fig. 3 is a flow chart of a first embodiment text-to-video cross-modality retrieval task.

Fig. 4 is a flowchart of a first embodiment video description generation task.

Fig. 5 is a flowchart of an open question-answering task in the video question-answering task according to the first embodiment.

Fig. 6 is a flowchart of a multiple choice question-answering task among the video question-answering tasks of the first embodiment.

Detailed Description

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Term interpretation:

Self-Attention (SA): also known as a self-attention mechanism, is an attention mechanism for sequential data processing. It is the core component of the Transformer model, and the self-attention mechanism learns the relationship between different positions from the input sequence by mapping queries, keys and values to a high-dimensional space and calculating the similarity between them; this enables the model to encode, model and capture long-range dependencies from dependencies inside the sequence.

Transformer: the transducer model is constructed on the basis of a self-attention mechanism, and utilizes a multi-head self-attention mechanism to learn the relation between different positions in parallel, so that the representation capability and the generalization capability of the model are improved.

ViT: a transducer model is used to perform feature extraction and representation learning on a sequence of images, capture the relationships between different regions in the images, and effectively learn global context information. ViT is composed of a plurality of transducer modules, each module comprising a plurality of self-attention heads and a feed-forward neural network; these modules build the entire ViT layer by stacking. Each module receives the input of the previous layer and generates a richer feature representation in the output.

ViT-ATM: is a further improvement to the Vision Transformer (ViT) model, with the addition of a time module to process the video data sequence. The traditional ViT model can only process a single still image and cannot model a video sequence; while ViT-ATM extends ViT the model by introducing a time block, captures the relationship between different frames in a video sequence and learns the timing characteristics between them to better understand the video data.

k-medoids++: is an improved k-means clustering algorithm for partitioning a dataset into k different clusters, by using a probability distribution to select initial means to increase representativeness and diversity, using actual sample points as the center points of the clusters, is more robust to outliers, and can handle non-euclidean distance measures.

[ cls ] token: is a special mark used in Natural Language Processing (NLP). It often serves as a starting marker for the input sequence, playing an important role in the transducer model and the attention mechanism-based model.

Bi-directional Self Attention (BiSA): is a variation of the attention mechanism in machine learning, modeling the dependency between words in sentences by taking into account contextual information at the same time. It uses two independent attentiveness mechanisms: forward and backward attention. In forward attention, each word is correlated with all words preceding it, thereby obtaining forward context information. Similarly, in backward attention, each word is correlated with all words following it to obtain backward context information. Finally, by combining the forward and backward attention outputs, a more comprehensive representation can be obtained to better understand the relationships between words in the sentence.

Causal Self Attention (cause SA): is a variation of the attention mechanism in machine learning and is mainly applied to the sequence data processing task. It introduces the concept of causality (causity) to ensure that models can only rely on previous information when predicting. Causality is achieved by modifying the attention matrix so that each word can only perform correlation calculations with its preceding word. In other words, the model can only predict the output of the current location from the context that has been observed, and cannot use future information.

Cross Attention (CA): is an attention mechanism that introduces a correlation between multiple input sequences. The method realizes cross-sequence information transfer by calculating correlation scores between a query sequence and a key sequence and then taking the scores as weights to carry out weighted summation on the value sequence.

Feed Forward Network (FFN): the feed-forward neural network structure is a common feed-forward neural network structure in machine learning, and is used for transforming input data through a plurality of full-connection layers and nonlinear activation functions so as to extract characteristics and enhance the representation capability of a model.

BERT: is a pre-trained language model that learns generic sentence representations by self-supervised training on large scale unlabeled text. Unlike conventional language models, the BERT model employs a transducer architecture that includes multiple layers of self-attention mechanisms.

Cosine similarity: the method is a measurement method for measuring the similarity between two vectors, and the cosine value of the included angle between the two vectors is calculated, and the range is between-1 and 1. When the cosine similarity approaches 1, the directions of the two vectors are very similar; when the cosine similarity approaches to-1, the directions of the two vectors are opposite; when the cosine similarity is close to 0, it means that there is no obvious similarity between the two vectors.

The present invention takes the three aforementioned challenges into account in aggregate and proposes an understanding model RTQ that consists of three key components, each component being directed to a particular challenge. First, the first component (clustering component) adopts a clustering method to eliminate redundant information in adjacent video frames and selects representative blocks. Next, a second component (timing component) perceives and interprets the timing relationship between blocks through temporal modeling, thereby avoiding processing the spatial consistency between representative blocks. Finally, a third component (text query component) comprising a text encoder, a video-based text encoder and a video-based text decoder, wherein related information of the task is gradually obtained through language query; these three components, which can be implemented by any suitable method, can effectively address the three challenges and significantly enhance the video-language understanding capabilities of the agent.

Example 1

In one or more embodiments, a video-language understanding method is disclosed, as shown in fig. 1, comprising:

step one: acquiring videos and texts to be understood;

step two: based on the trained understanding model, processing the input video and text to generate a final understanding result;

The implementation procedure of a video-language understanding method of this embodiment will be described in detail.

The embodiment aims at designing a novel deep learning model to learn video features and text features, improving the video-language understanding capability of an intelligent agent according to tasks, effectively aiming at three challenges of information redundancy, time dependence and scene complexity in the intelligent agent video-language understanding process, and completing three tasks of text-to-video cross-mode retrieval, video description generation and video question-answering as application landing; the overall process flow, as shown in fig. 2, specifically includes:

firstly, giving a video, and encoding each video frame through a visual Transformer (ViT) layer in a video encoder to obtain an image block embedded with semantic meaning; redundant blocks are eliminated by the clustering component preserving representative blocks.

The remaining blocks are then input to the ViT-ATM component to capture the time dependence between video frames and generate a video feature matrix as the output of the video encoder.

And finally, the text query component acquires information related to the task layer by layer from the video feature matrix and the text, and outputs a result corresponding to the task.

The three components may be implemented by any suitable method, and this embodiment describes only one implementation method, which specifically includes the following steps:

s1: the method comprises the steps of establishing a process of embedding video frames containing time sequence relations and eliminating redundant information by utilizing a clustering component, and acquiring video characteristics after eliminating the redundant information, wherein the process comprises the following specific steps:

s11: embedding video input to a video encoder into a block of a semantic meaning by using a K layer ViT layerWhere F is the number of video frames of the input model, 1 represents [ cis ]]token，P _in Representing the number of blocks each video frame is divided into by ViT layers, d representing the hidden dimension.

S12: linking [ cls ] before video frame embedding before K-th layer ViT]token to generate information tag embeddingsThe SA is then performed using these information flags to obtain timing dependencies between video frames:

and embedding timing dependencies into m ^k Feeding the image block into the Kth layer ViT layer together with other frame embedding to obtain image block embedding

S13: the video frames output by the K-th layer ViT layer are embedded and grouped into S segments, each segment containing an F/S frame.

S14: F/S (1+P) contained in each segment is clustered using k-medoids++ algorithm _in ) The image blocks are clustered to produce (1+P) _in ) Clusters, blocks located at the centroid of each cluster are selected to form a video insert with redundant information removal

The clustering algorithm is not limited to the k-means++ clustering algorithm, but can be implemented by other clustering methods.

S2: and reconstructing the video features after the redundant information is eliminated through a time sequence component so as to maintain the time dependence of the video features.

Specifically, in the timing component portion, an information tagging mechanism is used to reconstruct timing dependencies between video embeddings, i.e., the timing component uses a (L-K) layer ViT-ATM layer, where L represents the total number of layers of the video encoder; embedding the video obtained in the step S14 after the redundant information is removed into V ^k Inputting the time sequence component to finally obtain the video embedding with time sequence dependencyAnd serves as a video feature matrix for the video encoder output.

S3: for three tasks of cross-modal retrieval, video description generation and video question-answering, a text query component is used for task-specific text embedding, and a final understanding result is obtained, including:

s31: after the first two components, the video has been encoded into a time-aware representation with high information density, i.e., a video feature matrix; however, because of the complexity of video scenes, there still exist a lot of information unrelated to tasks; to address this problem, a text query component is introduced that uses task-specific queries to gradually collect relevant details and generate final understanding results.

The text query component integrates an L-layer MoED, which consists of four modules including BiSA, causer SA, CA and FFA, and the four modules form three variants to complete corresponding tasks.

The three variants of MoED include in particular:

(1) The text encoder, like BERT, encodes text using BiSA and FFN at each level, with a [ CLS ] token appended to the beginning of the text input to summarize it.

(2) Video-based text encoders collect visual information related to a task through Channel Attention (CA) between a bi-directional self-attention mechanism (BiSA) and a feed forward neural network (FFN) in each layer of the text encoder. In channel attention, text input is used as a query and video feature matrices are used as keys and values. For task-specific purposes, an [ Encode ] tag is added to the text input, and the embedded result thus generated contains a multimodal representation of the video-text pair.

(3) A video-based text decoder responsible for gathering task-specific visual information to generate a desired text output; the BiSA layer of the video-based text encoder is replaced with the Causal SA layer. The start of the sequence is identified using a [ Decode ] tag and the end of the sequence is identified using a sequence end tag.

The three variants are utilized to show the video-language understanding capability of the intelligent agent for three tasks of text-video cross-modal retrieval, video description generation and video question-answering, and specifically comprise the following steps:

(1) For the text-video cross-modal retrieval task, the task is divided into two stages of recall and reordering.

As shown in fig. 3, the Top Q segment video is recalled by first recalling the phase by calculating the cosine similarity between the video feature matrix and the [ cls ] token in the text embedding obtained by the text encoder. Then, in the reordering stage, inputting the video-text query into a text encoder based on the video, embedding the output [ Encode ] into an input full-connection layer and a sigmoid function to obtain a final score, and reordering the Q-segment video according to the score; the SA layer and the FFD layer in the text encoder and the video-based text encoder share parameters.

(2) For the video description generation task, as shown in fig. 4, a video feature matrix is input to a video-based text decoder to generate description text for the video.

(3) Aiming at video question-answering tasks, the video question-answering tasks comprise an open question-answering task and a multi-item selection question-answering task.

For the open question-answering task, as shown in fig. 5, the video feature matrix and the question text are input into the video-based text encoder to obtain a multi-modal feature matrix, and then the multi-modal feature matrix is input into the video-based text decoder to generate an answer, and the encoder and the decoder share parameters.

For a multi-item selection question-answering task, the task is used as a classification task, as shown in fig. 6, the questions and the answers are connected into a complete sentence, and then the complete sentence and the video feature matrix are input into a video-based text encoder together to encode pairs of video and question-answering text into a multi-modal feature matrix; finally, the multi-modal feature matrix is input into the linear layer and the Softmax layer to obtain the score of the best answer.

S4: reasoning and training, including in particular:

(1) For text-to-video cross-modality retrieval tasks, a text encoder and a video-based text encoder are jointly trained.

For a text encoder, video text contrast loss (VTC) is used to encourage the [ cls ] labels of matching video-text pairs to have a more similar representation than non-matching video-text pairs to align the feature space of the video and text.

First, for the ith video text pair, give them [ CLS]Embedding according to CLIP [5 ]]Using a linear projection layer and an L2 normalization layer to obtain video concealment vectorsAnd text hidden vector +.>

To maximize the benefit of large batch contrast learning, three memory banks are maintained to store the latest M video vectorsText vector->And corresponding video->

Then, calculate the text-to-video contrast lossContrast loss from video to text->

Wherein,is a positive sample set and τ is a temperature parameter that can be learned.

Finally combining the two losses to obtain the VTC loss

To compensate for potential false negative samples in the momentum encoder, a dynamic double distillation strategy in ALBEF [6] is used to generate soft labels.

For video-based text encoders, fine grain alignment between video and text is captured using video text matching loss (VTM), learning a video text multi-modal representation. VTM corresponds to a binary classification task in which the model uses the VTM header (linear layer) to predict whether a video text pair is positive (matching) or negative (not matching) given the multi-modal features of the [ -Encode ] tag.

For the ith group of video-text pairs, their positive match scores are first calculatedThen randomly sampling a video/text to replace the video/text to obtain a negative matching score +.>Finally, obtaining the video-text matching loss

In order for VTM loss to provide more information, negative samples are sampled using a difficult negative sample mining strategy.

Will beAnd->The addition results in a final penalty.

(2) For a video description text generation task, using language modeling loss LM for a decoder, optimizing cross entropy loss, and training a model to maximize the possibility of text in an autoregressive manner; for each video-text pair (v, t):

where L is the total length of the sentence; when calculating loss, label smoothing of 0.1 is used; compared to the mask language modeling penalty widely used for video language pre-training, LM provides the model with generalization capability that can convert visual information into coherent subtitles.

(3) For video questions and answers tasks, the open questions and answers employ LM losses, while the multiple choice questions and answers employ VTM losses. Unlike text-to-video retrieval, the negative samples are from wrong question-answer pairs, rather than being generated using samples.

The embodiment aims at improving the video-language understanding capability of an intelligent agent, and system analysis shows that the current video language understanding method focuses on limited aspects of tasks, and methods aiming at different challenges can be mutually complemented; in view of this, a framework is proposed that integrates redundancy elimination information, temporal modeling, and query components to jointly address information redundancy, temporal dependencies, and scene complexity, respectively. The effectiveness and superiority of the method of the embodiment are proved through extensive experimental evaluation; the goal of future work is to pre-train the model to help the model acquire more knowledge; more efficient elimination of redundant information, temporal modeling, and query components are developed to improve the overall performance of the method of the present embodiment. By exploring the potential of the method of this embodiment, it is desirable to be able to contribute to the development of a push-agent video-language understanding technology.

Example two

In one or more embodiments, a video-language understanding system is disclosed that includes an acquisition module and an understanding module:

Example III

An object of the present embodiment is to provide a computer-readable storage medium.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs steps in a video-language understanding method according to an embodiment of the present disclosure.

Example IV

An object of the present embodiment is to provide an electronic apparatus.

An electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing steps in a video-language understanding method as described in the first embodiment of the present disclosure when the program is executed.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A video-language understanding method, comprising:

acquiring videos and texts to be understood;

2. The video-language understanding method of claim 1, wherein the video encoder further comprises a ViT layer;

the video encoder comprises ViT layers, a clustering component and a time sequence component which are sequentially connected;

3. The video-language understanding method according to claim 2, wherein the filtering of the image blocks of all video frames in the video by using the clustering component is specifically:

Wherein the method comprises the steps ofS represents the number of segments, F represents the number of video frames, P _in Representing the number of blocks that each video frame is partitioned.

4. The video-language understanding method of claim 1, wherein in the timing component, a timing dependency between video frame embedding is reconstructed using an information tagging mechanism, specifically:

5. The video-language understanding method of claim 1, wherein the text query component, the integrated L-layer MoED, the MoED is comprised of four modules comprising a bi-directional self-attention mechanism BiSA, causal SA, channel attention CA, and feed-forward neural network FFA, the four modules forming three variants to accomplish corresponding task-specific text embedding, namely a text encoder, a video-based text encoder, and a video-based text decoder;

6. The video-language understanding method of claim 5, wherein the text encoder encodes the input text using BiSA and FFN per layer, attaches to the beginning of the text input with a [ CLS ] token, and outputs a text feature matrix;

7. A video-language understanding method as claimed in claim 1, wherein the text-video cross-modality retrieval task is performed in two stages of recall and reorder;

8. The video-language understanding method of claim 1, wherein the video question-answering task further comprises a plurality of selection question-answering tasks, which are performed as classification tasks, the questions and the answers are connected into a complete sentence, and then the complete sentence is input into a video-based text encoder together with a video feature matrix to encode pairs of video and question-answering text into a multi-modal feature matrix; finally, the multi-modal feature matrix is input at the linear layer and Softmax layer to obtain the score of the best answer.

9. The video-language understanding method according to claim 1, wherein the training of the understanding model is specifically:

10. A video-language understanding system, comprising an acquisition module and an understanding module: