CN117251821A - Video-language understanding method and system - Google Patents

Video-language understanding method and system Download PDF

Info

Publication number
CN117251821A
CN117251821A CN202311179398.5A CN202311179398A CN117251821A CN 117251821 A CN117251821 A CN 117251821A CN 202311179398 A CN202311179398 A CN 202311179398A CN 117251821 A CN117251821 A CN 117251821A
Authority
CN
China
Prior art keywords
video
text
feature matrix
encoder
understanding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311179398.5A
Other languages
Chinese (zh)
Inventor
甘甜
王霄
高晋飞
聂礼强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202311179398.5A priority Critical patent/CN117251821A/en
Publication of CN117251821A publication Critical patent/CN117251821A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
    • H04N19/31Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability in the temporal domain

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Signal Processing (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a video-language understanding method and a system, which relate to the technical field of video understanding and acquire videos and texts to be understood; based on the trained understanding model, processing the input video and text to generate a final understanding result; the understanding model comprises a video encoder and a text query component, wherein the video encoder comprises a clustering component and a time sequence component; screening image blocks of all video frames in the video by utilizing a clustering component to obtain video frame embedding after redundant information is eliminated; reconstructing a time sequence dependency relationship between video frame embedding through a time sequence component to obtain a video feature matrix; based on the video feature matrix, embedding the text through a text query component to obtain a final understanding result; the invention comprehensively considers three challenges of information redundancy, time dependence and scene complexity, provides an understanding model composed of three key components, and obviously improves the video-language understanding capability of an intelligent agent aiming at specific challenges.

Description

Video-language understanding method and system
Technical Field
The invention belongs to the technical field of video understanding, and particularly relates to a video-language understanding method and system.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
During the last years, both artificial intelligence technology and video streaming have experienced significant growth; video-language understanding capabilities may reflect an agent's ability to perceive and interpret visual, textual content in the real world; this capability may be applied to a series of tasks such as text-to-video cross-modal retrieval, video description generation, and video question-and-answer, etc.; however, this technique faces three major challenges, including information redundancy, time dependence, and complexity of the scenario; because of the complementary nature of these challenges, such as reducing redundant information in video, it can significantly reduce the complexity of the video scene, while considering these three challenges is particularly important.
The existing method mainly solves the problem of information redundancy by selecting meaningful marks or key frames, however, the selection can destroy the spatial consistency between information, thereby increasing the difficulty of modeling for time; on the other hand, placing the feature selection module at the end to address the time-dependent challenges, however, doing so may introduce too much redundant information at the previous module, which is not the optimal solution to the problem; in contrast, some approaches choose to ignore the redundancy of video information, focusing only on solving both the time dependence and scene complexity problems.
Thus, existing methods focus mainly on solving one or both challenges, and cannot take into account other factors that affect the understanding of the video, with low accuracy and performance.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a video-language understanding method and system, which comprehensively consider three challenges of information redundancy, time dependence and scene complexity, and provide an understanding model composed of three key components, wherein each component is aimed at a specific challenge, and the video-language understanding capability of an intelligent agent is obviously improved.
To achieve the above object, one or more embodiments of the present invention provide the following technical solutions:
the first aspect of the present invention provides a video-language understanding method.
A video-language understanding method obtains video and text to be understood;
based on the trained understanding model, processing the input video and text to generate a final understanding result;
the understanding model comprises a video encoder and a text query component, wherein the video encoder comprises a clustering component and a time sequence component; screening image blocks of all video frames in the video by utilizing a clustering component to obtain video frame embedding after redundant information is eliminated; reconstructing a time sequence dependency relationship between video frame embedding through a time sequence component to obtain a video feature matrix; based on the video feature matrix, embedding the text through a text query component to obtain a final understanding result;
the understanding model can at least process text-video cross-modal retrieval tasks, video description generation tasks or video question-answering tasks.
Further, the video encoder further comprises a ViT layer;
the ViT layer is used for carrying out block coding on each video frame to obtain image block embedding with semantic meaning, and the image block embedding belonging to the same video frame forms video frame embedding.
Further, the filtering of the image blocks of all video frames in the video by using the clustering component specifically includes:
embedding and grouping the video frames output by the ViT layer into S segments, wherein each segment comprises F/S frames;
for F/S (1+P) contained in each segment by using a clustering algorithm in ) The image blocks are clustered to produce (1+P) in ) Clusters, blocks located at the centroid of each cluster are selected to form video embedded V with redundant information removed k
Wherein S represents the number of segments, F represents the number of video frames, and P in Representing the number of blocks that each video frame is partitioned.
Further, in the timing assembly, using an information marking mechanism, reconstructing a timing dependency relationship between video frame embedments, specifically:
embedding video frames with redundant information removed into V k Inputting ViT-ATM layer to obtain video feature matrix with time sequence dependence
Further, the text query component integrates an L-layer MoED, the MoED consists of four modules, including a bi-directional self-attention mechanism BiSA, a Causer SA, a channel attention CA and a feed-forward neural network FFA, and the four modules form three variants to complete corresponding task-specific text embedding, namely a text encoder, a video-based text encoder and a video-based text decoder;
for a text-video cross-mode retrieval task, calculating the similarity between a video feature matrix and a text feature matrix obtained by a text encoder to obtain a video set meeting a similarity condition;
for a video description generation task, inputting a video feature matrix into a video-based text decoder to generate text as description text of the video;
for a video question-answering task, inputting a video feature matrix and a question text into a video-based text encoder to obtain a multi-modal feature matrix, and then inputting the multi-modal feature matrix into a video-based text decoder to generate text as an answer to the question.
Further, the text encoder encodes the input text by using BiSA and FFN at each layer, and adds [ CLS ] token to the beginning of the text input to output a text feature matrix;
the video-based text encoder collects visual information related to a task by adding CA between BiSA and FFN in each layer of the text encoder; in CA, the input text is used as query, the embedded video feature matrix is used as key and value, and a multi-mode feature matrix is generated;
the video-based text decoder replaces the BiSA layer of the video-based text encoder with the cable SA layer for decoding the entered multimodal feature matrix into text.
Further, the text-video cross-mode retrieval task is divided into two stages of recall and reordering;
the recall stage is used for recalling the Top Q video segment by calculating cosine similarity between [ cls ] token in the text feature matrix obtained by the video feature matrix and the text encoder;
and in the reordering stage, inputting the video-text query into a text encoder based on the video, embedding the output [ Encode ] into an input full-connection layer and a sigmoid function to obtain a final score, reordering the Q-segment video according to the score, and taking a preset number of videos to form a video set.
Further, the video question-answering task further comprises a plurality of selection question-answering tasks which are used as classification tasks, the questions and the answers are connected into a complete sentence, and then the complete sentence and the video feature matrix are input into a video-based text encoder together to encode pairs of video and question-answering text into a multi-modal feature matrix; finally, the multi-modal feature matrix is input at the linear layer and Softmax layer to obtain the score of the best answer.
Further, training of the understanding model specifically includes:
for a text encoder, using video text contrast loss, encouraging the [ cls ] labels of matching video-text pairs to have a more similar representation than non-matching video-text pairs to align the feature space of the video and text;
for video-based text encoders, learning a video text multi-modal representation using video text matching loss, capturing fine grain alignment between video and text;
for video-based text decoders, the cross entropy penalty is optimized using language modeling penalty, and the training model maximizes the likelihood of text in an autoregressive manner.
A second aspect of the present invention provides a video-language understanding system.
A video-language understanding system comprising an acquisition module and an understanding module:
an acquisition module configured to: acquiring videos and texts to be understood;
an understanding module configured to: based on the trained understanding model, processing the input video and text to generate a final understanding result;
the understanding model comprises a video encoder and a text query component, wherein the video encoder comprises a clustering component and a time sequence component; screening image blocks of all video frames in the video by utilizing a clustering component to obtain video frame embedding after redundant information is eliminated; reconstructing a time sequence dependency relationship between video frame embedding through a time sequence component to obtain a video feature matrix; based on the video feature matrix, embedding the text through a text query component to obtain a final understanding result;
the understanding model can at least process text-video cross-modal retrieval tasks, video description generation tasks or video question-answering tasks.
A third aspect of the present invention provides a computer readable storage medium having stored thereon a program which when executed by a processor performs steps in a video-language understanding method according to the first aspect of the present invention.
A fourth aspect of the invention provides an electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing the steps in a video-language understanding method according to the first aspect of the invention when the program is executed.
The one or more of the above technical solutions have the following beneficial effects:
(1) In order to improve the video-language understanding capability of an agent and overcome the limitations and defects of the prior art, the invention aims to eliminate redundant information in a video and reconstruct time dependence of characteristics based on cross-modal embedding of images and texts, and uses different architectures for different tasks to cope with the complexity of video scenes so as to help the agent to effectively complete the video-language understanding task; specifically, the invention contemplates two independent neural network modules (video encoder and text query component) that include three special components that cooperate to generate the final task goal, which can be implemented by any suitable method.
(2) The invention merges the multi-mode information and is used for realizing the natural language understanding and inference of the video content; specifically, the invention obtains the effective coding of the video through the video encoder, and processes the text and the video coding in different modes through the text query component so as to cope with three different video-language understanding tasks of cross-mode retrieval, video description generation and video question-answering, thereby obtaining corresponding results; the method has the advantages that the influence of the video complex scene on the result is effectively reduced, and the understanding capability of the model is improved.
(3) The invention utilizes the clustering component and the time sequence component in the video encoder module to extract the key blocks embedded in the video frames and realize the reconstruction of time dependence; the redundancy of video information is effectively reduced, the time dependence of embedded features is maintained, and the robustness of the model is further improved.
Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
Fig. 1 is a flow chart of a method of a first embodiment.
Fig. 2 is an overall process flow diagram of the first embodiment.
Fig. 3 is a flow chart of a first embodiment text-to-video cross-modality retrieval task.
Fig. 4 is a flowchart of a first embodiment video description generation task.
Fig. 5 is a flowchart of an open question-answering task in the video question-answering task according to the first embodiment.
Fig. 6 is a flowchart of a multiple choice question-answering task among the video question-answering tasks of the first embodiment.
Detailed Description
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Term interpretation:
Self-Attention (SA): also known as a self-attention mechanism, is an attention mechanism for sequential data processing. It is the core component of the Transformer model, and the self-attention mechanism learns the relationship between different positions from the input sequence by mapping queries, keys and values to a high-dimensional space and calculating the similarity between them; this enables the model to encode, model and capture long-range dependencies from dependencies inside the sequence.
Transformer: the transducer model is constructed on the basis of a self-attention mechanism, and utilizes a multi-head self-attention mechanism to learn the relation between different positions in parallel, so that the representation capability and the generalization capability of the model are improved.
ViT: a transducer model is used to perform feature extraction and representation learning on a sequence of images, capture the relationships between different regions in the images, and effectively learn global context information. ViT is composed of a plurality of transducer modules, each module comprising a plurality of self-attention heads and a feed-forward neural network; these modules build the entire ViT layer by stacking. Each module receives the input of the previous layer and generates a richer feature representation in the output.
ViT-ATM: is a further improvement to the Vision Transformer (ViT) model, with the addition of a time module to process the video data sequence. The traditional ViT model can only process a single still image and cannot model a video sequence; while ViT-ATM extends ViT the model by introducing a time block, captures the relationship between different frames in a video sequence and learns the timing characteristics between them to better understand the video data.
k-medoids++: is an improved k-means clustering algorithm for partitioning a dataset into k different clusters, by using a probability distribution to select initial means to increase representativeness and diversity, using actual sample points as the center points of the clusters, is more robust to outliers, and can handle non-euclidean distance measures.
[ cls ] token: is a special mark used in Natural Language Processing (NLP). It often serves as a starting marker for the input sequence, playing an important role in the transducer model and the attention mechanism-based model.
Bi-directional Self Attention (BiSA): is a variation of the attention mechanism in machine learning, modeling the dependency between words in sentences by taking into account contextual information at the same time. It uses two independent attentiveness mechanisms: forward and backward attention. In forward attention, each word is correlated with all words preceding it, thereby obtaining forward context information. Similarly, in backward attention, each word is correlated with all words following it to obtain backward context information. Finally, by combining the forward and backward attention outputs, a more comprehensive representation can be obtained to better understand the relationships between words in the sentence.
Causal Self Attention (cause SA): is a variation of the attention mechanism in machine learning and is mainly applied to the sequence data processing task. It introduces the concept of causality (causity) to ensure that models can only rely on previous information when predicting. Causality is achieved by modifying the attention matrix so that each word can only perform correlation calculations with its preceding word. In other words, the model can only predict the output of the current location from the context that has been observed, and cannot use future information.
Cross Attention (CA): is an attention mechanism that introduces a correlation between multiple input sequences. The method realizes cross-sequence information transfer by calculating correlation scores between a query sequence and a key sequence and then taking the scores as weights to carry out weighted summation on the value sequence.
Feed Forward Network (FFN): the feed-forward neural network structure is a common feed-forward neural network structure in machine learning, and is used for transforming input data through a plurality of full-connection layers and nonlinear activation functions so as to extract characteristics and enhance the representation capability of a model.
BERT: is a pre-trained language model that learns generic sentence representations by self-supervised training on large scale unlabeled text. Unlike conventional language models, the BERT model employs a transducer architecture that includes multiple layers of self-attention mechanisms.
Cosine similarity: the method is a measurement method for measuring the similarity between two vectors, and the cosine value of the included angle between the two vectors is calculated, and the range is between-1 and 1. When the cosine similarity approaches 1, the directions of the two vectors are very similar; when the cosine similarity approaches to-1, the directions of the two vectors are opposite; when the cosine similarity is close to 0, it means that there is no obvious similarity between the two vectors.
The present invention takes the three aforementioned challenges into account in aggregate and proposes an understanding model RTQ that consists of three key components, each component being directed to a particular challenge. First, the first component (clustering component) adopts a clustering method to eliminate redundant information in adjacent video frames and selects representative blocks. Next, a second component (timing component) perceives and interprets the timing relationship between blocks through temporal modeling, thereby avoiding processing the spatial consistency between representative blocks. Finally, a third component (text query component) comprising a text encoder, a video-based text encoder and a video-based text decoder, wherein related information of the task is gradually obtained through language query; these three components, which can be implemented by any suitable method, can effectively address the three challenges and significantly enhance the video-language understanding capabilities of the agent.
Example 1
In one or more embodiments, a video-language understanding method is disclosed, as shown in fig. 1, comprising:
step one: acquiring videos and texts to be understood;
step two: based on the trained understanding model, processing the input video and text to generate a final understanding result;
the understanding model comprises a video encoder and a text query component, wherein the video encoder comprises a clustering component and a time sequence component; screening image blocks of all video frames in the video by utilizing a clustering component to obtain video frame embedding after redundant information is eliminated; reconstructing a time sequence dependency relationship between video frame embedding through a time sequence component to obtain a video feature matrix; based on the video feature matrix, embedding the text through a text query component to obtain a final understanding result;
the understanding model can at least process text-video cross-modal retrieval tasks, video description generation tasks or video question-answering tasks.
The implementation procedure of a video-language understanding method of this embodiment will be described in detail.
The embodiment aims at designing a novel deep learning model to learn video features and text features, improving the video-language understanding capability of an intelligent agent according to tasks, effectively aiming at three challenges of information redundancy, time dependence and scene complexity in the intelligent agent video-language understanding process, and completing three tasks of text-to-video cross-mode retrieval, video description generation and video question-answering as application landing; the overall process flow, as shown in fig. 2, specifically includes:
firstly, giving a video, and encoding each video frame through a visual Transformer (ViT) layer in a video encoder to obtain an image block embedded with semantic meaning; redundant blocks are eliminated by the clustering component preserving representative blocks.
The remaining blocks are then input to the ViT-ATM component to capture the time dependence between video frames and generate a video feature matrix as the output of the video encoder.
And finally, the text query component acquires information related to the task layer by layer from the video feature matrix and the text, and outputs a result corresponding to the task.
The three components may be implemented by any suitable method, and this embodiment describes only one implementation method, which specifically includes the following steps:
s1: the method comprises the steps of establishing a process of embedding video frames containing time sequence relations and eliminating redundant information by utilizing a clustering component, and acquiring video characteristics after eliminating the redundant information, wherein the process comprises the following specific steps:
s11: embedding video input to a video encoder into a block of a semantic meaning by using a K layer ViT layerWhere F is the number of video frames of the input model, 1 represents [ cis ]]token,P in Representing the number of blocks each video frame is divided into by ViT layers, d representing the hidden dimension.
S12: linking [ cls ] before video frame embedding before K-th layer ViT]token to generate information tag embeddingsThe SA is then performed using these information flags to obtain timing dependencies between video frames:
and embedding timing dependencies into m k Feeding the image block into the Kth layer ViT layer together with other frame embedding to obtain image block embedding
S13: the video frames output by the K-th layer ViT layer are embedded and grouped into S segments, each segment containing an F/S frame.
S14: F/S (1+P) contained in each segment is clustered using k-medoids++ algorithm in ) The image blocks are clustered to produce (1+P) in ) Clusters, blocks located at the centroid of each cluster are selected to form a video insert with redundant information removal
The clustering algorithm is not limited to the k-means++ clustering algorithm, but can be implemented by other clustering methods.
S2: and reconstructing the video features after the redundant information is eliminated through a time sequence component so as to maintain the time dependence of the video features.
Specifically, in the timing component portion, an information tagging mechanism is used to reconstruct timing dependencies between video embeddings, i.e., the timing component uses a (L-K) layer ViT-ATM layer, where L represents the total number of layers of the video encoder; embedding the video obtained in the step S14 after the redundant information is removed into V k Inputting the time sequence component to finally obtain the video embedding with time sequence dependencyAnd serves as a video feature matrix for the video encoder output.
S3: for three tasks of cross-modal retrieval, video description generation and video question-answering, a text query component is used for task-specific text embedding, and a final understanding result is obtained, including:
s31: after the first two components, the video has been encoded into a time-aware representation with high information density, i.e., a video feature matrix; however, because of the complexity of video scenes, there still exist a lot of information unrelated to tasks; to address this problem, a text query component is introduced that uses task-specific queries to gradually collect relevant details and generate final understanding results.
The text query component integrates an L-layer MoED, which consists of four modules including BiSA, causer SA, CA and FFA, and the four modules form three variants to complete corresponding tasks.
The three variants of MoED include in particular:
(1) The text encoder, like BERT, encodes text using BiSA and FFN at each level, with a [ CLS ] token appended to the beginning of the text input to summarize it.
(2) Video-based text encoders collect visual information related to a task through Channel Attention (CA) between a bi-directional self-attention mechanism (BiSA) and a feed forward neural network (FFN) in each layer of the text encoder. In channel attention, text input is used as a query and video feature matrices are used as keys and values. For task-specific purposes, an [ Encode ] tag is added to the text input, and the embedded result thus generated contains a multimodal representation of the video-text pair.
(3) A video-based text decoder responsible for gathering task-specific visual information to generate a desired text output; the BiSA layer of the video-based text encoder is replaced with the Causal SA layer. The start of the sequence is identified using a [ Decode ] tag and the end of the sequence is identified using a sequence end tag.
The three variants are utilized to show the video-language understanding capability of the intelligent agent for three tasks of text-video cross-modal retrieval, video description generation and video question-answering, and specifically comprise the following steps:
(1) For the text-video cross-modal retrieval task, the task is divided into two stages of recall and reordering.
As shown in fig. 3, the Top Q segment video is recalled by first recalling the phase by calculating the cosine similarity between the video feature matrix and the [ cls ] token in the text embedding obtained by the text encoder. Then, in the reordering stage, inputting the video-text query into a text encoder based on the video, embedding the output [ Encode ] into an input full-connection layer and a sigmoid function to obtain a final score, and reordering the Q-segment video according to the score; the SA layer and the FFD layer in the text encoder and the video-based text encoder share parameters.
(2) For the video description generation task, as shown in fig. 4, a video feature matrix is input to a video-based text decoder to generate description text for the video.
(3) Aiming at video question-answering tasks, the video question-answering tasks comprise an open question-answering task and a multi-item selection question-answering task.
For the open question-answering task, as shown in fig. 5, the video feature matrix and the question text are input into the video-based text encoder to obtain a multi-modal feature matrix, and then the multi-modal feature matrix is input into the video-based text decoder to generate an answer, and the encoder and the decoder share parameters.
For a multi-item selection question-answering task, the task is used as a classification task, as shown in fig. 6, the questions and the answers are connected into a complete sentence, and then the complete sentence and the video feature matrix are input into a video-based text encoder together to encode pairs of video and question-answering text into a multi-modal feature matrix; finally, the multi-modal feature matrix is input into the linear layer and the Softmax layer to obtain the score of the best answer.
S4: reasoning and training, including in particular:
(1) For text-to-video cross-modality retrieval tasks, a text encoder and a video-based text encoder are jointly trained.
For a text encoder, video text contrast loss (VTC) is used to encourage the [ cls ] labels of matching video-text pairs to have a more similar representation than non-matching video-text pairs to align the feature space of the video and text.
First, for the ith video text pair, give them [ CLS]Embedding according to CLIP [5 ]]Using a linear projection layer and an L2 normalization layer to obtain video concealment vectorsAnd text hidden vector +.>
To maximize the benefit of large batch contrast learning, three memory banks are maintained to store the latest M video vectorsText vector->And corresponding video->
Then, calculate the text-to-video contrast lossContrast loss from video to text->
Wherein,is a positive sample set and τ is a temperature parameter that can be learned.
Finally combining the two losses to obtain the VTC loss
To compensate for potential false negative samples in the momentum encoder, a dynamic double distillation strategy in ALBEF [6] is used to generate soft labels.
For video-based text encoders, fine grain alignment between video and text is captured using video text matching loss (VTM), learning a video text multi-modal representation. VTM corresponds to a binary classification task in which the model uses the VTM header (linear layer) to predict whether a video text pair is positive (matching) or negative (not matching) given the multi-modal features of the [ -Encode ] tag.
For the ith group of video-text pairs, their positive match scores are first calculatedThen randomly sampling a video/text to replace the video/text to obtain a negative matching score +.>Finally, obtaining the video-text matching loss
In order for VTM loss to provide more information, negative samples are sampled using a difficult negative sample mining strategy.
Will beAnd->The addition results in a final penalty.
(2) For a video description text generation task, using language modeling loss LM for a decoder, optimizing cross entropy loss, and training a model to maximize the possibility of text in an autoregressive manner; for each video-text pair (v, t):
where L is the total length of the sentence; when calculating loss, label smoothing of 0.1 is used; compared to the mask language modeling penalty widely used for video language pre-training, LM provides the model with generalization capability that can convert visual information into coherent subtitles.
(3) For video questions and answers tasks, the open questions and answers employ LM losses, while the multiple choice questions and answers employ VTM losses. Unlike text-to-video retrieval, the negative samples are from wrong question-answer pairs, rather than being generated using samples.
The embodiment aims at improving the video-language understanding capability of an intelligent agent, and system analysis shows that the current video language understanding method focuses on limited aspects of tasks, and methods aiming at different challenges can be mutually complemented; in view of this, a framework is proposed that integrates redundancy elimination information, temporal modeling, and query components to jointly address information redundancy, temporal dependencies, and scene complexity, respectively. The effectiveness and superiority of the method of the embodiment are proved through extensive experimental evaluation; the goal of future work is to pre-train the model to help the model acquire more knowledge; more efficient elimination of redundant information, temporal modeling, and query components are developed to improve the overall performance of the method of the present embodiment. By exploring the potential of the method of this embodiment, it is desirable to be able to contribute to the development of a push-agent video-language understanding technology.
Example two
In one or more embodiments, a video-language understanding system is disclosed that includes an acquisition module and an understanding module:
an acquisition module configured to: acquiring videos and texts to be understood;
an understanding module configured to: based on the trained understanding model, processing the input video and text to generate a final understanding result;
the understanding model comprises a video encoder and a text query component, wherein the video encoder comprises a clustering component and a time sequence component; screening image blocks of all video frames in the video by utilizing a clustering component to obtain video frame embedding after redundant information is eliminated; reconstructing a time sequence dependency relationship between video frame embedding through a time sequence component to obtain a video feature matrix; based on the video feature matrix, embedding the text through a text query component to obtain a final understanding result;
the understanding model can at least process text-video cross-modal retrieval tasks, video description generation tasks or video question-answering tasks.
Example III
An object of the present embodiment is to provide a computer-readable storage medium.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs steps in a video-language understanding method according to an embodiment of the present disclosure.
Example IV
An object of the present embodiment is to provide an electronic apparatus.
An electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing steps in a video-language understanding method as described in the first embodiment of the present disclosure when the program is executed.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A video-language understanding method, comprising:
acquiring videos and texts to be understood;
based on the trained understanding model, processing the input video and text to generate a final understanding result;
the understanding model comprises a video encoder and a text query component, wherein the video encoder comprises a clustering component and a time sequence component; screening image blocks of all video frames in the video by utilizing a clustering component to obtain video frame embedding after redundant information is eliminated; reconstructing a time sequence dependency relationship between video frame embedding through a time sequence component to obtain a video feature matrix; based on the video feature matrix, embedding the text through a text query component to obtain a final understanding result;
the understanding model can at least process text-video cross-modal retrieval tasks, video description generation tasks or video question-answering tasks.
2. The video-language understanding method of claim 1, wherein the video encoder further comprises a ViT layer;
the video encoder comprises ViT layers, a clustering component and a time sequence component which are sequentially connected;
the ViT layer is used for carrying out block coding on each video frame to obtain image block embedding with semantic meaning, and the image block embedding belonging to the same video frame forms video frame embedding.
3. The video-language understanding method according to claim 2, wherein the filtering of the image blocks of all video frames in the video by using the clustering component is specifically:
embedding and grouping the video frames output by the ViT layer into S segments, wherein each segment comprises F/S frames;
for F/S (1+P) contained in each segment by using a clustering algorithm in ) The image blocks are clustered to produce (1+P) in ) Clusters, blocks located at the centroid of each cluster are selected to form video embedded V with redundant information removed k
Wherein the method comprises the steps ofS represents the number of segments, F represents the number of video frames, P in Representing the number of blocks that each video frame is partitioned.
4. The video-language understanding method of claim 1, wherein in the timing component, a timing dependency between video frame embedding is reconstructed using an information tagging mechanism, specifically:
embedding video frames with redundant information removed into V k Inputting ViT-ATM layer to obtain video feature matrix with time sequence dependence
5. The video-language understanding method of claim 1, wherein the text query component, the integrated L-layer MoED, the MoED is comprised of four modules comprising a bi-directional self-attention mechanism BiSA, causal SA, channel attention CA, and feed-forward neural network FFA, the four modules forming three variants to accomplish corresponding task-specific text embedding, namely a text encoder, a video-based text encoder, and a video-based text decoder;
for a text-video cross-mode retrieval task, calculating the similarity between a video feature matrix and a text feature matrix obtained by a text encoder to obtain a video set meeting a similarity condition;
for a video description generation task, inputting a video feature matrix into a video-based text decoder to generate text as description text of the video;
for a video question-answering task, inputting a video feature matrix and a question text into a video-based text encoder to obtain a multi-modal feature matrix, and then inputting the multi-modal feature matrix into a video-based text decoder to generate text as an answer to the question.
6. The video-language understanding method of claim 5, wherein the text encoder encodes the input text using BiSA and FFN per layer, attaches to the beginning of the text input with a [ CLS ] token, and outputs a text feature matrix;
the video-based text encoder collects visual information related to a task by adding CA between BiSA and FFN in each layer of the text encoder; in CA, the input text is used as query, the embedded video feature matrix is used as key and value, and a multi-mode feature matrix is generated;
the video-based text decoder replaces the BiSA layer of the video-based text encoder with the cable SA layer for decoding the entered multimodal feature matrix into text.
7. A video-language understanding method as claimed in claim 1, wherein the text-video cross-modality retrieval task is performed in two stages of recall and reorder;
the recall stage is used for recalling the Top Q video segment by calculating cosine similarity between [ cls ] token in the text feature matrix obtained by the video feature matrix and the text encoder;
and in the reordering stage, inputting the video-text query into a text encoder based on the video, embedding the output [ Encode ] into an input full-connection layer and a sigmoid function to obtain a final score, reordering the Q-segment video according to the score, and taking a preset number of videos to form a video set.
8. The video-language understanding method of claim 1, wherein the video question-answering task further comprises a plurality of selection question-answering tasks, which are performed as classification tasks, the questions and the answers are connected into a complete sentence, and then the complete sentence is input into a video-based text encoder together with a video feature matrix to encode pairs of video and question-answering text into a multi-modal feature matrix; finally, the multi-modal feature matrix is input at the linear layer and Softmax layer to obtain the score of the best answer.
9. The video-language understanding method according to claim 1, wherein the training of the understanding model is specifically:
for a text encoder, using video text contrast loss, encouraging the [ cls ] labels of matching video-text pairs to have a more similar representation than non-matching video-text pairs to align the feature space of the video and text;
for video-based text encoders, learning a video text multi-modal representation using video text matching loss, capturing fine grain alignment between video and text;
for video-based text decoders, the cross entropy penalty is optimized using language modeling penalty, and the training model maximizes the likelihood of text in an autoregressive manner.
10. A video-language understanding system, comprising an acquisition module and an understanding module:
an acquisition module configured to: acquiring videos and texts to be understood;
an understanding module configured to: based on the trained understanding model, processing the input video and text to generate a final understanding result;
the understanding model comprises a video encoder and a text query component, wherein the video encoder comprises a clustering component and a time sequence component; screening image blocks of all video frames in the video by utilizing a clustering component to obtain video frame embedding after redundant information is eliminated; reconstructing a time sequence dependency relationship between video frame embedding through a time sequence component to obtain a video feature matrix; based on the video feature matrix, embedding the text through a text query component to obtain a final understanding result;
the understanding model can at least process text-video cross-modal retrieval tasks, video description generation tasks or video question-answering tasks.
CN202311179398.5A 2023-09-13 2023-09-13 Video-language understanding method and system Pending CN117251821A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311179398.5A CN117251821A (en) 2023-09-13 2023-09-13 Video-language understanding method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311179398.5A CN117251821A (en) 2023-09-13 2023-09-13 Video-language understanding method and system

Publications (1)

Publication Number Publication Date
CN117251821A true CN117251821A (en) 2023-12-19

Family

ID=89134219

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311179398.5A Pending CN117251821A (en) 2023-09-13 2023-09-13 Video-language understanding method and system

Country Status (1)

Country Link
CN (1) CN117251821A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117640947A (en) * 2024-01-24 2024-03-01 羚客(杭州)网络技术有限公司 Video image encoding method, article searching method, electronic device, and medium
CN117765450A (en) * 2024-02-20 2024-03-26 浪潮电子信息产业股份有限公司 Video language understanding method, device, equipment and readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117640947A (en) * 2024-01-24 2024-03-01 羚客(杭州)网络技术有限公司 Video image encoding method, article searching method, electronic device, and medium
CN117640947B (en) * 2024-01-24 2024-05-10 羚客(杭州)网络技术有限公司 Video image encoding method, article searching method, electronic device, and medium
CN117765450A (en) * 2024-02-20 2024-03-26 浪潮电子信息产业股份有限公司 Video language understanding method, device, equipment and readable storage medium
CN117765450B (en) * 2024-02-20 2024-05-24 浪潮电子信息产业股份有限公司 Video language understanding method, device, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN114169330B (en) Chinese named entity recognition method integrating time sequence convolution and transform encoder
CN117251821A (en) Video-language understanding method and system
CN111309971A (en) Multi-level coding-based text-to-video cross-modal retrieval method
CN109992669B (en) Keyword question-answering method based on language model and reinforcement learning
Yu et al. Learning from inside: Self-driven siamese sampling and reasoning for video question answering
CN114996513B (en) Video question-answering method and system based on cross-modal prompt learning
CN113297364B (en) Natural language understanding method and device in dialogue-oriented system
CN113221571B (en) Entity relation joint extraction method based on entity correlation attention mechanism
CN114627162A (en) Multimodal dense video description method based on video context information fusion
CN113423004B (en) Video subtitle generating method and system based on decoupling decoding
Bilkhu et al. Attention is all you need for videos: Self-attention based video summarization using universal transformers
CN114821271A (en) Model training method, image description generation device and storage medium
CN113392265A (en) Multimedia processing method, device and equipment
Hsu et al. Video summarization with spatiotemporal vision transformer
CN117421591A (en) Multi-modal characterization learning method based on text-guided image block screening
CN115719510A (en) Group behavior recognition method based on multi-mode fusion and implicit interactive relation learning
Alkalouti et al. Encoder-decoder model for automatic video captioning using yolo algorithm
CN116091978A (en) Video description method based on advanced semantic information feature coding
CN115906857A (en) Chinese medicine text named entity recognition method based on vocabulary enhancement
Le et al. Learning to reason with relational video representation for question answering
CN116863920B (en) Voice recognition method, device, equipment and medium based on double-flow self-supervision network
Saleem et al. Stateful human-centered visual captioning system to aid video surveillance
Phuc et al. Video captioning in Vietnamese using deep learning
CN115310461A (en) Low-resource speech translation method and system based on multi-modal data optimization
CN114612826A (en) Video and text similarity determination method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination