CN114332729B

CN114332729B - Video scene detection labeling method and system

Info

Publication number: CN114332729B
Application number: CN202111678887.6A
Authority: CN
Inventors: 徐亦飞; 桑维光; 罗海伦; 李斌; 徐武将; 朱利
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2024-02-02
Anticipated expiration: 2041-12-31
Also published as: CN114332729A

Abstract

The invention discloses a video scene detection labeling method and a system, which are characterized in that according to an input video, audio and text embedded mode information source, a pre-training model is adopted to acquire the mode characteristics of the video, audio and text, the acquired mode characteristics of the video, audio and text are aligned and fused to form a window basic cross-mode representation, the window basic cross-mode representation is evolved into a self-adaptive context perception representation according to the difference between multi-temporal attention and an adjacent window, the scene is detected according to the acquired self-adaptive context perception representation, the attribute of the window is determined through a window attribute classifier, and the accurate position of the scene boundary is acquired in the window through a position offset regressor; based on the obtained scene boundaries, designating a plurality of labels for each scene to realize scene annotation, classifying the scene detection into window attribute classification and position offset regression, and solving the problem of multi-label annotation through the integrated learning of a two-stage classifier. The problems of error propagation and huge calculation cost are solved through a unified network of cross-modal clues; and (3) classifying the scene detection into window attribute classification and position offset regression, and solving the multi-label labeling problem through the integrated learning of the two-stage classifier.

Description

Video scene detection labeling method and system

Technical Field

The invention belongs to the field of video processing, and particularly relates to a video scene detection labeling method and a system.

Background

With the rapid development of 5G technology, video advertising has seen tremendous growth in short video applications. In the creative, release and policy of the advertisement ecosystem, the deep understanding of the content thereof becomes more and more important, and the requirement is also higher and higher. As a key step in semantic understanding of video advertisements, the purpose of scene detection and annotation is to temporarily parse the video into different scenes and predict each scene to be annotated in different dimensions, such as presentation form, style and location. This has emerged in a variety of potential applications including video ad insertion, video summarization, video indexing and retrieval, and the like.

In previous work, scene detection and annotation was studied independently and sequentially. Recent series of efforts on scene detection have tended to mark a salient shot as a scene boundary. But this suffers from three problems: 1) Error propagation. Errors caused by shot detection will propagate when determining whether the shot boundary is a scene boundary. 2) And huge calculation amount. If scene detection is built directly on the frame features, a significant amount of computation is consumed in the training process. 3) The generalization ability is poor. In the case of a segmented scene of variable length, the scene detector sometimes cannot perform generalized recognition of different scenes. Video annotation has been studied, but most of them infer that the description of events in the video does not perform well in our task. Because there are few tags, they are not accurate enough to mark complex scenes. Generally, in existing studies, scene markers are always regarded as post-processing of scene detection, resulting in a serious dependence on scene detection accuracy.

With the popularity of deep learning, there are several methods that can utilize CNNs for scene detection. Under an unsupervised setting, some methods extract joint representations of shots from visual (audio, text) cues and then apply to predicted scene boundaries based on shot similarity. However, a disadvantage of these methods is that they rely heavily on manually setting up different videos. In recent years, with the advent of manually labeled data sets, supervised learning methods have evolved. The movieNet dataset is an overall multimodal movie understanding dataset that facilitates the development of complex semantics in the analysis scene. Although the multi-modal pretraining scheme has made significant progress in scene detection, it requires strong assumptions about semantic relevance in the borrowing task and requires expensive computational costs.

For multi-tag video annotation, a set of interesting concepts are used to tag the video, including scenes, objects, events, etc. Many efforts have been made to label video in a versatile manner with multiple labels. A unified framework is presented for modeling reliable relationships between evoked moods and movie types with CNN specific feature combinations. Most of these methods capture time information from a short sequence of frames. Although GCN-based methods have met with great success in modeling intra-and inter-frame relationships, they continue to suffer from the following two problems for multi-labeled video classification. 1) Scalability. As video duration increases, model complexity grows geometrically. 2) Compatibility. Matrix reasoning requires a fixed length in the time dimension, which is not suitable for annotating variable length durations of video scenes.

Disclosure of Invention

The invention aims to provide a video scene detection labeling method and a video scene detection labeling system, which are used for overcoming the defects of the prior art.

A video scene detection labeling method comprises the following steps:

s1, acquiring modal characteristics of video, audio and text by adopting a pre-training model according to an embedded modal information source of the input video, audio and text;

s2, aligning and fusing the mode characteristics of the acquired video, audio and text to form a basic cross-mode representation of the window;

s3, according to the difference between the multi-temporal attention and the adjacent window, the basic cross-modal representation of the window is evolved into a self-adaptive context sensing representation;

s4, detecting a scene according to the acquired self-adaptive context sensing representation, determining the attribute of a window through a window attribute classifier, and acquiring the accurate position of a scene boundary in the window through a position offset regressive; and designating a plurality of labels for each scene based on the acquired scene boundaries to realize scene annotation.

Further, h×l×c is generated by using Swin transducer according to the modal information source of the video _v Visual characteristics F of dimensions _visual Encoding audio recordings into L C using VGGish network _a The dimension vector forms an audio feature F _audio Using BERT Network to provide 512 XC _t Channel, obtaining text embedded feature F of text embedding _text 。

Furthermore, the modal characteristics of the video and the audio are encoded and connected through adjacent branches of the continuous LayerNorm+Conv1D layer, the video-audio characteristics are obtained, the channel attention is combined with the video-audio characteristics, and a multi-head attention mechanism is adopted to force visual information, audio and text to be aligned explicitly in terms of semantics.

Further, the window basic cross-modal representation is evolved into an adaptive context-aware representation using long time dependencies between different view windows, multi-scale extended attention modules, and shift window order modules.

Further, the multi-scale expansion attention module comprises a multi-scale expansion window and a context awareness attention module, wherein the multi-scale expansion window operation is adopted to establish time dependence relation for all windows, and when all context representations are constructed, the context representations are input into a linear layer to generate a feature vector with 2C dimension.

Further, the feature is filled with zero-fill windows to make the input and output sizes uniform.

Further, the shift window order module calculates the window shift WBCR by subtracting the window WBCR from the WBCR of the previous window, captures the time sequence using the order transduction network, and then performs a linear operation to obtain SWSM (F _sw ) Is provided.

A video scene detection labeling system comprises a feature acquisition module, a feature fusion module and a detection labeling module;

the feature acquisition module is used for acquiring the modal features of the video, the audio and the text based on the pre-training model according to the modal information sources embedded in the input video, the audio and the text;

the feature fusion module is used for aligning and fusing the acquired modal features of the video, the audio and the text to form a window basic cross-modal representation, and according to the multi-temporal attention and the difference between adjacent windows, the window basic cross-modal representation is evolved into a self-adaptive context sensing representation;

the detection labeling module detects the scene according to the acquired self-adaptive context sensing representation, determines the attribute of the window through the window attribute classifier, and acquires the accurate position of the scene boundary in the window through the position offset regressor; and designating a plurality of labels for each scene based on the acquired scene boundaries to realize scene annotation.

Compared with the prior art, the invention has the following beneficial technical effects:

the invention relates to a video scene detection labeling method, which comprises the steps of acquiring modal characteristics of video, audio and text by adopting a pre-training model according to an input video, audio and text embedded modal information source, aligning and fusing the acquired modal characteristics of the video, audio and text to form a window basic cross-modal representation, converting the window basic cross-modal representation into a self-adaptive context perception representation according to the difference between multi-temporal attention and adjacent windows, detecting a scene according to the acquired self-adaptive context perception representation, determining the attribute of the window by a window attribute classifier, and acquiring the accurate position of a scene boundary in the window by a position offset regressor; based on the obtained scene boundaries, designating a plurality of labels for each scene to realize scene annotation, classifying the scene detection into window attribute classification and position offset regression, and solving the problem of multi-label annotation through the integrated learning of a two-stage classifier.

The invention solves the problems of error propagation and huge calculation cost through a unified network of cross-modal clues; and (3) classifying the scene detection into window attribute classification and position offset regression, and solving the multi-label labeling problem through the integrated learning of the two-stage classifier.

Drawings

Fig. 1 is an overall framework diagram of a multi-Modal Adaptive Context Network (MACN) of the present invention;

FIG. 2 is a comparison diagram of MSCD module and MDC (multi-scale extended convolution) proposed in the present invention to address scene annotation;

fig. 3 is a flow chart of the MDAM module of the present invention.

Detailed Description

The invention is described in further detail below with reference to the attached drawing figures:

in order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "comprises" and "comprising," along with any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

A video scene detection labeling method comprises the following steps:

s2, aligning and fusing the modal characteristics of the acquired video, audio and text to form a window basic cross-modal representation (WBCR) serving as a sharing task agnostic characteristic;

s3, according to the difference between the multi-temporal attention and the adjacent window, the basic cross-modal representation of the window is evolved into an adaptive context sensing representation (ACR);

The window attribute classifier is completed by adopting a two-class function, specifically adopting an FC+sigmoid activation function.

The concrete model structure is shown in fig. 1:

in S1, the pre-training model (pre-training network) is used to obtain the respective modal characteristics of the input video, audio and text, and the specific process of obtaining the modal characteristics of the video, audio and text is as follows:

in order to obtain the modal characteristics of the video, namely the visual characteristics, swin transformers are pre-trained on an ImageNet (visual database) to form Swin transformers, and h x L x C is generated by using the Swin transformers according to the modal information source of the video _v Visual characteristics F of dimensions _vlsual 。

Pretraining on an AudioSet to form a VGGish network, and encoding audio recordings into L×C by using the VGGish network _a The dimension vectors form modal features of the audio, i.e. audio features F _audio 。

Pre-training on data sets BookCorpus and Wikipedia to get BERT Network, which is utilized to provide 512 XC _t Channel, obtaining text embedded feature F of text embedding _text 。

The modality of video, audio and text embedding is characterized by three two-dimensional matricesIf the video duration is less than the set duration L seconds, it will be zero-padded to form an L-second video. If the video duration exceeds L seconds, it will truncate from the end, leaving a video clip of L seconds.

The modal features of the video, audio and text acquired in S1 are aligned and fused to provide a window basic cross-modal representation (WBCR) as a shared task agnostic feature.

The specific working procedure is as follows:

for visionIn terms of audio and frequency, F _visual And F _audio Encoding and concatenation by adjacent branches of successive layernorm+conv1d layers to obtain coarse video-audio characteristics F _{c_va} . Channel attention and video-audio feature F _{c_va} Combining, and then applying Dropout and LayerNorm learning to obtain video-audio features F _va . In terms of text embedding, a multi-headed attention mechanism is employed to force the visual information, audio and text to be aligned semantically explicitly. Specifically, F _va Embedding query matrix Q _i ＝F _va W _i ^q Constructing a key matrix K _i ＝F _text W _i ^k Sum matrix V _i ＝F _text W _i ^v Then, the aligned text embedding calculation is as follows:

head _i ＝Attention(Q _i ,W _i ,V _i )

F _atext ＝Concat(head ₁ ,head ₂ ,…head _r )W ^o

wherein the projection is a parameter matrix W _i ^q ∈R ^2C×2C/r ,W _i ^k ,W ^o ∈R ^2C×2C ,d _k Representing a scale factor; r is 16, U epsilon R ^L×512 Is a priori matrix that can be learned for learning rules for text and visual and audio information. />By summarizing F _atext And F _va Wherein->Representation window w _i WBCR of (a).

S3, in order to adaptively capture long-time dependencies of a variable-length scene, the WBCR is evolved into an adaptive context aware representation (ACR) by taking into account differences between multi-temporal awareness and adjacent windows, by exploiting long-time dependencies between different view windows, multi-scale expansion awareness modules (MDAMs), and shift window order modules (SWSMs). The specific working procedure is as follows:

a multi-scale expansion attention module (MDAM) includes two parts: a multi-scale extended window and a context aware attention module. The application adopts multi-scale extended window operation to establish time dependence relation for all windows, and for window i, the adjacent windows in d-hop range are adjacentAnd->And incorporated into a contextual representation of window i. Immediate use->And->Is combined with replacement F _w D is the expansion ratio. When all the context representations are constructed, they are input to the linear layer, producing a feature vector of 2C dimensions. The present application fills features with zero-fill windows to make the input and output sizes uniform. In order to model the importance of the extended windows, context-aware attention is introduced to learn the weights of the different extended windows. For each extended window i, we multiply its S2C feature map by i of the adaptive scale factor ^th Columns, and then adds the results to output a multi-scale attention feature vector, as in fig. 3. Where s records the number of scales. In the last step, the output F of the context-aware attention module _ca ∈R ^L×2C The multi-scale attention feature vector is obtained by connecting all the multi-scale attention feature vectors of the extended windows in series and then calculating through a Sequential Transduction Network (STN) and linear operation;

shift windowSequence Module (SWSM) for each window, the window shift WBCR is calculated by subtracting the window WBCR directly from the WBCR of the previous window, for capturing the time sequence, a Sequence Transduction Network (STN) is added, and then a linear operation is performed to obtain SWSM (F _sw ) Is provided.

S4, for scene detection, solving by a window attribute classifier and a position offset regressor:

the scene detection is expressed as a binary classification with the purpose of determining whether the window attribute is positive or negative and a positional shift regression with the purpose of finding the exact position of the scene boundary in the window, we use two subnetworks to get the result, the subnetworks being calculated by FC layer + Sigmoid.

Specifically, the pre-training model is used for the pre-trained model on the data set to extract the characteristic representations of video, audio and text in the given data set; the detection labeling module comprises a scene detection part and a scene labeling part, the scene detection is classified into window attribute classification and position deviation regression, and the multi-label labeling problem is solved through the integrated learning of two-stage classifiers (window level and scene level).

S5, in a scene labeling task, based on the generated scene boundary, assembling a two-stage classifier to assign a plurality of labels to each scene.

The specific working procedure is as follows:

(5.1) if the scene ranges from window i to window j, then characterizing F _ca Is sent to an adjacent Sequential Transduction Network (STN) to generate an initial sequential characterizationAnd end sequence characterization +.>Representing the characterization of the start window and the end window, respectively. On the basis, we created L x C _l Wherein [ i, j,:]element i.ltoreq.j +.>And->Is a dot product of (a).

(5.2), however, it is not sufficient to represent the scene representation with only [ i, j, ] elements. Thus, we have developed multi-scale semantic constraint expansion (MSCD) to capture features of a scene center window. MSCD can be considered as a variant of multi-scale dilation convolution (MDC). In addition to the nine features considered in the extended convolution, only up to four valid features are indexed as [ i+d, j, ], [ i, j, ], [ i+d, j-d, ] and [ i, j-d, ] allowing convolution operations to be performed, as shown in FIG. 2.

(5.3) although MSCDs are specifically designed to reduce information loss, representations from window i to window j have not yet been fully expressed. Thus, we first predict the multi-label for each window in the scene and then average the predicted probabilities of windows i through j as the scene predicted probabilities. We average the two predictions from the window level and scene level classifiers to get the final scene tag.

To improve the prediction performance and provide better results, we combine two multi-labeled sub-classifiers (window level and scene level) from different levels. According to the method, scene detection annotation is combined, the problems of error propagation and huge calculation cost are solved through a unified network of cross-modal clues, scene detection is classified into window attribute classification and position offset regression, and the problem of multi-label annotation is solved through integrated learning of a two-stage classifier; the advantages of our model over the previous most advanced methods are demonstrated on the TADScene dataset.

The loss function is used to optimize the training model during the pre-training of the pre-training model. The specific working procedure is as follows:

scene detection loss:

wherein N is _pos Representing the number of windows for which the result is positive,is a predictive and true binary indicator of window i. Delta _j ,δ′ _j Is the predicted and true offset position of the j-th positive window. L (L) _cls Is a binary cross entropy loss, L _reg Is the L1 loss.

Scene annotation loss:

wherein N is _s And E are the number of real scenes and the number of tag types, respectively. For real scene i, Y _i,j ,Y′ _i,j The j-th predicted and true tags, respectively. For window w _p ,Z _p,q Is the q-th predicted tag, Z' _p,q Is the q-th real tag inherited from its scene tag.

Clustering constraint loss:

total loss is l=λ ₁ L _D +λ ₂ L _A +λ ₃ L _C 。

The foregoing description of the embodiments of the invention has been presented in conjunction with the drawings. It will be appreciated by those skilled in the art that the invention is not limited to the embodiments described above. On the basis of the technical scheme of the invention, various modifications or variations which can be made by the person skilled in the art without the need of creative efforts are still within the protection scope of the invention.

Claims

1. The video scene detection labeling method is characterized by comprising the following steps of:

s4, detecting a scene according to the acquired self-adaptive context sensing representation, determining the attribute of a window through a window attribute classifier, and acquiring the accurate position of a scene boundary in the window through a position offset regressive; designating a plurality of labels for each scene based on the acquired scene boundaries to realize scene annotation;

generating h, L and C by using Swin transducer according to modal information source of video _v Visual characteristics F of dimensions _visual The method comprises the steps of carrying out a first treatment on the surface of the According to the modal information source of the video, audio recording is encoded into L multiplied by C by utilizing VGGish network _a The dimension vector forms an audio feature F _audio Using BERT Network to provide 512 XC _t Channel, obtaining text embedded feature F of text embedding _text The method comprises the steps of carrying out a first treatment on the surface of the Branching the modal characteristics of video and audio into adjacent branches of successive LayerNorm+Conv1D layersLine coding and connection are carried out to obtain video-audio characteristics, channel attention is combined with the video-audio characteristics, and a multi-head attention mechanism is adopted to force visual information, audio and text to be aligned explicitly in terms of semantics; will F _va Embedding query matrix Q _i ＝F _va W _i ^q Constructing a key matrix K _i ＝F _text W _i ^k Sum matrix V _i ＝F _text W _i ^v Then, the aligned text embedding calculation is as follows:

head _i ＝Attention(Q _i ，W _i ，V _i )

F _atext ＝Concat(head ₁ ，head ₂ ，...head _r )W ^o

wherein the projection is a parameter matrix W _i ^q ∈R ^2C×2C/r ，W _i ^k ，W ^o ∈R ^2C×2C ，d _k Representing a scale factor; r is 16, U epsilon R ^L×512 Is a priori matrix that can be learned for learning rules for text and visual and audio information;by summarizing F _atext And F _va Wherein->Representation window w _i WBCR of (a); the window basic cross-modal representation is evolved into an adaptive context-aware representation using long-time dependencies between different view windows, multi-scale extended attention modules, and shifted window order modules.

2. The method of claim 1, wherein the multi-scale expansion attention module comprises a multi-scale expansion window and a context awareness attention module, wherein the multi-scale expansion window is used to establish time dependency relationships for all windows, and when all context representations are constructed, they are input to the linear layer to generate a feature vector of 2C dimension.

3. A video scene detection annotation method as claimed in claim 2, wherein the feature is filled with zero-fill windows to match the input and output sizes.

4. The video scene detection annotation method of claim 1, wherein the window shift WBCR is calculated by subtracting the window WBCR from the WBCR of the previous window by a shift window order module, capturing the time sequence using a sequential transduction network, and then performing a linear operation to obtain SWSM (F _sw ) Is provided.

5. A video scene detection labeling system based on the method of claim 1, comprising a feature acquisition module, a feature fusion module and a detection labeling module;

the detection labeling module detects the scene according to the acquired self-adaptive context sensing representation, determines the attribute of the window through the window attribute classifier, and acquires the accurate position of the scene boundary in the window through the position offset regressor; designating a plurality of labels for each scene based on the acquired scene boundaries to realize scene annotation;

generating h, L and C by using Swin transducer according to modal information source of video _v Visual characteristics F of dimensions _visual The method comprises the steps of carrying out a first treatment on the surface of the According to the modal information source of the video, audio recording is encoded into L multiplied by C by utilizing VGGish network _a The dimension vector forms an audio feature F _audio Using BERT Network to provide 512 XC _t Channel, obtaining text embedded feature F of text embedding _text The method comprises the steps of carrying out a first treatment on the surface of the Coding and connecting the modal characteristics of video and audio through adjacent branches of continuous LayerNorm+Conv1D layers to obtain video-audio characteristics, combining channel attention with the video-audio characteristics, and forcing visual information, audio and text to be aligned explicitly in semantic sense by adopting a multi-head attention mechanism; will F _va Embedding query matrix Q _i ＝F _va W _i ^q Constructing a key matrix K _i ＝F _text W _i ^k Sum matrix V _i ＝F _text W _i ^v Then, the aligned text embedding calculation is as follows:

head _i ＝Attention(Q _i ，W _i ，V _i )

F _atext ＝Concat(head ₁ ，head ₂ ，...head _r )W ^o