CN113159034A

CN113159034A - Method and system for automatically generating subtitles by using short video

Info

Publication number: CN113159034A
Application number: CN202110442856.4A
Authority: CN
Inventors: 颜成钢; 高含笑; 潘潇恺; 孙垚棋; 张继勇; 李宗鹏; 张勇东
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2021-07-23

Abstract

The invention discloses a method and a system for automatically generating subtitles by using short video. And performing video feature extraction through a 3D convolutional network to obtain a video feature sequence, performing event detection on the received video feature sequence by using a DAPs model to obtain predicted event segments, and scoring each segment. And (3) processing each obtained event segment independently, firstly carrying out visual embedding operation on the event segment, then inputting the event segment subjected to the visual embedding operation into a transform model, and obtaining a predicted text. According to the invention, by fusing the cross-modal technology, the pre-processing comprises event detection, the range of the subsequent text generation work is narrowed, and the matching degree of the generated text and the event is increased. The text generation section exerts excellent performance of the Transformer in terms of feature encoding and decoding.

Description

Method and system for automatically generating subtitles by using short video

Technical Field

The invention belongs to the field of natural language processing, and relates to a text generation technology in the field of natural language processing.

Background

Short video refers to high-frequency pushed video content played on various new media, suitable for viewing in a mobile state and a short-time leisure state, and the duration is generally within 5 minutes. Since 2016, short videos are rapidly popular due to the characteristics of low creation threshold, strong social attributes and fragmented entertainment, the short video industry is rapidly developed in China and is mature nowadays, and active short video platforms include microblogs, trembles, fast hands, small red books and the like. The short video becomes a carrier for bloggers to record and share life, becomes a new mode for audience entertainment and grass planting, can even be popularized for brands, and brings benefits to original authors.

The creation of the short video is composed of links such as content planning, video shooting, post processing and the like, wherein the generation of the subtitles is used as one of the post processing, and is generally completed by manual addition of an author. The manual generation of subtitles is time-consuming and energy-consuming. At present, some caption generating software for videos is developed and applied, and text captions are derived through a voice recognition technology according to voice in the videos, so that the system has the powerful functions of manual modification, translation and the like and is high in practicability. However, the software mainly depends on the audio information in the video, and characters cannot be extracted without the audio information with clear pronunciation. In the real life, a common user wants to record life through short video, if the common user shoots and explains the life while shooting, the common user does not need to speak himself or herself, and if the common user records the life again in the later stage, the common user returns to the problem of wasting time and energy. According to research, subtitles of short videos without audio backgrounds are automatically generated, and pain spots and technical blanks exist.

The text generation technology has wide application scenes, can be applied to tasks such as information extraction, question-answering systems, character creation and the like, is born by a chat robot through the question-answering systems, is not in the speech of robot word writing and composition through the character creation, and is a research hotspot in recent years through modal text generation. The method is characterized in that cross-modality processing technologies combined with image, audio and language processing are required to be called, and common applications include 'talking with the aid of a picture', automatic subtitle generation for teaching videos and the like.

The technology for automatically generating the subtitles is cross-modal and relies on the fusion of an image vision technology and a natural language processing technology. On the whole frame, Sequence to Sequence-Video to Text 2015 proposes a model based on an LSTM frame, an original image and an optical flow image are processed independently, each frame of image features in a Video are extracted by using a 2D convolutional neural network to serve as input of an LSTM coding layer, output of the LSTM decoding layer is a prediction result, the probability of word occurrence is predicted one by one, and finally the probability generated by the original image and the probability generated by the optical flow image are subjected to weighted summation; in 2015, "Video Description Generation incorporation implementation and a Soft-Attention Mechanism" references an Attention Mechanism in image Description, firstly, extracting Features from each frame of image in a Video, wherein firstly, a GoogleNet network is used for extracting 2D Features, secondly, a 3D convolution network is used for extracting 3D Features, the Attention Mechanism is used for predicting words each time, the weight of each feature is calculated, and the image Features are weighted and summed according to the weights; in 2016, description video using Multi-modal Fusion proposed to fuse image features, video features, environmental sound features, voice features, genre features, and the like as feature representations of video, so-called Fusion, in which each feature is weighted and averaged, and then a video description is generated by an LSTM model.

The innovation point of the video description technology can be derived from three aspects, namely, the extraction of video features is optimized; secondly, optimizing an event detection model; and thirdly, optimizing a text generation model.

A3D convolution is proposed in left spatial components with 3D conditional Networks, and the 3D convolution is suitable for extracting space-time characteristics and can better pay attention to information of a time dimension.

The event detection aims at segmenting a long video into a plurality of segments according to semantics, enabling each segment to contain an event, proposing an event detection model DAPs, setting different sliding windows in the model before the DAPs to scan the whole video for a plurality of times, then using a maximum likelihood estimation method to find the most appropriate event segment, and having a slow operation speed. DAPs can obtain event segments with different scales only by using one sliding window, thereby greatly improving the running speed and the accuracy. The patent is all you need, put forward a transform model, suitable for the natural language processing tasks such as machine translation, text generation, text summarization, etc., the whole network is completely composed of the patent mechanism, and the contribution of the context information is fully paid Attention to.

Disclosure of Invention

The application scene of the invention is oriented to the automatic generation of subtitles of short videos such as microblog, trembling and fast hands, and the model adopted by the invention is an end-to-end model based on Transfomer, wherein the video characteristics are used as input, the description text is used as output. The main tasks are as follows: firstly, video feature extraction is carried out on a segment video, secondly, video segmentation is carried out, a plurality of segments are intercepted from the video according to events, and thirdly, an abstract for describing the video events is generated.

The technical problem to be solved by the invention is as follows: firstly, designing a video feature extraction model to optimize the extraction of video features, secondly designing a video segmentation model to improve the accuracy of video segmentation, and thirdly designing a text generation model to make a generated text fit with a picture displayed by a video.

Aiming at the defects in the prior art, the invention provides a method and a system for automatically generating subtitles by using short videos.

A system for automatically generating subtitles by using short video comprises a video feature extraction module, an event detection module and a text generation module;

the video feature extraction module extracts features of the video through a 3D convolutional network to obtain a video feature sequence, and sends the obtained video feature sequence to the event detection module;

the event detection module uses a DAPs model to carry out event detection on the received video characteristic sequence, inputs the video characteristic sequence into an LSTM network in the DAPs model, connects the characteristics in series, uses a hidden layer vector of the LSTM network as a time characteristic, and then uses a sliding window to scan the whole characteristic sequence to obtain predicted event segments and scores each segment.

The text generation module uses a Transformer model to generate a text, each event segment obtained by the event detection module is processed independently, visual embedding operation is firstly carried out on the event segment, then the event segment after the visual embedding operation is input into the Transformer model, and a predicted text is obtained through the input.

A method for automatically generating subtitles by short video comprises the following steps:

step (1), constructing a data set;

the method comprises the steps of constructing a data set, randomly sampling videos with subtitles in various large and short video platforms in China, uniformly preprocessing the videos, manually checking the subtitles of each video, correcting description which does not meet the facts, confirming that the subtitles can completely and simply summarize scenes or events displayed by the videos, and using the processed subtitles as tags of the videos.

Step (2), extracting video characteristics;

the video feature sequence is obtained by extracting the video features by using the 3D convolutional network in the video feature extraction module, the reason is that the 3D convolutional network is more suitable for learning the space-time features than the 2D convolutional network, and the 3D convolutional network can extract the video features through a convolutional layer, a pooling layer, a full-link layer and the like through better modeling time information of a 3D convolutional kernel and 3D pooling operation.

Step (3), event detection;

and (2) using a DAPs model in the event detection module to detect the event, inputting the acquired video characteristics into an LSTM network in the DAPs model on the basis of the step two, connecting the characteristics in series, using a hidden layer vector of the LSTM network as a time characteristic, scanning the whole characteristic sequence by using a sliding window to obtain predicted event segments, and scoring each segment. The DAPs model is trained with each fragment score and fragment matching accuracy as a loss function.

Step (4), generating a text;

the method has the advantages that the method completely depends on a self-attention mechanism to understand the global dependency relationship between input and output, helps to parallelize calculation, can also directly calculate the correlation between each word, and does not need to be transmitted through a hidden layer. And (4) processing each event segment obtained in the step (3) separately, firstly performing visual embedding operation on the event segment, then inputting the event segment subjected to the visual embedding operation into a Transformer model, and obtaining a predicted text.

And (5) training the network models from the step (2) to the step (4) through a data set, and finishing automatic generation of the short video captions through the trained network models.

The invention has the following beneficial effects:

crawling contributes data sets to short videos and subtitles such as microblog, trembling, fast hands. And secondly, a cross-modal technology is fused, and the pre-processing comprises event detection, so that the range of the subsequent text generation work is narrowed, and the matching degree of the generated text and the event is increased. The text generation section exerts excellent performance of the Transformer in terms of feature encoding and decoding.

Drawings

FIG. 1 is a schematic diagram of a 3D convolution operation;

FIG. 2 is a schematic diagram of a C3D network structure;

FIG. 3 is a text generation flow diagram;

FIG. 4 is a schematic diagram of a Transformer network structure;

FIG. 5 is a schematic diagram of a Transformer macro network structure.

Detailed Description

The present invention will be described in further detail below with reference to the accompanying drawings and examples.

The invention provides a method and a system for automatically generating subtitles by using short video.

A method for automatically generating subtitles by short videos comprises the following steps:

step (1), constructing a data set;

And (2) extracting video features.

The 3D convolution network used is a C3D network, C3D is a depth 3 dimensional convolution network for processing video (image set with time sequence), and its operation is shown in fig. 1, where k is the length and width size of the 3D convolution kernel, D is the depth of the 3D convolution kernel, W, H is the length and width of each frame of image in the video, and L is the frame number of the video, and in the spatial dimension, as with the 2D convolution operation, the convolution kernel traverses the image set by a set step size, and captures the video feature through the 3D convolution operation. A complete network framework of C3D as shown in fig. 2, the C3D network includes 8 convolutional layers, 5 max pooling layers and 2 full-link layers, and a nonlinear output layer. The 3D convolution kernels (Conv1a-Conv5b) are all 3 × 3 × 3 in size, step 1 in both space and time. The 5 maximum pooling layers are pool1-pool5 in sequence, the pooling core of pool1 is 1 × 2 × 2, other pooling cores are 2 × 2 × 2, and 2 full connection layers (fc6-fc7) all have 4096 output units. And processing the video through a C3D network to obtain a video feature sequence.

Step (3), event detection:

because the lengths of the event segments are different, different sliding windows need to be set for scanning the whole video for multiple times in a model before DAPs, and then the most appropriate event segment is found by using a maximum likelihood estimation method, so that the running speed is very low. DAPs can obtain event segments with different scales only by using one sliding window, so that the running speed is greatly increased. Inputting the video feature sequence obtained in the step (2) into an LSTM network of the DAPs model to connect the features in series, outputting a hidden layer of the LSTM as a time feature sequence, scanning the whole feature sequence by using a sliding window through an Anchor mechanism to obtain predicted event segments, and scoring each segment. A mapping point of the center of a sliding window on an original video sequence is called as Anchor, different-scale propulses are generated by taking the Anchor as the center, each propulsal is scored according to whether an event exists or not, the matching accuracy of the propulsal and an actual event fragment is calculated, a loss function is formed by scoring and matching accuracy of whether a predicted event exists or not, the former requires that the probability that the propulsal contains the event is as high as possible, and the latter requires that the predicted fragment fits an interval of a real event fragment as much as possible.

And finally obtaining the result of event detection: < start time of event 1, end time of event 1 >, …, < start time of event n, end time of event n >.

Step (4), text generation:

and (4) intercepting a video segment of each event by using the result of the event detection, and performing video description on each event segment in step (4), wherein the steps are shown in fig. 3. The text generation model adopts a Transformer-based text generation model. Firstly, visual embedding is carried out, a video sequence of an event segment is input into a C3D network to extract video characteristics, then LSTM network sharing time sequence information is input, and the hidden layer output of the LSTM is used as the input of a transform encoder.

The Transformer model structure is shown in fig. 4 and is composed of two parts, namely an encoder and a decoder, wherein the input of the encoder is a video feature sequence, the input of the decoder is an encoded feature sequence, the output of the decoder is predicted text, the encoder is composed of 6 encoding blocks, and the decoder is composed of 6 decoding blocks, as shown in fig. 5.

Specifically, one coding/decoding block comprises a multi-head self-attention mechanism layer, a residual normalization layer and a feedforward neural network layer. The coding block and the decoding block have the same network structure, and have different input and output contents.

The structure and function of each layer are as follows:

the multi-head self-attention mechanism layer has the following formula of a single-layer self-attention mechanism:

wherein Q is a query vector, K is a key vector, V is a value vector, and Q, K, V in the encoding block is from the input of the layer above the encoding block or the decoding block. The multi-head self-Attention mechanism carries out multiple linear transformation on the input of a coding block or a decoding block, then calculates the Attention value respectively, and carries out splicing and linear transformation operation on the result, wherein the formula is as follows:

MultiHead(Q,K,V)＝Concat(head₁,…,head_h)W^o

wherein the content of the first and second substances,

in order to linearly transform the weights for Q,

in the same way, head_iIs the ith entry score, concat is the column-wise splicing operation, W^oThe weights are linearly transformed.

Feed Forward neural network layer (Feed Forward): the system consists of a Relu activation layer and a full connection layer, and aims to adjust output dimensionality, and the formula is as follows:

FFN(x)＝nax(0,xW₁+b₁)W₂+b₂

wherein, W₁As weight of the active layer, b₁For bias of active layer, W₂For full connection layer weight, b₂For full link layer bias, x is the input vector.

Residual normalization layer (Add & Norm): the method is characterized by comprising a residual error network and normalization, wherein the residual error connection solves the problem of multi-layer network training, the network can only pay attention to the current difference part, the normalization operation can accelerate network convergence, and the formula is as follows:

output＝LayerNorm(x+F(x))

where x is the input vector, F (x) is the previous layer operation of the residual normalization layer, the multi-headed attention mechanism or the feedforward neural network.

Claims

1. A system for automatically generating subtitles by using short video is characterized by comprising a video feature extraction module, an event detection module and a text generation module;

the event detection module uses a DAPs model to carry out event detection on a received video characteristic sequence, inputs the video characteristic sequence into an LSTM network in the DAPs model, connects the characteristics in series, uses a hidden layer vector of the LSTM network as a time characteristic, and then uses a sliding window to scan the whole characteristic sequence to obtain predicted event segments and scores each segment;

2. A method for automatically generating subtitles by short videos is characterized by comprising the following steps:

step (1), constructing a data set;

constructing a data set, randomly sampling videos with subtitles in various domestic large and short video platforms, uniformly preprocessing the videos, manually checking the subtitles of each video, correcting description inconsistent with facts, confirming that the subtitles can completely and simply summarize scenes or events displayed by the videos, and using the processed subtitles as tags of the videos;

step (2), extracting video characteristics;

extracting the video characteristics by using a 3D convolutional network in a video characteristic extraction module to obtain a video characteristic sequence;

step (3), event detection;

using a DAPs model in an event detection module to detect an event, inputting the acquired video characteristics into an LSTM network in the DAPs model on the basis of the step two, connecting the characteristics in series, using a hidden layer vector of the LSTM network as a time characteristic, scanning the whole characteristic sequence by using a sliding window to obtain predicted event segments, and scoring each segment; training the DAPs model by taking the score of each segment and the matching accuracy of the segment as loss functions;

step (4), generating a text;

generating a text by using a Transformer model in a text generation module, independently processing each event segment obtained in the step (3), firstly performing visual embedding operation on the event segment, then inputting the event segment subjected to the visual embedding operation into the Transformer model, and obtaining a predicted text by the method;

3. The method for automatically generating subtitles by using short video according to claim 2, wherein the step (2) is specifically as follows:

the used 3D convolution network is a C3D network, C3D is a depth 3-dimensional convolution network and is used for processing videos, wherein k is the length and width of a 3D convolution kernel, D is the depth of the 3D convolution kernel, W, H is the length and width of each frame of image in the videos, L is the frame number of the videos, in the spatial dimension, as with 2D convolution operation, the convolution kernel traverses an image set according to set step length, and video features are captured through the 3D convolution operation; the C3D network includes 8 convolutional layers, 5 max pooling layers and 2 full-link layers, and a nonlinear output layer; the 3D convolution kernels (Conv1a-Conv5b) are all 3 × 3 × 3 in size, step 1 in both space and time; the 5 maximum pooling layers are pool1-pool5 in sequence, the pooling core of pool1 is 1 × 2 × 2, other pooling cores are 2 × 2 × 2, and 2 full-connection layers (fc6-fc7) are provided with 4096 output units; and processing the video through a C3D network to obtain a video feature sequence.

4. The method for automatically generating subtitles by using short video according to claim 3, wherein the step (3) is specifically as follows:

inputting the video feature sequence obtained in the step (2) into an LSTM network of a DAPs model to connect the features in series, outputting a hidden layer of the LSTM as a time feature sequence, scanning the whole feature sequence by using a sliding window through an Anchor mechanism to obtain predicted event segments and scoring each segment; a mapping point of the center of the sliding window on the original video sequence is called as Anchor, different-scale propulses are generated by taking the Anchor as the center, each propulsal is scored according to whether an event is included or not, the matching accuracy of the propulsal and an actual event fragment is calculated, a loss function is formed by scoring and matching accuracy of whether a predicted event is included or not, the former requires that the probability that the propulsal includes the event is as high as possible, and the latter requires that the predicted fragment fits an interval of a factual event fragment as much as possible;

5. The method for automatically generating subtitles according to claim 4, wherein the step (4) is specifically as follows:

the text generation model adopts a Transformer-based text generation model; firstly, visual embedding is carried out, a video sequence of an event segment is input into a C3D network to extract video characteristics, then LSTM network sharing time sequence information is input, and the hidden layer output of the LSTM is used as the input of a transform encoder;

the Transformer model is composed of an encoder and a decoder, wherein the input of the encoder is a video characteristic sequence, the input of the decoder is a coded characteristic sequence, the output of the decoder is a predicted text, the encoder is composed of 6 coding blocks, and the decoder is composed of 6 decoding blocks;

specifically, one coding block/decoding block comprises a multi-head self-attention mechanism layer, a residual error normalization layer and a feedforward neural network layer; the network structure of the coding block and the decoding block is the same, and the input content and the output content of the coding block and the decoding block are different;

the structure and function of each layer are as follows:

wherein Q is a query vector, K is a key vector, V is a value vector, and Q, K, V in a coding block is from the input of a layer above the coding block or the decoding block; the multi-head self-Attention mechanism carries out multiple linear transformation on the input of a coding block or a decoding block, then calculates the Attention value respectively, and carries out splicing and linear transformation operation on the result, wherein the formula is as follows:

MultiHead(Q,K,V)＝Concat(head₁,…,head_h)W^o

wherein the content of the first and second substances,

in order to linearly transform the weights for Q,

in the same way, head_iIs the ith entry score, concat is the column-wise splicing operation, W^oIs a linear transformation weight;

FFN(x)＝max(0,xW₁+b₁)W₂+b₂

wherein, W₁As weight of the active layer, b₁For bias of active layer, W₂For full connection layer weight, b₂For full link layer bias, x is the input vector;

output＝LayerNorm(x+F(x))