CN113159034A - Method and system for automatically generating subtitles by using short video - Google Patents

Method and system for automatically generating subtitles by using short video Download PDF

Info

Publication number
CN113159034A
CN113159034A CN202110442856.4A CN202110442856A CN113159034A CN 113159034 A CN113159034 A CN 113159034A CN 202110442856 A CN202110442856 A CN 202110442856A CN 113159034 A CN113159034 A CN 113159034A
Authority
CN
China
Prior art keywords
event
video
network
layer
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110442856.4A
Other languages
Chinese (zh)
Inventor
颜成钢
高含笑
潘潇恺
孙垚棋
张继勇
李宗鹏
张勇东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110442856.4A priority Critical patent/CN113159034A/en
Publication of CN113159034A publication Critical patent/CN113159034A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention discloses a method and a system for automatically generating subtitles by using short video. And performing video feature extraction through a 3D convolutional network to obtain a video feature sequence, performing event detection on the received video feature sequence by using a DAPs model to obtain predicted event segments, and scoring each segment. And (3) processing each obtained event segment independently, firstly carrying out visual embedding operation on the event segment, then inputting the event segment subjected to the visual embedding operation into a transform model, and obtaining a predicted text. According to the invention, by fusing the cross-modal technology, the pre-processing comprises event detection, the range of the subsequent text generation work is narrowed, and the matching degree of the generated text and the event is increased. The text generation section exerts excellent performance of the Transformer in terms of feature encoding and decoding.

Description

Method and system for automatically generating subtitles by using short video
Technical Field
The invention belongs to the field of natural language processing, and relates to a text generation technology in the field of natural language processing.
Background
Short video refers to high-frequency pushed video content played on various new media, suitable for viewing in a mobile state and a short-time leisure state, and the duration is generally within 5 minutes. Since 2016, short videos are rapidly popular due to the characteristics of low creation threshold, strong social attributes and fragmented entertainment, the short video industry is rapidly developed in China and is mature nowadays, and active short video platforms include microblogs, trembles, fast hands, small red books and the like. The short video becomes a carrier for bloggers to record and share life, becomes a new mode for audience entertainment and grass planting, can even be popularized for brands, and brings benefits to original authors.
The creation of the short video is composed of links such as content planning, video shooting, post processing and the like, wherein the generation of the subtitles is used as one of the post processing, and is generally completed by manual addition of an author. The manual generation of subtitles is time-consuming and energy-consuming. At present, some caption generating software for videos is developed and applied, and text captions are derived through a voice recognition technology according to voice in the videos, so that the system has the powerful functions of manual modification, translation and the like and is high in practicability. However, the software mainly depends on the audio information in the video, and characters cannot be extracted without the audio information with clear pronunciation. In the real life, a common user wants to record life through short video, if the common user shoots and explains the life while shooting, the common user does not need to speak himself or herself, and if the common user records the life again in the later stage, the common user returns to the problem of wasting time and energy. According to research, subtitles of short videos without audio backgrounds are automatically generated, and pain spots and technical blanks exist.
The text generation technology has wide application scenes, can be applied to tasks such as information extraction, question-answering systems, character creation and the like, is born by a chat robot through the question-answering systems, is not in the speech of robot word writing and composition through the character creation, and is a research hotspot in recent years through modal text generation. The method is characterized in that cross-modality processing technologies combined with image, audio and language processing are required to be called, and common applications include 'talking with the aid of a picture', automatic subtitle generation for teaching videos and the like.
The technology for automatically generating the subtitles is cross-modal and relies on the fusion of an image vision technology and a natural language processing technology. On the whole frame, Sequence to Sequence-Video to Text 2015 proposes a model based on an LSTM frame, an original image and an optical flow image are processed independently, each frame of image features in a Video are extracted by using a 2D convolutional neural network to serve as input of an LSTM coding layer, output of the LSTM decoding layer is a prediction result, the probability of word occurrence is predicted one by one, and finally the probability generated by the original image and the probability generated by the optical flow image are subjected to weighted summation; in 2015, "Video Description Generation incorporation implementation and a Soft-Attention Mechanism" references an Attention Mechanism in image Description, firstly, extracting Features from each frame of image in a Video, wherein firstly, a GoogleNet network is used for extracting 2D Features, secondly, a 3D convolution network is used for extracting 3D Features, the Attention Mechanism is used for predicting words each time, the weight of each feature is calculated, and the image Features are weighted and summed according to the weights; in 2016, description video using Multi-modal Fusion proposed to fuse image features, video features, environmental sound features, voice features, genre features, and the like as feature representations of video, so-called Fusion, in which each feature is weighted and averaged, and then a video description is generated by an LSTM model.
The innovation point of the video description technology can be derived from three aspects, namely, the extraction of video features is optimized; secondly, optimizing an event detection model; and thirdly, optimizing a text generation model.
A3D convolution is proposed in left spatial components with 3D conditional Networks, and the 3D convolution is suitable for extracting space-time characteristics and can better pay attention to information of a time dimension.
The event detection aims at segmenting a long video into a plurality of segments according to semantics, enabling each segment to contain an event, proposing an event detection model DAPs, setting different sliding windows in the model before the DAPs to scan the whole video for a plurality of times, then using a maximum likelihood estimation method to find the most appropriate event segment, and having a slow operation speed. DAPs can obtain event segments with different scales only by using one sliding window, thereby greatly improving the running speed and the accuracy. The patent is all you need, put forward a transform model, suitable for the natural language processing tasks such as machine translation, text generation, text summarization, etc., the whole network is completely composed of the patent mechanism, and the contribution of the context information is fully paid Attention to.
Disclosure of Invention
The application scene of the invention is oriented to the automatic generation of subtitles of short videos such as microblog, trembling and fast hands, and the model adopted by the invention is an end-to-end model based on Transfomer, wherein the video characteristics are used as input, the description text is used as output. The main tasks are as follows: firstly, video feature extraction is carried out on a segment video, secondly, video segmentation is carried out, a plurality of segments are intercepted from the video according to events, and thirdly, an abstract for describing the video events is generated.
The technical problem to be solved by the invention is as follows: firstly, designing a video feature extraction model to optimize the extraction of video features, secondly designing a video segmentation model to improve the accuracy of video segmentation, and thirdly designing a text generation model to make a generated text fit with a picture displayed by a video.
Aiming at the defects in the prior art, the invention provides a method and a system for automatically generating subtitles by using short videos.
A system for automatically generating subtitles by using short video comprises a video feature extraction module, an event detection module and a text generation module;
the video feature extraction module extracts features of the video through a 3D convolutional network to obtain a video feature sequence, and sends the obtained video feature sequence to the event detection module;
the event detection module uses a DAPs model to carry out event detection on the received video characteristic sequence, inputs the video characteristic sequence into an LSTM network in the DAPs model, connects the characteristics in series, uses a hidden layer vector of the LSTM network as a time characteristic, and then uses a sliding window to scan the whole characteristic sequence to obtain predicted event segments and scores each segment.
The text generation module uses a Transformer model to generate a text, each event segment obtained by the event detection module is processed independently, visual embedding operation is firstly carried out on the event segment, then the event segment after the visual embedding operation is input into the Transformer model, and a predicted text is obtained through the input.
A method for automatically generating subtitles by short video comprises the following steps:
step (1), constructing a data set;
the method comprises the steps of constructing a data set, randomly sampling videos with subtitles in various large and short video platforms in China, uniformly preprocessing the videos, manually checking the subtitles of each video, correcting description which does not meet the facts, confirming that the subtitles can completely and simply summarize scenes or events displayed by the videos, and using the processed subtitles as tags of the videos.
Step (2), extracting video characteristics;
the video feature sequence is obtained by extracting the video features by using the 3D convolutional network in the video feature extraction module, the reason is that the 3D convolutional network is more suitable for learning the space-time features than the 2D convolutional network, and the 3D convolutional network can extract the video features through a convolutional layer, a pooling layer, a full-link layer and the like through better modeling time information of a 3D convolutional kernel and 3D pooling operation.
Step (3), event detection;
and (2) using a DAPs model in the event detection module to detect the event, inputting the acquired video characteristics into an LSTM network in the DAPs model on the basis of the step two, connecting the characteristics in series, using a hidden layer vector of the LSTM network as a time characteristic, scanning the whole characteristic sequence by using a sliding window to obtain predicted event segments, and scoring each segment. The DAPs model is trained with each fragment score and fragment matching accuracy as a loss function.
Step (4), generating a text;
the method has the advantages that the method completely depends on a self-attention mechanism to understand the global dependency relationship between input and output, helps to parallelize calculation, can also directly calculate the correlation between each word, and does not need to be transmitted through a hidden layer. And (4) processing each event segment obtained in the step (3) separately, firstly performing visual embedding operation on the event segment, then inputting the event segment subjected to the visual embedding operation into a Transformer model, and obtaining a predicted text.
And (5) training the network models from the step (2) to the step (4) through a data set, and finishing automatic generation of the short video captions through the trained network models.
The invention has the following beneficial effects:
crawling contributes data sets to short videos and subtitles such as microblog, trembling, fast hands. And secondly, a cross-modal technology is fused, and the pre-processing comprises event detection, so that the range of the subsequent text generation work is narrowed, and the matching degree of the generated text and the event is increased. The text generation section exerts excellent performance of the Transformer in terms of feature encoding and decoding.
Drawings
FIG. 1 is a schematic diagram of a 3D convolution operation;
FIG. 2 is a schematic diagram of a C3D network structure;
FIG. 3 is a text generation flow diagram;
FIG. 4 is a schematic diagram of a Transformer network structure;
FIG. 5 is a schematic diagram of a Transformer macro network structure.
Detailed Description
The present invention will be described in further detail below with reference to the accompanying drawings and examples.
The invention provides a method and a system for automatically generating subtitles by using short video.
A method for automatically generating subtitles by short videos comprises the following steps:
step (1), constructing a data set;
the method comprises the steps of constructing a data set, randomly sampling videos with subtitles in various large and short video platforms in China, uniformly preprocessing the videos, manually checking the subtitles of each video, correcting description which does not meet the facts, confirming that the subtitles can completely and simply summarize scenes or events displayed by the videos, and using the processed subtitles as tags of the videos.
And (2) extracting video features.
The 3D convolution network used is a C3D network, C3D is a depth 3 dimensional convolution network for processing video (image set with time sequence), and its operation is shown in fig. 1, where k is the length and width size of the 3D convolution kernel, D is the depth of the 3D convolution kernel, W, H is the length and width of each frame of image in the video, and L is the frame number of the video, and in the spatial dimension, as with the 2D convolution operation, the convolution kernel traverses the image set by a set step size, and captures the video feature through the 3D convolution operation. A complete network framework of C3D as shown in fig. 2, the C3D network includes 8 convolutional layers, 5 max pooling layers and 2 full-link layers, and a nonlinear output layer. The 3D convolution kernels (Conv1a-Conv5b) are all 3 × 3 × 3 in size, step 1 in both space and time. The 5 maximum pooling layers are pool1-pool5 in sequence, the pooling core of pool1 is 1 × 2 × 2, other pooling cores are 2 × 2 × 2, and 2 full connection layers (fc6-fc7) all have 4096 output units. And processing the video through a C3D network to obtain a video feature sequence.
Step (3), event detection:
because the lengths of the event segments are different, different sliding windows need to be set for scanning the whole video for multiple times in a model before DAPs, and then the most appropriate event segment is found by using a maximum likelihood estimation method, so that the running speed is very low. DAPs can obtain event segments with different scales only by using one sliding window, so that the running speed is greatly increased. Inputting the video feature sequence obtained in the step (2) into an LSTM network of the DAPs model to connect the features in series, outputting a hidden layer of the LSTM as a time feature sequence, scanning the whole feature sequence by using a sliding window through an Anchor mechanism to obtain predicted event segments, and scoring each segment. A mapping point of the center of a sliding window on an original video sequence is called as Anchor, different-scale propulses are generated by taking the Anchor as the center, each propulsal is scored according to whether an event exists or not, the matching accuracy of the propulsal and an actual event fragment is calculated, a loss function is formed by scoring and matching accuracy of whether a predicted event exists or not, the former requires that the probability that the propulsal contains the event is as high as possible, and the latter requires that the predicted fragment fits an interval of a real event fragment as much as possible.
And finally obtaining the result of event detection: < start time of event 1, end time of event 1 >, …, < start time of event n, end time of event n >.
Step (4), text generation:
and (4) intercepting a video segment of each event by using the result of the event detection, and performing video description on each event segment in step (4), wherein the steps are shown in fig. 3. The text generation model adopts a Transformer-based text generation model. Firstly, visual embedding is carried out, a video sequence of an event segment is input into a C3D network to extract video characteristics, then LSTM network sharing time sequence information is input, and the hidden layer output of the LSTM is used as the input of a transform encoder.
The Transformer model structure is shown in fig. 4 and is composed of two parts, namely an encoder and a decoder, wherein the input of the encoder is a video feature sequence, the input of the decoder is an encoded feature sequence, the output of the decoder is predicted text, the encoder is composed of 6 encoding blocks, and the decoder is composed of 6 decoding blocks, as shown in fig. 5.
Specifically, one coding/decoding block comprises a multi-head self-attention mechanism layer, a residual normalization layer and a feedforward neural network layer. The coding block and the decoding block have the same network structure, and have different input and output contents.
The structure and function of each layer are as follows:
the multi-head self-attention mechanism layer has the following formula of a single-layer self-attention mechanism:
Figure BDA0003035812360000061
wherein Q is a query vector, K is a key vector, V is a value vector, and Q, K, V in the encoding block is from the input of the layer above the encoding block or the decoding block. The multi-head self-Attention mechanism carries out multiple linear transformation on the input of a coding block or a decoding block, then calculates the Attention value respectively, and carries out splicing and linear transformation operation on the result, wherein the formula is as follows:
Figure BDA0003035812360000062
MultiHead(Q,K,V)=Concat(head1,…,headh)Wo
wherein the content of the first and second substances,
Figure BDA0003035812360000071
in order to linearly transform the weights for Q,
Figure BDA0003035812360000072
in the same way, headiIs the ith entry score, concat is the column-wise splicing operation, WoThe weights are linearly transformed.
Feed Forward neural network layer (Feed Forward): the system consists of a Relu activation layer and a full connection layer, and aims to adjust output dimensionality, and the formula is as follows:
FFN(x)=nax(0,xW1+b1)W2+b2
wherein, W1As weight of the active layer, b1For bias of active layer, W2For full connection layer weight, b2For full link layer bias, x is the input vector.
Residual normalization layer (Add & Norm): the method is characterized by comprising a residual error network and normalization, wherein the residual error connection solves the problem of multi-layer network training, the network can only pay attention to the current difference part, the normalization operation can accelerate network convergence, and the formula is as follows:
output=LayerNorm(x+F(x))
where x is the input vector, F (x) is the previous layer operation of the residual normalization layer, the multi-headed attention mechanism or the feedforward neural network.
And (5) training the network models from the step (2) to the step (4) through a data set, and finishing automatic generation of the short video captions through the trained network models.
A system for automatically generating subtitles by using short video comprises a video feature extraction module, an event detection module and a text generation module;
the video feature extraction module extracts features of the video through a 3D convolutional network to obtain a video feature sequence, and sends the obtained video feature sequence to the event detection module;
the event detection module uses a DAPs model to carry out event detection on the received video characteristic sequence, inputs the video characteristic sequence into an LSTM network in the DAPs model, connects the characteristics in series, uses a hidden layer vector of the LSTM network as a time characteristic, and then uses a sliding window to scan the whole characteristic sequence to obtain predicted event segments and scores each segment.
The text generation module uses a Transformer model to generate a text, each event segment obtained by the event detection module is processed independently, visual embedding operation is firstly carried out on the event segment, then the event segment after the visual embedding operation is input into the Transformer model, and a predicted text is obtained through the input.

Claims (5)

1. A system for automatically generating subtitles by using short video is characterized by comprising a video feature extraction module, an event detection module and a text generation module;
the video feature extraction module extracts features of the video through a 3D convolutional network to obtain a video feature sequence, and sends the obtained video feature sequence to the event detection module;
the event detection module uses a DAPs model to carry out event detection on a received video characteristic sequence, inputs the video characteristic sequence into an LSTM network in the DAPs model, connects the characteristics in series, uses a hidden layer vector of the LSTM network as a time characteristic, and then uses a sliding window to scan the whole characteristic sequence to obtain predicted event segments and scores each segment;
the text generation module uses a Transformer model to generate a text, each event segment obtained by the event detection module is processed independently, visual embedding operation is firstly carried out on the event segment, then the event segment after the visual embedding operation is input into the Transformer model, and a predicted text is obtained through the input.
2. A method for automatically generating subtitles by short videos is characterized by comprising the following steps:
step (1), constructing a data set;
constructing a data set, randomly sampling videos with subtitles in various domestic large and short video platforms, uniformly preprocessing the videos, manually checking the subtitles of each video, correcting description inconsistent with facts, confirming that the subtitles can completely and simply summarize scenes or events displayed by the videos, and using the processed subtitles as tags of the videos;
step (2), extracting video characteristics;
extracting the video characteristics by using a 3D convolutional network in a video characteristic extraction module to obtain a video characteristic sequence;
step (3), event detection;
using a DAPs model in an event detection module to detect an event, inputting the acquired video characteristics into an LSTM network in the DAPs model on the basis of the step two, connecting the characteristics in series, using a hidden layer vector of the LSTM network as a time characteristic, scanning the whole characteristic sequence by using a sliding window to obtain predicted event segments, and scoring each segment; training the DAPs model by taking the score of each segment and the matching accuracy of the segment as loss functions;
step (4), generating a text;
generating a text by using a Transformer model in a text generation module, independently processing each event segment obtained in the step (3), firstly performing visual embedding operation on the event segment, then inputting the event segment subjected to the visual embedding operation into the Transformer model, and obtaining a predicted text by the method;
and (5) training the network models from the step (2) to the step (4) through a data set, and finishing automatic generation of the short video captions through the trained network models.
3. The method for automatically generating subtitles by using short video according to claim 2, wherein the step (2) is specifically as follows:
the used 3D convolution network is a C3D network, C3D is a depth 3-dimensional convolution network and is used for processing videos, wherein k is the length and width of a 3D convolution kernel, D is the depth of the 3D convolution kernel, W, H is the length and width of each frame of image in the videos, L is the frame number of the videos, in the spatial dimension, as with 2D convolution operation, the convolution kernel traverses an image set according to set step length, and video features are captured through the 3D convolution operation; the C3D network includes 8 convolutional layers, 5 max pooling layers and 2 full-link layers, and a nonlinear output layer; the 3D convolution kernels (Conv1a-Conv5b) are all 3 × 3 × 3 in size, step 1 in both space and time; the 5 maximum pooling layers are pool1-pool5 in sequence, the pooling core of pool1 is 1 × 2 × 2, other pooling cores are 2 × 2 × 2, and 2 full-connection layers (fc6-fc7) are provided with 4096 output units; and processing the video through a C3D network to obtain a video feature sequence.
4. The method for automatically generating subtitles by using short video according to claim 3, wherein the step (3) is specifically as follows:
inputting the video feature sequence obtained in the step (2) into an LSTM network of a DAPs model to connect the features in series, outputting a hidden layer of the LSTM as a time feature sequence, scanning the whole feature sequence by using a sliding window through an Anchor mechanism to obtain predicted event segments and scoring each segment; a mapping point of the center of the sliding window on the original video sequence is called as Anchor, different-scale propulses are generated by taking the Anchor as the center, each propulsal is scored according to whether an event is included or not, the matching accuracy of the propulsal and an actual event fragment is calculated, a loss function is formed by scoring and matching accuracy of whether a predicted event is included or not, the former requires that the probability that the propulsal includes the event is as high as possible, and the latter requires that the predicted fragment fits an interval of a factual event fragment as much as possible;
and finally obtaining the result of event detection: < start time of event 1, end time of event 1 >, …, < start time of event n, end time of event n >.
5. The method for automatically generating subtitles according to claim 4, wherein the step (4) is specifically as follows:
the text generation model adopts a Transformer-based text generation model; firstly, visual embedding is carried out, a video sequence of an event segment is input into a C3D network to extract video characteristics, then LSTM network sharing time sequence information is input, and the hidden layer output of the LSTM is used as the input of a transform encoder;
the Transformer model is composed of an encoder and a decoder, wherein the input of the encoder is a video characteristic sequence, the input of the decoder is a coded characteristic sequence, the output of the decoder is a predicted text, the encoder is composed of 6 coding blocks, and the decoder is composed of 6 decoding blocks;
specifically, one coding block/decoding block comprises a multi-head self-attention mechanism layer, a residual error normalization layer and a feedforward neural network layer; the network structure of the coding block and the decoding block is the same, and the input content and the output content of the coding block and the decoding block are different;
the structure and function of each layer are as follows:
the multi-head self-attention mechanism layer has the following formula of a single-layer self-attention mechanism:
Figure FDA0003035812350000031
wherein Q is a query vector, K is a key vector, V is a value vector, and Q, K, V in a coding block is from the input of a layer above the coding block or the decoding block; the multi-head self-Attention mechanism carries out multiple linear transformation on the input of a coding block or a decoding block, then calculates the Attention value respectively, and carries out splicing and linear transformation operation on the result, wherein the formula is as follows:
Figure FDA0003035812350000032
MultiHead(Q,K,V)=Concat(head1,…,headh)Wo
wherein the content of the first and second substances,
Figure FDA0003035812350000033
in order to linearly transform the weights for Q,
Figure FDA0003035812350000034
in the same way, headiIs the ith entry score, concat is the column-wise splicing operation, WoIs a linear transformation weight;
feed Forward neural network layer (Feed Forward): the system consists of a Relu activation layer and a full connection layer, and aims to adjust output dimensionality, and the formula is as follows:
FFN(x)=max(0,xW1+b1)W2+b2
wherein, W1As weight of the active layer, b1For bias of active layer, W2For full connection layer weight, b2For full link layer bias, x is the input vector;
residual normalization layer (Add & Norm): the method is characterized by comprising a residual error network and normalization, wherein the residual error connection solves the problem of multi-layer network training, the network can only pay attention to the current difference part, the normalization operation can accelerate network convergence, and the formula is as follows:
output=LayerNorm(x+F(x))
where x is the input vector, F (x) is the previous layer operation of the residual normalization layer, the multi-headed attention mechanism or the feedforward neural network.
CN202110442856.4A 2021-04-23 2021-04-23 Method and system for automatically generating subtitles by using short video Pending CN113159034A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110442856.4A CN113159034A (en) 2021-04-23 2021-04-23 Method and system for automatically generating subtitles by using short video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110442856.4A CN113159034A (en) 2021-04-23 2021-04-23 Method and system for automatically generating subtitles by using short video

Publications (1)

Publication Number Publication Date
CN113159034A true CN113159034A (en) 2021-07-23

Family

ID=76869972

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110442856.4A Pending CN113159034A (en) 2021-04-23 2021-04-23 Method and system for automatically generating subtitles by using short video

Country Status (1)

Country Link
CN (1) CN113159034A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113423004A (en) * 2021-08-23 2021-09-21 杭州一知智能科技有限公司 Video subtitle generating method and system based on decoupling decoding
CN113569717A (en) * 2021-07-26 2021-10-29 上海明略人工智能(集团)有限公司 Short video event classification method, system, device and medium based on label semantics
CN113609259A (en) * 2021-08-16 2021-11-05 山东新一代信息产业技术研究院有限公司 Multi-mode reasoning method and system for videos and natural languages
CN113784199A (en) * 2021-09-10 2021-12-10 中国科学院计算技术研究所 System and method for generating video description text
CN113837083A (en) * 2021-09-24 2021-12-24 焦点科技股份有限公司 Video segment segmentation method based on Transformer
CN114222193A (en) * 2021-12-03 2022-03-22 北京影谱科技股份有限公司 Video subtitle time alignment model training method and system
CN114782848A (en) * 2022-03-10 2022-07-22 沈阳雅译网络技术有限公司 Picture subtitle generating method applying characteristic pyramid
CN116310984A (en) * 2023-03-13 2023-06-23 中国科学院微电子研究所 Multi-mode video subtitle generating method based on Token sampling

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569717A (en) * 2021-07-26 2021-10-29 上海明略人工智能(集团)有限公司 Short video event classification method, system, device and medium based on label semantics
CN113609259A (en) * 2021-08-16 2021-11-05 山东新一代信息产业技术研究院有限公司 Multi-mode reasoning method and system for videos and natural languages
CN113423004B (en) * 2021-08-23 2021-11-30 杭州一知智能科技有限公司 Video subtitle generating method and system based on decoupling decoding
CN113423004A (en) * 2021-08-23 2021-09-21 杭州一知智能科技有限公司 Video subtitle generating method and system based on decoupling decoding
CN113784199B (en) * 2021-09-10 2022-09-13 中国科学院计算技术研究所 System, method, storage medium and electronic device for generating video description text
CN113784199A (en) * 2021-09-10 2021-12-10 中国科学院计算技术研究所 System and method for generating video description text
CN113837083A (en) * 2021-09-24 2021-12-24 焦点科技股份有限公司 Video segment segmentation method based on Transformer
CN114222193A (en) * 2021-12-03 2022-03-22 北京影谱科技股份有限公司 Video subtitle time alignment model training method and system
CN114222193B (en) * 2021-12-03 2024-01-05 北京影谱科技股份有限公司 Video subtitle time alignment model training method and system
CN114782848A (en) * 2022-03-10 2022-07-22 沈阳雅译网络技术有限公司 Picture subtitle generating method applying characteristic pyramid
CN114782848B (en) * 2022-03-10 2024-03-26 沈阳雅译网络技术有限公司 Picture subtitle generation method applying feature pyramid
CN116310984A (en) * 2023-03-13 2023-06-23 中国科学院微电子研究所 Multi-mode video subtitle generating method based on Token sampling
CN116310984B (en) * 2023-03-13 2024-01-30 中国科学院微电子研究所 Multi-mode video subtitle generating method based on Token sampling

Similar Documents

Publication Publication Date Title
CN113159034A (en) Method and system for automatically generating subtitles by using short video
CN108648746B (en) Open domain video natural language description generation method based on multi-modal feature fusion
CN114694076A (en) Multi-modal emotion analysis method based on multi-task learning and stacked cross-modal fusion
CN108388900A (en) The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN115039141A (en) Scene aware video conversation
CN111967272B (en) Visual dialogue generating system based on semantic alignment
CN111368142B (en) Video intensive event description method based on generation countermeasure network
CN111160264A (en) Cartoon figure identity recognition method based on generation of confrontation network
CN110232564A (en) A kind of traffic accident law automatic decision method based on multi-modal data
WO2024066920A1 (en) Processing method and apparatus for dialogue in virtual scene, and electronic device, computer program product and computer storage medium
Bilkhu et al. Attention is all you need for videos: Self-attention based video summarization using universal transformers
CN108550173A (en) Method based on speech production shape of the mouth as one speaks video
CN115129934A (en) Multi-mode video understanding method
CN116189039A (en) Multi-modal emotion classification method and system for modal sequence perception with global audio feature enhancement
CN113642536B (en) Data processing method, computer device and readable storage medium
CN116977457A (en) Data processing method, device and computer readable storage medium
Song et al. Contextual attention network for emotional video captioning
CN117273150A (en) Visual large language model method based on few sample learning
Zhu Metaaid 2.0: An extensible framework for developing metaverse applications via human-controllable pre-trained models
CN115019137A (en) Method and device for predicting multi-scale double-flow attention video language event
CN114495946A (en) Voiceprint clustering method, electronic device and storage medium
Liu et al. End-to-End Chinese Lip-Reading Recognition Based on Multi-modal Fusion
CN116821381B (en) Voice-image cross-mode retrieval method and device based on spatial clues
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN112765955B (en) Cross-modal instance segmentation method under Chinese finger representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination