CN113159034A - Method and system for automatically generating subtitles by using short video - Google Patents
Method and system for automatically generating subtitles by using short video Download PDFInfo
- Publication number
- CN113159034A CN113159034A CN202110442856.4A CN202110442856A CN113159034A CN 113159034 A CN113159034 A CN 113159034A CN 202110442856 A CN202110442856 A CN 202110442856A CN 113159034 A CN113159034 A CN 113159034A
- Authority
- CN
- China
- Prior art keywords
- event
- video
- network
- layer
- segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
- G06V20/635—Overlay text, e.g. embedded captions in a TV program
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
The invention discloses a method and a system for automatically generating subtitles by using short video. And performing video feature extraction through a 3D convolutional network to obtain a video feature sequence, performing event detection on the received video feature sequence by using a DAPs model to obtain predicted event segments, and scoring each segment. And (3) processing each obtained event segment independently, firstly carrying out visual embedding operation on the event segment, then inputting the event segment subjected to the visual embedding operation into a transform model, and obtaining a predicted text. According to the invention, by fusing the cross-modal technology, the pre-processing comprises event detection, the range of the subsequent text generation work is narrowed, and the matching degree of the generated text and the event is increased. The text generation section exerts excellent performance of the Transformer in terms of feature encoding and decoding.
Description
Technical Field
The invention belongs to the field of natural language processing, and relates to a text generation technology in the field of natural language processing.
Background
Short video refers to high-frequency pushed video content played on various new media, suitable for viewing in a mobile state and a short-time leisure state, and the duration is generally within 5 minutes. Since 2016, short videos are rapidly popular due to the characteristics of low creation threshold, strong social attributes and fragmented entertainment, the short video industry is rapidly developed in China and is mature nowadays, and active short video platforms include microblogs, trembles, fast hands, small red books and the like. The short video becomes a carrier for bloggers to record and share life, becomes a new mode for audience entertainment and grass planting, can even be popularized for brands, and brings benefits to original authors.
The creation of the short video is composed of links such as content planning, video shooting, post processing and the like, wherein the generation of the subtitles is used as one of the post processing, and is generally completed by manual addition of an author. The manual generation of subtitles is time-consuming and energy-consuming. At present, some caption generating software for videos is developed and applied, and text captions are derived through a voice recognition technology according to voice in the videos, so that the system has the powerful functions of manual modification, translation and the like and is high in practicability. However, the software mainly depends on the audio information in the video, and characters cannot be extracted without the audio information with clear pronunciation. In the real life, a common user wants to record life through short video, if the common user shoots and explains the life while shooting, the common user does not need to speak himself or herself, and if the common user records the life again in the later stage, the common user returns to the problem of wasting time and energy. According to research, subtitles of short videos without audio backgrounds are automatically generated, and pain spots and technical blanks exist.
The text generation technology has wide application scenes, can be applied to tasks such as information extraction, question-answering systems, character creation and the like, is born by a chat robot through the question-answering systems, is not in the speech of robot word writing and composition through the character creation, and is a research hotspot in recent years through modal text generation. The method is characterized in that cross-modality processing technologies combined with image, audio and language processing are required to be called, and common applications include 'talking with the aid of a picture', automatic subtitle generation for teaching videos and the like.
The technology for automatically generating the subtitles is cross-modal and relies on the fusion of an image vision technology and a natural language processing technology. On the whole frame, Sequence to Sequence-Video to Text 2015 proposes a model based on an LSTM frame, an original image and an optical flow image are processed independently, each frame of image features in a Video are extracted by using a 2D convolutional neural network to serve as input of an LSTM coding layer, output of the LSTM decoding layer is a prediction result, the probability of word occurrence is predicted one by one, and finally the probability generated by the original image and the probability generated by the optical flow image are subjected to weighted summation; in 2015, "Video Description Generation incorporation implementation and a Soft-Attention Mechanism" references an Attention Mechanism in image Description, firstly, extracting Features from each frame of image in a Video, wherein firstly, a GoogleNet network is used for extracting 2D Features, secondly, a 3D convolution network is used for extracting 3D Features, the Attention Mechanism is used for predicting words each time, the weight of each feature is calculated, and the image Features are weighted and summed according to the weights; in 2016, description video using Multi-modal Fusion proposed to fuse image features, video features, environmental sound features, voice features, genre features, and the like as feature representations of video, so-called Fusion, in which each feature is weighted and averaged, and then a video description is generated by an LSTM model.
The innovation point of the video description technology can be derived from three aspects, namely, the extraction of video features is optimized; secondly, optimizing an event detection model; and thirdly, optimizing a text generation model.
A3D convolution is proposed in left spatial components with 3D conditional Networks, and the 3D convolution is suitable for extracting space-time characteristics and can better pay attention to information of a time dimension.
The event detection aims at segmenting a long video into a plurality of segments according to semantics, enabling each segment to contain an event, proposing an event detection model DAPs, setting different sliding windows in the model before the DAPs to scan the whole video for a plurality of times, then using a maximum likelihood estimation method to find the most appropriate event segment, and having a slow operation speed. DAPs can obtain event segments with different scales only by using one sliding window, thereby greatly improving the running speed and the accuracy. The patent is all you need, put forward a transform model, suitable for the natural language processing tasks such as machine translation, text generation, text summarization, etc., the whole network is completely composed of the patent mechanism, and the contribution of the context information is fully paid Attention to.
Disclosure of Invention
The application scene of the invention is oriented to the automatic generation of subtitles of short videos such as microblog, trembling and fast hands, and the model adopted by the invention is an end-to-end model based on Transfomer, wherein the video characteristics are used as input, the description text is used as output. The main tasks are as follows: firstly, video feature extraction is carried out on a segment video, secondly, video segmentation is carried out, a plurality of segments are intercepted from the video according to events, and thirdly, an abstract for describing the video events is generated.
The technical problem to be solved by the invention is as follows: firstly, designing a video feature extraction model to optimize the extraction of video features, secondly designing a video segmentation model to improve the accuracy of video segmentation, and thirdly designing a text generation model to make a generated text fit with a picture displayed by a video.
Aiming at the defects in the prior art, the invention provides a method and a system for automatically generating subtitles by using short videos.
A system for automatically generating subtitles by using short video comprises a video feature extraction module, an event detection module and a text generation module;
the video feature extraction module extracts features of the video through a 3D convolutional network to obtain a video feature sequence, and sends the obtained video feature sequence to the event detection module;
the event detection module uses a DAPs model to carry out event detection on the received video characteristic sequence, inputs the video characteristic sequence into an LSTM network in the DAPs model, connects the characteristics in series, uses a hidden layer vector of the LSTM network as a time characteristic, and then uses a sliding window to scan the whole characteristic sequence to obtain predicted event segments and scores each segment.
The text generation module uses a Transformer model to generate a text, each event segment obtained by the event detection module is processed independently, visual embedding operation is firstly carried out on the event segment, then the event segment after the visual embedding operation is input into the Transformer model, and a predicted text is obtained through the input.
A method for automatically generating subtitles by short video comprises the following steps:
step (1), constructing a data set;
the method comprises the steps of constructing a data set, randomly sampling videos with subtitles in various large and short video platforms in China, uniformly preprocessing the videos, manually checking the subtitles of each video, correcting description which does not meet the facts, confirming that the subtitles can completely and simply summarize scenes or events displayed by the videos, and using the processed subtitles as tags of the videos.
Step (2), extracting video characteristics;
the video feature sequence is obtained by extracting the video features by using the 3D convolutional network in the video feature extraction module, the reason is that the 3D convolutional network is more suitable for learning the space-time features than the 2D convolutional network, and the 3D convolutional network can extract the video features through a convolutional layer, a pooling layer, a full-link layer and the like through better modeling time information of a 3D convolutional kernel and 3D pooling operation.
Step (3), event detection;
and (2) using a DAPs model in the event detection module to detect the event, inputting the acquired video characteristics into an LSTM network in the DAPs model on the basis of the step two, connecting the characteristics in series, using a hidden layer vector of the LSTM network as a time characteristic, scanning the whole characteristic sequence by using a sliding window to obtain predicted event segments, and scoring each segment. The DAPs model is trained with each fragment score and fragment matching accuracy as a loss function.
Step (4), generating a text;
the method has the advantages that the method completely depends on a self-attention mechanism to understand the global dependency relationship between input and output, helps to parallelize calculation, can also directly calculate the correlation between each word, and does not need to be transmitted through a hidden layer. And (4) processing each event segment obtained in the step (3) separately, firstly performing visual embedding operation on the event segment, then inputting the event segment subjected to the visual embedding operation into a Transformer model, and obtaining a predicted text.
And (5) training the network models from the step (2) to the step (4) through a data set, and finishing automatic generation of the short video captions through the trained network models.
The invention has the following beneficial effects:
crawling contributes data sets to short videos and subtitles such as microblog, trembling, fast hands. And secondly, a cross-modal technology is fused, and the pre-processing comprises event detection, so that the range of the subsequent text generation work is narrowed, and the matching degree of the generated text and the event is increased. The text generation section exerts excellent performance of the Transformer in terms of feature encoding and decoding.
Drawings
FIG. 1 is a schematic diagram of a 3D convolution operation;
FIG. 2 is a schematic diagram of a C3D network structure;
FIG. 3 is a text generation flow diagram;
FIG. 4 is a schematic diagram of a Transformer network structure;
FIG. 5 is a schematic diagram of a Transformer macro network structure.
Detailed Description
The present invention will be described in further detail below with reference to the accompanying drawings and examples.
The invention provides a method and a system for automatically generating subtitles by using short video.
A method for automatically generating subtitles by short videos comprises the following steps:
step (1), constructing a data set;
the method comprises the steps of constructing a data set, randomly sampling videos with subtitles in various large and short video platforms in China, uniformly preprocessing the videos, manually checking the subtitles of each video, correcting description which does not meet the facts, confirming that the subtitles can completely and simply summarize scenes or events displayed by the videos, and using the processed subtitles as tags of the videos.
And (2) extracting video features.
The 3D convolution network used is a C3D network, C3D is a depth 3 dimensional convolution network for processing video (image set with time sequence), and its operation is shown in fig. 1, where k is the length and width size of the 3D convolution kernel, D is the depth of the 3D convolution kernel, W, H is the length and width of each frame of image in the video, and L is the frame number of the video, and in the spatial dimension, as with the 2D convolution operation, the convolution kernel traverses the image set by a set step size, and captures the video feature through the 3D convolution operation. A complete network framework of C3D as shown in fig. 2, the C3D network includes 8 convolutional layers, 5 max pooling layers and 2 full-link layers, and a nonlinear output layer. The 3D convolution kernels (Conv1a-Conv5b) are all 3 × 3 × 3 in size, step 1 in both space and time. The 5 maximum pooling layers are pool1-pool5 in sequence, the pooling core of pool1 is 1 × 2 × 2, other pooling cores are 2 × 2 × 2, and 2 full connection layers (fc6-fc7) all have 4096 output units. And processing the video through a C3D network to obtain a video feature sequence.
Step (3), event detection:
because the lengths of the event segments are different, different sliding windows need to be set for scanning the whole video for multiple times in a model before DAPs, and then the most appropriate event segment is found by using a maximum likelihood estimation method, so that the running speed is very low. DAPs can obtain event segments with different scales only by using one sliding window, so that the running speed is greatly increased. Inputting the video feature sequence obtained in the step (2) into an LSTM network of the DAPs model to connect the features in series, outputting a hidden layer of the LSTM as a time feature sequence, scanning the whole feature sequence by using a sliding window through an Anchor mechanism to obtain predicted event segments, and scoring each segment. A mapping point of the center of a sliding window on an original video sequence is called as Anchor, different-scale propulses are generated by taking the Anchor as the center, each propulsal is scored according to whether an event exists or not, the matching accuracy of the propulsal and an actual event fragment is calculated, a loss function is formed by scoring and matching accuracy of whether a predicted event exists or not, the former requires that the probability that the propulsal contains the event is as high as possible, and the latter requires that the predicted fragment fits an interval of a real event fragment as much as possible.
And finally obtaining the result of event detection: < start time of event 1, end time of event 1 >, …, < start time of event n, end time of event n >.
Step (4), text generation:
and (4) intercepting a video segment of each event by using the result of the event detection, and performing video description on each event segment in step (4), wherein the steps are shown in fig. 3. The text generation model adopts a Transformer-based text generation model. Firstly, visual embedding is carried out, a video sequence of an event segment is input into a C3D network to extract video characteristics, then LSTM network sharing time sequence information is input, and the hidden layer output of the LSTM is used as the input of a transform encoder.
The Transformer model structure is shown in fig. 4 and is composed of two parts, namely an encoder and a decoder, wherein the input of the encoder is a video feature sequence, the input of the decoder is an encoded feature sequence, the output of the decoder is predicted text, the encoder is composed of 6 encoding blocks, and the decoder is composed of 6 decoding blocks, as shown in fig. 5.
Specifically, one coding/decoding block comprises a multi-head self-attention mechanism layer, a residual normalization layer and a feedforward neural network layer. The coding block and the decoding block have the same network structure, and have different input and output contents.
The structure and function of each layer are as follows:
the multi-head self-attention mechanism layer has the following formula of a single-layer self-attention mechanism:
wherein Q is a query vector, K is a key vector, V is a value vector, and Q, K, V in the encoding block is from the input of the layer above the encoding block or the decoding block. The multi-head self-Attention mechanism carries out multiple linear transformation on the input of a coding block or a decoding block, then calculates the Attention value respectively, and carries out splicing and linear transformation operation on the result, wherein the formula is as follows:
MultiHead(Q,K,V)=Concat(head1,…,headh)Wo
wherein the content of the first and second substances,in order to linearly transform the weights for Q,in the same way, headiIs the ith entry score, concat is the column-wise splicing operation, WoThe weights are linearly transformed.
Feed Forward neural network layer (Feed Forward): the system consists of a Relu activation layer and a full connection layer, and aims to adjust output dimensionality, and the formula is as follows:
FFN(x)=nax(0,xW1+b1)W2+b2
wherein, W1As weight of the active layer, b1For bias of active layer, W2For full connection layer weight, b2For full link layer bias, x is the input vector.
Residual normalization layer (Add & Norm): the method is characterized by comprising a residual error network and normalization, wherein the residual error connection solves the problem of multi-layer network training, the network can only pay attention to the current difference part, the normalization operation can accelerate network convergence, and the formula is as follows:
output=LayerNorm(x+F(x))
where x is the input vector, F (x) is the previous layer operation of the residual normalization layer, the multi-headed attention mechanism or the feedforward neural network.
And (5) training the network models from the step (2) to the step (4) through a data set, and finishing automatic generation of the short video captions through the trained network models.
A system for automatically generating subtitles by using short video comprises a video feature extraction module, an event detection module and a text generation module;
the video feature extraction module extracts features of the video through a 3D convolutional network to obtain a video feature sequence, and sends the obtained video feature sequence to the event detection module;
the event detection module uses a DAPs model to carry out event detection on the received video characteristic sequence, inputs the video characteristic sequence into an LSTM network in the DAPs model, connects the characteristics in series, uses a hidden layer vector of the LSTM network as a time characteristic, and then uses a sliding window to scan the whole characteristic sequence to obtain predicted event segments and scores each segment.
The text generation module uses a Transformer model to generate a text, each event segment obtained by the event detection module is processed independently, visual embedding operation is firstly carried out on the event segment, then the event segment after the visual embedding operation is input into the Transformer model, and a predicted text is obtained through the input.
Claims (5)
1. A system for automatically generating subtitles by using short video is characterized by comprising a video feature extraction module, an event detection module and a text generation module;
the video feature extraction module extracts features of the video through a 3D convolutional network to obtain a video feature sequence, and sends the obtained video feature sequence to the event detection module;
the event detection module uses a DAPs model to carry out event detection on a received video characteristic sequence, inputs the video characteristic sequence into an LSTM network in the DAPs model, connects the characteristics in series, uses a hidden layer vector of the LSTM network as a time characteristic, and then uses a sliding window to scan the whole characteristic sequence to obtain predicted event segments and scores each segment;
the text generation module uses a Transformer model to generate a text, each event segment obtained by the event detection module is processed independently, visual embedding operation is firstly carried out on the event segment, then the event segment after the visual embedding operation is input into the Transformer model, and a predicted text is obtained through the input.
2. A method for automatically generating subtitles by short videos is characterized by comprising the following steps:
step (1), constructing a data set;
constructing a data set, randomly sampling videos with subtitles in various domestic large and short video platforms, uniformly preprocessing the videos, manually checking the subtitles of each video, correcting description inconsistent with facts, confirming that the subtitles can completely and simply summarize scenes or events displayed by the videos, and using the processed subtitles as tags of the videos;
step (2), extracting video characteristics;
extracting the video characteristics by using a 3D convolutional network in a video characteristic extraction module to obtain a video characteristic sequence;
step (3), event detection;
using a DAPs model in an event detection module to detect an event, inputting the acquired video characteristics into an LSTM network in the DAPs model on the basis of the step two, connecting the characteristics in series, using a hidden layer vector of the LSTM network as a time characteristic, scanning the whole characteristic sequence by using a sliding window to obtain predicted event segments, and scoring each segment; training the DAPs model by taking the score of each segment and the matching accuracy of the segment as loss functions;
step (4), generating a text;
generating a text by using a Transformer model in a text generation module, independently processing each event segment obtained in the step (3), firstly performing visual embedding operation on the event segment, then inputting the event segment subjected to the visual embedding operation into the Transformer model, and obtaining a predicted text by the method;
and (5) training the network models from the step (2) to the step (4) through a data set, and finishing automatic generation of the short video captions through the trained network models.
3. The method for automatically generating subtitles by using short video according to claim 2, wherein the step (2) is specifically as follows:
the used 3D convolution network is a C3D network, C3D is a depth 3-dimensional convolution network and is used for processing videos, wherein k is the length and width of a 3D convolution kernel, D is the depth of the 3D convolution kernel, W, H is the length and width of each frame of image in the videos, L is the frame number of the videos, in the spatial dimension, as with 2D convolution operation, the convolution kernel traverses an image set according to set step length, and video features are captured through the 3D convolution operation; the C3D network includes 8 convolutional layers, 5 max pooling layers and 2 full-link layers, and a nonlinear output layer; the 3D convolution kernels (Conv1a-Conv5b) are all 3 × 3 × 3 in size, step 1 in both space and time; the 5 maximum pooling layers are pool1-pool5 in sequence, the pooling core of pool1 is 1 × 2 × 2, other pooling cores are 2 × 2 × 2, and 2 full-connection layers (fc6-fc7) are provided with 4096 output units; and processing the video through a C3D network to obtain a video feature sequence.
4. The method for automatically generating subtitles by using short video according to claim 3, wherein the step (3) is specifically as follows:
inputting the video feature sequence obtained in the step (2) into an LSTM network of a DAPs model to connect the features in series, outputting a hidden layer of the LSTM as a time feature sequence, scanning the whole feature sequence by using a sliding window through an Anchor mechanism to obtain predicted event segments and scoring each segment; a mapping point of the center of the sliding window on the original video sequence is called as Anchor, different-scale propulses are generated by taking the Anchor as the center, each propulsal is scored according to whether an event is included or not, the matching accuracy of the propulsal and an actual event fragment is calculated, a loss function is formed by scoring and matching accuracy of whether a predicted event is included or not, the former requires that the probability that the propulsal includes the event is as high as possible, and the latter requires that the predicted fragment fits an interval of a factual event fragment as much as possible;
and finally obtaining the result of event detection: < start time of event 1, end time of event 1 >, …, < start time of event n, end time of event n >.
5. The method for automatically generating subtitles according to claim 4, wherein the step (4) is specifically as follows:
the text generation model adopts a Transformer-based text generation model; firstly, visual embedding is carried out, a video sequence of an event segment is input into a C3D network to extract video characteristics, then LSTM network sharing time sequence information is input, and the hidden layer output of the LSTM is used as the input of a transform encoder;
the Transformer model is composed of an encoder and a decoder, wherein the input of the encoder is a video characteristic sequence, the input of the decoder is a coded characteristic sequence, the output of the decoder is a predicted text, the encoder is composed of 6 coding blocks, and the decoder is composed of 6 decoding blocks;
specifically, one coding block/decoding block comprises a multi-head self-attention mechanism layer, a residual error normalization layer and a feedforward neural network layer; the network structure of the coding block and the decoding block is the same, and the input content and the output content of the coding block and the decoding block are different;
the structure and function of each layer are as follows:
the multi-head self-attention mechanism layer has the following formula of a single-layer self-attention mechanism:
wherein Q is a query vector, K is a key vector, V is a value vector, and Q, K, V in a coding block is from the input of a layer above the coding block or the decoding block; the multi-head self-Attention mechanism carries out multiple linear transformation on the input of a coding block or a decoding block, then calculates the Attention value respectively, and carries out splicing and linear transformation operation on the result, wherein the formula is as follows:
MultiHead(Q,K,V)=Concat(head1,…,headh)Wo
wherein the content of the first and second substances,in order to linearly transform the weights for Q,in the same way, headiIs the ith entry score, concat is the column-wise splicing operation, WoIs a linear transformation weight;
feed Forward neural network layer (Feed Forward): the system consists of a Relu activation layer and a full connection layer, and aims to adjust output dimensionality, and the formula is as follows:
FFN(x)=max(0,xW1+b1)W2+b2
wherein, W1As weight of the active layer, b1For bias of active layer, W2For full connection layer weight, b2For full link layer bias, x is the input vector;
residual normalization layer (Add & Norm): the method is characterized by comprising a residual error network and normalization, wherein the residual error connection solves the problem of multi-layer network training, the network can only pay attention to the current difference part, the normalization operation can accelerate network convergence, and the formula is as follows:
output=LayerNorm(x+F(x))
where x is the input vector, F (x) is the previous layer operation of the residual normalization layer, the multi-headed attention mechanism or the feedforward neural network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110442856.4A CN113159034A (en) | 2021-04-23 | 2021-04-23 | Method and system for automatically generating subtitles by using short video |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110442856.4A CN113159034A (en) | 2021-04-23 | 2021-04-23 | Method and system for automatically generating subtitles by using short video |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113159034A true CN113159034A (en) | 2021-07-23 |
Family
ID=76869972
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110442856.4A Pending CN113159034A (en) | 2021-04-23 | 2021-04-23 | Method and system for automatically generating subtitles by using short video |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113159034A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113423004A (en) * | 2021-08-23 | 2021-09-21 | 杭州一知智能科技有限公司 | Video subtitle generating method and system based on decoupling decoding |
CN113569717A (en) * | 2021-07-26 | 2021-10-29 | 上海明略人工智能(集团)有限公司 | Short video event classification method, system, device and medium based on label semantics |
CN113609259A (en) * | 2021-08-16 | 2021-11-05 | 山东新一代信息产业技术研究院有限公司 | Multi-mode reasoning method and system for videos and natural languages |
CN113784199A (en) * | 2021-09-10 | 2021-12-10 | 中国科学院计算技术研究所 | System and method for generating video description text |
CN113837083A (en) * | 2021-09-24 | 2021-12-24 | 焦点科技股份有限公司 | Video segment segmentation method based on Transformer |
CN114222193A (en) * | 2021-12-03 | 2022-03-22 | 北京影谱科技股份有限公司 | Video subtitle time alignment model training method and system |
CN114782848A (en) * | 2022-03-10 | 2022-07-22 | 沈阳雅译网络技术有限公司 | Picture subtitle generating method applying characteristic pyramid |
CN116310984A (en) * | 2023-03-13 | 2023-06-23 | 中国科学院微电子研究所 | Multi-mode video subtitle generating method based on Token sampling |
-
2021
- 2021-04-23 CN CN202110442856.4A patent/CN113159034A/en active Pending
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113569717A (en) * | 2021-07-26 | 2021-10-29 | 上海明略人工智能(集团)有限公司 | Short video event classification method, system, device and medium based on label semantics |
CN113609259A (en) * | 2021-08-16 | 2021-11-05 | 山东新一代信息产业技术研究院有限公司 | Multi-mode reasoning method and system for videos and natural languages |
CN113423004B (en) * | 2021-08-23 | 2021-11-30 | 杭州一知智能科技有限公司 | Video subtitle generating method and system based on decoupling decoding |
CN113423004A (en) * | 2021-08-23 | 2021-09-21 | 杭州一知智能科技有限公司 | Video subtitle generating method and system based on decoupling decoding |
CN113784199B (en) * | 2021-09-10 | 2022-09-13 | 中国科学院计算技术研究所 | System, method, storage medium and electronic device for generating video description text |
CN113784199A (en) * | 2021-09-10 | 2021-12-10 | 中国科学院计算技术研究所 | System and method for generating video description text |
CN113837083A (en) * | 2021-09-24 | 2021-12-24 | 焦点科技股份有限公司 | Video segment segmentation method based on Transformer |
CN114222193A (en) * | 2021-12-03 | 2022-03-22 | 北京影谱科技股份有限公司 | Video subtitle time alignment model training method and system |
CN114222193B (en) * | 2021-12-03 | 2024-01-05 | 北京影谱科技股份有限公司 | Video subtitle time alignment model training method and system |
CN114782848A (en) * | 2022-03-10 | 2022-07-22 | 沈阳雅译网络技术有限公司 | Picture subtitle generating method applying characteristic pyramid |
CN114782848B (en) * | 2022-03-10 | 2024-03-26 | 沈阳雅译网络技术有限公司 | Picture subtitle generation method applying feature pyramid |
CN116310984A (en) * | 2023-03-13 | 2023-06-23 | 中国科学院微电子研究所 | Multi-mode video subtitle generating method based on Token sampling |
CN116310984B (en) * | 2023-03-13 | 2024-01-30 | 中国科学院微电子研究所 | Multi-mode video subtitle generating method based on Token sampling |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113159034A (en) | Method and system for automatically generating subtitles by using short video | |
CN108648746B (en) | Open domain video natural language description generation method based on multi-modal feature fusion | |
CN114694076A (en) | Multi-modal emotion analysis method based on multi-task learning and stacked cross-modal fusion | |
CN108388900A (en) | The video presentation method being combined based on multiple features fusion and space-time attention mechanism | |
CN115039141A (en) | Scene aware video conversation | |
CN111967272B (en) | Visual dialogue generating system based on semantic alignment | |
CN111368142B (en) | Video intensive event description method based on generation countermeasure network | |
CN111160264A (en) | Cartoon figure identity recognition method based on generation of confrontation network | |
CN110232564A (en) | A kind of traffic accident law automatic decision method based on multi-modal data | |
WO2024066920A1 (en) | Processing method and apparatus for dialogue in virtual scene, and electronic device, computer program product and computer storage medium | |
Bilkhu et al. | Attention is all you need for videos: Self-attention based video summarization using universal transformers | |
CN108550173A (en) | Method based on speech production shape of the mouth as one speaks video | |
CN115129934A (en) | Multi-mode video understanding method | |
CN116189039A (en) | Multi-modal emotion classification method and system for modal sequence perception with global audio feature enhancement | |
CN113642536B (en) | Data processing method, computer device and readable storage medium | |
CN116977457A (en) | Data processing method, device and computer readable storage medium | |
Song et al. | Contextual attention network for emotional video captioning | |
CN117273150A (en) | Visual large language model method based on few sample learning | |
Zhu | Metaaid 2.0: An extensible framework for developing metaverse applications via human-controllable pre-trained models | |
CN115019137A (en) | Method and device for predicting multi-scale double-flow attention video language event | |
CN114495946A (en) | Voiceprint clustering method, electronic device and storage medium | |
Liu et al. | End-to-End Chinese Lip-Reading Recognition Based on Multi-modal Fusion | |
CN116821381B (en) | Voice-image cross-mode retrieval method and device based on spatial clues | |
CN116913278B (en) | Voice processing method, device, equipment and storage medium | |
CN112765955B (en) | Cross-modal instance segmentation method under Chinese finger representation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |