CN113095435A - Video description generation method, device, equipment and computer readable storage medium - Google Patents

Video description generation method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN113095435A
CN113095435A CN202110470037.0A CN202110470037A CN113095435A CN 113095435 A CN113095435 A CN 113095435A CN 202110470037 A CN202110470037 A CN 202110470037A CN 113095435 A CN113095435 A CN 113095435A
Authority
CN
China
Prior art keywords
features
auditory
visual
coding
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110470037.0A
Other languages
Chinese (zh)
Other versions
CN113095435B (en
Inventor
罗剑
王健宗
程宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110470037.0A priority Critical patent/CN113095435B/en
Publication of CN113095435A publication Critical patent/CN113095435A/en
Application granted granted Critical
Publication of CN113095435B publication Critical patent/CN113095435B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The application belongs to the technical field of intelligent decision making, and provides a video description generation method, a device, equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring a video to be described, and extracting visual features, auditory features and word features of the video to be described; respectively coding the visual characteristics and the auditory characteristics through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding characteristics and auditory coding characteristics; processing the visual coding features and the auditory coding features through an auxiliary model of a video description generation system to generate target auxiliary features; decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through a multi-mode attention mechanism main model to obtain the posterior probability of each keyword, and selecting a decoding word from each keyword according to the posterior probability of each keyword; and generating the video description of the video to be described according to the decoding words. The method and the device can improve the accuracy of video description.

Description

Video description generation method, device, equipment and computer readable storage medium
Technical Field
The present application relates to the field of intelligent decision making technologies, and in particular, to a video description generation method, apparatus, device, and computer-readable storage medium.
Background
Video description is a technique for automatically generating a content description for a video. With the continuous development of the mobile internet, the short video gradually becomes the most popular spreading form at present, and the video description is automatically generated for the short video, so that the method has important application value in the aspects of providing reference for users, optimizing the recommendation algorithm and search engine of the short video, improving the auditing work efficiency of the short video and the like. Unlike image descriptions alone or audio descriptions alone, videos contain complex spatiotemporal relationships between objects, e.g., "footsteps from a wooden ladder, two people walking slowly closer", and thus how to automatically generate a video description is a challenge in the field of computer vision.
In the related art, a classical attention-based encoder-decoder algorithm is generally adopted to generate a video description for a video, however, the algorithm only utilizes visual features in the video, and the quality of the finally generated video description is not high due to single features, so that the video content cannot be accurately described.
Disclosure of Invention
The present application mainly aims to provide a video description generation method, device, apparatus and computer readable storage medium, and aims to solve the technical problem that the accuracy of video description generated by the existing automatic video description generation method is not high.
In a first aspect, the present application provides a video description generation method, including:
acquiring a video to be described, and extracting visual features, auditory features and word features of the video to be described;
respectively coding the visual features and the auditory features through a multi-mode attention mechanism main body model of a video description generation system to obtain visual coding features and auditory coding features;
processing the visual coding features and the auditory coding features through an auxiliary model of the video description generation system to generate target auxiliary features;
decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting decoding words from each keyword according to the posterior probability of each keyword;
and generating the video description of the video to be described according to the decoding words.
In a second aspect, the present application further provides a video description generation apparatus, including:
the extraction module is used for acquiring a video to be described and extracting visual features, auditory features and word features of the video to be described;
the coding module is used for coding the visual features and the auditory features respectively through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding features and auditory coding features;
a target assistant feature generation module, configured to process the visual coding features and the auditory coding features through an assistant model of the video description generation system to generate target assistant features;
the decoding module is used for decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting a decoding word from each keyword according to the posterior probability of each keyword;
and the video description generation module is used for generating the video description of the video to be described according to the decoding words.
In a third aspect, the present application further provides a computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the video description generation method as described above.
In a fourth aspect, the present application further provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the video description generation method as described above.
The application discloses a video description generation method, a video description generation device, computer equipment and a computer readable storage medium, wherein the video description generation method comprises the steps of firstly obtaining a video to be described, and extracting visual features, auditory features and word features of the video to be described; then, coding the visual characteristics and the auditory characteristics through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding characteristics and auditory coding characteristics; processing the visual coding features and the auditory coding features through an auxiliary model of a video description generation system to generate target auxiliary features; and further decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through a multi-mode attention mechanism-based main body model of the video description system to obtain the posterior probability of each keyword, selecting decoding words from each keyword according to the posterior probability of each keyword, and finally generating the video description of the video to be described according to the decoding words. The video description generation system realizes the fusion of visual features and auditory features through the multi-mode attention mechanism main body model, realizes the addition of auxiliary features through the auxiliary model, provides rich features for the generation of video description, and lays a foundation for accurately selecting words which accord with video scenes and events, thereby improving the video description accuracy.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a video description generation method according to an embodiment of the present application;
fig. 2 is a schematic architecture diagram of a video description system according to an embodiment of the present application;
fig. 3 is a schematic diagram of an architecture of a scene classification assistance model according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a keyword evaluation support model according to an embodiment of the present application;
fig. 5 is a schematic block diagram of a video description generation apparatus provided in an embodiment of the present application;
fig. 6 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
The embodiment of the application provides a video description generation method, a video description generation device, video description generation equipment and a computer-readable storage medium. The video description generation method is mainly applied to video description generation equipment, and can be equipment with a data processing function, such as a mobile terminal, a Personal Computer (PC), a portable computer and a server. The video description generation device carries a video description generation system thereon. The video description generation system may be implemented as part of a multimedia application.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a schematic flowchart of a video description generation method according to an embodiment of the present application.
As shown in fig. 1, the video description generation method includes steps S101 to S105.
Step S101, obtaining a video to be described, and extracting visual features, auditory features and word features of the video to be described.
As shown in fig. 2, fig. 2 is a schematic structural diagram of a video description generation system, which is a video description generation model and mainly includes three parts, namely a main model and two auxiliary models, the main model is a coder-decoder model based on a multi-modal attention mechanism (defined as a main model of the multi-modal attention mechanism, as shown in a dashed box part in fig. 2), and the two auxiliary models are respectively a scene classification auxiliary model and a keyword evaluation auxiliary model. The multi-modal attention mechanism main body model introduces a multi-modal attention mechanism in a traditional self-attention-based encoder-decoder algorithm, and can perform fusion extraction on visual features and auditory features in a video to be described; and a scene classification auxiliary model supported by visual features and a keyword evaluation auxiliary model supported by auditory features are introduced into the video description generation system, so that visual and auditory fusion features and word features can be assisted, words conforming to the current scene and event can be accurately selected, and the description accuracy of the video to be described is improved.
As shown in the dashed box of FIG. 2, the dashed box of FIG. 2 is an architectural diagram of a multi-modal attention mechanism body model, which includes a visual feature encoder (denoted as VE)θv) An auditory feature encoder (denoted VE)θa) And a text decoder (denoted D). The visual characteristic encoder and the auditory characteristic encoder are used for fusing and extracting the visual characteristics and the auditory characteristics, and the text decoder decodes the decoded words based on the visual characteristics, the auditory characteristics and the word characteristics.
First, a video to be described is acquired, and visual features (denoted as phi) and auditory features (denoted as phi) of the video to be described are extracted
Figure BDA0003045147430000051
) And word features (denoted as w)n-1)。
In one embodiment, the visual features are obtained by performing feature extraction on the visual information in the video to be described based on an Inflated 3D convolutional network (I3D ConvNet, I3D) pre-trained on the behavior data set Kinetics-600. It will be appreciated that the visual characteristic is in the form of a Tv×dvCharacteristic sequence of
Figure BDA0003045147430000052
Wherein T isvIndicating the length of the input sequence, dvThe dimensions of the features are represented such that,
Figure BDA0003045147430000053
representing the visual characteristics of the point in time t.
In one embodiment, based on a VGGish model obtained by pre-training in a Google data set Audio, the auditory information in a video to be described is subjected to feature extraction to obtain auditory features. When the VGGish model extracts the characteristics of the auditory information in the video to be describedFirstly, resampling the audio of the video to be described to be a monaural audio, exemplarily, resampling the audio of the video to be described to be a 16kHz monaural audio, then performing short-time fourier transform on the monaural audio by using a 25ms Hanning window (Hanning) and a 10ms frame shift to obtain a spectrogram, mapping the spectrogram into a 64-order mel filter bank, taking a logarithm to obtain a stable mel-frequency spectrum, and outputting a time-duration framing of 0.96s for the features, wherein each frame comprises 64 mel frequency bands and has a time duration of 10ms (namely, 96 frames in total). Similar to visual feature extraction, the VGGish model outputs Ta×daCharacteristic sequence of
Figure BDA0003045147430000054
TaWherein the length of the input sequence, i.e. audio duration/0.96, d is indicatedaRepresenting a characteristic dimension, which may be 128 dimensions.
In one embodiment, the extraction of the word features of the video to be described is to obtain each word w at the previous momentn-1=(w1,...,wn-1) Based on the public data set Common Crawl, using fastText to pre-train to obtain a lookup table with one dimension, so as to obtain each word w in the previous timen-1=(w1,...,wn-1) Can all use one dwThe vector of dimensions is represented.
And S102, coding the visual characteristics and the auditory characteristics through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding characteristics and auditory coding characteristics.
For the extracted visual and auditory features
Figure BDA0003045147430000055
Coding is carried out through a multi-modal attention mechanism main body model, and visual coding characteristics (expressed as v ═ VE) are obtainedθv(phi)) and auditory coding features (denoted as
Figure BDA0003045147430000061
In an embodiment, the multi-modal attention mechanism body model includes a visual feature encoder and an auditory feature encoder, and the multi-modal attention mechanism body model of the video description generation system encodes the visual feature and the auditory feature respectively to obtain a visual coding feature and an auditory coding feature, specifically: performing multi-head attention calculation on the visual features through the visual feature encoder to obtain visual multi-head attention features, and performing multi-head attention calculation on the auditory features through the auditory feature encoder to obtain auditory multi-head attention features; performing multi-modal attention calculation on the visual multi-head attention feature and the auditory multi-head attention feature through the visual feature encoder to obtain a visual feature fused with auditory attention, and performing multi-modal attention calculation on the auditory multi-head attention feature and the visual multi-head attention feature through the auditory feature encoder to obtain an auditory feature fused with visual attention; through the visual feature encoder is right in proper order visual feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the visual coding feature of visual feature encoder output, and through the sense of hearing feature encoder is right in proper order the sense of hearing feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the sense of hearing coding feature of sense of hearing feature encoder output.
With continued reference to fig. 2, a left dashed-line frame portion of fig. 2 is a schematic structural diagram of a visual feature encoder and an auditory feature encoder of a Multi-modal Attention mechanism body model, where encoding layers of the visual feature encoder and the auditory feature encoder each include five sublayers, a first Layer is a Multi-head Attention (Multi-modal Attention), a second Layer is a Multi-modal Attention (Multi-modal Attention), where the Multi-modal Attention is a variation of the Multi-head Attention, a third Layer is a first Layer regularization (Layer regularization) Layer, a fourth Layer is a feedforward neural network, and a fifth Layer is a second Layer regularization Layer.
For the extracted visual and auditory features
Figure BDA0003045147430000062
Respectively through visual feature encoder VEθvCoding is carried out to obtain visual coding characteristics (expressed as v ═ VE)θvPhi), by means of an auditory feature encoder VEθaEncoding is performed to obtain auditory encoding characteristics (expressed as
Figure BDA0003045147430000063
)。
Specifically, the visual feature phi and the auditory feature are respectively combined
Figure BDA0003045147430000064
Inputting the data into a visual characteristic encoder and an auditory characteristic encoder, and respectively adding a visual characteristic phi and an auditory characteristic phi into corresponding encoding layers of the visual characteristic encoder and the auditory characteristic encoder
Figure BDA0003045147430000065
Inputting the data into a corresponding multi-head attention layer to perform multi-head attention calculation to obtain the output of the corresponding multi-head attention layer, namely the visual multi-head attention feature and the auditory multi-head attention feature (defined as Q); then, the output of the corresponding multi-head attention layer is input to the corresponding multi-modal attention layer for multi-modal attention calculation, and the output (defined as K, V) corresponding to the multi-modal attention, namely the visual feature V fused with the auditory attention is obtainedmm=MultiHeadAttention(Vself,Aself,Aself) And an auditory feature A with fused visual attentionmm=MultiHeadAttention(Aself,Vself,Vself) (ii) a Then, respectively inputting the output corresponding to the multi-modal attention to a corresponding first layer of regularization layer for layer regularization to obtain the output corresponding to the first layer of regularization layer; then, respectively inputting the output of the corresponding first layer of regularization layer to the corresponding feedforward neural network layer for feedforward calculation to obtain the output of the corresponding feedforward neural network layer; finally, the output of the corresponding feedforward neural network layer is input to the corresponding second layer regularization layer for layer regularization, so that the complete encoder is stacked for N times to finally output the viewPerceptual coding features
Figure BDA0003045147430000071
And auditory coding features
Figure BDA0003045147430000072
Where VE and AE represent a visual feature encoder and an auditory feature encoder, respectively, and θv、θaRespectively, representing their parameter spaces. It should be noted that, in order to alleviate the problems such as the disappearance of the gradient, a stub connection is added between the input layer of the corresponding coding layer and the first layer regularization layer, and between the first layer regularization layer and the second layer regularization layer.
Among the coding layers of the visual feature encoder and the auditory feature encoder, the multi-point Attention is the most important conversion mapping, and the multi-point Attention is obtained by Scaled Dot-product Attention (Scaled Dot-Production Attention), and the formula is as follows:
Figure BDA0003045147430000073
wherein the content of the first and second substances,
Figure BDA0003045147430000074
is a scaling factor, Q, K, V is a sequence of queries (queries), keys (keys), values (values), respectively.
Multi-headed attention lets the query (Q, query), key (K, key), and value (V, value) first go through a linear transformation
Figure BDA0003045147430000075
Then inputting the data into the zooming dot product attention, repeating the step H times (H is the number of heads in multi-head attention), using different linear transformation parameter matrixes each time, splicing the results obtained by the zooming dot product attention of the H times, and performing linear transformation W once againoutThe obtained value is used as the output result of the multi-head attention, and the specific formula is as follows:
Figure BDA0003045147430000076
MuilHeadAttention(Q,K,V)=[head1(Q,K,V),...,headH(Q,K,V)]Wout
step S103, processing the visual coding features and the auditory coding features through an auxiliary model of the video description generation system to generate target auxiliary features.
The visual coding features and the auditory coding features may then be processed through an auxiliary model of the video description generation system to generate target auxiliary features.
In an embodiment, the auxiliary models include a scene classification auxiliary model and a keyword evaluation auxiliary model; the processing the visual coding features and the auditory coding features through the auxiliary model of the video description generation system to generate target auxiliary features specifically includes: inputting the visual coding features into the scene classification auxiliary model for processing to obtain first auxiliary features output by the scene classification auxiliary model, and inputting the auditory coding features into the keyword evaluation auxiliary model for processing to obtain second auxiliary features output by the keyword evaluation auxiliary model; and generating a target auxiliary feature according to the first auxiliary feature and the second auxiliary feature.
With continued reference to FIG. 2, the auxiliary models of the video description generation system include a scene classification auxiliary model and a keyword evaluation auxiliary model. For the visual coding features v output by the visual feature encoder, the video description generation system inputs the visual coding features v into the scene classification auxiliary model for processing to obtain first auxiliary features output by the scene classification auxiliary model
Figure BDA0003045147430000081
For the auditory coding characteristics a output by the auditory characteristic encoder, the video description generation system inputs the auditory coding characteristics a into the keyword evaluation auxiliary model for processing to obtain second auxiliary characteristics output by the keyword evaluation auxiliary model
Figure BDA0003045147430000082
Thereby generating the target assist feature m from the first assist feature and the second assist feature.
In an embodiment, the inputting the visual coding feature into the scene classification auxiliary model for processing to obtain a first auxiliary feature output by the scene classification auxiliary model specifically includes: inputting the visual coding features into the scene classification auxiliary model, and performing linear transformation on the visual coding features; carrying out nonlinear mapping on the vision coding features after linear transformation through a linear rectification function to obtain vision coding feature mapping; performing linear transformation on the visual coding feature mapping; and performing softmax logistic regression calculation on the vision coding feature mapping after the linear transformation to obtain a first auxiliary feature output by the scene classification auxiliary model.
As shown in fig. 3, fig. 3 is an architecture diagram of a scene classification assistant model, where the scene classification assistant model includes four sub-layers, a first Linear transformation layer (Linear), a second Linear rectification function (ReLU, activation function), a third Linear transformation layer, and a fourth Softmax function logistic regression layer.
After the visual coding features v output by the visual feature encoder are input into a scene classification auxiliary model, the scene classification auxiliary model firstly accesses the visual coding features v into a first linear transformation layer for linear transformation to obtain the output of the linear transformation layer; then, the output of the linear transformation layer is input to a linear rectification function for nonlinear mapping to obtain a calculation result of the linear rectification function, namely visual coding characteristic mapping; inputting the calculation result of the linear rectification function into a second linear conversion layer for linear conversion to obtain the output of the second linear conversion layer; then, the output of the second linear transformation layer is input to a Softmax function logistic regression layer for Softmax logistic regression calculation, and the probability scores m of Ka preset scenes output by the Softmax function are obtainedvAs the final output of the scene classification assistance model, wherein,
Figure BDA0003045147430000091
in an embodiment, the inputting the auditory coding features into the keyword assessment assistant model for processing to obtain second assistant features output by the keyword assessment assistant model specifically includes: inputting the auditory coding features into the keyword evaluation auxiliary model, and performing linear transformation on the auditory coding features; carrying out nonlinear mapping on the hearing coding features after linear transformation through a linear rectification function to obtain hearing coding feature mapping; performing a linear transformation on the auditory coding feature map;
calculating the auditory coding feature mapping after linear transformation through a Sigmoid function to obtain the posterior probability of each keyword in a dictionary; performing maximum pooling on the posterior probability of each keyword to obtain a score of each keyword; ranking the scores of the keywords, and selecting a preset number of keywords according to the order of the scores from large to small so as to search the indexes of the selected keywords in a dictionary; and combining the searched indexes to obtain a second auxiliary characteristic output by the keyword evaluation auxiliary module.
As shown in fig. 4, fig. 4 is a schematic diagram of a keyword evaluation auxiliary model, where the keyword evaluation auxiliary model includes six sublayers, a first layer is a first linear transformation layer, a second layer is an activation function (a linear rectification function), a third layer is a second linear transformation layer, a fourth layer is a Sigmoid function, a fifth layer is a maximum pooling layer, and a sixth layer is a sorting & selecting layer.
After the auditory coding characteristics a output by the auditory characteristic encoder are input into the keyword evaluation auxiliary model, the keyword evaluation auxiliary model firstly accesses the auditory coding characteristics a into a first linear transformation layer for linear transformation to obtain the output of the linear transformation layer; then, the output of the linear transformation layer is input to an activation function for calculation to obtain a calculation result of the activation function, namely, the auditory coding features after linear transformation are subjected to nonlinear mapping through a linear rectification function to obtain auditory coding feature mapping; then the auditory code characteristic mapping is input to a second linear transformation layer for linear transformation to obtain a secondThe output of each linear conversion layer; then, the output of the second linear transformation layer is input into a Sigmoid function to obtain the posterior probability Z of each keyword in a dictionary output by the Sigmoid function, wherein
Figure BDA0003045147430000092
Further inputting the posterior probability Z of each keyword in the dictionary output by the Sigmoid function into the maximum pooling layer for keyword evaluation to obtain a keyword score P (Z) output by the maximum pooling layerC| a), wherein P (Z)C|a)=max P(ZC,t| a); finally, the keyword score P (Z) output by the maximum pooling layerC| a) input to the sorting&Selecting a layer to sort, and taking indexes of the first K keywords in a dictionary according to the order of scores from large to small to form an output m of a keyword evaluation auxiliary modelaAnd K represents a preset number and can be flexibly set according to the actual situation.
In an embodiment, the generating a target assist feature from the first assist feature and the second assist feature comprises: carrying out keyword embedding processing and linear transformation on the second auxiliary features in sequence to obtain second auxiliary features with reduced feature dimensions; and splicing the second auxiliary features with the reduced feature dimensions with the first auxiliary features to obtain target auxiliary features.
With continued reference to FIG. 2, a first assist feature m output to assist the model in classifying the scenevAnd a second assistant feature m output by the keyword evaluation assistant modelaSplicing is carried out, and the second auxiliary characteristic m is also splicedaPerforming keyword embedding processing and linear transformation in sequence to reduce feature dimension, and combining the second assistant feature with reduced feature dimension and the first assistant feature mvAnd splicing to obtain the target auxiliary feature m.
And step S104, decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting a decoding word from each keyword according to the posterior probability of each keyword.
In the decoding stage, when the nth word is decoded, the visual coding characteristics, the auditory coding characteristics, the target auxiliary characteristics and the word characteristics are decoded through the multi-mode attention machine main body model to obtain the posterior probability of the nth word finally output by the multi-mode attention machine main body model
Figure BDA0003045147430000101
D is a text decoder, θdRepresenting its parameter space.
In an embodiment, the multi-modal attention mechanism principal model includes a text decoder, and the multi-modal attention mechanism principal model decodes the visual coding features, the auditory coding features, the target assistant features, and the word features to obtain posterior probabilities of the keywords, specifically: sequentially performing multi-head attention calculation and layer regularization on the word features through the text decoder to obtain word layer regularization features; performing multi-mode attention calculation on the word layer regularization features and the visual coding features to obtain word features fused with visual attention, and performing multi-mode attention calculation on the word layer regularization features and the auditory coding features to obtain word features fused with auditory attention; bridging the word features fused with the visual attention and the word features fused with the auditory attention to obtain bridging word features; performing layer regularization on the bridge word features, and performing multi-head attention calculation on the bridge word features after the layer regularization and the target auxiliary features to obtain word features fused with the target auxiliary features; sequentially carrying out first-layer regularization, feedforward calculation and second-layer regularization on the word features fused with the target auxiliary features to obtain the output of the text decoder; and sequentially carrying out linear transformation and Softmax logistic regression calculation on the output of the text decoder to obtain the posterior probability of each keyword.
As shown in the dashed box portion on the right of fig. 2, the dashed box portion on the right of fig. 2 is a structural diagram of a text decoder, which includes nine sub-layers, the first layer is a first multi-headed attention layer,the second layer is a first regularization layer, and the third layer is two different multi-modal attention layers, multiHeadsAttention (W)selfV, v) and MultiHeadAttention (W)selfA, a), the fourth layer is a bridge layer, the fifth layer is a second layer of regularization layer, and the sixth layer is a second multi-head attention layer (W)normM, m), the seventh layer is a third layer regularization layer, the eighth layer is a feedforward neural network layer, and the ninth layer is a fourth layer regularization layer.
For the word characteristics, the video description generation system inputs the word characteristics into a text decoder, firstly inputs the word characteristics into a first multi-head attention layer at a decoding layer of the text decoder to carry out multi-head attention calculation, and obtains the output of the first multi-head attention layer; inputting the output of the first multi-head attention layer to the first layer of regularization layer for layer regularization to obtain the output W of the first layer of regularization layerselfI.e. word layer regularization features; the output W of the first layer of regularization layer is thenselfRespectively inputting the data into two different multi-modal attention layers, respectively fusing with the visual coding features and the auditory coding features, performing multi-modal attention calculation to obtain the output of the two different multi-modal attention layers, namely the output W of the regularization layer of the first layerselfPerforming multi-mode attention calculation with the visual coding features to obtain word features fused with visual attention, and normalizing the output W of the first layerselfPerforming multi-modal attention calculation with the auditory coding features to obtain word features fused with auditory attention; then the outputs of two different multi-modal attention layers are input into the bridging layer to be bridged, and the output-bridging word characteristic (shape 2 d) of the bridging layer is obtainedwConversion of x (n-1) to dwX (n-1)); inputting the output of the bridging layer to a second layer of regularization layer for layer regularization to obtain the output W of the second layer of regularization layernorm(ii) a Further regularizing the second layer by the output WnormInput to the second Multi-head attention Multi HeadAttention (W)normM, m), fusing the word features and the target assist features to obtain a second multi-headed attention output-the word feature fused with the target assist featuresCharacterisation, i.e. output W of the second layer of regularization layernormPerforming head attention calculation with the target auxiliary features to obtain word features fused with the target auxiliary features; inputting the output of the second multi-head attention to a third layer of regularization layer for layer regularization to obtain the output of the third layer of regularization layer; then, the output of the third layer of regularization layer is input to a corresponding feedforward neural network layer for feedforward calculation to obtain the output of the feedforward neural network layer; and then inputting the output of the feedforward neural network layer to a fourth layer of regularization layer for layer regularization to obtain the output of the fourth layer of regularization layer, namely the output of the text decoder, and taking the output as the output of the main model of the multi-mode attention mechanism. It should be noted that, a stub connection is respectively added between the input layer of the decoding layer and the first layer regularization layer, between the first layer regularization layer and the second layer regularization layer, between the second layer regularization layer and the third layer regularization layer, and between the third layer regularization layer and the fourth layer regularization layer.
Continuing to refer to fig. 2, the output of the multi-modal attention mechanism main body model is subjected to linear transformation, the result obtained after the linear transformation is calculated through Softmax logistic regression, and the output of the video description generation system, namely the posterior probability P (W) of the nth keyword is finally obtainedn|v,a,m,Wn-1). It can be understood that the higher the posterior probability is, the higher the matching degree of the corresponding keyword and the video content to be described is, and the keyword with the highest posterior probability is determined as a decoding word.
And step S105, generating the video description of the video to be described according to the decoding words.
Because the video description is a natural language formed by the sequence of decoding words, the steps from S101 to S104 are repeated during each decoding, the decoding words are sequentially generated to form the sequence of decoding words, and the video description of the video to be described is generated by the sequence of decoding words.
In summary, based on visual features
Figure BDA0003045147430000121
Auditory features
Figure BDA0003045147430000122
And word (w)n-1=(w1,...,wn-1) ) feature, generates current word wnTo generate a complete word sequence (w)1,...,wn) And the method is used for describing the content of the video to be described.
The video description generation method provided by the embodiment comprises the steps of firstly obtaining a video to be described, and extracting visual features, auditory features and word features of the video to be described; then, coding the visual characteristics and the auditory characteristics through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding characteristics and auditory coding characteristics; processing the visual coding features and the auditory coding features through an auxiliary model of a video description generation system to generate target auxiliary features; and further decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through a multi-mode attention mechanism-based main body model of the video description system to obtain the posterior probability of each keyword, selecting decoding words from each keyword according to the posterior probability of each keyword, and finally generating the video description of the video to be described according to the decoding words. The video description generation system realizes the fusion of visual features and auditory features through the multi-mode attention mechanism main body model, realizes the addition of auxiliary features through the auxiliary model, provides rich features for the generation of video description, and lays a foundation for accurately selecting words which accord with video scenes and events, thereby improving the video description accuracy.
Referring to fig. 5, fig. 5 is a schematic block diagram of a video description generating apparatus according to an embodiment of the present application.
As shown in fig. 5, the video description generating apparatus 400 includes: an extraction module 401, an encoding module 402, a target assistant feature generation module 403, a decoding module 404, and a video description generation module 405.
The extraction module 401 is configured to acquire a video to be described, and extract visual features, auditory features, and word features of the video to be described;
the encoding module 402 is configured to encode the visual features and the auditory features through a multi-modal attention mechanism main body model of the video description generation system, so as to obtain visual encoding features and auditory encoding features;
a target assistant feature generation module 403, configured to process the visual coding features and the auditory coding features through an assistant model of the video description generation system to generate target assistant features;
a decoding module 404, configured to decode the visual coding features, the auditory coding features, the target auxiliary features, and the word features through the multi-modal attention mechanism main body model to obtain posterior probabilities of the keywords, and select a decoded word from the keywords according to the posterior probabilities of the keywords;
a video description generating module 405, configured to generate a video description of the video to be described according to the decoded word.
It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the apparatus and each module and unit described above may refer to the corresponding processes in the foregoing video description generation method embodiment, and are not described herein again.
The apparatus provided by the above embodiments may be implemented in the form of a computer program, which can be run on a computer device as shown in fig. 6.
Referring to fig. 6, fig. 6 is a schematic block diagram illustrating a structure of a computer device according to an embodiment of the present disclosure. The computer device may be a Personal Computer (PC), a server, or the like having a data processing function.
As shown in fig. 6, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any of the video description generation methods.
The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.
The internal memory provides an environment for running a computer program in the non-volatile storage medium, which when executed by the processor causes the processor to perform any of the video description generation methods.
The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:
acquiring a video to be described, and extracting visual features, auditory features and word features of the video to be described; respectively coding the visual features and the auditory features through a multi-mode attention mechanism main body model of a video description generation system to obtain visual coding features and auditory coding features; processing the visual coding features and the auditory coding features through an auxiliary model of the video description generation system to generate target auxiliary features; decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting decoding words from each keyword according to the posterior probability of each keyword; and generating the video description of the video to be described according to the decoding words.
In some embodiments, the multi-modal attention mechanism principal model includes a visual feature encoder and an auditory feature encoder, and the processor implements the encoding of the visual features and the auditory features by the multi-modal attention mechanism principal model of the video description generation system, respectively, to obtain visual coding features and auditory coding features, including:
performing multi-head attention calculation on the visual features through the visual feature encoder to obtain visual multi-head attention features, and performing multi-head attention calculation on the auditory features through the auditory feature encoder to obtain auditory multi-head attention features;
performing multi-modal attention calculation on the visual multi-head attention feature and the auditory multi-head attention feature through the visual feature encoder to obtain a visual feature fused with auditory attention, and performing multi-modal attention calculation on the auditory multi-head attention feature and the visual multi-head attention feature through the auditory feature encoder to obtain an auditory feature fused with visual attention;
through the visual feature encoder is right in proper order visual feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the visual coding feature of visual feature encoder output, and through the sense of hearing feature encoder is right in proper order the sense of hearing feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the sense of hearing coding feature of sense of hearing feature encoder output.
In some embodiments, the auxiliary models include a scene classification auxiliary model and a keyword evaluation auxiliary model, and the processor implements the auxiliary models by the video description generation system to process the visual coding features and the auditory coding features to generate target auxiliary features, including:
inputting the visual coding features into the scene classification auxiliary model for processing to obtain first auxiliary features output by the scene classification auxiliary model, and inputting the auditory coding features into the keyword evaluation auxiliary model for processing to obtain second auxiliary features output by the keyword evaluation auxiliary model;
and generating a target auxiliary feature according to the first auxiliary feature and the second auxiliary feature.
In some embodiments, the inputting the visual coding features into the scene classification assistant model for processing by the processor to obtain the first assistant features output by the scene classification assistant model includes:
inputting the visual coding features into the scene classification auxiliary model, and performing linear transformation on the visual coding features;
carrying out nonlinear mapping on the vision coding features after linear transformation through a linear rectification function to obtain vision coding feature mapping;
performing linear transformation on the visual coding feature mapping;
and performing softmax logistic regression calculation on the vision coding feature mapping after the linear transformation to obtain a first auxiliary feature output by the scene classification auxiliary model.
In some embodiments, the processor implements the inputting of the auditory coding features into the keyword assessment assistant model for processing to obtain second assistant features output by the keyword assessment assistant model, and the method includes:
inputting the auditory coding features into the keyword evaluation auxiliary model, and performing linear transformation on the auditory coding features;
carrying out nonlinear mapping on the hearing coding features after linear transformation through a linear rectification function to obtain hearing coding feature mapping;
performing a linear transformation on the auditory coding feature map;
calculating the auditory coding feature mapping after linear transformation through a Sigmoid function to obtain the posterior probability of each keyword in a dictionary;
performing maximum pooling on the posterior probability of each keyword to obtain a score of each keyword;
ranking the scores of the keywords, and selecting a preset number of keywords according to the order of the scores from large to small so as to search the indexes of the selected keywords in a dictionary;
and combining the searched indexes to obtain a second auxiliary characteristic output by the keyword evaluation auxiliary module.
In some embodiments, the processor implements the generating a target assist feature from the first assist feature and the second assist feature, including:
carrying out keyword embedding processing and linear transformation on the second auxiliary features in sequence to obtain second auxiliary features with reduced feature dimensions;
and splicing the second auxiliary features with the reduced feature dimensions with the first auxiliary features to obtain target auxiliary features.
In some embodiments, the multi-modal attention mechanism principal model comprises a text decoder, and the processor implements the decoding of the visual coding features, the auditory coding features, the target assist features, and the word features by the multi-modal attention mechanism principal model to obtain a posterior probability of each keyword, including:
sequentially performing multi-head attention calculation and layer regularization on the word features through the text decoder to obtain word layer regularization features;
performing multi-mode attention calculation on the word layer regularization features and the visual coding features to obtain word features fused with visual attention, and performing multi-mode attention calculation on the word layer regularization features and the auditory coding features to obtain word features fused with auditory attention;
bridging the word features fused with the visual attention and the word features fused with the auditory attention to obtain bridging word features;
performing layer regularization on the bridge word features, and performing multi-head attention calculation on the bridge word features after the layer regularization and the target auxiliary features to obtain word features fused with the target auxiliary features;
sequentially carrying out first-layer regularization, feedforward calculation and second-layer regularization on the word features fused with the target auxiliary features to obtain the output of the text decoder;
and sequentially carrying out linear transformation and Softmax logistic regression calculation on the output of the text decoder to obtain the posterior probability of each keyword.
Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, where the computer program includes program instructions, and a method implemented when the program instructions are executed may refer to various embodiments of a video description generation method of the present application.
The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of video description generation, the method comprising the steps of:
acquiring a video to be described, and extracting visual features, auditory features and word features of the video to be described;
respectively coding the visual features and the auditory features through a multi-mode attention mechanism main body model of a video description generation system to obtain visual coding features and auditory coding features;
processing the visual coding features and the auditory coding features through an auxiliary model of the video description generation system to generate target auxiliary features;
decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting decoding words from each keyword according to the posterior probability of each keyword;
and generating the video description of the video to be described according to the decoding words.
2. The video description generation method according to claim 1, wherein the multi-modal attention mechanism principal model includes a visual feature encoder and an auditory feature encoder;
the method for obtaining the visual coding features and the auditory coding features by respectively coding the visual features and the auditory features through a multi-mode attention mechanism main body model of a video description generation system comprises the following steps:
performing multi-head attention calculation on the visual features through the visual feature encoder to obtain visual multi-head attention features, and performing multi-head attention calculation on the auditory features through the auditory feature encoder to obtain auditory multi-head attention features;
performing multi-modal attention calculation on the visual multi-head attention feature and the auditory multi-head attention feature through the visual feature encoder to obtain a visual feature fused with auditory attention, and performing multi-modal attention calculation on the auditory multi-head attention feature and the visual multi-head attention feature through the auditory feature encoder to obtain an auditory feature fused with visual attention;
through the visual feature encoder is right in proper order visual feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the visual coding feature of visual feature encoder output, and through the sense of hearing feature encoder is right in proper order the sense of hearing feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the sense of hearing coding feature of sense of hearing feature encoder output.
3. The video description generation method according to claim 1, wherein the auxiliary models include a scene classification auxiliary model and a keyword evaluation auxiliary model;
the processing, by an assistant model of the video description generation system, the visually encoded features and the aurally encoded features to generate target assistant features, comprising:
inputting the visual coding features into the scene classification auxiliary model for processing to obtain first auxiliary features output by the scene classification auxiliary model, and inputting the auditory coding features into the keyword evaluation auxiliary model for processing to obtain second auxiliary features output by the keyword evaluation auxiliary model;
and generating a target auxiliary feature according to the first auxiliary feature and the second auxiliary feature.
4. The method according to claim 3, wherein the inputting the visual coding features into the scene classification assistant model for processing to obtain the first assistant features output by the scene classification assistant model comprises:
inputting the visual coding features into the scene classification auxiliary model, and performing linear transformation on the visual coding features;
carrying out nonlinear mapping on the vision coding features after linear transformation through a linear rectification function to obtain vision coding feature mapping;
performing linear transformation on the visual coding feature mapping;
and performing softmax logistic regression calculation on the vision coding feature mapping after the linear transformation to obtain a first auxiliary feature output by the scene classification auxiliary model.
5. The method of claim 3, wherein the inputting the auditory coding features into the keyword assessment assistant model for processing to obtain the second assistant features output by the keyword assessment assistant model comprises:
inputting the auditory coding features into the keyword evaluation auxiliary model, and performing linear transformation on the auditory coding features;
carrying out nonlinear mapping on the hearing coding features after linear transformation through a linear rectification function to obtain hearing coding feature mapping;
performing a linear transformation on the auditory coding feature map;
calculating the auditory coding feature mapping after linear transformation through a Sigmoid function to obtain the posterior probability of each keyword in a dictionary;
performing maximum pooling on the posterior probability of each keyword to obtain a score of each keyword;
ranking the scores of the keywords, and selecting a preset number of keywords according to the order of the scores from large to small so as to search the indexes of the selected keywords in a dictionary;
and combining the searched indexes to obtain a second auxiliary characteristic output by the keyword evaluation auxiliary module.
6. The video description generation method according to claim 3, wherein the generating a target assist feature from the first assist feature and the second assist feature comprises:
carrying out keyword embedding processing and linear transformation on the second auxiliary features in sequence to obtain second auxiliary features with reduced feature dimensions;
and splicing the second auxiliary features with the reduced feature dimensions with the first auxiliary features to obtain target auxiliary features.
7. The video description generation method of claim 1, wherein the multi-modal attention mechanism principal model comprises a text decoder;
the decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-modal attention mechanism main body model to obtain the posterior probability of each keyword, and the method comprises the following steps:
sequentially performing multi-head attention calculation and layer regularization on the word features through the text decoder to obtain word layer regularization features;
performing multi-mode attention calculation on the word layer regularization features and the visual coding features to obtain word features fused with visual attention, and performing multi-mode attention calculation on the word layer regularization features and the auditory coding features to obtain word features fused with auditory attention;
bridging the word features fused with the visual attention and the word features fused with the auditory attention to obtain bridging word features;
performing layer regularization on the bridge word features, and performing multi-head attention calculation on the bridge word features after the layer regularization and the target auxiliary features to obtain word features fused with the target auxiliary features;
sequentially carrying out first-layer regularization, feedforward calculation and second-layer regularization on the word features fused with the target auxiliary features to obtain the output of the text decoder;
and sequentially carrying out linear transformation and Softmax logistic regression calculation on the output of the text decoder to obtain the posterior probability of each keyword.
8. A video description generation apparatus, characterized in that the video description generation apparatus comprises:
the extraction module is used for acquiring a video to be described and extracting visual features, auditory features and word features of the video to be described;
the coding module is used for coding the visual features and the auditory features respectively through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding features and auditory coding features;
a target assistant feature generation module, configured to process the visual coding features and the auditory coding features through an assistant model of the video description generation system to generate target assistant features;
the decoding module is used for decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting a decoding word from each keyword according to the posterior probability of each keyword;
and the video description generation module is used for generating the video description of the video to be described according to the decoding words.
9. A computer device, characterized in that the computer device comprises a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the video description generation method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, wherein the computer program, when executed by a processor, implements the steps of the video description generation method according to any one of claims 1 to 7.
CN202110470037.0A 2021-04-28 2021-04-28 Video description generation method, device, equipment and computer readable storage medium Active CN113095435B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110470037.0A CN113095435B (en) 2021-04-28 2021-04-28 Video description generation method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110470037.0A CN113095435B (en) 2021-04-28 2021-04-28 Video description generation method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN113095435A true CN113095435A (en) 2021-07-09
CN113095435B CN113095435B (en) 2024-06-04

Family

ID=76681011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110470037.0A Active CN113095435B (en) 2021-04-28 2021-04-28 Video description generation method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113095435B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023201990A1 (en) * 2022-04-19 2023-10-26 苏州浪潮智能科技有限公司 Visual positioning method and apparatus, device, and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10699129B1 (en) * 2019-11-15 2020-06-30 Fudan University System and method for video captioning
CN111541910A (en) * 2020-04-21 2020-08-14 华中科技大学 Video barrage comment automatic generation method and system based on deep learning
WO2020190112A1 (en) * 2019-03-21 2020-09-24 Samsung Electronics Co., Ltd. Method, apparatus, device and medium for generating captioning information of multimedia data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020190112A1 (en) * 2019-03-21 2020-09-24 Samsung Electronics Co., Ltd. Method, apparatus, device and medium for generating captioning information of multimedia data
US10699129B1 (en) * 2019-11-15 2020-06-30 Fudan University System and method for video captioning
CN111541910A (en) * 2020-04-21 2020-08-14 华中科技大学 Video barrage comment automatic generation method and system based on deep learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023201990A1 (en) * 2022-04-19 2023-10-26 苏州浪潮智能科技有限公司 Visual positioning method and apparatus, device, and medium

Also Published As

Publication number Publication date
CN113095435B (en) 2024-06-04

Similar Documents

Publication Publication Date Title
US20210342670A1 (en) Processing sequences using convolutional neural networks
CN113420807A (en) Multi-mode fusion emotion recognition system and method based on multi-task learning and attention mechanism and experimental evaluation method
WO2021232746A1 (en) Speech recognition method, apparatus and device, and storage medium
WO2021037113A1 (en) Image description method and apparatus, computing device, and storage medium
CN110489567B (en) Node information acquisition method and device based on cross-network feature mapping
CN110781306B (en) English text aspect layer emotion classification method and system
CN112712813B (en) Voice processing method, device, equipment and storage medium
CN112417855A (en) Text intention recognition method and device and related equipment
CN109344242B (en) Dialogue question-answering method, device, equipment and storage medium
CN114245203B (en) Video editing method, device, equipment and medium based on script
CN114549850B (en) Multi-mode image aesthetic quality evaluation method for solving modal missing problem
CN114091450B (en) Judicial domain relation extraction method and system based on graph convolution network
CN115083435B (en) Audio data processing method and device, computer equipment and storage medium
CN116543768A (en) Model training method, voice recognition method and device, equipment and storage medium
CN111563161B (en) Statement identification method, statement identification device and intelligent equipment
CN114360502A (en) Processing method of voice recognition model, voice recognition method and device
CN113450765A (en) Speech synthesis method, apparatus, device and storage medium
CN115203372A (en) Text intention classification method and device, computer equipment and storage medium
CN113095435B (en) Video description generation method, device, equipment and computer readable storage medium
CN111898363B (en) Compression method, device, computer equipment and storage medium for long and difficult text sentence
CN117648469A (en) Cross double-tower structure answer selection method based on contrast learning
CN117033609A (en) Text visual question-answering method, device, computer equipment and storage medium
CN116775873A (en) Multi-mode dialogue emotion recognition method
CN115017900B (en) Conversation emotion recognition method based on multi-mode multi-prejudice
CN114743018B (en) Image description generation method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant