CN113191262B - Video description data processing method, device and storage medium - Google Patents

Video description data processing method, device and storage medium Download PDF

Info

Publication number
CN113191262B
CN113191262B CN202110476061.5A CN202110476061A CN113191262B CN 113191262 B CN113191262 B CN 113191262B CN 202110476061 A CN202110476061 A CN 202110476061A CN 113191262 B CN113191262 B CN 113191262B
Authority
CN
China
Prior art keywords
lens
video
similarity
feature
shot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110476061.5A
Other languages
Chinese (zh)
Other versions
CN113191262A (en
Inventor
蔡晓东
黄庆楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202110476061.5A priority Critical patent/CN113191262B/en
Publication of CN113191262A publication Critical patent/CN113191262A/en
Application granted granted Critical
Publication of CN113191262B publication Critical patent/CN113191262B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a video description data processing method, a device and a storage medium, wherein the method comprises the following steps: importing a video sequence and dividing the video sequence into a plurality of video pictures; performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets; merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets; performing feature extraction on the plurality of merged shot data sets through the preset convolutional neural network to obtain a video description feature sequence; and converting the video description feature sequence into video description information through a preset video description model. The method can directly convert the problem of natural language into the problem of images without generating the character description for each shot data and then combining the character description to generate the final description, thereby reducing the redundancy of the generated description and improving the fluency of the character description.

Description

Video description data processing method, device and storage medium
Technical Field
The present invention relates to the field of video processing technologies, and in particular, to a method and an apparatus for processing video description data, and a storage medium.
Background
At present, video description mainly has machine translation accuracy indexes, sentence fluency indexes and the like, and how to solve the fluency of videos is a difficult problem. In the prior art, a plurality of shot data sets are generated according to video shot segmentation, then each shot data is input into a convolutional neural network to generate a series of characteristics, and then the characteristics are input into a video description model generation statement. In the processing process, complex calculated amount is caused, meanwhile, if two shots with high similarity are input respectively, a plurality of similar features are generated by inputting a shot into a convolutional neural network, each feature is described in a video description model, so that the calculated amount of the model becomes high, and finally described sentences are trembled and unsmooth and are unsmooth, and the difference between manual description and manual description is large.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method, an apparatus and a storage medium for processing video description data, aiming at the defects of the prior art.
The technical scheme for solving the technical problems is as follows: a video description data processing method, comprising the steps of:
importing a video sequence and dividing the video sequence into a plurality of video pictures;
performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets;
merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets;
performing feature extraction on the plurality of merged shot data sets through the preset convolutional neural network to obtain a video description feature sequence;
and converting the video description feature sequence into video description information through a preset video description model.
The invention has the beneficial effects that: the method comprises the steps of carrying out feature segmentation analysis on all video pictures through a preset convolutional neural network to obtain a plurality of shot data sets, carrying out merging analysis on all the shot data sets through the preset convolutional neural network to obtain a plurality of merged shot data sets, carrying out feature extraction on a plurality of the merged shot data sets through the preset convolutional neural network to obtain a video description feature sequence, converting the video description feature sequence into video description information through a preset video description model, combining and generating final description after each shot data generates text description, directly converting natural language problems into image problems, reducing redundancy of generated description and improving fluency of text description.
On the basis of the technical scheme, the invention can be further improved as follows.
Further, the process of performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets includes:
respectively extracting the features of each video picture through a preset convolutional neural network to obtain the video features corresponding to the video pictures;
dividing the two adjacent video features into one group, and performing similarity calculation on the two video features in each group to obtain video similarity corresponding to the video features;
and when the video similarity is smaller than or equal to a preset segmentation threshold, taking the video picture corresponding to the video similarity and all the previous video pictures as shot data sets, thereby obtaining a plurality of shot data sets.
The beneficial effect of adopting the further scheme is that: the characteristics of all video pictures are segmented and analyzed through the preset convolutional neural network to obtain a plurality of shot data sets, the same shots can be clustered, a data base is provided for subsequent processing, natural language problems can be directly converted into image problems, the redundancy of generated description is reduced, and the fluency of character description is improved.
Further, the process of calculating the similarity of the two video features in each group to obtain the video similarity corresponding to the video features includes:
and calculating the similarity of the two video features of each group through a first formula to obtain the video similarity corresponding to the video features, wherein the first formula is as follows:
L n =cos(z i ,z i+1 ),
wherein L is n Is the video similarity of the ith video feature and the (i + 1) th video feature, cos is the cosine distance function, z i For the ith video feature, z i+1 Is the (i + 1) th video feature.
The beneficial effect of adopting the further scheme is that: the video similarity corresponding to the video features is obtained by calculating the similarity of the two video features of each group in the first mode, a data base is provided for subsequent processing, the problem of natural language can be directly converted into an image problem, the redundancy of generated description is reduced, and the fluency of character description is improved.
Further, the process of merging and analyzing all the shot data sets through the preset convolutional neural network to obtain a plurality of merged shot data sets includes:
extracting features of each lens data set through the preset convolutional neural network to obtain lens feature sequences corresponding to the lens data sets, wherein each lens feature sequence comprises a plurality of lens features;
respectively carrying out similarity calculation on each lens feature in each lens feature sequence and each lens feature in the corresponding next lens feature sequence to obtain lens similarity corresponding to the lens features in the lens feature sequences;
respectively screening the feature similarity corresponding to each lens feature in each lens feature sequence and the lens similarity corresponding to each lens feature in the next corresponding lens feature sequence, and collecting all the screened lens similarities as a lens similarity screening data set corresponding to the lens data set;
respectively carrying out summation calculation on all the lens similarity screening data in each lens similarity screening data set to obtain a summation value corresponding to the lens similarity screening data set;
and when the summation value is greater than or equal to a preset summation threshold value, merging the shot data set corresponding to the shot similarity screening data set and the next shot data set corresponding to the shot similarity screening data set into a merged shot data set.
The beneficial effect of adopting the above further scheme is: and a plurality of merged shot data sets are obtained by merging and analyzing all the shot data sets through a preset convolutional neural network, so that the data volume can be reduced, the redundancy of generated description is reduced, and the fluency of character description is improved.
Further, the process of calculating the similarity between each lens feature in each lens feature sequence and each lens feature in the corresponding next lens feature sequence to obtain the lens similarity corresponding to the lens feature in the lens feature sequence includes:
respectively carrying out similarity calculation on each lens feature in each lens feature sequence and each lens feature in a corresponding next lens feature sequence through a second formula to obtain lens similarity corresponding to the lens features in the lens feature sequences, wherein the second formula is as follows:
s i =cos(f i ,f i '),
wherein s is i Is the lens similarity between the ith lens characteristic and the (i + 1) th lens characteristic, cos is the cosine distance function, f i For the ith lens feature in the lens feature sequence, f i ' is the ith lens feature in the lens feature sequence adjacent to the lens feature sequence.
The beneficial effect of adopting the further scheme is that: and respectively calculating the similarity of each shot feature in each shot feature sequence and each shot feature in the corresponding next shot feature sequence through a second formula to obtain the shot similarity corresponding to the shot features in the shot feature sequence, thereby providing a data basis for subsequent processing, reducing the redundancy of generated description and improving the fluency of character description.
Further, the process of screening the feature similarity corresponding to each of the shot features in each of the shot feature sequences and the shot similarity corresponding to each of the shot features in a corresponding next shot feature sequence, and collecting all the screened shot similarities as the shot similarity screening dataset corresponding to the shot dataset includes:
respectively screening the feature similarity corresponding to each lens feature in each lens feature sequence and the lens similarity corresponding to each lens feature in the corresponding next lens feature sequence through a third formula, and collecting all the screened lens similarities as a lens similarity screening data set corresponding to the lens data set;
Q={Q i }if card(S i )>T 2
wherein, card is an element in the shot feature sequence, T 2 To preset a lens similarity threshold, Q i For the eligible ith shot similarity,q is the shot similarity screening dataset, S i The shot similarity between the ith shot feature and the (i + 1) th shot feature is obtained.
The beneficial effect of adopting the above further scheme is: and respectively taking all the screened shot similarities of the feature similarity screening set corresponding to each shot feature in each shot feature sequence and the shot similarity screening set corresponding to each shot feature in the next shot feature sequence as the shot similarity screening data set corresponding to the shot data set through a third formula, so that the processed data is reduced, the redundancy of the generated description is reduced, and the fluency of the character description is improved.
Further, the process of respectively summing all the shot similarity screening data in each shot similarity screening data set to obtain a sum value corresponding to the shot similarity screening data set includes:
respectively summing all the lens similarity screening data in each lens similarity screening data set by a fourth formula to obtain a summation value corresponding to the lens similarity screening data set, wherein the fourth formula is as follows:
Figure BDA0003047405380000051
where K is the sum, Q i And screening the ith lens similarity in the lens similarity screening data set, wherein n-1 is the last lens similarity screening data in the lens similarity screening data set.
The beneficial effect of adopting the above further scheme is: and respectively carrying out summation calculation on all shot similarity screening data in each shot similarity screening data set to obtain a summation value corresponding to the shot similarity screening data set through a fourth formula, and further reducing processing data, thereby reducing the redundancy of generated description and improving the fluency of character description.
Another technical solution of the present invention for solving the above technical problems is as follows: a video description data processing apparatus comprising:
the video segmentation module is used for importing a video sequence and segmenting the video sequence into a plurality of video pictures;
the characteristic segmentation analysis module is used for carrying out characteristic segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of lens data sets;
the merging analysis module is used for merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets;
the feature extraction module is used for extracting features of the plurality of combined shot data sets through the preset convolutional neural network to obtain a video description feature sequence;
and the video description information acquisition module is used for converting the video description feature sequence into video description information through a preset video description model.
Another technical solution of the present invention for solving the above technical problems is as follows: a video description data processing apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor when executing the computer program implementing a video description data processing method as described above.
Another technical solution of the present invention for solving the above technical problems is as follows: a computer-readable storage medium, storing a computer program which, when executed by a processor, implements a video description data processing method as described above.
Drawings
Fig. 1 is a schematic flow chart of a video description data processing method according to an embodiment of the present invention;
fig. 2 is a block diagram of a video description data processing apparatus according to an embodiment of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth to illustrate, but are not to be construed to limit the scope of the invention.
Fig. 1 is a schematic flowchart of a video description data processing method according to an embodiment of the present invention.
Example 1:
as shown in fig. 1, a video description data processing method includes the following steps:
importing a video sequence and dividing the video sequence into a plurality of video pictures;
performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets;
merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets;
performing feature extraction on the plurality of combined shot data sets through the preset convolutional neural network to obtain a video description feature sequence;
and converting the video description feature sequence into video description information through a preset video description model.
It should be understood that the shot data sets are sequentially generated according to the time of the event occurrence of the input video, such as the first shot data set of the video from 0 th to 10 th seconds, and the second shot data set from 11 th to 20 th seconds, each of which is composed of several of the video pictures.
It should be understood that the predetermined convolutional neural network is first constructed, defining V ═ i 1 ,i 2 ,…,i n Denotes a video sequence, and secondly divides the imported video sequence into each frame picture (i.e., the video picture).
It should be understood that the preset video description model is used for adjusting the feature weight of the input features, then generating words, and then arranging the words in word order to form a sentence.
Specifically, a plurality of synthesized shots (i.e., the merged shot data set) are input into the preset convolutional neural network, and one video description feature sequence M ═ f is output 1 ,f 2 ,…,f n },And inputting the video description feature sequence M into a preset video description model for conversion processing to generate a text description (namely the video description information).
In the embodiment, all the video pictures are subjected to feature segmentation analysis through a preset convolutional neural network to obtain a plurality of shot data sets, all the shot data sets are subjected to merging analysis through the preset convolutional neural network to obtain a plurality of merged shot data sets, feature extraction is performed on the plurality of merged shot data sets through the preset convolutional neural network to obtain a video description feature sequence, the video description feature sequence is converted into video description information through a preset video description model, and after each shot data is not required to generate a text description, the text description is combined to generate a final description, so that a natural language problem is directly converted into an image problem, the redundancy of generated description is reduced, and the fluency of text description is improved.
Example 2:
as shown in fig. 1, a video description data processing method includes the following steps:
importing a video sequence and dividing the video sequence into a plurality of video pictures;
performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets;
merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets;
performing feature extraction on the plurality of merged shot data sets through the preset convolutional neural network to obtain a video description feature sequence;
converting the video description feature sequence into video description information through a preset video description model;
the process of performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets comprises the following steps:
respectively extracting the features of each video picture through a preset convolutional neural network to obtain video features corresponding to the video pictures;
dividing the two adjacent video features into one group, and performing similarity calculation on the two video features in each group to obtain video similarity corresponding to the video features;
and when the video similarity is smaller than or equal to a preset segmentation threshold, taking the video picture corresponding to the video similarity and all the previous video pictures as shot data sets, thereby obtaining a plurality of shot data sets.
It should be understood that sequentially inputting the video pictures into the predetermined convolutional neural network obtains a feature sequence (i.e. a plurality of the video features), and outputting one of the video features by one of the video pictures, thereby obtaining a feature set Z ═ { Z ═ Z } 1 ,z 2 ,…,z i (i.e. a plurality of said video features) and then clustering according to said video similarity and making said video similarity larger than said preset segmentation threshold T 1 And clustering the corresponding video pictures into one shot data set, wherein each piece of shot data in the built shot data set occurs in a time sequence.
In the embodiment, the preset convolutional neural network is used for carrying out feature segmentation analysis on all video pictures to obtain a plurality of shot data sets, the same shots can be clustered, a data base is provided for subsequent processing, the problem of natural language can be directly converted into an image problem, the redundancy of generated description is reduced, and the fluency of character description is improved.
Example 3:
as shown in fig. 1, a video description data processing method includes the following steps:
importing a video sequence and dividing the video sequence into a plurality of video pictures;
performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets;
merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets;
performing feature extraction on the plurality of combined shot data sets through the preset convolutional neural network to obtain a video description feature sequence;
converting the video description feature sequence into video description information through a preset video description model;
the process of performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets comprises the following steps:
respectively extracting the features of each video picture through a preset convolutional neural network to obtain the video features corresponding to the video pictures;
dividing the two adjacent video features into one group, and performing similarity calculation on the two video features in each group to obtain video similarity corresponding to the video features;
when the video similarity is smaller than or equal to a preset segmentation threshold, taking a video picture corresponding to the video similarity and all previous video pictures as shot data sets so as to obtain a plurality of shot data sets;
the process of calculating the similarity of the two video features in each group to obtain the video similarity corresponding to the video features comprises the following steps:
and calculating the similarity of the two video features of each group through a first formula to obtain the video similarity corresponding to the video features, wherein the first formula is as follows:
L n =cos(z i ,z i+1 ),
wherein L is n Is the video similarity of the ith video feature and the (i + 1) th video feature, cos is a cosine distance function, z i For the ith video feature, z i+1 Is the (i + 1) th video feature.
In the embodiment, the video similarity corresponding to the video features is obtained by calculating the similarity of the two video features of each group in the first mode, so that a data base is provided for subsequent processing, the problem of natural language can be directly converted into the problem of images, the redundancy of generated description is reduced, and the fluency of text description is improved.
Example 4:
as shown in fig. 1, a video description data processing method includes the following steps:
importing a video sequence and dividing the video sequence into a plurality of video pictures;
performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets;
merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets;
performing feature extraction on the plurality of merged shot data sets through the preset convolutional neural network to obtain a video description feature sequence;
converting the video description feature sequence into video description information through a preset video description model;
the process of performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets comprises the following steps:
respectively extracting the features of each video picture through a preset convolutional neural network to obtain video features corresponding to the video pictures;
dividing the two adjacent video features into one group, and performing similarity calculation on the two video features in each group to obtain video similarity corresponding to the video features;
when the video similarity is smaller than or equal to a preset segmentation threshold, taking a video picture corresponding to the video similarity and all previous video pictures as shot data sets so as to obtain a plurality of shot data sets;
the process of performing merging analysis on all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets comprises the following steps:
extracting features of each lens data set through the preset convolutional neural network to obtain lens feature sequences corresponding to the lens data sets, wherein each lens feature sequence comprises a plurality of lens features;
respectively carrying out similarity calculation on each lens feature in each lens feature sequence and each lens feature in the corresponding next lens feature sequence to obtain lens similarity corresponding to the lens features in the lens feature sequences;
respectively screening the feature similarity corresponding to each lens feature in each lens feature sequence and the lens similarity corresponding to each lens feature in the next corresponding lens feature sequence, and collecting all the screened lens similarities as a lens similarity screening data set corresponding to the lens data set;
respectively carrying out summation calculation on all the lens similarity screening data in each lens similarity screening data set to obtain a summation value corresponding to the lens similarity screening data set;
and when the summation value is greater than or equal to a preset summation threshold value, merging the lens data set corresponding to the lens similarity screening data set and the next lens data set corresponding to the lens similarity screening data set into a merged lens data set.
It should be understood that the definition set V { { V { (V) 11 ,v 2 ,…,v m-1 } 1 ,{v m ,v m+1 ,…,v w-1 } 2 ,…,{v w ,v w+1 ,…,v n } i The method comprises the steps that firstly, the ith lens data set in a set is input into the preset convolution neural network, and then the lens characteristic sequence { f ] is output 1 ,f 2 ,…,f i And then inputting the (i + 1) th lens data set of the set into a convolutional neural network, and outputting the lens characteristic sequence { f 1 ',f 2 ',…,f i ' } each of said mirrors of said ith said series of lens featuresAnd calculating feature similarity (namely the shot similarity) between the head feature and each shot feature of the (i + 1) th shot feature sequence.
It should be understood that if the summation value K is greater than or equal to the preset summation threshold value t3, indicating that there is a large correlation between the two lens data sets, the lens data sets may be combined into one merged lens data set.
In the embodiment, the preset convolutional neural network is used for merging and analyzing all shot data sets to obtain a plurality of merged shot data sets, so that the data volume can be reduced, the redundancy of generated description is reduced, and the fluency of text description is improved.
Example 5:
as shown in fig. 1, a method for processing video description data includes the following steps:
importing a video sequence and dividing the video sequence into a plurality of video pictures;
performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets;
merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets;
performing feature extraction on the plurality of combined shot data sets through the preset convolutional neural network to obtain a video description feature sequence;
converting the video description feature sequence into video description information through a preset video description model;
the process of performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets comprises the following steps:
respectively extracting the features of each video picture through a preset convolutional neural network to obtain the video features corresponding to the video pictures;
dividing the two adjacent video features into one group, and performing similarity calculation on the two video features in each group to obtain video similarity corresponding to the video features;
when the video similarity is smaller than or equal to a preset segmentation threshold, taking a video picture corresponding to the video similarity and all previous video pictures as shot data sets so as to obtain a plurality of shot data sets;
the process of merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets comprises the following steps:
extracting features of each lens data set through the preset convolutional neural network to obtain lens feature sequences corresponding to the lens data sets, wherein each lens feature sequence comprises a plurality of lens features;
respectively carrying out similarity calculation on each lens feature in each lens feature sequence and each lens feature in the corresponding next lens feature sequence to obtain lens similarity corresponding to the lens features in the lens feature sequences;
respectively screening the feature similarity corresponding to each lens feature in each lens feature sequence and the lens similarity corresponding to each lens feature in the next corresponding lens feature sequence, and collecting all the screened lens similarities as a lens similarity screening data set corresponding to the lens data set;
respectively carrying out summation calculation on all the lens similarity screening data in each lens similarity screening data set to obtain a summation value corresponding to the lens similarity screening data set;
when the summation value is greater than or equal to a preset summation threshold value, merging the lens data set corresponding to the lens similarity screening data set and a next lens data set corresponding to the lens similarity screening data set into a merged lens data set;
the process of respectively calculating the similarity between each lens feature in each lens feature sequence and each lens feature in the corresponding next lens feature sequence to obtain the lens similarity corresponding to the lens features in the lens feature sequences includes:
respectively carrying out similarity calculation on each lens feature in each lens feature sequence and each lens feature in a corresponding next lens feature sequence through a second formula to obtain lens similarity corresponding to the lens features in the lens feature sequences, wherein the second formula is as follows:
s i =cos(f i ,f i '),
wherein s is i Is the lens similarity between the ith lens feature and the (i + 1) th lens feature, cos is the cosine distance function, f i For the ith lens feature in the lens feature sequence, f i ' is the ith lens feature in the lens feature sequence adjacent to the lens feature sequence.
In the above embodiment, the similarity between each shot feature in each shot feature sequence and each shot feature in the corresponding next shot feature sequence is calculated by the second formula to obtain the shot similarity corresponding to the shot feature in the shot feature sequence, so that a data basis is provided for subsequent processing, the redundancy of generated description is reduced, and the fluency of text description is improved.
Example 6:
as shown in fig. 1, a method for processing video description data includes the following steps:
importing a video sequence and dividing the video sequence into a plurality of video pictures;
performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets;
merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets;
performing feature extraction on the plurality of combined shot data sets through the preset convolutional neural network to obtain a video description feature sequence;
converting the video description feature sequence into video description information through a preset video description model;
the process of performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets comprises the following steps:
respectively extracting the features of each video picture through a preset convolutional neural network to obtain the video features corresponding to the video pictures;
dividing the two adjacent video features into one group, and performing similarity calculation on the two video features in each group to obtain video similarity corresponding to the video features;
when the video similarity is smaller than or equal to a preset segmentation threshold, taking a video picture corresponding to the video similarity and all previous video pictures as shot data sets so as to obtain a plurality of shot data sets;
the process of merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets comprises the following steps:
extracting features of each lens data set through the preset convolutional neural network to obtain lens feature sequences corresponding to the lens data sets, wherein each lens feature sequence comprises a plurality of lens features;
respectively carrying out similarity calculation on each lens feature in each lens feature sequence and each lens feature in the corresponding next lens feature sequence to obtain lens similarity corresponding to the lens features in the lens feature sequences;
respectively screening the feature similarity corresponding to each lens feature in each lens feature sequence and the lens similarity corresponding to each lens feature in the next corresponding lens feature sequence, and collecting all the screened lens similarities as a lens similarity screening data set corresponding to the lens data set;
respectively carrying out summation calculation on all the lens similarity screening data in each lens similarity screening data set to obtain a summation value corresponding to the lens similarity screening data set;
when the summation value is greater than or equal to a preset summation threshold value, merging the lens data set corresponding to the lens similarity screening data set and a next lens data set corresponding to the lens similarity screening data set into a merged lens data set;
the process of screening the feature similarity corresponding to each of the shot features in each of the shot feature sequences and the shot similarity corresponding to each of the shot features in a next shot feature sequence, and collecting all the screened shot similarities as a shot similarity screening dataset corresponding to the shot dataset includes:
respectively screening the feature similarity corresponding to each lens feature in each lens feature sequence and the lens similarity corresponding to each lens feature in the corresponding next lens feature sequence through a third formula, and collecting all the screened lens similarities as a lens similarity screening data set corresponding to the lens data set;
Q={Q i }if card(S i )>T 2
wherein, card is an element in the shot feature sequence, T 2 To preset a lens similarity threshold, Q i Screening the data set for the ith shot similarity, Q, meeting the conditions, S i The shot similarity between the ith shot feature and the (i + 1) th shot feature is obtained.
It should be understood that the preset segmentation threshold, the preset shot similarity threshold and the preset summation threshold are all empirical threshold settings obtained through training of label samples.
It should be understood that, Q is defined as a shot similarity set (i.e. the shot similarity screening dataset) left after the feature similarity is smaller than the preset shot similarity threshold.
It should be understood that the formula is expressed as putting elements into the set Q if the elements in the shot feature sequence S to which the shot feature sequence is adjacent are greater than a set threshold.
In the above embodiment, all the shot similarities obtained through screening of the feature similarity corresponding to each shot feature in each shot feature sequence and the shot similarity corresponding to each shot feature in the corresponding next shot feature sequence are respectively used as the shot similarity screening data set corresponding to the shot data set by the third formula, so that the processed data is reduced, the redundancy of the generated description is reduced, and the fluency of the text description is improved.
Example 7:
as shown in fig. 1, a method for processing video description data includes the following steps:
importing a video sequence and dividing the video sequence into a plurality of video pictures;
performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets;
merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets;
performing feature extraction on the plurality of merged shot data sets through the preset convolutional neural network to obtain a video description feature sequence;
converting the video description feature sequence into video description information through a preset video description model;
the process of performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets comprises the following steps:
respectively extracting the features of each video picture through a preset convolutional neural network to obtain the video features corresponding to the video pictures;
dividing the two adjacent video features into one group, and performing similarity calculation on the two video features in each group to obtain video similarity corresponding to the video features;
when the video similarity is smaller than or equal to a preset segmentation threshold, taking a video picture corresponding to the video similarity and all previous video pictures as shot data sets so as to obtain a plurality of shot data sets;
the process of performing merging analysis on all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets comprises the following steps:
extracting features of each lens data set through the preset convolutional neural network to obtain lens feature sequences corresponding to the lens data sets, wherein each lens feature sequence comprises a plurality of lens features;
respectively carrying out similarity calculation on each lens feature in each lens feature sequence and each lens feature in the corresponding next lens feature sequence to obtain lens similarity corresponding to the lens features in the lens feature sequences;
respectively screening the feature similarity corresponding to each lens feature in each lens feature sequence and the lens similarity corresponding to each lens feature in the next corresponding lens feature sequence, and collecting all the screened lens similarities as a lens similarity screening data set corresponding to the lens data set;
respectively carrying out summation calculation on all the lens similarity screening data in each lens similarity screening data set to obtain a summation value corresponding to the lens similarity screening data set;
when the summation value is greater than or equal to a preset summation threshold value, merging the lens data set corresponding to the lens similarity screening data set and a next lens data set corresponding to the lens similarity screening data set into a merged lens data set;
the process of respectively performing summation calculation on all the shot similarity screening data in each shot similarity screening data set to obtain a summation value corresponding to the shot similarity screening data set comprises the following steps:
summing all the shot similarity screening data in each shot similarity screening data set respectively through a fourth formula to obtain a summation value corresponding to the shot similarity screening data set, where the fourth formula is as follows:
Figure BDA0003047405380000191
where K is the sum, Q i And screening the ith lens similarity in the lens similarity screening data set, wherein n-1 is the last lens similarity screening data in the lens similarity screening data set.
It should be understood that n is the video feature, and the number of n-1 is now one less after two-by-two comparison.
In the embodiment, the fourth expression is used for respectively summing all the shot similarity screening data in each shot similarity screening data set to obtain the sum value corresponding to the shot similarity screening data set, so that the processing data is further reduced, the redundancy of the generated description is reduced, and the fluency of the text description is improved.
Fig. 2 is a block diagram of a video description data processing apparatus according to an embodiment of the present invention.
Example 8:
as shown in fig. 2, a video description data processing apparatus, comprising:
the video segmentation module is used for importing a video sequence and segmenting the video sequence into a plurality of video pictures;
the characteristic segmentation analysis module is used for carrying out characteristic segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of lens data sets;
the merging analysis module is used for merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets;
the feature extraction module is used for extracting features of the plurality of combined shot data sets through the preset convolutional neural network to obtain a video description feature sequence;
and the video description information acquisition module is used for converting the video description feature sequence into video description information through a preset video description model.
Example 9:
a video description data processing apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, when executing the computer program, implementing a video description data processing method as described above. The device may be a computer or the like.
Example 10:
a computer-readable storage medium, storing a computer program which, when executed by a processor, implements a video description data processing method as described above.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (8)

1. A method for processing video description data, comprising the steps of:
importing a video sequence and dividing the video sequence into a plurality of video pictures;
performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets;
merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets;
performing feature extraction on the plurality of merged shot data sets through the preset convolutional neural network to obtain a video description feature sequence;
converting the video description feature sequence into video description information through a preset video description model;
the process of performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets comprises the following steps:
respectively extracting the features of each video picture through a preset convolutional neural network to obtain the video features corresponding to the video pictures;
dividing the two adjacent video features into one group, and performing similarity calculation on the two video features in each group to obtain video similarity corresponding to the video features;
when the video similarity is smaller than or equal to a preset segmentation threshold, taking a video picture corresponding to the video similarity and all previous video pictures as shot data sets so as to obtain a plurality of shot data sets;
the process of performing merging analysis on all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets comprises the following steps:
extracting features of each lens data set through the preset convolutional neural network to obtain lens feature sequences corresponding to the lens data sets, wherein each lens feature sequence comprises a plurality of lens features;
respectively carrying out similarity calculation on each lens feature in each lens feature sequence and each lens feature in the corresponding next lens feature sequence to obtain lens similarity corresponding to the lens features in the lens feature sequences;
respectively screening the feature similarity corresponding to each lens feature in each lens feature sequence and the lens similarity corresponding to each lens feature in the next corresponding lens feature sequence, and collecting all the screened lens similarities as a lens similarity screening data set corresponding to the lens data set;
respectively carrying out summation calculation on all the lens similarity screening data in each lens similarity screening data set to obtain a summation value corresponding to the lens similarity screening data set;
and when the summation value is greater than or equal to a preset summation threshold value, merging the lens data set corresponding to the lens similarity screening data set and the next lens data set corresponding to the lens similarity screening data set into a merged lens data set.
2. The method according to claim 1, wherein the performing similarity calculation on the two video features in each group to obtain the video similarity corresponding to the video features comprises:
and calculating the similarity of the two video features of each group through a first formula to obtain the video similarity corresponding to the video features, wherein the first formula is as follows:
L n =cos(z i ,z i+1 ),
wherein L is n Is the video similarity of the ith video feature and the (i + 1) th video feature, cos is a cosine distance function, z i For the ith video feature, z i+1 Is the (i + 1) th video feature.
3. The method according to claim 1, wherein the step of calculating the similarity between each of the shot features in each of the shot feature sequences and each of the shot features in a corresponding next shot feature sequence to obtain the shot similarity corresponding to the shot features in the shot feature sequences comprises:
respectively carrying out similarity calculation on each lens feature in each lens feature sequence and each lens feature in a corresponding next lens feature sequence through a second formula to obtain lens similarity corresponding to the lens features in the lens feature sequences, wherein the second formula is as follows:
s i =cos(f i ,f i '),
wherein s is i The lens similarity of the ith lens feature in the lens feature sequence and the ith lens feature in the next lens feature sequence, cos is a cosine distance function, f i For the ith lens feature, f, in the lens feature sequence i ' is the ith shot feature in the next shot feature sequence.
4. The method according to claim 1, wherein the process of respectively filtering the feature similarity corresponding to each of the shot features in each of the shot feature sequences and the shot similarity corresponding to each of the shot features in a next shot feature sequence, and collecting all the filtered shot similarities as the shot similarity filtering data set corresponding to the shot data set comprises:
respectively screening the feature similarity corresponding to each lens feature in each lens feature sequence and the lens similarity corresponding to each lens feature in the corresponding next lens feature sequence through a third formula, and collecting all the screened lens similarities as a lens similarity screening data set corresponding to the lens data set;
Q={Q i }if card(S i )>T 2
wherein, card is an element in the shot feature sequence, T 2 To preset a lens similarity threshold, Q i Screening the data set for the ith shot similarity, Q, meeting the conditions, S i The shot similarity between the ith shot feature and the (i + 1) th shot feature is obtained.
5. The method according to claim 1, wherein the process of summing all the shot similarity screening data in each shot similarity screening data set to obtain a sum value corresponding to the shot similarity screening data set comprises:
respectively summing all the lens similarity screening data in each lens similarity screening data set by a fourth formula to obtain a summation value corresponding to the lens similarity screening data set, wherein the fourth formula is as follows:
Figure FDA0003715441480000041
where K is the sum, Q i And screening the ith lens similarity in the lens similarity screening data set, wherein n-1 is the last lens similarity screening data in the lens similarity screening data set.
6. A video description data processing apparatus, comprising:
the video segmentation module is used for importing a video sequence and segmenting the video sequence into a plurality of video pictures;
the characteristic segmentation analysis module is used for carrying out characteristic segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets;
the merging analysis module is used for merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets;
the feature extraction module is used for extracting features of the plurality of combined shot data sets through the preset convolutional neural network to obtain a video description feature sequence;
the video description information acquisition module is used for converting the video description feature sequence into video description information through a preset video description model;
the feature segmentation analysis module is specifically configured to:
respectively extracting the features of each video picture through a preset convolutional neural network to obtain the video features corresponding to the video pictures;
dividing the two adjacent video features into one group, and performing similarity calculation on the two video features in each group to obtain video similarity corresponding to the video features;
when the video similarity is smaller than or equal to a preset segmentation threshold, taking a video picture corresponding to the video similarity and all previous video pictures as shot data sets so as to obtain a plurality of shot data sets;
the merge analysis module is specifically configured to:
extracting features of each lens data set through the preset convolutional neural network to obtain lens feature sequences corresponding to the lens data sets, wherein each lens feature sequence comprises a plurality of lens features;
respectively carrying out similarity calculation on each lens feature in each lens feature sequence and each lens feature in the corresponding next lens feature sequence to obtain lens similarity corresponding to the lens features in the lens feature sequences;
respectively screening the feature similarity corresponding to each lens feature in each lens feature sequence and the lens similarity corresponding to each lens feature in the next corresponding lens feature sequence, and collecting all the screened lens similarities as lens similarity screening data sets corresponding to the lens data sets;
respectively carrying out summation calculation on all the lens similarity screening data in each lens similarity screening data set to obtain a summation value corresponding to the lens similarity screening data set;
and when the summation value is greater than or equal to a preset summation threshold value, merging the shot data set corresponding to the shot similarity screening data set and the next shot data set corresponding to the shot similarity screening data set into a merged shot data set.
7. A video description data processing apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that when the computer program is executed by the processor, the video description data processing method according to any one of claims 1 to 5 is implemented.
8. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a video description data processing method according to any one of claims 1 to 5.
CN202110476061.5A 2021-04-29 2021-04-29 Video description data processing method, device and storage medium Active CN113191262B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110476061.5A CN113191262B (en) 2021-04-29 2021-04-29 Video description data processing method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110476061.5A CN113191262B (en) 2021-04-29 2021-04-29 Video description data processing method, device and storage medium

Publications (2)

Publication Number Publication Date
CN113191262A CN113191262A (en) 2021-07-30
CN113191262B true CN113191262B (en) 2022-08-19

Family

ID=76980672

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110476061.5A Active CN113191262B (en) 2021-04-29 2021-04-29 Video description data processing method, device and storage medium

Country Status (1)

Country Link
CN (1) CN113191262B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597341A (en) * 2018-05-25 2021-04-02 中科寒武纪科技股份有限公司 Video retrieval method and video retrieval mapping relation generation method and device

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279495B (en) * 2015-10-23 2019-06-04 天津大学 A kind of video presentation method summarized based on deep learning and text
EP3166075B1 (en) * 2015-11-05 2020-08-05 Facebook, Inc. Systems and methods for processing content using convolutional neural networks
CN108447501B (en) * 2018-03-27 2020-08-18 中南大学 Pirated video detection method and system based on audio words in cloud storage environment
CN108228915B (en) * 2018-03-29 2021-10-26 华南理工大学 Video retrieval method based on deep learning
CN109189989B (en) * 2018-07-23 2020-11-03 北京市商汤科技开发有限公司 Video description method and device, computer equipment and storage medium
CN109359214A (en) * 2018-10-15 2019-02-19 平安科技(深圳)有限公司 Video presentation generation method, storage medium and terminal device neural network based
CN110909207B (en) * 2019-09-08 2023-06-02 东南大学 News video description data set construction method containing sign language
US11676278B2 (en) * 2019-09-26 2023-06-13 Intel Corporation Deep learning for dense semantic segmentation in video with automated interactivity and improved temporal coherence

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597341A (en) * 2018-05-25 2021-04-02 中科寒武纪科技股份有限公司 Video retrieval method and video retrieval mapping relation generation method and device

Also Published As

Publication number Publication date
CN113191262A (en) 2021-07-30

Similar Documents

Publication Publication Date Title
CN108984530B (en) Detection method and detection system for network sensitive content
Zhang et al. Face sketch synthesis via sparse representation-based greedy search
Klibisz et al. Fast, simple calcium imaging segmentation with fully convolutional networks
Zhao et al. A language model based evaluator for sentence compression
CN114023412B (en) ICD code prediction method and system based on joint learning and denoising mechanism
CN114357206A (en) Education video color subtitle generation method and system based on semantic analysis
L Fernandes et al. A novel decision support for composite sketch matching using fusion of probabilistic neural network and dictionary matching
CN110457707B (en) Method and device for extracting real word keywords, electronic equipment and readable storage medium
Reza et al. A customized residual neural network and bi-directional gated recurrent unit-based automatic speech recognition model
CN112765354B (en) Model training method, model training device, computer apparatus, and storage medium
CN112329663B (en) Micro-expression time detection method and device based on face image sequence
CN113191262B (en) Video description data processing method, device and storage medium
Sharma et al. A generalized zero-shot quantization of deep convolutional neural networks via learned weights statistics
CN109344252B (en) Microblog text classification method and system based on high-quality theme extension
Doughman et al. Time-aware word embeddings for three Lebanese news archives
CN116935057A (en) Target evaluation method, electronic device, and computer-readable storage medium
CN110796134A (en) Method for combining words of Chinese characters in strong-noise complex background image
CN113496228B (en) Human body semantic segmentation method based on Res2Net, transUNet and cooperative attention
US11270155B2 (en) Duplicate image detection based on image content
CN114979620A (en) Video bright spot segment detection method and device, electronic equipment and storage medium
Rodin et al. Document image quality assessment via explicit blur and text size estimation
Rust et al. Towards Privacy-Aware Sign Language Translation at Scale
CN111340329A (en) Actor assessment method and device and electronic equipment
CN113392722A (en) Method and device for recognizing emotion of object in video, electronic equipment and storage medium
CN113191263B (en) Video description method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant