CN113191262B

CN113191262B - Video description data processing method, device and storage medium

Info

Publication number: CN113191262B
Application number: CN202110476061.5A
Authority: CN
Inventors: 蔡晓东; 黄庆楠
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2022-08-19
Anticipated expiration: 2041-04-29
Also published as: CN113191262A

Abstract

The invention provides a video description data processing method, a device and a storage medium, wherein the method comprises the following steps: importing a video sequence and dividing the video sequence into a plurality of video pictures; performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets; merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets; performing feature extraction on the plurality of merged shot data sets through the preset convolutional neural network to obtain a video description feature sequence; and converting the video description feature sequence into video description information through a preset video description model. The method can directly convert the problem of natural language into the problem of images without generating the character description for each shot data and then combining the character description to generate the final description, thereby reducing the redundancy of the generated description and improving the fluency of the character description.

Description

Video description data processing method, device and storage medium

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a method and an apparatus for processing video description data, and a storage medium.

Background

At present, video description mainly has machine translation accuracy indexes, sentence fluency indexes and the like, and how to solve the fluency of videos is a difficult problem. In the prior art, a plurality of shot data sets are generated according to video shot segmentation, then each shot data is input into a convolutional neural network to generate a series of characteristics, and then the characteristics are input into a video description model generation statement. In the processing process, complex calculated amount is caused, meanwhile, if two shots with high similarity are input respectively, a plurality of similar features are generated by inputting a shot into a convolutional neural network, each feature is described in a video description model, so that the calculated amount of the model becomes high, and finally described sentences are trembled and unsmooth and are unsmooth, and the difference between manual description and manual description is large.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method, an apparatus and a storage medium for processing video description data, aiming at the defects of the prior art.

The technical scheme for solving the technical problems is as follows: a video description data processing method, comprising the steps of:

importing a video sequence and dividing the video sequence into a plurality of video pictures;

performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets;

merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets;

performing feature extraction on the plurality of merged shot data sets through the preset convolutional neural network to obtain a video description feature sequence;

and converting the video description feature sequence into video description information through a preset video description model.

The invention has the beneficial effects that: the method comprises the steps of carrying out feature segmentation analysis on all video pictures through a preset convolutional neural network to obtain a plurality of shot data sets, carrying out merging analysis on all the shot data sets through the preset convolutional neural network to obtain a plurality of merged shot data sets, carrying out feature extraction on a plurality of the merged shot data sets through the preset convolutional neural network to obtain a video description feature sequence, converting the video description feature sequence into video description information through a preset video description model, combining and generating final description after each shot data generates text description, directly converting natural language problems into image problems, reducing redundancy of generated description and improving fluency of text description.

On the basis of the technical scheme, the invention can be further improved as follows.

Further, the process of performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets includes:

respectively extracting the features of each video picture through a preset convolutional neural network to obtain the video features corresponding to the video pictures;

dividing the two adjacent video features into one group, and performing similarity calculation on the two video features in each group to obtain video similarity corresponding to the video features;

and when the video similarity is smaller than or equal to a preset segmentation threshold, taking the video picture corresponding to the video similarity and all the previous video pictures as shot data sets, thereby obtaining a plurality of shot data sets.

The beneficial effect of adopting the further scheme is that: the characteristics of all video pictures are segmented and analyzed through the preset convolutional neural network to obtain a plurality of shot data sets, the same shots can be clustered, a data base is provided for subsequent processing, natural language problems can be directly converted into image problems, the redundancy of generated description is reduced, and the fluency of character description is improved.

Further, the process of calculating the similarity of the two video features in each group to obtain the video similarity corresponding to the video features includes:

and calculating the similarity of the two video features of each group through a first formula to obtain the video similarity corresponding to the video features, wherein the first formula is as follows:

L _n ＝cos(z _i ，z _i+1 )，

wherein L is _n Is the video similarity of the ith video feature and the (i + 1) th video feature, cos is the cosine distance function, z _i For the ith video feature, z _i+1 Is the (i + 1) th video feature.

The beneficial effect of adopting the further scheme is that: the video similarity corresponding to the video features is obtained by calculating the similarity of the two video features of each group in the first mode, a data base is provided for subsequent processing, the problem of natural language can be directly converted into an image problem, the redundancy of generated description is reduced, and the fluency of character description is improved.

Further, the process of merging and analyzing all the shot data sets through the preset convolutional neural network to obtain a plurality of merged shot data sets includes:

extracting features of each lens data set through the preset convolutional neural network to obtain lens feature sequences corresponding to the lens data sets, wherein each lens feature sequence comprises a plurality of lens features;

respectively carrying out similarity calculation on each lens feature in each lens feature sequence and each lens feature in the corresponding next lens feature sequence to obtain lens similarity corresponding to the lens features in the lens feature sequences;

respectively screening the feature similarity corresponding to each lens feature in each lens feature sequence and the lens similarity corresponding to each lens feature in the next corresponding lens feature sequence, and collecting all the screened lens similarities as a lens similarity screening data set corresponding to the lens data set;

respectively carrying out summation calculation on all the lens similarity screening data in each lens similarity screening data set to obtain a summation value corresponding to the lens similarity screening data set;

and when the summation value is greater than or equal to a preset summation threshold value, merging the shot data set corresponding to the shot similarity screening data set and the next shot data set corresponding to the shot similarity screening data set into a merged shot data set.

The beneficial effect of adopting the above further scheme is: and a plurality of merged shot data sets are obtained by merging and analyzing all the shot data sets through a preset convolutional neural network, so that the data volume can be reduced, the redundancy of generated description is reduced, and the fluency of character description is improved.

Further, the process of calculating the similarity between each lens feature in each lens feature sequence and each lens feature in the corresponding next lens feature sequence to obtain the lens similarity corresponding to the lens feature in the lens feature sequence includes:

respectively carrying out similarity calculation on each lens feature in each lens feature sequence and each lens feature in a corresponding next lens feature sequence through a second formula to obtain lens similarity corresponding to the lens features in the lens feature sequences, wherein the second formula is as follows:

s _i ＝cos(f _i ,f _i ')，

wherein s is _i Is the lens similarity between the ith lens characteristic and the (i + 1) th lens characteristic, cos is the cosine distance function, f _i For the ith lens feature in the lens feature sequence, f _i ' is the ith lens feature in the lens feature sequence adjacent to the lens feature sequence.

The beneficial effect of adopting the further scheme is that: and respectively calculating the similarity of each shot feature in each shot feature sequence and each shot feature in the corresponding next shot feature sequence through a second formula to obtain the shot similarity corresponding to the shot features in the shot feature sequence, thereby providing a data basis for subsequent processing, reducing the redundancy of generated description and improving the fluency of character description.

Further, the process of screening the feature similarity corresponding to each of the shot features in each of the shot feature sequences and the shot similarity corresponding to each of the shot features in a corresponding next shot feature sequence, and collecting all the screened shot similarities as the shot similarity screening dataset corresponding to the shot dataset includes:

respectively screening the feature similarity corresponding to each lens feature in each lens feature sequence and the lens similarity corresponding to each lens feature in the corresponding next lens feature sequence through a third formula, and collecting all the screened lens similarities as a lens similarity screening data set corresponding to the lens data set;

Q＝{Q _i }if card(S _i )>T ₂ ，

wherein, card is an element in the shot feature sequence, T ₂ To preset a lens similarity threshold, Q _i For the eligible ith shot similarity,q is the shot similarity screening dataset, S _i The shot similarity between the ith shot feature and the (i + 1) th shot feature is obtained.

The beneficial effect of adopting the above further scheme is: and respectively taking all the screened shot similarities of the feature similarity screening set corresponding to each shot feature in each shot feature sequence and the shot similarity screening set corresponding to each shot feature in the next shot feature sequence as the shot similarity screening data set corresponding to the shot data set through a third formula, so that the processed data is reduced, the redundancy of the generated description is reduced, and the fluency of the character description is improved.

Further, the process of respectively summing all the shot similarity screening data in each shot similarity screening data set to obtain a sum value corresponding to the shot similarity screening data set includes:

respectively summing all the lens similarity screening data in each lens similarity screening data set by a fourth formula to obtain a summation value corresponding to the lens similarity screening data set, wherein the fourth formula is as follows:

where K is the sum, Q _i And screening the ith lens similarity in the lens similarity screening data set, wherein n-1 is the last lens similarity screening data in the lens similarity screening data set.

The beneficial effect of adopting the above further scheme is: and respectively carrying out summation calculation on all shot similarity screening data in each shot similarity screening data set to obtain a summation value corresponding to the shot similarity screening data set through a fourth formula, and further reducing processing data, thereby reducing the redundancy of generated description and improving the fluency of character description.

Another technical solution of the present invention for solving the above technical problems is as follows: a video description data processing apparatus comprising:

the video segmentation module is used for importing a video sequence and segmenting the video sequence into a plurality of video pictures;

the characteristic segmentation analysis module is used for carrying out characteristic segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of lens data sets;

the merging analysis module is used for merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets;

the feature extraction module is used for extracting features of the plurality of combined shot data sets through the preset convolutional neural network to obtain a video description feature sequence;

and the video description information acquisition module is used for converting the video description feature sequence into video description information through a preset video description model.

Another technical solution of the present invention for solving the above technical problems is as follows: a video description data processing apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor when executing the computer program implementing a video description data processing method as described above.

Another technical solution of the present invention for solving the above technical problems is as follows: a computer-readable storage medium, storing a computer program which, when executed by a processor, implements a video description data processing method as described above.

Drawings

Fig. 1 is a schematic flow chart of a video description data processing method according to an embodiment of the present invention;

fig. 2 is a block diagram of a video description data processing apparatus according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth to illustrate, but are not to be construed to limit the scope of the invention.

Fig. 1 is a schematic flowchart of a video description data processing method according to an embodiment of the present invention.

Example 1:

as shown in fig. 1, a video description data processing method includes the following steps:

performing feature extraction on the plurality of combined shot data sets through the preset convolutional neural network to obtain a video description feature sequence;

It should be understood that the shot data sets are sequentially generated according to the time of the event occurrence of the input video, such as the first shot data set of the video from 0 th to 10 th seconds, and the second shot data set from 11 th to 20 th seconds, each of which is composed of several of the video pictures.

It should be understood that the predetermined convolutional neural network is first constructed, defining V ═ i ₁ ，i ₂ ，…，i _n Denotes a video sequence, and secondly divides the imported video sequence into each frame picture (i.e., the video picture).

It should be understood that the preset video description model is used for adjusting the feature weight of the input features, then generating words, and then arranging the words in word order to form a sentence.

Specifically, a plurality of synthesized shots (i.e., the merged shot data set) are input into the preset convolutional neural network, and one video description feature sequence M ═ f is output ₁ ,f ₂ ,…,f _n }，And inputting the video description feature sequence M into a preset video description model for conversion processing to generate a text description (namely the video description information).

In the embodiment, all the video pictures are subjected to feature segmentation analysis through a preset convolutional neural network to obtain a plurality of shot data sets, all the shot data sets are subjected to merging analysis through the preset convolutional neural network to obtain a plurality of merged shot data sets, feature extraction is performed on the plurality of merged shot data sets through the preset convolutional neural network to obtain a video description feature sequence, the video description feature sequence is converted into video description information through a preset video description model, and after each shot data is not required to generate a text description, the text description is combined to generate a final description, so that a natural language problem is directly converted into an image problem, the redundancy of generated description is reduced, and the fluency of text description is improved.

Example 2:

converting the video description feature sequence into video description information through a preset video description model;

the process of performing feature segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets comprises the following steps:

respectively extracting the features of each video picture through a preset convolutional neural network to obtain video features corresponding to the video pictures;

It should be understood that sequentially inputting the video pictures into the predetermined convolutional neural network obtains a feature sequence (i.e. a plurality of the video features), and outputting one of the video features by one of the video pictures, thereby obtaining a feature set Z ═ { Z ═ Z } ₁ ，z ₂ ，…，z _i (i.e. a plurality of said video features) and then clustering according to said video similarity and making said video similarity larger than said preset segmentation threshold T ₁ And clustering the corresponding video pictures into one shot data set, wherein each piece of shot data in the built shot data set occurs in a time sequence.

In the embodiment, the preset convolutional neural network is used for carrying out feature segmentation analysis on all video pictures to obtain a plurality of shot data sets, the same shots can be clustered, a data base is provided for subsequent processing, the problem of natural language can be directly converted into an image problem, the redundancy of generated description is reduced, and the fluency of character description is improved.

Example 3:

when the video similarity is smaller than or equal to a preset segmentation threshold, taking a video picture corresponding to the video similarity and all previous video pictures as shot data sets so as to obtain a plurality of shot data sets;

the process of calculating the similarity of the two video features in each group to obtain the video similarity corresponding to the video features comprises the following steps:

L _n ＝cos(z _i ，z _i+1 )，

wherein L is _n Is the video similarity of the ith video feature and the (i + 1) th video feature, cos is a cosine distance function, z _i For the ith video feature, z _i+1 Is the (i + 1) th video feature.

In the embodiment, the video similarity corresponding to the video features is obtained by calculating the similarity of the two video features of each group in the first mode, so that a data base is provided for subsequent processing, the problem of natural language can be directly converted into the problem of images, the redundancy of generated description is reduced, and the fluency of text description is improved.

Example 4:

the process of performing merging analysis on all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets comprises the following steps:

and when the summation value is greater than or equal to a preset summation threshold value, merging the lens data set corresponding to the lens similarity screening data set and the next lens data set corresponding to the lens similarity screening data set into a merged lens data set.

It should be understood that the definition set V { { V { (V) ₁₁ ,v ₂ ,…,v _m-1 } ₁ ,{v _m ,v _m+1 ,…,v _w-1 } ₂ ,…,{v _w ,v _w+1 ,…,v _n } _i The method comprises the steps that firstly, the ith lens data set in a set is input into the preset convolution neural network, and then the lens characteristic sequence { f ] is output ₁ ,f ₂ ,…,f _i And then inputting the (i + 1) th lens data set of the set into a convolutional neural network, and outputting the lens characteristic sequence { f ₁ ',f ₂ ',…,f _i ' } each of said mirrors of said ith said series of lens featuresAnd calculating feature similarity (namely the shot similarity) between the head feature and each shot feature of the (i + 1) th shot feature sequence.

It should be understood that if the summation value K is greater than or equal to the preset summation threshold value t3, indicating that there is a large correlation between the two lens data sets, the lens data sets may be combined into one merged lens data set.

In the embodiment, the preset convolutional neural network is used for merging and analyzing all shot data sets to obtain a plurality of merged shot data sets, so that the data volume can be reduced, the redundancy of generated description is reduced, and the fluency of text description is improved.

Example 5:

as shown in fig. 1, a method for processing video description data includes the following steps:

the process of merging and analyzing all the lens data sets through the preset convolutional neural network to obtain a plurality of merged lens data sets comprises the following steps:

when the summation value is greater than or equal to a preset summation threshold value, merging the lens data set corresponding to the lens similarity screening data set and a next lens data set corresponding to the lens similarity screening data set into a merged lens data set;

the process of respectively calculating the similarity between each lens feature in each lens feature sequence and each lens feature in the corresponding next lens feature sequence to obtain the lens similarity corresponding to the lens features in the lens feature sequences includes:

s _i ＝cos(f _i ,f _i ')，

wherein s is _i Is the lens similarity between the ith lens feature and the (i + 1) th lens feature, cos is the cosine distance function, f _i For the ith lens feature in the lens feature sequence, f _i ' is the ith lens feature in the lens feature sequence adjacent to the lens feature sequence.

In the above embodiment, the similarity between each shot feature in each shot feature sequence and each shot feature in the corresponding next shot feature sequence is calculated by the second formula to obtain the shot similarity corresponding to the shot feature in the shot feature sequence, so that a data basis is provided for subsequent processing, the redundancy of generated description is reduced, and the fluency of text description is improved.

Example 6:

the process of screening the feature similarity corresponding to each of the shot features in each of the shot feature sequences and the shot similarity corresponding to each of the shot features in a next shot feature sequence, and collecting all the screened shot similarities as a shot similarity screening dataset corresponding to the shot dataset includes:

Q＝{Q _i }if card(S _i )>T ₂ ，

wherein, card is an element in the shot feature sequence, T ₂ To preset a lens similarity threshold, Q _i Screening the data set for the ith shot similarity, Q, meeting the conditions, S _i The shot similarity between the ith shot feature and the (i + 1) th shot feature is obtained.

It should be understood that the preset segmentation threshold, the preset shot similarity threshold and the preset summation threshold are all empirical threshold settings obtained through training of label samples.

It should be understood that, Q is defined as a shot similarity set (i.e. the shot similarity screening dataset) left after the feature similarity is smaller than the preset shot similarity threshold.

It should be understood that the formula is expressed as putting elements into the set Q if the elements in the shot feature sequence S to which the shot feature sequence is adjacent are greater than a set threshold.

In the above embodiment, all the shot similarities obtained through screening of the feature similarity corresponding to each shot feature in each shot feature sequence and the shot similarity corresponding to each shot feature in the corresponding next shot feature sequence are respectively used as the shot similarity screening data set corresponding to the shot data set by the third formula, so that the processed data is reduced, the redundancy of the generated description is reduced, and the fluency of the text description is improved.

Example 7:

the process of respectively performing summation calculation on all the shot similarity screening data in each shot similarity screening data set to obtain a summation value corresponding to the shot similarity screening data set comprises the following steps:

summing all the shot similarity screening data in each shot similarity screening data set respectively through a fourth formula to obtain a summation value corresponding to the shot similarity screening data set, where the fourth formula is as follows:

It should be understood that n is the video feature, and the number of n-1 is now one less after two-by-two comparison.

In the embodiment, the fourth expression is used for respectively summing all the shot similarity screening data in each shot similarity screening data set to obtain the sum value corresponding to the shot similarity screening data set, so that the processing data is further reduced, the redundancy of the generated description is reduced, and the fluency of the text description is improved.

Example 8:

as shown in fig. 2, a video description data processing apparatus, comprising:

Example 9:

a video description data processing apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, when executing the computer program, implementing a video description data processing method as described above. The device may be a computer or the like.

Example 10:

a computer-readable storage medium, storing a computer program which, when executed by a processor, implements a video description data processing method as described above.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for processing video description data, comprising the steps of:

2. The method according to claim 1, wherein the performing similarity calculation on the two video features in each group to obtain the video similarity corresponding to the video features comprises:

L _n ＝cos(z _i ，z _i+1 )，

3. The method according to claim 1, wherein the step of calculating the similarity between each of the shot features in each of the shot feature sequences and each of the shot features in a corresponding next shot feature sequence to obtain the shot similarity corresponding to the shot features in the shot feature sequences comprises:

s _i ＝cos(f _i ,f _i ')，

wherein s is _i The lens similarity of the ith lens feature in the lens feature sequence and the ith lens feature in the next lens feature sequence, cos is a cosine distance function, f _i For the ith lens feature, f, in the lens feature sequence _i ' is the ith shot feature in the next shot feature sequence.

4. The method according to claim 1, wherein the process of respectively filtering the feature similarity corresponding to each of the shot features in each of the shot feature sequences and the shot similarity corresponding to each of the shot features in a next shot feature sequence, and collecting all the filtered shot similarities as the shot similarity filtering data set corresponding to the shot data set comprises:

Q＝{Q _i }if card(S _i )>T ₂ ，

5. The method according to claim 1, wherein the process of summing all the shot similarity screening data in each shot similarity screening data set to obtain a sum value corresponding to the shot similarity screening data set comprises:

6. A video description data processing apparatus, comprising:

the characteristic segmentation analysis module is used for carrying out characteristic segmentation analysis on all the video pictures through a preset convolutional neural network to obtain a plurality of shot data sets;

the video description information acquisition module is used for converting the video description feature sequence into video description information through a preset video description model;

the feature segmentation analysis module is specifically configured to:

the merge analysis module is specifically configured to:

respectively screening the feature similarity corresponding to each lens feature in each lens feature sequence and the lens similarity corresponding to each lens feature in the next corresponding lens feature sequence, and collecting all the screened lens similarities as lens similarity screening data sets corresponding to the lens data sets;

7. A video description data processing apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that when the computer program is executed by the processor, the video description data processing method according to any one of claims 1 to 5 is implemented.

8. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a video description data processing method according to any one of claims 1 to 5.