CN111339369A

CN111339369A - Video retrieval method, system, computer equipment and storage medium based on depth features

Info

Publication number: CN111339369A
Application number: CN202010115194.5A
Authority: CN
Inventors: 曾凡智; 程勇; 周燕; 陈嘉文
Original assignee: Foshan University
Current assignee: Foshan University
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2020-06-26

Abstract

The invention discloses a video retrieval method, a system, computer equipment and a storage medium based on depth characteristics, wherein the method comprises the following steps: constructing a convolutional neural network, wherein the convolutional neural network is a DenseNet model; acquiring a plurality of videos; extracting a depth feature vector of a video frame in each video by using a DenseNet model; for each video, extracting key frames according to the depth feature vectors of the video frames, and outputting a key frame set; establishing an index relationship between each video and the key frame set of each video, and storing the index relationship into a video characteristic database; and searching the video in the video characteristic database according to the image or the short video provided by the user, and outputting a video searching result. The invention realizes the video retrieval function with high accuracy and high recall rate.

Description

Video retrieval method, system, computer equipment and storage medium based on depth features

Technical Field

The invention relates to a video retrieval method, a system, computer equipment and a storage medium based on depth features, and belongs to the field of video retrieval.

Background

Currently, a video retrieval method based on text labeling is relatively mature and widely applied to the market. The method needs to manually summarize and annotate the videos in the video library in advance, and the video retrieval result completely depends on the word expression of the user and manually marked information in advance. However, as the number of videos gradually increases, the contents are more diversified, and the conventional video retrieval method based on artificial text annotation gradually fails to meet the requirements of people on higher-level video retrieval. Most of content-based video retrieval systems adopt features such as color, texture, shape, SIFT and the like, and the features are susceptible to video blur, noise, illumination change and the like.

In recent years, deep learning obtains excellent results in the fields of video and image processing, a deep feature descriptor has strong image feature description capacity, the retrieval result of the method can meet the requirements of people on higher-level video retrieval, and the method has wide application prospects in the fields of security monitoring, remote online education, film and television copyright protection, network short video review and the like.

Disclosure of Invention

In view of the above, the present invention provides a video retrieval method, system, computer device and storage medium based on depth features, which implement a video retrieval function with high accuracy and high recall rate.

The invention aims to provide a video retrieval method based on depth characteristics.

A second object of the present invention is to provide a video retrieval system based on depth features.

It is a third object of the invention to provide a computer apparatus.

It is a fourth object of the present invention to provide a storage medium.

The first purpose of the invention can be achieved by adopting the following technical scheme:

a method for depth feature-based video retrieval, the method comprising:

constructing a convolutional neural network; wherein the convolutional neural network is a DenseNet model;

acquiring a plurality of videos;

extracting a depth feature vector of a video frame in each video by using a DenseNet model;

for each video, extracting key frames according to the depth feature vectors of the video frames, and outputting a key frame set;

establishing an index relationship between each video and the key frame set of each video, and storing the index relationship into a video characteristic database;

and searching the video in the video characteristic database according to the image or the short video provided by the user, and outputting a video searching result.

Further, the DenseNet model adopts a DenseNet-201 model;

the DenseNet-201 model comprises a convolution layer, a pooling layer, a first dense block, a first transition layer, a second dense block, a second transition layer, a third dense block, a third transition layer, a fourth dense block and a classification layer which are sequentially connected.

Further, the extracting a key frame according to the depth feature vector of the video frame and outputting a key frame set specifically includes:

setting the 1 st frame as a reference frame, taking the reference frame as a key frame, and adding the key frame into a key frame set;

according to the depth feature vector of the video frame, calculating the cosine included angle similarity of the current frame and the reference frame;

if the cosine included angle similarity is smaller than a threshold value, comparing the current frame with the key frame set, if the cosine included angle similarity is not repeated, taking the current frame as a key frame, adding the key frame set, and updating the current frame into a reference frame;

if the updated reference frame is not the last frame, the cosine included angle similarity calculation is carried out on the current frame and the reference frame according to the depth feature vector of the video frame, and the subsequent operation is executed; and if the updated reference frame is the last frame, outputting the key frame set.

Further, the cosine included angle similarity is calculated as follows:

wherein, I_kRepresenting the depth feature vector of the current frame, I_refRepresenting the depth feature vector of the reference frame.

Further, according to the image provided by the user, retrieving the video in the video feature database, and outputting a video retrieval result, specifically comprising:

according to the image provided by the user, the DenseNet model is utilized to extract the characteristics of the image, the cosine included angle similarity comparison is carried out on the characteristics and the database, and the first N most similar videos are output in a sequence from large to small.

Further, according to the short video provided by the user, retrieving the video in the video feature database, and outputting a video retrieval result, specifically comprising:

according to the short video provided by the user, the characteristics of the short video are extracted by using a DenseNet model, the key frame set of the short video and all the key frame sets in the database are matched in similarity in a sliding window mode, the similarity is sorted from large to small, and the first N most similar videos are output.

The second purpose of the invention can be achieved by adopting the following technical scheme:

a depth feature based video retrieval system, the system comprising:

the convolutional neural network construction module is used for constructing a convolutional neural network; wherein the convolutional neural network is a DenseNet model;

the video acquisition module is used for acquiring a plurality of videos;

the video frame feature extraction module is used for extracting a depth feature vector of a video frame in each video by using a DenseNet model;

the key frame extraction module is used for extracting key frames and outputting a key frame set according to the depth feature vectors of the video frames aiming at each video;

the index establishing module is used for establishing an index relationship between each video and the key frame set of each video and storing the index relationship into a video characteristic database;

and the video retrieval module is used for retrieving videos in the video characteristic database according to the images or short videos provided by the user and outputting video retrieval results.

Further, the DenseNet model adopts a DenseNet-201 model;

The third purpose of the invention can be achieved by adopting the following technical scheme:

a computer device comprises a processor and a memory for storing processor executable programs, and when the processor executes the programs stored in the memory, the video retrieval method is realized.

The fourth purpose of the invention can be achieved by adopting the following technical scheme:

a storage medium stores a program which, when executed by a processor, implements the video retrieval method described above.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention uses DenseNet model as convolution neural network at first, the DenseNet model further expands convolution neural network connection on the basis of ResNet model, for any layer of dense block in convolution neural network, the characteristic diagram of all layers in front of the layer is the input of the layer, the characteristic diagram of the layer is the input of all layers behind, the advantage of this design alleviates the problem of gradient disappearance, enhances the propagation of characteristic diagram, promotes the utilization ratio, greatly reduces the number of parameters and the extracted characteristics are more abundant and diversified; secondly, aiming at the fact that characteristics such as color, texture and shape adopted by the traditional content-based video retrieval are easily influenced by noise and illumination interference, deep characteristics with high abstraction and high generalization robustness of images can be extracted through a convolutional neural network, video shot segmentation, video frame depth characteristic extraction, key frame extraction and video characteristic database construction are achieved, and finally the content-based video retrieval function is achieved.

2. The invention provides an image depth feature descriptor, when the video frame features are extracted, a DenseNet model is introduced, the feature vector of a full connection layer of a penultimate layer is used as the image features of the image depth descriptor, the top5 of the network model on an ImageNet large-scale data set reaches 95% of classification accuracy, the depth features of the network model solve the problem that the image features such as traditional color, texture and shape are easily interfered by noise and illumination, and the network model has good generalization popularization capability, and experiments show that the method is superior to the most advanced method at present. The video retrieval function with high accuracy and high recall rate is realized.

3. In the process of extracting the video key frames, a reference frame mechanism is introduced, the key frames are extracted according to the threshold value in a self-adaptive mode, and the process that the traditional key frame extraction method needs to perform shot segmentation and key frame clustering firstly is omitted.

4. The invention adopts a retrieval mode based on images or short videos to retrieve videos in the video characteristic database, can directly and quickly and accurately find videos with similar content characteristics in massive videos, and provides accurate query for a search engine based on the retrieval model of images or short videos, so that a user can find the most relevant videos, thereby improving the working efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

Fig. 1 is a flowchart of a video retrieval method according to embodiment 1 of the present invention.

Fig. 2 is a structural diagram of the DenseNet model.

Fig. 3 is a video inter-frame similarity curve according to embodiment 1 of the present invention.

Fig. 4 is a flowchart of extracting key frames according to embodiment 1 of the present invention.

Fig. 5 is a flowchart of retrieving a video according to an image provided by a user according to embodiment 1 of the present invention.

Fig. 6 is a flowchart of retrieving a video according to an image provided by a short video according to embodiment 1 of the present invention.

Fig. 7 is a block diagram of a video retrieval system according to embodiment 2 of the present invention.

Fig. 8 is a block diagram of a computer device according to embodiment 3 of the present invention.

Fig. 9 is a block diagram of a main program of video retrieval in video retrieval software installed in a computer device according to embodiment 3 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer and more complete, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts based on the embodiments of the present invention belong to the protection scope of the present invention.

Example 1:

as shown in fig. 1, the present embodiment provides a video retrieval method based on depth features, which includes the following steps:

and S101, constructing a convolutional neural network.

In the field of computer vision, a Convolutional Neural network has become the most mainstream method, and compared with the underlying physical features, the image features based on a Convolutional Neural Network (CNN) model use the trained Convolutional Neural network model to extract the image depth features of a single key frame.

In recent years, five classical convolutional neural network models, AlexNet, VGGNet, inclusion net, ResNet and densnet, have appeared, ordered by order of appearance, which have respectively gained champions from the image recognition project of the large-scale visual recognition challenge race (ILSVRC) from 2012 to 2017. Compared with the traditional video retrieval algorithm based on shape, color, texture, SIFT and the like, the network models have excellent performance in the field of image recognition; the core of the ResNet model is to train a deeper convolutional neural network by establishing a 'short-circuit connection' between a front layer and a rear layer, which is helpful for the back propagation of gradient in the training process, and the DenseNet model is consistent with the ResNet model in basic idea, but establishes a dense connection of all the front layers and the rear layers, as shown in FIG. 2; another great feature of the DenseNet model is that feature reuse is realized by the connection of features on channels, which allows the DenseNet model to realize better performance than the ResNet model with less parameter and computation cost, and thus the DenseNet model also cuts the best paper prize of CVPR 2017.

The advantages of the DenseNet model are mainly shown in the following aspects: 1) due to the dense connection mode, the DenseNet model promotes the reverse propagation of the gradient, so that the network is easier to train; 2) the parameter is smaller and the calculation is more efficient, because the DenseNet model realizes short circuit connection through connection characteristics, the characteristic reuse is realized, and a smaller growth rate is adopted, and the unique characteristic diagram of each layer is smaller; 3) the final classifier uses low-level features due to feature multiplexing.

The DenseNet model of the embodiment adopts a DenseNet-201 model, the model is realized by using a TensorFlow framework, the number of convolution layers reaches 201 layers, but the parameter quantity is only 80M, the model belongs to a lightweight network model, and the top5 reaches 95% of classification accuracy on an ImageNet large-scale data set. Specific parameters of the denenet-201 network structure are shown in fig. 3, and include a convolutional Layer (constraint), a Pooling Layer (Pooling), a first Dense Block (density Block1), a first Transition Layer (Transition Layer1), a second Dense Block (density Block2), a second Transition Layer (Transition Layer2), a third Dense Block (density Block3), a third Transition Layer (Transition Layer3), a fourth Dense Block (density Block4), and a Classification Layer (Classification Layer), which are connected in sequence, where k ═ 32 represents a channel number growth rate.

TABLE 1 DenseNet-201 model Structure

S102, acquiring a plurality of videos.

The video of the embodiment can be acquired through collection, for example, a plurality of videos are shot through a camera.

S103, extracting the depth feature vector of the video frame in each video by using a DenseNet model.

Using the DenseNet-201 network model loaded with the pre-training parameters, 1920-dimensional feature vectors of video frames in each video are extracted, which are characterized by the output of the last-but-one layer full-link layer of the network model.

And S104, aiming at each video, extracting key frames according to the depth feature vectors of the video frames, and outputting a key frame set.

In this embodiment, the cosine included angle distance is used to measure the similarity between the front frame and the rear frame, the key frame extraction is achieved by comparing the threshold values, the value range of the cosine included angle similarity is [0,1], and the cosine included angle distance is calculated as follows:

wherein, I_kRepresenting the depth feature vector of the current frame, I_k-1The depth feature vector of the previous frame is shown, and the similarity curve between video frames is shown in fig. 3.

In the process of extracting the key frames, a reference frame mechanism is introduced to achieve the purpose of processing the gradient shot and the key frames simultaneously to remove the repetition, as shown in fig. 4, the key frames are extracted according to the depth feature vectors of the video frames, and a key frame set is output, which specifically comprises the following steps:

and S1041, setting the 1 st frame as a reference frame, and adding the reference frame as a key frame into a key frame set.

The video is set to have N video frames, the 1 st frame (namely the 1 st video frame) is set as a reference frame and is used as a key frame, and the key frame set is added.

S1042, according to the depth feature vector of the video frame, cosine included angle similarity T calculation is carried out on the current frame and the reference frame.

The cosine included angle similarity T of the present embodiment is calculated by using the cosine included angle distance of the above formula (1), as follows:

S1043, if the cosine included angle similarity T is smaller than a threshold value e, comparing the current frame with the key frame set, if the cosine included angle similarity T is not repeated, taking the current frame as a key frame, adding the key frame set, updating the current frame into a reference frame, and entering the step S1044; if the cosine included angle similarity T is greater than or equal to the threshold e, the next frame is taken as the current frame, and the process returns to the step S1042.

S1044, if the updated reference frame is not the last frame (i.e., nth frame), which indicates that the loop has not ended, returning to step S1042; and if the updated reference frame is the last frame, indicating that the cycle is ended, outputting the key frame set.

And S105, establishing an index relation between each video and the key frame set of each video, and storing the index relation into a video characteristic database.

Specifically, an index relationship is established between the video id of each video and the key frame set of each video, and the index relationship is stored in a video feature database, as shown in table 2 below.

TABLE 2 video feature database

Video id (Video _ id)	Key frame feature (Key _ frame _ feature)	Time (Time)
			Video A	Key frame 1	0:30
Video A	Key frame 2	1:34
			Video B	Key frame	1	0:17
Video C	Key frame	1					0:19

The steps S101 to S105 are video storage stages, and the step S106 is a video retrieval stage. It can be understood that the above steps S101 to S105 are completed on one computer device, and the video retrieval phase of step S106 can be performed on the computer device, or the video retrieval phase of step S106 can be performed on other networked computer devices.

And S106, retrieving the video in the video characteristic database according to the image or the short video provided by the user, and outputting a video retrieval result.

At present, the mainstream video retrieval mainly takes keyword retrieval as a main factor, and mass videos are generated, and the keyword retrieval needs to consume a large amount of time for manual marking, and this embodiment can retrieve videos in a video feature database by using two modes, namely, an image retrieval video and a short video retrieval video, and the specific description is as follows:

1) and searching the video in the video characteristic database according to the image provided by the user, and outputting a video searching result.

Specifically, according to the image provided by the user, a DenseNet model is used to extract 1920-dimensional features of the image, the features are compared with the cosine included angle similarity of the database, and the top N most similar videos are output in an order from large to small, as shown in fig. 5.

2) And searching the video in the video characteristic database according to the short video provided by the user, and outputting a video searching result.

Specifically, according to the short videos provided by the user, the characteristics of the short videos are extracted by using the DenseNet model, the key frame set of the short videos and all the key frame sets in the database are matched in similarity in a sliding window mode, and the first N most similar videos are output in a descending order of similarity.

Those skilled in the art will appreciate that all or part of the steps in the method for implementing the above embodiments may be implemented by a program to instruct associated hardware, and the corresponding program may be stored in a computer-readable storage medium.

It should be noted that although the method operations of the above-described embodiments are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the depicted steps may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Example 2:

as shown in fig. 7, this embodiment provides a depth feature-based video retrieval system, which includes a convolutional neural network construction module 701, a video acquisition module 702, a video frame feature extraction module 703, a key frame extraction module 704, an index establishment module 705, and a video retrieval module 706, where specific functions of each module are as follows:

the convolutional neural network constructing module 701 is configured to construct a convolutional neural network; wherein the convolutional neural network is a DenseNet model.

The video obtaining module 702 is configured to obtain a plurality of videos.

The video frame feature extraction module 703 is configured to extract a depth feature vector of a video frame in each video by using a DenseNet model.

The key frame extracting module 704 is configured to, for each video, extract a key frame according to the depth feature vector of the video frame, and output a key frame set.

The index establishing module 705 is configured to establish an index relationship between each video and the key frame set of each video, and store the index relationship in the video feature database.

The video retrieval module 706 is configured to retrieve videos in the video feature database according to images or short videos provided by the user, and output a video retrieval result.

The specific implementation of each module in this embodiment may refer to embodiment 1, which is not described herein any more; it should be noted that the system provided in this embodiment is only illustrated by the division of the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the functions described above.

Example 3:

the present embodiment provides a computer device, which may be a computer, as shown in fig. 8, and includes a processor 802, a memory, an input device 803, a display 804 and a network interface 805 connected by a system bus 801, the processor is used for providing computing and control capabilities, the memory includes a nonvolatile storage medium 806 and an internal memory 807, the nonvolatile storage medium 806 stores an operating system, computer programs and a database, the internal memory 807 provides an environment for the operating system and the computer programs in the nonvolatile storage medium to run, and when the processor 802 executes the computer programs stored in the memory, the video retrieval method of the above embodiment 1 is implemented as follows:

acquiring a plurality of videos;

Further, the DenseNet model adopts a DenseNet-201 model;

The computer equipment of the embodiment can be provided with video retrieval software capable of realizing the video retrieval method, the video retrieval software is provided with a video retrieval algorithm program, the video retrieval algorithm program is shown in fig. 9 and consists of video warehousing and video retrieval, wherein the video warehousing mainly comprises the steps of building a convolutional neural network, extracting video frame characteristics, extracting key frames and establishing indexes, and the videos and the depth characteristics of the corresponding key frames are indexed and warehoused so as to facilitate the subsequent retrieval of the videos; video retrieval includes retrieving video with pictures and retrieving video with short video.

Example 4:

the present embodiment provides a storage medium, which is a computer-readable storage medium, and stores a computer program, and when the computer program is executed by a processor, the method for checking attendance according to embodiment 1 above is implemented as follows:

acquiring a plurality of videos;

Further, the DenseNet model adopts a DenseNet-201 model;

The storage medium described in this embodiment may be a magnetic disk, an optical disk, a computer Memory, a Random Access Memory (RAM), a usb disk, a removable hard disk, or other media.

In summary, the DenseNet model is used as the convolutional neural network firstly, the DenseNet model further expands the convolutional neural network connection on the basis of the ResNet model, for any layer of dense blocks in the convolutional neural network, the feature maps of all layers in front of the layer are input to the layer, and the feature maps of all layers in back of the layer are input to the layer, so that the design has the advantages of reducing the problem of gradient disappearance, enhancing the propagation of the feature maps, improving the utilization rate, greatly reducing the number of parameters and enabling the extracted features to be richer and more diversified; secondly, aiming at the fact that characteristics such as color, texture and shape adopted by the traditional content-based video retrieval are easily influenced by noise and illumination interference, deep characteristics with high abstraction and high generalization robustness of images can be extracted through a convolutional neural network, video shot segmentation, video frame depth characteristic extraction, key frame extraction and video characteristic database construction are achieved, and finally the content-based video retrieval function is achieved.

The above description is only for the preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive concept of the present invention within the scope of the present invention.

Claims

1. A method for video retrieval based on depth features, the method comprising:

acquiring a plurality of videos;

2. The video retrieval method of claim 1, wherein the DenseNet model employs a DenseNet-201 model;

3. The video retrieval method according to claim 1, wherein the extracting key frames according to the depth feature vectors of the video frames and outputting the key frame set specifically comprises:

4. The video retrieval method of claim 3, wherein the cosine angle similarity is calculated as follows:

5. The video retrieval method according to any one of claims 1 to 4, wherein retrieving videos in the video feature database according to images provided by a user, and outputting a video retrieval result specifically includes:

6. The video retrieval method according to any one of claims 1 to 4, wherein retrieving videos in the video feature database according to short videos provided by a user, and outputting video retrieval results, specifically comprising:

7. A depth feature-based video retrieval system, the system comprising:

the video acquisition module is used for acquiring a plurality of videos;

8. The video retrieval system of claim 7, wherein the DenseNet model employs a DenseNet-201 model;

9. A computer device comprising a processor and a memory for storing a program executable by the processor, wherein the processor, when executing the program stored in the memory, implements the video retrieval method of any one of claims 1 to 6.

10. A storage medium storing a program, wherein the program, when executed by a processor, implements the video retrieval method of any one of claims 1 to 6.