CN109815364A

CN109815364A - A kind of massive video feature extraction, storage and search method and system

Info

Publication number: CN109815364A
Application number: CN201910047518.3A
Authority: CN
Inventors: 李传朋; 顾寅铮; 谢锦滨
Original assignee: Shanghai Jilian Network Technology Co Ltd
Current assignee: Shanghai Jilian Network Technology Co Ltd
Priority date: 2019-01-18
Filing date: 2019-01-18
Publication date: 2019-05-28
Anticipated expiration: 2039-01-18
Also published as: WO2020147857A1; CN109815364B

Abstract

The present invention relates to technical field of video processing, specifically, it is related to a kind of massive video feature extraction, storage and search method and system, feature extraction passes through training depth convolutional neural networks and extracts, storage method includes video slicing, interval sampling, terrestrial reference identification, depth convolutional neural networks handle to obtain Hash coding and feature, selection key frame and etc., search method includes landmark information and picture retrieval, when inputting picture, Hash coding and feature are extracted by depth convolutional neural networks first, uncommon coding quick-searching in hash index library, then it is accurately matched in video features library with feature, video information is obtained in video information library after obtaining similarly target index；The present invention has preferably accuracy, and precision is high.

Description

A kind of massive video feature extraction, storage and search method and system

Technical field

The present invention relates to technical field of video processing, specifically, being related to a kind of massive video feature extraction, storage and inspection Rope method and system.

Background technique

Image retrieval technologies can be divided into text based retrieval method and content-based retrieval method, text based Search method inputs the mark of keyword and all images firstly the need of text marking is added in advance for the image in all search libraries Explanatory notes is originally matched.The search method retrieval flow of image content-based is similar with text based retrieval method, different Be content-based retrieval mode be description using the visual signature of image as image, Visual Feature Retrieval Process have at present sift, The conventional methods such as hog, harr, gist, retrieval mode are to extract characteristic matching retrieval in Image Visual Feature, with feature database.

Current video retrieval method accuracy rate is low, and precision is small, is not able to satisfy the demand of people.

Summary of the invention

The contents of the present invention are to provide a kind of massive video feature extraction, storage and search method and system, can gram Take certain or certain defects of the prior art.

A kind of massive video feature extracting method according to the present invention comprising following steps:

A, by video input to depth convolutional neural networks；

B, depth convolutional neural networks carry out feature extraction to video；

C, video features are obtained.

2, a kind of massive video feature extracting method according to claim 1, it is characterised in that: the depth convolution For neural network based on ResNet101, conv-5 includes c5-1, c5-2 and c5-3, carries out attention behaviour in conv-5 Make as follows:

One layer of convolution kernel of c5-1 connection be the convolution of 3x3 and export 512 characteristic patterns, then connects the convolution of one layer of 3x3 and defeated A characteristic pattern out, attention map of this feature figure as the direction spatial；

The convolution of one layer of 1x1 of c5-2 connection keeps dimension 2048 to tie up constant, and does global poolization operation, obtains channel The attention map in direction；

The calculation formula of feature extraction is as follows:

Wherein, F_c-3Indicate c5-3 layers of characteristic pattern,The operation of representing matrix element multiplication,Representing matrix element addition Operation, M_sAnd M_cRespectively indicate spatial attention operation and channel attention operation；Attention is last Feature vector of the one layer of convolution by 2048 dimension of Generalized-mean pooling output, and use L2 normalization characteristic.

Preferably, the loss function of depth convolutional neural networks is contrastive loss, training set is Structure-from-motion, after the completion of training, immobilizing foundation model parameter, trains the Hash coding layer of 512 dimensions respectively, And support the full articulamentum of 4784 terrestrial references.

Preferably, increase the full articulamentum FC of one layer of 2048 dimension after training depth convolutional neural networks extract feature, and Concat operation is carried out to feature and FC, 4096 dimensional features are formed, by pca dimensionality reduction to 2048, as final feature.

The present invention extracts video features by depth convolutional neural networks, becomes apparent from feature, more accurately.

The present invention also provides a kind of massive video Feature Extraction System based on depth convolutional neural networks, the depth Convolutional neural networks based on ResNet101, the conv-5 module on ResNet101 include c5-1, c5-2 and c5-3 with And attention operation module, attention operation module include spatial attention operation module, channel Attention operation module and final operation module, in which:

Spatial attention operation module is used to handle the convolution that one layer of convolution kernel of c5-1 connection is 3x3 and output 512 characteristic patterns, then connect the convolution of one layer of 3x3 and export a characteristic pattern；

Channel attention operation module is used to handle the convolution of one layer of 1x1 of c5-2 connection, and dimension 2048 is kept to tie up It is constant, and do global poolization operation；

The relationship that final operation module is used to handle between c5-1, c5-2 and c5-3 obtains final feature.

A kind of massive video storage method according to the present invention comprising following steps:

One, by video slicing, it is divided into the segment being made of camera lens by Shot Detection, and be stored in video information library；

Two, segment interval sampling generate sample frame, sample frame input depth convolutional neural networks after obtain Hash coding and Feature；

Three, terrestrial reference identification is carried out in sample frame input terrestrial reference identification model, is as a result stored in video terrestrial reference storehouse；

Four, the correlation between comparative feature selects key frame, and the feature of key frame is stored in video features library, crucial The Hash coding deposit hash index library of frame.

Preferably, video slicing is carried out in video task scheduling system in step 1, tracking stream and the overall situation are utilized Color of image continuity comprehensive characteristics determine camera lens time boundary, by Video segmentation are the segment being made of camera lens.

Preferably, the information in segment deposit video information library includes start frame number, terminates frame number, frame length in step 1 With affiliated video.

Preferably, segment is sampled according to 30 frame periods in step 2.

Preferably, ground identifies method for distinguishing are as follows: if some classification softmax in terrestrial reference identification model in step 3 Value is greater than 0.8 and thinks that the ground of current lens is designated as the category, and result is stored in video terrestrial reference storehouse.

Preferably, in step 4, the method for the correlation between comparative feature are as follows: high using threshold value 0.55 as boundary Think similar in the threshold value, similar feature only retains first feature, to obtain one or more pass for representing the camera lens Key frame.

Preferably, the separability that the building of Hash coding reinforces coding by increasing sigmoid in training, with 0.5 threshold Value is divided into 0,1 coding.

The present invention also provides a kind of massive video storage systems, use a kind of above-mentioned sea based on deep learning Measure video storage method.

The present invention also provides a kind of massive video search methods comprising following steps:

(1), landmark information or image are inputted；

(2), when inputting landmark information, into being searched in video terrestrial reference storehouse；When inputting picture, pass through depth convolution mind first Hash coding and feature are extracted through network；

(3), Hash coding quick-searching in hash index library, is then accurately matched in video features library with feature, Video information is obtained in video information library after obtaining similarly target index.

Preferably, after extracting Hash coding and feature, Hash coding is searched in history search library is in step (2) It is no to have similar retrieval, if so, directly accurately being matched with feature, if nothing, carry out step (3).

Preferably, accurate matched method are as follows: obtained aspect indexing after being retrieved according to Hash obtains candidate feature Number N, form [N, 2048] eigenmatrix, with retrieving image feature carry out COS distance calculating, according to matching similarity into After row sequence, video segment information is inquired from video information library, cooperation screenshot is output to interface.

The present invention also provides a kind of massive video searching systems, use a kind of above-mentioned massive video search method.

The present invention include video lens segmentation, key-frame extraction, characteristic storage matched with index, quick-searching, terrestrial reference with The technologies such as identification fully describe the terrestrial reference feature in image using the depth characteristic of depth convolutional neural networks, before increasing terrestrial reference The accuracy rate of phase identification and later period index.By building hash index library and video features library, while guaranteeing quick-searching, Improve the matching precision of retrieval.The present invention provide it is a kind of based on deep neural network for landmark image feature extraction, deposit Storage, the frame indexed adapt to the terrestrial reference retrieval and identification of extensive video.

Detailed description of the invention

Fig. 1 is a kind of flow chart of massive video feature extracting method in embodiment 1；

Fig. 2 is a kind of structural frames of the massive video Feature Extraction System based on depth convolutional neural networks in embodiment 1 Figure；

Fig. 3 is a kind of flow chart of massive video storage method in embodiment 1；

Fig. 4 is a kind of flow chart of massive video search method in embodiment 1.

Specific embodiment

To further appreciate that the contents of the present invention, the present invention is described in detail in conjunction with the accompanying drawings and embodiments.It should be understood that , embodiment be only to the present invention explain and and it is non-limiting.

Embodiment 1

As shown in Figure 1, present embodiments providing a kind of massive video feature extracting method, comprising the following steps:

A, by video input to depth convolutional neural networks；

B, depth convolutional neural networks carry out feature extraction to video；

C, video features are obtained.

In the present embodiment, for the depth convolutional neural networks based on ResNet101, conv-5 includes c5-1, c5- It is as follows to carry out attention operation in conv-5 by 2 and c5-3:

The calculation formula of feature extraction is as follows:

In the present embodiment, the loss function of depth convolutional neural networks is contrastive loss, and training set is Structure-from-motion, after the completion of training, immobilizing foundation model parameter, trains the Hash coding layer of 512 dimensions respectively, And support the full articulamentum of 4784 terrestrial references.

In the present embodiment, after training depth convolutional neural networks extract feature, increase the full articulamentum FC of one layer of 2048 dimension, And concat operation is carried out to feature and FC, 4096 dimensional features are formed, by pca dimensionality reduction to 2048, as final feature.

As shown in Fig. 2, present embodiments providing a kind of massive video feature extraction system based on depth convolutional neural networks System, the depth convolutional neural networks based on ResNet101, the conv-5 module on ResNet101 include c5-1, C5-2 and c5-3 and attention operation module, attention operation module include spatial attention operation mould Block, channel attention operation module and final operation module, in which:

As shown in figure 3, present embodiments providing a kind of massive video storage method comprising following steps:

Two, segment interval sampling generates sample frame, after sample frame inputs depth convolutional neural networks (V-DIR i.e. in figure) Obtain Hash coding and feature；

Four, the correlation between comparative feature selects key frame, and the feature of key frame is stored in video features library, crucial The Hash coding deposit hash index library of frame.Massive video is analyzed in advance by depth convolutional neural networks, with certain in video Interval sampling frame, and according to feature selecting key frame, characteristic storage only selects key frame to be stored, and accounts for reduce space With.

Depth convolutional neural networks are a kind of depth convolutional neural networks, and classification training characteristics extract model, Hash coding Layer, terrestrial reference identification model.It is basic model with ResNet101, conv5 convolution is defeated by Generalized-mean pooling The feature vector of 2048 dimensions out, training uses contrastive loss as loss function, after the completion of training, fixes base respectively Plinth model parameter, the Hash coding layer of 512 dimension of training are made of one layer of full connection and sigmoid, and support 4784 terrestrial references Full articulamentum, be made of one layer of full connection and softmax.

Terrestrial reference identification model is the latter linked full articulamentum of characteristic layer in depth convolutional neural networks model, using there is mark What the landmark image of label was trained, classification is predicted by softmax mode.

In the present embodiment, in step 1, video slicing is to carry out in video task scheduling system, and complete video is pressed Fragment is carried out according to 3500 frame lengths, determines camera lens time side using tracking stream and global image distribution of color continuity comprehensive characteristics Video segmentation is the segment being made of camera lens by boundary.

In the present embodiment, in step 1, the information in segment deposit video information library includes start frame number, terminates frame number, frame Long and affiliated video.

In the present embodiment, in step 2, segment is sampled according to 30 frame periods.

In the present embodiment, in step 3, ground identifies method for distinguishing are as follows: if some classification in terrestrial reference identification model Softmax value is greater than 0.8 and thinks that the ground of current lens is designated as the category and then thinks that the ground of current lens is designated as the category, will tie Fruit is stored in video terrestrial reference storehouse.

In the present embodiment, in step 4, the method for the correlation between comparative feature are as follows: compared with COS distance, with threshold Value 0.55 is used as boundary, thinks similar higher than the threshold value, and similar feature only retains first feature, represents the camera lens to obtain One or more key frame.It will be stored in the 2048 dimensional features deposit video features library of key frame, the Hash coding of 512 dimensions It is stored in hash index library.

In the present embodiment, the separability that the building of Hash coding reinforces coding by increasing sigmoid in training, with 0.5 Threshold value is divided into 0,1 coding.

A kind of massive video storage system is present embodiments provided, a kind of above-mentioned massive video storage method is used.

As shown in figure 4, present embodiments providing a kind of massive video search method comprising following steps:

(1), landmark information or image are inputted；

In the present embodiment, in step (2), after extracting Hash coding and feature, Hash coding is searched in history search library Whether there is similar retrieval, if so, directly accurately being matched with feature, if nothing, carries out step (3).

In the present embodiment, accurate matched method are as follows: obtained aspect indexing after being retrieved according to Hash obtains candidate spy Number N is levied, the eigenmatrix of [N, 2048] is formed, the feature with retrieving image carries out COS distance calculating, according to matching similarity After being ranked up, video segment information is inquired from video information library, cooperation screenshot is output to interface.Made using depth Hash For quick indexing mode, the COS distance for calculating whole features is not needed, Hash code storage in the database, uses Hamming distance is matched, and quick-searching is guaranteed.After quick-searching terrestrial reference, and the uncertain terrestrial reference retrieved with it is defeated Enter terrestrial reference matching, compares to further calculate COS distance by video features library, to guarantee precise search.

A kind of massive video searching system is present embodiments provided, a kind of above-mentioned massive video search method is used.

4784 classes marked are supported in storage, but retrieval is not limited thereto, and input the landmark image in non-4784 class, Terrestrial reference in similar video can be equally retrieved, and is accurately positioned position in video.

Schematically the present invention and embodiments thereof are described above, description is not limiting, institute in attached drawing What is shown is also one of embodiments of the present invention, and actual structure is not limited to this.So if the common skill of this field Art personnel are enlightened by it, without departing from the spirit of the invention, are not inventively designed and the technical solution Similar frame mode and embodiment, are within the scope of protection of the invention.

Claims

1. a kind of massive video feature extracting method, it is characterised in that: the following steps are included:

A, by video input to depth convolutional neural networks；

B, depth convolutional neural networks carry out feature extraction to video；

C, video features are obtained.

2. a kind of massive video feature extracting method according to claim 1, it is characterised in that: the depth convolutional Neural For network based on ResNet101, conv-5 includes c5-1, c5-2 and c5-3, carries out attention operation such as in conv-5 Under:

One layer of convolution kernel of c5-1 connection is the convolution of 3x3 and exports 512 characteristic patterns, then connects the convolution of one layer of 3x3 and export one Open characteristic pattern, attention map of this feature figure as the direction spatial；

The convolution of one layer of 1x1 of c5-2 connection keeps dimension 2048 to tie up constant, and does global poolization operation, obtains the direction channel Attention map；

Feature calculation formula is as follows:

Wherein, F_c-3Indicate c5-3 layers of characteristic pattern,The operation of representing matrix element multiplication,Representing matrix element add operation, M_sAnd M_cRespectively indicate spatial attention operation and channel attention operation；Attention the last layer volume Feature vector of the product by 2048 dimension of Generalized-meanpooling output, and use L2 normalization characteristic.

3. a kind of massive video feature extracting method according to claim 2, it is characterised in that: the depth convolutional Neural The loss function of network is contrastive loss, training set structure-from-motion, after the completion of training, point Other immobilizing foundation model parameter, the Hash coding layer of 512 dimension of training, and support the full articulamentum of 4784 terrestrial references.

4. a kind of massive video feature extracting method according to claim 3, it is characterised in that: the training depth convolution After neural network extracts feature, increase the full articulamentum FC of one layer of 2048 dimension, and concat operation, composition are carried out to feature and FC 4096 dimensional features, by pca dimensionality reduction to 2048, as final feature.

5. a kind of massive video Feature Extraction System based on depth convolutional neural networks, it is characterised in that: the depth convolution Neural network based on ResNet101, the conv-5 module on ResNet101 include c5-1, c5-2 and c5-3 and Attention operation module, attention operation module include spatial attention operation module, channel Attention operation module and final operation module, in which:

Spatial attention operation module is used to handle the convolution that one layer of convolution kernel of c5-1 connection is 3x3 and exports 512 Characteristic pattern, then connect the convolution of one layer of 3x3 and export a characteristic pattern；

Channel attention operation module is used to handle the convolution of one layer of 1x1 of c5-2 connection, keeps the dimension of dimension 2048 not Become, and does global poolization operation；

6. a kind of massive video storage method, it is characterised in that: the following steps are included:

Two, segment interval sampling generates sample frame, obtains Hash coding and feature after sample frame input depth convolutional neural networks；

Four, the correlation between comparative feature selects key frame, and the feature of key frame is stored in video features library, key frame Hash coding deposit hash index library.

7. a kind of massive video storage method according to claim 6, it is characterised in that: in step 1, video slicing is It is carried out in video task scheduling system, determines camera lens using tracking stream and global image distribution of color continuity comprehensive characteristics Video segmentation is the segment being made of camera lens by time boundary.

8. a kind of massive video storage method according to claim 7, it is characterised in that: in step 1, segment deposit view The information of frequency information bank includes start frame number, terminates frame number, frame length and affiliated video.

9. a kind of massive video storage method according to claim 8, it is characterised in that: in step 2, segment is according to 30 Frame period sampling.

10. a kind of massive video storage method according to claim 9, it is characterised in that: in step 3, terrestrial reference identification Method are as follows: think that the ground of current lens is designated as such if some classification softmax value in terrestrial reference identification model is greater than 0.8 Not, result is stored in video terrestrial reference storehouse.

11. a kind of massive video storage method according to claim 10, it is characterised in that: in step 4, comparative feature Between correlation method are as follows: using threshold value 0.55 as boundary, think similar higher than the threshold value, similar feature only retains head A feature, to obtain one or more key frame for representing the camera lens.

12. a kind of massive video storage method according to claim 11, it is characterised in that: the building of Hash coding is by instructing Increase the separability that sigmoid reinforces coding in white silk, 0,1 coding is divided into 0.5 threshold value.

13. a kind of massive video storage system, it is characterised in that: it uses one kind of any one of claim 6-12 to be based on The massive video storage method of deep learning.

14. a kind of massive video search method, it is characterised in that: the following steps are included:

(1), landmark information or image are inputted；

(2), when inputting landmark information, into being searched in video terrestrial reference storehouse；When inputting picture, pass through depth convolutional Neural net first Network extracts Hash coding and feature；

(3), Hash coding quick-searching in hash index library, is then accurately matched in video features library with feature, is obtained Similarly video information is obtained in video information library after target index.

15. a kind of massive video search method as claimed in claim 7, it is characterised in that: in step (2), extract Hash and compile After code and feature, Hash coding has searched whether similar retrieval in history search library, if so, directly with feature accurate Match, if nothing, carries out step (3).

16. a kind of massive video search method as claimed in claim 8, it is characterised in that: accurate matched method are as follows: according to Obtained aspect indexing after Hash retrieval, obtains candidate feature number N, forms the eigenmatrix of [N, 2048], same to retrieving image Feature carry out COS distance calculating, after being ranked up according to matching similarity, inquire video clip from video information library Information, cooperation screenshot are output to interface.

17. a kind of massive video searching system, it is characterised in that: it uses a kind of sea of any one of claim 14-16 Measure video features search method.