CN109543735A

CN109543735A - Video copying detection method and its system

Info

Publication number: CN109543735A
Application number: CN201811353711.1A
Authority: CN
Inventors: 石慧杰
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2019-03-29

Abstract

The present invention is suitable for multimedia information field, provides a kind of video copying detection method and its system, which comprises obtains the image tag Y in described image training set, every image tag Y is carried out copy-attack and obtains the first image X；The training pattern of construction depth Feature Selection Model is trained it and obtains depth characteristic extraction model；The first depth characteristic and the first DCT coefficient feature for obtaining every second image carry out fused first fusion feature, and save as fusion feature database；The third key frame for extracting the inquiry video, carries out fusion for its second depth characteristic and the second DCT coefficient feature and obtains the second fusion feature；Each second fusion feature and each first fusion feature are subjected to matching primitives, obtain the testing result of the inquiry video.Whereby, the present invention improves the accuracy of the discrimination and robustness and raising video copy detection system of video features.

Description

Video copying detection method and its system

Technical field

The present invention relates to multimedia information field more particularly to a kind of video copying detection method and its it is System.

Background technique

Widely available with internet with the development of science and technology, digital video problem of piracy becomes increasingly conspicuous, based on content Video copy detection technology is a kind of similarity degree by comparing video to be detected and original video, judges that video to be detected is The no method for constituting pirate infringement, video infringement not only include the duplication to original video, further include the scaling to video, cutting, Rotation, fuzzy, insertion text or figure etc. operation, to video copyright protecting and managed based on the video copy detection of content To vital effect.

Conventional video copy detection technology is broadly divided into based on global characteristics and based on the method for local feature.It is representative Global characteristics include color histogram, GIST feature, DCT coefficient etc., based on the method for global characteristics for extensive video Retrieve it is simple and effective, can detecte colour brightness etc. variation, but for part copy variation as scene cuts, translation and other effects It is bad.Representative local feature includes SIFT, SURF, BRIEF feature etc., and local feature has a variety of copy-attacks higher Index, matching precision is high, but this feature descriptor dimension is excessively high, and there may be hundreds and thousands of parts for single key frame Feature, while computing cost is big, feature extraction speed and matching speed are slow.

In recent years with the introducing of deep learning method, largely gushed based on the method for deep learning in computer vision field Now and immense success is achieved, convolutional neural networks show powerful ability, convolutional Neural net in terms of extracting characteristics of image Network use multilayer neural network, by convolution algorithm extract input signal feature, by pond layer extraction feature base Upper progress pondization is abstract, obtains the depth characteristic of higher level, but the existing copy detection technology using deep learning is often only Depth characteristic is directly extracted using the layer second from the bottom or layer third from the bottom of pre-training neural network model, is copied without being directed to Shellfish detects this task and is targetedly designed and trained, and the character recognition power extracted in this way is not strong enough, while single depth Spend latent structure video information can not comprehensive describing video contents, lack extensive robustness, cause copy detection accurate The not high problem of rate.

Apply for artificial Dalian University of Technology, hero inventor Lee, Wang Ling, heavy rain, publication date is on May 10th, 2017, Publication No. CN106649663A, the China of entitled " a kind of video copying detection method based on compact video characterization " In application for a patent for invention, discloses dense extraction library video and inquire the key frame of video, extract the library video and inquiry view The image sparse feature of the key frame of frequency, using pond mode, by the image sparse feature point of the library video and inquiry video It is not merged, forms succinct video features.Although method that the patent application has used deep learning extracts the depth of video Feature, but it is directly used in the network model progress feature extraction that pre-training is carried out in image classification task, has ignored copy The uniqueness of Detection task is not designed for copy detection task and retraining, the depth characteristic extracted distinguishes Power is not strong, while the dense all frames for extracting video of this method, and all video frames are input in neural network and obtain 4096 The depth characteristic of dimension, computationally intensive, machine heavy computational burden, there are also the compact video characterizations for only obtaining video per second to lead for this method The problem for causing to cause copy positioning result coarse when carrying out library video matching.

In conclusion the detection technique of existing copy video there will naturally be inconvenient and defect in actual use, so It is necessary to be improved.

Summary of the invention

For above-mentioned defect, the purpose of the present invention is to provide a kind of video copying detection method and its systems, significantly The discrimination and robustness of video features are improved, while improving the accuracy of video copy detection system.

To achieve the goals above, the present invention provides a kind of video copying detection method, the described method comprises the following steps:

Original video data collection processing step obtains the original video data collection with multiple videos, mentions described Original video data collection is taken to be divided into training video collection and video database, the video that the training video is concentrated is the One video, the video in the video database is the second video, extracts all described the first of the training video collection First key frame of video forms training set of images, and extract all second videos of the video database second is crucial Frame forms image data base；The first key frame in described image training set is image tag Y, and every image tag Y is carried out Copy-attack obtains the corresponding first image X of every described image label Y；

The offline depth characteristic that obtains extracts model step, the training pattern of construction depth Feature Selection Model, the depth The training pattern of Feature Selection Model includes the penalty values computation model of loss function, by every described image label Y and with Its corresponding every the first image X is input in the training pattern of the depth characteristic extraction model and is trained to described When the penalty values of loss function stop decline, obtains the depth characteristic and extract model；

Fusion feature database steps are obtained, the second key frame in described image database is the second image, by every Second image extracts the first depth characteristic that model obtains every second image by the depth characteristic；To described Every second image of image data base carries out dct transform, obtains the first DCT coefficient feature of every second image, will First depth characteristic and the first DCT coefficient feature of every second image are merged, and every institute is obtained State the first fusion feature of the second image；By first fusion feature of every of described image database second image Save as the fusion feature database；

On-line checking inquires video step, extracts the third key frame of the inquiry video, by the of the inquiry video Three key frames extract model by the depth characteristic and obtain the second depth characteristic；Dct transform is carried out to the third key frame, The the second DCT coefficient feature for obtaining the third key frame, by second depth characteristic and the second DCT coefficient feature It is merged, obtains the second fusion feature of the third key frame；By each second fusion feature and each described the One fusion feature carries out matching primitives, judges whether the video in the video database is the copy source for inquiring video, If so, obtaining position of the inquiry video in the copy source simultaneously.

According to the method, it includes: construction depth feature extraction that the offline acquisition depth characteristic, which extracts model step, The training pattern of model, the training pattern that the depth characteristic extracts model includes variation self-encoding encoder neural network model, institute Stating variation self-encoding encoder neural network model includes using depth convolutional neural networks as the encoder of basic network and making With the decoder of full Connection Neural Network and deconvolution neural network structure；By every described image label Y and corresponding Every the first image X be input in the variation self-encoding encoder neural network model, to every described image label Y with And every corresponding the first image X carries out the first pretreatment；To the parameter of the encoder and the decoder It is initialized, the variation self-encoding encoder neural network model starts to be trained, until the penalty values of the loss function are stopped When only declining, obtains the depth characteristic and extract model.

According to the method, the encoder extracts characteristics of image, the depth convolutional Neural net of the encoder Network is ResNet-101 network, and the encoder is made of input layer, hidden layer and output layer, wherein the input layer receives warp The described first pretreated first image X input is crossed, hidden layer is made of the ResNet-101 network for removing full articulamentum, The output layer includes two full articulamentums for exporting the logarithm of the mean value of m Gaussian Profile and the variance of Gaussian Profile respectively, The output layer output vector Z, the dimension of the vector Z are m dimension, wherein the vector Z Normal Distribution N (mean, exp^varlog), the random noise ε of Normal Distribution N (0,1), the at this time output of the encoder are added in the vector Z Are as follows:In formula, mean is the Gaussian Profile that the output layer of the encoder exports Mean value, varlog be the encoder the output layer export Gaussian Profile variance logarithm；The decoder is to institute The characteristics of image for stating encoder extraction is decoded, and the input of the decoder is the vector Z, decoder output vector First pretreatment is that every described image label Y and every corresponding the first image X are normalized Processing, by after the normalized image tag Y and the first image X zoom to first size, by zooming to The image tag Y and the first image X for stating first size are transformed into rgb space, and by it is described be transformed into rgb space after The pixel value of image tag Y and the first image X are normalized；The loss function are as follows:In formula, mean is the encoder Output layer output Gaussian Profile mean value, varlog is that the output layer of the encoder exports Gaussian Profile The logarithm of variance, Y are image tag Y,For the output of the decoder.

According to the method, obtaining fusion feature database steps includes: that every second image is carried out second Pretreatment, every second image is is carried out size normalized by second pretreatment, by the normalized The second image scaling afterwards is transformed into RGB sky to the first size, by second image for zooming to the first size Between, and the pixel value of second image being transformed into after rgb space is normalized；

Model is extracted by the depth characteristic by the described second pretreated second image by every and obtains every institute State first depth characteristic of the m dimension of the second image；

Every second image is subjected to third pretreatment, the third pretreatment for by every second image into The second image scaling after the normalized is 64 × 64 pixels by row size normalized；

Second image for being scaled 64 × 64 pixels is divided into 64 8 × 8 first fritters, then to each described First fritter carries out dct transform and takes the low frequency of preceding 4 same positions of each first fritter according to the sequence of Zig-Zag Coefficient, four groups of length of composition are respectively 64 the first one-dimensional vector A_i, and calculate separately four groups of one-dimensional vector A_iExpectation mean_iWith variance var_i, and to the first one-dimensional vector A_iIt is standardized to obtainAgain by F_iSeries connection As the first one-dimensional vector F of 256 dimensions, the first DCT coefficient feature is obtained, by first depth characteristic and described the One DCT coefficient feature direct splicing obtains first fusion feature；

It is special that first fusion feature of every of described image database second image is saved as into the fusion Levy database.

According to the method, the on-line checking inquiry video step includes:

The extraction for being carried out the third key frame using the key frame algorithm based on shot segmentation to the inquiry video, is obtained To the third key frame set of the inquiry video

Each of the third key frame set third key frame is subjected to the 4th pretreatment, the 4th pretreatment For each third key frame is carried out size normalized, the third key frame after the normalized is zoomed to The third key frame for zooming to the first size is transformed into rgb space, and is transformed into described by the first size The pixel value of third key frame after rgb space is normalized；

The described 4th pretreated third key frame is extracted into the institute that model obtains m dimension by the depth characteristic State the second depth characteristic；

Each third key frame is subjected to the 5th pretreatment, the 5th pretreatment is that each third is crucial Frame carries out size normalized, and the third key frame after the normalized is scaled 64 × 64 pixels；

The third key frame picture for being scaled 64 × 64 pixels is divided into 64 8 × 8 second fritters, then to each Second fritter carries out dct transform and takes preceding 4 same positions of each second fritter according to the sequence of Zig-Zag Low frequency coefficient, four groups of length of composition are respectively 64 the second one-dimensional vector B_i, and calculate separately four groups of one-dimensional vector B_iPhase Hope Mean_iWith variance Var_i, and to the second one-dimensional vector B_iIt is standardized to obtainAgain by H_iString It is unified into as the second one-dimensional vector H of 256 dimensions, the second DCT coefficient feature is obtained, by second depth characteristic and described Second DCT coefficient feature direct splicing obtains second fusion feature；

Being by each second fusion feature and each first fusion feature progress matching primitives will be each described Each of second fusion feature and the fusion feature database first fusion feature carry out cosine similarity calculating, point It does not obtain N number of with most similar first fusion feature of second fusion feature distance, N number of melts described with described second Most similar first fusion feature of characteristic distance is closed according to the frame time sequencing compositional similarity of the inquiry video List of matches；It is successive according to the frame time by the first fusion feature configuration node in the similitude list of matches The node is linked to be side by the requirement that sequence and time interval are less than M frame, and the distance on the side is set as 1, uses Freud Longest distance between algorithm (Floyd-Warshall algorithm) acquisition any two node, set distance threshold value T, when When longest distance between described two nodes is greater than the distance threshold T, at least one of described video database is judged Second video is the copy source of the inquiry video, and obtains the inquiry video in the position in the copy source, such as When longest distance between the described two nodes of fruit is less than the distance threshold T, judging the inquiry video not is copy video.

In order to realize another goal of the invention of the invention, the present invention also provides a kind of video copy detection systems, comprising:

Original video data collection processing module, for obtaining the original video data collection with multiple videos, by institute It states extraction original video data collection and is divided into training video collection and video database, the video that the training video is concentrated For the first video, the video in the video database is the second video, extracts all described of the training video collection First key frame of the first video forms training set of images, extracts the second of all second videos of the video database Key frame forms image data base；The first key frame in described image training set is image tag Y, by every image tag Y It carries out copy-attack and obtains the corresponding first image X of every described image label Y；

Depth characteristic extracts model and obtains module, for the training pattern of construction depth Feature Selection Model, the depth The training pattern of Feature Selection Model includes the penalty values computation model of loss function, by every described image label Y and with Its corresponding every the first image X is input in the training pattern of the depth characteristic extraction model and is trained to described When the penalty values of loss function stop decline, obtains the depth characteristic and extract model；

Fusion feature database obtains module, and the second key frame in described image database is the second image, and being used for will Every second image extracts the first depth characteristic that model obtains every second image by the depth characteristic；It is right Every second image of described image database carries out dct transform, and the first DCT coefficient for obtaining every second image is special Sign merges first depth characteristic of every second image and the first DCT coefficient feature, obtains every First fusion feature of the second image of Zhang Suoshu；By first fusion of every of described image database second image Feature saves as the fusion feature database；

On-line checking inquires video module, for extracting the third key frame of the inquiry video, by the inquiry video Third key frame pass through the depth characteristic extract model obtain the second depth characteristic；DCT is carried out to the third key frame Transformation, obtains the second DCT coefficient feature of the third key frame, by second depth characteristic and the 2nd DCT system Number feature is merged, and the second fusion feature of the third key frame is obtained；By each second fusion feature and each First fusion feature carries out matching primitives, judges whether the video in the video database is copying for the inquiry video Bei Yuan, if so, obtaining position of the inquiry video in the copy source simultaneously.

According to the system, the depth characteristic extracts model acquisition module and includes:

Training submodule, for the training pattern of construction depth Feature Selection Model, the depth characteristic extracts model Training pattern further includes variation self-encoding encoder neural network model, and the variation self-encoding encoder neural network model includes using deep Convolutional neural networks are spent as the encoder of basic network and use full Connection Neural Network and deconvolution neural network structure Decoder；

Data processing submodule, for by every described image label Y and every corresponding the first image X It is input in the variation self-encoding encoder neural network model, to every described image label Y and every corresponding institute It states the first image X and carries out the first pretreatment；The parameter of the encoder and the decoder is initialized, the training Submodule starts to be trained to the variation self-encoding encoder neural network model, until under the penalty values of the loss function stop When drop, obtains the depth characteristic and extract model.

According to the system, the encoder extracts characteristics of image, the depth convolutional Neural net of the encoder Network is ResNet-101 network, and the encoder is made of input layer, hidden layer and output layer, wherein the input layer receives warp Cross described first pretreated image tag Y the first image X input, hidden layer is by removing the ResNet-101 of full articulamentum Network is constituted, and the output layer includes two logarithms for exporting the mean value of m Gaussian Profile and the variance of Gaussian Profile respectively The dimension of full articulamentum, the output layer output vector Z, the vector Z is m dimension, wherein the vector Z Normal Distribution N (mean,exp^varlog), the random noise ε for obeying N (0,1), the at this time output of the encoder are added in the vector Z are as follows:In formula, mean is the Gaussian Profile that the output layer of the encoder exports Mean value, varlog are the logarithm that the output layer of the encoder exports the variance of Gaussian Profile；The decoder is to described The characteristics of image that encoder extracts is decoded, and the input of the decoder is the vector Z, decoder output vectorInstitute Stating the first pretreatment is that place is normalized in every described image label Y and every corresponding the first image X Reason, by after the normalized image tag Y and the first image X zoom to first size, zoomed to described The image tag Y and the first image X of the first size are transformed into rgb space, and by it is described be transformed into rgb space after Image tag Y and the pixel value of the first image X be normalized；The loss function are as follows:In formula, mean is the encoder Output layer output Gaussian Profile mean value, varlog is that the output layer of the encoder exports Gaussian Profile The logarithm of variance, Y are image tag Y,For the output of the decoder.

According to the system, fusion feature database obtains module and includes:

Second pretreatment submodule, for every second image to be carried out the second pretreatment, second pretreatment For every second image is carried out size normalized, by the second image scaling after the normalized to first Second image for zooming to first size is transformed into rgb space by size, and by be transformed into after rgb space The pixel value of two images is normalized；

First depth characteristic obtains submodule, described for passing through every by the described second pretreated second image Depth characteristic extracts first depth characteristic that model obtains the m dimension of every second image；

Third pre-processes submodule, for every second image to be carried out third pretreatment, the third pretreatment For every second image is carried out size normalized, by the second image scaling after the normalized be 64 × 64 pixels；

First fusion feature obtains submodule, for second image for being scaled 64 × 64 pixels to be divided into 64 8 Then × 8 the first fritter carries out dct transform to each first fritter, according to the sequence of Zig-Zag, take each described The low frequency coefficient of preceding 4 same positions of first fritter, four groups of length of composition are respectively 64 the first one-dimensional vector A_i, and respectively Calculate four groups of one-dimensional vector A_iExpectation mean_iWith variance var_i, and to the first one-dimensional vector A_iIt is standardized It arrivesAgain by F_iSeries connection becomes the first one-dimensional vector F of 256 dimensions, obtains the first DCT coefficient feature, By first depth characteristic and the first DCT coefficient feature direct splicing, first fusion feature is obtained；

Fusion feature database obtains submodule, for will be described in every of described image database second image First fusion feature saves as the fusion feature database.

According to the system, the on-line checking inquiry video module includes:

Extract third key frame submodule, for the inquiry video use the key frame algorithm based on shot segmentation into The extraction of the row third key frame obtains the third key frame set of the inquiry video

4th pretreatment submodule, for each of the third key frame set third key frame to be carried out the 4th Pretreatment, each third key frame is is carried out size normalized by the 4th pretreatment, at the normalization Third key frame after reason zooms to the first size, and the third key frame for zooming to the first size is transformed into Rgb space, and the pixel value of the third key frame being transformed into after rgb space is normalized；

Second depth characteristic obtains submodule, described for passing through the described 4th pretreated third key frame Depth characteristic extracts second depth characteristic that model obtains m dimension；

5th pretreatment submodule, for each third key frame to be carried out the 5th pretreatment, the described 5th pre- place Reason scales the third key frame after the normalized for each third key frame is carried out size normalized For 64 × 64 pixels；

Second fusion feature obtains submodule, for the third key frame picture for being scaled 64 × 64 pixels to be divided into 64 Then a 8 × 8 the second fritter carries out dct transform to each second fritter and takes each institute according to the sequence of Zig-Zag The low frequency coefficient of preceding 4 same positions of the second fritter is stated, four groups of length of composition are respectively 64 the second one-dimensional vector B_i, and point Four groups of one-dimensional vector B are not calculated_iExpectation Mean_iWith variance Var_i, and to the second one-dimensional vector B_iIt is standardized It obtainsAgain by H_iSeries connection becomes the second one-dimensional vector H of 256 dimensions, and it is special to obtain second DCT coefficient Second depth characteristic and the second DCT coefficient feature direct splicing are obtained second fusion feature by sign；

Judging submodule, by carrying out each second fusion feature and each first fusion feature based on matching It calculates as will be more than first fusion feature progress of each of each second fusion feature and the fusion feature database String similarity calculation respectively obtains most similar first fusion feature of the N number of and described second fusion feature distance, will be described It is N number of successive according to the frame time of the inquiry video with the second fusion feature most similar first fusion feature of distance Sequence compositional similarity list of matches；By the first fusion feature configuration node in the similitude list of matches, according to The node is linked to be side by the requirement that the frame time sequencing and time interval are less than M frame, and the distance on the side is set as 1, the longest distance between any two node is obtained using Freud's algorithm (Floyd-Warshall algorithm), if Set a distance threshold value T judges the video data when the longest distance between described two nodes is greater than the distance threshold T Second video described at least one of library is the copy source of the inquiry video, and obtains the inquiry video and copy described The position of Bei Yuan judges the inquiry video if the longest distance between described two nodes is less than the distance threshold T It is not copy video.

The present invention is trained to obtain depth characteristic extraction mould by the training pattern for extracting model to depth characteristic Type extracts the second key frame of the video database, and extracts the third key frame of the inquiry video；By described second Key frame extracts model by the depth characteristic and obtains first depth characteristic, and obtains the of second key frame First depth characteristic and the first DCT coefficient feature are carried out direct splicing and obtain described the by one DCT coefficient feature One fusion feature compensates for single spy by way of combining depth characteristic and traditional-handwork feature (DCT coefficient feature) Levy construction video information can not comprehensive describing video contents the problem of, fusion feature robustness is stronger.All second is crucial First fusion feature of frame forms fusion feature database, and the third key frame is extracted model by the depth characteristic and is obtained Second depth characteristic is obtained, and obtains the second DCT coefficient feature of the third key frame, by second depth characteristic And the second DCT coefficient feature carries out direct splicing and obtains second fusion feature, each second fusion is special Sign carries out matching primitives with each first fusion feature, judges whether the video in the video database is the inquiry The copy source of video, if so, obtaining position of the inquiry video in the copy source simultaneously.The present invention uses deep as a result, It spends learning method and extracts the stronger depth characteristic of discrimination, while merging traditional-handwork feature (DCT coefficient feature), improve view Frequency copy detection system improves copy detection accuracy to the robustness of various copy variations.

Detailed description of the invention

Fig. 1 is the modular structure schematic diagram for the video copy detection system that the preferred embodiment of the present invention provides；

Fig. 2 is the variation self-encoding encoder Artificial Neural Network Structures schematic diagram that the preferred embodiment of the present invention provides；

Fig. 3 is that the fusion feature that the preferred embodiment of the present invention provides obtains flow diagram；

Fig. 4 is the flow chart that the preferred embodiment of the present invention provides；

Fig. 5 is the flow chart that the preferred embodiment of the present invention provides.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

Referring to FIG. 1 to FIG. 3, a kind of video copy detection system 100 is provided in the first embodiment of the present invention, is wrapped It includes:

Original video data collection processing module 10 will for obtaining the original video data collection with multiple videos The extraction original video data collection is divided into training video collection and video database, the view that the training video is concentrated Frequency is the first video, and the video in the video database is the second video, extracts all institutes of the training video collection The first key frame for stating the first video forms training set of images, extracts the of all second videos of the video database Two key frames form image data base；The first key frame in described image training set is image tag Y, by every image tag Y carries out copy-attack and obtains the corresponding first image X of every described image label Y；Depth characteristic extracts model and obtains module 20, For the training pattern of construction depth Feature Selection Model, the training pattern that the depth characteristic extracts model includes loss function Penalty values computation model, every described image label Y and every corresponding the first image X are input to described Depth characteristic, which is extracted, to be trained when stopping decline to the penalty values of the loss function in the training pattern of model, described in acquisition Depth characteristic extracts model；

Fusion feature database obtains module 30, and the second key frame in described image database is the second image, is used for Every second image is extracted into the first depth characteristic that model obtains every second image by the depth characteristic； Dct transform is carried out to every second image of described image database, the first DCT coefficient for obtaining every second image is special Sign merges first depth characteristic of every second image and the first DCT coefficient feature, obtains every First fusion feature of the second image of Zhang Suoshu；By first fusion of every of described image database second image Feature saves as the fusion feature database；

On-line checking inquires video module 40, and for extracting the third key frame of the inquiry video, the inquiry is regarded The third key frame of frequency extracts model by the depth characteristic and obtains the second depth characteristic；The third key frame is carried out Dct transform obtains the second DCT coefficient feature of the third key frame, by second depth characteristic and the 2nd DCT Coefficient characteristics are merged, and the second fusion feature of the third key frame is obtained；By each second fusion feature and often A first fusion feature carries out matching primitives, judges whether the video in the video database is the inquiry video Copy source, if so, obtaining position of the inquiry video in the copy source simultaneously.

In this embodiment, including off-line training step and on-line checking stage in off-line training step pass through original view Frequency data set processing module 10 obtains the original video data collection, and the original video data collection in the present embodiment derives from Open Standard data set CC_WEB_VIDEO data set, the original video data collection include 12790 videos, will be described Original video data collection is divided according to the ratio of 2:8, wherein the training video collection has 2558 first videos, institute Stating video database has 10232 second videos.The instruction is extracted using the Key-frame Extraction Algorithm based on shot segmentation The first key frame for practicing video set forms training set of images, extracts the view using the Key-frame Extraction Algorithm based on shot segmentation Second key frame of frequency database forms image data base；Some obvious duplicate images in described image training set are removed, with Reduce the quantity of described image training set.When training, 20000 image tag Y are randomly choosed from described image training set and are used In the training of progress model, and every described image label Y is translated, rotates, scale, obscuring, overturning, frame is blackened, adjusts The copy-attacks such as whole brightness, adjustment contrast, captioning, form the corresponding first image X of every described image label Y.Pass through Depth characteristic extracts the training pattern that model obtains 20 construction depth Feature Selection Model of module, and the depth characteristic extracts model Training pattern include loss function penalty values computation model, for calculating the penalty values of loss function；By every figure It is extracted in the training pattern of model as label Y and every corresponding the first image X are input to the depth characteristic It is trained to when the penalty values stopping decline of the loss function, obtains the depth characteristic and extract model；It is special by fusion Sign database obtains module 30 and obtains the fusion feature database；The on-line checking stage inquires video screen module by on-line checking Block 40 extracts the third key frame of the inquiry video, and by second depth characteristic of the third key frame of acquisition And the second DCT coefficient feature is merged, and the second fusion feature of the third key frame is obtained；By each described Each of two fusion features and the fusion feature database first fusion feature carry out matching primitives, due to tying simultaneously Close depth characteristic and DCT coefficient feature, make up single features constructions video information can not comprehensive describing video contents ask Topic, keeps feature robustness stronger.Finally, judging whether the video in the video database is the copy for inquiring video Source, if so, obtaining position of the inquiry video in the copy source simultaneously.

Referring to FIG. 1 to FIG. 3, in the second embodiment of the present invention, the depth characteristic extracts model and obtains the packet of module 20 It includes:

Training submodule 21, for the training pattern of construction depth Feature Selection Model, the depth characteristic extracts model Training pattern further include variation self-encoding encoder neural network model, the variation self-encoding encoder neural network model includes using Encoder and use full Connection Neural Network and deconvolution neural network knot of the depth convolutional neural networks as basic network The decoder of structure；

Data processing submodule 22, for by every described image label Y and corresponding every first figure As X is input in the variation self-encoding encoder neural network model, to every described image label Y and corresponding every The first image X carries out the first pretreatment；The parameter of the encoder and the decoder is initialized, the instruction Practice submodule 21 the variation self-encoding encoder neural network model is started to be trained, until the penalty values of the loss function are stopped Only decline, after being stabilized to a certain value, deconditioning obtains the depth characteristic and extracts model.

In this embodiment, by data processing submodule 22 by every described image label Y and corresponding every The first image X is input in the variation self-encoding encoder neural network model, to every described image label Y and with Its corresponding every the first image X carries out the first pretreatment, and particularly, first pretreatment is by every figure As label Y and every corresponding the first image X are normalized, by the image after the normalized Label Y and the first image X zoom to first size, and in the present embodiment, the first size is 224 × 224 pixels, The image tag Y for zooming to the first size and the first image X is transformed into rgb space, and described will be turned The pixel value of image tag Y and the first image X after changing to rgb space are normalized；Pass through training submodule The training pattern of 21 construction depth Feature Selection Model of block, the training pattern that the depth characteristic extracts model is that variation encodes certainly Device neural network model, the variation self-encoding encoder neural network model include the penalty values computation model of loss function, also Including using depth convolutional neural networks as the encoder of basic network and using full Connection Neural Network and deconvolution mind Decoder through network structure；The encoder extracts characteristics of image, and the encoder use is pre- on ImageNet data set Trained ResNet-101 deep neural network is as basic network, and the encoder is by input layer, hidden layer and output layer structure At；Wherein the input layer is received by the described first pretreated first image X input, and hidden layer is by removing full articulamentum The ResNet-101 network is constituted, and the output layer is two full articulamentum Fc6 and Fc7, two full articulamentum Fc6 arranged side by side Be m with Fc7 neuron number, export the mean value of m Gaussian Profile and the logarithm of variance respectively, by the output of Fc6 and Fc7 to Amount is added, and obtains the output layer output vector Z, and the dimension of the vector Z is that m is tieed up, in the present embodiment, the m=800, Described in vector Z Normal Distribution N (mean, exp^varlog), but this sampling operation lead mean, varlog can not, It will cause gradient descent algorithm failure in back-propagation phase, so that become can not for the variation self-encoding encoder neural network model Therefore the random noise ε for obeying N (0,1), the at this time output of the encoder is added in training pattern in the vector Z are as follows:In formula, mean is the Gaussian Profile that the output layer of the encoder exports Mean value, varlog are the logarithm that the output layer of the encoder exports the variance of Gaussian Profile；Using stochastic gradient descent Before variation self-encoding encoder neural network model described in method, data processing submodule 22 is to the encoder and the decoder Parameter initialized, wherein ResNet-101 network portion in the encoder is used in ImageNet pre-training Weight is initialized, and the Gaussian function that the convolution kernel standard deviation of other convolutional layers and warp lamination is 0.01 initializes, partially It sets and is initialized as 0；The full articulamentum is initialized using the random number that standard deviation is 0.01, and biasing is initialized as 0.It sets simultaneously Learning rate ε, ε=0.0001 in the present embodiment are set.The characteristics of image that the decoder extracts the encoder is decoded, The input of the decoder is the vector Z, decoder output vector

Table one

Table one is the structure table schematic diagram of the encoder；

Table two

Table two is the structure table schematic diagram of decoder, and as described in table two, the decoder includes one and contains 25088 The full articulamentum Fc8 of neuron, and full articulamentum Fc8 vector is converted into 14x14x128, it is followed by one BN layers and one Linear amending unit (ReLU) activation primitive, is followed by 4 warp laminations, and four warp lamination convolution kernel sizes are 5x5, step Long size is 2, and the first warp lamination exports 64 characteristic patterns, and the second warp lamination exports 32 characteristic patterns, third deconvolution Layer 16 characteristic pattern of output, the 4th warp lamination export 3 characteristic patterns.The variation self-encoding encoder neural network model needs to learn It practises from a first image X Jing Guo copy-attack to the Nonlinear Mapping of original image (i.e. described image label Y).Cause This, needs training convolutional neural networks to solve this problem.In order to reach this target, Euclidean is used in off-line training step Distance and KL divergence are lost to calculate, therefore, particularly, the loss function are as follows:

In formula, mean is the mean value for the Gaussian Profile that the output layer of the encoder exports, and varlog is the encoder Output layer output Gaussian Profile variance logarithm, in formula, Y is image tag Y,For the output of the decoder. Pass through the output to the decoderIt is measured with the similarity of the image tag Y of input, is gradually reduced loss function, So that first image X and original image (i.e. described image label Y) of the realization in feature space Jing Guo copy-attack are apart from closer. Particularly, in the network propagated forward stage, the penalty values of the loss function are calculated, while obtaining picture depth feature, Back-propagation phase updates the parameter of network, until the penalty values no longer change, finally obtains the depth characteristic and extracts mould Type.

Referring to FIG. 1 to FIG. 3, in the third embodiment of the present invention, fusion feature database obtains module 30 and includes:

Second pretreatment submodule 31, for every second image to be carried out the second pretreatment, the described second pre- place Every second image is is carried out size normalized by reason, by the second image scaling after the normalized to the Second image for zooming to first size is transformed into rgb space by one size, and by it is described be transformed into rgb space after The pixel value of second image is normalized；

First depth characteristic obtains submodule 32, for every second image to be extracted mould by the depth characteristic Type obtains first depth characteristic of the m dimension of every second image；

Third pre-processes submodule 33, and for every second image to be carried out third pretreatment, the third is located in advance The second image scaling after the normalized is 64 for every second image is carried out size normalized by reason × 64 pixels；

First fusion feature obtains submodule 34, for second image for being scaled 64 × 64 pixels to be divided into 64 Then 8 × 8 the first fritter carries out dct transform to each first fritter, according to the sequence of Zig-Zag, take each described The low frequency coefficient of preceding 4 same positions of first fritter, four groups of length of composition are respectively 64 the first one-dimensional vector A_i, and respectively Calculate four groups of one-dimensional vector A_iExpectation mean_iWith variance var_i, and to the first one-dimensional vector A_iIt is standardized It arrivesAgain by F_iSeries connection becomes the first one-dimensional vector F of 256 dimensions, obtains the first DCT coefficient feature, By first depth characteristic and the first DCT coefficient feature direct splicing, first fusion feature is obtained；

Fusion feature database obtains submodule 35, for by the institute of every of described image database second image It states the first fusion feature and saves as the fusion feature database.

In this embodiment, every second image is carried out the second pretreatment by the second pretreatment submodule 31, described Second pretreatment is that every second image is carried out to size normalized, in the present embodiment, normalizes to 224 × 224 Pixel, by the second image scaling after the normalized to first size, by second figure for zooming to first size It is normalized as being transformed into rgb space, and by the pixel value of second image being transformed into after rgb space, second The pixel value of image is normalized as the channel average value is individually subtracted in each channel value of the second image RGB, RGB triple channel average value is respectively 123.68,116.78,103.94；First depth characteristic obtains submodule 32 will be described in every It is tieed up by second second image of pretreatment by the m that depth characteristic extraction model obtains every second image First depth characteristic；Submodule 33 is pre-processed by third, and every second image is subjected to third pretreatment, it is described Every second image is is carried out size normalized by third pretreatment, by the second image after the normalized It is scaled 64 × 64 pixels；Submodule 34 is obtained by second image for being scaled 64 × 64 pixels by the first fusion feature It is divided into 64 8 × 8 first fritters, dct transform then is carried out to each first fritter, according to Zig-Zag (" it " word Shape) sequence, take the low frequency coefficient of preceding 4 same positions of each first fritter, four groups of length of composition are respectively 64 First one-dimensional vector A_i, and calculate separately four groups of one-dimensional vector A_iExpectation mean_iWith variance var_i, and to described first One-dimensional vector A_iIt is standardized to obtainAgain by F_iSeries connection becomes the first one-dimensional vector F of 256 dimensions, obtains First depth characteristic and the first DCT coefficient feature direct splicing are obtained to the first DCT coefficient feature First fusion feature；Obtaining submodule 35 finally by fusion feature database will be described in every of described image database First fusion feature of second image saves as the fusion feature database.

Referring to FIG. 1 to FIG. 3, in the fourth embodiment of the present invention, the on-line checking inquiry video module 40 includes:

Third key frame submodule 41 is extracted, for using the key frame algorithm based on shot segmentation to the inquiry video The extraction for carrying out the third key frame obtains the third key frame set of the inquiry video

4th pretreatment submodule 42, for each of the third key frame set third key frame to be carried out the Four pretreatments, each third key frame is is carried out size normalized by the 4th pretreatment, by the normalization Treated, and third key frame zooms to the first size, and the third key frame for zooming to the first size is converted It is normalized to rgb space, and by the pixel value of the third key frame being transformed into after rgb space；

Second depth characteristic obtains submodule 43, for the described 4th pretreated third key frame to be passed through institute It states depth characteristic and extracts second depth characteristic that model obtains m dimension；

5th pretreatment submodule 44, for each third key frame to be carried out the 5th pretreatment, the described 5th is pre- Processing contracts the third key frame after the normalized for each third key frame is carried out size normalized It puts as 64 × 64 pixels；

Second fusion feature obtains submodule 45, for the third key frame picture for being scaled 64 × 64 pixels to be divided into Then 64 8 × 8 second fritters carry out dct transform to each second fritter and take each according to the sequence of Zig-Zag The low frequency coefficient of preceding 4 same positions of second fritter, composition length are respectively 64 the second one-dimensional vector B_i, and respectively Calculate four groups of one-dimensional vector B_iExpectation Mean_iWith variance Var_i, and to the second one-dimensional vector B_iIt is standardized It arrivesAgain by H_iSeries connection becomes the second one-dimensional vector H of 256 dimensions, obtains the second DCT coefficient feature, By second depth characteristic and the second DCT coefficient feature direct splicing, second fusion feature is obtained；

Judging submodule 46, for matching each second fusion feature with each first fusion feature It is calculated as carrying out each of each second fusion feature and the fusion feature database first fusion feature Cosine similarity calculates, and most similar first fusion feature of the N number of and described second fusion feature distance is respectively obtained, by institute Most similar first fusion feature of the N number of and described second fusion feature distance is stated, according to the frame time of the inquiry video Sequencing compositional similarity list of matches；By the first fusion feature configuration node in the similitude list of matches, The node is linked to be side according to the requirement that the frame time sequencing and time interval are less than M frame, the distance on the side is set Be set to 1, using Freud's algorithm (Floyd-Warshall algorithm) obtain any two node between longest away from From set distance threshold value T judges the video when the longest distance between described two nodes is greater than the distance threshold T Second video described at least one of database is the copy source of the inquiry video, and obtains the inquiry video in institute The position in copy source is stated, if the longest distance between described two nodes is less than the distance threshold T, judges the inquiry Video is not copy video.

In this embodiment, using Freud's algorithm (Floyd-Warshall algorithm), video frame is matched As a result it is fused into video clip matching result, obtains the copy segment and positioning result of inquiry video.Particularly, pass through extraction Third key frame submodule 41 carries out third key frame using the key frame algorithm based on shot segmentation to the inquiry video and mentions It takes, obtains the third key frame set of inquiry videoThen the 4th pretreatment submodule 42, the second depth characteristic obtains the pretreatment submodule 44 of submodule the 43, the 5th and the second fusion feature obtains submodule 45 It cooperates and obtains second fusion feature, calculate each of the inquiry video second fusion feature and merged with described The cosine similarity of all first fusion features in property data base is found N number of most close with the second fusion feature distance First fusion feature, N=40 in the present embodiment.According to the frame time sequencing compositional similarity of the inquiry video List of matches；It is successive according to the frame time by the second fusion feature configuration node in the similitude list of matches The node is linked to be side by the requirement that sequence and time interval are less than M frame, and the distance on the side is set as 1, M=in the present embodiment 6.The longest distance between any two node is obtained using Freud's algorithm (Floyd-Warshall algorithm), if Set a distance threshold value T judges the video data when the longest distance between described two nodes is greater than the distance threshold T Video in library is the copy source of the inquiry video, and obtains the inquiry video in the position in the copy source, if When longest distance between described two nodes is less than the distance threshold T, judging the inquiry video not is copy video.This T=100 in embodiment.

To sum up, the present invention learns first Jing Guo copy-attack by the variation self-encoding encoder neural network model Image X makes the first image X by copy-attack and the original image in depth to the mapping of original image (i.e. image tag Y) Feature space distance is as close possible to extract the depth characteristic with more high sense, using in ImageNet data set The ResNet-101 network of upper pre-training forms the encoder, shortens the training of the variation self-encoding encoder neural network model Time, while using the characterization ability of ResNet-101 pre-training model, the characterization power of model depth feature is improved, by depth The mode that feature and traditional-handwork feature DCT coefficient feature combine, the video information for compensating for single features construction can not be complete The problem of describing video contents in face, fusion feature robustness is stronger, and the present invention substantially increases the video copy detection system The accuracy of system.

Referring to fig. 4~Fig. 5 provides a kind of video copying detection method, the side in the fifth embodiment of the present invention Method the following steps are included:

Step S501 obtains the original video data collection with multiple videos, by the extraction original video data Collection is divided into training video collection and video database, and the video that the training video is concentrated is the first video, the view The video in frequency database is the second video, and extract all first videos of the training video collection first is crucial Frame forms training set of images, and the second key frame for extracting all second videos of the video database forms image data Library；The first key frame in described image training set is image tag Y, and every image tag Y is carried out copy-attack and is obtained often Open the corresponding first image X of described image label Y；The step is original video data collection processing step, by original video data Collect processing module 10 to realize.

Step S502, the training pattern of construction depth Feature Selection Model, the depth characteristic extract the training mould of model Type includes the penalty values computation model of loss function, by every described image label Y and corresponding every described first Image X is input in the training pattern of the depth characteristic extraction model and is trained to the penalty values stopping of the loss function When decline, obtains the depth characteristic and extract model；The step is that the offline depth characteristic that obtains extracts model step, by depth spy Sign extracts model and obtains the realization of module 20.

Step S503, the second key frame in described image database is the second image, and every second image is led to It crosses the depth characteristic and extracts the first depth characteristic that model obtains every second image；To the every of described image database It opens the second image and carries out dct transform, the first DCT coefficient feature of every second image is obtained, by every second figure First depth characteristic and the first DCT coefficient feature of picture are merged, and the of every second image is obtained One fusion feature；First fusion feature of every of described image database second image is saved as into the fusion Property data base；The step is to obtain fusion feature database steps, obtains module 30 by fusion feature database and realizes.

Step S504 extracts the third key frame of the inquiry video, and the third key frame of the inquiry video is passed through The depth characteristic extracts model and obtains the second depth characteristic；Dct transform is carried out to the third key frame, obtains the third Second DCT coefficient feature of key frame, second depth characteristic and the second DCT coefficient feature are merged, obtained Obtain the second fusion feature of the third key frame；By each second fusion feature and each first fusion feature into Row matching primitives judge whether the video in the video database is the copy source for inquiring video, if so, obtaining simultaneously Position of the inquiry video in the copy source.The step is that on-line checking inquires video step, is inquired by on-line checking Video module 40 is realized.

In this embodiment, original video data collection processing module 10 will obtain the original video with multiple videos The extraction original video data collection is divided into training video collection and video database, the training video collection by data set In the video be the first video, the video in the video database is the second video, extracts the training video First key frame of all first videos of collection forms training set of images, extracts all described the of the video database Second key frame of two videos forms image data base；The first key frame in described image training set is image tag Y, will be every It opens image tag Y progress copy-attack and obtains the corresponding first image X of every described image label Y；Depth characteristic extracts model The training pattern of 20 construction depth Feature Selection Model of module is obtained, the training pattern that the depth characteristic extracts model includes damage Function is lost, every described image label Y and every corresponding the first image X are input to the depth characteristic and mentioned It is trained in the training pattern of modulus type to when the penalty values stopping decline of the loss function, obtains the depth characteristic and mention Modulus type；It is the second image that fusion feature database, which obtains the second key frame in 30 described image database of module, by every Second image extracts the first depth characteristic that model obtains every second image by the depth characteristic；To described Every second image of image data base carries out dct transform, obtains the first DCT coefficient feature of every second image, will First depth characteristic and the first DCT coefficient feature of every second image are merged, and every institute is obtained State the first fusion feature of the second image；By first fusion feature of every of described image database second image Save as the fusion feature database；On-line checking inquiry video module 40 extracts the third key frame of the inquiry video, The third key frame of the inquiry video is extracted into model by the depth characteristic and obtains the second depth characteristic；To the third Key frame carries out dct transform, obtains the second DCT coefficient feature of the third key frame, by second depth characteristic and The second DCT coefficient feature is merged, and the second fusion feature of the third key frame is obtained；Each described second is melted It closes feature and each first fusion feature carries out matching primitives, judge whether the video in the video database is described The copy source of video is inquired, if so, obtaining position of the inquiry video in the copy source simultaneously.

In the sixth embodiment of the present invention, step S502 includes: the training pattern of construction depth Feature Selection Model, institute Stating depth characteristic and extracting the training pattern of model further includes variation self-encoding encoder neural network model, the variation self-encoding encoder mind It include using depth convolutional neural networks as the encoder of basic network and using full Connection Neural Network through network model With the decoder of deconvolution neural network structure；The step is by training submodule 21 to realize.Particularly, the encoder extracts Characteristics of image, the depth convolutional neural networks of the encoder are ResNet-101 network, the encoder by input layer, Hidden layer and output layer are constituted, wherein the input layer is received by the described first pretreated first image X input, hidden layer The ResNet-101 network by removing full articulamentum is constituted, and the output layer is the mean value of m Gaussian Profile of two output Dimension with the full articulamentum of the logarithm of variance, the output layer output vector Z, the vector Z is m dimension, wherein the vector Z Normal Distribution N (mean, exp^varlog), the random noise ε of Normal Distribution N (0,1) is added in the vector Z, this The output of Shi Suoshu encoder are as follows: In formula, mean is the described defeated of the encoder The mean value of the Gaussian Profile of layer output out, varlog are pair that the output layer of the encoder exports the variance of Gaussian Profile Number；The characteristics of image that the decoder extracts the encoder is decoded, and the input of the decoder is the vector Z, Decoder output vector

Every described image label Y and every corresponding described image label the first image of Y X are input to described In variation self-encoding encoder neural network model, to every described image label Y and every corresponding the first image X The first pretreatment is carried out, particularly, first pretreatment is by every described image label Y and corresponding every The first image X is normalized, by after the normalized image tag Y and the first image X contract It is put into first size, the image tag Y for zooming to the first size and the first image X is transformed into RGB sky Between, and the pixel value of the image tag Y being transformed into after rgb space and the first image X is normalized； The parameter of the encoder and the decoder is initialized, the variation self-encoding encoder neural network model start into Row training, until obtaining the depth characteristic when penalty values of the loss function stop decline and extracting model, particularly, institute State loss function are as follows: In formula, Mean is the mean value for the Gaussian Profile that the output layer of the encoder exports, and varlog is the output of the encoder The logarithm of the variance of layer output Gaussian Profile, Y are image tag Y,For the output of the decoder.The step is by data processing Submodule 22 is realized.

In the seventh embodiment of the present invention, step S503 includes that every second image is carried out the second pretreatment, Every second image is is carried out size normalized by second pretreatment, by second after the normalized Image scaling is transformed into rgb space to first size, by second image for zooming to first size, and is transformed into described The pixel value of the second image after rgb space is normalized；The step is realized by the second pretreatment submodule 31.

Model is extracted by the depth characteristic by the described second pretreated second image for every to obtain often First depth characteristic of the m dimension of the second image of Zhang Suoshu；The step obtains submodule 32 by the first depth characteristic and realizes.

Every second image is subjected to third pretreatment, the third pretreatment for by every second image into The second image scaling after the normalized is 64 × 64 pixels by row size normalized；The step is pre- by third Submodule 33 is handled to realize.

Second image for being scaled 64 × 64 pixels is divided into 64 8 × 8 first fritters, then to each described First fritter carries out dct transform and takes the low frequency of preceding 4 same positions of each first fritter according to the sequence of Zig-Zag Coefficient, four groups of length of composition are respectively 64 the first one-dimensional vector A_i, and calculate separately four groups of one-dimensional vector A_iExpectation mean_iWith variance var_i, and to the first one-dimensional vector A_iIt is standardized to obtainAgain by F_iSeries connection As the first one-dimensional vector F of 256 dimensions, the first DCT coefficient feature is obtained, by first depth characteristic and described the One DCT coefficient feature direct splicing obtains first fusion feature；The step obtains submodule 34 in fact by the first fusion feature It is existing.

It is special that first fusion feature of every of described image database second image is saved as into the fusion Levy database；The step obtains submodule 35 by fusion feature database and realizes.

In the eighth embodiment of the present invention, step S504 includes:

The extraction for being carried out the third key frame using the key frame algorithm based on shot segmentation to the inquiry video, is obtained To the third key frame set of the inquiry videoThe step is crucial by extraction third Frame submodule 41 is realized.

Each of the third key frame set third key frame is subjected to the 4th pretreatment, the 4th pretreatment For each third key frame is carried out size normalized, the third key frame after the normalized is zoomed to The third key frame for zooming to the first size is transformed into rgb space, and is transformed into described by the first size The pixel value of third key frame after rgb space is normalized；The step is realized by the 4th pretreatment submodule 42.

The described 4th pretreated third key frame is extracted into the institute that model obtains m dimension by the depth characteristic State the second depth characteristic；The step obtains submodule 43 by the second depth characteristic and realizes.

Each third key frame is subjected to the 5th pretreatment, the 5th pretreatment is that each third is crucial Frame carries out size normalized, and the third key frame after the normalized is scaled 64 × 64 pixels；The step by 5th pretreatment submodule 44 is realized.

The third key frame picture for being scaled 64 × 64 pixels is divided into 64 8 × 8 second fritters, then to each Second fritter carries out dct transform and takes preceding 4 same positions of each second fritter according to the sequence of Zig-Zag Low frequency coefficient, four groups of length of composition are respectively 64 the second one-dimensional vector B_i, and calculate separately four groups of one-dimensional vector B_iPhase Hope Mean_iWith variance Var_i, and to the second one-dimensional vector B_iIt is standardized to obtainAgain by H_iString It is unified into as the second one-dimensional vector H of 256 dimensions, the second DCT coefficient feature is obtained, by second depth characteristic and described Second DCT coefficient feature direct splicing obtains second fusion feature；The step obtains submodule 45 by the second fusion feature It realizes.

Being by each second fusion feature and each first fusion feature progress matching primitives will be each described Each of second fusion feature and the fusion feature database first fusion feature carry out cosine similarity calculating, point It does not obtain N number of with most similar first fusion feature of second fusion feature distance, N number of melts described with described second Most similar first fusion feature of characteristic distance is closed according to the frame time sequencing compositional similarity of the inquiry video List of matches；It is successive according to the frame time by the first fusion feature configuration node in the similitude list of matches The node is linked to be side by the requirement that sequence and time interval are less than M frame, and the distance on the side is set as 1, uses Freud Longest distance between algorithm (Floyd-Warshall algorithm) acquisition any two node, set distance threshold value T, when When longest distance between described two nodes is greater than the distance threshold T, judge the video in the video database for institute State the copy source of inquiry video, and obtain the inquiry video in the position in the copy source, if described two nodes it Between longest distance when being less than the distance threshold T, judge the inquiry video not and be to copy video；The step is by judgement submodule Block 46 is realized.

In conclusion the present invention is based on the training samples for including a large amount of the first image X and image tag Y Jing Guo copy-attack This, automatically extracts the characteristic pattern with high discrimination using variation self-encoding encoder neural network model, while using transfer learning Method, will into the encoder, output be distinguished for the trained ResNet-101 network integration on ImageNet data set The other stronger depth characteristic of power, in combination with depth characteristic and DCT coefficient feature, make up the video information of single features construction without The problem of method comprehensive describing video contents, keep feature robustness stronger.In off-line training step, by the first of copy-attack Image X and image tag Y (i.e. original image), is input in variation self-encoding encoder neural network model, from it is a large amount of it is different described in Learnt in training sample, obtains the depth characteristic and extract model, mentioned using the depth characteristic obtained by training Modulus type carries out the first depth characteristic extraction to second key frame of videos all in video database, while described in extraction First DCT coefficient feature of the second key frame will save as the fusion feature database after two kinds of Fusion Features.It is examining online The survey stage extracts model using the depth characteristic and carries out depth characteristic extraction and DCT system to the third key frame of inquiry video Number feature extractions, by after the second depth characteristic of acquisition and the second DCT coefficient Fusion Features in the fusion feature database In matched, finally using Freud's algorithm determine it is described inquiry video copy detection result.The present invention uses as a result, Deep learning method extracts the stronger depth characteristic of discrimination, while merging traditional-handwork feature (DCT coefficient feature), improves Video copy detection system improves copy detection accuracy to the robustness of various copy variations.

Certainly, the present invention can also have other various embodiments, without deviating from the spirit and substance of the present invention, ripe It knows those skilled in the art and makes various corresponding changes and modifications, but these corresponding changes and change in accordance with the present invention Shape all should fall within the scope of protection of the appended claims of the present invention.

Claims

1. a kind of video copying detection method, which is characterized in that the described method comprises the following steps:

Original video data collection processing step obtains the original video data collection with multiple videos, and the extraction is former Beginning sets of video data is divided into training video collection and video database, and the video that the training video is concentrated is the first view Frequently, the video in the video database is the second video, extracts all first videos of the training video collection The first key frame formed training set of images, extract the second key frame shape of all second videos of the video database At image data base；The first key frame in described image training set is image tag Y, and every image tag Y is copied Attack obtains the corresponding first image X of every described image label Y；

The offline depth characteristic that obtains extracts model step, the training pattern of construction depth Feature Selection Model, the depth characteristic The training pattern for extracting model includes the penalty values computation model of loss function, by every described image label Y and right with it The every the first image X answered is input in the training pattern of the depth characteristic extraction model and is trained to the loss When the penalty values of function stop decline, obtains the depth characteristic and extract model；

Fusion feature database steps are obtained, the second key frame in described image database is the second image, described in every Second image extracts the first depth characteristic that model obtains every second image by the depth characteristic；To described image Every second image of database carries out dct transform, the first DCT coefficient feature of every second image is obtained, by every First depth characteristic and the first DCT coefficient feature of second image are merged, and obtain every described the First fusion feature of two images；First fusion feature of every of described image database second image is saved For the fusion feature database；

On-line checking inquires video step, extracts the third key frame of the inquiry video, and the third of the inquiry video is closed Key frame extracts model by the depth characteristic and obtains the second depth characteristic；Dct transform is carried out to the third key frame, is obtained Second DCT coefficient feature of the third key frame carries out second depth characteristic and the second DCT coefficient feature Fusion, obtains the second fusion feature of the third key frame；Each second fusion feature is melted with each described first It closes feature and carries out matching primitives, judge whether the video in the video database is the copy source for inquiring video, if so, Position of the inquiry video in the copy source is obtained simultaneously.

2. the method according to claim 1, wherein the offline acquisition depth characteristic extracts model step packet Include: the training pattern of construction depth Feature Selection Model, the training pattern that the depth characteristic extracts model includes that variation is self-editing Code device neural network model, the variation self-encoding encoder neural network model include using based on depth convolutional neural networks The encoder of network and the decoder for using full Connection Neural Network and deconvolution neural network structure；By every described image Label Y and every corresponding the first image X are input in the variation self-encoding encoder neural network model, to every It opens described image label Y and every corresponding the first image X and carries out the first pretreatment；To the encoder and The parameter of the decoder is initialized, and the variation self-encoding encoder neural network model starts to be trained, until the damage When losing the penalty values stopping decline of function, obtains the depth characteristic and extract model.

3. according to the method described in claim 2, it is characterized in that, the encoder extract characteristics of image, the encoder The depth convolutional neural networks are ResNet-101 network, and the encoder is made of input layer, hidden layer and output layer, Described in input layer receive by the described first pretreated first image X input, hidden layer is as removing described in full articulamentum ResNet-101 network is constituted, and the output layer includes the side of two mean values for exporting m Gaussian Profile respectively and Gaussian Profile The dimension of the full articulamentum of the logarithm of difference, the output layer output vector Z, the vector Z is m dimension, wherein the vector Z is obeyed Normal distribution N (mean, exp^varlog), the random noise ε of Normal Distribution N (0,1) is added in the vector Z, at this time institute State the output of encoder are as follows:In formula, mean is the output layer of the encoder The mean value of the Gaussian Profile of output, varlog are the logarithm that the output layer of the encoder exports the variance of Gaussian Profile； The characteristics of image that the decoder extracts the encoder is decoded, and the input of the decoder is the vector Z, decoding Device output vectorFirst pretreatment is by every described image label Y and corresponding every first figure As X is normalized, by after the normalized image tag Y and the first image X zoom to the first ruler It is very little, the image tag Y for zooming to the first size and the first image X are transformed into rgb space, and by the conversion The pixel value of image tag Y and the first image X after to rgb space are normalized；The loss function are as follows:In formula, mean is the encoder Output layer output Gaussian Profile mean value, varlog is that the output layer of the encoder exports Gaussian Profile The logarithm of variance, Y are image tag Y,For the output of the decoder.

4. according to the method described in claim 3, it is characterized in that, obtaining fusion feature database steps includes: by every institute It states the second image and carries out the second pretreatment, second pretreatment is to carry out every second image at size normalization The second image scaling after the normalized to the first size is zoomed to the first size for described by reason Second image is transformed into rgb space, and place is normalized in the pixel value of second image being transformed into after rgb space Reason；

Every described the is obtained by depth characteristic extraction model by the described second pretreated second image by every First depth characteristic of the m dimension of two images；

Every second image is subjected to third pretreatment, the third pretreatment is that every second image is carried out ruler The second image scaling after the normalized is 64 × 64 pixels by very little normalized；

Second image for being scaled 64 × 64 pixels is divided into 64 8 × 8 first fritters, then to each described first Fritter carries out dct transform and takes the low frequency system of preceding 4 same positions of each first fritter according to the sequence of Zig-Zag Number, four groups of length of composition are respectively 64 the first one-dimensional vector A_i, and calculate separately four groups of one-dimensional vector A_iExpectation mean_iWith variance var_i, and to the first one-dimensional vector A_iIt is standardized to obtainAgain by F_iSeries connection As the first one-dimensional vector F of 256 dimensions, the first DCT coefficient feature is obtained, by first depth characteristic and described the One DCT coefficient feature direct splicing obtains first fusion feature；

First fusion feature of every of described image database second image is saved as into the fusion feature number According to library.

5. according to the method described in claim 4, it is characterized in that, on-line checking inquiry video step includes:

The extraction for carrying out the third key frame using the key frame algorithm based on shot segmentation to the inquiry video, obtains institute State the third key frame set of inquiry video

Each of the third key frame set third key frame is subjected to the 4th pretreatment, the 4th pretreatment is will Each third key frame carries out size normalized, the third key frame after the normalized is zoomed to described The third key frame for zooming to the first size is transformed into rgb space, and is transformed into RGB for described by first size The pixel value of third key frame behind space is normalized；

Described that the described 4th pretreated third key frame is extracted that model obtains m dimension by the depth characteristic Two depth characteristics；

Each third key frame is subjected to the 5th pretreatment, the 5th pretreatment for by each third key frame into Third key frame after the normalized is scaled 64 × 64 pixels by row size normalized；

The third key frame picture for being scaled 64 × 64 pixels is divided into 64 8 × 8 second fritters, then to each described Second fritter carries out dct transform and takes the low frequency of preceding 4 same positions of each second fritter according to the sequence of Zig-Zag Coefficient, four groups of length of composition are respectively 64 the second one-dimensional vector B_i, and calculate separately four groups of one-dimensional vector B_iExpectation Mean_iWith variance Var_i, and to the second one-dimensional vector B_iIt is standardized to obtainAgain by H_iSeries connection As the second one-dimensional vector H of 256 dimensions, the second DCT coefficient feature is obtained, by second depth characteristic and described the Two DCT coefficient feature direct splicings obtain second fusion feature；

It is by each described second that each second fusion feature and each first fusion feature, which are carried out matching primitives, Each of fusion feature and the fusion feature database first fusion feature carry out cosine similarity calculating, respectively To N number of with most similar first fusion feature of second fusion feature distance, N number of spy is merged with described second for described Sign most similar first fusion feature of distance is matched according to the frame time sequencing compositional similarity of the inquiry video List；By the first fusion feature configuration node in the similitude list of matches, according to the frame time sequencing The node is linked to be side by the requirement for being less than M frame with time interval, and the distance on the side is set as 1, uses Freud's algorithm (Floyd-Warshall algorithm) obtains the longest distance between any two node, set distance threshold value T, when described When longest distance between two nodes is greater than the distance threshold T, judge described at least one of described video database Second video is the copy source of the inquiry video, and obtains the inquiry video in the position in the copy source, if institute When stating the longest distance between two nodes less than the distance threshold T, judging the inquiry video not is copy video.

6. a kind of video copy detection system characterized by comprising

Original video data collection processing module is mentioned for obtaining the original video data collection with multiple videos by described Original video data collection is taken to be divided into training video collection and video database, the video that the training video is concentrated is the One video, the video in the video database is the second video, extracts all described the first of the training video collection First key frame of video forms training set of images, and extract all second videos of the video database second is crucial Frame forms image data base；The first key frame in described image training set is image tag Y, and every image tag Y is carried out Copy-attack obtains the corresponding first image X of every described image label Y；

Depth characteristic extracts model and obtains module, for the training pattern of construction depth Feature Selection Model, the depth characteristic The training pattern for extracting model includes the penalty values computation model of loss function, by every described image label Y and right with it The every the first image X answered is input in the training pattern of the depth characteristic extraction model and is trained to the loss When the penalty values of function stop decline, obtains the depth characteristic and extract model；

Fusion feature database obtains module, and the second key frame in described image database is the second image, for by every Second image extracts the first depth characteristic that model obtains every second image by the depth characteristic；To described Every second image of image data base carries out dct transform, obtains the first DCT coefficient feature of every second image, will First depth characteristic and the first DCT coefficient feature of every second image are merged, and every institute is obtained State the first fusion feature of the second image；By first fusion feature of every of described image database second image Save as the fusion feature database；

On-line checking inquires video module, for extracting the third key frame of the inquiry video, by the of the inquiry video Three key frames extract model by the depth characteristic and obtain the second depth characteristic；Dct transform is carried out to the third key frame, The the second DCT coefficient feature for obtaining the third key frame, by second depth characteristic and the second DCT coefficient feature It is merged, obtains the second fusion feature of the third key frame；By each second fusion feature and each described the One fusion feature carries out matching primitives, judges whether the video in the video database is the copy source for inquiring video, If so, obtaining position of the inquiry video in the copy source simultaneously.

7. system according to claim 6, which is characterized in that the depth characteristic extracts model acquisition module and includes:

Training submodule, for the training pattern of construction depth Feature Selection Model, the depth characteristic extracts the training of model Model further includes variation self-encoding encoder neural network model, and the variation self-encoding encoder neural network model includes being rolled up using depth Neural network is accumulated as the encoder of basic network and using the solution of full Connection Neural Network and deconvolution neural network structure Code device；

Data processing submodule, for inputting every described image label Y and every corresponding the first image X Into the variation self-encoding encoder neural network model, described to every described image label Y and corresponding every One image X carries out the first pretreatment；The parameter of the encoder and the decoder is initialized, the trained submodule Block starts to be trained to the variation self-encoding encoder neural network model, until the penalty values of the loss function stop decline When, it obtains the depth characteristic and extracts model.

8. system according to claim 7, which is characterized in that the encoder extracts characteristics of image, the encoder The depth convolutional neural networks are ResNet-101 network, and the encoder is made of input layer, hidden layer and output layer, Described in input layer receive by described first pretreated image tag Y the first image X input, hidden layer is by removing full connection The ResNet-101 network of layer is constituted, and the output layer includes two mean values and Gauss for exporting m Gaussian Profile respectively The dimension of the full articulamentum of the logarithm of the variance of distribution, the output layer output vector Z, the vector Z is m dimension, wherein described Vector Z Normal Distribution N (mean, exp^varlog), the random noise ε for obeying N (0,1) is added in the vector Z, at this time The output of the encoder are as follows:In formula, mean is the output of the encoder The mean value of the Gaussian Profile of layer output, varlog are pair that the output layer of the encoder exports the variance of Gaussian Profile Number；The characteristics of image that the decoder extracts the encoder is decoded, and the input of the decoder is the vector Z, Decoder output vectorFirst pretreatment is by every described image label Y and corresponding every described the One image X is normalized, by after the normalized image tag Y and the first image X zoom to The image tag Y for zooming to the first size and the first image X is transformed into rgb space by one size, and will The pixel value of the image tag Y being transformed into after rgb space and the first image X is normalized；The damage Lose function are as follows:In formula, mean is The mean value of the Gaussian Profile of the output layer output of the encoder, varlog are that the output layer of the encoder exports The logarithm of the variance of Gaussian Profile, Y are image tag Y,For the output of the decoder.

9. system according to claim 8, which is characterized in that fusion feature database obtains module and includes:

Second pre-processes submodule, and for every second image to be carried out the second pretreatment, second pretreatment is will Every second image carries out size normalized, by the second image scaling after the normalized to the first ruler It is very little, second image for zooming to first size is transformed into rgb space, and by second be transformed into after rgb space The pixel value of image is normalized；

First depth characteristic obtains submodule, for passing through the depth by the described second pretreated second image for every Feature Selection Model obtains first depth characteristic of the m dimension of every second image；

Third pre-processes submodule, and for every second image to be carried out third pretreatment, the third pretreatment is will Every second image carries out size normalized, is 64 × 64 pictures by the second image scaling after the normalized Element；

First fusion feature obtains submodule, for second image for being scaled 64 × 64 pixels to be divided into 64 8 × 8 Then first fritter carries out dct transform to each first fritter, according to the sequence of Zig-Zag, take each described first small The low frequency coefficient of preceding 4 same positions of block, four groups of length of composition are respectively 64 the first one-dimensional vector A_i, and calculate separately institute State four groups of one-dimensional vector A_iExpectation mean_iWith variance var_i, and to the first one-dimensional vector A_iIt is standardized to obtainAgain by F_iSeries connection becomes the first one-dimensional vector F of 256 dimensions, obtains the first DCT coefficient feature, will First depth characteristic and the first DCT coefficient feature direct splicing obtain first fusion feature；

Fusion feature database obtains submodule, for by described the first of every of described image database second image Fusion feature saves as the fusion feature database.

10. system according to claim 9, which is characterized in that the on-line checking inquires video module and includes:

Third key frame submodule is extracted, for carrying out institute using the key frame algorithm based on shot segmentation to the inquiry video The extraction for stating third key frame obtains the third key frame set of the inquiry video

4th pretreatment submodule, locates in advance for each of the third key frame set third key frame to be carried out the 4th Reason, each third key frame is is carried out size normalized by the 4th pretreatment, after the normalized Third key frame zoom to the first size, the third key frame for zooming to the first size is transformed into RGB Space, and the pixel value of the third key frame being transformed into after rgb space is normalized；

Second depth characteristic obtains submodule, for the described 4th pretreated third key frame to be passed through the depth Feature Selection Model obtains second depth characteristic of m dimension；

5th pretreatment submodule, for each third key frame to be carried out the 5th pretreatment, the 5th pretreatment is Each third key frame is subjected to size normalized, the third key frame after the normalized is scaled 64 × 64 pixels；

Second fusion feature obtains submodule, for the third key frame picture for being scaled 64 × 64 pixels to be divided into 64 8 Then × 8 the second fritter carries out dct transform to each second fritter, according to the sequence of Zig-Zag, take each described The low frequency coefficient of preceding 4 same positions of second fritter, four groups of length of composition are respectively 64 the second one-dimensional vector B_i, and respectively Calculate four groups of one-dimensional vector B_iExpectation Mean_iWith variance Var_i, and to the second one-dimensional vector B_iIt is standardized It arrivesAgain by H_iSeries connection becomes the second one-dimensional vector H of 256 dimensions, obtains the second DCT coefficient feature, By second depth characteristic and the second DCT coefficient feature direct splicing, second fusion feature is obtained；

Judging submodule, for being by each second fusion feature and each first fusion feature progress matching primitives Each second fusion feature and each of the fusion feature database first fusion feature are subjected to cosine phase It is calculated like degree, respectively obtains most similar first fusion feature of the N number of and described second fusion feature distance, it will be described N number of It is successively suitable according to the frame time of the inquiry video with the second fusion feature most similar first fusion feature of distance Sequence compositional similarity list of matches；By the first fusion feature configuration node in the similitude list of matches, according to institute It states the requirement of frame time sequencing and time interval less than M frame and the node is linked to be side, the distance on the side is set as 1, The longest distance between any two node, setting are obtained using Freud's algorithm (Floyd-Warshall algorithm) Distance threshold T judges the video database when the longest distance between described two nodes is greater than the distance threshold T At least one of described in the second video be the inquiry video copy source, and obtain the inquiry video in the copy The position in source judges the inquiry video not if the longest distance between described two nodes is less than the distance threshold T It is copy video.