CN111325169A

CN111325169A - Deep video fingerprint algorithm based on capsule network

Info

Publication number: CN111325169A
Application number: CN202010121069.5A
Authority: CN
Inventors: 李新伟; 徐良浩; 杨艺
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2020-06-23
Anticipated expiration: 2040-02-26
Also published as: CN111325169B

Abstract

The invention provides a capsule network-based deep video fingerprint algorithm, which adopts a three-branch network with shared weights as an integral framework, is improved on the basis of a capsule network, extracts compact video characteristics by taking the deep capsule network as the three-branch network, combines three-dimensional convolution and two-dimensional convolution by utilizing the idea of joint convolution, extracts video spatio-temporal characteristics, greatly improves the calculation efficiency of the network while keeping video time information as much as possible, reduces the calculation cost of a subsequent network, additionally adds a full connection layer after the last two-dimensional convolution, takes the video spatio-temporal characteristics extracted by the joint convolution as input, takes video classification characteristics as output, ensures that the deep capsule network has high efficiency, accuracy and robustness, and can monitor videos on network platforms such as video websites, friend communities, chat tools and the like, the copy video is efficiently detected, and unauthorized video and managed and controlled video are prevented from being illegally spread.

Description

Deep video fingerprint algorithm based on capsule network

Technical Field

The invention relates to the technical field of video copyright protection and information security, in particular to a capsule network-based depth video fingerprint algorithm.

Background

With the rapid development of internet technology and video websites, abundant video contents provide different visual experiences for people, but the problem of video copyright infringement brought by the video contents is highlighted day by day, and the illegal copied videos not only damage the interests of copyright owners in the network transmission process, but also have adverse effects on the society. In the face of huge network video data, it is impractical to detect the copied video only by manpower, and therefore, a scheme for efficiently detecting the video copy is needed.

The video fingerprint is also called video hash, is a technology for compressing digital video features into a simplified video abstract, and is widely applied to video copy detection by the characteristics of low storage cost, high query speed and the like. The video fingerprint algorithm mainly comprises three parts of feature extraction, feature quantization and fingerprint matching, wherein the hamming distance between paired video fingerprints is calculated, whether a copy relation exists between two videos is judged according to a set threshold value, and the robustness, uniqueness and compactness of the video fingerprints are standards for measuring the performance of the video fingerprints. The robustness means that after some interference factors are added to an original video, video fingerprints of the original video have high similarity, uniqueness requires obvious difference between the video fingerprints of different videos, and compactness represents the length of the video fingerprints. However, the compactness of video fingerprints and the robustness and uniqueness thereof are often opposite, and how to ensure the compactness of video fingerprints and have better robustness and uniqueness is always the key point of the research of video fingerprint technology.

The feature extraction is an important link of a video fingerprint algorithm, plays a decisive role in the quality of generated video fingerprints, and the existing video fingerprint algorithms are mainly classified into three types based on time domain, space domain and time-space domain video fingerprint algorithms according to the feature extraction mode. The video fingerprint algorithm based on the airspace mainly extracts the characteristics of video key frames and compresses the video key frames into video fingerprints for copy detection, and the most representative algorithm is a radial hash algorithm provided by documents De Roover C, De Vleeschouter C, Lefebvre F, and equivalent. Such methods have some robustness to signal processing attacks, but are less than ideal for other types of attack transformations. Video fingerprint characteristics are extracted mainly by capturing a time Sequence of a video by a time domain-Based video fingerprint algorithm, and a representative algorithm is a video Sequence Matching method proposed by documents Chen, l.and Stentiford, f.video Sequence Matching Based on temporal organization algorithm. Although the algorithm has better robustness for long video segments, the video fingerprint extraction effect is not ideal for short videos because the short videos are difficult to contain enough information for distinguishing time domains. Therefore, the advantages of the space-domain-based video fingerprint algorithm and the time-domain-based video fingerprint algorithm are combined, the time-space-domain-based video fingerprint algorithm is provided, and from the perspective of space-time fusion characteristics, the space-time information of the video is fused and compressed into the video fingerprint for copy detection. Typical algorithms include the gradient direction centroid algorithm proposed in documents s.lee and c.d.yoo, Robust video profiling for content-based video identification, IEEE trans.circuits system.video technique, vol.18, No.7, pp.983c988, jul.2008, and the texture map model proposed in documents m.li and v.monga, computer video profiling via structural algorithms, IEEE trans.inf.forces.sec.8, vol.11, pp.1709-1721, nov.2013, etc., all of which provide a good solution for the study of video fingerprinting algorithms.

However, the above methods are all implemented by adopting a traditional manual feature extraction method, only one feature in the video is described in an abstract manner, the understanding of the content information of the video is not facilitated, the content of the video is relatively complex in the face of various videos streamed in a network, and the performance of the video fingerprint generated by utilizing a single manual feature extraction method is hardly greatly improved.

With the development of deep learning in recent years, a convolutional neural network becomes a hot spot which is of great interest to academic circles, and the powerful feature extraction capability of the convolutional neural network has excellent performance in various fields such as target tracking, target detection, video motion recognition and the like. Some convolutional neural network Based Video fingerprinting algorithms should also be developed, such as the Wang L, Bao Y, Li H, actual. compact CNN Based Video reproduction for Efficient Video Copy Detection [ C ]// International Conference on Multimedia modeling. Springer International publishing, 017. the proposed algorithm for generating Video fingerprints in sparse coding after extracting features from densely sampled Video frames using a convolutional neural network and the two proposed schemes for generating Video fingerprints using convolutional neural network excitation in Kordopation-Zilos G, Papodoulos S, Patras I, et al. near-Dual Video reproduction with Deep metadata left [ C ]// Web-Video and Social (VSM), IEEE 2017. convolutional neural network, 2017. the proposed algorithm for generating Video information from a convolutional network using a convolutional network and then the two proposed schemes for generating feature vector information, compared with the traditional algorithm, the robustness and uniqueness are improved, the compactness of the video fingerprint is ignored, and the storage and the efficient calculation are not facilitated for massive video data.

The convolutional neural network has strong feature extraction capability, but the robustness of some geometric transformations is not strong, the capsule network is a variant of the convolutional neural network, the capsule network based on dynamic routing planning forms feature vectors by a set of a fixed number of feature points on the basis of extracting a feature map by the convolutional neural network, the feature vectors are used as the input of a primary capsule layer, each capsule is multiplied by an attitude matrix, then dynamic routing operation is carried out, and the convolutional neural network has better feature fitting capability in the aspect of feature extraction compared with the convolutional neural network.

Therefore, in order to achieve the above object, it is necessary to provide an improved solution to the above-mentioned deficiencies of the prior art.

Disclosure of Invention

The invention mainly aims at the problem that the interdependence relation among characteristic channels and the limited capability of triple network learning for difference information between positive and negative samples cannot be fully utilized when spatial-temporal characteristics are extracted by three-dimensional convolution, provides a novel method for generating video fingerprints by combining a deep network and Hash learning, and efficiently detects copied videos while carrying out video monitoring on network platforms such as video websites, friend-making communities, chatting tools and the like, thereby preventing unauthorized videos and control videos from being illegally spread.

In order to achieve the above purpose, the invention provides the following technical scheme:

the invention provides a capsule network-based depth video fingerprint algorithm, which comprises the following steps:

s1, constructing a deep capsule network by taking the weight-sharing three-branch network as a framework and taking the capsule network as a basis, wherein the deep capsule network specifically comprises the following components:

s11, combining the three-dimensional convolution and the two-dimensional convolution to extract the space-time characteristics of the video;

s12, compressing the video space-time characteristics into compact video fingerprints by adopting the deep capsule network;

s13, performing metric learning on the compact video fingerprint by adopting a triple loss function, wherein the triple loss function is a self-adaptive triple angle loss function with a central loss constraint, and the triple angle loss function specifically comprises:

s131, adopting normalized cosine similarity as a measurement function, converting distance operation between the space-time characteristics into angle operation, and enhancing correlation learning between the space-time characteristics, wherein the normalized cosine similarity is expressed as:

wherein s is₁,s₂Expressed as a compact video fingerprint vector extracted by a deep capsule network, | · | | luminance₂To representIs a 2 norm;

s132, designing adaptive interval loss, adaptively adjusting an interval value according to the triplet sample pair, wherein the adaptive interval loss β is expressed as:

wherein S (v, v)^-) Representing cosine similarity between original video and non-copy video compact features, S (v, v)⁺) Representing cosine similarity between the original video and the copy video compact features;

s133, adding a central loss constraint term to the triplet sample pair in S132 after the triplet sample pair is lost, and normalizing the similarity learning between the positive sample pairs, where the central loss constraint term θ is expressed as:

θ＝||1-S(v,v⁺)||₂

the adaptive triplet angle loss function with the central loss constraint term is specifically represented as:

wherein v is_t,

Respectively representing the compact video characteristics of an original video, a copied video and a non-copied video in the tth group of video triples, wherein m is the size of a batch;

s2, training the deep capsule network;

and S3, extracting and matching the video fingerprints of the deep capsule network after training.

According to the capsule network-based depth video fingerprint algorithm, preferably, the training of the depth capsule network by the S2 specifically includes:

s21, establishing a training video data set;

s22, preprocessing the training video data set to obtain a video triple;

s23, taking a video triple as the input of the three-branch network, and extracting the high-level semantic features and the compact video features of each video through a forward propagation algorithm;

s24, calculating a loss value generated by measuring loss through the compact video features extracted by the deep capsule network;

s25, calculating a loss value generated by classification loss through the high-level semantic features extracted by the deep capsule network;

s26, calculating the gradient of each node in the deep capsule network according to a back propagation algorithm;

s27, optimizing and updating the weight of each node in the deep capsule network by adopting an SGD random gradient descent method;

s28, repeating the S23-S27 until the loss values in S24 and S25 are not changed any more, and finishing the training of the deep capsule network.

According to the capsule network-based depth video fingerprint algorithm, preferably, the S3 performs video fingerprint extraction and matching on the trained depth capsule network, and specifically includes:

s31, selecting input videos, wherein the input videos comprise original videos and query videos, and preprocessing the input videos;

s32, taking a single-branch network in the trained deep capsule network as an extractor, and taking an original video and a query video as the input of the deep capsule network respectively to extract compact video features of the original video and the query video;

s33, binary coding is carried out on the compact video features extracted by the deep capsule network, and an original video fingerprint and an inquiry video fingerprint are respectively generated;

s34, calculating the Hamming distance between the original video fingerprint and the query fingerprint;

s35, setting a threshold value and judging whether a copy relation exists between the query video and the original video according to the calculated Hamming distance;

and if the Hamming distance is smaller than a set threshold, the query video is defined as a copy video, and if the Hamming distance is larger than the set threshold, the query video is defined as a non-copy video.

According to the capsule network-based depth video fingerprint algorithm, preferably, the extracted compact video features are compressed, and the compression process specifically comprises the following steps:

s101, performing convolution operation on compact video characteristics output by the last two-dimensional convolution layer to obtain a capsule serving as an input of a primary capsule layer;

s102, for each capsule x_iRespectively processing to obtain high-grade capsule X_i；

S103, for each advanced capsule X_iMultiplying by a probability value S_iAnd performing a summation operation to output a predicted capsule v, where S_iIs formed by a weight b_iPassing through Softmax function

Converting into a probability form to obtain;

s104, adopting an activation function for the output prediction capsule v

Flattening to make the output vector norm of the predicted capsule v at [0, 1%]To (c) to (d);

s105, passing dynamic routing algorithm b_i←b_i+X_i+ v updates the weight b;

and S106, repeating the operations from S103 to S105 for 3 times, combining the features in the feature map obtained by two-dimensional convolution, and outputting the robust prediction capsule v as a compact video fingerprint for metric learning.

According to the above capsule network-based depth video fingerprint algorithm, as an optimal selection, the extracting of the high-level semantic features specifically comprises:

and (3) after the output characteristics of the last two-dimensional convolution layer pass through a Tanh activation function, the output characteristics are used as the input of a full connection layer, and the output dimension number of the full connection layer is the same as the class number and is used for classification learning.

According to the capsule network-based depth video fingerprint algorithm, preferably, the video triples are specifically:

and simultaneously extracting an original video, a non-copy video and a copy video corresponding to the original video from the training data set to form a pair of video triples, wherein the original video and the non-copy video have different contents.

According to the capsule network-based depth video fingerprint algorithm, loss values generated by classification losses are preferably calculated, wherein the classification losses are based on a cross-entropy loss function L₂Calculating a loss value generated by the dimension classification feature output by the deep capsule network as a classification loss calculation function, wherein the cross entropy loss function L₂The formula is as follows:

wherein x is_iAnd the dimension classification characteristic of the ith video output by the network is represented, n is represented by the size of a batch, y is represented by a video real label, and sigma is represented by a Sigmoid activation function.

According to the depth video fingerprint algorithm based on the capsule network, preferably, each node in the depth capsule network adopts a loss function, and each node in the depth capsule network is automatically derived according to a set loss function.

According to the depth video fingerprint algorithm based on the capsule network, the original video fingerprint and the query video fingerprint are both 16-bit video fingerprints as preferable.

Compared with the closest prior art, the technical scheme provided by the invention has the following excellent effects:

the invention provides a capsule network-based depth video fingerprint algorithm, which adopts a three-branch network with shared weights as an integral framework, is improved on the basis of a capsule network, extracts compact video characteristics by taking the depth capsule network as the three-branch network, combines three-dimensional convolution and two-dimensional convolution by utilizing the idea of joint convolution, extracts video spatio-temporal characteristics, greatly improves the calculation efficiency of the network while keeping video time information as much as possible, reduces the calculation cost of a subsequent capsule network, additionally adds a full connection layer after the two-dimensional convolution, takes the video spatio-temporal characteristics extracted by the joint convolution as input, and takes video classification characteristics as output, thereby enhancing the robustness and the characteristics of the network.

On the basis of extracting a feature map by a convolutional neural network, a capsule network based on dynamic routing planning forms feature vectors by a set of a fixed number of feature points and uses the feature vectors as input of a primary capsule layer, and each capsule is multiplied by an attitude matrix and then is subjected to dynamic routing operation.

Drawings

FIG. 1 is a schematic diagram of an overall architecture of a deep capsule network according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a deep capsule network architecture according to an embodiment of the present invention;

FIG. 3 is a depth capsule network parameter map in an embodiment of the present invention;

FIG. 4 is a schematic flow chart of a deep capsule network according to an embodiment of the present invention;

FIG. 5 shows a first experimental result of a deep capsule network according to an embodiment of the present invention;

FIG. 6 shows a second experimental result of the deep capsule network in the embodiment of the present invention;

FIG. 7 shows the third experimental result of the deep capsule network in the embodiment of the present invention;

FIG. 8 shows a fourth experimental result of the deep capsule network in the embodiment of the present invention;

FIG. 9 shows a fifth experimental result of the deep capsule network in an embodiment of the present invention;

fig. 10 shows a sixth experimental result of the deep capsule network in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

As shown in fig. 1, the present invention provides a capsule network-based depth video fingerprinting algorithm, which includes the following steps:

s1, in the aspect of network structure design, a three-branch network with shared weight is adopted as an overall framework, a capsule network is taken as a basis for improvement, and a deep capsule network is taken as a branch network to extract compact video characteristics, specifically:

s11, extracting the space-time characteristics of the video by combining the three-dimensional convolution and the two-dimensional convolution by utilizing the idea of the joint convolution, so that the space-time characteristics of the video greatly improve the calculation efficiency of the capsule network while keeping the video time information as much as possible, and reduce the calculation cost of the subsequent capsule network.

And S12, compressing the space-time characteristics of the video into compact video fingerprints by adopting a deep capsule network.

S13, performing metric learning by adopting an improved triple loss function, and meanwhile, additionally adding a full connection layer after two-dimensional convolution for further enhancing the robustness and the characteristic of the deep capsule network, outputting video classification characteristics by taking the video space-time characteristics extracted by the joint convolution as input, and performing classification learning by adopting a cross entropy loss function to assist in training the deep capsule network.

And S2, training the deep capsule network.

And S3, extracting and matching the video fingerprints of the trained deep capsule network.

The invention aims to provide a capsule network-based deep video fingerprint algorithm aiming at the defects of the existing video fingerprint algorithm, so that the capsule network has the advantages of high efficiency, accuracy and robustness, and can efficiently detect copied videos while carrying out video monitoring on network platforms such as video websites, friend-making communities, chat tools and the like, thereby preventing unauthorized videos and managed and video illegal transmission.

Further, as shown in fig. 2 to 3, in the aspect of loss function design, in a general case, the triplet loss function pushes away negative samples by narrowing the distance between the positive samples to achieve the effect of distinguishing the positive samples from the negative samples, but the effect of compact video feature learning is not ideal, and in the embodiment of the present invention, the triplet loss function is improved to provide an adaptive triplet angle loss with a central loss constraint, specifically:

s131, in the optimization process of the general triple loss function using the square of the Euclidean distance as a measurement mode, only the difference of distance emphasis values between feature elements is considered, but the whole correlation learning between the features is ignored, in the embodiment of the invention, normalized cosine similarity is used as a measurement function, the distance operation between the features is converted into angle operation, the correlation learning between the features is enhanced, and the normalized cosine similarity is expressed as:

wherein s is₁,s₂Expressed as a compact video fingerprint vector extracted by a deep capsule network, | · | | luminance₂Expressed as a 2 norm.

S132, a general triplet loss function sets an interval value α, so that positive samples have certain distinctiveness in an optimization process for negative sample pairs, but it is obviously unreasonable to allocate a uniform interval value to each triplet sample, and since the difficulty degrees of the samples forming the triplets are different, when the interval value is set too large, the loss is difficult to reduce to 0 and even cause the network to be not converged, and when the interval value is set too small, the network trained network is difficult to distinguish the more difficult samples.

Wherein, S (v, v)^-) Representing cosine similarity between original video and non-copy video compact features, S (v, v)⁺) Representing the cosine similarity between the original video and the compact features of the copy video.

S133, in the optimization process of a general triple loss function, only the relative distance between two sample pairs is considered, the absolute distance between the sample pairs is not sensitive, and misjudgment is easily caused to a difficult positive sample.

θ＝||1-S(v,v⁺)||₂

The adaptive triplet angular loss function with the central loss constraint is specifically represented as:

further, the built deep capsule network is trained, and the specific training steps are as follows:

s21, establishing a training data set, wherein the training data set is mainly divided into an original video and a copy video generated by subjecting the original video to attack transformation, and the attack transformation comprises common video attack transformation methods such as noise addition, blurring, frame dropping, logo insertion, contrast adjustment and the like.

And S22, preprocessing each training video.

And S23, taking the video triples as the input of the three-branch network, and extracting the high-level semantic features of each video through a forward propagation algorithm.

And S24, calculating a loss value generated by measuring loss through the compact video characteristics extracted by the deep capsule network.

And S25, calculating a loss value generated by classifying the loss through the high-level semantic features extracted by the deep capsule network.

And S26, calculating the gradient of each node in the deep capsule network according to a back propagation algorithm.

S27, optimizing and updating the weight of each node in the deep capsule network by adopting an SGD (stored Gradient) descending method.

And S28, repeating the steps S23-S27 until the loss values in the steps S24 and S25 do not change any more, wherein the loss values in the training process are dynamic changes, the judgment is carried out according to the loss function values designed in the deep capsule network, and the training is finished when the loss values do not change any more, so that the training of the deep capsule network is completed.

In order to better understand the training process of the deep capsule network, the following describes the training process of the deep capsule network in the present invention with a set of embodiments.

S21, establishing a training data set, wherein the original video is from 4000 videos selected from an FCVID public video data set, 100 frames of the 4000 videos are taken as video segments, any two segments of videos are visually different, and 32000 copy videos visually similar to the original video are generated by applying two mixed attack transformations including Gamma correction (0.6-1.6), Gaussian noise addition (0.01, 0.05, 0.1), median filtering (kernel size 20 × 20), frame loss (35% random loss), Logo insertion, 6 common video attack transformations of subtitles insertion, Logo + frame loss and rotation + clipping (10 degrees of rotation, center clipping 320 × 240) to each video segment, and each original video and the copy videos are classified into one class, and the total class number of the training samples is 4000.

S22, preprocessing each training video, uniformly adjusting each video to 64 × 56 × 56 video size by calling a resize function in an Opencv (open source computing library) module, converting the video into YCrCb color space, separating a luminance signal from a chrominance signal, and reducing the influence of luminance transformation on a video chrominance signal.

S23, extracting an original video, a non-copy video and a copy video corresponding to the original video from the training data set simultaneously to form a pair of video triples, wherein the original video and the non-copy video are different in content, taking 20 pairs of video triples as an input of a three-branch network, and outputting 4000-dimensional classification features and 16-dimensional compact video features of each video through a forward propagation algorithm.

S24, using an adaptive triplet angle loss function L with a central loss constraint₁Calculating loss value L generated by outputting 16-dimensional compact video characteristics by using deep capsule network as measure loss₁The formula is as follows:

wherein v is_t,

Representing the 16-dimensional compact video characteristics of the original video, the copied video and the non-copied video in the tth set of video triples, respectively, and m is represented as the batch size, set here to 20.

S25, calculating loss value generated by classifying loss through high-level semantic features extracted by the deep capsule network, wherein in the embodiment of the invention, a cross entropy loss function L is used₂Calculating loss value, L, generated by 4000-dimensional classification characteristic output by the deep capsule network as a classification loss function₂The formula is as follows:

wherein x_iRepresents the 4000-dimensional classification characteristic of the ith video output by the network, n represents the size of the batch, and is set to be 20, and y represents the real mark of the videoTab, σ, is expressed as a Sigmoid activation function, mapping the output to [0,1 ]]Numerical probability in between.

S26, calculating the gradient of each node in the deep capsule network according to a back propagation algorithm, wherein the gradient of each node in the deep capsule network does not need to be calculated manually.

S27, updating the weight of the corresponding node according to the gradient of each node by adopting a small-batch random gradient descent method, setting the initial learning rate to be 0.01, performing weight attenuation once every 10 periods, wherein the attenuation coefficient is 0.1, and the momentum factor is set to be 0.9.

And S28, repeating the operations of the steps S23-S27, and performing 800 times of loop iteration as 1 training period of the deep capsule network until the loop is circulated for 40 periods to finish the training of the deep capsule network.

Further, the trained deep capsule network extracts and matches video fingerprints, and the specific process of extracting and matching the video fingerprints is as follows:

and S31, preprocessing the videos, wherein the preprocessing mode is the same as the video processing mode in S22, and each video is uniformly adjusted to be 64 × 56 × 56 video size by adopting a resize function in an Opencv module and is converted into YCrCb color space.

And S32, taking the trained single-branch deep capsule network as a feature extractor, and taking the original video and the query video as the input of the deep capsule network respectively to extract compact video features.

S33, binary coding is carried out on the compact video features output by the deep capsule network by adopting a sign (·) function, the number larger than 0 is set as 1, the number smaller than 0 is set as-1, and the original video fingerprint and the query video fingerprint are respectively generated.

S34, calculating the Hamming distance HD between the original video fingerprint and the query video fingerprint, wherein the HD formula is as follows:

wherein

And H_kDenotes the kth element in two different hash sequences, L denotes the length of the hash sequence.

S35, setting a threshold α, defining the query video with the Hamming distance HD smaller than the threshold as a copy video, and otherwise, defining the query video as a non-copy video.

The method can effectively detect the copy video with attack transformation under the condition of only adopting 16bit video fingerprint, and fundamentally improves the detection efficiency and accuracy compared with the existing video fingerprint algorithm.

Further, as shown in fig. 2 and 4, the overall network architecture in the embodiment of the present invention is constructed by an open-source deep learning frame Pytorch, and is composed of three branch networks with shared weights, each branch network employs a deep capsule network, 3D convolution and 2D convolution are combined in a joint convolution manner, two convolution kernels are respectively 5 × 7 × 7 and 3 × 3 × 3 to extract space-time features of a video, video time features are averaged, and then a convolution operation is performed by using a two-dimensional convolution kernel with a convolution kernel of 9 × 9 to greatly improve the computation efficiency of the deep capsule network while retaining video time information as much as possible, wherein each layer of convolution employs Tanh as an activation function and performs batch standardization operation, and after that, the deep capsule network is divided into two parts, namely compression of compact video features and high-level semantic feature extraction for classification.

The method comprises the following steps of compressing the compact video features extracted by the deep capsule network, wherein the compression process comprises the following steps:

s101, 8 times of convolution operation with convolution kernel size of 9 × 9 and step length of 2 is carried out on the features output by the last two-dimensional convolution layer at the same time, the number of output channels of each convolution is set to 64, so 8 groups of feature maps with size of 64 × M × N are output, wherein M × N represents the size of the output feature map, each group of feature maps are flattened into one-dimensional vectors, corresponding positions of the one-dimensional vectors are combined, and the one-dimensional vectors can be converted into 64 × M × N capsules with vector length of 8 and serve as input of a primary capsule layer.

S102, for each capsule x_iAre multiplied by a pose matrix w of size 8 × 16_iGenerating C × M × N advanced capsules X with vector length of 16_i。

S103, for each advanced capsule X_iMultiplying by a probability value S_iAnd carrying out summation operation to output a prediction capsule v.

Wherein S_iIs formed by a weight b_iPassing through Softmax function

Converted into a probabilistic form to obtain b thereof_iThe initial value is set to 0.

S104, adopting an activation function for the output prediction capsule v

Flattening to make the output vector norm of the predicted capsule v at [0, 1%]In the meantime.

S105, passing dynamic routing algorithm b_i←b_i+X_iV updates the weight b.

And S106, repeating the operations of S103 to S105 for 3 times, and obtaining the best experimental effect after the dynamic routing is iterated for 3 times, wherein the dynamic routing iteration is set to be 3 times, and the best experimental effect is obtained by the deep capsule network in the invention when the optimized dynamic routing is iterated for 3 times after a large number of experimental verifications.

More critical features in the feature map obtained by the two-dimensional convolution are combined, so that a more robust predicted capsule v is output as a compact video fingerprint for metric learning.

Further, for the extraction of high-level semantic features, the output features of the last two-dimensional convolutional layer are directly used as the input of a full connection layer after being subjected to a Tanh activation function in the embodiment of the invention, and the output dimension of the full connection layer is the same as the class number and is used for classification learning.

In order to verify the performance of the deep capsule network, the embodiment of the invention performs fingerprint algorithm verification through a test video data set, the test video data set is different from videos adopted by a training video data set, in order to better fit the characteristic that the streaming videos on the network have diversity, the test video data set is respectively from 200 videos of TRECVID and 600 videos of You Tube, wherein each video is different visually, common video attack transformation including Gamma correction, Gaussian noise addition, median filtering, frame loss, Logo insertion, subtitle 6 insertion and Logo + frame loss and rotation + clipping 2 mixed attack transformations are also applied to each video segment, so that 6400 copy videos similar to the 800 original videos in vision are generated, and ROC (receiver operating training texture curve) is adopted as a judgment index of the performance of the video fingerprint algorithm, defining a false alarm rate P_M(Miss probability) and false alarm Rate P_FA(False alarm) is:

wherein the range of α is set to [0,1 ]]P is calculated with 800 similar video pairs_MBy using

Computing P for non-copy video pairs_FAWhen the curve is closer to the lower left corner, the smaller the false alarm rate and the smaller the false alarm rate, the smaller the error rate of the algorithm is, and the experimental results are shown in fig. 5-10, and under the condition of different types of video attack transformation, the average experimental error rate is reduced to about 0.025%.

In summary, the invention provides a capsule network-based depth video fingerprint algorithm, which adopts a three-branch network with shared weights as an integral framework, is improved on the basis of a capsule network, extracts compact video features by taking the depth capsule network as the three-branch network, combines three-dimensional convolution and two-dimensional convolution by utilizing the idea of joint convolution, extracts video spatiotemporal features, greatly improves the computational efficiency of the network while keeping video time information as much as possible, reduces the computational cost of the subsequent capsule network, additionally adds a full connection layer after the two-dimensional convolution, takes the video spatiotemporal features extracted by the joint convolution as input, and takes video classification features as output, thereby enhancing the robustness and the characteristics of the network.

The above description is only exemplary of the invention and should not be taken as limiting the invention, as any modification, equivalent replacement, or improvement made within the spirit and principle of the invention is intended to be covered by the appended claims.

Claims

1. A depth video fingerprint algorithm based on a capsule network is characterized by comprising the following steps:

s131, adopting normalized cosine similarity as a measurement function, converting distance operation between the video space-time characteristics into angle operation, and enhancing correlation learning between the video space-time characteristics, wherein the normalized cosine similarity is expressed as:

wherein s is₁,s₂Expressed as a compact video fingerprint vector extracted by a deep capsule network, | · | | luminance₂Expressed as a 2 norm;

θ＝||1-S(v,v⁺)||₂

wherein v is_t,

s2, training the deep capsule network;

2. The capsule network-based deep video fingerprint algorithm of claim 1, wherein the S2 is used for training the deep capsule network, and specifically comprises:

s21, establishing a training video data set;

s22, preprocessing the training video data set to obtain a video triple;

3. The capsule network-based deep video fingerprint algorithm of claim 1, wherein the S3 is configured to perform video fingerprint extraction and matching on the deep capsule network after training, and specifically includes:

4. The capsule network-based deep video fingerprint algorithm according to claim 2, wherein the extracted compact video features are compressed, and the compression process specifically comprises:

Converting into a probability form to obtain;

s104, adopting an activation function for the output prediction capsule v

s105, passing dynamic routing algorithm b_i←b_i+X_i+ v updates the weight b;

5. The capsule network-based deep video fingerprinting algorithm according to claim 2, characterized by, for extracting the high-level semantic features specifically:

6. The capsule network-based depth video fingerprint algorithm of claim 2, wherein the video triplets are specifically:

7. The capsule network-based depth video fingerprinting algorithm of claim 2, characterized by calculating a loss value resulting from classification losses as a cross-entropy loss function L₂Calculating a loss value generated by the dimension classification feature output by the deep capsule network as a classification loss calculation function, wherein the cross entropy loss function L₂The formula is as follows:

8. The capsule network-based deep video fingerprint algorithm of claim 2, wherein each node in the deep capsule network adopts a loss function, and each node in the deep capsule network is automatically derived according to a set loss function.

9. The capsule network-based deep video fingerprint algorithm of claim 3, wherein the original video fingerprint and the query video fingerprint are both 16-bit video fingerprints.