CN111325169A - Deep video fingerprint algorithm based on capsule network - Google Patents

Deep video fingerprint algorithm based on capsule network Download PDF

Info

Publication number
CN111325169A
CN111325169A CN202010121069.5A CN202010121069A CN111325169A CN 111325169 A CN111325169 A CN 111325169A CN 202010121069 A CN202010121069 A CN 202010121069A CN 111325169 A CN111325169 A CN 111325169A
Authority
CN
China
Prior art keywords
video
network
deep
capsule network
capsule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010121069.5A
Other languages
Chinese (zh)
Other versions
CN111325169B (en
Inventor
李新伟
徐良浩
杨艺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University of Technology
Original Assignee
Henan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University of Technology filed Critical Henan University of Technology
Priority to CN202010121069.5A priority Critical patent/CN111325169B/en
Publication of CN111325169A publication Critical patent/CN111325169A/en
Application granted granted Critical
Publication of CN111325169B publication Critical patent/CN111325169B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a capsule network-based deep video fingerprint algorithm, which adopts a three-branch network with shared weights as an integral framework, is improved on the basis of a capsule network, extracts compact video characteristics by taking the deep capsule network as the three-branch network, combines three-dimensional convolution and two-dimensional convolution by utilizing the idea of joint convolution, extracts video spatio-temporal characteristics, greatly improves the calculation efficiency of the network while keeping video time information as much as possible, reduces the calculation cost of a subsequent network, additionally adds a full connection layer after the last two-dimensional convolution, takes the video spatio-temporal characteristics extracted by the joint convolution as input, takes video classification characteristics as output, ensures that the deep capsule network has high efficiency, accuracy and robustness, and can monitor videos on network platforms such as video websites, friend communities, chat tools and the like, the copy video is efficiently detected, and unauthorized video and managed and controlled video are prevented from being illegally spread.

Description

Deep video fingerprint algorithm based on capsule network
Technical Field
The invention relates to the technical field of video copyright protection and information security, in particular to a capsule network-based depth video fingerprint algorithm.
Background
With the rapid development of internet technology and video websites, abundant video contents provide different visual experiences for people, but the problem of video copyright infringement brought by the video contents is highlighted day by day, and the illegal copied videos not only damage the interests of copyright owners in the network transmission process, but also have adverse effects on the society. In the face of huge network video data, it is impractical to detect the copied video only by manpower, and therefore, a scheme for efficiently detecting the video copy is needed.
The video fingerprint is also called video hash, is a technology for compressing digital video features into a simplified video abstract, and is widely applied to video copy detection by the characteristics of low storage cost, high query speed and the like. The video fingerprint algorithm mainly comprises three parts of feature extraction, feature quantization and fingerprint matching, wherein the hamming distance between paired video fingerprints is calculated, whether a copy relation exists between two videos is judged according to a set threshold value, and the robustness, uniqueness and compactness of the video fingerprints are standards for measuring the performance of the video fingerprints. The robustness means that after some interference factors are added to an original video, video fingerprints of the original video have high similarity, uniqueness requires obvious difference between the video fingerprints of different videos, and compactness represents the length of the video fingerprints. However, the compactness of video fingerprints and the robustness and uniqueness thereof are often opposite, and how to ensure the compactness of video fingerprints and have better robustness and uniqueness is always the key point of the research of video fingerprint technology.
The feature extraction is an important link of a video fingerprint algorithm, plays a decisive role in the quality of generated video fingerprints, and the existing video fingerprint algorithms are mainly classified into three types based on time domain, space domain and time-space domain video fingerprint algorithms according to the feature extraction mode. The video fingerprint algorithm based on the airspace mainly extracts the characteristics of video key frames and compresses the video key frames into video fingerprints for copy detection, and the most representative algorithm is a radial hash algorithm provided by documents De Roover C, De Vleeschouter C, Lefebvre F, and equivalent. Such methods have some robustness to signal processing attacks, but are less than ideal for other types of attack transformations. Video fingerprint characteristics are extracted mainly by capturing a time Sequence of a video by a time domain-Based video fingerprint algorithm, and a representative algorithm is a video Sequence Matching method proposed by documents Chen, l.and Stentiford, f.video Sequence Matching Based on temporal organization algorithm. Although the algorithm has better robustness for long video segments, the video fingerprint extraction effect is not ideal for short videos because the short videos are difficult to contain enough information for distinguishing time domains. Therefore, the advantages of the space-domain-based video fingerprint algorithm and the time-domain-based video fingerprint algorithm are combined, the time-space-domain-based video fingerprint algorithm is provided, and from the perspective of space-time fusion characteristics, the space-time information of the video is fused and compressed into the video fingerprint for copy detection. Typical algorithms include the gradient direction centroid algorithm proposed in documents s.lee and c.d.yoo, Robust video profiling for content-based video identification, IEEE trans.circuits system.video technique, vol.18, No.7, pp.983c988, jul.2008, and the texture map model proposed in documents m.li and v.monga, computer video profiling via structural algorithms, IEEE trans.inf.forces.sec.8, vol.11, pp.1709-1721, nov.2013, etc., all of which provide a good solution for the study of video fingerprinting algorithms.
However, the above methods are all implemented by adopting a traditional manual feature extraction method, only one feature in the video is described in an abstract manner, the understanding of the content information of the video is not facilitated, the content of the video is relatively complex in the face of various videos streamed in a network, and the performance of the video fingerprint generated by utilizing a single manual feature extraction method is hardly greatly improved.
With the development of deep learning in recent years, a convolutional neural network becomes a hot spot which is of great interest to academic circles, and the powerful feature extraction capability of the convolutional neural network has excellent performance in various fields such as target tracking, target detection, video motion recognition and the like. Some convolutional neural network Based Video fingerprinting algorithms should also be developed, such as the Wang L, Bao Y, Li H, actual. compact CNN Based Video reproduction for Efficient Video Copy Detection [ C ]// International Conference on Multimedia modeling. Springer International publishing, 017. the proposed algorithm for generating Video fingerprints in sparse coding after extracting features from densely sampled Video frames using a convolutional neural network and the two proposed schemes for generating Video fingerprints using convolutional neural network excitation in Kordopation-Zilos G, Papodoulos S, Patras I, et al. near-Dual Video reproduction with Deep metadata left [ C ]// Web-Video and Social (VSM), IEEE 2017. convolutional neural network, 2017. the proposed algorithm for generating Video information from a convolutional network using a convolutional network and then the two proposed schemes for generating feature vector information, compared with the traditional algorithm, the robustness and uniqueness are improved, the compactness of the video fingerprint is ignored, and the storage and the efficient calculation are not facilitated for massive video data.
The convolutional neural network has strong feature extraction capability, but the robustness of some geometric transformations is not strong, the capsule network is a variant of the convolutional neural network, the capsule network based on dynamic routing planning forms feature vectors by a set of a fixed number of feature points on the basis of extracting a feature map by the convolutional neural network, the feature vectors are used as the input of a primary capsule layer, each capsule is multiplied by an attitude matrix, then dynamic routing operation is carried out, and the convolutional neural network has better feature fitting capability in the aspect of feature extraction compared with the convolutional neural network.
Therefore, in order to achieve the above object, it is necessary to provide an improved solution to the above-mentioned deficiencies of the prior art.
Disclosure of Invention
The invention mainly aims at the problem that the interdependence relation among characteristic channels and the limited capability of triple network learning for difference information between positive and negative samples cannot be fully utilized when spatial-temporal characteristics are extracted by three-dimensional convolution, provides a novel method for generating video fingerprints by combining a deep network and Hash learning, and efficiently detects copied videos while carrying out video monitoring on network platforms such as video websites, friend-making communities, chatting tools and the like, thereby preventing unauthorized videos and control videos from being illegally spread.
In order to achieve the above purpose, the invention provides the following technical scheme:
the invention provides a capsule network-based depth video fingerprint algorithm, which comprises the following steps:
s1, constructing a deep capsule network by taking the weight-sharing three-branch network as a framework and taking the capsule network as a basis, wherein the deep capsule network specifically comprises the following components:
s11, combining the three-dimensional convolution and the two-dimensional convolution to extract the space-time characteristics of the video;
s12, compressing the video space-time characteristics into compact video fingerprints by adopting the deep capsule network;
s13, performing metric learning on the compact video fingerprint by adopting a triple loss function, wherein the triple loss function is a self-adaptive triple angle loss function with a central loss constraint, and the triple angle loss function specifically comprises:
s131, adopting normalized cosine similarity as a measurement function, converting distance operation between the space-time characteristics into angle operation, and enhancing correlation learning between the space-time characteristics, wherein the normalized cosine similarity is expressed as:
Figure BDA0002392990330000041
wherein s is1,s2Expressed as a compact video fingerprint vector extracted by a deep capsule network, | · | | luminance2To representIs a 2 norm;
s132, designing adaptive interval loss, adaptively adjusting an interval value according to the triplet sample pair, wherein the adaptive interval loss β is expressed as:
Figure BDA0002392990330000042
wherein S (v, v)-) Representing cosine similarity between original video and non-copy video compact features, S (v, v)+) Representing cosine similarity between the original video and the copy video compact features;
s133, adding a central loss constraint term to the triplet sample pair in S132 after the triplet sample pair is lost, and normalizing the similarity learning between the positive sample pairs, where the central loss constraint term θ is expressed as:
θ=||1-S(v,v+)||2
the adaptive triplet angle loss function with the central loss constraint term is specifically represented as:
Figure BDA0002392990330000043
wherein v ist,
Figure BDA0002392990330000044
Respectively representing the compact video characteristics of an original video, a copied video and a non-copied video in the tth group of video triples, wherein m is the size of a batch;
s2, training the deep capsule network;
and S3, extracting and matching the video fingerprints of the deep capsule network after training.
According to the capsule network-based depth video fingerprint algorithm, preferably, the training of the depth capsule network by the S2 specifically includes:
s21, establishing a training video data set;
s22, preprocessing the training video data set to obtain a video triple;
s23, taking a video triple as the input of the three-branch network, and extracting the high-level semantic features and the compact video features of each video through a forward propagation algorithm;
s24, calculating a loss value generated by measuring loss through the compact video features extracted by the deep capsule network;
s25, calculating a loss value generated by classification loss through the high-level semantic features extracted by the deep capsule network;
s26, calculating the gradient of each node in the deep capsule network according to a back propagation algorithm;
s27, optimizing and updating the weight of each node in the deep capsule network by adopting an SGD random gradient descent method;
s28, repeating the S23-S27 until the loss values in S24 and S25 are not changed any more, and finishing the training of the deep capsule network.
According to the capsule network-based depth video fingerprint algorithm, preferably, the S3 performs video fingerprint extraction and matching on the trained depth capsule network, and specifically includes:
s31, selecting input videos, wherein the input videos comprise original videos and query videos, and preprocessing the input videos;
s32, taking a single-branch network in the trained deep capsule network as an extractor, and taking an original video and a query video as the input of the deep capsule network respectively to extract compact video features of the original video and the query video;
s33, binary coding is carried out on the compact video features extracted by the deep capsule network, and an original video fingerprint and an inquiry video fingerprint are respectively generated;
s34, calculating the Hamming distance between the original video fingerprint and the query fingerprint;
s35, setting a threshold value and judging whether a copy relation exists between the query video and the original video according to the calculated Hamming distance;
and if the Hamming distance is smaller than a set threshold, the query video is defined as a copy video, and if the Hamming distance is larger than the set threshold, the query video is defined as a non-copy video.
According to the capsule network-based depth video fingerprint algorithm, preferably, the extracted compact video features are compressed, and the compression process specifically comprises the following steps:
s101, performing convolution operation on compact video characteristics output by the last two-dimensional convolution layer to obtain a capsule serving as an input of a primary capsule layer;
s102, for each capsule xiRespectively processing to obtain high-grade capsule Xi
S103, for each advanced capsule XiMultiplying by a probability value SiAnd performing a summation operation to output a predicted capsule v, where SiIs formed by a weight biPassing through Softmax function
Figure BDA0002392990330000051
Converting into a probability form to obtain;
s104, adopting an activation function for the output prediction capsule v
Figure BDA0002392990330000061
Flattening to make the output vector norm of the predicted capsule v at [0, 1%]To (c) to (d);
s105, passing dynamic routing algorithm bi←bi+Xi+ v updates the weight b;
and S106, repeating the operations from S103 to S105 for 3 times, combining the features in the feature map obtained by two-dimensional convolution, and outputting the robust prediction capsule v as a compact video fingerprint for metric learning.
According to the above capsule network-based depth video fingerprint algorithm, as an optimal selection, the extracting of the high-level semantic features specifically comprises:
and (3) after the output characteristics of the last two-dimensional convolution layer pass through a Tanh activation function, the output characteristics are used as the input of a full connection layer, and the output dimension number of the full connection layer is the same as the class number and is used for classification learning.
According to the capsule network-based depth video fingerprint algorithm, preferably, the video triples are specifically:
and simultaneously extracting an original video, a non-copy video and a copy video corresponding to the original video from the training data set to form a pair of video triples, wherein the original video and the non-copy video have different contents.
According to the capsule network-based depth video fingerprint algorithm, loss values generated by classification losses are preferably calculated, wherein the classification losses are based on a cross-entropy loss function L2Calculating a loss value generated by the dimension classification feature output by the deep capsule network as a classification loss calculation function, wherein the cross entropy loss function L2The formula is as follows:
Figure BDA0002392990330000062
wherein x isiAnd the dimension classification characteristic of the ith video output by the network is represented, n is represented by the size of a batch, y is represented by a video real label, and sigma is represented by a Sigmoid activation function.
According to the depth video fingerprint algorithm based on the capsule network, preferably, each node in the depth capsule network adopts a loss function, and each node in the depth capsule network is automatically derived according to a set loss function.
According to the depth video fingerprint algorithm based on the capsule network, the original video fingerprint and the query video fingerprint are both 16-bit video fingerprints as preferable.
Compared with the closest prior art, the technical scheme provided by the invention has the following excellent effects:
the invention provides a capsule network-based depth video fingerprint algorithm, which adopts a three-branch network with shared weights as an integral framework, is improved on the basis of a capsule network, extracts compact video characteristics by taking the depth capsule network as the three-branch network, combines three-dimensional convolution and two-dimensional convolution by utilizing the idea of joint convolution, extracts video spatio-temporal characteristics, greatly improves the calculation efficiency of the network while keeping video time information as much as possible, reduces the calculation cost of a subsequent capsule network, additionally adds a full connection layer after the two-dimensional convolution, takes the video spatio-temporal characteristics extracted by the joint convolution as input, and takes video classification characteristics as output, thereby enhancing the robustness and the characteristics of the network.
On the basis of extracting a feature map by a convolutional neural network, a capsule network based on dynamic routing planning forms feature vectors by a set of a fixed number of feature points and uses the feature vectors as input of a primary capsule layer, and each capsule is multiplied by an attitude matrix and then is subjected to dynamic routing operation.
Drawings
FIG. 1 is a schematic diagram of an overall architecture of a deep capsule network according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a deep capsule network architecture according to an embodiment of the present invention;
FIG. 3 is a depth capsule network parameter map in an embodiment of the present invention;
FIG. 4 is a schematic flow chart of a deep capsule network according to an embodiment of the present invention;
FIG. 5 shows a first experimental result of a deep capsule network according to an embodiment of the present invention;
FIG. 6 shows a second experimental result of the deep capsule network in the embodiment of the present invention;
FIG. 7 shows the third experimental result of the deep capsule network in the embodiment of the present invention;
FIG. 8 shows a fourth experimental result of the deep capsule network in the embodiment of the present invention;
FIG. 9 shows a fifth experimental result of the deep capsule network in an embodiment of the present invention;
fig. 10 shows a sixth experimental result of the deep capsule network in the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
As shown in fig. 1, the present invention provides a capsule network-based depth video fingerprinting algorithm, which includes the following steps:
s1, in the aspect of network structure design, a three-branch network with shared weight is adopted as an overall framework, a capsule network is taken as a basis for improvement, and a deep capsule network is taken as a branch network to extract compact video characteristics, specifically:
s11, extracting the space-time characteristics of the video by combining the three-dimensional convolution and the two-dimensional convolution by utilizing the idea of the joint convolution, so that the space-time characteristics of the video greatly improve the calculation efficiency of the capsule network while keeping the video time information as much as possible, and reduce the calculation cost of the subsequent capsule network.
And S12, compressing the space-time characteristics of the video into compact video fingerprints by adopting a deep capsule network.
S13, performing metric learning by adopting an improved triple loss function, and meanwhile, additionally adding a full connection layer after two-dimensional convolution for further enhancing the robustness and the characteristic of the deep capsule network, outputting video classification characteristics by taking the video space-time characteristics extracted by the joint convolution as input, and performing classification learning by adopting a cross entropy loss function to assist in training the deep capsule network.
And S2, training the deep capsule network.
And S3, extracting and matching the video fingerprints of the trained deep capsule network.
The invention aims to provide a capsule network-based deep video fingerprint algorithm aiming at the defects of the existing video fingerprint algorithm, so that the capsule network has the advantages of high efficiency, accuracy and robustness, and can efficiently detect copied videos while carrying out video monitoring on network platforms such as video websites, friend-making communities, chat tools and the like, thereby preventing unauthorized videos and managed and video illegal transmission.
Further, as shown in fig. 2 to 3, in the aspect of loss function design, in a general case, the triplet loss function pushes away negative samples by narrowing the distance between the positive samples to achieve the effect of distinguishing the positive samples from the negative samples, but the effect of compact video feature learning is not ideal, and in the embodiment of the present invention, the triplet loss function is improved to provide an adaptive triplet angle loss with a central loss constraint, specifically:
s131, in the optimization process of the general triple loss function using the square of the Euclidean distance as a measurement mode, only the difference of distance emphasis values between feature elements is considered, but the whole correlation learning between the features is ignored, in the embodiment of the invention, normalized cosine similarity is used as a measurement function, the distance operation between the features is converted into angle operation, the correlation learning between the features is enhanced, and the normalized cosine similarity is expressed as:
Figure BDA0002392990330000091
wherein s is1,s2Expressed as a compact video fingerprint vector extracted by a deep capsule network, | · | | luminance2Expressed as a 2 norm.
S132, a general triplet loss function sets an interval value α, so that positive samples have certain distinctiveness in an optimization process for negative sample pairs, but it is obviously unreasonable to allocate a uniform interval value to each triplet sample, and since the difficulty degrees of the samples forming the triplets are different, when the interval value is set too large, the loss is difficult to reduce to 0 and even cause the network to be not converged, and when the interval value is set too small, the network trained network is difficult to distinguish the more difficult samples.
Figure BDA0002392990330000092
Wherein, S (v, v)-) Representing cosine similarity between original video and non-copy video compact features, S (v, v)+) Representing the cosine similarity between the original video and the compact features of the copy video.
S133, in the optimization process of a general triple loss function, only the relative distance between two sample pairs is considered, the absolute distance between the sample pairs is not sensitive, and misjudgment is easily caused to a difficult positive sample.
θ=||1-S(v,v+)||2
The adaptive triplet angular loss function with the central loss constraint is specifically represented as:
Figure BDA0002392990330000093
further, the built deep capsule network is trained, and the specific training steps are as follows:
s21, establishing a training data set, wherein the training data set is mainly divided into an original video and a copy video generated by subjecting the original video to attack transformation, and the attack transformation comprises common video attack transformation methods such as noise addition, blurring, frame dropping, logo insertion, contrast adjustment and the like.
And S22, preprocessing each training video.
And S23, taking the video triples as the input of the three-branch network, and extracting the high-level semantic features of each video through a forward propagation algorithm.
And S24, calculating a loss value generated by measuring loss through the compact video characteristics extracted by the deep capsule network.
And S25, calculating a loss value generated by classifying the loss through the high-level semantic features extracted by the deep capsule network.
And S26, calculating the gradient of each node in the deep capsule network according to a back propagation algorithm.
S27, optimizing and updating the weight of each node in the deep capsule network by adopting an SGD (stored Gradient) descending method.
And S28, repeating the steps S23-S27 until the loss values in the steps S24 and S25 do not change any more, wherein the loss values in the training process are dynamic changes, the judgment is carried out according to the loss function values designed in the deep capsule network, and the training is finished when the loss values do not change any more, so that the training of the deep capsule network is completed.
In order to better understand the training process of the deep capsule network, the following describes the training process of the deep capsule network in the present invention with a set of embodiments.
S21, establishing a training data set, wherein the original video is from 4000 videos selected from an FCVID public video data set, 100 frames of the 4000 videos are taken as video segments, any two segments of videos are visually different, and 32000 copy videos visually similar to the original video are generated by applying two mixed attack transformations including Gamma correction (0.6-1.6), Gaussian noise addition (0.01, 0.05, 0.1), median filtering (kernel size 20 × 20), frame loss (35% random loss), Logo insertion, 6 common video attack transformations of subtitles insertion, Logo + frame loss and rotation + clipping (10 degrees of rotation, center clipping 320 × 240) to each video segment, and each original video and the copy videos are classified into one class, and the total class number of the training samples is 4000.
S22, preprocessing each training video, uniformly adjusting each video to 64 × 56 × 56 video size by calling a resize function in an Opencv (open source computing library) module, converting the video into YCrCb color space, separating a luminance signal from a chrominance signal, and reducing the influence of luminance transformation on a video chrominance signal.
S23, extracting an original video, a non-copy video and a copy video corresponding to the original video from the training data set simultaneously to form a pair of video triples, wherein the original video and the non-copy video are different in content, taking 20 pairs of video triples as an input of a three-branch network, and outputting 4000-dimensional classification features and 16-dimensional compact video features of each video through a forward propagation algorithm.
S24, using an adaptive triplet angle loss function L with a central loss constraint1Calculating loss value L generated by outputting 16-dimensional compact video characteristics by using deep capsule network as measure loss1The formula is as follows:
Figure BDA0002392990330000111
wherein v ist,
Figure BDA0002392990330000112
Representing the 16-dimensional compact video characteristics of the original video, the copied video and the non-copied video in the tth set of video triples, respectively, and m is represented as the batch size, set here to 20.
S25, calculating loss value generated by classifying loss through high-level semantic features extracted by the deep capsule network, wherein in the embodiment of the invention, a cross entropy loss function L is used2Calculating loss value, L, generated by 4000-dimensional classification characteristic output by the deep capsule network as a classification loss function2The formula is as follows:
Figure BDA0002392990330000113
wherein xiRepresents the 4000-dimensional classification characteristic of the ith video output by the network, n represents the size of the batch, and is set to be 20, and y represents the real mark of the videoTab, σ, is expressed as a Sigmoid activation function, mapping the output to [0,1 ]]Numerical probability in between.
S26, calculating the gradient of each node in the deep capsule network according to a back propagation algorithm, wherein the gradient of each node in the deep capsule network does not need to be calculated manually.
S27, updating the weight of the corresponding node according to the gradient of each node by adopting a small-batch random gradient descent method, setting the initial learning rate to be 0.01, performing weight attenuation once every 10 periods, wherein the attenuation coefficient is 0.1, and the momentum factor is set to be 0.9.
And S28, repeating the operations of the steps S23-S27, and performing 800 times of loop iteration as 1 training period of the deep capsule network until the loop is circulated for 40 periods to finish the training of the deep capsule network.
Further, the trained deep capsule network extracts and matches video fingerprints, and the specific process of extracting and matching the video fingerprints is as follows:
and S31, preprocessing the videos, wherein the preprocessing mode is the same as the video processing mode in S22, and each video is uniformly adjusted to be 64 × 56 × 56 video size by adopting a resize function in an Opencv module and is converted into YCrCb color space.
And S32, taking the trained single-branch deep capsule network as a feature extractor, and taking the original video and the query video as the input of the deep capsule network respectively to extract compact video features.
S33, binary coding is carried out on the compact video features output by the deep capsule network by adopting a sign (·) function, the number larger than 0 is set as 1, the number smaller than 0 is set as-1, and the original video fingerprint and the query video fingerprint are respectively generated.
S34, calculating the Hamming distance HD between the original video fingerprint and the query video fingerprint, wherein the HD formula is as follows:
Figure BDA0002392990330000121
wherein
Figure BDA0002392990330000122
And HkDenotes the kth element in two different hash sequences, L denotes the length of the hash sequence.
S35, setting a threshold α, defining the query video with the Hamming distance HD smaller than the threshold as a copy video, and otherwise, defining the query video as a non-copy video.
The method can effectively detect the copy video with attack transformation under the condition of only adopting 16bit video fingerprint, and fundamentally improves the detection efficiency and accuracy compared with the existing video fingerprint algorithm.
Further, as shown in fig. 2 and 4, the overall network architecture in the embodiment of the present invention is constructed by an open-source deep learning frame Pytorch, and is composed of three branch networks with shared weights, each branch network employs a deep capsule network, 3D convolution and 2D convolution are combined in a joint convolution manner, two convolution kernels are respectively 5 × 7 × 7 and 3 × 3 × 3 to extract space-time features of a video, video time features are averaged, and then a convolution operation is performed by using a two-dimensional convolution kernel with a convolution kernel of 9 × 9 to greatly improve the computation efficiency of the deep capsule network while retaining video time information as much as possible, wherein each layer of convolution employs Tanh as an activation function and performs batch standardization operation, and after that, the deep capsule network is divided into two parts, namely compression of compact video features and high-level semantic feature extraction for classification.
The method comprises the following steps of compressing the compact video features extracted by the deep capsule network, wherein the compression process comprises the following steps:
s101, 8 times of convolution operation with convolution kernel size of 9 × 9 and step length of 2 is carried out on the features output by the last two-dimensional convolution layer at the same time, the number of output channels of each convolution is set to 64, so 8 groups of feature maps with size of 64 × M × N are output, wherein M × N represents the size of the output feature map, each group of feature maps are flattened into one-dimensional vectors, corresponding positions of the one-dimensional vectors are combined, and the one-dimensional vectors can be converted into 64 × M × N capsules with vector length of 8 and serve as input of a primary capsule layer.
S102, for each capsule xiAre multiplied by a pose matrix w of size 8 × 16iGenerating C × M × N advanced capsules X with vector length of 16i
S103, for each advanced capsule XiMultiplying by a probability value SiAnd carrying out summation operation to output a prediction capsule v.
Wherein SiIs formed by a weight biPassing through Softmax function
Figure BDA0002392990330000131
Converted into a probabilistic form to obtain b thereofiThe initial value is set to 0.
S104, adopting an activation function for the output prediction capsule v
Figure BDA0002392990330000132
Flattening to make the output vector norm of the predicted capsule v at [0, 1%]In the meantime.
S105, passing dynamic routing algorithm bi←bi+XiV updates the weight b.
And S106, repeating the operations of S103 to S105 for 3 times, and obtaining the best experimental effect after the dynamic routing is iterated for 3 times, wherein the dynamic routing iteration is set to be 3 times, and the best experimental effect is obtained by the deep capsule network in the invention when the optimized dynamic routing is iterated for 3 times after a large number of experimental verifications.
More critical features in the feature map obtained by the two-dimensional convolution are combined, so that a more robust predicted capsule v is output as a compact video fingerprint for metric learning.
Further, for the extraction of high-level semantic features, the output features of the last two-dimensional convolutional layer are directly used as the input of a full connection layer after being subjected to a Tanh activation function in the embodiment of the invention, and the output dimension of the full connection layer is the same as the class number and is used for classification learning.
In order to verify the performance of the deep capsule network, the embodiment of the invention performs fingerprint algorithm verification through a test video data set, the test video data set is different from videos adopted by a training video data set, in order to better fit the characteristic that the streaming videos on the network have diversity, the test video data set is respectively from 200 videos of TRECVID and 600 videos of You Tube, wherein each video is different visually, common video attack transformation including Gamma correction, Gaussian noise addition, median filtering, frame loss, Logo insertion, subtitle 6 insertion and Logo + frame loss and rotation + clipping 2 mixed attack transformations are also applied to each video segment, so that 6400 copy videos similar to the 800 original videos in vision are generated, and ROC (receiver operating training texture curve) is adopted as a judgment index of the performance of the video fingerprint algorithm, defining a false alarm rate PM(Miss probability) and false alarm Rate PFA(False alarm) is:
Figure BDA0002392990330000141
wherein the range of α is set to [0,1 ]]P is calculated with 800 similar video pairsMBy using
Figure BDA0002392990330000142
Figure BDA0002392990330000143
Computing P for non-copy video pairsFAWhen the curve is closer to the lower left corner, the smaller the false alarm rate and the smaller the false alarm rate, the smaller the error rate of the algorithm is, and the experimental results are shown in fig. 5-10, and under the condition of different types of video attack transformation, the average experimental error rate is reduced to about 0.025%.
In summary, the invention provides a capsule network-based depth video fingerprint algorithm, which adopts a three-branch network with shared weights as an integral framework, is improved on the basis of a capsule network, extracts compact video features by taking the depth capsule network as the three-branch network, combines three-dimensional convolution and two-dimensional convolution by utilizing the idea of joint convolution, extracts video spatiotemporal features, greatly improves the computational efficiency of the network while keeping video time information as much as possible, reduces the computational cost of the subsequent capsule network, additionally adds a full connection layer after the two-dimensional convolution, takes the video spatiotemporal features extracted by the joint convolution as input, and takes video classification features as output, thereby enhancing the robustness and the characteristics of the network.
On the basis of extracting a feature map by a convolutional neural network, a capsule network based on dynamic routing planning forms feature vectors by a set of a fixed number of feature points and uses the feature vectors as input of a primary capsule layer, and each capsule is multiplied by an attitude matrix and then is subjected to dynamic routing operation.
The above description is only exemplary of the invention and should not be taken as limiting the invention, as any modification, equivalent replacement, or improvement made within the spirit and principle of the invention is intended to be covered by the appended claims.

Claims (9)

1. A depth video fingerprint algorithm based on a capsule network is characterized by comprising the following steps:
s1, constructing a deep capsule network by taking the weight-sharing three-branch network as a framework and taking the capsule network as a basis, wherein the deep capsule network specifically comprises the following components:
s11, combining the three-dimensional convolution and the two-dimensional convolution to extract the space-time characteristics of the video;
s12, compressing the video space-time characteristics into compact video fingerprints by adopting the deep capsule network;
s13, performing metric learning on the compact video fingerprint by adopting a triple loss function, wherein the triple loss function is a self-adaptive triple angle loss function with a central loss constraint, and the triple angle loss function specifically comprises:
s131, adopting normalized cosine similarity as a measurement function, converting distance operation between the video space-time characteristics into angle operation, and enhancing correlation learning between the video space-time characteristics, wherein the normalized cosine similarity is expressed as:
Figure FDA0002392990320000011
wherein s is1,s2Expressed as a compact video fingerprint vector extracted by a deep capsule network, | · | | luminance2Expressed as a 2 norm;
s132, designing adaptive interval loss, adaptively adjusting an interval value according to the triplet sample pair, wherein the adaptive interval loss β is expressed as:
Figure FDA0002392990320000012
wherein S (v, v)-) Representing cosine similarity between original video and non-copy video compact features, S (v, v)+) Representing cosine similarity between the original video and the copy video compact features;
s133, adding a central loss constraint term to the triplet sample pair in S132 after the triplet sample pair is lost, and normalizing the similarity learning between the positive sample pairs, where the central loss constraint term θ is expressed as:
θ=||1-S(v,v+)||2
the adaptive triplet angle loss function with the central loss constraint term is specifically represented as:
Figure FDA0002392990320000013
wherein v ist,
Figure FDA0002392990320000021
Respectively representing the compact video characteristics of an original video, a copied video and a non-copied video in the tth group of video triples, wherein m is the size of a batch;
s2, training the deep capsule network;
and S3, extracting and matching the video fingerprints of the deep capsule network after training.
2. The capsule network-based deep video fingerprint algorithm of claim 1, wherein the S2 is used for training the deep capsule network, and specifically comprises:
s21, establishing a training video data set;
s22, preprocessing the training video data set to obtain a video triple;
s23, taking a video triple as the input of the three-branch network, and extracting the high-level semantic features and the compact video features of each video through a forward propagation algorithm;
s24, calculating a loss value generated by measuring loss through the compact video features extracted by the deep capsule network;
s25, calculating a loss value generated by classification loss through the high-level semantic features extracted by the deep capsule network;
s26, calculating the gradient of each node in the deep capsule network according to a back propagation algorithm;
s27, optimizing and updating the weight of each node in the deep capsule network by adopting an SGD random gradient descent method;
s28, repeating the S23-S27 until the loss values in S24 and S25 are not changed any more, and finishing the training of the deep capsule network.
3. The capsule network-based deep video fingerprint algorithm of claim 1, wherein the S3 is configured to perform video fingerprint extraction and matching on the deep capsule network after training, and specifically includes:
s31, selecting input videos, wherein the input videos comprise original videos and query videos, and preprocessing the input videos;
s32, taking a single-branch network in the trained deep capsule network as an extractor, and taking an original video and a query video as the input of the deep capsule network respectively to extract compact video features of the original video and the query video;
s33, binary coding is carried out on the compact video features extracted by the deep capsule network, and an original video fingerprint and an inquiry video fingerprint are respectively generated;
s34, calculating the Hamming distance between the original video fingerprint and the query fingerprint;
s35, setting a threshold value and judging whether a copy relation exists between the query video and the original video according to the calculated Hamming distance;
and if the Hamming distance is smaller than a set threshold, the query video is defined as a copy video, and if the Hamming distance is larger than the set threshold, the query video is defined as a non-copy video.
4. The capsule network-based deep video fingerprint algorithm according to claim 2, wherein the extracted compact video features are compressed, and the compression process specifically comprises:
s101, performing convolution operation on compact video characteristics output by the last two-dimensional convolution layer to obtain a capsule serving as an input of a primary capsule layer;
s102, for each capsule xiRespectively processing to obtain high-grade capsule Xi
S103, for each advanced capsule XiMultiplying by a probability value SiAnd performing a summation operation to output a predicted capsule v, where SiIs formed by a weight biPassing through Softmax function
Figure FDA0002392990320000031
Converting into a probability form to obtain;
s104, adopting an activation function for the output prediction capsule v
Figure FDA0002392990320000032
Flattening to make the output vector norm of the predicted capsule v at [0, 1%]To (c) to (d);
s105, passing dynamic routing algorithm bi←bi+Xi+ v updates the weight b;
and S106, repeating the operations from S103 to S105 for 3 times, combining the features in the feature map obtained by two-dimensional convolution, and outputting the robust prediction capsule v as a compact video fingerprint for metric learning.
5. The capsule network-based deep video fingerprinting algorithm according to claim 2, characterized by, for extracting the high-level semantic features specifically:
and (3) after the output characteristics of the last two-dimensional convolution layer pass through a Tanh activation function, the output characteristics are used as the input of a full connection layer, and the output dimension number of the full connection layer is the same as the class number and is used for classification learning.
6. The capsule network-based depth video fingerprint algorithm of claim 2, wherein the video triplets are specifically:
and simultaneously extracting an original video, a non-copy video and a copy video corresponding to the original video from the training data set to form a pair of video triples, wherein the original video and the non-copy video have different contents.
7. The capsule network-based depth video fingerprinting algorithm of claim 2, characterized by calculating a loss value resulting from classification losses as a cross-entropy loss function L2Calculating a loss value generated by the dimension classification feature output by the deep capsule network as a classification loss calculation function, wherein the cross entropy loss function L2The formula is as follows:
Figure FDA0002392990320000041
wherein x isiAnd the dimension classification characteristic of the ith video output by the network is represented, n is represented by the size of a batch, y is represented by a video real label, and sigma is represented by a Sigmoid activation function.
8. The capsule network-based deep video fingerprint algorithm of claim 2, wherein each node in the deep capsule network adopts a loss function, and each node in the deep capsule network is automatically derived according to a set loss function.
9. The capsule network-based deep video fingerprint algorithm of claim 3, wherein the original video fingerprint and the query video fingerprint are both 16-bit video fingerprints.
CN202010121069.5A 2020-02-26 2020-02-26 Deep video fingerprint algorithm based on capsule network Active CN111325169B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010121069.5A CN111325169B (en) 2020-02-26 2020-02-26 Deep video fingerprint algorithm based on capsule network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010121069.5A CN111325169B (en) 2020-02-26 2020-02-26 Deep video fingerprint algorithm based on capsule network

Publications (2)

Publication Number Publication Date
CN111325169A true CN111325169A (en) 2020-06-23
CN111325169B CN111325169B (en) 2023-04-07

Family

ID=71173154

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010121069.5A Active CN111325169B (en) 2020-02-26 2020-02-26 Deep video fingerprint algorithm based on capsule network

Country Status (1)

Country Link
CN (1) CN111325169B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115998A (en) * 2020-09-11 2020-12-22 昆明理工大学 Method for overcoming catastrophic forgetting based on anti-incremental clustering dynamic routing network
CN112307258A (en) * 2020-11-25 2021-02-02 中国计量大学 Short video click rate prediction method based on double-layer capsule network
CN112633340A (en) * 2020-12-14 2021-04-09 浙江大华技术股份有限公司 Target detection model training method, target detection model training device, target detection model detection device and storage medium
CN112733701A (en) * 2021-01-07 2021-04-30 中国电子科技集团公司信息科学研究院 Robust scene recognition method and system based on capsule network
CN113763332A (en) * 2021-08-18 2021-12-07 上海建桥学院有限责任公司 Pulmonary nodule analysis method and device based on ternary capsule network algorithm and storage medium
CN113971686A (en) * 2021-10-26 2022-01-25 哈尔滨工业大学 Target tracking method based on background restoration and capsule network
CN116866089A (en) * 2023-09-05 2023-10-10 鹏城实验室 Network flow detection method and device based on twin capsule network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160267178A1 (en) * 2015-03-13 2016-09-15 TCL Research America Inc. Video retrieval based on optimized selected fingerprints
CN109840560A (en) * 2019-01-25 2019-06-04 西安电子科技大学 Based on the image classification method for incorporating cluster in capsule network
CN110569781A (en) * 2019-09-05 2019-12-13 河海大学常州校区 time sequence classification method based on improved capsule network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160267178A1 (en) * 2015-03-13 2016-09-15 TCL Research America Inc. Video retrieval based on optimized selected fingerprints
CN109840560A (en) * 2019-01-25 2019-06-04 西安电子科技大学 Based on the image classification method for incorporating cluster in capsule network
CN110569781A (en) * 2019-09-05 2019-12-13 河海大学常州校区 time sequence classification method based on improved capsule network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
汪冬冬等: "基于时空深度神经网络的视频指纹算法", 《激光与光电子学进展》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115998A (en) * 2020-09-11 2020-12-22 昆明理工大学 Method for overcoming catastrophic forgetting based on anti-incremental clustering dynamic routing network
CN112115998B (en) * 2020-09-11 2022-11-25 昆明理工大学 Method for overcoming catastrophic forgetting based on anti-incremental clustering dynamic routing network
CN112307258A (en) * 2020-11-25 2021-02-02 中国计量大学 Short video click rate prediction method based on double-layer capsule network
CN112307258B (en) * 2020-11-25 2021-07-20 中国计量大学 Short video click rate prediction method based on double-layer capsule network
CN112633340B (en) * 2020-12-14 2024-04-02 浙江大华技术股份有限公司 Target detection model training and detection method, device and storage medium
CN112633340A (en) * 2020-12-14 2021-04-09 浙江大华技术股份有限公司 Target detection model training method, target detection model training device, target detection model detection device and storage medium
CN112733701A (en) * 2021-01-07 2021-04-30 中国电子科技集团公司信息科学研究院 Robust scene recognition method and system based on capsule network
CN113763332A (en) * 2021-08-18 2021-12-07 上海建桥学院有限责任公司 Pulmonary nodule analysis method and device based on ternary capsule network algorithm and storage medium
CN113763332B (en) * 2021-08-18 2024-05-31 上海建桥学院有限责任公司 Pulmonary nodule analysis method and device based on ternary capsule network algorithm and storage medium
CN113971686A (en) * 2021-10-26 2022-01-25 哈尔滨工业大学 Target tracking method based on background restoration and capsule network
CN113971686B (en) * 2021-10-26 2024-05-31 哈尔滨工业大学 Target tracking method based on background restoration and capsule network
CN116866089B (en) * 2023-09-05 2024-01-30 鹏城实验室 Network flow detection method and device based on twin capsule network
CN116866089A (en) * 2023-09-05 2023-10-10 鹏城实验室 Network flow detection method and device based on twin capsule network

Also Published As

Publication number Publication date
CN111325169B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN111325169B (en) Deep video fingerprint algorithm based on capsule network
Jia et al. Coarse-to-fine copy-move forgery detection for video forensics
Khan et al. Ant Colony Optimization (ACO) based Data Hiding in Image Complex Region.
CN110717411A (en) Pedestrian re-identification method based on deep layer feature fusion
Li et al. Steganalysis over large-scale social networks with high-order joint features and clustering ensembles
CN111709408A (en) Image authenticity detection method and device
CN112651262B (en) Cross-modal pedestrian re-identification method based on self-adaptive pedestrian alignment
Meng et al. A survey of image information hiding algorithms based on deep learning
CN110751018A (en) Group pedestrian re-identification method based on mixed attention mechanism
CN106778571B (en) Digital video feature extraction method based on deep neural network
CN110765841A (en) Group pedestrian re-identification system and terminal based on mixed attention mechanism
Ding et al. Noise-resistant network: a deep-learning method for face recognition under noise
CN111160313A (en) Face representation attack detection method based on LBP-VAE anomaly detection model
CN106650617A (en) Pedestrian abnormity identification method based on probabilistic latent semantic analysis
CN110853074A (en) Video target detection network system for enhancing target by utilizing optical flow
Wang et al. HidingGAN: High capacity information hiding with generative adversarial network
Gan et al. Video object forgery detection algorithm based on VGG-11 convolutional neural network
CN115294655A (en) Method, device and equipment for countermeasures generation pedestrian re-recognition based on multilevel module features of non-local mechanism
CN114387641A (en) False video detection method and system based on multi-scale convolutional network and ViT
Zhao et al. Detecting deepfake video by learning two-level features with two-stream convolutional neural network
Chen et al. Image splicing localization using residual image and residual-based fully convolutional network
Li et al. One-class double compression detection of advanced videos based on simple Gaussian distribution model
Xu et al. Document images forgery localization using a two‐stream network
CN112990357B (en) Black box video countermeasure sample generation method based on sparse disturbance
Dai et al. HEVC video steganalysis based on PU maps and multi-scale convolutional residual network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant