CN108734106B

CN108734106B - Rapid riot and terrorist video identification method based on comparison

Info

Publication number: CN108734106B
Application number: CN201810366397.4A
Authority: CN
Inventors: 李兵; 胡卫明; 原春锋; 王博; 赵永帅; 刘琴
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2018-04-23
Filing date: 2018-04-23
Publication date: 2021-01-05
Anticipated expiration: 2038-04-23
Also published as: CN108734106A

Abstract

The invention relates to the field of video classification, and provides a rapid riot and terrorist video identification method based on comparison, aiming at solving the problem that the accuracy (preciousness) and recall (recall) of riot and terrorist video identification are relatively low due to limited feature descriptor description capability in the riot and terrorist video identification based on visual features. The method comprises the following steps: carrying out shot segmentation on a video to be detected for carrying out riot and terrorist identification so as to select a key frame of the video to be detected; carrying out hash code operation on each key frame of the video to be detected by utilizing a pre-constructed riot and terrorist video identification model to obtain the hash code of each key frame; comparing the hash code of each key frame with the hash code of the video frame of the prestored violence-terrorist video respectively to determine the video frame similar to each key frame; and if the number of the video frames similar to each key frame exceeds a set threshold value, determining that the video to be detected is a riot video. The method and the device can quickly and accurately identify the riot and terrorist videos from a large number of videos.

Description

Rapid riot and terrorist video identification method based on comparison

Technical Field

The invention relates to the technical field of computer vision, in particular to the field of video classification, and specifically relates to a rapid riot and terrorist video identification method based on comparison.

Background

The riot terrorist video is a video containing contents of propaganda riot terrorist, religious extremes, ethnic fission and the like. With the rapid development of network technology, the era of mobile internet comes along, which makes more and more multimedia data appear in front of people, and the violent video is also spread and diffused in a large quantity. At present, the detection of the riot and terrorist videos mainly adopts manual examination and marking, and the method consumes a great amount of financial resources and material resources. Therefore, in the face of the internet with increasing data volume, a novel technology is needed to automatically filter terrorist video image contents and deploy and control early warning in important public places.

The visual features currently applied to the detection of the riot and terrorist videos are mainly divided into two types, namely static features and dynamic features. Static features are used to describe features within a video frame, including color, texture, structure, etc. The characteristics can effectively reflect information such as background, environment, main character appearance and the like, and MPEG-7 is a typical static characteristic and comprises visual descriptors such as CLD, CSD, SC, EH and the like. The dynamic characteristics are used for describing characteristics between video frames, including motion amplitude, direction, frequency and the like, and the characteristics can effectively reflect the motion condition of a main angle in the video. The dynamic features are mostly tracked and extracted by using a corner detection algorithm. Such as HOG, HOF, MoSIFT, etc. Wherein the MoSIFT algorithm is used for detecting local features, and the descriptor can only extract features in places with sufficient motion. However, the above feature descriptors have limited description capability, it is difficult to fully and accurately describe the content in the video image, and especially in the riot video, detection needs to be performed for a specific target, so that the detection work accuracy (precision) and recall (recall) are relatively low.

Disclosure of Invention

In order to solve the problems in the prior art, namely to solve the problems that the copy judgment of some edited videos cannot be accurately detected and the positions of the copied video segments cannot be accurately positioned due to the existence of a plurality of copied segments in two sections of videos, the application provides a rapid violence and terrorism video identification method based on comparison so as to solve the problems.

The application provides a rapid riot and terrorist video identification method based on comparison, which comprises the following steps: carrying out shot segmentation on a video to be detected for carrying out riot and terrorist identification so as to select a key frame of the video to be detected; performing hash code operation on each key frame of the video to be detected by using a pre-constructed riot and terrorist video identification model to obtain hash codes of the key frames; the riot and terrorist video identification model is constructed based on a Hash network, the input of the model is a video frame, and the output of the model is a Hash code of the input video frame; comparing the hash code of each key frame with the hash code of the video frame of the prestored violence-terrorist video respectively to determine the video frame similar to each key frame; counting the number of similar frames, and if the number of video frames similar to each key frame exceeds a set threshold, determining that the video to be detected is a riot video.

In some examples, "shot segmentation is performed on a video to be detected for riot and terrorist identification to select a key frame of the video to be detected", including: extracting a histogram of each frame of the video to be detected, and performing difference comparison on the histograms of adjacent video frames to determine a shot boundary of the video to be detected; and selecting the starting frame and/or the ending frame of each shot of the video to be detected as a key frame according to the determined shot boundary.

In some examples, "comparing the hash code of each of the key frames with the hash codes of the video frames of the pre-stored riot video respectively to determine video frames similar to each of the key frames" includes: comparing the hash code of each key frame with the hash codes of the video frames of the violence and terrorism videos in the video library respectively; calculating the Hamming distance between the hash code of the key frame and the hash code of the video frame; and confirming the key frame and the video frame with the Hamming distance radius within the set value range as similar frames.

In some examples, the training method of the riot-terrorist video recognition model is as follows: classifying preset training sample pictures into positive sample data and negative sample data; wherein, the positive sample data are riot and riot pictures, and the negative sample data are riot and non-riot pictures; adjusting the size of the training sample picture, randomly intercepting a region with a set size from the adjusted training sample picture, and performing sample mean processing; training the processed picture by using the initial riot and terrorist video recognition model to obtain a riot and terrorist video recognition model based on the Hash network.

In some examples, the network structure of the initial riot video recognition model includes an input layer, a convolutional layer, and a fully-connected layer, where the first layer is the input layer, the second to sixth layers are convolutional layers, and the seventh to ninth layers are fully-connected layers.

In some examples, in training the riot video recognition model, the input layer inputs the training sample picture after sample mean processing.

In some examples, the convolutional layer receives an output of a previous layer, and outputs the convolutional layer after being activated by an activation function of the current layer after being subjected to convolution processing; the full connection layer receives the output of the previous layer, and the output is output after the activation function of the layer is activated after the convolution processing of the layer.

In some examples, the activation functions of the second layer to the eighth layer of the network structure of the initial riot video recognition model are:

where Re LU (x) is the activation function and x is the convolved output of the layer.

In some examples, the activation function of the ninth layer of the network structure of the initial riot video recognition model is:

wherein (x) is a pair of b_i,jAnd obtaining the result after the partial derivation.

In some examples, the loss function for training the above-described riot-terrorist video recognition model is:

wherein, y_iIndicating whether the sample pairs are similar, i.e. y _i1 indicates that the two samples are similar, but not in phaseLike;

is the euclidean distance between two sample binary codes in a sample pair;

is the Manhattan distance L of the sample binary code and the identity matrix_rIs that the loss function m (m > 0) is a marginal threshold parameter, alpha is a scaling factor, b_i,1And hash code b of sample 1_i,2Is the hash code of sample 2, N is the total number of training sample pairs, and k is the dimension of the hash code.

According to the rapid riot and terrorist video identification method based on comparison, a key frame is extracted by performing structural analysis on a video subjected to riot and terrorist detection; secondly, determining the hash code of each key frame of the video segment by using a riot and terrorist video identification model based on a hash network; and then, matching the hash code of the key frame of the video to be detected with the hash code of the key frame of the pre-stored riot and terrorist video, and determining whether the video to be detected is the riot and terrorist video. According to the method, the video to be detected is subjected to structured analysis, the key frame is extracted, and the good balance between the precision and the speed of lens detection is realized; by comparing the hash code of the key frame with the pre-stored hash code, whether the video to be detected comprises the video can be rapidly judged; the pre-stored hash codes occupy small space and are high in retrieval speed, so that the method can quickly and accurately identify the riot and terrorist videos.

Drawings

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a schematic flow chart diagram illustrating one embodiment of a contrast-based fast riot and terrorist video identification method according to the present application;

FIG. 3 is a schematic diagram of a network structure of a hash network model in an embodiment of a contrast-based fast riot and terrorist video identification method according to the present application;

fig. 4 is a flowchart illustrating an application example of the contrast-based fast riot and terrorist video identification method according to the present application.

Detailed Description

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 is a schematic diagram illustrating an exemplary system architecture to which an embodiment of the contrast-based fast riot video identification method of the present application may be applied.

As shown in fig. 1, the system architecture may include a terminal device 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. Various communication client applications, such as a web browser application, a video browsing application, a video uploading application, social platform software, etc., may be installed on the terminal device 101.

The terminal device 101 may be various electronic devices having a display screen and supporting video browsing or video uploading, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 103 may be a server that provides various services, such as a video processing server that performs including recognition on a video uploaded by the terminal apparatus 101, or an application platform. The video processing server can analyze and process the video data uploaded by each terminal device connected with the video processing server through the network, and feed back the processing result (such as the video riot identification result) to the terminal device or a third party for use.

It should be noted that the contrast-based fast riot video identification method provided in the embodiment of the present application is generally executed by the server 103, and accordingly, an apparatus to which the method shown in the present application can be applied is generally disposed in the server 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow diagram of one embodiment of a contrast-based fast riot-terrorist video identification method according to the present application is shown. The rapid riot and terrorist video identification method based on comparison comprises the following steps:

step 201, performing shot segmentation on a video to be detected for riot and terrorist identification to select a key frame of the video to be detected.

In this embodiment, an electronic device (such as the server in fig. 1) or an application platform that uses a rapid riot and terrorist video identification method based on comparison may be used to obtain a video to be detected that is to be subjected to the riot and terrorist detection. The electronic equipment or the application platform respectively performs shot segmentation on the obtained video to be detected so as to extract key frames of the video to be detected. For example, the video to be detected may be obtained from a terminal device connected to the electronic device or the application platform, for example, after a user using the terminal device connected to the server or the application platform via a network uploads the video, the server or the application platform obtains the video as the video to be detected.

Specifically, the "performing shot segmentation on a video to be detected for riot and terrorist identification to select a key frame of the video to be detected" includes: extracting a histogram of each frame of video of a video to be detected, and performing difference comparison on the histograms of adjacent video frames to determine a shot boundary of the video to be detected; and selecting the starting frame and/or the ending frame of each shot of the video to be detected as a key frame according to the determined shot boundary. The histogram for extracting each frame of video may be a gray level histogram or a color histogram. After a video to be detected is divided into a series of shots, the first frame or the last frame of each shot can be used as a key frame of the shot; the first and last frames may also be used as key frames.

Step 202, performing hash code operation on each key frame of the video to be detected by using a pre-constructed riot and terrorist video identification model to obtain the hash code of each key frame.

In this embodiment, based on the plurality of key frames of the video to be detected selected in step 201, the electronic device or the application platform performs an operation by using a pre-established hash network model to determine the hash code of each key frame. Here, the riot-terrorist video recognition model may be a deep convolutional neural network model, for example, a Siamese network model, and the hash operation of the video keyframe to be detected is completed by adding a designed hash loss using the Siamese network model. The riot and terrorist video identification model is constructed based on a hash network, the input of the model is a video frame, and the output of the model is a hash code of the input video frame.

The method for determining the key frame hash code by the riot-terrorist video identification model comprises the steps of judging an input frame of picture, and finishing the hash operation of the input key frame (picture) by utilizing the optimized operation of the deep convolutional neural network. The riot and terrorist video identification model can be operated by utilizing the characteristics of key frames, and the characteristics of the key frames can be static characteristics which comprise color, texture, structure and the like and reflect information such as background, environment, main role appearance and the like; and dynamic characteristics including motion amplitude, direction, frequency, etc. reflecting the motion status of the principal angles in the video. And determining the hash code of the key frame by using the characteristics of the key frame.

Step 203, comparing the hash code of each key frame with the hash codes of the video frames of the pre-stored riot video respectively, and determining the video frames similar to each key frame.

In this embodiment, based on the hash code of the key frame of the video to be detected obtained by the computation using the riot and terrorist video recognition model in step 202, the electronic device or the application platform compares with the pre-stored hash code to determine whether the key frame of the video to be detected is similar to the video frame of the riot and terrorist video. The pre-stored hash code may be a hash code of a video frame of the riot video.

Here, the above-mentioned pre-stored hash code is obtained by: firstly, extracting an violence video from a video library, and then extracting key video frames of all the extracted violence videos in an off-line or on-line mode; and finally, inputting the extracted key video frames into a riot and terrorist video identification model based on a hash network for operation to obtain a hash code of the riot and terrorist video, and storing the obtained hash code of the riot and terrorist video.

The hash code comparison may be comparing the hash code of the key frame with a hamming distance of a pre-stored hash code, and determining whether the key frame is similar to the video frame of the riot video according to the hamming distance.

In some optional implementation manners of this embodiment, the comparing the hash code of each of the key frames with the hash codes of the video frames of the pre-stored riot video to determine video frames similar to each of the key frames includes: comparing the hash code of each key frame with the hash codes of the video frames of the violence and terrorism videos in the video library respectively; calculating the Hamming distance between the hash code of the key frame and the hash code of the video frame; and confirming the key frame and the video frame with the Hamming distance radius within the set value range as similar frames. Specifically, two pictures having a hamming distance radius of 2 or less may be confirmed as similar frames.

And 204, counting the number of similar frames, and if the number of the video frames similar to each key frame exceeds a set threshold, determining that the video to be detected is a riot video.

In this embodiment, in step 203, key frames similar to the video frames of the riot and terrorist video in the riot and terrorist video database are determined, the number of key frames similar to the video frames of the riot and terrorist video in the video to be detected is counted, and if the number is greater than a set threshold, it can be determined that the video to be detected is the riot and terrorist video. Specifically, if the video to be detected has 3 frames or more than 3 frames of key frames similar to the video frames of the riot and terrorist video in the riot and terrorist video library, the video to be detected is determined to be the riot and terrorist video.

In some optional implementation manners of this embodiment, the training method of the above hash network-based riot-terrorist video recognition model is as follows: classifying preset training sample pictures into positive sample data and negative sample data, wherein the positive sample data are violence and violence pictures, and the negative sample data are violence and non-violence pictures; adjusting the size of the training sample picture, randomly intercepting a region with a set size from the adjusted training sample picture, and performing sample mean processing; training the processed picture by using the initial riot and terrorist video recognition model to obtain a riot and terrorist video recognition model based on the Hash network. Specifically, the data for training may be divided into two groups: positive sample data and negative sample data; the positive sample data can be an riot picture and an riot picture, the label of the positive sample data is set to be 1, the negative sample data can be an riot picture and a non-riot picture, and the label of the negative sample data is set to be 0; so that the hash codes of the violent-terrorism videos are similar as much as possible, and the hash codes of the non-violent-terrorism videos and the violent-terrorism videos are far away as much as possible.

And adjusting the training sample picture to 256 × 256, then randomly intercepting the region with the size of 227 × 227, subtracting all sample mean values to be used as the processed sample picture, and directly inputting the processed sample picture into an initial hash network model for training. The sample mean value is the mean value of all pixel points of the sample picture; after the sample mean value is subtracted, training and testing are carried out to improve the training speed and the testing precision.

Inputting the pair of pictures (the first violence and terrorist pictures and the second violence and terrorist pictures) of the positive sample data or the pair of pictures (one frame is the violence and terrorist picture and one frame is the non-violence and terrorist picture) of the negative sample data into the initial hash network model for training.

In some optional implementations of this embodiment, the network structure of the initial riot and terrorist video recognition model includes an input layer, a convolutional layer, and a full connection layer, as shown in fig. 3, which is a schematic network structure diagram of a hash network model. Wherein, the first layer is an input layer, the second layer to the sixth layer are convolution layers, and the seventh layer to the ninth layer are full connection layers. The processed training sample picture is input into the input layer, and the training sample picture is a two-frame RGB three-channel picture. The above-described convolutional layers of the second to sixth layers are represented by conv 1-conv 5 in fig. 3; the fully-connected layer of the seventh to ninth layers is denoted by fc1-fc3 in fig. 3; the loss function (loss) in the fully-connected layer includes: two characteristics of distinguishing power (Discriming) and near Binary-like code (Binary-like).

The convolution layer receives the output of the previous layer, and outputs the output after being activated by the activation function of the current layer after being subjected to convolution processing; the full connection layer receives the output of the previous layer, and the output is output after the activation function of the layer is activated after the convolution processing of the layer. Specifically, the method comprises the following steps:

the second layer is a convolution layer, 64 convolution kernels are provided, the size of each convolution kernel is 11 × 11, the convolution step is 4, padding is 0, and an active layer, a downsampling layer and a normalization layer are connected after the output feature map. The activation layer activation function adopts a ReLU function. The sampling layer sampling mode is maximum value sampling, the sampling kernel is 3 × 3, and the step size is 2. The LRN normalization method adopted by the normalization layer has a kernel size of 5, an alpha value of 0.00001, and a beta value of 0.75. Where alpha is the scaling factor and beta is the exponential term. The second layer obtains the output of the first layer, and the output is C after convolution processing₁，C₁Input to a down-sampling layer to obtain P₁，P₁Input to the active layer to obtain A₁，A₁Input to the normalization layer to obtain L₁And finally output L₁To the third layer.

The third layer is a convolution layer, 256 convolution kernels are provided, the size of each convolution kernel is 5 multiplied by 5, the convolution step is 1, the padding is 2, and an activation layer, a downsampling layer and a normalization layer are connected after the output feature map. The activation layer activation function adopts a ReLU function. The sampling layer sampling mode is maximum value sampling, the sampling kernel is 3 × 3, and the step size is 2. The LRN normalization method adopted by the normalization layer has a kernel size of 5, an alpha value of 0.00001, and a beta value of 0.75. The third layer obtains the output of the second layer, and the output is C after convolution processing₂，C₂Input to a down-sampling layer to obtain P₂，P₂Input to the active layer to obtain A₂，A₂Input to the normalization layer to obtain L₂And finally output L₂To the fourth layer.

The fourth layer is a convolutional layer having 256 convolutional kernels each of 3 × 3 in sizeThe step size is 1, padding is 1, and the activation layer is connected behind the output feature map. The activation layer activation function adopts a ReLU function. The fourth layer obtains the output of the third layer, and the output is C after convolution processing₃，C₃Input to the active layer to obtain A₃And finally output A₃To the fifth layer.

The fifth layer is a convolution layer, the convolution layer has 256 convolution kernels, the size of each convolution kernel is 3 multiplied by 3, the convolution step is 1, the padding is 1, and the activation layer is connected behind the output feature map. The activation layer activation function adopts a ReLU function. The fifth layer obtains the output of the fourth layer, and the output is C after convolution processing₄，C₄Input to the active layer to obtain A₄And finally output A₄To the sixth layer.

The sixth layer is a convolution layer, 256 convolution kernels are provided, the size of each convolution kernel is 3 × 3, the convolution step is 1, padding is 1, and an active layer and a downsampling layer are connected after the output feature map. The activation layer activation function adopts a ReLU function. The sampling layer sampling mode is maximum value sampling, the sampling kernel is 3 × 3, and the step size is 2. The sixth layer obtains the output of the fifth layer, and the output is C after convolution processing₅，C₅Input to a down-sampling layer to obtain P₅，P₅Input to the active layer to obtain A₅And finally output A₅To the seventh layer.

The seventh layer is a fully connected layer, there are 4096 convolution kernels, each convolution kernel is 1 × 1 in size, the step size is 1, and the activation layer is connected behind the output feature map. The activation layer activation function adopts a ReLU function. The seventh layer obtains the output of the sixth layer, and the output is C after convolution processing₆，C₆Input to the active layer to obtain A₆And finally output A₆To the eighth layer.

The eighth layer is a fully connected layer, there are 4096 convolution kernels, each convolution kernel is 1 × 1 in size, the step size is 1, and the activation layer is connected behind the output feature map. The activation layer activation function adopts a ReLU function. The eighth layer obtains the output of the seventh layer, and the output is C after convolution processing₇，C₇Input to the active layer to obtain A₇And finally output A₇To the last layer.

The ninth layer is a full-connected layer, the number of convolution kernels is according toThe required hash code length is determined, the size of each convolution kernel is 1 multiplied by 1, the step length is 1, and the hash loss layer is connected behind the output feature graph. The hash loss layer adopts a hash function. The ninth layer obtains the output of the eighth layer, and the output is C after convolution processing₈，C₈Hash binary code (b) input to a hash loss layer output sample pair_i,1,b_i,2)。

Each of the above layers includes an activation function, wherein the activation functions of the second layer to the eighth layer are:

The activation function of the ninth layer of the network structure of the initial riot and terrorist video identification model is as follows:

The loss function for training the riot and terrorist video recognition model is as follows:

wherein, y_iIndicating whether the sample pairs are similar, i.e. y_i1 means that two samples are similar, and vice versa;

is the euclidean distance between two sample binary codes in a sample pair;

is the Manhattan distance L of the sample binary code and the identity matrix_rIs that the loss function m (m > 0) is a marginal threshold parameter and alpha isScaling factor, b_i,1And b_i,2Is the hash code of sample 1 and sample 2, N is the total number of training sample pairs, and k is the dimension of the hash code.

As an example, referring to fig. 4, fig. 4 shows a schematic diagram of contrast-based fast riot video recognition. As shown in fig. 4, on one hand, key frames of the riot and terrorist video are extracted from the video database in advance, and the hash code of each key frame is generated by using the riot and terrorist video identification model. On the other hand, key frames of the video to be detected are extracted, and the hash codes of the key frames are generated by utilizing the hash network model. And then comparing the Hamming distance between the hash code of the video key frame to be detected and the hash code of the riot video key frame. Two pictures with Hamming distance radius within 2 are confirmed as similar frames. And finally, if the video to be detected has 3 frames and more than 3 frames of key frames which are similar to the key frames of the riot and terrorist video in the video library, the video is considered as the riot and terrorist video.

The method provided by the embodiment of the application confirms the similar frames of the key frames of the video to be detected by matching the hash codes of the key frames of the video to be detected with the hash codes of the key frames of the riot-terrorist video, and confirms whether the video to be detected is the riot-terrorist video or not according to the number of the key frames similar to the key frames in the video database in the video to be detected. The video key frames are extracted by utilizing shot segmentation, so that the good balance between the precision and the speed of shot detection is realized; by comparing the hash code of the key frame with the pre-stored hash code, whether the video to be detected comprises the video can be rapidly judged; the pre-stored hash codes occupy small space and have high retrieval speed; the Hash code of the key frame can be accurately and quickly obtained by utilizing the Hash network model; therefore, the method provided by the invention can be used for quickly and accurately identifying the riot and terrorist videos.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A rapid riot and terrorist video identification method based on comparison is characterized by comprising the following steps:

carrying out shot segmentation on a video to be detected for carrying out riot and terrorist identification so as to select a key frame of the video to be detected;

carrying out hash code operation on each key frame of the video to be detected by utilizing a pre-constructed riot and terrorist video identification model to obtain the hash code of each key frame; the riot and terrorist video identification model is constructed based on a Hash network, the input of the model is a video frame, and the output of the model is a Hash code of the input video frame;

comparing the hash code of each key frame with the hash code of the video frame of the prestored violence-terrorist video respectively to determine the video frame similar to each key frame;

counting the number of similar frames, and if the number of video frames similar to each key frame exceeds a set threshold, determining that the video to be detected is a riot video;

the step of comparing the hash code of each key frame with the hash codes of the video frames of the pre-stored riot video respectively to determine the video frames similar to each key frame includes:

comparing the hash code of each key frame with the hash codes of the video frames of the violence and terrorism videos in the video library respectively;

calculating the Hamming distance between the hash code of the key frame and the hash code of the video frame;

confirming the key frame and the video frame with the Hamming distance radius within a set value range as similar frames;

the training method of the riot and terrorist video identification model comprises the following steps:

classifying preset training sample pictures into positive sample data and negative sample data; wherein, the positive sample data are riot and riot pictures, and the negative sample data are riot and non-riot pictures;

adjusting the size of the training sample picture, randomly intercepting a region with a set size from the adjusted training sample picture, and carrying out sample mean processing;

training the processed picture by using the initial riot and terrorist video identification model to obtain a riot and terrorist video identification model based on a Hash network;

s.t.b_i,j∈{-1,+1}^k,i∈{1,...,N},j∈{1,2}

wherein, y_iIndicating whether the ith sample pair is similar, i.e. y_i1 means that the two samples in the ith sample pair are similar and vice versa;

is the euclidean distance between two sample binary codes in the ith sample pair; | | b_i,1-1|||₁、|||b_i,2-1|||₁Is the Manhattan distance between the two sample binary codes in the ith sample pair and the identity matrix; l is_rIs a loss function; m is a marginal threshold parameter, wherein m > 0; α is a scaling factor; b_i,1And b_i,2Hash codes of sample 1 and sample 2 in the ith sample pair; n is the total number of training sample pairs; k is the dimension of the hash code, b_i,jIs the hash code of sample j in the ith sample pair.

2. The contrast-based rapid riot-terrorist video identification method according to claim 1, wherein "shot segmentation is performed on a video to be detected for riot-terrorist identification to select key frames of the video to be detected" comprises:

extracting a histogram of each frame of the video to be detected, and performing difference comparison on the histograms of adjacent video frames to determine a shot boundary of the video to be detected;

and selecting a start frame and/or an end frame of each shot of the video to be detected as a key frame according to the determined shot boundary.

3. The contrast-based fast riot-terrorist video recognition method according to claim 1, wherein the network structure of the initial riot-terrorist video recognition model comprises an input layer, a convolutional layer and a fully connected layer, wherein the first layer is the input layer, the second layer to the sixth layer are convolutional layers, and the seventh layer to the ninth layer are fully connected layers.

4. The contrast-based rapid riot-terrorist video recognition method according to claim 3, wherein in training the riot-terrorist video recognition model, the training sample picture processed by the sample mean is inputted into the input layer.

5. The method as claimed in claim 3, wherein the convolutional layer receives the output of the previous layer, and outputs the result after being activated by the activation function of the current layer after being convolved; and the full connection layer receives the output of the previous layer, and outputs the output after the activation of the activation function of the layer after the convolution processing of the layer.

6. The contrast-based fast riot-terrorist video recognition method according to claim 5, wherein the activation functions of the second layer to the eighth layer of the network structure of the initial riot-terrorist video recognition model are:

7. The contrast-based fast riot-terrorist video recognition method of claim 5, wherein the activation function of the ninth layer of the network structure of the initial riot-terrorist video recognition model is:

wherein (x) is a pair of b_i,jThe result after the partial derivation, b_i,jFor the hash code of the jth sample in the ith sample pair, i ∈ {1,..., N }, and j ∈ {1,2 }.