CN115761599A

CN115761599A - Video anomaly detection method and system

Info

Publication number: CN115761599A
Application number: CN202211648403.8A
Authority: CN
Inventors: 余烨; 程勃; 蔡文; 孙旭; 路强
Original assignee: Intelligent Manufacturing Institute of Hefei University Technology
Current assignee: Intelligent Manufacturing Institute of Hefei University Technology
Priority date: 2022-12-21
Filing date: 2022-12-21
Publication date: 2023-03-07

Abstract

The invention belongs to the technical field of computer vision, video anomaly detection and video anomaly detection model design, and particularly relates to a video anomaly detection method, which comprises the following steps: acquiring segment characteristic data of a video to be detected; inputting the segment feature data into a video anomaly detection model to obtain a detection result, wherein the video anomaly detection model comprises: the system comprises an abnormal characteristic enhancement model, an abnormal score calculation model, an abnormal positioning model based on classification guidance and an abnormal score optimization model. According to the method, the time sequence information and the action information of the characteristic data of the video to be detected are enhanced through the abnormal characteristic enhancement model, so that the characteristic data of the video to be detected are more suitable for an abnormal detection task, parameters in the classification of the abnormal process of the video are mined according to the abnormal positioning model based on the classification guidance, and abnormal scores obtained by the abnormal score calculation model are optimized to obtain a detection result with higher accuracy. Meanwhile, the abnormal positioning model based on classification guidance is convenient to use and can be installed in a memory to be inserted and used at any time.

Description

Video anomaly detection method and system

Technical Field

The invention belongs to the technical field of computer vision, video anomaly detection and video anomaly detection model design, and particularly relates to a video anomaly detection method and system.

Background

The monitoring cameras are ubiquitous in cities, video monitoring is used as an important ring of a city security system, the video monitoring has the significance of recording everything happening in the cities, and abnormal event detection based on the monitoring videos can automatically detect and judge abnormal events from the videos, so that the method is helpful for assisting public security traffic police personnel in law enforcement, and is an important means for improving the level of public security.

The video anomaly detection algorithm based on deep learning utilizes the strong learning ability of deep learning to learn the behavior patterns and scene information of normal and abnormal videos, and judges the abnormal phenomena in the videos based on the behavior patterns and the scene information. The video anomaly detection method based on strong supervision requires a data set to provide labels of video segment levels of training data during training, and the labeling cost required by the arrangement is very huge. While the weak surveillance-based method only requires the data set to have video-level labels, it can reduce the cost of video labeling, although the detection performance is reduced in comparison with the strong surveillance-based anomaly detection method in terms of detection efficiency. Therefore, the anomaly detection method based on weak supervision has stronger applicability in actual scenes.

Most of existing video anomaly detection methods based on weak supervision are developed based on a multi-instance learning method, namely, a video is divided into a plurality of non-overlapping video segments, each video segment is taken as an instance, all instances in the video form a packet, and the video packet is divided into a positive packet and a negative packet according to whether the packet contains an anomaly instance or not (the positive packet contains the anomaly instance, and the negative packet contains all normal instances). And then carrying out abnormal scoring ranking on each instance in the packet, selecting a plurality of instances with the highest scores in the positive packet and the negative packet, and designing a loss function to enlarge the difference between the positive packet and the negative packet.

At present, the video anomaly detection method based on weak supervision has some defects:

on one hand, in the existing model, a three-dimensional feature extraction model pre-trained on an action recognition data set is adopted during feature extraction, such as I3D and C3D, and an abnormality detection task is completed based on the extracted features. Although these models can extract spatio-temporal features of videos well, the extracted features are general features of videos, and the particularity of features required for "anomaly detection" is not considered, so that an optimized space exists for the extracted features.

On the other hand, because only the labels at the video level exist, no effective guidance exists at the video clip level, in addition, only a few video clips participate in the optimization process of the model in each training process of each video, the remaining characteristics of a large number of video clips only play the roles of score sorting and screening, and the utilization efficiency of the characteristics is very low. The above problem may result in that on a large data set, the model cannot learn all abnormal behavior patterns, and thus the generated abnormal score is not accurate. When video abnormity detection is carried out based on the model, the distribution and actual difference of abnormity scores are large, and the phenomena of missing detection and error detection and inaccurate abnormity positioning easily exist.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a video anomaly detection method capable of enhancing general features of video extraction and optimizing anomaly detection results of videos.

To achieve the above and other related objects, the present invention provides a video anomaly detection method, including: acquiring segment characteristic data of a video to be detected; inputting the segment feature data into a video anomaly detection model to obtain a detection result, wherein the video anomaly detection model comprises: the abnormal characteristic enhancement model is used for enhancing the time sequence information and the action information of the fragment characteristic data to obtain fragment enhanced characteristic data; an abnormal score calculation model, which calculates an initial segment abnormal score according to the segment enhanced feature data; calculating to obtain a segment abnormal probability score and a segment positioning guide score according to the segment enhanced feature data based on an abnormal positioning model of classification guide; an abnormal score optimization model, which is used for calculating the final abnormal score of the video to be detected according to the initial segment abnormal score and the segment abnormal probability score; optimizing the initial segment abnormal score through the segment positioning guide score to obtain a final segment abnormal score; and the final abnormal score and the final segment abnormal score of the video to be detected are the detection result.

According to a specific embodiment of the present invention, the abnormal feature enhancement model includes a long-distance time sequence feature enhancement module and a short-distance motion feature enhancement module, and the step of enhancing the time sequence information and the motion information of the segment feature data to obtain segment enhanced feature data includes: reducing the dimension of the segment feature data through a first fully-connected neural network and a second fully-connected neural network to obtain first dimension-reduced segment feature data and second dimension-reduced segment feature data with the same dimension; the first fully-connected neural network and the second fully-connected neural network have the same model structure but different parameter settings; inputting the first dimension reduction fragment feature data and the second dimension reduction fragment feature data into a long-distance time sequence feature model to obtain fragment full-time feature data; inputting the fragment full-time feature data into a short-distance action feature model to obtain fragment action attention feature data; and integrating the segment characteristic data and the segment action attention characteristic data to obtain segment enhanced characteristic data.

According to a specific embodiment of the present invention, the step of inputting the first dimension reduction segment feature data and the second dimension reduction segment feature data into the long-distance time sequence feature model to obtain segment full-time feature data includes: inputting the first dimension reduction feature data into an LSTM neural network of a long-distance time sequence feature model to obtain fusion feature data; inputting the second dimension reduction feature data into a non-local neural network of the long-distance time sequence feature model, and performing residual error processing on an output result to obtain global feature data; the non-local neural network is formed by four one-dimensional convolutions, and the convolution characteristics of the four one-dimensional convolutions are completely consistent; splicing the fusion characteristic data and the global characteristic data to obtain segment full-time characteristic data; wherein the size of the segment full temporal feature data is consistent with the size of the segment feature data.

According to a specific embodiment of the present invention, the step of inputting the segment full-time feature data into a short-distance motion feature model to obtain segment motion attention feature data includes: carrying out average processing on the segment full-time characteristic data on a time dimension; inputting the segment full-time characteristic data subjected to mean value processing into a first fully-connected neural network of the short-distance action characteristic model for dimension reduction processing, and inputting the segment full-time characteristic data subjected to dimension reduction into a second fully-connected neural network of the short-distance action characteristic model for dimension increasing processing to obtain action attention weight; broadcast multiplying the action attention weight and the fragment full-time feature data, and adding a multiplication result and the fragment feature data to obtain the fragment action attention feature data; wherein the segment action attention feature data is consistent with a size of the segment feature data.

According to an embodiment of the present invention, the step of calculating an initial segment anomaly score according to the segment enhanced feature data includes: inputting the fragment enhancement feature data into a first fully-connected neural network of the abnormal score calculation model for first dimension reduction, and calculating the fragment enhancement feature data subjected to dimension reduction by using a Relu function and a Dropout function to obtain first intermediate data; inputting the first intermediate data into a second fully-connected neural network of the abnormal score calculation model for second dimensionality reduction, and calculating the first intermediate data subjected to dimensionality reduction by using a Relu function and a Dropout function to obtain second intermediate data; and inputting the second intermediate data into a third fully-connected neural network of the abnormal score calculation model for third dimension reduction, and calculating the second intermediate data subjected to dimension reduction by using a Sigmoid function to obtain the initial segment abnormal score with the dimension of 1.

According to an embodiment of the present invention, the step of calculating the segment anomaly probability score and the segment positioning guidance score according to the segment enhanced feature data includes: transposing the segment enhanced feature data to obtain transposed segment feature data; inputting the transposed segment feature data into a first one-dimensional convolution of the abnormal positioning model based on the classification guidance for carrying out first dimension reduction, and calculating the transposed segment feature data subjected to dimension reduction by using a Relu function to obtain third intermediate data; inputting the third intermediate data into a second one-dimensional convolution of the abnormal positioning model based on the classification guidance for second dimension reduction, and calculating the third intermediate data subjected to the dimension reduction by using a Relu function to obtain fourth intermediate data; inputting the fourth intermediate data into a third one-dimensional convolution of the abnormal positioning model based on the classification guidance for third dimension reduction to obtain a sharing parameter with the dimension of 1; calculating the sharing parameters by using a Softmax function to obtain the segment abnormal probability score; and calculating the sharing parameters by using a Sigmoid function to obtain the segment positioning guidance score.

According to a specific embodiment of the present invention, the step of calculating the final abnormality score of the video to be detected according to the initial segment abnormality score and the segment abnormality probability score includes: and calculating the product of the segment anomaly probability score and the initial segment anomaly score, and performing linear accumulation to obtain the final anomaly score of the video to be detected.

According to an embodiment of the present invention, the step of optimizing the initial segment abnormality score by the segment positioning guidance score to obtain a final segment abnormality score includes: carrying out normalization processing according to the average value of the fragment positioning guidance scores to obtain optimized scores; and carrying out Hadamard product calculation on the initial segment abnormal score and the optimized score, and processing a calculation result through a Sigmoid function to obtain the final segment abnormal score.

According to an embodiment of the present invention, the training step of the video anomaly detection model includes: acquiring a video level label of a training video data set; extracting the characteristics of each training video to obtain the segment characteristics of the training video; sending the segments into the video anomaly detection model to obtain video anomaly scores and video segment anomaly scores of the training video; calculating weak supervision multi-instance loss and time smoothness loss according to the video segment anomaly score; calculating two-classification cross entropy loss according to the video anomaly score; taking the sum of the weakly supervised multi-instance penalty, the temporal smoothness penalty, and the two-class cross-entropy penalty as a final penalty; and performing back propagation on the video anomaly detection model according to the final loss.

A video anomaly detection system, comprising: the video characteristic extraction module is used for acquiring segment characteristic data of a video to be detected; the video anomaly detection module is used for acquiring the detection result of the segment characteristic data; wherein, the video anomaly detection module comprises: the abnormal characteristic enhancing module is used for enhancing the time sequence information and the action information of the fragment characteristic data to obtain fragment enhanced characteristic data; the abnormal score calculation module is used for calculating an initial segment abnormal score according to the segment enhanced feature data; an abnormal positioning module based on classification guidance calculates a segment abnormal probability score and a segment positioning guidance score according to the segment enhanced feature data; the abnormal score optimizing module is used for calculating to obtain a final abnormal score of the video to be detected according to the initial segment abnormal score and the segment abnormal probability score; optimizing the initial segment abnormality score through the segment positioning guide score to obtain a final segment abnormality score; and the final abnormal score and the final segment abnormal score of the video to be detected are the detection result.

The method has the technical effects that the time sequence information and the action information of the characteristic data of the video to be detected are enhanced through the abnormal characteristic enhancement model, so that the method is more suitable for the abnormal detection task, and the utilization rate of the characteristic data in the video is improved. Meanwhile, an abnormal positioning capability of parameters generated in the classification process is found out by an abnormal positioning model based on classification guidance to optimize the detection result. And the abnormity positioning model based on classification guidance can be used in cooperation with any abnormity detection model, so that the detection performance is improved, the use is convenient and fast, and the abnormity positioning model can be arranged in a memory to be used along with insertion. The video anomaly detection model of the embodiment of the application is further trained based on the three loss functions, so that the model precision is further improved.

Drawings

Fig. 1 is a block diagram illustrating a video anomaly detection method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart illustrating a video anomaly detection method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating an embodiment of an abnormal feature enhancement model provided by the present invention;

FIG. 4 is a flowchart illustrating an embodiment of an anomaly score calculation model and an anomaly localization model based on classification guidance according to the present invention;

FIG. 5 is a flowchart illustrating an anomaly score optimization model according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart diagram illustrating a video anomaly detection system according to an embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating an embodiment of a computer apparatus;

fig. 8 is a schematic structural diagram of another embodiment of a computer device provided in the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the drawings only show the components related to the present invention rather than being drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of each component in actual implementation may be changed arbitrarily, and the layout of the components may be more complicated.

In the following description, numerous details are set forth to provide a more thorough explanation of embodiments of the present invention, however, it will be apparent to one skilled in the art that embodiments of the present invention may be practiced without these specific details, and in other embodiments, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring embodiments of the present invention.

Firstly, it should be noted that the application provides a video anomaly detection model based on weak supervision, and tags a video through an anomaly score calculation model of the weak supervision, and simultaneously adds two models, namely an anomaly feature enhancement model and an anomaly positioning model based on classification guidance.

The abnormal feature model processes the feature data of the video to be detected through long-distance time sequence feature enhancement and short-distance action features, so that the feature data extracted from the video to be detected integrates all the features in the previous time dimension and the global time dimension to enhance the time sequence features, and integrates the features in the adjacent time dimension to enhance the action features. Therefore, the original characteristic data are expressed as the characteristic data more suitable for the abnormal detection task, and the utilization rate of the characteristic data in the video is improved.

The abnormal positioning model based on the classification guidance is a plug-and-play branch module and can be suitable for any weak supervision abnormal detection model. And optimizing the abnormal score by acquiring the abnormal probability score and the positioning guidance score of the video feature data to be detected. Meanwhile, a classification loss function is used for carrying out partial supervised training on the video anomaly detection model, so that guidance is provided and the model is optimized.

Example 1

Referring to fig. 1-2, a video anomaly detection method includes:

and S10, acquiring segment characteristic data of the video to be detected.

The method includes dividing 1 video into n segments, wherein each segment is called as 1 segment, and because the position of video abnormality needs to be located, the abnormality score of each segment needs to be known, and the segment with the high abnormality score is the position of the abnormality. In weak surveillance video anomaly detection, we only have anomaly score labels for the entire video, but not for each video segment.

The method comprises the steps of taking a pre-trained 3D convolutional neural network as a feature extractor, and firstly obtaining feature data of each segment of a video to be detected. The extracted feature data of each segment is generally input into an anomaly score calculation model to obtain an anomaly score result of each segment of the video to be detected. But due to the difference of the motion recognition and the anomaly detection task, the feature data has suboptimal property when the feature data is used for processing the anomaly detection task. Therefore, it is necessary to enhance the original feature data set to make it more consistent with the features of the anomaly detection task.

Extracting the pre-trained 3D convolution neural network characteristics of the video to be detected to obtain a matrix f of segment characteristic data ₁ And f is a ₁ Is of the size [ t, d]Wherein t represents the number of segments in the video to be detected, and d is the dimension of each segment.

And S20, inputting the segment feature data into a video abnormity detection model to obtain a detection result.

Wherein the video anomaly detection model comprises: the system comprises an abnormal characteristic enhancement model, an abnormal score calculation model, an abnormal positioning model based on classification guidance and an abnormal score optimization model.

Specifically, in the application, the abnormal feature enhancement model is used for enhancing the time sequence information and the action information of the segment feature data to obtain segment enhanced feature data. As shown in FIG. 3, first, f is coupled through a first fully-connected neural network Fc1 and a second fully-connected neural network Fc2 ₁ And (5) performing dimension reduction. In order to avoid that the characteristic data are learned to have single characteristics, the model parameters of Fc1 and Fc2 are set to be different so as to obtain a matrix f of first dimension-reduced characteristic data with the same dimension ₂ And second dimension-reduced feature data f ₃ And all the sizes are [ t, d/2 ]]。

The abnormal feature enhancement model is formed by connecting a long-distance time sequence feature enhancement module and a short-distance action feature enhancement module in series. A further long-distance time sequence characteristic enhancement module is formed by a long-short term memory network and a non-local neural network in parallel so as to enhance the time sequence characteristic of the original characteristic data in a long time range on a time dimension; the short-range action feature enhancement module generates an action attention weight based on the two fully-connected neural networks, and performs action feature enhancement on the original features in a short time range based on the weight.

In a particular embodiment, since the video information is represented as consecutive video segments that remain continuous in time, these features may be represented as sequence features in the time dimension. The presence of a preceding element in the sequence affects the progress of subsequent elements, which are in turn continuations of the preceding element.

As shown in FIG. 3, f is ₂ The data in (1) is expanded in the dimension of the fragment with the size of t to obtain { x } ₁ ，x ₂ …x _t And inputting the data into the long-short term memory network in sequence to enable the data corresponding to each fragment in the matrix to be correlated, and simultaneously setting the input dimension and the output dimension of the long-short term memory network to be d/2. Recording for a long or short periodThe memory network obtains a set of fragment data (y) related to each other before and after processing ₁ ，y ₂ …y _t } and integrating to obtain fusion characteristic data f _2，1 。

Will f is mixed ₃ The data in (1) is input into a one-dimensional convolution of a non-local neural network, and the features can capture global time information by calculating the interaction between any two time positions regardless of the time distance between any two time positions by utilizing the advantages of non-local operation. The non-local neural network is formed by four one-dimensional convolutions, the convolution kernel size of all the one-dimensional convolutions is 1, the step length is 1, the filling is 0, and the number of output channels is d/2.

First, by the first three one-dimensional convolution pairs f ₃ Processing to obtain three sizes of t, d/2]Different data f of _3，1 、f _3，2 And f _3，3 . Will f is mixed _3，1 [t，d/2]And f of transposition process _3，2 [d/2，t]Performing matrix multiplication to generate a value of [ t, t ]]Global temporal attention weight f _t Then the global temporal attention weight f is weighted _t Applied to f by matrix multiplication _3，3 The data f which are applied with global time attention are obtained by optimization _3，4 [t，d/2]. By a fourth one-dimensional convolution pair f _3，4 Post-processing to obtain data f _3，5 [t，d/2]. Using residual operation to combine f3 and f _3，5 Integrating to obtain global characteristic data f _3，6 . Finally f is to be _2，1 And f _3，6 Splicing is carried out based on characteristic dimensions to obtain segment full-time characteristic data f ₄ [t，d]。

The data are restored to the original size through splicing, and meanwhile, the data are paid attention to in more scales in the time dimension. Acquired full-time feature data f of fragments ₄ Rich information of a long-distance time dimension is aggregated, however, for video data, a short-distance time dimension feature, namely, motion information is also important, the motion feature is embodied in a dimension d of a video segment, namely, the dimension d contains important motion information, and therefore, a visual attention weight is captured through motion attentionAnd (3) key action information in the frequency.

In one embodiment, f is ₄ Averaging over the time dimension, resizing the data to [1,d ]]. Inputting the short-distance motion characteristics into a first fully-connected neural network Fc-1 and a second fully-connected neural network Fc-2 in the short-distance motion characteristics enhancement model, wherein Fc-1 is used for dimension reduction, and the dimension of processed data is reduced to [1, d/4 ]]Then sending the weight into Fc-2 for ascending dimension, and finally generating action attention weight f _4，d [1，d]。

Focusing on the weight f by the action _4，d For fragment full time feature data f ₄ Optimization is performed, but due to the difference in dimensions, the two values are broadcast multiplied, i.e. f _4，d Is copied t times and stacked in its first dimension to form a size sum ₄ Identical variants f _4，d `[t，d]And then the two values are multiplied element by element. Wherein this step is automatically performed in the Pytorch using the properties of the tensor.

In this way, the action attention weight can be applied to the feature data of each segment, and the data f subjected to action attention is obtained ₅ [t，d]。

Finally, f is integrated to prevent the destruction of some useful original characteristic information ₁ And f ₅ Obtaining the concerned characteristic data f of the segment action ₆ [t，d]。

It should be noted that, in the conventional weak surveillance anomaly detection, feature data obtained by extracting and processing a video is sent to an anomaly score calculation model to calculate an anomaly score at a video segment level. And then optimizing the model by setting multi-instance loss, namely taking the abnormal scores of one or more video segments with the highest abnormal scores in the normal video and the abnormal video, and optimizing the model by enlarging the score difference between the normal video and the abnormal video. In the arrangement, only a few video segments in each video are used for optimizing the model, on one hand, the utilization rate of the features is low, and on the other hand, due to the insufficiency of the guide information, a large optimization space exists in the generated abnormal score.

Therefore, the embodiment of the present application provides an anomaly localization model based on classification guidance, which is used for optimizing an initial segment anomaly score calculated by an anomaly score calculation model, on one hand, the utilization rate of features is improved by generating an anomaly probability score at a video level and performing constraint optimization by using classification loss, and on the other hand, the segment score of a video is optimized by generating a localization guidance score.

Specifically, in application, the anomaly score calculation model in the embodiment of the present application can calculate the anomaly score of the video segment level by using the segment feature data of the video to be detected, as in the conventional weak supervision anomaly score calculation model. As shown in FIG. 4, the abnormal score calculation model is composed of three fully-connected neural networks connected in series and used for the pair f ₆ And (5) reducing dimensions, and sequentially reducing the characteristic dimensions of the data to d/4 and d/16,1 to obtain the initial segment abnormity score. Inputting the fragment enhancement feature data into a first fully-connected neural network of the abnormal score calculation model for first dimension reduction, and calculating the fragment enhancement feature data subjected to dimension reduction by using a Relu function and a Dropout function with a discarding rate of 0.7 to obtain first intermediate data with a dimension of d/4; inputting the first intermediate data into a second fully-connected neural network of the abnormal score calculation model for second dimensionality reduction, and calculating the first intermediate data subjected to dimensionality reduction by using a Relu function and a Dropout function with a discarding rate of 0.7 to obtain second intermediate data with a dimensionality of d/16; finally, inputting the second intermediate data into a third fully-connected neural network of the abnormal score calculation model for third dimensionality reduction to obtain a parameter p ₁ And calculating the initial segment abnormal score S by using a Sigmoid function to obtain the initial segment abnormal score S with the dimensionality of 1 ₁ [t，1]。

In the prior art, the score is regarded as a final abnormal score of the video segment, the score is a value between 0 and 1, the size of the value reflects the abnormal degree of the video, and the video segment can be classified by combining a set threshold. The score is used for calculating loss of multiple instances, and the model is continuously improved in the capability of generating abnormal scores through optimization. The application processes the structure through a positioning guidance model based on classification guidance for optimizing the structure.

Specifically, in the application, as shown in fig. 4, the anomaly localization model based on the classification guidance is composed of three one-dimensional convolutions Conv1, conv2 and Conv3 of a convolutional neural network in series. Will f is mixed ₆ After inversion f ₇ [d，t]And sequentially inputting the data into one-dimensional convolutions, performing convolution operation on the data in a time dimension by each one-dimensional convolution, and then performing dimensionality reduction on the data, wherein characteristic dimensionality is also sequentially reduced to d/4, d/16 and 1. The convolution kernel size of the first one-dimensional convolution is set to be 3, the step length is 1, and the filling is 1; the convolution kernel size of the second convolution and the third convolution is set to 1, the step size is 1, and the padding is 0. Inputting the transposed segment feature data into a first one-dimensional convolution of the abnormal positioning model based on the classification guidance for carrying out first dimension reduction, and calculating the transposed segment feature data subjected to dimension reduction by using a Relu function to obtain third intermediate data; inputting the third intermediate data into a second one-dimensional convolution of the abnormal positioning model based on the classification guidance for second dimension reduction, and calculating the third intermediate data subjected to the dimension reduction by using a Relu function to obtain fourth intermediate data; finally, inputting the fourth intermediate data into a third one-dimensional convolution of the abnormal positioning model based on the classification guidance for third dimension reduction to obtain a sharing parameter p with the dimension of 1 ₂ . Then calculating the sharing parameter p ₂ The Softmax function and the Sigmoid function to obtain the segment abnormal probability score S ₂ [t，1]And a segment localization guidance score S ₃ [t，1]. Wherein, based on the characteristics of the Softmax function, all the segments of the video to be detected have abnormal probability scores S ₂ The sum of (1); the process of generating the abnormal probability score can provide guidance for abnormal positioning, namely, the segments with high abnormal probability are more likely to be abnormal segments, and the segments with low abnormal probability are more likely to be normal segments. Hence using Sigmoid function to share parameter p ₂ Processing to generate segment positioning guidance score S ₃ 。

The abnormal localization model based on the classification guidance is a combination of one-dimensional convolution, compared with the use of a full connection layer, the one-dimensional convolution can not only reduce the calculation amount, but also comprehensively consider the characteristics of the time step of the convolution kernel size in the time dimension to generate a score more suitable for task characteristics.

Specifically, in the application, as shown in fig. 5, the detection result of the video to be detected is obtained through the following steps:

because the abnormal probability score is the proportion of one segment in the abnormal score of the video to be detected, the final abnormal score S of the video to be detected is obtained by calculating the product of the abnormal probability score of the segment and the abnormal score of the initial segment and performing linear accumulation ₄ [t，1]。

Wherein the content of the first and second substances,

the abnormal score of the ith segment in the video to be detected,

the abnormal probability score of the ith segment in the video to be detected, i belongs to [1,t ]]。

Further, S ₃ By

Is composed of, i.e.

Wherein, the first and the second end of the pipe are connected with each other,

and the positioning guidance score of the ith segment in the video to be detected is obtained. To enhance S ₃ Optimization effect of (2) to S ₃ In (1)

Carrying out normalization treatment to obtain

are combined to obtain

I.e. the optimization score

The principle of the formulation process is to re-update the parameters by using the average of the positioning scores as a threshold, and to assign a value greater than 1 to the score higher than the average and a value less than 1 to the score less than the average in a division manner as the final optimization score.

And finally, carrying out Hadamard product calculation on the optimized score and the abnormal score of the initial segment to enlarge the score difference between the positioned abnormal segment and other normal segments, thereby achieving the effect of optimizing the abnormal score of the initial segment. Limiting the calculation result between 0 and 1 by using a Sigmoid activation function to obtain the final segment abnormity score S ₅ [t，1]。

Further, weakly supervised datasets provide only video level tags, and our goal is to perform anomaly detection on video segments. In the multi-instance loss of the weak surveillance model in the past, the score difference between the video segments with the maximum abnormal scores in the normal video and the abnormal video is enlarged, so that the abnormal scores of the video segments can be generated more accurately by the model. This indirect and inaccurate supervision approach has a significant limitation on the optimization of the model. Therefore, the final abnormal score and the final segment abnormal score of the training video are obtained through calculation, and are combined with the corresponding video label to design an additional classification loss so as to play the role of the video-level label greatly and provide a direct and accurate constraint for the model.

The embodiment of the application also provides aThe training method of the video anomaly detection model comprises the following steps: obtaining video clip abnormal score S of training video _x And video tag y, calculating its multiple instance loss L _mil ：

Wherein the content of the first and second substances,

in (1)

For the final abnormal score of the jth segment in the training video, topk (Sx) operation represents that the final abnormal score of the first k segments with the maximum final abnormal score in the training video is taken, k is obtained by calculating the number t of the segments of the training video and a hyper-parameter alpha, and alpha is taken as 3.

In order to add additional supervision information, a general temporal smoothness penalty L is additionally used _smooth To increase the number of constraints to be imposed,

simultaneously obtaining video abnormity scores S of all training videos _y N, and video labels y corresponding to each training video _n To calculate the two-class cross entropy loss L _cls

and (4) scoring the video abnormality of the nth training video.

Finally, obtaining final loss L according to the obtained multi-instance loss, time smoothness loss and two-classification cross entropy loss integration _sum ：

L _sum ＝L _cls +L _mil +λL _smooth

Wherein the hyperparameter lambda is 0.001.

According to the final loss L _sum And carrying out back propagation on the video anomaly detection model, thereby optimizing the video anomaly detection model.

It should be noted that, the steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, and as long as the steps contain the same logical relationship, the steps are within the scope of the present patent; it is within the scope of this patent to add insignificant modifications or introduce insignificant designs to the algorithms or processes, but not to change the core designs of the algorithms and processes.

Example 2

Referring to fig. 6, an embodiment of the present application further provides a video anomaly detection system, including:

and the video feature extraction module 10 is configured to obtain segment feature data of the video to be detected.

And the video anomaly detection module 20 is configured to obtain a detection result of the segment feature data.

Wherein, the video anomaly detection module 20 includes:

and the abnormal feature enhancing module 21 is used for enhancing the time sequence information and the action information of the segment feature data to obtain segment enhanced feature data.

And the abnormal score calculation module 22 is used for calculating an initial segment abnormal score according to the segment enhanced feature data.

And the abnormal positioning module 23 based on the classification guidance calculates a segment abnormal probability score and a segment positioning guidance score according to the segment enhanced feature data.

The abnormal score optimizing module 24 is used for calculating a final abnormal score of the video to be detected according to the initial segment abnormal score and the segment abnormal probability score; optimizing the initial segment abnormal score through the segment positioning guide score to obtain a final segment abnormal score; and the final abnormal score and the final segment abnormal score of the video to be detected are the detection results.

It should be noted that the video anomaly detection system provided in the foregoing embodiment and the video anomaly detection method provided in the foregoing embodiment 1 belong to the same concept, and specific ways of performing operations by each module and unit have been described in detail in the method embodiment, and are not described again here. In practical applications, the video anomaly detection method provided in embodiment 1 may distribute the functions to different functional modules according to needs, that is, divide the internal structure of the device into different functional modules to complete all or part of the functions described above, which is not limited herein.

Example 3

Referring to fig. 7, an embodiment of the present application further provides a computer device, which may be a server. The computer device includes a processor 110, a memory 120, an internal memory 130, a network interface 140 connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes non-volatile and or volatile storage media. The non-volatile storage medium stores an operating system 121, a computer program 122, and a database 123. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external client through a network connection. The computer program is executed by a processor to implement the functions or steps of a cross-application test login method.

Referring to fig. 8, another computer device, which may be a client, is provided in an embodiment of the present application. The computer apparatus includes a processor 210, a memory 220, an internal memory 230, a network interface 240, a display screen 260, and an input device 250 connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes non-volatile storage media. The non-volatile storage medium stores an operating system 221 and a computer program 222. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The network interface of the computer device is used for communicating with an external server through a network connection. The computer program is executed by a processor to implement the functions or steps of a login method for cross-application testing.

In summary, the technical effects of the present invention are that the abnormal feature enhancement model enhances the time sequence information and the motion information of the feature data of the video to be detected, so that the abnormal feature enhancement model is more suitable for the abnormal detection task, and the utilization rate of the feature data in the video is improved. Meanwhile, an abnormal positioning capability of parameters generated in the classification process is found out by an abnormal positioning model based on classification guidance to optimize the detection result. And the abnormity positioning model based on classification guidance can be used in cooperation with any abnormity detection model, so that the detection performance is improved, the use is convenient and fast, and the abnormity positioning model can be arranged in a memory to be used along with insertion. The video anomaly detection model of the embodiment of the application is further trained based on the three loss functions, so that the model precision is further improved.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Those skilled in the art can modify or change the above-described embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A video anomaly detection method is characterized by comprising the following steps:

acquiring segment characteristic data of a video to be detected;

inputting the segment feature data into a video anomaly detection model to obtain a detection result, wherein the video anomaly detection model comprises:

the abnormal characteristic enhancement model is used for enhancing the time sequence information and the action information of the fragment characteristic data to obtain fragment enhanced characteristic data;

an abnormal score calculation model, which calculates an initial segment abnormal score according to the segment enhanced feature data;

calculating to obtain a segment abnormal probability score and a segment positioning guide score according to the segment enhanced feature data based on an abnormal positioning model of classification guide;

an abnormal score optimization model, which is used for calculating the final abnormal score of the video to be detected according to the initial segment abnormal score and the segment abnormal probability score; optimizing the initial segment abnormality score through the segment positioning guide score to obtain a final segment abnormality score; and the final abnormal score and the final segment abnormal score of the video to be detected are the detection results.

2. The video anomaly detection method according to claim 1, wherein said anomaly feature enhancement model comprises a long-range temporal feature enhancement module and a short-range motion feature enhancement module;

the step of enhancing the time sequence information and the action information of the segment feature data to obtain segment enhanced feature data comprises the following steps:

reducing the dimension of the segment feature data through a first fully-connected neural network and a second fully-connected neural network to obtain first dimension-reduced segment feature data and second dimension-reduced segment feature data with the same dimension; the first fully-connected neural network and the second fully-connected neural network have the same model structure but different parameter settings;

inputting the first dimension reduction segment characteristic data and the second dimension reduction segment characteristic data into a long-distance time sequence characteristic model to obtain segment full-time characteristic data;

inputting the fragment full-time feature data into a short-distance action feature model to obtain fragment action attention feature data;

and integrating the segment characteristic data and the segment action attention characteristic data to obtain segment enhanced characteristic data.

3. The video anomaly detection method according to claim 2, wherein the step of inputting the first dimension reduction segment feature data and the second dimension reduction segment feature data into a long-distance time sequence feature model to obtain segment full-time feature data comprises:

inputting the first dimension reduction feature data into an LSTM neural network of a long-distance time sequence feature model to obtain fusion feature data;

inputting the second dimension reduction feature data into a non-local neural network of the long-distance time sequence feature model, and performing residual error processing on an output result to obtain global feature data; the non-local neural network is formed by four one-dimensional convolutions, and the convolution characteristics of the four one-dimensional convolutions are completely consistent;

splicing the fusion characteristic data and the global characteristic data to obtain segment full-time characteristic data; wherein the size of the segment full temporal feature data is consistent with the size of the segment feature data.

4. The video anomaly detection method according to claim 2, wherein the step of inputting the segment full-time feature data into a short-distance motion feature model to obtain segment motion attention feature data comprises:

carrying out average processing on the segment full-time characteristic data on a time dimension;

inputting the segment full-time characteristic data subjected to mean value processing into a first fully-connected neural network of the short-distance action characteristic model for dimension reduction processing, and inputting the segment full-time characteristic data subjected to dimension reduction into a second fully-connected neural network of the short-distance action characteristic model for dimension increasing processing to obtain action attention weight;

performing broadcast multiplication on the action attention weight and the fragment full-time feature data, and adding a multiplication result and the fragment feature data to obtain the fragment action attention feature data; wherein the segment action attention feature data is consistent with a size of the segment feature data.

5. The method of claim 1, wherein the step of calculating an initial segment anomaly score according to the segment enhanced feature data comprises:

inputting the fragment enhanced feature data into a first fully-connected neural network of the abnormal score calculation model for first dimensionality reduction, and calculating the fragment enhanced feature data subjected to dimensionality reduction by using a Relu function and a Dropout function to obtain first intermediate data;

inputting the first intermediate data into a second fully-connected neural network of the abnormal score calculation model for second dimensionality reduction, and calculating the first intermediate data subjected to dimensionality reduction by using a Relu function and a Dropout function to obtain second intermediate data;

and inputting the second intermediate data into a third fully-connected neural network of the abnormal score calculation model to perform third dimensionality reduction, and calculating the second intermediate data subjected to dimensionality reduction by using a Sigmoid function to obtain the initial segment abnormal score with the dimensionality of 1.

6. The method of claim 1, wherein the step of calculating a segment anomaly probability score and a segment localization guidance score according to the segment enhanced feature data comprises:

transposing the segment enhanced feature data to obtain transposed segment feature data;

inputting the transposed segment feature data into a first one-dimensional convolution of the abnormal positioning model based on the classification guidance for carrying out first dimension reduction, and calculating the transposed segment feature data subjected to dimension reduction by using a Relu function to obtain third intermediate data;

inputting the third intermediate data into a second one-dimensional convolution of the abnormal positioning model based on the classification guidance to perform second dimension reduction, and calculating the third intermediate data after the dimension reduction by using a Relu function to obtain fourth intermediate data;

inputting the fourth intermediate data into a third one-dimensional convolution of the abnormal positioning model based on the classification guidance for third dimension reduction to obtain a sharing parameter with the dimension of 1;

calculating the sharing parameter by using a Softmax function to obtain a score of the segment abnormal probability;

and calculating the sharing parameters by using a Sigmoid function to obtain the segment positioning guidance score.

7. The video anomaly detection method according to claim 1, wherein the step of calculating a final anomaly score of the video to be detected according to the initial segment anomaly score and the segment anomaly probability score comprises:

and calculating the product of the segment anomaly probability score and the initial segment anomaly score, and performing linear accumulation to obtain the final anomaly score of the video to be detected.

8. The method of claim 1, wherein the step of optimizing the initial segment anomaly score by the segment localization guidance score to obtain a final segment anomaly score comprises:

carrying out normalization processing according to the average value of the fragment positioning guide score to obtain an optimized score;

and carrying out Hadamard product calculation on the initial segment abnormal score and the optimized score, and processing a calculation result through a Sigmoid function to obtain the final segment abnormal score.

9. The video anomaly detection method according to claim 1, wherein the training step of the video anomaly detection model comprises:

acquiring a video level label of a training video data set;

extracting the characteristics of each training video to obtain the segment characteristics of the training video;

sending the segments to the video anomaly detection model to obtain video anomaly scores and video segment anomaly scores of the training video;

calculating weak supervision multi-instance loss and time smoothness loss according to the video segment anomaly score;

calculating two-classification cross entropy loss according to the video anomaly score;

taking the sum of the weakly supervised multi-instance penalty, the temporal smoothness penalty, and the two-class cross-entropy penalty as a final penalty;

and performing back propagation on the video anomaly detection model according to the final loss.

10. A video anomaly detection system, comprising:

the video characteristic extraction module is used for acquiring segment characteristic data of a video to be detected;

the video anomaly detection module is used for acquiring the detection result of the segment characteristic data;

wherein the video anomaly detection module comprises:

the abnormal feature enhancement module is used for enhancing the time sequence information and the action information of the segment feature data to obtain segment enhanced feature data;

the abnormal score calculation module is used for calculating an initial segment abnormal score according to the segment enhanced feature data;

an abnormal positioning module based on classification guidance calculates a segment abnormal probability score and a segment positioning guidance score according to the segment enhanced feature data;

the abnormal score optimization module is used for calculating to obtain a final abnormal score of the video to be detected according to the initial segment abnormal score and the segment abnormal probability score; optimizing the initial segment abnormality score through the segment positioning guide score to obtain a final segment abnormality score; and the final abnormal score and the final segment abnormal score of the video to be detected are the detection result.