CN114724060A

CN114724060A - Method and device for unsupervised video anomaly detection based on mask self-encoder

Info

Publication number: CN114724060A
Application number: CN202210249993.0A
Authority: CN
Inventors: 王思齐; 胡婧韬; 余广; 祝恩; 蔡志平; 朱信忠
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-07-08

Abstract

The application relates to a method and a device for unsupervised video anomaly detection based on a mask self-encoder, wherein the method comprises the following steps: respectively carrying out target detection on each frame of the obtained monitoring video data, and extracting image blocks of a plurality of frames before and after the current frame and the adjacent frames at the corresponding positions of the detected foreground target; constructing each cube according to the image block of each frame, and respectively performing interval mask operation on each cube in a time domain; respectively utilizing visual network prediction to obtain prediction blocks of masked parts in each space-time cube and each optical flow cube according to each space-time cube and each optical flow cube after the interval masking operation; respectively calculating the prediction loss of each space-time cube and each optical flow cube at the pixel level; and calculating abnormal value scores of all frames in the monitoring video data according to the prediction losses of the space-time cubes and the optical flow cubes to obtain the abnormal event detection result of the video frames. High-performance end-to-end video abnormal event detection is realized.

Description

Method and device for unsupervised video anomaly detection based on mask self-encoder

Technical Field

The invention belongs to the technical field of video image processing, and relates to an unsupervised video anomaly detection method and device based on a mask self-encoder.

Background

With the proposition and promotion of concepts such as 'safe society' and 'smart city', the security system becomes an effective guarantee for preventing public safety incidents. One of the most important technologies in security systems, namely, massive surveillance video analysis, has gradually become an important research topic in the field of today's social security. Video Anomaly Detection (VAD) is one of the core technologies of surveillance video analysis, which aims at intelligently identifying anomalous events and suspicious behaviors from surveillance video. Through the video anomaly detection technology, the false detection probability and the missing detection probability which can occur when mass data are manually identified can be reduced, and a large amount of manpower and financial resources which need to be consumed can be reduced.

Modeling complex and high-dimensional video data is very difficult, and meanwhile, abnormal events have the essential characteristics of rarity (the occurrence frequency of the abnormal events is far lower than that of normal events and is difficult to collect and collect), novelty (the abnormal events are usually different from the usual rules and cannot be predicted), fuzziness (the definition of the abnormal events has abstraction and no clear division boundary exists between normal and abnormal samples), and inexhaustibility (the abnormal events are various). Therefore, the task of video anomaly detection is very challenging, and some key problems still exist and need to be solved.

At present, most video anomaly detection methods are usually based on semi-supervised (semi-supervised) setting, that is, models of all normal video events in a monitored video are constructed through a training set only containing normal events, and when test data does not accord with the inference of the normal models, the video anomaly detection methods are judged to be anomalous events. The semi-supervised surveillance video anomaly detection technique still requires time-consuming and labor-intensive collection of pure normal events to construct the training data set. In the big data era, the requirement of manual labeling is a big pain point of semi-supervised video anomaly detection. In order to meet social needs and promote technical development, a considerable solution is to adopt unsupervised (unsupervised) setting, that is, to detect abnormal situations from a surveillance video which is completely unlabeled and has no manual label.

Unsupervised video anomaly detection is an emerging technology that explores far less than semi-supervised approaches. Currently, one class of unsupervised methods develops research for change detection (change detection) on a video, and the other class of unsupervised methods iteratively improves performance through self-trained ordered regression (self-trained iterative regression) on the basis of an initial anomaly recognition result of a traditional machine learning algorithm. However, in the process of implementing the present invention, the inventor finds that in the current unsupervised video anomaly detection technology, there is a technical problem that the detection performance of the monitoring video anomaly event is low.

Disclosure of Invention

In view of the above problems in the conventional methods, the present invention provides an unsupervised video anomaly detection method based on a mask self-encoder, an unsupervised video anomaly detection apparatus based on a mask self-encoder, a computer device and a computer readable storage medium, which can realize high-performance end-to-end video anomaly event detection.

In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

in one aspect, a method for unsupervised video anomaly detection based on a mask self-encoder is provided, which includes the steps of:

acquiring monitoring video data;

respectively carrying out target detection on each frame of the monitoring video data, and extracting image blocks of a detected foreground target corresponding to a current frame and adjacent frames;

constructing each cube according to the image block of each frame, and respectively performing interval mask operation on each cube in a time domain; the cube comprises a space-time cube and an optical flow cube;

respectively utilizing visual network prediction to obtain prediction blocks of masked parts in each space-time cube and each optical flow cube according to each space-time cube and each optical flow cube after the interval masking operation;

according to the prediction block, respectively calculating the prediction loss of each space-time cube and each optical flow cube at the pixel level;

calculating abnormal value scores of all frames in the monitoring video data according to the prediction losses of all the space-time cubes and all the optical flow cubes; the abnormal value score is used for indicating the abnormal event detection result of the video frame.

In another aspect, an unsupervised video anomaly detection apparatus based on a mask self-encoder is also provided, including:

the video acquisition module is used for acquiring monitoring video data;

the target detection module is used for respectively carrying out target detection on each frame of the monitoring video data and extracting image blocks of a plurality of frames before and after the current frame and the adjacent frames at the corresponding positions of the detected foreground target;

the mask operation module is used for constructing each cube according to the image blocks of each frame and respectively carrying out interval mask operation on each cube in a time domain; the cube comprises a space-time cube and an optical flow cube;

the mask prediction module is used for respectively predicting by using a visual network according to each space-time cube and each optical flow cube after the interval mask operation to obtain a prediction block of a masked part in each space-time cube and each optical flow cube;

the prediction loss module is used for respectively calculating the prediction loss of each space-time cube and each optical flow cube at the pixel level according to the prediction block;

the abnormal scoring module is used for calculating abnormal value scores of all frames in the monitoring video data according to the prediction loss of each space-time cube and each optical flow cube; the abnormal value score is used for indicating the abnormal event detection result of the video frame.

In still another aspect, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above-mentioned unsupervised video anomaly detection method based on a mask self-encoder when executing the computer program.

In still another aspect, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, implements the steps of the above-mentioned mask auto-encoder-based unsupervised video anomaly detection method.

One of the above technical solutions has the following advantages and beneficial effects:

according to the unsupervised video anomaly detection method and the unsupervised video anomaly detection device based on the mask self-encoder, after target detection is carried out on the obtained monitoring video data, image blocks are extracted, two cubes are constructed, masking operation is carried out on the time domain by adopting an interval masking strategy, then a masked part is predicted by utilizing a visual network according to a visible part in each cube, prediction loss between a prediction block and the masked part is calculated, and finally, an anomaly score of each frame is calculated according to the prediction loss of all cubes, so that anomaly detection of the monitoring video data is completed.

By adopting the scheme, high-quality video characteristics and models can be learned through the time domain mask self-encoder under the condition of no labeling information, and high-performance end-to-end video abnormal event detection is realized. Specifically, the scheme firstly adopts a target detection positioning foreground to more accurately describe the video event, namely a space-time cube is constructed; then adding a mask to the time domain of the space-time cube to cover half of the blocks; the visual Transformer network is then trained to predict this empty cube mask portion by the unmasked portion of the spatiotemporal cube. Further, an optical flow cube is constructed using an optical flow as an auxiliary motion cue, and the same operation as that of the space-time cube is performed.

Finally, temporal masking increases the difficulty of predicting an exception event due to its inherent rarity and novelty as compared to a normal event, which makes it difficult to predict. Therefore, the differential prediction error between the normal event and the abnormal event prompts the scheme to directly adopt the prediction loss as the abnormal value of the video event for scoring, and the aim of detecting the high-performance end-to-end video abnormal event is effectively fulfilled.

Drawings

FIG. 1 is a flow diagram illustrating a method for unsupervised video anomaly detection based on a masked self-encoder in one embodiment;

FIG. 2 is a schematic flow chart illustrating the detection of an abnormal event in an original video frame of a surveillance video according to an embodiment;

FIG. 3 is a flow diagram that illustrates the detection of exceptions based on optical flow of surveillance video in one embodiment;

FIG. 4 is a diagram illustrating the visualization of the predicted results of masking blocks of normal and abnormal video events in one embodiment;

fig. 5 is a schematic block diagram of an unsupervised video anomaly detection apparatus based on a mask self-encoder according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

In the research process of realizing the invention, the inventor finds that although no manually marked training set is needed in an unsupervised scene, the cost of manpower, financial resources and time is saved, the effect of the method is 5-10% behind the performance of a semi-supervised algorithm on a common data set, and a certain gap exists. The bottlenecks of the existing methods are mainly due to two factors: (one) suboptimal video feature representation: the mainstream unsupervised algorithm relies on a manually designed feature operator to extract feature representation, the feature representation is required to be elaborately designed by an experienced expert, and the manual operator designed for a specific scene has limited generality and is difficult to migrate to other monitoring scenes. Recently, few unsupervised methods adopt a pretrained Deep Neural Network (DNN) to extract features, but due to the substantial difference between a data set used by the pretrained network and the monitoring video data related to the application, the extracted features are often represented insufficiently well and contain redundant information. The feature representation directly affects the learning quality of the downstream task. (II) limited model learning ability: the existing unsupervised methods all rely on the traditional machine learning algorithm, but the classical model has no end-to-end learning capability, can not integrate feature learning and downstream tasks into a whole, and has extremely limited modeling capability.

Before beginning the description of the embodiments of the present invention, the definitions of some of the terms involved are given below:

masking: the pixel value at the corresponding position on the image is set to 0 according to the shape specified by the mask (mosaic). In general, a mask is an image that operates on a whole unit, for example, sets all pixel values of a whole frame or a whole block to 0, that is, becomes completely black. The mask is a Chinese translation to a mask operation, equivalent to a mask, invisible.

Change detection: by some measure, when the ordinal number data changes dramatically, an abnormal condition is considered to have occurred.

Pre-trained deep neural network: algorithms in the general visual field typically use deep neural networks that have been trained on the larger image datasets disclosed to extract features.

Self-supervision learning: self-supervised learning is one type of unsupervised learning that is expected to learn a common feature expression for use in downstream tasks. The predefined proxy task (pretext task) is optimized on completely unlabeled data, learning the feature representation using self information.

Feature learning (feature learning): also called representation learning (representation learning) or representation learning, generally refers to a method for automatically extracting features or representations from data by a model, and is a process for automatically learning the model.

End-to-end (end-to-end): end-to-end means that the user enters the raw material directly, obtaining usable results directly, without concern for intermediate products.

Block hierarchy (patch-level): an image block (patch) is treated as a unit of operation. Other units, such as frame-level, are units that operate on an entire frame, such as adding frame noise or a mask to an entire video. Other pixel-level operations are units that treat each pixel as an operation.

Transformer: a neural network architecture. The industry has temporarily not unified chinese translation.

Outlier score: representing the probability of the object belonging to an anomaly. The larger the abnormal value is, the more likely it is to be an abnormal event, and the smaller the abnormal value is, the more likely it is to be a normal event.

The prediction loss (loss) is equivalent to the prediction error (error).

The following detailed description of embodiments of the invention will be made with reference to the accompanying drawings.

In the image field, mask-based self-encoders (MAEs) have enjoyed great success as self-supervised (self-supervised) learners. The core idea of the MAE method is to cover a part of small blocks (patch) in an image, and then train a deep neural network to recover the missing image blocks according to the uncovered part of the image. Through such agent tasks, the MAE learns high quality features for downstream tasks. Meanwhile, the MAE has strong modeling capability as a model based on a deep neural network. The MAE can learn high-quality characteristics and has strong modeling capability, so that the end-to-end learning paradigm is introduced into an unsupervised video anomaly detection task, and the defects of the prior art are expected to be overcome. However, since the MAE is applied to the image domain only by adding a mask in a spatial dimension (spatial), temporal information is not considered, and the MAE cannot be directly applied to the video domain. Notably, an exception event is typically manifested as an exception in a time series context.

Aiming at the technical problem of low detection performance of abnormal events of the surveillance video in the traditional unsupervised video abnormity detection technology, the application designs a novel unsupervised video abnormity detection technology based on a mask self-encoder, and realizes high-performance end-to-end detection of the abnormal events in the surveillance video by utilizing a time sequence mask self-encoder under the condition of not needing any video marking information. According to the method, the same detection and image processing operation are respectively carried out on the original video frame and the optical flow (optical flow) extracted from the video frame, finally, the prediction loss is directly adopted as the abnormal value of the video event for scoring, and the normal event and the abnormal event in the monitoring video can be directly distinguished through the abnormal value scoring, so that the end-to-end abnormal event detection of the monitoring video is efficiently realized, and the detection performance is remarkably improved.

Referring to fig. 1, an embodiment of the present application provides a method for unsupervised video anomaly detection based on a mask self-encoder, including the following steps S12 to S22:

and S12, acquiring the monitoring video data.

It is to be appreciated that the surveillance video data can be given, collected on-line, or otherwise derived unlabeled surveillance video data.

And S14, respectively carrying out target detection on each frame of the monitoring video data, and extracting the image blocks of the detected foreground target corresponding to the current frame and the adjacent frames.

It can be understood that for the obtained original monitoring video data, foreground objects in the video need to be located, and redundant background information needs to be filtered. The foreground object in the video may be a person, an animal, a vehicle, or other various items.

Performing target detection on each frame of data on the unlabeled monitoring video data, positioning foreground targets (such as people or vehicles) in the data in a detection frame mode (for example, each foreground target is identified by a target positioning frame), and then extracting image blocks of a current frame and adjacent front and back multiframes (which can be recorded as a D-1 frame, and D is a positive integer greater than or equal to 2) corresponding to the target positioning frame (representing the foreground target) for each target positioning frame. For each frame of data, after detecting and positioning the foreground object of the frame, the number of adjacent front and rear frames (with respect to the current frame in time sequence) to be subjected to the extraction processing, which are in addition to the image block of the current frame (i.e., the frame processed at the current time) at the corresponding position of the foreground object, may be selected according to the comprehensive balance of the calculation resource conditions, the cube size, the detection accuracy, and the like.

In an embodiment, optionally, D is 8, and in this case, the size of the cube is suitable, so that excessive computing resources are not consumed, and the detection time is not increased, thereby achieving relatively high detection performance.

S16, constructing each cube according to the image blocks of each frame, and respectively performing interval mask operation on each cube in a time domain; the cubes include spatiotemporal cubes and optical flow cubes.

It is understood that the spatiotemporal cube (STC) is used as a description of one video event and also as a single processing unit of a visual Transformer (ViT) network in a subsequent step, which may be formed by stacking a plurality of blocks of image blocks. An Optical Flow Cube (OFC) is used for the action description as an aid to one video event, and also as a single processing unit of the in-step visual network, which can be formed with a stack of optical flow blocks over a plurality of image blocks.

In video processing, optical flow is widely used to describe the motion (motion) of an object. The pattern (pattern) in time to learn is essentially embedded in the motion information of the object. Therefore, the optical flow extracted from the video frame is added as an auxiliary motion cue (auxiliary motion cue) to ensure the detection performance is improved.

In one embodiment, as shown in FIG. 2, the process of constructing a spatiotemporal cube from image blocks of frames includes:

scaling a plurality of image blocks with continuous time to a preset size and stacking the image blocks according to a time sequence to obtain a space-time cube; the spatio-temporal cube is used to describe a video event.

It can be understood that, for the construction of the spatio-temporal cube, specifically: after a plurality of (e.g., D-block) image blocks in time succession are scaled to a uniform preset size (e.g., 32 × 32 or other sizes), the image blocks are stacked in time sequence to form a spatio-temporal cube, and other spatio-temporal cubes are constructed in the same manner.

In one embodiment, as shown in fig. 2, the process of interval masking each spatio-temporal cube separately in the time domain includes:

for each space-time cube, performing image block level masking operation in a mode of image block interval masking in a time domain to obtain the space-time cube after masking operation; half of the image blocks in the space-time cube after the masking operation are visible, and the other half of the image blocks are masked.

It is understood that for each spatio-temporal cube, a masking operation of the image block level (patch-level) is performed in the temporal domain. Each spatio-temporal cube contains temporally successive D blocks, of which half, i.e. the D/2 blocks, are visible (i.e. unmasked parts) in this embodiment; the D/2 block is masked (i.e., unknown part).

Specifically, an interval mask strategy is adopted. For a given spatio-temporal cube C ═ x₁,x₂,…,x_D]Wherein { x_iI | -2, 4, … D-2, D } block as mask, the rest { x ═ x-_iThe i | -1, 3, … D-3, D-1} block is used as the visible part for subsequent step input into the visual Transformer network model.

In one embodiment, as shown in FIG. 3, the process of constructing an optical flow cube from image blocks of frames includes:

respectively carrying out optical flow extraction on a plurality of image blocks with continuous time to obtain a plurality of optical flow blocks with continuous time;

scaling a plurality of optical flow blocks with continuous time to a preset size and stacking the optical flow blocks according to time sequence to obtain an optical flow cube; the optical flow cube is used for action description as one video event.

It can be understood that, for the construction of the optical flow cube, after extracting the image blocks of each frame, the corresponding optical flows of the consecutive D blocks need to be extracted, so as to obtain the optical flow blocks corresponding to the image blocks, which are used for describing the motion information of the foreground object. After the optical flow blocks are obtained, a plurality of (for example, D-blocks) optical flow blocks which are continuous in time are scaled to a uniform preset size (for example, 32 × 32 or other sizes can be set), and then stacked in time sequence to form an optical flow cube, and other optical flow cubes are similarly constructed.

In one embodiment, as shown in FIG. 3, the process of interval masking each optical flow cube separately in the time domain includes:

for each optical flow cube, performing optical flow block level mask operation on a time domain in an optical flow block interval mask mode to obtain an optical flow cube subjected to mask operation; half of the light flow blocks in the masked light flow cube are visible and the other half of the light flow blocks are masked.

It will be appreciated that for each optical flow cube, the masking operation is performed in the time domain at the optical flow block level. Each optical flow cube contains temporally successive D blocks, with half of the optical flow blocks covered, i.e., the D/2 blocks are visible (i.e., unmasked portions) and the D/2 blocks are masked (i.e., unknown portions). In particular, the present embodiment employs a space masking strategy, such as for a given optical flow cube

Where { x_iI | -2, 4, … D-2, D } block as mask, the rest { x ═ x-_i1,3, … D-3, D-1 block is used as a visible part for subsequent step input into the visual Transformer network model.

And S18, respectively obtaining prediction blocks of the masked parts in each space-time cube and each optical flow cube by using visual network prediction according to each space-time cube and each optical flow cube after the interval masking operation.

It is to be understood that, through the above step S16, the self-supervised proxy task is constructed by masking on the time domain of the spatiotemporal cube and the optical flow cube, respectively. Half of the image blocks (optical flow blocks) of each spatio-temporal cube (optical flow cube) are masked as the network prediction part, and the remaining unmasked part is input as the visible part into the visual Transformer network.

In the proxy task, the subsequent prediction loss uses the mean square error to calculate the error between the original pixel values of the masked blocks in the spatio-temporal cube (optical flow cube) and the output of the visual transform network (prediction of the masked blocks in the spatio-temporal cube (optical flow cube)).

In one embodiment, as shown in fig. 2, the process of predicting the prediction block of the masked part in each space-time cube by using the visual network according to each space-time cube after the interval masking operation includes:

for each space-time cube subjected to the interval mask operation, performing pre-operation on a visible part on the space-time cube and inputting the visible part into a visual network; the pre-operation comprises flattening a visible part into a one-dimensional vector, linearly projecting to a set low-dimensional space, and adding a learnable position code;

and projecting the output vector of the visual network to the original dimension before the pre-operation and deforming the output vector into the size before the pre-operation to obtain a prediction block of the masked part in each space-time cube.

Specifically, for each visible part (D/2 blocks) of the space-time cube, flattening (flattening) the visible part into a one-dimensional vector (1D token), and then linearly projecting (project) to a set low-dimensional space; in order to retain timing information, a learnable position encoding (position encoding) is added to the projected one-dimensional vector. D/2 unmasked vectors are input into the visual transform network, and the output vectors of the visual transform network are projected to the original dimension and deformed (reshape) to the same size as the D/2 blocks, as the prediction (block) of the masked part in the empty cube at this time.

In one embodiment, as shown in fig. 3, the process of obtaining the prediction block of the masked portion in each optical flow cube by using the visual network prediction according to each optical flow cube after the interval masking operation includes:

for each optical flow cube after the interval mask operation, performing pre-operation on a visible part on the optical flow cube and inputting the visible part into a visual network; the pre-operation comprises flattening a visible part into a one-dimensional vector, linearly projecting to a set low-dimensional space, and adding a learnable position code;

and projecting the output vector of the visual network to the original dimension before the pre-operation and deforming the output vector into the size before the pre-operation to obtain a prediction block of the masked part in each optical flow cube.

Similarly, specifically, for each visible part (D/2 block) of the optical flow cube, the visible part is flattened into a one-dimensional vector, and then the one-dimensional vector is linearly projected to a set low-dimensional space, and in order to retain timing information, a learnable position code is added to the projected one-dimensional vector. The D/2 unmasked vectors are input into the visual transform network, and the output vectors of the visual transform network are projected to the original dimensions and deformed to the same size as the aforementioned D/2 blocks as the prediction (blocks) of the masked portion in this optical flow cube.

S20, based on the prediction block, the prediction loss at the pixel level of each space-time cube and each optical flow cube is calculated.

It is to be understood that the prediction penalty employs Mean Square Error (MSE) to compute the Error between the original pixel values of the masked blocks in the spatio-temporal cube (optical flow cube) and the output of ViT (prediction of the masked blocks in the spatio-temporal cube (optical flow cube)).

In one embodiment, the prediction loss of the spatio-temporal cube at the pixel level is calculated by the following formula:

wherein C represents a space-time cube, C_maskOriginal pixel value, C, representing the masked portion of the spatio-temporal cube_predRepresenting the predicted pixel values for the masked portion of the spatio-temporal cube.

Specifically, for each spatio-temporal cube that can be viewed as a video event, the prediction error at pixel-level is the mse (c) described above.

In one embodiment, the predicted loss of the optical flow cube at the pixel level is calculated by the following formula:

wherein,

an optical flow cube is represented that is,

the original pixel values representing the masked portion of the optical flow cube,

representing the predicted pixel values for the masked portion of the optical flow cube.

In particular, for each optical flow cube that can be considered as an action description assisted by a video event, the pixel-level prediction error is as described above

S22, calculating abnormal value scores of each frame in the monitoring video data according to the prediction loss of each space-time cube and each optical flow cube; the abnormal value score is used for indicating the abnormal event detection result of the video frame.

It can be understood that the unusual events are difficult to predict due to their rarity and novelty, and the temporal mask increases the difficulty of predicting the unusual events. Thus, the prediction error of a normal event tends to be small, whereas the prediction error of an abnormal event tends to be large.

As shown in fig. 4, on different public data sets, it can be seen from the last line that the prediction error of the abnormal event is very distinguishable from that of the normal event. According to this phenomenon and law, outlier scoring can directly take the overall prediction penalty for each spatiotemporal cube that can be considered as a video event and its ancillary action-cue optical flow cube.

In an embodiment, the step S22 may specifically include the following processing steps:

calculating the overall prediction loss of the monitoring video data according to the prediction losses of all the space-time cubes and the optical flow cubes;

and taking the maximum value of the abnormal value scores of all cubes on the frame as the abnormal value score of the video frame according to the overall prediction loss for each frame of the monitoring video data.

It can be understood that the overall prediction loss is directly adopted to score the abnormal value, and finally, the maximum value of the abnormal value scores of all cubes on one frame is taken as the abnormal value score of the video frame. The higher the abnormal value score is, the higher the probability that the abnormal event detection result of the video frame is an abnormal video event is indicated (the abnormal video event and the normal video event can be directly divided according to the set score limit).

In one embodiment, the overall prediction loss score (c) is calculated by the following formula:

wherein C represents a space-time cube, alpha and beta respectively represent super parameters for controlling the proportion of the two parts of the prediction loss, MSE (C) represents the prediction loss of the space-time cube at the pixel level,

represents the prediction loss of the optical flow cube at the pixel level, mu and sigma represent the mean and variance of the MSE loss of all space-time cubes respectively,

and

respectively, represent the mean and variance of the MSE loss for all the optical flow cubes.

In some embodiments, optionally, α may be 1.0, and β may be 0.5, so as to achieve relatively better detection performance.

According to the unsupervised video anomaly detection method based on the mask self-encoder, after target detection is carried out on the obtained monitoring video data, image blocks are extracted, two cubes are constructed, masking operation is carried out on the time domain by adopting an interval masking strategy, then a masked part is predicted by utilizing a visual network according to a visible part in each cube, the prediction loss between a prediction block and the masked part is calculated, and finally the abnormal value score of each frame is calculated according to the prediction losses of all cubes, so that the abnormal event detection of the monitoring video data is completed.

By adopting the scheme, high-quality video characteristics and models can be learned through the time domain mask self-encoder under the condition of no labeling information, and high-performance end-to-end video abnormal event detection is realized. Specifically, the scheme firstly adopts a target detection positioning foreground to more accurately describe the video event, namely a space-time cube is constructed; then adding a mask to the time domain of the space-time cube to cover half of the blocks; then, the visual Transformer network is trained to predict this time-space cube mask portion by the unmasked portion of the space-time cube. Further, an optical flow cube is constructed using an optical flow as an auxiliary motion cue, and the same operation as that of the space-time cube is performed.

Finally, the temporal mask increases the difficulty of predicting an abnormal event more because of the rarity and novelty inherent in the abnormal event compared to the normal event. Therefore, the differential prediction error between the normal event and the abnormal event prompts the scheme to directly adopt the prediction loss as the abnormal value of the video event for scoring, and the aim of detecting the high-performance end-to-end video abnormal event is effectively fulfilled.

It should be understood that, although the various steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps of fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

Referring to fig. 5, in an embodiment, an unsupervised video anomaly detection apparatus 100 based on a mask self-encoder is further provided, and includes a video acquisition module 11, a target detection module 13, a mask operation module 15, a mask prediction module 17, a prediction loss module 19, and an anomaly scoring module 21. The video obtaining module 11 is configured to obtain monitoring video data. The target detection module 13 is configured to perform target detection on each frame of the monitored video data, and extract image blocks of a plurality of frames before and after the current frame and the adjacent frame at positions corresponding to the detected foreground target. The mask operation module 15 is configured to construct each cube according to the image block of each frame, and perform an interval mask operation on each cube in the time domain; the cubes include spatiotemporal cubes and optical flow cubes. The mask prediction module 17 is configured to predict, according to each space-time cube and each optical flow cube after the interval masking operation, a prediction block of a masked portion in each space-time cube and each optical flow cube by using a visual network respectively. The prediction loss module 19 is configured to calculate prediction losses of each spatio-temporal cube and each optical flow cube at a pixel level according to the prediction block. The abnormal scoring module 21 is used for calculating abnormal value scores of frames in the monitoring video data according to the prediction loss of each space-time cube and each optical flow cube; the abnormal value score is used for indicating the abnormal event detection result of the video frame.

The unsupervised video anomaly detection device 100 based on the mask self-encoder performs target detection, image block extraction and two cubes construction on the obtained monitoring video data through cooperation of all modules, performs masking operation by adopting an interval mask strategy in a time domain, predicts a masked part by using a visual network according to a visible part in a cube, further calculates the prediction loss between a prediction block and the masked part, and finally calculates the abnormal value score of each frame according to the prediction loss of all cubes to complete the detection of the abnormal event of the monitoring video data.

In one embodiment, a process for constructing a spatio-temporal cube from image blocks of frames includes:

In one embodiment, the process of interval masking each spatio-temporal cube separately in the time domain includes:

In one embodiment, the process of constructing an optical flow cube from image blocks of frames includes:

In one embodiment, the process of interval masking each optical flow cube separately in the time domain includes:

In one embodiment, the process of predicting the prediction block of the masked part in each space-time cube by using the visual network according to each space-time cube after the interval masking operation includes:

for each space-time cube subjected to the interval mask operation, performing pre-operation on a visible part on the space-time cube and inputting the visible part into a visual network; the pre-operation comprises flattening a visible part into a one-dimensional vector, linearly projecting the one-dimensional vector to a set low-dimensional space, and adding a learnable position code;

In one embodiment, the process of obtaining a prediction block of a masked portion in each optical flow cube by using a visual network prediction according to each optical flow cube after the interval masking operation includes:

for each optical flow cube subjected to the interval mask operation, performing pre-operation on a visible part on the optical flow cube and inputting the visible part into a visual network; the pre-operation comprises flattening a visible part into a one-dimensional vector, linearly projecting to a set low-dimensional space, and adding a learnable position code;

wherein,

an optical flow cube is represented that is,

In an embodiment, the above-mentioned exception scoring module 21 may be specifically configured to implement the following processing functions:

and

For specific limitations of the apparatus 100 for detecting an unsupervised video anomaly based on a mask auto-encoder, reference may be made to the corresponding limitations of the method for detecting an unsupervised video anomaly based on a mask auto-encoder, which are not described herein again. The various modules in the above-described unsupervised video anomaly detection apparatus 100 based on a masked self-encoder may be implemented in whole or in part by software, hardware, and combinations thereof. The modules may be embedded in a hardware form or a device independent of a specific data processing function, or may be stored in a memory of the device in a software form, so that a processor may invoke and execute operations corresponding to the modules.

In still another aspect, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the following processing steps when executing the computer program: acquiring monitoring video data; respectively carrying out target detection on each frame of the monitoring video data, and extracting image blocks of a detected foreground target corresponding to a current frame and adjacent frames; constructing each cube according to the image block of each frame, and respectively performing interval mask operation on each cube in a time domain; the cube comprises a space-time cube and an optical flow cube; respectively utilizing visual network prediction to obtain prediction blocks of masked parts in each space-time cube and each optical flow cube according to each space-time cube and each optical flow cube after the interval masking operation; according to the prediction block, respectively calculating the prediction loss of each space-time cube and each optical flow cube at the pixel level; calculating abnormal value scores of all frames in the monitoring video data according to the prediction losses of all the space-time cubes and all the optical flow cubes; the abnormal value score is used for indicating the abnormal event detection result of the video frame.

In one embodiment, the processor when executing the computer program may further implement the additional steps or sub-steps in the embodiments of the unsupervised video anomaly detection method based on the masked self-encoder.

In yet another aspect, a computer-readable storage medium is provided, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the following processing steps: acquiring monitoring video data; respectively carrying out target detection on each frame of the monitoring video data, and extracting image blocks of a detected foreground target corresponding to a current frame and adjacent frames; constructing each cube according to the image block of each frame, and respectively performing interval mask operation on each cube in a time domain; the cube comprises a space-time cube and an optical flow cube; respectively utilizing visual network prediction to obtain prediction blocks of masked parts in each space-time cube and each optical flow cube according to each space-time cube and each optical flow cube after the interval masking operation; according to the prediction block, respectively calculating the prediction loss of each space-time cube and each optical flow cube at the pixel level; calculating abnormal value scores of all frames in the monitoring video data according to the prediction losses of all the space-time cubes and all the optical flow cubes; the abnormal value score is used for indicating the abnormal event detection result of the video frame.

In one embodiment, the computer program, when executed by the processor, may further implement the additional steps or sub-steps of the embodiments of the above-described mask auto-encoder based unsupervised video anomaly detection method.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, the computer program can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link DRAM (Synchlink) DRAM (SLDRAM), Rambus DRAM (RDRAM), and interface DRAM (DRDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for those skilled in the art, various changes and modifications can be made without departing from the spirit of the present application, and all of them fall within the scope of the present application. Therefore, the protection scope of the present patent should be subject to the appended claims.

Claims

1. An unsupervised video anomaly detection method based on a mask self-encoder is characterized by comprising the following steps:

acquiring monitoring video data;

respectively carrying out target detection on each frame of the monitoring video data, and extracting image blocks of the detected foreground target corresponding to the current frame and adjacent frames;

constructing each cube according to the image block of each frame, and performing interval mask operation on each cube on a time domain; the cube comprises a spatiotemporal cube and an optical flow cube;

respectively utilizing a visual network to predict according to each space-time cube and each optical flow cube after the interval mask operation to obtain a prediction block of a masked part in each space-time cube and each optical flow cube;

calculating abnormal value scores of all frames in the monitoring video data according to the prediction losses of the space-time cubes and the optical flow cubes; the abnormal value score is used for indicating the abnormal event detection result of the video frame.

2. The method of claim 1, wherein the process of constructing the spatiotemporal cube from the image blocks of each frame comprises:

3. The method of claim 2, wherein the step of performing interval masking operation on each spatio-temporal cube separately in time domain comprises:

for each space-time cube, performing image block level masking operation in a mode of image block interval masking in a time domain to obtain the space-time cube after masking operation; and after the masking operation, half of image blocks in the space-time cube are visible, and the other half of image blocks are masked.

4. The method of any of claims 1 to 3, wherein the step of obtaining the prediction block of the masked portion of each spatiotemporal cube by using a visual network prediction according to each spatiotemporal cube after the interval masking operation comprises:

for each space-time cube subjected to the interval mask operation, performing pre-operation on a visible part on the space-time cube and inputting the visible part into the visual network; the pre-operation comprises flattening the visible part into a one-dimensional vector, linearly projecting the one-dimensional vector to a set low-dimensional space, and adding a learnable position code;

5. The method of claim 4, wherein the prediction loss of the spatio-temporal cube at pixel level is calculated by the following formula:

wherein C represents one of the spatio-temporal cubes, C_maskOriginal pixel values, C, representing masked parts of said spatio-temporal cube_predRepresenting predicted pixel values for masked portions of the spatiotemporal cube.

6. The method of claim 1, wherein the process of constructing the optical flow cube from the image blocks of each frame comprises:

respectively carrying out optical flow extraction on a plurality of time-continuous image blocks to obtain a plurality of time-continuous optical flow blocks;

scaling a plurality of optical flow blocks with continuous time to a preset size and stacking the optical flow blocks according to time sequence to obtain an optical flow cube; the optical flow cube is used for action description as a video event.

7. The method of claim 6, wherein the step of separately interval masking each optical flow cube in the temporal domain comprises:

for each optical flow cube, performing optical flow block level masking operation in a mode of optical flow block interval masking in a time domain to obtain the optical flow cube after the masking operation; half of the light flow blocks in the light flow cube after the masking operation are visible, and the other half of the light flow blocks are masked.

8. The method of claim 1, 6 or 7, wherein the process of obtaining the prediction block of the masked portion of each optical flow cube by using a visual network prediction according to each optical flow cube after the interval masking operation comprises:

for each optical flow cube after the interval mask operation, performing a pre-operation on a visible part on the optical flow cube and inputting the visible part into the visual network; the pre-operation comprises flattening the visible part into a one-dimensional vector, linearly projecting the one-dimensional vector to a set low-dimensional space, and adding a learnable position code;

9. The method of claim 8, wherein the prediction loss of the optical flow cube at pixel level is calculated by the following formula:

wherein,

represents one of the optical flow cubes and a reference optical flow cube,

original pixel values representing the masked portion of the optical flow cube,

representing predicted pixel values for the masked portion of the optical flow cube.

10. The method of claim 1, 5 or 9, wherein the step of calculating the outlier score of each frame in the monitored video data based on the prediction loss of each spatio-temporal cube and each optical flow cube comprises:

11. The method of claim 10, wherein the overall prediction loss is calculated by the following formula:

wherein C represents one of the space-time cubes, α and β represent hyper-parameters controlling the proportion of the two parts of the prediction loss, MSE (C) represents the prediction loss of the space-time cube at the pixel level,

represents the predicted loss of the optical flow cube at the pixel level, mu and sigma represent the mean and variance of the MSE loss of all space-time cubes respectively,

and

12. An unsupervised video anomaly detection apparatus based on a masked self-encoder, comprising:

the video acquisition module is used for acquiring monitoring video data;

the mask operation module is used for constructing each cube according to the image blocks of each frame and respectively carrying out interval mask operation on each cube in a time domain; the cube comprises a spatiotemporal cube and an optical flow cube;

a mask prediction module, configured to predict, according to each space-time cube and each optical flow cube after an interval mask operation, a prediction block of a masked portion in each space-time cube and each optical flow cube by using a visual network respectively;

a prediction loss module, configured to calculate, according to the prediction block, prediction losses of the space-time cubes and the optical flow cubes at a pixel level, respectively;

13. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the mask auto-encoder based unsupervised video anomaly detection method according to any one of claims 1 to 11.

14. A computer readable storage medium having stored thereon a computer program, wherein the computer program, when being executed by a processor, is adapted to carry out the steps of the method for unsupervised video anomaly detection based on a masked self-encoder according to any of the claims 1 to 11.