CN116798118A

CN116798118A - Abnormal behavior detection method based on TPH-yolov5

Info

Publication number: CN116798118A
Application number: CN202310513769.2A
Authority: CN
Inventors: 徐雄; 赵文彬; 李思奇
Original assignee: CETC 10 Research Institute
Current assignee: CETC 10 Research Institute
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-09-22

Abstract

The invention discloses an abnormal behavior detection method based on TPH-yolov5, which comprises the following steps: inputting the test image into a trained TPH-yolov5 network model to identify abnormal behaviors; the abnormal behavior comprises smoking, illegal invasion, illegal photographing, illegal carrying of a computer and illegal calling. The invention improves the accuracy of detecting abnormal behaviors.

Description

Abnormal behavior detection method based on TPH-yolov5

Technical Field

The invention relates to the technical field of target detection, in particular to a method for detecting abnormal behaviors based on TPH-yolov 5.

Background

With the development of computer and artificial intelligence technology, abnormal behavior detection is increasingly important in the fields of economy, management and public safety. Dynamic analysis of group behavior is a significant concern in the field of video surveillance. "unusual event/behavior" refers to a sudden irregular, attention-calling event as compared to a more frequently occurring conventional event. Based on safety considerations, there has been increasing interest in how to detect anomalies in crowd behavior in public places such as crowded streets, shopping malls, theatres, airports and train stations. However, since the abnormal behavior target is mixed with the normal target, the phenomena of missing detection, false detection and the like are easy to occur, so that the detection precision is reduced (the description is modified and the small target is removed). Therefore, it is one of the difficulties faced at present to identify and detect the abnormal behavior target in the public scene with high precision.

Disclosure of Invention

In view of the above, the present invention provides a method for detecting abnormal behavior based on TPH-yolov5 to solve the above-mentioned technical problems.

The invention discloses an abnormal behavior detection method based on TPH-yolov5, which comprises the following steps:

inputting the test image into a trained TPH-yolov5 network model to identify abnormal behaviors; the abnormal behavior comprises smoking, illegal invasion, illegal photographing, illegal carrying of a computer and illegal calling.

Further, in the TPH-yolov5 network model:

on the basis of a yolov5 network, CSPDarknet53 is selected as a backup, and three transducer pre-measuring heads are added at the tail end;

using a PANet-like structure, using the characteristic mapping of a neck transformer encoder block by four transducer pre-measuring heads output by the neck, and using the output characteristic mapping as a TPH pre-measuring head, wherein the characteristic mapping corresponds to the characteristic mapping of targets with different sizes respectively;

the head is responsible for detecting the position and class of the target through the feature map extracted from the backbone network;

the CBAM injects the attention map along two independent dimensions of the channel and space of the feature map, then multiplies the attention by the input feature map, and performs adaptive feature refinement on the input feature map.

Further, the method further comprises the following steps:

performing an ms test strategy on a single trained TPH-yolov5 network model, namely: scaling and horizontally turning the original image to be tested in different proportions to obtain L images;

respectively inputting the L images and the original image to be detected into different trained TPH-yolov5 network models, and predicting by using a non-maximum value inhibition fusion test; and obtaining the accuracy of detecting the abnormal behavior in the original image to be detected after fusion of the weighting frames.

Further, adopting non-maximum suppression or soft NMS or weighted frame fusion to integrate a plurality of different trained TPH-yolov5 network models;

when non-maximum suppression is adopted, if the intersection ratio of the cuboids is higher than a preset threshold, the cuboids are considered to belong to the same object; for each object, the non-maximum suppression method only keeps one bounding box with highest confidence, and deletes other bounding boxes;

if the attenuation function is set for the confidence of the adjacent bounding box based on the intersection ratio when soft non-maximum suppression is adopted, the confidence score is not completely set to zero and is deleted;

when a weighted frame fusion is employed, it merges all the frames to form the final result; the TPH-yolov5 network model outputs 4 feature maps with different scales altogether and is used for detecting objects with different scales.

Further, the step of merging the weighted frames comprises the following steps:

step 1: each predicted frame of each trained TPH-yolov5 network model is added into a single list B, the list B is arranged in descending order according to the confidence coefficient C of each predicted frame, and then the list B is divided into a frame cluster and a fusion frame cluster according to the value interval of the confidence coefficient C; the list L is used for storing the frame clusters, and the list F is used for storing the fusion frame clusters; each location in list L contains a group of boxes or a single box forming a cluster; each position in the list F only comprises a frame, and the frame is a fusion frame of a frame cluster in the list L;

step 2: circularly traversing the prediction frames in the list B, and searching for a matching frame in the list F;

step 3: if a matching frame is found, adding the matching frame to a position pos corresponding to the matching frame in the list L and the list F; re-calculating coordinates and confidence scores of frames in the F pos, using all frames accumulated in the frame cluster L pos, and T total frames;

step 4: repeating the step 2 and the step 3 until all the prediction frames in the list B are processed, and rescaling the confidence scores in the list F: multiplying the number of frames in a cluster by the number of trained TPH-yolov5 network models N.

Further, the step 3 further includes:

if no matching box is found, adding the prediction box in the list B to the tail of the list L and the list F as a new entry; the next prediction box in list B is continued to be traversed.

Further, the step 4 further includes:

if the number of frames in a cluster is small, the confidence score needs to be reduced by the following formula:

or

where C is the confidence score, N is the number of the trained TPH-yolov5 network models, and T is the number of boxes in F pos.

Further, in step 3:

all the frames accumulated in the frame cluster L [ pos ] are fused using the following fusion formula:

wherein, C is confidence score, N is the number of TPH-yolov5 network models, T is the number of frames in L [ pos ], (X1, 2) and (Y1, 2) are the coordinates of two diagonal vertices of the corresponding frame in F [ pos ];

and fusing all frames accumulated in the frame cluster L [ pos ] to obtain a fused frame, wherein the coordinates of the fused frame are the weighted sum of the coordinates of T frames, and the weight is the confidence score of the corresponding frame.

Further, the method further comprises the following steps:

inputting the frames output by the TPH head into a model fused by a weighting frame for processing, and carrying out visual abnormal case analysis on the results;

after the visual abnormal case analysis is carried out on the results, the trained TPH-yolov5 network model can be optimized through an additional self-training classifier;

the output of the training classifier is the class of abnormal behavior and the prediction box.

Further, using the image blocks cut from the training set as the classification training set, selecting ResNet18 as the classifier network, and obtaining the self-training classifier.

Due to the adoption of the technical scheme, the invention has the following advantages: in order to better test the model effect, the method adds a multi-scale test (ms test) and a multi-model integration strategy in the reasoning process so as to obtain a more convincing detection result; in addition, the proposed architecture is found to have good positioning capability but poor classification capability by visualizing abnormal cases, and a self-training classifier is provided, and image blocks cut from training data are used as a classification training set, so that the accuracy of detecting abnormal behaviors is finally improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments described in the embodiments of the present invention, and other drawings may be obtained according to these drawings for those skilled in the art.

FIG. 1 is a block diagram of a CBAM according to an embodiment of the present invention;

FIG. 2 is a block diagram of a self-training classifier in accordance with an embodiment of the present invention;

FIG. 3 is a block diagram of a THP-yolov5 network model according to an embodiment of the invention;

fig. 4 is a flow chart of a method for detecting abnormal behavior based on TPH-yolov5 according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and examples, wherein it is apparent that the examples described are only some, but not all, of the examples of the present invention. All other embodiments obtained by those skilled in the art are intended to fall within the scope of the embodiments of the present invention.

The method relates to data acquisition and expansion, multi-model integration, a target detection module, a ms reasoning module and a self-training classifier module.

The method comprises the following steps:

1) Preparing a data set: manufacturing a line abnormal data set, and dividing the data set into a training set and a testing set according to the proportion of 7:3;

2) Based on the yolov5 network, a transducer pre-measuring head is added to detect objects with different sizes, and then the original pre-measuring head is replaced by the transducer pre-measuring head to explore the prediction potential with a self-attention mechanism. A Convolutional Block Attention Model (CBAM) is also integrated to find attention areas with dense object scenes. While applying some useful strategies such as data expansion, multi-scale testing, multi-model integration, and using additional classifiers to enhance the accuracy of the model. The TPH-yolov5 network model has four probes for detecting micro, small, medium and large objects, respectively. To better test the model effect, a multi-scale test (ms test) and a multi-model integration strategy are added in the reasoning process to obtain more convincing detection results. In addition, by visualizing abnormal cases, the proposed architecture was found to have good localization capability but poor classification capability, especially on some similar classes, in order to solve this problem, a self-training classifier (ResNet 18) was provided, using image blocks cut from the training data as a classification training set.

3) Inputting the training set after data enhancement into a network, wherein a back bone selects CSPDarknet53, three transducer pre-measuring heads are added at the tail end, a structure similar to PANet is selected at the neck to generate feature mapping, the feature mapping is selected through NMS and WPF, the feature mapping is input into a self-training classifier, and finally, a category and a pre-measuring frame are generated.

Specifically, referring to fig. 4, the method mainly includes:

1. and the data acquisition and expansion module is used for:

and (3) data acquisition: 3000 pictures are obtained through homemade data sets and website searches, wherein 2700 pictures contain abnormal behaviors such as smoking, illegal invasion, illegal photographing, illegal carrying of a computer, illegal calling and the like. 80% of the images in the dataset were chosen as training samples and 20% as test samples. The image is annotated by the annotation software LabelImg, and the annotation is stored as an XML file in accordance with the PascalVoc format used by ImageNet.

Data expansion: the data expansion is effective in expanding the data set, so that the model has higher robustness on images acquired under different environments. Photometric and geometric distortions are widely used. For photometric distortion, the hue, saturation and value of the image need to be adjusted. Random scaling, cropping, translation, shearing and rotation need to be added when dealing with geometric deformations. In addition to the global pixel enhancement methods described above, there are some more unique data enhancement methods. Some researchers have proposed methods of combining multiple images for data enhancement. I.e., mixUp, cutMix, and Mosaic (mosaics).

MixUp randomly selects two samples from the training image to perform random weighted summation, and the labels of the samples also correspond to the weighted summation. Assume two samples (x _i ,y _i ),(x _j ,y _j ) Then:

where x is the input vector.

Where y is the one-hot encoding of the tag.

Lambda E [0,1] is a probability value, lambda: beta (α, α), i.e., λ obeys the Beta distribution with the parameter α.

Unlike the occlusion work, which normally uses zero pixel "black cloth" to occlude an image, the CutMix uses one region of another image to cover the occluded region, i.e., to crop some random rectangular region on one picture onto another to generate a new picture. The label processing is the same as mixUp, and the proportion of the new mixed label is determined according to the proportion of two original samples in the new sample.

Mosaic is a modified version of CutMix. The mosaic splices four images, so that the background of the detection object is greatly enriched. In addition, the activation statistics for 4 different images on each layer were calculated by batch normalization.

In the TPH-yolov5 network model, mixUp, mosaic, and traditional methods are used to enhance data. The TPH-yolov5 network model is shown in fig. 3.

2. The multi-model integration method comprises the following steps:

deep learning neural networks are a nonlinear approach. They offer greater flexibility and can scale up according to the amount of training data. One disadvantage of this flexibility is that learning by a random training algorithm, which means that it is very sensitive to the details of the training data, a different set of weights may be found each time training, resulting in a different prediction. This gives the neural network a very high variance. One successful approach to reducing neural network model variance is to train multiple models rather than a single model and combine predictions for these models. There are three different ways in which the box can be integrated from different object detection models: non-maximum suppression (NMS), soft NMS, weighted Box Fusion (WBF).

In the NMS method, rectangular solids are considered to belong to the same object if their overlap, intersection at the union (IoU) is above a certain threshold. For each object, the NMS only retains one bounding box with the highest confidence and deletes the other bounding boxes. Thus, the frame filtering process depends on the choice of this single IoU threshold, which has a large impact on the model performance.

The Soft NMS makes minor changes to the NMS that set the decay function for the confidence of the neighboring bounding box based on the IoU value, rather than setting its confidence score to zero entirely and deleting it.

Wherein s is _i Is Iou value, N _t Is a set threshold.

Neither NMS nor Soft NMS exclude some boxes, while WPF incorporates all boxes to form the final result. The TPH-yolov5 network model outputs 4 feature graphs with different scales altogether, and is used for detecting objects with different scales, and the WPF combining step is as follows:

(1) Each prediction box for each trained TPH-yolov5 network model is added to a single list B. The list B is arranged in descending order according to the confidence coefficient C of each prediction frame, and is divided into frame clusters and fusion frame clusters according to the value interval of the confidence coefficient C; the list L is used for storing the frame clusters, and the list F is used for storing the fusion frame clusters; each location in list L contains a group of boxes or a single box forming a cluster; each position in the list F only comprises a frame, and the frame is a fusion frame of a frame cluster in the list L;

(3) The loop traverses the prediction box in B, attempting to find a matching box in list F. The match is defined as a box (OU > THR) that overlaps the problem box by a large amount. The optimal threshold is approached at thr=0.55.

(4) If no match is found, the box in list B is added to the end of lists L and F as a new entry; proceed to the next box in list B.

(5) If a match is found, this box is added to list L at a location pos corresponding to the matching box in list F.

(6) The coordinates and confidenycore of the boxes in F [ pos ] are recalculated, using all T boxes accumulated in the box cluster L [ pos ], the fusion formula is as follows:

(7) After all the boxes in B have been processed, rescale the confidence score in the F list: multiplying the number of frames in a cluster by the number of trained TPH-yolov5 network models N. If the number of frames in a cluster is small, it may mean that only a few models can predict it. The confidence score for this case needs to be reduced as follows:

or

3. the target detection module:

(1)backbone：

commonly used backbones are some, such as VGG, resNet, CSPDarknet53, which have proven to have powerful feature extraction capabilities in classification and other issues. But the structure of the backbone also needs to be fine tuned to make it more suitable for the particular task. The algorithm selects CSPDarknet53 as a backup, and adds three transducer pre-measuring heads at the tail end.

(2)Neck：

The neck is designed to better exploit the features extracted by the stem. The feature images extracted from the backbone network at different stages are reprocessed and reasonably used. The neck consists of several bottom-up paths and several top-down paths. The algorithm uses a PANet-like structure for the neck, and four transducer pre-probes for the neck output use the feature map of the neck transformer encoder block.

(3)Head：

The header is responsible for detecting the location and class of the object through the feature map extracted from the backbone network. The feature maps output by the four transducer pre-measuring heads at the neck are used as TPH pre-measuring heads, and the feature maps respectively correspond to targets with different sizes.

(4)CBAM：

CBAM is a simple and effective convolutional neural network attention module whose network architecture is shown in fig. 1. In any given intermediate feature map of the convolutional neural network, the CBAM injects the attention map along two independent dimensions of the channel and the space of the feature map, then multiplies the attention by the input feature map, and performs self-adaptive feature refinement on the input feature map.

Given an intermediate feature map F E R ^C×H×W As input, the operation process of the CBAM is totally divided into two parts, firstly, global maximum pooling and average pooling are carried out on the input according to channels, and two one-dimensional vectors after pooling are sent into a full-connection layer for operation and then added to generate one-dimensional channel attention M _c ∈R ^C×1×1 Multiplying the channel attention with the input element to obtain a characteristic diagram F' after channel attention adjustment; secondly, carrying out global maximum pooling and mean pooling on F' according to space, splicing two-dimensional vectors generated by pooling, and carrying out convolution operation to finally generate the two-dimensional space attention M _s ∈R ^1×H×W Multiplying the spatial attention by F' by elements, the CBAM generated attention process can be described as:

wherein the method comprises the steps ofRepresenting multiplication of corresponding elements, before multiplication, the channel attention and the spatial attention need to be broadcast according to the spatial dimension and the channel dimension, respectively.

Ms test module:

five different TPH-yolov5 network models were trained according to different perspectives of model integration. In the reasoning phase, the ms test strategy is first performed on a single model. The implementation details of the ms test include the following three steps.

(1) The test image was scaled to 1.3 times.

(2) The image was reduced to 1, 0.83 and 0.67 times, respectively.

(3) The image is flipped horizontally.

And finally, inputting six images with different proportions into a trained TPH-yolov5 network model, and predicting by using NMS fusion test. The same ms test operation is performed on different trained TPH-yolov5 network models and the final five predictions are fused by WBF to obtain the final result.

5. Self-training classifier module:

training the data set by using a TPH-yolov5 network model, and then obtaining a conclusion by visualizing an abnormal case analysis result: the TPH-yolov5 network model has good positioning capability but poor classification capability. It was observed by further investigation that the accuracy of certain categories was very low. To address this problem, an additional self-training classifier may be used. First, a training set is constructed by clipping a real bounding box and resizing each image patch to 64×64. ResNet18 is then selected as the classifier network.

Referring to fig. 2, with the help of this self-training classifier, the method increases by about 0.8% to 1.0% on the basis of the AP value.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. The abnormal behavior detection method based on TPH-yolov5 is characterized by comprising the following steps:

2. The method of claim 1, wherein in the TPH-yolov5 network model:

3. The method as recited in claim 1, further comprising:

4. A method according to claim 3, wherein a plurality of different trained TPH-yolov5 network model integration boxes are integrated using non-maximum suppression or soft NMS or weighted box fusion;

5. A method according to claim 3, wherein the step of combining the weighted frame fusion is:

6. The method of claim 5, wherein step 3 further comprises:

7. The method of claim 5, wherein step 4 further comprises:

or

8. The method according to claim 5, characterized in that in step 3:

9. The method as recited in claim 1, further comprising:

10. The method of claim 9, wherein the self-training classifier is obtained using image blocks cropped from a training set as a classification training set and selecting res net18 as a classifier network.