CN117437691A

CN117437691A - Real-time multi-person abnormal behavior identification method and system based on lightweight network

Info

Publication number: CN117437691A
Application number: CN202311428863.4A
Authority: CN
Inventors: 王瑞; 冯晓祥; 赵佳辉; 曹文辉
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2023-10-31
Filing date: 2023-10-31
Publication date: 2024-01-23

Abstract

The invention relates to a method and a system for identifying real-time multi-person abnormal behaviors based on a lightweight network, wherein the method comprises the following steps: collecting video sequence data containing multiple persons in real time, and converting the video sequence data into an RGB image data set; performing target detection on each frame of image in the RGB image data set, and calibrating an ROI (region of interest) in the image; preprocessing each frame of image in the RGB image dataset based on the calibrated ROI area; based on each preprocessed frame of image, a pre-constructed lightweight human body posture estimation network model is adopted to extract and fuse key point characteristics of human bones, so as to obtain fusion characteristics; and calling a trained integrated multi-classifier to conduct abnormal behavior classification and identification on the fusion characteristics, and obtaining identification results of various abnormal behaviors. Compared with the prior art, the invention has the advantages of improving the recognition precision and the like.

Description

Real-time multi-person abnormal behavior identification method and system based on lightweight network

Technical Field

The invention relates to the technical field of computer vision and behavior recognition, in particular to a method and a system for recognizing real-time multi-person abnormal behaviors based on a lightweight network.

Background

As an important branch of computer vision, abnormal behavior recognition and detection techniques have been widely used in the fields of intelligent security, medical monitoring, traffic control, and the like. The method mainly aims at identifying the ongoing abnormal behavior of the human body from the video or image sequence, however, the defining and distinguishing method of the abnormal behavior is closely related to scene factors, and the characteristic extraction and the abnormal behavior identification and detection method are properly selected according to the characteristics of different application scenes, so that the early warning accuracy is ensured, and the method is very important in practical application.

The traditional abnormal behavior recognition method comprises three steps of feature extraction, feature fusion and feature classification, and along with the continuous development of deep learning technology, a convolutional neural network gradually becomes the main stream of the abnormal behavior recognition technology, and comprises a cyclic convolutional neural network, a long-term and short-term memory network and the like. However, these methods also have differences in the method of extracting the features of the video image, for example, there is a feature extraction method based on the appearance and movement information of the human body, and the method performs behavior recognition based on the contour information and movement information of the human body as features to characterize the behavior of the human body; the method mainly uses local space-time information to extract human behavior characteristics; in addition, more research methods in recent years are mainly feature extraction methods based on two-dimensional or three-dimensional human skeleton key points, the method firstly obtains human skeleton key information from video stream data through a gesture estimation network, and then builds feature vectors to describe human behaviors. The main method is to extract two-dimensional human body key point characteristics in a video or image sequence based on a lightweight human body posture estimation network, and to classify abnormal behaviors by utilizing an integrated classifier, so that the method has higher precision and stronger robustness to external interference.

In order to optimize the performance of the abnormal behavior recognition system, the extraction and behavior characterization of the key point data of the human skeleton are enhanced, and the human body posture estimation network needs to be further optimized. Most of the current popular open-source human body posture estimation models have higher complexity, and higher accuracy is replaced by a multi-scale and deep network structure, but the model is not friendly to common edge terminal equipment in life, and often the equipment has limited computing resources and cannot be deployed into a model which is too complex. Some researchers use a lightweight human body posture estimation model to deploy an anomaly identification system on an intelligent terminal, but the identification accuracy is greatly discounted.

Disclosure of Invention

The invention aims to provide a real-time multi-person abnormal behavior recognition method and system based on a lightweight network, which improve recognition accuracy.

The aim of the invention can be achieved by the following technical scheme:

the invention provides a real-time multi-person abnormal behavior identification method based on a lightweight network, which comprises the following steps:

collecting video sequence data containing multiple persons in real time, and converting the video sequence data into an RGB image data set;

performing target detection on each frame of image in the RGB image data set, and calibrating an ROI (region of interest) in the image;

preprocessing each frame of image in the RGB image dataset based on the calibrated ROI area;

based on each preprocessed frame of image, a pre-constructed lightweight human body posture estimation network model is adopted to extract and fuse key point characteristics of human bones, so as to obtain fusion characteristics;

and calling a trained integrated multi-classifier to conduct abnormal behavior classification and identification on the fusion characteristics, and obtaining identification results of various abnormal behaviors.

Further, YOLOv5 is adopted to perform object detection on each frame of image.

Further, the specific steps of the pretreatment include:

removing the part irrelevant to the human body in the ROI by adopting an image clipping method;

aligning the images by adopting an image alignment method;

processing the cut ROI by adopting a normalization method;

and carrying out enhancement processing on the aligned images by adopting a data enhancement algorithm.

Further, the image alignment method is affine transformation, and the expression of the affine transformation is:

where x and y are the abscissa before affine transformation, x 'and y' are the coordinates after radiation transformation, a, b, c, d, e and f are constraint parameters.

Further, the normalization method is a maximum-minimum normalization method, and the maximum-minimum normalization function is:

wherein norm is a maximum-minimum normalization function, x _f The pixel values of the image are represented, min (x), and max (x) represent the maximum value and the minimum value of the input data, respectively.

Further, the specific steps of obtaining the fusion characteristic include:

inputting each preprocessed frame of image into a pre-constructed lightweight human body posture estimation network model to detect human skeleton key points;

preprocessing and optimizing the key points of the human bones;

extracting features based on the preprocessed and optimized key points of the human bones;

and fusing according to the extracted features to obtain fused features.

Further, the feature extraction method is a scale-invariant feature extraction method or an accelerated robust feature extraction method.

Further, the construction process of the lightweight human body posture estimation network model specifically comprises the following steps:

building a HRNet high-resolution design framework;

replacing all residual blocks of the HRNet with a Shuffle Block in the ShuffleNet;

and pruning and distilling the replaced HRNet to form a light human body posture estimation network model.

Further, the integrated multi-classifier is trained using a support vector machine algorithm.

The embodiment also provides an identification system of the real-time multi-person abnormal behavior identification method based on the lightweight network, which comprises the following steps:

the video real-time acquisition module is used for: the system is used for collecting video sequence data containing multiple persons in real time and converting the video sequence data into an RGB image data set;

the target detection module: the method comprises the steps of performing target detection on each frame of image in the RGB image data set, and calibrating an ROI (region of interest) in the image;

an image preprocessing module: the image preprocessing module is used for preprocessing each frame of image in the RGB image set based on the calibrated ROI area;

feature extraction and fusion module: the method comprises the steps of carrying out human skeleton key point feature extraction and fusion by adopting a pre-constructed lightweight human body posture estimation network model based on each preprocessed frame image to obtain fusion features;

abnormal behavior classification and identification module: and the method is used for calling the trained integrated multi-classifier to conduct abnormal behavior classification and identification on the fusion characteristics, and obtaining identification results of various abnormal behaviors.

Compared with the prior art, the invention has the following beneficial effects:

(1) According to the invention, the characteristic extraction and fusion are carried out on the key points of the human skeleton through the lightweight human body posture estimation network model, so that the key characteristics for describing and explaining the behaviors are obtained, and then the classification of various abnormal behaviors is carried out through the integrated classifier, so that the accuracy rate of identifying the abnormal behaviors of the human in real time is improved.

(2) The invention adopts the target detection algorithm to detect the ROI region, and performs preprocessing operations of cutting, image alignment, normalization and data enhancement on the image based on the ROI region, thereby being beneficial to extracting key information of the image, eliminating differences of gestures, angles and scales, eliminating the influence of factors such as illumination, contrast, color and the like, improving the quality of the image, enhancing details and contrast in the image, improving the stability and comparability of the characteristics, improving the accuracy of an abnormal behavior recognition method and enabling the recognition method to be more robust.

(2) According to the lightweight human body posture estimation network model, a lighter and efficient Shuffle Block in the ShuffleNet is used for replacing all residual blocks in an HRNet original framework, so that the performance of the model is further improved, pruning is carried out on the HRNet original model, the representation capability of the human body posture estimation network is enhanced through online knowledge distillation, and the detection precision of key points of human bones is improved.

Drawings

FIG. 1 is a flow chart of a method for identifying abnormal behaviors of multiple people in real time according to the invention;

FIG. 2 is a block diagram of a lightweight human body posture estimation network model of the present invention;

FIG. 3 is a Block diagram of a Shuffle Block of the present invention;

fig. 4 is a diagram of the real-time abnormal behavior recognition result of multiple persons according to the present invention.

FIG. 5 is a diagram showing the components of the real-time multi-person abnormal behavior recognition system according to the present invention.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.

The invention provides a real-time multi-person abnormal behavior identification method based on a lightweight network, which is shown in fig. 1 and comprises the following steps:

s1, acquiring video sequence data containing multiple persons in real time, and converting the video sequence data into an RGB image data set.

And acquiring a video sequence containing a human body in real time through a camera, and converting the video sequence into an RGB image data set.

S2, performing target detection on each frame of image in the RGB image data set, and calibrating the ROI area in the image.

Target detection is performed for each frame of image in the dataset using YOLOv5, calibrating the ROI area. Specifically, YOLOv5 is used for human body target detection, a pre-trained model is used, the YOLOv5 model rapidly divides different areas in an image through a convolutional neural network, and target category and position information existing in each area are predicted. Outputting the human body target detection result, giving out human body coordinate information, and determining the human body ROI area according to the coordinate information.

S3, preprocessing each frame of image in the RGB image data set based on the calibrated ROI area.

Preprocessing image data: and (3) performing image clipping and data enhancement according to the ROI region in the target detection result, and finally performing image alignment and normalization. The method comprises the following steps:

and (3) performing image clipping according to the ROI region in the target detection result, performing image alignment and normalization, and finally performing image enhancement. Image clipping and data enhancement are performed according to the ROI area, specifically, clipping operation can remove the part irrelevant to human body in the ROI area, so that input data is more concentrated on the task itself. The image alignment adopts affine transformation, so that human body images in different postures can be aligned to the same posture. Affine variations can be written as a form of matrix multiplication as follows:

in the radiation transformation formula, x and y are the abscissa before affine transformation, x 'and y' are the coordinates after radiation transformation, a, b, c, d, e and f are several constraint parameters, and different basic affine transformation is realized according to different constraint parameters.

Wherein x and y are the abscissa before affine transformation, x 'and y' are the coordinates after radiation transformation, and θ is the angle between the horizontal direction and the left and right central point coordinate connecting lines.

In the above scheme, the normalization of the ROI image is specifically performed by normalizing the human body image data using a max-min normalization method. The normalization function is:

wherein x is _f The pixel values of the image are represented, min (x), and max (x) represent the maximum value and the minimum value of the input data, respectively.

In the above aspect, the data enhancement of the transformed image includes: random flipping, horizontal or vertical projection, random scaling, etc.

S4, based on each preprocessed frame of image, extracting and fusing key point features of human bones by adopting a pre-constructed lightweight human body posture estimation network model to obtain fusion features.

The method comprises the steps of firstly constructing a lightweight human body posture estimation network model, and constructing a lightweight network by combining a high-resolution design architecture in a Shuffle Block and an HRNet in the shuffleNet. HRNet exhibits strong capabilities in location-sensitive issues such as semantic segmentation, human body pose estimation, and object detection. However, more residual blocks are stacked, and light-weight Shuffle blocks are introduced, so that the number of model parameters can be effectively reduced, and in order to further improve the performance of the model, the model is enabled to effectively extract multi-scale context information and simulate long-distance spatial dependence of human body posture estimation. And finally, pruning the original HRNet architecture for further optimizing the network, and then carrying out knowledge distillation to improve the performance of the model. In order to realize recognition of abnormal behaviors of multiple persons, the model needs to input a plurality of ROI areas in an image, and the adjacent image and the adjacent ROI areas share model weights, so that the recognition efficiency of the abnormal behaviors of the model is further improved, and higher instantaneity and accuracy can be ensured on edge equipment with limited computational power.

As shown in fig. 2, the lightweight human body posture estimation network uses a Shuffle Block to replace HRNet of all residual blocks, and performs pruning and distillation on a new network. The original multi-resolution, multi-scale feature and multi-stage feature fusion method of the HRNet is continued, and the HRNet architecture can keep high-resolution representation in the whole process. Starting from the high-resolution sub-network as a first stage, gradually adding one high-resolution sub-network to a low-resolution sub-network to form more stages, and connecting the multi-resolution sub-networks in parallel. Repeated multi-scale fusion is performed by exchanging information across parallel multi-resolution sub-networks throughout the process. The final keypoints are ultimately output only on the high resolution representation of the estimated network output. The Shuffle Block, as shown in fig. 3, first uses a Block convolution (Group Convolution) to divide the input feature map into groups, each of which performs a convolution operation independently. This can reduce the amount of parameters and calculation of the convolution operation. After the packet convolution, the number of channels of the input feature map is split into two branches, wherein one branch is kept unchanged, the other branch is subjected to a point-by-point convolution (i.e. 1x1conv,1x1 convolution), then is subjected to a depth separable convolution (DWConv, depth separable convolution) and a point-by-point convolution (i.e. 1x1conv,1x1 convolution), the part is actually calculated backwards, and the number of channels on the two branches is directly concatenated up to the end of the network, and then the feature map of each group is subjected to a channel rearrangement operation (channel shuffle). The purpose of channel rearrangement is to make the feature graphs of different groups interact and information merge, and increase the non-linear capability of the network. Specifically, channel reordering splits the feature map of the convolution output into subsets per channel, and then alternately connects the subsets such that feature maps from different groupings interact with each other. The feature map is then further processed using a point-wise convolution operation (i.e., a 1x1conv,1x1 convolution). And finally, adding the input characteristic diagram and the output characteristic diagram according to elements, thereby reducing the problem of gradient disappearance while keeping information flow. Even after the residual Block is replaced by the Shuffle Block, the HRNet architecture still appears to be somewhat complex, so in the network layers of the HRNet four phases, the higher the resolution, the more pruning is performed on the network layer of the high resolution branch. The performance of the pruned model is reduced, the original model is used as a teacher model, the pruned model is used as a student model, knowledge distillation operation is carried out, and finally the light-weight human body posture estimation network with better performance is obtained.

The method comprises the steps of extracting human skeleton key data by using a lightweight human body posture estimation network model, firstly obtaining a skeleton key point detection result, then preprocessing and optimizing key points, carrying out feature extraction by combining information of the human skeleton key points, and finally carrying out feature fusion. The feature extraction method mainly comprises two steps of scale invariant features and acceleration robust features by combining information of human skeleton key points, wherein the scale invariant features are feature detection algorithms in the field of computer vision and are used for detecting local features in images, extracting the local position, scale and rotation invariant features of the images, the essence of the feature detection method is to find key points on different scale spaces, calculate the directions of the key points, more specifically, the scale invariant features firstly construct the scale space by using a Gaussian differential pyramid, then detect the key points in the scale space by comparing pixel values with values of adjacent pixels (comprising pixels on upper layers, lower layers and the same layers of the scale space) to obtain candidate key points, accurately position the candidate key points by using an interpolation method, meanwhile, allocate a main direction according to the gradient directions of images around the candidate key points, construct a small area taking the key points as centers in the surrounding areas of the key points, and finally, combine the local feature vectors to form descriptions of the key points.

S5, calling a trained integrated multi-classifier to conduct abnormal behavior classification and identification on the fusion characteristics, and obtaining identification results of various abnormal behaviors.

The step classifies the fusion features, namely the feature description matrix, to obtain a confusion matrix of classification data and a plurality of abnormal behavior recognition results, wherein the confusion matrix provides specific classification result distribution conditions, and can help analyze the performance and the performance of the model. . Abnormal behavior classification recognition is carried out by calling an integrated multi-classifier, specifically, classifier training is carried out on training set data by using a support vector machine algorithm. In a multi-SVM classifier, multiple SVM classifiers need to be trained, each separately for a different sample feature or subset. After training, a weighted average method is adopted in actual reasoning to fuse a plurality of SVM classifiers obtained through training so as to improve accuracy and robustness of the classifiers, behaviors of the multi-classifier fusion result are abnormal behavior results judged by means of model mean square, the abnormal behavior results obtained in the embodiment are shown in a figure 4, the identification probability of each behavior closest to a camera in a video is shown on the left side, and the identification result of abnormal behaviors of a plurality of persons including the shape and behavior labels of human skeleton joints is shown in real time in a right image. Through testing, the model can still keep high-precision recognition of abnormal behaviors of multiple persons in a real-time video sequence.

Example 2

The example provides a real-time multi-person abnormal behavior recognition system based on a lightweight human body posture estimation network, which is shown in fig. 5 and comprises a video real-time acquisition module, a target detection module, an image preprocessing module, a feature extraction and fusion module and an abnormal behavior classification and recognition module. The video real-time acquisition module is used for acquiring complete human body activity video data by using a high-definition real-time camera, and converting a video sequence into an RGB image data set; the target detection module specifically uses YOLOv5 for target detection aiming at each frame of image in the data set, and marks the ROI; the image preprocessing module specifically performs image clipping and data enhancement according to the ROI region in the target detection result, and finally performs image alignment and normalization; the feature extraction and fusion module is used for inputting the preprocessed image into a lightweight human body posture estimation network to detect important key points of a human body, such as a node point, a five sense organ node and the like, and is different from a common multi-person detection method. Firstly, detecting skeleton key points, then preprocessing and optimizing the key points, carrying out feature extraction by combining information of the skeleton key points of a human body, and finally carrying out feature fusion; the abnormal behavior classification and identification module specifically calls an integrated multi-classifier to conduct abnormal behavior classification identification, and a confusion matrix of classified data and identification results of various abnormal behaviors are obtained.

A lightweight human body posture estimation network model in the feature extraction and fusion module is trained on the coco data set. Model training data and parameters are as follows:

this embodiment is based primarily on the pytorch deep learning framework, performed in Ubuntu 18.04 and Python 3.6 environments, with the network trained on 4 NVIDIA 3090 GPUs. COCO has more than 200K images and 250K human instances, with 17 key points. The model of the present invention was trained on the train2017 dataset (including 57K images and 150Kperson examples) and validated on val2017 (including 5 Kimages) and test development 2017 (including 20K images).

During training, each GPU is of small batch size 32. An Adam optimizer with an initial learning rate of 2e-3 was used. The human detection ROI area adopts an aspect ratio of 4:3, and then the box is cropped from the image. The image size of the COCO is adjusted to 256×192. Each image will be enhanced by a series of data enhancement operations including random rotation ([ -30 °,30 ° ]), random scale ([ 0.75,1.25 ]), and random inversion of the data set, as well as additional half-body data of the COCO.

And deploying the trained optimal model to a Jeston TX2 intelligent terminal, and building a real-time multi-person abnormal behavior recognition system.

The remainder were as in example 1.

The above functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the invention can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

Claims

1. The real-time multi-person abnormal behavior identification method based on the lightweight network is characterized by comprising the following steps of:

2. The method for identifying real-time multi-person abnormal behavior based on lightweight network according to claim 1, wherein YOLOv5 is adopted for target detection of each frame of image.

3. The method for identifying real-time multi-person abnormal behavior based on lightweight network according to claim 1, wherein the specific steps of preprocessing include:

aligning the images by adopting an image alignment method;

processing the cut ROI by adopting a normalization method;

4. A method for identifying real-time multi-person abnormal behavior based on a lightweight network according to claim 3, wherein said image alignment method is affine transformation, and said affine transformation has the expression:

5. The method for identifying real-time multi-person abnormal behavior based on lightweight network according to claim 3, wherein the normalization method is a maximum-minimum normalization method, and the maximum-minimum normalization function is:

wherein norm is a maximum-minimum normalization function, x _f Representing an imagePixel values, min (x), and max (x) represent the maximum value and the minimum value of the input data, respectively.

6. The method for identifying real-time multi-person abnormal behavior based on lightweight network according to claim 1, wherein the specific step of obtaining the fusion feature comprises the following steps:

preprocessing and optimizing the key points of the human bones;

and fusing according to the extracted features to obtain fused features.

7. The method for identifying real-time multi-person abnormal behaviors based on the lightweight network according to claim 6, wherein the method adopted by the feature extraction is a scale-invariant feature extraction method or an accelerated robust feature extraction method.

8. The method for identifying real-time multi-person abnormal behavior based on the lightweight network according to claim 1, wherein the construction process of the lightweight human body posture estimation network model specifically comprises the following steps:

building a HRNet high-resolution design framework;

9. The method for identifying real-time multi-person abnormal behavior based on a lightweight network according to claim 1, wherein the integrated multi-classifier is trained by using a support vector machine algorithm.

10. A real-time multi-person abnormal behavior identification system based on a lightweight network, comprising: