CN117437691A - Real-time multi-person abnormal behavior identification method and system based on lightweight network - Google Patents

Real-time multi-person abnormal behavior identification method and system based on lightweight network Download PDF

Info

Publication number
CN117437691A
CN117437691A CN202311428863.4A CN202311428863A CN117437691A CN 117437691 A CN117437691 A CN 117437691A CN 202311428863 A CN202311428863 A CN 202311428863A CN 117437691 A CN117437691 A CN 117437691A
Authority
CN
China
Prior art keywords
image
abnormal behavior
lightweight
real
human body
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311428863.4A
Other languages
Chinese (zh)
Inventor
王瑞
冯晓祥
赵佳辉
曹文辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202311428863.4A priority Critical patent/CN117437691A/en
Publication of CN117437691A publication Critical patent/CN117437691A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a method and a system for identifying real-time multi-person abnormal behaviors based on a lightweight network, wherein the method comprises the following steps: collecting video sequence data containing multiple persons in real time, and converting the video sequence data into an RGB image data set; performing target detection on each frame of image in the RGB image data set, and calibrating an ROI (region of interest) in the image; preprocessing each frame of image in the RGB image dataset based on the calibrated ROI area; based on each preprocessed frame of image, a pre-constructed lightweight human body posture estimation network model is adopted to extract and fuse key point characteristics of human bones, so as to obtain fusion characteristics; and calling a trained integrated multi-classifier to conduct abnormal behavior classification and identification on the fusion characteristics, and obtaining identification results of various abnormal behaviors. Compared with the prior art, the invention has the advantages of improving the recognition precision and the like.

Description

Real-time multi-person abnormal behavior identification method and system based on lightweight network
Technical Field
The invention relates to the technical field of computer vision and behavior recognition, in particular to a method and a system for recognizing real-time multi-person abnormal behaviors based on a lightweight network.
Background
As an important branch of computer vision, abnormal behavior recognition and detection techniques have been widely used in the fields of intelligent security, medical monitoring, traffic control, and the like. The method mainly aims at identifying the ongoing abnormal behavior of the human body from the video or image sequence, however, the defining and distinguishing method of the abnormal behavior is closely related to scene factors, and the characteristic extraction and the abnormal behavior identification and detection method are properly selected according to the characteristics of different application scenes, so that the early warning accuracy is ensured, and the method is very important in practical application.
The traditional abnormal behavior recognition method comprises three steps of feature extraction, feature fusion and feature classification, and along with the continuous development of deep learning technology, a convolutional neural network gradually becomes the main stream of the abnormal behavior recognition technology, and comprises a cyclic convolutional neural network, a long-term and short-term memory network and the like. However, these methods also have differences in the method of extracting the features of the video image, for example, there is a feature extraction method based on the appearance and movement information of the human body, and the method performs behavior recognition based on the contour information and movement information of the human body as features to characterize the behavior of the human body; the method mainly uses local space-time information to extract human behavior characteristics; in addition, more research methods in recent years are mainly feature extraction methods based on two-dimensional or three-dimensional human skeleton key points, the method firstly obtains human skeleton key information from video stream data through a gesture estimation network, and then builds feature vectors to describe human behaviors. The main method is to extract two-dimensional human body key point characteristics in a video or image sequence based on a lightweight human body posture estimation network, and to classify abnormal behaviors by utilizing an integrated classifier, so that the method has higher precision and stronger robustness to external interference.
In order to optimize the performance of the abnormal behavior recognition system, the extraction and behavior characterization of the key point data of the human skeleton are enhanced, and the human body posture estimation network needs to be further optimized. Most of the current popular open-source human body posture estimation models have higher complexity, and higher accuracy is replaced by a multi-scale and deep network structure, but the model is not friendly to common edge terminal equipment in life, and often the equipment has limited computing resources and cannot be deployed into a model which is too complex. Some researchers use a lightweight human body posture estimation model to deploy an anomaly identification system on an intelligent terminal, but the identification accuracy is greatly discounted.
Disclosure of Invention
The invention aims to provide a real-time multi-person abnormal behavior recognition method and system based on a lightweight network, which improve recognition accuracy.
The aim of the invention can be achieved by the following technical scheme:
the invention provides a real-time multi-person abnormal behavior identification method based on a lightweight network, which comprises the following steps:
collecting video sequence data containing multiple persons in real time, and converting the video sequence data into an RGB image data set;
performing target detection on each frame of image in the RGB image data set, and calibrating an ROI (region of interest) in the image;
preprocessing each frame of image in the RGB image dataset based on the calibrated ROI area;
based on each preprocessed frame of image, a pre-constructed lightweight human body posture estimation network model is adopted to extract and fuse key point characteristics of human bones, so as to obtain fusion characteristics;
and calling a trained integrated multi-classifier to conduct abnormal behavior classification and identification on the fusion characteristics, and obtaining identification results of various abnormal behaviors.
Further, YOLOv5 is adopted to perform object detection on each frame of image.
Further, the specific steps of the pretreatment include:
removing the part irrelevant to the human body in the ROI by adopting an image clipping method;
aligning the images by adopting an image alignment method;
processing the cut ROI by adopting a normalization method;
and carrying out enhancement processing on the aligned images by adopting a data enhancement algorithm.
Further, the image alignment method is affine transformation, and the expression of the affine transformation is:
where x and y are the abscissa before affine transformation, x 'and y' are the coordinates after radiation transformation, a, b, c, d, e and f are constraint parameters.
Further, the normalization method is a maximum-minimum normalization method, and the maximum-minimum normalization function is:
wherein norm is a maximum-minimum normalization function, x f The pixel values of the image are represented, min (x), and max (x) represent the maximum value and the minimum value of the input data, respectively.
Further, the specific steps of obtaining the fusion characteristic include:
inputting each preprocessed frame of image into a pre-constructed lightweight human body posture estimation network model to detect human skeleton key points;
preprocessing and optimizing the key points of the human bones;
extracting features based on the preprocessed and optimized key points of the human bones;
and fusing according to the extracted features to obtain fused features.
Further, the feature extraction method is a scale-invariant feature extraction method or an accelerated robust feature extraction method.
Further, the construction process of the lightweight human body posture estimation network model specifically comprises the following steps:
building a HRNet high-resolution design framework;
replacing all residual blocks of the HRNet with a Shuffle Block in the ShuffleNet;
and pruning and distilling the replaced HRNet to form a light human body posture estimation network model.
Further, the integrated multi-classifier is trained using a support vector machine algorithm.
The embodiment also provides an identification system of the real-time multi-person abnormal behavior identification method based on the lightweight network, which comprises the following steps:
the video real-time acquisition module is used for: the system is used for collecting video sequence data containing multiple persons in real time and converting the video sequence data into an RGB image data set;
the target detection module: the method comprises the steps of performing target detection on each frame of image in the RGB image data set, and calibrating an ROI (region of interest) in the image;
an image preprocessing module: the image preprocessing module is used for preprocessing each frame of image in the RGB image set based on the calibrated ROI area;
feature extraction and fusion module: the method comprises the steps of carrying out human skeleton key point feature extraction and fusion by adopting a pre-constructed lightweight human body posture estimation network model based on each preprocessed frame image to obtain fusion features;
abnormal behavior classification and identification module: and the method is used for calling the trained integrated multi-classifier to conduct abnormal behavior classification and identification on the fusion characteristics, and obtaining identification results of various abnormal behaviors.
Compared with the prior art, the invention has the following beneficial effects:
(1) According to the invention, the characteristic extraction and fusion are carried out on the key points of the human skeleton through the lightweight human body posture estimation network model, so that the key characteristics for describing and explaining the behaviors are obtained, and then the classification of various abnormal behaviors is carried out through the integrated classifier, so that the accuracy rate of identifying the abnormal behaviors of the human in real time is improved.
(2) The invention adopts the target detection algorithm to detect the ROI region, and performs preprocessing operations of cutting, image alignment, normalization and data enhancement on the image based on the ROI region, thereby being beneficial to extracting key information of the image, eliminating differences of gestures, angles and scales, eliminating the influence of factors such as illumination, contrast, color and the like, improving the quality of the image, enhancing details and contrast in the image, improving the stability and comparability of the characteristics, improving the accuracy of an abnormal behavior recognition method and enabling the recognition method to be more robust.
(2) According to the lightweight human body posture estimation network model, a lighter and efficient Shuffle Block in the ShuffleNet is used for replacing all residual blocks in an HRNet original framework, so that the performance of the model is further improved, pruning is carried out on the HRNet original model, the representation capability of the human body posture estimation network is enhanced through online knowledge distillation, and the detection precision of key points of human bones is improved.
Drawings
FIG. 1 is a flow chart of a method for identifying abnormal behaviors of multiple people in real time according to the invention;
FIG. 2 is a block diagram of a lightweight human body posture estimation network model of the present invention;
FIG. 3 is a Block diagram of a Shuffle Block of the present invention;
fig. 4 is a diagram of the real-time abnormal behavior recognition result of multiple persons according to the present invention.
FIG. 5 is a diagram showing the components of the real-time multi-person abnormal behavior recognition system according to the present invention.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.
The invention provides a real-time multi-person abnormal behavior identification method based on a lightweight network, which is shown in fig. 1 and comprises the following steps:
s1, acquiring video sequence data containing multiple persons in real time, and converting the video sequence data into an RGB image data set.
And acquiring a video sequence containing a human body in real time through a camera, and converting the video sequence into an RGB image data set.
S2, performing target detection on each frame of image in the RGB image data set, and calibrating the ROI area in the image.
Target detection is performed for each frame of image in the dataset using YOLOv5, calibrating the ROI area. Specifically, YOLOv5 is used for human body target detection, a pre-trained model is used, the YOLOv5 model rapidly divides different areas in an image through a convolutional neural network, and target category and position information existing in each area are predicted. Outputting the human body target detection result, giving out human body coordinate information, and determining the human body ROI area according to the coordinate information.
S3, preprocessing each frame of image in the RGB image data set based on the calibrated ROI area.
Preprocessing image data: and (3) performing image clipping and data enhancement according to the ROI region in the target detection result, and finally performing image alignment and normalization. The method comprises the following steps:
and (3) performing image clipping according to the ROI region in the target detection result, performing image alignment and normalization, and finally performing image enhancement. Image clipping and data enhancement are performed according to the ROI area, specifically, clipping operation can remove the part irrelevant to human body in the ROI area, so that input data is more concentrated on the task itself. The image alignment adopts affine transformation, so that human body images in different postures can be aligned to the same posture. Affine variations can be written as a form of matrix multiplication as follows:
in the radiation transformation formula, x and y are the abscissa before affine transformation, x 'and y' are the coordinates after radiation transformation, a, b, c, d, e and f are several constraint parameters, and different basic affine transformation is realized according to different constraint parameters.
Wherein x and y are the abscissa before affine transformation, x 'and y' are the coordinates after radiation transformation, and θ is the angle between the horizontal direction and the left and right central point coordinate connecting lines.
In the above scheme, the normalization of the ROI image is specifically performed by normalizing the human body image data using a max-min normalization method. The normalization function is:
wherein x is f The pixel values of the image are represented, min (x), and max (x) represent the maximum value and the minimum value of the input data, respectively.
In the above aspect, the data enhancement of the transformed image includes: random flipping, horizontal or vertical projection, random scaling, etc.
S4, based on each preprocessed frame of image, extracting and fusing key point features of human bones by adopting a pre-constructed lightweight human body posture estimation network model to obtain fusion features.
The method comprises the steps of firstly constructing a lightweight human body posture estimation network model, and constructing a lightweight network by combining a high-resolution design architecture in a Shuffle Block and an HRNet in the shuffleNet. HRNet exhibits strong capabilities in location-sensitive issues such as semantic segmentation, human body pose estimation, and object detection. However, more residual blocks are stacked, and light-weight Shuffle blocks are introduced, so that the number of model parameters can be effectively reduced, and in order to further improve the performance of the model, the model is enabled to effectively extract multi-scale context information and simulate long-distance spatial dependence of human body posture estimation. And finally, pruning the original HRNet architecture for further optimizing the network, and then carrying out knowledge distillation to improve the performance of the model. In order to realize recognition of abnormal behaviors of multiple persons, the model needs to input a plurality of ROI areas in an image, and the adjacent image and the adjacent ROI areas share model weights, so that the recognition efficiency of the abnormal behaviors of the model is further improved, and higher instantaneity and accuracy can be ensured on edge equipment with limited computational power.
As shown in fig. 2, the lightweight human body posture estimation network uses a Shuffle Block to replace HRNet of all residual blocks, and performs pruning and distillation on a new network. The original multi-resolution, multi-scale feature and multi-stage feature fusion method of the HRNet is continued, and the HRNet architecture can keep high-resolution representation in the whole process. Starting from the high-resolution sub-network as a first stage, gradually adding one high-resolution sub-network to a low-resolution sub-network to form more stages, and connecting the multi-resolution sub-networks in parallel. Repeated multi-scale fusion is performed by exchanging information across parallel multi-resolution sub-networks throughout the process. The final keypoints are ultimately output only on the high resolution representation of the estimated network output. The Shuffle Block, as shown in fig. 3, first uses a Block convolution (Group Convolution) to divide the input feature map into groups, each of which performs a convolution operation independently. This can reduce the amount of parameters and calculation of the convolution operation. After the packet convolution, the number of channels of the input feature map is split into two branches, wherein one branch is kept unchanged, the other branch is subjected to a point-by-point convolution (i.e. 1x1conv,1x1 convolution), then is subjected to a depth separable convolution (DWConv, depth separable convolution) and a point-by-point convolution (i.e. 1x1conv,1x1 convolution), the part is actually calculated backwards, and the number of channels on the two branches is directly concatenated up to the end of the network, and then the feature map of each group is subjected to a channel rearrangement operation (channel shuffle). The purpose of channel rearrangement is to make the feature graphs of different groups interact and information merge, and increase the non-linear capability of the network. Specifically, channel reordering splits the feature map of the convolution output into subsets per channel, and then alternately connects the subsets such that feature maps from different groupings interact with each other. The feature map is then further processed using a point-wise convolution operation (i.e., a 1x1conv,1x1 convolution). And finally, adding the input characteristic diagram and the output characteristic diagram according to elements, thereby reducing the problem of gradient disappearance while keeping information flow. Even after the residual Block is replaced by the Shuffle Block, the HRNet architecture still appears to be somewhat complex, so in the network layers of the HRNet four phases, the higher the resolution, the more pruning is performed on the network layer of the high resolution branch. The performance of the pruned model is reduced, the original model is used as a teacher model, the pruned model is used as a student model, knowledge distillation operation is carried out, and finally the light-weight human body posture estimation network with better performance is obtained.
The method comprises the steps of extracting human skeleton key data by using a lightweight human body posture estimation network model, firstly obtaining a skeleton key point detection result, then preprocessing and optimizing key points, carrying out feature extraction by combining information of the human skeleton key points, and finally carrying out feature fusion. The feature extraction method mainly comprises two steps of scale invariant features and acceleration robust features by combining information of human skeleton key points, wherein the scale invariant features are feature detection algorithms in the field of computer vision and are used for detecting local features in images, extracting the local position, scale and rotation invariant features of the images, the essence of the feature detection method is to find key points on different scale spaces, calculate the directions of the key points, more specifically, the scale invariant features firstly construct the scale space by using a Gaussian differential pyramid, then detect the key points in the scale space by comparing pixel values with values of adjacent pixels (comprising pixels on upper layers, lower layers and the same layers of the scale space) to obtain candidate key points, accurately position the candidate key points by using an interpolation method, meanwhile, allocate a main direction according to the gradient directions of images around the candidate key points, construct a small area taking the key points as centers in the surrounding areas of the key points, and finally, combine the local feature vectors to form descriptions of the key points.
S5, calling a trained integrated multi-classifier to conduct abnormal behavior classification and identification on the fusion characteristics, and obtaining identification results of various abnormal behaviors.
The step classifies the fusion features, namely the feature description matrix, to obtain a confusion matrix of classification data and a plurality of abnormal behavior recognition results, wherein the confusion matrix provides specific classification result distribution conditions, and can help analyze the performance and the performance of the model. . Abnormal behavior classification recognition is carried out by calling an integrated multi-classifier, specifically, classifier training is carried out on training set data by using a support vector machine algorithm. In a multi-SVM classifier, multiple SVM classifiers need to be trained, each separately for a different sample feature or subset. After training, a weighted average method is adopted in actual reasoning to fuse a plurality of SVM classifiers obtained through training so as to improve accuracy and robustness of the classifiers, behaviors of the multi-classifier fusion result are abnormal behavior results judged by means of model mean square, the abnormal behavior results obtained in the embodiment are shown in a figure 4, the identification probability of each behavior closest to a camera in a video is shown on the left side, and the identification result of abnormal behaviors of a plurality of persons including the shape and behavior labels of human skeleton joints is shown in real time in a right image. Through testing, the model can still keep high-precision recognition of abnormal behaviors of multiple persons in a real-time video sequence.
Example 2
The example provides a real-time multi-person abnormal behavior recognition system based on a lightweight human body posture estimation network, which is shown in fig. 5 and comprises a video real-time acquisition module, a target detection module, an image preprocessing module, a feature extraction and fusion module and an abnormal behavior classification and recognition module. The video real-time acquisition module is used for acquiring complete human body activity video data by using a high-definition real-time camera, and converting a video sequence into an RGB image data set; the target detection module specifically uses YOLOv5 for target detection aiming at each frame of image in the data set, and marks the ROI; the image preprocessing module specifically performs image clipping and data enhancement according to the ROI region in the target detection result, and finally performs image alignment and normalization; the feature extraction and fusion module is used for inputting the preprocessed image into a lightweight human body posture estimation network to detect important key points of a human body, such as a node point, a five sense organ node and the like, and is different from a common multi-person detection method. Firstly, detecting skeleton key points, then preprocessing and optimizing the key points, carrying out feature extraction by combining information of the skeleton key points of a human body, and finally carrying out feature fusion; the abnormal behavior classification and identification module specifically calls an integrated multi-classifier to conduct abnormal behavior classification identification, and a confusion matrix of classified data and identification results of various abnormal behaviors are obtained.
A lightweight human body posture estimation network model in the feature extraction and fusion module is trained on the coco data set. Model training data and parameters are as follows:
this embodiment is based primarily on the pytorch deep learning framework, performed in Ubuntu 18.04 and Python 3.6 environments, with the network trained on 4 NVIDIA 3090 GPUs. COCO has more than 200K images and 250K human instances, with 17 key points. The model of the present invention was trained on the train2017 dataset (including 57K images and 150Kperson examples) and validated on val2017 (including 5 Kimages) and test development 2017 (including 20K images).
During training, each GPU is of small batch size 32. An Adam optimizer with an initial learning rate of 2e-3 was used. The human detection ROI area adopts an aspect ratio of 4:3, and then the box is cropped from the image. The image size of the COCO is adjusted to 256×192. Each image will be enhanced by a series of data enhancement operations including random rotation ([ -30 °,30 ° ]), random scale ([ 0.75,1.25 ]), and random inversion of the data set, as well as additional half-body data of the COCO.
And deploying the trained optimal model to a Jeston TX2 intelligent terminal, and building a real-time multi-person abnormal behavior recognition system.
The remainder were as in example 1.
The above functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the invention can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

Claims (10)

1. The real-time multi-person abnormal behavior identification method based on the lightweight network is characterized by comprising the following steps of:
collecting video sequence data containing multiple persons in real time, and converting the video sequence data into an RGB image data set;
performing target detection on each frame of image in the RGB image data set, and calibrating an ROI (region of interest) in the image;
preprocessing each frame of image in the RGB image dataset based on the calibrated ROI area;
based on each preprocessed frame of image, a pre-constructed lightweight human body posture estimation network model is adopted to extract and fuse key point characteristics of human bones, so as to obtain fusion characteristics;
and calling a trained integrated multi-classifier to conduct abnormal behavior classification and identification on the fusion characteristics, and obtaining identification results of various abnormal behaviors.
2. The method for identifying real-time multi-person abnormal behavior based on lightweight network according to claim 1, wherein YOLOv5 is adopted for target detection of each frame of image.
3. The method for identifying real-time multi-person abnormal behavior based on lightweight network according to claim 1, wherein the specific steps of preprocessing include:
removing the part irrelevant to the human body in the ROI by adopting an image clipping method;
aligning the images by adopting an image alignment method;
processing the cut ROI by adopting a normalization method;
and carrying out enhancement processing on the aligned images by adopting a data enhancement algorithm.
4. A method for identifying real-time multi-person abnormal behavior based on a lightweight network according to claim 3, wherein said image alignment method is affine transformation, and said affine transformation has the expression:
where x and y are the abscissa before affine transformation, x 'and y' are the coordinates after radiation transformation, a, b, c, d, e and f are constraint parameters.
5. The method for identifying real-time multi-person abnormal behavior based on lightweight network according to claim 3, wherein the normalization method is a maximum-minimum normalization method, and the maximum-minimum normalization function is:
wherein norm is a maximum-minimum normalization function, x f Representing an imagePixel values, min (x), and max (x) represent the maximum value and the minimum value of the input data, respectively.
6. The method for identifying real-time multi-person abnormal behavior based on lightweight network according to claim 1, wherein the specific step of obtaining the fusion feature comprises the following steps:
inputting each preprocessed frame of image into a pre-constructed lightweight human body posture estimation network model to detect human skeleton key points;
preprocessing and optimizing the key points of the human bones;
extracting features based on the preprocessed and optimized key points of the human bones;
and fusing according to the extracted features to obtain fused features.
7. The method for identifying real-time multi-person abnormal behaviors based on the lightweight network according to claim 6, wherein the method adopted by the feature extraction is a scale-invariant feature extraction method or an accelerated robust feature extraction method.
8. The method for identifying real-time multi-person abnormal behavior based on the lightweight network according to claim 1, wherein the construction process of the lightweight human body posture estimation network model specifically comprises the following steps:
building a HRNet high-resolution design framework;
replacing all residual blocks of the HRNet with a Shuffle Block in the ShuffleNet;
and pruning and distilling the replaced HRNet to form a light human body posture estimation network model.
9. The method for identifying real-time multi-person abnormal behavior based on a lightweight network according to claim 1, wherein the integrated multi-classifier is trained by using a support vector machine algorithm.
10. A real-time multi-person abnormal behavior identification system based on a lightweight network, comprising:
the video real-time acquisition module is used for: the system is used for collecting video sequence data containing multiple persons in real time and converting the video sequence data into an RGB image data set;
the target detection module: the method comprises the steps of performing target detection on each frame of image in the RGB image data set, and calibrating an ROI (region of interest) in the image;
an image preprocessing module: the image preprocessing module is used for preprocessing each frame of image in the RGB image set based on the calibrated ROI area;
feature extraction and fusion module: the method comprises the steps of carrying out human skeleton key point feature extraction and fusion by adopting a pre-constructed lightweight human body posture estimation network model based on each preprocessed frame image to obtain fusion features;
abnormal behavior classification and identification module: and the method is used for calling the trained integrated multi-classifier to conduct abnormal behavior classification and identification on the fusion characteristics, and obtaining identification results of various abnormal behaviors.
CN202311428863.4A 2023-10-31 2023-10-31 Real-time multi-person abnormal behavior identification method and system based on lightweight network Pending CN117437691A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311428863.4A CN117437691A (en) 2023-10-31 2023-10-31 Real-time multi-person abnormal behavior identification method and system based on lightweight network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311428863.4A CN117437691A (en) 2023-10-31 2023-10-31 Real-time multi-person abnormal behavior identification method and system based on lightweight network

Publications (1)

Publication Number Publication Date
CN117437691A true CN117437691A (en) 2024-01-23

Family

ID=89551162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311428863.4A Pending CN117437691A (en) 2023-10-31 2023-10-31 Real-time multi-person abnormal behavior identification method and system based on lightweight network

Country Status (1)

Country Link
CN (1) CN117437691A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117789255A (en) * 2024-02-27 2024-03-29 沈阳二一三电子科技有限公司 Pedestrian abnormal behavior video identification method based on attitude estimation

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117789255A (en) * 2024-02-27 2024-03-29 沈阳二一三电子科技有限公司 Pedestrian abnormal behavior video identification method based on attitude estimation

Similar Documents

Publication Publication Date Title
Tabernik et al. Deep learning for large-scale traffic-sign detection and recognition
CN108229490B (en) Key point detection method, neural network training method, device and electronic equipment
Xie et al. Multilevel cloud detection in remote sensing images based on deep learning
CN108520226B (en) Pedestrian re-identification method based on body decomposition and significance detection
CN108416266B (en) Method for rapidly identifying video behaviors by extracting moving object through optical flow
CN111767882A (en) Multi-mode pedestrian detection method based on improved YOLO model
CN109903331B (en) Convolutional neural network target detection method based on RGB-D camera
CN110674874B (en) Fine-grained image identification method based on target fine component detection
CN110929593B (en) Real-time significance pedestrian detection method based on detail discrimination
US10445602B2 (en) Apparatus and method for recognizing traffic signs
CN113361495A (en) Face image similarity calculation method, device, equipment and storage medium
Travieso et al. Pollen classification based on contour features
US20230053911A1 (en) Detecting an object in an image using multiband and multidirectional filtering
CN112329771B (en) Deep learning-based building material sample identification method
CN110852327A (en) Image processing method, image processing device, electronic equipment and storage medium
CN114821014A (en) Multi-mode and counterstudy-based multi-task target detection and identification method and device
CN111539320B (en) Multi-view gait recognition method and system based on mutual learning network strategy
CN117437691A (en) Real-time multi-person abnormal behavior identification method and system based on lightweight network
Maximili et al. Hybrid salient object extraction approach with automatic estimation of visual attention scale
CN111274964A (en) Detection method for analyzing water surface pollutants based on visual saliency of unmanned aerial vehicle
CN114581918A (en) Text recognition model training method and device
Wu et al. A method for identifying grape stems using keypoints
CN114283326A (en) Underwater target re-identification method combining local perception and high-order feature reconstruction
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN106650629A (en) Kernel sparse representation-based fast remote sensing target detection and recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination