CN114648714A - YOLO-based workshop normative behavior monitoring method - Google Patents

YOLO-based workshop normative behavior monitoring method Download PDF

Info

Publication number
CN114648714A
CN114648714A CN202210087600.0A CN202210087600A CN114648714A CN 114648714 A CN114648714 A CN 114648714A CN 202210087600 A CN202210087600 A CN 202210087600A CN 114648714 A CN114648714 A CN 114648714A
Authority
CN
China
Prior art keywords
feature
workshop
behavior
network
yolo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210087600.0A
Other languages
Chinese (zh)
Inventor
谭思雨
朱栗波
杨倩倩
周赞
张喆
罗堃
王力
胡麒远
卢玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Zhongnan Intelligent Equipment Co ltd
Original Assignee
Hunan Zhongnan Intelligent Equipment Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Zhongnan Intelligent Equipment Co ltd filed Critical Hunan Zhongnan Intelligent Equipment Co ltd
Priority to CN202210087600.0A priority Critical patent/CN114648714A/en
Publication of CN114648714A publication Critical patent/CN114648714A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a monitoring method of workshop normative behaviors based on YOLO, which comprises the following steps: (1) constructing a workshop behavior sample data set; (2) constructing an E-YOLO target detection network comprising an encoder, a decoder and a classification regression network, and performing behavior feature learning, wherein the encoder is based on a YOLO backbone network, and the decoder constructs a high-efficiency decoding network; (3) and (3) acquiring real-time monitoring image information of the workshop, and identifying and detecting the image to be identified by using the detection model obtained in the step (2.3) to complete monitoring and early warning of non-standard behaviors of the workshop. The invention integrates the backbone network, the high-efficiency decoding network and the classification regression network to form the E-YOLO target detection network, has stronger characteristic characterization capability, can keep high speed in both training and testing, simultaneously can accurately position and distinguish similar characteristics, further determines the characteristic difference among regions, and ensures the classification correctness.

Description

YOLO-based workshop normative behavior monitoring method
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a method for monitoring workshop normative behaviors based on YOLO.
Background
Industrial production plants, houses for industrial production, besides fixed plants for production and development, also include their auxiliary buildings, such as distribution rooms, pollution discharge and equipment and material storage, etc. in industrial production and development operations, many large machines and equipment are involved in operation and real-time operation of technicians, which has many potential safety hazards, such as falling of small parts at high altitude, loss of control of motion trajectory during robot debugging, abnormal high-speed operation of large equipment on a production line, which may bring about damage to technicians in the plants to a large extent. Except that experienced managers regularly inspect hidden dangers for workshop equipment, the workshop safety hidden dangers are standardized by workshop personnel, so that accidental injuries can be reduced to a large extent, and the safety hidden dangers of the workshop are minimized, for example, the safety helmet is standardized to be worn, a mobile phone is not played during workshop operation, the wearing of large-area skin is not worn, and the like, so that the relative safety of the personnel can be effectively guaranteed. At present, most industrial production workshops are equipped with safety personnel to supervise standard behaviors of employees, and the employees with irregular behaviors are warned and behavior corrected by adopting a safety personnel inspection mode, so that real-time monitoring cannot be achieved, and huge workload and pressure are brought to the safety personnel.
In order to solve the above problems, in the prior art, an intelligent normative behavior monitoring system is invented and applied, such a system usually uses a voice alarm monitoring and deep learning target detection technology to perform real-time monitoring, the intelligent detection system does not depend on manpower, can realize real-time, reliable and low-cost safety guarantee for personnel in an industrial production workshop, the target detection technology can perform real-time and accurate classification and identification on different workshop behaviors and quickly judge whether the behaviors are not normative, but based on the fact that the ground features of the production workshop are complex, the scenes are rich, and similar characteristics often exist between the normative behaviors and the non-normative behaviors (such as the fact that a helmet is not buckled and a helmet is taken correctly), which bring certain difficulty and challenge to the behavior detection system based on the target detection, from the view of the existing behavior detection method, most monitoring systems are only aimed at rough safety action detection, such as wearing the safety helmet VS without the safety helmet, but the safety helmet with no safety helmet and the like can not be protected in case of danger.
In view of the foregoing, it is desirable to provide a monitoring method for workshop normative behaviors based on YOLO, which can quickly and accurately provide behavior prediction of workshop staff in an image range and warn against non-normative behaviors.
Disclosure of Invention
The invention aims to provide a monitoring method of workshop normative behaviors based on YOLO, which can quickly and accurately give out behavior prediction of workshop staff in an image range and warn of non-normative behaviors.
The above purpose is realized by the following technical scheme: a monitoring method for workshop specification behaviors based on YOLO comprises the following steps:
(1) constructing a workshop behavior sample data set;
(2) constructing an E-YOLO target detection network comprising an encoder, a decoder and a classification regression network, and performing behavior feature learning, wherein the encoder is based on a YOLO backbone network, and the decoder constructs a high-efficiency decoding network;
(2.1) inputting the workshop behavior sample data set into a backbone network encoder, carrying out slicing operation on the input sample, and reducing the picture size to form a low-level feature map; then, extracting image features from the low-level feature map through a feature extraction module to form a middle-level feature map; forming and fusing the characteristics on a plurality of receptive fields by adopting multi-scale pooling, and learning the multi-scale characteristics of the target to form a top-level characteristic diagram;
(2.2) the efficient decoding network receives the layer feature map and the top layer feature map in the step (2.1), feature information is fused to form a feature classification reference standard, similar behavior detection frames are subjected to further refined learning, behavior suspicious regions are screened according to main features of all image sets, the similar features are accurately positioned and distinguished, feature differences among the suspicious regions are further determined, and classification correctness is guaranteed;
(2.3) the classification regression network receives the effective prediction characteristic graph from the decoder, weights and position prediction of a detection frame are given to the possibility of different behaviors in each picture set, and internal parameters are subjected to fine adjustment to obtain a trained detection model;
(3) and (3) acquiring real-time monitoring image information of the workshop, and identifying and detecting the image to be identified by using the detection model obtained in the step (2.3) to complete monitoring and early warning of non-standard behaviors of the workshop.
The E-YOLO target detection network is formed by fusing a backbone network, a high-efficiency decoding network and a classification regression network based on a YOLOV5 network, and by adopting the method, the E-YOLO has stronger characteristic characterization capability, can keep high speed in training and testing, can accurately position and distinguish similar characteristics, further determines the characteristic difference among regions, ensures the classification correctness, and also has better portability.
The efficient decoding network decoder comprises a characteristic efficient fusion module, wherein the characteristic efficient fusion module comprises two convolution layers which are built in parallel, the characteristic efficient fusion module receives a middle layer characteristic diagram and a top layer characteristic diagram in the step (2.2), the two convolution layers which are built in parallel form two branches, the number of channels is reduced to half of the number of the original channels by 1x1 convolution firstly in the input characteristics in the branches, cross-channel information interaction is learned through 1x1 convolution, then the relation between behavior scattering characteristics is captured in a relatively large range through 7 x 7 convolution, characteristics with super-strong representation capability are learned, and finally the two branches are spliced after the results are processed through 1x1 convolution.
The further technical scheme is that the efficient decoding network decoder further comprises an efficient attention module for focusing on learning the difference in a small range so as to perform good and bad feature screening, the efficient attention module in the step (2.2) receives information processed by the efficient feature fusion module, learns specific features through global average pooling, readjusts an input feature map through a full connection layer and a Sigmoid function, and finally achieves the effect of extracting useful channel information.
The further technical scheme is that when the size of the input feature map is Xi∈RC×W×HThen the valid channel attention map AeSE(Xi)∈RC×1×1The calculation formula is as follows:
AeSE(Xi)=σ(WC(Fgap(Xi)))
Figure RE-GDA0003655301940000031
wherein Fgap(Xi) Is a global average pooling of channel information, and
Figure RE-GDA0003655301940000032
WCis a fully connected layerWeight, σ is Sigmoid function, Xi,jRepresents all matrix elements; input XiIs a multi-scale feature map from the middle level feature map and the top level feature map, will AeSE(Xi) Application of attention as a channel feature to multiscale feature map XiIn such a way that the multi-scale feature XiMore informative, finally inputting the output characteristic diagram into X element by elementrefine,XrefineInputting a feature map XiAnd A after processing by the high-efficiency attention moduleeSE(Xi) Multiplying to give each input XiAnd carrying out weight assignment pixel by pixel to realize feature re-screening.
The further technical scheme is that the characteristic efficient fusion module comprises a characteristic re-fusion module, and the characteristic re-fusion module realizes characteristic re-fusion of the useful characteristics screened by the efficient attention module.
The further technical scheme is that the feature re-fusion module outputs effective prediction feature maps of three scales after processing, the classification regression network in the step (2.3) divides grid regions on three scale feature layers into 64 processing numbers, 32 processing numbers and 16 processing numbers respectively, then convolution adjustment channel numbers are carried out on the effective prediction feature maps, classification regression is carried out to predict the position, confidence coefficient and the category of each Bounding box, overlapping frames are removed through NMS to obtain the final output detection result, in the training stage, the total network loss comprises the sum of classification loss, confidence coefficient loss and position regression loss, wherein the confidence coefficient loss and the classification loss adopt binary cross entropy loss, the position regression loss adopts CIOU loss, and when the loss function converges, training is finished, and the optimal weight is reserved for behavior detection.
The further technical scheme is that in the step (3), the acquired training model is used for detecting the workshop images acquired in real time in a sliding window detection mode, each window is endowed with action weight, and the action prediction weights of all sliding windows are integrated to give a detection frame so as to obtain a workshop action detection result.
A further technical scheme is that the sample data set in the step (2.1) is uniformly cut into a plurality of picture sets with consistent sizes after being input into the backbone network encoder, image information is gradually disassembled into a low-resolution multi-dimensional image from a high resolution low dimension, no information loss is determined from the multi-resolution multi-dimension image, a large variety of features is formed, and primary classification is performed on the features mainly detected from color information and scale information preliminarily.
The further technical scheme is that a backbone network adopts 1 × 1 convolution and then a group of 3 × 3 convolution to form a residual block as a basic structural unit, a feature extraction module is formed by stacked residual blocks, 3 × 3 convolution with the step length of 2 is adopted for downsampling before each feature module is extracted, and the resolution of a feature map is reduced.
The method comprises the following steps that (1) samples in the workshop behavior sample data set comprise open source data and real-time data, the real-time data comprise videos of artificial irregular behaviors captured in real time based on real workshop scenes, a formatting process is carried out on the videos of the artificial irregular behaviors to form multi-frame images, the open source data and the real-time data are mixed to form a data set containing JPG pictures and corresponding JSON labels, and the data set is amplified by adopting a data enhancement mode including mirroring, brightness, turning and rotation until the required number of the samples is achieved.
Compared with the prior art, the performance of the method is improved on the basis of the YOLO, an attention module which is beneficial to distinguishing similar behavior characteristics is fused to construct an efficient decoding network, the efficient decoding network is used for refining classification behaviors, image information from YOLOV5s with three different resolutions is fused to form a characteristic classification reference standard, behavior classification can be generally realized on images with medium and low resolutions, then the similar behavior detection frame is further refined and learned, suspicious regions of behaviors are screened according to the occupation ratio, composition and the like of color pixels in main body characteristics (safety helmet/mask) of each picture set, the similar characteristics are accurately positioned and distinguished, finally, the characteristic difference among the regions is further determined, and the classification correctness is guaranteed.
The intelligent standardized behavior monitoring system for the industrial production workshop, which is high in applicability, can judge the non-standardized behavior by only detecting the behaviors of wearing and not wearing a safety helmet aiming at most similar patents at present, can expand the behavior types of workshop personnel, and also can list the non-standardized behaviors into hidden dangers which can cause safety accidents of the workshop, is a comprehensive monitoring system aiming at the standardized behavior of the industrial production workshop, and can greatly reduce the probability of danger for the personnel in the industrial production workshop;
the invention adopts a sliding window detection combined with NMS detection mode, improves the classification precision for distinguishing the behaviors of correctly wearing the mask (safety helmet)/irregularly wearing the mask (safety helmet), and avoids the situation of difficult recognition when the main body behavior is positioned at the edge of the snapshot.
The invention constructs an irregular behavior data set based on an industrial production workshop and overcomes the defect that only a safety helmet detection data set exists on the existing open source network.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation of the invention.
Fig. 1 is a flowchart of a monitoring method for a YOLO-based plant specification behavior according to an embodiment of the present invention.
Detailed Description
The present invention will now be described in detail with reference to the drawings, which are given by way of illustration and explanation only and should not be construed to limit the scope of the present invention in any way. Furthermore, features from embodiments in this document and from different embodiments may be combined accordingly by a person skilled in the art from the description in this document.
The embodiment of the invention is as follows, referring to fig. 1, a monitoring method of workshop normative behaviors based on YOLO, comprising the following steps:
(1) constructing a workshop behavior sample data set;
(2) constructing an E-YOLO target detection network comprising an encoder, a decoder and a classification regression network, and performing behavior feature learning, wherein the encoder is based on a YOLO backbone network, and the decoder constructs a high-efficiency decoding network;
(2.1) inputting the workshop behavior sample data set into a backbone network encoder, carrying out slicing operation on the input sample, and reducing the picture size to form a low-level feature map; then, extracting image features from the low-level feature map through a feature extraction module to form a middle-level feature map; forming and fusing features on a plurality of receptive fields by adopting multi-scale pooling, and learning the multi-scale features of the target to form a top-level feature map;
in one embodiment, a backbone network of YOLOV5s is directly used as an encoder of the E-YOLO, the network adopts 1 × 1 convolution and then a group of 3 × 3 convolution to form a residual block as a basic structural unit, a feature extraction module is formed by stacked residual blocks, 3 × 3 convolution with the step length of 2 is adopted for downsampling before each feature module is extracted, the resolution of a feature map is reduced, and the operation speed of the network is increased. 5 times of gradual downsampling are performed, the receptive field is enlarged, rich characteristic information is extracted, and different scale characteristics are formed; firstly, slicing an input sample by using a Focus structure, reducing the size of a picture to be half of the original size, and reserving image information to the maximum extent to generate a low-level feature map C1; extracting abundant image features by 4 feature extraction modules, wherein the number of stacked residual blocks is 1, 3, 3 and 1, respectively, to form middle-layer feature maps C2, C3 and C4, wherein C2, C3 and C4 have the same structure, but the number of convolution kernels is 64, 128 and 256; in order to enhance the expression capability of the top-level features, an SPP module is added, multi-scale pooling is adopted to form features on a plurality of receptive fields, the features are fused, and multi-scale features of the target are learned to form a top-level feature map C5.
(2.2) the efficient decoding network receives the layer characteristic diagram and the top layer characteristic diagram in the step (2.1), combines characteristic information to form a characteristic classification reference standard, further performs refined learning aiming at a similar behavior detection frame, screens behavior suspicious regions according to main body characteristics concentrated by each image, accurately positions and distinguishes similar characteristics, further determines characteristic difference among the suspicious regions, and ensures classification correctness;
(2.3) the classification regression network receives the effective prediction characteristic graph from the decoder, weights and position prediction of a detection frame are given to the possibility of different behaviors in each picture set, and internal parameters are subjected to fine adjustment to obtain a trained detection model;
(3) and (3) acquiring real-time monitoring image information of the workshop, and identifying and detecting the image to be identified by using the detection model obtained in the step (2.3) to complete monitoring and early warning of non-standard behaviors of the workshop.
The E-YOLO target detection network is formed by fusing a backbone network, a high-efficiency decoding network and a classification regression network based on a YOLOV5 network, and by adopting the method, the E-YOLO has stronger characteristic characterization capability, can keep high speed in training and testing, can accurately position and distinguish similar characteristics, further determines the characteristic difference among regions, ensures the classification correctness, and also has better portability.
On the basis of the above embodiment, in another embodiment of the present invention, the efficient decoding network decoder includes a feature efficient fusion module, the feature efficient fusion module includes two convolution layers built in parallel, the feature efficient fusion module receives the middle layer feature map and the top layer feature map in step (2.2), the two convolution layers built in parallel form two branches, the input features in the branches are first convolved by 1 × 1 to reduce the number of channels to half of the original number, then cross-channel information interaction is learned through 1 × 1 convolution, further a relationship between behavior scattering features is captured in a relatively large range through 7 × 7 convolution, a feature with a super-strong representation capability is learned, and finally the two branches are spliced after being passed through 1 × 1 convolution processing results.
The efficient decoding network decoder only receives feature input from backbone networks C3, C4 and C5, medium and low-layer features have the characteristic of low resolution and multiple dimensions, feature dimension information is large, the efficient decoding network decoder is suitable for further feature fusion learning processing, the efficient feature fusion module further extracts rich context feature information after receiving the feature input from the main networks C3, C4 and C5, the efficient feature fusion module has the advantages that two convolution layers are built in parallel, feature fusion is achieved, feature information of different branches can be stored by branch confluence, therefore, more rich feature information can be extracted, features with super-strong representation capability are learned, and the convolution processing results of an upper branch and a lower branch are spliced.
On the basis of the above embodiment, in another embodiment of the present invention, the efficient decoding network decoder further includes an efficient attention module for focusing on learning differences in a small range to perform a good-bad screening of features, in the step (2.2), the efficient attention module receives information processed by the feature efficient fusion module, learns specific features through global average pooling, and readjusts an input feature map through a full connection layer and a Sigmoid function, thereby finally achieving an effect of extracting useful channel information.
The efficient attention module is a depth module with high portability in the field of target detection, and is characterized in that in view of high requirements of a target detection algorithm on detail feature recognition, the difference between a normative behavior and an unnormaltive behavior may be the difference of a certain tiny part of features, the efficient attention module can concentrate on learning the difference in a small range so as to perform good and bad feature screening, two efficient attention modules are effectively and parallelly built on branches from C4 and C5, the C4 and the C5 comprise low-resolution high-dimensional features from a backbone network, the low-resolution high-dimensional features are input to the efficient attention module to perform useful feature screening after global feature learning is performed by a CBP module, and redundant features are discarded. The compressed excitation is a representative channel attention method in the neural network, and the channel relation between feature graphs can be directly modeled, so that the network feature learning capability is enhanced.
On the basis of the above embodiment, in another embodiment of the present invention, when the size of the input feature map is Xi∈RC ×W×HThen the valid channel attention map AeSE(Xi)∈RC×1×1The calculation formula is as follows:
AeSE(Xi)=σ(WC(Fgap(Xi)))
Figure RE-GDA0003655301940000071
wherein Fgap(Xi) Is a global average pooling of channel information, and
Figure RE-GDA0003655301940000072
WCis the weight of the full connection layer, σ is the Sigmoid function, Xi,jRepresenting all matrix elements; input XiIs a multi-scale feature map from the middle level feature map and the top level feature map, will AeSE(Xi) Application of attention as a channel feature to multiscale feature map XiIn such a way that the multi-scale feature XiMore informative, finally inputting the output characteristic diagram into X element by elementrefine,XrefineInputting a feature map XiAnd A after processing by the high-efficiency attention moduleeSE(Xi) Multiplying to give each input XiAnd carrying out weight assignment pixel by pixel to realize feature re-screening.
On the basis of the above embodiment, in another embodiment of the present invention, the feature efficient fusion module includes a feature re-fusion module, and the feature re-fusion module performs feature re-fusion on the useful features screened by the efficient attention module. The feature re-fusion module is a part of the feature high-efficiency fusion module, and the feature re-fusion module is obtained by removing the Conv-BN-Relu module built at the tail end in the feature high-efficiency fusion module, so that the function realized by the feature re-fusion module is also the feature re-fusion of the useful features screened by the high-efficiency attention module.
On the basis of the above embodiment, in another embodiment of the present invention, the feature re-fusion module outputs three-scale effective predicted feature maps after processing, in the step (2.3), the classification regression network divides grid areas on three scale feature layers into 64 processing numbers, 32 processing numbers and 16 processing numbers respectively, then convolution adjustment is carried out on the effective prediction characteristic diagram to adjust the number of channels, classification regression prediction is carried out to predict the position, the confidence coefficient and the category of each Bounding box, NMS removes overlapped frames to obtain the final output detection result, in the training phase, the total loss of the network comprises the sum of classification loss, confidence loss and position regression loss, and when the loss function is converged, finishing the training, and reserving the optimal weight for behavior detection.
On the basis of the above embodiment, in another embodiment of the present invention, the processing in step (3) uses the obtained training model to detect the workshop image collected in real time in a sliding window detection manner, each window is given a behavior weight, and the behavior prediction weights of all sliding windows are integrated to give a detection frame, so as to obtain a workshop behavior detection result.
In order to improve the behavior detection efficiency of real-time monitoring video images in a workshop, a sliding window detection mode is adopted, the size of a test picture input into a network is reduced, the size of a window is 100, and sliding window slicing is carried out on the image subjected to video frame processing at an interval stride of 50, so that a test sample is obtained. And inputting the test sample into an E-YOLO network to obtain a behavior detection result, and inputting a non-maximum suppression algorithm NMS to screen a repeated prediction frame in a sliding window overlapping area so as to obtain a detection result of the workshop standard behavior. The principle of NMS is, among others:
assuming that the picture size of the obtained practical application scene is 200 × 200, the sliding window size is 100 × 100, and the interval size is 50, to detect the picture, there are 9 candidate frames, the confidence threshold of the preset candidate frames is 0.5, the 9 candidate frames are arranged according to the descending order of the confidence, the candidate frame with the highest confidence is output and deleted in the candidate frame list, the IOU values of the candidate frame and all candidate frames are calculated, the candidate frames larger than the threshold are deleted, the above is repeated until the candidate frame list is 0, and the output list is returned, wherein the IOU is defined as the intersection part of two regions divided by the union part of two regions.
Building a model training and testing system, training and testing an E-YOLO detection model based on a Pycharm software platform, generating a high-performance workshop standard behavior detection model after the steps are implemented, and monitoring the acquired images in real time for a workshop until an optimal model with 9-class behavior classification capability is acquired; the model can directly and quickly predict the behavior of the workshop staff in the image range and give an alarm for irregular behaviors.
On the basis of the above embodiment, in another embodiment of the present invention, the sample data set in the step (2.1) is uniformly cut into a plurality of picture sets with the same size after being input to the backbone network encoder, the image information is gradually disassembled from the high resolution and the low dimension into the low resolution and multi-dimensional image, it is determined that no information is lost from the multi-resolution and multi-dimension, a large combination of features is formed, and the features mainly detected are preliminarily classified from the color information and the scale information.
On the basis of the above embodiment, in another embodiment of the present invention, the backbone network uses 1 × 1 convolution and then a group of 3 × 3 convolutions to form a residual block as a basic structural unit, the feature extraction modules are formed by stacked residual blocks, and downsampling is performed by using 3 × 3 convolutions with a step size of 2 before each feature module is extracted, so as to reduce the resolution of the feature map.
On the basis of the above embodiment, in another embodiment of the present invention, the samples in the inter-vehicle behavior sample data set in step (1) include open source data and real-time data, the real-time data includes real-time capturing artificial irregular behavior videos based on real-time scene of the inter-vehicle, formatting the artificial irregular behavior videos to form multi-frame images, mixing the open source data and the real-time data to form a data set including JPG pictures and corresponding JSON tags, and then augmenting the data set by adopting a data augmentation method including mirroring, brightness, turning and rotation until the required number of samples is reached.
The core algorithm of the invention is based on a deep learning network YOLOV5s, and the characteristic of realizing high performance of deep learning is that a large amount of sample data sets are needed for training, the current open source website can download a Helmet detection data set (website: https:// github. com/njection power/Safety-help-week-database) with the data volume of about 8000, the data set is a part of source of the data set, the other part is based on a real workshop scene, a video of artificial irregular behaviors is captured in real time through workshop alarm monitoring, the video is formatted to form a multi-frame image, and a label frame is manually labeled based on a LabelImg tool. Mixing open source data and real-time data to form a data set containing JPG pictures and corresponding JSON labels, amplifying the data set by adopting data enhancement modes such as mirroring, brightness, overturning, rotating and the like, and finally obtaining about 50000 data set samples for training.
Specifically, a label software LabelImg is used for marking different behaviors with label boxes, and the workshop behaviors are mainly divided into a first step of standardizing wearing safety helmets (Helmet _ OK, dark green label boxes); ② wearing safety helmets/unbuckled hat bands (Helmet _ Warn1, dark yellow label box) are not specified; thirdly, a safety Helmet (Helmet _ NotOK, dark red mark box) is not worn; fourthly, the safety Helmet for the free growth of the girls (Helmet _ Warn2, faint yellow mark box); fifthly, playing the mobile Phone (Phone _ NotOK, pink mark box); sixthly, the clothes are not standard/the large area of skin is exposed by wearing (Wear _ NotOK, dark red mark box); adding a sample of Mask standard behavior, which comprises a Mask (Mask _ OK, blue mark frame) with standard; eighthly, wearing the Mask without standardization (Mask _ Warn, orange label box); ninthly, the Mask is not worn (Mask _ NotOK, black mark box). And mixing the open source data set with the data set acquired under the real scene, constructing a target detection large sample data set of 9 types of workshop behaviors by adopting a data set amplification mode, and dividing the target detection large sample data set into a training set and a test set according to a proportion.
The last three convolutional layer output characteristics of the backbone network, namely C3, R32x32x256, C4, R16x16x256 and C5, R64x64x128 are selected to be input into a characteristic fusion module to further fully extract the characteristics. The fusion module is composed of a feature efficient fusion module and two parallel efficient attention modules. The characteristic efficient fusion module is used for enhancing the relationship between different behavior characteristics captured by the network and better learning the multi-scale spatial context information. The HAE module is used for distinguishing effective channel semantic features and can suppress noise to some extent. And (4) carrying out convolution on the three effective feature prediction graphs output by the feature fusion module by using 1x1 to classify the classification, confidence and position of the regression network prediction target. And in the testing stage, screening the redundancy prediction frame through the NMS and outputting a final detection result. In the training stage, the total loss is formed by calculating the weighted sum of the classification loss, the confidence coefficient loss and the position regression loss, each parameter gradient is calculated by back propagation, the parameters are transmitted to an optimizer to carry out iterative training to update the weight of the model, and the training weight is reserved for testing.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A monitoring method for workshop specification behaviors based on YOLO is characterized by comprising the following steps:
(1) constructing a workshop behavior sample data set;
(2) constructing an E-YOLO target detection network comprising an encoder, a decoder and a classification regression network, and performing behavior feature learning, wherein the encoder is based on a YOLO backbone network, and the decoder constructs a high-efficiency decoding network;
(2.1) inputting the workshop behavior sample data set into a backbone network encoder, carrying out slicing operation on the input sample, and reducing the size of the picture to form a low-level feature map; then, extracting image characteristics from the low-level characteristic diagram through a characteristic extraction module to form a middle-level characteristic diagram; forming and fusing features on a plurality of receptive fields by adopting multi-scale pooling, and learning the multi-scale features of the target to form a top-level feature map;
(2.2) the efficient decoding network receives the layer characteristic diagram and the top layer characteristic diagram in the step (2.1), combines characteristic information to form a characteristic classification reference standard, further performs refined learning aiming at a similar behavior detection frame, screens behavior suspicious regions according to main body characteristics concentrated by each image, accurately positions and distinguishes similar characteristics, further determines characteristic difference among the suspicious regions, and ensures classification correctness;
(2.3) the classification regression network receives the effective prediction characteristic graph from the decoder, weights and position prediction of a detection frame are given to the possibility of different behaviors in each picture set, and internal parameters are subjected to fine adjustment to obtain a trained detection model;
(3) and (3) acquiring real-time monitoring image information of the workshop, and identifying and detecting the image to be identified by using the detection model obtained in the step (2.3) to complete monitoring and early warning of non-standard behaviors of the workshop.
2. The YOLO-based workshop specification behavior monitoring method according to claim 1, wherein the efficient decoding network decoder comprises a feature efficient fusion module, the feature efficient fusion module comprises two convolution layers which are built in parallel, the feature efficient fusion module in the step (2.2) receives a middle layer feature map and a top layer feature map, the two convolution layers which are built in parallel form two branches, the input features in the branches are firstly convolved by 1x1 to reduce the number of channels to half of the original number, then cross-channel information interaction is learned through 1x1 convolution, then the relationship among behavior scattering features is captured in a relatively large range through 7 x 7 convolution, features with ultra-strong representation capability are learned, and finally the two branches are spliced after being processed through 1x1 convolution.
3. The YOLO-based plant normative behavior monitoring method according to claim 2, wherein the efficient decoding network decoder further comprises an efficient attention module for focusing on learning the differences in a small range to perform the good-bad screening of the features, the efficient attention module in the step (2.2) receives the information processed by the feature efficient fusion module, learns the specific features through global average pooling, and readjusts the input feature map through a full connection layer and a Sigmoid function, thereby finally achieving the function of extracting useful channel information.
4. The method of claim 3, wherein the input feature map is X in sizei∈RC×W×HThen the valid channel attention map AeSE(Xi)∈RC×1×1The calculation formula is as follows:
AeSE(Xi)=σ(WC(Fgap(Xi)))
Figure FDA0003487632130000021
wherein Fgap(Xi) Is a global average pooling of channel information, and
Figure FDA0003487632130000022
WCis the weight of the full connection layer, σ is the Sigmoid function, Xi,jRepresenting all matrix elements; input XiIs a multi-scale feature map from the middle level feature map and the top level feature map, will AeSE(Xi) Application of attention as a channel feature to multiscale feature map XiIn such a way that the multi-scale feature XiMore informative, finally inputting the output characteristic diagram into X element by elementrefine,XrefineInputting a feature map XiAnd A after processing by the high-efficiency attention moduleeSE(Xi) Multiplying to give each input XiAnd carrying out weight assignment pixel by pixel to realize feature re-screening.
5. The YOLO-based workshop specification behavior monitoring method of claim 4, wherein the feature efficient fusion module comprises a feature re-fusion module that re-fuses the useful features screened by the efficient attention module.
6. The method for monitoring workshop normative behaviors based on YOLO according to claim 5, wherein the feature re-fusion module outputs an effective prediction feature map of three scales after processing, the classification regression network in step (2.3) divides grid regions on three scale feature layers into 64 processing numbers, 32 processing numbers and 16 processing numbers respectively, then performs convolution adjustment on the effective prediction feature map to perform classification regression prediction on the position, confidence and the category of each Boung box, and removes overlapped frames through NMS to obtain the final output detection result, in the training stage, the total network loss comprises the sum of classification loss, confidence loss and position regression loss, wherein the confidence loss and the classification loss adopt binary cross entropy loss, the position regression loss adopts CIOU loss, the training is finished after the loss function converges, the optimal weights are retained for behavior detection.
7. The method for monitoring workshop normative behaviors based on YOLO according to any one of claims 1 to 6, wherein the obtained training model is used in the step (3) to detect the workshop images collected in real time in a sliding window detection mode, each window is endowed with behavior weight, and the behavior prediction weights of all sliding windows are integrated to give a detection frame so as to obtain a workshop behavior detection result.
8. The method for monitoring workshop normative behaviors based on YOLO according to claim 7, wherein the sample data set in step (2.1) is input into a backbone network encoder, then is uniformly cut into a plurality of picture sets with consistent sizes, image information is gradually disassembled into low-resolution multi-dimensional images from high-resolution low-dimension, no information loss is determined from multi-resolution multi-dimension, a large variety of features is formed, and the features mainly detected are preliminarily classified from color information and scale information.
9. The YOLO-based plant normative behavior monitoring method of claim 8, wherein the backbone network uses 1x1 convolution followed by a set of 3 x3 convolutions to form residual blocks as basic structural units, the feature extraction modules are formed by stacked residual blocks, and downsampling is performed by using 3 x3 convolutions with a step size of 2 before each feature module is extracted, so as to reduce the resolution of the feature map.
10. The YOLO-based monitoring method for plant normative behaviors according to claim 1, wherein the samples in the plant behavioral sample dataset in the step (1) comprise open source data and real-time data, the real-time data comprises videos of human irregular behaviors captured in real time based on real plant scenes, the videos of human irregular behaviors are formatted to form multi-frame images, the source data and the real-time data are mixed to form a data set comprising JPG pictures and corresponding JSON tags, and the data set is augmented by data enhancement modes comprising mirroring, brightness, turning and rotation until the required number of samples is reached.
CN202210087600.0A 2022-01-25 2022-01-25 YOLO-based workshop normative behavior monitoring method Pending CN114648714A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210087600.0A CN114648714A (en) 2022-01-25 2022-01-25 YOLO-based workshop normative behavior monitoring method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210087600.0A CN114648714A (en) 2022-01-25 2022-01-25 YOLO-based workshop normative behavior monitoring method

Publications (1)

Publication Number Publication Date
CN114648714A true CN114648714A (en) 2022-06-21

Family

ID=81992812

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210087600.0A Pending CN114648714A (en) 2022-01-25 2022-01-25 YOLO-based workshop normative behavior monitoring method

Country Status (1)

Country Link
CN (1) CN114648714A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115223206A (en) * 2022-09-19 2022-10-21 季华实验室 Working clothes wearing condition detection method and device, electronic equipment and storage medium
CN115272987A (en) * 2022-07-07 2022-11-01 淮阴工学院 MSA-yolk 5-based vehicle detection method and device in severe weather
CN115410012A (en) * 2022-11-02 2022-11-29 中国民航大学 Method and system for detecting infrared small target in night airport clear airspace and application
CN115546652A (en) * 2022-11-29 2022-12-30 城云科技(中国)有限公司 Multi-time-state target detection model and construction method, device and application thereof
CN117291997A (en) * 2023-11-24 2023-12-26 无锡车联天下信息技术有限公司 Method for calibrating corner points of monitoring picture of vehicle-mounted monitoring system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115272987A (en) * 2022-07-07 2022-11-01 淮阴工学院 MSA-yolk 5-based vehicle detection method and device in severe weather
CN115272987B (en) * 2022-07-07 2023-08-22 淮阴工学院 MSA-Yolov 5-based vehicle detection method and device in severe weather
CN115223206A (en) * 2022-09-19 2022-10-21 季华实验室 Working clothes wearing condition detection method and device, electronic equipment and storage medium
CN115410012A (en) * 2022-11-02 2022-11-29 中国民航大学 Method and system for detecting infrared small target in night airport clear airspace and application
CN115410012B (en) * 2022-11-02 2023-02-28 中国民航大学 Method and system for detecting infrared small target in night airport clear airspace and application
CN115546652A (en) * 2022-11-29 2022-12-30 城云科技(中国)有限公司 Multi-time-state target detection model and construction method, device and application thereof
CN117291997A (en) * 2023-11-24 2023-12-26 无锡车联天下信息技术有限公司 Method for calibrating corner points of monitoring picture of vehicle-mounted monitoring system
CN117291997B (en) * 2023-11-24 2024-01-26 无锡车联天下信息技术有限公司 Method for calibrating corner points of monitoring picture of vehicle-mounted monitoring system

Similar Documents

Publication Publication Date Title
CN114648714A (en) YOLO-based workshop normative behavior monitoring method
CN110956094B (en) RGB-D multi-mode fusion personnel detection method based on asymmetric double-flow network
CN111126325B (en) Intelligent personnel security identification statistical method based on video
CN107749067A (en) Fire hazard smoke detecting method based on kinetic characteristic and convolutional neural networks
CN111339883A (en) Method for identifying and detecting abnormal behaviors in transformer substation based on artificial intelligence in complex scene
CN111222478A (en) Construction site safety protection detection method and system
CN112766195B (en) Electrified railway bow net arcing visual detection method
CN111126293A (en) Flame and smoke abnormal condition detection method and system
CN113642474A (en) Hazardous area personnel monitoring method based on YOLOV5
CN111753805A (en) Method and device for detecting wearing of safety helmet
CN111046728A (en) Straw combustion detection method based on characteristic pyramid network
CN113435407B (en) Small target identification method and device for power transmission system
CN112396635A (en) Multi-target detection method based on multiple devices in complex environment
CN111582074A (en) Monitoring video leaf occlusion detection method based on scene depth information perception
CN116152226A (en) Method for detecting defects of image on inner side of commutator based on fusible feature pyramid
CN112446376B (en) Intelligent segmentation and compression method for industrial image
CN114399719A (en) Transformer substation fire video monitoring method
CN115661932A (en) Fishing behavior detection method
CN113221667B (en) Deep learning-based face mask attribute classification method and system
CN114399734A (en) Forest fire early warning method based on visual information
CN116310922A (en) Petrochemical plant area monitoring video risk identification method, system, electronic equipment and storage medium
CN113052139A (en) Deep learning double-flow network-based climbing behavior detection method and system
CN113221991A (en) Method for re-labeling data set by utilizing deep learning
CN113191274A (en) Oil field video intelligent safety event detection method and system based on neural network
CN112488213A (en) Fire picture classification method based on multi-scale feature learning network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination