CN114648714A

CN114648714A - YOLO-based workshop normative behavior monitoring method

Info

Publication number: CN114648714A
Application number: CN202210087600.0A
Authority: CN
Inventors: 谭思雨; 朱栗波; 杨倩倩; 周赞; 张喆; 罗堃; 王力; 胡麒远; 卢玲
Original assignee: Hunan Zhongnan Intelligent Equipment Co ltd
Current assignee: Hunan Zhongnan Intelligent Equipment Co ltd
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-06-21

Abstract

The invention relates to a monitoring method of workshop normative behaviors based on YOLO, which comprises the following steps: (1) constructing a workshop behavior sample data set; (2) constructing an E-YOLO target detection network comprising an encoder, a decoder and a classification regression network, and performing behavior feature learning, wherein the encoder is based on a YOLO backbone network, and the decoder constructs a high-efficiency decoding network; (3) and (3) acquiring real-time monitoring image information of the workshop, and identifying and detecting the image to be identified by using the detection model obtained in the step (2.3) to complete monitoring and early warning of non-standard behaviors of the workshop. The invention integrates the backbone network, the high-efficiency decoding network and the classification regression network to form the E-YOLO target detection network, has stronger characteristic characterization capability, can keep high speed in both training and testing, simultaneously can accurately position and distinguish similar characteristics, further determines the characteristic difference among regions, and ensures the classification correctness.

Description

YOLO-based workshop normative behavior monitoring method

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a method for monitoring workshop normative behaviors based on YOLO.

Background

Industrial production plants, houses for industrial production, besides fixed plants for production and development, also include their auxiliary buildings, such as distribution rooms, pollution discharge and equipment and material storage, etc. in industrial production and development operations, many large machines and equipment are involved in operation and real-time operation of technicians, which has many potential safety hazards, such as falling of small parts at high altitude, loss of control of motion trajectory during robot debugging, abnormal high-speed operation of large equipment on a production line, which may bring about damage to technicians in the plants to a large extent. Except that experienced managers regularly inspect hidden dangers for workshop equipment, the workshop safety hidden dangers are standardized by workshop personnel, so that accidental injuries can be reduced to a large extent, and the safety hidden dangers of the workshop are minimized, for example, the safety helmet is standardized to be worn, a mobile phone is not played during workshop operation, the wearing of large-area skin is not worn, and the like, so that the relative safety of the personnel can be effectively guaranteed. At present, most industrial production workshops are equipped with safety personnel to supervise standard behaviors of employees, and the employees with irregular behaviors are warned and behavior corrected by adopting a safety personnel inspection mode, so that real-time monitoring cannot be achieved, and huge workload and pressure are brought to the safety personnel.

In order to solve the above problems, in the prior art, an intelligent normative behavior monitoring system is invented and applied, such a system usually uses a voice alarm monitoring and deep learning target detection technology to perform real-time monitoring, the intelligent detection system does not depend on manpower, can realize real-time, reliable and low-cost safety guarantee for personnel in an industrial production workshop, the target detection technology can perform real-time and accurate classification and identification on different workshop behaviors and quickly judge whether the behaviors are not normative, but based on the fact that the ground features of the production workshop are complex, the scenes are rich, and similar characteristics often exist between the normative behaviors and the non-normative behaviors (such as the fact that a helmet is not buckled and a helmet is taken correctly), which bring certain difficulty and challenge to the behavior detection system based on the target detection, from the view of the existing behavior detection method, most monitoring systems are only aimed at rough safety action detection, such as wearing the safety helmet VS without the safety helmet, but the safety helmet with no safety helmet and the like can not be protected in case of danger.

In view of the foregoing, it is desirable to provide a monitoring method for workshop normative behaviors based on YOLO, which can quickly and accurately provide behavior prediction of workshop staff in an image range and warn against non-normative behaviors.

Disclosure of Invention

The invention aims to provide a monitoring method of workshop normative behaviors based on YOLO, which can quickly and accurately give out behavior prediction of workshop staff in an image range and warn of non-normative behaviors.

The above purpose is realized by the following technical scheme: a monitoring method for workshop specification behaviors based on YOLO comprises the following steps:

(1) constructing a workshop behavior sample data set;

(2) constructing an E-YOLO target detection network comprising an encoder, a decoder and a classification regression network, and performing behavior feature learning, wherein the encoder is based on a YOLO backbone network, and the decoder constructs a high-efficiency decoding network;

(2.1) inputting the workshop behavior sample data set into a backbone network encoder, carrying out slicing operation on the input sample, and reducing the picture size to form a low-level feature map; then, extracting image features from the low-level feature map through a feature extraction module to form a middle-level feature map; forming and fusing the characteristics on a plurality of receptive fields by adopting multi-scale pooling, and learning the multi-scale characteristics of the target to form a top-level characteristic diagram;

(2.2) the efficient decoding network receives the layer feature map and the top layer feature map in the step (2.1), feature information is fused to form a feature classification reference standard, similar behavior detection frames are subjected to further refined learning, behavior suspicious regions are screened according to main features of all image sets, the similar features are accurately positioned and distinguished, feature differences among the suspicious regions are further determined, and classification correctness is guaranteed;

(2.3) the classification regression network receives the effective prediction characteristic graph from the decoder, weights and position prediction of a detection frame are given to the possibility of different behaviors in each picture set, and internal parameters are subjected to fine adjustment to obtain a trained detection model;

(3) and (3) acquiring real-time monitoring image information of the workshop, and identifying and detecting the image to be identified by using the detection model obtained in the step (2.3) to complete monitoring and early warning of non-standard behaviors of the workshop.

The E-YOLO target detection network is formed by fusing a backbone network, a high-efficiency decoding network and a classification regression network based on a YOLOV5 network, and by adopting the method, the E-YOLO has stronger characteristic characterization capability, can keep high speed in training and testing, can accurately position and distinguish similar characteristics, further determines the characteristic difference among regions, ensures the classification correctness, and also has better portability.

The efficient decoding network decoder comprises a characteristic efficient fusion module, wherein the characteristic efficient fusion module comprises two convolution layers which are built in parallel, the characteristic efficient fusion module receives a middle layer characteristic diagram and a top layer characteristic diagram in the step (2.2), the two convolution layers which are built in parallel form two branches, the number of channels is reduced to half of the number of the original channels by 1x1 convolution firstly in the input characteristics in the branches, cross-channel information interaction is learned through 1x1 convolution, then the relation between behavior scattering characteristics is captured in a relatively large range through 7 x 7 convolution, characteristics with super-strong representation capability are learned, and finally the two branches are spliced after the results are processed through 1x1 convolution.

The further technical scheme is that the efficient decoding network decoder further comprises an efficient attention module for focusing on learning the difference in a small range so as to perform good and bad feature screening, the efficient attention module in the step (2.2) receives information processed by the efficient feature fusion module, learns specific features through global average pooling, readjusts an input feature map through a full connection layer and a Sigmoid function, and finally achieves the effect of extracting useful channel information.

The further technical scheme is that when the size of the input feature map is X_i∈R^C×W×HThen the valid channel attention map A_eSE(X_i)∈R^C×1×1The calculation formula is as follows:

A_eSE(X_i)＝σ(W_C(F_gap(X_i)))

wherein F_gap(X_i) Is a global average pooling of channel information, and

W_Cis a fully connected layerWeight, σ is Sigmoid function, X_i,jRepresents all matrix elements; input X_iIs a multi-scale feature map from the middle level feature map and the top level feature map, will A_eSE(X_i) Application of attention as a channel feature to multiscale feature map X_iIn such a way that the multi-scale feature X_iMore informative, finally inputting the output characteristic diagram into X element by element_refine，X_refineInputting a feature map X_iAnd A after processing by the high-efficiency attention module_eSE(X_i) Multiplying to give each input X_iAnd carrying out weight assignment pixel by pixel to realize feature re-screening.

The further technical scheme is that the characteristic efficient fusion module comprises a characteristic re-fusion module, and the characteristic re-fusion module realizes characteristic re-fusion of the useful characteristics screened by the efficient attention module.

The further technical scheme is that the feature re-fusion module outputs effective prediction feature maps of three scales after processing, the classification regression network in the step (2.3) divides grid regions on three scale feature layers into 64 processing numbers, 32 processing numbers and 16 processing numbers respectively, then convolution adjustment channel numbers are carried out on the effective prediction feature maps, classification regression is carried out to predict the position, confidence coefficient and the category of each Bounding box, overlapping frames are removed through NMS to obtain the final output detection result, in the training stage, the total network loss comprises the sum of classification loss, confidence coefficient loss and position regression loss, wherein the confidence coefficient loss and the classification loss adopt binary cross entropy loss, the position regression loss adopts CIOU loss, and when the loss function converges, training is finished, and the optimal weight is reserved for behavior detection.

The further technical scheme is that in the step (3), the acquired training model is used for detecting the workshop images acquired in real time in a sliding window detection mode, each window is endowed with action weight, and the action prediction weights of all sliding windows are integrated to give a detection frame so as to obtain a workshop action detection result.

A further technical scheme is that the sample data set in the step (2.1) is uniformly cut into a plurality of picture sets with consistent sizes after being input into the backbone network encoder, image information is gradually disassembled into a low-resolution multi-dimensional image from a high resolution low dimension, no information loss is determined from the multi-resolution multi-dimension image, a large variety of features is formed, and primary classification is performed on the features mainly detected from color information and scale information preliminarily.

The further technical scheme is that a backbone network adopts 1 × 1 convolution and then a group of 3 × 3 convolution to form a residual block as a basic structural unit, a feature extraction module is formed by stacked residual blocks, 3 × 3 convolution with the step length of 2 is adopted for downsampling before each feature module is extracted, and the resolution of a feature map is reduced.

The method comprises the following steps that (1) samples in the workshop behavior sample data set comprise open source data and real-time data, the real-time data comprise videos of artificial irregular behaviors captured in real time based on real workshop scenes, a formatting process is carried out on the videos of the artificial irregular behaviors to form multi-frame images, the open source data and the real-time data are mixed to form a data set containing JPG pictures and corresponding JSON labels, and the data set is amplified by adopting a data enhancement mode including mirroring, brightness, turning and rotation until the required number of the samples is achieved.

Compared with the prior art, the performance of the method is improved on the basis of the YOLO, an attention module which is beneficial to distinguishing similar behavior characteristics is fused to construct an efficient decoding network, the efficient decoding network is used for refining classification behaviors, image information from YOLOV5s with three different resolutions is fused to form a characteristic classification reference standard, behavior classification can be generally realized on images with medium and low resolutions, then the similar behavior detection frame is further refined and learned, suspicious regions of behaviors are screened according to the occupation ratio, composition and the like of color pixels in main body characteristics (safety helmet/mask) of each picture set, the similar characteristics are accurately positioned and distinguished, finally, the characteristic difference among the regions is further determined, and the classification correctness is guaranteed.

The intelligent standardized behavior monitoring system for the industrial production workshop, which is high in applicability, can judge the non-standardized behavior by only detecting the behaviors of wearing and not wearing a safety helmet aiming at most similar patents at present, can expand the behavior types of workshop personnel, and also can list the non-standardized behaviors into hidden dangers which can cause safety accidents of the workshop, is a comprehensive monitoring system aiming at the standardized behavior of the industrial production workshop, and can greatly reduce the probability of danger for the personnel in the industrial production workshop;

the invention adopts a sliding window detection combined with NMS detection mode, improves the classification precision for distinguishing the behaviors of correctly wearing the mask (safety helmet)/irregularly wearing the mask (safety helmet), and avoids the situation of difficult recognition when the main body behavior is positioned at the edge of the snapshot.

The invention constructs an irregular behavior data set based on an industrial production workshop and overcomes the defect that only a safety helmet detection data set exists on the existing open source network.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation of the invention.

Fig. 1 is a flowchart of a monitoring method for a YOLO-based plant specification behavior according to an embodiment of the present invention.

Detailed Description

The present invention will now be described in detail with reference to the drawings, which are given by way of illustration and explanation only and should not be construed to limit the scope of the present invention in any way. Furthermore, features from embodiments in this document and from different embodiments may be combined accordingly by a person skilled in the art from the description in this document.

The embodiment of the invention is as follows, referring to fig. 1, a monitoring method of workshop normative behaviors based on YOLO, comprising the following steps:

(1) constructing a workshop behavior sample data set;

(2.1) inputting the workshop behavior sample data set into a backbone network encoder, carrying out slicing operation on the input sample, and reducing the picture size to form a low-level feature map; then, extracting image features from the low-level feature map through a feature extraction module to form a middle-level feature map; forming and fusing features on a plurality of receptive fields by adopting multi-scale pooling, and learning the multi-scale features of the target to form a top-level feature map;

in one embodiment, a backbone network of YOLOV5s is directly used as an encoder of the E-YOLO, the network adopts 1 × 1 convolution and then a group of 3 × 3 convolution to form a residual block as a basic structural unit, a feature extraction module is formed by stacked residual blocks, 3 × 3 convolution with the step length of 2 is adopted for downsampling before each feature module is extracted, the resolution of a feature map is reduced, and the operation speed of the network is increased. 5 times of gradual downsampling are performed, the receptive field is enlarged, rich characteristic information is extracted, and different scale characteristics are formed; firstly, slicing an input sample by using a Focus structure, reducing the size of a picture to be half of the original size, and reserving image information to the maximum extent to generate a low-level feature map C1; extracting abundant image features by 4 feature extraction modules, wherein the number of stacked residual blocks is 1, 3, 3 and 1, respectively, to form middle-layer feature maps C2, C3 and C4, wherein C2, C3 and C4 have the same structure, but the number of convolution kernels is 64, 128 and 256; in order to enhance the expression capability of the top-level features, an SPP module is added, multi-scale pooling is adopted to form features on a plurality of receptive fields, the features are fused, and multi-scale features of the target are learned to form a top-level feature map C5.

(2.2) the efficient decoding network receives the layer characteristic diagram and the top layer characteristic diagram in the step (2.1), combines characteristic information to form a characteristic classification reference standard, further performs refined learning aiming at a similar behavior detection frame, screens behavior suspicious regions according to main body characteristics concentrated by each image, accurately positions and distinguishes similar characteristics, further determines characteristic difference among the suspicious regions, and ensures classification correctness;

On the basis of the above embodiment, in another embodiment of the present invention, the efficient decoding network decoder includes a feature efficient fusion module, the feature efficient fusion module includes two convolution layers built in parallel, the feature efficient fusion module receives the middle layer feature map and the top layer feature map in step (2.2), the two convolution layers built in parallel form two branches, the input features in the branches are first convolved by 1 × 1 to reduce the number of channels to half of the original number, then cross-channel information interaction is learned through 1 × 1 convolution, further a relationship between behavior scattering features is captured in a relatively large range through 7 × 7 convolution, a feature with a super-strong representation capability is learned, and finally the two branches are spliced after being passed through 1 × 1 convolution processing results.

The efficient decoding network decoder only receives feature input from backbone networks C3, C4 and C5, medium and low-layer features have the characteristic of low resolution and multiple dimensions, feature dimension information is large, the efficient decoding network decoder is suitable for further feature fusion learning processing, the efficient feature fusion module further extracts rich context feature information after receiving the feature input from the main networks C3, C4 and C5, the efficient feature fusion module has the advantages that two convolution layers are built in parallel, feature fusion is achieved, feature information of different branches can be stored by branch confluence, therefore, more rich feature information can be extracted, features with super-strong representation capability are learned, and the convolution processing results of an upper branch and a lower branch are spliced.

On the basis of the above embodiment, in another embodiment of the present invention, the efficient decoding network decoder further includes an efficient attention module for focusing on learning differences in a small range to perform a good-bad screening of features, in the step (2.2), the efficient attention module receives information processed by the feature efficient fusion module, learns specific features through global average pooling, and readjusts an input feature map through a full connection layer and a Sigmoid function, thereby finally achieving an effect of extracting useful channel information.

The efficient attention module is a depth module with high portability in the field of target detection, and is characterized in that in view of high requirements of a target detection algorithm on detail feature recognition, the difference between a normative behavior and an unnormaltive behavior may be the difference of a certain tiny part of features, the efficient attention module can concentrate on learning the difference in a small range so as to perform good and bad feature screening, two efficient attention modules are effectively and parallelly built on branches from C4 and C5, the C4 and the C5 comprise low-resolution high-dimensional features from a backbone network, the low-resolution high-dimensional features are input to the efficient attention module to perform useful feature screening after global feature learning is performed by a CBP module, and redundant features are discarded. The compressed excitation is a representative channel attention method in the neural network, and the channel relation between feature graphs can be directly modeled, so that the network feature learning capability is enhanced.

On the basis of the above embodiment, in another embodiment of the present invention, when the size of the input feature map is X_i∈R^C ^×W×HThen the valid channel attention map A_eSE(X_i)∈R^C×1×1The calculation formula is as follows:

A_eSE(X_i)＝σ(W_C(F_gap(X_i)))

wherein F_gap(X_i) Is a global average pooling of channel information, and

W_Cis the weight of the full connection layer, σ is the Sigmoid function, X_i,jRepresenting all matrix elements; input X_iIs a multi-scale feature map from the middle level feature map and the top level feature map, will A_eSE(X_i) Application of attention as a channel feature to multiscale feature map X_iIn such a way that the multi-scale feature X_iMore informative, finally inputting the output characteristic diagram into X element by element_refine，X_refineInputting a feature map X_iAnd A after processing by the high-efficiency attention module_eSE(X_i) Multiplying to give each input X_iAnd carrying out weight assignment pixel by pixel to realize feature re-screening.

On the basis of the above embodiment, in another embodiment of the present invention, the feature efficient fusion module includes a feature re-fusion module, and the feature re-fusion module performs feature re-fusion on the useful features screened by the efficient attention module. The feature re-fusion module is a part of the feature high-efficiency fusion module, and the feature re-fusion module is obtained by removing the Conv-BN-Relu module built at the tail end in the feature high-efficiency fusion module, so that the function realized by the feature re-fusion module is also the feature re-fusion of the useful features screened by the high-efficiency attention module.

On the basis of the above embodiment, in another embodiment of the present invention, the feature re-fusion module outputs three-scale effective predicted feature maps after processing, in the step (2.3), the classification regression network divides grid areas on three scale feature layers into 64 processing numbers, 32 processing numbers and 16 processing numbers respectively, then convolution adjustment is carried out on the effective prediction characteristic diagram to adjust the number of channels, classification regression prediction is carried out to predict the position, the confidence coefficient and the category of each Bounding box, NMS removes overlapped frames to obtain the final output detection result, in the training phase, the total loss of the network comprises the sum of classification loss, confidence loss and position regression loss, and when the loss function is converged, finishing the training, and reserving the optimal weight for behavior detection.

On the basis of the above embodiment, in another embodiment of the present invention, the processing in step (3) uses the obtained training model to detect the workshop image collected in real time in a sliding window detection manner, each window is given a behavior weight, and the behavior prediction weights of all sliding windows are integrated to give a detection frame, so as to obtain a workshop behavior detection result.

In order to improve the behavior detection efficiency of real-time monitoring video images in a workshop, a sliding window detection mode is adopted, the size of a test picture input into a network is reduced, the size of a window is 100, and sliding window slicing is carried out on the image subjected to video frame processing at an interval stride of 50, so that a test sample is obtained. And inputting the test sample into an E-YOLO network to obtain a behavior detection result, and inputting a non-maximum suppression algorithm NMS to screen a repeated prediction frame in a sliding window overlapping area so as to obtain a detection result of the workshop standard behavior. The principle of NMS is, among others:

assuming that the picture size of the obtained practical application scene is 200 × 200, the sliding window size is 100 × 100, and the interval size is 50, to detect the picture, there are 9 candidate frames, the confidence threshold of the preset candidate frames is 0.5, the 9 candidate frames are arranged according to the descending order of the confidence, the candidate frame with the highest confidence is output and deleted in the candidate frame list, the IOU values of the candidate frame and all candidate frames are calculated, the candidate frames larger than the threshold are deleted, the above is repeated until the candidate frame list is 0, and the output list is returned, wherein the IOU is defined as the intersection part of two regions divided by the union part of two regions.

Building a model training and testing system, training and testing an E-YOLO detection model based on a Pycharm software platform, generating a high-performance workshop standard behavior detection model after the steps are implemented, and monitoring the acquired images in real time for a workshop until an optimal model with 9-class behavior classification capability is acquired; the model can directly and quickly predict the behavior of the workshop staff in the image range and give an alarm for irregular behaviors.

On the basis of the above embodiment, in another embodiment of the present invention, the sample data set in the step (2.1) is uniformly cut into a plurality of picture sets with the same size after being input to the backbone network encoder, the image information is gradually disassembled from the high resolution and the low dimension into the low resolution and multi-dimensional image, it is determined that no information is lost from the multi-resolution and multi-dimension, a large combination of features is formed, and the features mainly detected are preliminarily classified from the color information and the scale information.

On the basis of the above embodiment, in another embodiment of the present invention, the backbone network uses 1 × 1 convolution and then a group of 3 × 3 convolutions to form a residual block as a basic structural unit, the feature extraction modules are formed by stacked residual blocks, and downsampling is performed by using 3 × 3 convolutions with a step size of 2 before each feature module is extracted, so as to reduce the resolution of the feature map.

On the basis of the above embodiment, in another embodiment of the present invention, the samples in the inter-vehicle behavior sample data set in step (1) include open source data and real-time data, the real-time data includes real-time capturing artificial irregular behavior videos based on real-time scene of the inter-vehicle, formatting the artificial irregular behavior videos to form multi-frame images, mixing the open source data and the real-time data to form a data set including JPG pictures and corresponding JSON tags, and then augmenting the data set by adopting a data augmentation method including mirroring, brightness, turning and rotation until the required number of samples is reached.

The core algorithm of the invention is based on a deep learning network YOLOV5s, and the characteristic of realizing high performance of deep learning is that a large amount of sample data sets are needed for training, the current open source website can download a Helmet detection data set (website: https:// github. com/njection power/Safety-help-week-database) with the data volume of about 8000, the data set is a part of source of the data set, the other part is based on a real workshop scene, a video of artificial irregular behaviors is captured in real time through workshop alarm monitoring, the video is formatted to form a multi-frame image, and a label frame is manually labeled based on a LabelImg tool. Mixing open source data and real-time data to form a data set containing JPG pictures and corresponding JSON labels, amplifying the data set by adopting data enhancement modes such as mirroring, brightness, overturning, rotating and the like, and finally obtaining about 50000 data set samples for training.

Specifically, a label software LabelImg is used for marking different behaviors with label boxes, and the workshop behaviors are mainly divided into a first step of standardizing wearing safety helmets (Helmet _ OK, dark green label boxes); ② wearing safety helmets/unbuckled hat bands (Helmet _ Warn1, dark yellow label box) are not specified; thirdly, a safety Helmet (Helmet _ NotOK, dark red mark box) is not worn; fourthly, the safety Helmet for the free growth of the girls (Helmet _ Warn2, faint yellow mark box); fifthly, playing the mobile Phone (Phone _ NotOK, pink mark box); sixthly, the clothes are not standard/the large area of skin is exposed by wearing (Wear _ NotOK, dark red mark box); adding a sample of Mask standard behavior, which comprises a Mask (Mask _ OK, blue mark frame) with standard; eighthly, wearing the Mask without standardization (Mask _ Warn, orange label box); ninthly, the Mask is not worn (Mask _ NotOK, black mark box). And mixing the open source data set with the data set acquired under the real scene, constructing a target detection large sample data set of 9 types of workshop behaviors by adopting a data set amplification mode, and dividing the target detection large sample data set into a training set and a test set according to a proportion.

The last three convolutional layer output characteristics of the backbone network, namely C3, R32x32x256, C4, R16x16x256 and C5, R64x64x128 are selected to be input into a characteristic fusion module to further fully extract the characteristics. The fusion module is composed of a feature efficient fusion module and two parallel efficient attention modules. The characteristic efficient fusion module is used for enhancing the relationship between different behavior characteristics captured by the network and better learning the multi-scale spatial context information. The HAE module is used for distinguishing effective channel semantic features and can suppress noise to some extent. And (4) carrying out convolution on the three effective feature prediction graphs output by the feature fusion module by using 1x1 to classify the classification, confidence and position of the regression network prediction target. And in the testing stage, screening the redundancy prediction frame through the NMS and outputting a final detection result. In the training stage, the total loss is formed by calculating the weighted sum of the classification loss, the confidence coefficient loss and the position regression loss, each parameter gradient is calculated by back propagation, the parameters are transmitted to an optimizer to carry out iterative training to update the weight of the model, and the training weight is reserved for testing.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A monitoring method for workshop specification behaviors based on YOLO is characterized by comprising the following steps:

(1) constructing a workshop behavior sample data set;

(2.1) inputting the workshop behavior sample data set into a backbone network encoder, carrying out slicing operation on the input sample, and reducing the size of the picture to form a low-level feature map; then, extracting image characteristics from the low-level characteristic diagram through a characteristic extraction module to form a middle-level characteristic diagram; forming and fusing features on a plurality of receptive fields by adopting multi-scale pooling, and learning the multi-scale features of the target to form a top-level feature map;

2. The YOLO-based workshop specification behavior monitoring method according to claim 1, wherein the efficient decoding network decoder comprises a feature efficient fusion module, the feature efficient fusion module comprises two convolution layers which are built in parallel, the feature efficient fusion module in the step (2.2) receives a middle layer feature map and a top layer feature map, the two convolution layers which are built in parallel form two branches, the input features in the branches are firstly convolved by 1x1 to reduce the number of channels to half of the original number, then cross-channel information interaction is learned through 1x1 convolution, then the relationship among behavior scattering features is captured in a relatively large range through 7 x 7 convolution, features with ultra-strong representation capability are learned, and finally the two branches are spliced after being processed through 1x1 convolution.

3. The YOLO-based plant normative behavior monitoring method according to claim 2, wherein the efficient decoding network decoder further comprises an efficient attention module for focusing on learning the differences in a small range to perform the good-bad screening of the features, the efficient attention module in the step (2.2) receives the information processed by the feature efficient fusion module, learns the specific features through global average pooling, and readjusts the input feature map through a full connection layer and a Sigmoid function, thereby finally achieving the function of extracting useful channel information.

4. The method of claim 3, wherein the input feature map is X in size_i∈R^C×W×HThen the valid channel attention map A_eSE(X_i)∈R^C×1×1The calculation formula is as follows:

A_eSE(X_i)＝σ(W_C(F_gap(X_i)))

wherein F_gap(X_i) Is a global average pooling of channel information, and

5. The YOLO-based workshop specification behavior monitoring method of claim 4, wherein the feature efficient fusion module comprises a feature re-fusion module that re-fuses the useful features screened by the efficient attention module.

6. The method for monitoring workshop normative behaviors based on YOLO according to claim 5, wherein the feature re-fusion module outputs an effective prediction feature map of three scales after processing, the classification regression network in step (2.3) divides grid regions on three scale feature layers into 64 processing numbers, 32 processing numbers and 16 processing numbers respectively, then performs convolution adjustment on the effective prediction feature map to perform classification regression prediction on the position, confidence and the category of each Boung box, and removes overlapped frames through NMS to obtain the final output detection result, in the training stage, the total network loss comprises the sum of classification loss, confidence loss and position regression loss, wherein the confidence loss and the classification loss adopt binary cross entropy loss, the position regression loss adopts CIOU loss, the training is finished after the loss function converges, the optimal weights are retained for behavior detection.

7. The method for monitoring workshop normative behaviors based on YOLO according to any one of claims 1 to 6, wherein the obtained training model is used in the step (3) to detect the workshop images collected in real time in a sliding window detection mode, each window is endowed with behavior weight, and the behavior prediction weights of all sliding windows are integrated to give a detection frame so as to obtain a workshop behavior detection result.

8. The method for monitoring workshop normative behaviors based on YOLO according to claim 7, wherein the sample data set in step (2.1) is input into a backbone network encoder, then is uniformly cut into a plurality of picture sets with consistent sizes, image information is gradually disassembled into low-resolution multi-dimensional images from high-resolution low-dimension, no information loss is determined from multi-resolution multi-dimension, a large variety of features is formed, and the features mainly detected are preliminarily classified from color information and scale information.

9. The YOLO-based plant normative behavior monitoring method of claim 8, wherein the backbone network uses 1x1 convolution followed by a set of 3 x3 convolutions to form residual blocks as basic structural units, the feature extraction modules are formed by stacked residual blocks, and downsampling is performed by using 3 x3 convolutions with a step size of 2 before each feature module is extracted, so as to reduce the resolution of the feature map.

10. The YOLO-based monitoring method for plant normative behaviors according to claim 1, wherein the samples in the plant behavioral sample dataset in the step (1) comprise open source data and real-time data, the real-time data comprises videos of human irregular behaviors captured in real time based on real plant scenes, the videos of human irregular behaviors are formatted to form multi-frame images, the source data and the real-time data are mixed to form a data set comprising JPG pictures and corresponding JSON tags, and the data set is augmented by data enhancement modes comprising mirroring, brightness, turning and rotation until the required number of samples is reached.