CN112883755A

CN112883755A - Smoking and calling detection method based on deep learning and behavior prior

Info

Publication number: CN112883755A
Application number: CN201911196057.2A
Authority: CN
Inventors: 徐望明; 徐天赐; 李传东; 伍世虔
Original assignee: Wuhan University of Science and Engineering WUSE
Current assignee: Wuhan University of Science and Engineering WUSE; Wuhan University of Science and Technology WHUST
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2021-06-01

Abstract

The invention discloses a smoking and calling detection method based on deep learning and behavior prior, belonging to the field of safety supervision and image processing and analysis. The method comprises an off-line process and an on-line process: the off-line process trains a multi-task target detection deep convolutional neural network through self-built smoking and calling behavior image data sets, the on-line process utilizes a trained deep network model to carry out forward reasoning on an input image or a video frame after face detection, firstly preliminarily predicts the label, confidence coefficient and position information of smoking or calling behaviors, and simultaneously predicts the label, confidence coefficient and position information of specific targets related to the behaviors, namely human hands, cigarettes, mobile phones and the like, and then establishes a logic reasoning rule among the information according to prior knowledge when the behaviors occur, so as to further judge whether the smoking or calling behaviors occur.

Description

Smoking and calling detection method based on deep learning and behavior prior

Technical Field

The invention belongs to the field of safety supervision and image processing and analysis, and particularly relates to a smoking and calling detection method based on deep learning and behavior prior.

Background

Smoking and calling are strictly prohibited behaviors in gas stations, special laboratories, factory building sites and other occasions and during driver driving, and are also behaviors which are mainly monitored in safety management. The traditional video monitoring system mainly relies on the human body behaviors in the monitored picture to be monitored continuously by manpower or the human body behaviors are identified by recording videos and playing back afterwards, and the strictly prohibited behaviors are difficult to be effectively monitored in an all-round way at any time due to the reasons of manpower limitation, low efficiency and the like. The intelligent analysis and detection based on the machine vision technology becomes a trend, and compared with the traditional manual video monitoring method, the method has higher real-time performance and higher efficiency. The conventional method is to extract manually designed visual features from the collected video frames or images and then to perform behavior discrimination by using a classifier, and because the feature extraction algorithm is manually designed, the distinctiveness is not enough, and the result of human behavior detection is not reliable in a complex actual scene. In recent years, with the development of deep learning technology, people begin to use a deep convolutional neural network to automatically learn visual features from a large amount of image data to characterize behaviors, so as to realize end-to-end behavior detection, for example, liu chiqi and the like propose an abnormal behavior detection method based on a YOLO network model (see "electronic design engineering" journal 2018, volume 26, phase 20, page 154-158). Compared with the traditional behavior detection method, deep learning has great advantages in the field of behavior detection, but the effect of a deep learning model depends on a training set, and because the differences of human behaviors such as smoking, calling and the like in actual performances are large, the general training set is difficult to cover all situations, the situations of insufficient sample quantity and unbalanced distribution exist, the uniform standard is difficult to label the behavior training set, and the situations of behavior omission and false detection can be easily caused in the deep learning end-to-end prediction method.

Disclosure of Invention

The invention provides a smoking and calling detection method based on deep learning and behavior prior in order to overcome the defects of the technology, it is characterized in that the method comprises an off-line process and an on-line process, the off-line process trains a multitask target detection deep convolutional neural network through self-establishing smoking and calling behavior image data sets, in the online process, a trained deep network model is used for carrying out forward reasoning on an input image or a video frame after face detection, firstly, the label, confidence coefficient and position information of smoking or calling behaviors are preliminarily predicted, meanwhile, labels, confidence degrees and position information of specific targets related to the behaviors, namely human hands, cigarettes, mobile phones and the like are predicted, and then a logical inference rule among the information is established according to prior knowledge when the behaviors occur, so that whether smoking or calling behaviors occur is further judged.

Specifically, the invention provides a smoking and calling detection method based on deep learning and behavior prior, and the off-line process comprises the following steps:

the method comprises the following steps: collecting training videos or images, and screening out video frames or images containing face information by using a face detection method to serve as effective training samples; step two: labeling the screened effective training samples, wherein the effective training samples comprise labels and corresponding bounding box information of smoking, calling or normal behaviors, and labels and corresponding bounding box information of targets related to the smoking and calling behaviors, namely human hands, cigarettes, mobile phones and the like; step three: obtaining more samples by using a data enhancement means for the marked samples, and forming a training sample set together; step four: and training by using all training samples and labeling information based on a deep learning principle to obtain a multi-task target detection deep convolutional neural network.

In the technical scheme, the data acquisition method in the first step is to record the behaviors of people in different indoor and outdoor places and under different illumination conditions, record behavior videos of different people smoking or making calls, and record some videos without smoking or making calls as normal behavior samples; in addition, images downloaded from the internet or images directly photographed for different behaviors can also be used as training data; in order to establish the relevance between behaviors and people and consider the redundancy between continuous video frames, the data screening method is to collect 1 frame every few frames of a video file and use a face detection algorithm for processing, directly use the face detection algorithm for processing an image file, and only reserve images capable of detecting faces as effective training samples.

In the above technical solution, the method for labeling the effective training samples in the second step is as follows: on one hand, behavior information is marked, a larger image area containing a human face is framed as a behavior boundary frame, when smoking and calling behaviors occur, corresponding labels are respectively set as clicking and calling, otherwise, the corresponding labels are regarded as normal behaviors, and the labels are set as normal; on the other hand, when target information related to smoking and calling, namely, targets such as hands, cigarettes, mobile phones and the like appear in the image, a boundary frame is marked, and labels are set to hand, cigarette and phone accordingly.

In the above technical solution, the data enhancement method used in the third step includes image scaling, horizontal mirror image flipping, random brightness and color tone adjustment, and the like, and the coordinate information of the bounding box is updated according to the corresponding geometric transformation method while keeping the label information of each behavior or object unchanged.

In the above technical solution, the multitask target detection network used in the fourth step may be modified based on an existing network structure in the field, such as Fast/Fast R-CNN, SSD, or YOLO series, and share a backbone network structure, so as to simultaneously train a behavior detection classifier and a corresponding target detection classifier, where the behavior detection classifier is used to predict labels, confidence degrees, and location information of smoking, calling, or normal behaviors, and the target detection classifier is used to predict labels, confidence degrees, and location information of hands, cigarettes, or mobile phones; the behavior detection problem is also regarded as a target detection problem, and the loss functions of the two tasks in the training process are the same.

The invention provides a smoking and calling detection method based on deep learning and behavior prior, which comprises the following steps in an online process:

the method comprises the following steps: screening out video frames or images containing face information as effective test samples by using a face detection method for input monitoring videos or single images; step two: the effective test samples are sent to a multi-task target detection network trained in an off-line process for forward reasoning, and meanwhile, behaviors, namely smoking, calling or normal behaviors, and targets related to the behaviors, namely labels, confidence degrees and position information of human hands, cigarettes, mobile phones and the like are predicted; step three: and establishing a logic inference rule between the prediction information according to the prior knowledge when the behaviors occur, and further judging whether smoking or calling behaviors occur.

In the above technical solution, in the first step, a face detection method the same as that used in the offline process is used, and a video frame or an image containing face information is sent to a depth network model as an effective test sample for forward reasoning, and position information of a face therein is recorded for logical reasoning in the third step;

in the technical scheme, when the forward reasoning is carried out by using the trained deep network model for the two pairs of effective test samples in the step, the behavior label L, which belongs to { scraping, catching, normal }, and the confidence coefficient p are predicted at the same time₀Position information (x, y, h, w), namely the abscissa and ordinate of the central point of the behavior detection frame, the width and the height of the central point of the behavior detection frame, and a target label L ', L' E { hand, bag } and confidence p related to the behavior₀', position information (x ', y ', w ', h '), i.e., the center point of the object detection frame, abscissa, ordinate, and width and height.

In the above technical solution, the prior knowledge related to smoking and calling used in step three includes: (1) the predicted behavior frame should contain a face region, and for the situation that multiple persons possibly appear in the image at the same time, the behavior frame contains which face, so that the behavior is corresponding to the person; (2) when smoking or calling behaviors occur in actual life, certain constraint conditions also exist in the position relation among faces, hands, articles, namely cigarettes or mobile phones, and when the confidence degree corresponding to the behavior labels predicted by the trained network model is low or the actually occurring behaviors are missed or mistakenly detected, the constraint relation can be utilized to establish a logic inference rule based on behavior prior to further conduct behavior judgment.

Let Dist (face, object), Dist (hand, object) and Dist (face, hand) respectively represent the distance between the human face and the article, i.e. the cigarette or the mobile phone, the distance between the human hand and the article, i.e. the cigarette or the mobile phone, and the distance between the human face and the human hand, the distances can be obtained by calculating the distance of the central point of the detection frame, the possibility of smoking or calling behavior in the image is associated with the distance information, since the absolute distance between the pixels can change along with the change of the image dimension, the detected square human face frame side length len (face) is used as the reference distance, and the following rule is established:

(1) when Dist (face, object) ≦ a Len (face), the confidence of the occurrence of the smoking or calling action is increased by p₁；

(2) When Dist (hand, object) ≦ b Len (face), the confidence in the occurrence of a smoking or calling action is increased by p₂；

(3) When Dist (face, hand) is less than or equal to c.Len (face), confidence of occurrence of smoking or calling action is increased by p₃；

When the parameters a, b and c are determined, the parameter p can be determined by firstly carrying out statistical analysis on the labeling information of the training samples and then carrying out fine adjustment according to the experience of people₁，p₂，p₃According to the experience of the person, according to the degree of contribution p to the occurrence of smoking or calling actions₁≥p₂＞＞p₃Not less than 0, and when the above 3 conditions are satisfied simultaneously₁+p₂+p₃＝1；

When judging whether a certain specific behavior, namely smoking or making a call, occurs in an image, a label L is used for representing the behavior, a T is used for representing a confidence threshold value of the behavior, and the behavior is processed according to the predicted behavior of the target detection network and the label, confidence and position information of a related target in different situations:

(1) when the detection result predicts a specific behavior label L and the confidence degree p₀Higher is p₀When the value is more than T, directly judging that the behavior L occurs;

(2) when the detection result predicts the behavior label L and the confidence coefficient p₀Lower is p₀When the distance between the behavior L and the target is less than or equal to T, judging whether the behavior L occurs again through the distance relation between the behavior L and the target, wherein the judgment rule is as follows: calculating distance information from the related position information, and determining the 3 distance conditionsWhether the confidence coefficient increment of the behavior L is obtained as p₁，p₂，p₃Then the confidence of behavior L is revised to p₀+p₁+p₂+p₃If the corrected confidence is higher than the threshold T, judging that the behavior L occurs, otherwise, judging that the behavior L does not occur;

(3) when the detection result does not predict the behavior label L, p is carried out₀If 0, it is necessary to determine again whether the behavior L occurs or not based on the distance relationship with the behavior-related object, and the determination rule is: calculating distance information according to the related position information, judging whether the 3 conditions are satisfied or not, and obtaining the confidence coefficient increment of the behavior L as p₁，p₂，p₃Then the confidence of behavior L is calculated as p₁+p₂+p₃And if the confidence is higher than the threshold T, judging that the behavior L occurs, otherwise, judging that the behavior L does not occur.

The smoking and calling detection method based on deep learning and behavior prior provided by the invention has the following beneficial effects: (1) the off-line process has strong operability, can realize on-site video or image acquisition and timely model training for specific application occasions, realizes quick deployment, and is easy to popularize and apply in an actual system; (2) the method has the advantages that the method utilizes a deep learning method to train a multi-task target detection model, overcomes the limitation that the discrimination is not strong when the characteristics are manually extracted in the traditional method, meanwhile, a logical inference rule is established based on behavior prior, the result of the preliminary prediction of the deep network is further analyzed and inferred, behavior missing detection and false detection conditions which are easily caused by singly using a behavior detection method based on the deep network are favorably improved, and the reliability of safety monitoring is favorably improved in the practical behavior monitoring application; (3) as long as the data and the training model are re-acquired aiming at the application occasions and a new behavior prior logical inference rule is established, the method can be conveniently modified so as to be popularized and applied to detecting other human body behaviors.

Drawings

FIG. 1 is a flow chart of the smoking and call detection method based on deep learning and behavior prior of the present invention

FIG. 2 is a logic inference diagram of smoking and phone call detection method based on deep learning and behavior prior

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples, but the examples should not be construed as limiting the invention.

Referring to fig. 1, the smoking and calling detection method based on deep learning and behavior prior provided by the invention comprises an off-line process and an on-line process, wherein the off-line process trains a multitask target detection deep convolutional neural network through a self-built smoking and calling behavior image data set, the on-line process utilizes a trained deep network model to carry out forward reasoning on an input image or video frame after face detection, firstly preliminarily predicts the label, confidence coefficient and position information of a smoking or calling behavior, and also predicts the label, confidence coefficient and position information of specific targets related to the behaviors, namely human hands, cigarettes, mobile phones and the like, and then establishes a logic reasoning rule among the information according to prior knowledge when the behaviors occur to further judge whether the smoking or calling behavior occurs.

(3) When Dist (face, hand) ≦ c.Len (face), the confidence of the occurrence of smoking or calling behavior is increased by p₃；

When the parameters a, b and c are determined, the parameter p can be determined by firstly carrying out statistical analysis on the labeling information of the training samples and then carrying out fine adjustment according to the experience of people₁，p₂，p₃According to the experience of the person, according to the degree of contribution p to the occurrence of smoking or calling actions₁≥p₂＞＞p₃Not less than 0, and when the above 3 conditions are satisfied simultaneously₁+p₂+p₃1, for example, may be p₁＝0.5，p₂＝0.4，p₃＝0.1。

As shown in fig. 2, when it is determined whether a specific action, i.e. smoking or making a call, occurs in an image, the action is represented by a label L, a confidence threshold of the action is represented by a T, and the action predicted by the object detection network and the label, confidence and position information of the relevant object are processed in different cases:

(2) when the detection result predicts the behavior label L and the confidence coefficient p₀Lower is p₀When the value is less than or equal to T, whether the behavior L occurs needs to be determined again through the distance relation between the behavior and the related target, and the determination rule (i.e. rule 2 in FIG. 2) is as follows: calculating distance information according to the related position information, judging whether the 3 distance conditions are satisfied or not, and obtaining the confidence coefficient increment of the behavior L as p₁，p₂，p₃Then the confidence of behavior L is revised to p₀+p₁+p₂+p₃If the corrected confidence is higher than the threshold T, judging that the behavior L occurs, otherwise, judging that the behavior L does not occur;

(3) when the detection result does not predict the behavior label L, p is carried out₀If it is 0, it is necessary to determine again whether the behavior L has occurred or not based on the distance relationship with the behavior-related object, and the determination rule (i.e., rule 1 in fig. 2) is: calculating distance information according to the related position information, judging whether the 3 conditions are satisfied or not, and obtaining the confidence coefficient increment of the behavior L as p₁，p₂，p₃Then the confidence of behavior L is calculated as p₁+p₂+p₃And if the confidence is higher than the threshold T, judging that the behavior L occurs, otherwise, judging that the behavior L does not occur.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Those not described in detail in this specification are within the skill of the art.

Claims

1. A smoking and calling detection method based on deep learning and behavior prior is characterized in that the method is divided into an off-line process and an on-line process, the off-line process trains a multi-task target detection deep convolutional neural network through self-built smoking and calling behavior image data sets, the on-line process utilizes a trained deep network model to carry out forward reasoning on an input image or a video frame after face detection, firstly preliminarily predicts the label, confidence coefficient and position information of smoking or calling behaviors, and also predicts the label, confidence coefficient and position information of specific targets related to the behaviors, namely human hands, cigarettes, mobile phones and the like, and then establishes a logic reasoning rule among the information according to prior knowledge when the behaviors occur to further judge whether the smoking or calling behaviors occur.

2. The method of claim 1, wherein the offline process comprises the following steps: acquiring a training video or image, and screening out a video frame or image containing face information as an effective training sample by using a face detection method; marking the screened effective training samples, wherein the effective training samples comprise labels of smoking, calling or normal behaviors and corresponding bounding box information, and labels of targets related to the smoking and calling behaviors, namely human hands, cigarettes, mobile phones and the like and corresponding bounding box information; thirdly, obtaining more samples by using a data enhancement means for the marked samples, and forming a training sample set together; and step four, training by using all training samples and labeling information based on a deep learning principle to obtain a multi-task target detection deep convolutional neural network.

3. The smoking and calling detection method based on deep learning and behavior prior as claimed in claim 1, wherein in the off-line process of claim 2, the data acquisition method of step one is to record the behavior of people in different indoor and outdoor places and under different lighting conditions, record the behavior video of different people smoking or calling, and record some videos without smoking and calling as normal behavior samples; in addition, images downloaded from the internet or images directly photographed for different behaviors can also be used as training data; in order to establish the relevance between behaviors and people and consider the redundancy between continuous video frames, the data screening method is to collect 1 frame every few frames of a video file and use a face detection algorithm for processing, directly use the face detection algorithm for processing an image file, and only reserve images capable of detecting faces as effective training samples;

in the off-line process, the method for labeling the effective training samples in the step two is as follows: on one hand, behavior information is marked, a larger image area containing a human face is framed as a behavior boundary frame, when smoking and calling behaviors occur, corresponding labels are respectively set as clicking and calling, otherwise, the corresponding labels are regarded as normal behaviors, and the labels are set as normal; on the other hand, target information related to smoking and calling is marked, namely when targets such as hands, cigarettes, mobile phones and the like appear in the image, a boundary frame of the image is marked, and labels are set to hand, cigarette and phone correspondingly;

in the off-line process, the data enhancement method used in the third step comprises image scaling, horizontal mirror image turning, random brightness and tone adjustment and the like, and the label information of each behavior or target is kept unchanged while the coordinate information of the bounding box is updated according to a corresponding geometric transformation method;

in the off-line process, the multi-task target detection network used in the fourth step can be reconstructed based on the existing network structure in the field, such as Fast/Fast R-CNN, SSD or YOLO series, and the main network structure is shared, so that a behavior detection classifier and a corresponding target detection classifier are trained at the same time, the behavior detection classifier is used for predicting the label, confidence and position information of smoking, calling or normal behavior, and the target detection classifier is used for predicting the label, confidence and position information of a human hand, a cigarette or a mobile phone; the behavior detection problem is also regarded as a target detection problem, and the loss functions of the two tasks in the training process are the same.

4. The smoking and phone call detection method based on deep learning and behavior prior as claimed in claim 1, wherein the online process comprises the following steps: the method comprises the following steps: screening out video frames or images containing face information as effective test samples by using a face detection method for input monitoring videos or single images; step two: the effective test samples are sent to a multi-task target detection network trained in an off-line process for forward reasoning, and meanwhile, behaviors, namely smoking, calling or normal behaviors, and targets related to the behaviors, namely labels, confidence degrees and position information of human hands, cigarettes, mobile phones and the like are predicted; step three: and establishing a logic inference rule between the prediction information according to the prior knowledge when the behaviors occur, and further judging whether smoking or calling behaviors occur.

5. The smoking and phone call detection method based on deep learning and behavior prior as claimed in claim 1, wherein in the online process of claim 4, step one uses the same face detection method as the offline process, sends the video frame or image containing face information as effective test sample into the deep network model for forward reasoning, and records the position information of the face therein for logical reasoning in step three;

in the online process, two pairs of effective test samples are obtained in the step, and when forward reasoning is carried out by utilizing a trained deep network model, a behavior label L is predicted, wherein the behavior label L belongs to { scraping, catching, normal }, and a confidence coefficient p₀Position information (x, y, h, w), namely the abscissa and ordinate of the central point of the behavior detection frame, the width and the height of the central point of the behavior detection frame, and a target label L ', L' E { hand, bag } and confidence p related to the behavior₀', position information (x ', y ', w ', h '), i.e., the abscissa and ordinate of the center point of the target detection frame, and the width and height;

in the online process, the prior knowledge related to smoking and calling behaviors used in the third step comprises the following steps: (1) the predicted behavior frame should contain a face region, and for the situation that multiple persons possibly appear in the image at the same time, the behavior frame contains which face, so that the behavior is corresponding to the person; (2) when smoking or calling behaviors occur in actual life, certain constraint conditions also exist in the position relation among faces, hands, articles, namely cigarettes or mobile phones, and when the confidence corresponding to the behavior labels predicted by the trained network model is low or the actually occurring behaviors are missed or mistakenly detected, the constraint relation can be utilized to establish a logic inference rule based on behavior prior to further conduct behavior judgment;

(2) when the detection result predicts the behavior label L and the confidence coefficient p₀Lower is p₀When the distance between the behavior L and the target is less than or equal to T, judging whether the behavior L occurs again through the distance relation between the behavior L and the target, wherein the judgment rule is as follows: calculating distance information according to the related position information, judging whether the 3 distance conditions are satisfied or not, and obtaining the confidence coefficient increment of the behavior L as p₁，p₂，p₃Then the confidence of behavior L is revised to p₀+p₁+p₂+p₃If the corrected confidence is higher than the threshold T, judging that the behavior L occurs, otherwise, judging that the behavior L does not occur;