CN112883755A - Smoking and calling detection method based on deep learning and behavior prior - Google Patents

Smoking and calling detection method based on deep learning and behavior prior Download PDF

Info

Publication number
CN112883755A
CN112883755A CN201911196057.2A CN201911196057A CN112883755A CN 112883755 A CN112883755 A CN 112883755A CN 201911196057 A CN201911196057 A CN 201911196057A CN 112883755 A CN112883755 A CN 112883755A
Authority
CN
China
Prior art keywords
behavior
smoking
calling
face
behaviors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911196057.2A
Other languages
Chinese (zh)
Inventor
徐望明
徐天赐
李传东
伍世虔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Science and Engineering WUSE
Wuhan University of Science and Technology WHUST
Original Assignee
Wuhan University of Science and Engineering WUSE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Science and Engineering WUSE filed Critical Wuhan University of Science and Engineering WUSE
Priority to CN201911196057.2A priority Critical patent/CN112883755A/en
Publication of CN112883755A publication Critical patent/CN112883755A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a smoking and calling detection method based on deep learning and behavior prior, belonging to the field of safety supervision and image processing and analysis. The method comprises an off-line process and an on-line process: the off-line process trains a multi-task target detection deep convolutional neural network through self-built smoking and calling behavior image data sets, the on-line process utilizes a trained deep network model to carry out forward reasoning on an input image or a video frame after face detection, firstly preliminarily predicts the label, confidence coefficient and position information of smoking or calling behaviors, and simultaneously predicts the label, confidence coefficient and position information of specific targets related to the behaviors, namely human hands, cigarettes, mobile phones and the like, and then establishes a logic reasoning rule among the information according to prior knowledge when the behaviors occur, so as to further judge whether the smoking or calling behaviors occur.

Description

Smoking and calling detection method based on deep learning and behavior prior
Technical Field
The invention belongs to the field of safety supervision and image processing and analysis, and particularly relates to a smoking and calling detection method based on deep learning and behavior prior.
Background
Smoking and calling are strictly prohibited behaviors in gas stations, special laboratories, factory building sites and other occasions and during driver driving, and are also behaviors which are mainly monitored in safety management. The traditional video monitoring system mainly relies on the human body behaviors in the monitored picture to be monitored continuously by manpower or the human body behaviors are identified by recording videos and playing back afterwards, and the strictly prohibited behaviors are difficult to be effectively monitored in an all-round way at any time due to the reasons of manpower limitation, low efficiency and the like. The intelligent analysis and detection based on the machine vision technology becomes a trend, and compared with the traditional manual video monitoring method, the method has higher real-time performance and higher efficiency. The conventional method is to extract manually designed visual features from the collected video frames or images and then to perform behavior discrimination by using a classifier, and because the feature extraction algorithm is manually designed, the distinctiveness is not enough, and the result of human behavior detection is not reliable in a complex actual scene. In recent years, with the development of deep learning technology, people begin to use a deep convolutional neural network to automatically learn visual features from a large amount of image data to characterize behaviors, so as to realize end-to-end behavior detection, for example, liu chiqi and the like propose an abnormal behavior detection method based on a YOLO network model (see "electronic design engineering" journal 2018, volume 26, phase 20, page 154-158). Compared with the traditional behavior detection method, deep learning has great advantages in the field of behavior detection, but the effect of a deep learning model depends on a training set, and because the differences of human behaviors such as smoking, calling and the like in actual performances are large, the general training set is difficult to cover all situations, the situations of insufficient sample quantity and unbalanced distribution exist, the uniform standard is difficult to label the behavior training set, and the situations of behavior omission and false detection can be easily caused in the deep learning end-to-end prediction method.
Disclosure of Invention
The invention provides a smoking and calling detection method based on deep learning and behavior prior in order to overcome the defects of the technology, it is characterized in that the method comprises an off-line process and an on-line process, the off-line process trains a multitask target detection deep convolutional neural network through self-establishing smoking and calling behavior image data sets, in the online process, a trained deep network model is used for carrying out forward reasoning on an input image or a video frame after face detection, firstly, the label, confidence coefficient and position information of smoking or calling behaviors are preliminarily predicted, meanwhile, labels, confidence degrees and position information of specific targets related to the behaviors, namely human hands, cigarettes, mobile phones and the like are predicted, and then a logical inference rule among the information is established according to prior knowledge when the behaviors occur, so that whether smoking or calling behaviors occur is further judged.
Specifically, the invention provides a smoking and calling detection method based on deep learning and behavior prior, and the off-line process comprises the following steps:
the method comprises the following steps: collecting training videos or images, and screening out video frames or images containing face information by using a face detection method to serve as effective training samples; step two: labeling the screened effective training samples, wherein the effective training samples comprise labels and corresponding bounding box information of smoking, calling or normal behaviors, and labels and corresponding bounding box information of targets related to the smoking and calling behaviors, namely human hands, cigarettes, mobile phones and the like; step three: obtaining more samples by using a data enhancement means for the marked samples, and forming a training sample set together; step four: and training by using all training samples and labeling information based on a deep learning principle to obtain a multi-task target detection deep convolutional neural network.
In the technical scheme, the data acquisition method in the first step is to record the behaviors of people in different indoor and outdoor places and under different illumination conditions, record behavior videos of different people smoking or making calls, and record some videos without smoking or making calls as normal behavior samples; in addition, images downloaded from the internet or images directly photographed for different behaviors can also be used as training data; in order to establish the relevance between behaviors and people and consider the redundancy between continuous video frames, the data screening method is to collect 1 frame every few frames of a video file and use a face detection algorithm for processing, directly use the face detection algorithm for processing an image file, and only reserve images capable of detecting faces as effective training samples.
In the above technical solution, the method for labeling the effective training samples in the second step is as follows: on one hand, behavior information is marked, a larger image area containing a human face is framed as a behavior boundary frame, when smoking and calling behaviors occur, corresponding labels are respectively set as clicking and calling, otherwise, the corresponding labels are regarded as normal behaviors, and the labels are set as normal; on the other hand, when target information related to smoking and calling, namely, targets such as hands, cigarettes, mobile phones and the like appear in the image, a boundary frame is marked, and labels are set to hand, cigarette and phone accordingly.
In the above technical solution, the data enhancement method used in the third step includes image scaling, horizontal mirror image flipping, random brightness and color tone adjustment, and the like, and the coordinate information of the bounding box is updated according to the corresponding geometric transformation method while keeping the label information of each behavior or object unchanged.
In the above technical solution, the multitask target detection network used in the fourth step may be modified based on an existing network structure in the field, such as Fast/Fast R-CNN, SSD, or YOLO series, and share a backbone network structure, so as to simultaneously train a behavior detection classifier and a corresponding target detection classifier, where the behavior detection classifier is used to predict labels, confidence degrees, and location information of smoking, calling, or normal behaviors, and the target detection classifier is used to predict labels, confidence degrees, and location information of hands, cigarettes, or mobile phones; the behavior detection problem is also regarded as a target detection problem, and the loss functions of the two tasks in the training process are the same.
The invention provides a smoking and calling detection method based on deep learning and behavior prior, which comprises the following steps in an online process:
the method comprises the following steps: screening out video frames or images containing face information as effective test samples by using a face detection method for input monitoring videos or single images; step two: the effective test samples are sent to a multi-task target detection network trained in an off-line process for forward reasoning, and meanwhile, behaviors, namely smoking, calling or normal behaviors, and targets related to the behaviors, namely labels, confidence degrees and position information of human hands, cigarettes, mobile phones and the like are predicted; step three: and establishing a logic inference rule between the prediction information according to the prior knowledge when the behaviors occur, and further judging whether smoking or calling behaviors occur.
In the above technical solution, in the first step, a face detection method the same as that used in the offline process is used, and a video frame or an image containing face information is sent to a depth network model as an effective test sample for forward reasoning, and position information of a face therein is recorded for logical reasoning in the third step;
in the technical scheme, when the forward reasoning is carried out by using the trained deep network model for the two pairs of effective test samples in the step, the behavior label L, which belongs to { scraping, catching, normal }, and the confidence coefficient p are predicted at the same time0Position information (x, y, h, w), namely the abscissa and ordinate of the central point of the behavior detection frame, the width and the height of the central point of the behavior detection frame, and a target label L ', L' E { hand, bag } and confidence p related to the behavior0', position information (x ', y ', w ', h '), i.e., the center point of the object detection frame, abscissa, ordinate, and width and height.
In the above technical solution, the prior knowledge related to smoking and calling used in step three includes: (1) the predicted behavior frame should contain a face region, and for the situation that multiple persons possibly appear in the image at the same time, the behavior frame contains which face, so that the behavior is corresponding to the person; (2) when smoking or calling behaviors occur in actual life, certain constraint conditions also exist in the position relation among faces, hands, articles, namely cigarettes or mobile phones, and when the confidence degree corresponding to the behavior labels predicted by the trained network model is low or the actually occurring behaviors are missed or mistakenly detected, the constraint relation can be utilized to establish a logic inference rule based on behavior prior to further conduct behavior judgment.
Let Dist (face, object), Dist (hand, object) and Dist (face, hand) respectively represent the distance between the human face and the article, i.e. the cigarette or the mobile phone, the distance between the human hand and the article, i.e. the cigarette or the mobile phone, and the distance between the human face and the human hand, the distances can be obtained by calculating the distance of the central point of the detection frame, the possibility of smoking or calling behavior in the image is associated with the distance information, since the absolute distance between the pixels can change along with the change of the image dimension, the detected square human face frame side length len (face) is used as the reference distance, and the following rule is established:
(1) when Dist (face, object) ≦ a Len (face), the confidence of the occurrence of the smoking or calling action is increased by p1
(2) When Dist (hand, object) ≦ b Len (face), the confidence in the occurrence of a smoking or calling action is increased by p2
(3) When Dist (face, hand) is less than or equal to c.Len (face), confidence of occurrence of smoking or calling action is increased by p3
When the parameters a, b and c are determined, the parameter p can be determined by firstly carrying out statistical analysis on the labeling information of the training samples and then carrying out fine adjustment according to the experience of people1,p2,p3According to the experience of the person, according to the degree of contribution p to the occurrence of smoking or calling actions1≥p2>>p3Not less than 0, and when the above 3 conditions are satisfied simultaneously1+p2+p3=1;
When judging whether a certain specific behavior, namely smoking or making a call, occurs in an image, a label L is used for representing the behavior, a T is used for representing a confidence threshold value of the behavior, and the behavior is processed according to the predicted behavior of the target detection network and the label, confidence and position information of a related target in different situations:
(1) when the detection result predicts a specific behavior label L and the confidence degree p0Higher is p0When the value is more than T, directly judging that the behavior L occurs;
(2) when the detection result predicts the behavior label L and the confidence coefficient p0Lower is p0When the distance between the behavior L and the target is less than or equal to T, judging whether the behavior L occurs again through the distance relation between the behavior L and the target, wherein the judgment rule is as follows: calculating distance information from the related position information, and determining the 3 distance conditionsWhether the confidence coefficient increment of the behavior L is obtained as p1,p2,p3Then the confidence of behavior L is revised to p0+p1+p2+p3If the corrected confidence is higher than the threshold T, judging that the behavior L occurs, otherwise, judging that the behavior L does not occur;
(3) when the detection result does not predict the behavior label L, p is carried out0If 0, it is necessary to determine again whether the behavior L occurs or not based on the distance relationship with the behavior-related object, and the determination rule is: calculating distance information according to the related position information, judging whether the 3 conditions are satisfied or not, and obtaining the confidence coefficient increment of the behavior L as p1,p2,p3Then the confidence of behavior L is calculated as p1+p2+p3And if the confidence is higher than the threshold T, judging that the behavior L occurs, otherwise, judging that the behavior L does not occur.
The smoking and calling detection method based on deep learning and behavior prior provided by the invention has the following beneficial effects: (1) the off-line process has strong operability, can realize on-site video or image acquisition and timely model training for specific application occasions, realizes quick deployment, and is easy to popularize and apply in an actual system; (2) the method has the advantages that the method utilizes a deep learning method to train a multi-task target detection model, overcomes the limitation that the discrimination is not strong when the characteristics are manually extracted in the traditional method, meanwhile, a logical inference rule is established based on behavior prior, the result of the preliminary prediction of the deep network is further analyzed and inferred, behavior missing detection and false detection conditions which are easily caused by singly using a behavior detection method based on the deep network are favorably improved, and the reliability of safety monitoring is favorably improved in the practical behavior monitoring application; (3) as long as the data and the training model are re-acquired aiming at the application occasions and a new behavior prior logical inference rule is established, the method can be conveniently modified so as to be popularized and applied to detecting other human body behaviors.
Drawings
FIG. 1 is a flow chart of the smoking and call detection method based on deep learning and behavior prior of the present invention
FIG. 2 is a logic inference diagram of smoking and phone call detection method based on deep learning and behavior prior
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the drawings and examples, but the examples should not be construed as limiting the invention.
Referring to fig. 1, the smoking and calling detection method based on deep learning and behavior prior provided by the invention comprises an off-line process and an on-line process, wherein the off-line process trains a multitask target detection deep convolutional neural network through a self-built smoking and calling behavior image data set, the on-line process utilizes a trained deep network model to carry out forward reasoning on an input image or video frame after face detection, firstly preliminarily predicts the label, confidence coefficient and position information of a smoking or calling behavior, and also predicts the label, confidence coefficient and position information of specific targets related to the behaviors, namely human hands, cigarettes, mobile phones and the like, and then establishes a logic reasoning rule among the information according to prior knowledge when the behaviors occur to further judge whether the smoking or calling behavior occurs.
Specifically, the invention provides a smoking and calling detection method based on deep learning and behavior prior, and the off-line process comprises the following steps:
the method comprises the following steps: collecting training videos or images, and screening out video frames or images containing face information by using a face detection method to serve as effective training samples; step two: labeling the screened effective training samples, wherein the effective training samples comprise labels and corresponding bounding box information of smoking, calling or normal behaviors, and labels and corresponding bounding box information of targets related to the smoking and calling behaviors, namely human hands, cigarettes, mobile phones and the like; step three: obtaining more samples by using a data enhancement means for the marked samples, and forming a training sample set together; step four: and training by using all training samples and labeling information based on a deep learning principle to obtain a multi-task target detection deep convolutional neural network.
In the technical scheme, the data acquisition method in the first step is to record the behaviors of people in different indoor and outdoor places and under different illumination conditions, record behavior videos of different people smoking or making calls, and record some videos without smoking or making calls as normal behavior samples; in addition, images downloaded from the internet or images directly photographed for different behaviors can also be used as training data; in order to establish the relevance between behaviors and people and consider the redundancy between continuous video frames, the data screening method is to collect 1 frame every few frames of a video file and use a face detection algorithm for processing, directly use the face detection algorithm for processing an image file, and only reserve images capable of detecting faces as effective training samples.
In the above technical solution, the method for labeling the effective training samples in the second step is as follows: on one hand, behavior information is marked, a larger image area containing a human face is framed as a behavior boundary frame, when smoking and calling behaviors occur, corresponding labels are respectively set as clicking and calling, otherwise, the corresponding labels are regarded as normal behaviors, and the labels are set as normal; on the other hand, when target information related to smoking and calling, namely, targets such as hands, cigarettes, mobile phones and the like appear in the image, a boundary frame is marked, and labels are set to hand, cigarette and phone accordingly.
In the above technical solution, the data enhancement method used in the third step includes image scaling, horizontal mirror image flipping, random brightness and color tone adjustment, and the like, and the coordinate information of the bounding box is updated according to the corresponding geometric transformation method while keeping the label information of each behavior or object unchanged.
In the above technical solution, the multitask target detection network used in the fourth step may be modified based on an existing network structure in the field, such as Fast/Fast R-CNN, SSD, or YOLO series, and share a backbone network structure, so as to simultaneously train a behavior detection classifier and a corresponding target detection classifier, where the behavior detection classifier is used to predict labels, confidence degrees, and location information of smoking, calling, or normal behaviors, and the target detection classifier is used to predict labels, confidence degrees, and location information of hands, cigarettes, or mobile phones; the behavior detection problem is also regarded as a target detection problem, and the loss functions of the two tasks in the training process are the same.
The invention provides a smoking and calling detection method based on deep learning and behavior prior, which comprises the following steps in an online process:
the method comprises the following steps: screening out video frames or images containing face information as effective test samples by using a face detection method for input monitoring videos or single images; step two: the effective test samples are sent to a multi-task target detection network trained in an off-line process for forward reasoning, and meanwhile, behaviors, namely smoking, calling or normal behaviors, and targets related to the behaviors, namely labels, confidence degrees and position information of human hands, cigarettes, mobile phones and the like are predicted; step three: and establishing a logic inference rule between the prediction information according to the prior knowledge when the behaviors occur, and further judging whether smoking or calling behaviors occur.
In the above technical solution, in the first step, a face detection method the same as that used in the offline process is used, and a video frame or an image containing face information is sent to a depth network model as an effective test sample for forward reasoning, and position information of a face therein is recorded for logical reasoning in the third step;
in the technical scheme, when the forward reasoning is carried out by using the trained deep network model for the two pairs of effective test samples in the step, the behavior label L, which belongs to { scraping, catching, normal }, and the confidence coefficient p are predicted at the same time0Position information (x, y, h, w), namely the abscissa and ordinate of the central point of the behavior detection frame, the width and the height of the central point of the behavior detection frame, and a target label L ', L' E { hand, bag } and confidence p related to the behavior0', position information (x ', y ', w ', h '), i.e., the center point of the object detection frame, abscissa, ordinate, and width and height.
In the above technical solution, the prior knowledge related to smoking and calling used in step three includes: (1) the predicted behavior frame should contain a face region, and for the situation that multiple persons possibly appear in the image at the same time, the behavior frame contains which face, so that the behavior is corresponding to the person; (2) when smoking or calling behaviors occur in actual life, certain constraint conditions also exist in the position relation among faces, hands, articles, namely cigarettes or mobile phones, and when the confidence degree corresponding to the behavior labels predicted by the trained network model is low or the actually occurring behaviors are missed or mistakenly detected, the constraint relation can be utilized to establish a logic inference rule based on behavior prior to further conduct behavior judgment.
Let Dist (face, object), Dist (hand, object) and Dist (face, hand) respectively represent the distance between the human face and the article, i.e. the cigarette or the mobile phone, the distance between the human hand and the article, i.e. the cigarette or the mobile phone, and the distance between the human face and the human hand, the distances can be obtained by calculating the distance of the central point of the detection frame, the possibility of smoking or calling behavior in the image is associated with the distance information, since the absolute distance between the pixels can change along with the change of the image dimension, the detected square human face frame side length len (face) is used as the reference distance, and the following rule is established:
(1) when Dist (face, object) ≦ a Len (face), the confidence of the occurrence of the smoking or calling action is increased by p1
(2) When Dist (hand, object) ≦ b Len (face), the confidence in the occurrence of a smoking or calling action is increased by p2
(3) When Dist (face, hand) ≦ c.Len (face), the confidence of the occurrence of smoking or calling behavior is increased by p3
When the parameters a, b and c are determined, the parameter p can be determined by firstly carrying out statistical analysis on the labeling information of the training samples and then carrying out fine adjustment according to the experience of people1,p2,p3According to the experience of the person, according to the degree of contribution p to the occurrence of smoking or calling actions1≥p2>>p3Not less than 0, and when the above 3 conditions are satisfied simultaneously1+p2+p31, for example, may be p1=0.5,p2=0.4,p3=0.1。
As shown in fig. 2, when it is determined whether a specific action, i.e. smoking or making a call, occurs in an image, the action is represented by a label L, a confidence threshold of the action is represented by a T, and the action predicted by the object detection network and the label, confidence and position information of the relevant object are processed in different cases:
(1) when the detection result predicts a specific behavior label L and the confidence degree p0Higher is p0When the value is more than T, directly judging that the behavior L occurs;
(2) when the detection result predicts the behavior label L and the confidence coefficient p0Lower is p0When the value is less than or equal to T, whether the behavior L occurs needs to be determined again through the distance relation between the behavior and the related target, and the determination rule (i.e. rule 2 in FIG. 2) is as follows: calculating distance information according to the related position information, judging whether the 3 distance conditions are satisfied or not, and obtaining the confidence coefficient increment of the behavior L as p1,p2,p3Then the confidence of behavior L is revised to p0+p1+p2+p3If the corrected confidence is higher than the threshold T, judging that the behavior L occurs, otherwise, judging that the behavior L does not occur;
(3) when the detection result does not predict the behavior label L, p is carried out0If it is 0, it is necessary to determine again whether the behavior L has occurred or not based on the distance relationship with the behavior-related object, and the determination rule (i.e., rule 1 in fig. 2) is: calculating distance information according to the related position information, judging whether the 3 conditions are satisfied or not, and obtaining the confidence coefficient increment of the behavior L as p1,p2,p3Then the confidence of behavior L is calculated as p1+p2+p3And if the confidence is higher than the threshold T, judging that the behavior L occurs, otherwise, judging that the behavior L does not occur.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Those not described in detail in this specification are within the skill of the art.

Claims (5)

1. A smoking and calling detection method based on deep learning and behavior prior is characterized in that the method is divided into an off-line process and an on-line process, the off-line process trains a multi-task target detection deep convolutional neural network through self-built smoking and calling behavior image data sets, the on-line process utilizes a trained deep network model to carry out forward reasoning on an input image or a video frame after face detection, firstly preliminarily predicts the label, confidence coefficient and position information of smoking or calling behaviors, and also predicts the label, confidence coefficient and position information of specific targets related to the behaviors, namely human hands, cigarettes, mobile phones and the like, and then establishes a logic reasoning rule among the information according to prior knowledge when the behaviors occur to further judge whether the smoking or calling behaviors occur.
2. The method of claim 1, wherein the offline process comprises the following steps: acquiring a training video or image, and screening out a video frame or image containing face information as an effective training sample by using a face detection method; marking the screened effective training samples, wherein the effective training samples comprise labels of smoking, calling or normal behaviors and corresponding bounding box information, and labels of targets related to the smoking and calling behaviors, namely human hands, cigarettes, mobile phones and the like and corresponding bounding box information; thirdly, obtaining more samples by using a data enhancement means for the marked samples, and forming a training sample set together; and step four, training by using all training samples and labeling information based on a deep learning principle to obtain a multi-task target detection deep convolutional neural network.
3. The smoking and calling detection method based on deep learning and behavior prior as claimed in claim 1, wherein in the off-line process of claim 2, the data acquisition method of step one is to record the behavior of people in different indoor and outdoor places and under different lighting conditions, record the behavior video of different people smoking or calling, and record some videos without smoking and calling as normal behavior samples; in addition, images downloaded from the internet or images directly photographed for different behaviors can also be used as training data; in order to establish the relevance between behaviors and people and consider the redundancy between continuous video frames, the data screening method is to collect 1 frame every few frames of a video file and use a face detection algorithm for processing, directly use the face detection algorithm for processing an image file, and only reserve images capable of detecting faces as effective training samples;
in the off-line process, the method for labeling the effective training samples in the step two is as follows: on one hand, behavior information is marked, a larger image area containing a human face is framed as a behavior boundary frame, when smoking and calling behaviors occur, corresponding labels are respectively set as clicking and calling, otherwise, the corresponding labels are regarded as normal behaviors, and the labels are set as normal; on the other hand, target information related to smoking and calling is marked, namely when targets such as hands, cigarettes, mobile phones and the like appear in the image, a boundary frame of the image is marked, and labels are set to hand, cigarette and phone correspondingly;
in the off-line process, the data enhancement method used in the third step comprises image scaling, horizontal mirror image turning, random brightness and tone adjustment and the like, and the label information of each behavior or target is kept unchanged while the coordinate information of the bounding box is updated according to a corresponding geometric transformation method;
in the off-line process, the multi-task target detection network used in the fourth step can be reconstructed based on the existing network structure in the field, such as Fast/Fast R-CNN, SSD or YOLO series, and the main network structure is shared, so that a behavior detection classifier and a corresponding target detection classifier are trained at the same time, the behavior detection classifier is used for predicting the label, confidence and position information of smoking, calling or normal behavior, and the target detection classifier is used for predicting the label, confidence and position information of a human hand, a cigarette or a mobile phone; the behavior detection problem is also regarded as a target detection problem, and the loss functions of the two tasks in the training process are the same.
4. The smoking and phone call detection method based on deep learning and behavior prior as claimed in claim 1, wherein the online process comprises the following steps: the method comprises the following steps: screening out video frames or images containing face information as effective test samples by using a face detection method for input monitoring videos or single images; step two: the effective test samples are sent to a multi-task target detection network trained in an off-line process for forward reasoning, and meanwhile, behaviors, namely smoking, calling or normal behaviors, and targets related to the behaviors, namely labels, confidence degrees and position information of human hands, cigarettes, mobile phones and the like are predicted; step three: and establishing a logic inference rule between the prediction information according to the prior knowledge when the behaviors occur, and further judging whether smoking or calling behaviors occur.
5. The smoking and phone call detection method based on deep learning and behavior prior as claimed in claim 1, wherein in the online process of claim 4, step one uses the same face detection method as the offline process, sends the video frame or image containing face information as effective test sample into the deep network model for forward reasoning, and records the position information of the face therein for logical reasoning in step three;
in the online process, two pairs of effective test samples are obtained in the step, and when forward reasoning is carried out by utilizing a trained deep network model, a behavior label L is predicted, wherein the behavior label L belongs to { scraping, catching, normal }, and a confidence coefficient p0Position information (x, y, h, w), namely the abscissa and ordinate of the central point of the behavior detection frame, the width and the height of the central point of the behavior detection frame, and a target label L ', L' E { hand, bag } and confidence p related to the behavior0', position information (x ', y ', w ', h '), i.e., the abscissa and ordinate of the center point of the target detection frame, and the width and height;
in the online process, the prior knowledge related to smoking and calling behaviors used in the third step comprises the following steps: (1) the predicted behavior frame should contain a face region, and for the situation that multiple persons possibly appear in the image at the same time, the behavior frame contains which face, so that the behavior is corresponding to the person; (2) when smoking or calling behaviors occur in actual life, certain constraint conditions also exist in the position relation among faces, hands, articles, namely cigarettes or mobile phones, and when the confidence corresponding to the behavior labels predicted by the trained network model is low or the actually occurring behaviors are missed or mistakenly detected, the constraint relation can be utilized to establish a logic inference rule based on behavior prior to further conduct behavior judgment;
let Dist (face, object), Dist (hand, object) and Dist (face, hand) respectively represent the distance between the human face and the article, i.e. the cigarette or the mobile phone, the distance between the human hand and the article, i.e. the cigarette or the mobile phone, and the distance between the human face and the human hand, the distances can be obtained by calculating the distance of the central point of the detection frame, the possibility of smoking or calling behavior in the image is associated with the distance information, since the absolute distance between the pixels can change along with the change of the image dimension, the detected square human face frame side length len (face) is used as the reference distance, and the following rule is established:
(1) when Dist (face, object) ≦ a Len (face), the confidence of the occurrence of the smoking or calling action is increased by p1
(2) When Dist (hand, object) ≦ b Len (face), the confidence in the occurrence of a smoking or calling action is increased by p2
(3) When Dist (face, hand) ≦ c.Len (face), the confidence of the occurrence of smoking or calling behavior is increased by p3
When the parameters a, b and c are determined, the parameter p can be determined by firstly carrying out statistical analysis on the labeling information of the training samples and then carrying out fine adjustment according to the experience of people1,p2,p3According to the experience of the person, according to the degree of contribution p to the occurrence of smoking or calling actions1≥p2>>p3Not less than 0, and when the above 3 conditions are satisfied simultaneously1+p2+p3=1;
When judging whether a certain specific behavior, namely smoking or making a call, occurs in an image, a label L is used for representing the behavior, a T is used for representing a confidence threshold value of the behavior, and the behavior is processed according to the predicted behavior of the target detection network and the label, confidence and position information of a related target in different situations:
(1) when the detection result predicts a specific behavior label L and the confidence degree p0Higher is p0When the value is more than T, directly judging that the behavior L occurs;
(2) when the detection result predicts the behavior label L and the confidence coefficient p0Lower is p0When the distance between the behavior L and the target is less than or equal to T, judging whether the behavior L occurs again through the distance relation between the behavior L and the target, wherein the judgment rule is as follows: calculating distance information according to the related position information, judging whether the 3 distance conditions are satisfied or not, and obtaining the confidence coefficient increment of the behavior L as p1,p2,p3Then the confidence of behavior L is revised to p0+p1+p2+p3If the corrected confidence is higher than the threshold T, judging that the behavior L occurs, otherwise, judging that the behavior L does not occur;
(3) when the detection result does not predict the behavior label L, p is carried out0If 0, it is necessary to determine again whether the behavior L occurs or not based on the distance relationship with the behavior-related object, and the determination rule is: calculating distance information according to the related position information, judging whether the 3 conditions are satisfied or not, and obtaining the confidence coefficient increment of the behavior L as p1,p2,p3Then the confidence of behavior L is calculated as p1+p2+p3And if the confidence is higher than the threshold T, judging that the behavior L occurs, otherwise, judging that the behavior L does not occur.
CN201911196057.2A 2019-11-29 2019-11-29 Smoking and calling detection method based on deep learning and behavior prior Pending CN112883755A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911196057.2A CN112883755A (en) 2019-11-29 2019-11-29 Smoking and calling detection method based on deep learning and behavior prior

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911196057.2A CN112883755A (en) 2019-11-29 2019-11-29 Smoking and calling detection method based on deep learning and behavior prior

Publications (1)

Publication Number Publication Date
CN112883755A true CN112883755A (en) 2021-06-01

Family

ID=76038846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911196057.2A Pending CN112883755A (en) 2019-11-29 2019-11-29 Smoking and calling detection method based on deep learning and behavior prior

Country Status (1)

Country Link
CN (1) CN112883755A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113591662A (en) * 2021-07-24 2021-11-02 深圳市铁越电气有限公司 Method, system and storage medium for recognizing smoking calling behavior
CN114067441A (en) * 2022-01-14 2022-02-18 合肥高维数据技术有限公司 Shooting and recording behavior detection method and system
CN116580456A (en) * 2023-05-11 2023-08-11 中电金信软件有限公司 Behavior detection method, behavior detection device, computer equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113591662A (en) * 2021-07-24 2021-11-02 深圳市铁越电气有限公司 Method, system and storage medium for recognizing smoking calling behavior
CN114067441A (en) * 2022-01-14 2022-02-18 合肥高维数据技术有限公司 Shooting and recording behavior detection method and system
CN114067441B (en) * 2022-01-14 2022-04-08 合肥高维数据技术有限公司 Shooting and recording behavior detection method and system
CN116580456A (en) * 2023-05-11 2023-08-11 中电金信软件有限公司 Behavior detection method, behavior detection device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
US20200349875A1 (en) Display screen quality detection method, apparatus, electronic device and storage medium
CN106600888B (en) Automatic forest fire detection method and system
CN106534967B (en) Video clipping method and device
CN111047818A (en) Forest fire early warning system based on video image
CN112883755A (en) Smoking and calling detection method based on deep learning and behavior prior
CN106339657B (en) Crop straw burning monitoring method based on monitor video, device
CN109935080B (en) Monitoring system and method for real-time calculation of traffic flow on traffic line
CN110059761A (en) A kind of human body behavior prediction method and device
CN111223263A (en) Full-automatic comprehensive fire early warning response system
CN111222478A (en) Construction site safety protection detection method and system
CN102236947A (en) Flame monitoring method and system based on video camera
CN104966304A (en) Kalman filtering and nonparametric background model-based multi-target detection tracking method
CN115761537B (en) Power transmission line foreign matter intrusion identification method oriented to dynamic feature supplementing mechanism
CN116761049B (en) Household intelligent security monitoring method and system
CN110909703A (en) Detection method for chef cap in bright kitchen range scene based on artificial intelligence
CN113269039A (en) On-duty personnel behavior identification method and system
CN111476160A (en) Loss function optimization method, model training method, target detection method, and medium
CN117576632B (en) Multi-mode AI large model-based power grid monitoring fire early warning system and method
CN116385758A (en) Detection method for damage to surface of conveyor belt based on YOLOv5 network
CN116416281A (en) Grain depot AI video supervision and analysis method and system
CN118014327A (en) Real-time big data driven intelligent city management platform
CN113111866B (en) Intelligent monitoring management system and method based on video analysis
CN106780544B (en) The method and apparatus that display foreground extracts
CN107729811B (en) Night flame detection method based on scene modeling
CN117315719A (en) Safety helmet wearing identification method based on edge technology and deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination