CN112381068B

CN112381068B - Method and system for detecting 'playing mobile phone' of person

Info

Publication number: CN112381068B
Application number: CN202011563792.5A
Authority: CN
Inventors: 游忍; 邵延华; 刘明华
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2022-05-31
Anticipated expiration: 2040-12-25
Also published as: CN112381068A

Abstract

The invention discloses a method for detecting a person playing a mobile phone, which comprises the following steps: acquiring a video signal in the current environment to obtain a video to be detected and a training sample; if no person or mobile phone is detected to appear in the video, judging that no person plays the mobile phone; if the video is detected to have people and mobile phones, extracting the characteristics of each person and each mobile phone by using a characteristic extraction model; inputting the characteristics of each person and the mobile phone into a characteristic relation judgment model, and calculating the relation characteristics between each person and the mobile phone; inputting the characteristics of each person and each mobile phone and the relationship characteristics between the characteristics into a judgment model, and judging whether each person plays the mobile phone at the current moment; and processing the judgment result. The method of the invention combines the human body joint point coordinate, the size coordinate of the mobile phone, the action intention relation, the space relation and the deep learning method of the mobile phone and the human body, and combines the time sequence model to finally judge whether the people in the environment play the mobile phone, thereby greatly improving the detection precision.

Description

Method and system for detecting 'playing mobile phone' of person

Technical Field

The invention relates to the field of computer videos, in particular to a method and a system for detecting that a person plays a mobile phone.

Technical Field

With the rapid development of information technology, the use of mobile phones is more and more common, and the dependence of people on mobile phones is more and more serious. In an actual scene, accidents caused by playing mobile phones are frequent. For example, when a driver drives a car, the driver takes his hands away from the steering wheel to play a mobile phone, which causes a car accident. When the pedestrian passes through the road, the pedestrian collides with the vehicle because of playing the mobile phone. In some special industries such as railway departments, military management modes need to be adopted for employees, and real-time early warning needs to be carried out on some illegal behaviors of the employees, wherein the real-time early warning comprises the step of detecting whether the employees play mobile phones or not through a camera. For example, in schools, it is necessary to monitor the classroom discipline and detect whether students have the behavior of playing mobile phones. In the existing literature and patent, there are few patents on "playing a mobile phone" by a person. The mainstream method mainly based on computer vision mainly aims at judging the areas of the mobile phone and the hand or customizing some rules to judge whether to play the mobile phone. As disclosed in patent publication No. CN 110674728A, a method, apparatus, server, and storage medium for playing a mobile phone based on video image recognition. The method comprises the steps of detecting the change condition of the hand of a human body and the color change condition of the mobile phone in a set period by utilizing the change relationship between the hand of the human body and the mobile phone in the process of playing the mobile phone, so as to realize the detection of the behavior of playing the mobile phone. The method only simply applies the change condition of the hand of the human body and the color change of the mobile phone, and has low robustness in the complex scene of practical application. The invention with the publication number of CN 111191576A, a personnel behavior target detection model construction method, an intelligent analysis method and an intelligent analysis system. For the behavior of playing the mobile phone, the invention mainly intercepts the area of the mobile phone, then judges the brightness of the screen of the mobile phone and judges the frame number statistics to judge whether to play the mobile phone. The method is judged by a self-defined rule and does not have intelligent judgment of a similar person. The method has low robustness, is suitable for limited scenes and is difficult to meet various practical requirements. With the development of technologies such as deep learning and the like, the method of human body posture estimation algorithm, target detection algorithm, sight line estimation, time sequence model and the like is utilized to more accurately judge whether a person plays a mobile phone.

At present, the method for detecting the 'playing of a mobile phone' by a person in the prior art has the problems of rare related algorithms and low detection precision.

Disclosure of Invention

The invention aims to overcome the defects in the background technology, and provides a method and a system for detecting that a person plays a mobile phone, which can be used for solving the technical problem of low detection precision in the prior art.

In order to achieve the technical effects, the invention adopts the following technical scheme:

a method of detecting a person "playing a cell phone", the method comprising the steps of:

s1, acquiring video signals in the current environment to obtain a video to be detected and a training sample;

step S2, detecting all people and mobile phones in the video;

step S3, if no person or mobile phone is detected to appear in the video, judging that no person plays the mobile phone;

step S4, if the video is detected to have people and mobile phones, extracting the characteristics of each person and each mobile phone by using a characteristic extraction model;

step S5, inputting the characteristics of each person and the mobile phone into a characteristic relation judgment model, and calculating the relation characteristics between each person and the mobile phone;

step S6, inputting the characteristics of each person and each mobile phone and the relationship characteristics between the persons and the mobile phones into a judgment model, and judging whether each person plays the mobile phone at the current moment;

and step S7, processing the detection result.

Further, step S2 includes at least detecting all people and mobile phones in the current frame picture by using a computer vision algorithm.

Further, the characteristics of each person and each mobile phone in step S4 at least include:

a. two-dimensional human body joint point coordinates and three-dimensional human body joint point coordinates of each human body;

b. two-dimensional size coordinates and three-dimensional size coordinates of each mobile phone;

c. visual features of each person and cell phone.

Further, the visual features include, but are not limited to, features extracted based on a conventional machine learning algorithm or deep learning.

Further, the following operations are also included before proceeding to step S4:

a. constructing a human body key point model and a 3D target detection model;

b. training a human body key point model and a 3D target detection model by using a training sample to obtain the feature extraction model.

Further, the human body key point model is an openposition model and is used for calculating two-dimensional and three-dimensional joint point coordinates of each human body; the 3D target detection model is a centrNet model and is used for calculating two-dimensional size coordinates and three-dimensional size coordinates of each mobile phone.

Further, the relationship characteristics of each person and each mobile phone in step S5 at least include:

action intention relationship: the method comprises the steps that a person takes the mobile phone, does not take the mobile phone, sees the mobile phone, and does not see the mobile phone;

spatial relationship: comprises a front part, a rear part, a left part, a right part, an upper part and a lower part;

and combining the action intention relation and the spatial relation to obtain the relation characteristic.

Further, the following operations are also included before proceeding to step S5:

constructing a deep learning model;

extracting the characteristics of each person and each mobile phone from a training sample by using a characteristic extraction model, and training the deep learning model by using the characteristics to obtain a final characteristic relation judgment model;

further, the deep learning model is specifically a sight line (attention) estimation model.

Further, the step S6 of inputting the characteristics of each person and each mobile phone and the relationship characteristics between them into the judgment model to judge whether each person is playing the mobile phone at the current time includes:

for the video at the current moment, acquiring the characteristics of each person and each mobile phone and the relationship characteristics between the persons and the mobile phones in a period of time before the current moment and in the video at the current moment;

and inputting all the characteristics into a judgment model, and judging whether each person plays the mobile phone at the current moment.

Further, the following operations are also included before proceeding to step S6:

a. constructing a time sequence model;

b. and extracting the characteristics of each person and each mobile phone and the relationship characteristics between the characteristics from the training samples by using the characteristic extraction model and the characteristic relationship judgment model, and using the characteristics to train the time sequence model to obtain a final judgment model.

Further, the time series model is an LSTM model.

Further, the step 7 of processing the result specifically includes storing the detection result according to different application scenarios, storing the picture or video evidence that the person is playing the mobile phone, sending an alarm, and the like.

Meanwhile, the invention also discloses a system for detecting the 'playing mobile phone' of a person, which comprises the following steps:

the video signal acquisition module is used for acquiring video signals in the current environment to obtain a video to be detected and a training sample;

the human and mobile phone detection module is used for detecting all the human and mobile phones in the video;

the feature extraction module is used for training the feature extraction model and the feature relation judgment model, if people and mobile phones in the video are detected to appear, the feature extraction model is used for extracting the features of each person and each mobile phone, and the feature relation judgment model is used for obtaining the relation features of each person and each mobile phone;

the judging module is used for training a time sequence model, and judging whether each person plays the mobile phone at the current moment or not by utilizing the characteristics and the relation characteristics of each person and each mobile phone in a period of time before the current moment and in each frame of video at the current moment;

the characteristic storage module is used for storing the characteristics and the relation characteristics of the people and the mobile phone obtained in the algorithm operation process;

the state output module is used for outputting the state of each person: "play cell phone" or "not play cell phone".

Further, a system for detecting a person 'playing a mobile phone' comprises: the system also comprises an alarm module, and if a person plays the mobile phone, the system gives an alarm.

Compared with the prior art, the invention has the following beneficial effects: the method is combined with the human body joint point coordinate, the size coordinate of the mobile phone, the action intention relation and the space relation of the mobile phone and the human body, the deep learning method and the time sequence model to finally judge whether the person in the environment plays the mobile phone, and the detection precision is greatly improved.

Drawings

Fig. 1 is a flowchart illustrating a method for detecting a person playing a mobile phone according to an embodiment of the present invention.

Fig. 2 is a flowchart of training a feature extraction model according to a first embodiment of the present invention.

Fig. 3 is a flowchart of training a feature relation determination model according to an embodiment of the present invention.

Fig. 4 is a flowchart of a decision model training process according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a system for detecting a person playing a mobile phone according to a second embodiment of the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the embodiments of the invention described hereinafter.

Example one

As shown in fig. 1, a method for detecting a person playing a mobile phone specifically includes the following steps:

and step S1, acquiring the video signal in the current environment to obtain the video to be detected and the training sample.

Specifically, during model training, a large number of videos are collected through the camera. Marking two-dimensional and three-dimensional joint point coordinates of a person; marking two-dimensional and three-dimensional size coordinates of the mobile phone; marking the action intention relation between the person and the mobile phone: the method comprises the steps that a person takes the mobile phone, does not take the mobile phone, sees the mobile phone, and does not see the mobile phone; spatial relationship between the marker and the mobile phone: comprises a front part, a rear part, a left part, a right part, an upper part and a lower part; whether each person plays the mobile phone is marked, and a training sample is obtained after marking is completed. When the application is actually deployed, the video in the application scene is collected through the camera, and the video to be detected is obtained.

And step S2, detecting all people and mobile phones in the video.

Specifically, all people and mobile phones in the video to be detected are detected by a Faster-RCNN algorithm.

And step S3, if no person or mobile phone is detected to appear in the video, determining that no person plays the mobile phone.

And step S4, if the video is detected to have people and mobile phones, extracting the characteristics of each person and each mobile phone by using a characteristic extraction model.

The characteristics of each person and each mobile phone comprise a. two-dimensional human body joint point coordinates and three-dimensional human body joint point coordinates of each human body; b. two-dimensional size coordinates and three-dimensional size coordinates of each mobile phone; c. visual features of each person and cell phone. The visual features include, but are not limited to, features extracted based on a conventional machine learning algorithm or deep learning.

In this embodiment, the implementation is specifically as follows: and calculating the two-dimensional and three-dimensional joint point coordinates of each human body by using an openposition model. And calculating the two-dimensional size coordinate and the three-dimensional size coordinate of each mobile phone by using a centrNet model. Meanwhile, the area corresponding to each person and each mobile phone is intercepted from the last convolution layer of the centret model, and the visual characteristics of each person and each mobile phone are obtained.

The feature extraction models openpase and centret models are generated in advance, as shown in fig. 2, and the specific implementation and training steps are as follows:

a. constructing a human body key point model and a 3D target detection model;

In this embodiment, the human body key point model is an openposition model, and is used for calculating two-dimensional and three-dimensional joint point coordinates of each human body; the 3D target detection model is a centrNet model and is used for calculating two-dimensional size coordinates and three-dimensional size coordinates of each mobile phone;

in this embodiment, an openposition model is trained using a training sample labeled with two-dimensional and three-dimensional coordinates of a human joint point, and a centrnet model is trained using data labeled with two-dimensional and three-dimensional coordinates of a mobile phone. Finally, combining the openposition model and the centrnet model to obtain a feature extraction model.

And step S5, inputting the characteristics of each person and the mobile phone into a characteristic relation judgment model, and calculating the relation characteristics between each person and the mobile phone.

Specifically, the features of each person and the mobile phone obtained by the feature extraction model are input into the feature relationship judgment model, so as to obtain the relationship features between each person and the mobile phone.

Wherein the relationship features include an action intent relationship: the method comprises the steps that a person takes the mobile phone, does not take the mobile phone, sees the mobile phone, and does not see the mobile phone; spatial relationship: comprises a front part, a rear part, a left part, a right part, an upper part and a lower part; and combining the action intention relation and the spatial relation to obtain the relation characteristic.

In this embodiment, the implementation is specifically as follows: the openposition model and the centrnet model are used for calculating the two-dimensional and three-dimensional coordinates, the size coordinates and the visual characteristics of each person and each mobile phone. And then inputting all the characteristics into a characteristic relation judgment model to obtain the action intention relation and the spatial relation of the person and the mobile phone, and finally combining the action intention relation and the spatial relation to obtain the relation characteristics.

The feature relationship determination model is generated in advance, as shown in fig. 3, in this embodiment, the implementation and training steps of the feature relationship determination model are as follows:

a. constructing a deep learning model;

b. extracting the characteristics of each person and each mobile phone from a training sample by using a characteristic extraction model, and training the sight line (attention) estimation model based on deep learning by using the characteristics to obtain a final characteristic relation judgment model;

in this embodiment, the deep learning model is specifically a sight line (attention) estimation model.

In this embodiment, the implementation is specifically as follows: extracting the characteristics of each person and each mobile phone from the training samples by using the characteristic extraction models openposition and centret, training the constructed sight line estimation model by using the characteristics and the samples marked with the action intention relationship and the space relationship between the marked person and the mobile phone in the training samples, and finally training the sight line estimation model on the MPIIGaze data set to obtain the final characteristic relationship judgment model.

In step S6, the characteristics of each person and each mobile phone and the relationship characteristics between them are input to the determination model, and it is determined whether or not each person is playing a mobile phone at the present time. The method comprises the following specific steps:

a. for the video at the current moment, acquiring the characteristics of each person and each mobile phone and the relationship characteristics between the persons and the mobile phones in a period of time before the current moment and in the video at the current moment;

b. and inputting all the characteristics into a judgment model, and judging whether each person plays the mobile phone at the current moment.

As shown in fig. 4, in this embodiment, the implementation and training steps of the judgment model are as follows:

a. constructing a time sequence model, specifically constructing an LSTM model;

b. and extracting the characteristics of each person and each mobile phone and the relationship characteristics between the characteristics from the training samples by using the characteristic extraction model and the characteristic relationship judgment model, and using the characteristics and the relationship characteristics to train the LSTM model to obtain a final judgment model.

In this embodiment, the specific implementation manner is as follows: extracting the characteristics of each person and each mobile phone and the relationship characteristics between the characteristics from the training samples by using the characteristic extraction models openposition, centret and the sight line estimation model, and training the constructed LSTM model by taking 10 frames as an input sample to obtain a final judgment model.

And step S7, processing the result, specifically including storing the detection result according to different application scenes, storing pictures or video evidences for judging that a person plays a mobile phone, sending an alarm and the like.

Example two

Fig. 5 is a schematic structural diagram of a system for detecting a person playing a mobile phone according to an embodiment of the present invention. The method comprises the following steps: the device comprises a video signal acquisition module, a human and mobile phone detection module, a feature extraction module, a judgment module, a feature storage module and a state output and alarm module.

The 201 video signal acquisition module is used for acquiring video signals in the current environment to obtain a video to be detected and a training sample.

In this embodiment, the implementation is specifically as follows: and selecting a proper camera, and designing a hardware scheme for acquiring videos. Specifically, when the model is trained, a large number of video pictures are collected through the camera. Marking two-dimensional and three-dimensional joint point coordinates of a person; marking two-dimensional and three-dimensional coordinates of the mobile phone; marking the action intention relation between the person and the mobile phone: the method comprises the steps that a person takes the mobile phone, does not take the mobile phone, sees the mobile phone, and does not see the mobile phone; marking the spatial relationship between the person and the mobile phone: comprises a front part, a rear part, a left part, a right part, an upper part and a lower part; whether each person plays the mobile phone is marked, and a training sample is obtained after marking is completed. When the application is actually deployed, the camera is used for collecting the video in the application scene to obtain the video to be detected.

The 202 people and mobile phone detection module is used for detecting all people and mobile phones in the video.

In this embodiment, all people and mobile phones in the video to be detected are detected by using a Faster-RCNN algorithm.

And the 203 feature extraction module is used for training the feature extraction model and the feature relation judgment model, extracting the features of each person and each mobile phone by using the feature extraction model if the presence of the person and the mobile phone in the video is detected, and obtaining the relation features of each person and each mobile phone by using the feature relation judgment model.

In this embodiment, the implementation is specifically as follows: training a human body key point model openposition and a 3D target detection model centrNet by using a training sample to obtain a feature extraction model. And then extracting the characteristics of each person and each mobile phone from the training samples by using a characteristic extraction model, training a sight line estimation model based on deep learning by using the characteristics, and finally training the sight line estimation model on the MPIIGaze data set to obtain a final sight line estimation model, namely a relation characteristic model. When the application is actually deployed, the pictures are input into opendose and centrnet models to obtain the characteristics of people and mobile phones. And inputting the characteristics of the people and the mobile phones into the sight line estimation model to obtain the relationship characteristics of each person and each mobile phone.

The 204 judging module is used for training a time sequence model, and for the video at the current moment, judging whether each person in the current moment plays the mobile phone or not by utilizing the characteristics and the relation characteristics of each person and each mobile phone in a period of time before the current moment and in each frame of video at the current moment.

The method comprises the following specific steps:

b. inputting all the characteristics into a judgment model, and judging whether each person plays the mobile phone at the current moment;

in this embodiment, the implementation is specifically as follows: and extracting the characteristics of each person and each mobile phone and the relationship characteristics between the characteristics from the training samples by using the characteristic extraction model and the characteristic relationship judgment model, and training an LSTM model by taking 10 frames as input to obtain a final judgment model.

205 the feature storage module is used for storing the features and the relation features of the people and the mobile phone obtained in the algorithm operation process.

The 206 state output and alarm module is used for outputting the state of each person: "play cell phone" or "not play cell phone".

207 alarm module is used for the system to give an alarm if someone is playing the cell phone.

In summary, the method and the system for detecting the 'playing mobile phone' of the person provided by the invention have the beneficial effects that: the method is combined with the human body joint point coordinate, the size coordinate of the mobile phone, the action intention relation and the space relation of the mobile phone and the human body, the deep learning method and the time sequence model to finally judge whether the person in the environment plays the mobile phone, and the detection precision is greatly improved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware instructions related to a program, and the program may be stored in a computer-readable storage medium, and when executed, may include the processes of the above embodiments of the methods. The storage medium may be a magnetic disk, an optical disk, a Read-only Memory (ROM), a Random Access Memory (RAM), or the like.

It will be understood that the above embodiments are merely exemplary embodiments taken to illustrate the principles of the present invention, which is not limited thereto. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims

1. A method for detecting a person 'playing a cell phone', the method comprising the steps of:

s2, detecting all people and mobile phones in the video;

s3, if no person or mobile phone is detected to appear in the video, judging that no person plays the mobile phone;

s4, if the video is detected to have the people and the mobile phones, extracting the characteristics of each person and each mobile phone by using a characteristic extraction model, wherein the characteristics at least comprise:

two-dimensional human body joint point coordinates and three-dimensional human body joint point coordinates of each human body;

two-dimensional size coordinates and three-dimensional size coordinates of each mobile phone;

visual characteristics of each person and cell phone;

s5, inputting the characteristics of each person and the mobile phone into a characteristic relation judgment model, and calculating the relation characteristics between each person and the mobile phone;

s6, inputting the characteristics of each person and each mobile phone and the relationship characteristics between the characteristics into a judgment model, and judging whether each person plays the mobile phone at the current moment;

and S7, processing the result.

2. The method as claimed in claim 1, wherein the step S2 at least includes detecting all people and mobile phones in the current frame picture by using computer vision algorithm.

3. The method as claimed in claim 1, wherein the visual features include but are not limited to features extracted based on traditional machine learning algorithm or deep learning.

4. The method for detecting the "playing mobile phone" of the person as claimed in claim 1, wherein the step S4 is preceded by the following steps:

constructing a human body key point model and a 3D target detection model;

training a human body key point model and a 3D target detection model by using a training sample to obtain the feature extraction model.

5. A method of detecting a person's "playing a cell phone" as claimed in claim 4, wherein:

the human body key point model is an openposition model and is used for calculating the coordinates of two-dimensional and three-dimensional joint points of each human body;

the 3D target detection model is a centrNet model and is used for calculating two-dimensional size coordinates and three-dimensional size coordinates of each mobile phone.

6. The method as claimed in claim 1, wherein the relationship between each person and each mobile phone in step S5 at least comprises:

7. The method for detecting the "playing mobile phone" of the person as claimed in claim 1, wherein the step S5 is preceded by the following steps:

constructing a deep learning model;

and extracting the characteristics of each person and each mobile phone from the training sample by using a characteristic extraction model, and training the deep learning model by using the characteristics to obtain a final characteristic relation judgment model.

8. The method as claimed in claim 7, wherein the deep learning model is a sight line estimation model.

9. The method of claim 1, wherein the step S6 of inputting the characteristics of each person and each mobile phone and the relationship characteristics between them into the judgment model to judge whether each person is playing the mobile phone at the current time comprises:

10. The method for detecting the "playing mobile phone" of the person as claimed in claim 1, wherein the step S6 is preceded by the following steps:

a. constructing a time sequence model;

b. and extracting the characteristics of each person and each mobile phone and the relation characteristics between the characteristics from the training samples by using the characteristic extraction model and the characteristic relation judgment model, and using the characteristics and the relation characteristics to train the time sequence model to obtain a final judgment model.

11. The method of claim 10, wherein the time series model is an LSTM model.

12. The method as claimed in claim 1, wherein the step S7 processes the result, which includes one or more of the following ways:

storing the detection result;

the method comprises the steps of storing pictures or video evidences of 'playing mobile phones' of people;

an alarm is issued.

13. A system for detecting a person's mobile phone play' as claimed in any one of claims 1-12, comprising:

the person and mobile phone detection module is used for detecting all persons and mobile phones in the video;

14. The system for detecting human cell phone play as claimed in claim 13, further comprising an alarm module for generating an alarm signal if the output status is "cell phone play".