CN113723165B

CN113723165B - Method and system for detecting dangerous expressions of people to be detected based on deep learning

Info

Publication number: CN113723165B
Application number: CN202110319441.8A
Authority: CN
Inventors: 郑来波; 刘佩; 张�浩; 李莹; 王德强
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2022-06-07
Anticipated expiration: 2041-03-25
Also published as: CN113723165A

Abstract

The invention discloses a method and a system for detecting dangerous expressions of a person to be detected based on deep learning, wherein a video of the person to be detected is obtained; carrying out face detection on images in a video of a person to be detected; capturing a single-frame image containing an identifiable face to obtain a face image to be detected; detecting key points of a human body in a single-frame image in a video of a person to be detected; extracting a face feature vector of a face image to be detected to obtain a face feature vector; carrying out feature fusion on the face feature vector and the face key points, inputting the fused features into a trained classifier, and outputting an expression primary classification result; judging whether the critical points of the limbs exceed a set area, and if so, obtaining dangerous limb results; correcting the preliminary expression classification result by combining the limb detection result to obtain a final expression classification result of the single-frame image; the invention has the advantages of high detection precision, high detection speed and avoidance of erroneous judgment.

Description

Method and system for detecting dangerous expressions of people to be detected based on deep learning

Technical Field

The invention relates to the technical field of image processing, in particular to a method and a system for detecting dangerous expressions of a person to be detected based on deep learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

In the process of implementing the invention, the inventor finds that the following technical problems exist in the prior art:

in the network era, the video call technology is continuously developed, the meeting form of the supervision places is updated with time, and the face-to-face meeting of the same supervision place is upgraded into remote video meeting between two supervision places across the region. In the traditional face-to-face meeting, a plurality of supervisors are required to be present, the behaviors and the languages of the personnel to be detected and relatives in the meeting process are supervised, and the sensitive information transmission, the over-excited behaviors and the like of the personnel to be detected are prevented. Similarly, when performing remote video, real-time monitoring of behavior of both parties of the meeting is also required, which consumes a lot of time and energy of the supervisor. At the present stage, a blank is left for the real-time emotion calculation in the remote video conference.

There are seven categories of expression recognition technology already on the market today, but this is not suitable for special regulatory site scenarios. Because supervisors need to quickly process abnormal conditions, a two-classification detection method is needed to quickly early warn abnormal expressions of people to be detected in the meeting process.

In summary, the current remote video interview is the most convenient and fast interview form between the people to be detected and relatives, and the problems of real-time interview content supervision and seven-classification recognition method optimization need to be solved while the current remote video interview has a wide application prospect. Therefore, it is desirable to provide a method and a system for detecting dangerous expressions of a person to be detected from a remote video conference more accurately and quickly.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a method and a system for detecting dangerous expressions of a person to be detected based on deep learning;

in a first aspect, the invention provides a method for detecting dangerous expressions of a person to be detected based on deep learning;

the dangerous expression detection method of the person to be detected based on deep learning comprises the following steps:

acquiring a video of a person to be detected;

carrying out face detection on images in a video of a person to be detected; capturing a single-frame image containing an identifiable face to obtain a face image to be detected;

the method comprises the following steps of detecting key points of a human body in a single-frame image of a video of a person to be detected, wherein the key points of the human body comprise: face key points and limb key points;

extracting a face feature vector of a face image to be detected to obtain a face feature vector; carrying out feature fusion on the face feature vector and the face key points, inputting the fused features into a trained classifier, and outputting an expression preliminary classification result;

judging whether the critical points of the limbs exceed a set area, and if so, obtaining dangerous limb results; otherwise, obtaining the result of the non-dangerous limbs;

and correcting the preliminary expression classification result by combining the limb detection result to obtain the final expression classification result of the single-frame image.

In a second aspect, the invention provides a system for detecting dangerous expressions of a person to be detected based on deep learning;

dangerous expression detecting system of personnel that wait based on degree of depth study includes:

an acquisition module configured to: acquiring a video of a person to be detected;

a face detection module configured to: carrying out face detection on images in a video of a person to be detected; capturing a single-frame image containing an identifiable face to obtain a face image to be detected;

a keypoint detection module configured to: the method comprises the following steps of detecting key points of a human body in a single-frame image of a video of a person to be detected, wherein the key points of the human body comprise: face key points and limb key points;

a preliminary classification module configured to: extracting a face feature vector of a face image to be detected to obtain a face feature vector; carrying out feature fusion on the face feature vector and the face key points, inputting the fused features into a trained classifier, and outputting an expression preliminary classification result;

a determination module configured to: judging whether the critical points of the limbs exceed a set area, and if so, obtaining dangerous limb results; otherwise, obtaining the result of the non-dangerous limb;

a correction module configured to: and correcting the preliminary expression classification result by combining the limb detection result to obtain the final expression classification result of the single-frame image.

In a third aspect, the present invention further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.

In a fourth aspect, the present invention also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

1. the classification expression of the invention is more suitable for application scenes for detecting dangerous expressions of the person to be detected in the meeting of the supervision places. Compared with the traditional method of dividing the facial expression into seven classes and judging whether the result is dangerous expression, the direct two-class identification can clearly and definitely display the expression detection result, the step of dividing again is omitted, and the memory use and time cost are reduced. Meanwhile, the experimental result shows that: compared with the traditional expression seven-classification method and the two-classification combination method, the accuracy of the method for directly carrying out the expression two-classification is obviously improved to 83 percent (two-classification) from 65 percent (seven-classification).

2. The invention introduces two characteristic special area modes of a convolutional neural network and a human face key point into a characteristic extraction part of a dangerous expression detection model. The convolution neural network is used for extracting local features of the picture, the key points are used for extracting global features of the face, and the two features are combined, so that the feature extraction in the expression recognition process is more comprehensive and accurate, and reliable facial features are provided for subsequent classification recognition.

3. In the invention, a support vector classification model is adopted in the classification and identification part of the dangerous expression detection model. The model uses the mathematical theory of a support vector machine as a support, and is more suitable for processing the problem of two classifications.

4. In the invention, the single expression recognition result is corrected in an auxiliary way by using the limb reference result in the aspect of carrying out facial expression classification recognition. False alarm and false alarm conditions caused by the wrong expression recognition result are avoided, so that the dangerous expression detection result of the static picture is more credible.

5. In the aspect of dangerous expression detection of videos, a dynamic calculation module is added, and static expression recognition results are connected. Because the facial expression has continuity and gradual change, people can not make facial expressions representing various emotions within one second, the facial expressions in time sequence are continuously calculated, misjudgment can be avoided, even if errors occur in static image recognition, the errors can be compensated through dynamic calculation, and the accuracy and the practicability of the method are improved.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flowchart of an example of a method for detecting dangerous expressions of a person to be detected based on deep learning according to the present invention;

fig. 2 is a flowchart of an example of the server for detecting dangerous expressions of people to be detected based on deep learning according to the present invention.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Interpretation of professional terms:

CNN: a convolutional neural network. Convolutional Neural Networks (Convolutional Neural Networks) are Neural Networks that contain Convolutional calculations and have deep structures, and are also one of the representative algorithms for deep learning. The hierarchical classification method comprises an input layer, a hidden layer and an output layer, has the characteristic learning ability, and can perform translation invariant classification on input information according to the hierarchical structure of the input information. Its neurons at each layer are three-dimensionally arranged, including height, width, and depth, so this requires that the input to the convolutional neural network be data with three dimensions.

SVC: a vector classification model is supported. The Support Vector classification model (Support Vector classification) is an application of a Support Vector machine to classification problems.

OpenCV library. OpenCV is a cross-platform computer vision and machine learning software library that can run on Linux, Windows, Android, and Mac OS operating systems. The method is light and efficient, is composed of a series of C functions and a small number of C + + classes, provides interfaces of languages such as Python, Ruby, MATLAB and the like, and realizes a plurality of general algorithms in the aspects of image processing and computer vision.

Openopse library. Openpos is an open source library written based on a convolutional neural network and supervised learning and using caffe as a framework, can realize the tracking of facial expressions, trunks, limbs and even fingers of people, is suitable for single people and multiple people, and has better robustness. The method can become the first real-time multi-person two-dimensional attitude estimation based on deep learning in the world, is a milestone in man-machine interaction, and provides a high-quality information dimension for a robot to understand a person.

The Dlib library. The Dlib is an open source library for machine learning, contains a plurality of algorithms for machine learning, can be used by directly containing header files, does not depend on other libraries, and is widely applied to the fields of artificial intelligence and deep learning.

Example one

The embodiment provides a dangerous expression detection method for a person to be detected based on deep learning;

as shown in fig. 1 and fig. 2, the method for detecting dangerous expressions of a person to be detected based on deep learning includes:

s101: acquiring a video of a person to be detected;

s102: carrying out face detection on images in a video of a person to be detected; capturing a single-frame image containing an identifiable face to obtain a face image to be detected;

s103: carrying out human key point detection on a single-frame image in a video of a person to be detected, wherein the human key point comprises the following steps: face key points and limb key points;

s104: extracting a face feature vector of a face image to be detected to obtain a face feature vector; carrying out feature fusion on the face feature vector and the face key points, inputting the fused features into a trained classifier, and outputting an expression preliminary classification result;

s105: judging whether the critical points of the limbs exceed a set area, and if so, obtaining dangerous limb results; otherwise, obtaining the result of the non-dangerous limb;

s106: and correcting the preliminary expression classification result by combining the limb detection result to obtain the final expression classification result of the single-frame image.

Further, the method further comprises:

s107: and smoothing the final expression classification result of all frame images in the video of the person to be detected, and outputting the dynamic dangerous expression detection result of the video of the person to be detected.

Further, the step S101: acquiring a video of a person to be detected; the method specifically comprises the following steps:

in the process that the person to be detected carries out remote video with the family members in the supervision place, the camera is adopted to collect the video of the person to be detected.

Or, the S101: acquiring a video of a person to be detected; and displaying the content of the remote video meeting of the personnel to be detected on a client interface by using a tool of an OpenCV (open content description language) library, and intercepting an image according to the specified frame number.

Further, the S102: carrying out face detection on images in a video of a person to be detected; capturing a single-frame image containing an identifiable face to obtain a face image to be detected; the method specifically comprises the following steps:

s1021: carrying out face detection on the image of the person to be detected by using a Haar classifier;

s1022: and intercepting the current image frame by using an Opencv library to obtain a face image to be detected.

Further, the S1021: carrying out face detection on the image of the person to be detected by using a Haar classifier; the method specifically comprises the following steps:

s10211: extracting Haar characteristics from the image of the person to be detected;

s10212: using an integral chart to perform accelerated evaluation on the Haar characteristic to obtain a Haar characteristic value;

s10213: constructing a classifier;

s10214: and carrying out face region detection on the image of the person to be detected by using a Haar classifier.

It should be understood that the Haar feature detection and face cropping section includes: and (4) carrying out face detection by using the step of S1021, cutting out a face region image which accords with the input size of the convolutional neural network, and carrying out gray processing.

Further, the step S103: the method comprises the following steps of detecting key points of a human body in a single-frame image of a video of a person to be detected, wherein the key points of the human body comprise: key points of the face and the limbs; the method specifically comprises the following steps:

using an openpos library to detect human key points of a single-frame image in a video of a person to be detected by using an image pair to obtain 18 human key points, wherein the 18 human key points include:

the face key points comprise: nose key point no, right eye key point right _ eye, left eye key point left _ eye, right ear key point right _ ear, and left ear key point left _ ear.

Limb key points, including: neck key point hack, right shoulder key point right _ shot, right elbow key point right _ egg, right wrist key point right _ white, left shoulder key point left _ shot, left elbow key point left _ egg, left wrist key point left _ white, right hip key point right _ hip, right knee key point right _ knee, right ankle key point right _ knee, left hip key point left _ hip, left knee key point left _ knee, left ankle key point left _ knee.

Because the nose is positioned in the middle of the face, the face and the five sense organs can be found around the nose, and therefore, a square area containing the whole face is obtained by taking the coordinates (x, y) of the key point of the nose as the center and taking two times of the distance from the nose to the ears as the side length;

the coordinates (x, y) are changed into new coordinates (x, 'y') by using an affine matrix, 68 key points are calibrated by using a model trained by the Dlib, namely "shape _ predictor _68_ face _ maps. dat", and then the coordinates of the 68 key points of the human face are found out, so that the key points of the human face are expanded from 5 to 68.

Wherein 68 personal face key points comprise:

17 keypoints (numbered 1-17) depicting the contour of the face, 10 keypoints (numbered 18-27) depicting the left and right eyebrows, 4 keypoints (numbered 28-31) depicting the position of the bridge of the nose, 5 keypoints (numbered 32-36) depicting the contour of the lower edge of the nose, 12 keypoints (numbered 49-60) depicting the contour of the exterior of the lips, and 8 keypoints (numbered 61-68) depicting the contour of the interior of the nose.

Those skilled in the art will appreciate that the specific location of each of the 68 individual face keypoints is well known in the art and will not be described in detail herein.

It should be understood that the face keypoint extracting part specifically includes: and the more comprehensive face key point information is detected by using the 'nose' key point information.

Further, the S104: extracting a face feature vector of a face image to be detected to obtain a face feature vector; carrying out feature fusion on the face feature vector and the face key points, inputting the fused features into a trained classifier, and outputting an expression preliminary classification result; the method specifically comprises the following steps:

inputting a face image to be detected into a trained convolutional neural network CNN for face feature vector extraction to obtain a face feature vector;

performing feature fusion on the face feature vector and the face key points, wherein the feature fusion adopts one of serial fusion, parallel fusion or weighted fusion;

and inputting the fused features into a trained classifier, and outputting the primary expression classification result.

Further, the training of the trained convolutional neural network CNN specifically includes:

constructing a training set, wherein the training set comprises face region images of known danger or safety labels;

taking the face region image and the known expression classification labels (safety/danger) as input values of a convolutional neural network, and training the convolutional neural network; and obtaining the trained convolutional neural network.

Further, the training of the trained classifier specifically includes:

extracting face key points of the face region images in the training set to obtain the face key points of the face region images;

carrying out feature fusion on the face key points of the face region image and the image features extracted by the trained convolutional neural network;

and taking the fused features and the corresponding expression labels as input values of the support vector classification model, and training the untrained support vector classification model to obtain the trained support vector classification model.

And the trained convolutional neural network is connected with the trained support vector classification model in series to form a trained dangerous expression detection model.

It should be understood that training the convolutional neural network first, and then training the support vector classification model based on the features extracted from the trained convolutional neural network, can improve the precision and speed of training.

Illustratively, the convolutional neural network input image requires a grayscale map of 48 x 48 pixel values.

The convolutional neural network model CNN includes: the device comprises an input layer, a first convolution layer and pooling layer, a second convolution layer and pooling layer, a first full-connection layer and a second full-connection layer which are connected in sequence.

The convolutional layer extracts the shallow features. And performing convolution operation on the input of the previous layer by using a trainable convolution kernel with a fixed step length, adding a trainable bias term, and finally outputting the result to the next layer after nonlinear activation of a Relu activation function. Each convolution kernel can extract one feature, n convolution kernels can extract n features, and the expression form of the features is a feature map.

Wherein the content of the first and second substances,

representing the ith input of layer l-1,

represents the jth output of the ith layer; m represents the receptive field of the input layer, omega represents the convolution kernel, and M and n represent the size of the convolution kernel; b denotes the neuron bias and σ (-) denotes the ReLU activation function.

The pooling layer performs down-sampling of the image. The pooling function replaces the output of the network at a location with the overall statistical characteristics of the neighboring outputs at that location, and the maximal pooling replaces the output of the region with the maximum of all neurons in the region location, which serves the purpose of data dimension reduction while preserving the input image characteristic information.

Wherein the content of the first and second substances,

represents the ith input of the ith layer,

represents the jth output of the l +1 th layer; β is the coefficient term, b is the neuron bias; p (·) denotes an activation function p (x) ═ x; down (·) represents the down-sampling function, extracting the maximum of the pooled region pixels.

The fully connected layer integrates the features extracted by the above process.

Performing feature fusion on the features obtained by the two feature extraction modes to obtain a higher-dimensionality human face feature; and carrying out dangerous expression detection on the fused human face features by using a support vector classification model.

The support vector classification model is an application of the support vector machine principle in the classification problem. The support vector machine is to find a hyperplane which can divide a two-dimensional linear separable space, the point in the space closest to the hyperplane is the support vector, and the collective distance from the support vector to the hyperplane is called interval.

The mathematical expression for the geometric spacing is as follows:

the objective function of the support vector classification model is:

where h (x) is the discriminant function of the hyperplane, w is the weight vector, b is the offset vector, | w | is the inner product of w.

Further, constructing a training set specifically includes:

and re-marking the database of the seven classified pictures with the known expression into a database of two classified pictures.

Known databases commonly used for seven-category expression recognition include FER2013, CK + and the like, and are all divided by referring to seven-large-face basic expression classification criteria:

(1) the facial characteristics of Anger emotion are: the eyebrow positions move downwards and are closed, eyes are angry, and lips are locked;

(2) facial features of the Disgust (aversion) mood are: wrinkling in the seal hall and raising the upper lip;

(3) facial features of the Fear (Fear) mood are: the eyebrows are raised and locked tightly, eyelids are tilted upwards and tightened, lips are slightly opened and horizontally close to two ears and the like;

(4) the facial features of the happy emotion are: the canthus has magic claw-like wrinkles; the cheek bulges and pulls the muscles around the eye socket;

(5) the facial features of Sadness's mood are: the upper eyelid is drooped, two eyes are not lighted, and the mouth corners at two sides are slightly pulled down;

(6) the facial features of the surrise (surprised) mood are: the eyebrows are raised, the two eyes are big, and the mouth is slightly open;

(7) facial features of Neutral emotion are: no obvious expressive features.

Directly classifying Anger in the database into dangerous expression classes;

happinness, Sadness, Surrise and Neutral in the database are directly divided into safe expression classifications.

Meanwhile, since specific images of the dispust and Fear have fuzzy boundaries, threshold values are respectively established for the muscle contraction degree of the face and the area increase degree of the five sense organs, and if the threshold values are exceeded, dangerous expressions are determined, otherwise, safe expressions are determined.

Further, the step S105: judging whether the key points of the limbs exceed a set area, and if so, obtaining dangerous limb results; otherwise, obtaining the result of the non-dangerous limb; the method specifically comprises the following steps:

and setting a safety region which accords with the activity specification of the person to be detected according to the fit degree of the activity range of the person to be detected and the capture range of the camera, and setting a pixel value region corresponding to the safety region on a screen display. When the position information contained in the limb key points is out of the safe area, outputting the auxiliary result of the dangerous expression: identifying a dangerous limb; otherwise, outputting the auxiliary result of the safe expression: and (5) identifying the safety limb.

Further, the S106: correcting the preliminary expression classification result by combining the limb detection result to obtain a final expression classification result of the single-frame image; the method specifically comprises the following steps:

and (4) correcting the primary expression classification result according to the proportion that the primary expression classification result accounts for 80% and the primary body detection result accounts for 20% by combining the body detection result to obtain the final expression classification result of the single-frame image.

Further, the step S106: correcting the preliminary expression classification result by combining the limb detection result to obtain a final expression classification result of the single-frame image; the method specifically comprises the following steps:

the expression preliminary classification result obtained by the support vector classification model comprises two kinds of probability of expression classification:

p (x ═ 0) and P (x ═ 1);

and P (x ═ 0) + P (x ═ 1) ═ 1;

where x denotes an expression classification event, P (x ═ 0) denotes a probability of being recognized as a safe expression, and P (x ═ 1) denotes a probability of being recognized as a dangerous expression.

The limb calculation result obtained by the posture analysis model is as follows:

when the safe limb is identified to act: p (y ═ 0) ═ 1, and P (y ═ 1) ═ 0;

when the dangerous limb is identified to act: p (y ═ 0) ═ 0, P (y ═ 1) ═ 1;

wherein y represents a limb recognition event; p (y ═ 0) represents the probability of recognition as a safe limb action; p (y ═ 1) represents the probability of identifying an action as a dangerous limb;

and correcting the expression preliminary classification result according to the following calculation formula:

P(X＝0)＝0.8*P(x＝0)+0.2*P(y＝0)

P(X＝1)＝0.8*P(x＝1)+0.2*P(y＝1)

p (X ═ 0) represents the probability that the final expression classification result is a safe expression;

p (X ═ 1) represents the probability that the final expression classification result is a dangerous expression;

the decision rule is as follows:

and X is the final expression classification result, wherein X is 0, and X is a safe expression, and X is 1, and X is a dangerous expression. 0.8 and 0.2 are correction coefficients;

and obtaining the final expression classification result of the single-frame image.

Further, the step S107: smoothing the final expression classification results of all frame images in the video of the person to be detected to obtain a dynamic dangerous expression detection result of the video of the person to be detected; the method specifically comprises the following steps:

sequencing the final expression classification results of all the frame images according to a time sequence to obtain a time sequence;

setting a sliding window on the time sequence, and setting the sliding step length of the sliding window and the size N of the sliding window;

obtaining the final expression classification results of the N frames of images in the sliding window, putting the final expression classification results into a calculation sequence, and if the number of the detected dangerous expression results is greater than or equal to N/2, outputting a dangerous detection result of the video of the person to be detected and generating a dangerous alarm;

after the sliding window slides on the time sequence, acquiring the final expression classification result of the new N frames of images in the sliding window, and putting the final expression classification result into a calculation sequence, if the number of the detected dangerous expression results is more than or equal to N/2, outputting a dangerous detection result of the video of the person to be detected, and generating a dangerous alarm;

the sliding window continues to move until the video meeting ends.

And performing smoothing treatment on the output obtained by the dangerous expression detection model of the person to be detected. Because the model only identifies the expression picture of a single frame and the change of the facial expression does not jump in an ultra-short time, a step of expression smoothing processing is added in the system, and the previous expression detection result and the next expression detection result are linked to reduce the detection error.

Example two

The embodiment provides a dangerous expression detection system for a person to be detected based on deep learning;

a correction module configured to: and correcting the preliminary expression classification result by combining the limb detection result to obtain a final expression classification result of the single-frame image.

It should be noted here that the acquiring module, the face detecting module, the key point detecting module, the preliminary classifying module, the determining module and the correcting module correspond to steps S101 to S106 in the first embodiment, and the modules are the same as the corresponding steps in implementation examples and application scenarios, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.

EXAMPLE III

The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Example four

The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A dangerous expression detection method for a person to be detected based on deep learning is characterized by comprising the following steps:

acquiring a video of a person to be detected;

extracting a face feature vector of a face image to be detected to obtain a face feature vector; carrying out feature fusion on the face feature vector and the face key points, inputting the fused features into a trained classifier, and outputting an expression primary classification result; the method specifically comprises the following steps:

inputting the fused features into a trained classifier, and outputting an expression preliminary classification result;

judging whether the critical points of the limbs exceed a set area, and if so, obtaining dangerous limb results; otherwise, obtaining the result of the non-dangerous limb;

correcting the preliminary expression classification result by combining the limb detection result to obtain a final expression classification result of the single-frame image;

and smoothing the final expression classification result of all frame images in the video of the person to be detected, and outputting the dynamic dangerous expression detection result of the video of the person to be detected.

2. The method for detecting the dangerous expressions of the people to be detected based on the deep learning as claimed in claim 1, wherein the detection of the human key points is performed on a single frame image in the video of the people to be detected, wherein the human key points comprise: face key points and limb key points; the method specifically comprises the following steps:

the face key points comprise: a nose key point nose, a right eye key point right _ eye, a left eye key point left _ eye, a right ear key point right _ ear, and a left ear key point left _ ear;

limb key points, including: neck key point rock, right shoulder key point right _ shot, right elbow key point right _ elbow, right wrist key point right _ wrist, left shoulder key point left _ shot, left elbow key point left _ shot, left wrist key point left _ wrist, right hip key point right _ hip, right knee key point right _ knee, right ankle key point right _ ankle, left hip key point left _ hip, left knee key point left _ knee, left ankle key point left _ ankle;

the coordinates (x, y) are changed into new coordinates (x ', y') by using an affine matrix, 68 key points are calibrated by using a model 'shape _ predictor _68_ face _ maps.dat' trained by Dlib, and then the coordinates of the 68 key points of the human face are found out, so that the key points of the human face are expanded from 5 to 68.

3. The method for detecting the dangerous expressions of the people to be detected based on the deep learning as claimed in claim 1, wherein the training of the trained convolutional neural network CNN specifically comprises:

constructing a training set, wherein the training set comprises human face region images with known dangerous or non-dangerous labels;

taking the face region image and the known expression classification labels as input values of a convolutional neural network, and training the convolutional neural network; and obtaining the trained convolutional neural network.

4. The method for detecting the dangerous expressions of the people to be detected based on the deep learning as claimed in claim 1, wherein the training of the trained classifier specifically comprises the following steps:

5. The method for detecting the dangerous expressions of the person to be detected based on the deep learning as claimed in claim 3 or 4, wherein the constructing of the training set specifically comprises:

re-marking the known expression seven-classification picture database into a second classification picture database;

directly dividing Anger in the database into dangerous expression classifications;

directly dividing Happinness Happiness, Sadness Sadness, Surrise Surprise and Neutral in a database into safe expression classifications;

meanwhile, specific images of the dispust aversion and Fear expressions have fuzzy boundaries, so that threshold values are respectively established for the muscle contraction degree of the face and the area increase degree of five sense organs, dangerous expressions are judged if the threshold values are exceeded, and safe expressions are judged if the threshold values are not exceeded.

6. The method for detecting the dangerous expressions of the person to be detected based on the deep learning as claimed in claim 1, wherein whether the key points of the limbs exceed a set area is judged, and if the key points of the limbs exceed the set area, dangerous limb results are obtained; otherwise, obtaining the result of the non-dangerous limb; the method specifically comprises the following steps:

setting a safety region which accords with the activity specification of the person to be detected according to the fit degree of the activity range of the person to be detected and the capture range of the camera, and setting a pixel value region corresponding to the safety region on a screen display; when the position information contained in the limb key points is out of the safe area, outputting the auxiliary result of the dangerous expression: identifying a dangerous limb; otherwise, outputting the auxiliary result of the safe expression: and (5) identifying the safety limbs.

7. The method for detecting the dangerous expressions of the people to be detected based on the deep learning as claimed in claim 1, wherein the preliminary expression classification result of the expressions is corrected by combining the body detection result to obtain the final expression classification result of a single-frame image; the method specifically comprises the following steps:

p (x ═ 0) and P (x ═ 1);

and P (x ═ 0) + P (x ═ 1) ═ 1;

wherein x represents an expression classification event, P (x ═ 0) represents a probability of being identified as a safe expression, and P (x ═ 1) represents a probability of being identified as a dangerous expression;

P(X＝0)＝0.8*P(x＝0)+0.2*P(y＝0)

P(X＝1)＝0.8*P(x＝1)+0.2*P(y＝1)

the decision rule is as follows:

wherein, X is the final expression classification result, and X ═ 0 is a safe expression, and X ═ 1 is a dangerous expression; 0.8 and 0.2 are correction coefficients;

8. The method for detecting the dangerous expressions of the person to be detected based on the deep learning as claimed in claim 1, wherein in the video of the person to be detected, the final expression classification results of all the frame images are smoothed to obtain the dynamic dangerous expression detection results of the video of the person to be detected; the method specifically comprises the following steps:

the sliding window continues to move until the video meeting ends.

9. The utility model provides a wait and wait that treat dangerous expression detecting system of personnel based on degree of deep learning, characterized by includes:

a face detection module configured to: carrying out face detection on images in a video of a person to be detected;

capturing a single-frame image containing an identifiable face to obtain a face image to be detected;

a preliminary classification module configured to: extracting a face feature vector of a face image to be detected to obtain a face feature vector; carrying out feature fusion on the face feature vector and the face key points, inputting the fused features into a trained classifier, and outputting an expression primary classification result; the method specifically comprises the following steps:

a correction module configured to: correcting the preliminary expression classification result by combining the limb detection result to obtain a final expression classification result of the single-frame image; and smoothing the final expression classification result of all frame images in the video of the person to be detected, and outputting the dynamic dangerous expression detection result of the video of the person to be detected.

10. An electronic device, comprising: one or more processors, one or more memories; wherein a processor is connected to the memory, one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-8.

11. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 8.