CN113408434B

CN113408434B - Intelligent monitoring expression recognition method, device, equipment and storage medium

Info

Publication number: CN113408434B
Application number: CN202110694160.0A
Authority: CN
Inventors: 崔子栋; 吴毳; 李津轩; 姜峰
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2023-12-05
Anticipated expiration: 2041-06-22
Also published as: CN113408434A

Abstract

The embodiment of the invention provides an expression recognition method, device and equipment for intelligent monitoring and a storage medium, and relates to the technical field of intelligent monitoring. The expression recognition method comprises the steps of S3B0 and obtaining an image sequence. Wherein the image sequence comprises a target person. S3B1, obtaining a face region in the image sequence through a face detection model. S3B2, obtaining expression information in the face region through an expression recognition model. S3B3, generating an initial expression sequence according to the time sequence of the image sequence and the expression information. S3B4, correcting through a prediction model according to the initial expression sequence to obtain a facial expression sequence. According to the embodiment, the face detection module is used for extracting the face image and then carrying out expression recognition, so that the recognition efficiency is greatly improved, and the recognition result is corrected through the prediction model after recognition, so that the recognition accuracy is greatly improved.

Description

Intelligent monitoring expression recognition method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of intelligent monitoring, in particular to an intelligent monitoring expression recognition method, an intelligent monitoring expression recognition device, intelligent monitoring expression recognition equipment and an intelligent monitoring expression recognition storage medium.

Background

In order to timely find accidents of old people or children, cameras are often installed at the places where the old people and the children move to shoot in real time. Meanwhile, in order to know whether the old or the child is unexpected in time, real-time analysis is carried out on pictures shot by the camera through equipment such as a local server and a cloud server, and an alarm is generated to inform related personnel when the target person is determined to be unexpected.

Particularly, in the prior art, the expression of the target person can be analyzed to determine whether the target has unexpected expressions such as pain, gas generation and the like, so as to determine whether the target person has accidents. However, in the prior art, the expression recognition accuracy is not high, false alarms are easily caused, and unnecessary troubles are caused for related personnel.

Disclosure of Invention

The invention provides an intelligent monitoring expression recognition method, device, equipment and storage medium, which are used for solving the problem of inaccurate expression recognition in the related technology.

In a first aspect, an embodiment of the present invention provides an expression recognition method for intelligent monitoring, including the following steps:

S3B0, acquiring an image sequence; wherein the image sequence comprises a target person;

S3B1, obtaining a face region in the image sequence through a face detection model;

S3B2, obtaining expression information in the face region through an expression recognition model;

S3B3, generating an initial expression sequence according to the time sequence of the image sequence and the expression information;

S3B4, correcting through a prediction model according to the initial expression sequence to obtain a facial expression sequence.

Optionally, the face detection model is a YOLOv3 face recognition model; the expression recognition model is a VGG16 expression classification model;

optionally, the expression information includes x-class; wherein the class x includes neutral, serious, panic, curious, surprise, happiness, despise;

optionally, the step S3B3 specifically includes:

S3B31, generating a time sequence T according to time information of each frame in the image sequence;

and S3B32, sorting the expression information according to the time sequence to obtain the initial expression sequence I.

Optionally, the prediction model is an LSTM model; the input length of the LSTM model is n, and the characteristics of unit length comprise x types;

optionally, the step S3B4 specifically includes:

S3B41, dividing the initial expression sequence into an input sequence with the length of n;

S3B42, inputting the input sequence into the prediction model to obtain an output sequence with the length of n;

S3B43, obtaining the facial expression sequence according to the output sequence.

Optionally, the input length n is 11 frames.

In a second aspect, an embodiment of the present invention provides an intelligent monitoring expression recognition device, including:

the sequence module is used for acquiring an image sequence; wherein the image sequence contains images of people;

the region module is used for obtaining a face region in the image sequence through a face detection model;

the expression module is used for obtaining expression information in the face area through an expression recognition model;

the initial module is used for generating an initial expression sequence according to the time sequence of the image sequence and the expression information;

and the final module is used for correcting through a prediction model according to the initial expression sequence so as to obtain a facial expression sequence.

Optionally, the initial module specifically includes:

a time unit, configured to generate a time sequence T according to time information of each frame in the image sequence;

and the initial unit is used for sequencing the expression information according to the time sequence so as to obtain the initial expression sequence I.

optionally, the final module specifically includes:

an input unit for dividing the initial expression sequence into input sequences with the length of n;

an output unit for inputting the input sequence to the prediction model to obtain an output sequence of length n;

and a final unit, configured to obtain the facial expression sequence according to the output sequence.

Optionally, the input length n is 11 frames.

In a third aspect, embodiments of the present invention provide an intelligent monitoring expression recognition device, including a processor, a memory, and a computer program stored in the memory; the computer program is executable by the processor to implement the intelligent monitoring expression recognition method as described in the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium including a stored computer program, where the computer program when executed controls a device in which the computer readable storage medium is located to execute the expression recognition method for intelligent monitoring as described in the first aspect.

By adopting the technical scheme, the invention can obtain the following technical effects:

according to the embodiment, the face detection module is used for extracting the face image and then carrying out expression recognition, so that the recognition efficiency is greatly improved, and the recognition result is corrected through the prediction model after recognition, so that the recognition accuracy is greatly improved. Has good practical significance.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a security monitoring method according to a first embodiment of the present invention.

Fig. 2 is a schematic diagram of a camera layout of a target area.

Fig. 3 is a block diagram of the LSTM model.

Fig. 4 is a block diagram of the structure of the SSD model.

Fig. 5 is a flow chart of a security monitoring method according to a first embodiment of the present invention.

Fig. 6 is a schematic diagram of a human skeletal model.

Fig. 7 is a schematic structural diagram of a security monitoring device according to a second embodiment of the present invention.

Fig. 8 is a flowchart of a security monitoring method according to a fifth embodiment of the present invention.

Fig. 9 is a schematic structural diagram of a safety monitoring device according to a sixth embodiment of the present invention.

Fig. 10 is a flowchart of a security monitoring method according to a ninth embodiment of the present invention.

Fig. 11 is a schematic structural diagram of a security monitoring device according to a tenth embodiment of the present invention.

The marks in the figure: a 0-sequence module, a 1-video module, a 2-image module, a 3-coefficient module, a 4-grade module, a 5-human model module, a 6-human coordinate module, a 7-human parameter module, an 8-human posture module, a 9-region module, a 10-expression module, an 11-initial module, a 12-final module, a 13-model module, a 14-detection module and a 15-classification module.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

For a better understanding of the technical solution of the present invention, the following detailed description of the embodiments of the present invention refers to the accompanying drawings.

It should be understood that the described embodiments are merely some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention is described in further detail below with reference to the attached drawings and detailed description:

embodiment 1,

Referring to fig. 1 to 6, a security monitoring method according to a first embodiment of the present invention may be performed by a security monitoring device, and in particular, one or more processors in the security monitoring device to implement steps S1 to S4.

S1, receiving a plurality of monitoring videos of different angles of a target area.

Specifically, the safety monitoring device is electrically connected to the monitoring system of the target area, and can receive and analyze the monitoring picture shot by the monitoring system. As shown in fig. 2, the monitoring system has at least three cameras installed in a target area, the cameras being installed at a position 2.5m above the bottom surface such that an angle between a photographing angle of view of the cameras and a person image in the target area is not more than 45 °. At least three cameras are respectively arranged at different angles of the target area.

It should be noted that the security monitoring device may be a cloud server, a local server, or a local computer, which is not particularly limited in the present invention.

S2, respectively acquiring image sequences of all people in the target area according to the plurality of monitoring videos.

In particular, since the monitoring system has a plurality of cameras photographed at different angles. Therefore, the monitoring video contains video streams of different angles of each person in the target area. Image data of each person most suitable for subsequent analysis operation needs to be selected from the videos for subsequent analysis.

Based on the above embodiments, in an alternative embodiment of the present invention, step S2 specifically includes steps S21 to S23.

S21, acquiring skeleton information of each person in the target area at different angles through an OpenPose model according to the plurality of monitoring videos.

S22, acquiring image areas where all people are located according to the skeleton information. The image area is the area where the image with the largest skeleton area of each person is located.

S23, respectively extracting image sequences of all people from a plurality of monitoring videos according to the image areas.

It should be noted that, when a person enters the target area, the openPose model performs the same identification for the same person in the multiple monitoring video streams, and continuously tracks the person. The openPose model can identify skeleton information of people in a video stream.

In this embodiment, the size of the area occupied by the skeleton is used as the area of the portrait captured by the camera. For each person in the target area, only the image acquired by the camera in the direction with the largest skeleton area is extracted to serve as the basis of analysis. That is, image information having the largest skeleton area of each person is extracted from a plurality of video streams based on skeleton information, and the image information is sorted into an image sequence based on the time sequence of the video streams.

It will be appreciated that the image with the largest skeleton area of a person in the image will often correspond to the front of a person. Therefore, the extracted image sequence contains facial expression information and limb gesture information of each person.

In other embodiments, the face recognition model may be further combined to further ensure that the extracted image sequence is an image sequence of the front face of the person in the target area, so as to ensure the validity of the information.

S3, acquiring the body posture, the facial expression sequence and the gesture sequence of each person according to the image sequence, and carrying out regression analysis to obtain the safety coefficient of each person.

Specifically, the state of a person can be determined from the body posture, the facial expression, and the gesture sequence of the person. Whether the device is in a riot state of gas generation and violent exercise or a calm and mild normal state, and carrying out regression analysis according to the acquired states and a preset sequence of the states to obtain the safety coefficient in the current state.

Based on the above embodiments, in an alternative embodiment of the present invention, the step S3 specifically includes steps S3A1 to S3A4.

S3A1, acquiring joint point data according to the image sequence, and establishing a human skeleton model. The joint points include head, neck, torso, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right ankle, left knee, left ankle, left hip, left knee, and left ankle.

Specifically, the openPose model has identified and tracked each node of each persona. As shown in fig. 6, in this embodiment, from a plurality of nodes located in the openPose model, the above 15 nodes are selected, so as to build a human skeleton model with enough body language components, relatively fewer nodes and easy calculation. In other embodiments, more or fewer nodes may be selected, as the invention is not limited in this regard.

S3A2, according to a human skeleton model, a human dynamic coordinate system is established by taking a trunk joint as an origin, a joint pointing to a neck of the trunk as a Z axis, a joint pointing to a right shoulder of the left shoulder as an x axis and a human facing direction as a Y axis, as shown in fig. 6.

S3A3, according to a human body dynamic coordinate system, carrying out normalization processing on the coordinates of each joint according to the height, and then calculating body parameters. Wherein the parameters include height, first distance of head to x-axis, second distance of right foot to x-axis, third distance of left foot to x-axis, body inclination angle, foot angular velocity, shoulder center angular velocity, and moment information.

Specifically, in order to further analyze the limb language information of each person in the target area, a coordinate system needs to be established for each person in each image sequence to analyze the position information of each node, and further analyze the limb information.

The accuracy of the language judgment of the limb is improved in order to adapt to the differences of different scenes. In this embodiment, the coordinate information in the dynamic coordinate system of the human body is normalized according to the height information of each person. In other embodiments, normalization processing may not be performed in order to reduce the amount of calculation, which is not particularly limited in the present invention. After normalization processing, the physical parameters of the tasks in the image sequence are calculated according to the coordinate information of each node. The normalization process is the prior art, and the present invention is not described here in detail.

S3A4, classifying through the SVM model according to the body parameters to obtain the body posture of the person in the image sequence.

Specifically, the above-mentioned body parameters are input to an SVM model, through which the subject language of the task in the image sequence is acquired, for example: a series of human action postures such as standing still, walking slowly at constant speed, pushing and contracting arms at constant speed, swinging arms in the horizontal direction, swinging arms in the vertical direction and the like. Human behavior characteristics are analyzed through the SVM model, and the method belongs to the prior art and is not repeated here.

Based on the above embodiments, in an alternative embodiment of the present invention, step S3 further includes steps S3B1 to S3B4.

S3B1, obtaining a face region in the image sequence through a face detection model.

Specifically, the image sequence acquired in step S2 includes front information of the person. Therefore, the facial region is detected from the image sequence by the face detection model, and then the image of the partial region is extracted, thereby further analyzing the expression information of the person.

Preferably, the face detection model is a YOLOv3 face recognition model. The YOLOv3 face recognition model has good mobility and multi-target recognition capability and small object recognition capability. The face region can be accurately identified from the image sequence. Training a YOLOv3 face recognition model capable of recognizing a face is a conventional technical means for those skilled in the art, and the present invention is not described herein. In other embodiments, the face recognition model may be other face recognition models, which the present invention is not particularly limited to.

S3B2, obtaining expression information in the face region through an expression recognition model.

Preferably, the expression recognition model is a VGG16 expression classification model. Expression information includes x categories. Wherein class x includes neutral, serious, panic, curious, surprise, happiness, despise. Training a VGG16 expression classification model capable of identifying x-type expression information is a conventional technical means for those skilled in the art, and the present invention is not described herein. Specifically, after the face region is identified through the YOLOv3 face recognition model, the face region is extracted and input into the VGG16 expression classification model, so as to obtain expression information of the person in the image sequence. In other embodiments, the expression recognition model may be other expression recognition models, which are not particularly limited in the present invention.

S3B3, generating an initial expression sequence according to the time sequence of the image sequence and the expression information.

Based on the above embodiment, in an alternative embodiment of the present invention, the step S3B3 specifically includes steps S3B31 and S3B32.

S3B31, generating a time sequence T according to time information of each frame in the image sequence.

S3B32 sorts the expression information according to the time sequence to obtain an initial expression sequence I.

Specifically, the recognition results of the VGG16 expression classification model are ordered according to a time sequence, so that an initial expression sequence I is obtained, and the time sequence affects the prediction effect of the prediction model. Therefore, by sequentially generating the initial expression sequences, good input data can be provided for the subsequent prediction model. Has good practical significance.

Specifically, through the face recognition model and the expression classification model, expression information in the image sequence can be rapidly recognized. However, since the information is directly recognized, there may be cases where individual frames are recognized in error. In order to avoid the case of recognition errors, in the present embodiment, the recognition result of the VGG16 expression classification model is corrected by the LSTM prediction model. Thereby avoiding the occurrence of false expression recognition. The LSTM prediction mode is shown in fig. 3, where the input length of the LSTM model is n, and the features of the unit length include x types. The x-class corresponds to the expression information of the face in the image sequence judged by the VGG16 expression classification model, and the probability of the expression of the corresponding x-class is the probability. In other embodiments, the predictive model may use other existing predictive models, as the invention is not specifically limited in this regard.

Based on the above embodiments, in an alternative embodiment of the present invention, the step S3B4 specifically includes steps S3B41 to S3B43.

S3B41, dividing the input sequence into an input sequence with the length of n according to the initial expression sequence. Wherein the input length n is 11 frames. In other embodiments, the input length may be other frame numbers, which is not particularly limited by the present invention.

In particular, the behavior of changes in facial expressions over time tends to be piecewise continuous rather than discrete, i.e., the probability of an abrupt change in expression over a period of time is low. Therefore, the initial expression sequence is divided into an input sequence with the length of n for analysis, and the input sequence is corrected according to the detection result change in the small neighborhood, so that the accuracy rate of the input sequence is improved. In order to obtain the optimal n number, the inventor counts the whole sample, and finds that the duration time sequence interval of the same expression obtained by dividing the VGG16 is [11-58], so the inventor designs to find the optimal n between 10-15, and finally obtains the optimal design time sequence number n=11. In this embodiment, 11 is selected as the correction sequence number of the LSTM, which realizes improvement of the emotion recognition capability.

S3B42, inputting the input sequence into a prediction model to obtain an output sequence with the length of n.

S3B43, obtaining a facial expression sequence according to the output sequence.

Specifically, the results of VGG16 expression classification model recognition are sequenced according to the time sequence, and the sequenced expression sequence is recorded asAnd add->Representing the ith expression in the sequence. The corresponding time sequence is noted +.>Will->The time when the ith expression appears is noted. Will->And->Information is input into an LSTM prediction model, and a final classification result (1:neutral, 2:serious, 3:panic, 4:curous, 5:surise, 6:happiness, 7:despin) is output through the LSTM prediction model and is marked as +.>Wherein->The result of classification of the picture appearing at time t.

Based on the above embodiment, in an alternative embodiment of the present invention, step S3 further includes steps S3C1 to S3C3:

S3C1, constructing an object detection model based on the terminal lightweight neural network model.

Specifically, the image sequence includes image information of the entire person. Therefore, the gesture recognition calculation amount is large. In this embodiment, hand images are identified and extracted from the image sequence by constructing an object detection model. Thereby improving the recognition speed and accuracy. In this embodiment, the terminal lightweight model is a MnasNet model. The object detection model is an SSD model. In other embodiments, the terminal lightweight model and the object detection model may be other existing models, which are not particularly limited in the present invention.

Based on the above embodiment, in an alternative embodiment of the present invention, step S3C1 includes step S3C11.

S3C11, constructing an SSD model taking the MnasNet model as a backbone network. As shown in fig. 4, the backbone network sequentially includes: conv with 3x3 layer convolution kernel, speConv with 3x3 layer convolution kernel, MBConv with 5x5 layer convolution kernel, MBConv with 3x3 layer convolution kernel, MBConv with 5x5 layer convolution kernel, MBConv with 3x3 layer convolution kernel, pooling with 1 layer or FC with 1 layer.

Specifically, the main rod network of the existing SSD model is replaced by MnasNet from the VGG16 convolution main rod network, so that the calculated amount in the target detection process can be reduced, the target detection speed is greatly improved, and the method has good practical significance.

S3C2, extracting hand images in the image sequence through the object detection model, and generating the hand image sequence according to the time of the image sequence.

Based on the above embodiments, in an alternative embodiment of the present invention, the step S3C2 includes steps S3C21 to S3C23.

S3C21, as shown in fig. 4, the image sequence is input into the backbone network frame by frame, so that the backbone network convolves the images layer by layer.

S3C22, extracting five intermediate layers with the scales of 112×112×16, 56×56×24, 28×28×40, 14×14×112 and 7×7×100 in the convolution process, and performing regression analysis to obtain the region of the hand image

S3C23, extracting hand images from the images according to the areas.

In this embodiment, the inventors select five intermediate layers with dimensions of 112×112×16, 56×56×24, 28×28×40, 14×14×112, 7×7×100 in MnasNet as candidate regions under the SSD frame, and then perform regression analysis on the candidate regions in these candidate regions according to the classical manner of SSD, to obtain the final positioning result.

S3C3, classifying through an image classification model according to the hand image sequence to obtain a gesture sequence.

Because the MnasNet structure is simpler than the VGG16, and is not enough to complete the task of positioning and classifying at the same time, in order to achieve the classification of gestures, the final area selected by the final frame is separated, and finally the separated image is preprocessed to be an image suitable for the input scale of the VGG16 network, and is input to the VGG16 network for operation, and finally the classification of the hand gestures is completed by the VGG 16. Compared with the SSD frame with the traditional classical VGG16 network as a trunk, the alternative candidate area acquisition trunk network is of a MnaNet network structure, and the network which only classifies the last frame selection area by using the VGG16 is lighter, smaller in magnitude and correspondingly faster in operation speed. The number of parameter sets for which the two models are directed is compared as follows.

Specifically, each frame of the image sequence is input into an SSD network based on MNANet, convolution operation is carried out through MNANet, five intermediate layers with the dimensions of 112×112×16, 56×56×24, 28×28×40, 14×14×112 and 7×7×100 in the convolution process are extracted as candidate areas of the area where the hand is located, and regression analysis is carried out on the five candidate areas through a classical SSD analysis mode. And taking the hand region identified by the region with the highest confidence in the five regions as a candidate region where the hand is located, mapping the hand position of the candidate region onto an original image, taking out the corresponding position region buckle on the original image, and sending the obtained corresponding position region buckle into the VGG16 for classifying the hand gesture types to obtain a final result.

In the present embodiment, the image classification model is a VGG16 classification model. In other embodiments, the image classification model may be other classification/recognition models, as the invention is not particularly limited in this regard.

S4, generating corresponding safety alarm levels according to the safety coefficients.

Based on the above embodiment, in an alternative embodiment of the present invention, step S4 includes steps S41 to S43.

S41, according to each safety coefficient, calculating the number of people in the target area, the first average value of the safety coefficients and the second average value of the safety coefficients of the adjacent scenes, wherein the number of people in the target area is smaller than a preset safety coefficient threshold value.

S42, arranging the number of people, the first average value and the second average value into time sequence characteristics according to time sequence.

S43, according to the time sequence characteristics, prediction is carried out through a prediction model so as to obtain a safety alarm level.

Specifically, a feature vector is first calculated for each individual in the scene. The feature vector calculation mode is that in step S3, the body gesture, the facial expression sequence and the gesture sequence are calculated. And then carrying out regression analysis on the feature vectors to calculate the safety coefficient of the individual. And combining the safety coefficient of the individual with the feature vectors of the body gesture, the facial expression sequence and the gesture sequence to obtain the feature vector of each individual.

The method for calculating the safety coefficient comprises the following steps: firstly, obtaining some individual pictures in dangerous states, manually judging the pictures, marking 1-10 points of safety coefficients, then sending the pictures into each model in the step S3 to calculate to obtain three characteristic vectors V1, V2 and V3 of each picture, and combining the three characteristic vectors, a final scene result and a manual scoring result to generate a sample. Performing linear regression analysis according to the samples, and calculating the scene to be scored according to the generated regression function, namely, the feature vector of the new scene ,/>,/>The safety coefficient is calculated in the input function

After the feature vectors of all the people in the target area are calculated. According to the feature vectors, the number of people, average threshold, adjacent scene threshold and other features smaller than the safety coefficient threshold are calculated, the features are arranged into time sequence features according to the acquisition time, and finally the time sequence features are sent into an LSTM (least squares) for safety grade assessment, and according to the grading result and the specific use scene, the corresponding grade alarm is carried out. Wherein the near scene threshold is a safety factor threshold for an area next to the target area.

According to the embodiment of the invention, the image sequence of each person at one angle which is more suitable for analysis is extracted from the monitoring video of each person at the other angle, then the body gesture, the facial expression sequence and the gesture sequence of each person in the target area are respectively analyzed, the safety coefficient of each person is obtained through regression analysis and calculation according to the information, and then the alarm of the corresponding grade is generated according to the safety coefficients. Personnel do not need to be arranged to stare at the monitoring in real time, and the warning situation can be found timely, so that the warning system has good practical significance.

A second embodiment of the present invention provides a security monitoring device, which includes:

The video module 1 is used for receiving a plurality of monitoring videos of different angles of the target area.

And the image module 2 is used for respectively acquiring the image sequences of the people in the target area according to the plurality of monitoring videos.

And the coefficient module 3 is used for acquiring the body posture, the facial expression sequence and the gesture sequence of each person according to the image sequence and carrying out regression analysis to obtain the safety coefficient of each person.

And the level module 4 is used for generating corresponding safety alarm levels according to the safety coefficients.

Optionally, the image module 2 specifically includes:

and the skeleton unit is used for acquiring skeleton information of each person in the target area at different angles through the OpenPose model according to the plurality of monitoring videos.

And the region unit is used for acquiring the image region where each person is located according to the skeleton information. The image area is the area where the image with the largest skeleton area of each person is located.

And the image unit is used for extracting image sequences of all people from the plurality of monitoring videos respectively.

Optionally, the coefficient module 3 includes:

the human body model module 5 is used for acquiring the node data according to the image sequence and establishing a human body skeleton model. The joint points include head, neck, torso, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right ankle, left knee, left ankle, left hip, left knee, and left ankle.

The human body coordinate module 6 is used for establishing a human body dynamic coordinate system according to a human body skeleton model by taking a trunk joint as an origin, a joint pointing to a neck of the trunk as a Z axis, a joint pointing to a right shoulder of the left shoulder as an x axis and a human body facing direction as a Y axis.

And the human body parameter module 7 is used for calculating body parameters after normalizing the coordinates of each joint according to the height according to the human body dynamic coordinate system. Wherein the parameters include height, first distance of head to x-axis, second distance of right foot to x-axis, third distance of left foot to x-axis, body inclination angle, foot angular velocity, shoulder center angular velocity, and moment information.

The human body posture module 8 is used for classifying through the SVM model according to the body parameters so as to obtain the body posture of the person in the image sequence.

Optionally, the coefficient module 3 further includes:

and the region module 9 is used for obtaining the face region in the image sequence through the face detection model.

The expression module 10 is configured to obtain expression information in the face region through the expression recognition model.

The initial module 11 is configured to generate an initial expression sequence according to the time sequence of the image sequence and the expression information.

A final module 12, configured to correct the initial expression sequence through a prediction model to obtain a facial expression sequence.

Optionally, the face detection model is a YOLOv3 face recognition model. The expression recognition model is a VGG16 expression classification model.

Optionally, the expression information includes x-class. Wherein class x includes neutral, serious, panic, curious, surprise, happiness, despise.

Optionally, the initial module 11 includes:

and the time unit is used for generating a time sequence T according to the time information of each frame in the image sequence.

The initial unit is used for sequencing the expression information according to the time sequence to obtain an initial expression sequence I.

Optionally, the prediction model is an LSTM model. The LSTM model has an input length of n, and features per unit length include x classes.

Optionally, the final module 12 includes:

and the input unit is used for dividing the input sequence into input sequences with the length of n according to the initial expression sequence.

And an output unit for inputting the input sequence into the prediction model to obtain an output sequence of length n.

And the final unit is used for obtaining the facial expression sequence according to the output sequence.

Optionally, the input length n is 11 frames.

Optionally, the coefficient module 3 further includes:

the model module 13 is used for constructing an object detection model based on the terminal lightweight neural network model.

The detection module 14 is configured to extract hand images in the image sequence through the object detection model, and generate a hand image sequence according to time of the image sequence.

The classifying module 15 is configured to classify the hand image sequence according to the image classification model to obtain a gesture sequence.

Optionally, the terminal lightweight model is a MnasNet model. The object detection model is an SSD model.

Optionally, the model module 13 is specifically configured to:

and constructing an SSD model taking the MnasNet model as a backbone network. Wherein, the backbone network includes in proper order: conv with 3x3 layer convolution kernel, speConv with 3x3 layer convolution kernel, MBConv with 5x5 layer convolution kernel, MBConv with 3x3 layer convolution kernel, MBConv with 5x5 layer convolution kernel, MBConv with 3x3 layer convolution kernel, pooling with 1 layer or FC with 1 layer.

Optionally, the detection module 14 includes:

and the convolution unit is used for inputting the image sequence into the backbone network frame by frame so that the backbone network convolves the images layer by layer.

An analysis unit for extracting five intermediate layers with dimensions of 112×112×16, 56×56×24, 28×28×40, 14×14×112, 7×7×100 in the convolution process for regression analysis to obtain region of hand image

And the extraction unit is used for extracting the hand image from the image according to the region.

Optionally, the ranking module 4 comprises:

And the threshold value unit is used for calculating the number of people in the target area, the first average value of the safety coefficients and the second average value of the safety coefficients of the adjacent scenes, which are smaller than the preset safety coefficient threshold value, according to the safety coefficients.

And the time sequence unit is used for arranging the number of people, the first average value and the second average value into time sequence characteristics according to time sequence.

And the level unit is used for predicting through a prediction model according to the time sequence characteristics so as to obtain the safety alarm level.

The third embodiment provides a safety monitoring device, which comprises a processor, a memory and a computer program stored in the memory. The computer program is executable by the processor to implement the security monitoring method as described in the first aspect.

A fourth embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program is executed, controls a device in which the computer-readable storage medium is located to execute the security monitoring method according to the first aspect.

Fifth embodiment, the expression recognition method of the present embodiment has the same implementation principle and the same produced technical effects as those of the first embodiment, and the present embodiment is briefly described. Where this embodiment is not mentioned, reference is made to embodiment one.

Referring to fig. 8, an embodiment of the present invention provides an intelligent monitoring expression recognition method, which may be performed by an intelligent monitoring expression recognition device or a security monitoring device. In particular, by one or more processors within the expression recognition device or the security monitoring device to implement at least steps S3B0 to S3B4.

S3B0, acquiring an image sequence. Wherein the image sequence comprises a target person.

Optionally, step S3B3 specifically includes:

Optionally, step S3B4 specifically includes:

S3B41, dividing the input sequence into an input sequence with the length of n according to the initial expression sequence.

S3B43, obtaining a facial expression sequence according to the output sequence.

Optionally, the input length n is 11 frames.

Optionally, step S3B0 specifically includes steps S1 and S2:

Optionally, step S2 specifically includes:

After step S3B4, further comprising:

s3, carrying out regression analysis according to the facial expression sequence to obtain the safety coefficient of each person.

For specific implementation, refer to embodiment one. In the present embodiment, in order to save the amount of computation, the recognition speed is increased, and parts concerning the body posture and the gesture sequence are omitted. In other embodiments, only one of the body gestures and gesture sequence may be omitted.

Referring to fig. 9, an embodiment of the present invention provides an intelligent monitoring expression recognition device, which includes:

a sequence module 0, configured to acquire an image sequence. Wherein the image sequence contains images of the person.

Optionally, the initial module 11 specifically includes:

Optionally, the final module 12 specifically includes:

Optionally, the input length n is 11 frames.

Optionally, the sequence module 0 includes a video module and an image module in the first embodiment, which includes:

and the receiving unit is used for receiving a plurality of monitoring videos of different angles of the target area.

And the image unit is used for respectively extracting the image sequences of the people from the plurality of monitoring videos according to the image areas.

and the video module is used for receiving a plurality of monitoring videos of different angles of the target area.

And the image module is used for respectively acquiring the image sequences of all the people in the target area according to the plurality of monitoring videos.

Optionally, the image module includes:

The expression recognition apparatus further includes:

and the coefficient module 3 is used for carrying out regression analysis according to the facial expression sequence so as to obtain the safety coefficient of each person.

Based on the above embodiments, in an alternative embodiment of the present invention, the ranking module 4 includes.

An embodiment seven of the present invention provides an intelligent monitoring expression recognition apparatus, which includes a processor, a memory, and a computer program stored in the memory. The computer program can be executed by a processor to implement the intelligent monitoring expression recognition method as described in embodiment five.

An eighth embodiment of the present invention provides a computer-readable storage medium including a stored computer program, where the computer program when executed controls a device in which the computer-readable storage medium is located to execute the expression recognition method for intelligent monitoring as described in the fifth embodiment.

The gesture recognition method of the ninth embodiment has the same implementation principle and the same technical effects as those of the first embodiment, and the first embodiment is briefly described. Where this embodiment is not mentioned, reference is made to embodiment one.

Referring to fig. 10, an embodiment of the present invention provides a gesture recognition method for intelligent monitoring, which may be performed by a gesture recognition device for intelligent monitoring or a security monitoring device. In particular, by one or more processors within the gesture recognition device or the security monitoring device to implement at least steps S3C0 to S3C3.

S3C0, acquiring an image sequence. Wherein the image sequence comprises a target person.

Optionally, step S3C1 specifically includes:

S3C11, constructing an SSD model taking the MnasNet model as a backbone network. Wherein, the backbone network includes in proper order: conv with 3x3 layer convolution kernel, speConv with 3x3 layer convolution kernel, MBConv with 5x5 layer convolution kernel, MBConv with 3x3 layer convolution kernel, MBConv with 5x5 layer convolution kernel, MBConv with 3x3 layer convolution kernel, pooling with 1 layer or FC with 1 layer.

Optionally, step S3C2 specifically includes:

S3C21, inputting the image sequence into a backbone network frame by frame so that the backbone network convolves the images layer by layer.

S3C23, extracting hand images from the images according to the areas.

Optionally, step S3C0 specifically includes steps S1 and S2:

Optionally, step S2 specifically includes:

After step S3C3, further comprising:

s3, carrying out regression analysis according to the gesture sequence to obtain the safety coefficient of each person.

For specific implementation, refer to embodiment one. In the present embodiment, in order to save the amount of computation, the recognition speed is improved, and parts concerning the body posture and the facial expression sequence are omitted. In other embodiments, only one of the body posture and facial expression sequence may be omitted.

Referring to fig. 11, an embodiment of the present invention provides an intelligent monitoring gesture recognition apparatus, which includes:

a sequence module 0, configured to acquire an image sequence. Wherein the image sequence comprises a target person.

Optionally, the model module 13 is specifically configured to:

Optionally, the detection module 14 includes:

The gesture recognition apparatus further includes:

and the coefficient module 3 is used for carrying out regression analysis according to the gesture sequence so as to obtain the safety coefficient of each person.

An eleventh embodiment of the present invention provides an intelligently monitored gesture recognition apparatus, which includes a processor, a memory, and a computer program stored in the memory. The computer program can be executed by a processor to implement the gesture recognition method as described in embodiment nine.

An embodiment twelve, an embodiment of the present invention provides a computer readable storage medium, where the computer readable storage medium includes a stored computer program, and when the computer program runs, controls a device in which the computer readable storage medium is located to execute a gesture recognition method for intelligent monitoring as described in embodiment nine.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The intelligent monitoring expression recognition method is characterized by comprising the following steps of:

acquiring an image sequence; wherein the image sequence comprises a target person;

obtaining a face region in the image sequence through a face detection model;

obtaining expression information in the face region through an expression recognition model;

generating an initial expression sequence according to the time sequence of the image sequence and the expression information;

correcting through a prediction model according to the initial expression sequence to obtain a facial expression sequence;

Acquiring an image sequence; wherein, the image sequence contains target personage, specifically includes:

s1, receiving a plurality of monitoring videos of different angles of a target area;

s21, acquiring skeleton information of each person in a target area at different angles through an OpenPose model according to a plurality of monitoring videos;

s22, acquiring an image area where each person is located according to the skeleton information; the image area is the area where the image with the largest skeleton area of each person is located;

s23, respectively extracting image sequences of all people from a plurality of monitoring videos according to the image areas;

after correction by the predictive model according to the initial expression sequence to obtain a facial expression sequence, the method further comprises the following steps:

s3, carrying out regression analysis according to the facial expression sequence to obtain the safety coefficient of each person;

s41, according to each safety coefficient, calculating the number of people in the target area, which is smaller than a preset safety coefficient threshold value, a first average value of the safety coefficients and a second average value of the safety coefficients of adjacent scenes; wherein the adjacent scene is a scene of an area beside the target area;

s42, arranging the number of people, the first average value and the second average value into time sequence characteristics according to a time sequence;

S43, according to the time sequence characteristics, predicting through a prediction model to obtain a safety alarm level;

the calculation method of the safety coefficient comprises the following steps: firstly, obtaining individual pictures in dangerous states, manually judging indexes of the pictures, and marking a safety coefficient of 1-10 points; then, a body posture feature vector V1, a facial expression sequence feature vector V2 and a gesture sequence feature vector V3 of the pictures are obtained, then three feature vectors, a final scene result and a manual scoring result are combined to generate a sample, and linear regression analysis is carried out according to the samples to generate a regression function; subsequently, calculating the scene to be scored according to the generated regression function, namely, the feature vector of the new scene,/>,/>And (5) inputting the safety coefficient into a regression function to calculate.

2. The expression recognition method according to claim 1, wherein the face detection model is a YOLOv3 face recognition model; the expression recognition model is a VGG16 expression classification model;

the expression information comprises x types; wherein the class x includes neutral, serious, panic, curious, surprise, happiness, despise;

generating an initial expression sequence according to the time sequence of the image sequence and the expression information, wherein the initial expression sequence is specifically:

Generating a time sequence T according to the time information of each frame in the image sequence;

and sequencing the expression information according to the time sequence to obtain the initial expression sequence I.

3. The expression recognition method according to claim 1, wherein the prediction model is an LSTM model; the input length of the LSTM model is n, and the characteristics of unit length comprise x types;

correcting through a prediction model according to the initial expression sequence to obtain a facial expression sequence of the person, wherein the facial expression sequence specifically comprises the following steps:

dividing the initial expression sequence into an input sequence with the length of n according to the initial expression sequence;

inputting the input sequence into the prediction model to obtain an output sequence of length n;

and obtaining the facial expression sequence according to the output sequence.

4. The expression recognition method of claim 3, wherein the input length n is 11 frames.

5. An intelligent monitoring expression recognition device, which is characterized by comprising:

the final module is used for correcting through a prediction model according to the initial expression sequence so as to obtain a facial expression sequence;

the sequence module comprises:

the receiving unit is used for receiving a plurality of monitoring videos of different angles of the target area;

the framework unit is used for acquiring framework information of different angles of each person in the target area through an OpenPose model according to the plurality of monitoring videos;

the area unit is used for acquiring an image area where each person is located according to the skeleton information; the image area is the area where the image with the largest skeleton area of each person is located;

the image unit is used for respectively extracting image sequences of all people from a plurality of monitoring videos according to the image areas;

the expression recognition apparatus further includes:

the coefficient module is used for carrying out regression analysis according to the facial expression sequence so as to obtain the safety coefficient of each person;

the level module is used for generating corresponding safety alarm levels according to the safety coefficients;

The grade module comprises;

the threshold value unit is used for calculating the number of people in the target area, the first average value of the safety coefficients and the second average value of the safety coefficients of the adjacent scenes, which are smaller than a preset safety coefficient threshold value, according to each safety coefficient; wherein the adjacent scene is a scene of an area beside the target area;

the time sequence unit is used for arranging the number of people, the first average value and the second average value into time sequence characteristics according to time sequence;

the level unit is used for predicting through a prediction model according to the time sequence characteristics so as to obtain a safety alarm level;

the calculation method of the safety coefficient comprises the following steps: firstly, obtaining individual pictures in dangerous states, manually judging indexes of the pictures, and marking a safety coefficient of 1-10 points; then, a body posture feature vector V1, a facial expression sequence feature vector V2 and a gesture sequence feature vector V3 of the pictures are obtained, then three feature vectors, a final scene result and a manual scoring result are combined to generate a sample, and linear regression analysis is carried out according to the samples to generate a regression function; subsequently, calculating the scene to be scored according to the generated regression function, namely, the feature vector of the new scene ,/>,/>And (5) inputting the safety coefficient into a regression function to calculate.

6. The expression recognition device of claim 5, wherein the face detection model is a YOLOv3 face recognition model; the expression recognition model is a VGG16 expression classification model;

the initial module specifically comprises:

7. The expression recognition device of claim 5, wherein the predictive model is an LSTM model; the input length of the LSTM model is n, and the characteristics of unit length comprise x types;

the final module specifically comprises:

8. The expression recognition apparatus of claim 7, wherein the input length n is 11 frames.

9. The intelligent monitoring expression recognition device is characterized by comprising a processor, a memory and a computer program stored in the memory; the computer program being executable by the processor to implement the intelligent monitoring expression recognition method of any one of claims 1 to 4.

10. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored computer program, wherein the computer program when run controls a device in which the computer readable storage medium is located to perform the intelligent monitoring expression recognition method according to any one of claims 1 to 4.