CN110598587B

CN110598587B - Expression recognition network training method, system, medium and terminal combined with weak supervision

Info

Publication number: CN110598587B
Application number: CN201910795777.4A
Authority: CN
Inventors: 袁德胜; 游浩泉; 王作辉; 王海涛; 姚磊; 杨进参; 张宏俊; 吴贺丰; 余明静
Original assignee: Winner Technology Co ltd
Current assignee: Winner Technology Co ltd
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2022-05-13
Anticipated expiration: 2039-08-27
Also published as: CN110598587A

Abstract

The invention provides an expression recognition network training method, system, medium and terminal combined with weak supervision. The expression recognition network comprises a feature map extraction network, a feature extraction sub-network, a feature map matching sub-network and a classification sub-network; the method comprises the following steps: training the feature map extraction network; training the feature map matching sub-network and the classification sub-network. According to the invention, the facial expression characteristic graph is introduced for weak supervised learning, so that the accuracy and robustness of facial expression recognition classification are greatly improved, the method can adapt to expression recognition of facial images under various scenes such as different angles, distortion, shielding and the like, the purpose of multitask learning of an expression recognition network is achieved, and the accuracy of facial expression recognition is improved; therefore, by recognizing the facial expression, the customer satisfaction can be analyzed, the fatigue of the driver can be detected, or the psychological treatment can be performed.

Description

Expression recognition network training method, system, medium and terminal combined with weak supervision

Technical Field

The invention belongs to the technical field of facial expression recognition, and particularly relates to an expression recognition network training method, system, medium and terminal combined with weak supervision.

Background

Facial Expression is an important way for expressing emotion and information communication, and Facial Expression Recognition (Facial Expression Recognition) has important significance in many human-computer interaction systems, such as social robots, driver fatigue detection, customer satisfaction detection, medical treatment and the like. In 1971, psychologists Ekman and Friesen initially proposed that humans have six major emotions, each reflecting a unique psychological activity of the human with a unique expression, called basic emotions, namely anger (anger), happiness (happy), sadness (sadness), surprise (surrise), disgust (distust), and fear (fear), which, like other computer vision problems, face expression recognition also faces many challenges, such as head-pose deviation, occlusion, finer granularity of the category to which the expression images belong, and subtle differences between classes.

The existing facial expression recognition methods are mainly divided into the following methods:

1) expression recognition based on appearance features: appearance features are features based on pixel attributes using the entire face image, and commonly used appearance features include Gabor texture, lbp (local Binary pattern), hog (texture of ordered graphics). The appearance features can capture texture information features of the human face.

2) Expression recognition based on geometric features: the geometric features generally represent the change of the face structure, face calibration points are detected from the face area, and the face features and the expression features are extracted according to the geometric connection between the calibration points.

3) CNN (Convolutional Neural Networks) based expression recognition: the CNN has strong robustness to face position change and scale change. Bargal et al concatenates the learned features from different CNNs to form a feature vector to describe the input image; zhang et al propose a multi-model MSCNN while training the tasks of facial expression recognition and facial verification.

The existing facial expression recognition method has the following defects:

with the method based on the appearance characteristics, when the angle of the face changes and the posture changes, the characteristics of the face cannot be captured efficiently.

With the geometric feature-based method, the false detection of the face calibration point may reduce the accuracy of the model in recognizing the facial expression.

Compared with the traditional method, the CNN-based method has greatly improved effect, has stronger description capacity than artificial features, and can effectively capture facial expression features, but because the discrimination between expressions is smaller, the discrimination of the existing method on the expressions is still to be improved, so that accurate classification cannot be realized.

Disclosure of Invention

In view of the above disadvantages of the prior art, the present invention aims to provide an expression recognition network training method, system, medium, and terminal in combination with weak supervision, which can realize accurate recognition of facial expressions, capture small local differences between expression categories, and achieve the purpose of accurate classification by introducing a feature map obtained by extracting facial expression information.

In order to achieve the above objects and other related objects, the present invention provides a method for training an expression recognition network in combination with weak supervision, wherein the expression recognition network comprises a feature map extraction network, a feature extraction sub-network, a feature map matching sub-network and a classification sub-network; the method comprises the following steps: training the feature map extraction network; the training step comprises: training the input facial expression image by using the feature map extraction network to form an expression feature map of a specified expression, an expression feature map of a non-specified expression and classification prediction probabilities respectively corresponding to the specified expression and the non-specified expression; performing loss calculation according to the classification prediction probabilities of the specified expressions and the non-specified expressions to obtain the loss degree of the feature map extraction network; training the feature map matching sub-network and the classification sub-network; the training step comprises: inputting the facial expression image into the feature extraction sub-network to obtain low-level spatial features for representing image attributes; inputting the low-level spatial features and the expression feature map of the designated expression into the feature map matching sub-network to obtain the feature map after matching training of the feature map matching sub-network; performing difference calculation on the feature map after matching training and the expression feature map to obtain the loss degree of the feature map matching sub-network; meanwhile, the low-level spatial features and the result of pre-classifying the facial expression images are input into the classifying sub-network so as to obtain a classifying result after classification training of the classifying sub-network; and performing difference calculation on the classification result after the classification training and the result of the pre-classification to obtain the loss degree of the classification sub-network.

In an embodiment of the present invention, the input facial expression images include an expression image of a specified expression and an expression image of a non-specified expression, which are manually distinguished; training the feature map extraction network, wherein the training step comprises the following steps: obtaining the classification prediction probability of the specified expression and the non-specified expression through the feature map extraction network; performing difference calculation on the classification prediction probability of the specified expression and the non-specified expression and the classification real probability of the specified expression and the non-specified expression which are manually distinguished to obtain the loss degree of the feature map extraction network; and circularly executing the steps until the loss degree of the characteristic diagram extraction network is not reduced any more, stopping training, and selecting the characteristic diagram extraction network corresponding to the minimum loss degree as the optimal characteristic diagram extraction network.

In an embodiment of the present invention, the loss degree of the feature map extraction network is calculated by using a cross entropy loss function, and the calculation formula is:

L((Q_a,Q_b)(P_a,P_b))＝-(Q_a logP_a+Q_b logP_b)

wherein, L ((Q)_a,Q_b)(P_a,P_b) Is the degree of loss, Q_a、Q_bClassification real probabilities, P, of designated and non-designated expressions, respectively_a、P_bThe classification prediction probabilities of the specified expression and the non-specified expression are respectively.

In an embodiment of the present invention, the feature map matching sub-network is trained, and the training step includes: inputting the low-level spatial features and the expression feature map of the specified expression into the feature map matching sub-network; learning the low-level spatial features by using the expression feature map by using the feature map matching sub-network so as to obtain a feature map subjected to matching training of the feature map matching sub-network; performing difference calculation on the feature map after matching training and the expression feature map to obtain the loss degree of the feature map matching sub-network; and circularly executing the steps until the loss degree of the feature map matching sub-network is not reduced any more, stopping training, and selecting the feature map matching sub-network corresponding to the minimum loss degree as the optimal feature map matching sub-network.

In an embodiment of the invention, the calculation formula of the loss degree of the feature map matching sub-network is:

wherein L (Θ) is the degree of loss, F (X)_i(ii) a Θ) is the pixel of the ith matched trained feature map, X_iIs the ith personal facial expression image, F_iThe value of i is 1 to N, and N refers to the preset number of the collected facial expression images.

In an embodiment of the present invention, the classification subnetwork is trained, and the training step includes: inputting the low-level spatial features and the result of pre-classifying the facial expression images into the classifying sub-network; learning the low-level spatial features by using the result of the pre-classification by using the classification subnetwork to obtain a classification result after classification training of the classification subnetwork; performing difference calculation on the classification result after the classification training and the result of the pre-classification to obtain the loss degree of the classification sub-network; and circularly executing the steps until the loss degree of the classification sub-network is not reduced any more, stopping training, and selecting the classification sub-network corresponding to the minimum loss degree as the optimal classification sub-network.

In an embodiment of the present invention, before performing the difference calculation, regression processing is performed on the classification result after the classification training and the result of the pre-classification respectively, the classification result after the classification training and the result of the pre-classification are converted into a prediction probability and a true probability respectively, difference calculation is performed based on the prediction probability and the true probability, and a loss degree of the classification sub-network is obtained, where the loss degree is calculated by using a cross entropy loss function, and a calculation formula is:

L(M,N)＝-MlogN

where L (M, N) is the loss, M is the true probability, and N is the prediction probability.

The invention provides an expression recognition network training system combined with weak supervision, wherein the expression recognition network comprises a feature map extraction network, a feature extraction sub-network, a feature map matching sub-network and a classification sub-network; the system comprises: a first training module and a second training module; the first training module is used for training the feature map extraction network; the training step comprises the steps of training an input facial expression image by utilizing the feature map extraction network to form an expression feature map of a specified expression, an expression feature map of a non-specified expression and classification prediction probabilities respectively corresponding to the specified expression and the non-specified expression; performing loss calculation according to the classification prediction probabilities of the specified expressions and the non-specified expressions to obtain the loss degree of the feature map extraction network; the second training module is used for training the feature map matching sub-network and the classification sub-network; the training step comprises inputting the facial expression images into the feature extraction sub-network to obtain low-level spatial features representing image attributes; inputting the low-level spatial features and the expression feature map of the designated expression into the feature map matching sub-network to obtain the feature map after matching training of the feature map matching sub-network; performing difference calculation on the feature map after matching training and the expression feature map to obtain the loss degree of the feature map matching sub-network; meanwhile, the low-level spatial features and the result of pre-classifying the facial expression images are input into the classifying sub-network so as to obtain a classifying result after classification training of the classifying sub-network; and carrying out difference calculation on the classification result after the classification training and the result of the pre-classification to obtain the loss degree of the classification sub-network.

The present invention provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described expression recognition network training method in conjunction with weak supervision.

The present invention provides a terminal, including: a processor and a memory; the memory is used for storing a computer program; the processor is used for executing the computer program stored in the memory so as to enable the terminal to execute the expression recognition network training method combined with weak supervision.

As described above, the expression recognition network training method, system, medium and terminal combined with weak supervision according to the present invention have the following beneficial effects:

(1) the method has the advantages that the weak supervision learning is carried out by fusing the facial expression feature map, the features that each type of expression and other types of expressions have discriminative power and distinctiveness can be captured, each type of expression is expressed by using the feature map, the extracted features have fine granularity, and the accuracy and the robustness of facial expression recognition and classification are greatly improved.

(2) The facial expression recognition method can adapt to the expression recognition of the facial image under various scenes such as different angles, distortion, shielding and the like.

(3) The feature map is matched and trained by adopting the feature map matching sub-network, the collected facial expressions are predicted and classified by adopting the classification sub-network, and the two networks are fused together, so that the purpose of multi-task learning of the expression recognition network is achieved, and the accuracy of facial expression recognition is improved.

(4) By recognizing the facial expressions, customer satisfaction can be analyzed, fatigue detection can be performed on the driver, or psychotherapy can be performed.

Drawings

FIG. 1 is a flowchart illustrating an expression recognition network training method incorporating weak supervision according to an embodiment of the present invention.

FIG. 2 is a flow chart illustrating training of a feature extraction network according to an embodiment of the present invention.

FIG. 3 is a flow chart illustrating training a feature matching sub-network according to an embodiment of the present invention.

FIG. 4 is a flow chart illustrating training of a classification subnetwork network in accordance with an embodiment of the present invention.

FIG. 5 is a schematic diagram illustrating an embodiment of an expression recognition network training system with weak supervision according to the present invention.

Fig. 6 is a schematic structural diagram of a terminal according to an embodiment of the invention.

Description of the element reference

51 first training module

52 second training module

61 processor

62 memory

S1-S2 network steps for training expression recognition

S201-S203 training feature diagram extraction network steps

S301-S304 training feature map matching sub-network steps

S401-S404 training classifying sub-network steps

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

According to the expression recognition network training method, system, medium and terminal combined with weak supervision, the facial expression feature map is fused for weak supervision learning, the features that each type of expression and other types of expressions have discriminative power and discriminative performance can be captured, each type of expression is expressed by using the feature map, the extracted features have fine granularity, and the accuracy and robustness of facial expression recognition classification are greatly improved; the method can adapt to the expression recognition of the face image under various scenes such as different angles, distortion, shielding and the like; the feature map is matched and trained by adopting the feature map matching sub-network, the collected facial expressions are predicted and classified by adopting the classification sub-network, and the two networks are fused together, so that the purpose of multi-task learning of the expression recognition network is achieved, and the accuracy of facial expression recognition is improved; by recognizing the facial expressions, customer satisfaction can be analyzed, fatigue detection can be performed on the driver, or psychotherapy can be performed.

Example one

The embodiment provides an expression recognition network training method combined with weak supervision, wherein the expression recognition network comprises a feature map extraction network, a feature extraction sub-network, a feature map matching sub-network and a classification sub-network; the method comprises the following steps:

training the feature map extraction network; the training step comprises:

training the input facial expression image by using the feature map extraction network to form an expression feature map of a specified expression, an expression feature map of a non-specified expression and classification prediction probabilities respectively corresponding to the specified expression and the non-specified expression;

and performing loss calculation according to the classification prediction probabilities of the specified expressions and the non-specified expressions to obtain the loss degree of the feature map extraction network.

Training the feature map matching sub-network and the classification sub-network; the training step comprises:

inputting the facial expression image into the feature extraction sub-network to obtain low-level spatial features for representing image attributes;

inputting the low-level spatial features and the expression feature map of the designated expression into the feature map matching sub-network to obtain the feature map after matching training of the feature map matching sub-network;

performing difference calculation on the feature map after matching training and the expression feature map to obtain the loss degree of the feature map matching sub-network;

meanwhile, the low-level spatial features and the result of pre-classifying the facial expression images are input into the classifying sub-network so as to obtain a classifying result after classification training of the classifying sub-network;

and carrying out difference calculation on the classification result after the classification training and the result of the pre-classification to obtain the loss degree of the classification sub-network.

The expression recognition network training method with weak supervision according to the present embodiment will be described in detail below with reference to fig. 1 to 4.

In this embodiment, the expression recognition network includes a feature map extraction network, a feature extraction sub-network, a feature map matching sub-network, and a classification sub-network.

Specifically, the feature map is introduced in the process of facial expression recognition through training of the feature map extraction network and the feature map matching sub-network, and then the feature extraction sub-network and the classification sub-network are combined to realize the classification recognition of the facial expression.

In the implementation, information such as human face characteristic point detection, human face verification and the like can be added to the expression recognition network to help the learning of the expression recognition network.

Please refer to fig. 1, which is a flowchart illustrating an expression recognition network training method combined with weak supervision according to an embodiment of the present invention. As shown in fig. 1, the method for training an expression recognition network in combination with weak supervision of the present invention includes the following steps:

and S1, training the feature diagram extraction network. The training step comprises the steps of training an input facial expression image by utilizing the feature map extraction network to form an expression feature map of a specified expression, an expression feature map of a non-specified expression and classification prediction probabilities respectively corresponding to the specified expression and the non-specified expression; and performing loss calculation according to the classification prediction probabilities of the specified expressions and the non-specified expressions to obtain the loss degree of the feature map extraction network.

Further, the acquiring of the facial expression image comprises the following steps:

(11) and collecting human body pictures.

(12) And extracting a face region image from the human body image, and carrying out face correction processing on the face region image.

It should be noted that a dlib tool is adopted to extract the face region image from the human body image, and the dlib tool can detect face information in the human body image and cut out the face region.

(13) And carrying out image preprocessing on the corrected face region image to form the face expression image.

It should be noted that the image preprocessing operation includes: scaling the face region image to a preset size H x W (H represents the image length, and W represents the image width); the face region image is normalized (the mean is subtracted from the face region image and divided by the variance).

The image standardization is to realize centralized processing of data through mean value removal, and the data centralization accords with a data distribution rule according to convex optimization theory and data probability distribution related knowledge, so that a generalization effect after training is obtained more easily. The most common method for normalizing the image is a maximum and minimum normalization method, the normalization does not change the information storage of the image, but the pixel value of the image is converted into 0-1 from the previous value range of 0-255, and the method is very beneficial to network processing. When the image is subjected to the preprocessing operation, generally, only one of them, i.e., the normalization processing or normalization processing, is selected, and the two types are not used at the same time.

In this implementation, the input facial expression images include an expression image of a specified expression and an expression image of a non-specified expression, which are manually distinguished.

Specifically, collected facial expression images are manually classified into two types, one type is an expression image corresponding to a specified expression, the other type is an expression image corresponding to a non-specified expression, and then the two classified expression images are input into the feature map extraction network.

It should be noted that the specified expression is any one of expressions such as anger, happiness, sadness, surprise, disgust, fear, and the like, when training of the feature map extraction network is performed, a certain expression is taken as a specified expression, a facial expression image corresponding to the specified expression is input into the feature map extraction network, the other expressions are taken as non-specified expressions, and the facial expression image corresponding to the specified expression is also input into the feature map extraction network, so that the feature map extraction network trains the specified expression to obtain an expression feature map corresponding to the specified expression, and if a feature map of a certain expression is to be obtained, the feature map is taken as the specified expression, and the other expressions are taken as non-specified expressions, which is not repeated herein.

Furthermore, the non-specified expression can be replaced by a non-expressive facial image, and the feature information that the specified expression is distinguished from the non-expressive expression can be obtained through the feature map extraction network, so that the interference of the face irrelevant information on the extraction of the expression features is reduced.

In this implementation, the feature map extraction network is trained.

The feature map extraction network is used for distinguishing specified expressions and non-specified expressions in the facial expression images input to the feature map extraction network so as to generate expression feature maps corresponding to the specified expressions and the non-specified expressions and classification prediction probabilities corresponding to the specified expressions and the non-specified expressions.

Referring to fig. 2, a flow chart illustrating training of a feature map extraction network according to an embodiment of the invention is shown. As shown in fig. 2, the S1 specifically includes the following steps:

s201, obtaining the classification prediction probability of the specified expression and the non-specified expression through the feature map extraction network.

It should be noted that the structure of the feature map extraction network is as follows:

VGG-16
	Conv-2-1
Global Average Pooling

conv-2-1 represents a convolution layer with the number of channels being 2 and the size of a convolution kernel being 1, and is used for outputting an expression feature graph of a specified expression and an expression feature graph of a non-specified expression; global Average potential Powing represents a Global Average Pooling (averaging of Global data) operation for outputting classification confidence corresponding to a specified expression and a non-specified expression. The classification confidence coefficient represents the classification prediction probability generated after the feature map extraction network trains the expression images of the specified expressions and the expression images of the non-specified expressions.

S202, carrying out difference calculation on the classification prediction probability of the specified expression and the non-specified expression and the classification real probability of the specified expression and the non-specified expression which are manually distinguished so as to obtain the loss degree of the feature map extraction network.

It should be noted that the classification true probability is a probability generated by artificially classifying the collected facial expression images into expression images of specified expressions and expression images of non-specified expressions.

Specifically, difference calculation is carried out on the classification prediction probability and the classification real probability, and the loss degree of the network is extracted according to the difference calculation result.

In this implementation, the loss degree of the feature map extraction network is calculated by using a cross entropy loss function, and the calculation formula is as follows:

L((Q_a,Q_b)(P_a,P_b))＝-(Q_a logP_a+Q_b logP_b)

S203, the steps are executed in a circulating mode until the loss degree of the feature map extraction network is not reduced any more, training is stopped, and the feature map extraction network corresponding to the minimum loss degree value is selected as the optimal feature map extraction network.

It should be noted that Adam is adopted by the optimizer of the feature map extraction network, and the loss degree of the feature map extraction network is reduced every time of gradient pass-back until the loss degree is not reduced any more, and the training is stopped.

And performing expression feature extraction on the acquired facial expression image by using the optimal feature map extraction network to obtain an expression feature map of the designated expression corresponding to the facial expression image.

Furthermore, expression feature extraction can be performed on the collected preset facial expression images with all designated expressions through an optimal feature image extraction network to obtain expression feature images with all the designated expressions, an average value is taken on the preset expression feature images to obtain a final feature image of the designated expression, and the calculation formula is as follows:

wherein N is the preset number of the collected facial expression images,

is an expression feature map corresponding to the ith personal face expression image, F_AIs the final feature map representing the specified expression.

And S2, training the feature map matching sub-network and the classification sub-network. The training step comprises inputting the facial expression images into the feature extraction sub-network to obtain low-level spatial features representing image attributes; inputting the low-level spatial features and the expression feature map of the designated expression into the feature map matching sub-network to obtain the feature map after matching training of the feature map matching sub-network; performing difference calculation on the feature map after matching training and the expression feature map to obtain the loss degree of the feature map matching sub-network; meanwhile, the low-level spatial features and the result of pre-classifying the facial expression images are input into the classifying sub-network so as to obtain a classifying result after classification training of the classifying sub-network; and carrying out difference calculation on the classification result after the classification training and the result of the pre-classification to obtain the loss degree of the classification sub-network.

Specifically, in S2, a multitask network is used, feature map matching is performed by a feature map matching sub-network, a human face expression is predicted by classifying sub-networks, and the two networks are organically merged.

It should be noted that the low-level spatial features refer to edge information, corner information, and texture information in an image.

In this implementation, the feature map matching sub-network has the following structure:

the classification sub-network adopts the following structure:

Conv-256-1
	Conv-128-1
Conv-64-1
	FC-2048
FC-512
	FC-6

here, for the sake of a brief description of the network structure, all convolutional layers are expressed as Conv-number of channels-convolutional kernel size, all convolutional layers use a padding method to keep the size of the network input and output consistent, and the fully-connected layers are expressed as FC-number of nodes.

In this implementation, the feature map matching sub-network is trained.

The feature map matching sub-network is used for matching and learning the feature map obtained through the training of the feature map extraction network, so that the feature map generated through the training of the feature map matching sub-network is similar to the feature map obtained through the training of the feature map extraction network as much as possible, and the difference between the feature map generated through the training of the feature map extraction network and the feature map generated through the training of the feature map matching sub-network is reduced.

It should be noted that through the weak supervised learning of the feature map, the features that each type of expression has discriminability and distinguishability with other types of expressions can be captured, and each type of expression is expressed by using the feature map, so that the extracted features have finer granularity, and the accuracy and robustness of facial expression recognition and classification are greatly improved.

Weakly supervised learning refers to a data set whose label is unreliable, e.g., (x, y), y is unreliable for labeling of x, where unreliable refers to labeling is incorrect, multiple labels, labeling insufficiency, local labeling, etc. Learning problems for objects with incomplete or ambiguous supervised information are collectively referred to as weakly supervised learning.

Referring now to FIG. 3, therein is shown a flow chart of an embodiment of the present invention for training a feature map matching sub-network. As shown in fig. 3, the training step includes:

s301, inputting the low-level spatial features and the expression feature map of the specified expression into the feature map matching sub-network.

Specifically, the expression feature map is used as a piece of supervision information of the specified expression, and is input into a feature map matching sub-network together with the low-level spatial features. The expression characteristic graph is used for capturing fine-grained information, extracting high-level semantics of the expression image of the face, adding the characteristic graph as supervision information and well finishing the task of expression recognition.

S302, learning the low-level spatial features by the expression feature map by using the feature map matching sub-network to obtain the feature map after matching training of the feature map matching sub-network.

And S303, carrying out difference calculation on the feature map after matching training and the expression feature map to obtain the loss degree of the feature map matching sub-network.

It should be noted that the loss degree of the feature map matching sub-network is used to characterize the difference between the feature map after the matching training and the expression feature map, and a smaller difference indicates that the two feature maps are more similar.

In this embodiment, the calculation formula of the loss degree of the feature map matching sub-network is:

Specifically, a calculation method based on Euclidean distance in n-dimensional space

Known as the formula

The specific calculation method comprises the following steps: and subtracting the pixels of the matched and trained feature map from the pixels of the expression feature map one by one, and adding the squared differences. Thus, the formula

Can be equivalent to:

it should be noted that the feature map matching sub-network measures the loss degree of the feature map matching sub-network by using the euclidean distance between the feature map after the matching training and the expression feature map. Euclidean distance is a commonly used definition of distance, referring to the true distance between two points in n-dimensional space, or the natural length of a vector (i.e., the distance of the point from the origin). The euclidean distance in two and three dimensions is the actual distance between two points.

S304, the steps are executed in a circulating mode until the loss degree of the feature map matching sub-network is not reduced any more, training is stopped, and the feature map matching sub-network corresponding to the minimum loss degree is selected as the optimal feature map matching sub-network.

In this implementation, the classification subnetwork is trained.

The classification sub-network is used for recognizing and classifying the facial expression images input into the classification sub-network, learning the result of classifying the facial expression images in advance to the low-level spatial features to generate a classification result after classification training of the classification sub-network, and enabling the difference between the classification result after classification training of the classification sub-network and the result of classifying the facial expression images in advance to be minimum, so that the recognition and classification capability of the classification sub-network on the expressions in the facial expression images is good.

It should be noted that in practical scene application, the facial expression image only needs to be input into the classification sub-network, and the recognition and classification of the expression in the facial expression image can be realized.

Referring now to FIG. 4, therein is shown a flow chart of an embodiment of the present invention for training a classification subnetwork network. As shown in fig. 4, the training step includes:

s401, inputting the low-level spatial features and the result of pre-classifying the facial expression images into the classifying sub-network.

Specifically, before a classification sub-network is trained, the facial expression images are classified in advance, and then the result of the classification in advance is used as another type of supervision information corresponding to each type of expression and is input into the classification sub-network together with the low-level spatial features.

It should be noted that the result of pre-classification refers to a result generated after pre-classifying the facial expression image, and may be a result generated by encoding the facial expression image and inputting a generated encoding label into a classification sub-network as a result of pre-classification; after the facial expression images are coded, the coded class labels and the facial expression images which correspond one to one are used as pre-classification results and input into a classification sub-network; of course, other classification methods capable of distinguishing the facial expression images may be adopted. Therefore, the manner of pre-classifying the facial expression image and what classification result is generated as the pre-classification result are not conditions for limiting the present invention.

S402, learning the low-level spatial features by using the result of the pre-classification by using the classification sub-network so as to obtain a classification result after classification training by using the classification sub-network.

And S403, performing difference calculation on the classification result after the classification training and the result of the pre-classification to obtain the loss degree of the classification sub-network.

In the implementation, a one-hot coding mode is adopted to pre-classify the facial expression images, and the class marks generated by the one-hot coding are used as real values; inputting the real values and the low-level spatial features into a classification sub-network; learning the low-level spatial features by using the real values through a classification sub-network, and acquiring a classification result after classification training of the classification sub-network as a predicted value; and carrying out difference calculation on the true value and the predicted value to obtain the loss degree of the classification sub-network. For example, with 6 types of expressions, anger, happiness, sadness, surprise, disgust, and fear, the corresponding one-hot code may be: anger: [1,0,0,0,0,0 ]; happy: [0,1,0,0,0,0 ]; sadness: [0,0,1,0,0,0 ]; surprisingly: [0,0,0,1,0,0 ]; aversion: [0,0,0,0,1,0 ]; fear: [0,0,0,0,0,1]. The above-mentioned coding is only an example of one-hot coding, and the present invention may also adopt other coding forms.

In this embodiment, before performing the difference calculation, performing regression processing on the classification result after the classification training and the result of the pre-classification respectively, converting the classification result after the classification training and the result of the pre-classification into a prediction probability and a true probability respectively, performing the difference calculation based on the prediction probability and the true probability to obtain the loss degree of the classification sub-network, where the loss degree is calculated by using a cross entropy loss function, and the calculation formula is:

L(M,N)＝-MlogN

It should be noted that, because the cross entropy loss function describes the distance between two probability distributions, the output of the classification sub-network is not necessarily a probability distribution, and may be real, and therefore, the output needs to be converted into a probability through a regression process to realize the calculation of the loss degree of the classification sub-network through the cross entropy.

In the present embodiment, the regression processing may employ Softmax regression processing.

S404, the steps are executed in a circulating mode until the loss degree of the classification sub-network is not reduced any more, the training is stopped, and the classification sub-network corresponding to the minimum loss degree value is selected as the optimal classification sub-network.

Further, stopping training, before selecting the optimal classification sub-network, evaluating the classification sub-network corresponding to the minimum loss degree by acquiring the facial expression image, and finally selecting the optimal classification sub-network according to the expression recognition accuracy of the acquired facial expression image by the classification sub-network.

In this implementation, the feature extraction sub-network can be replaced with other network structures, such as MobileNet (lightweight model), Res-Net (pre-trained model); both the feature map extraction network and the feature map matching sub-network can be changed as long as the feature map generated by the feature map matching sub-network is consistent with the feature map generated by the feature map extraction network in size, and comparison can be performed.

It should be noted that the basic idea of ResNet is: the output of each module of the network is added with the corresponding input, so that the transmission of information in the network is ensured, the learning difficulty of the neural network is reduced, and the network effect is reduced because of partial bad data when the image obtained by the pedestrian texture map is used as the main input of the model, so that the network structure is innovated, and the influence of useless information is reduced.

Furthermore, the convolution layers in the feature map extraction network, the feature map matching sub-network and the classification sub-network can be replaced by deformable convolution or expansion convolution, the convolution experience is expanded, and the robustness of the network structure are utilized.

Furthermore, Loss functions with good performance such as A discrete Feature Learning Approach for Deep Face Recognition, contrast Loss (contrast Loss) and the like during network training are added into the network, so that the accuracy of the network can be improved in a small scale.

Further, before training the feature map extraction network, the feature map matching sub-network, and the classification sub-network, it is necessary to initialize these network structures.

Specifically, the initialization includes setting a preset weight and initializing a full connection layer and a convolution layer in a network structure by adopting normal distribution, wherein a standard deviation is 0.01 and an expected value is 0.

The expression recognition network training method combined with weak supervision extracts an expression recognition network combined with weak supervision aiming at the diversity and the indistinguishability of facial expressions in an actual scene, and can effectively solve the problems of background noise, human head angle, uneven illumination distribution, difficulty in distinguishing expression details and the like; the characteristic diagram is used for capturing fine-grained information and extracting high-level semantics from the facial expression image; the expression recognition network adds a characteristic diagram as supervision information to better complete the task of expression recognition.

The expression recognition network training method combined with weak supervision can be used as a module for a shopping mall passenger flow analysis system to call, the passenger flow analysis system inputs a customer image and returns the expression state of the customer, and can also be combined with a tracking module to synthesize the historical information of the customer and perform sampling on a time sequence, so that the precision is further improved, and the psychological behavior state of the customer is analyzed.

It should be noted that the protection scope of the expression recognition network training method combined with weak supervision according to the present invention is not limited to the execution sequence of the steps listed in this embodiment, and all the solutions implemented by adding, subtracting, and replacing steps in the prior art according to the principles of the present invention are included in the protection scope of the present invention.

The storage medium of the present invention has stored thereon a computer program which, when executed by a processor, implements the above-described expression recognition network training method in conjunction with weak supervision. The storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic disk, U-disk, memory card, or optical disk.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer-readable storage medium. Which when executed performs steps comprising the method embodiments described above.

Example two

The embodiment provides an expression recognition network training system combined with weak supervision, wherein the expression recognition network comprises a feature map extraction network, a feature extraction sub-network, a feature map matching sub-network and a classification sub-network; the system comprises: a first training module and a second training module;

the first training module is used for training the feature map extraction network; the training step comprises the steps of training an input facial expression image by utilizing the feature map extraction network to form an expression feature map of a specified expression, an expression feature map of a non-specified expression and classification prediction probabilities respectively corresponding to the specified expression and the non-specified expression; performing loss calculation according to the classification prediction probabilities of the specified expressions and the non-specified expressions to obtain the loss degree of the feature map extraction network;

the second training module is used for training the feature map matching sub-network and the classification sub-network; the training step comprises inputting the facial expression images into the feature extraction sub-network to obtain low-level spatial features representing image attributes; inputting the low-level spatial features and the expression feature map of the designated expression into the feature map matching sub-network to obtain the feature map after matching training of the feature map matching sub-network; performing difference calculation on the feature map after matching training and the expression feature map to obtain the loss degree of the feature map matching sub-network; meanwhile, the low-level spatial features and the result of pre-classifying the facial expression image are input into the classifying sub-network so as to obtain a classifying result after classifying training of the classifying sub-network; and carrying out difference calculation on the classification result after the classification training and the result of the pre-classification to obtain the loss degree of the classification sub-network.

Please refer to fig. 5, which is a schematic structural diagram of an expression recognition network training system combined with weak supervision according to an embodiment of the present invention. As shown in fig. 5, the expression recognition network training system with weak supervision of the present invention includes: a first training module 51 and a second training module 52.

The first training module 51 is configured to train the feature map extraction network; the training step comprises the steps of training an input facial expression image by utilizing the feature map extraction network to form an expression feature map of a specified expression, an expression feature map of a non-specified expression and classification prediction probabilities respectively corresponding to the specified expression and the non-specified expression; and performing loss calculation according to the classification prediction probabilities of the specified expressions and the non-specified expressions to obtain the loss degree of the feature map extraction network.

The second training module 52 is configured to train the feature map matching sub-network and the classification sub-network; the training step comprises inputting the facial expression images into the feature extraction sub-network to obtain low-level spatial features representing image attributes; inputting the low-level spatial features and the expression feature map of the designated expression into the feature map matching sub-network to obtain the feature map after matching training of the feature map matching sub-network; performing difference calculation on the feature map after matching training and the expression feature map to obtain the loss degree of the feature map matching sub-network; meanwhile, the low-level spatial features and the result of pre-classifying the facial expression images are input into the classifying sub-network so as to obtain a classifying result after classification training of the classifying sub-network; and carrying out difference calculation on the classification result after the classification training and the result of the pre-classification to obtain the loss degree of the classification sub-network.

It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the x module may be a processing element that is set up separately, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and the function of the x module may be called and executed by a processing element of the apparatus. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more Digital Signal Processors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

EXAMPLE III

The present embodiment provides a terminal, including: a processor and a memory;

the memory is used for storing a computer program;

the processor is used for executing the computer program stored in the memory so as to enable the terminal to execute the expression recognition network training method combined with weak supervision.

Please refer to fig. 6, which is a schematic structural diagram of a terminal according to an embodiment of the present invention. As shown in fig. 6, the terminal of the present invention includes a processor 61 and a memory 62.

The memory 62 is used for storing computer programs. Preferably, the memory 62 comprises: various media that can store program codes, such as ROM, RAM, magnetic disk, U-disk, memory card, or optical disk.

The processor 61 is connected to the memory 62 and is configured to execute the computer program stored in the memory 62, so that the terminal executes the above-mentioned facial expression recognition network training method combined with weak supervision.

Preferably, the Processor 61 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

It should be noted that the expression recognition network training system combined with weak supervision of the present invention can implement the expression recognition network training method combined with weak supervision of the present invention, but the implementation apparatus of the expression recognition network training method combined with weak supervision of the present invention includes, but is not limited to, the structure of the expression recognition network training system combined with weak supervision as illustrated in this embodiment, and all structural modifications and substitutions in the prior art made according to the principle of the present invention are included in the scope of the present invention.

In summary, the weakly supervised expression recognition network training method, system, medium and terminal of the invention perform weakly supervised learning by fusing the facial expression feature map, can capture the features that each type of expression has discriminative power and discriminative ability with other types of expressions, and use the feature map to express each type of expression, so that the extracted features have finer granularity, and the accuracy and robustness of facial expression recognition classification are greatly improved; the facial expression recognition method can adapt to the expression recognition of the facial image under various scenes such as different angles, distortion, shielding and the like; the feature map is matched and trained by adopting the feature map matching sub-network, the collected facial expressions are predicted and classified by adopting the classification sub-network, and the two networks are fused together, so that the purpose of multi-task learning of the expression recognition network is achieved, and the accuracy of facial expression recognition is improved; by recognizing the facial expressions, customer satisfaction can be analyzed, fatigue detection can be performed on the driver, or psychotherapy can be performed. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. The expression recognition network training method combined with weak supervision is characterized in that the expression recognition network comprises a feature map extraction network, a feature extraction sub-network, a feature map matching sub-network and a classification sub-network; the method comprises the following steps:

training the feature map extraction network; the training step comprises:

performing loss calculation according to the classification prediction probabilities of the specified expressions and the non-specified expressions to obtain the loss degree of the feature map extraction network; training the feature map matching sub-network and the classification sub-network; the training step comprises:

inputting the low-level spatial features and the expression feature map of the designated expression into the feature map matching sub-network to obtain the feature map after matching training of the feature map matching sub-network; performing difference calculation on the feature map after matching training and the expression feature map to obtain the loss degree of the feature map matching sub-network;

meanwhile, the low-level spatial features and the result of pre-classifying the facial expression images are input into the classifying sub-network so as to obtain a classifying result after classification training of the classifying sub-network; and carrying out difference calculation on the classification result after the classification training and the result of the pre-classification to obtain the loss degree of the classification sub-network.

2. The weakly supervised expression recognition network training method as recited in claim 1, wherein the input facial expression images include an expression image of a specified expression and an expression image of a non-specified expression which are manually distinguished; training the feature map extraction network, wherein the training step comprises:

obtaining the classification prediction probability of the specified expression and the non-specified expression through the feature map extraction network;

performing difference calculation on the classification prediction probability of the specified expression and the non-specified expression and the classification real probability of the specified expression and the non-specified expression which are manually distinguished to obtain the loss degree of the feature map extraction network;

and circularly executing the steps until the loss degree of the characteristic diagram extraction network is not reduced any more, stopping training, and selecting the characteristic diagram extraction network corresponding to the minimum loss degree as the optimal characteristic diagram extraction network.

3. The expression recognition network training method combined with weak supervision according to claim 2, wherein the loss degree of the feature map extraction network is calculated by adopting a cross entropy loss function, and the calculation formula is as follows:

L((Q_a,Q_b)(P_a,P_b))＝-(Q_a logP_a+Q_b logP_b)

wherein, L ((Q)_a,Q_b)(P_a,P_b) Is the degree of loss, Q_a、Q_bClassification true probabilities, P, of specified and unspecified expressions, respectively_a、P_bThe classification prediction probabilities of the specified expression and the non-specified expression are respectively.

4. The method of claim 1, wherein the sub-network of feature map matching is trained, and the training comprises:

inputting the low-level spatial features and the expression feature map of the specified expression into the feature map matching sub-network;

learning the low-level spatial features by using the expression feature map by using the feature map matching sub-network so as to obtain a feature map subjected to matching training of the feature map matching sub-network;

and circularly executing the steps until the loss degree of the feature map matching sub-network is not reduced any more, stopping training, and selecting the feature map matching sub-network corresponding to the minimum loss degree as the optimal feature map matching sub-network.

5. The expression recognition network training method combined with weak supervision according to claim 4, wherein the loss degree of the feature map matching sub-network is calculated by the following formula:

6. The method of claim 1, wherein the classification sub-network is trained, and the training comprises:

inputting the low-level spatial features and the result of pre-classifying the facial expression images into the classifying sub-network;

learning the low-level spatial features by using the result of the pre-classification by using the classification sub-network so as to obtain a classification result after classification training of the classification sub-network;

performing difference calculation on the classification result after the classification training and the result of the pre-classification to obtain the loss degree of the classification sub-network;

and circularly executing the steps until the loss degree of the classification sub-network is not reduced any more, stopping training, and selecting the classification sub-network corresponding to the minimum loss degree as the optimal classification sub-network.

7. The method for facial expression recognition network training combined with weak supervision according to claim 6, wherein before the difference calculation, regression processing is performed on the classification result after the classification training and the result of the pre-classification respectively, the classification result after the classification training and the result of the pre-classification are converted into a prediction probability and a true probability respectively, so as to perform difference calculation based on the prediction probability and the true probability to obtain the loss degree of the classification sub-network, wherein the loss degree is calculated by using a cross entropy loss function, and the calculation formula is as follows:

L(M,N)＝-MlogN

8. An expression recognition network training system combined with weak supervision is characterized in that the expression recognition network comprises a feature map extraction network, a feature extraction sub-network, a feature map matching sub-network and a classification sub-network; the system comprises: a first training module and a second training module;

the second training module is used for training the feature map matching sub-network and the classification sub-network; the training step comprises inputting the facial expression images into the feature extraction sub-network to obtain low-level spatial features representing image attributes; inputting the low-level spatial features and the expression feature map of the designated expression into the feature map matching sub-network to obtain the feature map after matching training of the feature map matching sub-network; performing difference calculation on the feature map after matching training and the expression feature map to obtain the loss degree of the feature map matching sub-network; meanwhile, the low-level spatial features and the result of pre-classifying the facial expression images are input into the classifying sub-network so as to obtain a classifying result after classification training of the classifying sub-network; and carrying out difference calculation on the classification result after the classification training and the result of the pre-classification to obtain the loss degree of the classification sub-network.

9. A storage medium on which a computer program is stored, which program, when being executed by a processor, carries out the method of expression recognition network training in combination with weak supervision of any of claims 1 to 7.

10. A terminal, comprising: a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to execute the memory-stored computer program to cause the terminal to perform the method of facial expression recognition network training in conjunction with weak supervision of any of claims 1 to 7.