CN107292256B

CN107292256B - Auxiliary task-based deep convolution wavelet neural network expression recognition method

Info

Publication number: CN107292256B
Application number: CN201710446076.0A
Authority: CN
Inventors: 白静; 陈科雯; 张景森; 焦李成; 缑水平; 张向荣
Original assignee: Xian University of Electronic Science and Technology
Current assignee: Xian University of Electronic Science and Technology
Priority date: 2017-06-14
Filing date: 2017-06-14
Publication date: 2019-12-24
Anticipated expiration: 2037-06-14
Also published as: CN107292256A

Abstract

The invention discloses a depth convolution wavelet neural network expression recognition method based on an auxiliary task, and solves the problems that expression features cannot be efficiently learned by an existing feature selection operator, and more image expression information classification features cannot be extracted. The invention is realized as follows: building a deep convolution wavelet neural network; establishing a facial expression set and a corresponding expression sensitive area image set; inputting a facial expression image to a network; training a deep convolution wavelet neural network; network error back propagation; updating each convolution kernel and offset vector of the network; inputting an expression sensitive area image to a trained network; learning the weighted proportion of the auxiliary task; obtaining a network global classification label; and counting the recognition accuracy according to the global label. The method gives consideration to the abstract and detail information of the expression image, enhances the influence of the expression sensitive area in the expression feature learning, obviously improves the accuracy of expression recognition, and can be applied to the expression recognition of the facial expression image.

Description

Auxiliary task-based deep convolution wavelet neural network expression recognition method

Technical Field

The invention belongs to the technical field of image processing, mainly relates to computer vision identification, and particularly relates to a depth convolution wavelet neural network expression identification method based on an auxiliary task. The method can be applied to learning and classifying the expression characteristics in the facial expression recognition.

Background

Facial expression recognition is a leading technology in the field of image processing and computer vision. The method is a key step from image processing to image analysis, and the quality of a segmentation result directly influences subsequent image analysis, understanding, solving and the like. The purpose of facial expression recognition is to research a coding model of facial expression, learn and extract a characteristic expression mode of the facial expression, and realize automatic synthesis, tracking and recognition of the facial expression by a computer.

Currently, the technical research on the recognition of facial expressions mainly focuses on two aspects of feature extraction and classification algorithms. The human face expression recognition method based on the deep learning network is used by researchers in recent years, particularly, a deep convolution neural network which is good at processing two-dimensional images in the deep learning network is applied to the expression recognition field by the researchers, but the deep convolution neural network focuses on abstract mapping of images from a low layer to a high layer in a general sense so as to obtain a high-level feature expression mode, and texture and detail information of the expression images are ignored when the high-level feature expression mode is obtained. Moreover, the commonly used deep network is generally a single-task deep network, and the main contribution of the expression sensitive area to the feature expression cannot be effectively highlighted when the features of the expression are learned.

In the existing expression recognition technology, a method of firstly selecting features and then classifying is mainly adopted, but in the feature selection step, the existing feature selection operator cannot efficiently learn expression features, so that the subsequent classification cannot obtain an ideal result. In addition, luyadan et al adopt a deep self-coding network as a classifier, and do not avoid the step of feature selection, so that the final classification effect is not greatly improved.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a depth convolution wavelet neural network expression recognition method based on an auxiliary task.

The invention relates to a depth convolution wavelet neural network expression recognition method based on an auxiliary task, which is characterized by comprising the following steps of:

(1) building a depth convolution wavelet network consisting of three convolution layers, two pooling layers, a multi-scale transformation layer, a full connection layer and a softmax output layer; initializing a bias weight matrix of the network convolution layer into a 0 matrix, wherein a Sigmoid function is selected as an activation function of the network;

(2) establishing a facial expression image set and an expression sensitive area image set, wherein the expression sensitive area image set is obtained by cutting eyebrow parts and mouth parts of the facial expression image set, a part of images in the facial expression image data set are used as a training image set of a network, and the rest of images are used as a testing image set;

(3) inputting a training image into a deep convolution wavelet network, wherein the size of the input image is 96 × 96;

(4) the first layer of the deep convolution wavelet network is a convolution layer, the convolution layer performs convolution operation on each input facial expression training image, and the number of selected convolution kernels is Q₁Convolution kernel size 7 × 7:

(4a) adopting a random initialization method to configure the weight of a convolution kernel as a near zero number between [ -0.5,0.5 ];

(4b) each convolution kernel performs convolution operation on the human face expression image to obtain Q₁The feature map size of each convolution kernel is 90 x 90;

(5) the second layer of the network is a pooling layer of Q's obtained from the previous layer of convolutional layers₁Taking the characteristic graph as input, and performing pooling operation:

the pooling layer is prepared by selecting maximum value in non-overlapping 2 x 2 region to obtain Q of the pooling layer₁The size of the pooled feature map is 45 × 45;

(6) the third layer of the network is a convolution layer, and Q obtained by the previous layer of the pooling layer₁Taking the characteristic graph as input, performing convolution operation, wherein the number of convolution kernels selected by the convolution layer is Q₂Convolution kernel size 6 × 6:

(6a) adopting a random initialization method to configure the weight of a convolution kernel as a near zero number between [ -0.5,0.5 ]; (ii) a

(6b) Each convolution kernel is on the Q₁Performing convolution operation on the feature map, and then performing convolution on the Q₁Characteristic diagramThe convolution result and the bias matrix are subjected to average evaluation after the activation function filtering to obtain a feature map of the convolution kernel, and the feature map size of each convolution kernel is 40 x 40;

(7) the fourth layer of the network is a pooling layer that pools Q from the previous convolutional layer₂Taking the characteristic graph as input, and performing pooling operation:

the pooling layer is prepared by selecting maximum value in non-overlapping 2 x 2 region to obtain Q of the pooling layer₂The size of the pooled feature map is 20 x 20;

(8) the fifth layer of the network is a convolution layer, and Q obtained by the previous layer of the stratification layer is added₂Taking the characteristic graph as input, performing convolution operation, wherein the number of selected convolution kernels of the convolution layer is Q₃Convolution kernel size 5 × 5:

(8a) adopting a random initialization method to configure the weight of a convolution kernel as a near zero number between [ -0.5,0.5 ];

(8b) each convolution kernel is on the Q₂Performing convolution operation on the feature map, and then performing convolution on the Q₂The convolution result of each characteristic map and the bias matrix are subjected to average evaluation after the activation function filtering to obtain the characteristic map of the convolution kernel, and the size of each characteristic map is 16 x 16;

(9) the sixth layer of the network is a wavelet pooling layer that is a Q value obtained from the previous convolutional layer₃Taking the characteristic graph as an input, and performing one-layer wavelet decomposition:

the adopted wavelet basis function is a 'haar' function, for each feature map, an 8 x 8 low-frequency sub-band and three 8 x 8 high-frequency sub-bands are obtained, the corresponding positions of the three high-frequency sub-bands are maximized, and the three high-frequency sub-bands are fused into a new high-frequency sub-band;

(10) the seventh layer of the network is a full connection layer, and Q obtained by pooling wavelets of the sixth layer of the network is used as a Q value₃8 x 8 low frequency subbands and Q₃Taking 8-by-8 high-frequency sub-bands as input to form a 128-dimensional full-connection layer feature vector;

(11) repeating the steps (3) to (10) by taking n randomly selected facial expression images as a unit to obtain respective 128-dimensional feature vectors of the n images;

(12) the eighth layer of the network is a Softmax output layer, n 128-dimensional feature vectors are obtained and used as input, a probability distribution Softmax classifier with 7 types of output is trained, and classification labels are obtained;

(13) error calculation is carried out on the classification label and the real label of the Softmax output layer, and a weight matrix is updated once according to a BP back propagation algorithm;

(14) repeating the training steps (3) to (13) until the weight matrix is updated m times to obtain a trained deep convolution wavelet neural network;

(15) bringing the facial expression image test set into a trained deep convolution wavelet neural network to obtain a classification label z1 on an output layer, bringing an expression sensitive region image set corresponding to the test data set into the trained deep convolution wavelet neural network to obtain a classification label z2 on the output layer, and obtaining a final classification label by using the two classification labels according to a mode of z3 ═ z1+ λ × z2, wherein λ represents the weighted proportion of an auxiliary task;

(16) and outputting the facial expression recognition accuracy according to the classification label z3 of the test set, and completing the auxiliary task-based deep convolution wavelet neural network facial expression recognition.

According to the method, the expression characteristics are learned by using the subtask deep convolution wavelet neural network, and characteristic selection is not needed, so that the abstract and local detail information of the facial expression can be well learned, the influence of an expression sensitive area on network expression characteristic extraction is improved, and the accuracy of the facial expression recognition result is obviously improved.

Compared with the prior art, the invention has the following advantages:

firstly, because the invention gives consideration to the special characterization capability of the expression sensitive area in the learning of the expression characteristics of the deep convolutional neural network, firstly, a main task learning DCNN is trained to obtain a shared characteristic weight matrix, then, local images of the eyebrow postures and the mouth postures of the eyes and eyebrows of the expression sensitive area are merged to be used as an auxiliary task estimation branch task, a classification result of auxiliary task estimation is obtained by mapping the shared characteristic weight matrix, and finally, the classification performance of the main task learning is optimized by the auxiliary task estimation classification result, so that the generalization capability of the deep convolutional network in the expression recognition is improved;

secondly, because the invention avoids the defects that the characteristic learned by the upper convolutional layer of the part of the common convolutional neural network can be lost due to simple down-sampling operation of the pooling layer in the convolutional neural network and the local characteristic of a plurality of shallow layers is lost because the output of the full-link layer only contains abstract information, and combines the multi-scale wavelet transform and the deep convolutional neural network architecture, the network not only ensures that the characteristic learned by the convolutional layer can effectively carry out complete characteristic transmission in the pooling layer, but also can expand the expression local characteristic obtained during shallow layer learning in the full-link layer, thereby leading the whole network structure to describe the expression characteristic more optimally and obviously improving the recognition result.

Description of the figures

FIG. 1 is a portion of an image in a raw database as employed by the present invention;

FIG. 2 is a block flow diagram of the present invention;

FIG. 3 is a schematic diagram of the network structure of the present invention, wherein FIG. 3(a) is a structural diagram of the deep convolution wavelet neural network of the present invention, and FIG. 3(b) is a structural diagram of the deep convolution wavelet neural network of the subtask of the present invention;

fig. 4 is a portion of an expression sensitive area image of the present invention.

Detailed Description

The invention is described in detail below with reference to the attached drawing figures:

example 1

The facial expression recognition is an indispensable component in machine learning research, has very wide application value in the society where human-computer interaction is continuously popularized at present, and automatically recognizes facial expressions of human faces in real time in human-computer interfaces such as mobile terminals and personal computers; in some cases, the realized facial expressions are retrieved from the video, tracked and identified. The breakthrough of the facial expression recognition method also has great reference significance to the fields of intelligent calculation and brain-like research.

In the existing expression recognition technology, a method of firstly selecting features and then classifying is mainly adopted, but in the feature selection step, the existing feature selection operator cannot efficiently learn expression features, so that the subsequent classification cannot obtain an ideal result. In addition, the method of adopting the deep network as the classifier does not avoid feature selection, so that the classification effect is improved to a limited extent.

The invention develops research and exploration aiming at the current situation, provides a depth convolution wavelet neural network expression recognition method based on an auxiliary task, and referring to fig. 2, the invention realizes facial expression recognition, and comprises the following steps:

(1) building a depth convolution wavelet network consisting of three convolution layers, two pooling layers, a multi-scale transformation layer, a full connection layer and a softmax output layer; the bias weight matrix of the network convolution layer is initialized to be 0 matrix, and the Sigmoid function is selected as the activation function of the network. The deep convolution wavelet neural network built by the invention comprises the following components from an input layer to an output layer in sequence: the deep convolutional wavelet neural network comprises an input layer, a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a third convolutional layer, a multi-scale transformation layer, a full connection layer and a softmax output layer, wherein the multi-scale transformation layer is a wavelet pooling layer and integrally forms the deep convolutional wavelet neural network.

(2) And establishing a facial expression image set and an expression sensitive area image set, wherein the expression sensitive area image set is obtained by cutting eyebrow parts and mouth parts of the facial expression image set, a part of images in the facial expression image data set are used as a training image set of a network, and the rest images are used as a test image set. For example, the facial expression image data set in this example has 20000 samples, of which 15000 images are used as the training image set, and the remaining 5000 images are used as the training image set, and the number of the expression sensitive area image sets is corresponding to that of the facial expression image data set.

(3) A training image is input into a deep convolutional wavelet network, the size of the input image being 96 × 96. In the embodiment, the training image is directly input into the network, and other image preprocessing is not needed, such as removing the influence of complex background or illumination, and the like, so that the program and the process of image recognition are simplified.

(4) The first layer of the deep convolution wavelet network is a convolution layer, the convolution layer performs convolution operation on each input facial expression training image, and the number of selected convolution kernels is Q₁The convolution kernel size is 7 × 7. The number of convolution kernels, in this example Q, is selected according to the computing environment and the software and hardware conditions₁Take 4.

(4a) And adopting a random initialization method to configure the weight value of the convolution kernel as a near zero number between [ -0.5 and 0.5 ]. The initial weight of the convolution kernel is near zero in the invention, so as to accelerate the convergence speed of the network.

(4b) Each convolution kernel performs convolution operation on the human face expression image to obtain Q₁And (4) the feature map size of each convolution kernel is 90 x 90. The characteristic graph size of the convolution kernel in the present invention is determined by the convolution kernel size.

(4c) Initially setting a bias weight matrix of the convolutional layer as a 0 matrix; in this example, the bias weight matrix is a one-dimensional vector, the dimensions and the number Q of convolution kernels₁The same is true.

(4d) The activation function of the network is a Sigmoid function. The Sigmoid function formula in the invention is as follows:

wherein f (x) is the activation value of the function, x is the input of the activation function, x in the network represents the value of the convolution result added with the bias weight, and e is the natural logarithm.

(5) The second layer of the network is a pooling layer of Q from the previous, first, convolutional layer₁Taking the feature map as an input of the pooling layer, and performing pooling operation:

the pooling layer is prepared by selecting maximum value in non-overlapping 2 x 2 region to obtain Q of the pooling layer₁And (4) obtaining the pooled feature maps with the size of 45 × 45.

(6) The third layer of the network is a convolution layer, and Q obtained by the previous layer of the pooling layer₁Taking the characteristic graph as input, performing convolution operation, wherein the number of convolution kernels selected by the convolution layer is Q₂The convolution kernel size is 6 x 6. Number of convolution kernel selections Q in this example₂Was taken as 6.

(6a) Adopting a random initialization method to configure the weight of a convolution kernel as a near zero number between [ -0.5,0.5 ];

(6b) each convolution kernel is on the Q₁Performing convolution operation on the feature map, and then performing convolution on the Q₁The convolution result of the characteristic graphs and the bias matrix are subjected to average evaluation after the activation function filtering to obtain the characteristic graphs of the convolution kernels, and the characteristic graph size of each convolution kernel is 40 x 40;

(6c) the bias weight matrix for the convolutional layer is initially set to a 0 matrix. In this example, the bias weight matrix is a one-dimensional vector, the dimensions and the number Q of convolution kernels₂The same is true.

(6d) The activation function of the network is a Sigmoid function.

(8) the fifth layer of the network is a convolution layer, and Q obtained by the previous layer of the stratification layer is added₂Taking the characteristic graph as input, performing convolution operation, wherein the number of selected convolution kernels of the convolution layer is Q₃The convolution kernel size is 5 x 5. Number of convolution kernel selections Q in this example₃Taken as 12.

(8c) initially setting a bias weight matrix of the convolutional layer as a 0 matrix;

(8d) the activation function of the network is a Sigmoid function.

the adopted wavelet basis function is a 'haar' function, for each feature map, an 8 x 8 low-frequency sub-band and three 8 x 8 high-frequency sub-bands are obtained, the corresponding positions of the three high-frequency sub-bands are maximized, and the three high-frequency sub-bands are fused into a new high-frequency sub-band.

(10) The seventh layer of the network is a full connection layer, and Q obtained by pooling wavelets of the sixth layer of the network is used as a Q value₃8 x 8 low frequency subbands and Q₃The 8 x 8 high frequency subbands are used as input to form a 128-dimensional fully connected layer feature vector.

(11) And (5) repeating the steps (3) to (10) by taking the randomly selected n facial expression images as a unit to obtain respective 128-dimensional feature vectors of the n images.

(12) And the eighth layer of the network is a Softmax output layer, the obtained n 128-dimensional feature vectors are used as input, a probability distribution Softmax classifier with 7 types of output is trained, and classification labels are obtained.

(13) And carrying out error calculation on the classification label of the Softmax output layer and the real label, and updating the weight matrix once according to a BP back propagation algorithm. The updated weight matrix in this example includes the values of the convolution kernel and the values of the bias weight vector.

(14) And (5) repeating the training steps (3) to (13) until the weight matrix is updated m times. In the invention, m is the updating times and is determined by the image scale and the convergence rate of the network, so that the trained deep convolution wavelet neural network is obtained.

(15) And substituting the facial expression image test set into the trained deep convolution wavelet neural network to obtain a classification label z1 on an output layer, substituting the expression sensitive region image set corresponding to the test data set into the trained deep convolution wavelet neural network to obtain a classification label z2 on the output layer, and obtaining a final classification label by using the two classification labels according to a mode of z 3-z 1+ lambda-z 2, wherein lambda represents the weighted proportion of the auxiliary task.

The method gives consideration to the special characterization capability of the expression sensitive area in the learning of the expression characteristics of the deep convolutional neural network, firstly trains a main task learning DCNN to obtain a shared characteristic weight matrix, then merges partial images of the eyebrow postures and the mouth postures of the eyes and eyebrows of the expression sensitive area to serve as an auxiliary task estimation branch task, obtains a classification result of auxiliary task estimation through mapping of the shared characteristic weight matrix, and finally optimizes the classification performance of the main task learning through the auxiliary task estimation classification result to improve the generalization capability of the deep convolutional network in the expression recognition.

Example 2

The auxiliary task-based deep convolution wavelet neural network expression recognition method is the same as that in the embodiment 1, the facial expression image set and the expression sensitive area image set are established in the step (2), and the method is carried out according to the following steps:

2.1 the facial expression image set is obtained as follows:

selecting a proper number of original images with labels from a JAFFE expression image library at random, wherein the images in the JAFFE expression image library adopted by the invention have 213 images in total as shown in figure 1, and the images comprise seven types of expressions: anger, heart injury, happiness, calmness, aversion, surprise and fear. The original image size is 256 × 256, and referring to fig. 1, images of partially different expressions of four persons are listed in fig. 1, the first row represents angry expressions, the second row represents aversive expressions, the third row represents startle expressions, the fourth row represents happy expressions, and the fifth row represents calm expressions. The original image is expanded by turning, rotating and selecting image blocks by the sliding frame, the image is firstly turned, then the image is rotated by a plurality of small angles, and finally the expression image is selected by sliding up and down by the sliding frame by taking the center of the image as a base point. The method combines haar characteristics with Adaboost algorithm to identify the face area of the expanded image and zoom the face expression image, and finally obtains a face expression image set of tens of thousands of samples.

2.2 expression sensitive region image sets were obtained as follows:

the expression sensitive area refers to areas of several parts sensitive to expressions in the human face area, including the eye eyebrow area and the mouth area; and (3) cutting the facial expression image set obtained in the step (2.1), obtaining two left and right eyebrow and eye image blocks and obtaining a mouth position image block by adopting a proper cutting frame, splicing the three image blocks to obtain an expression sensitive area image, and finally obtaining the expression sensitive area image set of the same tens of thousands of samples. Referring to fig. 4, the sensitive region images of seven expressions of one person are listed in fig. 4.

2.3, a label file of the facial expression image set is manufactured according to an original label of a JAFFE expression image library, a label of a single image is a 1 x k-dimensional binary vector, a k-dimensional expression category is divided into k types, k is 2, 3, 4, 5, 6, and the value of k is determined according to the requirement of an actual expression classification problem. The dimension with the label vector of 1 represents that the image belongs to the expression category represented by the dimension, the values of other dimensions are 0, for example, the first dimension represents a happy expression category in the category 5 expression category, and the label vector of a single image is [1, 0, 0, 0, 0] if the single image is a happy image. In the invention, the expression image data set and the sensitive area image data set are mutually corresponding, so that the label file can be shared.

Example 3

The auxiliary task-based expression recognition method for the deep convolution wavelet neural network is the same as that in embodiment 1-2, the wavelet pooling layer in step (9) obtains a low-frequency sub-band and a high-frequency sub-band, referring to fig. 3(a), and a conventional downsampling pooling layer is modified into the wavelet pooling layer in fig. 3(a), so that information loss caused by simple downsampling is avoided, high-frequency information is retained, and local information of expression features is enhanced. The method comprises the following steps:

9.1, performing one-layer downsampling wavelet decomposition on the feature map obtained by the previous layer of convolutional layer, wherein the selected wavelet basis function is a Haar function, and each feature map is subjected to one-layer downsampling wavelet decomposition to obtain a low-frequency sub-band, a horizontal high-frequency sub-band, a vertical high-frequency sub-band and a high-frequency sub-band comprising the horizontal direction and the vertical direction. The number of layers of the wavelet decomposition in the invention can be determined according to the requirement of the network on the size in practical application.

9.2 fuse the three high frequency subbands into a new high frequency subband according to the following formula:

x^WH＝Maxf(0，x^HH，x^HL，x^LH)

wherein x is^HH，x^HL，x^LHRepresenting three high-frequency subbands, x, obtained by a wavelet decomposition^WHRepresenting the fused high-frequency sub-band, and defining a function Maxf (A, B) to represent that the corresponding positions of the matrix A and the matrix B take larger values;

and 9.3, taking the obtained low-frequency sub-band and the fused high-frequency sub-band as the input of the next full-connection layer.

The wavelet pooling layer of the invention avoids the defect that information is lost due to simple down-sampling operation of the pooling layer in a general convolutional neural network, replaces pooling results with low-frequency sub-bands with less wavelet transform information loss, and inputs high-frequency sub-bands containing detailed information into the full-connection layer together, so that the feature vector of the full-connection layer is expanded in multiple channels, and the distinguishability of the feature vector is enhanced.

Example 4

The auxiliary task-based expression recognition method for the deep convolution wavelet neural network is performed as in embodiments 1-3, and the full-connected layer feature vector in step (10) is performed according to the following steps:

10.1, solving a low-frequency subband matrix according to the following formula;

x^L＝Maxf(0，W₁·x^LL1+W₂·x^LL2+W₃·x^LL3+……+W_n·x^LLn)

wherein x is^LRepresenting a global low frequency subband matrix, x^LLnShowing each characteristic diagramOne layer of wavelet decomposed low frequency sub-band, W_nRepresenting the superposition weights of the low frequency subbands of the respective profiles. The superposition weight W in the invention_nThe determination can be based on empirical values, or other learning manners can be designed.

10.2 solving the high-frequency subband matrix according to the following formula:

x^H＝Maxf(0，x^WH1，x^WH2，…x^WHn)

wherein x is^HRepresenting a global high-frequency subband matrix, x^WHnRepresenting a new high-frequency sub-band formed by fusing three high-frequency sub-bands of one-layer wavelet decomposition of each characteristic diagram;

10.3 Global Low frequency subband x^LAnd a global high frequency subband x^HAnd stretching the sub-band matrix into a vector with the dimension of 1 x v according to rows, connecting the sub-band matrix end to obtain the characteristic vector of the fully-connected layer, wherein the size of the vector is 1 x 2v, and the value of v is obtained by multiplying the values of the length and the width of the sub-band matrix. The feature vector of the fully-connected layer in this example is formed by stretching the low-frequency and high-frequency subbands in rows and splicing the low-frequency and high-frequency subbands end to end, and specifically is 1 × 128-dimensional, wherein the vector size of the stretching of the low-frequency subbands and the high-frequency subbands in rows is 1 × 64-dimensional.

Example 5

The auxiliary task-based expression recognition method for the deep convolution wavelet neural network is the same as that in the embodiment 1-4, the auxiliary task weighting specific gravity lambda is described in the step (15), referring to fig. 3(b), the auxiliary task correction is added to the trained deep convolution wavelet neural network in the step (b) of fig. 3, the weighting specific gravity lambda is obtained by learning in the network by using the sensitive region image set, and the method is carried out according to the following steps:

15.1 initializing λ to 0, and randomly selecting M personal face expression images and corresponding sensitive area images as learning samples of a weight λ;

15.2 for the trained deep convolution wavelet neural network, the learning sample is brought into the network to obtain the classification label according to the following formula:

z₃＝z₁+λz₂

wherein z is₁Output label z representing the network brought by the facial expression image₂Representing images of corresponding sensitive areas brought into the networkOutput tag, z₃Representing a network global tag;

15.3 according to Global Label z₃The magnitude of the error with the true tag, the value of λ is updated as follows:

λ＝λ+▽λ

wherein ≧ λ is 0.05, the λ value corresponding to the minimum label error of each learning sample is counted. The error of the global label and the real label in this example is determined according to the numerical difference of the dimensionalities of the expression classes.

15.4, the expected value of the lambda value corresponding to the minimum label error of the M learning samples is obtained, and the expected value lambda is used as the global auxiliary task weighting specific gravity lambda value. The expected value calculation in this example can be obtained directly by averaging.

A more detailed example is given below to further illustrate the invention

Example 6

The auxiliary task-based expression recognition method for the deep convolution wavelet neural network is the same as the embodiment 1-5, and the specific steps are as follows with reference to the attached figure 3:

step 1: establishment of facial expression image set

200 original images with labels are randomly selected from a JAFFE expression database containing 213 images, and the size of the selected original image is 256 × 256 as shown in FIG. 1. Then 200 original images are expanded into 400 images which are 2 times of the original images through left-right turning, then 10 times of the expansion of 4000 images are obtained through left-right rotation of the images by 1 degree, 2 degrees, 3 degrees, 4 degrees and 5 degrees, finally, a 128-128 rectangular frame is adopted, the coordinates of upper and lower 5 pixel points are cut in a sliding mode by taking the center of the images as a base point, then, a method of combining haar features and an Adaboost algorithm is adopted to identify a face area and zoom the face area into an experimental face image with the size of 96-96, finally, a face expression image set with 40000 sample numbers is shared, a corresponding label file is made, the label of a single image is a binary vector with the dimension of 1-7, the dimension with the value of 1 represents that the image belongs to the expression classification represented by the dimension, and the other dimensions are 0.

Step 2: establishment of expression sensitive area image set

The expression-sensitive image in the invention refers to the regions of several parts sensitive to expression in the human face region, including the eyebrow region and mouth region of the eyes, as shown in fig. 4. And (3) cutting the face region image obtained in the step (1), obtaining two eyebrow and eye position image blocks by adopting a 48-by-48 cutting frame and obtaining a mouth position image block by adopting a 48-by-96 cutting frame, splicing the three image blocks to obtain an expression sensitive region image, and finally sharing the expression sensitive region image set with 40000 samples, wherein the label file can be shared with the image obtained in the step (1).

And step 3: network training

(1) Building a depth network consisting of three convolution layers, two pooling layers, a multi-scale transformation layer, a full connection layer and a softmax output layer;

(2) inputting an expression image into a depth network, wherein the size of the input image is 96 × 96;

(3) the first layer of the network is a convolution layer, the convolution layer performs convolution operation on each expression original image, the number of selected convolution kernels is 6, and the size of each convolution kernel is 7 x 7:

(3a) adopting a random initialization method to configure the weight of a convolution kernel as a near zero number between [ -0.5,0.5 ];

(3b) performing convolution operation on the human face expression image by each convolution kernel to obtain 6 feature graphs after convolution, wherein the feature graph size of each convolution kernel is 90 x 90;

(3c) initially setting a bias weight matrix of the convolutional layer as a 0 matrix;

(3d) the activation function of the network is a Sigmoid function;

(4) the second layer of the network is a pooling layer, which takes the 6 feature maps obtained from the previous convolutional layer as input and performs pooling operations:

the pooling layer adopts a pooling method that the maximum value is selected in non-overlapping 2 x 2 areas to obtain 6 characteristic maps of the pooling layer, and the size of the characteristic maps is 45 x 45;

(5) the third layer of the network is a convolutional layer, 6 feature maps obtained by the previous pooling layer are used as input, convolution operation is carried out, the number of convolution kernels selected by the convolutional layer is 12, and the sizes of the convolution kernels are 6 x 6:

(5a) adopting a random initialization method to configure the weight of a convolution kernel as a near zero number between [ -0.5,0.5 ]; (ii) a

(5b) Each convolution kernel performs convolution operation on the 6 characteristic graphs, then average evaluation is performed on the convolution results of the 6 characteristic graphs and a bias matrix after filtering of an activation function, the characteristic graph of the convolution kernel is obtained, and the characteristic graph size of each convolution kernel is 40 x 40;

(5c) initially setting a bias weight matrix of the convolutional layer as a 0 matrix;

(5d) the activation function of the network is a Sigmoid function.

(6) The fourth layer of the network is a pooling layer which takes the 12 feature maps obtained from the previous convolutional layer as input and performs pooling operations:

the pooling layer was created by selecting the maximum in non-overlapping 2 x 2 regions to yield 12 signatures of the pooling layer with a size of 20 x 20.

(7) The fifth layer of the network is a convolutional layer, 12 feature maps obtained by the previous pooling layer are used as input to carry out convolution operation, the number of convolution kernels selected by the convolutional layer is 12, and the size of the convolutional layer is 5 x 5:

(7a) adopting a random initialization method to configure the weight of a convolution kernel as a near zero number between [ -0.5,0.5 ];

(7b) each convolution kernel performs convolution operation on the 12 characteristic graphs, then average evaluation is performed on the convolution results of the 12 characteristic graphs and a bias matrix after filtering an activation function, and the characteristic graph of the convolution kernel is obtained, wherein the size of each characteristic graph is 16 x 16;

(7c) initially setting a bias weight matrix of the convolutional layer as a 0 matrix;

(7d) the activation function of the network is a Sigmoid function.

(8) The sixth layer of the network is a wavelet pooling layer, which takes 12 characteristic maps obtained from the previous convolutional layer as input and performs one-layer wavelet decomposition:

(9) The seventh layer of the network is a full connection layer, and 128 × 8 low-frequency subbands and 128 × 8 high-frequency subbands obtained by the previous wavelet transform layer are used as input to form a 128-dimensional full connection layer feature vector. In the invention, the full connection layer firstly carries out corresponding position calculation on 12 8-8 low-frequency sub-bands to obtain the maximum value, then the maximum value is stretched into a vector with the dimension of 1-64 according to lines, the high-frequency sub-bands obtain another vector with the dimension of 1-64 according to the same operation, and the two vectors are connected end to end according to the sequence of the low-frequency sub-band vector and the high-frequency sub-band vector to obtain a global vector with the dimension of 1-128.

(10) And (4) repeating the steps (2) to (9) by taking the randomly selected 50 expression images as a unit to obtain respective 128-dimensional feature vectors of the 50 images.

(11) The eighth layer of the network is a Softmax output layer, the obtained 50 128-dimensional feature vectors are used as input, a Softmax classifier with 7-class probability distribution output is trained, and classification labels are obtained;

(12) and performing error calculation on the classification label and the real label of the Softmax output layer, and updating the value of the convolution kernel and the value of the bias weight vector of each layer according to a BP back propagation algorithm. The weight updating learning step length of the deep convolution wavelet neural network is set to be 0.05.

(13) And (5) repeating the training steps (2) to (12) until the weight matrix is updated 200 times. The setting of the updating times of the weight in the network training can be determined according to the convergence speed of the network.

And 4, step 4: learning auxiliary tasks

And (3) bringing the facial expression test data set into the trained network to obtain a classification label z1, bringing the expression sensitive area corresponding to the test data set into the trained network to obtain a classification label z2, obtaining a final classification label by using the two classification labels according to a mode of z 3-z 1+ 0.65-z 2, and calculating z3 for the whole test data set.

And 5: statistics of recognition results

And calculating the correct rate of correct recognition according to z3 in step 4.

The invention avoids the defects that the characteristics learned by the convolutional layer on the upper layer of the general convolutional neural network can be lost due to simple down-sampling operation of the pooling layer in the general convolutional neural network and the output of the full-link layer only contains abstract information but lacks a plurality of shallow local characteristics, and combines the multi-scale wavelet transform and the deep convolutional neural network architecture, so that the network not only ensures that the characteristics learned by the convolutional layer can effectively carry out complete characteristic transmission on the pooling layer, but also can expand the expression local characteristics obtained during shallow layer learning in the full-link layer, thereby leading the whole network structure to describe the expression characteristics more optimally and obviously improving the recognition result.

The technical effects of the invention are verified and explained by the simulation results as follows:

example 7

The auxiliary task-based deep convolution wavelet neural network expression recognition method is similar to that of the embodiment 1-6, and the effect of the method is further analyzed by combining the recognition results of the attached table 1.

Simulation experiment conditions

The hardware test platform of the invention is: the processor is an Inter Core CPU i3, the main frequency is 3.20GHz, the memory is 4G, and the software platform is as follows: windows 7 flagship edition 64-bit operating system and Matlab R2013 b. The input images of the inventive network were all 96 × 96 in size and in TIFF format.

Emulated content

The simulation content of the invention comprises: simulation experiments and recognition result statistics of the existing facial expression recognition technology; under the condition of no additional wavelet pooling layer and no auxiliary task learning, a six-layer deep convolutional neural network is simply used for carrying out simulation experiments of facial expression recognition and recognition result statistics; the auxiliary task-based deep convolution wavelet neural network expression recognition method is completely used for experimental simulation and recognition result statistics; and comparing and analyzing the simulation results of each experiment.

Analysis of simulation results

Table 1 shows the comparison between the recognition effect of the method of the present invention and the recognition effect of the existing facial expression recognition technology. Referring to the data in table 1, Shan C and Jabid T can know that in the method of dividing the image into several sub-regions, each sub-region is multiplied by a weight according to the level of its contribution to the expression, and the weight represents the amount of the expression characterization ability of the region. Tasked et al initializes the weight by x ^2 distribution, and obtains a result with an average recognition rate of 85.4% by using a new local face descriptor using a Local Direction Pattern (LDP) and an architecture of LDP + SVM algorithm. Furthermore, shishirr et al use Gabor's feature in combination with Learning Vectorization (LVQ) to perform Gabor filtering on 34 image fiducials on an imaging interface with the 34 fiducials, resulting in an identification rate of 87.51%. Nectarios et al propose an algorithm based on Gabor combined with Log-Gabor filter convolution to obtain feature vectors, resulting in 86.1% and 85.72% recognition rates. In the achievement of Luya Dan, Von ShiYong and the like, the algorithm of FP and the deep self-coding network is utilized to obtain the recognition rate of 90.47%, and in addition, the invention only uses a six-layer deep convolutional neural network to learn the expression characteristics, and the recognition rate of 90.56 is obtained when the algorithm of a softmax classifier is trained under the condition of no additional wavelet pooling layer and auxiliary task learning.

When the auxiliary task-based deep convolution wavelet neural network expression recognition method is used integrally, the obtained recognition accuracy is 92.91%.

TABLE 1 comparison of recognition effects of the present invention and the existing facial expression recognition methods

As can be seen from table 1, the method of the present invention can give good consideration to local and global information of the expression features of the facial expression image, and enhances the influence of the expression sensitive area on facial expression recognition through the auxiliary task, thereby improving the recognition rate of the facial expression.

In short, the auxiliary task-based deep convolution wavelet neural network expression recognition method disclosed by the invention solves the problems that expression features cannot be efficiently learned by a feature selection operator and more image expression information classification features cannot be extracted in the conventional expression recognition technology. The method comprises the following implementation steps: building a deep convolution wavelet neural network; establishing a facial expression image set and an expression sensitive area image set; inputting a facial expression image to a network; training a deep convolution wavelet neural network; network error back propagation; updating a depth convolution wavelet neural network parameter set, namely updating each convolution kernel and offset vector of the network; inputting an expression sensitive area image to a trained network; learning the weighted proportion of the auxiliary task; obtaining a network global classification label according to the weighted proportion; and counting and identifying the accuracy according to the global label. The method gives consideration to the abstract and the detailed information of the expression image, enhances the influence of the expression sensitive area in the learning of expression characteristics, obviously improves the accuracy of expression recognition, and can be applied to the expression recognition of the facial expression image.

Claims

1. A depth convolution wavelet neural network expression recognition method based on auxiliary tasks is characterized by comprising the following steps:

(4) the first layer of the deep convolution wavelet network is a convolution layer which performs convolution operation on each input facial expression training image and selectsSelecting the number of convolution kernels as Q₁Convolution kernel size 7 × 7:

(6) the third layer of the network is a convolution layer, and Q obtained by the previous layer of the pooling layer₁Taking the characteristic graph as input, performing convolution operation, wherein the number of selected convolution kernels of the convolution layer is Q₂Convolution kernel size 6 × 6:

(8) the fifth layer of the network is a convolution layer, and Q obtained by the previous layer of the stratification layer is added₂Taking the characteristic graph as input, performing convolution operation, wherein the number of selected convolution kernels of the convolution layer is Q₃Ruler of convolution kernelCun 5 x 5:

(8b) each convolution kernel is on the Q₂Performing convolution operation on the feature map, and then performing convolution on the Q₂The convolution result of the characteristic graphs and the bias matrix are subjected to average evaluation after filtering of the activation function, so that the characteristic graphs of the convolution kernel are obtained, and the size of each characteristic graph is 16 x 16;

2. The subtask-based expression recognition method of the deep convolutional wavelet neural network, according to claim 1, wherein the step (2) of establishing the facial expression image set and the expression sensitive region image set is performed according to the following steps:

2.1 the facial expression image set is obtained as follows:

randomly selecting a proper number of original images with labels from an expression image library, expanding the original images in a mode of selecting image blocks through turning, rotating and sliding frames, identifying the face area of the expanded images by adopting a method of combining haar characteristics and Adaboost algorithm, and scaling the expanded images into face expression images with the size of 96 × 96, so as to finally obtain a face expression image set of a ten-thousand-level sample;

2.2 expression sensitive region image sets were obtained as follows:

the expression sensitive area refers to areas of several parts sensitive to expressions in the human face area, including the eye eyebrow area and the mouth area; cutting the obtained facial expression image set, obtaining two left and right eyebrow and eye part image blocks and a mouth part image block by adopting a cutting frame, splicing the three image blocks to obtain an expression sensitive area image, and finally obtaining an expression sensitive area image set of the same ten-thousand-order sample;

2.3, label files of the facial expression image set are manufactured according to original labels of the expression image library, the label of a single image is a 1 x k dimensional binary vector, k dimensions represent that the expression of the image is divided into k classes, the dimension with the label vector of 1 represents that the image belongs to the expression class represented by the dimension, the values of other dimensions are 0, and the label files of the expression image data set and the sensitive area image data set can be shared.

3. The subtask-based expression recognition method for the deep convolutional wavelet neural network as claimed in claim 1, wherein the wavelet pooling layer in step (9) obtains a low frequency subband and a high frequency subband, and the method comprises the following steps:

9.1, performing one-layer downsampling wavelet decomposition on the feature map obtained by the previous layer of convolutional layer, wherein the selected wavelet basis function is a Haar function, and each feature map obtains one low-frequency sub-band and three high-frequency sub-bands through one-layer downsampling wavelet decomposition;

x^WH＝Maxf(0，x^HH，x^HL，x^LH)

wherein x is^HH，x^HL，x^LHRepresenting three high-frequency subbands, x, obtained by a wavelet decomposition^WHRepresenting the fused high-frequency sub-band, and defining a function Maxf (A, B) as a larger value of the corresponding positions of the matrix A and the matrix B;

4. The subtask-based expression recognition method for the deep convolutional wavelet neural network as claimed in claim 1, wherein the fully-connected layer feature vector in step (10) is obtained by the following steps:

x^L＝Maxf(0，W₁·x^LL1+W₂·x^LL2+W₃·x^LL3+……+W_n·x^LLn)

wherein x is^LRepresenting a global low frequency subband matrix, x^LLnLow-frequency sub-bands, W, representing a wavelet decomposition of a layer of each feature map_nRepresenting the superposition weight of the low-frequency sub-band of each characteristic diagram;

x^H＝Maxf(0，x^WH1，x^WH2，…x^WHn)

10.3 Global Low frequency subband x^LAnd a global high frequency subband x^HAnd stretching into a vector of 1 x v according to rows, and connecting the vectors end to obtain the characteristic vector of the fully-connected layer, wherein the size of the characteristic vector is 1 x 2 v.

5. The subtask-based deep convolutional wavelet neural network expression recognition method of claim 1, wherein the subtask weighting specific gravity λ of step (15) is performed according to the following steps:

z₃＝z₁+λz₂

wherein z is₁Output label z representing the network brought by the facial expression image₂Output labels, z, representing images of corresponding sensitive areas brought into the network₃Representing a network global tag;

wherein the content of the first and second substances,counting the lambda value corresponding to the minimum label error of each learning sample;

and 15.4, calculating the expected value of the lambda value corresponding to the minimum label error of the M learning samples, wherein the expected value lambda is used as the weighted proportion lambda value of the subsequent auxiliary task.