CN113065460B

CN113065460B - Establishment method of pig face facial expression recognition framework based on multitask cascade

Info

Publication number: CN113065460B
Application number: CN202110350752.0A
Authority: CN
Inventors: 温长吉; 张笑然; 吴建双; 于合龙; 石磊; 郭宏亮; 毕春光; 李卓识; 苏恒强; 薛明轩; 杨之音
Original assignee: Jilin Agricultural University
Current assignee: Jilin Agricultural University
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2022-04-29
Anticipated expiration: 2041-03-31
Also published as: CN113065460A

Abstract

The invention relates to a method for establishing a pig face facial expression recognition framework based on multitask cascade, and belongs to the technical field of computer image recognition and artificial intelligence. The cascade framework model is firstly proposed to be applied to the classification and identification of the pig time sequence facial expression images. The network model is composed of three level 1 connection structures, and pig face facial expression video frame images are selected and input into the simplified multitask cascade convolution neural network at equal intervals. And secondly, inputting the extracted frame feature map of the pig face sequence into a multi-attention machine module, capturing a face significant region caused by expression change, and realizing attention to fine changes of the face. And then, merging the fine characteristic diagram and the multi-attention characteristic diagram extracted from the video frame through array merging operation, and inputting the merged fine characteristic diagram and the multi-attention characteristic diagram into a long-term and short-term memory network to realize expression classification and recognition. Emotion regulation and control can be better realized through expression recognition of livestock, so that the digestibility and the utilization rate of the feed are improved, the growth speed is increased, and the yield is improved.

Description

Establishment method of pig face facial expression recognition framework based on multitask cascade

Technical Field

The invention relates to the technical field of computer image recognition and artificial intelligence, in particular to a method for establishing a pig face facial expression recognition framework based on multitask cascade, which is an end-to-end model framework for livestock facial expression recognition in a video.

Background

Animal emotion research is one of important research targets of animal science, can better evaluate the welfare of livestock, and the good emotion of livestock such as pigs and the like in the feeding process plays an important role in ensuring that the digestibility and the utilization rate of feed are the highest, so that the growth speed is increased, the yield benefit is increased, and the animal emotion research based on facial expression recognition is of great significance.

The animal facial expression recognition faces challenges, firstly, compared with the facial expression recognition, the animal facial expression change is difficult to perceive and recognize, because the animal facial expression change mainly depends on cheekbar muscles on two sides of a cheek, and the change range of the muscle group structure is simple and low. Secondly, the existing work related to facial expression recognition of animals is mostly based on physiological anatomy, and the cost of payment is high and the efficiency is low. Finally, the collection of the physiological signs of the animal face is difficult, no large-scale standardized data set which can be used for supervising semi-supervised learning exists, and at present, the facial expressions of the animal in a static image are mostly identified by only a few methods based on machine vision, and the research of classifying and identifying the expression with time sequence in a video frame does not exist. Facial expressions in static images are only expression feature records at a certain time point, however, facial expressions themselves have space-time property, so that only a very small amount of static image-based facial expression recognition methods lose a large amount of space-time logic features caused by time dimension change in the extraction and representation processes of facial expression features, and the inherent regularity of the facial expression presentation is violated, so that an end-to-end model framework for recognizing the facial expressions of livestock in videos is urgently needed.

Disclosure of Invention

The invention aims to provide a method for establishing a pig face facial expression recognition framework based on multitask cascade, which solves the problems in the prior art. The method is used for classifying and identifying the facial expressions of the pig faces in the video images based on a multi-attention-machine cascade long-and-short-term memory network model. Firstly, a simplified multi-task cascade convolution network is utilized to quickly detect and position the pig face in a video frame, and the influence of a non-pig face area on the identification performance is removed. The detected and positioned pig face facial sequence feature map is sent to a multi-attention convolution mechanism module, and attention is paid to salient regions caused by various expression changes, so that the problem that facial expressions are difficult to perceive and identify due to the fact that the simple structure and small amplitude change of the domestic pig facial expression muscle group are overcome. And finally fusing the extracted global feature map and the attention feature map into refined features through merging array operation, and sending the refined features into a long-time and short-time memory network in a sequence form to finally realize expression recognition.

The above object of the present invention is achieved by the following technical solutions:

the establishment method of the pig face facial expression recognition framework based on multitask cascade comprises the following steps:

s1, inputting a pig face facial expression video segment, and carrying out category labeling on the input video segment, wherein the input video segment is respectively angry, cheerful, fear and peace four types of expressions of the domestic pig;

s2, a first stage of a cascade framework model: selecting frame images of the pig face facial expression video images at equal intervals, and inputting the frame images into a simplified multi-task cascade convolution neural network for detecting and positioning the pig face; the simplified multitask cascade convolution neural network realizes the rapid detection and positioning of the pig face area by two steps of coarse granularity and fine granularity respectively;

s3, a second stage of the cascade framework model: inputting the extracted pig face facial sequence frame images into a multi-attention machine system module for extracting and constructing a salient region characteristic diagram of pig face facial expression change; firstly, extracting a global convolution characteristic diagram from the face of the pig by using a shallow residual error network; secondly, a channel grouping response attention mechanism is used for capturing and generating a significant region characteristic diagram of the change of the facial expression of the pig face; then merging the attention area feature map and the global convolution feature map to generate a pig face feature map fused with the attention mechanism;

s4, a third stage of the cascade framework model: and (4) inputting the pig face facial feature map fused with the attention mechanism into a long-time memory network and a short-time memory network in sequence, and identifying and classifying the pig face facial expression through a full connection layer and a softmax classifier.

The model architecture method for simplifying the multitask cascade convolution neural network in the step S2 is as follows:

s21, coarse grain detection and localization: acquiring a pig face facial window and a bounding box regression vector thereof by using a full convolution network, namely a recommendation network, and correcting the candidate window according to the estimated bounding box regression vector; finally, non-maximum values are used for restraining and combining the candidate windows with high overlapping;

s22, fine-grained detection and positioning: all candidate objects containing the pig faces obtained in the step S21 are transmitted to a fine network, wrong candidate windows are screened and removed, a bounding box regression vector is used for calibration, non-maximum value inhibition is executed, and finally bounding box coordinates containing the pig faces are output, so that pig face detection and positioning are realized;

s23, loss optimization function: the loss function of the simplified multitask cascade convolution neural network is respectively composed of a pig face classification loss function and an Euclidean distance regression loss function regressed by a face region boundary frame, and network learning is realized by jointly optimizing the loss function; the joint optimization loss function is:

wherein L is_cdOptimized objective function representing simplified multi-task cascaded convolutional neural network for pig face detectionN is the total number of samples in the training set, i represents the ith sample, j represents the task type and takes the value as det or box, det is used for representing that the task type is pig face discrimination, box is used for representing that the task type is pig face regression frame detection,

represents the loss function of the ith sample in the jth task, alpha_jIndicating the weight possessed by the jth task corresponding to the penalty function,

the value of the label of the ith sample in the jth task is 0 or 1, and the corresponding weight distribution proportion of the coarse-grained task and the fine-grained task is respectively alpha_det1 and α_box＝0.5，

True tags representing samples, p_iRepresenting the probability that the i sample network output is a pig face,

the coordinates of the pig face bounding box predicted for the network,

the coordinates of the artificially labeled real bounding box, both of which are four-dimensional vectors R⁴The horizontal and vertical coordinates of the upper left corner of the regression frame and the width and height of the regression frame are respectively.

The method for extracting and constructing the distinctive zone feature map of the facial expression change of the pig face in the step S3 is as follows:

s31, inputting the feature map extracted from the video frame sequence containing the pig face facial expression extracted in the step S2 into a shallow residual network for generating a global feature map with time sequence;

s32, grouping the global feature graphs obtained in the step S31 according to a channel response mode: first, each feature channel is calculatedFor the attention area contribution degree, the weight calculation expression is as follows: d_υ(X)＝f_υ(W X X), wherein d_υ(X)＝[d_υ(1)，…，d_υ(c)]To generate n attention regions, a set of fully connected functions F (·) { F ] is defined₁(·)，…f_υ(·)，…f_N(. o) }, each f_υ(. The) takes convolution characteristics as input, respectively corresponds to upsilon attention areas, receives the input of a c-dimensional characteristic channel, and generates a c-dimensional weight vector d_υThe method is used for referring to the contribution degree of each feature channel to an attention region upsilon, W X represents the convolution feature of an input sample X, W represents the parameter set of a feature extraction unit, W, h and c represent the width, height and number of feature channels of the input sample respectively, and "+" represents the convolution, pooling and activation operation of the feature extraction unit;

s33, calculating an attention area feature map according to the weight calculated in the step S32: first based on the learned weight vector d_υObtaining an attention mask matrix M for each region of interest_υ，

Wherein X represents an input sample, a sigmoid function is taken to normalize the input sample to be 0-1, k and upsilon represent different characteristic channels, namely different attention area index values, (k ≠ upsilon) E {1, 2, …, N }, [ · c]_kK-th eigen-channel weight vector d representing convolution features W X_kMultiplying with corresponding elements of the corresponding feature channel; then calculating the attention area feature map

Wherein P is_υ(X) representing a characteristic diagram of the upsilon attention areas, and calculating through pooling on each channel, wherein the operation mark is multiplied by points of a mask matrix and a convolution characteristic diagram of the upsilon attention areas and then accumulated;

s34: constructing a characteristic channel grouping clustering optimization objective function L_cgImplementing feature channel clustering to obtain attention area, L_cgThe method aims to judge the distance between the feature points of the high attention area and the feature points of the weak attention areaThe correlation of (2) makes the coordinates in the same attention area more clustered, which is represented by a function Dis (·), the coordinates in different areas are as far as possible, which is represented by a function Div (·), and λ represents the target constraint assignment weight, and the optimization objective function is as follows:

min L_cg(M_υ)＝Dis(M_υ)+2Div(M_υ)

where (x, y) is taken from the attention area coordinates, m_υ(x, y) attention mask matrix M corresponding to the region of interest_υ(X) response value at (X, y) coordinate, t_xAnd t_yCoordinates representing the peak response of the training set to the v attention regions,

for expressing that the (x, y) coordinate position can best represent the response value of the upsilon areas, T_mrgThe threshold value representing the preset boundary is a constant value and is used for preventing extreme values from appearing, so that the loss is not sensitive to noise, and the robustness of the network is realized.

The method for identifying and classifying the facial expressions of the pig faces in the step S4 is as follows:

the long-short term memory network classifies the facial expressions of the pig face in real time, the facial global convolution feature map obtained in the step S31 and the multi-attention feature map obtained in the step S33 are merged and input into the long-short term memory network through merging arrays, and four types of expression probability values of anger, joy, fear, peace and the like are output, so that the classification and identification of the facial expressions of the pig face are realized; the optimization function of the cascaded pig face facial expression recognition framework model is as follows,

wherein gamma is the weight of the objective function in the balance stage，L_cdRepresenting an optimized objective function, alpha, of a simplified multitask cascaded convolutional neural network for pig face detection_jRepresenting the weight of a loss function corresponding to the jth task in the simplified multi-task cascaded convolutional neural network, wherein j belongs to { det, box }, det is used for representing that the task type is pig face discrimination, box is used for representing that the task type is pig face regression box detection,

represents the loss value of the ith sample in task j; l is_cgRepresenting the construction of a characteristic channel grouping clustering optimization objective function, lambda represents the target constraint distribution weight in the attention area, M_υAn attention mask matrix representing a first v regions.

The invention has the beneficial effects that:

1. the invention firstly provides a multi-attention-machine-based cascade long-and-short-term memory network framework model for classifying and identifying the facial expressions of the pig faces in the video images. Compared with the existing method for recognizing the expression of the livestock by researching the change of the facial muscle group of the animal based on the physiological anatomy, the method has the advantages of higher payment cost and lower efficiency. The method is also different from the existing animal (livestock) expression recognition method based on single static image recognition, and the machine vision recognition method of single static image loses the time sequence information in the expression change process in the feature extraction and representation process. The end-to-end model framework for identifying the facial expression of the livestock in the video has innovation and advancement.

2. The model framework is a multi-task cascading framework, and the structural design is innovative. The first stage of the cascade task is a simplified multitask convolution network which is used for detecting and positioning the pig faces in the video frames so as to remove the influence of non-pig face areas on the identification performance. The second stage of the cascade task is a multi-attention mechanism module, grouping is carried out through characteristic channels, an attention area is obtained through weak supervision clustering by utilizing the characteristic that visual information concerned by different channels of a characteristic diagram is different and a peak value response area is also different, and a saliency area caused by various expression changes is concerned through attention. And the third stage of the cascading task is a long-term and short-term memory network model, and the extracted convolution characteristics and the attention characteristic graph are fused into refined characteristics through merging array operation and are sent into the long-term and short-term memory network in a sequence form to realize expression recognition.

3. The emotion research of the livestock is one of important research targets of animal science, the emotion of the livestock is known by identifying expression changes, the welfare of the livestock can be better evaluated, and the emotion research method plays an important role in ensuring that the digestibility and the utilization rate of the feed are the highest, so that the growth speed is increased, and the production benefit is increased.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention.

FIG. 1 is a framework diagram of the process of establishing a pig face facial expression recognition model according to the present invention;

FIG. 2 is a diagram of steps for establishing a pig face facial expression recognition model according to the present invention;

FIG. 3 is a flow chart of a multi-attention convolutional network implementation of the present invention.

Detailed Description

The details of the present invention and its embodiments are further described below with reference to the accompanying drawings.

Referring to fig. 1 to 3, the method for establishing a pig face facial expression recognition framework based on multitask cascade connection comprises the following steps:

s1: inputting a video segment of pig face facial expression, selecting a video segment which is shot in a pig farm and contains a front pig face, and carrying out category labeling on the input video segment according to related results and artificial experience, wherein a video segment data set is divided into four types of expressions of anger, joy, fear and peace and the like of a live pig and is used for training, verifying and testing a frame model;

s2: and (3) inputting the frame images of the pig face facial expression video images at equal intervals into a simplified multitask cascade convolution neural network for detecting and positioning the pig face. The simplified multitask cascade convolution neural network realizes rapid detection and positioning of the pig face area by two steps of coarse granularity and fine granularity respectively. The specific execution steps are as follows:

s21: coarse grain detection and positioning, obtaining a pig face window and a boundary frame regression vector thereof by utilizing a full convolution network, namely a recommendation network, and correcting the candidate window according to the estimated boundary frame regression vector. And finally, suppressing and merging the high-overlapping candidate windows by using the non-maximum value.

S22: fine-grained detection and positioning, all candidate objects containing the pig face obtained in the step S21 are transmitted to a refined network, wrong candidate windows are screened and removed, a bounding box regression vector is used for calibration, non-maximum value inhibition is executed, and finally bounding box coordinates containing the pig face are output, so that pig face detection and positioning are realized.

S23: and the loss optimization function is a simplified loss function of the multitask cascade convolution neural network and is respectively composed of a pig face classification loss function and an Euclidean distance regression loss function regressed by a face region boundary frame, and the network learning is realized by the joint optimization loss function. The joint optimization loss function is:

wherein L is_cdThe optimization objective function of the simplified multi-task cascade convolution neural network for pig face detection is represented, N is the total number of samples in a training set, i represents the ith sample, j represents the task type and takes the value of det or box, det is used for representing that the task type is pig face discrimination, box is used for representing that the task type is pig face regression box detection,

indicating that the ith sample is in the jth taskLoss function of alpha_jIndicating the weight possessed by the jth task corresponding to the penalty function,

the coordinates of the pig face bounding box predicted for the network,

S3: on the basis of the step S2, the extracted pig face facial sequence frame images are input into a multi-attention machine module for extracting and constructing a salient region feature map of the change of pig face facial expression. The specific execution steps are as follows:

s31: inputting the video sequence containing the pig face facial expression extracted in the step S2 into a residual error network with a depth of 24 layers, wherein the network structure comprises 8 groups of residual error units, the first two layers of the residual error units in each group are respectively of a structure BN-ReLu-Conv (3 × 3), the last structure is of a structure BN-Conv (3 × 3), the step size is 1, a downsampling structure needs to be added for realizing each stage of the network, and at the moment, the step size is changed to 2 to obtain a pig face convolution characteristic map with the size of 28 × 28 × 512.

S32: grouping the global feature maps obtained in the step S31 according to the channel response mode: first, the attention area of each feature channel is calculatedThe domain contribution degree and the weight calculation expression are as follows: d_υ(X)＝f_υ(W X X), wherein d_υ(X)＝[d_υ(1)，…，d_υ(c)]To generate n attention regions, a set of fully connected functions F (·) { F ] is defined₁(·)，…f_υ(·)，…f_N(. o) }, each f_υ(. h) taking convolution characteristics as input, respectively corresponding to a upsilon attention areas, receiving input of c-dimensional characteristic channels, and generating a c-dimensional weight vector d upsilon for indicating the contribution degree of each characteristic channel to the attention areas upsilon, wherein W X represents the convolution characteristics of input samples X, W represents a parameter set of a characteristic extraction unit, and is respectively W, h, c represents the width, height and number of the characteristic channels of the input samples, and the' represents the convolution, pooling and activation operations of the characteristic extraction unit;

s34: constructing a characteristic channel grouping clustering optimization objective function L_cgImplementing feature channel clustering to obtain attention area, L_cgThe method aims to judge the correlation between the characteristic points of the high attention area and the weak attention area so as to ensure that the same characteristic point is in the same stateAttention is paid to the more concentrated coordinates in the area, expressed by the function Dis (-) and the coordinates of different areas as far as possible, expressed by the function Div (-) and λ, the objective constraint is assigned weight, and the optimization objective function is as follows:

min L_cg(M_υ)＝Dis(M_υ)+λDiv(Mx)

S4: the long-short term memory network classifies the pig face facial expressions in real time, the facial global convolution feature map obtained in the step S31 and the multi-attention feature map obtained in the step S33 are input into the long-short term memory network through merging array fusion, the pig face facial expressions are identified and classified through a full connection layer and a softmax classifier, and four types of expression probability values of anger, joy, fear, peace and the like are output, so that the pig face facial expression classification identification is realized. The optimization function of the cascaded pig face facial expression recognition framework model is as follows,

where γ is the weight of the objective function in the equilibrium stage, L_cdRepresenting a simplified multitasking level for pig face detectionOptimization objective function of the network, alpha_jRepresenting the weight of a loss function corresponding to the jth task in the simplified multi-task convolutional network, wherein j belongs to { det, box }, det is used for representing that the task type is pig face discrimination, box is used for representing that the task type is pig face regression box detection,

represents the loss value of the ith sample in task j; l is_cgRepresenting an optimized objective function for clustering of constrained channel groups, λ representing the assigned weight of the target constraint in the attention area, M_υAn attention mask matrix representing a first v regions.

S5: optimizing an objective function and updating parameters, and carrying out model training and verification on the network structure in a training set and a verification set through multiple iterations. According to a random gradient descent method

w_e+1＝w_e+v_e+1Updating network parameters, optimizing an objective function, wherein e in the above formula represents iteration times, v represents momentum, eta represents learning rate,

denotes D in the e-th iteration_eThe partial derivatives of the loss function L generated by the training for each batch with respect to the weight w.

The pig face facial expression recognition model based on the multi-attention mechanism cascade long-and-short time memory network model is not only suitable for recognizing pig face facial expressions, but also suitable for recognizing expressions in other livestock video images.

Example (b):

referring to fig. 1 to 3, in the pig face facial expression recognition framework model based on the multi-attention-machine cascade long-and-short-term memory network model, data expansion is firstly performed on the basis of sharing a pig face facial expression data set, specifically, video darkness change, small-angle rotation, left-right turning and the like are performed, and expressions are divided into four classes by the data set, namely four classes of expressions such as anger, joy, fear and peace. The training set and the verification set are divided in a five-fold cross verification mode, the training set is used for training, the error between the actual output result of the training and the label value is calculated, the difference value is transmitted from top to bottom through a back propagation algorithm, the weight value is updated through the weight value, after the training is finished, the trained neural network model is stored, and the verification set adjustment parameters are input to conduct preliminary evaluation on the training condition of the network model. The method specifically comprises the following steps:

s1: the method comprises the steps of inputting a video segment of facial expressions of pig faces, selecting a video segment which is shot in a pig farm and contains front pig faces, and dividing a video segment data set into four categories of angry, cheerful, fear of live pigs, neutral expressions in a peaceful state and the like.

S2: the simplified multitask concatenated convolutional network locates the pig face in the video segment and extracts the video image containing the pig face front normalized to 224 x 3 size.

The network first defines the learning objective as a two-classification problem of pig face/non-pig face, x for each sample_iUsing cross entropy loss:

wherein i represents the ith sample, and det is used for representing that the task class of the simplified multitask cascade convolution neural network is pig face discrimination. p is a radical of_iRepresenting the probability that the network output is a pig face,

a real bounding box label representing a manual annotation.

Secondly, the offset between the predicted bounding box generated by the network and the actual bounding box nearest to the predicted bounding box is calculated, and for each sample x_iThe loss is calculated by euclidean.

Where box is used to represent a simplified multi-task cascaded convolutional neural networkThe task category is the pig face regression box detection,

the coordinates of the pig face bounding box predicted for the network,

the coordinates of the real bounding box marked for the manual work are composed of four values, namely the horizontal and vertical coordinates of the upper left corner of the bounding box and the height and width of the bounding box, so that

Representing a four-dimensional vector.

The simplified multi-task cascade convolution neural network adds a weight alpha before the final loss function, and the weights of two tasks, namely det and box, are different, and the training optimization objective function is as follows:

wherein L is_cdRepresents an optimized objective function of a simplified multi-task cascaded convolutional neural network for pig face detection, N is the total number of samples in a training set, i represents the ith sample, j represents a task type and takes the value of det or box,

the value of the label of the ith sample in the jth task is 0 or 1, and the corresponding weight distribution proportion of the coarse-grained task and the fine-grained task is respectively alpha_det1 and α_boxWhen is equal to 0.5

When the sample is judged to be a non-pig face, when

At that time, the sample was judged to be a pig face.

S3: constructing a network recognition model based on the facial expression of the pig face, and extracting the global features and the multi-attention features of the pig face;

(1) firstly, extracting global convolution characteristics from the face of a pig by using a 24-layer residual error network model;

inputting a video sequence which is extracted by S2 and contains a pig face and has the size of 224 multiplied by 3 into a residual error network with the depth of 24 layers, wherein the network structure comprises 8 groups of residual error units, the first two layers of the residual error units in each group have the structure of BN-ReLu-Conv (3 multiplied by 3), the last structure of the residual error units is BN-Conv (3 multiplied by 3), the step size is 1, a downsampling structure is required to be added for realizing each stage of the network, the step size is changed into 2 at the moment to obtain a pig face convolution characteristic diagram with the size of 28 multiplied by 512, the residual error structure can effectively avoid the problems of gradient disappearance and the like, and the robustness is higher aiming at object characteristic expression. An exemplary expression for the residual network structure is:

F＝W₂σ(W₁x)

y＝F(x，{W_i})+x

where the representation σ denotes the ReLu function, x and y are the input and output vectors of the network layer, i is the ith sample, and the function F (x, { W)_i}) represents the residual mapping to be learned, and it can be found from the formula that neither additional parameters are introduced nor the computational complexity is increased, and the sizes of x and F must be equal, e.g. if the sizes are different, the linear transformation W needs to be performed_sTo match the size.

(2) Secondly, introducing a multi-attention mechanism to generate attention characteristics of key areas of the pig face, as shown in figure 3;

inputting a video segment into a residual error network model to extract a convolution characteristic graph, unfolding characteristic channels of a first convolution layer to find that each characteristic channel has a peak value corresponding area, and clustering the channels with similar response areas and the number of full connection layers by utilizing the characteristics that the different channels of the characteristic graph pay attention to different visual information and the peak value response areas are differentSimilar to the number of fine-grained features, n attention areas are generated by pseudo-clustering, and a group of full-connection functions F (·) and F are defined₁(·)，…f_υ(·)，…f_N(. o) }, wherein f_υ(-) corresponding to the upsilon attention areas, receiving input of a c-dimensional feature channel and generating a c-dimensional weight vector d_υAnd adding channels of the same type, and obtaining a corresponding probability value by taking a sigmoid function (normalized to 0-1), thereby obtaining the attention area required by the identification process.

Unfolding the characteristic channel obtained from the first convolution layer in the residual error network model and generating a weight vector d by using n stacked fully-connected layers_υUsed to refer to the degree of contribution of each feature channel to region of attention v, d_υ(X)＝f_υ(W*X)。

Where X represents the input image, W represents the overall parameters, with dimensions W × h × c, where W, h, c represent the width, height and number of feature channels of the image, W × X represents the extracted depth, represents the convolution, pooling and activation operations of the feature extraction unit, d_υ(X)＝[d_υ(1)，…，d_υ(c)]，f_υ(-) is a fully connected function corresponding to the v attention regions.

Clustering channels with similar response areas by the learned weight vectors, normalizing the channels into 0-1 by a sigmoid function to obtain corresponding probability values, and acquiring an attention mask matrix M of each attention area based on the learned weight vectors_υ(X), upsilon ∈ {1, 2, …, N }, and the formula is as follows.

Wherein X represents an input sample, a sigmoid function is taken to normalize the input sample to be 0-1, k and upsilon represent different characteristic channels, namely different attention area index values, (k ≠ upsilon) E {1, 2, …, N }, [ · c]_kK-th eigen-channel weight vector d representing convolution features W X_kMultiplied by the corresponding element of the corresponding eigen-channel.

(3) And performing spatial pooling operation on the global features and the attention area features to obtain a multi-attention feature map of the pig face.

The size of an input video sequence is 28 x 512, the size of an attention area is 28 x 1, and finally space pooling operation is carried out on a upsilon attention area mask and a convolution feature map to obtain a multi-attention feature P of a pig face attention map with the size of 28 x 512_υ(X)。

s4: and merging and inputting the extracted sequence images of the multi-attention feature and the convolution feature into a long-time and short-time memory network in a combined array mode, and identifying and classifying the facial expression of the pig face through a full connection layer and a softmax classifier.

Because the change of the pig expression usually lasts for 3-4 seconds, the frame rate is 25 frames per second, the change of the table expression is considered to be a continuous dynamic change, and the information content provided by the beginning frame and the ending frame is poor, the method of 'pinching head and removing tail' is adopted in the feature frame extraction process, the method of middle average frame is reserved, one frame is taken at every 5 frames in the middle section of the video for halving sampling, and if the number of frames of the original video section is smaller than the average length, the method of copying the last frame is adopted, so that each video sequence becomes 16 frames with the average length required by the experiment. The invention selects 16 frames as input frames, the dimension of the input frames is 28 multiplied by 512, and the input frames pass through an input gate i inside a network_tForgetting door f_tAnd from the memory cell c_tCandidate vector with continuously updated feature information

Finally passes through an output gate o_tAnd obtaining a class vector of the sample, and setting hidden layer units of the long-time and short-time memory network as a single layer 128.

S5: and defining a network model loss function, and performing model training and verification on the network structure in a training set and a verification set through multiple iterations. The loss function is specifically as follows:

L_cg(λ，M_υ)＝Dis(M_υ)+λDiv(M_υ)

where γ is the weight of the objective function in the equilibrium stage, α_jThe weight of the jth task in the simplified multi-task cascaded convolutional neural network corresponding to the loss function is shown, lambda represents the target constraint distribution weight in the attention area,

represents the loss value of the ith sample in task j, M_υAn attention mask matrix, L, representing a first v regions_cdRepresenting an optimized objective function, L, of a simplified multi-task cascaded convolutional neural network for pig face detection_cgAn optimization objective function for constraining the clustering of channel packets is represented. N is a radical ofIn order to train the total number of samples in the set,

the value of the label of the ith sample in the jth task is 0 or 1, j belongs to { det, box }, det is used for indicating that the task type is pig face discrimination, box is used for indicating that the task type is pig face regression box detection,

true tags representing samples, p_iRepresenting the probability that the network output is a pig face.

The coordinates of the pig face bounding box predicted for the network,

and (3) manually labeling the coordinates of the real boundary box, wherein the coordinates are four-dimensional vectors, namely the horizontal and vertical coordinates of the upper left corner of the regression box and the width and height of the regression box. L is_cgThe method aims to judge the correlation between the characteristic points of the high attention area and the characteristic points of the weak attention area, namely, the coordinates close to the peak response area of the channel in the same attention area are more gathered and expressed by a function Dis (-), and the coordinates of the peak response areas of the channel in different attention areas are far as possible and expressed by a function Div (-), (x, y) are taken from the coordinates of the attention area m_υ(x, y) attention mask matrix M corresponding to the region of interest_υ(X) response value at (X, y) coordinate, t_xAnd t_yCoordinates representing the peak response of the training set to the upsilon attention areas, k and upsilon representing different attention area index values,

where k, υ 1, 2, …, N is used to indicate the response value of the (x, y) coordinate position that can best represent the υ zone, T_mrgThe threshold value represents a preset boundary threshold value and is a constant value used for preventing extreme values from appearing, so that the loss is not sensitive to noise, and the robustness of the network is realized.

The method is not only suitable for recognizing the facial expression of the pig face, but also suitable for recognizing four types of simple expressions in other livestock facial videos, namely anger, fear, joy and peace. The pig face facial expression recognition model framework based on the multi-attention-machine cascade long-time memory network model identifies the pig expressions in the video frame images, and is different from the existing livestock expression recognition method mainly based on a physiological anatomy method and the livestock expression recognition method in a static image. The invention firstly provides that the cascade framework model is applied to the classification and identification of the timing sequence facial expression images of the pigs. The network model is composed of three cascade structures, and the pig face facial expression video frame images are selected and input into the simplified multi-task cascade convolution neural network at equal intervals to be used for detecting and positioning the pig face. And secondly, inputting the extracted frame feature map of the pig face sequence into a multi-attention machine module, capturing a face significant region caused by expression change, and realizing attention to fine changes of the face. And then, merging the fine characteristic diagram and the multi-attention characteristic diagram extracted from the video frame through array merging operation, and inputting the merged fine characteristic diagram and the multi-attention characteristic diagram into a long-term and short-term memory network to realize expression classification and recognition. The end-to-end cascade framework provided by the invention can effectively solve the problems that the facial expression muscle groups of the domestic pigs are simple in structure, small in quantity and short in expression duration, so that the facial expressions are difficult to perceive and recognize. The model framework provided by the invention is used for livestock emotion research, is one of important research targets of animal science, can better evaluate livestock welfare, and can better realize emotion regulation and control through livestock expression recognition, so that the feed digestibility and utilization rate are improved, the growth speed is increased, and the yield is increased.

The above description is only a preferred example of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like of the present invention shall be included in the protection scope of the present invention.

Claims

1. A pig face facial expression recognition framework establishing method based on multitask cascade is characterized by comprising the following steps: the method comprises the following steps:

s2, a first stage of a cascade framework model: selecting frame images of the pig face facial expression video images at equal intervals, and inputting the frame images into a simplified multi-task cascade convolution neural network for detecting and positioning the pig face; the simplified multitask cascade convolution neural network realizes the rapid detection and positioning of the pig face area by two steps of coarse granularity and fine granularity respectively; the model architecture method of the simplified multitask cascade convolution neural network comprises the following steps:

wherein L is_cdThe optimization objective function of the simplified multi-task cascade convolution neural network for pig face detection is represented, N is the total number of samples in a training set, i represents the ith sample, j represents the task type, the value is det or box, der is used for representing that the task type is pig face discrimination, box is used for representing that the task type is pig face regression box detection,

the coordinates of the pig face bounding box predicted for the network,

the coordinates of the artificially labeled real bounding box, both of which are four-dimensional vectors R⁴Respectively representing the horizontal and vertical coordinates of the upper left corner of the regression frame and the width and height of the regression frame;

2. The method for establishing a pig face facial expression recognition framework based on multitask cascade as claimed in claim 1, wherein: the method for extracting and constructing the distinctive zone feature map of the facial expression change of the pig face in the step S3 is as follows:

s32, grouping the global feature graphs obtained in the step S31 according to a channel response mode: firstly, calculating the contribution degree of each feature channel to the attention area, wherein the weight calculation expression is as follows: d_υ(X)＝f_υ(W X X), wherein d_υ(X)＝[d_υ(1)，…，d_υ(c)]To generate n attention regions, a set of fully connected functions F (·) { F ] is defined₁(·)，…f_υ(·)，…f_N(. o) }, each f_υ(. The) takes convolution characteristics as input, respectively corresponds to upsilon attention areas, receives the input of a c-dimensional characteristic channel, and generates a c-dimensional weight vector d_υThe method is used for referring to the contribution degree of each feature channel to an attention region upsilon, W X represents the convolution feature of an input sample X, W represents the parameter set of a feature extraction unit, W, h and c represent the width, height and number of feature channels of the input sample respectively, and "+" represents the convolution, pooling and activation operation of the feature extraction unit;

s33, calculating an attention area feature map according to the weight calculated in the step S32: first based on learning toWeight vector d of_υObtaining an attention mask matrix M for each region of interest_υ，

Wherein P is_v(X) representing a characteristic diagram of the upsilon attention areas, and calculating through pooling on each channel, wherein the operation mark is multiplied by points of a mask matrix and a convolution characteristic diagram of the upsilon attention areas and then accumulated;

s34: constructing a characteristic channel grouping clustering optimization objective function L_cgImplementing feature channel clustering to obtain attention area, L_cgThe method aims to judge the correlation between the feature points of the high attention area and the feature points of the weak attention area, so that coordinates in the same attention area are more aggregated and are represented by a function Dis (·), coordinates in different areas are far as possible and are represented by a function Div (·), lambda represents target constraint distribution weight, and the optimization objective function is as follows:

min L_cg(M_υ)＝Dis(M_υ)+λDiv(M_υ)

where (x, y) is taken from the attention area coordinates, m_υ(x, y) attention mask matrix M corresponding to the region of interest_υ(X) response value at (X, y) coordinate, t_xAnd t_yCoordinates, max, representing the peak response of the training set to the upsilon attention regions_υ≠k m_υ(x, y) is used for expressing that the (x, y) coordinate position can best represent the response value of the upsilon areas, T_mrgThe threshold value representing the preset boundary is a constant value and is used for preventing extreme values from appearing, so that the loss is not sensitive to noise, and the robustness of the network is realized.

3. The method for establishing a pig face facial expression recognition framework based on multitask cascade as claimed in claim 1, wherein: the method for identifying and classifying the facial expressions of the pig faces in the step S4 is as follows:

the long-short term memory network classifies the facial expressions of the pig face in real time, the facial global convolution feature map obtained in the step S31 and the multi-attention feature map obtained in the step S33 are merged and input into the long-short term memory network through merging arrays, and angry, cheerful, fear and peace and four types of expression probability values are output, so that the classification and identification of the facial expressions of the pig face are realized; the optimization function of the cascaded pig face facial expression recognition framework model is as follows,

where γ is the weight of the objective function in the equilibrium stage, L_cdRepresenting an optimized objective function, alpha, of a simplified multitask cascaded convolutional neural network for pig face detection_jRepresenting the weight of a loss function corresponding to the jth task in the simplified multi-task cascaded convolutional neural network, wherein j belongs to { det, box }, det is used for representing that the task type is pig face discrimination, box is used for representing that the task type is pig face regression box detection,

represents the loss value of the ith sample in task j; l is_cgRepresenting the construction of a characteristic channel grouping clustering optimization objective function, lambda represents the target constraint distribution weight in the attention area, M_υThe attention mask matrix of the U-th region is shown.