CN112613552B

CN112613552B - Convolutional neural network emotion image classification method combined with emotion type attention loss

Info

Publication number: CN112613552B
Application number: CN202011506810.6A
Authority: CN
Inventors: 毋立芳; 邓斯诺; 张恒; 石戈; 简萌; 相叶
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2024-05-28
Anticipated expiration: 2040-12-18
Also published as: CN112613552A

Abstract

A convolutional neural network emotion image classification method combining emotion type attention loss relates to the technical field of intelligent media calculation and computer vision; firstly, carrying out category weight calculation on a training sample to obtain an emotion category attention weight vector; secondly, modifying the last classification layer and loss function of the convolutional neural network according to the number of emotion categories and the attention loss of the emotion categories; then preprocessing a training sample and transmitting the training sample into a network, so that the network achieves convergence after iterative updating of parameters of a loss function and an optimizer, and training is completed; and finally, sending the preprocessed test image into a network, and calculating the accuracy of emotion image classification of the obtained model and the prediction type of the model on the test emotion image. According to the method, when emotion classification is carried out on the emotion images through the convolutional neural network, classification results which are more in line with the distribution characteristics of the data set samples can be obtained in a self-adaptive mode, and training and use of the emotion classification algorithm in different practical application scenes are facilitated.

Description

Convolutional neural network emotion image classification method combined with emotion type attention loss

Technical Field

The invention belongs to the technical field of computer vision, and relates to a convolutional neural network emotion image classification method combined with emotion type attention loss.

Background

With the development of social media, people are increasingly inclined to record and share their moods by distributing image information over a network. These pictures, which contain emotion information, often also contain the publisher's emotion tendencies and attitudes for a certain class of things. The attitudes of the user groups are known through the emotion tendencies of people from the massive images, and the method has important effects on commodity recommendation, public opinion analysis and social media management. Therefore, how to efficiently and automatically identify and analyze a large number of images containing emotion information by using a computer algorithm is a problem to be solved.

Early emotion analysis methods employed the use of underlying manual features for emotion classification, such as color, line, texture, etc., or the use of noun adjectives to extract a mid-level representation of the detector Sentibank in the image. Due to the strong feature extraction capability of deep learning, the effect obtained in the image classification task by using the algorithm based on the convolutional neural network is better and better, and the emotion image classification task also has new progress. For example, you et al designed a deep convolutional neural network in 2016 to classify the emotion of an image, and added a filtering function in a feedback mechanism, and filtered erroneous annotation data in a training set by the filtering function, thereby effectively improving the emotion classification capability of the image. In 2019, she et al designed a weakly supervised coupled convolutional network based on a deep convolutional neural network ResNet to classify image emotions, capture the region causing the emotion through a class activation diagram, and perform feedback adjustment by using an error back propagation mechanism, so that the accuracy of image emotion classification is further improved.

However, in a general image classification task, the image distinction of different categories is relatively clear, while the different emotion images are between emotion categories, but there is no clear limit. This requires that the loss function be calculated by taking different metrics on the similarity distance of the sample features to promote differentiation of different classes. In this respect, zhang et al in 2017 proposed a face recognition technology combining a deep convolutional neural network and a central loss, and by using a conventional loss function and a central loss function together as a supervisory signal in a transfer learning process, the extracted features are aggregated in the same class and dispersed in different classes, so as to improve the discrimination capability of the model for outputting the face features. Yang et al in 2018 propose a method for training a convolutional neural network by using triple constraints to effectively position images at emotion levels, and the multi-task processing of emotion image retrieval and classification is realized by calculating the correlation among features by considering the relationship of different emotion polarities. Although these studies consider the intra-harvest distance, there is no targeted loss function design for emotion categories with unbalanced sample numbers, which is not enough in a real social media environment. In a real social media environment, the number of affective images cannot be uniformly distributed by category. This would make the model usually trained unable to take into account the individual emotion categories in a targeted manner, resulting in a loss of classification performance.

Under the inspired circumstances, a method combining cross entropy loss and emotion category attention center loss is designed, the distance between emotion images is enlarged, and meanwhile, the intra-class distances are converged differently according to different categories, so that the accuracy of emotion image classification is improved.

Disclosure of Invention

In order to solve the problems, the invention discloses an emotion image classification algorithm, which combines emotion type attention center loss and cross entropy loss, so that the distance between samples of the same type is closer while the characteristic distances of samples of different types are farther. The chance of some samples being distributed at inter-class junctions that are further from the center within the class due to lack of intra-class constraints is reduced, thereby reducing the likelihood of these samples being misclassified. Meanwhile, different convergence forces are adopted for different sample types, so that when the number of samples in each type is unbalanced, the corresponding adjustment can be obtained, and the network model obtained by the method has a better emotion image classification effect.

The specific steps of the invention are as follows:

Step 1, establishing emotion type weight vectors of images: sorting and dividing the marked emotion image data set, and regarding each type of image in the training set part as one-dimensional to obtain an emotion category weight vector W;

Step 2, establishing a depth network model: selecting a depth network model, such as ResNet-101, and replacing the last classification layer of the original model by taking the number of categories to be classified as the dimension of the output vector;

Step 3, loss function design: in order to solve the problems of unbalanced proportion of heterogeneous samples and convergence of inter-class distances by using emotion class weight information of images, a loss function comprises calculation of the inter-class distances and the intra-class distances and consideration of weight proportions of different emotion classes;

step 4, training a model: preprocessing the images divided in the step 1 in a manner of scaling, random overturning and the like, inputting the preprocessed images into the network model in the step2, optimizing the preprocessed images by using a random gradient descent method, and learning model parameters by calculating loss through the loss function in the step 3;

Step 5, obtaining emotion types of the image to be detected: and (3) inputting the images in the data set into the model trained in the step (4) after the preprocessing steps of fixed-size scaling and center cutting, and obtaining the corresponding emotion types.

Compared with the prior art, the invention has the following obvious prominent substantive features and obvious technical progress:

the invention provides a convolutional neural network emotion image classification method combined with emotion type attention loss. The parameters of the convolutional neural network are fed back and updated in a mode of combining the calculation of cross entropy loss and category attention center loss, so that the distances between samples of the same category are closer while the characteristic distances of samples of different categories are farther. The chance of some samples being distributed at inter-class junctions that are further from the center within the class due to lack of intra-class constraints is reduced, thereby reducing the likelihood of these samples being misclassified. Meanwhile, different convergence forces are adopted for different sample types, so that when the sample numbers of the sample types are unbalanced, the corresponding adjustment can be obtained, and the network model obtained by the method has better emotion classification effect in the test of the emotion image data set with unbalanced types.

Drawings

The invention is described in further detail below with reference to the attached drawings and detailed description:

FIG. 1 is a schematic diagram of training an image emotion classification convolutional neural network based on the method.

Fig. 2 is an overall flowchart of emotion image classification based on the present method.

Detailed Description

The invention provides a convolutional neural network emotion image classification method combined with emotion type attention loss. The overall structure of the present invention is shown in fig. 1. The embodiment of the invention simulates in win10 and JupyterNotebook environments, and uses the FI data set to train by the method of the invention, so as to obtain the image emotion classification model which can realize high accuracy. After the model is obtained, the test image can be input into the model to obtain the emotion classification result of the image. The specific implementation flow of the invention is shown in figure 2, and the specific implementation steps are as follows:

step 2, establishing a depth network model: resNet-101 is selected as a backbone network of the depth network model, and the class number to be classified is used as an output vector dimension to replace the last classification layer of the original model, so that the depth network model to be used is obtained;

Training a model: preprocessing the images divided in the step 1 in a manner of scaling, random overturning and the like, inputting the preprocessed images into the network model in the step2, optimizing the preprocessed images by using a random gradient descent method, and learning model parameters by calculating loss through the loss function in the step 3;

In the step 1, an emotion type weight vector W of the image is established:

The method can be used for carrying out emotion classification on images in a large-scale real social network, so that a general public emotion data set (hereinafter abbreviated as FI data set) formed by sorting the Flickr and the Instagram is selected in the example, and compared with the traditional emotion data set, the data set has the characteristics of large data scale and unbalanced emotion types, and is more in line with the real network environment.

The training set in the FI data set is sorted according to the original 8-class label, and the weight w _i of each class is obtained through weight calculation, and the calculation mode is as follows:

Where N is the number of classes, N _i is the number of training samples for each class, and n=8 is set for class 8 emotion of FI dataset in this example. After the weight coefficient of each category is obtained, the weight coefficient of each category is connected in parallel to obtain an emotion category weight vector W= [ W ₁,w₂,…,w_N ];

In said step 2, a deep network model is built:

The embodiment adopts ResNet-101 to obtain a ResNet-101 model which is pre-trained on an ImageNet, removes a classification layer with the final input dimension of 2048 x 1024 and the output dimension of 1 x 1024 after loading, replaces the classification layer with the number of classes to be classified as the output vector dimension, and takes the final classification layer input dimension of 2048 x 8 and the output dimension of 1*8, namely the prediction probability of each class, and takes the corresponding emotion class of the position of the largest item as the emotion class of image output;

in the step 3, the design of the loss function is performed:

in order to solve the problems of unbalanced proportion of heterogeneous samples and convergence of inter-class distances by using emotion class weight information of images, the loss function comprises calculation of the inter-class distances and the intra-class distances and consideration of different emotion class weight proportions, and the designed loss function comprises two parts of emotion class attention center loss and cross entropy loss:

Emotion category attention center loss: the constraint is applied with a center penalty that increases emotion class weight to implement a mechanism for attention control according to emotion class. The weighted center penalty is constructed to differentially group the different classes. In order to solve the problem of data class imbalance, the weight center loss specifically increases the weight part to reduce the influence caused by sample class imbalance. The specific loss function is as follows:

Wherein m is the number of images in each batch during training, the value is generally the integer power of 2, such as 16 or 32, according to the size of the display memory of the experimental platform, the value is set to 16 in the example, W is the weight vector obtained in the step 1, f _i is the characteristic of the images obtained before the images pass through the basic backbone network in the basic step 2 and are classified into layers, And the feature center is a feature vector formed by the average value of the features obtained from the images of the same category in the current batch in each dimension when the category is y _i.

Cross entropy loss: the cross entropy loss is used for the basic loss metric, which aims at preserving the inter-class distance. The cross entropy penalty constructed is used to make the image farther between different emotion categories. The specific loss function is as follows:

Where m is the number of images per batch during training and is the same as m in L _c above, N is the number of emotion categories of the dataset, in this example according to FI dataset case n= 8,x _i is the feature of the ith picture in the batch obtained from the underlying backbone network in step 2 before the classification layer, w and b are the weight (weight) and bias (bias) parameter values in the classification layer, and subscripts y _i and j are the category after the classification layer, for example where The value of the bias parameter in the classification layer when the i-th picture of the lot is judged as the y _i category is shown.

The total loss function is:

where α is the super-parameter of the quantization loss function, set to 0.6 in this example.

In the step 4, training of the model is performed:

The training set data is preprocessed by scaling random overturn and the like, in this example, parameters of random overturn are set to 448 x 448, and probability of random overturn is set to 0.5. And inputting the samples with a fixed size into the network model according to the fixed size as a batch, and finally taking samples with a size smaller than the fixed size as a batch. The fixed batch size is set to a certain extent as large as possible to improve the training effect of the model to a small extent, but due to experimental platform limitations, it is recommended to choose 8, 16 or 32, in this example the fixed batch size is set to 16. And (3) automatically comparing an output result with the input training set label through a final classification layer, and marking the proportion of the number of the counted correct samples to the total training samples as the accuracy of the training set in the round. Meanwhile, when the output vector is obtained, the loss value of the current round is obtained according to the loss function calculated in the step 3, and the obtained loss value is fed back to the optimizer for processing and then carrying out back propagation to update each parameter in the model.

The optimizer in the method selects a random gradient descent method as an optimization method in consideration of convergence speed and convergence effect. The parameter setting of the optimizer mainly comprises an initial learning rate (LEARNING RATE) and a momentum (momentum), wherein the initial learning rate is generally 0.1, 0.01, 0.0001 and 0.00001, which are equal to each other, the model convergence condition is selected, the example recommends 0.0001, and the convergence effect is more stable at the initial value. The momentum in principle takes a value between 0 and 1, in this example it is recommended to choose a default value of 0.9 in the random gradient descent method. Because the setting of the fixed learning rate is unfavorable for the deep network to search for better parameters in the latter half of training, the method adds a strategy of reducing the learning rate by fixed rounds in the training process. The number of the reduced rounds is recommended to be reduced for 1-2 times in 14-20 rounds, the total training round number is recommended to be 50-80 rounds until the loss function value oscillates, the model can be considered to be converged without obviously reducing, and the training can be finished. In the example, the optimizer is set to reduce half of the learning rate every 14 rounds and every 20 rounds, and 60 rounds of training learning is performed on model parameters to ensure effective convergence of training effects, the number of the set rounds is too small and probably not converged, and the number of the rounds is too large, so that training time is increased but the effect is not improved.

After each round of training samples is completed, parameters of a model are fixed, verification set data in an FI data set are subjected to fixed-size scaling clipping and are transmitted into a network model, clipping parameters are set to 448 x 448 in the example, model output is compared with labels of the samples, the proportion of correct samples, namely verification set accuracy, is counted, if the verification set accuracy of the current round number is higher than the previous highest verification set accuracy, the verification set accuracy with the highest current accuracy is saved, and the model trained by the current round number is saved. After all rounds of training are finished, the model with the highest verification set accuracy rate which is finally stored is the trained optimal model;

in the step 5, the emotion type of the image to be detected is obtained:

The test set data or any image in the FI data set can be input into the model piece by piece or in batches in fixed quantity after the preprocessing step of clipping the test set data or any image in the FI data set according to the fixed-size scaling center as the verification set image in the step 4. In this example, the parameter of the fixed-size scaling center is set to 448 x 448, and in order to improve the processing efficiency under the same experimental condition, the test set data in this example is recommended to use 16 as the batch size, and test images are output to the model according to the batch for testing. And comparing the output result after the classification layer with the label of the sample through model processing, and counting the proportion of the correct sample, namely the accuracy of the test set. And outputting the emotion type corresponding to the result to be the image emotion type judged by the model.

The test set in the FI data set is subjected to model test in the example, the accuracy rate result is 0.7087, which is higher than the best effect in the research content of the current similar method: ADAPTIVEDEEP METRIC LEARNING for AFFECTIVE IMAGE RETRIEVAL AND Classification 0.6837, published in the journal IEEE TRANSACTIONS ON MULTIMEDIA of the present year, is also higher than the best results of multiple classifications currently based on the data set, again only on tag information as a reference: the accuracy 0.7007 of WSCNet: weakly Supervised Coupled Networksfor Visual Sentiment Classification and Detection published in this year high-level journal IEEE TRANSACTIONS ON MULTIMEDIA.

Claims

1. The convolutional neural network emotion image classification method combined with emotion type attention loss is characterized by comprising the following steps of:

Step 1, establishing emotion type weight vectors of images: taking each type of image in the dataset as one dimension to obtain an emotion type weight vector W;

Step2, establishing a depth network model: selecting a deep network model ResNet-101, using ResNet-101 as a basic backbone network, and using the number of categories to be classified as the dimension of an output vector to replace the last classification layer of the original model so as to generate emotion categories;

Step4, training a model: the image is preprocessed and then input into a network model, and a random gradient descent method is used for optimizing, so that model parameters are learned;

Step 5, obtaining emotion types of the image to be detected: the images in the database are input into the model trained in the step 4 after being preprocessed in the same way as the step 4, and the corresponding emotion types are obtained;

In the step 1, the specific method for establishing the emotion type weight of the image is as follows: regarding the same class image in the image training set with the emotion class number of N as a dimension C _i, i=1, 2, …, N; calculating and counting the number N _i of samples in different dimensions, i=1, 2, … and N; the weights of all classes are obtained through calculation, and are combined to form a weight vector W= [ W ₁,w₂,…,w_N ], and each W _i is calculated as follows:

In step 3, the loss function contains two losses, namely an emotion class attention center loss and a cross entropy loss;

4.1 emotion category attention center loss

The emotion type attention center loss particularly increases a weight part to reduce the influence caused by sample type imbalance; the specific loss function is as follows:

wherein m is the number of images in each batch during training, W is the obtained weight vector, f _i is the feature of the images obtained before the classification layer through the basic backbone network, The feature center is the feature vector formed by the average value of the features obtained from the images of the same category in the current batch in each dimension when the category is y _i;

4.2 Cross entropy loss

Using cross entropy loss for basic loss metrics, the purpose of which is to preserve inter-class distances; the cross entropy loss constructed is used to make the image between different emotion categories farther away; the specific loss function is as follows:

Where m is the number of images per batch during training, N is the number of emotion categories of the dataset, x _i is the features of the ith picture in the batch obtained from ResNet-101 basic backbone network before the classification layer, w and b are the weight and bias parameter values in the classification layer, subscripts y _i and j are the categories after the classification layer, A value representing a bias parameter in the classification layer when the lot i picture is determined to be of the y _i category;

The total loss function is:

wherein alpha is the super parameter of the quantization loss function, and is set to 0.6;

and carrying out data preprocessing on the pictures to be trained in a random center cutting and random horizontal overturning mode, training the model in a random gradient descending mode, and storing the trained model.