WO2023098912A1 - Image processing method and apparatus, storage medium, and electronic device - Google Patents

Image processing method and apparatus, storage medium, and electronic device Download PDF

Info

Publication number
WO2023098912A1
WO2023098912A1 PCT/CN2022/136363 CN2022136363W WO2023098912A1 WO 2023098912 A1 WO2023098912 A1 WO 2023098912A1 CN 2022136363 W CN2022136363 W CN 2022136363W WO 2023098912 A1 WO2023098912 A1 WO 2023098912A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
target
feature
classification network
emotional
Prior art date
Application number
PCT/CN2022/136363
Other languages
French (fr)
Chinese (zh)
Inventor
赵鹤
陈奕名
Original Assignee
新东方教育科技集团有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 新东方教育科技集团有限公司 filed Critical 新东方教育科技集团有限公司
Publication of WO2023098912A1 publication Critical patent/WO2023098912A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates to the field of image processing, and in particular, to an image processing method, device, storage medium, and electronic equipment.
  • Emotion recognition is an inevitable part of any interpersonal communication. People observe the emotional changes of others to confirm whether their actions are reasonable and effective. With the continuous advancement of technology, emotion recognition can use different features to detect and recognize, such as face, voice, EEG, and even speech content. Among these features, facial expressions are usually easier to observe.
  • the present disclosure provides an image processing method, device, storage medium and electronic equipment.
  • the first aspect of the present disclosure provides an image processing method, the method comprising:
  • the emotion classification network includes a RedNet feature extractor composed of involution operators, and the RedNet feature extractor is used to obtain a feature image according to the target image, so as to obtain the emotion information based on the feature image.
  • the obtaining the emotional information based on the feature image includes:
  • the feature image is input into a Transformer encoder to obtain a feature vector corresponding to the target image, and the Transformer encoder includes a multi-head self-attention module, a multi-layer perceptron and a layer normalization module;
  • the feature vector is input into the fully connected layer to obtain the emotional information represented by the facial information in the target image.
  • the training of the emotion classification network includes:
  • the training set includes a plurality of training images, each training image in the plurality of training images includes facial information and an emotional label corresponding to the facial information;
  • the target training image is input into the RedNet feature extractor in the initial sentiment classification network to obtain the feature image of the target training image;
  • the feature image of the target training image is input into the Transformer encoder to obtain the corresponding feature vector of the target training image;
  • the feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained;
  • parameters in the emotion classification network are adjusted to obtain a trained emotion classification network.
  • the fully connected layer includes an attention factor
  • the input of the feature vector corresponding to the target training image into the fully connected layer to obtain the predicted label corresponding to the emotional information represented by the facial information in the target training image includes:
  • the feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained, and the weight information of the target training image;
  • the parameters in the emotion classification network are adjusted based on a cross-entropy loss function and a regularization loss.
  • the method also includes:
  • test set includes a plurality of test images, each test image in the plurality of test images includes facial information and a pre-marked emotional label corresponding to the facial information;
  • the target test image is input into the RedNet feature extractor in the emotion classification network after the training, to obtain the feature image of the target test image;
  • the feature image of the target test image is input to the Transformer encoder to obtain the corresponding feature vector of the target test image;
  • a second aspect of the present disclosure provides an image processing device, the device comprising:
  • An acquisition module configured to acquire a target image including facial information
  • the emotion determination module is used to input the target image into the pre-trained emotional classification network to obtain the emotional information represented by facial information in the target image;
  • the emotion classification network includes a RedNet feature extractor composed of involution operators, and the RedNet feature extractor is used to obtain a feature image according to the target image, so as to obtain the emotion information based on the feature image.
  • the emotion determination module is specifically used for:
  • the feature image is input into a Transformer encoder to obtain a feature vector corresponding to the target image, and the Transformer encoder includes a multi-head self-attention module, a multi-layer perceptron and a layer normalization module;
  • the feature vector is input into the fully connected layer to obtain the emotional information represented by the facial information in the target image.
  • the device includes:
  • the second acquisition module is used to acquire a training set, the training set includes a plurality of training images, and each training image in the plurality of training images includes facial information and a pre-marked emotional label corresponding to the facial information;
  • Feature extraction module for any target training image in the training set, input the target training image into the RedNet feature extractor in the initial emotion classification network, obtain the feature image of the target training image;
  • a feature vector determination module configured to input the feature image of the target training image into the Transformer encoder to obtain the corresponding feature vector of the target training image
  • a prediction module configured to input a feature vector corresponding to the target training image into a fully connected layer, to obtain a prediction label corresponding to emotional information represented by facial information in the target training image;
  • the adjustment module is used to adjust the parameters in the emotion classification network according to the predicted label and the emotion label pre-marked in the target training image, so as to obtain the trained emotion classification network.
  • a third aspect of the present disclosure provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of any one of the methods described in the first aspect of the present disclosure are implemented.
  • a fourth aspect of the present disclosure provides an electronic device, including:
  • a processor configured to execute the computer program in the memory to implement the steps of any one of the methods in the first aspect of the present disclosure.
  • the image input to the emotion classification network is initially processed, the local details of the image are extracted, and the obtained feature image is input into the downstream of the emotion classification network module, which effectively improves the final accuracy of the emotional information output by the emotional classification network.
  • Fig. 1 is a flowchart of an image processing method shown according to an exemplary embodiment
  • Fig. 2 is a schematic diagram of an emotion classification network in a training phase according to an exemplary embodiment
  • Fig. 3 is a schematic diagram of an emotion classification network in a testing phase according to an exemplary embodiment
  • Fig. 4 is a block diagram of an image processing device according to an exemplary embodiment
  • Fig. 5 is a block diagram of an electronic device according to an exemplary embodiment
  • Fig. 6 is another block diagram of an electronic device according to an exemplary embodiment.
  • Emotion recognition is an inevitable part of any interpersonal communication. People observe the emotional changes of others to confirm whether their actions are reasonable and effective. With the continuous advancement of technology, emotion recognition can use different features to detect and recognize, such as face, voice, EEG, and even speech content. Among these features, facial expressions are usually easier to observe.
  • the facial expression recognition system mainly consists of three stages, namely face detection, feature extraction and expression recognition.
  • face detection stage multiple face detectors are used, like MTCNN network and RetinaFace network, they are used to locate the face position in complex scenes, and the detected faces can be further aligned.
  • feature extraction past studies have proposed various methods for capturing facial geometry and appearance features induced by facial expressions.
  • the feature type they can be divided into engineering features and learning-based features.
  • engineering features it can be further divided into texture-based features, geometry-based global features, etc.
  • Fig. 1 is a flow chart of an image processing method shown according to an exemplary embodiment.
  • the execution subject of the method may be a terminal such as a mobile phone, a computer, a notebook computer, or a server.
  • the method includes :
  • the face information in the target image may only include the face information of one person, or may be the face information of multiple people.
  • S102 Input the target image into a pre-trained emotion classification network to obtain emotional information represented by facial information in the target image.
  • the emotion information may represent probability values of emotions such as happiness, sadness, crying, laughing, etc. corresponding to the face information of the task in the target image.
  • the emotion classification network includes a RedNet feature extractor composed of involution operators, and the RedNet feature extractor is used to obtain a feature image according to the target image, so as to obtain the emotion information based on the feature image.
  • the involution operator has channel invariance and space specificity, and its design is opposite to the characteristics of convolution, that is, the kernel kernel is shared in the channel dimension, and the space-specific kernel is used in the space dimension for more flexible modeling.
  • the inner convolution kernel has different attention to different positions in the space, which can more effectively mine the target features with diversity, and without increasing the amount of parameter calculation,
  • the sharing and migration of feature weights in different spatial positions is exactly what the space-specific design principle pursues.
  • This design from convolution to involution re-allocates computing power, making the limited The computing power is adjusted to the position where the performance can be maximized, so we use the RedNet composed of involution operators as the feature extractor, and obtain better results than ResNet with a smaller amount of parameters.
  • the image input to the emotion classification network is initially processed, the local details of the picture are extracted, and the obtained feature image is input into the
  • the downstream module of the emotion classification network effectively improves the final accuracy of the emotional information output by the emotion classification network.
  • the obtaining the emotional information based on the feature image includes:
  • the feature image is input into a Transformer encoder to obtain a feature vector corresponding to the target image, and the Transformer encoder includes a multi-head self-attention module, a multi-layer perceptron and a layer normalization module;
  • the feature vector is input into the fully connected layer to obtain the emotional information represented by the facial information in the target image.
  • the feature image may include multiple feature sub-image patches
  • inputting the feature image into the Transformer encoder includes: stretching the multiple feature sub-image patches and inputting them into the Transformer encoder respectively.
  • the Multi-head self-attention module linearly connects multiple attention outputs to the desired dimension. Multiple attention heads can be used to learn local and global dependencies in images.
  • the Multi-Layer Perception contains two layers of Gaussian Error Linear Units (GELU) and Layer Normalization (LN), which can be used to improve training time and generalization performance. Residual connections are applied after each patch because they allow gradients to flow directly through the network without going through nonlinear layer implementations.
  • GELU Gaussian Error Linear Units
  • LN Layer Normalization
  • CNNs convolutional neural networks
  • a facial expression recognition system including key features can be extracted and learned through training of data sets.
  • many cues come from some parts of the face, such as the mouth and eyes, while other parts, such as the background and hair, play a small role in the output, which means that the ideal
  • the model framework should only focus on the important parts of the face, and pay less attention to the sensitivity to other facial areas, and have better generalization ability for special cases such as occlusion blur.
  • a Transformer-based framework for facial expression recognition which takes the above observations into account and utilizes an attention mechanism to focus on salient parts of the face. Using Transformer encoding instead of deep convolutional models can achieve very high accuracy.
  • the training of the emotion classification network includes:
  • the training set includes a plurality of training images, each training image in the plurality of training images includes facial information and an emotional label corresponding to the facial information;
  • the target training image is input into the RedNet feature extractor in the initial sentiment classification network to obtain the feature image of the target training image;
  • the feature image of the target training image is input into the Transformer encoder to obtain the corresponding feature vector of the target training image;
  • the feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained;
  • parameters in the emotion classification network are adjusted to obtain a trained emotion classification network.
  • the untrained initial emotion classification network is trained to obtain an accurate representation of the facial information in the image.
  • An emotion classification network for identifying and classifying emotions.
  • the fully connected layer includes an attention factor
  • the feature vector corresponding to the target training image is input into the fully connected layer to obtain the emotional information represented by the facial information in the target training image
  • the corresponding predicted labels include:
  • the feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained, and the weight information of the target training image;
  • the true accuracy of the samples in the training set is determined by adding an attention factor to the fully connected layer.
  • a high value represents a good performance of the sample, high accuracy, and a large "role" during training, otherwise the sample performance is poor.
  • the accuracy is low, and the training time is not ideal.
  • the neural network will focus on the samples with better and more effective actual effects, which can effectively improve the accuracy of training.
  • the training of the emotion classification network also includes a method of inputting the training set into the SCN network (Self-Cure Network) to automatically repair the wrong label in the sample.
  • the SCN network includes a self-attention importance weighting module (Self-Attention Importance Weighting) and a relabeling module.
  • the self-attention importance weighting module is used to generate a weight ⁇ i for each sample xi in the training set as a measure of the importance of the sample xi in the training set.
  • the self-attention importance weighting module is trained using RR-loss (Rank Regularization loss, rank regularization loss).
  • the specific calculation steps of RR-loss include: sorting a batch of samples according to ⁇ i , and dividing the samples into two groups with high scores and low scores according to the ratio ⁇ .
  • L RR represents RR-loss
  • ⁇ H represents the average weight of high groups
  • ⁇ L represents the average weight of low groups
  • ⁇ H and ⁇ L satisfy the following formula:
  • ⁇ 1 is a fixed or learnable value used to separate the weight mean value of high group and low group.
  • dividing the sample into high-scoring and low-scoring groups according to the ratio ⁇ includes:
  • the best grouping method should satisfy: argmax M distance( ⁇ H , ⁇ L )
  • ⁇ H represents the set of high-group sample weights
  • ⁇ L represents the set of low-group samples.
  • the formula used for the distance can be argmax M (min i ⁇ [0,M) ⁇ i -max i ⁇ [M,N) ⁇ i ).
  • grouping is performed according to the actual weight of each batch of samples, which can avoid instability in training while realizing adaptive grouping.
  • an adaptive grouping method is proposed, grouping according to the actual weight of each batch of samples, which effectively improves the accuracy of the weight output by the model.
  • the method also includes:
  • test set includes a plurality of test images, each test image in the plurality of test images includes facial information and a pre-marked emotional label corresponding to the facial information;
  • the target test image is input into the RedNet feature extractor in the trained emotional classification network to obtain the feature image of the target test image;
  • the feature image of the target test image is input to the Transformer encoder to obtain the corresponding feature vector of the target test image;
  • the CNN model, the attention model, and the Transformer model are all mathematically maximum likelihood estimation models.
  • the maximum likelihood estimation model is unbiased and the weights are fixed.
  • any model weights in the real world should tend to be Gaussian rather than fixed. Therefore, the maximum likelihood estimation cannot effectively estimate the uncertainty of the data.
  • Human expressions are extremely complex, such as panic and surprise, and tears from laughter. These are mixed with different expressions, not a single expression. Therefore, using a model with fixed weights to estimate an uncertain task is a contradiction in itself.
  • MC-dropout is a way of understanding dropout based on Bayesian theory, which interprets dropout as a Bayesian approximation of a Gaussian process.
  • ordinary models have the ability to evaluate uncertainty like Bayesian neural networks.
  • the use of the MC-dropout layer only requires an input to be tested n times during testing to obtain a set of sampling points, thereby calculating the mean and variance, and using the variance to predict the uncertainty of the samples in the test set The larger the variance, the higher the uncertainty of the prediction.
  • the backbone when testing, the backbone outputs the features O b ⁇ R 1 ⁇ p .
  • W fc is sampled n times.
  • the weight obtained by sampling is denoted as Then the MC-dropout layer can be defined by the following formula:
  • the variance n () function is expressed in Computes the variance over the n-dimension of .
  • the function represents the sample variance corresponding to O mean .
  • the uncertainty of the prediction results can be measured based on the maximum value of O var . The larger the variance, the higher the uncertainty.
  • dropout can also be implemented in other layers, just ensure that the calculation before this layer is only run once, and then when it reaches the MC-dropout layer, it becomes a matrix operation.
  • Bayesian estimation can be used for uncertainty analysis by replacing the fully connected layer with the MC-dropout layer during the testing phase.
  • Described emotion classification network 20 comprises input module 21, RedNet feature extractor 22, Transformer coder 23, fully connected layer 24 and classifier 25 connected in series successively;
  • the training of the emotion classification network 20 includes: the training set is input into the RedNet feature extractor 22 in this emotion classification network 20 through the input module 21, to obtain the multiple of any training image in the training set a feature image pactch; the multiple feature images pactch are input into the Transformer encoder 23 to obtain the feature vector of any training image in the training set; the feature vector is input into the fully connected layer 24 to obtain the facial information representation in the target image
  • the probability value of each emotional category; the probability value of each emotional category is input into the classifier 25 to obtain the highest probability emotional category; according to the emotional category and the pre-marked label information in the training set, based on the cross entropy loss function and regularization
  • the loss adjusts the parameters in the emotion classification network 20 to obtain the trained emotion classification network.
  • the present disclosure also provides a schematic diagram of an emotion classification network in a test phase according to an exemplary embodiment as shown in FIG. 3 .
  • the emotion classification network 30 includes a trained input module 31, RedNet feature extractor 32, Transformer encoder 33, MC-dropout layer 34, and classifier 35.
  • the test of the emotion classification network 30 includes: the test set is input into the RedNet feature extractor 32 in this emotion classification network 30 by the input module 31, to obtain the multiple of any training image in the training set a feature image pactch; input the multiple feature images pactch into the Transformer encoder 33, to obtain the feature vector of any training image in the training set; input the feature vector into the MC-dropout layer 34 for multiple sampling, and obtain each sampling place
  • RedNet and Transformer are used as feature extractors for the first time. Combined use of RedNet and Bayesian-based MC-dropout. In addition, in order to deal with the blurred pictures and labels contained in the training set, the training method in SCN is utilized and further improved.
  • FIG. 4 is a block diagram of an image processing device 40 according to an exemplary embodiment.
  • the device 40 can be used as a part of a terminal such as a mobile phone, or as a part of a server.
  • the device 40 includes:
  • the first obtaining module 41 is used to obtain the target image comprising facial information
  • the emotion determination module 42 is used to input the target image into the emotional classification network that has been trained in advance to obtain the emotional information represented by facial information in the target image;
  • the emotion classification network includes a RedNet feature extractor composed of involution operators, and the RedNet feature extractor is used to obtain a feature image according to the target image, so as to obtain the emotion information based on the feature image.
  • the emotion determination module 42 is specifically used for:
  • the feature image is input into a Transformer encoder to obtain a feature vector corresponding to the target image, and the Transformer encoder includes a multi-head self-attention module, a multi-layer perceptron and a layer normalization module;
  • the feature vector is input into the fully connected layer to obtain the emotional information represented by the facial information in the target image.
  • the device 40 also includes:
  • the second acquisition module is used to acquire a training set, the training set includes a plurality of training images, and each training image in the plurality of training images includes facial information and a pre-marked emotional label corresponding to the facial information;
  • the first feature extraction module for any target training image in the training set, input the target training image into the RedNet feature extractor in the initial emotion classification network to obtain the feature image of the target training image;
  • the first feature vector determination module is used to input the feature image of the target training image into the Transformer encoder to obtain the feature vector corresponding to the target training image;
  • a prediction module configured to input a feature vector corresponding to the target training image into a fully connected layer, to obtain a prediction label corresponding to emotional information represented by facial information in the target training image;
  • the adjustment module is used to adjust the parameters in the emotion classification network according to the predicted label and the emotion label pre-marked in the target training image, so as to obtain the trained emotion classification network.
  • the fully connected layer includes an attention factor
  • the prediction module is specifically used for:
  • the feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained, and the weight information of the target training image;
  • the adjustment module is specifically used for:
  • the parameters in the emotion classification network are adjusted based on a cross-entropy loss function and a regularization loss.
  • the device 40 also includes:
  • the third acquisition module is used to acquire a test set, the test set includes a plurality of test images, and each test image in the plurality of test images includes facial information and a pre-marked emotional label corresponding to the facial information;
  • the second feature extraction module is used to input the target test image into the RedNet feature extractor in the trained emotion classification network for any target test image in the test set to obtain the feature image of the target test image ;
  • the second eigenvector determination module inputs the eigenimage of the target test image into the Transformer encoder to obtain the corresponding eigenvector of the target test image;
  • the first determination module is used to input the feature vector corresponding to the target test image into the MC-dropout layer, and determine the uncertainty information of the target test image;
  • the second determination module is used to determine whether the uncertainty information of the plurality of test images satisfies a preset rule, and if the preset rule is satisfied, the trained emotion classification network is completed as the training emotion classification network.
  • Fig. 5 is a block diagram of an electronic device 500 according to an exemplary embodiment.
  • the electronic device 500 may include: a processor 501 and a memory 502 .
  • the electronic device 500 may also include one or more of a multimedia component 503 , an input/output (I/O) interface 504 , and a communication component 505 .
  • I/O input/output
  • the processor 501 is used to control the overall operation of the electronic device 500, so as to complete all or part of the steps in the above-mentioned image processing method.
  • the memory 502 is used to store various types of data to support the operation of the electronic device 500, for example, these data may include instructions for any application or method operating on the electronic device 500, and application-related data, For example, images in training set, test set, etc.
  • the memory 502 can be realized by any type of volatile or non-volatile storage device or their combination, such as Static Random Access Memory (Static Random Access Memory, referred to as SRAM), Electrically Erasable Programmable Read-Only Memory (EPROM) Electrically Erasable Programmable Read-Only Memory, referred to as EEPROM), Erasable Programmable Read-Only Memory (Erasable Programmable Read-Only Memory, referred to as EPROM), Programmable Read-Only Memory (Programmable Read-Only Memory, referred to as PROM), read-only Memory (Read-Only Memory, referred to as ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • Multimedia components 503 may include screen and audio components.
  • the screen can be, for example, a touch screen, and the audio component is used for outputting and/or inputting audio signals.
  • an audio component may include a microphone for receiving external audio signals.
  • the received audio signal may be further stored in the memory 502 or sent through the communication component 505 .
  • the audio component also includes at least one speaker for outputting audio signals.
  • the I/O interface 504 provides an interface between the processor 501 and other interface modules, which may be a keyboard, a mouse, buttons, and the like. These buttons can be virtual buttons or physical buttons.
  • the communication component 505 is used for wired or wireless communication between the electronic device 500 and other devices.
  • the communication component 505 may include: a Wi-Fi module, a Bluetooth module, an NFC module and the like.
  • the electronic device 500 may be implemented by one or more application-specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), digital signal processors (Digital Signal Processor, DSP for short), digital signal processing equipment (Digital Signal Processing Device, referred to as DSPD), programmable logic device (Programmable Logic Device, referred to as PLD), field programmable gate array (Field Programmable Gate Array, referred to as FPGA), controller, microcontroller, microprocessor or other electronic components Implementation, used to execute the above-mentioned image processing method.
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Signal Processor
  • DSPD Digital Signal Processing Device
  • PLD programmable logic device
  • FPGA Field Programmable Gate Array
  • controller microcontroller
  • microprocessor or other electronic components Implementation used to execute the above-mentioned image processing method.
  • a computer-readable storage medium including program instructions, and when the program instructions are executed by a processor, the steps of the above-mentioned image processing method are realized.
  • the computer-readable storage medium may be the above-mentioned memory 502 including program instructions, and the above-mentioned program instructions can be executed by the processor 501 of the electronic device 500 to complete the above-mentioned image processing method.
  • Fig. 6 is a block diagram of an electronic device 600 according to an exemplary embodiment.
  • the electronic device 600 may be provided as a server.
  • the electronic device 600 includes a processor 622 , the number of which may be one or more, and a memory 632 for storing computer programs executable by the processor 622 .
  • the computer program stored in memory 632 may include one or more modules each corresponding to a set of instructions.
  • the processor 622 may be configured to execute the computer program to perform the above-mentioned image processing method.
  • the electronic device 600 may further include a power supply component 626 and a communication component 650, the power supply component 626 may be configured to perform power management of the electronic device 600, and the communication component 650 may be configured to implement communication of the electronic device 600, for example, wired or wireless communication.
  • the electronic device 600 may further include an input/output (I/O) interface 658 .
  • the electronic device 600 can operate based on an operating system stored in the memory 632, such as Windows Server TM , Mac OS X TM , Unix TM , Linux TM and so on.
  • a computer-readable storage medium including program instructions, and when the program instructions are executed by a processor, the steps of the above-mentioned image processing method are implemented.
  • the non-transitory computer-readable storage medium may be the above-mentioned memory 632 including program instructions, and the above-mentioned program instructions can be executed by the processor 622 of the electronic device 600 to implement the above-mentioned image processing method.
  • a computer program product comprising a computer program executable by a programmable device, the computer program having a function for performing the above-mentioned The code section of the image processing method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An image processing method and apparatus, a storage medium, and an electronic device, relating to the field of image processing. The method comprises: acquiring a target image comprising facial information (S101); and inputting the target image into a pre-trained emotion classification network to obtain emotion information represented by the facial information of the target image (S102), wherein the emotion classification network comprises a RedNet feature extractor composed of an inner convolution operator, and the RedNet feature extractor is used for obtaining a feature image according to the target image so as to obtain the emotion information on the basis of the feature image. A RedNet structure composed of an inner convolution operator is used as a feature extractor, the image input into the emotion classification network is preliminarily processed, local details of the image are extracted, and the obtained feature image is input into a downstream module of the emotion classification network, so that the final accuracy of the emotion information output by the emotion classification network is effectively improved.

Description

图像处理方法、装置、存储介质及电子设备Image processing method, device, storage medium and electronic equipment
相关申请的交叉引用Cross References to Related Applications
本公开要求在2021年12月02日提交中国专利局、申请号为202111473999.8、名称为“图像处理方法、装置、存储介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。This disclosure claims priority to a Chinese patent application filed with the China Patent Office on December 02, 2021, with application number 202111473999.8, entitled "Image Processing Method, Device, Storage Medium, and Electronic Equipment," the entire contents of which are incorporated herein by reference. In this disclosure.
技术领域technical field
本公开涉及图像处理领域,具体地,涉及一种图像处理方法、装置、存储介质及电子设备。The present disclosure relates to the field of image processing, and in particular, to an image processing method, device, storage medium, and electronic equipment.
背景技术Background technique
情感识别是任何人际沟通中不可避免的一部分,人们通过观察他人的情感变化来确认自己的行为是否合理有效。随着科技不断进步,情感识别可以使用不同的特征来检测识别,例如人脸、语音、脑电图,甚至言语内容,在这些特征中,通常面部表情是更容易被观测到的。Emotion recognition is an inevitable part of any interpersonal communication. People observe the emotional changes of others to confirm whether their actions are reasonable and effective. With the continuous advancement of technology, emotion recognition can use different features to detect and recognize, such as face, voice, EEG, and even speech content. Among these features, facial expressions are usually easier to observe.
在相关技术中,近年来,随着深度学习的运用,尤其是ViT(Vision Transformer)模型的出现,也成功打破了基于卷积和池化主导的网络在分类任务上面的垄断,然而,ViT模型的底层卷积部分过于简洁,网络底层对于更细节的图像信息利用的非常不到位,而中间处理阶段也没有特征图尺寸递减的变换。In related technologies, in recent years, with the application of deep learning, especially the emergence of the ViT (Vision Transformer) model, it has also successfully broken the monopoly of the network based on convolution and pooling in classification tasks. However, the ViT model The underlying convolution part of the network is too concise, and the bottom layer of the network does not use more detailed image information in place, and there is no transformation of feature map size reduction in the intermediate processing stage.
发明内容Contents of the invention
为了解决相关技术中存在的问题,本公开提供一种图像处理方法、装置、存储介质及电子设备。In order to solve the problems existing in related technologies, the present disclosure provides an image processing method, device, storage medium and electronic equipment.
为了实现上述目的,本公开第一方面提供一种图像处理方法,所述方法包括:In order to achieve the above object, the first aspect of the present disclosure provides an image processing method, the method comprising:
获取包括面部信息的目标图像;Obtain an image of the target including facial information;
将所述目标图像输入预先训练完成的情绪分类网络,得到所述目标图像中面部信息表征的情绪信息;Inputting the target image into a pre-trained emotional classification network to obtain emotional information represented by facial information in the target image;
其中,所述情绪分类网络包括由内卷算子构成的RedNet特征提取器,所述RedNet特征提取器用于根据所述目标图像得到特征图像,以基于所述特征图像得到所述情绪信 息。Wherein, the emotion classification network includes a RedNet feature extractor composed of involution operators, and the RedNet feature extractor is used to obtain a feature image according to the target image, so as to obtain the emotion information based on the feature image.
可选地,所述基于所述特征图像得到所述情绪信息包括:Optionally, the obtaining the emotional information based on the feature image includes:
将所述特征图像输入Transformer编码器,得到所述目标图像对应的特征向量,所述Transformer编码器包括多头自注意模块、多层感知器以及层归一化模块;The feature image is input into a Transformer encoder to obtain a feature vector corresponding to the target image, and the Transformer encoder includes a multi-head self-attention module, a multi-layer perceptron and a layer normalization module;
将所述特征向量输入全连接层,得到所述目标图像中面部信息表征的情绪信息。The feature vector is input into the fully connected layer to obtain the emotional information represented by the facial information in the target image.
可选地,所述情绪分类网络的训练包括:Optionally, the training of the emotion classification network includes:
获取训练集,所述训练集包括多个训练图像,所述多个训练图像中的每一个训练图像包括面部信息以及对应该面部信息预先标注的情绪标签;Obtain a training set, the training set includes a plurality of training images, each training image in the plurality of training images includes facial information and an emotional label corresponding to the facial information;
针对所述训练集中的任意目标训练图像,将所述目标训练图像输入初始情绪分类网络中的RedNet特征提取器,得到所述目标训练图像的特征图像;For any target training image in the training set, the target training image is input into the RedNet feature extractor in the initial sentiment classification network to obtain the feature image of the target training image;
将所述目标训练图像的特征图像输入所述Transformer编码器,得到所述目标训练图像对应的特征向量;The feature image of the target training image is input into the Transformer encoder to obtain the corresponding feature vector of the target training image;
将所述目标训练图像对应的特征向量输入全连接层,得到所述目标训练图像中面部信息表征的情绪信息对应的预测标签;The feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained;
根据所述预测标签与所述目标训练图像预先标注的情绪标签,对所述情绪分类网络中的参数进行调整,以得到训练后的情绪分类网络。According to the predicted label and the emotional label pre-marked on the target training image, parameters in the emotion classification network are adjusted to obtain a trained emotion classification network.
可选地,所述全连接层包括注意力因子,所述将所述目标训练图像对应的特征向量输入全连接层,得到所述目标训练图像中面部信息表征的情绪信息对应的预测标签包括:Optionally, the fully connected layer includes an attention factor, and the input of the feature vector corresponding to the target training image into the fully connected layer to obtain the predicted label corresponding to the emotional information represented by the facial information in the target training image includes:
将所述目标训练图像对应的特征向量输入全连接层,得到所述目标训练图像中面部信息表征的情绪信息对应的预测标签,以及所述目标训练图像的权重信息;The feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained, and the weight information of the target training image;
所述根据所述预测标签与所述训练图像预先标注的情绪标签,对所述情绪分类网络中的参数进行调整包括:The adjusting the parameters in the emotional classification network according to the predicted label and the pre-marked emotional label of the training image includes:
根据所述预测标签与所述目标训练图像预先标注的情绪标签,以及所述目标训练图像的权重信息,基于交叉熵损失函数以及正则化损失对所述情绪分类网络中的参数进行调整。According to the predicted label and the emotional label pre-marked on the target training image, and the weight information of the target training image, the parameters in the emotion classification network are adjusted based on a cross-entropy loss function and a regularization loss.
可选地,所述方法还包括:Optionally, the method also includes:
获取测试集,所述测试集包括多个测试图像,所述多个测试图像中的每一个测试图像包括面部信息以及对应该面部信息预先标注的情绪标签;Obtain a test set, the test set includes a plurality of test images, each test image in the plurality of test images includes facial information and a pre-marked emotional label corresponding to the facial information;
针对所述测试集中的任意目标测试图像,将所述目标测试图像输入所述训练后的情 绪分类网络中的RedNet特征提取器,得到所述目标测试图像的特征图像;For any target test image in the test set, the target test image is input into the RedNet feature extractor in the emotion classification network after the training, to obtain the feature image of the target test image;
将所述目标测试图像的特征图像输入所述Transformer编码器,得到所述目标测试图像对应的特征向量;The feature image of the target test image is input to the Transformer encoder to obtain the corresponding feature vector of the target test image;
将所述目标测试图像对应的特征向量输入MC-dropout层,确定所述目标测试图像的不确定性信息;Input the feature vector corresponding to the target test image into the MC-dropout layer to determine the uncertainty information of the target test image;
确定所述多个测试图像的不确定性信息是否满足预设规律,在满足所述预设规律的情况下,将所述训练后的情绪分类网络作为所述训练完成的情绪分类网络。Determine whether the uncertainty information of the plurality of test images satisfies a preset rule, and if the preset rule is satisfied, use the trained emotion classification network as the trained emotion classification network.
本公开第二方面提供一种图像处理装置,所述装置包括:A second aspect of the present disclosure provides an image processing device, the device comprising:
获取模块,用于获取包括面部信息的目标图像;An acquisition module, configured to acquire a target image including facial information;
情绪确定模块,用于将所述目标图像输入预先训练完成的情绪分类网络,得到所述目标图像中面部信息表征的情绪信息;The emotion determination module is used to input the target image into the pre-trained emotional classification network to obtain the emotional information represented by facial information in the target image;
其中,所述情绪分类网络包括由内卷算子构成的RedNet特征提取器,所述RedNet特征提取器用于根据所述目标图像得到特征图像,以基于所述特征图像得到所述情绪信息。Wherein, the emotion classification network includes a RedNet feature extractor composed of involution operators, and the RedNet feature extractor is used to obtain a feature image according to the target image, so as to obtain the emotion information based on the feature image.
可选地,所述情绪确定模块具体用于:Optionally, the emotion determination module is specifically used for:
将所述特征图像输入Transformer编码器,得到所述目标图像对应的特征向量,所述Transformer编码器包括多头自注意模块、多层感知器以及层归一化模块;The feature image is input into a Transformer encoder to obtain a feature vector corresponding to the target image, and the Transformer encoder includes a multi-head self-attention module, a multi-layer perceptron and a layer normalization module;
将所述特征向量输入全连接层,得到所述目标图像中面部信息表征的情绪信息。The feature vector is input into the fully connected layer to obtain the emotional information represented by the facial information in the target image.
可选地,所述装置包括:Optionally, the device includes:
第二获取模块,用于获取训练集,所述训练集包括多个训练图像,所述多个训练图像中的每一个训练图像包括面部信息以及对应该面部信息预先标注的情绪标签;The second acquisition module is used to acquire a training set, the training set includes a plurality of training images, and each training image in the plurality of training images includes facial information and a pre-marked emotional label corresponding to the facial information;
特征提取模块,针对所述训练集中的任意目标训练图像,将所述目标训练图像输入初始情绪分类网络中的RedNet特征提取器,得到所述目标训练图像的特征图像;Feature extraction module, for any target training image in the training set, input the target training image into the RedNet feature extractor in the initial emotion classification network, obtain the feature image of the target training image;
特征向量确定模块,用于将所述目标训练图像的特征图像输入所述Transformer编码器,得到所述目标训练图像对应的特征向量;A feature vector determination module, configured to input the feature image of the target training image into the Transformer encoder to obtain the corresponding feature vector of the target training image;
预测模块,用于将所述目标训练图像对应的特征向量输入全连接层,得到所述目标训练图像中面部信息表征的情绪信息对应的预测标签;A prediction module, configured to input a feature vector corresponding to the target training image into a fully connected layer, to obtain a prediction label corresponding to emotional information represented by facial information in the target training image;
调整模块,用于根据所述预测标签与所述目标训练图像预先标注的情绪标签,对所述情绪分类网络中的参数进行调整,得到训练后的情绪分类网络。The adjustment module is used to adjust the parameters in the emotion classification network according to the predicted label and the emotion label pre-marked in the target training image, so as to obtain the trained emotion classification network.
本公开第三方面提供一种非临时性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现本公开第一方面中任一项所述方法的步骤。A third aspect of the present disclosure provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of any one of the methods described in the first aspect of the present disclosure are implemented.
本公开第四方面提供一种电子设备,包括:A fourth aspect of the present disclosure provides an electronic device, including:
存储器,其上存储有计算机程序;a memory on which a computer program is stored;
处理器,用于执行所述存储器中的所述计算机程序,以实现本公开第一方面中任一项所述方法的步骤。A processor configured to execute the computer program in the memory to implement the steps of any one of the methods in the first aspect of the present disclosure.
通过上述技术方案,通过使用内卷算子组成的RedNet结构作为特征提取器,对输入该情绪分类网络的图像进行初步处理,提取图片的局部细节并将得到的特征图像输入该情绪分类网络的下游模块,有效地提高了情绪分类网络输出的情绪信息的最终准确率。Through the above technical solution, by using the RedNet structure composed of involution operators as a feature extractor, the image input to the emotion classification network is initially processed, the local details of the image are extracted, and the obtained feature image is input into the downstream of the emotion classification network module, which effectively improves the final accuracy of the emotional information output by the emotional classification network.
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。Other features and advantages of the present disclosure will be described in detail in the detailed description that follows.
附图说明Description of drawings
附图是用来提供对本公开的进一步理解,并且构成说明书的一部分,与下面的具体实施方式一起用于解释本公开,但并不构成对本公开的限制。在附图中:The accompanying drawings are used to provide a further understanding of the present disclosure, and constitute a part of the description, together with the following specific embodiments, are used to explain the present disclosure, but do not constitute a limitation to the present disclosure. In the attached picture:
图1是根据一示例性实施例示出的一种图像处理方法的流程图;Fig. 1 is a flowchart of an image processing method shown according to an exemplary embodiment;
图2是根据一示例性实施例示出的一种训练阶段的情绪分类网络的示意图;Fig. 2 is a schematic diagram of an emotion classification network in a training phase according to an exemplary embodiment;
图3是根据一示例性实施例示出的一种测试阶段的情绪分类网络的示意图;Fig. 3 is a schematic diagram of an emotion classification network in a testing phase according to an exemplary embodiment;
图4是根据一示例性实施例示出的一种图像处理装置的框图;Fig. 4 is a block diagram of an image processing device according to an exemplary embodiment;
图5是根据一示例性实施例示出的一种电子设备的框图;Fig. 5 is a block diagram of an electronic device according to an exemplary embodiment;
图6是根据一示例性实施例示出的一种电子设备的另一框图。Fig. 6 is another block diagram of an electronic device according to an exemplary embodiment.
具体实施方式Detailed ways
以下结合附图对本公开的具体实施方式进行详细说明。应当理解的是,此处所描述的具体实施方式仅用于说明和解释本公开,并不用于限制本公开。Specific embodiments of the present disclosure will be described in detail below in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to illustrate and explain the present disclosure, and are not intended to limit the present disclosure.
情感识别是任何人际沟通中不可避免的一部分,人们通过观察他人的情感变化来确认自己的行为是否合理有效。随着科技不断进步,情感识别可以使用不同的特征来检测识别,例如人脸、语音、脑电图,甚至言语内容,在这些特征中,通常面部表情是更容易被观测到的。Emotion recognition is an inevitable part of any interpersonal communication. People observe the emotional changes of others to confirm whether their actions are reasonable and effective. With the continuous advancement of technology, emotion recognition can use different features to detect and recognize, such as face, voice, EEG, and even speech content. Among these features, facial expressions are usually easier to observe.
一般来说,人脸表情识别***主要由三个阶段组成,即人脸检测、特征提取和表情识别。在人脸检测阶段,采用多个人脸检测器,像MTCNN网络和RetinaFace网络一样, 它们被用来定位复杂场景中的人脸位置,检测到的人脸还可以进一步对齐。对于特征提取,过去的研究提出了多种方法用于捕捉面部表情引起的面部几何和外观特征。根据特征类型,它们可以分为工程特征和基于学习的特征。对于工程特性可以进一步分为基于纹理的特征,基于几何的全局特征等。Generally speaking, the facial expression recognition system mainly consists of three stages, namely face detection, feature extraction and expression recognition. In the face detection stage, multiple face detectors are used, like MTCNN network and RetinaFace network, they are used to locate the face position in complex scenes, and the detected faces can be further aligned. For feature extraction, past studies have proposed various methods for capturing facial geometry and appearance features induced by facial expressions. According to the feature type, they can be divided into engineering features and learning-based features. For engineering features, it can be further divided into texture-based features, geometry-based global features, etc.
近年来,随着深度学习的运用,尤其是ViT(Vision Transformer)模型的出现,也成功打破了基于卷积和池化主导的网络在分类任务上面的垄断,然而,ViT模型的底层卷积部分过于简洁,网络底层对于更细节的图像信息利用的非常不到位,而中间处理阶段也没有特征图尺寸递减的变换。In recent years, with the application of deep learning, especially the emergence of the ViT (Vision Transformer) model, it has also successfully broken the monopoly of the network based on convolution and pooling in classification tasks. However, the underlying convolution part of the ViT model Too succinct, the bottom layer of the network does not make full use of more detailed image information, and there is no transformation of feature map size reduction in the intermediate processing stage.
图1是根据一示例性实施例示出的一种图像处理方法的流程图,该方法的执行主体可以是手机、计算机、笔记本电脑等终端,也可以服务器,如图1所示,所述方法包括:Fig. 1 is a flow chart of an image processing method shown according to an exemplary embodiment. The execution subject of the method may be a terminal such as a mobile phone, a computer, a notebook computer, or a server. As shown in Fig. 1 , the method includes :
S101、获取包括面部信息的目标图像。S101. Acquire a target image including facial information.
其中,目标图像中的面部信息可以仅包括一个人物的面部信息,也可以是多个人物的面部信息。Wherein, the face information in the target image may only include the face information of one person, or may be the face information of multiple people.
S102、将所述目标图像输入预先训练完成的情绪分类网络,得到所述目标图像中面部信息表征的情绪信息。S102. Input the target image into a pre-trained emotion classification network to obtain emotional information represented by facial information in the target image.
可以理解的是,该情绪信息可以表征目标图像中的任务的面部信息对应的快乐、悲伤、哭、笑等等情绪的概率值。It can be understood that the emotion information may represent probability values of emotions such as happiness, sadness, crying, laughing, etc. corresponding to the face information of the task in the target image.
其中,所述情绪分类网络包括由内卷算子构成的RedNet特征提取器,所述RedNet特征提取器用于根据所述目标图像得到特征图像,以基于所述特征图像得到所述情绪信息。Wherein, the emotion classification network includes a RedNet feature extractor composed of involution operators, and the RedNet feature extractor is used to obtain a feature image according to the target image, so as to obtain the emotion information based on the feature image.
本领域技术人员应理解,在传统的ViT模型中,通过将图像进行等步长的均匀切割,然而,这可能在局部信息的切分中造成部分特征的丢失或错位,图像处理不同于自然语言处理任务中文字存在上下文关系,像素之间的关系连续性粒度更大。Those skilled in the art should understand that in the traditional ViT model, by uniformly cutting the image with equal steps, however, this may cause the loss or misalignment of some features in the segmentation of local information, and image processing is different from natural language There is a contextual relationship between the text in the processing task, and the continuity of the relationship between pixels has a greater granularity.
另外,内卷算子具有通道不变性和空间特异性,它在设计上与卷积的特性相反,即在通道维度共享内核kernel,而在空间维度采用空间特异的kernel进行更灵活的建模。相比于卷积共用空间维度权重的操作,内卷核对空间上不同位置是有不同的关注度的,可以更有效的挖掘具有多样性的目标特征,并且在不增加参数计算量的情况下,在不同的空间位置上进行特征权重的共享和迁移,这也恰恰是空间特异性的设计原则所追求的这种从卷积到内卷的设计对算力进行了重新的调配,使得将有限的算力调整到最能发挥性 能的位置,故而我们使用内卷算子组成的RedNet作为特征提取器,并且在更小参数量的情况下获得了比使用ResNet更好的效果。In addition, the involution operator has channel invariance and space specificity, and its design is opposite to the characteristics of convolution, that is, the kernel kernel is shared in the channel dimension, and the space-specific kernel is used in the space dimension for more flexible modeling. Compared with the operation of convolution sharing the weight of the spatial dimension, the inner convolution kernel has different attention to different positions in the space, which can more effectively mine the target features with diversity, and without increasing the amount of parameter calculation, The sharing and migration of feature weights in different spatial positions is exactly what the space-specific design principle pursues. This design from convolution to involution re-allocates computing power, making the limited The computing power is adjusted to the position where the performance can be maximized, so we use the RedNet composed of involution operators as the feature extractor, and obtain better results than ResNet with a smaller amount of parameters.
在本公开实施例中,通过使用内卷算子(Involution)组成的RedNet结构作为特征提取器,对输入该情绪分类网络的图像进行初步处理,提取图片的局部细节并将得到的特征图像输入该情绪分类网络的下游模块,有效地提高了情绪分类网络输出的情绪信息的最终准确率。In the embodiment of the present disclosure, by using the RedNet structure composed of involution operators (Involution) as the feature extractor, the image input to the emotion classification network is initially processed, the local details of the picture are extracted, and the obtained feature image is input into the The downstream module of the emotion classification network effectively improves the final accuracy of the emotional information output by the emotion classification network.
在一些可选的实施例中,所述基于所述特征图像得到所述情绪信息包括:In some optional embodiments, the obtaining the emotional information based on the feature image includes:
将所述特征图像输入Transformer编码器,得到所述目标图像对应的特征向量,所述Transformer编码器包括多头自注意模块、多层感知器以及层归一化模块;The feature image is input into a Transformer encoder to obtain a feature vector corresponding to the target image, and the Transformer encoder includes a multi-head self-attention module, a multi-layer perceptron and a layer normalization module;
将所述特征向量输入全连接层,得到所述目标图像中面部信息表征的情绪信息。The feature vector is input into the fully connected layer to obtain the emotional information represented by the facial information in the target image.
可以理解的是,该特征图像可以包括多个特征子图像patch,将所述特征图像输入Transformer编码器包括:将该多个特征子图像patch进行拉伸后,分别输入该Transformer编码器中。It can be understood that the feature image may include multiple feature sub-image patches, and inputting the feature image into the Transformer encoder includes: stretching the multiple feature sub-image patches and inputting them into the Transformer encoder respectively.
该多头自我注意模块(Multi-head self-attention,MSA)将多个注意输出线性连接到期望的维度。多个注意头可以用于了解图像中的局部和全局依赖关系。该多层感知机(Multi-Layer Perception,MLP)包含两层高斯误差线性单元(Gaussian Error Linear Units,GELU)层归一化(Layer Normalization,LN),能够用于提高训练时间和泛化性能。残差连接在每个patch之后应用,因为它们允许梯度直接流过网络而不经过非线性层实现。The Multi-head self-attention module (MSA) linearly connects multiple attention outputs to the desired dimension. Multiple attention heads can be used to learn local and global dependencies in images. The Multi-Layer Perception (MLP) contains two layers of Gaussian Error Linear Units (GELU) and Layer Normalization (LN), which can be used to improve training time and generalization performance. Residual connections are applied after each patch because they allow gradients to flow directly through the network without going through nonlinear layer implementations.
本领域技术人员应理解,卷积神经网络(CNNs)应用于人脸领域,经过数据集的训练可以提取和学习一个包含关键特征的面部表情识别***。然而,值得注意的是就面部表情而言,很多线索都来自面部的一些部位,例如嘴和眼睛,而其他部分,如背景和头发,在输出中所起的作用很小,这意味着,理想情况下,模型框架应该只关注脸部的重要部位,而少关注对其他面部区域敏感,并且对遮挡模糊等特殊情况有较好的泛化能力。在这项工作中,我们提出了一个基于Transformer的框架对于人脸表情识别,它考虑了上述观察,利用注意力机制来聚焦面部突出的部分。使用Transformer编码,而不是深度卷积模型,能够获得非常高的准确率。Those skilled in the art should understand that convolutional neural networks (CNNs) are applied in the field of human faces, and a facial expression recognition system including key features can be extracted and learned through training of data sets. However, it is worth noting that in terms of facial expressions, many cues come from some parts of the face, such as the mouth and eyes, while other parts, such as the background and hair, play a small role in the output, which means that the ideal In this case, the model framework should only focus on the important parts of the face, and pay less attention to the sensitivity to other facial areas, and have better generalization ability for special cases such as occlusion blur. In this work, we propose a Transformer-based framework for facial expression recognition, which takes the above observations into account and utilizes an attention mechanism to focus on salient parts of the face. Using Transformer encoding instead of deep convolutional models can achieve very high accuracy.
采用上述方案,利用Transformer编码器,利用注意力机制来聚焦面部突出的部分,能够保证该情绪分类网络输出的情绪信息准确率。Using the above scheme, using the Transformer encoder and using the attention mechanism to focus on the prominent parts of the face can ensure the accuracy of the emotional information output by the emotional classification network.
在一些可选地实施例中,所述情绪分类网络的训练包括:In some optional embodiments, the training of the emotion classification network includes:
获取训练集,所述训练集包括多个训练图像,所述多个训练图像中的每一个训练图像包括面部信息以及对应该面部信息预先标注的情绪标签;Obtain a training set, the training set includes a plurality of training images, each training image in the plurality of training images includes facial information and an emotional label corresponding to the facial information;
针对所述训练集中的任意目标训练图像,将所述目标训练图像输入初始情绪分类网络中的RedNet特征提取器,得到所述目标训练图像的特征图像;For any target training image in the training set, the target training image is input into the RedNet feature extractor in the initial sentiment classification network to obtain the feature image of the target training image;
将所述目标训练图像的特征图像输入所述Transformer编码器,得到所述目标训练图像对应的特征向量;The feature image of the target training image is input into the Transformer encoder to obtain the corresponding feature vector of the target training image;
将所述目标训练图像对应的特征向量输入全连接层,得到所述目标训练图像中面部信息表征的情绪信息对应的预测标签;The feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained;
根据所述预测标签与所述目标训练图像预先标注的情绪标签,对所述情绪分类网络中的参数进行调整,以得到训练后的情绪分类网络。According to the predicted label and the emotional label pre-marked on the target training image, parameters in the emotion classification network are adjusted to obtain a trained emotion classification network.
采用上述方案,基于包括面部信息以及对应该面部信息预先标注的情绪标签的多个训练图像的训练集,对未经训练的初始情绪分类网络进行训练,以得到能够准确对图像中的面部信息表征的情绪进行识别分类的情绪分类网络。Using the above scheme, based on the training set of multiple training images including facial information and emotional labels corresponding to the facial information in advance, the untrained initial emotion classification network is trained to obtain an accurate representation of the facial information in the image. An emotion classification network for identifying and classifying emotions.
在另一些可选地实施例中,所述全连接层包括注意力因子,所述将所述目标训练图像对应的特征向量输入全连接层,得到所述目标训练图像中面部信息表征的情绪信息对应的预测标签包括:In other optional embodiments, the fully connected layer includes an attention factor, and the feature vector corresponding to the target training image is input into the fully connected layer to obtain the emotional information represented by the facial information in the target training image The corresponding predicted labels include:
将所述目标训练图像对应的特征向量输入全连接层,得到所述目标训练图像中面部信息表征的情绪信息对应的预测标签,以及所述目标训练图像的权重信息;The feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained, and the weight information of the target training image;
所述根据所述预测标签与所述训练图像预先标注的情绪标签,对所述情绪分类网络中的参数进行调整包括:The adjusting the parameters in the emotional classification network according to the predicted label and the pre-marked emotional label of the training image includes:
根据所述预测标签与所述目标训练图像预先标注的情绪标签,以及所述目标训练图像的权重信息,基于交叉熵损失函数(Cross-Entropy Loss)以及正则化损失对所述情绪分类网络中的参数进行调整。According to the emotional label pre-marked by the predicted label and the target training image, and the weight information of the target training image, based on the cross-entropy loss function (Cross-Entropy Loss) and regularization loss for the emotional classification network parameters to adjust.
采用上述方案,通过对全连接层加入注意力因子来确定训练集中的样本的真实准确性,数值高代表样本表现好,准确度高,在训练时“发挥的作用”大,反之则样本表现差,准确度底,训练时不理想。通过该因子,神经网络将注意力集中在实际效果好更有效的样本上,能够有效地提高训练的准确度。Using the above scheme, the true accuracy of the samples in the training set is determined by adding an attention factor to the fully connected layer. A high value represents a good performance of the sample, high accuracy, and a large "role" during training, otherwise the sample performance is poor. , the accuracy is low, and the training time is not ideal. Through this factor, the neural network will focus on the samples with better and more effective actual effects, which can effectively improve the accuracy of training.
在又一些实施例中,所述情绪分类网络的训练还包括,将所述训练集输入SCN网络(Self-Cure Network),以自动修复样本中错误标签的方法。该SCN网络包括自注意力 重要性加权模块(Self-Attention Importance Weighting)和重标注模块。In some other embodiments, the training of the emotion classification network also includes a method of inputting the training set into the SCN network (Self-Cure Network) to automatically repair the wrong label in the sample. The SCN network includes a self-attention importance weighting module (Self-Attention Importance Weighting) and a relabeling module.
在自注意力重要性加权模块用于对于每个训练集中的样本x i生成一个权重α i,作为该训练集中的样本x i的重要程度的衡量。使用RR-loss(Rank Regularization loss,秩正则化损失)对所述自注意力重要性加权模块进行训练。 The self-attention importance weighting module is used to generate a weight α i for each sample xi in the training set as a measure of the importance of the sample xi in the training set. The self-attention importance weighting module is trained using RR-loss (Rank Regularization loss, rank regularization loss).
RR-loss的具体计算步骤包括:对一个批次的样本按照α i进行排序,按照比例β将样本分成高分和低分两组,高分组有β*N=M个样本,低分组有N-M样本,则:L RR=max{0,δ 1-(α HL)}, The specific calculation steps of RR-loss include: sorting a batch of samples according to α i , and dividing the samples into two groups with high scores and low scores according to the ratio β. The high-scoring group has β*N=M samples, and the low-scoring group has NM samples. sample, then: L RR =max{0,δ 1 -(α HL )},
其中,L RR表示RR-loss,α H表示高分组平均权重,α L表示低分组平均权重,并且,α H以及α L满足下式:
Figure PCTCN2022136363-appb-000001
Among them, L RR represents RR-loss, α H represents the average weight of high groups, α L represents the average weight of low groups, and α H and α L satisfy the following formula:
Figure PCTCN2022136363-appb-000001
可以理解的是,δ 1是一个固定或可学习的数值,用于分离高分组和低分组的权重均值。 It can be understood that δ1 is a fixed or learnable value used to separate the weight mean value of high group and low group.
在另一些实施例中,按照比例β将样本分成高分和低分两组包括:In some other embodiments, dividing the sample into high-scoring and low-scoring groups according to the ratio β includes:
在距离公式argmax M(min i∈[0,M)α i-max i∈[M,N)α i)不成立的情况下,对所述比例β进行人工标定,在此范围内,则使用上述距离公式进行分组。 In the case where the distance formula argmax M (min i∈[0,M) α i -max i∈[M,N) α i ) is not valid, manually calibrate the ratio β, and within this range, use the above The distance formula for grouping.
可以理解的是,若使用固定的超参数β对训练样本进行分组,这一设定相当于对数据中错误标签的占比做了假设。然而在实际中,我们往往不知道数据中的错误样本分布情况。另一方面,即使知道总体数据中错误标签的占比,实际由于抽样的随机性会导致每个批次的占比不尽相同,使用固定比例会造成一定的偏差。It is understandable that if a fixed hyperparameter β is used to group training samples, this setting is equivalent to making assumptions about the proportion of wrong labels in the data. In practice, however, we often do not know the distribution of error samples in the data. On the other hand, even if the proportion of wrong labels in the overall data is known, the actual proportion of each batch will be different due to the randomness of sampling, and using a fixed proportion will cause certain deviations.
在所述自注意力重要性加权模块已经学到如何区分高分组和低分组的情况下,此时,最好的分组方法应当满足:argmax Mdistance(Ω HL) In the case where the self-attention importance weighting module has learned how to distinguish between high groups and low groups, at this time, the best grouping method should satisfy: argmax M distance(Ω HL )
其中,Ω H代表的是高分组样本权重的集合,Ω L表示低分组样本的集合。考虑到该权重的有序性,该距离采用的公式即可以为argmax M(min i∈[0,M)α i-max i∈[M,N)α i)。 Among them, Ω H represents the set of high-group sample weights, and Ω L represents the set of low-group samples. Considering the order of the weights, the formula used for the distance can be argmax M (min i∈[0,M) α i -max i∈[M,N) α i ).
采用本方案,根据每个批次样本的实际权重进行分组,即可在实现自适应分组的同时,避免训练的不稳定。By adopting this scheme, grouping is performed according to the actual weight of each batch of samples, which can avoid instability in training while realizing adaptive grouping.
此外,考虑到不同类别的训练集样本的复杂程度不同,每个样本在计算属于各类别的置信度时,评判其重要程度的指标不完全一致。因此我们扩展了α i的维度,由原本的标量变成输出类别维度1×c的向量。在计算RR-loss时使用α i的均值进行约束。 In addition, considering the complexity of the training set samples of different categories, the indicators for judging the importance of each sample are not completely consistent when calculating the confidence of each category. Therefore, we expand the dimension of α i , from the original scalar to a vector of output category dimension 1×c. The mean value of α i is used as a constraint when calculating RR-loss.
采用上述方案,提出了自适应的分组方法,根据每个批次样本的实际权重进行分组,有效地提高了该模型输出的权重的准确性。Using the above scheme, an adaptive grouping method is proposed, grouping according to the actual weight of each batch of samples, which effectively improves the accuracy of the weight output by the model.
在另一些可选地实施例中,所述方法还包括:In other optional embodiments, the method also includes:
获取测试集,所述测试集包括多个测试图像,所述多个测试图像中的每一个测试图像包括面部信息以及对应该面部信息预先标注的情绪标签;Obtain a test set, the test set includes a plurality of test images, each test image in the plurality of test images includes facial information and a pre-marked emotional label corresponding to the facial information;
针对所述测试集中的任意目标测试图像,将所述目标测试图像输入所述训练后的情绪分类网络中的RedNet特征提取器,得到所述目标测试图像的特征图像;For any target test image in the test set, the target test image is input into the RedNet feature extractor in the trained emotional classification network to obtain the feature image of the target test image;
将所述目标测试图像的特征图像输入所述Transformer编码器,得到所述目标测试图像对应的特征向量;The feature image of the target test image is input to the Transformer encoder to obtain the corresponding feature vector of the target test image;
将所述目标测试图像对应的特征向量输入MC-dropout层,确定所述目标测试图像的不确定性信息;Input the feature vector corresponding to the target test image into the MC-dropout layer to determine the uncertainty information of the target test image;
确定所述多个测试图像的不确定性信息是否满足预设规律,在满足所述预设规律的情况下,将所述训练后的情绪分类网络作为所述训练完成的情绪分类网络。Determine whether the uncertainty information of the plurality of test images satisfies a preset rule, and if the preset rule is satisfied, use the trained emotion classification network as the trained emotion classification network.
本领域技术人员应理解,在相关技术中,CNN模型,注意力模型,Transformer模型,在数学上都是最大似然估计模型。最大似然估计模型是无偏的且权重是固定的。然而真实世界中的任何模型权重更应该倾向于高斯分布,而不是固定的。因此最大似然估计不能够有效地估计数据的不确定性。而人类表情本身就是及其复杂的,比如惊恐与惊讶、笑出眼泪,这些本身就是混杂着不同的表情,而非单一的一个表情。因此使用权重固定的模型去估计一个不确定的任务,本身就是一种矛盾。Those skilled in the art should understand that in related technologies, the CNN model, the attention model, and the Transformer model are all mathematically maximum likelihood estimation models. The maximum likelihood estimation model is unbiased and the weights are fixed. However any model weights in the real world should tend to be Gaussian rather than fixed. Therefore, the maximum likelihood estimation cannot effectively estimate the uncertainty of the data. Human expressions are extremely complex, such as panic and surprise, and tears from laughter. These are mixed with different expressions, not a single expression. Therefore, using a model with fixed weights to estimate an uncertain task is a contradiction in itself.
本领域技术人员应知悉,MC-dropout是一种从贝叶斯理论出发的dropout理解方式,将dropout解释为高斯过程的贝叶斯近似。从而使得普通的模型像贝叶斯神经网络一样具备了评估不确定性的能力。Those skilled in the art should know that MC-dropout is a way of understanding dropout based on Bayesian theory, which interprets dropout as a Bayesian approximation of a Gaussian process. As a result, ordinary models have the ability to evaluate uncertainty like Bayesian neural networks.
具体地,使用MC-dropout层只需要在测试时,一个输入通过测试n次,即可得到一组采样点,从而计算得到均值和方差,使用该方差即可对测试集中的样本预测的不确定性进行评估,方差越大,说明预测的不确定性越高。Specifically, the use of the MC-dropout layer only requires an input to be tested n times during testing to obtain a set of sampling points, thereby calculating the mean and variance, and using the variance to predict the uncertainty of the samples in the test set The larger the variance, the higher the uncertainty of the prediction.
一些实施方式中,在测试时,backbone输出的特征O b∈R 1×p。通常,O b会与全连接层的权重
Figure PCTCN2022136363-appb-000002
进行相乘,公式为O fc=O b·W fc,其中的
Figure PCTCN2022136363-appb-000003
将被用于进一步的分类。
In some implementations, when testing, the backbone outputs the features O b ∈ R 1×p . Usually, O b will be related to the weight of the fully connected layer
Figure PCTCN2022136363-appb-000002
Multiply, the formula is O fc =O b ·W fc , where
Figure PCTCN2022136363-appb-000003
will be used for further classification.
另一些可能的实施方式中,对W fc进行了n次采样。采样得到的权重记为
Figure PCTCN2022136363-appb-000004
Figure PCTCN2022136363-appb-000005
则MC-dropout层可以以以下公式定义:
Figure PCTCN2022136363-appb-000006
In some other possible implementation manners, W fc is sampled n times. The weight obtained by sampling is denoted as
Figure PCTCN2022136363-appb-000004
Figure PCTCN2022136363-appb-000005
Then the MC-dropout layer can be defined by the following formula:
Figure PCTCN2022136363-appb-000006
其中
Figure PCTCN2022136363-appb-000007
增加了一个采样维度。相对于O fc
Figure PCTCN2022136363-appb-000008
等价于使用dropout进行 n次采样的结果。最终的分类结果通过下式计算均值得到:
in
Figure PCTCN2022136363-appb-000007
A sampling dimension has been added. Relative to O fc ,
Figure PCTCN2022136363-appb-000008
Equivalent to the result of n sampling using dropout. The final classification result is obtained by calculating the mean value by the following formula:
Figure PCTCN2022136363-appb-000009
Figure PCTCN2022136363-appb-000009
Figure PCTCN2022136363-appb-000010
Figure PCTCN2022136363-appb-000010
其中,
Figure PCTCN2022136363-appb-000011
函数代表在
Figure PCTCN2022136363-appb-000012
的m维度上执行softmax操作。mean n()表示在
Figure PCTCN2022136363-appb-000013
的n维度上计算平均值。max()代表求向量的最大值。样本的不确定性计算如下:
Figure PCTCN2022136363-appb-000014
in,
Figure PCTCN2022136363-appb-000011
function represented in
Figure PCTCN2022136363-appb-000012
The softmax operation is performed on the m dimension of . mean n () means in
Figure PCTCN2022136363-appb-000013
Computes the mean over the n-dimension of . max() stands for finding the maximum value of a vector. The uncertainty of the sample is calculated as follows:
Figure PCTCN2022136363-appb-000014
其中,variance n()函数表示在
Figure PCTCN2022136363-appb-000015
的n维度上计算方差。
Figure PCTCN2022136363-appb-000016
函数表示O mean所对应的样本方差。可以基于O var的最大值衡量预测结果的不确定性。方差越大,代表不确定性越高。
Among them, the variance n () function is expressed in
Figure PCTCN2022136363-appb-000015
Computes the variance over the n-dimension of .
Figure PCTCN2022136363-appb-000016
The function represents the sample variance corresponding to O mean . The uncertainty of the prediction results can be measured based on the maximum value of O var . The larger the variance, the higher the uncertainty.
可选地,在其他层实施dropout亦可,只需保证该层前的计算只运行一遍,然后到MC-dropout层时,变成矩阵运算即可。Optionally, dropout can also be implemented in other layers, just ensure that the calculation before this layer is only run once, and then when it reaches the MC-dropout layer, it becomes a matrix operation.
采用本方案,在测试阶段,通过替换全连接层为MC-dropout layer,即可使用贝叶斯估计进行不确定性分析。With this scheme, Bayesian estimation can be used for uncertainty analysis by replacing the fully connected layer with the MC-dropout layer during the testing phase.
为了使得本领域技术人员更理解本公开提供的技术方案,本公开提供如图2所示的根据一示例性实施例示出的一种训练阶段的情绪分类网络20的示意图,如图2所示,所述情绪分类网络20包括依次串联的输入模块21、RedNet特征提取器22、Transformer编码器23、全连接层24以及分类器25;In order to enable those skilled in the art to better understand the technical solution provided by the present disclosure, the present disclosure provides a schematic diagram of an emotion classification network 20 in the training phase according to an exemplary embodiment as shown in FIG. 2 , as shown in FIG. 2 , Described emotion classification network 20 comprises input module 21, RedNet feature extractor 22, Transformer coder 23, fully connected layer 24 and classifier 25 connected in series successively;
基于图2所示的情绪分类网络20,情绪分类网络20的训练包括:将训练集通过输入模块21输入该情绪分类网络20中的RedNet特征提取器22,以得到训练集中任一训练图像的多个特征图像pactch;将该多个特征图像pactch输入Transformer编码器23,以得到训练集中任一训练图像的特征向量;将该特征向量输入全连接层24,得到所述目标图像中面部信息表征的各情绪类别的概率值;将所述各情绪类别的概率值输入该分类器25,得到概率最高的情绪类别;根据该情绪类别以及训练集中预先标注的标签信息,基于交叉熵损失函数以及正则化损失对该情绪分类网络20中的参数进行调整,得到训练后的情绪分类网络。Based on the emotion classification network 20 shown in Figure 2, the training of the emotion classification network 20 includes: the training set is input into the RedNet feature extractor 22 in this emotion classification network 20 through the input module 21, to obtain the multiple of any training image in the training set a feature image pactch; the multiple feature images pactch are input into the Transformer encoder 23 to obtain the feature vector of any training image in the training set; the feature vector is input into the fully connected layer 24 to obtain the facial information representation in the target image The probability value of each emotional category; the probability value of each emotional category is input into the classifier 25 to obtain the highest probability emotional category; according to the emotional category and the pre-marked label information in the training set, based on the cross entropy loss function and regularization The loss adjusts the parameters in the emotion classification network 20 to obtain the trained emotion classification network.
进一步,本公开还提供如图3所示的根据一示例性实施例示出的一种测试阶段的情绪分类网络的示意图,如图3所示,该情绪分类网络30包括训练后的输入模块31、RedNet特征提取器32、Transformer编码器33、MC-dropout层34以及分类器35。Further, the present disclosure also provides a schematic diagram of an emotion classification network in a test phase according to an exemplary embodiment as shown in FIG. 3 . As shown in FIG. 3 , the emotion classification network 30 includes a trained input module 31, RedNet feature extractor 32, Transformer encoder 33, MC-dropout layer 34, and classifier 35.
基于图3所示的情绪分类网络30,情绪分类网络30的测试包括:将测试集通过输入 模块31输入该情绪分类网络30中的RedNet特征提取器32,以得到训练集中任一训练图像的多个特征图像pactch;将该多个特征图像pactch输入Transformer编码器33,以得到训练集中任一训练图像的特征向量;将该特征向量输入MC-dropout层34进行多次采样,得到每次采样所述MC-dropout层34输出的所述目标图像中面部信息表征的各个情绪类别的概率值;将所述各情绪类别的概率值输入该分类器35,得到概率最高的情绪类别;根据该情绪类别以及该测试集预先标注的标签信息,确定所述情绪分类网络30是否满足预设要求。Based on the emotion classification network 30 shown in Figure 3, the test of the emotion classification network 30 includes: the test set is input into the RedNet feature extractor 32 in this emotion classification network 30 by the input module 31, to obtain the multiple of any training image in the training set a feature image pactch; input the multiple feature images pactch into the Transformer encoder 33, to obtain the feature vector of any training image in the training set; input the feature vector into the MC-dropout layer 34 for multiple sampling, and obtain each sampling place The probability value of each emotional category represented by the facial information in the described target image output by the MC-dropout layer 34; the probability value of each emotional category is input into the classifier 35 to obtain the highest emotional category; according to the emotional category and the pre-marked label information of the test set to determine whether the emotion classification network 30 meets preset requirements.
基于图3以及图4的情绪分类网络结构,在SCN的基础上,首次结合使用RedNet和Transformer作为特征提取器。联合使用了RedNet以及基于贝叶斯学派的MC-dropout。此外,为了处理训练集中包含的模糊的图片和标签,利用SCN中的训练方法并进行了进一步改进。Based on the emotion classification network structure in Figure 3 and Figure 4, on the basis of SCN, RedNet and Transformer are used as feature extractors for the first time. Combined use of RedNet and Bayesian-based MC-dropout. In addition, in order to deal with the blurred pictures and labels contained in the training set, the training method in SCN is utilized and further improved.
图4是根据一示例性实施例示出的一种图像处理装置40的框图,该装置40可以作为手机等终端的一部分,也可以是服务器的一部分,所述装置40包括:FIG. 4 is a block diagram of an image processing device 40 according to an exemplary embodiment. The device 40 can be used as a part of a terminal such as a mobile phone, or as a part of a server. The device 40 includes:
第一获取模块41,用于获取包括面部信息的目标图像;The first obtaining module 41 is used to obtain the target image comprising facial information;
情绪确定模块42,用于将所述目标图像输入预先训练完成的情绪分类网络,得到所述目标图像中面部信息表征的情绪信息;The emotion determination module 42 is used to input the target image into the emotional classification network that has been trained in advance to obtain the emotional information represented by facial information in the target image;
其中,所述情绪分类网络包括由内卷算子构成的RedNet特征提取器,所述RedNet特征提取器用于根据所述目标图像得到特征图像,以基于所述特征图像得到所述情绪信息。Wherein, the emotion classification network includes a RedNet feature extractor composed of involution operators, and the RedNet feature extractor is used to obtain a feature image according to the target image, so as to obtain the emotion information based on the feature image.
可选地,所述情绪确定模块42具体用于:Optionally, the emotion determination module 42 is specifically used for:
将所述特征图像输入Transformer编码器,得到所述目标图像对应的特征向量,所述Transformer编码器包括多头自注意模块、多层感知器以及层归一化模块;The feature image is input into a Transformer encoder to obtain a feature vector corresponding to the target image, and the Transformer encoder includes a multi-head self-attention module, a multi-layer perceptron and a layer normalization module;
将所述特征向量输入全连接层,得到所述目标图像中面部信息表征的情绪信息。The feature vector is input into the fully connected layer to obtain the emotional information represented by the facial information in the target image.
可选地,所述装置40还包括:Optionally, the device 40 also includes:
第二获取模块,用于获取训练集,所述训练集包括多个训练图像,所述多个训练图像中的每一个训练图像包括面部信息以及对应该面部信息预先标注的情绪标签;The second acquisition module is used to acquire a training set, the training set includes a plurality of training images, and each training image in the plurality of training images includes facial information and a pre-marked emotional label corresponding to the facial information;
第一特征提取模块,针对所述训练集中的任意目标训练图像,将所述目标训练图像输入初始情绪分类网络中的RedNet特征提取器,得到所述目标训练图像的特征图像;The first feature extraction module, for any target training image in the training set, input the target training image into the RedNet feature extractor in the initial emotion classification network to obtain the feature image of the target training image;
第一特征向量确定模块,用于将所述目标训练图像的特征图像输入所述Transformer 编码器,得到所述目标训练图像对应的特征向量;The first feature vector determination module is used to input the feature image of the target training image into the Transformer encoder to obtain the feature vector corresponding to the target training image;
预测模块,用于将所述目标训练图像对应的特征向量输入全连接层,得到所述目标训练图像中面部信息表征的情绪信息对应的预测标签;A prediction module, configured to input a feature vector corresponding to the target training image into a fully connected layer, to obtain a prediction label corresponding to emotional information represented by facial information in the target training image;
调整模块,用于根据所述预测标签与所述目标训练图像预先标注的情绪标签,对所述情绪分类网络中的参数进行调整,得到训练后的情绪分类网络。The adjustment module is used to adjust the parameters in the emotion classification network according to the predicted label and the emotion label pre-marked in the target training image, so as to obtain the trained emotion classification network.
可选地,所述全连接层包括注意力因子,所述预测模块具体用于:Optionally, the fully connected layer includes an attention factor, and the prediction module is specifically used for:
将所述目标训练图像对应的特征向量输入全连接层,得到所述目标训练图像中面部信息表征的情绪信息对应的预测标签,以及所述目标训练图像的权重信息;The feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained, and the weight information of the target training image;
所述调整模块具体用于:The adjustment module is specifically used for:
根据所述预测标签与所述目标训练图像预先标注的情绪标签,以及所述目标训练图像的权重信息,基于交叉熵损失函数以及正则化损失对所述情绪分类网络中的参数进行调整。According to the predicted label and the emotional label pre-marked on the target training image, and the weight information of the target training image, the parameters in the emotion classification network are adjusted based on a cross-entropy loss function and a regularization loss.
可选地,所述装置40还包括:Optionally, the device 40 also includes:
第三获取模块,用于获取测试集,所述测试集包括多个测试图像,所述多个测试图像中的每一个测试图像包括面部信息以及对应该面部信息预先标注的情绪标签;The third acquisition module is used to acquire a test set, the test set includes a plurality of test images, and each test image in the plurality of test images includes facial information and a pre-marked emotional label corresponding to the facial information;
第二特征提取模块,用于针对所述测试集中的任意目标测试图像,将所述目标测试图像输入所述训练后的情绪分类网络中的RedNet特征提取器,得到所述目标测试图像的特征图像;The second feature extraction module is used to input the target test image into the RedNet feature extractor in the trained emotion classification network for any target test image in the test set to obtain the feature image of the target test image ;
第二特征向量确定模块,将所述目标测试图像的特征图像输入所述Transformer编码器,得到所述目标测试图像对应的特征向量;The second eigenvector determination module inputs the eigenimage of the target test image into the Transformer encoder to obtain the corresponding eigenvector of the target test image;
第一确定模块,用于将所述目标测试图像对应的特征向量输入MC-dropout层,确定所述目标测试图像的不确定性信息;The first determination module is used to input the feature vector corresponding to the target test image into the MC-dropout layer, and determine the uncertainty information of the target test image;
第二确定模块,用于确定所述多个测试图像的不确定性信息是否满足预设规律,在满足所述预设规律的情况下,将所述训练后的情绪分类网络作为所述训练完成的情绪分类网络。The second determination module is used to determine whether the uncertainty information of the plurality of test images satisfies a preset rule, and if the preset rule is satisfied, the trained emotion classification network is completed as the training emotion classification network.
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Regarding the apparatus in the foregoing embodiments, the specific manner in which each module executes operations has been described in detail in the embodiments related to the method, and will not be described in detail here.
图5是根据一示例性实施例示出的一种电子设备500的框图。如图5所示,该电子设备500可以包括:处理器501,存储器502。该电子设备500还可以包括多媒体组件503, 输入/输出(I/O)接口504,以及通信组件505中的一者或多者。Fig. 5 is a block diagram of an electronic device 500 according to an exemplary embodiment. As shown in FIG. 5 , the electronic device 500 may include: a processor 501 and a memory 502 . The electronic device 500 may also include one or more of a multimedia component 503 , an input/output (I/O) interface 504 , and a communication component 505 .
其中,处理器501用于控制该电子设备500的整体操作,以完成上述的图像处理方法中的全部或部分步骤。存储器502用于存储各种类型的数据以支持在该电子设备500的操作,这些数据例如可以包括用于在该电子设备500上操作的任何应用程序或方法的指令,以及应用程序相关的数据,例如训练集、测试集中的图像等等。该存储器502可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,例如静态随机存取存储器(Static Random Access Memory,简称SRAM),电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,简称EEPROM),可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,简称EPROM),可编程只读存储器(Programmable Read-Only Memory,简称PROM),只读存储器(Read-Only Memory,简称ROM),磁存储器,快闪存储器,磁盘或光盘。多媒体组件503可以包括屏幕和音频组件。其中屏幕例如可以是触摸屏,音频组件用于输出和/或输入音频信号。例如,音频组件可以包括一个麦克风,麦克风用于接收外部音频信号。所接收的音频信号可以被进一步存储在存储器502或通过通信组件505发送。音频组件还包括至少一个扬声器,用于输出音频信号。I/O接口504为处理器501和其他接口模块之间提供接口,上述其他接口模块可以是键盘,鼠标,按钮等。这些按钮可以是虚拟按钮或者实体按钮。通信组件505用于该电子设备500与其他设备之间进行有线或无线通信。无线通信,例如Wi-Fi,蓝牙,近场通信(Near Field Communication,简称NFC),2G、3G、4G、NB-IOT、eMTC、或其他5G等等,或它们中的一种或几种的组合,在此不做限定。因此相应的该通信组件505可以包括:Wi-Fi模块,蓝牙模块,NFC模块等等。Wherein, the processor 501 is used to control the overall operation of the electronic device 500, so as to complete all or part of the steps in the above-mentioned image processing method. The memory 502 is used to store various types of data to support the operation of the electronic device 500, for example, these data may include instructions for any application or method operating on the electronic device 500, and application-related data, For example, images in training set, test set, etc. The memory 502 can be realized by any type of volatile or non-volatile storage device or their combination, such as Static Random Access Memory (Static Random Access Memory, referred to as SRAM), Electrically Erasable Programmable Read-Only Memory (EPROM) Electrically Erasable Programmable Read-Only Memory, referred to as EEPROM), Erasable Programmable Read-Only Memory (Erasable Programmable Read-Only Memory, referred to as EPROM), Programmable Read-Only Memory (Programmable Read-Only Memory, referred to as PROM), read-only Memory (Read-Only Memory, referred to as ROM), magnetic memory, flash memory, magnetic disk or optical disk. Multimedia components 503 may include screen and audio components. The screen can be, for example, a touch screen, and the audio component is used for outputting and/or inputting audio signals. For example, an audio component may include a microphone for receiving external audio signals. The received audio signal may be further stored in the memory 502 or sent through the communication component 505 . The audio component also includes at least one speaker for outputting audio signals. The I/O interface 504 provides an interface between the processor 501 and other interface modules, which may be a keyboard, a mouse, buttons, and the like. These buttons can be virtual buttons or physical buttons. The communication component 505 is used for wired or wireless communication between the electronic device 500 and other devices. Wireless communication, such as Wi-Fi, Bluetooth, Near Field Communication (NFC for short), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or one or more of them Combinations are not limited here. Therefore, correspondingly, the communication component 505 may include: a Wi-Fi module, a Bluetooth module, an NFC module and the like.
在一示例性实施例中,电子设备500可以被一个或多个应用专用集成电路(Application Specific Integrated Circuit,简称ASIC)、数字信号处理器(Digital Signal Processor,简称DSP)、数字信号处理设备(Digital Signal Processing Device,简称DSPD)、可编程逻辑器件(Programmable Logic Device,简称PLD)、现场可编程门阵列(Field Programmable Gate Array,简称FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述的图像处理方法。In an exemplary embodiment, the electronic device 500 may be implemented by one or more application-specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), digital signal processors (Digital Signal Processor, DSP for short), digital signal processing equipment (Digital Signal Processing Device, referred to as DSPD), programmable logic device (Programmable Logic Device, referred to as PLD), field programmable gate array (Field Programmable Gate Array, referred to as FPGA), controller, microcontroller, microprocessor or other electronic components Implementation, used to execute the above-mentioned image processing method.
在另一示例性实施例中,还提供了一种包括程序指令的计算机可读存储介质,该程序指令被处理器执行时实现上述的图像处理方法的步骤。例如,该计算机可读存储介质可以为上述包括程序指令的存储器502,上述程序指令可由电子设备500的处理器501 执行以完成上述的图像处理方法。In another exemplary embodiment, there is also provided a computer-readable storage medium including program instructions, and when the program instructions are executed by a processor, the steps of the above-mentioned image processing method are realized. For example, the computer-readable storage medium may be the above-mentioned memory 502 including program instructions, and the above-mentioned program instructions can be executed by the processor 501 of the electronic device 500 to complete the above-mentioned image processing method.
图6是根据一示例性实施例示出的一种电子设备600的框图。例如,电子设备600可以被提供为一服务器。参照图6,电子设备600包括处理器622,其数量可以为一个或多个,以及存储器632,用于存储可由处理器622执行的计算机程序。存储器632中存储的计算机程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理器622可以被配置为执行该计算机程序,以执行上述的图像处理方法。Fig. 6 is a block diagram of an electronic device 600 according to an exemplary embodiment. For example, the electronic device 600 may be provided as a server. Referring to FIG. 6 , the electronic device 600 includes a processor 622 , the number of which may be one or more, and a memory 632 for storing computer programs executable by the processor 622 . The computer program stored in memory 632 may include one or more modules each corresponding to a set of instructions. In addition, the processor 622 may be configured to execute the computer program to perform the above-mentioned image processing method.
另外,电子设备600还可以包括电源组件626和通信组件650,该电源组件626可以被配置为执行电子设备600的电源管理,该通信组件650可以被配置为实现电子设备600的通信,例如,有线或无线通信。此外,该电子设备600还可以包括输入/输出(I/O)接口658。电子设备600可以操作基于存储在存储器632的操作***,例如Windows Server TM,Mac OS X TM,Unix TM,Linux TM等等。 In addition, the electronic device 600 may further include a power supply component 626 and a communication component 650, the power supply component 626 may be configured to perform power management of the electronic device 600, and the communication component 650 may be configured to implement communication of the electronic device 600, for example, wired or wireless communication. In addition, the electronic device 600 may further include an input/output (I/O) interface 658 . The electronic device 600 can operate based on an operating system stored in the memory 632, such as Windows Server , Mac OS X , Unix , Linux and so on.
在另一示例性实施例中,还提供了一种包括程序指令的计算机可读存储介质,该程序指令被处理器执行时实现上述的图像处理方法的步骤。例如,该非临时性计算机可读存储介质可以为上述包括程序指令的存储器632,上述程序指令可由电子设备600的处理器622执行以完成上述的图像处理方法。In another exemplary embodiment, there is also provided a computer-readable storage medium including program instructions, and when the program instructions are executed by a processor, the steps of the above-mentioned image processing method are implemented. For example, the non-transitory computer-readable storage medium may be the above-mentioned memory 632 including program instructions, and the above-mentioned program instructions can be executed by the processor 622 of the electronic device 600 to implement the above-mentioned image processing method.
在另一示例性实施例中,还提供一种计算机程序产品,该计算机程序产品包含能够由可编程的装置执行的计算机程序,该计算机程序具有当由该可编程的装置执行时用于执行上述的图像处理方法的代码部分。In another exemplary embodiment, there is also provided a computer program product comprising a computer program executable by a programmable device, the computer program having a function for performing the above-mentioned The code section of the image processing method.
以上结合附图详细描述了本公开的优选实施方式,但是,本公开并不限于上述实施方式中的具体细节,在本公开的技术构思范围内,可以对本公开的技术方案进行多种简单变型,这些简单变型均属于本公开的保护范围。The preferred embodiments of the present disclosure have been described in detail above in conjunction with the accompanying drawings. However, the present disclosure is not limited to the specific details of the above embodiments. Within the scope of the technical concept of the present disclosure, various simple modifications can be made to the technical solutions of the present disclosure. These simple modifications all belong to the protection scope of the present disclosure.
另外需要说明的是,在上述具体实施方式中所描述的各个具体技术特征,在不矛盾的情况下,可以通过任何合适的方式进行组合,为了避免不必要的重复,本公开对各种可能的组合方式不再另行说明。In addition, it should be noted that the various specific technical features described in the above specific embodiments can be combined in any suitable manner if there is no contradiction. The combination method will not be described separately.
此外,本公开的各种不同的实施方式之间也可以进行任意组合,只要其不违背本公开的思想,其同样应当视为本公开所公开的内容。In addition, various implementations of the present disclosure can be combined arbitrarily, as long as they do not violate the idea of the present disclosure, they should also be regarded as the content disclosed in the present disclosure.

Claims (10)

  1. 一种图像处理方法,其特征在于,所述方法包括:An image processing method, characterized in that the method comprises:
    获取包括面部信息的目标图像;Obtain an image of the target including facial information;
    将所述目标图像输入预先训练完成的情绪分类网络,得到所述目标图像中面部信息表征的情绪信息;Inputting the target image into a pre-trained emotional classification network to obtain emotional information represented by facial information in the target image;
    其中,所述情绪分类网络包括由内卷算子构成的RedNet特征提取器,所述RedNet特征提取器用于根据所述目标图像得到特征图像,以基于所述特征图像得到所述情绪信息。Wherein, the emotion classification network includes a RedNet feature extractor composed of involution operators, and the RedNet feature extractor is used to obtain a feature image according to the target image, so as to obtain the emotion information based on the feature image.
  2. 根据权利要求1所述的方法,其特征在于,所述基于所述特征图像得到所述情绪信息包括:The method according to claim 1, wherein said obtaining said emotional information based on said feature image comprises:
    将所述特征图像输入Transformer编码器,得到所述目标图像对应的特征向量,所述Transformer编码器包括多头自注意模块、多层感知器以及层归一化模块;The feature image is input into a Transformer encoder to obtain a feature vector corresponding to the target image, and the Transformer encoder includes a multi-head self-attention module, a multi-layer perceptron and a layer normalization module;
    将所述特征向量输入全连接层,得到所述目标图像中面部信息表征的情绪信息。The feature vector is input into the fully connected layer to obtain the emotional information represented by the facial information in the target image.
  3. 根据权利要求2所述的方法,其特征在于,所述情绪分类网络的训练包括:The method according to claim 2, wherein the training of the emotion classification network comprises:
    获取训练集,所述训练集包括多个训练图像,所述多个训练图像中的每一个训练图像包括面部信息以及对应该面部信息预先标注的情绪标签;Obtain a training set, the training set includes a plurality of training images, each training image in the plurality of training images includes facial information and an emotional label corresponding to the facial information;
    针对所述训练集中的任意目标训练图像,将所述目标训练图像输入初始情绪分类网络中的RedNet特征提取器,得到所述目标训练图像的特征图像;For any target training image in the training set, the target training image is input into the RedNet feature extractor in the initial sentiment classification network to obtain the feature image of the target training image;
    将所述目标训练图像的特征图像输入所述Transformer编码器,得到所述目标训练图像对应的特征向量;The feature image of the target training image is input into the Transformer encoder to obtain the corresponding feature vector of the target training image;
    将所述目标训练图像对应的特征向量输入全连接层,得到所述目标训练图像中面部信息表征的情绪信息对应的预测标签;The feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained;
    根据所述预测标签与所述目标训练图像预先标注的情绪标签,对所述情绪分类网络中的参数进行调整,以得到训练后的情绪分类网络。According to the predicted label and the emotional label pre-marked on the target training image, parameters in the emotion classification network are adjusted to obtain a trained emotion classification network.
  4. 根据权利要求3所述的方法,其特征在于,所述全连接层包括注意力因子,所述将所述目标训练图像对应的特征向量输入全连接层,得到所述目标训练图像中面部信息表征的情绪信息对应的预测标签包括:The method according to claim 3, wherein the fully connected layer includes an attention factor, and the feature vector corresponding to the target training image is input into the fully connected layer to obtain a facial information representation in the target training image The prediction labels corresponding to the emotional information of include:
    将所述目标训练图像对应的特征向量输入全连接层,得到所述目标训练图像中面部信息表征的情绪信息对应的预测标签,以及所述目标训练图像的权重信息;The feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained, and the weight information of the target training image;
    所述根据所述预测标签与所述训练图像预先标注的情绪标签,对所述情绪分类网络中的参数进行调整包括:The adjusting the parameters in the emotional classification network according to the predicted label and the pre-marked emotional label of the training image includes:
    根据所述预测标签与所述目标训练图像预先标注的情绪标签,以及所述目标训练图像的权重信息,基于交叉熵损失函数以及正则化损失对所述情绪分类网络中的参数进行调整。According to the predicted label and the emotional label pre-marked on the target training image, and the weight information of the target training image, the parameters in the emotion classification network are adjusted based on a cross-entropy loss function and a regularization loss.
  5. 根据权利要求3所述的方法,其特征在于,所述方法还包括:The method according to claim 3, characterized in that the method further comprises:
    获取测试集,所述测试集包括多个测试图像,所述多个测试图像中的每一个测试图像包括面部信息以及对应该面部信息预先标注的情绪标签;Obtain a test set, the test set includes a plurality of test images, each test image in the plurality of test images includes facial information and a pre-marked emotional label corresponding to the facial information;
    针对所述测试集中的任意目标测试图像,将所述目标测试图像输入所述训练后的情绪分类网络中的RedNet特征提取器,得到所述目标测试图像的特征图像;For any target test image in the test set, the target test image is input into the RedNet feature extractor in the trained emotional classification network to obtain the feature image of the target test image;
    将所述目标测试图像的特征图像输入所述Transformer编码器,得到所述目标测试图像对应的特征向量;The feature image of the target test image is input to the Transformer encoder to obtain the corresponding feature vector of the target test image;
    将所述目标测试图像对应的特征向量输入MC-dropout层,确定所述目标测试图像的不确定性信息;Input the feature vector corresponding to the target test image into the MC-dropout layer to determine the uncertainty information of the target test image;
    确定所述多个测试图像的不确定性信息是否满足预设规律,在满足所述预设规律的情况下,将所述训练后的情绪分类网络作为所述训练完成的情绪分类网络。Determine whether the uncertainty information of the plurality of test images satisfies a preset rule, and if the preset rule is satisfied, use the trained emotion classification network as the trained emotion classification network.
  6. 一种图像处理装置,其特征在于,所述装置包括:An image processing device, characterized in that the device comprises:
    第一获取模块,用于获取包括面部信息的目标图像;The first obtaining module is used to obtain the target image including facial information;
    情绪确定模块,用于将所述目标图像输入预先训练完成的情绪分类网络,得到所述目标图像中面部信息表征的情绪信息;The emotion determination module is used to input the target image into the pre-trained emotional classification network to obtain the emotional information represented by facial information in the target image;
    其中,所述情绪分类网络包括由内卷算子构成的RedNet特征提取器,所述RedNet特征提取器用于根据所述目标图像得到特征图像,以基于所述特征图像得到所述情绪信息。Wherein, the emotion classification network includes a RedNet feature extractor composed of involution operators, and the RedNet feature extractor is used to obtain a feature image according to the target image, so as to obtain the emotion information based on the feature image.
  7. 根据权利要求6所述的装置,其特征在于,所述情绪确定模块具体用于:The device according to claim 6, wherein the emotion determining module is specifically used for:
    将所述特征图像输入Transformer编码器,得到所述目标图像对应的特征向量,所述 Transformer编码器包括多头自注意模块、多层感知器以及层归一化模块;The feature image is input into a Transformer encoder to obtain the corresponding feature vector of the target image, and the Transformer encoder includes a multi-head self-attention module, a multi-layer perceptron and a layer normalization module;
    将所述特征向量输入全连接层,得到所述目标图像中面部信息表征的情绪信息。The feature vector is input into the fully connected layer to obtain the emotional information represented by the facial information in the target image.
  8. 根据权利要求7所述的装置,其特征在于,所述装置包括:The device according to claim 7, wherein the device comprises:
    第二获取模块,用于获取训练集,所述训练集包括多个训练图像,所述多个训练图像中的每一个训练图像包括面部信息以及对应该面部信息预先标注的情绪标签;The second acquisition module is used to acquire a training set, the training set includes a plurality of training images, and each training image in the plurality of training images includes facial information and a pre-marked emotional label corresponding to the facial information;
    特征提取模块,针对所述训练集中的任意目标训练图像,将所述目标训练图像输入初始情绪分类网络中的RedNet特征提取器,得到所述目标训练图像的特征图像;Feature extraction module, for any target training image in the training set, input the target training image into the RedNet feature extractor in the initial emotion classification network, obtain the feature image of the target training image;
    特征向量确定模块,用于将所述目标训练图像的特征图像输入所述Transformer编码器,得到所述目标训练图像对应的特征向量;A feature vector determination module, configured to input the feature image of the target training image into the Transformer encoder to obtain the corresponding feature vector of the target training image;
    预测模块,用于将所述目标训练图像对应的特征向量输入全连接层,得到所述目标训练图像中面部信息表征的情绪信息对应的预测标签;A prediction module, configured to input a feature vector corresponding to the target training image into a fully connected layer, to obtain a prediction label corresponding to emotional information represented by facial information in the target training image;
    调整模块,用于根据所述预测标签与所述目标训练图像预先标注的情绪标签,对所述情绪分类网络中的参数进行调整,得到训练后的情绪分类网络。The adjustment module is used to adjust the parameters in the emotion classification network according to the predicted label and the emotion label pre-marked in the target training image, so as to obtain the trained emotion classification network.
  9. 一种非临时性计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1-5中任一项所述方法的步骤。A non-transitory computer-readable storage medium, on which a computer program is stored, characterized in that, when the program is executed by a processor, the steps of the method in any one of claims 1-5 are implemented.
  10. 一种电子设备,其特征在于,包括:An electronic device, characterized in that it comprises:
    存储器,其上存储有计算机程序;a memory on which a computer program is stored;
    处理器,用于执行所述存储器中的所述计算机程序,以实现权利要求1-5中任一项所述方法的步骤。A processor, configured to execute the computer program in the memory, so as to realize the steps of the method according to any one of claims 1-5.
PCT/CN2022/136363 2021-12-02 2022-12-02 Image processing method and apparatus, storage medium, and electronic device WO2023098912A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111473999.8 2021-12-02
CN202111473999.8A CN116229530A (en) 2021-12-02 2021-12-02 Image processing method, device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
WO2023098912A1 true WO2023098912A1 (en) 2023-06-08

Family

ID=86579171

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/136363 WO2023098912A1 (en) 2021-12-02 2022-12-02 Image processing method and apparatus, storage medium, and electronic device

Country Status (2)

Country Link
CN (1) CN116229530A (en)
WO (1) WO2023098912A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117058405A (en) * 2023-07-04 2023-11-14 首都医科大学附属北京朝阳医院 Image-based emotion recognition method, system, storage medium and terminal
CN117079324A (en) * 2023-08-17 2023-11-17 厚德明心(北京)科技有限公司 Face emotion recognition method and device, electronic equipment and storage medium
CN117611933A (en) * 2024-01-24 2024-02-27 卡奥斯工业智能研究院(青岛)有限公司 Image processing method, device, equipment and medium based on classified network model
CN117689998A (en) * 2024-01-31 2024-03-12 数据空间研究院 Nonparametric adaptive emotion recognition model, method, system and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194347A (en) * 2017-05-19 2017-09-22 深圳市唯特视科技有限公司 A kind of method that micro- expression detection is carried out based on Facial Action Coding System
CN107423707A (en) * 2017-07-25 2017-12-01 深圳帕罗人工智能科技有限公司 A kind of face Emotion identification method based under complex environment
CN113221639A (en) * 2021-04-01 2021-08-06 山东大学 Micro-expression recognition method for representative AU (AU) region extraction based on multitask learning
CN113591718A (en) * 2021-07-30 2021-11-02 北京百度网讯科技有限公司 Target object identification method and device, electronic equipment and storage medium
CN113705541A (en) * 2021-10-21 2021-11-26 中国科学院自动化研究所 Expression recognition method and system based on transform marker selection and combination

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194347A (en) * 2017-05-19 2017-09-22 深圳市唯特视科技有限公司 A kind of method that micro- expression detection is carried out based on Facial Action Coding System
CN107423707A (en) * 2017-07-25 2017-12-01 深圳帕罗人工智能科技有限公司 A kind of face Emotion identification method based under complex environment
CN113221639A (en) * 2021-04-01 2021-08-06 山东大学 Micro-expression recognition method for representative AU (AU) region extraction based on multitask learning
CN113591718A (en) * 2021-07-30 2021-11-02 北京百度网讯科技有限公司 Target object identification method and device, electronic equipment and storage medium
CN113705541A (en) * 2021-10-21 2021-11-26 中国科学院自动化研究所 Expression recognition method and system based on transform marker selection and combination

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI DUO, HU JIE, WANG CHANGHU, LI XIANGTAI, SHE QI, ZHU LEI, ZHANG TONG, CHEN QIFENG: "Involution: Inverting the Inherence of Convolution for Visual Recognition", ARXIV.ORG, 10 March 2021 (2021-03-10), pages 1 - 12, XP093070355 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117058405A (en) * 2023-07-04 2023-11-14 首都医科大学附属北京朝阳医院 Image-based emotion recognition method, system, storage medium and terminal
CN117058405B (en) * 2023-07-04 2024-05-17 首都医科大学附属北京朝阳医院 Image-based emotion recognition method, system, storage medium and terminal
CN117079324A (en) * 2023-08-17 2023-11-17 厚德明心(北京)科技有限公司 Face emotion recognition method and device, electronic equipment and storage medium
CN117079324B (en) * 2023-08-17 2024-03-12 厚德明心(北京)科技有限公司 Face emotion recognition method and device, electronic equipment and storage medium
CN117611933A (en) * 2024-01-24 2024-02-27 卡奥斯工业智能研究院(青岛)有限公司 Image processing method, device, equipment and medium based on classified network model
CN117689998A (en) * 2024-01-31 2024-03-12 数据空间研究院 Nonparametric adaptive emotion recognition model, method, system and storage medium
CN117689998B (en) * 2024-01-31 2024-05-03 数据空间研究院 Nonparametric adaptive emotion recognition model, method, system and storage medium

Also Published As

Publication number Publication date
CN116229530A (en) 2023-06-06

Similar Documents

Publication Publication Date Title
WO2023098912A1 (en) Image processing method and apparatus, storage medium, and electronic device
TWI773189B (en) Method of detecting object based on artificial intelligence, device, equipment and computer-readable storage medium
US20230119593A1 (en) Method and apparatus for training facial feature extraction model, method and apparatus for extracting facial features, device, and storage medium
KR20190081243A (en) Method and apparatus of recognizing facial expression based on normalized expressiveness and learning method of recognizing facial expression
CN111860362A (en) Method and device for generating human face image correction model and correcting human face image
US11681923B2 (en) Multi-model structures for classification and intent determination
CN111133453A (en) Artificial neural network
WO2020238353A1 (en) Data processing method and apparatus, storage medium, and electronic apparatus
Liu et al. Real-time facial expression recognition based on cnn
CN110598638A (en) Model training method, face gender prediction method, device and storage medium
CN112712068B (en) Key point detection method and device, electronic equipment and storage medium
US20230036338A1 (en) Method and apparatus for generating image restoration model, medium and program product
WO2021217937A1 (en) Posture recognition model training method and device, and posture recognition method and device
Krishnan et al. Detection of alphabets for machine translation of sign language using deep neural net
CN115964638A (en) Multi-mode social data emotion classification method, system, terminal, equipment and application
CN112650885A (en) Video classification method, device, equipment and medium
CN113221695B (en) Method for training skin color recognition model, method for recognizing skin color and related device
CN110717407A (en) Human face recognition method, device and storage medium based on lip language password
WO2024071884A1 (en) Apparatus and method for generating image of bald head person, virtual hair styling experience apparatus comprising apparatus for generating bald head person image, and virtual hair styling method using same
Tewari et al. Real Time Sign Language Recognition Framework For Two Way Communication
RU2768797C1 (en) Method and system for determining synthetically modified face images on video
CN115116117A (en) Learning input data acquisition method based on multi-mode fusion network
CN115457365A (en) Model interpretation method and device, electronic equipment and storage medium
WO2022178833A1 (en) Target detection network training method, target detection method, and apparatus
CN112101185A (en) Method for training wrinkle detection model, electronic device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22900716

Country of ref document: EP

Kind code of ref document: A1