CN114078275A

CN114078275A - Expression recognition method and system and computer equipment

Info

Publication number: CN114078275A
Application number: CN202111376445.6A
Authority: CN
Inventors: 卫华威; 韩欣彤
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-02-22

Abstract

The embodiment of the application provides an expression recognition method, an expression recognition system and computer equipment, wherein a facial image of a target object is input into a pre-training mode to obtain an identity recognition network and an expression recognition network, the identity recognition network and the expression recognition network are used for carrying out identity recognition and expression recognition on the facial image respectively to obtain an identity characteristic and an initial multi-dimensional expression coefficient of the target object, and then the initial multi-dimensional expression coefficient is processed according to the identity characteristic to obtain a final multi-dimensional expression coefficient of the target object for subsequent application. Based on this, compared with the traditional expression recognition scheme without considering the identity characteristics of the target object, the final expression recognition result obtained by the embodiment can more accurately express the facial expression of the target object, and further realize more accurate expression recognition of the target object.

Description

Expression recognition method and system and computer equipment

Technical Field

The application relates to the technical field of image recognition and processing based on artificial intelligence, in particular to an expression recognition method, system and computer equipment.

Background

With the continuous development of mobile internet technology and network communication technology, live webcasting is rapidly developed and applied in daily work and life of people. In some specific live scenes, in order to provide diversified live experience, a virtual live mode based on an virtual digital image is also widely applied.

For example, with the rapid development of two-dimensional virtual live broadcast in various fields such as games, singing and the like, audience groups of virtual images become huge gradually. Compared with a live broadcast mode by a live anchor, the virtual live broadcast does not need the live anchor to carry out live interaction, and the live broadcast interaction can be carried out by the live anchor in the background by controlling the behavior of the virtual digital image simulation background live broadcast.

In a virtual live broadcast application scene based on an virtual digital image, expression recognition of the virtual digital image is an important technical branch of virtual live broadcast, however, most of existing virtual digital image driving schemes based on expression recognition have the problem that the virtual digital image is difficult to accurately express the live broadcast expression because the recognition precision is not ideal. Or, in some conventional mature expression recognition schemes, accurate expression recognition can be achieved through thick helmet face capture equipment, but the face capture equipment adopted in the scheme is expensive and not beneficial to popularization of virtual live broadcast.

Disclosure of Invention

Based on the above, in a first aspect, an embodiment of the present application provides an expression recognition method, where the method includes:

acquiring a face image of a target object;

inputting the facial image into a pre-training system to obtain an identity recognition network and an expression recognition network, and respectively performing identity recognition and expression recognition on the facial image through the identity recognition network and the expression recognition network to obtain identity characteristics and an initial multi-dimensional expression coefficient of the target object; wherein the identity features comprise implicit features for implicitly characterizing at least one facial information of the target object;

and processing the initial multi-dimensional expression coefficient according to the identity characteristics to obtain a final multi-dimensional expression coefficient of the target object.

In a possible implementation manner of the first aspect, the method further includes:

and driving the facial expression of the virtual digital image in the live broadcast picture according to the final multi-dimensional expression coefficient.

Based on a possible implementation manner of the first aspect, the identity recognition network and the expression recognition network are connected with a full connection layer after being cascaded;

the processing the initial multi-dimensional expression coefficient according to the identity characteristics to obtain a final multi-dimensional expression coefficient of the target object comprises:

inputting the identity characteristics output after the identity recognition of the facial image by the identity recognition network into the full connection layer as conditions;

inputting the initial multi-dimensional expression coefficient output by the expression recognition network into the full connection layer;

and processing the initial multi-dimensional expression coefficient according to the identity characteristics through the full connection layer to obtain the final multi-dimensional expression coefficient.

Based on a possible implementation manner of the first aspect, the method further includes a network training step for obtaining the identity recognition network, and specifically includes:

acquiring a first training data set, wherein the first training data set comprises a plurality of sample face pictures with different identity characteristics, and each sample face picture carries an identity characteristic label calibrated in advance;

sequentially inputting each sample face picture in the first training data set into a deep neural network to be trained, performing identity feature prediction on each sample face picture through the deep neural network, and outputting a predicted identity feature corresponding to each sample face picture;

calculating a loss function value of the deep neural network according to the predicted identity characteristics of each sample face picture predicted by the deep neural network and the identity characteristic labels corresponding to each sample face picture;

and performing iterative optimization on the network parameters of the deep neural network according to the loss function value until a training convergence condition is met, and obtaining the trained deep neural network as the identity recognition network.

According to a possible implementation manner of the first aspect, the network structure of the deep neural network is a resnet18 network structure, and the loss function value of the deep neural network is calculated by a cross entropy loss function.

Based on a possible implementation manner of the first aspect, the method further includes a network training step for obtaining the expression recognition network, and specifically includes:

acquiring a second training data set, wherein the second training data set can comprise a plurality of sample face pictures with pre-calibrated expression coefficient label values;

performing key point detection on each sample face picture in the second training data set, and obtaining a face main body picture corresponding to the sample face picture according to a key point detection result;

sequentially inputting the face main body pictures respectively corresponding to the sample face pictures into a convolutional neural network to be trained, and performing expression recognition on the face main body pictures through the convolutional neural network to obtain expression coefficient prediction values corresponding to the sample face pictures;

calculating a loss function value of the convolutional neural network according to the expression coefficient prediction value of the sample face picture output by the convolutional neural network and the expression coefficient label value corresponding to the sample face picture;

and performing iterative optimization on the network parameters of the convolutional neural network according to the loss function value of the convolutional neural network until a training convergence condition is met, and obtaining the trained convolutional neural network as the expression recognition network.

In a possible implementation manner of the first aspect, the loss function value of the convolutional neural network is calculated by the following formula:

wherein L1Loss represents a Loss function value of the convolutional neural network, x_nRepresenting the expression coefficient predicted value, y, corresponding to a sample face picture output in the nth iteration training process of the convolutional neural network_nAnd representing the expression coefficient label value corresponding to the sample face picture used in the nth iteration training process.

In a second aspect, an embodiment of the present application further provides an expression recognition system, where the expression recognition system includes:

an acquisition module for acquiring a face image of a target object;

the recognition module is used for inputting the facial image into a pre-training mode to obtain an identity recognition network and an expression recognition network, and respectively carrying out identity recognition and expression recognition on the facial image through the identity recognition network and the expression recognition network to obtain the identity characteristic and the initial multi-dimensional expression coefficient of the target object; wherein the identity features comprise implicit features for implicitly characterizing at least one facial information of the target object;

and the processing module is used for processing the initial multi-dimensional expression coefficient according to the identity characteristics to obtain a final multi-dimensional expression coefficient of the target object.

In a possible implementation manner of the second aspect, the expression recognition system further includes a driving module and a training module, where:

the driving module is used for driving the facial expression of the virtual digital image in the live broadcast picture according to the final multi-dimensional expression coefficient;

the training module is configured to:

performing iterative optimization on network parameters of the deep neural network according to the loss function value until a training convergence condition is met, and obtaining a trained deep neural network as the identity recognition network;

the training module is further configured to:

In a third aspect, this application further provides a computer device, including a machine-readable storage medium and one or more processors, where the machine-readable storage medium stores machine-executable instructions that, when executed by the one or more processors, implement the method recited in any one of claims 1-7.

In summary, the expression recognition method, system and computer device provided in the embodiments of the present application are different from the conventional expression recognition technology, and innovatively introduce an identity recognition network as a conditional network, perform identity recognition on the facial image of the target object through the conditional network, and output an identity feature, where the identity feature is used to describe or express the facial personalized features of the target object. Then, the identity feature is also used as the input of expression recognition, and further the self-adaptive final expression coefficient is output according to the facial personalized characteristics of different subjects. Therefore, compared with the traditional expression recognition scheme without considering the identity characteristics of the target object, the final expression recognition result of the embodiment can express the facial expression of the target object more accurately, and further realize more accurate expression recognition of the target object. Meanwhile, when the final multi-dimensional expression coefficient corresponding to the facial image of the target object is obtained by using the expression recognition method to drive the facial expression of the virtual digital object, the virtual digital object can express the facial expression of the target object more vividly and finely.

Further, compared with the mature expression recognition technology such as a heavy helmet face capturing device, the expression of a target object (anchor) in a camera picture can be accurately recognized through a camera and other simple image acquisition devices, the threshold and the cost of virtual live broadcast can be greatly reduced, and the popularization of the virtual live broadcast is facilitated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic flow chart of an expression recognition method according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a live broadcast system for implementing the expression recognition method.

Fig. 3 is a schematic distribution diagram of an image capturing apparatus for performing facial image acquisition on a target object according to an embodiment of the present application.

Fig. 4 is a schematic network structure diagram of an identity recognition network and an expression recognition network in the embodiment of the present application.

Fig. 5 is a second flowchart of an expression recognition method according to an embodiment of the present application.

Fig. 6 is a schematic flowchart of network training for the identity recognition network according to the embodiment of the present application.

Fig. 7 is a schematic flowchart of network training on the expression recognition network according to the embodiment of the present application.

Fig. 8 is a schematic diagram of a computer device for implementing the expression recognition method according to an embodiment of the present application.

Fig. 9 is a functional module schematic diagram of an expression recognition system provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present application, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

In the description of the present application, it is further noted that, unless expressly stated or limited otherwise, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art.

Based on the problems mentioned in the background technology, the inventor finds out through research and investigation that an important technical branch behind virtual live broadcast is expression recognition, the traditional mature expression recognition technology depends on heavy helmet face capturing equipment, the price is high, the cost is high, and the popularization of the virtual live broadcast is not facilitated. In a common virtual live broadcast scene based on a virtual digital object, a more common expression animation driving scheme is an expression recognition mode of the virtual digital object based on an expression base. The expression base refers to an expression unit obtained by dividing a specific expression of the driven character, and may generally include 51 different expression bases, for example. Different expression bases can represent the movement of different parts, such as eyes, mouth, eyebrows, nose and the like, and the corresponding expression movement can be eye squeezing, mouth opening, eyebrows wrinkling and the like. Different expressions can be obtained by linearly combining the expression bases according to different weights.

The traditional expression base driving scheme is to drive different expression bases through a group of expression coefficients. In a common expression recognition scheme at present, expression data sets of multiple persons are generally collected, the expression data sets include a large number of expression pictures, one picture can correspond to one group of expression coefficients, and then a model from one picture to one group of expression coefficients is directly trained by using a neural network. However, this training method is difficult. For example, assume that there are two pictures that are emoticons of two different identity objects (e.g., object a and object B). Taking the feature of the mouth part (the expression coefficient of the expression base of the corresponding mouth part is "jawOpen coefficient") as an example, assuming that the mouth of the subject a is larger than the subject B, the maximum amplitude of the mouth opening of the subject a is 5cm, and the maximum amplitude of the subject B is only 3 cm. In practical applications, if the mouths of the object a and the object B are open by the same extent, for example, both are 1cm, the jawOpen coefficient corresponding to the object a is 0.2, and the object B is 0.33. Under the condition that the mouth is opened to the same extent, the coefficient label value of one picture is 0.2, and the other picture is 0.33, so that the network training is disturbed, the network convergence is difficult due to increased training difficulty, or the output expression coefficient cannot accurately reflect the actual expression condition of the anchor when the network with the training convergence is used for expression recognition. Therefore, when the virtual digital object is subsequently driven, it is difficult for the virtual digital object to express the real-time expression of the anchor vividly and finely.

In view of the above problems, the novel expression recognition scheme provided by the application innovatively can accurately recognize the expression of the anchor in the camera picture by only one simple camera, so that the virtual live broadcast threshold can be greatly reduced, the popularization of virtual live broadcast is facilitated, and the problem of high difficulty in network training in the existing scheme can be effectively solved.

Fig. 1 is a schematic flow chart of an expression recognition method provided in this embodiment. To facilitate understanding of the present embodiment, further referring to fig. 2, fig. 2 is a schematic diagram of a live broadcast system for implementing the expression recognition method according to the present embodiment. Fig. 2 shows a scene diagram suitable for the target object to perform virtual live broadcast based on a virtual digital object.

In this embodiment, the live broadcast system includes a live broadcast providing terminal 100, a live broadcast server 200, and a live broadcast receiving terminal 300. Illustratively, the live broadcast providing terminal 100 and the live broadcast receiving terminal 300 may access the live broadcast server 200 through a network to use a live broadcast service provided by the live broadcast server 200. For example, as an example, for the live broadcast providing terminal 100, a main broadcast Application (APP) may be downloaded through the live broadcast server 200, and content live broadcast may be performed through the live broadcast server 200 after registration is performed through the main broadcast application. Correspondingly, the live broadcast receiving terminal 300 may also download the viewer-side application through the live broadcast server 200, and may view the live broadcast content provided by the live broadcast providing terminal 100 by accessing the live broadcast server 200 through the viewer-side application. In some possible embodiments, the anchor application and the viewer application may also be one integrated application.

For example, the live broadcast providing terminal 100 may transmit live content (e.g., a live video stream) to the live broadcast server 200, and the viewer may access the live broadcast server 200 through the live broadcast receiving terminal 300 to view the live broadcast content. The live content pushed by the live server 200 may be real-time content currently live in the live platform, or historical live content stored after the live broadcast is completed. It will be appreciated that the live system shown in fig. 1 is only an alternative example, and that in other possible embodiments the live system may comprise only some of the components shown in fig. 1 or may also comprise further components.

In addition, it should be noted that, in a specific application scenario, the live broadcast providing terminal 100 and the live broadcast receiving terminal 300 may also implement role exchange. For example, a main broadcast of the live broadcast providing terminal 100 may provide a live broadcast service using the live broadcast providing terminal 100, or view live broadcast content provided by other main broadcasts as viewers. For example, the user of the live broadcast receiving terminal 300 may view live broadcast content provided by a concerned anchor using the live broadcast receiving terminal 300, or may perform live broadcast as an anchor through the live broadcast receiving terminal 300.

In this embodiment, the live broadcast providing terminal 100 and the live broadcast receiving terminal 300 may be, but are not limited to, a smart phone, a personal digital assistant, a tablet computer, a personal computer, a notebook computer, a virtual reality terminal device, an augmented reality terminal device, and the like. The live broadcast providing terminal 100 and the live broadcast receiving terminal 300 may have installed therein related applications or program components for implementing live broadcast interaction, such as an application APP, a Web page, a live broadcast applet, a live broadcast plug-in or component, but are not limited thereto. The live server 200 may be a background device providing live services, and may be, for example and without limitation, a server cluster, a cloud service center, and the like.

The steps of the expression recognition method of the present embodiment are described in detail in an exemplary manner with reference to fig. 1 and fig. 2, and in detail, as shown in fig. 1, the method may include the following steps S100 to S300. It should be noted that, in the expression recognition method provided in this embodiment, the sequences of some steps included in the expression recognition method may be interchanged according to actual needs in actual implementation, or some steps may be omitted or deleted, which is not specifically limited in this embodiment.

In step S100, a face image of a target object is acquired.

In this embodiment, the target object may be a main broadcast user who provides a live broadcast service using the live broadcast providing terminal 100 in a live broadcast scene. In this embodiment, in order to facilitate the capturing of the facial object, the live broadcast providing terminal 100 may further include an image capturing device for capturing a main broadcast image. The facial image may be obtained by the image capturing device capturing a facial expression of the target object. The image capturing device may be integrated in the live broadcast providing terminal 100 used by the anchor (for example, a camera of the live broadcast providing terminal 100), or may be an independent image capturing device in communication connection with the live broadcast providing terminal 100, for example, a video monitoring terminal independent of the live broadcast providing terminal 100.

The facial image may be a current video frame in a real-time video image obtained by shooting a main broadcast in real time, or may be a face picture capable of expressing the current facial expression of the target object, which is obtained by shooting the image through the image acquisition device according to a certain time period (such as 1 second, 2 seconds, 3 seconds, and the like).

In addition, in another possible example, in order to obtain the facial expression of the user in all directions, the facial image may also be a composite image obtained by image-combining two or more images obtained by respectively shooting a target object by two or more image capturing devices located at different positions and distributed around the target object (such as a main broadcast), so that facial images including expressive features of the target object at different viewing angles can be obtained from different viewing angles, which is beneficial to improving a live broadcast effect.

For example, as shown in fig. 3, in an alternative preferred embodiment, 3 cameras such as C1, C2, C3 and the like may be used to capture facial images of the target subject around the face of the target subject from the front face, the left side face direction and the right side face direction of the target subject, three facial images such as M1, M2, M3 and the like which may express the front face expression details, the left side face expression details and the right side face expression details of the target subject respectively are obtained, and then the three facial images captured by the three cameras are subjected to image synthesis to obtain a synthetic image including the expression details of the front face, the left side face and the right side face of the target subject as the facial image. The thus obtained facial object may be understood as a multi-angle facial expression image capable of expressing a multi-angle expression of the target object.

Step S200, inputting the facial image into a pre-training system to obtain an identity recognition network and an expression recognition network, and respectively performing identity recognition and expression recognition on the facial image through the identity recognition network and the expression recognition network to obtain the identity characteristic and the initial multi-dimensional expression coefficient of the target object.

In this embodiment, the identity recognition network and the expression recognition network may be obtained by performing network training on a set artificial intelligence neural network through a collected training data set. The identity feature obtained by the trained identity recognition network for performing identity feature recognition on the facial image of the target object may be a multi-dimensional implicit feature, and the implicit feature may be represented by a feature vector, for example, may be a 512-dimensional feature vector, which may include implicit features for implicitly characterizing at least one facial information of the target object, such as face width, glabellar distance, lip width, lip thickness, eye size, eyelid distance, and the like, and such features may be used to distinguish different user identities, and are therefore referred to as identity features.

Furthermore, the expression recognition network can also be obtained by performing network training on a set artificial intelligence neural network through an acquired training data set. The initial multi-dimensional expression coefficient obtained by the trained expression recognition network for performing expression recognition on the facial image of the target object may be a multi-dimensional information, for example, a multi-dimensional matrix or a multi-dimensional array including 51-dimensional expression coefficients. For example, 51-dimensional expression coefficients are taken as an example, each of the 51-dimensional expression coefficients may correspond to one expression base. Generally, the expression bases for driving the virtual digital object may include 51 expression bases for expressing various expression contents (e.g., squinting, mouth opening, eyebrow frowning, cheek bulging, etc.) of the virtual digital object, respectively. Illustratively, the expression coefficient may be any one coefficient value in a value range from 0 to 1. Taking the jawoen coefficient for expressing the mouth opening expression as an example, the size of the jawoen coefficient is in positive correlation with the mouth opening amplitude. Meanwhile, in the embodiment, in the training process for the expression recognition network, when the sample labels of the training samples are calibrated, special consideration may not be given to the identity characteristics (such as the mouth opening amplitude, the eye opening amplitude, and the like) of the sample object (user) corresponding to the specific sample, and only the calibration may be performed according to the same rule according to the expression information (such as the mouth opening amplitude) in the sample. For example, still taking the samples a and B corresponding to the above-mentioned objects a and B as an example, the maximum amplitude of the mouth opening of the object a in the sample a is 5cm, and the maximum amplitude of the object B is only 3 cm. When the actual sample label is calibrated, if the mouths of the object a and the object B are open by the same amplitude, for example, both are 1cm, the jawOpen coefficient corresponding to the object a may be calibrated to 0.2, and the object B may also be calibrated to 0.2. Therefore, the expression recognition network is trained by adopting the same label based on the same expression information, and the rapid convergence of the network is facilitated.

The network training process of the above-mentioned identity recognition network and expression recognition network will be described in detail later.

And step S300, processing the initial multi-dimensional expression coefficient according to the identity characteristics to obtain a final multi-dimensional expression coefficient of the target object.

In detail, in this embodiment, based on step S300, when the expression recognition network obtained through pre-training is subsequently applied to recognition of expression coefficients, the expression coefficients output by the expression recognition network may be output after being adaptively processed and adjusted in combination with the identity features obtained by the identity recognition network, and then the relevant expression coefficient features of the target object are accurately recognized as the final multi-dimensional expression coefficients, so as to be used for subsequently driving the virtual digital image to perform vivid and fine expression on real-time facial expressions of a plurality of target objects.

In a possible implementation manner, for example, as shown in fig. 4, the identity recognition network and the expression recognition network may be connected to a Full Connection Layer (FC Layer) after being cascaded, the identity characteristics output after the identity recognition network performs identity recognition on the facial image are input to the Full Connection Layer as a condition, then the initial multidimensional expression coefficient output by the expression recognition network is input to the Full Connection Layer, and finally the final multidimensional expression coefficient is obtained after the Full Connection Layer processes the initial multidimensional expression coefficient according to the identity characteristics.

It should be understood that, in this embodiment, the fully-connected layer may be used as a final output layer of the emotion recognition network (for example, the fully-connected layer is a part of the emotion recognition network). In this way, the output of the identity recognition network can be used as one of the condition inputs of the expression recognition network. Furthermore, the fully connected layer may be understood as a separate network layer independent of the emotion recognition network, in which case both the output of the identity recognition network and the output of the emotion recognition network may be used as conditional inputs to the fully connected layer. In another possible example, the identity recognition network, the expression recognition network, and the fully-connected layer may be understood as an overall artificial intelligence model for performing expression recognition, and the identity recognition network, the expression recognition network, and the fully-connected layer may respectively serve as three components of the artificial intelligence model. In this embodiment, the attribution division of the specific network structures of the identity recognition network, the expression recognition network, and the full connection layer is not limited.

Based on the above, please refer to fig. 5, the expression recognition method according to the embodiment of the present application may further include step S400, which is described in detail as follows.

And step S400, driving the facial expression of the virtual digital image in the live broadcast picture according to the final multi-dimensional expression coefficient.

In this embodiment, the final multi-dimensional expression coefficient is expression information that is output after adaptive adjustment is performed on the basis of the initial multi-dimensional expression coefficient in combination with the identity of the target user and that can more accurately express the current facial expression of the target object. For example, still taking the jawont coefficient of the target object as an example, if the maximum amplitude of the mouth opening of the target object is 3cm, and the maximum amplitude of the mouth opening of the target object in the initial multi-dimensional expression coefficients output after the facial image is input into the trained expression recognition network is 0.2 (the amplitude of the mouth opening of the target object is 1cm), it may be determined that the maximum amplitude of the mouth opening of the target object is only 3cm according to the multi-dimensional feature vector included in the identity feature output by the identity recognition network, and then the jawont coefficient of the target object in the initial multi-dimensional expression coefficients is 0.2, which may be adaptively adjusted to 0.33 (included in the final multi-dimensional expression coefficient) according to the identity feature, that is, the amplitude of the mouth opening of the target object may be accurately expressed. On the basis, the initial multi-dimensional expression coefficient is subjected to self-adaptive processing through the identity characteristics to obtain a final multi-dimensional expression coefficient of the target object, and the virtual digital object is driven to have facial expression, so that the virtual digital object can express the expression of the target object vividly and vividly.

As an example, taking the example that the final multi-dimensional expression coefficients respectively include 51-dimensional expression coefficients corresponding to 51 expression bases, the purpose of driving the facial expression of the digital virtual object may be achieved by driving the corresponding expression base through the expression coefficient of each dimension.

Further, with respect to step S200, as shown in fig. 6, the identification network may be obtained by training through steps S610 to S640 described below, and a specific training method of the identification network is exemplarily described below.

Step S610, a first training data set is acquired.

In this embodiment, a large public identity data set including a large number of sample face pictures with different identities may be downloaded via a network as the first training data set, where each sample face picture carries an identity feature tag calibrated in advance. For example, in this embodiment, the identity tag may be represented by an identity number (ID). As one possible example, the identity feature tag may be, but is not limited to, a feature vector that can be used to implicitly characterize a variety of facial features such as face width, glabellar distance, lip width, lip thickness, eye size, eyelid distance, etc. of the face in the sample face picture. For example, a picture data set with 240 ten thousand face pictures may be downloaded over a network as the first training data set, the 240 ten thousand face pictures may contain 9 ten thousand different identity tags, and each identity tag may contain about 27 pictures.

Step S620, sequentially inputting each sample face picture in the first training data set into a deep neural network to be trained, performing identity feature prediction on each sample face picture through the deep neural network, and outputting a predicted identity feature corresponding to each sample face picture.

As a possible example, in this embodiment, the deep neural network to be trained may be implemented by a resnet18 network structure. In the deep neural network based on the rene 18 network structure, 17 convolutional layers (Conv) and 1 fully-connected layer (fc) may be included. The 17 convolutional layers may be connected to the full-link layer after being sequentially cascaded. Each sample face picture can be used as the input of a first convolutional layer, the input of a previous convolutional layer can be used as the output of a next convolutional layer, the output of a last convolutional layer is used as the input of the fully-connected layer, and the fully-connected layer can be used as the output layer of the deep neural network. In the iterative training process of each sample face picture, the sample face picture can be input into the first convolution layer, convolution operation (or feature extraction) is sequentially performed on the sample face picture through each convolution layer, and then the predicted identity feature corresponding to the sample face picture is output through the full-connection layer. The predicted identity feature may be a 512-dimensional implicit feature that may implicitly characterize various facial features of the face corresponding to the sample face picture.

Wherein, the fully-connected layer can also be understood as a classified fully-connected layer, and the predicted identity feature can also be understood as an identity class feature. For example, the identity class features may be used to characterize which feature classes each part of the face in the sample face picture belongs to, for example, these classes may include, but are not limited to, feature classes such as face width, mouth height, eye width, eyebrow width, and the like. Taking the classification of mouth height as an example, three categories of high, medium and low may be included, where "high" may represent an open-mouth amplitude in a first amplitude range (e.g., a range of 4.5cm to 5 cm), "medium" may represent an open-mouth amplitude in a second amplitude range (e.g., a range of 3.5cm to 4.5 cm), and "low" may represent an open-mouth amplitude in a third amplitude range (e.g., a range of 3cm to 3.5 cm). Of course, this is only a simple example provided for convenience of understanding, and in practical implementation, more detailed feature classification can be made according to actual needs.

Step S630, calculating a loss function value of the deep neural network according to the predicted identity feature of each sample face picture predicted by the deep neural network and the identity feature tag corresponding to each sample face picture.

In this embodiment, the loss function (loss function) of the deep neural network may be a cross entropy loss function (cross entropy). In an actual iterative training process, the deep neural network outputs a 512-dimensional feature as the predicted identity feature according to an input sample face picture, then compares the predicted identity feature with an identity feature label of the sample face picture, and calculates the loss function value according to the difference between the predicted identity feature and the sample face picture.

And step S640, performing iterative optimization on the network parameters of the deep neural network according to the loss function value until a training convergence condition is met, and obtaining the trained deep neural network as the identity recognition network.

For example, the loss function value may be calculated from a euclidean distance between the predicted identity and the identity tag, a pearson correlation coefficient, or the like. The higher the closeness degree between the predicted identity features and the identity feature labels of the sample face pictures is, the smaller the corresponding loss function value (cross entropy) is, otherwise, the lower the closeness degree between the predicted identity features and the identity feature labels of the sample face pictures is, the larger the corresponding loss function value is, and the parameters of the deep neural network are iteratively adjusted according to the loss function value, so that the deep neural network obtained after training can accurately identify the identity features of the input face pictures.

Based on this, in this embodiment, iterative training may be performed on each sample face picture in the first training data set in sequence, and when the loss function value calculated in a certain iterative training process is smaller than a preset loss function value threshold, it may be considered that the training convergence condition is satisfied; alternatively, the training convergence condition may be considered to be satisfied when the number of iterative training reaches a preset number.

As such, the training convergence condition may include that the loss function value is less than a preset loss function value threshold or that the number of iterative training reaches a preset number.

Further, with respect to step S200, as shown in fig. 7, the expression recognition network may be obtained by training through steps S710 to S740 described below, and a specific training method of the expression recognition network is exemplarily described below.

Step S710, a second training data set is obtained.

In this embodiment, the second training data set may be obtained in the same or similar manner as the first training data set. The second training data set may comprise a plurality of sample face pictures with previously calibrated emoji label values. In a possible example, the second training data set may also be formed by scaling an expression coefficient value of a sample face picture in the first training data set.

Step S720, performing key point detection on each sample face picture in the second training data set, and obtaining a face subject picture corresponding to the sample face picture according to a key point detection result.

Specifically, in a possible implementation manner, a key point SDK (Software Development Kit) may be adopted to perform face key point detection based on the sample face picture, and finally, a face subject is extracted according to a detection result of the face key point, so as to obtain a face subject picture corresponding to the sample face picture. The key point SDK may be any mature key point acquisition tool in the market at present, which is not limited in this embodiment. The face main body picture can be a face picture including detected key points of each face, and is different from the sample face picture in that the face main body picture does not include at least part of details except for a face part in the sample face picture, so that the details of the face part can be conveniently and rapidly identified and analyzed in the subsequent network training process.

And step S730, sequentially inputting the face main body pictures respectively corresponding to the sample face pictures into a convolutional neural network to be trained, and performing expression recognition on the face main body pictures through the convolutional neural network to obtain expression coefficient prediction values corresponding to the sample face pictures.

In particular, as a possible example, the convolutional neural network may be a lightweight convolutional neural network, such as the ShuffleNet V2 convolutional neural network, and the loss function of the neural network may employ the L1loss function.

Step S740, calculating a loss function value of the convolutional neural network according to the expression coefficient prediction value of the sample face picture output by the convolutional neural network and the expression coefficient label value corresponding to the sample face picture.

In this embodiment, the loss function value of the convolutional neural network may be calculated by the following calculation formula of the L1loss function:

wherein L1Loss represents the Loss function value, x_nRepresenting the expression coefficient predicted value, y, corresponding to a sample face picture output in the nth iteration training process of the convolutional neural network_nAnd representing the expression coefficient label value corresponding to the sample face picture used in the nth iteration training process.

And S750, performing iterative optimization on network parameters of the convolutional neural network according to the loss function value of the convolutional neural network until a training convergence condition is met, and obtaining the trained convolutional neural network as the expression recognition network.

In this embodiment, the loss function value of the convolutional neural network may be calculated according to a difference between the expression coefficient predicted value and the expression coefficient tag value. Correspondingly, the smaller the difference is, the smaller the corresponding loss function value is, otherwise, the larger the corresponding loss function value is, and the iterative optimization of the parameters of the convolutional neural network according to the loss function value can enable the convolutional neural network obtained after training to accurately identify the expression coefficient of the input face picture.

Based on this, in this embodiment, iterative training may be performed on each sample face picture in the second training data set in sequence, and when the loss function value calculated in a certain iterative training process is smaller than a preset loss function value threshold, it may be considered that the training convergence condition is satisfied; alternatively, the training convergence condition may be considered to be satisfied when the number of iterative training reaches a preset number.

Correspondingly, the training convergence condition of the convolutional neural network may include that the loss function value is smaller than a preset loss function value threshold or the number of times of iterative training reaches a preset number of times.

Referring to fig. 8, fig. 8 is a schematic view of a computer device for implementing the expression recognition method according to an embodiment of the present application. In this embodiment, the computer device may be the live broadcast providing terminal 100 shown in fig. 2, or may be the live broadcast server 200. For example, on the premise that the live broadcast providing terminal 100 has sufficient data processing capability, the computer device is preferably the live broadcast providing terminal 100. When the data processing capability of the live broadcast providing terminal 100 is not enough to meet the data processing requirement of this embodiment, the computer device is the live broadcast server 200, the live broadcast server 200 identifies and processes the facial image of the anchor (target object) sent by the live broadcast providing terminal 100 to obtain a final multi-dimensional expression coefficient, and then performs expression driving and rendering on the virtual digital object in the live broadcast picture according to the final multi-dimensional expression coefficient to transmit the live broadcast picture to the live broadcast receiving terminal 300.

The computer device may include one or more processors 110, a machine-readable storage medium 120, and an emotion recognition system 130. The processor 110 and the machine-readable storage medium 120 may be communicatively connected via a system bus. The machine-readable storage medium 120 stores machine-executable instructions, and the processor 110 implements the emotion recognition method described above by reading and executing the machine-executable instructions in the machine-readable storage medium 120.

The machine-readable storage medium 120 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The machine-readable storage medium 120 is used for storing a program, and the processor 110 executes the program after receiving an execution instruction.

The processor 110 may be an integrated circuit chip having signal processing capabilities. The Processor may be, but is not limited to, a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), and the like.

Please refer to fig. 9, which is a functional block diagram of the expression recognition system 130. In this embodiment, the emotion recognition system 130 may include one or more software functional modules running on the computer device, and these software functional modules may be stored in the machine-readable storage medium 120 in the form of a computer program, so that when these software functional modules are called and executed by the processor 130, the emotion recognition method described in this embodiment of the present application may be implemented.

In detail, the expression recognition system 130 includes an acquisition module 131, a recognition module 132, and a processing module 133.

The acquiring module 131 is configured to acquire a face image of the target object.

In this embodiment, the target object may be an anchor user who provides a live broadcast service using a live broadcast device in a live broadcast scene. The facial image may be a current video frame in a real-time video image obtained by shooting a main broadcast in real time, or may be a face photo capable of expressing the current facial expression of the target object obtained by shooting a single point through the image acquisition device according to a certain time period (such as 1 second, 2 seconds, 3 seconds, and the like).

It should be understood that the obtaining module 131 may be configured to execute the step S100 in the foregoing method embodiment, and for details of the obtaining module 131, reference may be made to the above detailed description of the step S100, which is not described herein any more.

The recognition module 132 is configured to input the facial image into a pre-training database to obtain an identity recognition network and an expression recognition network, and perform identity recognition and expression recognition on the facial image through the identity recognition network and the expression recognition network, respectively, to obtain an identity feature and an initial multi-dimensional expression coefficient of the target object.

In this embodiment, the identity recognition network and the expression recognition network may be obtained by performing network training on a set artificial intelligence neural network through a collected training data set. The identity feature obtained by the trained identity recognition network for performing identity feature recognition on the facial image of the target object may be a multi-dimensional implicit feature vector, for example, a 512-dimensional feature vector, which may include implicit feature vectors for implicitly characterizing various facial features of the target object, such as face width, inter-eyebrow distance, lip width, lip thickness, eye size, eyelid distance, and the like, and such feature vectors may be used to distinguish different user identities, and are therefore referred to as identity features.

Furthermore, the expression recognition network can also be obtained by performing network training on a set artificial intelligence neural network through an acquired training data set. The initial multi-dimensional expression coefficient obtained by the trained expression recognition network for performing expression recognition on the facial image of the target object may be a multi-dimensional information, for example, a multi-dimensional matrix or a multi-dimensional array including 51-dimensional expression coefficients. For example, 51-dimensional expression coefficients are taken as an example, each of the 51-dimensional expression coefficients may correspond to one expression base. Generally, the expression bases for driving the virtual digital object may include 51 expression bases for expressing various expression contents (e.g., squinting, mouth opening, eyebrow frowning, cheek bulging, etc.) of the virtual digital object, respectively. Illustratively, the expression coefficient may be any one coefficient value in a value range from 0 to 1.

It should be understood that the identification module 132 may be configured to execute the step S200 in the foregoing method embodiment, and for details of the identification module 132, reference may be made to the above detailed description of the step S200, which is not repeated herein.

The processing module 133 is configured to process the initial multi-dimensional expression coefficient according to the identity feature to obtain a final multi-dimensional expression coefficient of the target object. In this embodiment, when the expression recognition network obtained through pre-training is subsequently applied to recognition of expression coefficients, the expression coefficients output by the expression recognition network can be output after being subjected to adaptive processing and adjustment by combining with the identity features obtained by the identity recognition network, and then the relevant expression coefficient features of the target object are accurately recognized as the final multi-dimensional expression coefficients, so as to be used for subsequently driving the virtual digital image to perform vivid and fine expression on real-time facial expressions of a plurality of target objects.

It should be understood that the processing module 133 may be configured to execute the step S300 in the foregoing method embodiment, and for details of the processing module 133, reference may be made to the above detailed description of the step S300, which is not described in detail herein.

Further, on the basis of the above content, in this embodiment, please refer to fig. 9 again, the expression recognition system 130 may further include a driving module 134 for driving the facial expression of the virtual digital image in the live view according to the final multi-dimensional expression coefficient. In this embodiment, the final multi-dimensional expression coefficient is expression information that is output after adaptive processing and adjustment are performed on the basis of the initial multi-dimensional expression coefficient in combination with the identity of the target user, and that can more accurately express the current facial expression of the target object. On the basis, the initial multi-dimensional expression coefficient is subjected to self-adaptive processing through the identity characteristics to obtain a final multi-dimensional expression coefficient of the target object, and the virtual digital object is driven to have facial expression, so that the virtual digital object can express the expression of the target object vividly and vividly.

It should be understood that the driving module 134 may be configured to execute the step S400 in the foregoing method embodiment, and for details of the driving module 134, reference may be made to the above detailed description of the step S400, which is not repeated herein.

Further, on the basis of the above content, in this embodiment, please refer to fig. 9 again, the expression recognition system 130 may further include a training module 135, where the training module 135 is specifically configured to obtain the identity recognition network and the expression recognition network through network training.

Specifically, the training module 135 trains the identification network by:

In this embodiment, iterative training may be performed on each sample face picture in the first training data set in sequence, and when the loss function value calculated in a certain iterative training process is smaller than a preset loss function value threshold, it may be considered that the training convergence condition is satisfied; alternatively, the training convergence condition may be considered to be satisfied when the number of iterative training reaches a preset number.

It should be understood that the training module 135 may be configured to perform the method steps corresponding to fig. 6 in the above method embodiment, and for details of the training module 135, reference may be made to the above detailed description of each method step of fig. 6, which is not described herein again.

Further, the training module 135 trains the facial expression recognition network by:

It should be understood that the training module 135 may be further configured to perform the method steps corresponding to fig. 7 in the above method embodiment, and for details of the training module 135, reference may also be made to the above detailed description of each method step of fig. 7, which is not described herein again.

To sum up, the expression recognition method, system and computer device provided in the embodiments of the present application are different from the conventional expression recognition technology, and innovatively introduce an identity recognition network as a conditional network, perform identity recognition on each facial expression picture (facial image) of the target object through the conditional network, and output an identity feature, where the identity feature is used to describe or express a facial personalized feature of the target object. Then, the identity feature is also used as the condition input of expression recognition, and if the condition exists, even if two photos with the same expression amplitude exist, the self-adaptive expression coefficients can be finally output according to the facial personalized characteristics of different objects. Therefore, compared with the traditional expression recognition scheme without considering the identity characteristics of the target object, the final expression recognition result can more accurately express the facial expression of the target object, and further more accurate expression recognition of the target object is realized. Meanwhile, when the final multi-dimensional expression coefficient corresponding to the facial image of the target object is obtained by using the expression recognition method to drive the facial expression of the virtual digital object, the virtual digital object can express the facial expression of the target object more vividly and finely.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only for various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all such changes or substitutions are included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An expression recognition method, characterized in that the method comprises:

acquiring a face image of a target object;

2. The expression recognition method according to claim 1, further comprising:

3. The expression recognition method according to claim 1 or 2, wherein the identity recognition network and the expression recognition network are connected with a full connection layer after being cascaded;

4. The expression recognition method according to claim 1 or 2, further comprising a network training step for obtaining the identity recognition network, specifically comprising:

5. The expression recognition method according to claim 4, wherein the network structure of the deep neural network is a resnet18 network structure, and the loss function value of the deep neural network is calculated by a cross entropy loss function.

6. The expression recognition method according to claim 1 or 2, further comprising a network training step for obtaining the expression recognition network, specifically comprising:

7. The expression recognition method according to claim 6, wherein the loss function value of the convolutional neural network is calculated by the following formula:

8. An expression recognition system, comprising:

an acquisition module for acquiring a face image of a target object;

9. The expression recognition system of claim 8, further comprising a driver module and a training module, wherein:

the training module is configured to:

the training module is further configured to:

10. A computer device comprising a machine-readable storage medium and one or more processors, the machine-readable storage medium having stored thereon machine-executable instructions that, when executed by the one or more processors, implement the method of any one of claims 1-7.