CN111489424A

CN111489424A - Virtual character expression generation method, control method, device and terminal equipment

Info

Publication number: CN111489424A
Application number: CN202010283348.1A
Authority: CN
Inventors: 郑一星; 张智勐; 陈佳丽; 丁彧; 范长杰; 胡志鹏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2020-08-04

Abstract

The invention provides a virtual character expression generation method, a control method, a device and terminal equipment; the method comprises the following steps: determining the audio data and emotion labels corresponding to the audio data, wherein the emotion labels are used for representing emotions of the virtual characters when the audio data are played in a game scene; extracting voice characteristics of the audio data; inputting the voice characteristics and the emotion labels into a neural network model, and outputting mixed deformation parameters corresponding to the audio data, wherein the neural network model is trained and completed in advance based on the voice sample characteristics marked with the emotion labels; and controlling the expression of the virtual character in the game scene when the audio data is played according to the mixed deformation parameters. The mixed deformation parameters are output through the neural network model, video production is not required through professional actors and expensive actor recording equipment, and time cost and money cost for generating the virtual character expressions can be saved; and moreover, the influence of the emotion label is considered by the mixed deformation parameters obviously output by the neural network, and the generated expression is more natural.

Description

Virtual character expression generation method, control method, device and terminal equipment

Technical Field

The invention relates to the technical field of machine learning, in particular to a virtual character expression generation method, a virtual character expression control device and terminal equipment.

Background

With the development of game business, the virtual characters of games tend to be more realistic. Among them, in order to improve the interactivity between the game and the player, the game producer pays more and more attention to the natural degree of the expression when the virtual character speaks. In order to construct a virtual character, a blendshape (mixed distortion) parameter needs to be set, and the expression of the virtual character is controlled by the mixed distortion parameter, wherein at present, the mixed distortion parameter is mainly generated according to a video played by an actor or an audio needing to be played.

Wherein, if mixed deformation parameters are generated according to videos performed by actors, the actors can wear professional actor recording equipment to make reasonable expressions and record facial videos of the actors. For each frame of face video, a special marking person marks the key point positions of the actor's face. Because the positions of the key points and the mixed deformation parameters have a one-to-one correspondence relationship, the mixed deformation parameters can be calculated according to the positions of the key points, and the expression of the virtual character is controlled. However, this method has the following disadvantages: 1. require professional actors and expensive actor recording equipment; 2. the marking personnel need to spend a long time marking the key point positions of the actor faces; 3. it is difficult to customize the expression of the virtual character according to the player's needs.

Disclosure of Invention

In view of this, the present invention provides a virtual character expression generation method, a control method, a device and a terminal device, so as to save time cost and money cost for generating a virtual character expression and generate a natural expression for a virtual character.

In a first aspect, an embodiment of the present invention provides a virtual character expression generating method, including: determining audio data and emotion labels corresponding to the audio data, wherein the emotion labels are used for representing emotions of virtual characters when the audio data are played in a game scene; extracting voice characteristics of the audio data; inputting the voice characteristics and the emotion labels into a neural network model, and outputting mixed deformation parameters corresponding to the audio data; the neural network model is trained and completed based on the characteristics of the voice samples marked with emotion labels in advance; and controlling the expression of the virtual character in the game scene when the audio data is played according to the mixed deformation parameters.

In a preferred embodiment of the present invention, the step of determining the emotion labels corresponding to the audio data and the audio data includes: acquiring a game scene containing a virtual character; the game scene is also configured with at least one audio data; and each audio data is configured with an emotion label corresponding to the virtual character.

In a preferred embodiment of the present invention, the voice feature includes a mel-frequency cepstral Coefficient (MFCC) feature; the step of extracting the voice feature of the audio data includes: carrying out high-pass filtering on the audio data to obtain an audio frame optimization sequence; sampling the audio frame optimization sequence to obtain a plurality of target frame signals; wherein the duration of the sampling is greater than the interval time of the sampling; attenuating two ends of each target frame signal to obtain an optimized frame signal of each target frame; performing fast Fourier transform on each optimized frame signal to obtain a frequency domain signal corresponding to each optimized frame signal; inputting each frequency domain signal into a preset triangular filter bank, and outputting logarithmic energy corresponding to each frequency domain signal; and performing discrete cosine transform on each logarithmic energy to obtain a Mel cepstrum coefficient characteristic corresponding to the audio data.

In a preferred embodiment of the present invention, before the step of inputting the audio data into the high-pass filter for high-pass filtering, the method further comprises: the audio data is converted into 16kHz monaural audio data.

In a preferred embodiment of the present invention, after the step of performing discrete cosine transform on each logarithmic energy to obtain mel-frequency cepstrum coefficient characteristics corresponding to the audio data, the method further includes: the pre-set average energy information is added to the mel-frequency cepstrum coefficient characteristics.

In a preferred embodiment of the present invention, the neural network model is trained by: training a neural network model based on a preset sample set; the sample set comprises a plurality of training voice features, and each training voice feature is labeled with an emotion label and a standard mixed deformation parameter corresponding to the training voice feature.

In a preferred embodiment of the present invention, the step of training the neural network model based on the preset sample set includes: determining current training speech features from a sample set; inputting the current training voice features and emotion labels labeled by the current training voice features into a neural network model, and outputting training mixed deformation parameters of the current training voice features; calculating a loss value of the training mixed deformation parameter according to the standard mixed deformation parameter and a preset loss function; and adjusting parameters of the neural network model according to the loss value, determining the next training voice characteristic from the sample set to train the neural network model until the loss value is converged, and obtaining the trained neural network model.

In a preferred embodiment of the present invention, the step of controlling the expression of the virtual character in the game scene when playing the audio data according to the mixed deformation parameter includes: acquiring a face model of a preset virtual character; and adjusting parameters of the face model according to the mixed deformation parameters so as to control the expression of the virtual character when the audio data is played in the game scene.

In a second aspect, an embodiment of the present invention further provides a method for controlling expressions of virtual characters, where the method is applied to a game client, a virtual character in a game at the game client is configured with a target expression, and the target expression is an expression generated by applying the method; the method comprises the following steps: if the current game scene contains the virtual character, acquiring audio data configured in the current game scene and a target expression corresponding to the virtual character; and in the process of playing the audio data, synchronously controlling the expression of the virtual character to be the target expression.

In a third aspect, an embodiment of the present invention further provides an apparatus for generating an expression of a virtual character, where the apparatus includes: the scene obtaining module is used for determining the audio data and emotion labels corresponding to the audio data, wherein the emotion labels are used for representing the emotion of the virtual characters when the audio data are played in the game scene; the characteristic extraction module is used for extracting voice characteristics of the audio data; the parameter output module is used for inputting the voice characteristics and the emotion labels into the neural network model and outputting mixed deformation parameters corresponding to the audio data; the neural network model is trained and completed based on the characteristics of the voice samples marked with emotion labels in advance; and the expression generation module is used for controlling the expression of the virtual character in the game scene when the audio data is played according to the mixed deformation parameters.

In a fourth aspect, an embodiment of the present invention further provides a virtual character expression control apparatus, where the apparatus is applied to a game client, a virtual character in a game of the game client is configured with a target expression, and the target expression is an expression generated by applying the foregoing method; the device comprises: the configuration acquisition module is used for acquiring audio data configured in the current game scene and a target expression corresponding to the virtual character if the current game scene contains the virtual character; and the expression control module is used for synchronously controlling the expression of the virtual character to be the target expression in the process of playing the audio data.

In a fifth aspect, an embodiment of the present invention further provides a terminal device, which includes a processor and a memory, where the memory stores computer-executable instructions that can be executed by the processor, and the processor executes the computer-executable instructions to implement the steps of the virtual character expression generation method or the virtual character expression control method.

In a sixth aspect, the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are called and executed by a processor, the computer-executable instructions cause the processor to implement the virtual character expression generation method or the virtual character expression control method.

The embodiment of the invention has the following beneficial effects:

the virtual character expression generation method, the control method, the device and the terminal equipment provided by the embodiment of the invention are used for obtaining audio data and emotion labels corresponding to the audio data; inputting the voice features extracted from the audio data and the emotion labels into a neural network model, outputting mixed deformation parameters, and generating the expressions of the virtual characters based on the mixed deformation parameters. The mixed deformation parameters are output through the neural network model, video production is not required through professional actors and expensive actor recording equipment, and time cost and money cost for generating the virtual character expressions can be saved; and moreover, the influence of the emotion label is considered by the mixed deformation parameters obviously output by the neural network, and the generated expression is more natural.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a virtual character expression generating method according to an embodiment of the present invention;

fig. 2 is a flowchart of another virtual character expression generation method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an MFCC feature extraction method according to an embodiment of the present invention;

fig. 4 is a flowchart of a method for controlling expressions of virtual characters according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a virtual character expression generating apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an expression control device for virtual characters according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to save time cost and money cost for generating the expressions of virtual characters and generate natural expressions of virtual characters, embodiments of the present invention provide a method, a device and a terminal device for generating expressions of virtual characters, where the method for generating expressions of virtual characters can be applied to a terminal device (e.g., a server or a terminal computer) of a virtual character producer, and the method for controlling expressions of virtual characters can be applied to a game producer or a game client of a game player.

To facilitate understanding of the embodiment, a detailed description is first given to a virtual character expression generation method disclosed in the embodiment of the present invention.

The embodiment provides a virtual character expression generation method, referring to a flowchart of the virtual character expression generation method shown in fig. 1, the virtual character expression generation method includes the following steps:

step S102, determining the audio data and emotion labels corresponding to the audio data; the emotion label is used for representing the emotion of the virtual character when the audio data are played in the game scene.

The virtual character refers to a character or an animal character which does not exist in reality, the virtual character is created by a virtual character creator, the virtual character in this embodiment may be a virtual character appearing in a game or a movie, and the corresponding virtual character creator is a game creator or a movie creator.

If a virtual character appears in a certain game scene, the game scene is the game scene containing the virtual character, for example: virtual characters exist in a game scene, or virtual characters exist in a scene of animation of the game scene, and the game scenes are scenes containing the virtual characters.

The audio data refers to a segment of audio, the audio includes a plurality of frames, and the time length of the audio data is not limited. The emotion labels are set artificially, correspond to the audio data and are used for representing the emotion of the virtual character needing to generate the expression when the audio data are played in the game scene. For example: if the emotion label is happy, the expression of the virtual character should be happy when the audio data is played in the game scene.

And step S104, extracting the voice characteristics of the audio data.

Therefore, the voice characteristics of different audio data are different, and the voice characteristics can embody some characteristics of the audio data, wherein the voice characteristics comprise MFCC characteristics, P L P (Perceptual L initial Predictive) characteristics and the like.

Step S106, inputting the voice characteristics and the emotion labels into a neural network model, and outputting mixed deformation parameters corresponding to the audio data; the neural network model is trained and completed based on the characteristics of the voice samples marked with emotion labels in advance.

If the mixed deformation parameters are generated only according to the audio to be played, the audio data to be played can be pre-recorded, and the phonemes corresponding to the audio are determined according to the audio data. Meanwhile, the game producer presets the numerical value of the blendshape corresponding to each phoneme and replaces the identified phoneme with the preset mixed deformation parameter. However, this method has the following disadvantages: 1. simply corresponding a mixed deformation parameter to each phoneme can cause action stiffness and loss of mind, and the generated expression is unnatural; 2. it is difficult to generate emotional motions conforming to human perception according to different emotions, resulting in unnatural expressions being generated.

In step S106, the neural network model may output a mixed deformation parameter according to the input speech feature and the emotion label. The mixed deformation parameter is a parameter for enhancing the expression face of the virtual character, is a multi-dimensional array and is used for controlling the movement of the face of the virtual character. In the embodiment, the neural network model is used as a generation tool of the mixed deformation parameters, and an efficient mixed deformation parameter generation network model is trained by reasonably designing the network structure and the network parameters.

And S108, controlling the expression of the virtual character in the game scene when the audio data is played according to the mixed deformation parameters.

When the value of the blendshape of the virtual character changes, the eyebrow, the eye, the mouth and other parts of the virtual character move along with the change, and the facial expression of the virtual character is generated. By accurately adjusting the value of the blenshape, the virtual character can make a speaking state corresponding to the audio.

The mixed deformation parameters have 51 dimensions, the numerical value of each dimension is changed from 0 to 100, and the expression of the face of the virtual character moves along with the change of the numerical value of each dimension. For example, the mouth may open from closed to maximum when the 20 th dimension jawopen, from 0 to 100, changes.

The embodiment of the invention provides a virtual character expression generation method, which comprises the steps of obtaining audio data and emotion labels corresponding to the audio data; inputting the voice features extracted from the audio data and the emotion labels into a neural network model, outputting mixed deformation parameters, and generating the expressions of the virtual characters based on the mixed deformation parameters. The mixed deformation parameters are output through the neural network model, video production is not required through professional actors and expensive actor recording equipment, and time cost and money cost for generating the virtual character expressions can be saved; and moreover, the influence of the emotion label is considered by the mixed deformation parameters obviously output by the neural network, and the generated expression is more natural.

The embodiment of the invention also provides another virtual character expression generation method, which mainly describes a specific implementation mode of determining the audio data and the emotion corresponding to the audio data and extracting the voice characteristics of the audio data. Referring to a flowchart of another virtual character expression generation method shown in fig. 2, the virtual character expression generation method includes the following steps:

step S202, obtaining a game scene containing virtual characters; the game scene is also configured with at least one piece of audio data, and each piece of audio data is configured with an emotion tag corresponding to the virtual character; the emotion label is used for representing the emotion of the virtual character when the audio data are played in the game scene.

At least one audio data is required to be configured in a game scene, and the configured audio data can be played synchronously or asynchronously. Each audio data is configured with an audio tag corresponding to the virtual character, and when the audio data is played in a game scene, the expression of the virtual orange color can be the expression corresponding to the emotion tag.

The emotion of the virtual character when playing audio data in a game scene is at least reflected in two ways, one way is that the audio data is spoken by a false virtual character and the corresponding emotion is made; the other is that the virtual character does not speak the audio data, but speaks the audio data as background music or by other characters, and the audio data of the virtual character makes corresponding emotion after being heard.

And step S204, carrying out high-pass filtering on the audio data to obtain an audio frame optimization sequence.

Referring to fig. 3, a schematic diagram of an MFCC feature extraction method is shown, in the present embodiment, MFCC features are extracted as speech features. The frequency of the audio may be unified prior to high pass filtering the audio data, for example: the audio data is converted into 16kHz single-channel audio data, and the converted 16kHz audio data is the audio (16k) in FIG. 3. The audio data are unified into a 16kHz single sound channel, which is mainly used for facilitating the subsequent MFCC feature extraction processing and improving the MFCC feature extraction efficiency.

As shown in fig. 3, the audio data is then pre-emphasized, i.e. filtered by a high-pass filter. The pre-emphasis is intended to raise the high frequency part of the audio data to flatten the spectrum of the audio data, and to obtain the spectrum with the same signal-to-noise ratio while maintaining the entire frequency band from low frequency to high frequency.

Step S206, sampling the audio frame optimization sequence to obtain a plurality of target frame signals; wherein the duration of the samples is greater than the interval of the samples.

As shown in fig. 3, framing and windowing are required after pre-emphasis, the framing is to sample the audio frame optimization sequence, the sampling duration is longer than the sampling interval time, so that mutual overlapping between adjacent target frame signals can be ensured, the same part is provided, when machine learning is performed through a neural network, the adjacent target frame signals have consistency, so that output mixed deformation parameters can be ensured to be more reasonable, and the generated expression is more natural. Further, the duration of the sampling may be 25 seconds, and the interval time of the sampling may be 10 seconds. For example: 0-25 seconds are the first target frame signal, 10-35 seconds are the second target frame signal, 20-35 seconds are the third target frame signal, and so on.

And S208, attenuating two ends of each target frame signal to obtain an optimized frame signal of each target frame.

A hamming window, the windowing shown in fig. 3, is added to the target frame signal. The purpose of windowing is to increase the continuity of the left and right ends of the target frame signal.

Step S210, performing fast fourier transform on each optimized frame signal to obtain a frequency domain signal corresponding to each optimized frame signal.

After windowing, the optimized frame signal is a time domain signal, and therefore, the optimized frame signal needs to be subjected to fast fourier transform to obtain energy distribution on a frequency spectrum, that is, the framed and windowed optimized frame signal is subjected to fast fourier transform to obtain a frequency domain signal corresponding to the optimized frame signal.

Step S212, inputting each frequency domain signal into a preset triangular filter bank, and outputting logarithmic energy corresponding to each frequency domain signal.

The triangular filter is a Mel filter shown in fig. 3, and the Mel filter is used for smoothing the frequency domain signal, eliminating the effect of harmonic waves and highlighting the resonance of the audio data. For example, filtering is performed by 40 Mel filter banks to obtain a sound spectrum according with the hearing habits of human ears, and logarithms are taken as output.

Step S214, discrete cosine transform is carried out on each logarithmic energy to obtain a Mel cepstrum coefficient characteristic corresponding to the audio data.

After discrete cosine transform, the MFCC features can be obtained, and the MFCC features at this time are 13-dimensional and only reflect the static characteristics of the audio data. In order to reflect the static feature and the dynamic feature of the audio data at the same time, it is necessary to add average energy information on each frame, which can be performed by the following steps: the pre-set average energy information is added to the mel-frequency cepstrum coefficient characteristics. The MFCC characteristics after adding the energy information finally have a dimension of 28 dimensions per frame. Combining each frame to obtain a matrix of 28 × T, wherein 28 is the feature dimension and T is the number of frames of the audio.

In the method, the MFCC features can be quickly and accurately extracted as the voice features, and the extracted MFCC features reflect the static characteristics and the dynamic characteristics of the audio data at the same time.

Step S216, inputting the Mel cepstrum coefficient characteristics and the emotion labels into the neural network model, and outputting mixed deformation parameters corresponding to the audio data; the neural network model is trained and completed based on the characteristics of the voice samples marked with emotion labels in advance.

For the neural network model, training the neural network model based on a preset sample set is needed; the sample set comprises a plurality of training voice features, and each training voice feature is labeled with an emotion label and a standard mixed deformation parameter corresponding to the training voice feature.

The training voice features in the sample set can be obtained from the voice features of the adjusted virtual character, the voice features of the adjusted virtual character are obtained to serve as the training voice features, the mixed deformation parameters of the virtual character corresponding to the voice features are selected to serve as standard mixed deformation parameters, and the emotion labels at the moment are set artificially to serve as emotion labels marked by the training voice features.

For example: the method comprises the steps of adjusting a virtual character A, obtaining a plurality of voice features of the virtual character A in a certain game scene as training voice features, taking mixed deformation parameters when each training voice feature is played by the virtual character A in the game scene as standard mixed deformation parameters corresponding to the training voice features, and finally manually marking emotion labels of each training voice feature.

Specifically, the neural network model may be trained by steps A1-A4:

step A1, determining current training speech features from the sample set.

Randomly selecting a training voice feature from the sample set as a current training voice feature, and determining an emotion label labeled by the current training voice feature in the sample set and a mixed deformation parameter corresponding to the current training voice feature.

Step A2, inputting the current training voice features and the emotion labels labeled by the current training voice features into the neural network model, and outputting the training mixed deformation parameters of the current training voice features.

And taking the current training voice characteristics and emotion marks labeled by the current training voice characteristics as the input of the neural network model, wherein the output mixed deformation parameters of the neural network model are the training mixed deformation parameters of the current training voice characteristics. In addition, the Neural network model can be a U-net structure with a CNN (Convolutional Neural network) main body, and the CNN network has a larger and larger visual field as the network is deeper and deeper, so that the influence of the duration time of the sound and the front and rear audio features on the current expression can be learned spontaneously, and natural expression actions can be generated.

And step A3, calculating the loss value of the training mixed deformation parameter according to the standard mixed deformation parameter and a preset loss function.

And substituting the training mixed deformation parameters and the mixed deformation parameters corresponding to the current training voice characteristics into a predetermined loss function, and calculating the loss value of the loss function.

And step A4, adjusting parameters of the neural network model according to the loss value, determining the next training voice characteristic from the sample set to train the neural network model until the loss value is converged, and obtaining the trained neural network model.

And then adjusting parameters of the neural network model according to the loss values, then determining next training voice characteristics from the sample set again, training the neural network model by using the next training voice characteristics to obtain the loss values corresponding to the next training voice characteristics, continuously adjusting the parameters of the neural network model until the loss values are converged, stopping training the neural network model, and obtaining the trained neural network model.

In the method, the emotion label is added to serve as the input of the neural network model, so that the trained neural network model can generate virtual character expressions of corresponding emotions according to input customization. The neural network can reasonably control the parameter quantity, reduce the resource consumed by the neural network model while ensuring the accuracy of the neural network model, and improve the output speed of the neural network model. The average prediction time of each audio data is less than 50 milliseconds, and the requirement that the virtual character maker generates expressions in real time is met. Technical support is provided for the virtual character maker to develop new functions, for example, the virtual character maker can generate the action of the virtual character in real time according to the voice spoken by the user, and the interactivity of the game is improved.

Step S218, controlling the expression of the virtual character in the game scene when playing the audio data according to the mixed deformation parameter.

Generating an expression based on the blended warping parameters may be performed by steps B1-B2:

and step B1, acquiring the face model of the preset virtual character.

Mixed deformation parameters can be set on the face model of the virtual character, and after different mixed deformation parameters are set, the virtual character can display different expressions.

And step B2, adjusting the parameters of the face model according to the mixed deformation parameters to control the expression of the virtual character when playing the audio data in the game scene.

The mixed deformation parameters output by the neural network model are set at the corresponding position of the face model, so that the expression of the virtual character in the game scene when playing audio data can be generated.

The method provided by the embodiment of the invention can directly generate the blendshape value according to the audio, thereby saving the time for actor performance and data annotation. The time for generating the blendshape is within 50 milliseconds, and the game producer is completely supported to develop new functions in real time. And performing the blendshape generation by using the neural network, wherein natural expressions can be generated by considering the information of the front and rear frames of the audio, the information of the duration of the audio and the contribution weight of the information of the front and rear frames to the current phoneme. The emotion information can be conveniently added, so that the virtual character has different performances according to different emotions.

The embodiment of the invention also provides a virtual character expression control method, which is applied to a game client, wherein the virtual character in the game of the game client is configured with a target expression, and the target expression is an expression generated by applying the virtual character expression living method. That is to say, the expressions generated by the virtual character expression life method provided in the above embodiment are taken as target expressions, and the virtual character expressions are controlled according to the target expressions. It should be noted that the target expression is generated on the device of the virtual character maker; the game client of the virtual character producer can also control the expression of the virtual character on the game client of the player, namely, the virtual character can display the target expression on the game client of the virtual character producer or the player.

Based on the above description, referring to the flowchart of a virtual character expression control method shown in fig. 4, the virtual character expression control method includes the following steps:

step S402, if the current game scene contains the virtual character, the audio data configured in the current game scene and the target expression corresponding to the virtual character are obtained.

The game client firstly determines whether the current game scene contains the virtual character, and can perform expression control on the virtual character only when the current game scene contains the virtual character. And if the current game scene contains the virtual character, acquiring audio data configured in the current game scene and a target expression corresponding to the virtual character. The audio data of the current game scene configuration can be obtained from the game client side or transmitted from the server. The target expression corresponding to the virtual character, that is, the expression corresponding to the audio data configured in the current game scene, may also be pre-stored in the game client or the server, and the game client may obtain the target expression corresponding to the audio data from the game client or the server.

Step S404, in the process of playing the audio data, the expression of the virtual character is synchronously controlled to be the target expression.

And when the audio data is played in the current game scene, the game client side takes the expression of the synchronously controlled virtual character as the target expression, so that the control of the virtual character can be realized.

For example, the virtual character producer implements the virtual character expression generation method on the terminal device of the producer, obtains the target expression of the virtual character when playing audio data in a game scene, and uploads the mixed deformation parameter and the corresponding relation between the mixed deformation parameter and the audio data to the server. When a game client wants to play certain audio data, the game client firstly judges whether the current game scene playing the audio data contains a virtual character, if so, the game client acquires a mixed deformation parameter and a corresponding relation between the mixed deformation parameter and the audio data from a server, and determines the mixed deformation parameter corresponding to the audio data needing to be played according to the corresponding relation. And then, the game client synchronously controls the expression of the virtual character to be the target expression in the process of playing the audio data.

In the method for controlling the expression of the virtual character provided by the embodiment of the invention, the game client acquires the audio data to be played in the current game scene and the target expression corresponding to the virtual character, and the expression of the virtual character is synchronously controlled to be the target expression in the process of playing the audio data by the game client, so that the expression control of the virtual character can be realized, and the expression displayed by the virtual character in the expression control process is very natural.

It should be noted that the above method embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

Corresponding to the above method embodiment, an embodiment of the present invention provides a virtual character expression generating device, as shown in fig. 5, a schematic structural diagram of the virtual character expression generating device, where the virtual character expression generating device includes:

the scene obtaining module 51 is configured to determine the audio data and emotion labels corresponding to the audio data, where the emotion labels are used to represent emotion of the virtual character when the audio data is played in the game scene;

a feature extraction module 52, configured to extract a voice feature of the audio data;

the parameter output module 53 is configured to input the voice features and the emotion labels to the neural network model, and output mixed deformation parameters corresponding to the audio data; the neural network model is trained and completed based on the characteristics of the voice samples marked with emotion labels in advance;

and the expression generating module 54 is configured to control the expression of the virtual character in the game scene when the audio data is played according to the mixed deformation parameter.

The virtual character expression generating device provided by the embodiment of the invention obtains audio data and emotion labels corresponding to the audio data; inputting the voice features extracted from the audio data and the emotion labels into a neural network model, outputting mixed deformation parameters, and generating the expressions of the virtual characters based on the mixed deformation parameters. The mixed deformation parameters are output through the neural network model, video production is not required through professional actors and expensive actor recording equipment, and time cost and money cost for generating the virtual character expressions can be saved; and moreover, the influence of the emotion label is considered by the mixed deformation parameters obviously output by the neural network, and the generated expression is more natural.

In some embodiments, the scene acquiring module is configured to acquire a game scene including a virtual character; the game scene is also configured with at least one audio data; and each audio data is configured with an emotion label corresponding to the virtual character.

In some embodiments, the speech features include mel-frequency cepstral coefficient features; the feature extraction module is configured to perform high-pass filtering on the audio data to obtain an audio frame optimization sequence; sampling the audio frame optimization sequence to obtain a plurality of target frame signals; wherein the duration of the sampling is greater than the interval time of the sampling; attenuating two ends of each target frame signal to obtain an optimized frame signal of each target frame; performing fast Fourier transform on each optimized frame signal to obtain a frequency domain signal corresponding to each optimized frame signal; inputting each frequency domain signal into a preset triangular filter bank, and outputting logarithmic energy corresponding to each frequency domain signal; and performing discrete cosine transform on each logarithmic energy to obtain a Mel cepstrum coefficient characteristic corresponding to the audio data.

In some embodiments, the feature extraction module is configured to convert the audio data into 16kHz monaural audio data.

In some embodiments, the feature extraction module is configured to add preset average energy information to the mel-frequency cepstrum coefficient features.

In some embodiments, the above apparatus further comprises: the neural network training module is used for training the neural network model based on a preset sample set; the sample set comprises a plurality of training voice features, and each training voice feature is labeled with an emotion label and a standard mixed deformation parameter corresponding to the training voice feature.

In some embodiments, the neural network training module is configured to determine a current training speech feature from the sample set; inputting the current training voice features and emotion labels labeled by the current training voice features into a neural network model, and outputting training mixed deformation parameters of the current training voice features; calculating a loss value of the training mixed deformation parameter according to the standard mixed deformation parameter and a preset loss function; and adjusting parameters of the neural network model according to the loss value, determining the next training voice characteristic from the sample set to train the neural network model until the loss value is converged, and obtaining the trained neural network model.

In some embodiments, the expression generation module is configured to obtain a preset face model of the virtual character; and adjusting parameters of the face model according to the mixed deformation parameters so as to control the expression of the virtual character when the audio data is played in the game scene.

Corresponding to the method embodiment, the embodiment of the invention provides a virtual character expression control device, which is applied to a game client, wherein a virtual character in a game of the game client is configured with a target expression, and the target expression is an expression generated by applying the method; fig. 6 is a schematic structural diagram of a virtual character expression control device, which includes:

the configuration acquisition module 61 is configured to acquire audio data configured in the current game scene and a target expression corresponding to the virtual character if the current game scene contains the virtual character;

and the expression control module 62 is configured to synchronously control the expression of the virtual character to be the target expression in the process of playing the audio data.

According to the virtual character expression control device provided by the embodiment of the invention, the game client side obtains the audio data to be played in the current game scene and the target expression corresponding to the virtual character, and the expression of the virtual character is synchronously controlled to be the target expression in the process of playing the audio data by the game client side, so that the expression control of the virtual character can be realized, and the expression displayed by the virtual character in the expression control process is very natural.

The embodiment of the invention also provides terminal equipment for operating the virtual character expression generation method or the virtual character expression control method; referring to fig. 7, a schematic structural diagram of a terminal device includes a memory 100 and a processor 101, where the memory 100 is configured to store one or more computer instructions, and the one or more computer instructions are executed by the processor 101 to implement the virtual character expression generation method or the virtual character expression control method.

Further, the terminal device shown in fig. 7 further includes a bus 102 and a communication interface 103, and the processor 101, the communication interface 103, and the memory 100 are connected through the bus 102.

The Memory 100 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 102 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 7, but this does not indicate only one bus or one type of bus.

The processor 101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 101. The Processor 101 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 100, and the processor 101 reads the information in the memory 100, and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.

The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are called and executed by a processor, the computer-executable instructions cause the processor to implement the virtual character expression generation method or the virtual character expression control method, which can be referred to as method embodiments for specific implementation and will not be described herein again.

The method for generating virtual character expressions or the method and apparatus for controlling virtual character expressions provided in the embodiments of the present invention, and the computer program product of the terminal device include a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method in the foregoing method embodiments, and specific implementations may refer to the method embodiments and will not be described herein again.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and/or the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A virtual character expression generation method is characterized by comprising the following steps:

determining audio data and emotion labels corresponding to the audio data, wherein the emotion labels are used for representing emotions of virtual characters when the audio data are played in a game scene;

extracting voice features of the audio data;

inputting the voice features and the emotion labels into a neural network model, and outputting mixed deformation parameters corresponding to the audio data; the neural network model is trained and completed based on the characteristics of the voice samples marked with emotion labels in advance;

and controlling the expression of the virtual character in the game scene when the audio data is played according to the mixed deformation parameter.

2. The method of claim 1, wherein the step of determining the audio data and the emotion label corresponding to the audio data comprises:

acquiring a game scene containing a virtual character; the game scene is also configured with at least one audio data; and each audio data is configured with an emotion label corresponding to the virtual character.

3. The method of claim 1, wherein the speech features comprise mel-frequency cepstral coefficient features;

the step of extracting the voice features of the audio data comprises the following steps:

carrying out high-pass filtering on the audio data to obtain an audio frame optimization sequence;

sampling the audio frame optimization sequence to obtain a plurality of target frame signals; wherein the duration of the sampling is greater than the interval time of the sampling;

attenuating two ends of each target frame signal to obtain an optimized frame signal of each target frame;

performing fast Fourier transform on each optimized frame signal to obtain a frequency domain signal corresponding to each optimized frame signal;

inputting each frequency domain signal into a preset triangular filter bank, and outputting logarithmic energy corresponding to each frequency domain signal;

and performing discrete cosine transform on each logarithmic energy to obtain a Mel cepstrum coefficient characteristic corresponding to the audio data.

4. The method of claim 3, wherein prior to the step of high-pass filtering the audio data, the method further comprises:

and converting the audio data into 16kHz single-channel audio data.

5. The method of claim 3, wherein after the step of performing a discrete cosine transform on each of the logarithmic energies to obtain a Mel cepstral coefficient characteristic corresponding to the audio data, the method further comprises:

and adding preset average energy information to the Mel cepstrum coefficient characteristics.

6. The method of claim 1, wherein the neural network model is trained by:

training the neural network model based on a preset sample set; the sample set comprises a plurality of training voice features, and each training voice feature is labeled with an emotion label and a standard mixed deformation parameter corresponding to the training voice feature.

7. The method of claim 6, wherein the step of training the neural network model based on a preset sample set comprises:

determining a current training speech feature from the sample set;

inputting the current training voice features and emotion labels labeled by the current training voice features into the neural network model, and outputting training mixed deformation parameters of the current training voice features;

calculating a loss value of the training mixed deformation parameter according to the standard mixed deformation parameter and a preset loss function;

and adjusting parameters of the neural network model according to the loss value, and determining the next training voice feature from the sample set to train the neural network model until the loss value is converged to obtain the trained neural network model.

8. The method of claim 1, wherein the step of controlling the expression of the virtual character in the game scene during playing the audio data according to the mixed deformation parameter comprises:

acquiring a preset face model of the virtual character;

and adjusting parameters of the face model according to the mixed deformation parameters to control the expression of the virtual character when the audio data is played in the game scene.

9. A virtual character expression control method is characterized in that the method is applied to a game client, a virtual character in a game of the game client is configured with a target expression, and the target expression is an expression generated by applying the virtual character expression generation method of any one of claims 1 to 8; the method comprises the following steps:

if the current game scene contains the virtual character, acquiring audio data configured in the current game scene and a target expression corresponding to the virtual character;

and in the process of playing the audio data, synchronously controlling the expression of the virtual character to be the target expression.

10. An apparatus for generating an expression of a virtual character, comprising:

the scene obtaining module is used for determining audio data and emotion labels corresponding to the audio data, wherein the emotion labels are used for representing emotions of virtual characters when the audio data are played in a game scene;

the characteristic extraction module is used for extracting the voice characteristics of the audio data;

the parameter output module is used for inputting the voice features and the emotion labels into a neural network model and outputting mixed deformation parameters corresponding to the audio data; the neural network model is trained and completed based on the characteristics of the voice samples marked with emotion labels in advance;

and the expression generation module is used for controlling the expression of the virtual character in the game scene when the audio data is played according to the mixed deformation parameters.

11. A virtual character expression control device, which is applied to a game client, wherein a virtual character in a game of the game client is configured with a target expression, and the target expression is an expression generated by applying the virtual character expression generation method according to any one of claims 1 to 8; the device comprises:

the configuration acquisition module is used for acquiring audio data configured in the current game scene and a target expression corresponding to the virtual character if the current game scene contains the virtual character;

and the expression control module is used for synchronously controlling the expression of the virtual character to be the target expression in the process of playing the audio data.

12. A terminal device comprising a processor and a memory, the memory storing computer-executable instructions executable by the processor, the processor executing the computer-executable instructions to implement the steps of the virtual character expression generation method of any one of claims 1 to 8 or the virtual character expression control method of claim 9.

13. A computer-readable storage medium storing computer-executable instructions that, when invoked and executed by a processor, cause the processor to perform the steps of the virtual character expression generation method of any one of claims 1 to 8 or the virtual character expression control method of claim 9.