CN117935807A

CN117935807A - Method, device, equipment and storage medium for driving mouth shape of digital person

Info

Publication number: CN117935807A
Application number: CN202311635344.5A
Authority: CN
Inventors: 吴俊蓉
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2023-11-30
Filing date: 2023-11-30
Publication date: 2024-04-26

Abstract

The invention discloses a method, a device, equipment and a storage medium for driving a mouth shape of a digital person. The method comprises the following steps: acquiring target text data and determining emotion feature vectors corresponding to the target text data; determining target voice data corresponding to the target text data, and determining a voice feature vector corresponding to the target voice data; and determining mouth shape driving parameters of the digital person based on the emotion feature vectors and the voice feature vectors, and driving the digital person to change according to the mouth shape driving parameters, wherein the mouth shape driving parameters comprise a combination of at least one mouth shape driving action parameter and a product of weights corresponding to the at least one mouth shape driving action parameter. The mouth shape driving parameters synchronous with the voice are obtained in real time to drive the mouth shape of the digital person to change, so that the digital person can smoothly make a correct mouth shape when broadcasting the voice in real time, and the use experience of a user is improved.

Description

Method, device, equipment and storage medium for driving mouth shape of digital person

Technical Field

The present invention relates to the field of virtual animation technologies, and in particular, to a method, an apparatus, a device, and a storage medium for driving a digital person's mouth shape.

Background

The virtual digital person refers to a virtual character having a digitized appearance, and exists depending on a display device, unlike a robot having an entity. Virtual digital people have the appearance of a person, and generally have the ability to express language, facial expressions and limb actions, and can interact with the person through input and output devices (such as a mouse, a keyboard and the like). Under the current of deep learning and the high-speed development of a neural network, a virtual digital person can obtain interactive capability through an artificial intelligence technology, and the method is widely applied to the fields of virtual anchor, service assistants, media performance and the like.

The virtual digital person is added into the office system to serve as a question and answer assistant, so that user experience can be optimized while self-help question and answer service is completed efficiently and quickly. According to the text questions input by the user, the virtual digital person can adopt different limb actions and expression mouth shapes to report answers to the questions in real time. In the process of real-time broadcasting by a virtual digital person, mouth shape and pronunciation synchronization are one of the most critical problems, unlike virtual live broadcasting, when the virtual digital person performs real-time broadcasting in question-answer interaction of a specific office system, no input of a camera face capturing technology is used as an aid, and all mouth shapes and actions need to be matched and controlled according to input texts or voices.

The existing methods mainly deal with virtual digital demographics by the following solutions:

1. for a real person type virtual digital person, the image is required to be very close to a real person, a large number of video frame materials are recorded by performing with the real person, and then the materials are manufactured into a data set for training a model, so that a voice mouth shape synchronous video frame animation is synthesized. The method needs to consume a great deal of time and financial resources to collect the performance materials of the real person, and has higher requirements on equipment performance in order to make the image of the digital person more vivid.

2. For cartoon type virtual digital people, as the appearance is more abstract and cartoon, the requirements on mouth shapes are generally lower, and two schemes mainly exist: firstly, the mouth shape is fixed and irrelevant to pronunciation, and the mouth shape cannot be changed according to the voice, so that the digital portrait is quite mechanical; secondly, a mode of manually marking in advance is adopted to add a corresponding mouth shape label for voice so as to realize mouth shape switching, and the method needs to manually mark fixed template voice in advance, is not suitable for flexible scenes and cannot be used for real-time broadcasting.

Disclosure of Invention

The invention provides a mouth shape driving method, device, equipment and storage medium for digital people, which are used for solving the problems that the traditional virtual digital population type data acquisition is time-consuming and labor-consuming and the correct mouth shape cannot be smoothly made when voice is broadcast in real time.

According to an aspect of the present invention, there is provided a mouth shape driving method of a digital person, the method comprising:

Acquiring target text data and determining emotion feature vectors corresponding to the target text data;

Determining target voice data corresponding to the target text data, and determining a voice feature vector corresponding to the target voice data;

And determining mouth shape driving parameters of the digital person based on the emotion feature vectors and the voice feature vectors, and driving the digital person to change according to the mouth shape driving parameters, wherein the mouth shape driving parameters comprise a combination of at least one mouth shape driving action parameter and a product of weights corresponding to the at least one mouth shape driving action parameter.

According to another aspect of the present invention, there is provided a mouth shape driving device for a digital person, the device comprising:

The first feature vector determining module is used for acquiring target text data and determining emotion feature vectors corresponding to the target text data;

the second feature vector determining module is used for determining target voice data corresponding to the target text data and determining voice feature vectors corresponding to the target voice data;

And the mouth shape driving module is used for determining mouth shape driving parameters of the digital person based on the emotion feature vectors and the voice feature vectors and driving the digital population shape to change based on the mouth shape driving parameters, wherein the mouth shape driving parameters comprise a combination of at least one mouth shape driving action parameter and a product of weights corresponding to the at least one mouth shape driving action parameter.

According to another aspect of the present invention, there is provided an electronic apparatus including:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of mouth shape driving for a digital person according to any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the method for mouth shape driving of a digital person according to any one of the embodiments of the present invention when executed.

According to the technical scheme, through obtaining target text data, an emotion feature vector corresponding to the target text data is determined; the emotion of the digital person when broadcasting the target text data in real time can be determined; then, determining target voice data corresponding to the target text data, and determining voice feature vectors corresponding to the target voice data; the corresponding relation between the target voice data pair and the voice characteristic vector can be quasi established; finally, determining mouth shape driving parameters of the digital person based on the emotion feature vector and the voice feature vector, and driving the digital person to change based on the mouth shape driving parameters, wherein the mouth shape driving parameters comprise at least one mouth shape driving action parameter and a combination of products of weights corresponding to the at least one mouth shape driving action parameter, so that the problems that virtual digital person data acquisition is time-consuming and labor-consuming, and correct mouth shape cannot be smoothly made when voice is broadcast in real time are solved, the mouth shape of the digital person is driven to change by the mouth shape driving parameters synchronous with the voice in real time, the digital person can smoothly make correct mouth shape when voice is broadcast in real time, and the user experience beneficial effects are improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for driving a mouth shape of a digital person according to a first embodiment of the present invention;

Fig. 2a is a flowchart of a method for driving a mouth shape of a digital person according to a second embodiment of the present invention;

FIG. 2b is a schematic illustration of a digital human mouth shape closure according to an alternative example of a digital human mouth shape driving method provided in accordance with a second embodiment of the present invention;

FIG. 2c is a schematic diagram of a digital demographic-type opening sample of an alternative example of a digital human mouth-type driving method provided in accordance with a second embodiment of the present invention;

fig. 3 is a schematic structural view of a digital human mouth shape driving device according to a third embodiment of the present invention;

fig. 4 is a schematic structural view of an electronic device implementing a method of driving a mouth shape of a digital person according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a method for driving a digital person's mouth shape according to an embodiment of the present invention, where the method may be performed by a mouth shape driving device of a digital person, the mouth shape driving device of the digital person may be implemented in hardware and/or software, and the mouth shape driving device of the digital person may be configured in an electronic device. As shown in fig. 1, the method includes:

S110, acquiring target text data, and determining emotion feature vectors corresponding to the target text data.

The target text data may be understood as text data to be broadcasted by a digital person, and the target text data may include target answer text data, target prompt text data and the like. Emotion feature vectors can be understood as emotion feature values.

Specifically, target text answer data corresponding to a question input by a user in a preset mode is obtained. And acquiring an emotion characteristic value corresponding to the target answer text data. The preset mode comprises the following steps: voice input, text input, clicking corresponding problem key triggering and other modes.

Optionally, the determining, based on the target text data, an emotion feature vector corresponding to the target text data includes: performing word segmentation processing on the target text data to obtain a plurality of text word segments corresponding to the target text data; determining emotion matching probability of each text word relative to each preset emotion label according to each text word, and determining a target emotion label corresponding to the target text data based on the emotion matching probability corresponding to a plurality of text words; and performing one-time heat coding treatment on the target emotion label to obtain an emotion feature vector corresponding to the target text data.

The preset emotion label can be understood as a preset emotion attribute label, and the preset emotion label can comprise various emotion attribute labels such as happiness, lowness, calm, confusion and the like. The target emotion label can be understood as the emotion label with the highest matching degree with the target text data.

Specifically, word segmentation processing is performed on the target text to obtain a plurality of text words, and emotion matching probability of each text word relative to each preset emotion label is determined, for example, in the case that the text word is "your good", the emotion matching probability relative to each preset emotion label is "happy 80%" "0%" "calm 15%", and "confused 5%", respectively. Based on the emotion matching probabilities corresponding to the plurality of segmentation words, a target emotion tag corresponding to the whole target text data can be determined. And performing single-heat encoding treatment on the target emotion label to obtain an emotion characteristic value corresponding to the target text data.

Optionally, the determining, based on emotion matching probabilities corresponding to the text word, a target emotion tag corresponding to the target text data includes: for each preset emotion label, determining comprehensive emotion matching probability corresponding to each preset emotion label based on the emotion matching probabilities corresponding to a plurality of text segmentation words, and taking the preset emotion label with the highest comprehensive emotion matching probability as a target emotion label.

The comprehensive emotion matching probability can be understood as the sum of emotion matching probabilities of a plurality of text word segments corresponding to a single preset emotion label.

Specifically, for each preset emotion tag, determining a comprehensive emotion matching probability corresponding to each preset emotion tag based on the sum of emotion matching probabilities corresponding to a plurality of text segmentation words. And selecting a preset emotion label with highest comprehensive matching probability as a target emotion label matched with the target text data.

In the embodiment of the invention, the emotion matching probability of each text word is respectively determined by word segmentation processing on the target text data. When the comprehensive emotion matching probability is determined based on the emotion matching probability, the target emotion label matched with the target text data is determined based on the comprehensive emotion matching probability, and the emotion label corresponding to the target text data can be accurately determined.

Optionally, a question-reply corpus of the system can be established, and the question-reply corpus can include preset emotion labels such as happiness, lowness, calm, confusion and the like. Each preset emotion tag may correspond to at least one piece of text data. And preprocessing the problem reply corpus by a Word2Vec (Word to Vector) neural network model and an LSTM (Long Short-Term Memory) special cyclic neural network model. The preprocessing comprises at least one of word segmentation, word deactivation, encoding of a true value tag, processing of a lengthened sentence into a fixed-length sequence and data normalization. And converting the text data into Word vectors by using a Word2Vec model, embedding the Word vectors into a hidden layer of the neural network model, classifying the Word vectors by using a Softmax activation function as a classification layer, and outputting target emotion labels corresponding to the Word vectors.

S120, determining target voice data corresponding to the target text data, and determining voice feature vectors corresponding to the target voice data.

The target voice data can be understood as voice data to be broadcasted by a digital person. A speech feature vector may be understood as a speech feature value.

Optionally, the determining the target voice data corresponding to the target text data includes:

And converting the target text data into target voice data based on a preset text conversion voice technology.

Specifically, the target text data can be converted into target voice data output by natural voice in real time through a preset text conversion voice technology. It can be appreciated that the preset text-to-speech technology in this embodiment may also be preset empirically, for example: TTS (TexttoSpeech) text-to-speech technology this embodiment is not limited thereto. And carrying out text analysis on the target text data by using a TTS technology, converting the target text data into a phoneme sequence, and marking phoneme information such as start-stop time, frequency change and the like of each phoneme. And then, performing voice synthesis to convert the processed phoneme sequence into a voice waveform. And finally, prosody processing is carried out, so that the synthesized voice is more natural and coherent.

Optionally, the determining a speech feature vector corresponding to the target speech data includes:

Determining a phoneme and a phoneme start-stop time corresponding to each character in the target voice data; constructing a phoneme sequence corresponding to the target voice data based on the phonemes corresponding to each character and the phoneme start-stop time; dividing the phoneme sequence into a plurality of the voice feature vectors based on a preset phoneme time threshold.

The phonemes can be understood as the minimum speech units divided according to the natural attribute of the speech, and are divided into two major classes, namely vowels and consonants. For example: chinese has 32 phones, 10 vocalic phones, and 22 consonant phones.

And determining the phonemes and the phoneme starting and ending time corresponding to each character, and carrying out combined arrangement on the phonemes and the phoneme starting and ending time corresponding to each character to obtain a factor sequence corresponding to the target voice data. Based on a preset phoneme time threshold, dividing the phoneme sequence to obtain a read speech characteristic value. The preset phoneme time threshold may be preset according to experience, for example, 25 ms, and corresponds to a real-time broadcasting speed of 40 frames per second, where the preset phoneme time threshold is taken as a minimum time unit, and one minimum time unit corresponds to one speech feature vector. It should be noted that if a certain phoneme needs two minimum time units to complete pronunciation, the phoneme corresponds to two speech feature vectors.

S130, determining mouth shape driving parameters of the digital person based on the emotion feature vectors and the voice feature vectors, and driving the digital person to change the mouth shape based on the mouth shape driving parameters, wherein the mouth shape driving parameters comprise a combination of at least one mouth shape driving action parameter and a product of weights corresponding to the at least one mouth shape driving action parameter.

The digital person can be understood as a three-dimensional virtual digital person, namely a virtual character with a digital appearance, has the appearance form of a person, has the capability of expressing language, facial expression and limb actions, and can interact with the person through control of various modes. The mouth shape driving parameters can be understood as mouth shape action parameters required for driving the lower half face part of the digital person to make the mouth shape. The die drive parameter may be any value in the interval 0-1. The mouth shape driving action parameters can comprise opening, mouth corner lifting, mouth corner pulling, mouth corner gathering and the like. The corresponding weight of the mouth shape driving action parameter can be assigned through a number.

Specifically, the mouth shape driving parameters of the digital person are comprehensively determined in a pre-manufactured digital population type database based on the emotion feature vector and the voice feature vector. The mouth shape database comprises mouth shape images of digital people, the mouth shape images can be obtained by putting out preset expressions under different emotions and different phonemes based on the digital people, and then rendering the next half face into a two-dimensional image. Each image in the mouth shape data set has a unique image tag and preset mouth shape driving parameters. The die drive parameters are based on a combination of a plurality of die drive motion parameters and their corresponding weighted products. For example: the mouth shape driving action parameters comprise mouth opening 0.5, mouth angle rising 0.8 and mouth angle pulling 0.2, the weight of each mouth shape driving action parameter is mouth opening weight 0.2, mouth angle rising weight 0.6 and mouth angle pulling weight 0.1 respectively, and the mouth shape driving parameters of the digital person are calculated to be the combination of the values of mouth opening 0.5 x 0.2, mouth angle rising 0.8 x 0.6 and mouth angle pulling 0.2 x 0.1.

It is worth noting that the magnitude of the mouth angle lifting or sagging varies from emotion to emotion. For Chinese pronunciation phonemes, they are roughly divided into 15 classes of mouth shapes (including closed mouth shapes in initial state and phoneme miss state) according to the similarity of mouth shapes at the time of pronunciation. Different emotions can result in different pronunciation patterns for a same word. The mouth angle can be pulled downwards relatively when the 'good' character is sounded when sad, the mouth opening amplitude is smaller, the mouth angle can be raised relatively when the 'good' character is sounded when open, and the mouth opening amplitude is larger.

In the embodiment of the invention, the mouth shape database of the virtual digital person is customized, and the database comprises the mouth shape images of the lower half face of the virtual digital person corresponding to different pronunciations under different emotions, and unique labels and mouth shape driving parameters corresponding to each image. The mouth shape driving parameters of the digital person can be rapidly and accurately determined, and the instantaneity and the consistency of the mouth shape change of the digital person are improved.

Example two

Fig. 2a is a flowchart of a method for driving a mouth shape of a digital person according to a second embodiment of the present invention, where the embodiment is a further optimization of how to determine mouth shape driving parameters of the digital person based on the emotion feature vector and the voice feature vector in the above embodiments. Optionally, the determining the mouth shape driving parameter of the digital person based on the emotion feature vector and the voice feature vector includes: splicing the emotion feature vector and the voice feature vector to obtain an audio feature vector; determining the mouth shape label result based on the audio feature vector and a mouth shape label determining model, wherein the mouth shape label determining model is obtained by training a pre-established deep learning model based on a sample audio feature vector and an expected output mouth shape label result corresponding to the sample audio feature vector; and determining a mouth shape driving parameter corresponding to the target voice data based on the mouth shape label result, and taking the mouth shape driving parameter corresponding to the target voice data as the mouth shape driving parameter of the digital person.

As shown in fig. 2a, the method comprises:

s210, acquiring target text data, and determining emotion feature vectors corresponding to the target text data.

S220, determining target voice data corresponding to the target text data, and determining voice feature vectors corresponding to the target voice data.

And S230, splicing the emotion feature vector and the voice feature vector to obtain an audio feature vector.

Specifically, the emotion feature vector and the voice feature vector are spliced in a mode that the emotion feature vector is spliced by the voice feature vector, and an audio feature vector value corresponding to the target text is obtained. The emotion feature vectors can be repeatedly spliced for use.

S240, determining the mouth shape label result based on the audio feature vector and a mouth shape label determining model, wherein the mouth shape label determining model is obtained by training a pre-established deep learning model based on a sample audio feature vector and an expected output mouth shape label result corresponding to the sample audio feature vector.

The mouth shape label result can be understood as a mouth shape label sequence result.

Specifically, through the mapping relation between the audio feature vector and the mouth shape, the audio feature vector is input into a mouth shape label determining model, and an output mouth shape label sequence is trained by using an encoder and decoder model, wherein the encoder is composed of two layers of convolution layers with a convolution kernel of 3*3 and is used for further extracting audio features, the mouth shape label determining model is adopted as a decoder, and a Softmax activating function is used for classification. And determining a mouth shape tag sequence result based on the model output result.

Optionally, before the determining the mouth shape label result based on the audio feature vector and the mouth shape label determining model, the method further includes: acquiring a sample audio feature vector, and determining an expected output port type label result corresponding to the sample audio feature vector; inputting the sample audio feature vector into a pre-established initial mouth shape label result determining model to obtain the initial mouth shape label result; and determining model loss between the initial mouth shape label result and the expected mouth shape label based on a loss function, adjusting the initial mouth shape label result determination model based on the model loss until the loss function converges, and taking the adjusted initial mouth shape label result determination model as a target mouth shape label result determination model.

Specifically, a preset number of sample audio feature vectors are obtained, and an expected output port type label result corresponding to the sample audio feature vectors is determined. And inputting the sample audio feature vector into a pre-established initial mouth shape label determining model, and determining an initial mouth shape label result based on the model output result. Model losses of the initial mouth shape label result and the expected mouth shape label result are calculated through the cross entropy function. And iteratively adjusting the initial mouth shape tag determination model based on model loss before the loss function of the initial mouth shape tag result determination model is not converged, and taking the initial mouth shape tag determination model as a target mouth shape tag determination model after the loss function is converged.

Illustratively, the model loss for calculating the initial die label result and the desired outlet die label result by the cross entropy function is formulated as follows:

where L represents the cross entropy model loss. S _i represents the initial die label result. O _i represents the desired output port label result.

S250, determining a mouth shape driving parameter corresponding to the target voice data based on the mouth shape label result, and taking the mouth shape driving parameter corresponding to the target voice data as the mouth shape driving parameter of the digital person.

Specifically, after the mouth shape label result is obtained, the mouth shape label result is determined, the corresponding mouth shape driving parameters are searched in a mouth shape database established in advance, and the searched mouth shape driving parameters are used as the mouth shape driving parameters corresponding to the target voice data. The three-dimensional digital human model is driven by using the mouth shape driving parameters, and the mouth shape which changes synchronously with the voice is obtained. If the mouth shape driving parameters corresponding to the mouth shape label result are not found, the background can be reminded to carry out manual setting, and the manually set mouth shape driving parameters are updated to the mouth shape database so as to be convenient for subsequent use.

S260, driving a digital population to change based on the mouth shape driving parameters, wherein the mouth shape driving parameters comprise a combination of at least one mouth shape driving action parameter and a product of weights corresponding to at least one mouth shape driving action parameter.

According to the technical scheme, the emotion feature vector and the voice feature vector are spliced to obtain an audio feature vector; the emotion characteristic vector and the voice characteristic vector are spliced and used, so that the accuracy and the efficiency of the subsequent determination of the corresponding mouth shape sequence are improved; then, determining a mouth shape label result based on the audio feature vector and a mouth shape label determining model, wherein the mouth shape label determining model is obtained by training a pre-established deep learning model based on a sample audio feature vector and an expected output mouth shape label result corresponding to the sample audio feature vector; and accurately obtaining the sequence result of the mouth shape tag. And finally, determining a mouth shape driving parameter corresponding to the target voice data based on the mouth shape label result, and taking the mouth shape driving parameter corresponding to the target voice data as the mouth shape driving parameter of the digital person. The mouth shape change of the three-dimensional model can be accurately driven in the three-dimensional engine without time-consuming calculation, and the method is different from the common method for fusing mouth and other facial part images to obtain the changed mouth shape, so that the calculation cost is saved.

As an optional example of the embodiment of the present invention, the method for driving a mouth shape of a digital person of the present embodiment specifically includes the steps of:

Step 1, based on an original virtual digital person three-dimensional model, a mouth shape database corresponding to the digital person is pre-established, the mouth shape database comprises mouth shape images of the digital person, the mouth shape images can be obtained by setting out preset expressions under different emotions and different phonemes based on the digital person, and then rendering the next half face into a two-dimensional image. For a Chinese pronunciation phoneme, the Chinese pronunciation phoneme is roughly divided into 15 types of mouth shapes (including a closed mouth shape in an initial state and a phoneme miss state) according to the similarity of the mouth shapes in pronunciation. FIG. 2b is a schematic illustration of a digital human mouth shape closure according to an alternative example of a digital human mouth shape driving method provided in accordance with a second embodiment of the present invention; as shown in fig. 2b, the digital person is in a closed mouth shape in the initial state and the phoneme miss state. Different emotions can result in different mouth shapes for the same word, for example, mouth corners can be pulled relatively downward when a "good" word is pronounced in sadness, and mouth opening amplitude is also relatively high. FIG. 2c is a schematic diagram of a digital demographic-type opening sample of an alternative example of a digital human mouth-type driving method provided in accordance with a second embodiment of the present invention; as shown in fig. 2c, the mouth angle is raised relatively when the "good" character is uttered at the beginning of the heart, and the width of the mouth opening is also large.

In order to make the virtual digital person more vivid and natural when playing voice, the embodiment considers the influence of emotion on the mouth shape, and fine-adjusts the mouth shape under different emotion when making a mouth shape database. Each image in the mouth shape database is provided with a unique label and preset mouth shape driving parameters, wherein the mouth shape driving parameters refer to mouth shape driving action parameters required by driving the lower half face part to make corresponding mouth shapes, the mouth shape driving action parameters comprise at least one mouth shape driving action parameter and the combination of products of weights corresponding to at least one mouth shape driving action parameter, and the mouth shape driving action parameters can comprise mouth opening, mouth angle lifting, mouth angle pulling, mouth angle gathering and the like.

And 2, in order to realize real-time voice broadcasting of the digital person, firstly taking the target text data as input, and then converting the target text data into target voice data by using a TTS technology for broadcasting.

In order to realize the association matching between the voice and the mouth shape with emotion, the embodiment establishes a question-reply corpus of the system, wherein the question-reply corpus can comprise preset emotion labels such as happiness, lowness, calm, confusion and the like. Each preset emotion tag may correspond to at least one piece of text data. And preprocessing the problem reply corpus by a Word2Vec and LSTM special cyclic neural network model. The preprocessing comprises at least one of word segmentation, word deactivation, encoding of a true value tag, processing of a lengthened sentence into a fixed-length sequence and data normalization. And converting the text data into Word vectors by using a Word2Vec model, embedding the Word vectors into a hidden layer of the neural network model, classifying the Word vectors by using a Softmax activation function as a classification layer, and outputting target emotion labels corresponding to the Word vectors.

And3, converting the target text data into target voice data output by natural voice in real time through a preset text conversion voice technology. It can be appreciated that the preset text-to-speech technology in this embodiment may also be preset empirically, for example: TTS (Text to Speech) text-to-speech technology this embodiment is not limited thereto. And carrying out text analysis on the target text data by using a TTS technology, converting the target text data into a phoneme sequence, and marking phoneme information such as start-stop time, frequency change and the like of each phoneme. And then, performing voice synthesis to convert the processed phoneme sequence into a voice waveform. And finally, prosody processing is carried out, so that the synthesized voice is more natural and coherent.

And determining the phonemes and the phoneme starting and ending time corresponding to each character, and carrying out combined arrangement on the phonemes and the phoneme starting and ending time corresponding to each character to obtain a factor sequence corresponding to the target voice data. Based on a preset phoneme time threshold, dividing the phoneme sequence to obtain a read speech characteristic value. The preset phoneme time threshold may be preset according to experience, for example, 25 ms, and corresponds to a real-time broadcasting speed of 40 frames per second, where the preset phoneme time threshold is taken as a minimum time unit, and one minimum time unit corresponds to one speech feature vector. It should be noted that if a certain phoneme needs two minimum time units to complete pronunciation, the phoneme corresponds to two speech feature vectors. And splicing the emotion feature vector and the voice feature vector in a mode of splicing the emotion feature vector by the voice feature vector to obtain an audio feature vector. The emotion feature vectors can be repeatedly spliced for use.

And 4, inputting the audio feature vector into an input port type label determining model through a mapping relation between the audio feature vector and the mouth shape, and training an output port type label sequence by using an encoder and decoder model, wherein the encoder is formed by two layers of convolution layers with a convolution kernel of 3*3 and is used for further extracting audio features, the mouth shape label determining model is adopted as a decoder, and a Softmax activating function is used for classification. And determining a mouth shape label result based on the model output result.

The model loss is formulated as follows by cross entropy function:

And 5, determining a mouth shape driving parameter corresponding to the target voice data based on a mouth shape label result, and taking the mouth shape driving parameter corresponding to the target voice data as the mouth shape driving parameter of the digital person. And driving the three-dimensional digital human model by using the mouth shape driving parameters to finally obtain the three-dimensional mouth shape which changes synchronously with the voice.

According to the technical scheme, the deep learning model is trained to automatically match a section of audio with the corresponding mouth shape sequence, so that a three-dimensional digital person can smoothly make a correct mouth shape when broadcasting voice in real time, and the pronunciation of each word is not required to be marked in a manual mode, so that a great deal of labor cost is saved. In order to make the mouth shape of the digital person more natural, the embodiment takes the influence of emotion on the mouth shape into consideration, when inputting a text, a text emotion analysis module is used to obtain an emotion classification label of the text, the label is encoded into an emotion feature vector, the emotion feature vector is spliced with a voice feature vector obtained by encoding after the text is converted into voice, and the voice feature vector is input into a voice mouth shape association module to learn the mapping relation between voice with emotion and the mouth shape. The mouth shape database comprises preset mouth shape driving parameters, so that mouth shape change of a three-dimensional model can be accurately driven in a three-dimensional engine without time-consuming calculation, and the mouth shape database is different from a mode of fusing mouth and other facial part images to obtain a changed mouth shape commonly used in other methods, and the calculation cost is saved. The method and the device have the advantages that the acquisition efficiency of the mouth shape driving parameters is improved, so that a digital person can smoothly make a correct mouth shape when broadcasting voice in real time, and the use experience beneficial effects of the user are improved.

Example III

Fig. 3 is a schematic structural diagram of a digital human mouth shape driving device according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes: a first feature vector determination module 310, a second feature vector determination module 320, and a die drive module 330.

The first feature vector determining module 310 is configured to obtain target text data, and determine an emotion feature vector corresponding to the target text data; a second feature vector determining module 320, configured to determine target voice data corresponding to the target text data, and determine a voice feature vector corresponding to the target voice data; a mouth shape driving module 330, configured to determine a mouth shape driving parameter of a digital person based on the emotion feature vector and the voice feature vector, and drive a digital population shape to change based on the mouth shape driving parameter, where the mouth shape driving parameter includes a combination of at least one mouth shape driving action parameter and a product of weights corresponding to at least one mouth shape driving action parameter.

According to the technical scheme, target text data is obtained through a first feature vector determining module, and emotion feature vectors corresponding to the target text data are determined; the emotion of the digital person when broadcasting the target text data in real time can be determined; then, determining target voice data corresponding to the target text data, and determining voice feature vectors corresponding to the target voice data; the corresponding relation between the target voice data pair and the voice characteristic vector can be quasi established; finally, determining mouth shape driving parameters of the digital person based on the emotion feature vector and the voice feature vector, and driving the digital person to change based on the mouth shape driving parameters, wherein the mouth shape driving parameters comprise at least one mouth shape driving action parameter and a combination of products of weights corresponding to the at least one mouth shape driving action parameter, so that the problems that virtual digital person data acquisition is time-consuming and labor-consuming, and correct mouth shape cannot be smoothly made when voice is broadcast in real time are solved, the mouth shape of the digital person is driven to change by the mouth shape driving parameters synchronous with the voice in real time, the digital person can smoothly make correct mouth shape when voice is broadcast in real time, and the user experience beneficial effects are improved.

Optionally, the first feature vector determining module includes:

the word segmentation unit is used for carrying out word segmentation processing on the target text data to obtain a plurality of text word segments corresponding to the target text data;

The first emotion tag determining unit is used for determining emotion matching probability of each text word relative to each preset emotion tag according to each text word, and determining a target emotion tag corresponding to the target text data based on the emotion matching probability corresponding to a plurality of text words;

And the first feature vector determining unit is used for carrying out one-time heat coding processing on the target emotion label to obtain an emotion feature vector corresponding to the target text data.

Optionally, the first emotion tag determining unit is specifically configured to:

For each preset emotion label, determining comprehensive emotion matching probability corresponding to each preset emotion label based on the emotion matching probabilities corresponding to a plurality of text segmentation words, and taking the preset emotion label with the highest comprehensive emotion matching probability as a target emotion label.

Optionally, the second feature vector determining module includes:

A phoneme data determining unit for determining a phoneme and a phoneme start-stop time corresponding to each character in the target voice data;

A phoneme sequence constructing unit for constructing a phoneme sequence corresponding to the target voice data based on the phonemes and the phoneme start-stop times corresponding to each of the characters;

And the second feature vector determining unit is used for dividing the phoneme sequence into a plurality of voice feature vectors based on a preset phoneme time threshold value.

Optionally, the die driving module includes:

the vector splicing module is used for splicing the emotion feature vector and the voice feature vector to obtain an audio feature vector;

the mouth shape tag determining module is used for determining a mouth shape tag result based on the audio feature vector and a mouth shape tag determining model, wherein the mouth shape tag determining model is obtained by training a pre-established deep learning model based on a sample audio feature vector and an expected output mouth shape tag result corresponding to the sample audio feature vector;

and the mouth shape driving parameter determining module is used for determining mouth shape driving parameters corresponding to the target voice data based on the mouth shape label result, and taking the mouth shape driving parameters corresponding to the target voice data as the mouth shape driving parameters of the digital person.

Optionally, the device further comprises a sample vector acquisition module, an initial mouth shape label result determination module and a model adjustment module.

The sample vector acquisition module is used for acquiring a sample audio feature vector and determining an expected output port type label result corresponding to the sample audio feature vector;

the initial mouth shape label result determining module is used for inputting the sample audio feature vector into a pre-established initial mouth shape label result determining model to obtain the initial mouth shape label result;

the model adjustment module is used for determining model loss between the initial mouth shape label result and the expected mouth shape label based on a loss function, adjusting the initial mouth shape label result determination model based on the model loss until the loss function converges, and taking the adjusted initial mouth shape label result determination model as a target mouth shape label result determination model.

Optionally, the second feature vector determining module is specifically configured to:

The mouth shape driving device for the digital person provided by the embodiment of the invention can execute the mouth shape driving method for the digital person provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executing method.

Example IV

Fig. 4 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM12 and the RAM13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as the method digital human mouth shape driving.

In some embodiments, the method digital person's mouthpiece drive may be implemented as a computer program tangibly embodied on a computer readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM12 and/or the communication unit 19. When the computer program is loaded into RAM13 and executed by processor 11, one or more steps of the method digital person's mouth shape drive described above may be performed. Alternatively, in other embodiments, processor 11 may be configured to perform the method digital person's die drive in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for driving a mouth shape of a digital person, comprising:

2. The method of claim 1, wherein the determining an emotion feature vector corresponding to the target text data based on the target text data comprises:

Performing word segmentation processing on the target text data to obtain a plurality of text word segments corresponding to the target text data;

Determining emotion matching probability of each text word relative to each preset emotion label according to each text word, and determining a target emotion label corresponding to the target text data based on the emotion matching probability corresponding to a plurality of text words;

and performing one-time heat coding treatment on the target emotion label to obtain an emotion feature vector corresponding to the target text data.

3. The method of claim 2, wherein the determining a target emotion tag corresponding to the target text data based on emotion match probabilities corresponding to a plurality of the text tokens comprises:

4. The method of claim 1, wherein the determining a speech feature vector corresponding to the target speech data comprises:

Determining a phoneme and a phoneme start-stop time corresponding to each character in the target voice data;

Constructing a phoneme sequence corresponding to the target voice data based on the phonemes corresponding to each character and the phoneme start-stop time;

dividing the phoneme sequence into a plurality of the voice feature vectors based on a preset phoneme time threshold.

5. The method of claim 1, wherein said determining a mouth shape driving parameter of a digital person based on said emotion feature vector and said voice feature vector comprises:

Splicing the emotion feature vector and the voice feature vector to obtain an audio feature vector;

Determining the mouth shape label result based on the audio feature vector and a mouth shape label determining model, wherein the mouth shape label determining model is obtained by training a pre-established deep learning model based on a sample audio feature vector and an expected output mouth shape label result corresponding to the sample audio feature vector;

And determining a mouth shape driving parameter corresponding to the target voice data based on the mouth shape label result, and taking the mouth shape driving parameter corresponding to the target voice data as the mouth shape driving parameter of the digital person.

6. The method of claim 5, further comprising, prior to said determining said mouth shape tag result based on said audio feature vector and a mouth shape tag determination model:

acquiring a sample audio feature vector, and determining an expected output port type label result corresponding to the sample audio feature vector;

inputting the sample audio feature vector into a pre-established initial mouth shape label result determining model to obtain the initial mouth shape label result;

And determining model loss between the initial mouth shape label result and the expected mouth shape label based on a loss function, adjusting the initial mouth shape label result determining model based on the model loss until the loss function converges, and taking the adjusted initial mouth shape label result determining model as a target mouth shape label result determining model.

7. The method of claim 1, wherein the determining target speech data corresponding to the target text data comprises:

8. A mouth shape driving device for a digital person, comprising:

9. An electronic device, the electronic device comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the digital person's mouth shape driving method according to any one of claims 1-7.

10. A computer readable storage medium storing computer instructions for causing a processor to implement the method of mouth shape driving of a digital person according to any one of claims 1-7 when executed.