CN111081270B

CN111081270B - Real-time audio-driven virtual character mouth shape synchronous control method

Info

Publication number: CN111081270B
Application number: CN201911314031.3A
Authority: CN
Inventors: 朱风云; 陈博
Original assignee: Dalian Real Time Intelligent Technology Co ltd
Current assignee: Dalian Real Time Intelligent Technology Co ltd
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2021-06-01
Anticipated expiration: 2039-12-19
Also published as: CN111081270A

Abstract

The invention discloses a real-time audio-driven virtual character mouth shape synchronous control method. The method comprises the following steps: identifying the viseme probability from the real-time voice stream; a step of filtering the visualprobability; converting the sampling rate of the viseme probability into the sampling rate which is the same as the rendering frame rate of the virtual character; and converting the visualprobability into a standard mouth shape configuration and performing mouth shape rendering. The method can avoid the requirement of synchronously transmitting phoneme sequence or mouth shape sequence information when transmitting the audio stream, can obviously reduce the complexity, the coupling degree and the realization difficulty of the system, and is suitable for various application scenes of rendering virtual characters on display equipment.

Description

Real-time audio-driven virtual character mouth shape synchronous control method

Technical Field

The invention belongs to the field of virtual character posture control, and particularly relates to a real-time audio-driven virtual character mouth shape synchronous control method.

Background

Virtual character modeling and rendering techniques are widely used in the industries of animation, games, movies, and the like. Enabling the avatar to have a natural and smooth mouth-shape action synchronized with the voice while speaking is a key to improving the user experience. In a real-time system, audio acquired in real time in a streaming form and a virtual character rendered synchronously need to be played synchronously, and the synchronization between the audio and the character mouth shape needs to be ensured in the process.

The application scene comprises the following steps:

1. the real-time audio is the voice generated by the voice synthesizer;

1.1, acquiring a phoneme sequence corresponding to the voice in a form of synchronous stream;

1.2, a phoneme sequence corresponding to the voice cannot be obtained in a synchronous stream mode;

2. real-time audio is speech uttered by a person.

In scenario 1.1, a phoneme sequence corresponding to the speech can be synchronously obtained. The phoneme sequence may thus be converted into a mouth movement sequence for driving the virtual character mouth shape change. However, the synchronous acquisition of the phoneme sequence corresponding to the speech requires an additional communication protocol support in the application to ensure the time synchronization between the speech and the phoneme sequence, so that the system complexity is increased, the coupling is increased, and the implementation difficulty is high.

In scene 1.2 and scene 2, the phoneme sequence corresponding to the speech cannot be synchronously obtained. There is therefore a need for a control method that can drive the virtual character's mouth shape based on real-time audio data.

Therefore, in order to solve the above-mentioned situation that the phoneme sequence corresponding to the speech cannot be synchronously obtained, a method capable of identifying an exit type sequence from the audio and synchronously driving the mouth shape change of the virtual character by using the mouth shape sequence is needed.

Disclosure of Invention

The invention provides a real-time audio-driven virtual character mouth shape synchronous control method, aiming at solving the following problems: in the scene of real-time audio streaming, a virtual character needs to be displayed at the equipment end, the voice spoken by the character is acquired from the real-time audio stream, and the mouth shape of the character needs to be synchronized with the voice content.

A real-time audio-driven virtual character mouth shape synchronous control method comprises the following steps:

identifying the viseme probability from the real-time voice stream; the viseme probability is obtained by combining the probabilities of the phonemes belonging to the same type of visemes based on a preset mapping relation from the phonemes to the visemes;

a step of filtering the visualprobability;

converting the sampling rate of the viseme probability into the sampling rate which is the same as the rendering frame rate of the virtual character;

and converting the visualprobability into a standard mouth shape configuration and performing mouth shape rendering.

The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: the viseme probability is obtained by a viseme identification method; or identifying the phoneme probability from the real-time voice stream by using phoneme identification, and converting the phoneme probability into the viseme probability.

The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: and respectively smoothing and filtering each viseme probability by adopting a finite or infinite impulse response filter.

The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: converting the viseme probability to a standard mouth shape configuration; firstly, defining a standard mouth shape configuration for each visual element, wherein the standard mouth shape configuration is a key frame or a parameter describing the mouth shape; secondly, converting the viseme probability into a mixing proportion of standard mouth shape configuration through a mapping function; wherein, in a key frame scene, the mixing proportion is an interpolation proportion between different key frames; in the scenario of the key point parameter, the bone parameter or the blenshape parameter, the mixing ratio is a mixing ratio of each mouth shape describing parameter.

The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: in order to keep synchronization during audio/video playing, the contents of the audio stream and the video stream are synchronized by compensating for the delay during playing of the audio stream.

The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: the length of the buffer for compensating the delay is determined by the processing delay of the mouth shape visual element identification, the filtering and the video rendering.

The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: the phoneme recognition comprises: framing the voice stream, and extracting features; and a step of performing phoneme estimation using the features.

The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: the phoneme is an IPA defined phoneme, or a custom phoneme.

The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: the method for compensating the delay comprises the following steps: the audio delay compensation amount is the framing delay + feature splicing delay + phoneme recognition delay + filtering delay-video rendering delay.

The invention provides a method for identifying an outlet type sequence from audio and synchronously driving the mouth shape change of a virtual character by utilizing the mouth shape sequence aiming at the condition that a phoneme sequence corresponding to a voice cannot be synchronously obtained. The method can avoid the requirement of synchronously transmitting phoneme sequence or mouth shape sequence information when transmitting the audio stream, can obviously reduce the complexity, the coupling degree and the realization difficulty of the system, and is suitable for various application scenes of rendering virtual characters on display equipment.

Compared with the prior art, the invention has the following advantages:

by locally rendering the virtual character at the equipment end, video signals are prevented from being transmitted through a network after being rendered at the server end, a large amount of communication bandwidth can be saved, and the operation cost is reduced.

Through discerning the mouth type locally at the equipment end, avoid transmitting the mouth type information when transmitting the audio frequency, avoid carrying on the communication layer synchronization of audio frequency and mouth type, reduce communication protocol complexity, reduce and realize the degree of difficulty.

By using the probability output based on the phoneme or viseme recognition model as the mixing ratio of the standard mouth shape parameters, the probability can be prevented from being converted into the label of the phoneme or viseme category by using a Viterbi decoding algorithm, and the realization difficulty is reduced.

The invention directly infers the mixing ratio of the outlet type parameters from the audio signals without using Viterbi decoding, can avoid systematic delay caused by decoding, can shorten the response time of the system by about 1 second compared with a decoding-based method, greatly reduces the interaction delay in a real-time interaction scene, and improves the user experience.

Drawings

FIG. 1 is a flowchart illustrating a method for controlling the mouth shape synchronization of a virtual character driven by real-time audio according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating a real-time audio-driven virtual character mouth shape synchronization control method according to a second embodiment of the present invention;

fig. 3 is a flowchart of a virtual character mouth shape synchronization control method driven by real-time audio according to a third embodiment of the present invention.

Detailed Description

Embodiments of the invention will be described below with reference to the drawings, but it should be appreciated that the invention is not limited to the embodiments described and that various modifications of the invention are possible without departing from the basic idea. The scope of the invention is therefore intended to be limited solely by the appended claims.

As shown in fig. 1, the method for controlling the mouth shape of a virtual character driven by real-time audio according to the present invention includes the following steps:

identifying the viseme probability from the real-time voice stream;

a step of filtering the visualprobability;

As shown in fig. 2, a method for controlling the mouth shape of a real-time audio-driven avatar according to another embodiment of the present invention includes the following steps:

step 1, phoneme recognition

Step 1.1, feature extraction

And framing the voice stream, and extracting the features.

The framing process is that a frame of data with the frame length of L is taken every H sampling points on the continuous voice stream, and the number of overlapped sampling points between frames is L-H.

The characteristic extraction process is to process a frame of data to convert it into some form, such as frequency spectrum, phase spectrum, banded energy, cepstrum coefficient, linear prediction coefficient, etc.

The feature extraction process may not process the voice data, and the original audio sample is used as the result of feature extraction.

After the feature corresponding to each frame of data is obtained, the feature of the temporally adjacent frame can also be used to further extract the differential feature, and the differential feature is added to the original feature as the result of feature extraction.

After the features corresponding to each frame of data are obtained, the features of the frames adjacent in time can be spliced, and the splicing result is used as the feature extraction result.

The differential and stitching operations may be used simultaneously.

Step 1.2 phoneme probability estimation

Phoneme probability estimation the probability that a feature is a certain phoneme is estimated from the input features using a statistical machine learning model.

The phoneme may be a phoneme defined by ipa (international Phonetic alpha beta), or a phoneme defined by other standards.

Taking the chinese language as an example, the user-defined phoneme set that can be adopted is:

b

p

m

f

d

t

n

l

g

h

j

q

x

z

c

s

zh

ch

sh

ng

a

o

e

i

ii

iii

u

v

er

sil

wherein ng represents the finals of neng, i represents the finals of yi, ii represents the finals of zi, and iii represents the finals of zhi. sil denotes silence.

Step 2, phoneme-to-viseme probability conversion

The viseme probability is obtained by combining the phoneme probabilities belonging to the same type of visemes based on a preset mapping relation from the phonemes to the visemes.

The predetermined mapping relationship may follow different design criteria and is not limited to the specific embodiment given in the present invention.

Taking chinese as an example, the mapping relationship may be:

vision element	Phoneme
		b	b/p/m
d	d/t/n
		z	z/c/s
zh	zh/ch/sh
		j	j/q/x
k	k/h/l/g/ng
		a	a
o	o
		e	e/er
i	i/ii/iii
		u	u/v
sil	sil

Step 3, carrying out smooth filtering on the obtained viseme probability

Because the estimation of the statistical machine learning model on the probability cannot guarantee complete accuracy, the result is usually optimized by combining multi-frame data information to obtain the probability of smooth change in time.

The smoothing filtering process can adopt a finite impulse response filter to respectively filter the probability of each visual element, and the order and the parameter of the filter can be adjusted according to the requirement on the response time of the system.

Taking the simplest case as an example, a moving average fir filter implementation with order 10 may be used. In actual implementation, different filter designs may be used.

Step 4, resampling the voice stream according to the sampling rate of the video

Since the feature extraction process in step 1 frames the voice stream, the sampling rate of the data frame is (H/audio sampling rate) hz.

The sampling rate at which video is rendered is typically based on the refresh rate of the display device.

It is therefore desirable to use resampling to match the sampling rate of the data frames to the video sampling rate.

Step 5, converting the viseme probability to the standard mouth shape mixing proportion

The avatar rendering system will typically define a standard mouth shape configuration for each view, possibly in the form of key frames, or parameters describing the mouth shape.

The viseme probabilities can be converted to a mix ratio for a standard mouth-shape configuration by a linear or non-linear mapping function.

In a key frame scenario, the blending ratio may be an interpolated ratio between different key frames.

In the context of a keypoint parameter, a bone parameter, or a blenshape parameter, the blending ratio may be a blending ratio of the parameters.

Taking a frame of data as an example, if the viseme probability is:

vision element	Probability of viseme
		b	0.0
d	0.0
		z	0.0
zh	0.0
		j	0.0
k	0.0
		a	0.6
o	0.4
		e	0.0
i	0.0
		u	0.0
sil	0.0

And the mapping function from the viseme probabilities to the mixture ratios is assumed to be a linear mapping. Taking a key point parameter scene as an example, a two-dimensional key point parameter is defined as:

a(0.2 0.8)

e(0.7 0.3)

the mixing ratio of the keypoint parameters corresponding to the above visual element probability is a × 0.6+ e × 0.4, and thus the keypoint parameters of the current frame are (0.4, 0.6).

Step 6, mouth shape rendering is carried out by utilizing visuality probability

And the virtual character rendering system renders a virtual character image according to the mixed mouth shape configuration to obtain a video stream.

Step 7, synchronously playing audio and video

Because the voice stream is processed by the links of framing, splicing, phoneme recognition, smoothing filtering and the like, each link has a certain system delay, and therefore, when the audio stream is played, the contents of the audio stream and the video stream need to be synchronized by compensating the delay.

The delay can be calculated by accumulating the delays of the processing elements.

Since there is also some delay in video rendering, the delay of the video rendering system needs to be subtracted when calculating the audio delay.

Taking a common scenario as an example:

the audio delay compensation amount is the framing delay + feature splicing delay + phoneme recognition delay + smoothing filter delay-video rendering delay.

Fig. 3 is a third embodiment of the present invention. This embodiment differs from the second embodiment provided in fig. 2 in that: the embodiment performs the viseme recognition directly from the speech stream and does not go through the phoneme recognition and the phoneme to viseme probability conversion.

Compared with the method described in fig. 2, the accuracy of the viseme probability estimation of the method is slightly lower, but the subjective feeling of the user is not affected basically, and the method has the advantages of lower implementation difficulty and calculation complexity.

Since possible variations and modifications may be effected by one skilled in the art without departing from the spirit and scope of the invention, the scope of protection is to be determined by the claims appended hereto.

Claims

1. A real-time audio-driven virtual character mouth shape synchronous control method comprises the following steps:

identifying the viseme probability from the real-time voice stream; the viseme probability is obtained by combining the probabilities of the phonemes belonging to the same type of visemes based on a preset mapping relation from the phonemes to the visemes; the viseme probability is obtained by a viseme identification method; or identifying the phoneme probability from the real-time voice stream by using phoneme identification, and converting the phoneme probability into a viseme probability;

a step of filtering the visualprobability;

converting the visuals probability into standard mouth shape configuration and rendering mouth shapes; when converting the visualprobabilities to a standard mouth shape configuration: firstly, defining a standard mouth shape configuration for each visual element, wherein the standard mouth shape configuration is a key frame or a parameter describing the mouth shape; secondly, converting the viseme probability into a mixing proportion of standard mouth shape configuration through a mapping function; wherein, in a key frame scene, the mixing proportion is an interpolation proportion between different key frames; in the scenario of the key point parameter, the bone parameter or the blenshape parameter, the mixing ratio is the mixing ratio of the key point parameter, the bone parameter or the blenshape parameter.

2. The method for real-time audio-driven virtual character mouth shape synchronous control as claimed in claim 1, characterized in that: and respectively smoothing and filtering each viseme probability by adopting a finite or infinite impulse response filter.

3. The method for real-time audio-driven virtual character mouth shape synchronous control as claimed in claim 1, characterized in that: in order to keep synchronization during audio/video playing, the contents of the audio stream and the video stream are synchronized by compensating for the delay during playing of the audio stream.

4. The real-time audio-driven virtual character mouth shape synchronous control method as claimed in claim 3, characterized in that: the length of the buffer for compensating the delay is determined by the processing delay of the mouth shape visual element identification, the filtering and the video rendering.

5. The method for real-time audio-driven virtual character mouth shape synchronous control as claimed in claim 1, characterized in that: the phoneme recognition comprises: framing the voice stream, and extracting features; and a step of performing phoneme estimation using the features.

6. The real-time audio-driven virtual character mouth shape synchronous control method as claimed in claim 5, characterized in that: the phoneme is an IPA defined phoneme, or a custom phoneme.

7. The method of claim 6, wherein the real-time audio-driven virtual character mouth shape synchronization control method comprises: the phonemes are:

wherein ng represents the finals of neng, i represents the finals of yi, ii represents the finals of zi, iii represents the finals of zhi, and sil represents silence; the phoneme and viseme conversion relationship is as follows:

。

8. The real-time audio-driven virtual character mouth shape synchronous control method as claimed in claim 3, characterized in that: the method for compensating the delay comprises the following steps: the audio delay compensation amount is the framing delay + feature splicing delay + phoneme recognition delay + filtering delay-video rendering delay.