CN114359450A

CN114359450A - Method and device for simulating virtual character speaking

Info

Publication number: CN114359450A
Application number: CN202210050718.6A
Authority: CN
Inventors: 余国军; 耿俊怀
Original assignee: Xiaoduo Intelligent Technology Beijing Co ltd
Current assignee: Xiaoduo Intelligent Technology Beijing Co ltd
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-04-15

Abstract

The embodiment of the invention discloses a method and a device for simulating the talking of a virtual character, wherein the method comprises the following steps: according to the multiple phoneme classifications, making a mouth shape corresponding to each phoneme classification to obtain multiple basic mouth shapes; inputting an audio stream, extracting an audio frame of the audio stream, and identifying phonemes of the audio frame; determining a phoneme classification corresponding to the phoneme of the audio frame from the plurality of phoneme classifications, and selecting a basic mouth shape corresponding to the phoneme classification; and synthesizing the selected basic mouth shape into a corresponding mouth shape of the audio frame. The mouth shapes of the real persons are classified by phonemes and are arranged into 14 basic mouth shapes, and a computer can drive the virtual digital population type synchronization by phoneme recognition. Through the virtual digital population type patent, the voice mouth shape synchronization of the virtual digital people can be quickly and accurately realized. A mouth shape standardized mouth shape manufacturing scheme is formulated, and the virtual digital population shape manufacturing efficiency and the mouth shape quality are greatly improved. The virtual digital person is closer to a real person, and the user experience is greatly improved.

Description

Method and device for simulating virtual character speaking

Technical Field

The embodiment of the invention relates to the field of language identification processing, in particular to a method and a device for simulating virtual character speaking.

Background

The virtual digital population type current market has the following three main solutions:

(1) fixed virtual digital demographics: the mouth shape is fixed no matter what the virtual character says, and the voice mouth shape cannot be synchronized;

(2) volume driven virtual digital demographics: the mouth shape size of the virtual character is controlled according to the speaking volume of the virtual character, which is very inaccurate and cannot realize voice mouth shape synchronization;

(3) live picture sequence frame animation: the scheme used by the scientific research news flying virtual digital people sunny and young realizes the voice mouth shape synchronization by recognizing voice and calling picture sequence frame animation.

Disclosure of Invention

Therefore, the embodiment of the invention provides a method and a device for simulating a virtual character to speak, so as to solve the problem that the sound volume identification and the fixed mouth shape in the market in the prior art are only suitable for cartoon characters and cannot realize voice mouth shape synchronization.

In order to achieve the above object, an embodiment of the present invention provides the following:

in one aspect of an embodiment of the present invention, there is provided a method of simulating a virtual character speaking, the method comprising:

according to a plurality of phoneme classifications, making a mouth shape corresponding to each phoneme classification to obtain a plurality of basic mouth shapes;

inputting an audio stream, extracting an audio frame of the audio stream, and identifying a phoneme of the audio frame;

determining the phoneme classification corresponding to the phoneme of the audio frame from the plurality of phoneme classifications, and selecting the basic mouth shape corresponding to the phoneme classification;

synthesizing the selected base mouth shape into a corresponding mouth shape of the audio frame.

Further, the plurality of phoneme classifications includes:

(p，b，m)、(f，v)、(th)、(t，d)、(k，g)、(tS，dZ，S)、(s，z)、(n，l)、(r)、(A)、(e)、(ih)、(oh)、(ou)。

further, in the audio stream, a data amount in units of 2.5ms to 60ms is extracted as one frame of audio.

Further, the method further comprises:

and making a virtual character model, and generating the mouth shape of the virtual character according to the corresponding mouth shape of the audio frame.

Further, the plurality of base dies further comprises: mouth closed and universal.

Further, when the phonemes identified from the audio frame are not in the plurality of phoneme classifications, selecting the generic mouth shape as a base mouth shape;

selecting the mouth closed mouth as a base mouth when no phoneme is recognized from the audio frame.

In one aspect of an embodiment of the present invention, there is also provided an apparatus for simulating a virtual character speaking, the apparatus including:

a basic mouth shape generating unit, which is used for making a mouth shape corresponding to each phoneme classification according to a plurality of phoneme classifications to obtain a plurality of basic mouth shapes;

a phoneme extracting unit, which is used for inputting an audio stream, extracting an audio frame of the audio stream and identifying phonemes of the audio frame;

a basic mouth shape determining unit, configured to determine the phoneme classification corresponding to the phoneme of the audio frame from the plurality of phoneme classifications, and select the basic mouth shape corresponding thereto;

and the mouth shape synthesizing unit is used for synthesizing the selected basic mouth shape into a corresponding mouth shape of the audio frame.

In another aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the above-mentioned method.

In another aspect of embodiments of the present invention, there is provided a computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the above method.

The embodiment of the invention has the following advantages:

the embodiment of the invention discloses a method and a device for simulating the speaking of a virtual character, which classifies the mouth shape of a real person by phonemes, arranges the mouth shape of the real person into 14 basic mouth shapes, and can drive the population shape of a virtual number to be synchronous by the phoneme recognition of a computer. Through the virtual digital population type patent, the voice mouth shape synchronization of the virtual digital people can be quickly and accurately realized. Through the fusion and classification of phonemes, the voice mouth shape synchronization of the virtual digital person is realized, and the mouth shape fault tolerance rate of the virtual digital person during speaking reaches 99.9%. A mouth shape standardized mouth shape manufacturing scheme is formulated, and the virtual digital population shape manufacturing efficiency and the mouth shape quality are greatly improved. The virtual digital person is closer to a real person, and the user experience is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions that the present invention can be implemented, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the effects and the achievable by the present invention, should still fall within the range that the technical contents disclosed in the present invention can cover.

FIG. 1 is a flowchart illustrating a method for simulating a virtual character speaking according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an apparatus for simulating virtual character speaking according to an embodiment of the present invention.

In the figure: 102-basic mouth shape generating unit, 104-phoneme extracting unit, 106-basic mouth shape determining unit and 108-mouth shape synthesizing unit.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the present specification, the terms "upper", "lower", "left", "right", "middle", and the like are used for clarity of description, and are not intended to limit the scope of the present invention, and changes or modifications in the relative relationship may be made without substantial changes in the technical content.

Examples

Referring to fig. 1 and 2, an embodiment of the present invention provides a method for simulating a virtual character speaking, including the following steps:

s1: and according to the plurality of phoneme classifications, making a mouth shape corresponding to each phoneme classification to obtain a plurality of basic mouth shapes. Specifically, the phoneme: phonemes are the smallest units of speech that are divided according to the natural properties of the speech. From an acoustic property point of view, a phoneme is the smallest unit of speech divided from a psychoacoustic point of view. From the physiological point of view, a pronunciation action forms a phoneme. If [ ma ] contains [ m ] a ] two pronunciation actions, which are two phonemes. The sounds uttered by the same pronunciation action are the same phoneme, and the sounds uttered by different pronunciation actions are different phonemes. For example, in [ ma-mi ], the two [ m ] pronunciations are identical and are identical phonemes, and [ a ] i is different and is different phoneme. The analysis of phonemes is generally described in terms of pronunciation actions. The pronunciation action [ m ] is: the upper and lower lips are closed, the vocal cords vibrate, and the airflow flows out of the nasal cavity to make sound. In phonetic terms, it is the bicuspid nasal sound. For example, in the present invention, after a number of tests, the phonemes of Mandarin Chinese are arranged into 14 corresponding pronunciation mouth shapes, and the plurality of phoneme classifications includes the following 14 classifications:

(p, b, m), (f, v), (th), (t, d), (k, g), (tS, dZ, S), (S, z), (n, l), (r), (A), (e), (ih), (oh), (ou). Each classified set comprises at least one phoneme, and 14 basic mouth shapes corresponding to the 14 phoneme classified sets are made. The following is a phoneme classification table, which includes 14 phoneme classifications and corresponding pronunciation examples, in which in the example, the Pinyin pronunciation is bold and the English pronunciation is italic.

Phoneme/phoneme	Example (Pinyin + word)
		p，b，m	pu，ban，man
f，v	fan，vat
		th	xing，zan
t，d	te，da
		k，g	call，gan
tS，dZ，S	chair，zha，she
		S，Z	Se，zeal
n，1	la，na
		r	rui
A	ka
		e	bed
ih	tip
		oh	tou
ou	bu

S2: inputting an audio stream, extracting audio frames of the audio stream, and identifying phonemes of the audio frames. The audio data is streaming, and there is no clear concept of one frame per se, and in practical applications, for the convenience of audio algorithm processing/transmission, the data amount in units of 2.5ms to 60ms is generally defined as one frame of audio. This time is called the "sampling time" and has no particular criteria for its length, which is determined by the requirements of the codec and the particular application. Specifically, after a segment of audio frame is extracted, the phonemes in the audio frame will be identified by the neural network identification model.

S3: a phoneme classification corresponding to a phoneme of the audio frame is determined from the plurality of phoneme classifications, and a basic mouth shape corresponding thereto is selected. Specifically, the phoneme in the audio frame is compared with the phoneme classification of 14, and the phoneme classification corresponding to the phoneme of the audio frame is determined. For example, after an audio frame is recognized, a plurality of phonemes are obtained, and a plurality of phoneme classifications corresponding to the plurality of phonemes are respectively identified, and a plurality of basic mouth shapes corresponding to the plurality of phoneme classifications are selected.

S4: and synthesizing the selected basic mouth shape into a corresponding mouth shape of the audio frame. Furthermore, a virtual character model is produced, and the mouth shape of the virtual character is generated according to the corresponding mouth shape of the audio frame. The technical scheme of the invention can identify the phonemes in the audio frame through real-time calling, synthesize the image frame corresponding to the audio frame and synthesize the image frame into the animation or the video in real time, and can quickly and accurately realize the voice mouth shape synchronization of the virtual digital person in the super-realistic/realistic manner.

Further, the plurality of basic dies further comprises: mouth closed and universal. When the phonemes identified from the audio frame are not in the plurality of phoneme classifications, a generic mouth shape is selected as the base mouth shape. When no phoneme is recognized from the audio frame, the mouth-closed mouth shape is selected as the base mouth shape.

As shown in fig. 2, an embodiment of the present invention further provides an apparatus for simulating the virtual character speaking, the apparatus including: a base mouth shape generating unit 102, a phoneme extracting unit 104, a base mouth shape determining unit 106, and a mouth shape synthesizing unit 108.

The basic mouth shape generating unit 102 is configured to create a mouth shape corresponding to each phoneme classification according to the plurality of phoneme classifications, and obtain a plurality of basic mouth shapes. The phoneme extracting unit 104 is used for inputting an audio stream, extracting an audio frame of the audio stream, and identifying phonemes of the audio frame. The base mouth shape determining unit 106 is configured to determine a phoneme classification corresponding to a phoneme of the audio frame from among the plurality of phoneme classifications, and select a base mouth shape corresponding thereto. The mouth shape synthesizing unit 108 is configured to synthesize the selected base mouth shape into a corresponding mouth shape of the audio frame.

The technical scheme of the invention realizes the synchronization of the voice mouth shape of the virtual digital person by the fusion and classification of the phonemes, so that the mouth shape fault tolerance rate of the virtual digital person during speaking can reach 99.9 percent. A mouth shape standardized mouth shape manufacturing scheme is formulated, and the virtual digital population shape manufacturing efficiency and the mouth shape quality are greatly improved. Meanwhile, the virtual digital person is closer to a real person, and the user experience is greatly improved.

The functions of each functional module of the device in the above embodiments of the present description may be implemented through each step of the above method embodiments, and therefore, a specific working process of the device provided in one embodiment of the present description is not repeated herein.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 1.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method in conjunction with fig. 1.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware or may be embodied in software instructions executed by a processor. The software instructions may consist of corresponding software modules that may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in a server. Of course, the processor and the storage medium may reside as discrete components in a server.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A method of simulating a virtual character speaking, the method comprising:

2. The method of claim 1, wherein the plurality of phoneme classifications comprises:

3. the method of claim 1,

in the audio stream, a data amount in units of 2.5ms to 60ms is extracted as one frame of audio.

4. The method of claim 1, further comprising:

5. The method of claim 1,

the plurality of base dies further comprises: mouth closed and universal.

6. The method of claim 5,

selecting the generic mouth shape as a base mouth shape when the phonemes identified from the audio frame are not in the plurality of phoneme classifications;

7. An apparatus for simulating a virtual character speaking, the apparatus comprising:

a basic mouth shape generating unit (102) for creating a mouth shape corresponding to each phoneme classification according to the plurality of phoneme classifications to obtain a plurality of basic mouth shapes;

a phoneme extraction unit (104) for inputting an audio stream, extracting an audio frame of the audio stream, and identifying phonemes of the audio frame;

a basic mouth shape determining unit (106) for determining the phoneme classification corresponding to the phoneme of the audio frame from the plurality of phoneme classifications, and selecting the basic mouth shape corresponding thereto;

a mouth shape synthesis unit (108) for synthesizing the selected base mouth shape into a corresponding mouth shape of the audio frame.

8. A computer-readable storage medium, having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any one of claims 1-6.

9. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-6.