CN115797515A

CN115797515A - Voice generation and expression driving method, client and server

Info

Publication number: CN115797515A
Application number: CN202211577842.4A
Authority: CN
Inventors: 邵岩; 郑航; 费元华; 郭建军
Original assignee: Beijing Weiling Times Technology Co Ltd
Current assignee: Beijing Weiling Times Technology Co Ltd
Priority date: 2022-12-09
Filing date: 2022-12-09
Publication date: 2023-03-14

Abstract

The application relates to the field of data processing, in particular to a voice generation and expression driving method, a client and a server, wherein the method comprises the steps of acquiring audio data; determining facial expression data of an avatar corresponding to the audio data, wherein the facial expression data of the avatar comprises a plurality of point positions; sending the audio data and the facial expression data of the virtual image corresponding to the audio data to the server, so that the server binds each point of the facial expression data with each point of the facial skeleton of the virtual image; and outputting the facial expression of the virtual image while playing the audio data based on the driving instruction of the server. The method and the device have the effect of improving the user experience.

Description

Voice generation and expression driving method, client and server

Technical Field

The present application relates to the field of data processing, and in particular, to a method for generating speech and driving expressions, a client, and a server.

Background

With the development of science and technology, the synthesis technology of computer voices is more and more mature, voices synthesized by the computer technology almost have the same speed, tone and pronunciation as voices of real persons, and voices can almost be played by the real persons through voice broadcasting.

However, in the related art, when the synthesized voice is combined with the screen, only attention is paid to whether the duration is matched, so that the matching degree between the person in the screen and the synthesized voice is not high, for example, a 10-second screen corresponds to a 10-second synthesized voice, but the expression of the person in the screen is single, and the mouth movement is only opened and closed, so that the experience of the user may be reduced.

Disclosure of Invention

In order to solve at least one technical problem, embodiments of the present application provide a speech generation and expression driving method, a client, and a server.

In a first aspect, the present application provides a method for generating speech and driving expressions, which adopts the following technical solutions:

a speech generation and expression driven method, performed by a client, comprising:

acquiring audio data;

determining facial expression data of an avatar corresponding to the audio data, wherein the facial expression data of the avatar comprise a plurality of point positions;

sending the audio data and facial expression data of the virtual image corresponding to the audio data to a server, so that the server binds each point of the facial expression data with each point of a facial skeleton of the virtual image;

and outputting the facial expression of the virtual image while playing the audio data based on the driving instruction of the server.

Through adopting above-mentioned technical scheme, after the customer end determines expression data according to audio data, send audio data and expression data to the server, so that the server binds each point position of expression data with each point position in the facial skeleton of avatar, drive the back through the server again, make the customer end can broadcast audio data and the facial expression that avatar corresponds simultaneously, because expression data is confirmed through audio data, consequently, the audio data of broadcast and the facial expression matching degree of avatar are higher, that is to say, when broadcasting audio data, avatar can synchronous display corresponding expression, thereby can be convenient for enrich the facial expression of avatar, also be convenient for promote the smoothness of the facial expression broadcast of avatar, and then help promoting user's experience.

In one possible implementation, the obtaining audio data includes:

acquiring text information, and performing sentence cutting processing on the text information to obtain a sentence set;

analyzing each sentence to obtain first emotion information and/or first sound attribute information corresponding to each sentence;

and generating audio data corresponding to the text information based on the first emotion information and/or the first sound attribute information corresponding to each sentence to obtain the audio data.

By adopting the technical scheme, when the text information is converted into the audio data, the emotion and sound attribute information of each statement in the text information is analyzed, and further the audio data is generated according to the emotion and sound characteristics of each statement, so that the generated sound data can be more consistent with a real situation, the authenticity of the audio data is promoted, the audio can be equal to the sound of a real person, and the user experience can be further promoted.

In a possible implementation manner, the parsing each sentence to obtain the first emotion information and/or the first sound attribute information corresponding to each sentence further includes:

acquiring virtual image information and/or second emotion information input by a user;

the analyzing each sentence to obtain first emotion information and/or first sound attribute information corresponding to each sentence includes:

and analyzing each sentence based on the virtual character information and/or second emotion information input by the user to obtain first emotion information and/or first sound attribute information corresponding to each sentence.

By adopting the technical scheme, the avatar information can be considered when each sentence in the text information is analyzed by acquiring the avatar information and/or the second emotion information input by the user, and/or the context and the avatar of each sentence can be considered when the first emotion information and/or the first sound attribute information corresponding to each sentence is obtained and/or the second emotion information input by the user is obtained, so that the accuracy of analyzing the sentences to obtain the first emotion information and/or the first sound attribute information corresponding to each sentence can be improved, and the user experience can be further improved.

In one possible implementation manner, the determining facial expression data of the avatar corresponding to the audio data includes:

sentence division processing is carried out on the audio data, and third emotion information and/or second sound attribute information corresponding to each audio sentence are/is obtained;

and determining the facial expression data of the virtual image corresponding to each audio statement based on the third emotion information and/or the second sound attribute information corresponding to each audio statement.

By adopting the technical scheme, when the facial expression data corresponding to each audio statement in the audio data is determined, the facial expression corresponding to the reading of each character can be enriched through the third emotion information and/or the second sound attribute information of each audio statement, rather than the facial expression of the avatar being determined only through the reading of the character, the third emotion information and/or the second sound attribute information corresponding to each audio statement are further considered to determine the facial expression data of the avatar corresponding to each audio statement, so that the facial expression data of the avatar reading each audio statement avatar can simultaneously accord with the third emotion information and/or the second sound attribute information corresponding to each audio statement, the facial expression data of the avatar can be more accurate, and the user experience can be further improved.

determining mouth movement information corresponding to the audio data;

determining the motion information of other parts of the face based on the mouth motion information corresponding to the audio data;

and determining each point position of facial expression data according to the mouth movement information corresponding to the audio data and the movement information of other parts of the face so as to obtain the facial expression data of the virtual image corresponding to the audio data.

Through adopting above-mentioned technical scheme, because the fixed invariant in position of mouth and other positions, when the mouth of avatar takes place to move, other positions also can change along with the motion of mouth, and then determine the motion information back of mouth, the motion information of other positions also can be confirmed, the facial expression of avatar is confirmed through the position between mouth and other positions, be convenient for improve the smoothness degree when each position takes place the displacement change in the avatar face, also be convenient for improve the accuracy when confirming facial expression data.

In a possible implementation manner, the sending, to a server, the audio data and facial expression data of an avatar corresponding to the audio data further includes:

generating a data stream by the audio data and the facial expression data of the virtual image corresponding to the audio data through a specific network protocol;

the sending of the audio data and the facial expression data of the avatar corresponding to the audio data to the server includes:

and sending the data stream to the server.

By adopting the technical scheme, the audio data and the facial expression data are generated into data streams by simulating a specific network protocol, and the data streams are transmitted to the server, so that free data transmission between the client and the server is realized, and the server can drive the facial expression of the virtual image.

In a second aspect, the present application further provides another speech generation and expression driving method, which adopts the following technical solution:

a speech generation and expression driving method is executed by a server and comprises the following steps:

acquiring the audio data and facial expression data of an avatar corresponding to the audio data, wherein the facial expression data of the avatar comprises a plurality of point positions;

binding each point of the facial expression data with each point of the facial skeleton of the virtual image;

and controlling the client to drive the facial expression of the virtual image while playing audio data based on the binding relationship.

Through adopting above-mentioned technical scheme, bind expression data and avatar's facial skeleton, again after the drive, make the client can play audio data and the facial expression that avatar corresponds simultaneously, because expression data is confirmed through audio data, consequently the audio data of broadcast and avatar's facial expression matching degree is higher, and, bind through each position point in expression data and avatar's facial skeleton, be convenient for enrich avatar's facial expression, also be convenient for promote avatar's facial expression broadcast's fluency, and then help promoting user's experience.

In one possible implementation manner, the binding each point location of the facial expression data with each point location of the facial skeleton of the avatar further includes:

determining fourth emotion information and/or third sound attribute information corresponding to each sentence in the audio data;

binding each point location of the facial expression data with each point location in the facial skeleton of the virtual image, and then further comprising:

based on fourth emotion information and/or third sound attribute information corresponding to each sentence and a binding relationship, determining displacement information of each point in a facial skeleton corresponding to the virtual image;

wherein, based on the binding relationship, controlling the client to drive the facial expression of the avatar while playing audio data, comprises:

and controlling the client to drive the facial expression of the virtual image while playing audio data based on the binding relationship and the displacement information of each point in the facial skeleton.

By adopting the technical scheme, through determining the fourth emotion information and/or the third sound attribute information corresponding to the audio data, the facial expression data corresponding to the audio data are determined, then the facial expression data are bound with the facial skeleton of the virtual image, the facial expression corresponding to the audio data is convenient to determine, and the virtual image is driven through the audio data, so that when the audio data are played, the corresponding facial expression data can be played simultaneously, because the facial expression data are determined by the audio data, and the facial expression data are bound with the facial skeleton of the virtual image, the accuracy corresponding to the sound picture is improved, and the sound picture synchronization is realized.

acquiring image information containing the virtual image;

identifying each point of the facial skeleton of the virtual image according to the image information containing the virtual image;

wherein, binding each point of facial expression data with each point of facial skeleton of the virtual image comprises:

and binding each point of the facial expression data with each point of the recognized facial skeleton of the virtual image.

By adopting the technical scheme, the image information containing the virtual image is identified to obtain each point position of the facial skeleton of the virtual image, and each point position of the facial expression data is bound with each point position of the identified facial skeleton, so that the binding accuracy of each point position of the facial expression data and each point position of the facial skeleton of the virtual image is higher, the corresponding facial expression of the virtual image is more accurate when the audio data is played, and the user experience can be further improved.

In a third aspect, the present application provides a device for speech generation and expression driving from a client perspective, which adopts the following technical solution:

a speech generating and expression driven apparatus comprising:

the audio data acquisition module is used for acquiring audio data;

the facial expression data determining module is used for determining facial expression data of an avatar corresponding to the audio data, and the facial expression data of the avatar comprises a plurality of point positions;

the data sending module is used for sending the audio data and the facial expression data of the virtual image corresponding to the audio data to a server, so that the server binds each point of the facial expression data with each point of a facial skeleton of the virtual image;

and the execution driving module is used for outputting the facial expression of the virtual image while playing the audio data based on the driving instruction of the server.

In a possible implementation manner, the audio data obtaining module is specifically configured to, when obtaining the audio data:

analyzing each statement to obtain first emotion information and/or first sound attribute information corresponding to each statement;

In one possible implementation, the apparatus further includes:

the obtaining information module is used for obtaining the virtual image information and/or second emotion information input by the user;

the audio data obtaining module is specifically configured to, when analyzing each sentence to obtain first emotion information and/or first sound attribute information corresponding to each sentence:

In a possible implementation manner, the expression data determining module, when determining facial expression data of an avatar corresponding to the audio data, is specifically configured to:

In a possible implementation manner, when determining facial expression data of an avatar corresponding to the audio data, the facial expression data determining module is specifically configured to:

determining mouth movement information corresponding to the audio data;

and determining each point of facial expression data according to the mouth movement information corresponding to the audio data and the movement information of other parts of the face so as to obtain the facial expression data of the virtual image corresponding to the audio data.

In one possible implementation, the apparatus further includes:

a data stream generation module, configured to generate a data stream according to the audio data and facial expression data of the avatar corresponding to the audio data through a specific network protocol;

when the data sending module sends the audio data and the facial expression data of the virtual image corresponding to the audio data to the server, the data sending module is specifically configured to:

and sending the data stream to the server.

In a fourth aspect, the present application provides a device for speech generation and expression driving from a server side perspective, and adopts the following technical solution:

a speech generating and expression driven apparatus comprising:

the data information module is used for acquiring the audio data and facial expression data of the virtual image corresponding to the audio data, and the facial expression data of the virtual image comprises a plurality of point positions;

the point location binding module is used for binding each point location of the facial expression data with each point location in the facial skeleton of the virtual image;

and the control driving module is used for controlling the client to drive the facial expression of the virtual image while playing the audio data based on the binding relationship.

In one possible implementation, the apparatus further includes:

the determining information module is used for determining fourth emotion information and/or third sound attribute information which respectively correspond to each statement in the audio data;

wherein, the device still includes:

the displacement information determining module is used for determining displacement information of each point position in the facial skeleton corresponding to the virtual image based on fourth emotion information and/or third sound attribute information and the binding relation respectively corresponding to each sentence;

the control driving module is used for controlling the client to play audio data and simultaneously driving the facial expression of the virtual image based on the binding relationship, and is specifically used for:

In one possible implementation, the apparatus further includes:

the image information acquisition module is used for acquiring image information containing the virtual image;

the point identification module is used for identifying each point of the facial skeleton of the virtual image according to the image information containing the virtual image;

the point location binding module is specifically used for binding each point location of the facial expression data with each point location in the facial skeleton of the virtual image:

and binding each point location of the facial expression data with each point location in the identified facial skeleton of the virtual image.

In a fifth aspect, the present application provides a client, which adopts the following technical solution:

a client, the client comprising:

at least one processor;

a memory;

at least one application, wherein the at least one application is stored in the memory and configured to be executed by the at least one processor, the at least one application configured to: the method of speech generation and expression driving shown in the first aspect described above is performed.

In a sixth aspect, the present application provides a server, which adopts the following technical solution:

a server, the server comprising:

at least one processor;

a memory;

at least one application, wherein the at least one application is stored in the memory and configured to be executed by the at least one processor, the at least one application configured to: the method of speech generation and expression driving shown in the second aspect described above is performed.

In a seventh aspect, the present application provides a computer-readable storage medium, which adopts the following technical solutions:

a computer-readable storage medium, comprising: a computer program is stored which can be loaded by a processor and which performs the speech generation and emotion driving methods as shown in the first and/or second aspects above.

In summary, the present application includes at least one of the following beneficial technical effects:

1. after the client determines the expression data according to the audio data, the audio data and the expression data are sent to the server, so that the server binds each point position of the expression data with each point position in the facial skeleton of the virtual image, and then the client is driven by the server, so that the client can play the audio data and the facial expression corresponding to the virtual image at the same time, because the expression data are determined through the audio data, the matching degree of the played audio data and the facial expression of the virtual image is higher, namely, when the audio data are played, the virtual image can synchronously display the corresponding expression, so that the facial expression of the virtual image can be conveniently enriched, the fluency of playing the facial expression of the virtual image is conveniently promoted, and the experience of a user is facilitated to be promoted.

2. Bind expression data and avatar's facial skeleton, after driving again, make the client can play audio data and the facial expression that avatar corresponds simultaneously, because expression data is confirmed through audio data, consequently, the audio data of broadcast and avatar's facial expression matching degree is higher, and, bind expression data and each point position in avatar's facial skeleton, be convenient for enrich avatar's facial expression, also be convenient for promote avatar's facial expression broadcast's fluency, and then help promoting user's experience.

Drawings

FIG. 1 is a schematic flow chart of a speech generation and expression driving method in an embodiment of the present application;

FIG. 2 is a schematic flow chart of another speech generation and expression driving method in an embodiment of the present application;

FIG. 3 is a diagram of a scenario in which a client interacts with a server in an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating the result of a client-side speech generation and expression-driven apparatus in an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating the result of a speech generation and expression-driven device from a server side perspective in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a client in an embodiment of the present application.

Detailed Description

The present application is described in further detail below with reference to figures 1-6.

A person skilled in the art, after reading the present specification, may make modifications to the present embodiments as necessary without inventive contribution, but only within the scope of the claims of the present application are protected by patent laws.

Since the first birth of a computer-based speech synthesis system, statistical machine learning synthesis speech represented by a hidden markov model is the mainstream, and speech synthesis of a neural network is rapidly developed, computer speech synthesis technology now can be comparable to the sound of a genuine person, and is in a large-scale commercialization landing stage. The AI voice can automatically generate natural voice matched with the intonation and emotion of human voice according to the self-defined text, supports detailed customized operation, can easily adjust the speed, tone, pronunciation and pause of voice, and optimizes voice output. However, at present, audio and expression output in the field do not have corresponding super-realistic virtual images, and application scenes of the technology are limited.

In order to solve the above technical problems, embodiments of the present application provide a voice generation and expression driving method, which can integrate an AI voice generation technology, bind to a super-realistic digital image, and satisfy scenes such as movie, advertisement, live broadcast, and virtual shooting. In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter associated objects are in an "or" relationship, unless otherwise specified.

The embodiment of the application provides a voice generation and expression driving method, which is executed by a client, where the client may be a terminal device, and the terminal device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like, but is not limited thereto, and the embodiment of the application is not limited herein.

Referring to fig. 1, fig. 1 is a schematic flowchart of a speech generation and expression driving method in an embodiment of the present application, which is executed by a client, and the method may include step S110, step S120, step S130, and step S140, where:

step S110: audio data is acquired.

Specifically, the audio data is digitized sound data, and includes voice information such as intonation and emotion of human voice, and the audio data may be input by a user, or may be automatically captured from a related webpage by a client using a web crawler technology, or may be obtained by the client from a local storage, or may be obtained by performing audio conversion on text information, which is not limited in the embodiment of the present application.

Step S120: and determining the facial expression data of the virtual image corresponding to the audio data, wherein the facial expression data of the virtual image comprises a plurality of point positions.

Specifically, the avatar may be an animation character, a cartoon character, or an avatar, and the specific avatar may be selected and determined by a user, or may be pre-designated by the system. The facial expression data may be facial expression data including a plurality of point locations, for example, the facial expression data may include 55 point locations, as shown in table 1:

TABLE 1

1	Blink in the left eye	EyeBlinkLeft
			2	Left eye below the eye	EyeLookDownLeft
3	The nose tip is looked at to the left eye	EyeLookInLeft
			4	Left eye looks to the left	EyeLookOutLeft
5	Left eye to visualize above	EyeLookUpLeft
			6	Squint for left eye	EyeSquintLeft
7	Left eye opening wide	EyeWideLeft
			8	Blink of the right eye	EyeBlinkRight
9	Below the right eye	EyeLookDownRight
			10	The nose tip is watched by the right eye	EyeLookInRight
11	The right eye looks to the left	EyeLookOutRight
			12	The upper part of the right eye	EyeLookUpRight
13	Squint in right eye	EyeSquintRight
			14	The right eye is wide open	EyeWideRight
15	The chin is forward when the mouth is not opened	JawForward
			16	With the mouth left	JawLeft
17	When the mouth is left, the chin is turned to the right	JawRight
			18	When opening the mouth, the chin is downward	JawOpen
19	Closed mouth	MouthClose
			20	Slightly open mouth and open two lips	MouthFunnel
21	Sipping mouth	MouthPucker
			22	Left-handed nozzle	MouthLeft
23	Right left-hand nozzle	MouthRight
			24	Left-falling smile	MouthSmileLeft
25	Smile with right-handed mouth	MouthSmileRight
			26	Left lip is pressed down	MouthFrownLeft
27	Right lip is pressed down	MouthFrownRight
			28	The left lip faces backwards	MouthDimpleLeft
29	The right lip faces backwards	MouthDimpleRight
			30	Left mouth angle to left	MouthStretchLeft
31	Right mouth angle to the right	MouthStretchRight
			32	The lower lip is rolled inwards	MouthRollLower
33	The lower lip rolls up	MouthRollUpper
			34	The lower lip is downward	MouthShrugLower
35	The upper lip is upward	MouthShrugUpper
			36	Lower lip is pressed to the left	MouthPressLeft
37	The lower lip presses to the right	MouthPressRight
			38	The lower lip is pressed leftwards and downwards	MouthLowerDownLeft
39	The lower lip is pressed to the right	MouthLowerDownRight
			40	Upper lip is pressed leftwards and upwards	MouthUpperUpLeft
41	Upper lip pressed to right and up	MouthUpperUpRight
			42	The left eyebrow faces outwards	BrowDownLeft
43	The right eyebrow faces outwards	BrowDownRight
			44	Frown	BrowInnerUp
45	The left eyebrow facing upwards	BrowOuterUpLeft
			46	The right eyebrow facing upwards	BrowOuterUpRight
47	Cheek outwards	CheekPuff
			48	The left cheek is turned up and back	CheekSquintLeft
49	The right cheek is turned upward and revolved	CheekSquintRight
			50	Left cramped nose	NoseSneerLeft
51	Right cramped nose	NoseSneerRight
			52	Tongue-spitting	TongueOut
53	Rotary head	HeadRoll
			54	Rotating left eye	LeftEyeRoll
55	Right-turning eye	RightEyeRoll

Each point data is used for representing the data by a value from 0 to 1, the size of the value is used for representing the motion amplitude of each point, for example, the point data may be the motion amplitude of blinking of the left eye, the motion amplitude of visual below the left eye, and the motion amplitude of staring at the nose tip of the left eye, and when the point data is 0, the representation does not have expression change at this time.

Because different characters correspond to different vocalizations when the virtual image is broadcasted with voice, the mouth can be correspondingly changed due to the vocalization of different characters, and because facial bones are linked, bones of other parts can be changed after the mouth bones move, so that when facial expression data are determined, the mouth movement state of the virtual image can be determined firstly according to the audio data, and then the movement states of other parts can be determined according to the mouth movement state, so that the facial expression data are formed.

Step S130: and sending the audio data and the facial expression data of the virtual image corresponding to the audio data to the server, so that the server binds each point of the facial expression data with each point of the facial skeleton of the virtual image.

Specifically, the client and the server can perform information interaction in a data transmission mode, when the client sends audio data and facial expression data to the server, the audio data and the facial expression data can be subjected to data packaging according to a transmission protocol, the packaged data are transmitted to the server, the server can receive the complete audio data and the complete facial expression data, and after the server receives the facial expression data, the facial expression data and facial skeleton point positions of the virtual image are bound.

Step S140: and outputting the facial expression of the virtual image while playing the audio data based on the driving instruction of the server.

Specifically, the driving instruction is used for controlling the audio data and the facial expressions of the virtual image to be played at the client side at the same time, so that when the virtual image is subjected to voice broadcasting, the audio and the facial expressions correspond to each other.

For the embodiment of the application, after the client determines the expression data according to the audio data, the audio data and the expression data are sent to the server, so that the server binds each point of the expression data with each point of the facial skeleton of the virtual image, and then the client can simultaneously play the audio data and the facial expression corresponding to the virtual image after being driven by the server, because the expression data are determined through the audio data, the matching degree of the played audio data and the facial expression of the virtual image is higher, that is, when the audio data are played, the virtual image can synchronously display the corresponding expression, so that the facial expression of the virtual image can be enriched conveniently, the fluency of playing the facial expression of the virtual image is also facilitated, and the experience of a user is facilitated to be improved.

Specifically, in the embodiment of the present application, the acquiring of the audio data in step S110 may specifically include step S1101 (not shown in the drawings), step S1102 (not shown in the drawings), and step S1103 (not shown in the drawings), where:

step S1101: and acquiring text information, and performing sentence cutting processing on the text information to obtain a sentence set.

Specifically, the text information may be content that needs to be broadcasted, may be policy publicity, and may also be introduced to a commodity, and the specific content of the text information is not specifically limited in this embodiment of the application. The text information can be input by a user, prestored contents can be called, and the text information can be captured from a webpage.

After the text information is acquired, sentence cutting is performed on the text information, that is, sentence cutting is performed on the acquired text information, the text information is decomposed into a plurality of sentences, when sentence cutting is performed on the text information, punctuations in the text information can be identified, semantic identification can be performed on the text information, sentence cutting is performed on the text information by using a semantic identification result, sentence cutting can be performed in a regular expression mode, and a specific sentence cutting processing mode is not specifically limited in the embodiment of the application. Further, after sentence segmentation processing, a plurality of segmented sentences can be obtained to form a sentence set.

Step S1102: and analyzing each sentence to obtain first emotion information and/or first sound attribute information corresponding to each sentence.

Specifically, the first emotion information may be information capable of expressing the emotion of a real person, such as happiness or anger, and the first sound attribute information is tone, intonation, and sound amplitude, etc. when a sound is produced, wherein tone is a sound form of a specific sentence under certain emotional conditions; intonation, namely the cavity tone of speaking, namely the configuration and change of speed, speed and weight in a sentence; the amplitude of sound, i.e. the amplitude of the sound.

For example, when the sentence is "thank you", the sentence is analyzed, the obtained emotion information may be happy, and the tone in the corresponding first sound attribute information is softer and the tone amplitude may be 10dB; when the sentence is "why you always make mistakes", the first emotion information obtained at this time may be anger, and the tone in the corresponding sound attribute information is bustling, and the tone amplitude may be 100dB.

In order to further make the converted audio data more suitable for the user requirement, that is, more suitable for the avatar of the output audio data and/or the emotion information required by the user, each sentence is analyzed to obtain the first emotion information and/or the first sound attribute information corresponding to each sentence, before the method may further include:

and acquiring the avatar information and/or second emotion information input by the user.

In the embodiment of the application, the avatar information may be input by the user, may also be input by the user through a selection operation, and further may also be preset, for example, the avatar information may be girl avatar information, or may also be male avatar information, and may also be some animal avatar information.

Furthermore, the virtual image information can also comprise sound production information of the virtual image, wherein the sound production information comprises the tone, the frequency and the sound production habit of the virtual image, the character of the virtual image is convenient to determine through the virtual image information, and the audio data corresponding to the text is determined through the character characteristics of the virtual image, so that the audio data is more fit with the virtual image. The pronunciation demand corresponding to each sentence can also be input by the user, and the user can limit the pronunciation demand according to the demands for sex, age, emotion, state and character, so that the generated audio data can better meet the requirements of the user, wherein, the sex can be divided into male or female, the age can be divided into old, middle, young or young, the emotion can be divided into happiness, anger, sadness or music, the state can include walking or running, and the character can be slow-pitch fine or coarse.

Further, after obtaining the avatar information and/or the second emotion information input by the user, parsing each sentence to obtain emotion information and/or sound attribute information corresponding to each sentence, which may specifically include:

and analyzing each sentence based on the virtual character information and/or the second emotion information input by the user to obtain first emotion information and/or first sound attribute information corresponding to each sentence.

Specifically, each sentence is analyzed based on the avatar and/or the second emotion information input by the user, that is, the sentence analysis result is optimized according to the avatar and/or the second emotion information input by the user.

After the sentence which is subjected to sentence cutting processing is analyzed, the obtained first emotion information may be the same as or different from the second emotion information input by the user, for example, when the sentence is "why you always have mistakes", the sentence is analyzed to determine that the first emotion information corresponding to the sentence may be angry, but the second emotion information input by the user is happy, and the first emotion information is different from the second emotion information, at this time, in order to improve the reality of the sentence after being converted into audio, the context of the current sentence can be determined by analyzing the context, the final emotion information is determined according to the context, and the final emotion information is determined as the first emotion information.

Step S1103: and generating audio data corresponding to the text information based on the first emotion information and/or the first sound attribute information corresponding to each sentence to obtain the audio data.

Specifically, when audio data is generated by using sentences, the sentences can be converted into audio by using a text-to-speech (TTS) tool or other text-to-audio methods, and the first emotion information and/or the first sound attribute information corresponding to each sentence are considered in the conversion process, so that each converted audio data has corresponding emotion information and/or sound attribute information.

Further, after obtaining the audio data, determining facial expression data of the avatar corresponding to the audio data may specifically include step S1201 (not shown in the drawings) and step S1202 (not shown in the drawings), where:

step S1201: and performing sentence division processing on the audio data, and acquiring third emotion information and/or second sound attribute information corresponding to each audio sentence.

Specifically, when each audio statement is determined from the audio data, speech recognition can be performed on the audio data to obtain a pause position in the audio data, sentence division processing is performed on the audio data through the pause position, and then semantic recognition is performed on each sentence to determine third emotion information and/or second sound attribute information corresponding to each sentence. The first emotion information and the first sound attribute information are obtained by cutting sentences of the text and analyzing each sentence after the cutting; the second emotion information is input by the user, and the second emotion information may be the same as or different from the first emotion information; because the text information is audio data generated according to the determined first emotion information and the first sound attribute information, if the audio data is obtained by converting the text information, the third emotion information is the same as the first emotion information, the second sound attribute information is the same as the first sound attribute, and if the audio data is not obtained by converting the text information, the third emotion information may not be the same as the first emotion information, and the second sound attribute information may not be the same as the first sound attribute information.

In addition, if text information is converted into audio data by text-to-audio conversion, sentence cutting processing is performed on the text information, after first emotion information and/or first sound attribute information corresponding to a plurality of sentences obtained after sentence cutting processing is determined, the first emotion information and/or the first sound attribute information corresponding to each sentence are bound to the sentence, after the sentence is converted into the audio data, the first emotion information and/or the first sound attribute information corresponding to each sentence can also be bound to the audio sentence corresponding to each sentence, and when the emotion information and/or the sound attribute information corresponding to each audio sentence is acquired, third emotion information and/or second sound attribute information corresponding to each audio sentence is acquired.

Step S1202: and determining facial expression data of the virtual image corresponding to each audio statement based on the third emotion information and/or the second sound attribute information corresponding to each audio statement.

Specifically, the facial expression data is corresponding facial expressions when the virtual pictographs broadcast the audio data, and as the corresponding vocalization of each character is different during broadcasting, the contraction and expansion of facial muscles corresponding to different characters are also different, for example, when the broadcast character is "o", the mouth is opened, the facial muscles are expanded, but different emotion information and/or sound attribute information, and the corresponding facial expressions are also different, for example, when the third emotion information is happy, the facial expression data also includes that the facial muscles are relaxed except that the mouth corresponding to the character pronunciation is opened, the facial muscles are expanded, and the mouth angle is upward; when the emotion information is anger, the facial expression data includes the contraction of facial muscles and the downward mouth angle in addition to the opening of the mouth corresponding to the reading of the characters and the stretching of the facial muscles. And determining facial expression data corresponding to each audio sentence according to the facial expression data corresponding to each character.

For the embodiment of the application, the facial expressions corresponding to the pronunciation of each character are enriched through the third emotion information and/or the second sound attribute information of each audio sentence, rather than the facial expressions of the virtual character are determined only through the pronunciation of the character, and the number of the facial expressions is increased conveniently through the emotion information and/or the sound attribute information, so that the facial expressions are enriched.

Further, since a sentence corresponds to a plurality of expressions, the state of the mouth is different when each word is spoken, and the state of the mouth affects the state of other parts of the face, specifically, determining facial expression data of an avatar corresponding to the audio data may specifically include step S1 (not shown in the drawings), step S2 (not shown in the drawings), and step S3 (not shown in the drawings), where:

step S1: and determining mouth movement information corresponding to the audio data.

Specifically, the mouth movement information may include displacement generated by the mouth, wherein when the mouth movement information corresponding to the audio data is determined, after characters in the audio data are identified, a mouth movement process corresponding to each character is determined according to a sound production rule, wherein the sound production rule is a mouth shape corresponding to each character when the character is correctly read. Recording a mouth moving image before each character is sounded and a mouth moving image after each character is sounded, respectively determining boundary points of mouth features in the mouth moving image before the sound is sounded and boundary points of the mouth features in the mouth moving image after the sound is sounded, labeling the boundary points to form a boundary point image before the sound is sounded and a boundary point image after the sound is sounded, introducing the boundary point image before the sound is sounded and the boundary point image after the sound is sounded into a pre-established coordinate system, determining coordinates of each boundary point in the boundary point image before the sound is sounded and coordinates of each boundary point in the boundary point image after the sound is sounded, and determining displacement of each boundary point to obtain mouth movement information corresponding to each character.

Furthermore, the mouth movement information corresponding to the audio data can be determined by the audio data through a network model. Other manners for determining mouth movement information corresponding to the audio data may also be included in the embodiments of the present application, and are not limited in the embodiments of the present application.

Step S2: and determining the motion information of other parts of the face based on the mouth motion information corresponding to the audio data.

Specifically, facial expression means that various emotional states are expressed through changes of eye muscles, face muscles and mouth muscles, the trend of muscles is the movement of bones, the movement of the eye muscles corresponds to the movement of the eye bones, and the movement of the mouth muscles corresponds to the movement of the mouth bones.

When the movement information of other parts is determined, the fixed distance between the mouth part and other parts can be determined through the characteristics of the mouth part, and when the mouth part changes, the displacement of other parts can be determined through the corresponding displacement of the mouth part and the fixed distance between the mouth part and other parts.

And step S3: and determining each point of the facial expression data according to the mouth movement information corresponding to the audio data and the movement information of other parts of the face so as to obtain the facial expression data of the virtual image corresponding to the audio data.

Specifically, because there is the relation between facial skeletons, when the mouth skeleton takes place the motion because of the pronunciation, facial skeleton in other positions also can change along with it, according to the change of mouth skeleton, when confirming the change of other positions, accessible record mouth skeleton take place the range that changes to and the direction that takes place the motion, through the distance between the skeleton, confirm the range and the direction that other skeleton takes place the production when changing.

Each point in the facial expression data is used for representing the facial expression of the virtual image, for example, when one character in the audio data is broadcasted, the corresponding facial expression may be blinking of the left eye, blinking of the right eye, pressing down of the left lip, pressing down of the right lip, and the like, each character in the audio data corresponds to at least one facial expression, and the facial expression data of the virtual image corresponding to the audio data is determined conveniently based on the facial expression data corresponding to each of the plurality of characters.

For the embodiment of the application, because the positions of the mouth and other parts are fixed and unchanged, when the mouth of the virtual image moves, the other parts also change along with the movement of the mouth, and the facial expression of the virtual image is determined through the positions between the mouth and the other parts, so that the fluency of each part in the face of the virtual image when the displacement changes is convenient to improve, and the accuracy of determining the facial expression data is also convenient to improve.

Further, after obtaining the audio data and the expression data corresponding to the audio data, step Sa (not shown in the figure) is performed, or step Sa (not shown in the figure) is performed before sending the audio data and the facial expression data of the avatar corresponding to the audio data to the server, where:

step Sa: and generating a data stream by the audio data and the facial expression data of the virtual image corresponding to the audio data through a specific network protocol.

In particular, a network protocol is a set of rules, standards, or conventions established for exchanging data over a computer network. For example, a client and a server in a network need to communicate, but because character sets used by the client and the server may be different, after audio data and facial expression data are transmitted to the server by the client, the server may not be able to identify the audio data and the facial expression data, and in order to enable the client and the server to communicate, the client and the server may be specified to convert characters in respective character sets into characters of a standard character set, and then communicate, for example, before the client needs to transmit the audio data and the facial expression data to the server, a specific network protocol is simulated to generate a corresponding data stream.

For example, the data format before the client sends is not consistent with the data format that the server can receive, the number of the points before sending is 55, and the number of the points of the server is 52, so that the data before sending needs to be format-converted according to a specific network protocol to form a data stream and then forwarded.

Further, with the above embodiment, the data stream is generated by using the audio data and the facial expression data of the avatar corresponding to the audio data, so that the sending of the audio data and the facial expression data of the avatar corresponding to the audio data to the server may specifically include: and transmitting the data stream to the server.

For the embodiment of the application, the audio data and the facial expression data are generated into the data stream by simulating the specific network protocol, and the data stream is transmitted to the server, so that the free data transmission between the client and the server is realized, and the server can drive the facial expression of the virtual image.

Referring to fig. 2, fig. 2 is a schematic flow chart of another speech generation and expression driving method in the embodiment of the present application, which is executed by a server and may include step S210, step S220, and step S230, where:

step S210: and acquiring audio data and facial expression data of the virtual image corresponding to the audio data, wherein the facial expression data of the virtual image comprises a plurality of point positions.

For the embodiment of the application, the server receives the data stream sent by the client, and obtains the audio data and the facial expression data of the virtual image corresponding to the audio data by analyzing the data stream.

Step S220: and binding each point of the facial expression data with each point of the facial skeleton of the virtual image.

Specifically, a face image of the virtual image is obtained, skeleton point location identification is carried out on the face image of the virtual image, and each point location of the face skeleton of the virtual image is determined. The virtual character face skeleton point location is a point corresponding to the position of the user face point on a preset virtual character face skeleton model, and the real-time coordinates of the user face feature points are converted into the real-time coordinates of virtual character area skeleton feature points by methods such as establishing corresponding coordinate systems for the user face and the virtual character face skeleton model respectively, so that the position data of the virtual character face skeleton feature points are obtained, and the application does not limit the position data.

The facial expression data comprises a plurality of point locations, so that each point location of the facial expression data is bound with each point location of the facial skeleton of the virtual image, and only the point location of the facial expression data is bound with the corresponding point location of the facial skeleton of the virtual image.

Step S230: and controlling the client to drive the facial expression of the virtual image while playing the audio data based on the binding relationship.

Specifically, the binding relationship is a corresponding relationship between a point location in the facial expression data and a point location in a facial skeleton of the virtual image. Because the facial expression data are determined according to the audio data, the facial expression of the virtual image is driven while the audio data are played, and the effect of sound and picture synchronization is convenient to achieve.

For the embodiment of the application, the expression data and the facial skeleton of the virtual image are bound, the client can simultaneously play the audio data and the facial expression corresponding to the virtual image after being driven, the expression data and each point position in the facial skeleton of the virtual image are bound, the facial expression of the virtual image is convenient to enrich, the smoothness of playing the facial expression of the virtual image is also convenient to promote, and the experience of a user is facilitated to be promoted.

Further, the method for binding each point location of the facial expression data with each point location of the facial skeleton of the avatar further comprises step S2a (not shown in the drawings), wherein:

step S2a: and determining fourth emotion information and/or third sound attribute information corresponding to each sentence in the audio data.

For the embodiment of the present application, the fourth emotion information and/or the third sound attribute information respectively corresponding to each sentence in the audio data may be carried in the audio data sent by the client, that is, the fourth emotion information may be the first emotion information or the third emotion information shown in the above embodiment, and the third sound attribute information may be the first sound attribute information or the second sound attribute information shown in the above embodiment. Of course, the audio data may also be obtained by analyzing the audio data in the case that the server receives the audio data, where details of a manner in which the server analyzes the audio data to obtain the fourth emotion information and/or the third sound attribute information respectively corresponding to each sentence may be found in the above embodiments, and are not described herein again.

Further, step S220 binds each point of the facial expression data with each point of the facial skeleton of the avatar, and then step S2b (not shown in the drawings) is included, in which:

and S2b, based on the fourth emotion information and/or the third sound attribute information and the binding relationship respectively corresponding to each sentence, determining displacement information of each point in the face skeleton corresponding to the virtual image.

Specifically, the displacement information of each point in the facial skeleton corresponding to the virtual image may be affected by using the fourth emotion information and/or the third sound attribute information corresponding to each sentence. In this embodiment of the present application, after obtaining emotion information and/or sound attribute information corresponding to each sentence, displacement information of each point in a facial skeleton corresponding to the avatar may be determined through a network model, and displacement information of each point in a facial skeleton corresponding to the avatar may also be determined in other manners, which is not limited in this embodiment of the present application.

Further, after the binding relationship and the displacement information of each point in the facial skeleton corresponding to the avatar are obtained through the above embodiment, based on the binding relationship, the client is controlled to drive the facial expression of the avatar while playing the audio data, which may specifically include:

and controlling the client to drive the facial expression of the virtual image while playing the audio data based on the binding relationship and the displacement information of each point in the facial skeleton.

Specifically, for example, the total length of the audio data is 30 seconds, 1800 pieces of facial expression data corresponding to the audio data are provided, each piece of facial expression data has a corresponding point, 60 pieces of facial expression data are played every second, and when the audio data is played, the audio data and the facial expression data can correspond to each other, so that when the audio data is played, sound and picture synchronization can be realized.

For the embodiment of the application, facial expression data corresponding to the audio data are determined by determining fourth emotion information and/or third sound attribute information corresponding to the audio data, then the facial expression data are bound with facial skeletons of the virtual image, so that facial expressions corresponding to the audio data are determined, and the virtual image is driven by the audio data, so that the corresponding facial expression data can be played simultaneously when the audio data are played.

Further, since the bone point locations possibly corresponding to different avatars are different, in order to improve the binding accuracy and better conform to the facial expression of the avatars, each point location of the facial expression data is bound to each point location of the facial bones of the avatars, which may further include: acquiring image information containing an avatar; from the image information comprising the avatar, respective points of facial skeleton of the avatar are identified.

Specifically, after the image of the virtual image is obtained, the image information containing the virtual image is imported into the feature recognition model, facial feature recognition is carried out on the image information containing the virtual image, and then each bone point in the face of the virtual image is determined according to the recognized facial features.

Further, after identifying each point location in the facial skeleton of the avatar, binding each point location of the facial expression data with each point location in the facial skeleton of the avatar, which may specifically include: and binding each point of the facial expression data with each point of the recognized facial skeleton of the virtual image.

For the embodiment of the present application, the manner of binding each point of the facial expression data with each point of the identified facial skeleton of the avatar is described in detail in the above embodiment, and details are not repeated herein.

The embodiment introduces a voice generation and expression driving method from the perspective of a method flow, and introduces a voice generation and expression driving method from the perspective of a client and a service end, and the embodiment introduces a voice generation and expression driving method shown in the embodiment, as shown in fig. 3, after the client acquires text data, audio data is determined through AI voice service, facial expression data of an avatar corresponding to the audio data is determined according to the audio data, the client generates a data stream (custom protocol data) and sends the data stream to the service end through a simulation custom protocol for the audio data and the facial expression data of the avatar corresponding to the audio data, and after the service end acquires the audio data and the facial expression data of the avatar corresponding to the audio data, each point of the facial expression data is bound with each point of a facial skeleton of the avatar, and based on the binding relationship, the client is controlled to drive the facial expression of the avatar while playing the audio data, so as to achieve the purpose of sound and picture synchronization.

The foregoing embodiments describe a method for generating speech and driving expression from the perspective of a method flow, and the following embodiments describe a device for generating speech and driving expression from the perspective of a virtual module or a virtual unit, which will be described in detail in the following embodiments.

An embodiment of the present application provides a speech generation and expression driving apparatus, as shown in fig. 4, the apparatus may specifically include an audio data obtaining module 410, an expression data determining module 420, a data sending module 430, and an execution driving module 440, where:

an audio data acquisition module 410 for acquiring audio data;

an expression data determining module 420, configured to determine facial expression data of an avatar corresponding to the audio data, where the facial expression data of the avatar includes multiple point locations;

the data sending module 430 is configured to send the audio data and the facial expression data of the avatar corresponding to the audio data to the server, so that the server binds each point of the facial expression data to each point of the facial skeleton of the avatar;

and the execution driving module 440 is configured to output the facial expression of the avatar while playing the audio data based on the driving instruction of the server.

In one possible implementation manner, the audio data obtaining module 410 is specifically configured to, when obtaining the audio data:

In one possible implementation, the apparatus further includes:

In a possible implementation manner, the expression data determining module 420, when determining facial expression data of an avatar corresponding to the audio data, is specifically configured to:

performing sentence division processing on the audio data, and acquiring third emotion information and/or second sound attribute information corresponding to each audio sentence;

and determining facial expression data of the virtual image corresponding to each audio statement based on the third emotion information and/or the second sound attribute information corresponding to each audio statement.

determining mouth movement information corresponding to the audio data;

determining motion information of other parts of the face based on the mouth motion information corresponding to the audio data;

and determining each point position of the facial expression data according to the mouth movement information corresponding to the audio data and the movement information of other parts of the face so as to obtain the facial expression data of the virtual image corresponding to the audio data.

In one possible implementation, the apparatus further includes:

the data stream generation module is used for generating data streams by the aid of specific network protocols according to the audio data and the facial expression data of the virtual image corresponding to the audio data;

the data sending module 430 is specifically configured to, when sending the audio data and the facial expression data of the avatar corresponding to the audio data to the server:

and transmitting the data stream to the server.

The embodiment of the present application further provides a device for speech generation and expression driving, and as shown in fig. 5, the device may specifically include a data information obtaining module 510, a point location binding module 520, and a control driving module 530, where:

a data information module 510, configured to obtain audio data and facial expression data of an avatar corresponding to the audio data, where the facial expression data of the avatar includes multiple point locations;

a point location binding module 520, configured to bind each point location of the facial expression data with each point location of the facial skeleton of the avatar;

and the control driving module 530 is configured to control the client to drive the facial expression of the avatar while playing the audio data based on the binding relationship.

In one possible implementation, the apparatus further includes:

the determining information module is used for determining fourth emotion information and/or third sound attribute information which correspond to each sentence in the audio data respectively;

wherein, the device still includes:

the control driver module 530 is specifically configured to, when the client is controlled to play the audio data and drive the facial expression of the avatar based on the binding relationship,:

In one possible implementation, the apparatus further includes:

the identification point location module is used for identifying each point location of the facial skeleton of the virtual image according to the image information containing the virtual image;

the point location binding module 520 is specifically configured to, when binding each point location of the facial expression data with each point location of the facial skeleton of the virtual image:

In the embodiment of the present application, there is provided a client, as shown in fig. 6, a client 600 shown in fig. 6 includes: a processor 601 and a memory 603. Wherein the processor 301 is coupled to the memory 603, such as via a bus 602. Optionally, the client 600 may also include a transceiver 604. It should be noted that the transceiver 604 is not limited to one in practical applications, and the structure of the client 600 is not limited to the embodiment of the present application.

The Processor 601 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 601 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, and the like.

Bus 602 may include a path that transfers information between the above components. The bus 602 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 602 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.

The Memory 603 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 603 is used for storing application program codes for executing the scheme of the application, and the processor 601 controls the execution. The processor 601 is configured to execute application program code stored in the memory 603 to implement the content shown in the foregoing method embodiments.

Among them, the client includes but is not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. But also a server, etc. The client shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

In the embodiment of the present application, a server is further provided, and please refer to the description about a client specifically, and specific contents of the server are not described in detail in the embodiment of the present application.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A speech generation and expression driven method, performed by a client, comprising:

acquiring audio data;

2. The speech generation and expression driving method according to claim 1, wherein the acquiring audio data comprises:

3. The method of claim 2, wherein the parsing each sentence to obtain the first emotion information and/or the first sound attribute information corresponding to each sentence further comprises:

acquiring virtual character information and/or second emotion information input by a user;

4. The method of claim 1, wherein the determining facial expression data of the avatar corresponding to the audio data comprises:

5. The speech generating and expression driving method according to any one of claims 1 to 4, wherein the determining facial expression data of the avatar corresponding to the audio data comprises:

determining mouth movement information corresponding to the audio data;

6. The method of claim 1, wherein the sending the audio data and the facial expression data of the avatar corresponding to the audio data to the server further comprises:

and sending the data stream to the server.

7. A speech generation and expression driving method, which is executed by a server, comprises the following steps:

acquiring audio data and facial expression data of an avatar corresponding to the audio data, wherein the facial expression data of the avatar comprises a plurality of point positions;

and controlling the client to drive the facial expression of the virtual image while playing the audio data based on the binding relationship.

8. The speech generation and expression driving method according to claim 7, wherein the binding of each point of facial expression data with each point of facial skeleton of the avatar further comprises:

binding each point location of the facial expression data with each point location of the facial skeleton of the virtual image, and then further comprising:

9. The speech generation and expression driving method according to claim 7, wherein the binding of each point of facial expression data with each point of facial skeleton of the avatar further comprises:

acquiring image information containing the virtual image;

10. A client, comprising:

at least one processor;

a memory;

at least one application, wherein the at least one application is stored in the memory and configured to be executed by the at least one processor, the at least one application configured to: performing the speech generation and expression driven method of any of claims 1-6.

11. A server, comprising:

at least one processor;

a memory;

at least one application, wherein the at least one application is stored in the memory and configured to be executed by the at least one processor, the at least one application configured to: a method of performing speech generation and expression driving as claimed in any one of claims 7 to 9.