CN113096223A

CN113096223A - Image generation method, storage medium, and electronic device

Info

Publication number: CN113096223A
Application number: CN202110448734.6A
Authority: CN
Inventors: 冯富森; 闫嵩
Original assignee: Beijing Dami Technology Co Ltd
Current assignee: Beijing Dami Technology Co Ltd
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2021-07-09

Abstract

The embodiment of the invention discloses an image generation method, a storage medium and electronic equipment. After determining the phoneme label corresponding to each audio fragment in the target audio, determining the lip width and the lip height corresponding to each audio fragment according to each phoneme label, and generating a lip image sequence corresponding to the target image according to the lip width and the lip height corresponding to each audio fragment. The lip width and the lip height which are supposed to be presented during pronunciation are determined based on the phoneme label, the lip image sequence corresponding to the target audio is automatically generated according to the lip width and the lip height corresponding to each audio clip in the target audio, and the image acquisition cost for performing word pronunciation learning in a visualization mode is effectively reduced.

Description

Image generation method, storage medium, and electronic device

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to an image generation method, a storage medium, and an electronic device.

Background

With the increasing popularity of internet and computer technology, online teaching activities, particularly language-like online teaching activities, are therefore becoming more and more frequent. The learning of language is vital to learners, and the pronunciation of words is the basis in the learning of language, so the learning of pronunciation of words is an essential link in the on-line teaching activities of language class. The lip shape change in the word pronunciation process is displayed to a learner in a visual mode in the online teaching mode of word pronunciation, but different languages have different pronunciation modes and the number of words is very large, so that the mode of recording the lip shape change of a real person is obviously unrealistic.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide an image generating method, a storage medium, and an electronic device, which are used for automatically generating a lip image sequence corresponding to an audio according to a phoneme label corresponding to each audio segment in a target audio, so as to effectively reduce an image acquisition cost for performing word pronunciation learning in a visualization manner.

According to a first aspect of embodiments of the present invention, there is provided an image generation method, the method including:

acquiring a target audio;

determining phoneme labels corresponding to the audio fragments in the target audio;

determining face feature parameters corresponding to the audio segments according to the phoneme labels, wherein the face feature parameters comprise lip width and lip height;

and determining a lip image sequence of the target image according to each lip width and the corresponding lip height.

Preferably, the method further comprises:

determining a sequence of facial images of the target avatar according to each of the lip widths and the corresponding lip heights.

Preferably, the determining the phoneme label corresponding to each audio fragment in the target audio includes:

and performing voice recognition on the target audio based on a preset voice recognition model, and determining the phoneme label corresponding to each audio fragment.

Preferably, the determining, according to each of the phoneme labels, a face feature parameter corresponding to each of the audio segments includes:

determining a feature vector corresponding to each audio segment according to each phoneme label;

and determining the lip width and the lip height corresponding to each audio segment based on a preset feature recognition model according to each feature vector.

Preferably, the feature vector is a one-hot vector of the audio piece;

the determining the feature vector corresponding to each audio segment according to each phoneme label includes:

determining a sorting position of each phoneme label in the phoneme table based on a predetermined phoneme table;

for each of the audio segments, determining the corresponding one-hot vector according to the corresponding ranking position.

Preferably, the speech recognition model is obtained based on a first sample set training, the first sample set includes a plurality of first samples, and each first sample includes a first audio segment and a phoneme identifier corresponding to the first audio segment.

Preferably, the feature recognition model is obtained based on training of a second sample set, where the second sample set includes a plurality of second samples, each of the second samples includes a second audio segment and a lip width and a lip height corresponding to each of the second audio segments.

Preferably, the determining a lip image sequence of the target figure according to each of the lip widths and the corresponding lip heights includes:

acquiring a target image corresponding to the target image;

determining the original lip key point position of the target image in the target image;

determining the actual lip key point position of the target image based on a pre-trained key point prediction model according to the lip widths, the corresponding lip heights and the original lip key point positions;

determining the lip image sequence according to each of the actual lip keypoint locations.

Preferably, the keypoint prediction model is obtained based on a third sample set training, the third sample set comprising a plurality of third samples, each of the third samples comprising an initial lip keypoint position of a predetermined avatar, a lip height, a lip width of a third audio segment, and the predetermined avatar target lip keypoint position.

According to a second aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method according to any one of the first aspect.

According to a third aspect of embodiments of the present method, there is provided an electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method according to any one of the first aspect.

After determining the phoneme label corresponding to each audio fragment in the target audio, determining the lip width and the lip height corresponding to each audio fragment according to each phoneme label, and generating a lip image sequence corresponding to the target image according to the lip width and the lip height corresponding to each audio fragment. The lip width and the lip height which are supposed to be presented during pronunciation are determined based on the phoneme label, the lip image sequence corresponding to the target audio is automatically generated according to the lip width and the lip height corresponding to each audio clip in the target audio, and the image acquisition cost for performing word pronunciation learning in a visualization mode is effectively reduced. The embodiment of the invention uses the phoneme label corresponding to the audio clip to generate the pronunciation lip image, thereby greatly improving the generalization capability of generating the corresponding pronunciation lip image from the audio clip. According to the embodiment of the invention, a mode that the lip width sequence and the lip height sequence are generated based on the corresponding phoneme label, and then the corresponding pronunciation lip image sequence is generated according to the lip width sequence and the lip height sequence is adopted, so that the two parts of models can be trained by using the same or different data respectively, and the flexibility and the practical application capability of the method for generating the lip image from the phoneme label are improved.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of an image generation method of a first embodiment of the present invention;

fig. 2 is a flowchart of determining face feature parameters in an alternative implementation manner of the first embodiment of the present invention;

FIG. 3 is a schematic illustration of lip keypoints for an embodiment of the invention;

FIG. 4 is a flow chart of determining a sequence of lip images in an alternative implementation of the first embodiment of the invention;

fig. 5 is a schematic view of an electronic device according to a second embodiment of the present invention.

Detailed Description

The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.

Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.

Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".

In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

In the traditional learning process of the word pronunciation, a learner usually needs to complete more standard word pronunciation learning by simulating lip shape change of a teacher and simulating pronunciation change of the teacher, so that the on-line teaching mode of the word pronunciation as a learning mode of the word pronunciation also needs to perform the pronunciation teaching of the word in a visual mode. However, different languages have different pronunciation modes, and the number of words is very large, so that the labor cost and the time cost required for recording lip changes of real people are huge, and the mode is obviously unrealistic.

Fig. 1 is a flowchart of an image generation method of a first embodiment of the present invention. As shown in fig. 1, the method of the present embodiment includes the following steps:

step S100, a target audio is acquired.

In the present embodiment, the target audio is a pronunciation audio of a predetermined word. The target audio may be an audio obtained by recording a real person To read the predetermined word, or may be an audio obtained by converting Text information of the predetermined word through various existing manners, for example, based on a Text To Speech (Text To Speech) technology, and this embodiment is not limited in particular.

The predetermined words corresponding to the target audio may be words in any language, such as chinese words, english words, german words, japanese words, french words, and the like.

Step S200, determining phoneme labels corresponding to the audio fragments in the target audio.

The phoneme is the minimum voice unit divided according to the natural attribute of the voice, and is analyzed by a sentence of pronunciation action, and each pronunciation action can form a phoneme. Taking English phonemes as an example, English international phonetic symbols have 48 phonemes, including 20 vowel phonemes and 28 consonant phonemes, and taking consonant phonemes as an example, the consonant phonemes include/p/,/b/,/ts/etc.

In an optional implementation manner, if the target audio is an audio converted from text information of the predetermined word, the phoneme tags corresponding to the audio segments in the target audio are known, and the server may directly determine the phoneme tags corresponding to the audio segments according to the text information of the predetermined word by using a force-alignment (force-alignment) model and the like.

In another optional implementation manner, if the target audio is an audio of a predetermined word read aloud by a real person obtained through recording, the server may perform speech recognition on the target audio based on a predetermined speech recognition model, and determine a phoneme label corresponding to each audio fragment. Specifically, the server may perform sliding interception on the target audio based on a window with a predetermined length to obtain a plurality of audio segments corresponding to the target audio, and input each audio segment into the speech recognition model, so as to determine a phoneme label corresponding to each audio segment.

Wherein the window length is typically greater than the sliding length of the window. For example, the target audio is audio with a length of 1 second, and when the window length is 20 milliseconds and the sliding length is 10 milliseconds, the server may intercept the target audio as 99 audio pieces of 0-20 milliseconds, 10-30 milliseconds, …, 990 and 1000 milliseconds. The continuity of voice change in the audio clip can be improved by adopting a sliding interception mode, so that the accuracy of voice recognition is improved subsequently.

In this embodiment, the speech recognition model may be obtained based on a first set of samples training. The first sample set includes a plurality of first samples, and each first sample includes a first audio segment and a phoneme identification corresponding to the first audio segment. Similar to each audio segment in the target audio, the first audio segment may also be an audio segment obtained by sliding and intercepting the original audio based on a window with a predetermined length. The phoneme label is a phoneme label of the first audio fragment pre-labeled by an artificial way. In the training process of the speech recognition model, the server may use each first audio segment as an input and use the corresponding phoneme identifier as a training target until the loss function of the speech recognition model converges.

The speech recognition model can be various existing models, such as Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), linear regression, and the like. Taking RNN as an example, RNN is a recurrent neural network that takes sequence data as input, recurs in the evolution direction of the sequence, and all nodes (i.e., neurons) are connected according to a chain, and has memory, parameter sharing between nodes, and is well-formed (in a computability theory, if a series of rules of operating data can be used to simulate a single-band turing machine, the rules are well-formed), so there is a certain advantage in learning the nonlinear features of the sequence. The existing RNNs mainly include Bidirectional recurrent neural networks (Bi-RNNs) and Long-Short Term Memory networks (LSTM). The target audio is one of the sequences, so that the target audio has high adaptability to the RNN and can identify the phoneme label more accurately.

And step S300, determining the face characteristic parameters corresponding to the audio segments according to the phoneme labels.

In this embodiment, the face feature parameters are lip width and lip height. In the prior art, a typical mouth shape corresponding to each audio clip is usually determined directly based on the corresponding relation between phoneme labels and mouth shapes, and the mouth shape formed when a person utters a voice changes in a time dimension, so that when a plurality of continuous phoneme labels correspond to the same mouth shape, the lip image does not conform to the actual pronunciation mouth shape, and the reality of the lip image is reduced. Therefore, in this embodiment, the neural network model is constructed, the phoneme label, the lip width and the lip height are modeled, and the trained model is used to determine the lip width and the lip height corresponding to each audio clip, so that the synchronicity and the authenticity of the generated lip image and the phoneme pronunciation can be improved.

Fig. 2 is a flowchart of determining face feature parameters in an alternative implementation manner of the first embodiment of the present invention. As shown in fig. 2, in an alternative implementation, step S300 may include the following steps:

step S310, determining a feature vector corresponding to each audio segment according to each phoneme label.

In the present embodiment, the feature vector of the audio segment is a one-hot (one-hot) vector of the audio segment. The One-hot vector may be determined according to a preset phoneme table, wherein the phoneme table includes a plurality of phonemes and a sorting position of each phoneme. Only One element in the One-hot vector has a value of 1, the other elements are all 0, and the position of the element with the value of 1 in the One-hot vector is the same as the sorting position of the phoneme label corresponding to the element in the phoneme table. For example, the phone list includes 48 phones, and the sorting position of the phone/e/in the phone list is the 3 rd bit, then in the one-hot vector corresponding to the audio clip with the phone label/e/, the value of the 3 rd bit element is 1, and the remaining elements are all 0, that is, the one-hot vector is (0,0,1,0, …,0), where the number of 0 after 1 is 45.

Step S320, determining lip width and lip height corresponding to each audio clip based on a predetermined feature recognition model according to each feature vector.

After determining the feature vector corresponding to each audio clip, the server may input the feature vector corresponding to each audio clip into a pre-trained feature recognition model to obtain the lip width and the lip height corresponding to each audio clip.

The lip height and lip width at a certain time are not only affected by the phoneme corresponding to the current time, but also may be affected by the previous phoneme and the next phoneme adjacent to the phoneme corresponding to the current time. Optionally, the feature vectors corresponding to the audio segments may be input into the feature recognition model, respectively, to obtain the lip height and the lip width corresponding to each audio segment.

Similar to the speech recognition model, the feature recognition model of the present embodiment may be various existing models, such as RNN, CNN, and the like. The feature recognition model is obtained based on a second sample set training, wherein the second sample set comprises a plurality of second samples, and each second sample comprises a second audio segment and a lip width and a lip height corresponding to each second audio segment. In the training process of the feature recognition model, the server may use each second audio clip as an input, and use the corresponding lip height and lip width as training targets at the same time until the loss function corresponding to the lip height and the loss function corresponding to the lip width in the feature recognition model both converge.

In this embodiment, the first audio clip and the second audio clip may be the same audio clip or different audio clips, and this embodiment is not limited in particular.

The lip height and the lip width in each second sample can be determined in a mode of carrying out key point detection on a real face image, so that the reality of the lip image can be effectively improved when the lip image is generated. Specifically, the server may perform face detection on the image corresponding to each second audio clip, determine a plurality of lip key points and coordinates of each lip key point in the image, and then determine lip width and lip height according to the coordinates of the lip key points. In an alternative implementation, the server may use Dlib to perform the above-mentioned face detection and lip key point information acquisition. The Dlib is a C + + open source toolkit that contains machine learning algorithms. In Dlib, the facial features and contours are identified by 68 keypoints. Wherein the contour of the lip may be defined by a plurality of keypoints.

Figure 3 is a schematic illustration of lip keypoints for an embodiment of the invention. The lip key points shown in fig. 3 are key points obtained by detecting key points based on Dlib, and the key point 49-key point 68 are lip key points in Dlib. After determining the coordinates of keypoint 49-keypoint 68, the server may determine the lip height from keypoint 62, keypoint 63, keypoint 64, keypoint 66, keypoint 67, and keypoint 68, and determine the lip width from keypoint 49, keypoint 55, keypoint 61, and keypoint 65.

The server may determine the lip height and lip width corresponding to each second audio piece through various existing distance calculation methods, such as an L1 distance (i.e., manhattan distance), an L2 distance (i.e., euclidean distance), and so on.

When the distance calculation method is the L1 distance, the server may calculate the L1 distance between the key point 62 and the key point 68, the L1 distance between the key point 63 and the key point 67, and the L1 distance between the key point 64 and the key point 66, respectively, and calculate an average value of the L1 distances as the lip height; also, the server may calculate the L1 distance of keypoint 55 from keypoint 49 and the L1 distance of keypoint 65 from keypoint 61, respectively, and calculate the average of the above L1 distances as the lip width. The L1 distance between keypoint 62 and keypoint 68 may be calculated by the following equation:

L1＝|x₆₂-x₆₈|+|y₆₂-y₆₈|；

wherein x is₆₂Is the abscissa, y, of the keypoint 62₆₂Is the ordinate, x, of the keypoint 62₆₈Is the abscissa, y, of the keypoint 68₆₈Is the ordinate of the keypoint 68.

When the distance calculation method is the L2 distance, the server may calculate the L2 distance between the key point 62 and the key point 68, the L2 distance between the key point 63 and the key point 67, and the L2 distance between the key point 64 and the key point 66, respectively, and calculate an average value of the L2 distances as the lip height; also, the server may calculate the L2 distance of keypoint 55 from keypoint 49 and the L2 distance of keypoint 65 from keypoint 61, respectively, and calculate the average of the above L2 distances as the lip width. The L2 distance between keypoint 62 and keypoint 68 may be calculated by the following equation:

step S400, determining a lip image sequence of the target image according to the width of each lip and the corresponding height of the lip.

After determining the lip width and the lip height corresponding to each audio clip, the server may determine the lip image corresponding to each audio clip, and determine the lip image sequence corresponding to the target audio according to the sequence of the audio clips.

Fig. 4 is a flow chart for determining a sequence of lip images in an alternative implementation of the first embodiment of the invention. As shown in fig. 4, in an alternative implementation, step S400 may include the following steps:

step S410, obtaining a target image corresponding to the target image.

In the present embodiment, the target character may be a real character, or may be a character of a virtual character, an animal, or the like. The server may perform face recognition on the predetermined image through various existing manners, such as Dlib, and determine the predetermined image as a target image corresponding to the target image when the face of the target avatar is detected in the predetermined image, or may also acquire the target image corresponding to the target image according to a preset avatar-image correspondence relationship.

Step S420, determining the original lip key point position of the target image in the target image.

In this step, the server may perform key point detection on the target image through various existing manners, such as Dlib, and determine the original lip key point position of the target image in the target image. The original lip key point position may be a key point position of the lip of the target image in a closed state.

And step S430, determining the actual lip key point position of the target image based on a pre-trained key point prediction model according to the lip width, the corresponding lip height and the original lip key point position.

In this step, the server may simultaneously input the lip width and the lip height corresponding to the same audio clip and the original lip keypoint position of the target image into the keypoint prediction model, thereby obtaining the actual lip keypoint position of the target image. The actual lip key point position is the lip key point position that the target image should present when reading the corresponding audio clip.

Therefore, in this embodiment, the server may input the lip width and the lip height corresponding to each audio clip and a matrix formed by the original lip keypoint positions of the target image into the keypoint prediction model to obtain a sequence formed by the actual lip keypoint positions corresponding to each audio clip. Optionally, the lip width and the lip height corresponding to each audio clip and the original lip key point position of the target image may also be respectively input into the key point prediction model to obtain the actual lip key point position corresponding to each audio clip.

Similar to the speech recognition model, the keypoint prediction model of the present embodiment may be various existing models, such as RNN, CNN, and the like. The keypoint prediction model is obtained based on a third sample set training, wherein the third sample set comprises a plurality of third samples, and each third sample comprises an initial lip keypoint position of a predetermined avatar, a lip height, a lip width of a third audio segment, and a target lip keypoint position of the predetermined avatar. In the training process of the key point prediction model, the server may use the lip width and the lip height of each third audio segment and an initial lip key point position of a predetermined image as input, and use a corresponding target lip key point position as a training target until loss functions of the key point prediction model are converged. In this embodiment, the first audio clip and the third audio clip may be the same audio clip or different audio clips, and this embodiment is not limited in particular.

Step S440, determining lip image sequences according to the actual lip key point positions.

After determining the actual lip key point positions corresponding to the audio clips, the server may determine a key point position sequence corresponding to the target audio and formed by the actual lip key point positions according to the sequence of the audio clips, and determine the lip image sequence according to the key point position sequence. Therefore, lip changes in the pronunciation process can be displayed more truly in a visual mode.

In this embodiment, the server may convert the sequence of keypoint locations into a sequence of lip images in various existing ways, such as by the method described in Few-shot Video-to-Video synchronization, Ting-Chun Wang, NVIDIA Corporation.

Through the mode, the server can accurately determine the phoneme label sequence corresponding to the target audio, determine the lip width and lip height sequence corresponding to the target audio based on the phoneme label sequence, and further determine the key point position sequence of the lips, so that a lip image sequence is generated, the continuity between the audio clips is improved through a sliding interception mode, the continuity of lip image change is improved through the determination of the lip width and the lip height, and the reality of the lip image sequence is effectively improved.

Optionally, the method of this embodiment may further include the following steps:

step S500, determining a face image sequence of the target image according to each lip width and the corresponding lip height.

Similar to step S440, in this step, the server may also convert the keypoint location sequence formed by the actual lip keypoint locations corresponding to each audio clip into a facial image sequence of the target avatar in various existing manners.

In the embodiment, a mode that the lip width sequence and the lip height sequence are generated based on the corresponding phoneme label, and then the corresponding pronunciation lip image sequence is generated according to the lip width sequence and the lip height sequence is adopted, so that the two parts of models can be trained by using the same or different data respectively, and the flexibility and the practical application capability of the method for generating the lip image from the phoneme label are improved.

After determining the phoneme label corresponding to each audio segment in the target audio, the embodiment determines the lip width and the lip height corresponding to each audio segment according to each phoneme label, and generates a lip image sequence corresponding to the target character according to the lip width and the lip height corresponding to each audio segment. The lip width and the lip height which are supposed to be presented during pronunciation are determined based on the phoneme label, and the lip image sequence corresponding to the target audio is automatically generated according to the lip width and the lip height corresponding to each audio clip in the target audio, so that the image acquisition cost for performing word pronunciation learning in a visualization mode is effectively reduced.

Fig. 5 is a schematic view of an electronic device according to a second embodiment of the present invention. The electronic device shown in fig. 5 is a general-purpose data processing device, and may be specifically a first terminal, a second terminal or a server according to an embodiment of the present invention, and includes a general-purpose computer hardware structure, which includes at least a processor 51 and a memory 52. The processor 51 and the memory 52 are connected by a bus 53. The memory 52 is adapted to store instructions or programs executable by the processor 51. The processor 51 may be a stand-alone microprocessor or a collection of one or more microprocessors. Thus, the processor 51 implements the processing of data and the control of other devices by executing the commands stored in the memory 52 to execute the method flows of the embodiments of the present invention as described above. The bus 53 connects the above components together, and also connects the above components to a display controller 54 and a display device and an input/output (I/O) device 55. Input/output (I/O) devices 55 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, an input/output (I/O) device 55 is connected to the system through an input/output (I/O) controller 56.

The memory 52 may store, among other things, software components such as an operating system, communication modules, interaction modules, and application programs. Each of the modules and applications described above corresponds to a set of executable program instructions that perform one or more functions and methods described in embodiments of the invention.

The flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention described above illustrate various aspects of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

Also, as will be appreciated by one skilled in the art, aspects of embodiments of the present invention may be embodied as a system, method or computer program product. Accordingly, various aspects of embodiments of the invention may take the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," module "or" system. Further, aspects of the invention may take the form of: a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.

Any combination of one or more computer-readable media may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of embodiments of the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to: electromagnetic, optical, or any suitable combination thereof. The computer readable signal medium may be any of the following computer readable media: is not a computer readable storage medium and may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including: object oriented programming languages such as Java, Smalltalk, C + +, PHP, Python, and the like; and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package; executing in part on a user computer and in part on a remote computer; or entirely on a remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An image generation method, characterized in that the method comprises:

acquiring a target audio;

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein the determining the phoneme label corresponding to each audio fragment in the target audio comprises:

4. The method of claim 1, wherein the determining the face feature parameters corresponding to the audio segments according to the phoneme labels comprises:

5. The method of claim 4, wherein the feature vector is a one-hot vector of the audio segment;

6. The method of claim 3, wherein the speech recognition model is trained based on a first sample set, the first sample set comprises a plurality of first samples, and each of the first samples comprises a first audio segment and a phoneme identification corresponding to the first audio segment.

7. The method of claim 4, wherein the feature recognition model is trained based on a second sample set, the second sample set comprising a plurality of second samples, each of the second samples comprising a second audio segment and a lip width and a lip height corresponding to each of the second audio segments.

8. The method of claim 1, wherein said determining a sequence of lip images of a target figure from each of said lip widths and corresponding said lip heights comprises:

acquiring a target image corresponding to the target image;

9. The method of claim 8, wherein the keypoint prediction model is trained based on a third sample set, the third sample set comprising a plurality of third samples, each of the third samples comprising an initial lip keypoint location of a predetermined avatar, a lip height, a lip width of a third audio segment, and the predetermined avatar target lip keypoint location.

10. A computer-readable storage medium on which computer program instructions are stored, which, when executed by a processor, implement the method of any one of claims 1-9.

11. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-9.