CN115631268A - Virtual image generation method and device, electronic equipment and computer storage medium - Google Patents

Virtual image generation method and device, electronic equipment and computer storage medium Download PDF

Info

Publication number
CN115631268A
CN115631268A CN202211258254.4A CN202211258254A CN115631268A CN 115631268 A CN115631268 A CN 115631268A CN 202211258254 A CN202211258254 A CN 202211258254A CN 115631268 A CN115631268 A CN 115631268A
Authority
CN
China
Prior art keywords
text information
information
facial feature
virtual image
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211258254.4A
Other languages
Chinese (zh)
Inventor
唐旻杰
梁超
陈苏全
王开新
陈云琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mobvoi Innovation Technology Co Ltd
Original Assignee
Mobvoi Innovation Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mobvoi Innovation Technology Co Ltd filed Critical Mobvoi Innovation Technology Co Ltd
Priority to CN202211258254.4A priority Critical patent/CN115631268A/en
Publication of CN115631268A publication Critical patent/CN115631268A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The embodiment of the invention discloses a virtual image generation method, a device, electronic equipment and a computer storage medium. Therefore, the method can improve the efficiency of generating the virtual image and the generalization capability of generating the virtual image by generating the facial feature points through the text information and generating the target virtual image based on the facial feature points.

Description

Virtual image generation method and device, electronic equipment and computer storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a virtual image generation method and device, electronic equipment and a computer storage medium.
Background
With the rapid development of the digital human synthesis technology, the virtual image is widely applied. The existing virtual image generation algorithm is usually realized by training a neural network through single audio and video data, but the method needs a large amount of audio and video data, the network calculation amount is large, the generated virtual image is single, the generation speed of the virtual image is low, and the generation efficiency is low.
Disclosure of Invention
In view of this, embodiments of the present invention provide an avatar generation method, apparatus, electronic device and computer storage medium to improve the efficiency and generalization capability of avatar generation.
In a first aspect, an embodiment of the present invention provides an avatar generation method, where the method includes:
acquiring text information;
carrying out voice synthesis on the text information to generate audio features corresponding to the text information;
determining facial feature points according to the audio features based on a preset neural network model;
and generating a target virtual image according to the facial feature points.
Further, the generating of the audio feature corresponding to the text information includes:
preprocessing the text information, and determining morpheme information corresponding to the text information;
determining pinyin information corresponding to the morpheme information;
determining phoneme information corresponding to the pinyin information based on a preset mapping relation between pinyin and phonemes to determine phoneme information corresponding to the text information;
and determining the audio features corresponding to the text information according to the phoneme information and a preset speech synthesis model.
Further, the preset neural network model comprises an encoding module, a facial feature generation module and a decoding module, and the determining facial feature points according to the audio features based on the preset neural network model comprises:
based on the coding module, the audio features are coded, and a voice feature code corresponding to the audio features is determined;
determining facial features corresponding to the voice feature codes based on the facial feature module;
and based on the decoding module decoding the facial features, determining facial feature points corresponding to the audio features.
Further, the encoding module is determined based on a residual network, the facial feature module is determined based on a recurrent neural network, and the decoding module includes a fully-connected layer.
Further, the generating a target avatar from the facial feature points includes:
drawing a face sketch according to the facial feature points;
and inputting the face sketch and the real image picture into a preset generator network module to determine the target virtual image.
Further, the method further comprises:
and inputting the target virtual image into a preset discriminator network module to verify the authenticity of the target virtual image.
Further, the method further comprises:
and splicing according to the target virtual image and the audio information to determine the video image of the target virtual image.
In a second aspect, an embodiment of the present invention provides an avatar generation apparatus, including:
the text unit is used for acquiring text information;
the audio characteristic unit is used for carrying out voice synthesis on the text information and generating audio characteristics corresponding to the text information;
the facial feature unit is used for determining facial feature points according to the audio features based on a preset neural network model;
and an avatar unit for generating a target avatar based on the facial feature points.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, where the memory is used to store one or more computer program instructions, where the one or more computer program instructions are executed by the processor to implement the method described in any one of the above.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method steps described in any one of the above.
According to the technical scheme of the embodiment of the invention, the text information is obtained, the text information is subjected to voice synthesis, the audio features corresponding to the text information are generated, the facial feature points are determined according to the audio features based on the preset neural network model, and the target virtual image is generated according to the facial feature points. Thus, the present embodiment enables generation of a face feature point by text information and generation of a target avatar based on the face feature point, while improving avatar generation efficiency, and at the same time, can improve generalization ability of avatar generation.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of an avatar generation method of an embodiment of the present invention;
FIG. 2 is a schematic diagram of an avatar generation system in accordance with an embodiment of the present invention;
FIG. 3 is a flow chart of generating audio features corresponding to text messages according to an embodiment of the present invention;
FIG. 4 is a flow chart of determining facial feature points according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of generating a target avatar from facial feature points in an embodiment of the present invention;
fig. 6 is a schematic view of an avatar generation apparatus of an embodiment of the present invention;
fig. 7 is another schematic view of an avatar generation apparatus according to an embodiment of the present invention;
fig. 8 is a schematic view of an electronic device of an embodiment of the invention.
Detailed Description
The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.
Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.
Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including but not limited to".
In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
With the rapid development of the digital human synthesis technology, the virtual image is widely applied. Presently, avatars are often generated by training neural networks using audio-visual data of a single person. The method not only needs a large amount of audio and video data corresponding to the character images, has large network calculation amount, single generated virtual image, slow generation speed of the virtual image and low generation efficiency, but also can only generate one single virtual image by one model and can not generate video images of any virtual image. When a new virtual image is needed to be made, the network model needs to be retrained to generate the new virtual image. Meanwhile, the existing virtual image generation method has low data processing speed and is difficult to meet the use scene with high real-time performance.
In view of this, the embodiment of the present invention provides an avatar generation method to improve the efficiency and generalization capability of avatar generation.
Fig. 1 is a flowchart of an avatar generation method according to an embodiment of the present invention. As shown in fig. 1, the avatar generation method in the present embodiment includes the following steps.
In step S110, text information is acquired.
In the present embodiment, the text information corresponds to a character sequence made up of at least one character. The text information may be text content pre-stored in a database, or may be text content to be processed received in real time.
In step S120, the text information is speech-synthesized, and an audio feature corresponding to the text information is generated.
In this embodiment, the text information is converted into corresponding audio features by performing speech synthesis on the text information. Optionally, when the corresponding audio feature is generated based on the text information, the phoneme information corresponding to the text information is determined first, and then the audio feature corresponding to the phoneme information is determined, so as to determine the audio feature corresponding to the text information. Therefore, the corresponding audio features are generated through the text information, so that the text information can be converted into the audio features, the sound emitted by the real image does not need to be converted into the audio features for generating the target virtual object, the target virtual objects with different images can be generated according to the same text information, and the generalization capability of the virtual object generation method is improved.
Optionally, in this embodiment, a natural language processing method is used to determine phoneme information corresponding to the text information, and determine an audio feature corresponding to the text information according to the phoneme information and a preset speech synthesis model, so as to improve determination efficiency of the audio feature corresponding to the text information.
In step S130, facial feature points are determined from the audio features based on a preset neural network model.
In this embodiment, the audio features corresponding to the text information are input to a preset neural network model for processing, and the facial feature points are determined according to the processed output result. The face feature points comprise feature point coordinates and feature point action information, the feature point coordinates are used for representing position coordinates of the feature points in the virtual image face image, and the action information is used for representing a series of action sequences such as mouth shapes and eye blinks.
In step S140, a target avatar is generated from the facial feature points.
In the embodiment, when the target virtual image is generated according to the facial feature points, the face sketch is drawn according to the facial feature points, and then the target virtual image is determined according to the face sketch and the real image picture. Optionally, in the present embodiment, the target avatar is determined by inputting the face sketch and the real avatar picture into a preset generator network module.
In the technical scheme of the embodiment, the facial feature points are determined in a mode of generating corresponding audio information by using the text information, so that the dependency between the audio information and a specific real image is reduced, and the problem that different virtual images need to train a plurality of models can be effectively solved. And after the facial feature points are determined, the target virtual image is generated according to the facial feature points, so that the generated virtual image is richer and more diverse, and the generalization capability of virtual image generation is favorably improved. Meanwhile, in the embodiment, the action and the micro-expression of the generated target virtual image can be changed by changing the generated facial feature point sequence, so that the effect of the generated target virtual image is more vivid and natural, and the visual effect of the generated target virtual image is further optimized.
Optionally, in order to further optimize the use effect of the generated target avatar, after the target avatar is generated, the target avatar is input to a preset discriminator network module to verify the authenticity of the target avatar. Therefore, the authenticity of the target virtual image is verified, and the synthetic identity of the target virtual image is confirmed, so that the synthetic target virtual image is guaranteed to be applied to a subsequent processing process of a user, and the influence on the use effect of the virtual image due to the use of the corresponding real image used when the target virtual image is generated is avoided.
Optionally, after the target avatar is generated, the target avatar and the audio information are spliced to determine a video image of the target avatar. Therefore, the generated target virtual image is spliced with the audio information, so that the target virtual image can be displayed in a video image mode, the using function of the virtual image is enriched, and the using effect of the virtual image is improved.
According to the technical scheme of the embodiment of the invention, the text information is obtained, the text information is subjected to voice synthesis, the audio features corresponding to the text information are generated, the facial feature points are determined according to the audio features based on the preset neural network model, and the target virtual image is generated according to the facial feature points. Thus, the present embodiment enables generation of a face feature point by text information and generation of a target avatar based on the face feature point, while improving avatar generation efficiency, and at the same time, can improve generalization ability of avatar generation. Meanwhile, the generated target virtual image is verified, and the target virtual image is spliced with the audio information to determine the video image of the target virtual image, so that the use function of the virtual image can be enriched, and the use effect of the virtual image is optimized.
Fig. 2 is a schematic diagram of an avatar generation system of an embodiment of the present invention. As shown in fig. 2, the avatar generation system in the present embodiment includes a speech synthesis network 10, a facial feature generation network 20, and an avatar generation network 30. The speech synthesis network 10 is configured to process the text information and generate an audio feature corresponding to the text information. The facial feature network 20 is configured to determine corresponding facial feature points according to the audio features corresponding to the text information. The image generation network 30 is used for generating a corresponding target avatar according to the facial feature points and the real image picture, and further generating a corresponding video according to the target avatar. Therefore, the embodiment enables the generation of the facial feature points through the text information and the generation of the target avatar based on the facial feature points, and can improve the generalization capability of avatar generation while improving the avatar generation efficiency.
Alternatively, as shown in fig. 2, the speech synthesis network 10 in the present embodiment includes a morpheme layer 11, a pinyin layer 12, a phoneme layer 13, and an audio feature layer 14. In the embodiment, when generating the corresponding audio features based on the text information, the morpheme layer 11, the pinyin layer 12, and the phoneme layer 13 are configured to determine phoneme information corresponding to the text information, and the audio feature layer 14 is configured to determine corresponding audio features according to the phoneme information, that is, determine the audio features corresponding to the text information according to the phoneme information.
Fig. 3 is a flowchart of generating audio features corresponding to text information according to an embodiment of the present invention. As shown in fig. 3, in the present embodiment, the audio feature corresponding to the text information is generated based on the following steps.
In step S310, the text information is preprocessed to determine morpheme information corresponding to the text information.
In this embodiment, the received text information is preprocessed by the morpheme layer 11 in the speech synthesis network 10, and morpheme information corresponding to the text information is determined. The morpheme information refers to the smallest semantic aggregate in the language, meets the three conditions of 'minimum, voiced and sense', and has the main function of being used as a material for forming words. The term "minimum" represents that the morpheme is smaller than the high-level semantic combination units such as words and phrases. "voiced" characterizes the phonetic form of a morpheme. "sense" characterizes the meaning of a morpheme, including lexical or grammatical meaning. Morphemes are often divided into three word-building modes, including monosyllable morphemes (e.g., heaven, earth, human, etc.) consisting of words that are intended for one word, bisyllable morphemes (e.g., generous, leisurely, coral, etc.) consisting of words that are intended for two words, and polysyllable morphemes (e.g., brandy, petrolatum, crackle, etc.) consisting of more than two words that are intended for more than two words.
Optionally, the preprocessing of the text information in this embodiment includes a text structure analysis. In the text structure analysis process, the language information of the text information is analyzed, and a label is added to the sentence segment of each language. And meanwhile, segmenting, sentence-dividing and word-dividing the text, deleting meaningless text input in the text information, and determining morpheme information corresponding to the text information.
Further, in this embodiment, the preprocessing of the text information further includes text normalization, the text structure and the context text information are analyzed through the text normalization processing, the non-standard text information other than the normal text in the text information is converted into corresponding words, and the text information after the text structure analysis and the text normalization processing is determined as morpheme information corresponding to the text information. Specifically, in the text normalization process, the abbreviation, date, formula, number and the like contained in the text information are often processed to correctly pronounce, the non-pronounced characters are also deleted, and full-angle and half-angle conversion, simple and complex conversion and the like are also performed. The commonly used text specification method is regular matching replacement, or semantic information is extracted by a statistical model, machine learning and deep learning method for judgment.
In step S320, pinyin information corresponding to the morpheme information is determined.
In this embodiment, the pinyin layer 12 in the speech synthesis network 10 determines the pinyin information corresponding to the morpheme information. The pinyin information represents the process of spelling syllables, which refers to the basic unit of speech structure composed of one or several phonemes, for example, pinyin in Chinese is a syllable formed by rapidly and continuously splicing initial consonants, intermediate consonants and vowels according to the formation rule of syllables in Mandarin Chinese and adding tones. Optionally, in this embodiment, the pinyin information corresponding to the morpheme information is determined based on methods such as polyphone prediction, three-tone continuous tone change, one-tone invariant, retroversion prediction, and soft-tone prediction.
In step S330, the phoneme information corresponding to the pinyin information is determined based on the mapping relationship between the predetermined pinyin and the morpheme, so as to determine the phoneme information corresponding to the text information.
In the present embodiment, the phoneme layer 13 in the speech synthesis network 10 determines phoneme information corresponding to the text information. The phoneme refers to the smallest unit of speech divided according to the natural attributes of the speech, and is analyzed according to the pronunciation action in the syllable, and one action constitutes one phoneme. For example, the phonemes corresponding to the Chinese morpheme "speech" include "h", "u", and "a".
Optionally, the mapping relationship between the pinyin and the morphemes in this embodiment is obtained through pronunciation information recorded in a pronunciation dictionary, and the phonetic notation sequence corresponding to the pinyin information is determined by querying the pronunciation dictionary, and the phonetic notation sequence corresponding to the morpheme information is determined as phoneme information, so as to determine the phoneme information corresponding to the text information.
It should be understood that the text information in the present embodiment may be a text in a chinese language or other language, and the determination method of the phoneme information corresponding to the text information in different forms may be adjusted according to the composition and pronunciation principle of the corresponding language.
Further, in order to improve the determination efficiency of the phoneme information corresponding to the text information, in this embodiment, the phoneme information in the text information may be determined by using an existing natural language processing model, and the text information is input to a preset trained natural language processing model, and the phoneme information corresponding to the text information is output after being processed by the natural language processing model.
In step S340, an audio feature corresponding to the text information is determined according to the phoneme information and a preset speech synthesis model.
In this embodiment, the audio feature layer 14 in the speech synthesis network 10 determines the audio feature corresponding to the phoneme information.
In order to improve the efficiency of determining the audio features and facilitate the subsequent rapid production of the target avatar, in this embodiment, a preset speech synthesis model is set in the audio feature layer 14, and after determining the phoneme information corresponding to the text information, the audio features corresponding to the text information are determined according to the phoneme information through the preset speech synthesis model.
Optionally, in this embodiment, a Speech synthesis technology (TTS) is used To convert the Text information into audio information. TTS technology is typically implemented using a concatenation method and a parametric method. In the parametric method, speech parameters (including the number of features, formant frequency, etc.) are generated at each time from a statistical model, the speech parameters are converted into waveforms, and corresponding target sounds are synthesized from the waveforms. In the splicing method, a speech signal is formed by splicing syllables or phonemes of a plurality of basic units, corresponding language, phoneme (or syllable), prosody and emotion information are generated from input text information, and a target sound is synthesized by extracting the phoneme or syllable from the text information. Compared with a parameter method, the voice synthesized by adopting the splicing method has higher quality. Therefore, in the embodiment, the text information is subjected to speech synthesis based on the splicing method, and the audio features corresponding to the text information are generated, so that the quality of determining the audio features is improved, and the quality of the generated virtual image is further improved.
Further, in this embodiment, the preset speech synthesis model uses a TTS acoustic model, encodes the input phoneme information through the preset acoustic model, generates a feature code, processes the feature code corresponding to the phoneme information, generates a corresponding Mel spectrum (Mel spectrum), and uses the Mel spectrum as an audio feature corresponding to the text information. The forming process of the Mel frequency spectrum comprises the steps of firstly converting the sound signals in a time domain form corresponding to the audio features into the sound signals in a frequency domain form through Fast Fourier Transform (FFT), wherein the value of each frequency represents the intensity of the sound signals in a current frequency of a corresponding frame, and then processing the amplitude spectrum corresponding to the sound signals in the frequency domain form through a Mel filter bank to form the Mel frequency spectrum corresponding to the sound signals in the Mel scale, so that people can more sensitively sense the sound in the Mel scale.
According to the technical scheme, the text information is preprocessed, morpheme information corresponding to the text information is respectively determined, pinyin information corresponding to the morpheme information is determined, phoneme information corresponding to the pinyin information is determined based on the mapping relation between the preset pinyin and the morpheme, the audio characteristics corresponding to the text information are determined according to the phoneme information and the preset speech synthesis model, the dependency of the audio characteristics and the specific real image is reduced, and meanwhile, different virtual images can be generated according to the same text information conveniently.
Alternatively, as shown in fig. 2, the facial feature generation network 20 in this embodiment adopts a preset neural network model, and determines facial feature points according to audio features based on the preset neural network model. Further, the preset neural network model in this embodiment includes an encoding module 21, a facial feature generation module 22, and a decoding module 23, and the facial feature points are determined according to the audio features corresponding to the text information by the encoding module 21, the facial feature generation module 22, and the decoding module 23.
Fig. 4 is a flowchart of determining facial feature points according to an embodiment of the present invention. As shown in fig. 4, the method for determining facial feature points based on audio features in the present embodiment includes the following steps.
In step S410, the audio feature is encoded based on the encoding module, and a speech feature encoding corresponding to the audio feature is determined.
In this embodiment, the encoding module 21 encodes the audio features in the mel-frequency spectrum form to generate the speech feature codes corresponding to the audio features. Optionally, the encoding module 21 in this embodiment determines based on a Residual network, which includes a Residual Unit (4-layer Residual Unit) 211 and a Spatial Attention Unit (Spatial Attention) 212.
In step S420, the facial feature module determines a facial feature corresponding to the speech feature code.
In this embodiment, the facial feature module 22 processes the speech feature codes to determine the facial features corresponding to the speech feature codes. Alternatively, since the audio feature to be processed in the present embodiment is a time-series feature, the facial feature module 22 determines based on a recurrent neural network. Furthermore, the recurrent neural network adopts the GRU model, and compared with other types of recurrent neural networks, the training efficiency of the GRU model is higher, the calculation speed is higher, and the training efficiency of the facial feature points is favorably improved.
In step S430, the decoding module decodes the facial features, and determines facial feature points corresponding to the audio features.
In this embodiment, the decoding module 23 decodes the facial features to determine facial feature points corresponding to the facial features. Optionally, the decoding module 23 in this embodiment includes a fully-connected Layer (sense-Layer) 231 and an active Layer (signature Layer) 232, and the face features are transferred through an active function preset in the active Layer 232, and the received face features are decoded through the fully-connected Layer 231, so as to determine the face feature points.
According to the technical scheme, the audio features are coded based on the residual error network, the voice feature codes corresponding to the audio features are determined, the voice feature codes are input to the recurrent neural network to determine the facial features corresponding to the audio feature codes, the facial features are decoded through the activation layer and the full connection layer, and the facial feature information corresponding to the audio features is determined, so that the target virtual image can be generated through the facial feature information in the subsequent process.
Alternatively, as shown in fig. 2, the avatar generation network 30 in the present embodiment includes a sampling layer (Down-sampling) 31 and a Generator network module (Sequential Generator) 32, and the target avatar is generated from the face feature points by the sampling layer 31 and the Generator network module 32. Because the facial feature points are determined based on the audio features corresponding to the text information, the target virtual image can be generated without depending on the audio features of a specific real image, the virtual images suitable for generating various images can be generated, the target virtual image generation efficiency is improved, and meanwhile, the generalization capability of virtual image generation is improved.
Fig. 5 is a schematic diagram of generating a target avatar from facial feature points in an embodiment of the present invention. As shown in fig. 5, the method of generating a target avatar according to facial feature points of the present embodiment includes the following steps.
In step S510, a draft of the face is drawn from the facial feature points.
In this embodiment, a face sketch is drawn by the sampling layer 31 according to the facial features generated by the facial feature generation network, and the position information of the facial feature points in the face sketch.
In step S520, a target avatar is determined according to the face sketch and the avatar picture.
In this embodiment, the generator network module 32 generates a target avatar corresponding to the real avatar according to the face sketch and the real avatar picture. Optionally, in this embodiment, the image features corresponding to the face sketch and the real image picture are input to the preset generator network module, and the preset generator network module performs multi-resolution scale feature extraction on the input facial key points with different time sequence scales, and generates a target virtual image corresponding to the real image.
Optionally, as shown in fig. 2, the character generation network 30 of the present embodiment further includes an authentication network module (Conditional Discriminator) 33 in addition to the sampling layer 31 and the generator network module 32. The authenticity of the target virtual image is verified through the identification network module 33, the synthetic identity of the target virtual image is confirmed, the accuracy of the target virtual image is ensured, and meanwhile, the use of the real image corresponding to the target virtual image is avoided, and the use effect of the virtual image is not influenced.
According to the technical scheme of the embodiment, the face sketch is drawn according to the face characteristic points through the sampling layer in the image generation network, the target virtual image corresponding to the real image is generated according to the face sketch and the real image picture through the generator network module, so that any target virtual image can be generated only by inputting different real image pictures without repeatedly training each model in the virtual image generation system corresponding to each real image picture, the generalization capability of the virtual image generation method and system can be improved, different virtual images can be generated, and the generation efficiency, accuracy and real-time performance of the virtual image can be improved.
Fig. 6 is a schematic diagram of an avatar generation apparatus according to an embodiment of the present invention. As shown in fig. 6, the avatar generation apparatus in the present embodiment includes a text unit 1, an audio feature unit 2, a face feature unit 3, and an avatar unit 4. The text unit 1 is used for acquiring text information. The audio feature unit 2 is configured to perform speech synthesis on the text information, and generate an audio feature corresponding to the text information. The facial feature unit 3 is used for determining facial feature points according to the audio features based on a preset neural network model. The avatar unit 4 is used to generate a target avatar from the facial feature points.
Optionally, the avatar generation unit in this embodiment further includes an authentication unit 5. The authentication unit 5 is used to verify the authenticity of the target avatar to confirm the generation of the target avatar.
Further, the avatar generation unit in the present embodiment further includes a video unit 6. The video unit 6 splices the target virtual image and the audio information to determine a video image of the target virtual image.
Fig. 7 is another schematic view of an avatar generation apparatus according to an embodiment of the present invention. As shown in fig. 7, the text unit 1 in this embodiment includes a morpheme determining module 11, a pinyin determining module 12, and a morpheme determining module 13. The morpheme determining module 11 is configured to pre-process the text information and determine morpheme information corresponding to the text information. The pinyin determining module 12 is configured to determine pinyin information corresponding to morpheme information. The phoneme determining module 13 is configured to determine phoneme information corresponding to the pinyin information based on a mapping relationship between a preset pinyin and a phoneme, so as to determine phoneme information corresponding to the text information.
Optionally, the facial feature unit 3 in the present embodiment includes a feature decoding module 31, a facial feature generating module 32, and a feature decoding module 33. The feature decoding module 31 is configured to encode the audio feature and determine a speech feature code corresponding to the audio feature. The facial feature generation module 32 is used to determine the facial features corresponding to the speech feature codes. The feature decoding module 33 is configured to decode the facial features and determine facial feature points corresponding to the audio features.
Alternatively, the avatar unit 4 in the present embodiment includes a sketch drawing module 41 and an avatar generation module 42. The sketch drawing module 41 is configured to draw a face sketch according to the facial feature points. The image generation module 42 is used for determining a corresponding target virtual image according to the face sketch and the real image picture.
Fig. 8 is a schematic diagram of an electronic device of an embodiment of the invention. As shown in fig. 8, the electronic device shown in fig. 8 is a general address query device, which includes a general computer hardware structure, which includes at least a processor 81 and a memory 82. The processor 81 and the memory 82 are connected by a bus 83. The memory 82 is adapted to store instructions or programs executable by the processor 81. Processor 81 may be a stand-alone microprocessor or a collection of one or more microprocessors. Thus, the processor 81 implements the processing of data and the control of other devices by executing instructions stored by the memory 82 to perform the method flows of embodiments of the present invention as described above. The bus 83 connects the above components together, and also connects the above components to a display controller 84 and a display device and an input/output (I/O) device 85. Input/output (I/O) devices 85 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, the input/output devices 85 are coupled to the system through an input/output (I/O) controller 86.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus (device) or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may employ a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions.
These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.
These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.
Another embodiment of the invention is directed to a non-volatile storage medium storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.
That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be accomplished by specifying the relevant hardware through a program, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. An avatar generation method, the method comprising:
acquiring text information;
carrying out voice synthesis on the text information to generate audio features corresponding to the text information;
determining facial feature points according to the audio features based on a preset neural network model;
and generating a target virtual image according to the facial feature points.
2. The method according to claim 1, wherein the generating the audio feature corresponding to the text information comprises:
preprocessing the text information, and determining morpheme information corresponding to the text information;
determining pinyin information corresponding to the morpheme information;
determining phoneme information corresponding to the pinyin information based on a preset mapping relation between pinyin and phonemes to determine phoneme information corresponding to the text information;
and determining the audio features corresponding to the text information according to the phoneme information and a preset speech synthesis model.
3. The method of claim 1, wherein the preset neural network model comprises an encoding module, a facial feature generation module and a decoding module, and wherein the determining facial feature points from the audio features based on the preset neural network model comprises:
based on the encoding module, encoding the audio features, and determining a voice feature code corresponding to the audio features;
determining facial features corresponding to the speech feature codes based on the facial feature module;
and determining the facial feature points corresponding to the audio features based on the decoding of the facial features by the decoding module.
4. The method of claim 1, wherein the encoding module is determined based on a residual network, wherein the facial feature module is determined based on a recurrent neural network, and wherein the decoding module comprises a fully-connected layer.
5. The method of claim 1, wherein the generating a target avatar from the facial feature points comprises:
drawing a face sketch according to the facial feature points;
and inputting the face sketch and the real image picture into a preset generator network module to determine the target virtual image.
6. The method of claim 1, further comprising:
and inputting the target virtual image into a preset discriminator network module to verify the authenticity of the target virtual image.
7. The method of claim 1, further comprising:
and splicing according to the target virtual image and the audio information to determine the video image of the target virtual image.
8. An avatar generation apparatus, the apparatus comprising:
the text unit is used for acquiring text information;
the audio characteristic unit is used for carrying out voice synthesis on the text information and generating audio characteristics corresponding to the text information;
the facial feature unit is used for determining facial feature points according to the audio features based on a preset neural network model;
and an avatar unit for generating a target avatar based on the facial feature points.
9. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-7.
10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1-7.
CN202211258254.4A 2022-10-13 2022-10-13 Virtual image generation method and device, electronic equipment and computer storage medium Pending CN115631268A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211258254.4A CN115631268A (en) 2022-10-13 2022-10-13 Virtual image generation method and device, electronic equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211258254.4A CN115631268A (en) 2022-10-13 2022-10-13 Virtual image generation method and device, electronic equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN115631268A true CN115631268A (en) 2023-01-20

Family

ID=84904895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211258254.4A Pending CN115631268A (en) 2022-10-13 2022-10-13 Virtual image generation method and device, electronic equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN115631268A (en)

Similar Documents

Publication Publication Date Title
KR102246943B1 (en) Method of multilingual text-to-speech synthesis
CN111754976B (en) Rhythm control voice synthesis method, system and electronic device
US12033611B2 (en) Generating expressive speech audio from text data
CN108899009B (en) Chinese speech synthesis system based on phoneme
Tran et al. Improvement to a NAM-captured whisper-to-speech system
CN115516552A (en) Speech recognition using synthesis of unexplained text and speech
US20030144842A1 (en) Text to speech
CN102203853B (en) Method and apparatus for synthesizing a speech with information
KR20230043084A (en) Method and computer readable storage medium for performing text-to-speech synthesis using machine learning based on sequential prosody feature
EP4029010B1 (en) Neural text-to-speech synthesis with multi-level context features
KR20150146373A (en) Method and apparatus for speech synthesis based on large corpus
CN115485766A (en) Speech synthesis prosody using BERT models
CN108231062A (en) A kind of voice translation method and device
CN112735371B (en) Method and device for generating speaker video based on text information
CN112735454A (en) Audio processing method and device, electronic equipment and readable storage medium
KR20210059586A (en) Method and Apparatus for Emotional Voice Conversion using Multitask Learning with Text-to-Speech
Dunbar et al. Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge
WO2023279976A1 (en) Speech synthesis method, apparatus, device, and storage medium
CN115101046A (en) Method and device for synthesizing voice of specific speaker
CN114974218A (en) Voice conversion model training method and device and voice conversion method and device
CN113112575B (en) Mouth shape generating method and device, computer equipment and storage medium
CN115424604B (en) Training method of voice synthesis model based on countermeasure generation network
CN112634861B (en) Data processing method, device, electronic equipment and readable storage medium
Petrushin et al. Whispered speech prosody modeling for TTS synthesis
CN115631268A (en) Virtual image generation method and device, electronic equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination