WO2022048404A1 - 端到端的虚拟对象动画生成方法及装置、存储介质、终端 - Google Patents
端到端的虚拟对象动画生成方法及装置、存储介质、终端 Download PDFInfo
- Publication number
- WO2022048404A1 WO2022048404A1 PCT/CN2021/111423 CN2021111423W WO2022048404A1 WO 2022048404 A1 WO2022048404 A1 WO 2022048404A1 CN 2021111423 W CN2021111423 W CN 2021111423W WO 2022048404 A1 WO2022048404 A1 WO 2022048404A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- virtual object
- sequence
- pronunciation
- feature
- linguistic
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000003860 storage Methods 0.000 title claims abstract description 10
- 238000013507 mapping Methods 0.000 claims abstract description 77
- 238000004458 analytical method Methods 0.000 claims abstract description 29
- 238000000605 extraction Methods 0.000 claims description 32
- 238000005516 engineering process Methods 0.000 claims description 31
- 230000014509 gene expression Effects 0.000 claims description 26
- 230000009466 transformation Effects 0.000 claims description 22
- 230000009471 action Effects 0.000 claims description 20
- 238000013528 artificial neural network Methods 0.000 claims description 20
- 230000015654 memory Effects 0.000 claims description 18
- 238000012545 processing Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 13
- 230000015572 biosynthetic process Effects 0.000 claims description 9
- 238000003786 synthesis reaction Methods 0.000 claims description 9
- 230000002776 aggregation Effects 0.000 claims description 5
- 238000004220 aggregation Methods 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 230000008878 coupling Effects 0.000 claims description 2
- 238000010168 coupling process Methods 0.000 claims description 2
- 238000005859 coupling reaction Methods 0.000 claims description 2
- 230000000875 corresponding effect Effects 0.000 description 55
- 230000008921 facial expression Effects 0.000 description 9
- 238000004519 manufacturing process Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 238000012549 training Methods 0.000 description 6
- 238000013135 deep learning Methods 0.000 description 4
- 238000003062 neural network model Methods 0.000 description 4
- 230000006403 short-term memory Effects 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000001360 synchronised effect Effects 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 210000005182 tip of the tongue Anatomy 0.000 description 1
- 238000009966 trimming Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/205—3D [Three Dimensional] animation driven by audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/80—2D [Two Dimensional] animation, e.g. using sprites
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
Definitions
- performing feature analysis on the pronunciation unit sequence to obtain a corresponding linguistic feature sequence includes: performing feature analysis on each pronunciation unit in the pronunciation unit sequence to obtain the linguistic feature of each pronunciation unit. ; Based on the linguistic features of each pronunciation unit, a corresponding linguistic feature sequence is generated.
- carrying out feature analysis to each pronunciation unit in the pronunciation unit sequence, and obtaining the linguistic feature of each pronunciation unit includes: for each pronunciation unit, analyzing the pronunciation feature of the pronunciation unit to obtain an independent linguistic feature of the pronunciation unit; the linguistic feature is generated based on the independent linguistic feature.
- carrying out feature analysis to each pronunciation unit in the pronunciation unit sequence, and obtaining the linguistic feature of each pronunciation unit includes: for each pronunciation unit, analyzing the pronunciation feature of the pronunciation unit to obtain The independent linguistic feature of the pronunciation unit; analyze the pronunciation feature of the adjacent pronunciation unit of the pronunciation unit, obtain the adjacent linguistic feature of the pronunciation unit; generate the language based on the independent linguistic feature and the adjacent linguistic feature academic characteristics.
- obtaining the adjacent linguistic features of the pronunciation units includes: count the types of the pronunciation features and the number of the same kind of pronunciation features that the adjacent pronunciation units have. , and obtain the adjacent linguistic features according to the statistical results.
- the inputting the linguistic feature sequence into a preset time sequence mapping model to generate a corresponding virtual object animation based on the linguistic feature sequence includes: mapping the linguistic feature based on the preset time sequence mapping model.
- the sequence performs multi-dimensional information extraction, wherein the multi-dimension includes a time dimension and a linguistic feature dimension; based on the preset time sequence mapping model, the multi-dimensional information extraction result is subjected to feature domain mapping and feature dimension transformation to obtain Expression parameters and/or action parameters of the virtual object, wherein the mapping of the feature domain refers to the mapping of the linguistic feature domain to the animation feature domain of the virtual object, and the animation feature domain of the virtual object includes the expression of the virtual object Features and/or Action Features.
- the preset time sequence mapping model includes: a multi-layer convolutional network, used for receiving the linguistic feature sequence, and performing multi-dimensional information extraction on the linguistic feature sequence; a deep neural network, and the The multi-layer convolution network is coupled, and the deep neural network is used to receive the multi-dimensional information extraction results output by the multi-layer convolution network, and perform feature domain mapping and feature dimension transformation on the multi-dimensional information extraction results. , so as to obtain the expression parameters and/or action parameters of the virtual object.
- the deep neural network includes: a multi-layer fully connected layer connected in series; a plurality of nonlinear transformation modules are respectively coupled between two adjacent fully connected layers except the last fully connected layer,
- the nonlinear change module is configured to perform nonlinear transformation processing on the output result of the coupled upper fully connected layer, and input the result of the nonlinear transformation processing into the next fully connected layer of the coupling.
- an embodiment of the present invention also provides an end-to-end virtual object animation generation device, including: a receiving module for receiving input information, where the input information includes text information or audio information of the virtual object animation to be generated Conversion module, for converting described input information into pronunciation unit sequence; Feature analysis module, for carrying out feature analysis to described pronunciation unit sequence, obtains corresponding linguistic feature sequence; Mapping module, for described language The linguistic feature sequence is input into a preset time sequence mapping model, so as to generate a corresponding virtual object animation based on the linguistic feature sequence.
- the corresponding linguistic feature sequence in the original audio or text is extracted and used as the input information of the preset time series mapping model. Since linguistic features are only related to the semantic content of audio, they have nothing to do with features that vary from speaker to speaker, such as timbre, pitch, and F0 features of fundamental frequency. Therefore, the solution in this embodiment is not limited to a specific speaker, and the original audio with different audio characteristics can be applied to the preset time sequence mapping model described in this embodiment. That is to say, because the solution of this embodiment does not analyze the audio features in the audio information, but analyzes the linguistic features of the pronunciation units after converting the audio information into pronunciation units, so that the neural network model does not depend on specific audio features to drive the neural network model. It is possible to generate animation of virtual objects.
- FIG. 1 is a flowchart of an end-to-end virtual object animation generation method according to an embodiment of the present invention
- Fig. 2 is a flowchart of a specific implementation of step S103 in Fig. 1;
- step S104 in FIG. 1 is a flowchart of a specific implementation of step S104 in FIG. 1;
- FIG. 4 is a schematic structural diagram of an end-to-end virtual object animation generating apparatus according to an embodiment of the present invention.
- an embodiment of the present invention provides an end-to-end virtual object animation generation method, including: receiving input information, the input information including text information or audio information of the virtual object animation to be generated; Converting into a sequence of pronunciation units; performing feature analysis on the sequence of pronunciation units to obtain a corresponding linguistic feature sequence; inputting the linguistic feature sequence into a preset time sequence mapping model to generate a corresponding virtual sequence based on the linguistic feature sequence Object animation.
- the solution in this embodiment provides a more versatile end-to-end virtual object animation generation solution, which can be quickly and automatically generated. It can generate virtual object animation, especially 3D animation, and the input selection is more diverse.
- the corresponding linguistic feature sequence in the original audio or text is extracted and used as the input information of the preset time series mapping model. Since linguistic features are only related to the semantic content of audio, they have nothing to do with features that vary from speaker to speaker, such as timbre, pitch, and F0 features of fundamental frequency. Therefore, the solution in this embodiment is not limited to a specific speaker, and the original audio with different audio characteristics can be applied to the preset time sequence mapping model described in this embodiment. That is to say, because the solution of this embodiment does not analyze the audio features in the audio information, but analyzes the linguistic features of the pronunciation units after converting the audio information into pronunciation units, so that the neural network model does not depend on specific audio features to drive the neural network model. It is possible to generate animation of virtual objects.
- the end-to-end virtual object animation generation method provided by the solution in this embodiment can be applied to the end-to-end virtual object animation generation of any voice actor and any text, which solves the problem of the existing end-to-end automatic speech synthesis virtual object animation technology.
- the problem of dependence on a specific voice actor really realizes the "universality" of the technology.
- a preset time sequence mapping model is constructed based on deep learning technology training, and based on the preset time sequence mapping model, the input linguistic feature sequence is mapped to the expression parameters and/or action parameters of the corresponding virtual object.
- the originally received input information may be text information or audio information, so that the solution of this embodiment can generate corresponding virtual object animations according to different input modalities.
- FIG. 1 is a flowchart of an end-to-end virtual object animation generation method according to an embodiment of the present invention.
- Arbitrary speaker can mean that there is no limit to the audio characteristics of the speaker.
- the virtual object may include a virtual person, and may also include multiple types of virtual objects such as virtual animals and virtual plants.
- Virtual objects can be three-dimensional or two-dimensional.
- End-to-end can refer to the computer operation from the input end to the output end, and there is no human (such as animator) intervention between the input end and the output end.
- the input terminal refers to the port for receiving original audio and original text
- the output terminal refers to the port for generating and outputting virtual object animation.
- the virtual object animation output by the output terminal may include a controller for generating the virtual object animation, and the specific expression is a sequence of digitized vectors.
- the virtual object animation may include a lip animation
- the controller of the lip animation output by the output terminal may include offset information of the lip feature points
- the controller of the lip animation may be input into the rendering engine. The lips of the virtual object are driven to make corresponding actions.
- the controller for generating a virtual object animation may be a sequence of virtual object animation data, and the data in the sequence is arranged according to the time sequence of the input information and synchronized with the audio data obtained based on the input information.
- the facial expression movement and human posture movement of the virtual object can be driven by the virtual object animation data.
- the final virtual object animation can be obtained through the rendering engine.
- the virtual object animation data may include facial expression motion data and body motion data of the virtual object.
- the facial expressions and actions include information such as expressions, eyes, and the like, and the body actions may include human body posture information of the virtual object.
- the facial expression motion data is referred to as the expression parameter of the virtual object
- the body motion data is referred to as the motion parameter of the virtual object.
- the end-to-end virtual object animation generation method described in this embodiment may include the following steps:
- Step S101 receiving input information, wherein the input information includes text information or audio information of the virtual object animation to be generated;
- Step S103 performing feature analysis on the pronunciation unit sequence to obtain a corresponding linguistic feature sequence
- Step S104 inputting the linguistic feature sequence into a preset time sequence mapping model to generate a corresponding virtual object animation based on the linguistic feature sequence.
- the pronunciation unit sequence and the linguistic feature sequence are both time-aligned sequences.
- the input information may be multimodal input, such as audio information expressed in the form of sound, or text information expressed in the form of text.
- the input information may be received from a client that needs to generate an animation of a virtual object.
- the input information may be audio information collected in real time based on a device such as a microphone, or text information input in real time based on a device such as a keyboard.
- the input information may be pre-collected or recorded audio information or text information, and is transmitted to the computing device executing the solution of this embodiment in a wired or wireless form when a corresponding virtual object animation needs to be generated.
- the input information can be divided into pronunciation unit sequences composed of the smallest pronunciation units, which are used as the data basis for the subsequent linguistic feature analysis.
- the step S102 may include the steps of: converting the input information into a pronunciation unit and a corresponding time code; performing a time alignment operation on the pronunciation unit according to the time code to obtain the time aligned pronunciation unit sequence.
- the time-aligned pronunciation unit sequence is simply referred to as a pronunciation unit sequence.
- each group of data includes a single pronunciation unit and a corresponding time code.
- the pronunciation units in the multiple sets of data can be aligned in time sequence, so as to obtain a time-aligned pronunciation unit sequence.
- the audio information may be converted into text information, and then the text information may be processed to obtain the pronunciation unit and the corresponding time code.
- the text information can be directly processed to obtain the pronunciation unit and the corresponding time code.
- the basic pronunciation in the text information can be extracted based on the Front-End module and the Alignment module in the text-to-speech (Text-to-Speech, TTS for short) technology Units and their arrangement and duration information in the time dimension, so as to obtain the basic pronunciation unit sequence after time alignment.
- the text-to-speech Text-to-Speech, TTS for short
- the step S103 may include the following steps:
- Step S1031 carries out feature analysis to each pronunciation unit in the described pronunciation unit sequence, obtains the linguistic feature of each pronunciation unit;
- the independent linguistic features can be used to characterize the pronunciation characteristics of a single pronunciation unit itself.
- the adjacent sounding units of the sounding unit may include a preset number of sounding units centered on the sounding unit and located before and after the sounding unit in time sequence.
- the specific value of the preset number may be determined according to experiments, for example, according to the evaluation index during training of the preset time sequence mapping model.
- the statistical features on the right side of the pronunciation unit are uniformly zeroed.
- the independent linguistic features of the phonetic unit and the adjacent linguistic features are combined to obtain the complete linguistic feature of the phonetic unit.
- the linguistic features of the pronunciation unit can be obtained by splicing the independent linguistic features and the adjacent linguistic features in the form of quantitative coding. That is, the linguistic feature of the pronunciation unit is a long array consisting of a series of quantified values.
- Step S1042 performing feature domain mapping and feature dimension transformation on the multi-dimensional information extraction result based on the preset time sequence mapping model to obtain the expression parameters and/or action parameters of the virtual object;
- the RNN network can process the input features from the time dimension, and in order to process the features in more dimensions to extract higher-dimensional feature information, thereby enhancing the generalization ability of the model, it can be based on convolution.
- Neural network Convolutional Neural Network, CNN for short
- its variants such as dilated convolution, causal convolution, etc.
- feature mapping models such as preset time series mapping models usually involve feature domain transformation and feature dimension transformation.
- this conversion function can be implemented based on a Fully Connected Network (FCN for short).
- the preset time sequence mapping model may be a model that can use time sequence information (such as text information and audio information aligned with time synchronization) to predict other time sequence information (such as virtual object animation).
- the training data of the preset time sequence mapping model may include text information, voice data synchronized with the text information, and virtual object animation data.
- a professional recording engineer and actor can express corresponding voice data and action data (one-to-one correspondence between voice and action) according to rich and emotional text information.
- the motion data includes facial expressions and body movements. Facial expressions and actions involve information such as expressions and eyes.
- the data of the virtual object facial expression controller is obtained.
- Body movements can be obtained by capturing high-quality posture information data of actors' performances through the performance capture platform, and body movement data and expression data have temporal correspondence.
- the corresponding virtual object animation data can be obtained by mapping based on the digitized vector sequence (ie, the linguistic feature sequence).
- the driving of body movements can also be implemented based on the controller.
- the driving of the limb movements may also be bone-driven.
- the preset time sequence mapping model may be a convolutional network-long short-term memory network-deep neural network (Convolutional LSTM Deep Neural Networks, CLDNN for short).
- the structure of the preset timing mapping model may not be limited to this.
- the preset timing mapping model may be any one of the above three networks, or any two of the above three networks. combination of species.
- the preset time sequence mapping model may include: a multi-layer convolutional network, configured to receive the linguistic feature sequence and perform multi-dimensional information extraction on the linguistic feature sequence.
- the multi-layered convolutional network may include a four-layered dilated convolutional network for performing multi-dimensional information extraction on the quantized linguistic feature sequence processed in step S103.
- the linguistic feature sequence can be two-dimensional data. Assuming that each pronunciation unit is represented by a pronunciation feature with a length of 600 bits and there are 100 pronunciation units in total, the linguistic feature sequence input into the preset time sequence mapping model is 100. A two-dimensional array of ⁇ 600. The 100 dimension represents the time dimension, and the 600 dimension represents the linguistic feature dimension.
- the multi-layer convolutional network performs feature operations in two dimensions, time and linguistic features.
- the preset time sequence mapping model may further include: a long-short-term memory network for performing information aggregation processing on the information extraction results of the time dimension.
- the long short-term memory network may include a two-layer stacked bidirectional LSTM network, coupled with the multi-layer convolutional network to obtain the temporal dimension of the linguistic feature sequence output by the multi-layer convolutional network. Information extraction results. Further, the two-layer stacked bidirectional LSTM network performs high-dimensional information processing on the information extraction result of the linguistic feature sequence in the time dimension, so as to further obtain feature information in the time dimension.
- the preset time sequence mapping model may further include: a deep neural network, coupled with the multi-layer convolutional network and the long-short-term memory network, and the deep neural network is used for the multi-layer convolutional network and the long-short-term memory network.
- the multi-dimensional information extraction result of the output of the time memory network is used to map the feature domain and transform the feature dimension, so as to obtain the expression parameter and/or action parameter of the virtual object.
- the deep neural network can receive the information extraction result of the linguistic feature dimension output by the multi-layer convolutional network, and the deep neural network can also receive the updated information on the time dimension output by the long-short-term memory network Extract results.
- the dimension transformation may refer to dimension reduction.
- the input of the preset time series mapping model is 600 features, and the output is 100 features.
- the deep neural network may include: multiple fully connected layers connected in series, wherein the first fully connected layer is used to receive the multi-dimensional information extraction results, and the last fully connected layer outputs the virtual object expression parameters and/or action parameters.
- the number of the fully connected layers may be three.
- the deep neural network may further include: a plurality of nonlinear transformation modules, respectively coupled between two adjacent fully connected layers except the last fully connected layer, the nonlinear transformation modules are used for The output result of the coupled upper fully connected layer is subjected to nonlinear transformation processing, and the result of the nonlinear transformation processing is input to the next coupled fully connected layer.
- the nonlinear transformation module may be a Rectified linear unit (Rectified linear unit, ReLU for short) activation function.
- the nonlinear transformation module can improve the expression ability and generalization ability of the preset time series mapping model.
- the multi-layer convolutional network, the long-short-term memory network and the deep neural network can be connected in series in sequence, and the information extraction results of the linguistic feature dimension output by the multi-layer convolutional network are processed by the long-short-term memory network. It is transmitted to the deep neural network, and the information extraction result of the time dimension output by the multi-layer convolutional network is processed by the long-short-term memory network and then transmitted to the deep neural network.
- the solution of this embodiment has the ability to receive different types of input information, thereby improving the scope of application and helping to further reduce the cost and efficiency related to animation production.
- the traditional end-to-end virtual object animation synthesis technology mainly generates two-dimensional animation, while the solution of this embodiment can generate high-quality three-dimensional animation, and can also generate two-dimensional animation.
- FIG. 4 is a schematic structural diagram of an end-to-end virtual object animation generating apparatus according to an embodiment of the present invention.
- the end-to-end virtual object animation generation apparatus 4 in this embodiment may be used to implement the method and technical solutions described in the embodiments described in FIG. 1 to FIG. 3 .
- the end-to-end virtual object animation generation device 4 in this embodiment may include: a receiving module 41, configured to receive input information, where the input information includes text information or audio information of the virtual object animation to be generated
- the conversion module 42 is used to convert the input information into a sequence of pronunciation units;
- the feature analysis module 43 is used to carry out feature analysis to the sequence of pronunciation units to obtain the corresponding linguistic feature sequence;
- the mapping module 44 is used to convert the The linguistic feature sequence is input into a preset time sequence mapping model to generate a corresponding virtual object animation based on the linguistic feature sequence.
- the end-to-end virtual object animation generation method described in this embodiment may be implemented based on an end-to-end virtual object animation generation system.
- the end-to-end virtual object animation generation system may include: a collection module for collecting the input information; the end-to-end virtual object animation generation device 2 shown in FIG. 4, wherein the receiving module 41 and The acquisition module is coupled to receive the input information, and the end-to-end virtual object animation generation apparatus 2 executes the end-to-end virtual object animation generation methods shown in FIG. 1 to FIG. 3 to generate a corresponding virtual object animation.
- the user can obtain the corresponding virtual object animation at the end of the end-to-end virtual object animation generation device 2 by providing input information at the end of the acquisition module.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Processing Or Creating Images (AREA)
Abstract
Description
Claims (19)
- 一种端到端的虚拟对象动画生成方法,其特征在于,包括:接收输入信息,所述输入信息包括待生成虚拟对象动画的文本信息或音频信息;将所述输入信息转换为发音单元序列;对所述发音单元序列进行特征分析,得到对应的语言学特征序列;将所述语言学特征序列输入预设时序映射模型,以基于所述语言学特征序列生成对应的虚拟对象动画。
- 根据权利要求1所述的虚拟对象动画生成方法,其特征在于,所述将所述输入信息转换为发音单元序列包括:将所述输入信息转换为发音单元及对应的时间码;根据所述时间码对所述发音单元进行时间对齐操作,以得到所述发音单元序列,其中,所述发音单元序列为时间对齐后的序列。
- 根据权利要求2所述的虚拟对象动画生成方法,其特征在于,所述将所述输入信息转换为发音单元及对应的时间码包括:当所述输入信息为音频信息时,基于语音识别技术和预设发音字典将所述音频信息转换为发音单元及对应的时间码。
- 根据权利要求2所述的虚拟对象动画生成方法,其特征在于,所述将所述输入信息转换为发音单元及对应的时间码包括:当所述输入信息为文本信息时,基于语音合成技术将所述文本信息转换为发音单元及对应的时间码。
- 根据权利要求1所述的虚拟对象动画生成方法,其特征在于,所述对所述发音单元序列进行特征分析,得到对应的语言学特征序列包括:对所述发音单元序列中的每个发音单元进行特征分析,得到每个发音单元的语言学特征;基于每个发音单元的语言学特征,生成对应的语言学特征序列。
- 根据权利要求5所述的虚拟对象动画生成方法,其特征在于,所述对所述发音单元序列中的每个发音单元进行特征分析,得到每个发音单元的语言学特征包括:对于每个发音单元,分析所述发音单元的发音特征,以得到所述发音单元的独立语言学特征;基于所述独立语言学特征生成所述语言学特征。
- 根据权利要求5所述的虚拟对象动画生成方法,其特征在于,所述对所述发音单元序列中的每个发音单元进行特征分析,得到每个发音单元的语言学特征包括:对于每个发音单元,分析所述发音单元的发音特征,以得到所述发音单元的独立语言学特征;分析所述发音单元的邻接发音单元的发音特征,得到所述发音单元的邻接语言学特征;基于所述独立语言学特征和邻接语言学特征生成所述语言学特征。
- 根据权利要求7所述的虚拟对象动画生成方法,其特征在于,所述分析所述发音单元的邻接发音单元的发音特征,得到所述发音单元的邻接语言学特征包括:统计所述邻接发音单元所具有发音特征的种类以及同种发音特征的数量,并根据统计结果得到所述邻接语言学特征。
- 根据权利要求1所述的虚拟对象动画生成方法,其特征在于,所述预设时序映射模型用于按时序将输入的语言学特征序列映射至 虚拟对象的表情参数和/或动作参数,以生成对应的虚拟对象动画。
- 根据权利要求9所述的虚拟对象动画生成方法,其特征在于,所述将所述语言学特征序列输入预设时序映射模型,以基于所述语言学特征序列生成对应的虚拟对象动画包括:基于所述预设时序映射模型对所述语言学特征序列进行多维度的信息提取,其中,所述多维度包括时间维度和语言学特征维度;基于所述预设时序映射模型对多维度的信息提取结果进行特征域的映射和特征维度变换,以得到所述虚拟对象的表情参数和/或动作参数,其中,所述特征域的映射是指语言学特征域到虚拟对象动画特征域的映射,所述虚拟对象动画特征域包括所述虚拟对象的表情特征和/或动作特征。
- 根据权利要求10所述的虚拟对象动画生成方法,其特征在于,所述预设时序映射模型包括:多层卷积网络,用于接收所述语言学特征序列,并对所述语言学特征序列进行多维度的信息提取;深度神经网络,与所述多层卷积网络耦接,所述深度神经网络用于接收所述多层卷积网络输出的多维度的信息提取结果,并对多维度的信息提取结果进行特征域的映射和特征维度变换,以得到所述虚拟对象的表情参数和/或动作参数。
- 根据权利要求11所述的虚拟对象动画生成方法,其特征在于,所述深度神经网络包括:多层串联连接的全连接层;多个非线性变换模块,分别耦接于除最后一层全连接层外的相邻两层全连接层之间,所述非线性变化模块用于对耦接的上一层全连接层的输出结果进行非线性变换处理,并将非线性变换处理的结果输入耦接的下一层全连接层。
- 根据权利要求10所述的虚拟对象动画生成方法,其特征在于,在基于所述预设时序映射模型对所述语言学特征序列进行多维度的信息提取之后,基于所述预设时序映射模型对多维度的信息提取结果进行特征域的映射和特征维度变换之前,还包括:基于所述预设时序映射模型对时间维度的信息提取结果进行信息聚合处理,并将处理结果更新为所述时间维度的信息提取结果。
- 根据权利要求13所述的虚拟对象动画生成方法,其特征在于,所述预设时序映射模型包括:长短时记忆网络,用于对时间维度的信息提取结果进行信息聚合处理。
- 根据权利要求9所述的虚拟对象动画生成方法,其特征在于,所述虚拟对象的表情参数包括:用于生成唇形动画的控制器。
- 根据权利要求1至15中任一项所述的虚拟对象动画生成方法,其特征在于,所述发音单元序列和所述语言学特征序列均为时间对齐后的序列。
- 一种端到端的虚拟对象动画生成装置,其特征在于,包括:接收模块,用于接收输入信息,所述输入信息包括待生成虚拟对象动画的文本信息或音频信息;转换模块,用于将所述输入信息转换为发音单元序列;特征分析模块,用于对所述发音单元序列进行特征分析,得到对应的语言学特征序列;映射模块,用于将所述语言学特征序列输入预设时序映射模型,以基于所述语言学特征序列生成对应的虚拟对象动画。
- 一种存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器运行时执行权利要求1至16中任一项所述方法的 步骤。
- 一种终端,包括存储器和处理器,所述存储器上存储有能够在所述处理器上运行的计算机程序,其特征在于,所述处理器运行所述计算机程序时执行权利要求1至16中任一项所述方法的步骤。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/023,993 US11810233B2 (en) | 2020-09-01 | 2021-08-09 | End-to-end virtual object animation generation method and apparatus, storage medium, and terminal |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010905550.3 | 2020-09-01 | ||
CN202010905550.3A CN112184859B (zh) | 2020-09-01 | 2020-09-01 | 端到端的虚拟对象动画生成方法及装置、存储介质、终端 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022048404A1 true WO2022048404A1 (zh) | 2022-03-10 |
Family
ID=73925584
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/111423 WO2022048404A1 (zh) | 2020-09-01 | 2021-08-09 | 端到端的虚拟对象动画生成方法及装置、存储介质、终端 |
Country Status (3)
Country | Link |
---|---|
US (1) | US11810233B2 (zh) |
CN (1) | CN112184859B (zh) |
WO (1) | WO2022048404A1 (zh) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019226964A1 (en) * | 2018-05-24 | 2019-11-28 | Warner Bros. Entertainment Inc. | Matching mouth shape and movement in digital video to alternative audio |
CN112184858B (zh) * | 2020-09-01 | 2021-12-07 | 魔珐(上海)信息科技有限公司 | 基于文本的虚拟对象动画生成方法及装置、存储介质、终端 |
CN112184859B (zh) * | 2020-09-01 | 2023-10-03 | 魔珐(上海)信息科技有限公司 | 端到端的虚拟对象动画生成方法及装置、存储介质、终端 |
CN117541321B (zh) * | 2024-01-08 | 2024-04-12 | 北京烽火万家科技有限公司 | 一种基于虚拟数字人的广告制作发布方法及*** |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361620A (zh) * | 2014-11-27 | 2015-02-18 | 韩慧健 | 一种基于综合加权算法的口型动画合成方法 |
CN106653052A (zh) * | 2016-12-29 | 2017-05-10 | Tcl集团股份有限公司 | 虚拟人脸动画的生成方法及装置 |
CN108447474A (zh) * | 2018-03-12 | 2018-08-24 | 北京灵伴未来科技有限公司 | 一种虚拟人物语音与口型同步的建模与控制方法 |
CN109377540A (zh) * | 2018-09-30 | 2019-02-22 | 网易(杭州)网络有限公司 | 面部动画的合成方法、装置、存储介质、处理器及终端 |
US20190130628A1 (en) * | 2017-10-26 | 2019-05-02 | Snap Inc. | Joint audio-video facial animation system |
CN111145322A (zh) * | 2019-12-26 | 2020-05-12 | 上海浦东发展银行股份有限公司 | 用于驱动虚拟形象的方法、设备和计算机可读存储介质 |
CN112184859A (zh) * | 2020-09-01 | 2021-01-05 | 魔珐(上海)信息科技有限公司 | 端到端的虚拟对象动画生成方法及装置、存储介质、终端 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107564511B (zh) * | 2017-09-25 | 2018-09-11 | 平安科技(深圳)有限公司 | 电子装置、语音合成方法和计算机可读存储介质 |
CN110379430B (zh) * | 2019-07-26 | 2023-09-22 | 腾讯科技(深圳)有限公司 | 基于语音的动画显示方法、装置、计算机设备及存储介质 |
-
2020
- 2020-09-01 CN CN202010905550.3A patent/CN112184859B/zh active Active
-
2021
- 2021-08-09 WO PCT/CN2021/111423 patent/WO2022048404A1/zh active Application Filing
- 2021-08-09 US US18/023,993 patent/US11810233B2/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361620A (zh) * | 2014-11-27 | 2015-02-18 | 韩慧健 | 一种基于综合加权算法的口型动画合成方法 |
CN106653052A (zh) * | 2016-12-29 | 2017-05-10 | Tcl集团股份有限公司 | 虚拟人脸动画的生成方法及装置 |
US20190130628A1 (en) * | 2017-10-26 | 2019-05-02 | Snap Inc. | Joint audio-video facial animation system |
CN108447474A (zh) * | 2018-03-12 | 2018-08-24 | 北京灵伴未来科技有限公司 | 一种虚拟人物语音与口型同步的建模与控制方法 |
CN109377540A (zh) * | 2018-09-30 | 2019-02-22 | 网易(杭州)网络有限公司 | 面部动画的合成方法、装置、存储介质、处理器及终端 |
CN111145322A (zh) * | 2019-12-26 | 2020-05-12 | 上海浦东发展银行股份有限公司 | 用于驱动虚拟形象的方法、设备和计算机可读存储介质 |
CN112184859A (zh) * | 2020-09-01 | 2021-01-05 | 魔珐(上海)信息科技有限公司 | 端到端的虚拟对象动画生成方法及装置、存储介质、终端 |
Also Published As
Publication number | Publication date |
---|---|
US11810233B2 (en) | 2023-11-07 |
CN112184859A (zh) | 2021-01-05 |
CN112184859B (zh) | 2023-10-03 |
US20230267665A1 (en) | 2023-08-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022048405A1 (zh) | 基于文本的虚拟对象动画生成方法及装置、存储介质、终端 | |
WO2022048404A1 (zh) | 端到端的虚拟对象动画生成方法及装置、存储介质、终端 | |
CN110223705B (zh) | 语音转换方法、装置、设备及可读存储介质 | |
Vougioukas et al. | Video-driven speech reconstruction using generative adversarial networks | |
CN113408385A (zh) | 一种音视频多模态情感分类方法及*** | |
CN110992987A (zh) | 语音信号中针对通用特定语音的并联特征提取***及方法 | |
CN104867489B (zh) | 一种模拟真人朗读发音的方法及*** | |
WO2022116432A1 (zh) | 多风格音频合成方法、装置、设备及存储介质 | |
Padi et al. | Improved speech emotion recognition using transfer learning and spectrogram augmentation | |
CN116863038A (zh) | 一种文本生成数字人语音及面部动画的方法 | |
Sager et al. | Vesus: A crowd-annotated database to study emotion production and perception in spoken english. | |
CN116311456A (zh) | 基于多模态交互信息的虚拟人表情个性化生成方法 | |
AlBadawy et al. | Voice Conversion Using Speech-to-Speech Neuro-Style Transfer. | |
CN117251057A (zh) | 一种基于aigc构建ai数智人的方法及*** | |
Gasparini et al. | Sentiment recognition of Italian elderly through domain adaptation on cross-corpus speech dataset | |
Zhao et al. | Research on voice cloning with a few samples | |
CN112700520B (zh) | 基于共振峰的口型表情动画生成方法、装置及存储介质 | |
CN115910083A (zh) | 一种实时语音转换方法、装置、电子设备及介质 | |
Preciado-Grijalva et al. | Speaker fluency level classification using machine learning techniques | |
CN115731917A (zh) | 语音数据处理方法、模型训练方法、装置及存储介质 | |
Liu et al. | Speech-gesture GAN: gesture generation for robots and embodied agents | |
Ghosh et al. | Automatic speech-gesture mapping and engagement evaluation in human robot interaction | |
Mansouri et al. | Human Laughter Generation using Hybrid Generative Models. | |
Guo et al. | HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural Networks for Expressive Long-Form TTS | |
TWI712032B (zh) | 語音轉換虛擬臉部影像的方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21863471 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21863471 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 05.09.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21863471 Country of ref document: EP Kind code of ref document: A1 |