CN111081270B - Real-time audio-driven virtual character mouth shape synchronous control method - Google Patents

Real-time audio-driven virtual character mouth shape synchronous control method Download PDF

Info

Publication number
CN111081270B
CN111081270B CN201911314031.3A CN201911314031A CN111081270B CN 111081270 B CN111081270 B CN 111081270B CN 201911314031 A CN201911314031 A CN 201911314031A CN 111081270 B CN111081270 B CN 111081270B
Authority
CN
China
Prior art keywords
mouth shape
real
probability
phoneme
virtual character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911314031.3A
Other languages
Chinese (zh)
Other versions
CN111081270A (en
Inventor
朱风云
陈博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Real Time Intelligent Technology Co ltd
Original Assignee
Dalian Real Time Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Real Time Intelligent Technology Co ltd filed Critical Dalian Real Time Intelligent Technology Co ltd
Priority to CN201911314031.3A priority Critical patent/CN111081270B/en
Publication of CN111081270A publication Critical patent/CN111081270A/en
Application granted granted Critical
Publication of CN111081270B publication Critical patent/CN111081270B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/18Details of the transformation process
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a real-time audio-driven virtual character mouth shape synchronous control method. The method comprises the following steps: identifying the viseme probability from the real-time voice stream; a step of filtering the visualprobability; converting the sampling rate of the viseme probability into the sampling rate which is the same as the rendering frame rate of the virtual character; and converting the visualprobability into a standard mouth shape configuration and performing mouth shape rendering. The method can avoid the requirement of synchronously transmitting phoneme sequence or mouth shape sequence information when transmitting the audio stream, can obviously reduce the complexity, the coupling degree and the realization difficulty of the system, and is suitable for various application scenes of rendering virtual characters on display equipment.

Description

Real-time audio-driven virtual character mouth shape synchronous control method
Technical Field
The invention belongs to the field of virtual character posture control, and particularly relates to a real-time audio-driven virtual character mouth shape synchronous control method.
Background
Virtual character modeling and rendering techniques are widely used in the industries of animation, games, movies, and the like. Enabling the avatar to have a natural and smooth mouth-shape action synchronized with the voice while speaking is a key to improving the user experience. In a real-time system, audio acquired in real time in a streaming form and a virtual character rendered synchronously need to be played synchronously, and the synchronization between the audio and the character mouth shape needs to be ensured in the process.
The application scene comprises the following steps:
1. the real-time audio is the voice generated by the voice synthesizer;
1.1, acquiring a phoneme sequence corresponding to the voice in a form of synchronous stream;
1.2, a phoneme sequence corresponding to the voice cannot be obtained in a synchronous stream mode;
2. real-time audio is speech uttered by a person.
In scenario 1.1, a phoneme sequence corresponding to the speech can be synchronously obtained. The phoneme sequence may thus be converted into a mouth movement sequence for driving the virtual character mouth shape change. However, the synchronous acquisition of the phoneme sequence corresponding to the speech requires an additional communication protocol support in the application to ensure the time synchronization between the speech and the phoneme sequence, so that the system complexity is increased, the coupling is increased, and the implementation difficulty is high.
In scene 1.2 and scene 2, the phoneme sequence corresponding to the speech cannot be synchronously obtained. There is therefore a need for a control method that can drive the virtual character's mouth shape based on real-time audio data.
Therefore, in order to solve the above-mentioned situation that the phoneme sequence corresponding to the speech cannot be synchronously obtained, a method capable of identifying an exit type sequence from the audio and synchronously driving the mouth shape change of the virtual character by using the mouth shape sequence is needed.
Disclosure of Invention
The invention provides a real-time audio-driven virtual character mouth shape synchronous control method, aiming at solving the following problems: in the scene of real-time audio streaming, a virtual character needs to be displayed at the equipment end, the voice spoken by the character is acquired from the real-time audio stream, and the mouth shape of the character needs to be synchronized with the voice content.
A real-time audio-driven virtual character mouth shape synchronous control method comprises the following steps:
identifying the viseme probability from the real-time voice stream; the viseme probability is obtained by combining the probabilities of the phonemes belonging to the same type of visemes based on a preset mapping relation from the phonemes to the visemes;
a step of filtering the visualprobability;
converting the sampling rate of the viseme probability into the sampling rate which is the same as the rendering frame rate of the virtual character;
and converting the visualprobability into a standard mouth shape configuration and performing mouth shape rendering.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: the viseme probability is obtained by a viseme identification method; or identifying the phoneme probability from the real-time voice stream by using phoneme identification, and converting the phoneme probability into the viseme probability.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: and respectively smoothing and filtering each viseme probability by adopting a finite or infinite impulse response filter.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: converting the viseme probability to a standard mouth shape configuration; firstly, defining a standard mouth shape configuration for each visual element, wherein the standard mouth shape configuration is a key frame or a parameter describing the mouth shape; secondly, converting the viseme probability into a mixing proportion of standard mouth shape configuration through a mapping function; wherein, in a key frame scene, the mixing proportion is an interpolation proportion between different key frames; in the scenario of the key point parameter, the bone parameter or the blenshape parameter, the mixing ratio is a mixing ratio of each mouth shape describing parameter.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: in order to keep synchronization during audio/video playing, the contents of the audio stream and the video stream are synchronized by compensating for the delay during playing of the audio stream.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: the length of the buffer for compensating the delay is determined by the processing delay of the mouth shape visual element identification, the filtering and the video rendering.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: the phoneme recognition comprises: framing the voice stream, and extracting features; and a step of performing phoneme estimation using the features.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: the phoneme is an IPA defined phoneme, or a custom phoneme.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: the method for compensating the delay comprises the following steps: the audio delay compensation amount is the framing delay + feature splicing delay + phoneme recognition delay + filtering delay-video rendering delay.
The invention provides a method for identifying an outlet type sequence from audio and synchronously driving the mouth shape change of a virtual character by utilizing the mouth shape sequence aiming at the condition that a phoneme sequence corresponding to a voice cannot be synchronously obtained. The method can avoid the requirement of synchronously transmitting phoneme sequence or mouth shape sequence information when transmitting the audio stream, can obviously reduce the complexity, the coupling degree and the realization difficulty of the system, and is suitable for various application scenes of rendering virtual characters on display equipment.
Compared with the prior art, the invention has the following advantages:
by locally rendering the virtual character at the equipment end, video signals are prevented from being transmitted through a network after being rendered at the server end, a large amount of communication bandwidth can be saved, and the operation cost is reduced.
Through discerning the mouth type locally at the equipment end, avoid transmitting the mouth type information when transmitting the audio frequency, avoid carrying on the communication layer synchronization of audio frequency and mouth type, reduce communication protocol complexity, reduce and realize the degree of difficulty.
By using the probability output based on the phoneme or viseme recognition model as the mixing ratio of the standard mouth shape parameters, the probability can be prevented from being converted into the label of the phoneme or viseme category by using a Viterbi decoding algorithm, and the realization difficulty is reduced.
The invention directly infers the mixing ratio of the outlet type parameters from the audio signals without using Viterbi decoding, can avoid systematic delay caused by decoding, can shorten the response time of the system by about 1 second compared with a decoding-based method, greatly reduces the interaction delay in a real-time interaction scene, and improves the user experience.
Drawings
FIG. 1 is a flowchart illustrating a method for controlling the mouth shape synchronization of a virtual character driven by real-time audio according to a first embodiment of the present invention;
FIG. 2 is a flowchart illustrating a real-time audio-driven virtual character mouth shape synchronization control method according to a second embodiment of the present invention;
fig. 3 is a flowchart of a virtual character mouth shape synchronization control method driven by real-time audio according to a third embodiment of the present invention.
Detailed Description
Embodiments of the invention will be described below with reference to the drawings, but it should be appreciated that the invention is not limited to the embodiments described and that various modifications of the invention are possible without departing from the basic idea. The scope of the invention is therefore intended to be limited solely by the appended claims.
As shown in fig. 1, the method for controlling the mouth shape of a virtual character driven by real-time audio according to the present invention includes the following steps:
identifying the viseme probability from the real-time voice stream;
a step of filtering the visualprobability;
converting the sampling rate of the viseme probability into the sampling rate which is the same as the rendering frame rate of the virtual character;
and converting the visualprobability into a standard mouth shape configuration and performing mouth shape rendering.
As shown in fig. 2, a method for controlling the mouth shape of a real-time audio-driven avatar according to another embodiment of the present invention includes the following steps:
step 1, phoneme recognition
Step 1.1, feature extraction
And framing the voice stream, and extracting the features.
The framing process is that a frame of data with the frame length of L is taken every H sampling points on the continuous voice stream, and the number of overlapped sampling points between frames is L-H.
The characteristic extraction process is to process a frame of data to convert it into some form, such as frequency spectrum, phase spectrum, banded energy, cepstrum coefficient, linear prediction coefficient, etc.
The feature extraction process may not process the voice data, and the original audio sample is used as the result of feature extraction.
After the feature corresponding to each frame of data is obtained, the feature of the temporally adjacent frame can also be used to further extract the differential feature, and the differential feature is added to the original feature as the result of feature extraction.
After the features corresponding to each frame of data are obtained, the features of the frames adjacent in time can be spliced, and the splicing result is used as the feature extraction result.
The differential and stitching operations may be used simultaneously.
Step 1.2 phoneme probability estimation
Phoneme probability estimation the probability that a feature is a certain phoneme is estimated from the input features using a statistical machine learning model.
The phoneme may be a phoneme defined by ipa (international Phonetic alpha beta), or a phoneme defined by other standards.
Taking the chinese language as an example, the user-defined phoneme set that can be adopted is:
b p m f d t n l
g h j q x z c s
zh ch sh ng a o e i
ii iii u v er sil
wherein ng represents the finals of neng, i represents the finals of yi, ii represents the finals of zi, and iii represents the finals of zhi. sil denotes silence.
Step 2, phoneme-to-viseme probability conversion
The viseme probability is obtained by combining the phoneme probabilities belonging to the same type of visemes based on a preset mapping relation from the phonemes to the visemes.
The predetermined mapping relationship may follow different design criteria and is not limited to the specific embodiment given in the present invention.
Taking chinese as an example, the mapping relationship may be:
vision element Phoneme
b b/p/m
d d/t/n
z z/c/s
zh zh/ch/sh
j j/q/x
k k/h/l/g/ng
a a
o o
e e/er
i i/ii/iii
u u/v
sil sil
Step 3, carrying out smooth filtering on the obtained viseme probability
Because the estimation of the statistical machine learning model on the probability cannot guarantee complete accuracy, the result is usually optimized by combining multi-frame data information to obtain the probability of smooth change in time.
The smoothing filtering process can adopt a finite impulse response filter to respectively filter the probability of each visual element, and the order and the parameter of the filter can be adjusted according to the requirement on the response time of the system.
Taking the simplest case as an example, a moving average fir filter implementation with order 10 may be used. In actual implementation, different filter designs may be used.
Step 4, resampling the voice stream according to the sampling rate of the video
Since the feature extraction process in step 1 frames the voice stream, the sampling rate of the data frame is (H/audio sampling rate) hz.
The sampling rate at which video is rendered is typically based on the refresh rate of the display device.
It is therefore desirable to use resampling to match the sampling rate of the data frames to the video sampling rate.
Step 5, converting the viseme probability to the standard mouth shape mixing proportion
The avatar rendering system will typically define a standard mouth shape configuration for each view, possibly in the form of key frames, or parameters describing the mouth shape.
The viseme probabilities can be converted to a mix ratio for a standard mouth-shape configuration by a linear or non-linear mapping function.
In a key frame scenario, the blending ratio may be an interpolated ratio between different key frames.
In the context of a keypoint parameter, a bone parameter, or a blenshape parameter, the blending ratio may be a blending ratio of the parameters.
Taking a frame of data as an example, if the viseme probability is:
vision element Probability of viseme
b 0.0
d 0.0
z 0.0
zh 0.0
j 0.0
k 0.0
a 0.6
o 0.4
e 0.0
i 0.0
u 0.0
sil 0.0
And the mapping function from the viseme probabilities to the mixture ratios is assumed to be a linear mapping. Taking a key point parameter scene as an example, a two-dimensional key point parameter is defined as:
a(0.2 0.8)
e(0.7 0.3)
the mixing ratio of the keypoint parameters corresponding to the above visual element probability is a × 0.6+ e × 0.4, and thus the keypoint parameters of the current frame are (0.4, 0.6).
Step 6, mouth shape rendering is carried out by utilizing visuality probability
And the virtual character rendering system renders a virtual character image according to the mixed mouth shape configuration to obtain a video stream.
Step 7, synchronously playing audio and video
Because the voice stream is processed by the links of framing, splicing, phoneme recognition, smoothing filtering and the like, each link has a certain system delay, and therefore, when the audio stream is played, the contents of the audio stream and the video stream need to be synchronized by compensating the delay.
The delay can be calculated by accumulating the delays of the processing elements.
Since there is also some delay in video rendering, the delay of the video rendering system needs to be subtracted when calculating the audio delay.
Taking a common scenario as an example:
the audio delay compensation amount is the framing delay + feature splicing delay + phoneme recognition delay + smoothing filter delay-video rendering delay.
Fig. 3 is a third embodiment of the present invention. This embodiment differs from the second embodiment provided in fig. 2 in that: the embodiment performs the viseme recognition directly from the speech stream and does not go through the phoneme recognition and the phoneme to viseme probability conversion.
Compared with the method described in fig. 2, the accuracy of the viseme probability estimation of the method is slightly lower, but the subjective feeling of the user is not affected basically, and the method has the advantages of lower implementation difficulty and calculation complexity.
Since possible variations and modifications may be effected by one skilled in the art without departing from the spirit and scope of the invention, the scope of protection is to be determined by the claims appended hereto.

Claims (8)

1. A real-time audio-driven virtual character mouth shape synchronous control method comprises the following steps:
identifying the viseme probability from the real-time voice stream; the viseme probability is obtained by combining the probabilities of the phonemes belonging to the same type of visemes based on a preset mapping relation from the phonemes to the visemes; the viseme probability is obtained by a viseme identification method; or identifying the phoneme probability from the real-time voice stream by using phoneme identification, and converting the phoneme probability into a viseme probability;
a step of filtering the visualprobability;
converting the sampling rate of the viseme probability into the sampling rate which is the same as the rendering frame rate of the virtual character;
converting the visuals probability into standard mouth shape configuration and rendering mouth shapes; when converting the visualprobabilities to a standard mouth shape configuration: firstly, defining a standard mouth shape configuration for each visual element, wherein the standard mouth shape configuration is a key frame or a parameter describing the mouth shape; secondly, converting the viseme probability into a mixing proportion of standard mouth shape configuration through a mapping function; wherein, in a key frame scene, the mixing proportion is an interpolation proportion between different key frames; in the scenario of the key point parameter, the bone parameter or the blenshape parameter, the mixing ratio is the mixing ratio of the key point parameter, the bone parameter or the blenshape parameter.
2. The method for real-time audio-driven virtual character mouth shape synchronous control as claimed in claim 1, characterized in that: and respectively smoothing and filtering each viseme probability by adopting a finite or infinite impulse response filter.
3. The method for real-time audio-driven virtual character mouth shape synchronous control as claimed in claim 1, characterized in that: in order to keep synchronization during audio/video playing, the contents of the audio stream and the video stream are synchronized by compensating for the delay during playing of the audio stream.
4. The real-time audio-driven virtual character mouth shape synchronous control method as claimed in claim 3, characterized in that: the length of the buffer for compensating the delay is determined by the processing delay of the mouth shape visual element identification, the filtering and the video rendering.
5. The method for real-time audio-driven virtual character mouth shape synchronous control as claimed in claim 1, characterized in that: the phoneme recognition comprises: framing the voice stream, and extracting features; and a step of performing phoneme estimation using the features.
6. The real-time audio-driven virtual character mouth shape synchronous control method as claimed in claim 5, characterized in that: the phoneme is an IPA defined phoneme, or a custom phoneme.
7. The method of claim 6, wherein the real-time audio-driven virtual character mouth shape synchronization control method comprises: the phonemes are:
Figure FDA0003005027780000011
Figure FDA0003005027780000021
wherein ng represents the finals of neng, i represents the finals of yi, ii represents the finals of zi, iii represents the finals of zhi, and sil represents silence; the phoneme and viseme conversion relationship is as follows:
vision element Phoneme b b/p/m d d/t/n z z/c/s zh zh/ch/sh j j/q/x k k/h/l/g/ng a a o o e e/er i i/ii/iii u u/v sil sil
8. The real-time audio-driven virtual character mouth shape synchronous control method as claimed in claim 3, characterized in that: the method for compensating the delay comprises the following steps: the audio delay compensation amount is the framing delay + feature splicing delay + phoneme recognition delay + filtering delay-video rendering delay.
CN201911314031.3A 2019-12-19 2019-12-19 Real-time audio-driven virtual character mouth shape synchronous control method Active CN111081270B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911314031.3A CN111081270B (en) 2019-12-19 2019-12-19 Real-time audio-driven virtual character mouth shape synchronous control method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911314031.3A CN111081270B (en) 2019-12-19 2019-12-19 Real-time audio-driven virtual character mouth shape synchronous control method

Publications (2)

Publication Number Publication Date
CN111081270A CN111081270A (en) 2020-04-28
CN111081270B true CN111081270B (en) 2021-06-01

Family

ID=70315527

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911314031.3A Active CN111081270B (en) 2019-12-19 2019-12-19 Real-time audio-driven virtual character mouth shape synchronous control method

Country Status (1)

Country Link
CN (1) CN111081270B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111627096A (en) * 2020-05-07 2020-09-04 江苏原力数字科技股份有限公司 Digital human driving system based on blenshape
CN111698552A (en) * 2020-05-15 2020-09-22 完美世界(北京)软件科技发展有限公司 Video resource generation method and device
CN115426553A (en) * 2021-05-12 2022-12-02 海信集团控股股份有限公司 Intelligent sound box and display method thereof
CN117557692A (en) * 2022-08-04 2024-02-13 深圳市腾讯网域计算机网络有限公司 Method, device, equipment and medium for generating mouth-shaped animation

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2438691A (en) * 2005-04-13 2007-12-05 Pixel Instr Corp Method, system, and program product for measuring audio video synchronization independent of speaker characteristics
CN101482975A (en) * 2008-01-07 2009-07-15 丰达软件(苏州)有限公司 Method and apparatus for converting words into animation
CN102342100A (en) * 2009-03-09 2012-02-01 思科技术公司 System and method for providing three dimensional imaging in network environment
CN103218842A (en) * 2013-03-12 2013-07-24 西南交通大学 Voice synchronous-drive three-dimensional face mouth shape and face posture animation method
CN103329147A (en) * 2010-11-04 2013-09-25 数字标记公司 Smartphone-based methods and systems
CN107369440A (en) * 2017-08-02 2017-11-21 北京灵伴未来科技有限公司 The training method and device of a kind of Speaker Identification model for phrase sound
CN109599113A (en) * 2019-01-22 2019-04-09 北京百度网讯科技有限公司 Method and apparatus for handling information
CN109712627A (en) * 2019-03-07 2019-05-03 深圳欧博思智能科技有限公司 It is a kind of using speech trigger virtual actor's facial expression and the voice system of mouth shape cartoon

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8694318B2 (en) * 2006-09-19 2014-04-08 At&T Intellectual Property I, L. P. Methods, systems, and products for indexing content
US10657972B2 (en) * 2018-02-02 2020-05-19 Max T. Hall Method of translating and synthesizing a foreign language

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2438691A (en) * 2005-04-13 2007-12-05 Pixel Instr Corp Method, system, and program product for measuring audio video synchronization independent of speaker characteristics
CN101482975A (en) * 2008-01-07 2009-07-15 丰达软件(苏州)有限公司 Method and apparatus for converting words into animation
CN102342100A (en) * 2009-03-09 2012-02-01 思科技术公司 System and method for providing three dimensional imaging in network environment
CN103329147A (en) * 2010-11-04 2013-09-25 数字标记公司 Smartphone-based methods and systems
CN103218842A (en) * 2013-03-12 2013-07-24 西南交通大学 Voice synchronous-drive three-dimensional face mouth shape and face posture animation method
CN107369440A (en) * 2017-08-02 2017-11-21 北京灵伴未来科技有限公司 The training method and device of a kind of Speaker Identification model for phrase sound
CN109599113A (en) * 2019-01-22 2019-04-09 北京百度网讯科技有限公司 Method and apparatus for handling information
CN109712627A (en) * 2019-03-07 2019-05-03 深圳欧博思智能科技有限公司 It is a kind of using speech trigger virtual actor's facial expression and the voice system of mouth shape cartoon

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于BTSM和DBN模型的唇读和视素切分研究;吕国云 等;《计算机工程与应用》;20070731;第43卷(第14期);第21-24页 *
面向人机接口的多种输入驱动的三维虚拟人头;於俊 等;《计算机学报》;20131231;第36卷(第12期);第2525-2536页 *

Also Published As

Publication number Publication date
CN111081270A (en) 2020-04-28

Similar Documents

Publication Publication Date Title
CN111081270B (en) Real-time audio-driven virtual character mouth shape synchronous control method
US6766299B1 (en) Speech-controlled animation system
US5608839A (en) Sound-synchronized video system
US20080259085A1 (en) Method for Animating an Image Using Speech Data
CN103650002B (en) Text based video generates
EP0920691A1 (en) Segmentation and sign language synthesis
EP0993197B1 (en) A method and an apparatus for the animation, driven by an audio signal, of a synthesised model of human face
US5926575A (en) Model-based coding/decoding method and system
US20030149569A1 (en) Character animation
US6943794B2 (en) Communication system and communication method using animation and server as well as terminal device used therefor
EP4195668A1 (en) Virtual video livestreaming processing method and apparatus, storage medium, and electronic device
US20060079325A1 (en) Avatar database for mobile video communications
JP2003529861A (en) A method for animating a synthetic model of a human face driven by acoustic signals
JP2518683B2 (en) Image combining method and apparatus thereof
CN113592985B (en) Method and device for outputting mixed deformation value, storage medium and electronic device
CN112001992A (en) Voice-driven 3D virtual human expression sound-picture synchronization method and system based on deep learning
JP2008500573A (en) Method and system for changing messages
JPH089372A (en) Device for increasing frame transmission rate of received video signal
US20050204286A1 (en) Speech receiving device and viseme extraction method and apparatus
CA2162199A1 (en) Acoustic-assisted image processing
CN116597857A (en) Method, system, device and storage medium for driving image by voice
CN114760425A (en) Digital human generation method, device, computer equipment and storage medium
CN114339069A (en) Video processing method and device, electronic equipment and computer storage medium
CN114793300A (en) Virtual video customer service robot synthesis method and system based on generation countermeasure network
CN110958417A (en) Method for removing compression noise of video call video based on voice clue

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant