CN111081270B - Real-time audio-driven virtual character mouth shape synchronous control method - Google Patents
Real-time audio-driven virtual character mouth shape synchronous control method Download PDFInfo
- Publication number
- CN111081270B CN111081270B CN201911314031.3A CN201911314031A CN111081270B CN 111081270 B CN111081270 B CN 111081270B CN 201911314031 A CN201911314031 A CN 201911314031A CN 111081270 B CN111081270 B CN 111081270B
- Authority
- CN
- China
- Prior art keywords
- mouth shape
- real
- probability
- phoneme
- virtual character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000001360 synchronised effect Effects 0.000 title claims abstract description 29
- 238000009877 rendering Methods 0.000 claims abstract description 22
- 238000005070 sampling Methods 0.000 claims abstract description 16
- 238000001914 filtration Methods 0.000 claims abstract description 13
- 238000002156 mixing Methods 0.000 claims description 15
- 238000013507 mapping Methods 0.000 claims description 10
- 238000009432 framing Methods 0.000 claims description 8
- 230000000007 visual effect Effects 0.000 claims description 7
- 238000009499 grossing Methods 0.000 claims description 5
- 230000004044 response Effects 0.000 claims description 5
- 210000000988 bone and bone Anatomy 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 230000008878 coupling Effects 0.000 abstract description 3
- 238000010168 coupling process Methods 0.000 abstract description 3
- 238000005859 coupling reaction Methods 0.000 abstract description 3
- 238000000605 extraction Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 238000012952 Resampling Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/18—Details of the transformation process
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/4302—Content synchronisation processes, e.g. decoder synchronisation
- H04N21/4307—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention discloses a real-time audio-driven virtual character mouth shape synchronous control method. The method comprises the following steps: identifying the viseme probability from the real-time voice stream; a step of filtering the visualprobability; converting the sampling rate of the viseme probability into the sampling rate which is the same as the rendering frame rate of the virtual character; and converting the visualprobability into a standard mouth shape configuration and performing mouth shape rendering. The method can avoid the requirement of synchronously transmitting phoneme sequence or mouth shape sequence information when transmitting the audio stream, can obviously reduce the complexity, the coupling degree and the realization difficulty of the system, and is suitable for various application scenes of rendering virtual characters on display equipment.
Description
Technical Field
The invention belongs to the field of virtual character posture control, and particularly relates to a real-time audio-driven virtual character mouth shape synchronous control method.
Background
Virtual character modeling and rendering techniques are widely used in the industries of animation, games, movies, and the like. Enabling the avatar to have a natural and smooth mouth-shape action synchronized with the voice while speaking is a key to improving the user experience. In a real-time system, audio acquired in real time in a streaming form and a virtual character rendered synchronously need to be played synchronously, and the synchronization between the audio and the character mouth shape needs to be ensured in the process.
The application scene comprises the following steps:
1. the real-time audio is the voice generated by the voice synthesizer;
1.1, acquiring a phoneme sequence corresponding to the voice in a form of synchronous stream;
1.2, a phoneme sequence corresponding to the voice cannot be obtained in a synchronous stream mode;
2. real-time audio is speech uttered by a person.
In scenario 1.1, a phoneme sequence corresponding to the speech can be synchronously obtained. The phoneme sequence may thus be converted into a mouth movement sequence for driving the virtual character mouth shape change. However, the synchronous acquisition of the phoneme sequence corresponding to the speech requires an additional communication protocol support in the application to ensure the time synchronization between the speech and the phoneme sequence, so that the system complexity is increased, the coupling is increased, and the implementation difficulty is high.
In scene 1.2 and scene 2, the phoneme sequence corresponding to the speech cannot be synchronously obtained. There is therefore a need for a control method that can drive the virtual character's mouth shape based on real-time audio data.
Therefore, in order to solve the above-mentioned situation that the phoneme sequence corresponding to the speech cannot be synchronously obtained, a method capable of identifying an exit type sequence from the audio and synchronously driving the mouth shape change of the virtual character by using the mouth shape sequence is needed.
Disclosure of Invention
The invention provides a real-time audio-driven virtual character mouth shape synchronous control method, aiming at solving the following problems: in the scene of real-time audio streaming, a virtual character needs to be displayed at the equipment end, the voice spoken by the character is acquired from the real-time audio stream, and the mouth shape of the character needs to be synchronized with the voice content.
A real-time audio-driven virtual character mouth shape synchronous control method comprises the following steps:
identifying the viseme probability from the real-time voice stream; the viseme probability is obtained by combining the probabilities of the phonemes belonging to the same type of visemes based on a preset mapping relation from the phonemes to the visemes;
a step of filtering the visualprobability;
converting the sampling rate of the viseme probability into the sampling rate which is the same as the rendering frame rate of the virtual character;
and converting the visualprobability into a standard mouth shape configuration and performing mouth shape rendering.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: the viseme probability is obtained by a viseme identification method; or identifying the phoneme probability from the real-time voice stream by using phoneme identification, and converting the phoneme probability into the viseme probability.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: and respectively smoothing and filtering each viseme probability by adopting a finite or infinite impulse response filter.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: converting the viseme probability to a standard mouth shape configuration; firstly, defining a standard mouth shape configuration for each visual element, wherein the standard mouth shape configuration is a key frame or a parameter describing the mouth shape; secondly, converting the viseme probability into a mixing proportion of standard mouth shape configuration through a mapping function; wherein, in a key frame scene, the mixing proportion is an interpolation proportion between different key frames; in the scenario of the key point parameter, the bone parameter or the blenshape parameter, the mixing ratio is a mixing ratio of each mouth shape describing parameter.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: in order to keep synchronization during audio/video playing, the contents of the audio stream and the video stream are synchronized by compensating for the delay during playing of the audio stream.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: the length of the buffer for compensating the delay is determined by the processing delay of the mouth shape visual element identification, the filtering and the video rendering.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: the phoneme recognition comprises: framing the voice stream, and extracting features; and a step of performing phoneme estimation using the features.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: the phoneme is an IPA defined phoneme, or a custom phoneme.
The above real-time audio-driven virtual character mouth shape synchronous control method, wherein: the method for compensating the delay comprises the following steps: the audio delay compensation amount is the framing delay + feature splicing delay + phoneme recognition delay + filtering delay-video rendering delay.
The invention provides a method for identifying an outlet type sequence from audio and synchronously driving the mouth shape change of a virtual character by utilizing the mouth shape sequence aiming at the condition that a phoneme sequence corresponding to a voice cannot be synchronously obtained. The method can avoid the requirement of synchronously transmitting phoneme sequence or mouth shape sequence information when transmitting the audio stream, can obviously reduce the complexity, the coupling degree and the realization difficulty of the system, and is suitable for various application scenes of rendering virtual characters on display equipment.
Compared with the prior art, the invention has the following advantages:
by locally rendering the virtual character at the equipment end, video signals are prevented from being transmitted through a network after being rendered at the server end, a large amount of communication bandwidth can be saved, and the operation cost is reduced.
Through discerning the mouth type locally at the equipment end, avoid transmitting the mouth type information when transmitting the audio frequency, avoid carrying on the communication layer synchronization of audio frequency and mouth type, reduce communication protocol complexity, reduce and realize the degree of difficulty.
By using the probability output based on the phoneme or viseme recognition model as the mixing ratio of the standard mouth shape parameters, the probability can be prevented from being converted into the label of the phoneme or viseme category by using a Viterbi decoding algorithm, and the realization difficulty is reduced.
The invention directly infers the mixing ratio of the outlet type parameters from the audio signals without using Viterbi decoding, can avoid systematic delay caused by decoding, can shorten the response time of the system by about 1 second compared with a decoding-based method, greatly reduces the interaction delay in a real-time interaction scene, and improves the user experience.
Drawings
FIG. 1 is a flowchart illustrating a method for controlling the mouth shape synchronization of a virtual character driven by real-time audio according to a first embodiment of the present invention;
FIG. 2 is a flowchart illustrating a real-time audio-driven virtual character mouth shape synchronization control method according to a second embodiment of the present invention;
fig. 3 is a flowchart of a virtual character mouth shape synchronization control method driven by real-time audio according to a third embodiment of the present invention.
Detailed Description
Embodiments of the invention will be described below with reference to the drawings, but it should be appreciated that the invention is not limited to the embodiments described and that various modifications of the invention are possible without departing from the basic idea. The scope of the invention is therefore intended to be limited solely by the appended claims.
As shown in fig. 1, the method for controlling the mouth shape of a virtual character driven by real-time audio according to the present invention includes the following steps:
identifying the viseme probability from the real-time voice stream;
a step of filtering the visualprobability;
converting the sampling rate of the viseme probability into the sampling rate which is the same as the rendering frame rate of the virtual character;
and converting the visualprobability into a standard mouth shape configuration and performing mouth shape rendering.
As shown in fig. 2, a method for controlling the mouth shape of a real-time audio-driven avatar according to another embodiment of the present invention includes the following steps:
step 1, phoneme recognition
Step 1.1, feature extraction
And framing the voice stream, and extracting the features.
The framing process is that a frame of data with the frame length of L is taken every H sampling points on the continuous voice stream, and the number of overlapped sampling points between frames is L-H.
The characteristic extraction process is to process a frame of data to convert it into some form, such as frequency spectrum, phase spectrum, banded energy, cepstrum coefficient, linear prediction coefficient, etc.
The feature extraction process may not process the voice data, and the original audio sample is used as the result of feature extraction.
After the feature corresponding to each frame of data is obtained, the feature of the temporally adjacent frame can also be used to further extract the differential feature, and the differential feature is added to the original feature as the result of feature extraction.
After the features corresponding to each frame of data are obtained, the features of the frames adjacent in time can be spliced, and the splicing result is used as the feature extraction result.
The differential and stitching operations may be used simultaneously.
Step 1.2 phoneme probability estimation
Phoneme probability estimation the probability that a feature is a certain phoneme is estimated from the input features using a statistical machine learning model.
The phoneme may be a phoneme defined by ipa (international Phonetic alpha beta), or a phoneme defined by other standards.
Taking the chinese language as an example, the user-defined phoneme set that can be adopted is:
b | p | m | f | d | t | n | l |
g | h | j | q | x | z | c | s |
zh | ch | sh | ng | a | o | e | i |
ii | iii | u | v | er | sil |
wherein ng represents the finals of neng, i represents the finals of yi, ii represents the finals of zi, and iii represents the finals of zhi. sil denotes silence.
Step 2, phoneme-to-viseme probability conversion
The viseme probability is obtained by combining the phoneme probabilities belonging to the same type of visemes based on a preset mapping relation from the phonemes to the visemes.
The predetermined mapping relationship may follow different design criteria and is not limited to the specific embodiment given in the present invention.
Taking chinese as an example, the mapping relationship may be:
vision element | Phoneme |
b | b/p/m |
d | d/t/n |
z | z/c/s |
zh | zh/ch/sh |
j | j/q/x |
k | k/h/l/g/ng |
a | a |
o | o |
e | e/er |
i | i/ii/iii |
u | u/v |
sil | sil |
Step 3, carrying out smooth filtering on the obtained viseme probability
Because the estimation of the statistical machine learning model on the probability cannot guarantee complete accuracy, the result is usually optimized by combining multi-frame data information to obtain the probability of smooth change in time.
The smoothing filtering process can adopt a finite impulse response filter to respectively filter the probability of each visual element, and the order and the parameter of the filter can be adjusted according to the requirement on the response time of the system.
Taking the simplest case as an example, a moving average fir filter implementation with order 10 may be used. In actual implementation, different filter designs may be used.
Step 4, resampling the voice stream according to the sampling rate of the video
Since the feature extraction process in step 1 frames the voice stream, the sampling rate of the data frame is (H/audio sampling rate) hz.
The sampling rate at which video is rendered is typically based on the refresh rate of the display device.
It is therefore desirable to use resampling to match the sampling rate of the data frames to the video sampling rate.
Step 5, converting the viseme probability to the standard mouth shape mixing proportion
The avatar rendering system will typically define a standard mouth shape configuration for each view, possibly in the form of key frames, or parameters describing the mouth shape.
The viseme probabilities can be converted to a mix ratio for a standard mouth-shape configuration by a linear or non-linear mapping function.
In a key frame scenario, the blending ratio may be an interpolated ratio between different key frames.
In the context of a keypoint parameter, a bone parameter, or a blenshape parameter, the blending ratio may be a blending ratio of the parameters.
Taking a frame of data as an example, if the viseme probability is:
vision element | Probability of viseme |
b | 0.0 |
d | 0.0 |
z | 0.0 |
zh | 0.0 |
j | 0.0 |
k | 0.0 |
a | 0.6 |
o | 0.4 |
e | 0.0 |
i | 0.0 |
u | 0.0 |
sil | 0.0 |
And the mapping function from the viseme probabilities to the mixture ratios is assumed to be a linear mapping. Taking a key point parameter scene as an example, a two-dimensional key point parameter is defined as:
a(0.2 0.8)
e(0.7 0.3)
the mixing ratio of the keypoint parameters corresponding to the above visual element probability is a × 0.6+ e × 0.4, and thus the keypoint parameters of the current frame are (0.4, 0.6).
Step 6, mouth shape rendering is carried out by utilizing visuality probability
And the virtual character rendering system renders a virtual character image according to the mixed mouth shape configuration to obtain a video stream.
Step 7, synchronously playing audio and video
Because the voice stream is processed by the links of framing, splicing, phoneme recognition, smoothing filtering and the like, each link has a certain system delay, and therefore, when the audio stream is played, the contents of the audio stream and the video stream need to be synchronized by compensating the delay.
The delay can be calculated by accumulating the delays of the processing elements.
Since there is also some delay in video rendering, the delay of the video rendering system needs to be subtracted when calculating the audio delay.
Taking a common scenario as an example:
the audio delay compensation amount is the framing delay + feature splicing delay + phoneme recognition delay + smoothing filter delay-video rendering delay.
Fig. 3 is a third embodiment of the present invention. This embodiment differs from the second embodiment provided in fig. 2 in that: the embodiment performs the viseme recognition directly from the speech stream and does not go through the phoneme recognition and the phoneme to viseme probability conversion.
Compared with the method described in fig. 2, the accuracy of the viseme probability estimation of the method is slightly lower, but the subjective feeling of the user is not affected basically, and the method has the advantages of lower implementation difficulty and calculation complexity.
Since possible variations and modifications may be effected by one skilled in the art without departing from the spirit and scope of the invention, the scope of protection is to be determined by the claims appended hereto.
Claims (8)
1. A real-time audio-driven virtual character mouth shape synchronous control method comprises the following steps:
identifying the viseme probability from the real-time voice stream; the viseme probability is obtained by combining the probabilities of the phonemes belonging to the same type of visemes based on a preset mapping relation from the phonemes to the visemes; the viseme probability is obtained by a viseme identification method; or identifying the phoneme probability from the real-time voice stream by using phoneme identification, and converting the phoneme probability into a viseme probability;
a step of filtering the visualprobability;
converting the sampling rate of the viseme probability into the sampling rate which is the same as the rendering frame rate of the virtual character;
converting the visuals probability into standard mouth shape configuration and rendering mouth shapes; when converting the visualprobabilities to a standard mouth shape configuration: firstly, defining a standard mouth shape configuration for each visual element, wherein the standard mouth shape configuration is a key frame or a parameter describing the mouth shape; secondly, converting the viseme probability into a mixing proportion of standard mouth shape configuration through a mapping function; wherein, in a key frame scene, the mixing proportion is an interpolation proportion between different key frames; in the scenario of the key point parameter, the bone parameter or the blenshape parameter, the mixing ratio is the mixing ratio of the key point parameter, the bone parameter or the blenshape parameter.
2. The method for real-time audio-driven virtual character mouth shape synchronous control as claimed in claim 1, characterized in that: and respectively smoothing and filtering each viseme probability by adopting a finite or infinite impulse response filter.
3. The method for real-time audio-driven virtual character mouth shape synchronous control as claimed in claim 1, characterized in that: in order to keep synchronization during audio/video playing, the contents of the audio stream and the video stream are synchronized by compensating for the delay during playing of the audio stream.
4. The real-time audio-driven virtual character mouth shape synchronous control method as claimed in claim 3, characterized in that: the length of the buffer for compensating the delay is determined by the processing delay of the mouth shape visual element identification, the filtering and the video rendering.
5. The method for real-time audio-driven virtual character mouth shape synchronous control as claimed in claim 1, characterized in that: the phoneme recognition comprises: framing the voice stream, and extracting features; and a step of performing phoneme estimation using the features.
6. The real-time audio-driven virtual character mouth shape synchronous control method as claimed in claim 5, characterized in that: the phoneme is an IPA defined phoneme, or a custom phoneme.
7. The method of claim 6, wherein the real-time audio-driven virtual character mouth shape synchronization control method comprises: the phonemes are:
wherein ng represents the finals of neng, i represents the finals of yi, ii represents the finals of zi, iii represents the finals of zhi, and sil represents silence; the phoneme and viseme conversion relationship is as follows:
。
8. The real-time audio-driven virtual character mouth shape synchronous control method as claimed in claim 3, characterized in that: the method for compensating the delay comprises the following steps: the audio delay compensation amount is the framing delay + feature splicing delay + phoneme recognition delay + filtering delay-video rendering delay.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911314031.3A CN111081270B (en) | 2019-12-19 | 2019-12-19 | Real-time audio-driven virtual character mouth shape synchronous control method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911314031.3A CN111081270B (en) | 2019-12-19 | 2019-12-19 | Real-time audio-driven virtual character mouth shape synchronous control method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111081270A CN111081270A (en) | 2020-04-28 |
CN111081270B true CN111081270B (en) | 2021-06-01 |
Family
ID=70315527
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911314031.3A Active CN111081270B (en) | 2019-12-19 | 2019-12-19 | Real-time audio-driven virtual character mouth shape synchronous control method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111081270B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111627096A (en) * | 2020-05-07 | 2020-09-04 | 江苏原力数字科技股份有限公司 | Digital human driving system based on blenshape |
CN111698552A (en) * | 2020-05-15 | 2020-09-22 | 完美世界(北京)软件科技发展有限公司 | Video resource generation method and device |
CN115426553A (en) * | 2021-05-12 | 2022-12-02 | 海信集团控股股份有限公司 | Intelligent sound box and display method thereof |
CN117557692A (en) * | 2022-08-04 | 2024-02-13 | 深圳市腾讯网域计算机网络有限公司 | Method, device, equipment and medium for generating mouth-shaped animation |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2438691A (en) * | 2005-04-13 | 2007-12-05 | Pixel Instr Corp | Method, system, and program product for measuring audio video synchronization independent of speaker characteristics |
CN101482975A (en) * | 2008-01-07 | 2009-07-15 | 丰达软件(苏州)有限公司 | Method and apparatus for converting words into animation |
CN102342100A (en) * | 2009-03-09 | 2012-02-01 | 思科技术公司 | System and method for providing three dimensional imaging in network environment |
CN103218842A (en) * | 2013-03-12 | 2013-07-24 | 西南交通大学 | Voice synchronous-drive three-dimensional face mouth shape and face posture animation method |
CN103329147A (en) * | 2010-11-04 | 2013-09-25 | 数字标记公司 | Smartphone-based methods and systems |
CN107369440A (en) * | 2017-08-02 | 2017-11-21 | 北京灵伴未来科技有限公司 | The training method and device of a kind of Speaker Identification model for phrase sound |
CN109599113A (en) * | 2019-01-22 | 2019-04-09 | 北京百度网讯科技有限公司 | Method and apparatus for handling information |
CN109712627A (en) * | 2019-03-07 | 2019-05-03 | 深圳欧博思智能科技有限公司 | It is a kind of using speech trigger virtual actor's facial expression and the voice system of mouth shape cartoon |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8694318B2 (en) * | 2006-09-19 | 2014-04-08 | At&T Intellectual Property I, L. P. | Methods, systems, and products for indexing content |
US10657972B2 (en) * | 2018-02-02 | 2020-05-19 | Max T. Hall | Method of translating and synthesizing a foreign language |
-
2019
- 2019-12-19 CN CN201911314031.3A patent/CN111081270B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2438691A (en) * | 2005-04-13 | 2007-12-05 | Pixel Instr Corp | Method, system, and program product for measuring audio video synchronization independent of speaker characteristics |
CN101482975A (en) * | 2008-01-07 | 2009-07-15 | 丰达软件(苏州)有限公司 | Method and apparatus for converting words into animation |
CN102342100A (en) * | 2009-03-09 | 2012-02-01 | 思科技术公司 | System and method for providing three dimensional imaging in network environment |
CN103329147A (en) * | 2010-11-04 | 2013-09-25 | 数字标记公司 | Smartphone-based methods and systems |
CN103218842A (en) * | 2013-03-12 | 2013-07-24 | 西南交通大学 | Voice synchronous-drive three-dimensional face mouth shape and face posture animation method |
CN107369440A (en) * | 2017-08-02 | 2017-11-21 | 北京灵伴未来科技有限公司 | The training method and device of a kind of Speaker Identification model for phrase sound |
CN109599113A (en) * | 2019-01-22 | 2019-04-09 | 北京百度网讯科技有限公司 | Method and apparatus for handling information |
CN109712627A (en) * | 2019-03-07 | 2019-05-03 | 深圳欧博思智能科技有限公司 | It is a kind of using speech trigger virtual actor's facial expression and the voice system of mouth shape cartoon |
Non-Patent Citations (2)
Title |
---|
基于BTSM和DBN模型的唇读和视素切分研究;吕国云 等;《计算机工程与应用》;20070731;第43卷(第14期);第21-24页 * |
面向人机接口的多种输入驱动的三维虚拟人头;於俊 等;《计算机学报》;20131231;第36卷(第12期);第2525-2536页 * |
Also Published As
Publication number | Publication date |
---|---|
CN111081270A (en) | 2020-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111081270B (en) | Real-time audio-driven virtual character mouth shape synchronous control method | |
US6766299B1 (en) | Speech-controlled animation system | |
US5608839A (en) | Sound-synchronized video system | |
US20080259085A1 (en) | Method for Animating an Image Using Speech Data | |
CN103650002B (en) | Text based video generates | |
EP0920691A1 (en) | Segmentation and sign language synthesis | |
EP0993197B1 (en) | A method and an apparatus for the animation, driven by an audio signal, of a synthesised model of human face | |
US5926575A (en) | Model-based coding/decoding method and system | |
US20030149569A1 (en) | Character animation | |
US6943794B2 (en) | Communication system and communication method using animation and server as well as terminal device used therefor | |
EP4195668A1 (en) | Virtual video livestreaming processing method and apparatus, storage medium, and electronic device | |
US20060079325A1 (en) | Avatar database for mobile video communications | |
JP2003529861A (en) | A method for animating a synthetic model of a human face driven by acoustic signals | |
JP2518683B2 (en) | Image combining method and apparatus thereof | |
CN113592985B (en) | Method and device for outputting mixed deformation value, storage medium and electronic device | |
CN112001992A (en) | Voice-driven 3D virtual human expression sound-picture synchronization method and system based on deep learning | |
JP2008500573A (en) | Method and system for changing messages | |
JPH089372A (en) | Device for increasing frame transmission rate of received video signal | |
US20050204286A1 (en) | Speech receiving device and viseme extraction method and apparatus | |
CA2162199A1 (en) | Acoustic-assisted image processing | |
CN116597857A (en) | Method, system, device and storage medium for driving image by voice | |
CN114760425A (en) | Digital human generation method, device, computer equipment and storage medium | |
CN114339069A (en) | Video processing method and device, electronic equipment and computer storage medium | |
CN114793300A (en) | Virtual video customer service robot synthesis method and system based on generation countermeasure network | |
CN110958417A (en) | Method for removing compression noise of video call video based on voice clue |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |