CN1320497C - Statistics and rule combination based phonetic driving human face carton method - Google Patents

Statistics and rule combination based phonetic driving human face carton method Download PDF

Info

Publication number
CN1320497C
CN1320497C CNB021402868A CN02140286A CN1320497C CN 1320497 C CN1320497 C CN 1320497C CN B021402868 A CNB021402868 A CN B021402868A CN 02140286 A CN02140286 A CN 02140286A CN 1320497 C CN1320497 C CN 1320497C
Authority
CN
China
Prior art keywords
face
video
human face
audio
people
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
CNB021402868A
Other languages
Chinese (zh)
Other versions
CN1466104A (en
Inventor
陈益强
高文
王兆其
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CNB021402868A priority Critical patent/CN1320497C/en
Publication of CN1466104A publication Critical patent/CN1466104A/en
Application granted granted Critical
Publication of CN1320497C publication Critical patent/CN1320497C/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

The present invention relates to a voice driven human face animation method with the combination of statistic and rules, which comprises the following steps: a video synchronization cutting method is used for obtaining a video corresponding data flow; corresponding eigenvectors are obtained through an audio-video analysis method; an audio-video synchronous insinuate relational model is learned through the application of a statistic learning method, models and rules which are learned by statistic, human face motion parameters which correspond to user given speech sequence are obtained, and a human face animation model is driven. The present invention uses methods of video acquisition, speech analysis, image treatment, etc. to record the voice of a real human face at the time of speaking and the motion data of human face characteristic points, and simultaneously, statistic learning is carried out to association patterns between voice and the human face characteristic points. When new voice is given, the motion parameters of the human face characteristics points, which correspond to the voice, can be obtained by using learned models and some rules, and the human face animation model is driven.

Description

Based on statistics and the regular voice-driven human face animation method that combines
Technical field
The present invention relates to a kind of based on statistics and the regular voice-driven human face animation method that combines, especially refer to a kind of use video acquisition, method such as speech analysis and image processing, voice and human face characteristic point exercise data when the record real human face is spoken are set up an initial phonetic image database; Can calculate the displacement of speech data analysis window by video acquisition frame per second and speech data sampling rate, utilize these several data to utilize statistical learning method to obtain the voice synchronous corresponding relation model corresponding simultaneously with frame of video.Utilize this model, add rule, can obtain people's face kinematic parameter of any voice correspondence, drive the human face animation model.
Background technology
After the technology of recovering realistic three-dimensional face by a width of cloth or a few subpicture or video sequence became a reality, research at present was the simulation of realistic three-dimensional face behavior.The same with the problem that runs in the phonetic synthesis, it is not so difficult to obtain a large amount of real human face motion videos and people's face synthesis unit, and difficulty is how to edit and reuse the human face animation data of these existence.A kind of method provides a cover and is used for the instrument of edit easily, generates animation sequence after the key frame that edits is done interpolation, and this method is the most direct, the making but the expert who needs to be familiar with animation takes much time.Second kind is adopted control technology, with other relevant signals such as text, sound, video, or sensor is realized the control to human face animation.With text control, the sound of output is synthetic speech, and is difficult to synchronously grasp.By video control, be a difficult point to the tracking and the feature extraction of video image.Take sensor plan, equipment manufacturing cost is too high, and the variation of the unique point of some details can only estimate.Therefore now feasible and Many researchers is to realize the voice-driven human face animation what do.People are very responsive for the behavior of people's face, for whether realistic being easy to judged, and also find corresponding people's face motor behavior from voice signal easily.Realize the human face animation of voice driven, the association mode between the moving and human face expression of voice and lip is synthetic is vital for personage's the sense of reality and confidence level.
Cognitive scholar and psychologist have observed great amount of relevant information and have existed in voice and the behavior of people's face.Face's information can increase the observer to voice content and pro forma understanding, and is considered by a lot of systems based on speech interfaces.On the contrary, the synthetic higher people's face of confidence level is considered to generate acceptable visual human and animation people's major obstacle.People are for explaining that the human motion behavior has higher susceptibility, and untrue natural animation people face can disturb even interrupt the understanding of people to voice usually.Present voice driven research can be divided into two classes: by speech recognition with not by speech recognition.First method is by voice are divided into linguistic unit, and as phoneme (Phoneme), vision primitive (Viseme) and syllable (syllable) are further directly insinuated these linguistic units after the lip posture synthetic with the splicing method subsequently.This method very directly is easy to realize, but shortcoming is to have ignored dynamic factor and stationary problem--interaction that-potential voice paragraph and muscle model move and influence intractable.Till now, the effort on the nearly all stationary problem concentrates on heuristic rule and the classical smoothing method.Such as Baldy is the 3D visual human face system that a speech primitive drives, and adopts the voice synchronous model of the hand-designed of psychologist's approval for the processing of stationary problem.Though video rewrites (Video Rewrite) method and obtains new video by the video-frequency band of three-tone correspondence is arranged, the result is than the animation model nature that generates, but it is worthy of note that three-tone is represented is that transition between the voice connects the not motion between representative's face frame.The quality of simultaneity factor depends on number and the smoothing technique that the three-tone sample is provided.When we represented the elementary cell of audio frequency and video with discrete speech primitive or visual primitive, a lot of necessary information can be lost.In fact, the needs that difference is pronounced height and can be transmitted language content are only satisfied in the design of speech primitive.Speech primitive is represented for identification very effectively but is not best for synthetic, this is difficult between the prediction sound rhythm and the human face expression mainly due to them, between acoustic energy and posture are amplified, and the moving relation between synchronous of sound paragraph and lip.Second method is to walk around this form of speech primitive, finds the relation of insinuating between voice signal and the controlled variable, directly drives lip motion then.Can train with neural network, go the PREDICTIVE CONTROL parameter with each five frame voice signal of front and back.But the general method that adopts the corresponding voice segments controlled variable of manual demarcation though avoided the difficult problem that human face characteristic point obtains automatically, also causes system to be difficult to describe the variation of people's face complexity simultaneously.Have some 3D position trackers are placed in around lip next door and the cheek, though can obtain the accurate data of people's face motion, for people portion such as eyes on the face, and the variation of eyebrow or the like does not but realize yet.Someone proposes with a kind of method (HMM) with coherent signal PREDICTIVE CONTROL signal, and it is used for the voice-driven human face animation.But handle complicated voice data with problem reduction with HMM.Above processing simultaneously all is based on statistical learning, can processed voice and stronger the insinuating of relevance such as lip moves, but for voice and nictation, incidence relation then is difficult to obtain by study a little less than voice and the impetus etc.
Summary of the invention
The purpose of this invention is to provide a kind of employing and realize the method for voice to the mapping of people's face based on statistics and the regular method that combines.
For achieving the above object, method provided by the invention comprises step:
Utilize the audio-visual synchronization cutting method to obtain audio frequency and video corresponding data stream;
By the audio frequency and video analytical approach, obtain corresponding proper vector;
The utilization statistical learning method is learnt audio-visual synchronization and is insinuated relational model;
The utilization statistical learning to model and rule obtain and the corresponding people's face of new speech kinematic parameter, drive the human face animation model.
The present invention uses video acquisition, methods such as speech analysis and image processing, and voice and human face characteristic point exercise data when the record real human face is spoken are set up an initial phonetic image database; Can obtain phonetic feature by speech analysis, comprise linear predictor coefficient and prosodic parameter (energy and zero-crossing rate and fundamental frequency), can extract the human face animation parameter characteristic of correspondence point of MPEG4 definition from frame of video, calculate by relative frame do difference calculating and relative displacement and can obtain the human face animation parameter.Utilize cluster, methods such as statistics and neural network are finished the study mapping from phonetic feature to the human face animation parameter.After the study, when new voice are come in, can obtain phonetic feature by analyzing, phonetic feature can obtain the human face animation parameter by mapping, and on this basis, utilization people face movement knowledge storehouse adds rule constrain on the result, realize the animation of the sense of reality.
Description of drawings
Fig. 1 is a learning phase framework synoptic diagram;
Fig. 2 is that human face characteristic point is followed the tracks of synoptic diagram;
Fig. 3 is feature point detection and range of influence synoptic diagram;
Fig. 4 is part FDPFAP corresponding point and the FAPU among the MPEG4;
Fig. 5 is 29 kinds of FAP patterns;
Fig. 6 is an application stage framework synoptic diagram;
Fig. 7 is a statistics vision mode method and comparison (comparison of lip height parameter) based on neural net method;
Fig. 8 is a voice-driven human face animation example, and last figure is true audio frequency and video, and figure below is according to the people's face motion sequence that utilizes audio frequency to obtain of the present invention.
Embodiment
At first utilize no tutor's cluster analysis can obtain people's face kinematic parameter (FAP) proper vector class of video.Add up the people's face dynamic model (essence is FAP classification transition matrix) that takes place synchronously with speech events then, we are referred to as to add up vision mode, and the statistical language model in its principle and the natural language processing is similar.The last a plurality of neural networks of learning training (ANN) are finished from speech pattern insinuating to the human face animation pattern.Can obtain some face animation mode sequences to the new speech data by calculating after the machine learning, utilize the statistics vision mode can therefrom select beautiful woman's face kinematic parameter (FAP) sequence, utilize people's face sports rule that the FAP sequence is revised and replenished then, after finishing smoothly, use these FAP can directly drive face wire frame model.This strategy has following distinctive feature:
1) whole process is set up and can be provided description with the Bayes rule of classics,
arg max L Pr ( L | A ) = arg max L Pr ( A | L ) . Pr ( L ) Pr ( A ) ,
Wherein A can regard voice signal as.Maximal possibility estimation Pr (A|L) weighs the modeled accuracy of voice signal, and prior model Pr (L) sets up about the background knowledge of real human face motion or claims the statistics vision mode.
2) cluster analysis of voice signal is on the classification learning that is based upon face posture, do like this than consider hypothesis to pass through the speech perception classification good.Simultaneously, because therefore the corresponding diverse phonetic feature of same lip, adopts neural network that of a sort voice signal is trained, can make the robustness that predicts the outcome improve.
3) the statistics vision mode is allowed people's face movement locus that we find whole word to optimize, and fully uses contextual information to avoid neural metwork training to be difficult to realize context-sensitive defective simultaneously.
4) vision signal only need be analyzed once, is used for the corresponding relation of training utterance and human face animation parameter (FAP), and it is synthetic that results model can be used to do other people people's face.
5) introducing of people's face sports rule made the animation of the not high part of original and voice association degree also can be truer, moved as nictation and head etc.
6) whole framework can be used for correlation predictive and the control or synthetic between other signals.
Above-mentionedly comprise following two aspects with the voice-driven human face animation method that combines of rule: learn and the application stage based on statistics:
1) learning phase comprises the steps (Fig. 1):
A) audio-visual synchronization is recorded and is cut apart
By gamma camera can be synchronous recorded speech and video data, form avi file, but, audio-video signal must be divided into the audio and video stream of different passages in order to need with post analysis.Traditional method usually rule of thumb fixedly installs certain gamma camera that adopts, and the present invention proposes the audio-visual synchronization dividing method and can be used for any gamma camera collection video.
Suppose that the video acquisition frame per second is Videoframecount/msec, the audio frequency frame per second is Audiosamplecount/msec, and the displacement of speech analysis window is Windowmove, and speech analysis window size is Windowsize, needing voice window number is m, and speech analysis window and speech analysis window displacement ratio are n;
Windowmove=Audiosamplecount/(Videoframecount*m) (1);
Windowsize=Windowmove*n (2);
Wherein m and n are adjustable parameter, set according to actual conditions.Synchronization parameter by this method setting can make audio-visual synchronization be accurate to sample bits.
In order to cover the complete various pronunciations of trying one's best, the written material that the text information that method selects 863 Chinese phonetic synthesis storehouse CoSS-1 to sum up pronounces as words person.CoSS-1 comprises the pronunciation of 1268 independent syllables of all Chinese, also comprises the pronunciation of a large amount of 2-4 words and the voice of 200 statements.Note various individual characters, the synchronized audio/video storehouse of speech and statement.By the marker characteristic point, can obtain lip, cheek, the exercise data of positions such as eyelid.Gamma camera is set obtains the image feature sequence by transferring the video of gathering to image and utilize trace routine to handle 10 frame/seconds.Suppose m=6, n=2/3 we to adopt the speech sample rate be 8040Hz, then the window of speech analysis is long is 8040/10*6=134, frame moves and is 134*2/3=89.
B) audio and video characteristic extracts.
For the linear forecasting parameter of speech data in the audio extraction hamming window and prosodic parameter (energy, zero-crossing rate and fundamental frequency) as speech feature vector
For video, extract consistent with the Mpeg-4 on the face unique point of people, calculate the difference Ve1={V1 of each unique point coordinate and standard frame coordinate then, V2 ... Vn}, calculate the specific people corresponding yardstick reference quantity of each unique point P={P1 on the face that presses the Mpeg-4 definition again, P2 ... Pn} can obtain people's face kinematic parameter by formula (3).
Fap i=(V I (x|y)/ P I (x|y)) * 1024 (3) Fap iExpression and I people's face kinematic parameter that unique point is corresponding, V I (x|y)The V of expression iX or y coordinate, P I (x|y)Expression and V I (x|y)Corresponding yardstick reference quantity.
For phonetic feature, in speech analysis, use traditional hamming window, each frame obtains 16 rank LPC and RASTA-PLP mixing constant and some prosodic parameters like this.
For people's face motion feature, use human face animation representation scheme based on MPEG4.MPEG-4 uses FDP (people's face defined parameters) and FAP (human face animation parameter) to specify faceform and animation thereof, uses FAPU (human face animation parameter unit) to indicate the displacement activity of FAP.Based on above-mentioned principle, obtain the moving exercise data of human face expression and lip, to obtain corresponding FDP and FAP parameter exactly.In order to obtain people's face exercise data, developed face characteristic that a cover computer vision system can follow the tracks of many individual characteies synchronously as the corners of the mouth and lip line, eyes and nose etc.Fig. 2 shows the unique point that we can follow the tracks of and obtain.Because it is synthetic more important to us to obtain the accurate unique point exercise data track algorithm more numerous than experiment.We adopt by obtain data and require words person to reduce head movement in the way of mark particular color on the face as far as possible, and Fig. 3 shows the unique point and the range of influence of final acquisition.
The data of coming out by feature point extraction are absolute coordinatess, and because the influence of words person's head movement or body kinematics makes that handling the coordinate figure that obtains with simple image has very big noise, the pre-service of therefore need reforming.The unique point that our hypothesis is not influenced by FAP is not move relatively, utilizes this unchangeability to finish conversion from image coordinate to faceform's relative coordinate, thereby can remove by the influence to data of the kinetic rotation of words person and telescopic variation.Unique point to Mpeg4 definition among Fig. 4, we have chosen P0 (11.2), P1 (11.3), P2 (11.1) and P3 (adding at a supratip point) form orthogonal coordinate system (X-axis P0P1, Y-axis P2 P3), according to this coordinate system, can calculate the anglec of rotation and flexible yardstick in accordance with the following methods.Suppose that these coordinates of reference points are P0 (x0, y0), P1 (x1, y1), P2 (x2, y2) and P3 (x3, y3), the origin of new coordinate-system can be calculated by two straight-line intersections that they connect into, and (xnew ynew) can also calculate the anglec of rotation Φ of new coordinate with respect to normal coordinates simultaneously to be assumed to be P.Like this arbitrfary point (x, y) value under new coordinate-system (x ', y ') can be calculated according to following formula:
x′=x×cos(θ)-x×sin(θ)+P(x new) (4)
y′=y×sin(θ)-y×cos(θ)+P(y new) (5)
For fear of flexible influence, suppose to be added in o'clock not moving on the bridge of the nose with respect to first frame, any other point can calculate relative displacement with this point according to formula (6) and (7), thereby transfers image coordinate to the faceform coordinate, obtains the accurate data of unique point motion:
x k″=(x k′-x k3)-(x 1′-x 13) (6)
y k″=(y k′-y k3)-(y 1′-y 13) (7)
(x wherein 13, y 13) coordinate of prenasale of expression the 1st frame, (x 1', y 1') other characteristic point coordinates of expression the 1st frame, (x K3, y K3) coordinate of prenasale of expression k frame, (x k', y k') other characteristic point coordinates of expression k frame, (x k", y kThe last coordinates computed of other unique points of ") expression k frame.After filtering, each characteristic point coordinates can calculate the FAP value with reference to the human face animation parameter unit (FAPU) of Fig. 4 definition.Suppose that the ESO and the ENSO that define among Fig. 4 are respectively 200 and 140, then 5.3 (x y) may be calculated respectively corresponding to two FAP values:
FAP39=X×1024/200 (8)
FAP41=Y×1024/140 (9)
C) audio frequency characteristics is to the statistical learning of video features.
1. at first audio frequency and video are pressed a), b) described feature set Audio, the Video cut apart synchronously;
2. concentrate video not have the supervision cluster analysis to Video, obtain people's face motion basic model, be made as the I class;
3. utilize statistical method to obtain transition probability between two classes or the multiclass, be called the statistics vision mode, and the quality of coming evaluation model with entropy, and then carry out b) up to the entropy minimum.
The data that 4. will belong among the set of voice features Audio of correspondence of same individual face motion basic model are divided into corresponding subclass Audio (i), and which class I represents.
5. each subclass Audio (i) is trained with a neural network, be input as the phonetic feature F (Audio (i)) in the subclass, be output as the degree of approximation P (Video (i)) that belongs to this classification.
1. 2. middle people's face motion basic model clustering method
For basic people's face pattern, cognitive scholar has provided some achievements in research, but generally all is the qualitative 6 kinds of basic facial expressions or more that provide, and the synthetic result's of this qualitative expression the sense of reality is bad.The researchist is also arranged by the True Data cluster is come discovery mode, but at present mostly cluster analysis all on the phoneme basis, carry out, ignored the dynamic of statement level people face motion.We wish to find the pattern of one group of effective expression people face motion from a large amount of true statements, the pattern of this discovery can have clearly meaning such as 14 kinds of lips of MPEG4 definition, also can be the synthetic basic model of a kind of people's of being effective to face.By mode discovery, not only be beneficial to the convergence of neural metwork training, also synthetic complex process explanation of the moving face of lip and understanding are laid the first stone simultaneously for follow-up.In cluster process, since the number of such basic model and uncertain, the no tutor's cluster of general employing.
For clustering algorithm, the problem that is provided with that has a lot of parameters, parameter is provided with for the cluster result influence very big, for the moving face basic model of lip cluster, owing to do not have the experiment sample collection of known class as the error rate evaluation, simultaneously therefore geometric properties that again can't the Direct observation higher dimensional space is estimated cluster result and is had difficulties.Though in the class spacing of cluster data or the class apart from obtaining being used to instruct the cluster evaluation, but can't be described in and use the effect that cluster can reach in the real system, usually the quality of effect is vital for animation system, and we directly adopt with cluster data and True Data and ask the way of variance whether to weigh cluster result to reach the requirement of describing main motor pattern.By adjust the clustering algorithm parameter as: wish clusters number, maximum frequency of training, every class smallest sample number, separation parameter P and merge parameters C etc. and can obtain different cluster results carries out variance to these results by (10) to calculate, the result is as shown in table 1:
ErrorSquare ( X , Y ) = ( X - Y ) * ( X - Y ) T / | | X | | - - - ( 10 )
Wherein X is the True Data matrix, and Y is the matrix of True Data after the classification mapping, ‖ X ‖ representing matrix size.
Every class smallest sample number Separation parameter P/ merges parameters C Clusters number Variance ratio
1 32 P=0.5-1,C=1-1.5 18 3.559787
2 20 P=0.5-1,C=1-1.5 21 4.813459
3 10 P=0.5-1,C=1-1.5 23 2.947106
4 5 P=0.5-1,C=1-1.5 29 2.916784
5 3 P=0.5-1,C=1-1.5 33 2.997993
Table 1: cluster result relatively
Above-mentioned cluster is carried out on 6200 sample datas, wishes that the number of cluster is made as 64, and maximum frequency of training is made as 200, all the other parameter manual shift, and P represents separation parameter, and C represents to merge parameter, and P and C change in [0.5,1] interval.We find that variance ratio more is not mild decline, and certain shake occurs, this select mainly due to different cluster selection of parameter such as initial classes center and the deletion step of clustering algorithm to resultant influence.Estimate and can find out from variance, the 3rd row, the cluster result variance of the 4th row and the 5th row is more or less the same, and can think and tend towards stability that the number with people's face basic facial expression pattern is made as 29 thus.Fig. 5 demonstrates the result:
2. the statistics vision mode method for building up in 3.
The purpose of setting up the statistics vision mode is the people's face movement locus that finds whole word to optimize in order to allow, fully uses contextual information to avoid single neural metwork training to be difficult to utilize context-sensitive defective simultaneously.The statistics vision mode can calculate the probability that video sequence occurs.If we suppose that F is the human face animation sequence of a particular statement, as,
F=f 1f 2…f Q
So, P (F) can be calculated by following formula
P(F)=P(f 1f 2…f Q)=P(f 1)P(f 2|f 1)…P(f Q|f 1f 2…f Q-1) (11)
Yet,, estimate all possible conditional probability P (f for any face posture and the sequence formed j| f 1f 2F F-1) be impossible, in practice, generally adopt the N unit syntax to solve this problem, can approximate evaluation P (F) be
P ( F ) = Π i = 1 Q P ( f i | f i - 1 f i - 2 · · · f i - N + 1 ) , - - - ( 12 )
Conditional probability P (f i| f I-1f I-2F I-N+1) can obtain by simple relative statistic method:
P ( f i | f i - 1 f i - 2 · · · f i - N + 1 ) = F ( f i , f i - 1 , · · · f i - N + 1 ) F ( f i - 1 , · · · f i - N + 1 ) - - - ( 13 )
Wherein, F is the same occurrence numbers of various face postures in given training video database.After setting up the statistics vision mode, we adopt puzzled degree to estimate the performance quality of whole training pattern.Suppose θ iBe the cluster centre of gathering I by the cluster that cluster analysis obtains, for θ={ θ 1, θ 2θ n, we wish to find the vision mode of an optimization.Puzzled degree for model θ can define according to following method:
pp = 2 H ( S , θ ) ≈ 2 - 1 n log p ( S | θ ) - - - ( 14 )
Wherein, S=s 1, s 2..., s nThe human face animation argument sequence of expression statement. p ( S | θ ) = Σ i p ( s i + 1 | s i · · · s 1 ) The probability of expression human face animation argument sequence S under model p (θ).P (θ) represents our background knowledge for the motion of people's face in fact, can utilize above-mentioned statistical method to obtain simultaneously.Such as using the bi-gram commonly used in the natural language processing or the method for the ternary syntax, table 2 shows that the puzzled degree of the statistics vision mode that different cluster results obtain compares:
Number of state Bi-gram(PP) Tri-gram(pp)
1 18 8.039958 2.479012
2 21 6.840446 2.152096
3 26 5.410093 1.799709
4 29 4.623306 1.623896
5 33 4.037879 1.478828
Table 2: the puzzlement degree relatively
By the statistics vision mode, we obtain the distribution probability of one group of state transitions, when a plurality of human face animation sequences provide, can utilize the Viterbi algorithm to obtain the human face animation sequence that maximum possible takes place on probability.
3. the network learning method in 5.
If regard voice the task of a pattern-recognition as to the mapping of FAP pattern, there are a lot of learning algorithms to be used, as hidden Markov model (HMM), support vector machine (SVM) and neural network or the like.Because neural network embodies stronger efficient and robustness for study input and output mapping, we select a kind of neural network (BP net) to learn the sentence of a large amount of records.Each cluster node can be finished training with two neural networks, and one is used for the sign state, and value is 0 or 1, and another is used for sign speed.These two kinds of Feedback Neural Network can be unified to be described as:
y k = f 2 ( Σ j = 0 n 2 w kj ( 2 ) f 1 ( Σ i = 0 n 1 w ji ( 1 ) x i ) ) - - - ( 15 )
Wherein x ∈ Φ is an audio frequency characteristics, w (1)And w (2)Be the weights and the threshold value of each layer, f 1And f 2The is-symbol function.Train very simply, behind the given data set, adopt the Levenberg-Marquardt optimized Algorithm to adjust weights and threshold value is come neural network training.
All calculate 16 dimension LPC and RASTA-PLP mixed vectors for each frame of voice and add 2 dimension prosodic parameters, form 18 dimension speech feature vectors, 6 frames are combined into an input vector before and after getting, and the input of so each neural network is 108 vectors of tieing up.For the state neural network, the output node number is decided to be 1, expression 0 or 1.Adopt 30 for middle hidden node number, the parameter of neural network is made as simultaneously: learning rate 0.001, the error of network are 0.005.For the speed neural network, the output node number is decided to be 18, expression 18 dimension FAP proper vectors.Adopt 80 for middle hidden node number.The parameter of neural network is made as simultaneously: learning rate 0.001, the error of network are 0.005.
2) application stage comprises the steps (Fig. 6):
1) audio recording:
Can directly utilize microphone or other sound pick-up outfits to obtain speech data
2) audio feature extraction
Audio feature extraction method according to learning phase is extracted phonetic feature
3) based on the mapping of the audio frequency characteristics of statistical learning model to video features
Phonetic feature is sent into the neural network of everyone face pattern correspondence as input, and each state neural network all has an output, the degree of approximation that belongs to this classification that obtains exporting; After a sentence is finished, utilize statistics vision mode and Viterbi decoding algorithm to obtain the transferring route of the class of a maximum probability, coupling together is exactly the human face animation mode sequences corresponding with voice;
Though information main or that provided by voice plays a major role, the viterbi algorithm guarantees that formation sequence meets the proper motion of people's face.Though directly represent each state of sequence just can drive people's face grid with each cluster centre, choose basic model owing to simplify, jitter phenomenon can appear in human face animation.Classic method generally solves with interpolation, though can eliminate shake, but do not meet the dynamic perfromance of human face animation, we have two neural networks to predict now under each state, one of them predetermined speed, the net result sequence of utilizing transition matrix to obtain like this includes enough information and can generate and the consistent animation of nature person's face motion, and whole formula is very succinct, makes T=(t 1, t 2T n) be people's face motion state point of prediction, V={v 1, v 2V nBe the speed under each state point.
Y ( t * i / m ) - > t + 1 = Y t + ( ( Y t + 1 - Y t ) / m ) * v t * iIf i < = m / 2 - - - ( 16 )
Y ( t * i / m ) - > t + 1 = Y t + 1 - ( ( Y t + 1 - Y t ) / ( i * m ) ) * v t + 1 Ifi > m / 2 - - - ( 17 )
Y wherein (t*i/m)-t+1The I frame of expression from state t to state t+1, m are represented the frame number that need insert to state t+1 from state t. because the speed parameter has been arranged, make the human face animation that generates meet the polytrope of people's face motion more than interpolation method.
4) revise based on the video features stream of people's face sports rule
After the people's face kinematic parameter sequence that obtains based on statistical model, because a bit little influence that study predicts the outcome, can cause the sense of reality of whole animation sequence to descend, some people's face motion simultaneously is little with the correlation degree of phonetic feature, as nictation, nod, for this reason, on the basis of statistical learning, the rule that adds people's face movement knowledge storehouse is revised sequence, thereby improve result's output, make the animation sense of reality stronger.
5) audio-visual synchronization is broadcasted
Obtain voice and animation played file, can directly broadcast at different passages, because the data that itself obtain are strict synchronism, it also is synchronous therefore broadcasting.
Four) experimental result relatively
System has been adopted qualitative and quantitative two kinds of valuation methods: quantitative test is based on and calculates the error of weighing between predicted data and the True Data, to a lot of machine learning systems, all should adopt quantivative approach.Qualitative test is to judge by perception whether the people's face motion that synthesizes is true, and for synthetic, qualitative test is very important.In quantitative test, weighed the error of predicted data and True Data, comprise two groups of closed set (training data is a test data) and openers (test data is not through training).Fig. 7 shows the test result of upper lip height parameter in two words, and compare with single neural net method, last two figure test datas are training data, two figure test datas are non-training data down, by testing all FAP parameters and calculating the mean square deviation of predicted data and True Data, obtain the result of table 3 by formula (10).
Test data Mean square deviation (VM+ANN) Mean square deviation (ANN)
Training data 2.859213 3.863582
Test data 4.097657 5.558253
The variance ratio of table 3:FAP parameter prediction data and True Data
Evaluation for multi-mode system is not sought unity of standard so far, for the voice-driven human face animation system, in obtaining the your human face analysis data corresponding with voice, can't calculate the error of predicted data and True Data, the Practical Performance that therefore simple quantitative result can not representative system.For the tone testing evaluation of unspecified person, generally can only adopt method qualitatively, in experiment, require five people's audiovisual systems, and from intelligent, naturality, the acceptability of friendly and the motion of people's face is assessed.Because what system not only can solve people's dynamic change of portion on the face but also use is the raw tone of recording, and effectively solves stationary problem, therefore obtained higher evaluation.
Utilize the system of this paper, behind a given people's voice, the FAP pattern that neural network can the every frame phonetic feature of real-time estimate correspondence is by can directly driving the people's face grid based on Mpeg4 after level and smooth.Fig. 8 provides the partial frame of voice-driven human face animation.

Claims (5)

1. one kind based on statistics and the regular voice-driven human face animation method that combines, and comprises step:
Utilize the audio-visual synchronization cutting method to obtain audio frequency and video corresponding data stream;
By the audio frequency and video analytical approach, obtain corresponding proper vector;
The utilization statistical learning method is learnt audio-visual synchronization and is insinuated relational model;
The utilization statistical learning to model add that rule obtains and the corresponding people's face of new speech kinematic parameter.
2. by the described method of claim 1, it is characterized in that described audio-visual synchronization dividing method comprises step:
A, suppose that the video acquisition frame per second is Videoframecount/msec, the audio frequency frame per second is Audiosamplecount/msec, the displacement of speech analysis window is Windowmove, speech analysis window size is Windowsize, needing voice window number is m, and speech analysis window and speech analysis window displacement ratio are n;
b、Windowmove=Audiosamplecount/(Videoframecount*m)
Windowsize=Windowmove*n
Wherein, m and n are adjustable parameter, set according to actual conditions.
3. by the described method of claim 1, it is characterized in that described audio frequency and video analysis and feature extracting method comprise step:
A, for the linear forecasting parameter of speech data in the audio extraction hamming window and prosodic parameter as speech feature vector;
B, for video, extract consistent with the Mpeg-4 on the face unique point of people, calculate the difference Vel={V1 of each unique point coordinate and standard frame coordinate then, V2 ... Vn), calculate the specific people corresponding yardstick reference quantity of each unique point P={P1 on the face that presses the Mpeg-4 definition again, P2 ... Pn), can obtain people's face kinematic parameter by following formula:
Fap i=(V i(x|y)/P i(x|y))*1024
Wherein, Fap iExpression and i people's face kinematic parameter that unique point is corresponding, V I (x|y)The V of expression iX or y coordinate, P I (x|y)Expression and V I (x|y)Corresponding yardstick reference quantity.
4. by the described method of claim 1, it is characterized in that described audio-visual synchronization insinuates the statistical learning method of relational model and comprise step:
A) cut apart feature set Audio, Video at first synchronously;
B) concentrate video not have the supervision cluster analysis to Video, obtain people's face motion basic model, be made as the I class;
C) utilize statistical method to obtain transition probability between two classes or the multiclass, be called the statistics vision mode, and the quality of coming evaluation model with entropy, and then carry out b) up to the entropy minimum;
D) data that will belong among the set of voice features Audio of correspondence of same individual face motion basic model are divided into corresponding subclass Audio (i), and which class i represents;
E) each subclass Audio (i) is trained with a neural network, be input as the phonetic feature F (Audio (i)) in the subclass, be output as the degree of approximation P (Video (i)) that belongs to this classification.
5. by the described method of claim 1, it is characterized in that described obtaining comprises step with the corresponding people's face of phonetic feature kinematic parameter:
A), extract phonetic feature for given new speech;
B) phonetic feature is sent into the neural network of everyone face pattern correspondence, the degree of approximation that belongs to this classification that obtains exporting as input;
C) after a sentence is finished, utilize statistics vision mode and Viterbi decoding algorithm to obtain the transferring route of the class of a maximum probability, coupling together is exactly the human face animation mode sequences corresponding with voice;
D) the human face animation mode sequences of prediction is revised by the rule in people's face movement knowledge storehouse, made result's true nature more.
CNB021402868A 2002-07-03 2002-07-03 Statistics and rule combination based phonetic driving human face carton method Expired - Lifetime CN1320497C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB021402868A CN1320497C (en) 2002-07-03 2002-07-03 Statistics and rule combination based phonetic driving human face carton method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB021402868A CN1320497C (en) 2002-07-03 2002-07-03 Statistics and rule combination based phonetic driving human face carton method

Publications (2)

Publication Number Publication Date
CN1466104A CN1466104A (en) 2004-01-07
CN1320497C true CN1320497C (en) 2007-06-06

Family

ID=34147542

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB021402868A Expired - Lifetime CN1320497C (en) 2002-07-03 2002-07-03 Statistics and rule combination based phonetic driving human face carton method

Country Status (1)

Country Link
CN (1) CN1320497C (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100343874C (en) * 2005-07-11 2007-10-17 北京中星微电子有限公司 Voice-based colored human face synthesizing method and system, coloring method and apparatus
CN100369469C (en) * 2005-08-23 2008-02-13 王维国 Method for composing audio/video file by voice driving head image
CN100476877C (en) * 2006-11-10 2009-04-08 中国科学院计算技术研究所 Generating method of cartoon face driven by voice and text together
CN101482976B (en) * 2009-01-19 2010-10-27 腾讯科技(深圳)有限公司 Method for driving change of lip shape by voice, method and apparatus for acquiring lip cartoon
CN101488346B (en) * 2009-02-24 2011-11-02 深圳先进技术研究院 Speech visualization system and speech visualization method
CN102820030B (en) * 2012-07-27 2014-03-26 中国科学院自动化研究所 Vocal organ visible speech synthesis system
GB2510200B (en) * 2013-01-29 2017-05-10 Toshiba Res Europe Ltd A computer generated head
CN103279970B (en) * 2013-05-10 2016-12-28 中国科学技术大学 A kind of method of real-time voice-driven human face animation
US10586368B2 (en) * 2017-10-26 2020-03-10 Snap Inc. Joint audio-video facial animation system
CN109409307B (en) * 2018-11-02 2022-04-01 深圳龙岗智能视听研究院 Online video behavior detection method based on space-time context analysis
CN110072047B (en) * 2019-01-25 2020-10-09 北京字节跳动网络技术有限公司 Image deformation control method and device and hardware device
CN110599573B (en) * 2019-09-03 2023-04-11 电子科技大学 Method for realizing real-time human face interactive animation based on monocular camera
CN110610534B (en) * 2019-09-19 2023-04-07 电子科技大学 Automatic mouth shape animation generation method based on Actor-Critic algorithm
CN111243626B (en) * 2019-12-30 2022-12-09 清华大学 Method and system for generating speaking video
CN115100329B (en) * 2022-06-27 2023-04-07 太原理工大学 Multi-mode driving-based emotion controllable facial animation generation method
CN116912373B (en) * 2023-05-23 2024-04-16 苏州超次元网络科技有限公司 Animation processing method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0992933A2 (en) * 1998-10-09 2000-04-12 Mitsubishi Denki Kabushiki Kaisha Method for generating realistic facial animation directly from speech utilizing hidden markov models
US20020024519A1 (en) * 2000-08-20 2002-02-28 Adamsoft Corporation System and method for producing three-dimensional moving picture authoring tool supporting synthesis of motion, facial expression, lip synchronizing and lip synchronized voice of three-dimensional character
US20020054047A1 (en) * 2000-11-08 2002-05-09 Minolta Co., Ltd. Image displaying apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0992933A2 (en) * 1998-10-09 2000-04-12 Mitsubishi Denki Kabushiki Kaisha Method for generating realistic facial animation directly from speech utilizing hidden markov models
US20020024519A1 (en) * 2000-08-20 2002-02-28 Adamsoft Corporation System and method for producing three-dimensional moving picture authoring tool supporting synthesis of motion, facial expression, lip synchronizing and lip synchronized voice of three-dimensional character
US20020054047A1 (en) * 2000-11-08 2002-05-09 Minolta Co., Ltd. Image displaying apparatus

Also Published As

Publication number Publication date
CN1466104A (en) 2004-01-07

Similar Documents

Publication Publication Date Title
CN1320497C (en) Statistics and rule combination based phonetic driving human face carton method
Ferstl et al. Multi-objective adversarial gesture generation
CN103218842B (en) A kind of voice synchronous drives the method for the three-dimensional face shape of the mouth as one speaks and facial pose animation
CN108447474B (en) Modeling and control method for synchronizing virtual character voice and mouth shape
CN104361620B (en) A kind of mouth shape cartoon synthetic method based on aggregative weighted algorithm
Hong et al. Real-time speech-driven face animation with expressions using neural networks
Zeng et al. Audio-visual affect recognition
CN101101752B (en) Monosyllabic language lip-reading recognition system based on vision character
CN101187990A (en) A session robotic system
Sargin et al. Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation
CN1860504A (en) System and method for audio-visual content synthesis
CN103279970A (en) Real-time human face animation driving method by voice
CN112331183A (en) Non-parallel corpus voice conversion method and system based on autoregressive network
CN116109455B (en) Language teaching auxiliary system based on artificial intelligence
CN115188074A (en) Interactive physical training evaluation method, device and system and computer equipment
CN1952850A (en) Three-dimensional face cartoon method driven by voice based on dynamic elementary access
Windle et al. The UEA Digital Humans entry to the GENEA Challenge 2023
Hofer et al. Automatic head motion prediction from speech data
Kettebekov et al. Prosody based audiovisual coanalysis for coverbal gesture recognition
Li et al. A novel speech-driven lip-sync model with CNN and LSTM
Braffort Research on computer science and sign language: Ethical aspects
Lan et al. Low level descriptors based DBLSTM bottleneck feature for speech driven talking avatar
Pitsikalis et al. Data-driven sub-units and modeling structure for continuous sign language recognition with multiple cues
CN1152336C (en) Method and system for computer conversion between Chinese audio and video parameters
Filntisis et al. Photorealistic adaptation and interpolation of facial expressions using HMMS and AAMS for audio-visual speech synthesis

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CX01 Expiry of patent term
CX01 Expiry of patent term

Granted publication date: 20070606