CN102609969A

CN102609969A - Method for processing face and speech synchronous animation based on Chinese text drive

Info

Publication number: CN102609969A
Application number: CN2012100375287A
Authority: CN
Inventors: 赵群飞; 杜鹏; 樊延峰; 邓杰; 唐品
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2012-02-17
Filing date: 2012-02-17
Publication date: 2012-07-25
Anticipated expiration: 2032-02-17
Also published as: CN102609969B

Abstract

The invention discloses a method for processing face and speech synchronous animation based on Chinese text drive, comprising the steps of classifying all Chinese phonemes into 16 groups of Chinese visual phonemes according to the size feature of the lip action when pronouncing according to Chinese pinyin, and using an input face image to synthesize a corresponding key frame; analyzing an input text to obtain a corresponding Chinese visual phoneme sequence and a key frame sequence of the animation; interposing a transition frame between every two neighbor key frames; aligning the key frame sequence and a speech stream; and finally, simultaneously playing a speech stream and an animation stream to achieve the face and speech synchronous animation. After inputting any face head portrait and any text content, the method can automatically finish generation and output of the face animation, is simple in operation and smooth in effect, and is suitable for kinds of such occasions as a visual human-computer interface, a computer game, the teaching of Chinese as a foreign language and the like.

Description

The disposal route of the people's face voice synchronous animation that drives based on Chinese language text

Technical field

The present invention relates to people's face voice synchronous cartoon technique field, specifically relate to a kind of disposal route of the people's face voice synchronous animation that drives based on Chinese language text.

Background technology

Word message, acoustic information and visual information are the main forms of mankind nowadays information and knowledge, and they also are that the mankind learn and the important tool that exchanges simultaneously.Even to this day; Reciprocation between the multiple information more and more receives people's attention; Integrate literal, sound and image, formation is direct by the conversion of text to visual speech, i.e. the human face animation system of voice synchronous; Let people when listening computing machine to sound, can see synchronous talker's face, make human-computer interaction interface more friendly, harmonious.People's face voice synchronous cartoon technique is in nearly decades, and from the realization of synthetic animation in real time of initial storage static images played in order three-dimensional face by now, the innovation of technology and the release of product are at a tremendous pace.The research team of China Science & Technology University has realized a voice synchronous animation system compatible mutually with the MPEG-4 standard; This system utilizes two true man's photos of three-dimensional headform and positive side of a neutrality to realize " head of speaking " (the talking head) of a three-dimensional; But its animation effect that realizes comparatively cartoonizes, and speaking with true personage also has bigger gap.The research team of Shanghai Communications University has realized people's face animation system with three-dimensional headform and a front face photo of a neutrality; But the insertion of its transition frames and animation streams and the voice flow synchronous alignment problem on time shaft is handled very simple and crudely and coarse, and the animation of generation has flicker and factitious situation to take place often.

Retrieval through to the prior art document is found; One Chinese patent application number: 201010263097.7; Patent of invention title: based on the moving synchronizing animation systems of the real-time voice-driven human face lip of collaborative filtering; Be characterized in through real-time typing voice, make personage's head model make and import the lip animation of voice synchronous.This system can utilize digital recorder; Receive the voice signal of input in real time; And the output and the human face and lip animation of voice synchronous in real time, when generating multi-modal synchronous storehouse, do not need manual mark, can import the lip animation that men and women's voice carry out voice driven arbitrarily.But needing special multi-modal data acquisition equipment synchronously to gather, this system records speaker voice and people's face three-dimensional feature point movable information in speaking; Must increase difficulty that system realizes and then the usable range that has limited this system; And this system is based on voice driven; The acoustic information of reading aloud under needing before animation generates, to record in advance can not be for the corresponding animation of the text generation that need read aloud arbitrarily.One Chinese patent application number: 200910263558.8; Patent of invention title: the method for voice-driven lip animation; The realization of this method need be gathered some individuals' original audio data and video data: everyone reads initial consonant and simple or compound vowel of a Chinese syllable word, uses DV or video camera to take simultaneously, to obtain audio stream and video stream data; Need the content of collection more, robotization completely.

Summary of the invention

The objective of the invention is to overcome above-mentioned deficiency of the prior art; A kind of method of the people's face voice synchronous animation system that drives based on Chinese language text is provided; This system's full automation only needs the computing machine of a band camera, input to want the content of text of reading aloud; Just can obtain any people's face and read aloud the voice synchronous animation effect of any Chinese language text, output effect is truly smooth.

The present invention realizes through following technical scheme:

A kind of disposal route of the people's face voice synchronous animation that drives based on Chinese language text is characterized in that this method comprises the following steps:

1. gather facial image: by light source the people that light impinges upon the desire collection is produced reflected light or the transmitted light of representing face characteristic on the face, convert light signal to corresponding electric signal by ccd detector again; Or from memory device, read facial image;

2. people's face detects: to step 1. the facial image of gained carry out pre-service, utilize the AdaBoost algorithm to detect human face region then;

3. face characteristic extracts: in the 2. detected human face region of step, utilize the ASM algorithm to extract the unique point of people's face, wherein mouth extracts 32 unique points, and eyes portion extracts 20 unique points, and the peripheral profile of nose and face extracts 30 unique points;

4. key frame is synthetic: the triangle block that mouth image is divided into 49 non-overlapping copies according to step 32 unique points that 3. mouth extracted; According to classification and definition to the visual phoneme of Chinese; Unique point and said triangle block that utilization free-format deformation algorithm controlled step is extracted in 3. move in people's face plane and the form distortion, synthesize corresponding human face animation key frame;

5. transition frames is synthetic: at first; Unique point according to 4. every adjacent two key frames of step; With time is that parameter is carried out the unique point that linear interpolation calculates transition frames to it; According to the unique point of 32 transition frames of this mouth mouth is divided into the triangle block of 49 non-overlapping copies again, these triangle blocks utilization free-format deformation algorithm are synthesized corresponding human face animation transition frames;

Then, based on the definition and the classification of 16 groups of visual phonemes of Chinese and Chinese visual phoneme, between every adjacent two frame key frames, insert the transition frames of different numbers;

6. Chinese text input: import Chinese text or from memory device, read Chinese text;

7. text analyzing: the 6. resulting content of text of step is analyzed, obtained the corresponding Chinese visual phoneme stream of the text;

8. text voice conversion: the voice flow that the 6. resulting content of text of step is converted into voice signal;

9. animation streams and voice flow are synchronous: the key frame that 4. step is synthesized snaps on the voice flow that 8. step changed;

10. the synchronous output of people's face voice and animation shows synthetic people's face voice synchronous animation effect.

8. 6. 5. 1. described step carry out to step with described step to step simultaneously.

Lip motion characteristic when the definition of the visual phoneme of described Chinese and classification are meant by Chinese speech pronunciation classifies as 16 kinds of visual phoneme classes of Chinese with all Chinese phonetic alphabet.

Described pre-service is meant that the facial image to input carries out smothing filtering and angle correction procedure.

The unique point computing formula of said transition frames is following:

P_{(k, t)} = \frac{t_{e} - t}{t_{e} - t_{s}} \times P_{(k, t_{s})} + \frac{t - t_{s}}{t_{e} - t_{s}} \times P_{(k, t_{e})}, k = 1,2, . . ., 32 Andt &Element; [t_{s}, t_{e})

P in the formula _{(k, t)}For k unique point of mouth at the coordinate of t constantly time the, t _sBe the moment that certain Chinese visual phoneme pronunciation begins, t _eBe the moment of the visual phoneme pronunciation end of this Chinese.

The computing formula of the said transition frames number that between every adjacent two key frames, need insert is following:

N_{i} = \frac{W_{i}}{W_{sum}} \times T_{w} \times F_{v}, i = 1,2, . . ., n

In the formula: N _iBe the number of the individual Chinese visual phoneme of the corresponding i of certain Chinese character to the transition frames that inserts between i+1 the Chinese visual phoneme, n is the number of all corresponding Chinese visual phonemes of this Chinese character, n≤3, W _iBe the weights of i corresponding Chinese visual phoneme of this Chinese character, W _SumBe the summation of the weights of all corresponding Chinese visual phonemes of this Chinese character, T _wBe the time that this Chinese character pronunciation continues, F _vBe the animation broadcasting speed, unit is " a frame per second ".The Chinese visual phoneme of in the Chinese character each all corresponding a key frame in the animation streams, in the Chinese character i and i+1 Chinese visual phoneme just correspondence two adjacent key frames in the animation streams.

Whole process realizes simple, and easy to operate, calculated amount is little, and people's face voice synchronous animation effect of generation is truly smooth.

Description of drawings

Fig. 1 is the process flow figure that the present invention is based on people's face voice synchronous animation of Chinese language text driving.

Fig. 2 is key frame alignment synoptic diagram, and Fa among the figure, Fb, Fc, Fd are Chinese visual phoneme key frame.

Embodiment

Below in conjunction with accompanying drawing and embodiment technical scheme of the present invention is done detailed description, but should not limit protection scope of the present invention with this.

With the Chinese phonetic alphabet table according to when pronunciation the lip motion characteristic be divided into 16 groups of visual phoneme set of Chinese, see Table 1, and define the weights of Chinese visual phoneme, the lip action size when characterizing its pronunciation, as shown in table 2.Table 1 is that Chinese visual phoneme divides into groups, and table 2 is Chinese visual phoneme weight table.

Table 1

Table 2

At first gather facial image; Detect step through people's face then and detect the human face region in this image; Extract the human face characteristic point in this zone through the human face characteristic point extraction step again; Through these unique points,, between per two adjacent key frames, insert transition frames then according to the definition of the visual phoneme of Chinese and the Chinese visual phoneme key frame of the synthetic human face animation of classification.

Import or read the Chinese text that to read aloud, it is analyzed obtain corresponding Chinese visual aligned phoneme sequence again, convert Chinese text into voice flow through the text voice switch process; At last, the visual phoneme key frame of Chinese is snapped on the voice flow,, realize people's face voice synchronous animation effect so that export animation streams and voice flow synchronously.

Fig. 1 is the process flow diagram of the disposal route of people's face voice synchronous animation of driving based on Chinese language text, and is as shown in the figure, and a kind of disposal route of the people's face voice synchronous animation that drives based on Chinese language text comprises the following steps:

2. people's face detects: to step 1. the facial image of gained carry out pre-service such as smothing filtering, angularity correction, utilize the AdaBoost algorithm to detect the approximate region of people's face then;

3. face characteristic extracts: in the approximate region of the 2. detected people's face of step, utilize the ASM algorithm to extract the unique point of people's face, wherein mouth extracts 32 unique points, and eyes portion extracts 20 unique points, and the peripheral profile of nose and face extracts 30 unique points;

4. key frame is synthetic: the mouth that is at first 3. extracted according to step in the present embodiment extracts 32 unique points; Mouth image is divided into the triangle block of 49 non-overlapping copies; Then according in the table 1 to the classification and the definition of the visual phoneme of Chinese; Unique point and above-mentioned triangle block that utilization free-format deformation algorithm controlled step is extracted in 3. move in people's face plane and the form distortion, thereby synthesize corresponding human face animation key frame;

The unique point computing formula of any transition frames is following:

P_{(k, t)} = \frac{t_{e} - t}{t_{e} - t_{s}} \times P_{(k, t_{s})} + \frac{t - t_{s}}{t_{e} - t_{s}} \times P_{(k, t_{e})}, k = 1,2, . . ., 32 andt &Element; [t_{s}, t_{e})

The number that transition frames inserts is according to the weights decision of its corresponding Chinese visual phoneme in table 2, and the computing formula of the transition frames number that need insert between any two adjacent key frames is following:

N_{i} = \frac{W_{i}}{W_{sum}} \times T_{w} \times F_{v}, i = 1,2, . . ., n

In the formula: N _iBe the number of the individual Chinese visual phoneme of the corresponding i of certain Chinese character to the transition frames that should insert between i+1 the Chinese visual phoneme, n is the number (can know n≤3 by table 1) of all corresponding Chinese visual phonemes of this Chinese character, W _iBe i the corresponding weights of Chinese visual phoneme in table 2 of this Chinese character, W _SumBe the summation of the weights of all corresponding Chinese visual phonemes of this Chinese character, T _wBe the time that this Chinese character pronunciation continues, F _vBe the animation broadcasting speed, unit is " a frame per second ".

7. text analyzing: the 6. resulting content of text of step is analyzed, obtained the corresponding Chinese visual phoneme stream of the text, i.e. the sequence of Chinese visual phoneme;

9. animation streams and voice flow are synchronous: the key frame that 4. step is synthesized snaps on the voice flow that 8. step changed.Concrete grammar is following:

At first, the text voice transform engine is represented it and begins " reading " Chinese character that the mistiming between two information that takes place in succession is exactly the duration of a Chinese character pronunciation in the information of can dishing out that begins of each Chinese character.Then; 7. the corresponding visual phoneme stream (sequence) of this Chinese character that obtains in through step; The key frame of the human face animation when obtaining this Chinese character pronunciation stream (sequence) is arranged in these key frames streams on the time span that this Chinese character pronunciation continues in the weights ratio shown in the table 2.

10. the synchronous output of people's face voice and animation realizes people's face voice synchronous animation effect.

Claims

1. the disposal route based on people's face voice synchronous animation of Chinese language text driving is characterized in that this method comprises the following steps:

10. the synchronous output of people's face voice and animation.

2. the disposal route of people's face voice synchronous animation according to claim 1 is characterised in that 8. 6. 5. 1. described step carry out to step with described step to step simultaneously.

3. the disposal route of people's face voice synchronous animation according to claim 1 and 2; Be characterised in that; Lip motion characteristic when the definition of the visual phoneme of described Chinese and classification are meant by Chinese speech pronunciation classifies as 16 kinds of visual phoneme classes of Chinese with all Chinese phonetic alphabet.

4. the disposal route of people's face voice synchronous animation according to claim 1 and 2 is characterised in that, described pre-service is meant that the facial image to input carries out smothing filtering and angle correction procedure.

5. the disposal route of people's face voice synchronous animation according to claim 1 and 2 is characterised in that, the unique point computing formula of said transition frames is following:

P_{(k, t)} = \frac{t_{e} - t}{t_{e} - t_{s}} \times P_{(k, t_{s})} + \frac{t - t_{s}}{t_{e} - t_{s}} \times P_{(k, t_{e})}, k = 1,2, . . ., 32 andt &Element; [t_{s}, t_{e})

6. the disposal route of people's face voice synchronous animation according to claim 1 and 2 is characterised in that, the computing formula of the said transition frames number that between every adjacent two key frames, need insert is following:

N_{i} = \frac{W_{i}}{W_{sum}} \times T_{w} \times F_{v}, i = 1,2, . . ., n

In the formula: N _iBe the number of the individual Chinese visual phoneme of the corresponding i of certain Chinese character to the transition frames that inserts between i+1 the Chinese visual phoneme, n is the number of all corresponding Chinese visual phonemes of this Chinese character, n≤3, W _iBe the weights of i corresponding Chinese visual phoneme of this Chinese character, W _SumBe the summation of the weights of all corresponding Chinese visual phonemes of this Chinese character, T _wBe the time that this Chinese character pronunciation continues, F _vBe the animation broadcasting speed, unit is " a frame per second ".