CN1866270B

CN1866270B - Face recognition method based on video frequency

Info

Publication number: CN1866270B
Application number: CN2005100709199A
Authority: CN
Inventors: 汤晓鸥
Original assignee: Chinese University of Hong Kong CUHK
Current assignee: Chinese University of Hong Kong CUHK
Priority date: 2004-05-17
Filing date: 2005-05-17
Publication date: 2010-09-08
Anticipated expiration: 2025-05-17
Also published as: CN1866270A; HK1095187A1

Abstract

The provided video-to-video face recognition method comprises: synchronizing the video both on time and space, taking multi-grade sub-space analysis, processing data to extract target feature. This invention makes full use of video sequence information, overcomes process speed and scale defect, and can obtain almost perfect classification result in XM2VTS database.

Description

Face recognition method based on video

Technical field

The present invention relates to field of image recognition, relate more specifically to carry out the technology of face recognition based on video image.

Background technology

Automatically face recognition is a challenging task in the pattern identification research.In recent years, a large amount of technology had been proposed, for example:

1. the local feature analytical approach comprises

1) active presentation model (AAM) method: referring to T.F.Cootes, C.J.Edwards, " the Active Appearance Models " that is shown with C.J.Taylor (presentation model (AAM) initiatively, list of references 1), publish in IEEE Trans.On PAMI (IEEE is about the proceedings of PAMI) the 23rd volume, the 6th phase, the 681-685 page or leaf, June calendar year 2001; With

2) elastic graph coupling (EGM) method: referring to L.Wiskott, J.M.Fellous, N.Krueger, " the Face Recognition by Elastic Bunch GraphMatching " that C.von der Malsbug is shown (carrying out face recognition, list of references 2 by elasticity string figure coupling) is published in IEEETrans.on Pattern Analysis and Machine Intelligence (IEEE pattern analysis and machine intelligence proceedings), the 19th volume, the 7th phase, 775-779 page or leaf, 1997.

2. based on the subspace method of presentation, comprising:

1) eigenface (eigenface) method: (the facial method of use characteristic is carried out face recognition referring to " Facerecognition using eigenfaces " that M.Turk and A.Pentland showed, list of references 3), IEEE International Conference Computer Vision and Pattern Recognition (IEEE international computer vision and pattern identification meeting, list of references 3), the 586-591 page or leaf, 1991.

2) LDA method: referring to V.Belhumeur, J.Hespanda, shown with D.Kiregeman, " Eigenfaces vs.fisherfaces:Recognition Using Class Specific Linear Projection " (the facial comparison with expense rounding off face of feature: use classes specific linear projection discern list of references 4), be published in IEEE Trans.on PAMI (IEEE is about the proceedings of PAMI), the 19th volume, the 7th phase, 711-720 page or leaf, in July, 1997.And W.Zhao, R.Chellappa, " Empirical performance analysis of linear discriminantclassifiers " (the experience performance evaluation of linear discrimination classification device of being shown with N.Nandhakumar, list of references 5), Proceedings ofCVPR (CVPR procceedings), the 164-169 page or leaf, 1998.

3) Bayes (Bayesian) method: referring to B.Moghaddam, T.Jebara, " Bayesian face the recognition " (face recognition of Bayes's method of being shown with A.Pentland, list of references 6), PatternRecognition (pattern identification), the 33rd volume, 1771-1782 page or leaf, 2000 years.

But above-mentioned these methods all belong to uses the face recognition method based on image of rest image as the input data.First problem based on the face recognition of image is, the someone may use in advance the mug shot of record to remove to confuse camera, and the photograph chance error is taken pictures it as the object of activity.Second problem is to compare with the biometric techniques of other high accuracy, and be still too low in the application of some reality based on the accuracy of the identification of image.In order to address these problems, the face recognition based on video has been proposed recently.A main advantage based on the face recognition of video is to have prevented to cheat recognition system by the face-image of storage in advance.Although this is possible because forge video sequence before real-time video camera, difficulty is very big.Like this, can guarantee that the biological data when authentication is from real object.Based on another key advantage of the recognition methods of video is that available information in video sequence is more than single image.If can correctly extract extra information, just can further improve the identification accuracy.

But, compare with a large amount of facial recognition techniques based on image, still there is limitation about video to the research of the face recognition of video.Most of research about the face recognition in the video mainly concentrates on carries out facial detection and tracking in the video.

In case face is positioned in the frame of video, what the common use of existing method was traditional carries out the identification of single frames based on the facial recognition techniques of image.Identification about direct use video data, can be referring to " Comparative Evaluation of Face Sequence Matching for Content-BasedVideo Access " (comparative evaluation that is used for the facial sequences match of content-based video access that S.Satoh showed, list of references 8), be published in Proceedings of IEEE International Conference on AutomaticFace and Gesture (the facial automatically and gesture recognition international symposium procceedings of IEEE), the 163-168 page or leaf, 2000.Immediate frame is to mating this two video sequences in two videos by selecting for Satoh, and it remains the coupling of image to image.

In addition, train the method for the statistical model face that is used to mate about using video sequence, can be with reference to following document:

" the Exemplar-based Face Recognition fromVideo " that V.Kruger and S.Zhou showed is (based on sample to the face recognition in the video, list of references 9), be published in Proceedings ofIEEE International Conference on Automatic Face and Gesture (the facial automatically and gesture recognition international symposium procceedings of IEEE), the 182-187 page or leaf, 2002.

G.Edwards, C.Taylor, " the Improving IdentificationPerformance by Integrating Evidence from Sequences " that shown with T.Cootes (improves recognition performance by integrated evidence from sequence, list of references 10), be published in IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (the IEEE ACM (Association of Computing Machinery) is about the symposial of computer vision and pattern identification), the 486-491 page or leaf, 1999.

Though this training pattern is than more stable and healthy and strong from the model of single image training, if given identical intrinsic dimensionality, the Global Information that is included in so in this model is still similar to single image.Similar to the coupling of image to image, its training data scale has also increased.

In the document of above-mentioned Satoh and O.Yamaguchi, K.Fukui, " Face Recognition Using Temporal Image the Sequence " (face recognition that service time, image sequence carried out of being shown with K.Maeda, list of references 11, be published in Proceedings of IEEE InternationalConference on Automatic Face and Gesture (the facial automatically and gesture recognition international symposium procceedings of IEEE), the 318-323 page or leaf, 1998) in a kind of mutual subspace (mutualsubspace) method has been described, calculate the eigen space of many individualities for everyone uses frame of video.Because it can not obtain discriminant information from the difference between the different people, so the identification accuracy is lower than other method.

In addition, though information available and can help to improve the identification accuracy thus than many in the single image in video sequence, it must solve, and data scale is big, processing speed is slow, and handles the high problem of complexity.

Summary of the invention

Therefore, in view of the problems of the prior art about face recognition discussed above, the purpose of this invention is to provide the face recognition method of a kind of new video to video, can make full use of the space-time information that is included in the video sequence, realize high accuracy of identification, can overcome simultaneously and adopt video sequence to carry out the big and slow defective of processing speed of data scale that face recognition is brought.

Face recognition method according to the present invention comprises: 1) determine corresponding a plurality of similar frame of video in the video sequence in the video sequence that is identified and reference picture storehouse; 2) the similar video frame of the correspondence in the video sequence in described video sequence that is identified and reference picture storehouse is carried out the aligning of reference point; 3) from the described face data cube that is identified the video through a plurality of frame of video formation persons of being identified behind the reference point aligning; With 4) described face data cube is carried out subspace analysis, extract the person's of being identified facial characteristics, compare with facial characteristics vector in the described reference picture storehouse.

Wherein, in the present invention, above-mentioned steps 1) determine to be identified in the video sequence with the reference picture storehouse in the processing of frame of video of image similarity be called as frame of video carried out time synchronized.By this time synchronized, determine to have in two video sequences the frame of similar image.According to a kind of scheme of the present invention, used the frame of waveform definite expectation in each video of sound signal.Therefore, utilize the sound signal that comprises in the video, simply and effectively avoided complicated algorithm.

After time synchronized, the process of each image alignment reference point is called spatial synchronization in the present invention.In an embodiment of the present invention, use the Gabor wavelet character to carry out spatial synchronization.Can be about the Gabor wavelet character referring to list of references 2.The back will further specify.For the shape similarity of utilizing in subspace method between different face-images, the aligning of reference point is important.

For the big video sequence to elapsed time and spatial synchronization mates identification fast, method provided by the invention comprises multistage subspace analysis method and multi-classifier integrating method.

Wherein, multistage subspace analysis method be proper vector with facial cubical each frame of the person of being identified in the video as a feature sheet (slice), in first order subspace analysis, from each feature sheet, extract and differentiate proper vector.In the subspace analysis of the second level, the differentiation proper vector that will extract from each feature sheet is connected to form new proper vector successively earlier.Then, new proper vector is carried out PCA handle, eliminate the redundant information in a plurality of frames.Choose feature, with the final proper vector that is formed for discerning with big eigenwert.

In multi-classifier integrating method according to the present invention, behind the first order subspace analysis in carrying out above-mentioned multistage subspace analysis method, do not carry out second level subspace analysis, but directly adopt the differentiation proper vector that obtains in the first order subspace analysis to come each frame is discerned, use multiple fusion rule to merge the result of all sorters based on frame then, to carry out the last identification of video sequence.

According to the present invention, can obtain following useful effect:

1) avoided original video data directly discerned and handled the processing complicated problems bring, can be fast and high accuracy ground carry out face recognition.

2) for adopting the auxiliary video frequency identifying method of audio frequency to carry out the system that identity is differentiated, since need the person of being identified sounding in real time, can avoid the problem of the traditional identification based on rest image (even comprising traditional video identification) security deficiency, therefore have higher security.

Description of drawings

Fig. 1 has shown the synoptic diagram of assisting the time synchronized of carrying out video sequence frame according to the employing audio frequency of the inventive method;

Fig. 2 is the synoptic diagram of people's face graphics template, shows the example of the reference point of selecting on the face the people.

Embodiment

Below with reference to description of drawings preferred implementation of the present invention.

In the recognition methods based on video according to the present invention, can provide the advantage of more information in order to bring into play video, the independent frame in the video should be mutually different.Because if all frames are all similar mutually, the information that is included in so in the video sequence is identical with the situation of single image basically.Yet for the video that content frame changes, two video sequences (template video sequence and the video sequence that is identified) simple coupling frame by frame can not have very great help.This be because, the situation that the frame with the different expressions in the frame in the video and another video mates may take place, this can further damage the performance of face recognition on the contrary.

Therefore, carrying out improved key for the recognition performance based on video is, image in two video sequences must have identical order with regard to it with regard to each independent frame, for example amimia (neutral) face can with amimia facial match, smile's face can with smile's facial match.This shows, if video sequence is used for face recognition, be important to two video sequences with the similar frame of video of identical series arrangement (being that the time is synchronous) so.In other words, need reorder to original time video sequence (template video sequence and the video sequence that is identified) according to the content in each frame.

In order to realize this point, can use conventional expression algorithmic technique to come the similar expression of coupling in different videos based on face.But this calculating cost for the big situation of this data scale of video data is too high, and the accuracy of Expression Recognition neither be very high.Certainly, can use information such as expression, illumination or direction to be used for audio video synchronization.According to preferred implementation of the present invention, can use the information of the sound signal that comprises in the video to carry out the time synchronized of video sequence frame.To specify this method below.

With XM2VTS database (the facial video database of the maximum of public Ke De, referring to list of references 12, K.Messer, J.Matas, J.Kittler, J.Luettin, " XM2VTSDB:The Extended M2VTS the Database " (XM2VTSDB: the M2VTS database of expansion) that is shown with G.Matitre, Second International Conference on AVBPA (second international AVBPA symposial), in March, 1999) be example, video data wherein comprises 295 people's video sequence.For everyone, extract several video sequences (each is 20 seconds) with four different time periods (session).In each section, when recorded video sequence, people are required to chant two sections literal: " 0,1,2 ..., 9 " and " 5,0,6,9,2,8,1,3,7,4 ".Can position frame in conjunction with the difference expression with these voice signals.

Fig. 1 has shown an example, and wherein the pronunciation with 5 words is an example: " Zero ", " one ", " two ", " three ", " four ".This example is that the peak value of the audio volume control of each pronunciation of words (maximum point) is positioned, and chooses corresponding frame of video constantly with this audio volume control peak value then.All adopt this method to select frame of video to the test video that is used to set up the training video in reference picture storehouse and be identified, thereby the frame of video in two kinds of video sequences is carried out time synchronized.Certainly, also can use other parameter to choose corresponding frame of video as reference point (for example central point of the audio zone of the trough of audio volume control (smallest point) or each word).Usually, when a people reads different words, can show different expressions.Certainly, can use other paragraph or sentence, as long as it is identical with the content that the test video that is identified uses to be used for the training video of modelling.

Although can use the more senior speech recognition that assesses the cost to improve this result,, said method has proved and has produced effect very much and efficient for adopting the synchronization video sequence and choosing a plurality of difference frames that are used for face recognition.

In addition, also can easily expand said method to comprise more information.For example, in identification system, the method that also audio frequency that adopts the above-mentioned utilization person of being identified sounding can be selected the video frequency identifying method of frame and discern based on checking (as the password authentication) method of the content of sounding or/and to the person's of being identified tone features is integrated, realizes more accurate and safe performance.

After carrying out above-mentioned time synchronized, each image is carried out the aligning of reference point, this is that their face will move and change because when people talk.Fig. 2 has shown the example of the facial reference point of this image.In this embodiment, have 35 reference points.In this manual, claim that this step is a spatial synchronization.The aligning of reference point utilizes the shape similarity in the face of different people for subspace method be very important.Can use the Gabor wavelet character to come on schedule as the spatial synchronization allocation base.

Concrete grammar is, calculate the Gabor wavelet character value of each reference point of reference picture, extract Gabor wavelet character value to being identified the regional area of image at each reference point place, seek then to be identified in the image and go up the point that near the reference point of correspondence position has the most close Gabor wavelet character value, be identified near the reference point of image this position as this with reference picture (template).

For all video sequences of using in the identification, the frame of video after elapsed time and spatial synchronization (two-dimensional matrix) constitutes (aligned) 3D face data cube (three-dimensional matrice) of everyone aligning respectively.On this basis, can use a large amount of methods to carry out the video sequence coupling.But, as mentioned above, use traditional method (for example image or subspace method recently mutually) can not utilize the discriminant information in all video datas.

A kind of direct method is that whole data cube is treated as single big proper vector, and carries out normal subspace analysis to extract feature.Though the fusion method of this eigenwert level has been utilized data all in the video, there are several problems in this method.At first, data scale is very huge.For example use for each video sequence and be of a size of 21 images of 41 * 27, then intrinsic dimensionality is 23247.Big like this vector is carried out direct subspace analysis, and processing cost is very high.The second, more serious problem is, because with respect to the big intrinsic dimensionality of differentiating the subspace analysis algorithm, sample size is but very little, so there is the problem of so-called over-fitting (over fitting).

In order to overcome these problems,, adopted a kind of multistage subspace analysis algorithm according to preferred implementation of the present invention.That is, cubical each frame of the face data in the video as a feature sheet, is carried out unified subspace analysis to each feature sheet then, from each sheet, extract and differentiate feature.Detailed content about this analytical approach can be referring to list of references 13, be " Unified Subspace Analysis for Face Recognition " (the unified subspace analysis that is used for face recognition) that X.Wang and X.Tang show, Proceeding of IEEE International Conference on Computer Vision (symposial of IEEE international computer vision), 2003.

Then, will make up from the differentiation proper vector that each sheet extracts, to form new proper vector.New proper vector is carried out PCA (principal component analysis (PCA)) handle,, thereby extract final proper vector with the redundant information between the elimination feature sheet.Specify multistage subspace analysis method of the present invention below.

In the present invention, the implication of term " class (class) " is meant the individuality (people) in training set or the reference picture storehouse.

In first order subspace analysis, for each feature sheet:

1-1. each feature sheet is projected to the PCA subspace of determining from the training set of this sheet, selects the dimension of PCA subspace then by the test findings of repeatedly identification, to remove most of noise.

1-2. scatter matrix (within-class scattermatrix) is determined (intrapersonal) subspace in the class in the use class in the PCA subspace that dimension reduces.

1-3. be the mean value that L the class in reference picture storehouse (gallery, promptly be used for discerning with reference to template base) calculated their training data respectively, with the center of the training sample that obtains each class.With all class central projection subspace in the class, by eigenwert in the class normalization is carried out in projection then, obtain (whitened) proper vector of albefaction.

Handle 1-4. PCA is carried out in the space that the proper vector center of the albefaction of above-mentioned all L class is formed, obtain differentiating proper vector.

In the subspace analysis of the second level, carry out following operation:

2-1. the differentiation proper vector that will extract from each sheet is connected to form new proper vector successively.

Handle 2-2. new proper vector is carried out PCA, eliminate the redundant information in a plurality of frames.Choose preceding several characteristic, the final proper vector that is identified with formation with big eigenwert.

In above-mentioned first order subspace analysis, the dimension of subspace is selected in the following way in PCA subspace and the class: the dimension of selecting subspace in a PCA subspace and the class, carry out identification test, by test of many times, choose the PCA subspace and the interior subspace of the class dimension of the recognition result that can obtain.

In the subspace analysis of the second level, only use PCA rather than unified subspace analysis.This is to be reduced because of changing in the class in first order albefaction step, is extracted in the step 1-4 of first order subspace analysis and differentiate feature.Repeat unified subspace analysis and can not increase any new information.But, between different sheets, still have a large amount of overlay informations.Although because have expression shape change, these frames are still closely similar each other.Need to adopt PCA to reduce redundant information.

Multistage subspace analysis of the present invention can not lose a lot of information than existing subspace analysis.Specifically, because the albefaction step only needs except change information in the class, so do not need to consider them during the information loss in analytical algorithm.Only need to pay close attention to two PCA steps.Handle in order to carry out PCA, at first generate the sampling matrix that a n takes advantage of m.

A = [\begin{matrix} x_{1} (1) & x_{2} (1) & . . . & x_{m} (1) \\ x_{1} (2) & x_{2} (2) & . . . & x_{m} (2) \\ . . . & . . . & . . . & . . . \\ x_{1} (n) & x_{2} (n) & . . . & x_{m} (n) \end{matrix}] - - - (1)

X wherein _iBe the face data cube proper vector of video, n is the length of vector, and m is the number of training sampling.The length that is decomposed into the g=n/k group by the proper vector with length is the little proper vector of k,

Can be at the short set of eigenvectors B of g group _iIn each on carry out PCA.Form new proper vector by a few eigenwert of choosing from each group then.By new proper vector is carried out PCA, calculate final proper vector.

Be that example illustrates to choose two groups short set of eigenvectors below.Eigenvectors matrix and its covariance matrix are:

A = [\begin{matrix} B_{1} \\ B_{2} \end{matrix}], - - - (3)

W = {AA}^{T} = [\begin{matrix} B_{1} {B_{1}}^{T} & B_{1} {B_{2}}^{T} \\ B_{2} {B_{1}}^{T} & B_{2} {B_{2}}^{T} \end{matrix}] = [\begin{matrix} W_{1} & W_{12} \\ W_{21} & W_{2} \end{matrix}] - - - (4)

If covariance matrix W ₁And W ₂Eigenvectors matrix be respectively T ₁And T ₂, so

T_{1}^{T} W_{1} T_{1} = Λ_{1} - - - (5)

T_{2}^{T} W_{2} T_{2} = Λ_{2} - - - (6)

Wherein, Λ ₁And Λ ₂It is the diagonal angle eigenvalue matrix.Grouping (B for the first order ₁, B ₂..., B _g) effective rotation matrix of PCA be

T = [\begin{matrix} T_{1} & 0 \\ 0 & T_{2} \end{matrix}] - - - (7)

T also is an orthogonal matrix, because

T^{T} T = [\begin{matrix} T_{1}^{T} T_{1} & 0 \\ 0 & T_{2}^{T} T_{2} \end{matrix}] = I - - - (8)

So grouping (B in the first order ₁, B ₂..., B _g) PCA after because the orthogonality of rotation matrix T, the covariance matrix of rotation proper vector

W_{r} = T^{T} WT = [\begin{matrix} Λ_{1} & T_{1}^{T} W_{12} T_{2} \\ T_{2}^{T} W_{21} T_{1} & Λ_{2} \end{matrix}] = [\begin{matrix} [\begin{matrix} Λ_{1 b} & 0 \\ 0 & Λ_{1 s} \end{matrix}] & {[\begin{matrix} C_{bb} & C_{bs} \\ C_{sb} & C_{ss} \end{matrix}]}^{T} \\ [\begin{matrix} C_{bb} & C_{bs} \\ C_{sb} & C_{ss} \end{matrix}] & [\begin{matrix} Λ_{2 b} & 0 \\ 0 & Λ_{2 s} \end{matrix}] \end{matrix}] - - - (9)

It is the similar matrix of former proper vector covariance matrix W.Because similar matrix has identical eigenwert,, influence on former proper vector covariance matrix W is discussed by only keeping in each group in front a few dominant eigenvalue so can use the rightest of equation (9).

In equation (9), n=1 or 2 o'clock, A _NbAnd A _NsDifference representation eigenvalue matrix Λ _nGreater advantage eigenwert section and than I override feature value section.C _Xx(wherein x=b or s) represents the cross covariance matrix of two groups of rotation features.By only keeping the dominant eigenvalue among the PCA of the second level, new proper vector covariance matrix becomes

W_{d} = [\begin{matrix} Λ_{1 b} & C_{bb}^{T} \\ C_{bb} & Λ_{2 b} \end{matrix}] - - - (10)

The item of eliminating from Wr has Λ _1s, Λ _2s, C _Ss, C _BsAnd C _SbBecause main energy is comprised in the middle of the dominant eigenvalue, Λ _1sAnd Λ _2sInformation loss very little, thereby be included in energy C in the cross covariance matrix of two little energy feature vectors _SsShould be littler.

Can prove C _BsAnd C _SbAll can not be very big.If two stack features B ₁And B ₂Uncorrelated mutually, all cross covariance C in the equation (9) so _XxMatrix all can be very little.On the other hand, if two stack features values are very relevant mutually, this dominant eigenvalue of two groups can be closely similar.Therefore, the cross covariance Matrix C of second group of big feature and first group of little feature _BsCan be closely similar with the cross covariance matrix of first group of big feature and first group of little feature, and owing to the decorrelation of PCA is zero.

As two stack features B ₁And B ₂During part correlation, relevant part should be main signal, and this is because feature B ₁And B ₂Noise section relevant hardly each other.The key property of PCA is for all signal energies in a few the big eigenwert that remains on the front.So, B ₂In most of signal energy, particularly with B ₁Relevant B ₂The major part of signal energy is retained in B ₂In the big eigenwert section of covariance matrix.B ₂The energy that is dropped of little eigenwert section comprise hardly and B ₁Relevant energy.So, C _BsAnd C _SbShould be very little, with them from covariance matrix W _rMiddle removal can not lost too many information.

As the above analysis, covariance matrix W _dBe W _rApproximate, and W _rIt is the similar matrix of W.Therefore, we can say W from multistage subspace method _dEigenwert be actually the approximate of the eigenwert that calculates from the W of Standard PC A method.

According to another embodiment of the invention, in above-mentioned multistage subspace analysis method, also can substitute partial subspace analysis with the multi-categorizer integrated technology.That is, in the middle of the first order is analyzed, still handle each individual frame of video with unified subspace analysis.Then, come integrated all sorters, to determine last classification based on frame with fusion rule.Its detailed method is presented below.

First order subspace analysis is identical to 1-4 with step 1-1 in the above-described multistage subspace analysis, repeats no more.

In the analyzing and processing of the second level, carry out following steps:

2-1 '. in sorter, each frame is discerned with resulting differentiation proper vector among the step 1-4 based on frame.

2-2 '. use fusion rule that the recognition result based on the sorter of frame is made up, obtain final recognition result.

Had much about method the fusion of multi-categorizer.These methods all can be used for realizing said process of the present invention.Enumerate below and adopt two kinds of simple fusion rules to merge example based on the sorter of frame, i.e. majority rule voting rule and sum rule respectively.

Majority rule ballot (Majority voting)

Each sorter C _k(x) face data of input is set the class label C _k(x)=i.This incident can be expressed as a binary function,

With the majority rule ballot, last class can be chosen to

β (x) = \underset{X_{i}}{\arg \max} Σ_{k = 1}^{K} T_{k} (x &Element; X_{i}) . - - - (12)

Sum rule (Sum rule)

Suppose P (X _i| C _k(x)) be by sorter C based on frame _kThe x of measurement (x) belongs to X _iProbability.According to sum rule, the classification that is used for final decision is selected as:

β (x) = \underset{X_{i}}{\arg \max} Σ_{k = 1}^{K} P (X_{i} | C_{k} (x)) - - - (13)

P (X _i| C _k(x)) can from output, estimate based on the sorter of frame.For sorter C based on frame _k(x), classification X _iCenter m _iX is projected as discriminant vector W with the input face data _k

w_{k}^{i} = W_{k}^{T} m_{i} - - - (14)

w_{k}^{x} = W_{k}^{T} x - - - (15)

P (X _i| C _k(x)) be estimated as

\hat{P} (X_{i} | C_{k} (x)) = (1 + \frac{{(w_{k}^{x})}^{T} (w_{k}^{i})}{| | w_{k}^{x} | | \cdot | | w_{k}^{i} | |}) / 2 - - - (16)

Its value has been normalized to [0,1].

The present invention tests on the normal video face data storehouse XM2VTS of maximum.

294 * 4 video sequences from above-mentioned four different time periods, choosing 294 different people on XM2VTS.For training data, select 294 * 3 video sequences of first three section.The set of reference picture storehouse is made up of 294 video sequences of very first time section.Form by 294 video sequences of the 4th time period as the test set that is identified video sequence.People in video is required to read two Serial No.s " 0123456789 " and " 5069281374 ".

For each video, respectively by two policy selection 21 frames: audio-video time synchronized and do not have the picked at random of audio-frequency information.So two groups of different facial figure sequence set that are labeled as A-V synchrodata and A-V non-synchronous data are respectively arranged.For the A-V synchrodata, each frame is corresponding with the crest of numeral.Other frame alignment is at the mid point of the beginning of first sentence end and second sentence.The quantity of frame can be different for different experiments.

At first check use with the gradation of image value directly as the recognition result based on the method for presentation of feature.Result for rest image and video sequence is summarised in the table 1.Rest image is (situation that A-V is synchronous) chosen from first frame of video sequence, or from (the asynchronous situation of A-V) of video sequence picked at random.Can see the performance very low (61%) of directly using rest image by the Euclidean distance classification.In fact this result reflects that the identification difficulty of this database is very big.For face recognition, if the image in test pattern and the reference picture storehouse from the different time periods, the result is very poor usually so.Can represent a significant improvement by the video data that uses identical Euclidean distance (78.3%).After having used multistage subspace analysis algorithm of the present invention and multi-categorizer algorithm, the video identification rate further is increased to and surpasses 98%.This clearly illustrates that and in fact comprised a large amount of information in video sequence.

In two hurdles of table 1, compare time synchronized and asynchronous result below.As can be seen, the A-V method for synchronizing time is compared with other all sorting techniques, and the identification accuracy is had remarkable improvement.Though note using multistage subspace analysis that the improvement of visual classification is had only 1.7%, it reflects that the identification error rate has been reduced above 45%, and this result is significant.

Table 1 uses gray scale presentation Feature Recognition result's comparison

Summed up result in the table 2 with local wavelet character.As what expect, all results are further improved.More further confirming result of study in table 1 between the distinct methods.The final identification accuracy of noting this experiment of all three kinds of algorithms of use (time synchronized, spatial synchronization and multistage subspace analysis (or multi-categorizer)) is 99%.Consider it is the identification of striding the time period (cross-session), so this accuracy is very high.

Table 2 uses the comparison of the recognition result of local wavelet character

At last, in table 3, video frequency identifying method of the present invention and existing face recognition method based on video, nearest frame method and mutual subspace method are compared.The result of existing method calculates from A-V time synchronized video sequence in the attention table 3.Also used unified subspace analysis method in the frame method recently, so they are good than original method.Can be clear that from table 3 method of the present invention has obvious improvement, its error rate only is 5% to 10% of a classic method.

The comparison of table 3 and the recognition result of existing method based on video

Method based on video	Identification accuracy (%)
Method based on video	Identification accuracy (%)	Mutual subspace method	79.3

Method based on video	Identification accuracy (%)
Method based on video	Identification accuracy (%)	Use the nearest frame method of Euclidean distance	81.7
Use the nearest frame method of LDA	90.9	Use the nearest frame method of Euclidean distance	81.7
Use the nearest frame method of LDA	90.9	Use the nearest frame method of unified subspace analysis	93.2
The multistage subspace method of use gray feature of the present invention	98.0	Use the nearest frame method of unified subspace analysis	93.2
	98.0	The sum rule multistage classifier method based on video of use gray feature of the present invention	98.0
The voting rule multistage classifier method based on video of use gray feature of the present invention	98.6		98.0
	98.6	The multistage subspace analysis method based on video of use wavelet character of the present invention	99.0
The sum rule multistage classifier method based on video of use wavelet character of the present invention	99.0		99.0
	99.0	The voting rule multistage classifier method based on video of use wavelet character of the present invention	98.6

Face recognition method based on the auxiliary video of audio frequency more than has been described.This method has made full use of all the space-time information in the video sequence.In order to overcome processing speed and data scale problem, room and time frame synchronization algorithm, multistage subspace analysis algorithm and the integrated algorithm of multi-categorizer have been developed.Is effective in all these technology that experimental results show that on the facial video database that gets of maximum on the improvement recognition performance.Obtained being close to perfect recognition result by new algorithm.Compare with existing method based on video with the method based on rest image, it has marked improvement.And the present invention can also use the multi-categorizer integrated technology to come further to can further improve the identification accuracy thus to based on the visual classification of presentation with carry out integratedly based on the visual classification method of small echo.

Claims

1. based on the face recognition method of video, comprising:

Determine corresponding a plurality of similar frame of video in the video sequence in the video sequence that is identified and reference picture storehouse;

The similar video frame of the correspondence in the video sequence in described video sequence that is identified and reference picture storehouse is carried out the aligning of reference point;

From the described face data cube that is identified the video sequence through a plurality of frame of video formation persons of being identified behind the reference point aligning; With

Described face data cube is carried out subspace analysis, extracts the person's of being identified facial characteristics, compare with facial characteristics vector in the described reference picture storehouse,

Wherein, describedly facial data cube carried out subspace analysis comprise:

From the feature sheet that cubical each frame of described face data is formed, extract and differentiate proper vector;

In based on the sorter of frame, come each frame is discerned with described differentiation proper vector; And

Use fusion rule that the result of described sorter is merged, the video sequence that is identified is discerned.

2. method according to claim 1 is characterized in that, described fusion rule comprises: majority rule voting rule, sum rule.

3. according to each described method of claim 1-2, it is characterized in that, determining to be identified video sequence comprises with step as a plurality of similar frame of video corresponding in the video sequence of reference image library: use the waveform of the sound signal that predetermined sound produced that comprises in the described video sequence, selects the described a plurality of similar frame that is identified video sequence and conduct with reference to the correspondence in the video sequence of image library.

4. method according to claim 3, it is characterized in that, select from the waveform of described sound signal to comprise that a kind of in the following parameter is benchmark, choose described frame of video: the peak value of audio volume control, the trough of audio volume control, the central point of each word audio zone.

5. method according to claim 4 is characterized in that, further comprises the person's of being identified content of sounding when being identified is discerned, or/and

Tone features to the person of being identified is discerned.

6. based on the face recognition method of video, comprising:

Extract in the feature sheet that each frame from the video sequence that is identified is formed and differentiate proper vector;

In based on the sorter of frame, come each frame is discerned with described differentiation proper vector; With

Use fusion rule that the result of described sorter is merged, the video sequence that is identified discerned,

Wherein, the described step of differentiating proper vector of extracting from the feature sheet comprises:

Each described feature sheet is projected to the principal component analysis (PCA) subspace of determining according to the training set of this feature sheet;

Determine subspace in the class from described principal component analysis (PCA) subspace;

Determine the center of the training data class of the individuality in the reference picture storehouse, subspace in the described class is arrived in all class central projection;

Utilize the interior eigenwert of class of subspace in the described class that normalization is carried out in projection, to determine the proper vector of albefaction;

The space that the proper vector center of the described albefaction of described all classes is formed is carried out principal component analysis (PCA) and is handled, and determines to differentiate proper vector.

7. method according to claim 6 is characterized in that, described fusion rule comprises: majority rule voting rule, sum rule.