CN107274905A

CN107274905A - A kind of method for recognizing sound-groove and system

Info

Publication number: CN107274905A
Application number: CN201610218436.7A
Authority: CN
Inventors: 金星明; 李为; 郑昉劢; 吴富章; 朱碧磊; 钱柄桦; 李科; 吴永坚; 黄飞跃
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Tencent Cloud Computing Beijing Co Ltd
Priority date: 2016-04-08
Filing date: 2016-04-08
Publication date: 2017-10-20
Anticipated expiration: 2036-04-08
Also published as: CN107274905B

Abstract

The embodiment of the invention discloses method for recognizing sound-groove, including：N number of identity factor I Vector matrixes are respectively trained, N number of I Vector matrixes are obtained；N is the natural number more than 1；According to N number of I Vector matrixes, N number of corresponding I Vector vectors in being extracted respectively from test sample；Score is calculated according to N number of corresponding I Vector respectively, N number of corresponding fraction is drawn；N number of corresponding fraction is merged, target fractional is obtained, and make decisions according to target fractional.Using the present invention, it can realize under the premise of magnanimity training data, break through the technical problem of single I Vector frameworks Application on Voiceprint Recognition performance bottleneck in the prior art, test indicate that, the I Vector frameworks trained by two or more by enough data can lift 20%~30% or so relative to single I Vector frame systems overall performance.

Description

A kind of method for recognizing sound-groove and system

Technical field

The present invention relates to computer realm, more particularly to a kind of method for recognizing sound-groove and system.

Background technology

Application on Voiceprint Recognition (the particularly unrelated field of text) technology includes Gauss by accumulation for many years and evolution and mixed Matched moulds type-universal background model (Gaussian Mixture Model-Universal Background Model, UBM-GMM), Gauss super vector (Gaussian Super Vector, GSV), simultaneous factor analysis (Joint Factor Analysis, JFA), the technology such as identity factor (Identity Vector, I-Vector).Even to this day, I-Vector methods dominate the development of whole Application on Voiceprint Recognition art.I-Vector methods, also referred to as identity because Submethod, it does not attempt to force separated speaker space and channel space, but directly sets an overall situation Change space (Total Variability Space), it contains all possible information in speech data.Then By the method for factorial analysis, the load factor in global change space is obtained, this is just called I-Vector.Its Dimension is well below Gauss super vector.In this factor, with a simple side distinguished between speaker Method, exactly allows the distance between different speakers to become big, and same speaker each language affected by noise The distance between sentence diminishes.Obvious, here it is linear discriminent analyzes (Linear Discriminant Analysis, LDA) method target, by the difference between speaker, be considered as matrix between class, by grass The difference come, is considered as matrix in class, then the LDA methods of applied probability estimate I-vector LDA Matrix, shows that shoot out is exactly the information vector of reacting speaker's identity in this LDA matrix.

Although the technology based on single I-Vector matrixes has been achieved for phase relative to technologies such as JFA in performance When big breakthrough, but still there is increase amount of training data, the problem of systematic function runs into bottleneck, so depositing In very big performance boost space.

The content of the invention

Technical problem to be solved of the embodiment of the present invention is that there is provided a kind of method for recognizing sound-groove and vocal print knowledge Other system, can break through the technical problem of single I-Vector frameworks Application on Voiceprint Recognition performance bottleneck in the prior art.

In order to solve the above-mentioned technical problem, first aspect of the embodiment of the present invention discloses a kind of method for recognizing sound-groove, Including

N number of identity factor I-Vector matrixes are respectively trained, N number of I-Vector matrixes are obtained；The N is Natural number more than 1；

According to N number of I-Vector matrixes, N number of corresponding I-Vector in being extracted respectively from test sample Vector；

Score is calculated according to N number of corresponding I-Vector respectively, N number of corresponding fraction is drawn；

N number of corresponding fraction is merged, target fractional is obtained, and enter according to the target fractional Row judgement.

It is described that N number of identity factor is respectively trained with reference in a first aspect, in the first possible implementation I-Vector matrixes, obtain N number of I-Vector matrixes, including：

N number of I-Vector matrixes are respectively trained by N parts of training datas, N number of I-Vector matrixes are obtained； The N points of training datas are separate.

It is described by N number of corresponding fraction with reference in a first aspect, in second of possible implementation Merged, obtain target fractional, including：

N number of corresponding fraction is averaged, the target fractional is used as；Or

Maximum is taken from N number of corresponding fraction, the target fractional is used as；Or

Minimum value is taken from N number of corresponding fraction, the target fractional is used as.

With reference to the first possible implementation of first aspect, in the third possible implementation, institute State and N number of I-Vector matrixes are respectively trained by N parts of training datas, obtain N number of I-Vector matrixes, wrap Include：

The speech data for randomly selecting N the first durations of part is used as the training data of background model；

The speech data of first duration is subjected to feature extraction and feature normalization respectively, extracted After feature, N number of GMM-UBM models are respectively trained；

The speech data for randomly selecting N the second durations of part is used as the training data of I-Vector matrixes；

The speech data of first duration is subjected to feature extraction and feature normalization respectively, extracted After feature, by train complete N number of GMM-UBM models, respectively it is N number of extraction Gauss surpass to Measure GSV；

It is respectively trained using N number of GSV and obtains N number of I-Vector matrixes.

With reference to the third possible implementation of first aspect, in the 4th kind of possible implementation, point When not carrying out N number of GMM-UBM model trainings, there are M in N number of GMM-UBM model parameters GMM-UBM model parameters are differed, and the M is more than 1, less than or equal to the natural number of the N.

Second aspect of the embodiment of the present invention discloses a kind of Voiceprint Recognition System, including：

Matrix training module, for N number of identity factor I-Vector matrixes to be respectively trained, obtains N number of I-Vector Matrix；The N is the natural number more than 1；

Vectorial extraction module, for according to N number of I-Vector matrixes, being extracted respectively from test sample In N number of corresponding I-Vector vector；

Computing module, for calculating score respectively according to N number of corresponding I-Vector, it is N number of right to draw The fraction answered；

Amalgamation judging module, for N number of corresponding fraction to be merged, obtains target fractional, and Made decisions according to the target fractional.

With reference to second aspect, in the first possible implementation, the matrix training module specifically for N number of I-Vector matrixes are respectively trained by N parts of training datas, N number of I-Vector matrixes are obtained；It is described N points of training datas are separate.

With reference to second aspect, in second of possible implementation, the amalgamation judging module includes：

First integrated unit, for N number of corresponding fraction to be averaged, is used as the target fractional； Or

Second integrated unit, for taking maximum from N number of corresponding fraction, is used as the target point Number；Or

3rd integrated unit, for taking minimum value from N number of corresponding fraction, is used as the target point Number.

With reference to the first possible implementation of second aspect, in the third possible implementation, institute Stating matrix training module includes：

First chooses unit, and the speech data for randomly selecting N the first durations of part is used as the instruction of background model Practice data；

Model training unit, for the speech data of first duration to be carried out into feature extraction and feature respectively After normalization, the feature extracted, N number of GMM-UBM models are respectively trained；

Second chooses unit, for randomly selecting the speech data of N the second durations of part as I-Vector matrixes Training data；

GSV extraction units, for the speech data of first duration to be carried out into feature extraction and feature respectively After normalization, the feature extracted, by training the N number of GMM-UBM models completed, respectively N number of extraction Gauss super vector GSV；

I-Vector matrix training units, N number of I-Vector is obtained for being respectively trained using N number of GSV Matrix.

With reference to the third possible implementation of second aspect, in the 4th kind of possible implementation, institute When stating model training unit and carrying out N number of GMM-UBM model trainings respectively, N number of GMM-UBM models There is M GMM-UBM model parameter in parameter to differ, the M is more than 1, less than or equal to institute State N natural number.

The third aspect of the embodiment of the present invention discloses a kind of computer-readable storage medium, the computer-readable storage medium Have program stored therein, when described program is performed including first aspect of the embodiment of the present invention or first aspect the A kind of possible implementation, the either possible implementation of second of first aspect or first aspect The third possible implementation, or vocal print in the 4th kind of possible implementation of first aspect knows The Overall Steps of other method.

Implement the embodiment of the present invention, by the way that N number of I-Vector matrixes are respectively trained, according to N number of I-Vector Matrix, N number of corresponding I-Vector vectors, then calculate score respectively in being extracted respectively from test sample, Draw N number of corresponding fraction；Finally N number of corresponding fraction is merged, target fractional is obtained, and Made decisions, can be realized under the premise of magnanimity training data according to the target fractional, broken through in the prior art The technical problem of single I-Vector frameworks Application on Voiceprint Recognition performance bottleneck, test indicate that, passed through by two or more Crossing the I-Vector frameworks of enough data training can be lifted relative to single I-Vector frame systems overall performance 20%~30% or so.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to implementing The accompanying drawing used required in example or description of the prior art is briefly described, it should be apparent that, describe below In accompanying drawing be only some embodiments of the present invention, for those of ordinary skill in the art, do not paying On the premise of going out creative work, other accompanying drawings can also be obtained according to these accompanying drawings.

Fig. 1 is the identification framework schematic diagram for the I-Vector methods that the present invention is provided；

Fig. 2 is the schematic flow sheet of method for recognizing sound-groove provided in an embodiment of the present invention；

Fig. 3 is the principle framework schematic diagram for the method for recognizing sound-groove that the present invention is provided；

Fig. 4 is the schematic flow sheet of I-Vector matrix training methods provided in an embodiment of the present invention；

Fig. 5 is the structural representation for the Voiceprint Recognition System that the present invention is provided；

Fig. 6 is the structural representation of amalgamation judging module provided in an embodiment of the present invention；

Fig. 7 is the structural representation of matrix training module provided in an embodiment of the present invention；

The structural representation of another embodiment for the Voiceprint Recognition System that Fig. 8 present invention is provided.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear Chu, it is fully described by, it is clear that described embodiment is only a part of embodiment of the invention, rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creation Property work under the premise of the every other embodiment that is obtained, belong to the scope of protection of the invention.

Method for recognizing sound-groove provided in an embodiment of the present invention is the sound groove recognition technology in e based on I-Vector methods, I-Vector technologies are that it does not differentiate between the information and channel of speaker space across channel algorithm based on single space Spatial information.For any one single voice, it can be broken into background model and reflection each spoken People's feature, its Gauss super vector GSV can be expressed as follows：

M_s=m_o+Tw_s

Wherein, M_sIt is the Gauss super vector GSV of C*F dimensions；m_oIt is the C*F that words person is unrelated and channel is unrelated Super vector is tieed up, is spliced by UBM mean vector；w_sI.e. total changed factor I-Vector, dimension is N, It is the random vector of one group of obedience standardized normal distribution；T is total change space matrix, and dimension is CF*N.

Training stage, according to a large amount of exploitation data sets, using factorial analysis algorithm, total change is estimated therefrom Change space matrix T；After obtaining always changing space, by the GSV of higher-dimension in total change represented by matrix T Projected in subspace, finally give the entire change factor (I-Vector) of low-dimensional.

Specifically, the identification framework schematic diagram for the I-Vector methods that the present invention with reference to shown in Fig. 1 is provided, instruction Practicing the stage mainly includes the training of three models：The training of UBM background models, total transformation matrices T and PLDA Model training.

1.UBM background models are trained：It is enough in a balanced way using hundreds of people, channel equalization, men and women's sound The GMM of one high-order of voice training, to describe the feature distribution that speaker is unrelated.

2. always change space T (also referred to as I-Vector matrixes) training：According to a large amount of exploitation data sets, utilize Factorial analysis and greatest hope (Expectation Maximization Algorithm, EM) algorithm, therefrom Estimate total change space matrix T.

Channel compensation algorithm 3. (Probabilistic Linear Discriminant Analysis, PLDA) model Training：According to total change space T and UBM, total changed factor i-vector of the training voice of extraction will I-vector is grouped by speaker, using factorial analysis and EM algorithms, estimates the parameter of PLDA models.

Finally, test phase：According to UBM model and total transformation matrices T, extracting total changed factor is i-vector；With the i-vetcor feeding PLDA marking of test data and target speaker, make decisions.

The method for recognizing sound-groove of the embodiment of the present invention is the method for recognizing sound-groove based on I-Vector methods, referring to figure 2, it is the schematic flow sheet of method for recognizing sound-groove provided in an embodiment of the present invention, this method includes：

Step S200：N number of identity factor I-Vector matrixes are respectively trained, N number of I-Vector matrixes are obtained；

Specifically, the N in each embodiment of the invention is the natural number more than 1, by concurrently independent Multiple I-Vector matrixes are respectively trained, the multiple I-Vector matrixes not occured simultaneously mutually can be obtained.

Step S202：It is N number of right in being extracted respectively from test sample according to N number of I-Vector matrixes The I-Vector vectors answered；

Step S204：Score is calculated according to N number of corresponding I-Vector respectively, drawn N number of corresponding Fraction；

Step S206：N number of corresponding fraction is merged, target fractional is obtained, and according to described Target fractional makes decisions.

Specifically, the principle framework schematic diagram for the method for recognizing sound-groove that the present invention with reference to shown in Fig. 3 is provided enters Row is described in detail, and step S200 can be respectively trained N number of I-Vector matrixes by N parts of training datas, obtain To N number of I-Vector matrixes；The N points of training datas are separate, i.e., can ensure objective between data Do not occur simultaneously mutually.

Step S200 can be trained referring specifically to the I-Vector matrixes provided in an embodiment of the present invention shown in Fig. 4 The schematic flow sheet of method, including：

Step S400：The speech data for randomly selecting N the first durations of part is used as the training data of background model；

Specifically, first duration can for 50 hours duration, or 60 hours duration etc., The embodiment of the present invention is not restricted.

Step S402：The speech data of first duration is subjected to feature extraction and feature normalization respectively, After the feature extracted, N number of GMM-UBM models are respectively trained；

Specifically, for 1 part of speech data, speech samples can be processed into sample rate 8K, 16bit first Pulse code modulation (PulseCodeModulation, PCM) file, then extract mel cepstrum coefficients (Mel Frequency Cepstrum Coefficient, MFCC) feature, then take its single order and second-order statistic Splice the feature as the sample, then for the MFCC series of features of extraction, carry out energy measuring, language Sound Activity determination (Voice Activity Detection, VAD) and normalization；After the feature extracted, instruction Practice GMM-UBM models, it is general to specify 512 with last components components；So for it is N number of can So that N number of GMM-UBM models are respectively trained.

It should be noted that when carrying out N number of GMM-UBM model trainings respectively, N number of GMM-UBM Have M GMM-UBM model parameter in model parameter to differ, the M be more than 1, less than etc. In the natural number of the N.

Step S404：The speech data for randomly selecting N the second durations of part is used as the training number of I-Vector matrixes According to；

Specifically, second duration can for 100 hours duration, or 120 hours duration etc. Deng the embodiment of the present invention is not restricted, more preferably, and second duration is more than 100 hours.

Step S406：The speech data of first duration is subjected to feature extraction and feature normalization respectively, It is N number of respectively to carry by training the N number of GMM-UBM models completed after the feature extracted Take Gauss super vector GSV；

Specifically, for 1 part of speech data, this feature is extracted and the process of feature normalization may be referred to The process in step S402 is stated, is repeated no more here；After the feature extracted, step S402 is used The GMM-UBM models completed are trained, extracting Gauss super vector, (i.e. the average of each Gauss model, splices The superelevation dimension vector formed afterwards).

Step S408：It is respectively trained using N number of GSV and obtains N number of I-Vector matrixes.

Next, step S202 is according to N number of I-Vector matrixes, in being extracted respectively from test sample N number of corresponding I-Vector vectors, can specifically include：Speech data is first subjected to feature extraction and feature is returned One changes, after the feature extracted, this feature extract and feature normalization process equally may be referred to it is above-mentioned Process in step S402, is repeated no more here；After the feature extracted, it can be trained based on before GMM-UBM models and I-Vector matrixes, extract corresponding to each sample I-Vector vector.

The fusion method for being merged N number of corresponding fraction in final step S206 can have a variety of, Including：N number of corresponding fraction is averaged, the target fractional is used as；Or from described N number of Maximum is taken in corresponding fraction, the target fractional is used as；Or taken from N number of corresponding fraction Minimum value, is used as described target fractional, etc..

Can be with specifically, more than between calculating two I-Vector vectors when being made decisions according to the target fractional Whether chordal distance, belong to same person according to two samples of Distance Judgment.(under normal circumstances, then obtain After I-Vector vectors, then by the method dimensionality reduction such as PLDA, the vector after dimensionality reduction is obtained, often with more table Ability is levied, is not just repeated herein)

For the ease of preferably implementing the such scheme of the embodiment of the present invention, the present invention also corresponds to and provides one kind Voiceprint Recognition System, the structural representation for the Voiceprint Recognition System that the present invention as shown in Figure 5 is provided, vocal print Identifying system 50 includes：Matrix training module 500, vectorial extraction module 502, computing module 504 and melt Judging module 506 is closed, wherein

Matrix training module 500 is used to N number of identity factor I-Vector matrixes are respectively trained, and obtains N number of I-Vector matrixes；The N is the natural number more than 1；

Vectorial extraction module 502 is used for according to N number of I-Vector matrixes, is carried respectively from test sample N number of corresponding I-Vector vectors in taking；

Computing module 504 is used to calculate score respectively according to N number of corresponding I-Vector, draws N number of Corresponding fraction；

Amalgamation judging module 506 is used to be merged N number of corresponding fraction, obtains target fractional, And made decisions according to the target fractional.

Specifically, matrix training module 500 is N number of specifically for being respectively trained by N parts of training datas I-Vector matrixes, obtain N number of I-Vector matrixes；The N points of training datas are separate.

Specifically, the structural representation of amalgamation judging module provided in an embodiment of the present invention as shown in Figure 6, Amalgamation judging module 506 can include：First integrated unit 5060 or the second integrated unit 5062 or The integrated unit 5064 of person the 3rd, wherein

First integrated unit 5060 is used to average N number of corresponding fraction, is used as the target point Number；Or

Second integrated unit 5062 is used to take maximum from N number of corresponding fraction, is used as the target Fraction；Or

3rd integrated unit 5064 is used to take minimum value from N number of corresponding fraction, is used as the target Fraction.

Further, the structural representation of matrix training module provided in an embodiment of the present invention as shown in Figure 7, Matrix training module 500 can include：First, which chooses unit 5000, model training unit 5002, second, selects Unit 5004, GSV extraction units 5006 and I-Vector matrixes training unit 5008 are taken, wherein,

First selection unit 5000 is used for the speech data for randomly selecting N the first durations of part as background model Training data；

Model training unit 5002 is used to the speech data of first duration is carried out into feature extraction and spy respectively Normalization is levied, after the feature extracted, N number of GMM-UBM models are respectively trained；

Second selection unit 5004 is used for the speech data for randomly selecting N the second durations of part as I-Vector squares The training data of battle array；

GSV extraction units 5006 are used to the speech data of first duration is carried out into feature extraction and spy respectively Normalization is levied, after the feature extracted, by training the N number of GMM-UBM models completed, point Not N number of extraction Gauss super vector GSV；

I-Vector matrixes training unit 5008 is used to be respectively trained using N number of GSV to obtain N number of I-Vector matrixes.

Yet further, when model training unit 5002 carries out N number of GMM-UBM model trainings respectively, N There is M GMM-UBM model parameter in individual GMM-UBM model parameters to differ, the M is More than 1, less than or equal to the natural number of the N.

Referring to Fig. 8, Fig. 8 is the structural representation of another embodiment for the Voiceprint Recognition System that the present invention is provided. Wherein, as shown in figure 8, Voiceprint Recognition System 80 can include：At least one processor 801, such as CPU, At least one network interface 804, user interface 803, memory 805, at least one communication bus 802 with And display screen 806.Wherein, communication bus 802 is used to realize the connection communication between these components.Wherein, User interface 803, optional user interface 803 can also include wireline interface, the wave point of standard.Network Interface 804 can optionally include wireline interface, the wave point (such as WI-FI interfaces) of standard.Memory 805 can be high-speed RAM memory or non-labile memory (non-volatile Memory), for example, at least one magnetic disk storage.Memory 805 optionally can also be that at least one is located at Storage system away from aforementioned processor 801.As shown in figure 8, depositing as a kind of computer-readable storage medium Operating system, network communication module, Subscriber Interface Module SIM and Application on Voiceprint Recognition journey can be included in reservoir 805 Sequence.

Processor 801 can be used for calling and be deposited in memory 805 in the Voiceprint Recognition System 80 shown in Fig. 8 The Application on Voiceprint Recognition program of storage, and perform following operate：

Specifically, N number of identity factor I-Vector matrixes are respectively trained in processor 801, obtain N number of I-Vector Matrix, can include：

Further, processor 801 is merged N number of corresponding fraction, obtains target fractional, It can include：

Further, N number of I-Vector matrixes are respectively trained by N parts of training datas in processor 801, obtain To N number of I-Vector matrixes, it can include：

Further, when carrying out N number of GMM-UBM model trainings respectively, N number of GMM-UBM models There is M GMM-UBM model parameter in parameter to differ, the M is more than 1, less than or equal to institute State N natural number.

It should be noted that the Voiceprint Recognition System 50 or Voiceprint Recognition System 80 in the embodiment of the present invention can Think the electric terminal such as personal computer or mobile intelligent terminal, tablet personal computer；Voiceprint Recognition System 50 or sound The function of each functional module can be implemented according to the method in above method embodiment in line identifying system 80, Here repeat no more.

In summary, the embodiment of the present invention is implemented, by the way that N number of I-Vector matrixes are respectively trained, according to the N Individual I-Vector matrixes, N number of corresponding I-Vector vectors, then distinguish in being extracted respectively from test sample Score is calculated, N number of corresponding fraction is drawn；Finally N number of corresponding fraction is merged, mesh is obtained Fraction is marked, and is made decisions according to the target fractional, can be realized under the premise of magnanimity training data, is broken through The technical problem of single I-Vector frameworks Application on Voiceprint Recognition performance bottleneck in the prior art, test indicate that, pass through The I-Vector frameworks that two or more is trained by enough data are relative to single I-Vector frame systems globality It can lift 20%~30% or so.

One of ordinary skill in the art will appreciate that all or part of flow in above-described embodiment method is realized, It can be by computer program to instruct the hardware of correlation to complete, described program can be stored in a calculating In machine read/write memory medium, the program is upon execution, it may include such as the flow of the embodiment of above-mentioned each method. Wherein, described storage medium can for magnetic disc, CD, read-only memory (Read-Only Memory, ) or random access memory (Random Access Memory, RAM) etc. ROM.

Above disclosure is only preferred embodiment of present invention, can not limit the present invention's with this certainly Interest field, therefore the equivalent variations made according to the claims in the present invention, still belong to the scope that the present invention is covered.

Claims

1. a kind of method for recognizing sound-groove, it is characterised in that including：

2. the method as described in claim 1, it is characterised in that described that N number of identity factor is respectively trained I-Vector matrixes, obtain N number of I-Vector matrixes, including：

3. the method as described in claim 1, it is characterised in that described to enter N number of corresponding fraction Row fusion, obtains target fractional, including：

4. method as claimed in claim 2, it is characterised in that described to be instructed respectively by N parts of training datas Practice N number of I-Vector matrixes, obtain N number of I-Vector matrixes, including：

5. method as claimed in claim 4, it is characterised in that carry out N number of GMM-UBM moulds respectively During type training, there is M GMM-UBM model parameter not phase in N number of GMM-UBM model parameters Together, the M is more than 1, less than or equal to the natural number of the N.

6. a kind of Voiceprint Recognition System, it is characterised in that including：

7. system as claimed in claim 6, it is characterised in that the matrix training module is specifically for logical Cross N parts of training datas and N number of I-Vector matrixes are respectively trained, obtain N number of I-Vector matrixes；The N Divide training data separate.

8. system as claimed in claim 6, it is characterised in that the amalgamation judging module includes：

9. system as claimed in claim 7, it is characterised in that the matrix training module includes：

10. system as claimed in claim 9, it is characterised in that the model training unit carries out N respectively During individual GMM-UBM model trainings, there is M GMM-UBM in N number of GMM-UBM model parameters Model parameter is differed, and the M is more than 1, less than or equal to the natural number of the N.