CN107274905A - A kind of method for recognizing sound-groove and system - Google Patents

A kind of method for recognizing sound-groove and system Download PDF

Info

Publication number
CN107274905A
CN107274905A CN201610218436.7A CN201610218436A CN107274905A CN 107274905 A CN107274905 A CN 107274905A CN 201610218436 A CN201610218436 A CN 201610218436A CN 107274905 A CN107274905 A CN 107274905A
Authority
CN
China
Prior art keywords
vector
training
matrixes
gmm
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610218436.7A
Other languages
Chinese (zh)
Other versions
CN107274905B (en
Inventor
金星明
李为
郑昉劢
吴富章
朱碧磊
钱柄桦
李科
吴永坚
黄飞跃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Tencent Cloud Computing Beijing Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201610218436.7A priority Critical patent/CN107274905B/en
Publication of CN107274905A publication Critical patent/CN107274905A/en
Application granted granted Critical
Publication of CN107274905B publication Critical patent/CN107274905B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/12Score normalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Image Analysis (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the invention discloses method for recognizing sound-groove, including:N number of identity factor I Vector matrixes are respectively trained, N number of I Vector matrixes are obtained;N is the natural number more than 1;According to N number of I Vector matrixes, N number of corresponding I Vector vectors in being extracted respectively from test sample;Score is calculated according to N number of corresponding I Vector respectively, N number of corresponding fraction is drawn;N number of corresponding fraction is merged, target fractional is obtained, and make decisions according to target fractional.Using the present invention, it can realize under the premise of magnanimity training data, break through the technical problem of single I Vector frameworks Application on Voiceprint Recognition performance bottleneck in the prior art, test indicate that, the I Vector frameworks trained by two or more by enough data can lift 20%~30% or so relative to single I Vector frame systems overall performance.

Description

A kind of method for recognizing sound-groove and system
Technical field
The present invention relates to computer realm, more particularly to a kind of method for recognizing sound-groove and system.
Background technology
Application on Voiceprint Recognition (the particularly unrelated field of text) technology includes Gauss by accumulation for many years and evolution and mixed Matched moulds type-universal background model (Gaussian Mixture Model-Universal Background Model, UBM-GMM), Gauss super vector (Gaussian Super Vector, GSV), simultaneous factor analysis (Joint Factor Analysis, JFA), the technology such as identity factor (Identity Vector, I-Vector).Even to this day, I-Vector methods dominate the development of whole Application on Voiceprint Recognition art.I-Vector methods, also referred to as identity because Submethod, it does not attempt to force separated speaker space and channel space, but directly sets an overall situation Change space (Total Variability Space), it contains all possible information in speech data.Then By the method for factorial analysis, the load factor in global change space is obtained, this is just called I-Vector.Its Dimension is well below Gauss super vector.In this factor, with a simple side distinguished between speaker Method, exactly allows the distance between different speakers to become big, and same speaker each language affected by noise The distance between sentence diminishes.Obvious, here it is linear discriminent analyzes (Linear Discriminant Analysis, LDA) method target, by the difference between speaker, be considered as matrix between class, by grass The difference come, is considered as matrix in class, then the LDA methods of applied probability estimate I-vector LDA Matrix, shows that shoot out is exactly the information vector of reacting speaker's identity in this LDA matrix.
Although the technology based on single I-Vector matrixes has been achieved for phase relative to technologies such as JFA in performance When big breakthrough, but still there is increase amount of training data, the problem of systematic function runs into bottleneck, so depositing In very big performance boost space.
The content of the invention
Technical problem to be solved of the embodiment of the present invention is that there is provided a kind of method for recognizing sound-groove and vocal print knowledge Other system, can break through the technical problem of single I-Vector frameworks Application on Voiceprint Recognition performance bottleneck in the prior art.
In order to solve the above-mentioned technical problem, first aspect of the embodiment of the present invention discloses a kind of method for recognizing sound-groove, Including
N number of identity factor I-Vector matrixes are respectively trained, N number of I-Vector matrixes are obtained;The N is Natural number more than 1;
According to N number of I-Vector matrixes, N number of corresponding I-Vector in being extracted respectively from test sample Vector;
Score is calculated according to N number of corresponding I-Vector respectively, N number of corresponding fraction is drawn;
N number of corresponding fraction is merged, target fractional is obtained, and enter according to the target fractional Row judgement.
It is described that N number of identity factor is respectively trained with reference in a first aspect, in the first possible implementation I-Vector matrixes, obtain N number of I-Vector matrixes, including:
N number of I-Vector matrixes are respectively trained by N parts of training datas, N number of I-Vector matrixes are obtained; The N points of training datas are separate.
It is described by N number of corresponding fraction with reference in a first aspect, in second of possible implementation Merged, obtain target fractional, including:
N number of corresponding fraction is averaged, the target fractional is used as;Or
Maximum is taken from N number of corresponding fraction, the target fractional is used as;Or
Minimum value is taken from N number of corresponding fraction, the target fractional is used as.
With reference to the first possible implementation of first aspect, in the third possible implementation, institute State and N number of I-Vector matrixes are respectively trained by N parts of training datas, obtain N number of I-Vector matrixes, wrap Include:
The speech data for randomly selecting N the first durations of part is used as the training data of background model;
The speech data of first duration is subjected to feature extraction and feature normalization respectively, extracted After feature, N number of GMM-UBM models are respectively trained;
The speech data for randomly selecting N the second durations of part is used as the training data of I-Vector matrixes;
The speech data of first duration is subjected to feature extraction and feature normalization respectively, extracted After feature, by train complete N number of GMM-UBM models, respectively it is N number of extraction Gauss surpass to Measure GSV;
It is respectively trained using N number of GSV and obtains N number of I-Vector matrixes.
With reference to the third possible implementation of first aspect, in the 4th kind of possible implementation, point When not carrying out N number of GMM-UBM model trainings, there are M in N number of GMM-UBM model parameters GMM-UBM model parameters are differed, and the M is more than 1, less than or equal to the natural number of the N.
Second aspect of the embodiment of the present invention discloses a kind of Voiceprint Recognition System, including:
Matrix training module, for N number of identity factor I-Vector matrixes to be respectively trained, obtains N number of I-Vector Matrix;The N is the natural number more than 1;
Vectorial extraction module, for according to N number of I-Vector matrixes, being extracted respectively from test sample In N number of corresponding I-Vector vector;
Computing module, for calculating score respectively according to N number of corresponding I-Vector, it is N number of right to draw The fraction answered;
Amalgamation judging module, for N number of corresponding fraction to be merged, obtains target fractional, and Made decisions according to the target fractional.
With reference to second aspect, in the first possible implementation, the matrix training module specifically for N number of I-Vector matrixes are respectively trained by N parts of training datas, N number of I-Vector matrixes are obtained;It is described N points of training datas are separate.
With reference to second aspect, in second of possible implementation, the amalgamation judging module includes:
First integrated unit, for N number of corresponding fraction to be averaged, is used as the target fractional; Or
Second integrated unit, for taking maximum from N number of corresponding fraction, is used as the target point Number;Or
3rd integrated unit, for taking minimum value from N number of corresponding fraction, is used as the target point Number.
With reference to the first possible implementation of second aspect, in the third possible implementation, institute Stating matrix training module includes:
First chooses unit, and the speech data for randomly selecting N the first durations of part is used as the instruction of background model Practice data;
Model training unit, for the speech data of first duration to be carried out into feature extraction and feature respectively After normalization, the feature extracted, N number of GMM-UBM models are respectively trained;
Second chooses unit, for randomly selecting the speech data of N the second durations of part as I-Vector matrixes Training data;
GSV extraction units, for the speech data of first duration to be carried out into feature extraction and feature respectively After normalization, the feature extracted, by training the N number of GMM-UBM models completed, respectively N number of extraction Gauss super vector GSV;
I-Vector matrix training units, N number of I-Vector is obtained for being respectively trained using N number of GSV Matrix.
With reference to the third possible implementation of second aspect, in the 4th kind of possible implementation, institute When stating model training unit and carrying out N number of GMM-UBM model trainings respectively, N number of GMM-UBM models There is M GMM-UBM model parameter in parameter to differ, the M is more than 1, less than or equal to institute State N natural number.
The third aspect of the embodiment of the present invention discloses a kind of computer-readable storage medium, the computer-readable storage medium Have program stored therein, when described program is performed including first aspect of the embodiment of the present invention or first aspect the A kind of possible implementation, the either possible implementation of second of first aspect or first aspect The third possible implementation, or vocal print in the 4th kind of possible implementation of first aspect knows The Overall Steps of other method.
Implement the embodiment of the present invention, by the way that N number of I-Vector matrixes are respectively trained, according to N number of I-Vector Matrix, N number of corresponding I-Vector vectors, then calculate score respectively in being extracted respectively from test sample, Draw N number of corresponding fraction;Finally N number of corresponding fraction is merged, target fractional is obtained, and Made decisions, can be realized under the premise of magnanimity training data according to the target fractional, broken through in the prior art The technical problem of single I-Vector frameworks Application on Voiceprint Recognition performance bottleneck, test indicate that, passed through by two or more Crossing the I-Vector frameworks of enough data training can be lifted relative to single I-Vector frame systems overall performance 20%~30% or so.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to implementing The accompanying drawing used required in example or description of the prior art is briefly described, it should be apparent that, describe below In accompanying drawing be only some embodiments of the present invention, for those of ordinary skill in the art, do not paying On the premise of going out creative work, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is the identification framework schematic diagram for the I-Vector methods that the present invention is provided;
Fig. 2 is the schematic flow sheet of method for recognizing sound-groove provided in an embodiment of the present invention;
Fig. 3 is the principle framework schematic diagram for the method for recognizing sound-groove that the present invention is provided;
Fig. 4 is the schematic flow sheet of I-Vector matrix training methods provided in an embodiment of the present invention;
Fig. 5 is the structural representation for the Voiceprint Recognition System that the present invention is provided;
Fig. 6 is the structural representation of amalgamation judging module provided in an embodiment of the present invention;
Fig. 7 is the structural representation of matrix training module provided in an embodiment of the present invention;
The structural representation of another embodiment for the Voiceprint Recognition System that Fig. 8 present invention is provided.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear Chu, it is fully described by, it is clear that described embodiment is only a part of embodiment of the invention, rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creation Property work under the premise of the every other embodiment that is obtained, belong to the scope of protection of the invention.
Method for recognizing sound-groove provided in an embodiment of the present invention is the sound groove recognition technology in e based on I-Vector methods, I-Vector technologies are that it does not differentiate between the information and channel of speaker space across channel algorithm based on single space Spatial information.For any one single voice, it can be broken into background model and reflection each spoken People's feature, its Gauss super vector GSV can be expressed as follows:
Ms=mo+Tws
Wherein, MsIt is the Gauss super vector GSV of C*F dimensions;moIt is the C*F that words person is unrelated and channel is unrelated Super vector is tieed up, is spliced by UBM mean vector;wsI.e. total changed factor I-Vector, dimension is N, It is the random vector of one group of obedience standardized normal distribution;T is total change space matrix, and dimension is CF*N.
Training stage, according to a large amount of exploitation data sets, using factorial analysis algorithm, total change is estimated therefrom Change space matrix T;After obtaining always changing space, by the GSV of higher-dimension in total change represented by matrix T Projected in subspace, finally give the entire change factor (I-Vector) of low-dimensional.
Specifically, the identification framework schematic diagram for the I-Vector methods that the present invention with reference to shown in Fig. 1 is provided, instruction Practicing the stage mainly includes the training of three models:The training of UBM background models, total transformation matrices T and PLDA Model training.
1.UBM background models are trained:It is enough in a balanced way using hundreds of people, channel equalization, men and women's sound The GMM of one high-order of voice training, to describe the feature distribution that speaker is unrelated.
2. always change space T (also referred to as I-Vector matrixes) training:According to a large amount of exploitation data sets, utilize Factorial analysis and greatest hope (Expectation Maximization Algorithm, EM) algorithm, therefrom Estimate total change space matrix T.
Channel compensation algorithm 3. (Probabilistic Linear Discriminant Analysis, PLDA) model Training:According to total change space T and UBM, total changed factor i-vector of the training voice of extraction will I-vector is grouped by speaker, using factorial analysis and EM algorithms, estimates the parameter of PLDA models.
Finally, test phase:According to UBM model and total transformation matrices T, extracting total changed factor is i-vector;With the i-vetcor feeding PLDA marking of test data and target speaker, make decisions.
The method for recognizing sound-groove of the embodiment of the present invention is the method for recognizing sound-groove based on I-Vector methods, referring to figure 2, it is the schematic flow sheet of method for recognizing sound-groove provided in an embodiment of the present invention, this method includes:
Step S200:N number of identity factor I-Vector matrixes are respectively trained, N number of I-Vector matrixes are obtained;
Specifically, the N in each embodiment of the invention is the natural number more than 1, by concurrently independent Multiple I-Vector matrixes are respectively trained, the multiple I-Vector matrixes not occured simultaneously mutually can be obtained.
Step S202:It is N number of right in being extracted respectively from test sample according to N number of I-Vector matrixes The I-Vector vectors answered;
Step S204:Score is calculated according to N number of corresponding I-Vector respectively, drawn N number of corresponding Fraction;
Step S206:N number of corresponding fraction is merged, target fractional is obtained, and according to described Target fractional makes decisions.
Specifically, the principle framework schematic diagram for the method for recognizing sound-groove that the present invention with reference to shown in Fig. 3 is provided enters Row is described in detail, and step S200 can be respectively trained N number of I-Vector matrixes by N parts of training datas, obtain To N number of I-Vector matrixes;The N points of training datas are separate, i.e., can ensure objective between data Do not occur simultaneously mutually.
Step S200 can be trained referring specifically to the I-Vector matrixes provided in an embodiment of the present invention shown in Fig. 4 The schematic flow sheet of method, including:
Step S400:The speech data for randomly selecting N the first durations of part is used as the training data of background model;
Specifically, first duration can for 50 hours duration, or 60 hours duration etc., The embodiment of the present invention is not restricted.
Step S402:The speech data of first duration is subjected to feature extraction and feature normalization respectively, After the feature extracted, N number of GMM-UBM models are respectively trained;
Specifically, for 1 part of speech data, speech samples can be processed into sample rate 8K, 16bit first Pulse code modulation (PulseCodeModulation, PCM) file, then extract mel cepstrum coefficients (Mel Frequency Cepstrum Coefficient, MFCC) feature, then take its single order and second-order statistic Splice the feature as the sample, then for the MFCC series of features of extraction, carry out energy measuring, language Sound Activity determination (Voice Activity Detection, VAD) and normalization;After the feature extracted, instruction Practice GMM-UBM models, it is general to specify 512 with last components components;So for it is N number of can So that N number of GMM-UBM models are respectively trained.
It should be noted that when carrying out N number of GMM-UBM model trainings respectively, N number of GMM-UBM Have M GMM-UBM model parameter in model parameter to differ, the M be more than 1, less than etc. In the natural number of the N.
Step S404:The speech data for randomly selecting N the second durations of part is used as the training number of I-Vector matrixes According to;
Specifically, second duration can for 100 hours duration, or 120 hours duration etc. Deng the embodiment of the present invention is not restricted, more preferably, and second duration is more than 100 hours.
Step S406:The speech data of first duration is subjected to feature extraction and feature normalization respectively, It is N number of respectively to carry by training the N number of GMM-UBM models completed after the feature extracted Take Gauss super vector GSV;
Specifically, for 1 part of speech data, this feature is extracted and the process of feature normalization may be referred to The process in step S402 is stated, is repeated no more here;After the feature extracted, step S402 is used The GMM-UBM models completed are trained, extracting Gauss super vector, (i.e. the average of each Gauss model, splices The superelevation dimension vector formed afterwards).
Step S408:It is respectively trained using N number of GSV and obtains N number of I-Vector matrixes.
Next, step S202 is according to N number of I-Vector matrixes, in being extracted respectively from test sample N number of corresponding I-Vector vectors, can specifically include:Speech data is first subjected to feature extraction and feature is returned One changes, after the feature extracted, this feature extract and feature normalization process equally may be referred to it is above-mentioned Process in step S402, is repeated no more here;After the feature extracted, it can be trained based on before GMM-UBM models and I-Vector matrixes, extract corresponding to each sample I-Vector vector.
The fusion method for being merged N number of corresponding fraction in final step S206 can have a variety of, Including:N number of corresponding fraction is averaged, the target fractional is used as;Or from described N number of Maximum is taken in corresponding fraction, the target fractional is used as;Or taken from N number of corresponding fraction Minimum value, is used as described target fractional, etc..
Can be with specifically, more than between calculating two I-Vector vectors when being made decisions according to the target fractional Whether chordal distance, belong to same person according to two samples of Distance Judgment.(under normal circumstances, then obtain After I-Vector vectors, then by the method dimensionality reduction such as PLDA, the vector after dimensionality reduction is obtained, often with more table Ability is levied, is not just repeated herein)
Implement the embodiment of the present invention, by the way that N number of I-Vector matrixes are respectively trained, according to N number of I-Vector Matrix, N number of corresponding I-Vector vectors, then calculate score respectively in being extracted respectively from test sample, Draw N number of corresponding fraction;Finally N number of corresponding fraction is merged, target fractional is obtained, and Made decisions, can be realized under the premise of magnanimity training data according to the target fractional, broken through in the prior art The technical problem of single I-Vector frameworks Application on Voiceprint Recognition performance bottleneck, test indicate that, passed through by two or more Crossing the I-Vector frameworks of enough data training can be lifted relative to single I-Vector frame systems overall performance 20%~30% or so.
For the ease of preferably implementing the such scheme of the embodiment of the present invention, the present invention also corresponds to and provides one kind Voiceprint Recognition System, the structural representation for the Voiceprint Recognition System that the present invention as shown in Figure 5 is provided, vocal print Identifying system 50 includes:Matrix training module 500, vectorial extraction module 502, computing module 504 and melt Judging module 506 is closed, wherein
Matrix training module 500 is used to N number of identity factor I-Vector matrixes are respectively trained, and obtains N number of I-Vector matrixes;The N is the natural number more than 1;
Vectorial extraction module 502 is used for according to N number of I-Vector matrixes, is carried respectively from test sample N number of corresponding I-Vector vectors in taking;
Computing module 504 is used to calculate score respectively according to N number of corresponding I-Vector, draws N number of Corresponding fraction;
Amalgamation judging module 506 is used to be merged N number of corresponding fraction, obtains target fractional, And made decisions according to the target fractional.
Specifically, matrix training module 500 is N number of specifically for being respectively trained by N parts of training datas I-Vector matrixes, obtain N number of I-Vector matrixes;The N points of training datas are separate.
Specifically, the structural representation of amalgamation judging module provided in an embodiment of the present invention as shown in Figure 6, Amalgamation judging module 506 can include:First integrated unit 5060 or the second integrated unit 5062 or The integrated unit 5064 of person the 3rd, wherein
First integrated unit 5060 is used to average N number of corresponding fraction, is used as the target point Number;Or
Second integrated unit 5062 is used to take maximum from N number of corresponding fraction, is used as the target Fraction;Or
3rd integrated unit 5064 is used to take minimum value from N number of corresponding fraction, is used as the target Fraction.
Further, the structural representation of matrix training module provided in an embodiment of the present invention as shown in Figure 7, Matrix training module 500 can include:First, which chooses unit 5000, model training unit 5002, second, selects Unit 5004, GSV extraction units 5006 and I-Vector matrixes training unit 5008 are taken, wherein,
First selection unit 5000 is used for the speech data for randomly selecting N the first durations of part as background model Training data;
Model training unit 5002 is used to the speech data of first duration is carried out into feature extraction and spy respectively Normalization is levied, after the feature extracted, N number of GMM-UBM models are respectively trained;
Second selection unit 5004 is used for the speech data for randomly selecting N the second durations of part as I-Vector squares The training data of battle array;
GSV extraction units 5006 are used to the speech data of first duration is carried out into feature extraction and spy respectively Normalization is levied, after the feature extracted, by training the N number of GMM-UBM models completed, point Not N number of extraction Gauss super vector GSV;
I-Vector matrixes training unit 5008 is used to be respectively trained using N number of GSV to obtain N number of I-Vector matrixes.
Yet further, when model training unit 5002 carries out N number of GMM-UBM model trainings respectively, N There is M GMM-UBM model parameter in individual GMM-UBM model parameters to differ, the M is More than 1, less than or equal to the natural number of the N.
Referring to Fig. 8, Fig. 8 is the structural representation of another embodiment for the Voiceprint Recognition System that the present invention is provided. Wherein, as shown in figure 8, Voiceprint Recognition System 80 can include:At least one processor 801, such as CPU, At least one network interface 804, user interface 803, memory 805, at least one communication bus 802 with And display screen 806.Wherein, communication bus 802 is used to realize the connection communication between these components.Wherein, User interface 803, optional user interface 803 can also include wireline interface, the wave point of standard.Network Interface 804 can optionally include wireline interface, the wave point (such as WI-FI interfaces) of standard.Memory 805 can be high-speed RAM memory or non-labile memory (non-volatile Memory), for example, at least one magnetic disk storage.Memory 805 optionally can also be that at least one is located at Storage system away from aforementioned processor 801.As shown in figure 8, depositing as a kind of computer-readable storage medium Operating system, network communication module, Subscriber Interface Module SIM and Application on Voiceprint Recognition journey can be included in reservoir 805 Sequence.
Processor 801 can be used for calling and be deposited in memory 805 in the Voiceprint Recognition System 80 shown in Fig. 8 The Application on Voiceprint Recognition program of storage, and perform following operate:
N number of identity factor I-Vector matrixes are respectively trained, N number of I-Vector matrixes are obtained;The N is Natural number more than 1;
According to N number of I-Vector matrixes, N number of corresponding I-Vector in being extracted respectively from test sample Vector;
Score is calculated according to N number of corresponding I-Vector respectively, N number of corresponding fraction is drawn;
N number of corresponding fraction is merged, target fractional is obtained, and enter according to the target fractional Row judgement.
Specifically, N number of identity factor I-Vector matrixes are respectively trained in processor 801, obtain N number of I-Vector Matrix, can include:
N number of I-Vector matrixes are respectively trained by N parts of training datas, N number of I-Vector matrixes are obtained; The N points of training datas are separate.
Further, processor 801 is merged N number of corresponding fraction, obtains target fractional, It can include:
N number of corresponding fraction is averaged, the target fractional is used as;Or
Maximum is taken from N number of corresponding fraction, the target fractional is used as;Or
Minimum value is taken from N number of corresponding fraction, the target fractional is used as.
Further, N number of I-Vector matrixes are respectively trained by N parts of training datas in processor 801, obtain To N number of I-Vector matrixes, it can include:
The speech data for randomly selecting N the first durations of part is used as the training data of background model;
The speech data of first duration is subjected to feature extraction and feature normalization respectively, extracted After feature, N number of GMM-UBM models are respectively trained;
The speech data for randomly selecting N the second durations of part is used as the training data of I-Vector matrixes;
The speech data of first duration is subjected to feature extraction and feature normalization respectively, extracted After feature, by train complete N number of GMM-UBM models, respectively it is N number of extraction Gauss surpass to Measure GSV;
It is respectively trained using N number of GSV and obtains N number of I-Vector matrixes.
Further, when carrying out N number of GMM-UBM model trainings respectively, N number of GMM-UBM models There is M GMM-UBM model parameter in parameter to differ, the M is more than 1, less than or equal to institute State N natural number.
It should be noted that the Voiceprint Recognition System 50 or Voiceprint Recognition System 80 in the embodiment of the present invention can Think the electric terminal such as personal computer or mobile intelligent terminal, tablet personal computer;Voiceprint Recognition System 50 or sound The function of each functional module can be implemented according to the method in above method embodiment in line identifying system 80, Here repeat no more.
In summary, the embodiment of the present invention is implemented, by the way that N number of I-Vector matrixes are respectively trained, according to the N Individual I-Vector matrixes, N number of corresponding I-Vector vectors, then distinguish in being extracted respectively from test sample Score is calculated, N number of corresponding fraction is drawn;Finally N number of corresponding fraction is merged, mesh is obtained Fraction is marked, and is made decisions according to the target fractional, can be realized under the premise of magnanimity training data, is broken through The technical problem of single I-Vector frameworks Application on Voiceprint Recognition performance bottleneck in the prior art, test indicate that, pass through The I-Vector frameworks that two or more is trained by enough data are relative to single I-Vector frame systems globality It can lift 20%~30% or so.
One of ordinary skill in the art will appreciate that all or part of flow in above-described embodiment method is realized, It can be by computer program to instruct the hardware of correlation to complete, described program can be stored in a calculating In machine read/write memory medium, the program is upon execution, it may include such as the flow of the embodiment of above-mentioned each method. Wherein, described storage medium can for magnetic disc, CD, read-only memory (Read-Only Memory, ) or random access memory (Random Access Memory, RAM) etc. ROM.
Above disclosure is only preferred embodiment of present invention, can not limit the present invention's with this certainly Interest field, therefore the equivalent variations made according to the claims in the present invention, still belong to the scope that the present invention is covered.

Claims (10)

1. a kind of method for recognizing sound-groove, it is characterised in that including:
N number of identity factor I-Vector matrixes are respectively trained, N number of I-Vector matrixes are obtained;The N is Natural number more than 1;
According to N number of I-Vector matrixes, N number of corresponding I-Vector in being extracted respectively from test sample Vector;
Score is calculated according to N number of corresponding I-Vector respectively, N number of corresponding fraction is drawn;
N number of corresponding fraction is merged, target fractional is obtained, and enter according to the target fractional Row judgement.
2. the method as described in claim 1, it is characterised in that described that N number of identity factor is respectively trained I-Vector matrixes, obtain N number of I-Vector matrixes, including:
N number of I-Vector matrixes are respectively trained by N parts of training datas, N number of I-Vector matrixes are obtained; The N points of training datas are separate.
3. the method as described in claim 1, it is characterised in that described to enter N number of corresponding fraction Row fusion, obtains target fractional, including:
N number of corresponding fraction is averaged, the target fractional is used as;Or
Maximum is taken from N number of corresponding fraction, the target fractional is used as;Or
Minimum value is taken from N number of corresponding fraction, the target fractional is used as.
4. method as claimed in claim 2, it is characterised in that described to be instructed respectively by N parts of training datas Practice N number of I-Vector matrixes, obtain N number of I-Vector matrixes, including:
The speech data for randomly selecting N the first durations of part is used as the training data of background model;
The speech data of first duration is subjected to feature extraction and feature normalization respectively, extracted After feature, N number of GMM-UBM models are respectively trained;
The speech data for randomly selecting N the second durations of part is used as the training data of I-Vector matrixes;
The speech data of first duration is subjected to feature extraction and feature normalization respectively, extracted After feature, by train complete N number of GMM-UBM models, respectively it is N number of extraction Gauss surpass to Measure GSV;
It is respectively trained using N number of GSV and obtains N number of I-Vector matrixes.
5. method as claimed in claim 4, it is characterised in that carry out N number of GMM-UBM moulds respectively During type training, there is M GMM-UBM model parameter not phase in N number of GMM-UBM model parameters Together, the M is more than 1, less than or equal to the natural number of the N.
6. a kind of Voiceprint Recognition System, it is characterised in that including:
Matrix training module, for N number of identity factor I-Vector matrixes to be respectively trained, obtains N number of I-Vector Matrix;The N is the natural number more than 1;
Vectorial extraction module, for according to N number of I-Vector matrixes, being extracted respectively from test sample In N number of corresponding I-Vector vector;
Computing module, for calculating score respectively according to N number of corresponding I-Vector, it is N number of right to draw The fraction answered;
Amalgamation judging module, for N number of corresponding fraction to be merged, obtains target fractional, and Made decisions according to the target fractional.
7. system as claimed in claim 6, it is characterised in that the matrix training module is specifically for logical Cross N parts of training datas and N number of I-Vector matrixes are respectively trained, obtain N number of I-Vector matrixes;The N Divide training data separate.
8. system as claimed in claim 6, it is characterised in that the amalgamation judging module includes:
First integrated unit, for N number of corresponding fraction to be averaged, is used as the target fractional; Or
Second integrated unit, for taking maximum from N number of corresponding fraction, is used as the target point Number;Or
3rd integrated unit, for taking minimum value from N number of corresponding fraction, is used as the target point Number.
9. system as claimed in claim 7, it is characterised in that the matrix training module includes:
First chooses unit, and the speech data for randomly selecting N the first durations of part is used as the instruction of background model Practice data;
Model training unit, for the speech data of first duration to be carried out into feature extraction and feature respectively After normalization, the feature extracted, N number of GMM-UBM models are respectively trained;
Second chooses unit, for randomly selecting the speech data of N the second durations of part as I-Vector matrixes Training data;
GSV extraction units, for the speech data of first duration to be carried out into feature extraction and feature respectively After normalization, the feature extracted, by training the N number of GMM-UBM models completed, respectively N number of extraction Gauss super vector GSV;
I-Vector matrix training units, N number of I-Vector is obtained for being respectively trained using N number of GSV Matrix.
10. system as claimed in claim 9, it is characterised in that the model training unit carries out N respectively During individual GMM-UBM model trainings, there is M GMM-UBM in N number of GMM-UBM model parameters Model parameter is differed, and the M is more than 1, less than or equal to the natural number of the N.
CN201610218436.7A 2016-04-08 2016-04-08 A kind of method for recognizing sound-groove and system Active CN107274905B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610218436.7A CN107274905B (en) 2016-04-08 2016-04-08 A kind of method for recognizing sound-groove and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610218436.7A CN107274905B (en) 2016-04-08 2016-04-08 A kind of method for recognizing sound-groove and system

Publications (2)

Publication Number Publication Date
CN107274905A true CN107274905A (en) 2017-10-20
CN107274905B CN107274905B (en) 2019-09-27

Family

ID=60052207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610218436.7A Active CN107274905B (en) 2016-04-08 2016-04-08 A kind of method for recognizing sound-groove and system

Country Status (1)

Country Link
CN (1) CN107274905B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108520752A (en) * 2018-04-25 2018-09-11 西北工业大学 A kind of method for recognizing sound-groove and device
CN108831487A (en) * 2018-06-28 2018-11-16 深圳大学 Method for recognizing sound-groove, electronic device and computer readable storage medium
CN109036437A (en) * 2018-08-14 2018-12-18 平安科技(深圳)有限公司 Accents recognition method, apparatus, computer installation and computer readable storage medium
CN109360573A (en) * 2018-11-13 2019-02-19 平安科技(深圳)有限公司 Livestock method for recognizing sound-groove, device, terminal device and computer storage medium
CN110047504A (en) * 2019-04-18 2019-07-23 东华大学 Method for distinguishing speek person under identity vector x-vector linear transformation
CN110047491A (en) * 2018-01-16 2019-07-23 中国科学院声学研究所 A kind of relevant method for distinguishing speek person of random digit password and device
CN110289003A (en) * 2018-10-10 2019-09-27 腾讯科技(深圳)有限公司 A kind of method of Application on Voiceprint Recognition, the method for model training and server
CN111161713A (en) * 2019-12-20 2020-05-15 北京皮尔布莱尼软件有限公司 Voice gender identification method and device and computing equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140222428A1 (en) * 2013-02-07 2014-08-07 Nuance Communications, Inc. Method and Apparatus for Efficient I-Vector Extraction
US20140222423A1 (en) * 2013-02-07 2014-08-07 Nuance Communications, Inc. Method and Apparatus for Efficient I-Vector Extraction
US20140244257A1 (en) * 2013-02-25 2014-08-28 Nuance Communications, Inc. Method and Apparatus for Automated Speaker Parameters Adaptation in a Deployed Speaker Verification System
CN104064189A (en) * 2014-06-26 2014-09-24 厦门天聪智能软件有限公司 Vocal print dynamic password modeling and verification method
CN105139857A (en) * 2015-09-02 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 Countercheck method for automatically identifying speaker aiming to voice deception

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140222428A1 (en) * 2013-02-07 2014-08-07 Nuance Communications, Inc. Method and Apparatus for Efficient I-Vector Extraction
US20140222423A1 (en) * 2013-02-07 2014-08-07 Nuance Communications, Inc. Method and Apparatus for Efficient I-Vector Extraction
US20140244257A1 (en) * 2013-02-25 2014-08-28 Nuance Communications, Inc. Method and Apparatus for Automated Speaker Parameters Adaptation in a Deployed Speaker Verification System
CN104064189A (en) * 2014-06-26 2014-09-24 厦门天聪智能软件有限公司 Vocal print dynamic password modeling and verification method
CN105139857A (en) * 2015-09-02 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 Countercheck method for automatically identifying speaker aiming to voice deception

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘婷婷等: "基于因子分析的与文本无关的说话人辨认方法研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047491A (en) * 2018-01-16 2019-07-23 中国科学院声学研究所 A kind of relevant method for distinguishing speek person of random digit password and device
CN108520752A (en) * 2018-04-25 2018-09-11 西北工业大学 A kind of method for recognizing sound-groove and device
CN108520752B (en) * 2018-04-25 2021-03-12 西北工业大学 Voiceprint recognition method and device
CN108831487A (en) * 2018-06-28 2018-11-16 深圳大学 Method for recognizing sound-groove, electronic device and computer readable storage medium
CN109036437A (en) * 2018-08-14 2018-12-18 平安科技(深圳)有限公司 Accents recognition method, apparatus, computer installation and computer readable storage medium
CN110289003A (en) * 2018-10-10 2019-09-27 腾讯科技(深圳)有限公司 A kind of method of Application on Voiceprint Recognition, the method for model training and server
CN110289003B (en) * 2018-10-10 2021-10-29 腾讯科技(深圳)有限公司 Voiceprint recognition method, model training method and server
CN109360573A (en) * 2018-11-13 2019-02-19 平安科技(深圳)有限公司 Livestock method for recognizing sound-groove, device, terminal device and computer storage medium
CN110047504A (en) * 2019-04-18 2019-07-23 东华大学 Method for distinguishing speek person under identity vector x-vector linear transformation
CN110047504B (en) * 2019-04-18 2021-08-20 东华大学 Speaker identification method under identity vector x-vector linear transformation
CN111161713A (en) * 2019-12-20 2020-05-15 北京皮尔布莱尼软件有限公司 Voice gender identification method and device and computing equipment

Also Published As

Publication number Publication date
CN107274905B (en) 2019-09-27

Similar Documents

Publication Publication Date Title
CN107274905A (en) A kind of method for recognizing sound-groove and system
CN107610707B (en) A kind of method for recognizing sound-groove and device
EP2770502B1 (en) Method and apparatus for automated speaker classification parameters adaptation in a deployed speaker verification system
CN107886943A (en) Voiceprint recognition method and device
CN104835498B (en) Method for recognizing sound-groove based on polymorphic type assemblage characteristic parameter
CN112259105B (en) Training method of voiceprint recognition model, storage medium and computer equipment
CN110010133A (en) Vocal print detection method, device, equipment and storage medium based on short text
WO2017162017A1 (en) Method and device for voice data processing and storage medium
CN110289003A (en) A kind of method of Application on Voiceprint Recognition, the method for model training and server
CN108766446A (en) Method for recognizing sound-groove, device, storage medium and speaker
CN106683661A (en) Role separation method and device based on voice
CN107886957A (en) Voice wake-up method and device combined with voiceprint recognition
CN104538035B (en) A kind of method for distinguishing speek person and system based on Fisher super vectors
CN107369440A (en) The training method and device of a kind of Speaker Identification model for phrase sound
CN106898355B (en) Speaker identification method based on secondary modeling
CN109461073A (en) Risk management method, device, computer equipment and the storage medium of intelligent recognition
CN106709804A (en) Interactive wealth planning consulting robot system
CN108010516A (en) Semantic independent speech emotion feature recognition method and device
CN108986798B (en) Processing method, device and the equipment of voice data
CN110390946A (en) A kind of audio signal processing method, device, electronic equipment and storage medium
CN107071193A (en) The method and apparatus of interactive answering system accessing user
CN112259104A (en) Training device of voiceprint recognition model
CN108877769B (en) Method and device for identifying dialect type
CN110164453A (en) A kind of method for recognizing sound-groove, terminal, server and the storage medium of multi-model fusion
CN108985776A (en) Credit card security monitoring method based on multiple Information Authentication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210917

Address after: 518057 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 floors

Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

Patentee after: TENCENT CLOUD COMPUTING (BEIJING) Co.,Ltd.

Address before: 2, 518000, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.