CN107993663A

CN107993663A - A kind of method for recognizing sound-groove based on Android

Info

Publication number: CN107993663A
Application number: CN201710809811.XA
Authority: CN
Inventors: 陈立江; 窦文韬; 张旭东
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2017-09-11
Filing date: 2017-09-11
Publication date: 2018-05-04

Abstract

A kind of method for recognizing sound-groove based on Android of the present invention, operates in Android operation system, by sound pick-up outfit collection training speaker's audio built in calling and carries out speech enhan-cement, kd trees and the gauss hybrid models of vector quantization model are built when training.Searched out during identification in the kd trees of vector quantization model with after the test trained speaker of immediate K of speaker's vocal print feature, accurately being identified using gauss hybrid models.The present invention not only avoid all models in traversal gauss hybrid models storehouse using the kd trees of vector quantization model, add recognition speed, and the accuracy of identification is added using the secondary identification of vector quantization model and gauss hybrid models, there is very strong practicality, ease for use and robustness.

Description

A kind of method for recognizing sound-groove based on Android

(1) technical field：

A kind of method for recognizing sound-groove based on Android of the present invention, belongs to field of computer technology.

(2) background technology：

In the epoch of " nowhere not account, nowhere not password ", people are usually because forgeing or losing password and feel tired Angry, the appearance of method for recognizing sound-groove undoubtedly brings more convenient, efficient method of service, and vocal print feature " can be carried with ", Authentication can be carried out whenever and wherever possible.The present invention is based on Android operation system, extracts vocal print feature, is sweared by using construction Measure the kd trees (k-dimension Tree) of quantitative model and gauss hybrid models reach the effect of identification speaker.This hair It is bright to lift search recognition speed, precision, there are very strong practicality, ease for use and robustness.

(3) content of the invention：

A kind of method for recognizing sound-groove based on Android of the present invention, extract voice mel-frequency cepstrum coefficient, one Order difference mel-frequency cepstrum coefficient, second differnce mel-frequency cepstrum coefficient, frame energy and the energy screening of utilization frame are effective After feature, the vector quantization model kd trees of training speaker's vocal print feature are constructed, establish the gauss hybrid models of vocal print feature Storehouse, when test first search out in kd trees K with after the most similar model of vocal print feature, using gauss hybrid models in K model It is middle accurately to be identified, so as to identify speaker's identity.

1st, a kind of method for recognizing sound-groove based on Android, this method step are as follows：

Step 1：Voice data is gathered using AudioRecorder interfaces, using monaural recording, sample frequency is set 22050HZ, takes pulse code modulation, each sampled point quantization digit 16, while passes through calling AcousticEchoCanceler classes example, NoiseSuppressor classes example and AutomaticGainControl class examples Carry out automatic echo cancellor, suppress noise and automatic growth control, reach the effect of speech enhan-cement；Pass through Android's at the same time Asynchronous message treatment mechanism realizes sub-line journey renewal UI and realizes clocking capability；While audio uncorrected data is obtained, for its volume It is stored in after writing the header file of wave forms in corresponding training speaker's voice library archive, after End of Tape, software interface bullet Go out renaming window, user inputs the name of corresponding training speaker as filename；

Step 2：After training speaker's sound bank collection, all audio files in the storehouse are pre-processed, it is first Framing operation is first carried out, it is 16 milliseconds to set framing frame length, and it is 8 milliseconds that frame, which moves,；After the completion of framing operation, double threshold is utilized Method carries out end-point detection work, and since speech enhan-cement effect is preferable in step 1, it is 0.1 to set low energy threshold value, high-energy threshold It is worth for 1, low zero-crossing rate threshold value is 0.01, and high zero-crossing rate threshold value is 10, a length of 12 frame, i.e. voice when most long mute in voice segments The frame energy and zero-crossing rate of section are at the same time less than the duration of low energy threshold value and low zero-crossing rate threshold value no more than 12 frames, most phrase A length of 10 frame during sound, i.e. the frame energy of initiating speech section be more than low frame energy threshold or zero-crossing rate be more than zero-crossing rate threshold value when Length cannot be shorter than 10 frames；After the completion of end-point detection, voice segment signal is obtained, is to add Hamming window per frame signal；When adding window is grasped It is to carry out preemphasis operation per frame voice segment signal after the completion of work, makes up the loss of high fdrequency component, this method sets preemphasis Coefficient is 0.93；

Step 3：After the completion of pretreatment, this method takes mel-frequency cepstrum coefficient, sets triangle bandpass filter number For 40,12 number is maintained before extraction per frame；After the completion of mel-frequency cepstrum coefficient is extracted, 1 rank and 2 ranks 12 dimension difference before extraction Mel-frequency cepstrum coefficient；After Differential Characteristics are completed in extraction, eliminate corresponding frame energy and be less than 1 and the feature vector more than 10, The identification error that elimination speaker's wave volume is excessive or too small and produces；

Step 4：After feature extraction is completed, its vector quantization model of training, generation represents the code book of the speaker, adopts With the algorithm of construction balance kd trees, kd trees are built with the code word in all trained speaker's code books, each code word is as in the tree A node and exist；

Step 5：Meanwhile program is clustered characteristic vector using k means clustering algorithms in another thread, classification Number is 16, after k means clustering process is repeated 10 times by this method, calculates total variance within clusters after cluster every time, choosing respectively That selects total variance within clusters minimum is once used as final result；After cluster operation is completed, Gauss model is carried out using EM algorithms Parameter Estimation, the initial ginseng first using each parameter-average, variance and the weight coefficient after cluster as gauss hybrid models Number, carries out the revaluation of parameters by the parameter revaluation formula of EM algorithms, pair of likelihood function value is calculated during revaluation The knots modification of numerical value, when the knots modification is less than threshold value 0.01, that is, is judged to restraining, records average, variance and weight at this time Coefficient, then three parameters and the voice document name of crawl are that training speaker name is stored in speaker model class, and The speaker model class example is stored in model dynamic array；When the memory size in model dynamic array, that is, model class is real The number of example, during equal to speaker's number, shows that all voice documents are all trained to finish, and then program is by model dynamic Array carries out serializing operation, and each model class instance transfer therein is byte sequence, then mixes sequence deposit Gauss Molding type database file, training finish；

Step 6：After the completion of the foundation of gauss hybrid models database file, you can carry out test job, admission test is said After talking about people's voice data, according to step 2, the vocal print feature vector set of 3 extraction test speakers, concentrate and choose in feature vector One feature vector, in the kd trees that step 4 generates, finds the M code word nearest with the Euclidean distance of this feature vector, and The corresponding code book of M code word is searched, is repeated the above steps, until having traveled through testing feature vector collection；Then find out by Search K most code book of number, wherein K<M, it is then that the vocal print of the corresponding K trained speaker of the K code book is special Sign carries out the identification of gauss hybrid models, and during gauss hybrid models identify, program pin is to wherein everyone vocal print Feature, finds the model of posterior probability maximum in gauss hybrid models database, and the corresponding speaker of the model judges To test speaker.

2nd, the corresponding Voiceprint Recognition System of method for recognizing sound-groove according to 1, including following module：

Voice acquisition module：Hardware components include microphone and sound card, and microphone is responsible for the collection of voice, and sound card is then It is for being digitally converted to voice；Software section include audio data collecting, automatic echo cancellor, suppress noise, from Dynamic gain control and clocking capability；

Vocal print feature extraction module：Including voice pretreatment and two submodules of feature extraction, input is by voice collecting Feature is sent to model training module and is handled by the voice data of module admission, training stage, and test phase sends feature to Model identification module is handled；

Model training module：The module includes the vector quantization model of training voice, the kd trees for building code word and training language The gauss hybrid models of sound；

Model modification module：Newly added trained speaker's vector quantization model is inserted into kd trees by the module, and will Its gauss hybrid models is added in gauss hybrid models database file；

Model identification module：The input of the module is the output of vocal print feature extraction module, and the output of the module is to survey Try the name and its personal information of speaker；

Database management module：It is responsible for training name and its personal information, the model parameter and voice of speaker File, wherein each model correspond to each voice document, the name of each voice document is the name of corresponding speaker, often The name of a speaker is connected to its personal information in the database.

(4) illustrate：

Fig. 1, model training flow chart；

Fig. 2, Model Identification flow chart；

Fig. 3, asynchronous message treatment mechanism；

Fig. 4, pretreatment process figure；

Fig. 5, vocal print feature extraction flow chart；

Fig. 6, triangle bandpass filter；

Fig. 7, gauss hybrid models Establishing process figure.

(5) embodiment：

Below in conjunction with the accompanying drawings, technical solutions according to the invention are further elaborated.

The software program of the present invention mainly includes algorithm realization and the client software control circle of model training and identification Face, wherein algorithm part are mainly including the framing adding window to voice data, end-point detection, preemphasis, feature extraction and model Training, identification, training and identification process are as shown in Figure 1, 2.Client software control interface is mainly used for behaviour of the user to software Control.

Audio collection：It is mainly used for gathering voice data, timing and speech enhan-cement.Wherein audio collection subprogram AudioRecorder class examples are called to carry out inputting audio data, program writes wave forms for it after voice data typing Header file, and this document is named with speaker's name, it is stored in voice library file.In order to reach the effect of speech enhan-cement, adopting Collect audio while, program by call AcousticEchoCanceler classes example, NoiseSuppressor classes example and AutomaticGainControl classes example carries out automatic echo cancellor, suppresses noise and automatic growth control, at the same time journey Sequence realizes clocking capability by using the asynchronous message treatment mechanism of Android, and the mechanism flow chart is as shown in Figure 3；

Feature extraction：The main algorithm realization for including pretreatment and feature extraction, preprocessing part is as shown in figure 4, first Framing operation is carried out to audio, after making it have relatively stable characteristic, partial digitized processing can be carried out to it.To obtaining The audio taken carries out end-point detection work, to determine voice segments position, calculates the zero-crossing rate and frame energy of present frame, such as At least one parameter of the parameter of fruit two has exceeded set threshold value, if duration long enough, note present frame is Point, otherwise, jumps to next frame and is calculated；If two parameters are respectively less than set threshold value, if duration long enough, Terminal is then denoted as, otherwise, next frame is jumped to and is calculated.It is to add per frame data after the start frame and abort frame that determine voice segments Hamming window carries out preemphasis work to lift its high fdrequency component to ensure the continuity between frame and frame.Characteristic extraction part is such as Shown in Fig. 5, after the completion of pretreatment, program carries out extraction and the screening operation of feature.After the data of all frames of program pass, Frame data of the frame energy in range of normal value are filtered out, after carrying out FFT transform to every frame data after screening, by the frequency Spectrum carries out a square processing, obtains its energy spectrum.Meanwhile program carries out the design evaluation work of triangle bandpass filter, experiment is seen Survey and find human ear just as one group of wave filter, it simply selectively pays close attention to some frequency signals.But these wave filters Distribution on the frequency axis is not but unified, and low frequency region is densely distributed, and high-frequency region distribution is sparse, as shown in Figure 6.Should The design procedure of triangle bandpass filter group is：

(1) low-limit frequency and the number of highest frequency and Mel wave filters of voice signal are determined；

(2) the Mel frequencies corresponding to minimum and highest frequency are calculated；

(3) since filter centre frequency is spacedly distributed on Mel frequency axis, calculate in two neighboring Mel wave filters The distance of frequency of heart；

(4) each Mel centre frequencies are converted into actual frequency, calculate the FFT subscripts corresponding to each Frequency point.

(5) according to formula：

Calculate the range value of triangle bandpass filter group.Wherein, under f (m) is represented corresponding to m-th of center frequency points Mark, H_mRepresent m-th of wave filter.Assuming that o (m), c (m) and h (m) are the lower frequency limit of m-th of triangular filter, center respectively Frequency and upper limiting frequency, then adjacent wave filter have：

C (m)=h (m-1)=o (m+1)

After the completion of the design of bandpass filter group, the energy spectrum per frame is filtered using wave filter group, then takes result pair Discrete cosine transform is carried out after number, output result is mel-frequency cepstrum coefficient, then utilizes mel-frequency cepstrum coefficient meter Calculate single order and second differnce mel-frequency cepstrum coefficient, calculating formula are as follows：

Wherein, t represents t-th of difference coefficient, and the general values of N are 2, and it is MFCC that c is corresponding when calculating first-order difference, meter It is first-order difference MFCC that c is corresponding when calculating second differnce, and hereafter each order differential points are all based on first-order difference calculating.We Method extraction is first-order difference and second differnce, since the later parameter amplitude of 12 dimensions substantially goes to zero, so before only needing 12 dimension parameters.Finally, program every frame feature vector it is last it is one-dimensional add per frame energy value as a kind of vocal print feature, And eliminate feature vector of the frame energy outside 1 to 10.

The structure of vector quantization model kd trees：After training speaker's speech feature vector collection is obtained, its vector quantity of training Change model, the code word in all code books obtained after training pattern is constructed into balance kd according to the flow of structure balance kd trees Tree, all code words represent a node in kd trees.

The training of gauss hybrid models：The algorithm flow is as shown in fig. 7, program is special by the training speaker vocal print of acquisition Sign vector is classified using K mean cluster algorithm, per the average of class, variance as gauss hybrid models initial mean value and Vectorial number in every class, is accounted for initial weight of the ratio as gauss hybrid models of general characteristic vector number by variance, and will Degree of mixing of the preliminary classification number as gauss hybrid models, then carries out the average, variance and weight of model using EM algorithms Interative computation, until convergence.

The search of vector quantization model kd trees：It is any wherein to choose one after test speaker characteristic vector set is obtained A feature vector, goes out M and the most similar code word of this feature vector according to the nearest neighbor search algorithm search of kd trees, looks into respectively Look for its corresponding code book.All feature vectors that feature vector is concentrated are traveled through, after repeating the above steps, K is picked out and is searched The most code book of number, wherein K<M.Obtain the training speaker corresponding to the K code book and be used for the accurate of gauss hybrid models Identification.

Gauss hybrid models identify：After searching out with immediate K trained speaker of test speaker, in height The gauss hybrid models of this K trained speaker are found in this mixed model storehouse, then find mould in the K model parameter Shape parameter λ_iSo that the characteristic vector group X for testing speaker has maximum a posteriori probability P (λ_i| X), model parameter λ_iIt is corresponding Training speaker i.e. be determined as test speaker.

According to Bayesian formula：

Again due to P (λ_i) it is the probability that i-th of model is selected in model library, so

And P (X) is the probability for testing speaker, it is a definite constant, all equal for all testers, because This, and the maximum posterior probability i.e. maximum P of correspondence (X | λ_i), in order to reduce computation complexity, usually taken the logarithm, it is as follows It is shown：

Claims

1. a kind of method for recognizing sound-groove based on Android, this method step are as follows：

Step 1：Voice data is gathered using AudioRecorder interfaces, using monaural recording, sample frequency is set 22050HZ, takes pulse code modulation, each sampled point quantization digit 16, while passes through calling AcousticEchoCanceler classes example, NoiseSuppressor classes example and AutomaticGainControl class examples Carry out automatic echo cancellor, suppress noise and automatic growth control, reach the effect of speech enhan-cement, while pass through Android's Asynchronous message treatment mechanism realizes sub-line journey renewal UI and realizes clocking capability, while audio uncorrected data is obtained, for its volume It is stored in after writing the header file of wave forms in corresponding training speaker's voice library archive, after End of Tape, software interface bullet Go out renaming window, user inputs the name of corresponding training speaker as filename；

Step 2：When training speaker's sound bank collection after, all audio files in the storehouse are pre-processed, first into Row framing operates, and it is 16 milliseconds to set framing frame length, and it is 8 milliseconds that frame, which moves,；After the completion of framing operation, carried out using double threshold method End-point detection works, and since speech enhan-cement effect is preferable in step 1, it is 0.1 to set low energy threshold value, and high-energy threshold value is 1, low Zero-crossing rate threshold value is 0.01, and high zero-crossing rate threshold value is 10, the frame energy of a length of 12 frame, i.e. voice segments when most long mute in voice segments A length of 10 during with zero-crossing rate at the same time less than the duration of low energy threshold value and low zero-crossing rate threshold value no more than 12 frames, most phrase sound The frame energy of frame, i.e. initiating speech section is more than low frame energy threshold or zero-crossing rate and cannot be shorter than more than the duration of low zero-crossing rate threshold value 10 frames；After the completion of end-point detection, voice segment signal is obtained, is to add Hamming window per frame signal；After the completion of windowing operation, it is Preemphasis operation is carried out per frame voice segment signal, makes up the loss of high fdrequency component, it is 0.93 that this method, which sets pre emphasis factor,；

Step 3：After the completion of pretreatment, this method extracts mel-frequency cepstrum coefficient, and setting triangle bandpass filter number is 40,12 maintain number before being extracted per frame；After the completion of mel-frequency cepstrum coefficient is extracted, 1 rank and 2 ranks 12 dimension difference Meier before extraction Frequency cepstral coefficient；After Differential Characteristics are completed in extraction, eliminate corresponding frame energy and be less than 1 and the feature vector more than 10, eliminate The identification error that speaker's wave volume is excessive or too small and produces；

Step 4：After feature extraction is completed, its vector quantization model of training, generation represents the code book of the speaker, using construction The algorithm of kd trees is balanced, kd trees are built with the code word in all trained speaker's code books, each code word is as one in the tree Node and exist；

Step 5：Meanwhile program is clustered characteristic vector using k means clustering algorithms in another thread, classification number is 16, after k means clustering process is repeated 10 times by this method, total variance within clusters after cluster every time are calculated respectively, and selection is total Variance within clusters minimum is once used as final result；After cluster operation is completed, the parameter of Gauss model is carried out using EM algorithms Estimation, first using each parameter-average, variance and the weight coefficient after cluster as the initial parameter of gauss hybrid models, passes through The parameter revaluation formula of EM algorithms carries out the revaluation of parameters, and changing to numerical value for likelihood function value is calculated during revaluation Variable, when the knots modification is less than threshold value 0.01, that is, is judged to restraining, records average, variance and weight coefficient at this time, then Three parameters and the i.e. training speaker name of the voice document name of crawl are stored in speaker model class, and this is spoken In people's model class example deposit model dynamic array；When the number of the memory size in model dynamic array, that is, model class example, During equal to speaker's number, show that all voice documents are all trained to finish, then model dynamic array is carried out sequence by program Rowization operate, and each model class instance transfer therein is byte sequence, and the sequence then is stored in gauss hybrid models data Library file, training finish；

Step 6：After the completion of the foundation of gauss hybrid models database file, you can carry out test job, admission test speaker After voice data, according to step 2, the vocal print feature vector set of 3 extraction test speakers, concentrated in feature vector and choose a spy Sign vector, in the kd trees that step 4 generates, finds the M code word nearest with the Euclidean distance of this feature vector, and search the M The corresponding code book of a code word, repeats the above steps, until having traveled through testing feature vector collection, then finds out by lookup number K most code books, wherein K<M, then carries out the vocal print feature of the corresponding K trained speaker of the K code book high The identification of this mixed model, during gauss hybrid models identify, program pin is to wherein everyone vocal print feature, in height The model of posterior probability maximum is found in this mixed model database, the corresponding speaker of the model is judged as that test is spoken People.

2. the corresponding Voiceprint Recognition System of method for recognizing sound-groove according to claim 1, including following module：

Voice acquisition module：Hardware components include microphone and sound card, and microphone is responsible for the collection of voice, and sound card be then for Voice is digitally converted；Software section includes audio data collecting, automatic echo cancellor, suppresses noise, automatic gain Control and clocking capability；

Vocal print feature extraction module：Including voice pretreatment and two submodules of feature extraction, input is by voice acquisition module Feature is sent to model training module and is handled by the voice data of admission, training stage, and feature is sent to model by test phase Identification module is handled；

Model training module：The module includes the vector quantization model of training voice, the kd trees for building code word and training voice Gauss hybrid models；

Model modification module：Newly added trained speaker's vector quantization model is inserted into kd trees by the module, and it is high This mixed model is added in gauss hybrid models database file；

Model identification module：The input of the module is the output of vocal print feature extraction module, and the output of the module is to test to say Talk about the name and its personal information of people；

Database management module：It is responsible for training name and its personal information, the model parameter and voice document of speaker, Wherein each model correspond to each voice document, and the name of each voice document is the name of corresponding speaker, is each spoken The name of people is connected to its personal information in the database.