WO2020181824A1 - 声纹识别方法、装置、设备以及计算机可读存储介质 - Google Patents

声纹识别方法、装置、设备以及计算机可读存储介质 Download PDF

Info

Publication number
WO2020181824A1
WO2020181824A1 PCT/CN2019/118656 CN2019118656W WO2020181824A1 WO 2020181824 A1 WO2020181824 A1 WO 2020181824A1 CN 2019118656 W CN2019118656 W CN 2019118656W WO 2020181824 A1 WO2020181824 A1 WO 2020181824A1
Authority
WO
WIPO (PCT)
Prior art keywords
voiceprint
feature
voice
fusion
voiceprint feature
Prior art date
Application number
PCT/CN2019/118656
Other languages
English (en)
French (fr)
Inventor
徐凌智
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020181824A1 publication Critical patent/WO2020181824A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Definitions

  • This application relates to the technical field of voiceprint recognition, in particular to voiceprint recognition methods, equipment, devices, and computer-readable storage media.
  • the voiceprint recognition system is a system that automatically recognizes the speaker's identity based on the characteristics of the human voice.
  • the bodyprint recognition technology is a type of biometric verification technology, that is, the speaker's identity is verified through voice. This technology has good convenience, stability, measurability, security and other characteristics, and it is usually used in banking, social security, public security, smart home, mobile payment and other fields.
  • the current voiceprint recognition system is generally based on the Gaussian Mixture Model-Universal Background Model (GMM-UBM) proposed in the 1990s, which is simple and flexible and has good robustness.
  • GMM-UBM Gaussian Mixture Model-Universal Background Model
  • the neural network-based voiceprint verification system has been applied and practiced, and the performance of neural network-based models on some sets is higher than that of a single Gaussian Mixture Model-Universal Background Model (GMM-UBM).
  • the main purpose of this application is to provide a voiceprint recognition method, equipment, device, and computer-readable storage medium, aiming to solve the technical problem of low voice recognition accuracy in the prior art.
  • a voiceprint recognition method includes:
  • the voiceprint recognition result of the verification voice is determined.
  • this application also provides a voiceprint recognition device, including:
  • the data acquisition module is set to acquire the verification voice to be recognized
  • a data processing module configured to extract the first voiceprint feature of the verification voice using a GMM-UBM model, and use a neural network model to extract the second voiceprint feature of the verification voice;
  • a data fusion module configured to perform feature fusion of the first voiceprint feature and the second voiceprint feature of the verification voice to obtain a fusion voiceprint feature vector of the verification voice;
  • a data comparison module configured to calculate the similarity between the fusion voiceprint feature vector of the verification voice and the voiceprint feature vector of each registered user in the preset registered voiceprint database
  • the data judgment module is configured to judge the voiceprint recognition result of the verification voice based on the similarity.
  • the present application also provides a voiceprint recognition device, the voiceprint recognition device including a processor, a memory, and a voiceprint recognition program stored on the memory and executable by the processor, The steps of the voiceprint recognition method described above are realized when the voiceprint recognition program is executed by the processor.
  • the present application also provides a computer-readable storage medium having a voiceprint recognition program stored on the computer-readable storage medium, and the voiceprint recognition program is executed by a processor to realize the above voiceprint recognition method A step of.
  • This application uses the GMM-UBM model to extract the first voiceprint feature of the verification voice from the verification voice, and extracts the second voiceprint feature of the verification voice from the verification voice through the neural network model; the first voiceprint feature and the first voiceprint feature of the verification voice
  • the two voiceprint features are fused to obtain the fusion voiceprint feature vector of the verified voice; the similarity between the fusion feature voiceprint vector of the verified voice and the voiceprint feature vector of each registered user in the preset voiceprint database is calculated; based on the similarity , Determine the voiceprint recognition result of the verification voice.
  • the GMM-UBM model and the neural network model are combined.
  • the two models extract features for the verification speech.
  • the features extracted by the two models are used for speech verification.
  • the information contained in the features extracted by the two models is more comprehensive, so that the verification voice and the registration voice can be fully verified, so that the accuracy of voiceprint recognition is improved.
  • FIG. 1 is a schematic diagram of the hardware structure of a voiceprint recognition device involved in a solution of an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of an embodiment of the voiceprint recognition method of the present invention.
  • FIG. 3 is a schematic flowchart of another embodiment of the voiceprint recognition method of the present invention.
  • FIG. 4 is a detailed flowchart of an embodiment of step S20 in FIG. 2;
  • FIG. 5 is a schematic diagram of a detailed process of another embodiment of step S20 in FIG. 2;
  • FIG. 6 is a schematic flowchart of an embodiment of step S30 in FIG. 2;
  • FIG. 7 is a schematic diagram of functional modules of an embodiment of the voiceprint device of the present invention.
  • FIG. 1 is a schematic diagram of the hardware structure of the voiceprint recognition device involved in the solution of the embodiment of the present invention.
  • the voiceprint recognition device may include a processor 1001 (for example, a CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005.
  • the communication bus 1002 is used to realize the connection and communication between these components;
  • the user interface 1003 may include a display (Display), an input unit such as a keyboard (Keyboard);
  • the network interface 1004 may optionally include a standard wired interface, a wireless interface (Such as WI-FI interface);
  • the memory 1005 can be a high-speed RAM memory or a non-volatile memory, such as a disk memory.
  • the memory 1005 can optionally be a storage device independent of the aforementioned processor 1001 .
  • FIG. 1 does not constitute a limitation on the voiceprint recognition device, and may include more or fewer components than shown in the figure, or a combination of certain components, or a different component arrangement .
  • the memory 1005 as a computer-readable storage medium in FIG. 1 may include an operating system, a network communication module, and a voiceprint recognition program.
  • the network communication module is mainly used to connect to the server and perform data communication with the server; and the processor 1001 can call the voiceprint recognition program stored in the memory 1005 and execute the voiceprint recognition method provided by the embodiment of the present invention.
  • FIG. 2 is a schematic flowchart of an embodiment of a voiceprint recognition method according to the present application.
  • the voiceprint recognition method includes the following steps:
  • Step S10 Obtain the verification voice to be recognized
  • the verification voice is the voice uttered by the user after the voice registration has been performed. If the user has not performed the voice registration, the voice uttered by the user is invalid.
  • the voice of a user who has been registered for voice is obtained through a microphone, and the microphone sends the obtained voice to the processing terminal for voiceprint recognition; for example, it is obtained through a smart terminal (mobile phone, tablet, etc.)
  • the smart terminal sends the obtained verification voice to the processing terminal of the voiceprint recognition device; of course, the verification voice can also be obtained by other devices, which will not be listed here.
  • the verification voice to be recognized can also be screened to eliminate the verification voice to be recognized with poor quality.
  • the duration of the verification voice to be recognized and the volume of the verification voice to be recognized can also be detected at the same time. If the duration of the verification voice to be recognized is greater than or equal to the preset voice duration, it is prompted to obtain The recognized verification voice is successful, and if the duration of the verification voice to be recognized is less than the preset voice duration, it is prompted that obtaining the verification voice to be recognized fails.
  • This setting ensures the quality of the obtained verification voice to be recognized, and also ensures that the features extracted from the verification voice to be recognized are relatively obvious and clear, which is beneficial to improve the accuracy of voiceprint recognition.
  • Step S20 Use a GMM-UBM model to extract the first voiceprint feature of the verification voice, and use a neural network model to extract the second voiceprint feature of the verification voice;
  • the GMM-UBM model (Gaussian Mixture Model-Universal Background Model) and the neural network model extract features from the verification speech at the same time. Since the GMM-UBM model and the neural network model are two different models, the two models When extracting voiceprint features from the verification speech, the same voiceprint features may be extracted, different voiceprint features may be extracted, or some of the same voiceprint features may be extracted, which will not be specifically limited here.
  • the GMM-UBM model and the neural network model extract different voiceprint features from the verified voice.
  • the first voiceprint feature extracted from the verified voice by the GMM-UBM model includes timbre, frequency, amplitude, volume, etc. Sub-features.
  • the second voiceprint feature extracted by the neural network model from the verification voice includes multiple sub-features such as fundamental frequency, Mel frequency cepstrum coefficient, formant, fundamental tone, reflection coefficient and so on.
  • the GMM-UBM model and the neural network model extract voiceprint features in the same sound segment of the verified speech.
  • the GMM-UBM model and the neural network model can also extract voiceprint features in different sound segments of the verified speech.
  • GMM -The UBM model and the neural network model can also extract voiceprint features in the partially overlapping sound segments of the verification speech, which is not specifically limited here.
  • Step S30 Perform feature fusion of the first voiceprint feature and the second voiceprint feature of the verification voice to obtain a fusion voiceprint feature vector of the verification voice;
  • the fusion voiceprint feature vector of the verification voice is obtained by fusing the first voiceprint feature and the second voiceprint feature of the verification voice.
  • merge the first voiceprint feature and the second voiceprint feature There are many ways to merge the first voiceprint feature and the second voiceprint feature.
  • the first voiceprint feature and the second voiceprint feature are fused by mutual superposition to form a fusion voiceprint feature vector for verification voice, and then the first voiceprint feature and the second voiceprint feature are fused by partial sub-feature superposition
  • the fusion voiceprint feature vector of the verification voice is formed.
  • the first voiceprint feature and the second voiceprint feature of the verified voice can also be fused in other ways, which will not be listed here.
  • Step S40 Calculate the similarity between the fusion voiceprint feature vector of the verification voice and the voiceprint feature vector of each registered user in the preset registered voiceprint database;
  • the voiceprint feature vector of the registered user is established by the voiceprint recognition device during the user's voice registration.
  • Each user corresponds to a registered user's voiceprint feature vector, and each user’s registered user’s voiceprint feature
  • the vectors are all stored in the data storage module of the voiceprint recognition device, and the voiceprint feature vectors of multiple registered users form a preset registered voiceprint database.
  • the similarity between the fusion voiceprint feature vector of the verified voice and the voiceprint feature vector of the registered user can also be calculated using Pearson correlation coefficient, Euclidean distance, cosine similarity, Manhattan distance, etc. Not listed one by one.
  • the fusion voiceprint feature vector of the verification voice of the verification voice needs to be combined with the registered voiceprint database
  • the voiceprint feature vector of each registered user in the comparison is performed, which makes the voiceprint recognition device need to perform a lot of calculations.
  • the voiceprint feature vector of each registered user in the preset registered voiceprint database can be associated. Specifically, the voiceprint feature vector of any two registered users in the preset registered voiceprint database can be calculated. The similarity is to associate the voiceprint feature vectors of each registered user in the preset registered voiceprint database.
  • the fusion voiceprint feature vector of the verification voice can be compared with the voiceprint feature vector of a registered user.
  • the similarity between the feature vectors is screened to exclude the voiceprint feature vectors of other registered users with low similarity to the voiceprint feature vectors of a certain registered user, so that the calculation amount of the voiceprint recognition device can be reduced.
  • Step S50 Determine the voiceprint recognition result of the verification voice based on the similarity.
  • the voiceprint recognition result of the verified voice is determined based on the magnitude relationship between the similarity and the preset threshold, that is, between the fusion voiceprint feature vector of the verification voice and the voiceprint feature vector of a registered user
  • the voiceprint recognition is determined to be successful; when the similarity between the fusion voiceprint feature vector of the verified voice and the voiceprint feature vector of each registered user is less than the preset threshold, the voiceprint is determined Recognition failed.
  • the similarity of the fusion voiceprint feature vector of the verification voice is highest and matches the fusion voiceprint feature vector of the verification voice.
  • This application uses the GMM-UBM model to extract the first voiceprint feature of the verification voice from the verification voice, and extracts the second voiceprint feature of the verification voice from the verification voice through the neural network model; the first voiceprint feature and the first voiceprint feature of the verification voice
  • the two voiceprint features are fused to obtain the fusion voiceprint feature vector of the verified voice; the similarity between the fusion feature voiceprint vector of the verified voice and the voiceprint feature vector of each registered user in the preset voiceprint database is calculated; based on the similarity , Determine the voiceprint recognition result of the verification voice.
  • the GMM-UBM model and the neural network model are combined.
  • the two models extract features for the verification speech.
  • the features extracted by the two models are used for speech verification.
  • the information contained in the features extracted by the two models is more comprehensive, so that the verification voice and the registration voice can be fully verified, so that the accuracy of voiceprint recognition is improved.
  • step S10 the following steps are further included before step S10:
  • Step S100 Obtain the registered voice of the registered user
  • the registered voice is the voice uttered by the user who needs to register, and the method of obtaining the registered voice is the same as that of the verification voice in step S10.
  • the voiceprint recognition system will use the registered voice when the user registers as the user's verification standard.
  • the quality of the registered voice directly affects the accuracy of voiceprint recognition.
  • Step S110 Use a GMM-UBM model to extract the third voiceprint feature of the registered voice, and use a neural network model to extract the fourth voiceprint feature of the registered voice;
  • the sub-features contained in the third voiceprint feature are the same as the sub-features contained in the first voiceprint feature, and the sub-features contained in the fourth voiceprint feature are the same as the sub-features contained in the second voiceprint feature. .
  • Step S120 Perform feature fusion of the third voiceprint feature and the fourth voiceprint feature of the registered voice to obtain a fused voiceprint feature vector of the registered voice;
  • Step S130 Save the fused voiceprint feature vector of the registered voice in the registered voiceprint database as the voiceprint feature vector of the registered user.
  • the data storage module of the voiceprint recognition device is provided with a registered voiceprint database
  • the fusion voiceprint feature vector of the registered voice is stored in the registered voiceprint database
  • the registered voiceprint database stores the fusion voiceprint of the registered voice.
  • the fusion voiceprint feature vectors of registered voices can be classified and stored, for example, classified and stored according to similarity, that is, the fusion voiceprint feature vectors of multiple registered voices with higher similarity are stored in a subset. Multiple subsets form the registered voiceprint database.
  • storage is classified according to gender, that is, the fusion voiceprint feature vector of the registered voice of male registered users and the fusion voiceprint feature vector of the registered voice of female registered users are stored separately.
  • the fusion feature vector of the registered voice can also be stored in other ways, which will not be listed here.
  • step S20 includes:
  • Step S210 Perform pre-emphasis, framing and window preprocessing on the verification voice
  • the voice signal Because the voice signal has short-term stability, the voice signal needs to be framed and windowed after the preprocessing is completed, which is convenient for short-term analysis technology to process the voice signal. Under normal circumstances, the number of frames per second is about 33-100 frames.
  • the framing method can be either continuous segmentation or overlapping segmentation, but the latter can make the transition between frames smoothly , To maintain its continuity.
  • the overlapping part of the previous frame and the next frame is called frame shift, and the ratio of frame shift to frame length is generally taken as (0 ⁇ 1/2).
  • the voice signal is intercepted by a movable window of limited length, that is, framed.
  • the commonly used window functions are Rectangular, Hamming and Hanning.
  • the characteristic parameters will be extracted.
  • the selection of the characteristic parameters should meet several principles: first, it is easy to extract the characteristic parameters from the speech signal; second, it is not easy to be imitated; third, it does not change with time and space , It has relative stability; fourth, it can effectively identify different speakers.
  • the current speaker verification system mainly relies on the low-level acoustic features of speech for recognition. These features can be divided into time domain features and transform domain features.
  • Step S220 Extract Mel frequency cepstral coefficients, first-order difference of linear prediction cepstral coefficients, energy, first-order difference of energy, and Gammatone filter cepstrum system from the preprocessed verification voice Number of feature parameters to obtain the first voiceprint feature of the verification voice;
  • the energy spectrum is obtained by squaring the frequency spectrum X(k), and then smoothing and eliminating harmonics through the Mel frequency filter to obtain the corresponding Mel spectrum.
  • the Mel frequency filter bank is based on the masking effect of the sound, and a number of triangular band pass filters H m (k) (0 ⁇ m ⁇ M, M is the number of filters) are set in the frequency spectrum of the speech.
  • the center frequency is f(m), and the interval between each f(m) widens as the value of m increases.
  • the transfer function of the triangular band-pass filter bank can be expressed by the following formula:
  • L is the order of the MFCC parameter.
  • L is the number of frames of the speech segment.
  • E max maxE 1 , which is the largest logarithmic energy in the speech segment.
  • A(z) is the inverse filter.
  • the analysis of LPC is to solve the linear prediction coefficient a k , and this application adopts the recursive solution formula method based on autocorrelation (ie, the Durbin algorithm).
  • Dynamic characteristic parameters The specific steps of extracting the first-order difference of the Mel frequency cepstral coefficient, the first-order difference of the linear prediction cepstral coefficient, and the first-order difference energy parameter are as follows:
  • the Mel frequency cepstrum coefficients, linear prediction cepstrum coefficients, and energy feature parameters introduced above only represent the timely information of the speech spectrum, which are static parameters.
  • the dynamic information of the speech spectrum also contains speaker-related information, which can be used to improve the recognition rate of the speaker recognition system.
  • the dynamic information of speech cepstrum represents the law of the change of speech characteristic parameters over time.
  • the change of speech cepstrum over time can be expressed as follows:
  • c m m denotes the order cepstrum coefficient
  • n and k represent serial cepstral coefficients on a time axis.
  • the first-order coefficient ⁇ c m (n) of the orthogonal polynomial is shown in the above formula.
  • the window function in practical applications mostly adopts a rectangular window, and K usually takes 2.
  • the dynamic parameter is called a linear combination of the parameters of the first two frames and the last two frames of the current frame. Therefore, the first-order dynamic parameters of Mel frequency cepstrum coefficient, linear prediction cepstrum coefficient and energy can be obtained according to the above formula.
  • the Gamma tone filter is a standard cochlear auditory filter.
  • the time domain impulse response of the filter is:
  • A is the filter gain
  • f i is the center frequency of the filter
  • U(t) is the step function
  • ⁇ i is the phase.
  • ⁇ i be 0 and n is the order of the filter.
  • N is the number of filters.
  • the center frequencies of each filter bank are equally spaced in the ERB domain.
  • the frequency coverage of the entire filter bank is 80Hz-8000Hz.
  • the calculation formula for each center frequency is as follows:
  • v i is the filter overlap factor specifies the percentage of overlap between adjacent filters. After the center frequency of each filter is determined, the corresponding bandwidth can be obtained by the above formula.
  • step (3) Gamma tone filter bank filtering.
  • the power spectrum X(k) obtained in step (1) is squared to obtain the power spectrum, and then the Gamma tone filter group G m (k) is used for filtering.
  • the logarithmic spectrum S(m) is obtained, which is used to compress the dynamic range of the speech spectrum and convert the multiplicative component of the noise in the frequency domain into an additive component.
  • step S20 includes:
  • Step S210' Arrange the verification voice into a spectrogram with a predetermined number of latitudes
  • a feature vector of a predetermined latitude may be extracted from the verification voice every predetermined time interval, so as to arrange the verification voice into a spectrogram of a predetermined latitude.
  • the above-mentioned predetermined number of latitudes, predetermined latitude, and predetermined time interval can be set according to requirements and/or system performance during specific implementation.
  • the size of the above-mentioned predetermined number of latitudes, predetermined latitude and predetermined time interval is not set. limited.
  • Step S220' Recognizing the spectrogram of the predetermined number of latitudes through a neural network to obtain the second voiceprint feature of the verification voice.
  • the second voiceprint feature can better characterize the acoustic features in the speech and improve the accuracy of speech recognition.
  • step S30 specifically includes:
  • the Markov chain Monte Carlo random model is used to perform the fusion of the first voiceprint feature dimension and the second voiceprint feature dimension to obtain the fused voiceprint feature vector of the verification voice.
  • the Markov chain Monte Carlo random model randomly obtains multiple features from the first voiceprint feature, multiple features from the second voiceprint feature, and then obtains from the first voiceprint feature The multiple features of and the multiple features obtained from the second voiceprint feature are fused to obtain the fused voiceprint feature vector of the verification voice.
  • the Markov chain Monte Carlo random model randomly extracts 10 features from the 15 features in the first voiceprint feature, and 15 features from the 20 features in the second voiceprint feature, which can be obtained after fusion There are 25 voiceprint feature fusion voiceprint feature vectors along the speech.
  • FIG. 6 is a detailed flowchart of an embodiment of step S30 in FIG. 1.
  • the first voiceprint feature includes a plurality of first voiceprint sub-features
  • the second voiceprint feature includes a plurality of second voiceprint sub-features
  • the foregoing step S30 includes:
  • Step S310 Set the total feature number of the fusion feature voiceprint of the verification voice to K;
  • Step S320 Determine the fusion ratio of the first voiceprint sub-feature and the second voiceprint sub-feature according to the total feature of the fusion voiceprint feature of the verification voice as K;
  • Step S330 According to the fusion ratio of the first voiceprint sub-feature and the second voiceprint sub-feature, use MCMC's Gibbs sampling to simulate the sampling process of the joint normal distribution, and determine the selection of the first voiceprint feature respectively
  • the first voiceprint sub-feature of and the second voiceprint sub-feature selected from the second voiceprint feature constitute a fusion voiceprint feature vector of the verification voice.
  • step 320 specifically includes:
  • Step A Generate a random number between [0,1] as a parameter p, which represents the proportion of the first voiceprint sub-feature in the fused voiceprint features of the verification voice;
  • Step C Generate a random number q between [0,1] and compare it with the parameter p.
  • q ⁇ p select one of the second voiceprint sub-features. The number is increased by 1.
  • q>p one of the first voiceprint sub-features is selected, and the number of the first voiceprint sub-features is increased by 1;
  • Step D Increase the value of k by 1, and judge whether k ⁇ K. If so, count the number of first and second voiceprint sub-features of the fusion feature voiceprint vector to be selected into the verification voice, respectively Record as A and B, end the sampling process; otherwise, return to step C above.
  • the number of second voiceprint sub-features B 5, then three first voiceprint sub-features and five second voiceprint sub-features should be selected in the subsequent specific feature selection process.
  • step 330 specifically includes:
  • Step F Count the number of features in the fusion voiceprint feature vector of the collected verification voice, record it as M, and generate M random numbers between [0,1] as the initial state
  • Step G For each increase in the number of transitions t by 1, for each variable x i (t), i ⁇ 1,2...M ⁇ , perform the following calculations according to the following conditional probability distribution formula obtained from the joint probability distribution:
  • the mean value of the joint probability distribution is X; judge whether t ⁇ T, if yes, return to step G, otherwise get
  • P(T) [P(x 1 (T)),P(x 2 (T)),...P(x i (T)),...P(x M (T))];
  • Step H According to the number A of the first voiceprint sub-feature in the fusion voiceprint feature vector to be selected into the verification voice calculated in step D, select the first A with the largest probability Px i (T) A voiceprint sub-feature as the first voiceprint sub-feature of the fusion voiceprint feature vector of the selected verification voice;
  • Step J Collect statistically the number of features in the fusion voiceprint feature vector of the verification voice, record it as N, and generate N random numbers between [0,1] as the initial state;
  • y(0) [y 1 (0), y 2 (0)...y N (0)]
  • Step K Every time the number of transitions t increases by 1, for each variable y j (t), j ⁇ ⁇ 1,2...M ⁇ , the following calculation is performed according to the following conditional probability distribution formula obtained from the joint probability distribution:
  • the mean of the joint probability distribution is Y;
  • Step L According to the number B of the second voiceprint sub-features of the fusion voiceprint feature vector to be selected into the verification voice calculated in step D, select the first B second voice with the highest probability Py j (T) The pattern sub-feature is used as the second voiceprint sub-feature of the fusion voiceprint feature vector of the selected verification speech.
  • T 50
  • Px i (50) [0.6, 0.2, 0.5, 0.8, 0.9] is calculated, then the corresponding maximum is selected.
  • Two behavioral features of probability are added to verify voice fusion voiceprint feature vector.
  • this application also provides a voiceprint recognition device.
  • FIG. 7 is a functional block diagram of an embodiment of a voiceprint recognition device of the present application.
  • the voiceprint recognition device includes:
  • the data acquisition module 10 is configured to acquire the verification voice to be recognized
  • the data processing module 20 is configured to use the GMM-UBM model to extract the first voiceprint feature of the verified voice, and use the neural network model to extract the second voiceprint feature of the verified voice;
  • the data fusion module 30 is configured to perform feature fusion of the first voiceprint feature and the second voiceprint feature of the verification voice to obtain a fusion voiceprint feature vector of the verification voice;
  • the data comparison module 40 is configured to calculate the similarity between the fusion voiceprint feature vector of the verification voice and the voiceprint feature vector of each registered user in the preset registered voiceprint database;
  • the data judgment module 50 is configured to judge the voiceprint recognition result of the verification voice based on the similarity.
  • a module for acquiring a voiceprint feature vector of a registered user includes:
  • Extracting the voiceprint feature unit configured to use a GMM-UBM model to extract the third voiceprint feature of the registered voice, and use a neural network model to extract the fourth voiceprint feature of the registered voice;
  • a fusion unit configured to perform feature fusion of the third voiceprint feature and the fourth voiceprint feature of the registered voice to obtain a fused voiceprint feature vector of the registered voice
  • the saving unit is configured to save the fused voiceprint feature vector of the registered voice in the registered voiceprint database as the voiceprint feature vector of the registered user.
  • the data processing module 20 further includes:
  • the first pre-processing unit 201 is configured to perform pre-emphasis, framing, and windowing pre-processing on the verification voice;
  • the first extraction unit 202 is configured to extract the pitch period, linear prediction cepstral coefficient, first-order difference of linear prediction cepstral coefficient, energy, first-order difference of energy, and Gammatone filter from the preprocessed verification voice Feature parameters of cepstral coefficients to obtain the first voiceprint feature of the verification voice;
  • the second preprocessing unit 203 is configured to arrange the verification voice into a spectrogram of a predetermined number of latitudes
  • the second extraction unit 202 is configured to recognize the spectrogram of the predetermined number of latitudes through a neural network to obtain the second voiceprint feature of the verification voice.
  • the data fusion module 30 includes:
  • the data fusion unit 301 is configured to use a Markov chain Monte Carlo random model to perform the fusion of the first voiceprint feature dimension and the second voiceprint feature dimension to obtain the fused voiceprint feature vector of the verification voice.
  • the data fusion unit 301 includes:
  • the setting subunit 3011 is set to set the total number of voiceprint features of the fusion feature of the verification voice to K;
  • the determining subunit 3012 is configured to determine the fusion ratio of the first voiceprint sub-feature and the second voiceprint sub-feature by using the direct sampling method according to the total feature of the fused voiceprint feature of the verification voice as K;
  • the fusion sub-unit 3013 is configured to use the Gibbs sampling of MCMC to simulate the sampling process of the joint normal distribution according to the fusion ratio of the first voiceprint sub-feature and the second voiceprint sub-feature, and to determine the selection of the first voiceprint feature respectively
  • the first voiceprint sub-feature and the second voiceprint sub-feature selected from the second voiceprint feature constitute a fusion voiceprint feature vector of the verification voice.
  • determining subunit 3012 is configured to:
  • Step A Generate a random number between [0,1] as a parameter p, which represents the proportion of the first voiceprint sub-feature in the fused voiceprint features of the verification voice;
  • Step C Generate a random number q between [0,1] and compare it with the parameter p.
  • q ⁇ p select one of the second voiceprint sub-features. The number is increased by 1.
  • q>p one of the first voiceprint sub-features is selected, and the number of the first voiceprint sub-features is increased by 1;
  • Step D Increase the value of k by 1, and judge whether k ⁇ K. If so, count the number of first and second voiceprint sub-features of the fusion feature voiceprint vector to be selected into the verification voice, respectively Record as A and B, end the sampling process; otherwise, return to step C above.
  • fusion subunit 3013 is configured as:
  • Step F Count the number of features in the fusion voiceprint feature vector of the collected verification voice, record it as M, and generate M random numbers between [0,1] as the initial state
  • Step G For each increase in the number of transitions t by 1, for each variable x i (t), i ⁇ 1,2...M ⁇ , perform the following calculations according to the following conditional probability distribution formula obtained from the joint probability distribution:
  • the mean value of the joint probability distribution is X; judge whether t ⁇ T, if yes, return to step G, otherwise get
  • P(T) [P(x 1 (T)),P(x 2 (T)),...P(x i (T)),...P(x M (T))];
  • Step H According to the number A of the first voiceprint sub-feature in the fusion voiceprint feature vector to be selected into the verification voice calculated in step D, select the first A with the largest probability Px i (T) A voiceprint sub-feature as the first voiceprint sub-feature of the fusion voiceprint feature vector of the selected verification voice;
  • Step J Collect statistically the number of features in the fusion voiceprint feature vector of the verification voice, record it as N, and generate N random numbers between [0,1] as the initial state
  • y(0) [y 1 (0), y 2 (0)...y N (0)];
  • Step K Every time the number of transitions t increases by 1, for each variable y j (t), j ⁇ ⁇ 1,2...M ⁇ , the following calculation is performed according to the following conditional probability distribution formula obtained from the joint probability distribution:
  • the mean of the joint probability distribution is Y;
  • Step L According to the number B of the second voiceprint sub-features of the fusion voiceprint feature vector to be selected into the verification voice calculated in step D, select the first B second voice with the highest probability Py j (T) The pattern sub-feature is used as the second voiceprint sub-feature of the fusion voiceprint feature vector of the selected verification speech.
  • an embodiment of the present application also provides a voiceprint recognition device, including a processor, a memory, and a voiceprint recognition program stored in the memory and executable by the processor.
  • the voiceprint recognition program is executed by the processor to implement the above-mentioned implementations. Example of the steps of the voiceprint recognition method.
  • the embodiments of the present application also provide a computer-readable storage medium, and a voiceprint recognition program is stored on the computer-readable storage medium.
  • a voiceprint recognition program is executed by a processor, the voiceprint recognition method of the foregoing embodiments is implemented. step.
  • the storage medium may be a volatile storage medium, and the storage medium may also be a non-volatile storage medium.
  • the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. ⁇
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium (such as ROM/RAM), including Several instructions are used to make a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种声纹识别方法、装置、设备以及计算机可读存储介质,该声纹识别方法包括:获取待识别的验证语音(S10);采用GMM-UBM模型提取验证语音的第一声纹特征,采用神经网络模型提取验证语音的第二声纹特征(S20);将验证语音的第一声纹特征与第二声纹特征进行特征融合,得到验证语音的融合声纹特征向量(S30);计算验证语音的融合声纹特征向量与预设注册声纹数据库中各注册用户的声纹特征向量之间的相似度(S40);基于相似度,判定验证语音的声纹识别结果(S50)。两个模型分别对验证语音提取特征并用来进行语音验证,相较于单一模型提取验证语音的特征并进行语音验证而言,两个模型提取的特征所包含的信息更加全面,从而使得声纹识别的准确率得到提高。

Description

声纹识别方法、装置、设备以及计算机可读存储介质
本申请要求于2019年3月12日提交中国专利局、申请号为201910182453.3、发明名称为“声纹识别方法、装置、设备以及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及声纹识别技术领域,尤其涉及声纹识别方法、设备、装置以及计算机可读存储介质。
背景技术
声纹识别***是根据人声的特质来自动识别说话人身份的***,身纹识别技术属于生物验证技术的一种,即通过语音对说话人的身份进行验证。这种技术具有较好的便捷性、稳定性、可测量性、安全性等特点,其通常用在银行、社保、公安、智能家居、移动支付等领域。
目前的声纹识别***一般是基于20世纪90年代提出的高斯混合模型-通用背景模型(GMM-UBM),该模型简单灵活且具有较好的鲁棒性。然而,近年来随着技术的发展,神经网络的训练学习取得了突破进展,基于神经网络的声纹验证***得到应用与实践,并且基于神经网络的模型在一些集合上展现的性能要高于单一的高斯混合模型-通用背景模型(GMM-UBM)。
发明内容
本申请的主要目的在于提供一种声纹识别方法、设备、装置以及计算机可读存储介质,旨在解决现有技术中语音识别准确性不高的技术问题。
为实现上述目的,本申请提供的一种声纹识别方法,包括:
获取待识别的验证语音;
采用GMM-UBM模型提取所述验证语音的第一声纹特征,采用神经网络模型提取所述验证语音的第二声纹特征;
将所述验证语音的第一声纹特征与第二声纹特征进行特征融合,得到所述验证语音的融合声纹特征向量;
计算所述验证语音的融合声纹特征向量与预设注册声纹数据库中各注册用户的声纹特征向量之间的相似度;
基于所述相似度,判定所述验证语音的声纹识别结果。
此外,为实现上述目的,本申请还提供一种声纹识别装置,包括:
数据获取模块,设置为获取待识别的验证语音;
数据处理模块,设置为采用GMM-UBM模型提取所述验证语音的第一声纹特征,采用神经网络模型提取所述验证语音的第二声纹特征;
数据融合模块,设置为将所述验证语音的第一声纹特征与第二声纹特征进行特征融合,得到所述验证语音的融合声纹特征向量;
数据比较模块,设置为计算所述验证语音的融合声纹特征向量与预设注册声纹数据库中各注册用户的声纹特征向量之间的相似度;
数据判断模块,设置为基于所述相似度,判定所述验证语音的声纹识别结果。
此外,为实现上述目的,本申请还提供一种声纹识别设备,所述声纹识别设备包括处理器、存储器以及存储在所述存储器上并可被所述处理器执行的声纹识别程序,所述声纹识别程序被所述处理器执行时实现上述声纹识别方法的步骤。
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有声纹识别程序,所述声纹识别程序被处理器执行时实现上述声纹识别方法的步骤。
本申请通过GMM-UBM模型从验证语音中提取验证语音的第一声纹特征,通过神经网络模型从验证语音中提取验证语音的第二声纹特征;将验证语音的第一声纹特征和第二声纹特征进行融合,得到验证语音的融合声纹特征向量;计算验证语音的融合特征声纹向量与预设声纹数据库中各注册用户的声纹特征向量之间的相似度;基于相似度,判定验证语音的声纹识别结果。通过上述方式,将GMM-UBM模型和神经网络模型相结合,两个模型分别对验证语音提取特征,同时两个模型所提取的特征均用来进行语音验证,相较于单一模型提取验证语音的特征并进行语音验证而言,两个模型提取的特征所包含的信息更加全面,这样就可以将验证语音与注册语音进行全面的验证,从而使得声纹识别的准确率得到提高。
附图说明
图1为本发明实施例方案中涉及的声纹识别设备的硬件结构示意图;
图2为本发明声纹识别方法一实施例的流程示意图;
图3为本发明声纹识别方法另一实施例的流程示意图;
图4为图2中步骤S20一实施例的细化流程示意图;
图5为图2中步骤S20另一实施例的细化流程示意图;
图6为图2中步骤S30一实施例的流程示意图;
图7为本发明声纹装置一实施例的功能模块示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
参照图1,图1为本发明实施例方案中涉及的声纹识别理设备的硬件结构示意图。本发明实施例中,声纹识别设备可以包括处理器1001(例如CPU),通信总线1002,用户接口1003,网络接口1004,存储器1005。其中,通信总线1002用于实现这些组件之间的连接通信;用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard);网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口);存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储 器,存储器1005可选的还可以是独立于前述处理器1001的存储装置。
本领域技术人员可以理解,图1中示出的硬件结构并不构成对声纹识别设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
继续参照图1,图1中作为一种计算机可读存储介质的存储器1005可以包括操作***、网络通信模块以及声纹识别程序。
在图1中,网络通信模块主要用于连接服务器,与服务器进行数据通信;而处理器1001可以调用存储器1005中存储的声纹识别程序,并执行本发明实施例提供的声纹识别方法。
基于上述声纹识别设备,提出本发明的声纹识别方法的各个实施例。
参照图2,图2为本申请声纹识别方法一实施例的流程示意图,本实施例中,声纹识别方法包括以下步骤:
步骤S10:获取待识别的验证语音;
在本实施例中,验证语音为已经进行语音注册后的用户所发出的声音,若该用户未进行语音注册,则该用户所发出的声音为无效语音。验证语音的获取方式有很多种,例如通过麦克风获取已经语音注册过的用户所发出的声音,麦克风将获取的声音发送至声纹识别的处理终端;再如通过智能终端(手机、平板等)获取已经语音注册过的用户所发出的声音,智能终端将获取的验证语音发送至声纹识别设备的处理终端;当然,验证语音还可以采用其他设备来获取,在此就不一一列举了。
值得注意的是,在获取待识别的验证语音时,还可以对待识别的验证语音进行筛选,以剔除质量较差的待识别的验证语音。具体的,在获取验证语音时,还可以同时对待识别的验证语音的时长和待识别的验证语音的音量进行检测,若待识别的验证语音的时长大于或等于预设语音时长,则提示获取待识别的验证语音成功,若待识别的验证语音的时长小于预设语音时长,则提示获取待识别的验证语音失败。如此设置,保证了获取的待识别的验证语音的质量,也就保证了从待识别的验证语音中提取的特征是比较明显的、清晰的,从而有利于提高声纹识别的准确率。
步骤S20:采用GMM-UBM模型提取所述验证语音的第一声纹特征,采用神经网络模型提取所述验证语音的第二声纹特征;
本实施例中,GMM-UBM模型(高斯混合模型-通用背景模型)和神经网络模型同时从验证语音中提取特征,由于GMM-UBM模型和神经网络模型是两个不同的模型,因此两个模型从验证语音中提取声纹特征时,可能会提取相同的声纹特征,也可能提取不同的声纹特征,也有可能提取部分相同的声纹特征,在此就不做的具体的限定。较佳地,GMM-UBM模型和神经网络模型从验证语音中提取不同的声纹特征,例如GMM-UBM模型从验证语音中提取的第一声纹特征中包括音色、频率、振幅、音量等多个子特征,神经网络模型从验证语音中提取的第二声纹特征中包括基频、梅尔频率倒谱系数、共振峰、基音、反射系数等等多个子特征。
需要说明的是,GMM-UBM模型和神经网络模型在验证语音的同一声音段内提取声纹特征,GMM-UBM模型和神经网络模型也可以在验证语音的不同声音段内提取声纹特征,GMM-UBM模型和神经网络模型也可以在验证语音的部分重叠的声音段内提取声纹特征,在此不做具体的限定。
步骤S30:将所述验证语音的第一声纹特征与第二声纹特征进行特征融合,得到验证语音的融合声纹特征向量;
在本实施例中,验证语音的融合声纹特征向量由验证语音的第一声纹特征和第二声纹特征融合得到的,第一声纹特征和第二声纹特征的融合方式有很多种,例如第一声纹特征和第二声纹特征通过相互叠加的方式融合形成验证语音的融合声纹特征向量,再如第一声纹特征和第二声纹特征通过部分子特征叠加的方式融合形成验证语音的融合声纹特征向量。当然,验证语音的第一声纹特征和第二声纹特征还可以采用其他的方式融合,在此就不一一列举了。
步骤S40:计算所述验证语音的融合声纹特征向量与预设注册声纹数据库中各注册用户的声纹特征向量之间的相似度;
在本实施例中,注册用户的声纹特征向量是声纹识别设备在用户语音注册时建立的,每一个用户对应有一个注册用户的声纹特征向量,每一个用户的注册用户的声纹特征向量均存储于声纹识别设备的数据储存模块中,多个注册用户的声纹特征向量组成了预设注册声纹数据库。
验证语音的融合声纹特征向量与注册用户的声纹特征向量之间的相似度的计算方法有很多种,例如验证语音的融合声纹特征向量与注册用户的声纹特征向量之间的相似度采用余弦相似度计算,即根据公式:
Figure PCTCN2019118656-appb-000001
计算验证语音的融合声纹特征向量与注册用户的声纹特征向量之间的余弦相似度,计算得到的值越大,则说明融合声纹特征向量与注册用户的声纹特征向量的相似度越小,计算得到的值越小,则说明融合声纹特征向量与注册用户的声纹特征向量的相似度越大。
当然,验证语音的融合声纹特征向量与注册用户的声纹特征向量之间的相似度还可以采用皮尔逊相关系数、欧几里德距离、余弦相似度、曼哈顿距离等来计算,在此就不一一列举了。
值得注意的是,一般预设注册声纹数据库中存储有大量的注册用户的声纹特征向量,在进行声纹识别时,需要将验证语音的验证语音的融合声纹特征向量与注册声纹数据库中各注册用户的声纹特征向量进行比对,这就使得声纹识别设备需要进行大量的计算。鉴于此,可以将预设注册声纹数据库中的各注册用户的声纹特征向量关联起来,具体的,可以通过计算预设注册声纹数据库中任意两个注册用户的声纹特征向量之间的相似度,将预设注册声纹数据库中的各注册用户的声纹特征向量关联起来。这在计算验证语音的融合声纹特征向量与注册声纹数据库中的某一注册用户的声纹特征向量的相似度时,可以根据验证语音的融合声纹特征向量与某一注册用户的声纹特征向量之间的相似度进行筛选,以排除与某一注册用户的声纹特征向量相似度较低的其他注册用户的声纹特征向量,这样就可以减少声纹识别设备的计算量。
步骤S50:基于所述相似度,判定所述验证语音的声纹识别结果。
在本实施例中,验证语音的声纹识别结果是基于相似度与预设阈值之间的大小关系确定的,即验证语音的融合声纹特征向量与某一注册用户的声纹特征向量之间的相似度等于或者大于预设阈值时,则判定声纹识别成功;验证语音的融合声纹特征向量与各注册用户 的声纹特征向量之间的相似度小于预设阈值时,则判断声纹识别失败。
需要注意的是,若预设注册声纹数据库中有多个注册用户的声纹特征向量与验证语音的融合声纹特征向量的相似度均超过预设阈值时,此时,判定多个注册用户的声纹特征向量中与验证语音的融合声纹特征向量的相似度最高与验证语音的融合声纹特征向量相匹配。
本申请通过GMM-UBM模型从验证语音中提取验证语音的第一声纹特征,通过神经网络模型从验证语音中提取验证语音的第二声纹特征;将验证语音的第一声纹特征和第二声纹特征进行融合,得到验证语音的融合声纹特征向量;计算验证语音的融合特征声纹向量与预设声纹数据库中各注册用户的声纹特征向量之间的相似度;基于相似度,判定验证语音的声纹识别结果。通过上述方式,将GMM-UBM模型和神经网络模型相结合,两个模型分别对验证语音提取特征,同时两个模型所提取的特征均用来进行语音验证,相较于单一模型提取验证语音的特征并进行语音验证而言,两个模型提取的特征所包含的信息更加全面,这样就可以将验证语音与注册语音进行全面的验证,从而使得声纹识别的准确率得到提高。
参照图3,在本实施例中,在步骤S10之前还包括以下步骤:
步骤S100:获取注册用户的注册语音;
在本实施例中,注册语音为需要注册的用户所发出的声音,注册语音的获取方式与步骤S10验证语音的获取方式相同。
值得注意的是,声纹识别***会将用户注册时的注册语音作为该用户的验证标准,注册语音质量的好坏,直接影响到声纹识别的准确率。为了提高声纹识别的准确率,还可以在获取注册语音时,对注册语音进行筛选,以剔除质量较差的注册语音。
步骤S110:采用GMM-UBM模型提取所述注册语音的第三声纹特征,采用神经网络模型提取所述注册语音的第四声纹特征;
需要说明的是,第三声纹特征所包含的子特征与第一声纹特征所包含的子特征相同,第四声纹特征所包含的子特征与第二声纹特征所包含的子特征相同。
步骤S120:将所述注册语音的第三声纹特征与第四声纹特征进行特征融合,得到所述注册语音的融合声纹特征向量;
步骤S130:将所述注册语音的融合声纹特征向量保存到所述注册声纹数据库中,以作为注册用户的声纹特征向量。
在本实施例中,声纹识别设备的数据储存模块中设有注册声纹数据库,注册语音的融合声纹特征向量保存于注册声纹数据库,该注册声纹数据库在储存注册语音的融合声纹特征向量时,可以将注册语音的融合声纹特征向量分类进行存储,例如按照相似度来进行分类存储,即将相似度较高的多个注册语音的融合声纹特征向量存储在一个子集内,多个子集组成注册声纹数据库。再如按照性别进行分类存储,即将男性注册用户的注册语音的融合声纹特征向量和女性注册用户的注册语音的融合声纹特征向量分开存储。当然,注册语音的融合特征向量还可以采用其他的方式进行存储,在此就不一一列举了。
参照图4,本实施例中,步骤S20包括:
步骤S210:对所述验证语音进行预加重、分帧以及加窗预处理;
预加重:由于语音信号的平均功率谱受声门激励和口鼻辐射的影响,高倍频大约在 800Hz以上按6dB/倍频跌落,所以求语音信号频谱,频率越高,对应的成分越小,高频部分的频谱也越难求,为此要进行预加重处理。其目的是要提升高频部分,使信号的频谱变得平坦,保持在低频到高频的整个频带中,能用同样的信噪比求频谱。预加重一般在语音信号数字化之后,且预加重滤波器是一阶的,其滤波器的实现形式:H(z)=1-u*z-1,其中u一般在(0.9,1)之间。
分帧、加窗:由于语音信号具有短时平稳性,预处理完成后需对语音信号进行分帧、加窗处理,便于用短时分析技术对语音信号进行处理。通常情况下,每秒钟的帧数约为33~100帧,分帧既可采用连续分段的方法,也可采用交叠分段的方法,但后者可以使帧与帧之间平滑过渡,保持其连续性。前一帧和后一帧的交叠部分称为帧移,帧移和帧长的比值一般取为(0~1/2)。一边将语音信号用可移动有限长度的窗口进行截取即分帧,通常采用的窗函数有矩形窗(Rectangular)、汉明窗(Hamming)和汉宁窗(Hanning)等。
语音信号经过预处理之后,将提取特征参数,特征参数的选择应当满足几个原则:第一,易于从语音信号中提取特征参数;第二,不容易被模仿;第三,不随时间和空间变化,具有相对的稳定性;第四,能够有效识别不同的说话人。目前说话人确认***主要依靠语音的低层次声学特征来进行识别,这些特征可分为时域特征和变换域特征。
步骤S220:从预处理后的所述验证语音中提取梅尔频率倒谱系数、线性预测倒谱系数的一阶差分、能量、能量的一阶差分以及Gamma tone(伽马音)滤波器倒谱系数的特征参数,以得到所述验证语音的第一声纹特征;
梅尔频率倒谱系数的提取的具体步骤如下:
(1)对于处理后的语音信号进行短时傅里叶变换,得到其频谱。这里采用快速傅里叶变换FFT来对每一帧语音信号进行离散余弦变换DCT。先将每一帧时域信号x(n)后补若干个0以形成长度为N的序列,然后对其进行快速傅里叶变换,最后得到线性频谱X(k)。X(k)与x(n)之间的转换公式为:
Figure PCTCN2019118656-appb-000002
(2)对频谱X(k)去平方求得能量谱,然后通过Mel频率滤波器进行平滑并消除谐波,得到相应的Mel频谱。其中Mel频率滤波器组是根据声音的掩蔽效应,在语音的频谱范围内设置的若干个三角形带通过滤波器H m(k)(0≤m≤M,M为滤波器的个数),其中心频率为f(m),各f(m)之间的间隔随着m值的增大而增宽。
三角形带通过滤波器组的传递函数可用以下公式表示:
Figure PCTCN2019118656-appb-000003
对Mel滤波器组输出的Mel谱取按下式计算对数得到对数频谱S(m),用来压缩语 音谱的动态范围,并将频域中噪声的乘性成分转化成加性成分。
Figure PCTCN2019118656-appb-000004
(4)对对数频谱S(m)进行离散余弦变换DCT,得到梅尔频率倒谱系数(MFCC)的参数c(n)。
Figure PCTCN2019118656-appb-000005
其中L为MFCC参数的阶数。
短时归一化能量特征参数提取的具体步骤如下:
(1)给定语音段中的长度N的一帧{Si(n),n=1,2,…,N},计算该帧的短时对数能量的公式如下;
Figure PCTCN2019118656-appb-000006
其中L是语音段的帧数。
(2)由于不同语音段不同语音帧的能量差别比较大,为了使其能够于前面的倒谱系数一起作为向量计算,需要进行归一化处理。
Figure PCTCN2019118656-appb-000007
其中,E max=maxE 1,即语音段中最大的对数能量。
LPCC特征参数提取的具体步骤如下:
(1)求解线性预测LPC:在线性预测(LPC)分析中,声道模型表示为下式的全极点模型:
Figure PCTCN2019118656-appb-000008
式中p是LPC分析的阶数,a k为线性预测系数(k=1,2,…,p),A(z)为逆滤波器。LPC的分析就是求解线性预测系数a k,本申请采用基于自相关的递推求解公式法(即Durbin算法)。
(2)求LPC的倒谱系数LPCC:已预处理的语音信号x(n)复倒谱
Figure PCTCN2019118656-appb-000009
定义为x(n)的Z变换的对数Z变换,即为
Figure PCTCN2019118656-appb-000010
只考虑X(z)的模,忽略它的相位,就得到信号的倒谱c(n)为:
c(n)=Z -1(log|X(z)|-j argX(z))
LPCC不是由输入信号x(n),而是由LPC系数a n得到的。LPCC参数C n的递推公式:
Figure PCTCN2019118656-appb-000011
动态特征参数:梅尔频率倒谱系数的一阶差分、线性预测倒谱系数的一阶差分、一阶差分能量参数提取的具体步骤如下:
前面介绍的梅尔频率倒谱系数、线性预测倒谱系数、能量特征参数只表征了语音谱的及时信息,属于静态参数。实验表明,语音谱的动态信息中也包含有与说话人有关的信息,可以用来提高说话人识别***的识别率。
(1)语音倒谱的动态信息是表征语音特征参数随时间变化的规律。语音倒谱随时间的变换可以用下式表达:
Figure PCTCN2019118656-appb-000012
公式中,c m表示m阶倒谱系数,n和k表示倒谱系数在时间轴上的序号。h(k)(k=-k,-k+1,…,k-1,k)为长度为2k+1的窗函数,它通常是对称的。正交多项式的一阶系数△c m(n)如上公式所示。
(2)实际应用中的窗函数多采用矩形窗,K通常取2,此时动态参数就称为当前帧的前两帧和后两帧参数的线性组合。所以按照上式可以得到梅尔频率倒谱系数、线性预测倒谱系数、能量的一阶动态参数。
Gamma tone滤波器倒谱系数的特征参数提取的具体步骤如下:
(1)对预处理后的语音信号进行短时傅里叶变换,得到其频谱。这里采用快速傅里叶变换FFT来对每一帧语音信号进行离散余弦变换DCT。先将每一帧时域信号x(n)后补若干个0以形成长度为N的序列,然后对其进行快速傅里叶变换,最后得到线性频谱X(k)。X(k)与x(n)之间的转换公式为:
Figure PCTCN2019118656-appb-000013
(2)获得Gamma tone滤波器组,Gamma tone滤波器是一个标准的耳蜗听觉滤波器,该滤波器的时域脉冲响应为:
g(t)=At n-1e -2πbtcos(2πf ii)U(t),t≥0,1≤i≤N
式中,A为滤波器增益,f i是滤波器的中心频率,U(t)是阶跃函数,φ i是相位,为了简化模型,令φ i为0,n是滤波器的阶数,实验表明n=4时能够很好的模拟人耳耳蜗的滤波特征。
b t是滤波器的衰减因子,它决定了脉冲响应的衰减速度,并与滤波器的带宽有关,b t=1.019ERB(f i),在听觉心理学中,
Figure PCTCN2019118656-appb-000014
式中,N为滤波器的个数,各滤波器组的中心频率在ERB域上等间距分布,整个滤波器组的频率覆盖范围为80Hz-8000Hz,每个中心频率的计算公式如下:
Figure PCTCN2019118656-appb-000015
其中f H为滤波器截止频率,v i是滤波器重叠因子,用来指定相邻滤波器之间重叠百分比。每个滤波器中心频率确定后,相应的带宽可由上式获得。
(3)Gamma tone滤波器组滤波。对步骤(1)得到的功率谱X(k)取平方得到能力 谱,然后用Gamma tone滤波组G m(k)进行滤波处理。得到对数频谱S(m),用来压缩语音谱的动态范围,并将频域中噪声的乘性成分转化成加性成分。
Figure PCTCN2019118656-appb-000016
(4)对对数频谱S(m)进行离散余弦变换DCT,得到Gamma tone滤波器倒谱系数的特征参数G(n),G(n)计算公式如下:
Figure PCTCN2019118656-appb-000017
参照图5,上述步骤S20包括:
步骤S210':将所述验证语音排列成预定纬数的语谱图;
具体地,可以每隔预定的时间间隔从验证语音中提取预定纬度的特征向量,以将验证语音排列成预定纬数的语谱图。
其中,上述预定纬数、预定纬度和预定的时间间隔可以在具体实现时根据需求和/或***性能等自行设定,本实施例对上述预定纬数、预定纬度和预定的时间间隔的大小不作限定。
步骤S220':通过神经网络对所述预定纬数的语谱图进行识别,得到所述验证语音的第二声纹特征。
将验证语音排列成预定纬数的语谱图,然后通过神经网络模型对预定纬数的语谱图进行识别,获得验证语音的第二声纹特征,从而可以实现通过神经网络模型提取验证语音的第二声纹特征,可以更好地表征语音中的声学特征,提高语音识别的准确率。
值得注意的是,在对验证语音进行第一声纹特征和第二声纹特征提取时,两者是互不干扰的,也就是说,上述步骤S210、步骤S220相对于步骤S210'、步骤S220'是相互独立进行的,并且步骤S210、步骤S220与步骤S210'、步骤S220'之间是不分先后顺序的。
进一步地,在本申请声纹识别方法一实施例中,上述步骤S30具体包括:
利用马尔可夫链蒙特卡罗随机模型进行第一声纹特征维度和第二声纹特征维度的融合,得到所述验证语音的融合声纹特征向量。
本实施例中,马尔可夫链蒙特卡罗随机模型随机分别从第一声纹特征中获取多个特征,从第二声纹特征中获取多个特征,再将从第一声纹特征中获取的多个特征和从第二声纹特征中获取的多个特征融合,得到验证语音的融合声纹特征向量。
例如,马尔可夫链蒙特卡罗随机模型随机从第一声纹特征中的15个特征中抽取10个特征,从第二声纹特征的20个特征中抽取15个特征,融合后即可得到有25个声纹特征的沿着语音的融合声纹特征向量。
参照图6,图6为图1中步骤S30一实施例的细化流程示意图。在本实施例中,所述第一声纹特征包括多个第一声纹子特征,所述第二声纹特征包括多个第二声纹子特征;
基于上述实施例,本实施例中,上述步骤S30包括:
步骤S310:设定验证语音的融合特征声纹总特征数为K;
步骤S320:根据所述验证语音的融合声纹特征总特征为K,利用直接抽样法确定第一声纹子特征的和第二声纹子特征的融合比例;
步骤S330:根据第一声纹子特征和第二声纹子特征的融合比例,利用MCMC的Gibbs (吉布斯)采样模拟联合正态分布的采样过程,分别确定所述第一声纹特征选取的第一声纹子特征和所述第二声纹特征选取的第二声纹子特征,组成所述验证语音的融合声纹特征向量。
进一步地,步骤320具体包括:
步骤A:生成一个[0,1]之间的随机数作为参数p,参数p代表所述第一声纹子特征在所述验证语音的融合声纹特征中所占的比例;
步骤B:初始化用于记录迭代次数的计数器的初始值k=0;
步骤C:生成一个[0,1]之间的随机数q,并与参数p进行比较,当q<p时,选取一个所述第二声纹子特征,所述第二声纹子特征的数量加1,当q>p时,选取一个所述第一声纹子特征,所述第一声纹子特征的数量加1;
步骤D:k值增加1,判断是否k≧K,如果是则统计待选入所述验证语音的融合特征声纹向量的第一声纹子特征和第二声纹子特征的个数,分别记录为A和B,结束采样过程;否则,返回上步骤C。
假设设定的验证语音的融合声纹特征向量的总纬度数K=8,随机生成的参数p=0.4,经过8次上述过程的迭代得到待入选的第一声纹子特征的个数A=3,第二声纹子特征的个数B=5,则在后续的具体特征选取过程中要选取3个第一声纹子特征和5个第二声纹子特征。
进一步地,步骤330具体包括:
步骤E:设定转移次数阈值为T,初始化转移次数t=0;
步骤F:统计采集的验证语音的融合声纹特征向量中特征的个数,记录为M,生成M个[0,1]之间的随机数作为初始状态
Figure PCTCN2019118656-appb-000018
步骤G:转移次数t每增加1,对每个变量x i(t),i∈{1,2...M},按以下由联合概率分布得到的条件概率分布公式进行如下计算:
P(x i(t+1))|x 1(t+1),x 2(t+1)...x i-1(0),x i+1(t)...x m(t)),
其中,联合概率分布的均值为X;判断是否t<T,如果是则返回步骤G,否则得到
P(T)=[P(x 1(T)),P(x 2(T)),...P(x i(T)),...P(x M(T))];
步骤H:根据步骤D中计算的待选入所述验证语音的融合声纹特征向量中所述第一声纹子特征为个数A,选取前A个对应概率Px i(T)最大的第一声纹子特征作为入选验证语音的融合声纹特征向量的第一声纹子特征;
步骤I:设定转移次数阈值为T,初始化转移次数t=0;
步骤J:统计采集所述验证语音的融合声纹特征向量中特征的个数,记录为N,生成N个[0,1]之间的随机数作为初始状态;
y(0)=[y 1(0),y 2(0)...y N(0)]
步骤K:转移次数t每增加1,对每个变量y j(t),j∈{1,2...M},按以下由联合概率分布得到的条件概率分布公式进行如下计算:
P(y i(t+1))|y 1(t+1),y 2(t+1)...y j-1(0),y j+1(t)...y N(t)),
其中,联合概率分布的均值为Y;
判断是否t<T,如果是,则执行步骤K,否则得到
Figure PCTCN2019118656-appb-000019
步骤L:根据步骤D中计算的待选入所述验证语音的融合声纹特征向量的第二声纹子特征为个数B,选取前B个对应概率Py j(T)最大的第二声纹子特征作为入选验证语音的融合声纹特征向量的第二声纹子特征。
如果上步中采集的验证语音融合声纹特征向量中第二声纹子特征共5个,步骤D中计算出的本实施例中x 0(0)=[0.2,0.3,0.4,0.5,0.6];t=0时,根据Px i(t+1)=[x 1(t+1),x 2(t+1),…x i-1(t+1),x i+1(t+1)…x M(t+1)]依次得到Px 1(1)、Px 2(1)、Px 3(1)、Px 4(1)、Px 5(1),假设计算得到Px i(1)=[0.5,0.6,0.2,0.8,0.1]。依次循环,直到达到预定转移次数,本实施例中T=50,计算得到Px i(50),假设计算得到Px i(50)=[0.6,0.2,0.5,0.8,0.9],则选取对应最大概率的两个行为特征加入验证语音融合声纹特征向量。
此外,本申请还提供一种声纹识别装置。
参照图7,图7为本申请声纹识别装置一实施例的功能模块图。
本实施例中,所述声纹识别装置包括:
数据获取模块10,设置为获取待识别的验证语音;
数据处理模块20,设置为采用GMM-UBM模型提取验证语音的第一声纹特征,采用神经网络模型提取验证语音的第二声纹特征;
数据融合模块30,设置为将验证语音的第一声纹特征与第二声纹特征进行特征融合,得到验证语音的融合声纹特征向量;
数据比较模块40,设置为计算验证语音的融合声纹特征向量与预设注册声纹数据库中各注册用户的声纹特征向量之间的相似度;
数据判断模块50,设置为基于相似度,判定验证语音的声纹识别结果。
进一步地,还包括获取注册用户的声纹特征向量模块,所获取注册用户的声纹特征向量模块包括:
获取注册语音单元,设置为获取注册用户的注册语音;
提取声纹特征单元,设置为采用GMM-UBM模型提取所述注册语音的第三声纹特征,采用神经网络模型提取所述注册语音的第四声纹特征;
融合单元,设置为将所述注册语音的第三声纹特征与第四声纹特征进行特征融合,得到所述注册语音的融合声纹特征向量;
保存单元,设置为将所述注册语音的融合声纹特征向量保存到所述注册声纹数据库中,以作为注册用户的声纹特征向量。
进一步地,所述数据处理模块20还包括:
第一预处理单元201,设置为对所述验证语音进行预加重、分帧以及加窗预处理;
第一提取单元202,设置为从预处理后的所述验证语音中提取基音周期、线性预测倒谱系数、线性预测倒谱系数的一阶差分、能量、能量的一阶差分以及Gamma tone滤波器倒谱系数的特征参数,得到所述验证语音的第一声纹特征;
第二预处理单元203,设置为将所述验证语音排列成预定纬数的语谱图;
第二提取单元202,设置为通过神经网络对所述预定纬数的语谱图进行识别,得到所述验证语音的第二声纹特征。
进一步地,所述数据融合模块30包括:
数据融合单元301,设置为利用马尔可夫链蒙特卡罗随机模型进行第一声纹特征维度和第二声纹特征维度的融合,得到所述验证语音的融合声纹特征向量。
进一步地,数据融合单元301包括:
设定子单元3011,设置为设定验证语音的融合特征声纹总特征数为K;
确定子单元3012,设置为根据所述验证语音的融合声纹特征总特征为K,利用直接抽样法确定第一声纹子特征的和第二声纹子特征的融合比例;
融合子单元3013,设置为根据第一声纹子特征和第二声纹子特征的融合比例,利用MCMC的Gibbs采样模拟联合正态分布的采样过程,分别确定所述第一声纹特征选取的第一声纹子特征和所述第二声纹特征选取的第二声纹子特征,组成所述验证语音的融合声纹特征向量。
进一步地,所述确定子单元3012设置为:
步骤A:生成一个[0,1]之间的随机数作为参数p,参数p代表所述第一声纹子特征在所述验证语音的融合声纹特征中所占的比例;
步骤B:初始化用于记录迭代次数的计数器的初始值k=0;
步骤C:生成一个[0,1]之间的随机数q,并与参数p进行比较,当q<p时,选取一个所述第二声纹子特征,所述第二声纹子特征的数量加1,当q>p时,选取一个所述第一声纹子特征,所述第一声纹子特征的数量加1;
步骤D:k值增加1,判断是否k≧K,如果是则统计待选入所述验证语音的融合特征声纹向量的第一声纹子特征和第二声纹子特征的个数,分别记录为A和B,结束采样过程;否则,返回上步骤C。
进一步地,所述融合子单元3013设置为:
步骤E:设定转移次数阈值为T,初始化转移次数t=0;
步骤F:统计采集的验证语音的融合声纹特征向量中特征的个数,记录为M,生成M个[0,1]之间的随机数作为初始状态
Figure PCTCN2019118656-appb-000020
步骤G:转移次数t每增加1,对每个变量x i(t),i∈{1,2...M},按以下由联合概率分布得到的条件概率分布公式进行如下计算:
P(x i(t+1))|x 1(t+1),x 2(t+1)...x i-1(0),x i+1(t)...x m(t)),
其中,联合概率分布的均值为X;判断是否t<T,如果是则返回步骤G,否则得到
P(T)=[P(x 1(T)),P(x 2(T)),...P(x i(T)),...P(x M(T))];
步骤H:根据步骤D中计算的待选入所述验证语音的融合声纹特征向量中所述第一声纹子特征为个数A,选取前A个对应概率Px i(T)最大的第一声纹子特征作为入选验证语音的融合声纹特征向量的第一声纹子特征;
步骤I:设定转移次数阈值为T,初始化转移次数t=0;
步骤J:统计采集所述验证语音的融合声纹特征向量中特征的个数,记录为N,生成N个[0,1]之间的随机数作为初始状态
y(0)=[y 1(0),y 2(0)...y N(0)];
步骤K:转移次数t每增加1,对每个变量y j(t),j∈{1,2...M},按以下由联合概率 分布得到的条件概率分布公式进行如下计算:
P(y i(t+1))|y 1(t+1),y 2(t+1)...y j-1(0),y j+1(t)...y N(t)),
其中,联合概率分布的均值为Y;
判断是否t<T,如果是,则执行步骤K,否则得到
Figure PCTCN2019118656-appb-000021
步骤L:根据步骤D中计算的待选入所述验证语音的融合声纹特征向量的第二声纹子特征为个数B,选取前B个对应概率Py j(T)最大的第二声纹子特征作为入选验证语音的融合声纹特征向量的第二声纹子特征。
此外,本申请实施例还提供一种声纹识别设备,包括处理器、存储器以及存储在存储器上并可被处理器执行的声纹识别程序,声纹识别程序被处理器执行时实现上述各实施例的声纹识别方法的步骤。
此外,本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质上存储有声纹识别程序,其中声纹识别程序被处理器执行时,实现上述各实施例的声纹识别方法的步骤。其中,所述存储介质可以为易失性存储介质,所述存储介质也可以为非易失性存储介质。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器或者网络设备等)执行本申请各个实施例所述的方法。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,这些均属于本申请的保护之内。

Claims (20)

  1. 一种声纹识别方法,包括:
    获取待识别的验证语音;
    采用GMM-UBM模型提取所述验证语音的第一声纹特征,采用神经网络模型提取所述验证语音的第二声纹特征;
    将所述验证语音的第一声纹特征与第二声纹特征进行特征融合,得到验证语音的融合声纹特征向量;
    计算所述验证语音的融合声纹特征向量与预设注册声纹数据库中各注册用户的声纹特征向量之间的相似度;
    基于所述相似度,判定所述验证语音的声纹识别结果。
  2. 如权利要求1所述的声纹识别方法,其中,在所述获取待识别的验证语音之前,还包括:
    获取注册用户的注册语音;
    采用GMM-UBM模型提取所述注册语音的第三声纹特征,采用神经网络模型提取所述注册语音的第四声纹特征;
    将所述注册语音的第三声纹特征与第四声纹特征进行特征融合,得到所述注册语音的融合声纹特征向量;
    将所述注册语音的融合声纹特征向量保存到所述注册声纹数据库中,以作为注册用户的声纹特征向量。
  3. 如权利要求1所述的声纹识别方法,其中,所述采用GMM-UBM模型提取所述验证语音的第一声纹特征,包括:
    对所述验证语音进行预加重、分帧以及加窗预处理;
    从预处理后的所述验证语音中提取基音周期、线性预测倒谱系数、线性预测倒谱系数的一阶差分、能量、能量的一阶差分以及Gamma tone滤波器倒谱系数的特征参数,得到所述验证语音的第一声纹特征;
    所述采用神经网络模型提取所述验证语音的第二声纹特征,包括:
    将所述验证语音排列成预定纬数的语谱图;
    通过神经网络对所述预定纬数的语谱图进行识别,得到所述验证语音的第二声纹特征。
  4. 如权利要求1所述的声纹识别方法,其中,所述将所述验证语音的第一声纹特征与第二声纹特征进行特征融合,得到所述验证语音的融合声纹特征向量,包括:
    利用马尔可夫链蒙特卡罗随机模型进行第一声纹特征维度和第二声纹特征维度的融合,得到所述验证语音的融合声纹特征向量。
  5. 如权利要求4所述的声纹识别方法,其中,所述第一声纹特征包括多个第一声纹子特征,所述第二声纹特征包括多个第二声纹子特征;
    所述利用马尔可夫链蒙特卡罗随机模型进行第一声纹特征维度和第二声纹特征维度的融合,得到所述验证语音的融合声纹特征向量,包括:
    设定验证语音的融合特征声纹总特征数为K;
    根据所述验证语音的融合声纹特征总特征为K,利用直接抽样法确定第一声纹子特 征的和第二声纹子特征的融合比例;
    根据第一声纹子特征和第二声纹子特征的融合比例,利用MCMC的Gibbs采样模拟联合正态分布的采样过程,分别确定所述第一声纹特征选取的第一声纹子特征和所述第二声纹特征选取的第二声纹子特征,组成所述验证语音的融合声纹特征向量。
  6. 如权利要求5所述的声纹识别方法,其中,所述根据所述验证语音的融合声纹特征总特征为K,利用直接抽样法确定第一声纹子特征的和第二声纹子特征的融合比例,包括:
    步骤A:生成一个[0,1]之间的随机数作为参数p,参数p代表所述第一声纹子特征在所述验证语音的融合声纹特征中所占的比例;
    步骤B:初始化用于记录迭代次数的计数器的初始值k=0;
    步骤C:生成一个[0,1]之间的随机数q,并与参数p进行比较,当q<p时,选取一个所述第二声纹子特征,所述第二声纹子特征的数量加1,当q>p时,选取一个所述第一声纹子特征,所述第一声纹子特征的数量加1;
    步骤D:k值增加1,判断是否k≧K,如果是则统计待选入所述验证语音的融合特征声纹向量的第一声纹子特征和第二声纹子特征的个数,分别记录为A和B,结束采样过程;否则,返回步骤C。
  7. 如权利要求6所述的声纹识别方法,其中,所述根据第一声纹子特征和第二声纹子特征的融合比例,利用MCMC的Gibbs采样模拟联合正态分布的采样过程,分别确定所述第一声纹特征选取的第一声纹子特征和所述第二声纹特征选取的第二声纹子特征,组成所述验证语音的融合声纹特征向量,包括:
    步骤E:设定转移次数阈值为T,初始化转移次数t=0;
    步骤F:统计采集的验证语音的融合声纹特征向量中特征的个数,记录为M,生成M个[0,1]之间的随机数作为初始状态
    Figure PCTCN2019118656-appb-100001
    步骤G:转移次数t每增加1,对每个变量x i(t),i∈{1,2…M},按以下由联合概率分布得到的条件概率分布公式进行如下计算:
    P(x i(t+1))|x 1(t+1),x 2(t+1)…x i-1(0),x i+1(t)…x m(t)),
    其中,联合概率分布的均值为X;判断是否t<T,如果是则返回步骤G,否则得到
    P(T)=[P(x 1(T)),P(x 2(T)),…P(x i(T)),…P(x M(T))];
    步骤H:根据步骤D中计算的待选入所述验证语音的融合声纹特征向量中所述第一声纹子特征为个数A,选取前A个对应概率Px i(T)最大的第一声纹子特征作为入选验证语音的融合声纹特征向量的第一声纹子特征;
    步骤I:设定转移次数阈值为T,初始化转移次数t=0;
    步骤J:统计采集所述验证语音的融合声纹特征向量中特征的个数,记录为N,生成N个[0,1]之间的随机数作为初始状态
    y(0)=[y 1(0),y 2(0)…y N(0)];
    步骤K:转移次数t每增加1,对每个变量y j(t),j∈{1,2…M},按以下由联合概率分布得到的条件概率分布公式进行如下计算:
    P(y i(t+1))|y 1(t+1),y 2(t+1)…y j-1(0),y j+1(t)…y N(t)),
    其中,联合概率分布的均值为Y;
    判断是否t<T,如果是,则执行步骤K,否则得到
    Figure PCTCN2019118656-appb-100002
    步骤L:根据步骤D中计算的待选入所述验证语音的融合声纹特征向量的第二声纹子特征为个数B,选取前B个对应概率Py j(T)最大的第二声纹子特征作为入选验证语音的融合声纹特征向量的第二声纹子特征。
  8. 一种声纹识别装置,包括:
    数据获取模块,设置为获取待识别的验证语音;
    数据处理模块,设置为采用GMM-UBM模型提取所述验证语音的第一声纹特征,采用神经网络模型提取所述验证语音的第二声纹特征;
    数据融合模块,设置为将所述验证语音的第一声纹特征与第二声纹特征进行特征融合,得到所述验证语音的融合声纹特征向量;
    数据比较模块,设置为计算所述验证语音的融合声纹特征向量与预设注册声纹数据库中各注册用户的声纹特征向量之间的相似度;
    数据判断模块,设置为基于所述相似度,判定所述验证语音的声纹识别结果。
  9. 如权利要求8所述的声纹识别装置,其中,还包括获取注册用户的声纹特征向量模块,所述获取注册用户的声纹特征向量模块包括:
    获取注册语音单元,设置为获取注册用户的注册语音;
    提取声纹特征单元,设置为采用GMM-UBM模型提取所述注册语音的第三声纹特征,采用神经网络模型提取所述注册语音的第四声纹特征;
    融合单元,设置为将所述注册语音的第三声纹特征与第四声纹特征进行特征融合,得到所述注册语音的融合声纹特征向量;
    保存单元,设置为将所述注册语音的融合声纹特征向量保存到所述注册声纹数据库中,以作为注册用户的声纹特征向量。
  10. 如权利要求8所述的声纹识别装置,其中,所述数据处理模块,包括:
    第一预处理单元,设置为对所述验证语音进行预加重、分帧以及加窗预处理;
    第一提取单元,设置为从预处理后的所述验证语音中提取基音周期、线性预测倒谱系数、线性预测倒谱系数的一阶差分、能量、能量的一阶差分以及Gamma tone滤波器倒谱系数的特征参数,得到所述验证语音的第一声纹特征;
    第二预处理单元,设置为将所述验证语音排列成预定纬数的语谱图;
    第二提取单元,设置为通过神经网络对所述预定纬数的语谱图进行识别,得到所述验证语音的第二声纹特征。
  11. 如权利要求8所述的声纹识别装置,其中,所述数据融合模块,包括:
    数据融合单元,设置为利用马尔可夫链蒙特卡罗随机模型进行第一声纹特征维度和第二声纹特征维度的融合,得到所述验证语音的融合声纹特征向量。
  12. 如权利要求11所述的声纹识别装置,其中,所述数据融合单元,包括:
    设定子单元,设置为设定验证语音的融合特征声纹总特征数为K;
    确定子单元,设置为根据所述验证语音的融合声纹特征总特征为K,利用直接抽样法确定第一声纹子特征的和第二声纹子特征的融合比例;
    融合子单元,设置为根据第一声纹子特征和第二声纹子特征的融合比例,利用MCMC的Gibbs采样模拟联合正态分布的采样过程,分别确定所述第一声纹特征选取的第一声纹子特征和所述第二声纹特征选取的第二声纹子特征,组成所述验证语音的融合声纹特征向量。
  13. 如权利要求12所述的声纹识别装置,其中,所述确定子单元,设置为:
    步骤A:生成一个[0,1]之间的随机数作为参数p,参数p代表所述第一声纹子特征在所述验证语音的融合声纹特征中所占的比例;
    步骤B:初始化用于记录迭代次数的计数器的初始值k=0;
    步骤C:生成一个[0,1]之间的随机数q,并与参数p进行比较,当q<p时,选取一个所述第二声纹子特征,所述第二声纹子特征的数量加1,当q>p时,选取一个所述第一声纹子特征,所述第一声纹子特征的数量加1;
    步骤D:k值增加1,判断是否k≧K,如果是则统计待选入所述验证语音的融合特征声纹向量的第一声纹子特征和第二声纹子特征的个数,分别记录为A和B,结束采样过程;否则,返回步骤C。
  14. 如权利要求13所述的声纹识别装置,其中,所述融合子单元,设置为:
    步骤E:设定转移次数阈值为T,初始化转移次数t=0;
    步骤F:统计采集的验证语音的融合声纹特征向量中特征的个数,记录为M,生成M个[0,1]之间的随机数作为初始状态
    Figure PCTCN2019118656-appb-100003
    步骤G:转移次数t每增加1,对每个变量x i(t),i∈{1,2…M},按以下由联合概率分布得到的条件概率分布公式进行如下计算:
    P(x i(t+1))|x 1(t+1),x 2(t+1)…x i-1(0),x i+1(t)…x m(t)),
    其中,联合概率分布的均值为X;判断是否t<T,如果是则返回步骤G,否则得到
    P(T)=[P(x 1(T)),P(x 2(T)),…P(x i(T)),…P(x M(T))];
    步骤H:根据步骤D中计算的待选入所述验证语音的融合声纹特征向量中所述第一声纹子特征为个数A,选取前A个对应概率Px i(T)最大的第一声纹子特征作为入选验证语音的融合声纹特征向量的第一声纹子特征;
    步骤I:设定转移次数阈值为T,初始化转移次数t=0;
    步骤J:统计采集所述验证语音的融合声纹特征向量中特征的个数,记录为N,生成N个[0,1]之间的随机数作为初始状态
    y(0)=[y 1(0),y 2(0)…y N(0)];
    步骤K:转移次数t每增加1,对每个变量y j(t),j∈{1,2…M},按以下由联合概率分布得到的条件概率分布公式进行如下计算:
    P(y i(t+1))|y 1(t+1),y 2(t+1)…y j-1(0),y j+1(t)…y N(t)),
    其中,联合概率分布的均值为Y;
    判断是否t<T,如果是,则执行步骤K,否则得到
    P(T)=[P(y 1(T)),P(y 2(T)),…P(y j(T)),…P(y N(T))];
    步骤L:根据步骤D中计算的待选入所述验证语音的融合声纹特征向量的第二声纹子特征为个数B,选取前B个对应概率Py j(T)最大的第二声纹子特征作为入选验证语音的 融合声纹特征向量的第二声纹子特征。
  15. 一种声纹识别设备,所述声纹识别设备包括处理器、存储器以及存储在所述存储器上并可被所述处理器执行的声纹识别程序,所述声纹识别程序被所述处理器执行时实现以下步骤:
    获取待识别的验证语音;
    采用GMM-UBM模型提取所述验证语音的第一声纹特征,采用神经网络模型提取所述验证语音的第二声纹特征;
    将所述验证语音的第一声纹特征与第二声纹特征进行特征融合,得到验证语音的融合声纹特征向量;
    计算所述验证语音的融合声纹特征向量与预设注册声纹数据库中各注册用户的声纹特征向量之间的相似度;
    基于所述相似度,判定所述验证语音的声纹识别结果。
  16. 根据权利要求15所述的声纹识别设备,其中,在所述获取待识别的验证语音之前,所述声纹识别程序被所述处理器执行时实现以下步骤:
    获取注册用户的注册语音;
    采用GMM-UBM模型提取所述注册语音的第三声纹特征,采用神经网络模型提取所述注册语音的第四声纹特征;
    将所述注册语音的第三声纹特征与第四声纹特征进行特征融合,得到所述注册语音的融合声纹特征向量;
    将所述注册语音的融合声纹特征向量保存到所述注册声纹数据库中,以作为注册用户的声纹特征向量。
  17. 根据权利要求15所述的声纹识别设备,其中,所述将所述验证语音的第一声纹特征与第二声纹特征进行特征融合,得到所述验证语音的融合声纹特征向量,所述声纹识别程序被所述处理器执行时实现以下步骤:
    利用马尔可夫链蒙特卡罗随机模型进行第一声纹特征维度和第二声纹特征维度的融合,得到所述验证语音的融合声纹特征向量。
  18. 一种计算机可读存储介质,所述计算机可读存储介质上存储有声纹识别程序,所述声纹识别程序被处理器执行时实现以下步骤:
    获取待识别的验证语音;
    采用GMM-UBM模型提取所述验证语音的第一声纹特征,采用神经网络模型提取所述验证语音的第二声纹特征;
    将所述验证语音的第一声纹特征与第二声纹特征进行特征融合,得到验证语音的融合声纹特征向量;
    计算所述验证语音的融合声纹特征向量与预设注册声纹数据库中各注册用户的声纹特征向量之间的相似度;
    基于所述相似度,判定所述验证语音的声纹识别结果。
  19. 根据权利要求18所述的计算机可读存储介质,其中,在所述获取待识别的验证语音之前,所述声纹识别程序被所述处理器执行时实现以下步骤:
    获取注册用户的注册语音;
    采用GMM-UBM模型提取所述注册语音的第三声纹特征,采用神经网络模型提取所述注册语音的第四声纹特征;
    将所述注册语音的第三声纹特征与第四声纹特征进行特征融合,得到所述注册语音的融合声纹特征向量;
    将所述注册语音的融合声纹特征向量保存到所述注册声纹数据库中,以作为注册用户的声纹特征向量。
  20. 根据权利要求18所述的计算机可读存储介质,其中,所述将所述验证语音的第一声纹特征与第二声纹特征进行特征融合,得到所述验证语音的融合声纹特征向量,所述声纹识别程序被所述处理器执行时实现以下步骤:
    利用马尔可夫链蒙特卡罗随机模型进行第一声纹特征维度和第二声纹特征维度的融合,得到所述验证语音的融合声纹特征向量。
PCT/CN2019/118656 2019-03-12 2019-11-15 声纹识别方法、装置、设备以及计算机可读存储介质 WO2020181824A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910182453.3A CN110047490A (zh) 2019-03-12 2019-03-12 声纹识别方法、装置、设备以及计算机可读存储介质
CN201910182453.3 2019-03-12

Publications (1)

Publication Number Publication Date
WO2020181824A1 true WO2020181824A1 (zh) 2020-09-17

Family

ID=67274752

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118656 WO2020181824A1 (zh) 2019-03-12 2019-11-15 声纹识别方法、装置、设备以及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN110047490A (zh)
WO (1) WO2020181824A1 (zh)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047490A (zh) * 2019-03-12 2019-07-23 平安科技(深圳)有限公司 声纹识别方法、装置、设备以及计算机可读存储介质
CN110517698B (zh) * 2019-09-05 2022-02-01 科大讯飞股份有限公司 一种声纹模型的确定方法、装置、设备及存储介质
CN110556126B (zh) * 2019-09-16 2024-01-05 平安科技(深圳)有限公司 语音识别方法、装置以及计算机设备
CN112687274A (zh) * 2019-10-17 2021-04-20 北京猎户星空科技有限公司 一种语音信息的处理方法、装置、设备及介质
CN110880321B (zh) * 2019-10-18 2024-05-10 平安科技(深圳)有限公司 基于语音的智能刹车方法、装置、设备及存储介质
CN110838294B (zh) * 2019-11-11 2022-03-04 效生软件科技(上海)有限公司 一种语音验证方法、装置、计算机设备及存储介质
CN111370003B (zh) * 2020-02-27 2023-05-30 杭州雄迈集成电路技术股份有限公司 一种基于孪生神经网络的声纹比对方法
CN112185344A (zh) * 2020-09-27 2021-01-05 北京捷通华声科技股份有限公司 语音交互方法、装置、计算机可读存储介质和处理器
CN112614493B (zh) * 2020-12-04 2022-11-11 珠海格力电器股份有限公司 声纹识别方法、***、存储介质及电子设备
CN112382300A (zh) * 2020-12-14 2021-02-19 北京远鉴信息技术有限公司 声纹鉴定方法、模型训练方法、装置、设备及存储介质
CN115310066A (zh) * 2021-05-07 2022-11-08 华为技术有限公司 一种升级方法、装置及电子设备
CN115022087B (zh) * 2022-07-20 2024-02-27 中国工商银行股份有限公司 一种语音识别验证处理方法及装置
CN115019804B (zh) * 2022-08-03 2022-11-01 北京惠朗时代科技有限公司 一种多员工密集签到的多重校验式声纹识别方法及***
CN115831152B (zh) * 2022-11-28 2023-07-04 国网山东省电力公司应急管理中心 一种用于实时监测应急装备发电机运行状态的声音监测装置及方法
CN116386647B (zh) * 2023-05-26 2023-08-22 北京瑞莱智慧科技有限公司 音频验证方法、相关装置、存储介质及程序产品

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080010065A1 (en) * 2006-06-05 2008-01-10 Harry Bratt Method and apparatus for speaker recognition
CN103745002A (zh) * 2014-01-24 2014-04-23 中国科学院信息工程研究所 一种基于行为特征与内容特征融合的水军识别方法及***
CN104900235A (zh) * 2015-05-25 2015-09-09 重庆大学 基于基音周期混合特征参数的声纹识别方法
CN105575394A (zh) * 2016-01-04 2016-05-11 北京时代瑞朗科技有限公司 基于全局变化空间及深度学习混合建模的声纹识别方法
CN106887225A (zh) * 2017-03-21 2017-06-23 百度在线网络技术(北京)有限公司 基于卷积神经网络的声学特征提取方法、装置和终端设备
US10008209B1 (en) * 2015-09-25 2018-06-26 Educational Testing Service Computer-implemented systems and methods for speaker recognition using a neural network
CN108417217A (zh) * 2018-01-11 2018-08-17 苏州思必驰信息科技有限公司 说话人识别网络模型训练方法、说话人识别方法及***
CN109102812A (zh) * 2017-06-21 2018-12-28 北京搜狗科技发展有限公司 一种声纹识别方法、***及电子设备
CN109147797A (zh) * 2018-10-18 2019-01-04 平安科技(深圳)有限公司 基于声纹识别的客服方法、装置、计算机设备及存储介质
CN110047490A (zh) * 2019-03-12 2019-07-23 平安科技(深圳)有限公司 声纹识别方法、装置、设备以及计算机可读存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1308911C (zh) * 2003-07-10 2007-04-04 上海优浪信息科技有限公司 一种说话者身份识别方法和***
CN103440873B (zh) * 2013-08-27 2015-10-28 大连理工大学 一种基于相似性的音乐推荐方法
CN104835498B (zh) * 2015-05-25 2018-12-18 重庆大学 基于多类型组合特征参数的声纹识别方法
CN106710589B (zh) * 2016-12-28 2019-07-30 百度在线网络技术(北京)有限公司 基于人工智能的语音特征提取方法及装置
CN106847309A (zh) * 2017-01-09 2017-06-13 华南理工大学 一种语音情感识别方法
CN109767790A (zh) * 2019-02-28 2019-05-17 中国传媒大学 一种语音情感识别方法及***

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080010065A1 (en) * 2006-06-05 2008-01-10 Harry Bratt Method and apparatus for speaker recognition
CN103745002A (zh) * 2014-01-24 2014-04-23 中国科学院信息工程研究所 一种基于行为特征与内容特征融合的水军识别方法及***
CN104900235A (zh) * 2015-05-25 2015-09-09 重庆大学 基于基音周期混合特征参数的声纹识别方法
US10008209B1 (en) * 2015-09-25 2018-06-26 Educational Testing Service Computer-implemented systems and methods for speaker recognition using a neural network
CN105575394A (zh) * 2016-01-04 2016-05-11 北京时代瑞朗科技有限公司 基于全局变化空间及深度学习混合建模的声纹识别方法
CN106887225A (zh) * 2017-03-21 2017-06-23 百度在线网络技术(北京)有限公司 基于卷积神经网络的声学特征提取方法、装置和终端设备
CN109102812A (zh) * 2017-06-21 2018-12-28 北京搜狗科技发展有限公司 一种声纹识别方法、***及电子设备
CN108417217A (zh) * 2018-01-11 2018-08-17 苏州思必驰信息科技有限公司 说话人识别网络模型训练方法、说话人识别方法及***
CN109147797A (zh) * 2018-10-18 2019-01-04 平安科技(深圳)有限公司 基于声纹识别的客服方法、装置、计算机设备及存储介质
CN110047490A (zh) * 2019-03-12 2019-07-23 平安科技(深圳)有限公司 声纹识别方法、装置、设备以及计算机可读存储介质

Also Published As

Publication number Publication date
CN110047490A (zh) 2019-07-23

Similar Documents

Publication Publication Date Title
WO2020181824A1 (zh) 声纹识别方法、装置、设备以及计算机可读存储介质
CN106486131B (zh) 一种语音去噪的方法及装置
WO2021139425A1 (zh) 语音端点检测方法、装置、设备及存储介质
US9940935B2 (en) Method and device for voiceprint recognition
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
EP2763134B1 (en) Method and apparatus for voice recognition
WO2018166187A1 (zh) 服务器、身份验证方法、***及计算机可读存储介质
WO2018149077A1 (zh) 声纹识别方法、装置、存储介质和后台服务器
CN109584884B (zh) 一种语音身份特征提取器、分类器训练方法及相关设备
WO2019019256A1 (zh) 电子装置、身份验证的方法、***及计算机可读存储介质
US20120143608A1 (en) Audio signal source verification system
US20050143997A1 (en) Method and apparatus using spectral addition for speaker recognition
WO2018223727A1 (zh) 识别声纹的方法、装置、设备及介质
WO2014114049A1 (zh) 一种语音识别的方法、装置
WO2021051608A1 (zh) 一种基于深度学习的声纹识别方法、装置及设备
WO2014114116A1 (en) Method and system for voiceprint recognition
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
WO2021042537A1 (zh) 语音识别认证方法及***
WO2019232826A1 (zh) i-vector向量提取方法、说话人识别方法、装置、设备及介质
CN113823293B (zh) 一种基于语音增强的说话人识别方法及***
CN110570871A (zh) 一种基于TristouNet的声纹识别方法、装置及设备
CN111199742A (zh) 一种身份验证方法、装置及计算设备
Herrera-Camacho et al. Design and testing of a corpus for forensic speaker recognition using MFCC, GMM and MLE
CN113241059B (zh) 语音唤醒方法、装置、设备及存储介质
Nagakrishnan et al. Generic speech based person authentication system with genuine and spoofed utterances: different feature sets and models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19919458

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19919458

Country of ref document: EP

Kind code of ref document: A1