CN112951245A

CN112951245A - Dynamic voiceprint feature extraction method integrated with static component

Info

Publication number: CN112951245A
Application number: CN202110257723.XA
Authority: CN
Inventors: 刘涛; 刘斌; 黄金国
Original assignee: Jiangsu Open University of Jiangsu City Vocational College
Current assignee: Jiangsu Open University of Jiangsu City Vocational College
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2021-06-11
Anticipated expiration: 2041-03-09
Also published as: CN112951245B

Abstract

The invention discloses a dynamic voiceprint feature extraction method integrated with static components, which comprises the steps of preprocessing target voice data, acquiring preprocessed target voice data, processing the preprocessed target voice by using a Fourier transform and Mel filter group, and acquiring MFCC coefficients of the target voice data; substituting the MFCC coefficients of the target voice data into a dynamic voiceprint feature extraction model fused with the static component, obtaining an MFCC dynamic feature difference parameter matrix of the target voice data, and defining the matrix as the dynamic voiceprint features of the target voice data; the method provided by the invention can ensure the sound continuity, reduce the average error rate and the like and improve the recognition rate when the voiceprint feature extraction is carried out on the voice data.

Description

Dynamic voiceprint feature extraction method integrated with static component

Technical Field

The invention relates to the technical field of artificial intelligence voiceprint recognition, in particular to a dynamic voiceprint feature extraction method fused with static components.

Background

At present, smart homes are more and more widely applied to life and work of people, the smart homes adopt technologies such as wireless communication, image processing and voice processing, an intelligent home system based on voice interaction is more convenient to use, an information acquisition space is wider, and user experience is more friendly.

Voiceprint recognition has been developed greatly in recent years, and in some occasions, the recognition rate also meets the basic requirements of people on safety, and the voiceprint recognition has the advantages of economy, convenience and the like, so that the voiceprint recognition has a very wide application prospect. How to suppress external noise as much as possible and extract voice features as pure as possible from the acquired signals is a precondition for putting various voice processing techniques into practical use.

Today, the living quality of people is rapidly improved, the requirements of the public on the intelligent home system are not limited to the execution of standard and common control functions, but the intellectualization, convenience, safety and comfort of the whole home are expected to be improved. The voice print recognition function is added to the intelligent home system, and the stability of the system in a noise environment is improved by adopting voice enhancement, so that the human-computer interaction experience of the intelligent home can be further improved, and the use efficiency of a user on the intelligent home is improved; and a level system can be set for the control and operation of the smart home, and differentiated service functions can be provided for users with different permission levels, so that the overall safety and the practicability of the system are further improved. The system has strong impact force in the future market, especially under the large background that the development of the current smart home market is slow, the system can play more and more important roles and has profound influence on the life of the public, but the voice recognition or voice feature extraction method in the prior art has the problems of high average error rate and low recognition rate.

Therefore, in order to further reduce the error rate such as average and the like and improve the recognition rate, the invention provides a dynamic voiceprint feature extraction method which is integrated with a static component.

Disclosure of Invention

The purpose of the invention is as follows: the dynamic voiceprint feature extraction method is low in average equal error rate and high in recognition rate.

The technical scheme is as follows: the invention provides a dynamic voiceprint feature extraction method fused with static components, which is used for carrying out voiceprint feature extraction on target voice data and is characterized by comprising the following steps:

step 1: preprocessing the target voice data to obtain preprocessed target voice data;

step 2: processing the preprocessed target voice by using a Fourier transform and Mel filter group to obtain an MFCC coefficient of the target voice data;

and step 3: and substituting the MFCC coefficients of the target voice data into the dynamic voiceprint feature extraction model fused into the static component, acquiring an MFCC dynamic feature difference parameter matrix of the target voice data, and defining the matrix as the dynamic voiceprint features of the target voice data.

As a preferred aspect of the present invention, in step 1, a method for preprocessing target speech data includes: dividing target voice data into T frames, and acquiring multi-frame voice data;

in step 2, the method for processing the preprocessed target voice by using the Fourier transform and the Mel filter set comprises the following steps:

processing each frame of voice data by using Fourier transform to obtain the frequency spectrum of each frame of voice data;

the frequency spectrum of each frame of voice data is input into the Mel filter bank, and the MFCC coefficient of each frame of voice data, namely the MFCC coefficient of the target voice data, is obtained.

As a preferable aspect of the present invention, in step 3, the dynamic voiceprint feature extraction model merged into the static component is:

d (l, t) is the extraction result of the ith order dynamic voiceprint feature of the tth frame of voice data, d (l, t) constitutes the tth element of the ith order in the MFCC dynamic feature difference parameter matrix of the target voice data, C (l, t) is the tth parameter of the ith order in the MFCC coefficients, C (l, t +1) is the t +1 th parameter of the ith order in the MFCC coefficients, C (l, t + K) is the t + K parameter of the ith order, C (l, t-K) is the t-K parameter of the ith order in the MFCC coefficients, K is the frequency ordinal number after Fourier transform is performed on the tth frame of voice data, and K is the preset total step length when Fourier transform is performed on the tth frame of voice data.

As a preferred aspect of the present invention, according to the following formula:

obtaining an l-th order characteristic coefficient C (l, t) of the t-th frame of voice data in the MFCC coefficients;

wherein, L is the order of the MFCC coefficient, m is the serial number of the Mel filter bank, and s (m) is the logarithmic energy output by the mth Mel filter bank.

obtaining the logarithmic energy S (m) output by the mth Mel filter bank;

wherein M represents the total number of filter groups, N represents the data length of the t frame voice data, X (k) represents the power corresponding to the k frequency, H_m(k) Representing the transfer function of the mth Mel filter bank corresponding to the kth frequency.

Has the advantages that: compared with the prior art, the method for extracting the dynamic voiceprint features fused with the static components, provided by the invention, has the advantages that the voiceprint features are extracted based on the dynamic voiceprint feature extraction model fused with the static components, and the purposes of reducing average equal error rate and improving identification rate are achieved while the sound continuity is ensured.

Drawings

FIG. 1 is a flow chart of a dynamic voiceprint feature extraction method provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of the variation of the equal error rate with the ratio of the dynamic characteristic to the static characteristic provided by the embodiment of the invention;

FIG. 3 is a graph illustrating the variation of the constant error rate with the static feature coefficients according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Referring to fig. 1, the method for extracting dynamic voiceprint features merged into a static component provided by the invention comprises the following steps:

step 1: and preprocessing the target voice data to obtain the preprocessed target voice data.

The method for preprocessing the target voice data comprises the following steps: dividing target voice data into T frames, and acquiring multi-frame voice data;

step 2: and processing the preprocessed target voice by using a Fourier transform and Mel filter group to obtain the MFCC coefficient of the target voice data.

The method for processing the preprocessed target voice by using the Fourier transform and the Mel filter set comprises the following steps:

The method of step 1 and step 2 specifically comprises the following steps:

extraction of Mel-frequency cepstrum coefficients (MFCCs) is performed on data that has been subjected to speech preprocessing, and desired feature coefficients are obtained by performing operations such as fourier transform, Mel (Mel) filter, and the like on the data.

(1) Fourier transform is carried out on each frame of data after voice preprocessing to obtain corresponding frequency spectrum and obtain power spectrum | X (j) of each frame²X (j) is calculated as follows:

wherein, N is the length of each frame, J is the fast Fourier transform length, namely the total frame number, J is the value of 1-J, which represents the jth frame, and x (N) is the voice data in the nth frame.

(2) And designing a Mel filter bank, and filtering the power spectrum of the signal through the configured Mel filter bank. And carrying out logarithm operation, and converting the frequency scale into Mel frequency. The center frequency f (m) of the mth filter in the filter bank satisfies the following formula:

Mel(f(m+1))-Mel(f(m))＝Mel(f(m))-Mel(f(m-1))

where m is the number of filters in the filter bank, and Mel (f (m)) is the operation of converting the frequency f (m) to Mel frequency.

Transfer function H of each band pass filter in Mel Filter Bank_m(f)：

Wherein f is the frequency.

After the voice data is processed by the Mel filter, the logarithmic energy S (m) output by each filter bank is calculated:

wherein M is the serial number of the filter bank filter, M is the total number of the filters in the filter bank, generally 22-26, and M is 24 in the invention. | X (k) messaging²Represents the power spectrum of the k-th frame, H_m(f) Representing the transfer function of the mth filter in the filter bank at frequency f.

(3) Discrete cosine transform is carried out on the logarithm Mel power spectrum of each frame to carry out decorrelation operation on the energy of the logarithm Mel power spectrum, the correlation among signals of all dimensions is eliminated, the signals are mapped to a low-dimensional space, and a corresponding MFCC coefficient C (l) is obtained:

wherein, L is the total order of the MFCC coefficient, usually 12 to 18, and the invention takes L-15; l is a value from 1 to L and represents the ith order of the MFCC coefficient.

In step 3, a dynamic voiceprint feature extraction model fused with the static component is constructed according to the following method:

the essence of the dynamic feature extraction is the MFCC coefficient difference mode, that is, when the MFCC coefficient difference parameter of the t-th frame is calculated, the parameters of the t-1-th frame and the t + 1-th frame are used for carrying out the downsampling. Therefore, the classical dynamic feature extraction formula is as follows:

wherein J represents the fast Fourier transform length, usually 1 or 2, represents a first-order MFCC coefficient differential parameter and a second-order MFCC coefficient differential parameter, and is the value of J (J is more than or equal to 1 and is less than or equal to J); l is the order of the Mel cepstrum coefficient, T is the frame number, T is the total frame number of a section of audio, C (l, T) is the T-th parameter of the L-th order of the Mel cepstrum coefficient matrix of the voice signal, and d (l, T) is the MFCC dynamic characteristic parameter.

The new dynamic voiceprint characteristic feature symplex frequency cepstrum coefficient formula provided by the invention is as follows:

the modification is as follows:

wherein the content of the first and second substances,

for the dynamic voiceprint feature proposed by the present invention, MFCC is the static voiceprint feature and Δ MFCC is the classical motionThe dynamic voiceprint characteristic is a difference dynamic parameter, alpha is a static characteristic coefficient, beta is a dynamic characteristic coefficient, and delta is the ratio of the dynamic characteristic coefficient to the static characteristic coefficient.

The sum α and δ values are determined according to the following method:

assuming that α is 1, the optimum value of the ratio δ of the dynamic coefficient to the static coefficient is determined by experiment.

The number of gaussian elements in the experiment was set to 64, and voice data of 100 persons (of which 50 women and 50 men) was selected from the timmit corpus as experimental voice data of the experiment. And selecting 60 persons of voice data as training data for UBM model training, and combining 10 sections of voice of each person into 10 seconds of voice for UBM model training. Model parameters of the UBM model are obtained and stored, 5 segments of speech of each of the remaining 40 persons are combined into 10-second speech data to train the GMM model of each specific speaker, and the obtained model parameters are stored. The remaining voice data of the last 40 people is cycled into 10 segments of 5 seconds of voice data to match the system. The complete test process comprises 400 times of speaker acceptance test experiments and 15600 times of speaker rejection test experiments, and the equal error rate is obtained as the output result of one experiment.

For the voiceprint feature obtained by the voice data, each section of test voice generates a plurality of frames of voice sections, the set MFCC order is 15, so that one frame of voice data can generate 15 MFCC coefficients, 15 dynamic feature coefficients are generated after calculation, and 30 MFCC coefficients are generated in each frame of voice section after combination. The sampling frequency in the experiment was 16KHz and the frame was shifted to 1/2 the length of the frame.

According to the experimental conditions, δ takes 5 different values, and 5 experiments are respectively carried out to obtain average equal error rate data as shown in table 1:

TABLE 1

Based on the data shown in table 1, error rate curves such as the ratio δ of different dynamic characteristics to static characteristics and the average can be obtained as shown in fig. 1.

As can be seen from fig. 2, when δ is 1, the average equal error rate is the lowest, so that the optimal value of the ratio δ of the dynamic characteristic to the static characteristic is 1.

Accordingly, the dynamic voiceprint characteristic symplex frequency cepstrum coefficient formula provided by the invention can be changed into:

according to the experimental conditions, α takes 5 different values, and 5 experiments are performed respectively to obtain average equal error rate data as shown in table 2:

TABLE 2

Based on the data shown in Table 2, error rate curves of different static characteristic coefficients α and average values can be obtained as shown in FIG. 3.

As can be seen from fig. 3, when α is 0.5, the average equal error rate is the lowest, and thus the optimal value of the static feature coefficient is 0.5.

formula (5) represents a dynamic feature parameter, that is, Δ MFCC, MFCC is a static feature parameter, that is, MFCC ═ d (l, t), and the two are added by taking a weight of 0.5, so as to obtain a dynamic feature extraction formula in which a static component is merged:

and (5) arranging to obtain a dynamic feature extraction formula fused with the static component:

namely, the constructed dynamic voiceprint feature extraction model fused into the static component is as follows:

wherein d (l, t) is the extraction result of the ith order dynamic voiceprint feature of the tth frame of voice data, and d (l, t) constitutes the tth element of the ith order in the MFCC dynamic feature difference parameter matrix of the target voice data, namely: d (l, t) is the t-th parameter of the order I of the MFCC dynamic characteristic difference parameter matrix; c (l, t) is the t-th parameter of the l-th order in the MFCC coefficients, C (l, t +1) is the t + 1-th parameter of the l-th order in the MFCC coefficients, C (l, t + K) is the t + K-th parameter of the l-th order, C (l, t-K) is the t-K-th parameter of the l-th order in the MFCC coefficients, K is the frequency ordinal number after Fourier transform is carried out on the t-th frame voice data, and K is the preset total step length when Fourier transform is carried out on the t-th frame voice data.

And for the constructed dynamic voiceprint feature extraction model blended into the static component, the following formula is adopted:

According to the following formula:

obtaining the logarithmic energy S (m) output by the mth Mel filter bank;

Based on the model and the method, according to parameters such as a Mel cepstrum coefficient matrix, audio time and the like, static characteristic parameters can be calculated firstly, and dynamic characteristic parameters blended into static components are further calculated for voiceprint recognition.

In the voiceprint recognition algorithm, a Gaussian mixture model and a general background model are commonly used for carrying out model establishment on voiceprint characteristics of a speaker, and the method mainly comprises the steps of training voice input of the Gaussian mixture model, voice preprocessing, voiceprint characteristic extraction, general background model parameter input, Gaussian mixture model construction and Gaussian mixture model parameter storage. Generally, in the voiceprint recognition algorithm, a classical dynamic feature extraction algorithm is mostly adopted in the process of voiceprint feature extraction, the process is improved, a static component is blended in when a dynamic feature extraction parameter is calculated, and the performance of the voiceprint recognition algorithm is improved.

The above description is only a preferred embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be considered as the protection scope of the present invention.

Claims

1. A dynamic voiceprint feature extraction method fused with static components is used for carrying out voiceprint feature extraction on target voice data, and is characterized by comprising the following steps:

2. The method for extracting dynamic voiceprint features merged into a static component according to claim 1, wherein in step 1, the method for preprocessing the target voice data comprises: dividing target voice data into T frames, and acquiring multi-frame voice data;

3. The method according to claim 2, wherein in step 3, the model for extracting the dynamic voiceprint features merged into the static component is:

4. The method of claim 3, wherein the method comprises the following steps:

acquiring characteristic coefficients C (l, t) of the l order of the t frame voice data in the MFCC coefficients;

5. The method of extracting a dynamic temperature-increasing feature incorporating a static component according to claim 4, wherein the method is based on the following formula:

obtaining the logarithmic energy S (m) output by the mth Mel filter bank;