CN112951245B

CN112951245B - Dynamic voiceprint feature extraction method integrated with static component

Info

Publication number: CN112951245B
Application number: CN202110257723.XA
Authority: CN
Inventors: 刘涛; 刘斌; 黄金国
Original assignee: Jiangsu Open University of Jiangsu City Vocational College
Current assignee: Jiangsu Open University of Jiangsu City Vocational College
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2023-06-16
Anticipated expiration: 2041-03-09
Also published as: CN112951245A

Abstract

The invention discloses a dynamic voiceprint feature extraction method integrated with static components, which comprises the steps of preprocessing target voice data, obtaining preprocessed target voice data, and processing the preprocessed target voice by using a Fourier transform and Mel filter bank to obtain MFCC coefficients of the target voice data; the MFCC coefficient of the target voice data is brought into a dynamic voiceprint feature extraction model which is integrated into the static component, an MFCC dynamic feature differential parameter matrix of the target voice data is obtained, and the matrix is defined as the dynamic voiceprint feature of the target voice data; the method provided by the invention can ensure the sound continuity, reduce the average error rate and the like and improve the recognition rate when the voiceprint feature extraction is carried out on the voice data.

Description

Dynamic voiceprint feature extraction method integrated with static component

Technical Field

The invention relates to the technical field of artificial intelligent voiceprint recognition, in particular to a dynamic voiceprint feature extraction method integrated with static components.

Background

At present, intelligent home is more and more widely applied to life and work of people, the intelligent home adopts technologies such as wireless communication, image processing and voice processing, a voice interaction-based intelligent home system is more convenient to use, an information acquisition space is wider, and user experience is more friendly.

Voiceprint recognition has been developed in recent years, and in some occasions, the recognition rate also meets the basic requirements of people on safety, and the voiceprint recognition has the advantages of economy, convenience and the like, so that the voiceprint recognition has a very wide application prospect. How to suppress external noise as much as possible and extract as pure speech features as possible from the collected signals is a premise for various speech processing techniques to be put into practical use.

At present, the living quality of people is rapidly improved, and the requirements of the public on the intelligent home system are not limited to the requirements of the public on the intelligent home system to execute standard and common control functions, but the intelligent, convenience, safety and comfort of the whole home are expected to be improved more. The voiceprint recognition function is added for the intelligent home system, and the stability of the system in a noise environment is improved by adopting voice enhancement, so that the man-machine interaction experience of the intelligent home can be further improved, and the use efficiency of a user on the intelligent home is improved; the system can set a level system for the control and operation of the intelligent home, and different service functions are provided for users with different authority levels, so that the overall safety and practicability of the system are further improved. However, the voice recognition or voice feature extraction method in the prior art has the problems of high average error rate and low recognition rate.

Therefore, in order to further reduce the average error rate and improve the recognition rate, the invention provides a dynamic voiceprint feature extraction method which is integrated with static components.

Disclosure of Invention

The purpose of the invention is that: the dynamic voiceprint feature extraction method is low in average error rate and high in recognition rate.

The technical scheme is as follows: the invention provides a dynamic voiceprint feature extraction method integrated with static components, which is used for extracting voiceprint features of target voice data and is characterized by comprising the following steps:

step 1: preprocessing target voice data to obtain preprocessed target voice data;

step 2: processing the preprocessed target voice by using Fourier transform and Mel filter bank to obtain MFCC coefficients of target voice data;

step 3: and carrying the MFCC coefficients of the target voice data into a dynamic voiceprint feature extraction model integrated with the static component, obtaining an MFCC dynamic feature differential parameter matrix of the target voice data, and defining the matrix as the dynamic voiceprint feature of the target voice data.

As a preferred embodiment of the present invention, in step 1, a method for preprocessing target voice data includes: dividing target voice data into T frames to obtain multi-frame voice data;

in step 2, the method for processing the preprocessed target speech using a fourier transform and a Mel filter bank comprises the steps of:

processing each frame of voice data by using Fourier transformation to obtain the frequency spectrum of each frame of voice data;

the frequency spectrum of each frame of voice data is input into a Mel filter bank, and the MFCC coefficient of each frame of voice data, namely the MFCC coefficient of target voice data, is obtained.

As a preferred solution of the present invention, in step 3, the dynamic voiceprint feature extraction model integrated with the static component is:

wherein d (l, t) is the first-order dynamic voiceprint feature extraction result of the t-th frame voice data, d (l, t) forms the first-order t element in the MFCC dynamic feature differential parameter matrix of the target voice data, C (l, t) is the first-order t parameter in the MFCC coefficient, C (l, t+1) is the first-order t+1 parameter in the MFCC coefficient, C (l, t+k) is the first-order t+k parameter in the MFCC coefficient, C (l, t-K) is the first-order t-K parameter in the MFCC coefficient, K is the frequency ordinal after Fourier transformation is performed on the t-th frame voice data, and K is the preset total step length when Fourier transformation is performed on the t-th frame voice data.

As a preferred embodiment of the present invention, the following formula is used:

acquiring a first-order characteristic coefficient C (l, t) of the t-th frame voice data in the MFCC coefficient;

where L is the order of the MFCC coefficients, m is the number of the Mel filter bank, and S (m) is the logarithmic energy of the mth Mel filter bank output.

obtaining logarithmic energy S (m) output by an mth Mel filter bank;

wherein M represents the total number of filter banks, N represents the data length of the t-th frame voice data, X (k) represents the power corresponding to the k-th frequency, H _m (k) Representing the transfer function of the mth Mel filter bank corresponding to the kth frequency.

The beneficial effects are that: compared with the prior art, the dynamic voiceprint feature extraction method for merging the static component provided by the invention is used for extracting the voiceprint features based on the dynamic voiceprint feature extraction model merging the static component, so that the purposes of reducing the average error rate and improving the recognition rate are achieved while the sound continuity is ensured.

Drawings

FIG. 1 is a flow chart of a dynamic voiceprint feature extraction method provided in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of the constant error rate as a function of the ratio of dynamic and static features provided in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of the variation of the constant error rate with the static characteristic coefficient according to the embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

Referring to fig. 1, the method for extracting dynamic voiceprint features incorporated into static components provided by the present invention includes the following steps:

step 1: preprocessing target voice data to obtain preprocessed target voice data.

The method for preprocessing the target voice data comprises the following steps: dividing target voice data into T frames to obtain multi-frame voice data;

step 2: and processing the preprocessed target voice by using Fourier transformation and a Mel filter bank to acquire MFCC coefficients of the target voice data.

The method for processing the preprocessed target speech using a fourier transform and a Mel-filter bank comprises the steps of:

The method of the step 1 and the step 2 specifically comprises the following steps:

the extraction of Mel-frequency cepstrum coefficients (MFCCs) is performed on data that has been subjected to speech preprocessing, and desired characteristic coefficients are obtained by performing operations such as fourier transform, mel (Mel) filter filtering, and the like on the data.

(1) Performing Fourier transform on each frame of data after voice pretreatment to obtain a corresponding frequency spectrum and obtaining a power spectrum |X (j) |of each frame ² The formula of X (j) is as follows:

wherein, N is the length of each frame, J is the length of the fast Fourier transform, i.e. the total frame number, J is the value of 1-J, which represents the J frame, and x (N) is the voice data in the N frame.

(2) And designing a Mel filter bank, and filtering the power spectrum of the signal through the configured Mel filter bank. And carrying out logarithmic operation, and converting the frequency scale into Mel frequency. The center frequency f (m) of the mth filter in the filter bank satisfies the following formula:

Mel(f(m+1))-Mel(f(m))＝Mel(f(m))-Mel(f(m-1))

where m is the number of filters in the filter bank, and Mel (f (m)) is the operation of converting the frequency f (m) into a Mel frequency.

Transfer function H of each band pass filter in the Mel Filter Bank _m (f)：

Where f is the frequency.

After the voice data is processed by the Mel filter, the logarithmic energy S (m) output by each filter bank is obtained:

wherein M is the number of the filter bank, M is the total number of the filters in the filter bank, generally 22-26 are taken, and m=24 are taken. I X (k) | ² Representing the power spectrum of the kth frame, H _m (f) Representing the transfer function of the mth filter frequency f in the filter bank.

(3) Performing discrete cosine transform on the logarithmic Mel power spectrum of each frame to perform decorrelation operation on energy of the power spectrum, eliminating correlation among signals of each dimension, mapping the signals to a low-dimension space, and obtaining a corresponding MFCC coefficient C (l):

where L is the total order of MFCC coefficients, typically taken from 12 to 18, the invention takes l=15; l is a value of 1 to L, and represents the first order of the MFCC coefficients.

In step 3, a dynamic voiceprint feature extraction model incorporating static components is constructed according to the following method:

the dynamic feature extraction is essentially a MFCC coefficient differential mode, i.e. the parameters of the t-1 th frame and the t+1 th frame are used for downsampling when calculating the MFCC coefficient differential parameters of the t-th frame. Therefore, the classical dynamic feature extraction formula is as follows:

wherein J represents the length of the fast Fourier transform, usually 1 or 2 is taken, and represents a first-order MFCC coefficient differential parameter and a second-order MFCC coefficient differential parameter, and J is the value of J (J is more than or equal to 1 and less than or equal to J); l is the mel cepstrum coefficient order, T is the frame number, T is the total frame number of a section of audio, C (l, T) is the first order T parameter of the mel cepstrum coefficient matrix of the voice signal, and d (l, T) is the MFCC dynamic characteristic parameter.

The novel dynamic voiceprint characteristic Mel frequency cepstrum coefficient formula provided by the invention:

the modification is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

for the dynamic voiceprint feature proposed by the invention, MFCC is a static voiceprint feature, Δmfcc is a classical dynamic voiceprint feature, i.e. a differential dynamic parameter, α is a static feature coefficient, β is a dynamic feature coefficient, and δ is the ratio of the dynamic feature coefficient to the static feature coefficient.

The sum alpha and delta values are determined according to the following method:

assuming α=1, the optimal value of the ratio δ of dynamic coefficient to static coefficient is determined experimentally.

The number of gaussian elements in the experiment was set to 64, and 100 persons (50 females and 50 males) of speech data were selected in the timi corpus as experimental speech data in the experiment. And selecting voice data of 60 persons as training data for UBM model training, combining 10 sections of voice of each person into 10 seconds of voice, and performing UBM model training. The model parameters of the UBM model are obtained and then saved, and then 5 segments of speech of each of the remaining 40 persons are combined into 10 seconds of speech data to train the GMM model of each specific speaker and save the obtained model parameters. Finally, the rest voice data of 40 persons are circularly formed into 10 sections of voice data of 5 seconds to carry out matching test on the system. The complete test process comprises 400 times of speaker acceptance test experiments and 15600 times of speaker rejection test experiments, and the constant error rate is obtained as an output result of one experiment.

For voiceprint features obtained by voice data, each test voice generates a plurality of frames of voice segments, the set MFCC (frequency-division multiplexing) order is 15, so that one frame of voice data can generate 15 MFCC coefficients, 15 dynamic feature coefficients are generated after calculation, and 30 MFCC coefficients are generated for each frame of voice segment after combination. The sampling frequency in the experiment is 16KHz, and the frame shift is 1/2 of the frame length.

According to the experimental conditions, δ takes 5 different values, and 5 experiments are performed respectively, so that average error rate data are shown in table 1:

TABLE 1

From the data shown in table 1, different dynamic to static feature ratios δ and average error rate curves are obtained as shown in fig. 1.

As can be seen from fig. 2, when δ=1, the average equivalent error rate is the lowest, so that the optimum value of the dynamic-to-static feature ratio δ is 1.

Accordingly, the dynamic voiceprint characteristic Mel frequency cepstrum coefficient formula provided by the invention can be changed into:

according to the experimental conditions, α takes 5 different values, and 5 experiments are performed respectively, so that average error rate data are shown in table 2:

TABLE 2

From the data shown in Table 2, different static characteristic coefficients α and average error rate curves are obtained as shown in FIG. 3.

As can be seen from fig. 3, when α=0.5, the average error rate is the lowest, so that the optimum value of the static characteristic coefficient is 0.5.

equation (5) represents a dynamic feature parameter, namely Δmfcc, and MFCC is a static feature parameter, namely mfcc=d (l, t), and the two parameters are added by taking weight 0.5, so as to obtain a dynamic feature extraction equation integrated into a static component:

the dynamic characteristic extraction formula which is integrated with the static component is obtained by arrangement:

the built dynamic voiceprint feature extraction model integrated with the static component is as follows:

d (l, t) is the first-order dynamic voiceprint feature extraction result of the t-th frame voice data, and d (l, t) forms the t-th element of the first-order in the MFCC dynamic feature differential parameter matrix of the target voice data, namely: d (l, t) is the first order t parameter of the MFCC dynamic characteristic differential parameter matrix; c (l, t) is the t parameter of the first order in the MFCC coefficient, C (l, t+1) is the t+1th parameter of the first order in the MFCC coefficient, C (l, t+k) is the t+kth parameter of the first order, C (l, t-K) is the t-kth parameter of the first order in the MFCC coefficient, K is the frequency ordinal number after Fourier transformation is performed on the t frame voice data, and K is the preset total step length when Fourier transformation is performed on the t frame voice data.

And for the constructed dynamic voiceprint feature extraction model integrated with the static component, the method is based on the following formula:

According to the following formula:

obtaining logarithmic energy S (m) output by an mth Mel filter bank;

wherein M represents the total number of filter banks, N represents the data length of the t-th frame voice data, X (k) represents the power corresponding to the k-th frequency, H _m (k) Watch (watch)The transfer function of the mth Mel filter bank corresponding to the kth frequency is shown.

Based on the model and the method, according to parameters such as the mel cepstrum coefficient matrix, the audio duration and the like, static characteristic parameters can be calculated first, and dynamic characteristic providing parameters blended into static components are further calculated for voiceprint recognition.

In the voiceprint recognition algorithm, a Gaussian mixture model and a general background model are used for carrying out model establishment on the voiceprint characteristics of a speaker, and the model establishment method mainly comprises the steps of Gaussian mixture model training voice input, voice pretreatment, voiceprint characteristic extraction, general background model parameter input, gaussian mixture model construction and Gaussian mixture model parameter storage. In general, in the voiceprint recognition algorithm, in the voiceprint feature extraction process, a classical dynamic feature extraction algorithm is mostly adopted, and the invention improves the process, integrates static components when calculating dynamic feature extraction parameters, and improves the performance of the voiceprint recognition algorithm.

The foregoing is merely a preferred embodiment of the present invention, and it will be apparent to those skilled in the art that modifications and variations can be made without departing from the technical principles of the present invention, and the modifications and variations should also be regarded as the scope of the invention.

Claims

1. A dynamic voiceprint feature extraction method integrated with static components is used for extracting voiceprint features of target voice data, and is characterized by comprising the following steps:

in step 1, the method for preprocessing target voice data includes: dividing target voice data into T frames to obtain multi-frame voice data;

inputting the frequency spectrum of each frame of voice data into a Mel filter bank, and obtaining the MFCC coefficient of each frame of voice data, namely the MFCC coefficient of target voice data;

step 3: the MFCC coefficient of the target voice data is brought into a dynamic voiceprint feature extraction model which is integrated into the static component, an MFCC dynamic feature differential parameter matrix of the target voice data is obtained, and the matrix is defined as the dynamic voiceprint feature of the target voice data;

in step 3, the dynamic voiceprint feature extraction model integrated with the static component is:

wherein d (l, t) isFirst, thetFrame(s)The method comprises the steps that d (l, t) form a first-order t element in an MFCC dynamic characteristic differential parameter matrix of target voice data according to a first-order dynamic voiceprint characteristic extraction result of the voice data, C (l, t) is a first-order t parameter in an MFCC coefficient, C (l, t+1) is a first-order t+1 parameter in the MFCC coefficient, C (l, t+k) is a first-order t+k parameter in the MFCC coefficient, C (l, t-K) is a first-order t-K parameter in the MFCC coefficient, K is a frequency ordinal number after Fourier transformation is carried out on t-th frame voice data, and K is a preset total step length when Fourier transformation is carried out on t-th frame voice data.

2. The method for dynamic voiceprint feature extraction incorporated into a static component of claim 1, wherein the method is based on the formula:

3. The method for dynamic voiceprint feature extraction incorporated into a static component of claim 2 wherein the method is based on the formula:

obtaining logarithmic energy S (m) output by an mth Mel filter bank;