CN109767756B

CN109767756B - Sound characteristic extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficient

Info

Publication number: CN109767756B
Application number: CN201910087494.4A
Authority: CN
Inventors: 左毅; 马赫; 李铁山; 贺培超; 刘君霞; 艾佳琪; 肖杨; 于仁海
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2021-07-16
Anticipated expiration: 2039-01-29
Also published as: CN109767756A; JP6783001B2; JP2020140193A

Abstract

The invention discloses a sound characteristic extraction algorithm based on a dynamic segmentation inverse discrete cosine transform cepstrum coefficient, which comprises the following steps: s1, pre-emphasis, framing and windowing preprocessing are carried out on the sound signals: s2, performing transformation form processing from a time domain to a frequency domain on the preprocessed sound signals: s3, calculating the similarity between the inverse discrete cosine transform cepstrum coefficients obtained in the step S2 by using a cluster analysis algorithm, and sequentially combining two adjacent classes with the maximum similarity; and iterating the processes until the clustering is carried out to 24 classes, wherein the obtained dynamic segmentation inverse discrete cosine transform cepstrum coefficient is the sound characteristic. The invention overcomes the defect that the prior art does not fully utilize the dynamic characteristics of sound to carry out frequency domain transformation, so that the invention has wider adaptability and can obtain higher identification precision on speaker identification.

Description

Sound characteristic extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficient

Technical Field

The invention belongs to the technical field of sound feature extraction, applies an unsupervised clustering analysis algorithm to a sound feature extraction direction, and particularly relates to a sound feature extraction algorithm based on a dynamic segmentation inverse discrete cosine transform cepstrum coefficient.

Background

The speaker recognition technology comprises two parts of feature extraction and recognition modeling. Feature extraction is a key step in speaker recognition technology, and will directly affect the overall performance of the speech recognition system. Generally, after a speech signal is subjected to frame division and windowing preprocessing, high latitude data volume is generated, and when speaker characteristics are extracted, redundant information in original speech needs to be removed to reduce data dimensionality. In the existing method, triangular filtering is used to convert a voice signal into a voice feature vector meeting the requirement of feature parameters, and the voice feature vector can meet the similar human ear auditory perception characteristics and can enhance the voice signal and inhibit non-voice signals to a certain extent. The characteristic parameters commonly used are: the linear prediction analysis coefficient is a characteristic parameter obtained by simulating the human phonation principle and analyzing a model of the sound channel short pipe cascade connection; the perceptual linear prediction coefficient is applied to spectral analysis by calculation based on an auditory model, an input speech signal is processed by a human ear auditory model to replace characteristic parameters of an all-pole model prediction polynomial which is equivalent to LPC and is used for linear prediction coding of a time domain signal used by LPC; the Tandem feature and the Bottleneck feature are two types of features extracted by utilizing a neural network; the filter bank-based Fbank characteristic is equivalent to that the MFCC removes the discrete cosine transform of the last step, and more original voice data are reserved compared with the MFCC characteristic; the linear prediction cepstrum coefficients are important characteristic parameters which discard voice excitation information in the signal generation process based on the vocal tract model and represent the characteristics of formants by more than ten cepstrum coefficients; the extraction process comprises the steps of firstly preprocessing the voice, dividing frames, windowing, accelerating Fourier transform and the like, then filtering an energy spectrum through a group of Mel-scale triangular filter banks, calculating logarithmic energy output by each filter bank, obtaining an MFCC coefficient through Discrete Cosine Transform (DCT), solving a Mel-scale Cepstrum parameter, and then extracting a dynamic difference parameter, namely a Mel Cepstrum coefficient. In 2012, S.Al-Rawahya et al refer to an MFCC feature extraction method, perform equal frequency domain segmentation on DCT cepstrum coefficients obtained after voice preprocessing, and propose a method for Histopram DCT cepstrum coefficients. We find that the dynamic characteristics of the sound data can be ignored by the cepstrum coefficients of the frequency domain segmentation, so that the invention provides a new sound characteristic extraction algorithm on the basis of the dynamic segmentation inverse discrete cosine transform cepstrum coefficients, and combines unsupervised learning to perform cluster analysis on the sound data according to the similarity of the dynamic characteristics by using a hierarchical clustering method, thereby extracting the dynamic characteristic vector which can describe the sound characteristics better.

In the existing research, one of the most widely used speech recognition techniques is to use MFCC as a voice feature vector and perform speaker mode matching by combining machine learning methods such as a Gaussian Mixture Model (GMM), a Hidden Markov Model (HMM), and a Support Vector Machine (SVM). The extraction process of MFCC is as follows: firstly, pre-emphasis, framing, windowing and accelerated Fourier transform preprocessing are carried out on voice; then filtering the energy spectrum through a group of Mel-scale triangular filter banks; calculating logarithmic energy output by each filter bank, obtaining an MFCC coefficient through Discrete Cosine Transform (DCT), introducing the obtained logarithmic energy into the DCT, solving a Mel-scale Cepstrum parameter, and extracting a dynamic differential parameter, namely a Mel Cepstrum coefficient MFCC.

Al-Rawahya et al discovered DCT Cepstrum as a new feature in 2012, and the equal frequency domain DCT Cepstrum coefficient-based acoustic feature extraction algorithm proposed by the same is provided. And (2) converting the preprocessed sound signals into frequency domains, namely converting the preprocessed sound signals from time domain convolution into a frequency domain spectrum multiplication form, taking logarithms of the sound signals, and expressing obtained components in an addition form to obtain discrete cosine transform Cepstrum coefficients (DCT Cepstrum coefficients). The DCT cepstral coefficients record the periodicity of the frequency range in non-linear increments, dividing the frequency domain feature interval every 50Hz between 0Hz and 600Hz, and dividing the frequency domain feature interval every 100Hz between 600Hz and 1000Hz, which process can be viewed as a count of the number of frequency range cycles in a given speech signal. Compared with the MFCC feature extraction method, the method is simpler and faster.

Disclosure of Invention

The invention mainly aims to provide a sound characteristic extraction algorithm based on a dynamic segmentation inverse discrete cosine transform cepstrum coefficient, aiming at the inaccuracy of the segmentation frequency in the sound characteristic extraction algorithm based on the equal frequency domain segmentation inverse discrete cosine transform cepstrum coefficient. The technical means adopted by the invention are as follows:

a sound characteristic extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficients comprises the following steps:

s1, preprocessing the sound signals:

pre-emphasis, framing and windowing are sequentially carried out on the sound signals;

the influence of factors such as aliasing, higher harmonic distortion, high frequency and the like caused by human vocal organs and equipment for acquiring vocal signals on the quality of the vocal signals is eliminated through preprocessing, so that the signals obtained through subsequent processing are more uniform and smooth, high-quality parameters are provided for the vocal feature extraction, and the subsequent processing quality is improved.

S2, performing transformation form processing from time domain to frequency domain on the preprocessed sound signals:

the method comprises the following steps of converting a preprocessed sound signal into a frequency domain, namely converting the preprocessed sound signal from time domain convolution into a frequency domain spectrum multiplication form, taking logarithm of the sound signal, expressing obtained components in an addition form, and obtaining an inverse discrete cosine transform Cepstrum coefficient (IDCT Cepstrum coefficient), wherein the specific process is carried out through the following formula:

C(q)＝IDCT log|DCT{x(k)}|；

wherein, DCT and IDCT are discrete cosine transform and inverse discrete cosine transform respectively, x (k) is input sound signal, namely sound signal after pretreatment, C (q) is output voice signal, namely inverse discrete cosine transform cepstrum coefficient;

the inverse discrete cosine transform cepstrum coefficient is a data matrix, and because the inherent frequency attribute of sound and sound, all column attributes are the same when hierarchical clustering is carried out, sequential clustering is carried out by calculating the similarity of adjacent column attributes.

S3, calculating the similarity between the inverse discrete cosine transform cepstrum coefficients obtained in the step S2 by using a cluster analysis algorithm, and sequentially combining two adjacent classes with the maximum similarity; and iterating the processes until the clustering is carried out to 24 classes, wherein the obtained dynamic segmentation inverse discrete cosine transform Cepstrum coefficient (DD-IDCT Cepstrum coefficient) is the sound characteristic.

The pre-emphasis is realized by a digital filter, and the specific process is carried out by the following formula:

Y(n)＝X(n)-aX(n-l)；

where y (n) is the output signal after pre-emphasis, x (n) is the input acoustic signal, a is the pre-emphasis coefficient, and n is the time.

The average power spectrum of the acoustic signal is affected by glottic excitation and oronasal radiation, the high frequency end is attenuated by 6dB/oct (octave) above about 800Hz, the higher the frequency the smaller the corresponding component, and therefore the high frequency part of the acoustic signal is boosted before being analyzed.

Throughout the entire course of the sound analysis is a "short-time analysis technique". The acoustic signal has a time-varying characteristic, but within a short time range (generally within a short time of 10-30 ms), the time-varying characteristic thereof is basically kept unchanged, i.e. relatively stable, so that the acoustic signal can be regarded as a quasi-steady-state process, i.e. the acoustic signal has short-time stationarity. Therefore, the analysis and processing of any sound signal must be based on the "short-time", i.e. performing the "short-time analysis", and segmenting the sound signal to analyze its characteristic parameters, wherein each segment is called a "frame", and the length of the frame is generally 10-30 ms. Thus, for the whole acoustic signal, the characteristic parameter time sequence composed of the characteristic parameters of each frame is analyzed.

The framing is to segment the output signal after the pre-emphasis into 20ms one frame.

Windowing is carried out on the voice signals after the framing processing, and the purpose of windowing can be considered to be that the voice signals are more global and continuous, so that the Gibbs effect is avoided, and the voice signals without periodicity originally present partial characteristics of periodic functions. The windowing is hamming window windowing.

The transformation is in the form of a cepstral transform.

The clustering analysis algorithm is a hierarchical analysis algorithm.

And the similarity calculation is the Euclidean distance calculation.

Compared with the prior art, the invention has the following advantages:

firstly, due to the nature of the voice feature extraction algorithm of the in-depth analysis equal frequency domain segmentation DCT Cepstrum coefficient, the invention perfects the defect that the prior art does not fully utilize the dynamic features of voice to carry out frequency domain transformation, so that the invention has wider adaptability and can obtain higher identification precision in speaker identification.

Secondly, the unsupervised clustering analysis is applied to the sound characteristic extraction, so that the method has the advantages of simple process, high speed and less occupied computing resources.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a sound feature extraction algorithm based on a dynamic segmentation inverse discrete cosine transform cepstrum coefficient in an embodiment of the present invention.

FIG. 2 is a diagram of a cluster analysis tree in accordance with an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, a sound feature extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficients has the following steps:

s1, preprocessing the sound signals:

Y(n)＝X(n)-aX(n-l)；

wherein, y (n) is the output signal after pre-emphasis, x (n) is the input sound signal, a is the pre-emphasis coefficient, n is the time, and a takes the value of 0.97.

The windowing is hamming window windowing.

C(q)＝IDCT log|DCT{x(k)}|；

wherein, DCT and IDCT are discrete cosine transform and inverse discrete cosine transform respectively, x (k) is input sound signal, namely sound signal after pretreatment, C (q) is output voice signal, namely inverse discrete cosine transform cepstrum coefficient; the transformation is in the form of a cepstral transform.

S3, calculating the similarity between the inverse discrete cosine transform cepstrum coefficients obtained in the step S2 by using a cluster analysis algorithm, and sequentially combining two adjacent classes with the maximum similarity; iterating the processes until the clustering is carried out to 24 types, wherein the obtained dynamic segmentation inverse discrete cosine transform cepstrum coefficient is the sound characteristic, and the specific steps are as follows:

the matrix A represents the m-dimensional n-dimensional inverse discrete cosine transform cepstral coefficients obtained in step S2, and as shown in FIG. 2, a vector V of each dimension of the inverse discrete cosine transform cepstral coefficients is used₁，V₂…V_nLooking at n, find V_iAnd V_jHas a Euclidean distance of

The specific steps of cluster analysis are as follows:

clustering for the first time:

l₁＝Dis(V₁,V₂)

l₂＝Dis(V₂,V₃)

…

l_n-1＝Dis(V_n-1,V_n)

if i ═ arg min (l)₁,l₂,l₃…l_n-1) Then the clustering result is

(V₁),(V₂),…(V_i+V_i+1),…(V_n) Namely, it is

Updating:

l_i-1＝Dis(V_i-1,(V_i+V_i+1))

l_i＝Dis((V_i+V_i+1),V_i+2)

l_i+1＝l_i+2

…

l_n-1＝l_n-2

Delete l_n-1

and (5) clustering for the second time:

if j is argmin (l)₁,l₂,l₃…l_n-2) Then the clustering result is

(V₁),(V₂),…(V_i+V_i+1),…(V_j+V_j+1),…(V_n) Namely, it is

And (3) updating again:

l_j-1＝Dis(V_j-1,(V_j+V_j+1))

l_j＝Dis((V_j+V_j+1),V_j+2)

l_j+1＝l_j+2

…

l_n-3＝l_n-2

Delete l_n-2

and performing hierarchical clustering by analogy until the final clustering result is 24 types, obtaining a dynamic segmentation inverse discrete cosine transform cepstrum coefficient which is a sound characteristic, and putting the sound characteristic into a GMM (Gaussian mixture model) model for identification to judge the feasibility of the algorithm.

The clustering analysis algorithm is a hierarchical analysis algorithm.

And the similarity calculation is the Euclidean distance calculation.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A sound feature extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficients is characterized by comprising the following steps:

s1, preprocessing sound signals of m individuals:

pre-emphasis, framing and windowing are sequentially carried out on the sound signals of m persons;

Y(n)＝X(n)-aX(n-l)；

wherein, Y (n) is the output signal after pre-emphasis, X (n) is the input sound signal, a is the pre-emphasis coefficient, and n is the time; the framing is to segment the output signal after the pre-emphasis into one frame of 20 ms;

s2, performing transformation form processing from time domain to frequency domain on the preprocessed sound signals of the m persons:

the method comprises the steps of converting preprocessed sound signals of m persons into frequency domains, namely converting the preprocessed sound signals of the m persons from time domain convolution into a frequency domain spectrum multiplication form, taking logarithms of the sound signals, expressing obtained components in an addition form to obtain inverse discrete cosine transform cepstrum coefficients of the m persons, and performing the specific process through the following formula

C(q)＝IDCTlog|DCT{x(k)}|；

Wherein, DCT and IDCT are discrete cosine transform and inverse discrete cosine transform respectively, x (k) is input sound signal, namely sound signal of m persons after pretreatment, C (q) is output voice signal, namely inverse discrete cosine transform cepstrum coefficient of m persons;

s3, calculating the similarity between the inverse discrete cosine transform cepstrum coefficients of the m individuals obtained in the step S2 by using a hierarchical clustering analysis algorithm, and sequentially combining two adjacent columns with the maximum similarity; iterating the processes until the clustering is carried out to 24 rows, wherein the obtained dynamic segmentation inverse discrete cosine transform cepstrum coefficient is the sound characteristic of the m persons; the method comprises the following specific steps:

the matrix A represents the m-dimensional inverse discrete cosine transform cepstrum coefficients of the n-dimension of the person obtained in step S2, and the vector V of each dimension of the inverse discrete cosine transform cepstrum coefficients₁，V₂…V_nLooking at n, find V_iAnd V_jHas a Euclidean distance of

The specific steps of cluster analysis are as follows:

clustering for the first time:

l₁＝Dis(V₁,V₂)

l₂＝Dis(V₂,V₃)

…

l_n-1＝Dis(V_n-1,V_n)

if i ═ arg min (l)₁,l₂,l₃…l_n-1) Then the clustering result is

(V₁),(V₂),…(V_i+V_i+1),…(V_n) Namely, it is

Updating:

l_i-1＝Dis(V_i-1,(V_i+V_i+1))

l_i＝Dis((V_i+V_i+1),V_i+2)

l_i+1＝l_i+2

…

l_n-1＝l_n-2

Delete l_n-1

and (5) clustering for the second time:

if j is argmin (l)₁,l₂,l₃…l_n-2) Then the clustering result is

(V₁),(V₂),…(V_i+V_i+1),…(V_j+V_j+1),…(V_n) Namely, it is

And (3) updating again:

l_j-1＝Dis(V_j-1,(V_j+V_j+1))

l_j＝Dis((V_j+V_j+1),V_j+2)

l_j+1＝l_j+2

…

l_n-3＝l_n-2

Delete l_n-2

and performing hierarchical clustering in the same way until the final clustering result is 24 rows, and obtaining a dynamic segmentation inverse discrete cosine transform cepstrum coefficient which is the sound characteristic.

2. The extraction algorithm according to claim 1, characterized in that: the windowing is hamming window windowing.

3. The extraction algorithm according to claim 1, characterized in that: the transformation is in the form of a cepstral transform.