CN114999459A

CN114999459A - Voice recognition method and system based on multi-scale recursive quantitative analysis

Info

Publication number: CN114999459A
Application number: CN202210481126.XA
Authority: CN
Inventors: 张晓俊; 朱欣程; 赵登煌; 陶智
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-09-02

Abstract

The invention discloses a voice recognition method and a system based on multi-scale recursive quantitative analysis, wherein the method comprises the following steps: extracting a glottal wave signal of the voice signal; utilizing a Gamma filter to divide the glottal wave signals in a multi-band mode to obtain glottal wave signals of a plurality of frequency channels; reconstructing a multi-scale phase space of the glottal wave signals of each frequency channel through time delay and embedding dimension, and constructing a recursion graph according to the distance between every two phase points in the phase space; quantizing the nonlinear dynamic recursive characteristics of the glottal wave signals in each frequency channel according to a recursive graph to obtain a plurality of characteristic parameters of the glottal wave signals of each frequency channel; dividing the voice signal into a training set and a testing set, and training a recognition model by using the characteristic parameters of the training set; and carrying out prediction classification on the characteristic parameters of the test set by using the trained recognition model. The invention can accurately quantize the nonlinear characteristics in the voice signal and improve the voice recognition accuracy.

Description

Voice recognition method and system based on multi-scale recursive quantitative analysis

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice recognition method and system based on multi-scale recursive quantitative analysis.

Background

With the rapid development of artificial intelligence, the speech recognition technology has made remarkable progress, and gradually enters various fields such as household appliances, medical treatment, automotive electronics and the like. The speech recognition process mainly comprises feature extraction and recognition by using a classifier. The extracted features of speech mainly affect the accuracy of speech recognition. The commonly used characteristic parameters mainly comprise disturbance-like characteristics, such as base frequency Jitter and amplitude perturbation Shimmer; spectral and cepstral-like features such as linear prediction coefficients LPCC, mel-frequency cepstral coefficients MFCC, gamma cepstral coefficients GFCC; complex measures such as the maximum lyapunov exponent, correlation dimension and entropy characteristics, etc.

The computation of the perturbation features depends on choosing a suitable window length and accurately estimating the fundamental frequency, whereas for aperiodic, irregular speech signals it is obviously difficult to extract its pitch period. And the generation of the voice is not a deterministic linear process, is not a random process, but is a nonlinear process, so the spectrum and cepstrum-like characteristics cannot represent the nonlinear characteristics in the voice signal. The maximum lyapunov exponent, the correlation dimension and the entropy characteristic can only represent the low-dimensional chaotic characteristic of the voice signal. The accuracy of the recursive quantization measure in the aspect of speech recognition is not ideal, and the recursive quantization measure is difficult to be applied to actual scenes.

Disclosure of Invention

The invention aims to provide a voice recognition method and a system based on multi-scale recursive quantitative analysis, which can accurately quantize nonlinear characteristics in a voice signal and improve the voice recognition accuracy.

In order to solve the above technical problem, the present invention provides a speech recognition method based on multi-scale recursive quantization analysis, comprising the following steps:

s1, extracting a glottal wave signal of the voice signal;

s2, dividing the glottal wave signals in a multiband mode by using a Gamma filter to obtain glottal wave signals of a plurality of frequency channels;

s3, reconstructing a multi-scale phase space of the glottal wave signals of each frequency channel through time delay and embedding dimension, and constructing a recursion graph according to the distance between every two phase points in the phase space;

s4, quantizing the nonlinear dynamic recursive characteristics of the glottal wave signals in each frequency channel according to the recursive graph to obtain a plurality of characteristic parameters of the glottal wave signals of each frequency channel;

s5, dividing the voice signal into a training set and a testing set, and training a recognition model by using the characteristic parameters of the training set;

and S6, carrying out prediction classification on the characteristic parameters of the test set by using the trained recognition model.

As a further improvement of the present invention, the time domain impulse response of the Gammatone filter is:

g _i (t)＝B ^k t ^(k-1) e ^-2πbt cos(2πf _i +φ)u(t)

wherein, the order k of the filter is set to 4, and the initial phase phi of the filter is set to 0; f. of _i Is the center frequency of the ith channel filter; b is a parameter related to the equivalent rectangular bandwidth; u (t) is a step function.

As a further improvement of the present invention, the center frequency is:

where C is related to quality factor and bandwidth, f _l And f _h Is the lowest and highest frequency of the filter; the number K of the filters is 24; b is a parameter related to the equivalent rectangular bandwidth ERB:

B＝1.019·ERB(f _i )

the equivalent rectangular bandwidth ERB is related to the filter center frequency as follows:

ERB(f _i )＝24.7+0.108f _i 。

as a further improvement of the present invention, a time series { x (1), x (2),.., x (N) } of length N is set, and the phase space is reconstructed by the Takens embedding theorem:

where τ is the time delay, m is the embedding dimension, and the total number of points represented by the vector in the reconstructed phase space is N ═ N- (m-1) τ.

As a further improvement of the present invention, when the distance between two facies points in the facies space is smaller than the threshold, it represents that the distance between the two points is recursive, and the obtained recursive value is:

R _ij ＝θ(ε-||X _i -X _j ||)

i，j＝1，2…n

wherein epsilon is a threshold, theta is a Herveseidel function, and | | · | |, represents a norm.

As a further development of the invention, a series of characteristic parameters relating to the recursion values is obtained on the basis of an analysis of the density of the repetition points, diagonal lines, vertical lines or horizontal lines, according to a recursion diagram.

As a further refinement of the invention, the characteristic parameters include: recursion rate, certainty, maximum diagonal length, entropy of diagonal length, average diagonal length, degree of stratification, capture time, maximum vertical line length, first recursion time, second recursion time, recursion time entropy, clustering coefficient, and transitivity.

As a further improvement of the present invention, the recursion rate is a percentage of the recursion points in the recursion graph;

the certainty represents a ratio of a recursion point forming a diagonal segment appearing in the recursion graph to an overall recursion point;

the maximum diagonal length is the length of the longest diagonal in the recursive graph structure;

the entropy of the length of the diagonal line is the Shannon entropy distributed by the length of the diagonal line structure on the recursion diagram, and the information content contained in the structure of the recursion diagram is measured;

the average diagonal length is highly correlated with the average predicted time of the dynamic system and the divergence of the system;

the degree of stratification is the ratio of the recursion points forming the vertical structure to all the recursion points in the recursion graph, and reflects the complexity of the dynamic system;

the capture time represents an average length of vertical lines in the recursive graph structure; measuring the average time that the system is in a very slowly varying state;

the maximum vertical line length represents the maximum length of a vertical line in the recursive graph structure;

first recursion time T1(i) and second recursion time T2 (i):

T1(i)＝t _i+1 -t _i ,t＝1,2,K

T2(i)＝t _i+1 -t _i ,t＝1,2,K

the recursive temporal entropy indicates the degree to which a temporal sequence repeats the same sequence;

the clustering coefficient represents the probability that two neighbor points of any state in the recursive graph structure are also clustered together;

the transfer quantifies the geometric properties of the phase space trajectory.

A speech recognition system based on multi-scale recursive quantization analysis adopts the speech recognition method based on multi-scale recursive quantization analysis to perform speech recognition.

As a further improvement of the invention, the recognition model classifier adopts a Bayesian network classifier.

The invention has the beneficial effects that: the multi-scale recursive quantization measure of the characteristic parameters provided by the method does not depend on the extraction of the pitch period of the voice, and simultaneously can measure the high-dimensional chaotic characteristic of the voice signals, thereby being beneficial to improving the accuracy of voice recognition; the recursion quantization measure effectively captures vocal cord vibration change, based on the vocal cord vibration mechanism, a glottal signal is extracted as a source signal, the signal reconstructs a high-dimensional phase space on a gamma scale, a recursion graph is drawn by combining the characteristics of human auditory perception, finally, the nonlinear dynamic recursion characteristic of the voice signal in each frequency channel is quantized in recursion, and the voice recognition rate by using the nonlinear analysis method exceeds that of the traditional linear analysis method.

Drawings

FIG. 1 is a schematic diagram of the multi-scale recursive quantization measure extraction process of the present invention;

FIG. 2 is a schematic diagram of a speech recognition system of the present invention.

Detailed Description

The present invention is further described below in conjunction with the drawings and the embodiments so that those skilled in the art can better understand the present invention and can carry out the present invention, but the embodiments are not to be construed as limiting the present invention.

As described in the background, with respect to the commonly used characteristic parameters:

1. disturbance characteristics: noise, such as base frequency jitter and amplitude perturbations, in speech signals caused by vocal cord irregularities due to vocal complaints are described. Fundamental jitter represents a short-term perturbation of the fundamental frequency and amplitude perturbations represent a short-term perturbation in amplitude.

2. Spectral and cepstral class features: MFCC and GFCC are characteristic parameters that precisely conform to the auditory perception characteristics of the human ear, and MFCC and GFCC are features commonly used for speech recognition. The basic principle is to map a linear spectrum into a mel or gamma non-linear spectrum based on the auditory perception characteristics of the human ear, and then to a cepstrum.

3. Complexity measure: maximum lyapunov exponent (LLE), Correlation Dimension (CD), and Recursive Quantization Measures (RQMs). The maximum Lyapunov exponent represents the numerical characteristic of the average exponential divergence rate of adjacent tracks in the phase space, and the maximum Lyapunov exponent, the correlation dimension and the recursive quantization measure are all based on the nonlinear characteristic of phase space reconstruction and represent the chaos degree of the voice signal.

Since a speech signal has complex nonlinear characteristics, a conventional nonlinear analysis method is applied to speech recognition. However, due to the non-stationary characteristic of the speech signal, the non-linear characteristic thereof cannot be accurately quantized, and thus the recognition effect is inferior to the linear analysis method. The invention provides a voice recognition method based on multi-scale recursive quantitative analysis. The characteristic parameter multi-scale recursive quantization measure provided by the method does not depend on the extraction of the voice pitch period, and simultaneously can measure the high-dimensional chaotic characteristics of the voice signal. The recursive quantization measure effectively captures vocal cord vibration changes. The signals reconstruct a high-dimensional phase space on a gamma scale, and a recursive graph is drawn by combining the characteristics of human auditory perception. Finally, the nonlinear dynamic recursive characteristics of the voice signal in each frequency channel are quantized from the recursion.

Referring to fig. 1, the present invention provides a speech recognition method based on multi-scale recursive quantization analysis, comprising the following steps:

s1, extracting a glottal wave signal of the voice signal;

s2, multi-band division is carried out on the glottal wave signals by using a Gamma-tone filter, and glottal wave signals of a plurality of frequency channels are obtained;

The multi-scale recursive quantitative measure provided by the invention starts from a vocal cord vibration mechanism, and extracts a glottal signal as a source signal. The feature can decompose a non-stationary, non-linear complex sequence into a set of frequency subsequence features through multi-scale analysis. The invention combines the auditory perception characteristic of human ears, reconstructs the multi-scale phase space of the glottal signal by calculating time delay and embedding dimension, quantizes a nonlinear and non-stable recursive structure to obtain the nonlinear characteristic in the voice signal, and then identifies the voice by an artificial intelligence method. The multi-scale recursive quantization measurement characteristic parameters provided by the invention do not need to extract the pitch period of the voice, can accurately quantize the nonlinear characteristics in the voice signal, is beneficial to improving the voice recognition accuracy rate, and exceeds the traditional linear analysis method.

Specifically, the invention mainly aims at feature extraction and researches from the perspective of glottal waves. In the aspect of feature extraction, the multi-band division of the glottal wave signal is performed by utilizing the Gamma filter group, so that the glottal wave signal can more finely express the voice characteristic and has the auditory perception characteristic.

The specific design of the speech recognition system in the invention mainly comprises:

1. extracting a glottal wave signal, namely extracting the glottal wave signal of the original voice signal by using a glottal inverse filtering algorithm;

gamma atom frequency division processing:

designing a Gamma tone auditory bionic filter to divide the glottal wave signals in multiple frequency bands to obtain the glottal wave signals of 24 frequency channels:

the time domain expression of the Gamma tone filter bank is as follows: g is a radical of formula _i (t)＝B ^k t ^(k-1) e ^-2πbt cos(2πf _i + phi u (t), when the filter order k is set to be 4, the filter characteristic of the human ear basement membrane can be well simulated; the initial phase phi of the filter is set to 0; f. of _i Is the center frequency of the i-th channel filter. The center frequency is:

where C is related to quality factor and bandwidth, f _l And f _h Is the lowest and highest frequency of the filters, the number K of filters is 24.

B is a parameter related to the equivalent rectangular bandwidth:

B＝1.019·ERB(f _i )

ERB(f _i )＝24.7+0.108f _i 。

3. nonlinear kinetic analysis:

the first step in analyzing the signal using nonlinear dynamics theory is to reconstruct the phase space: assume a time series of length N { x (1), x (2),.., x (N) } can reconstruct the phase space by Takens embedding theorem:

where τ is the time delay and m is the embedding dimension. Reconstruction of the phase space { X ₁ ，X ₂ ，X ₃ ...，X _n The total number of points represented by the vector in (f) is N ═ N- (m-1) τ.

A. Constructing a recursive graph:

the recursion map is a tool for analyzing the signal recursion phenomenon in the two-dimensional space map. When the distance between two phase points is less than the threshold, it means that the distance between the two points is recursive, represented by a black point, otherwise it is not recursive, represented by a white point or a blank space.

R _ij ＝θ(ε-||X _i -X _j ||)

i，j＝1，2…n

B. Recursive measure of quantization:

the recursive nature of the time series depends on the geometrical nature of the recursion graph. Recursive quantization analysis is a method for quantizing system dynamics based on a recursive graph. Based on an analysis of the density of the repetition points, diagonal lines, vertical lines or horizontal lines, a series of statistical parameters may be derived. This work uses 13 recursive quantization measures such as average diagonal length, maximum diagonal length, clustering coefficients and transitivity.

Recursion Rate (RR): refers to the percentage of recursion points in the recursion graph:

certainty (DET) represents the ratio of the recursion points forming the diagonal segments appearing in RP to the total recursion points:

where l is the length of the diagonal segment, l is _min Is its minimum value; the frequency distribution of the diagonal length l is represented by P ^ε (l) Represents; p ^ε (l)＝{l _i ；i＝1...n _l }，n _l Is the absolute number of diagonals.

Maximum diagonal length (L) _max ): length of longest diagonal in recursion graph structure:

L _max ＝max({l _i ；i＝1...n _l })

entropy of diagonal length (ENTR) refers to shannon entropy of diagonal structure length distribution on the recursion map, which measures the amount of information contained in the recursion map structure:

the average diagonal length (< L >) is highly correlated to the average predicted time of the dynamic system and the divergence of the system:

the degree of stratification (LAM) is the ratio of the recursion points forming the vertical structure to all the recursion points in the recursion graph, and can reflect the complexity of the dynamic system:

where v is the length of the vertical segment, P ^ε (v)＝{l _i ；i＝1...n _v }；

The capture time (TT) represents the average length of the vertical lines in the recursive graph structure. It measures the average time the system is in a very slowly varying state:

maximum vertical line length (V) _max ) Represents the maximum length of a vertical line in the recursive graph structure:

V _max ＝max({v _i ；i＝1...n _v })

first recursion time (T1) and second recursion time (T2):

T1(i)＝t _i+1 -t _i ,t＝1,2,K

T2(i)＝t _i+1 -t _i ,t＝1,2,K

recursive temporal entropy (RPDE) has been successfully applied to biomedical tests. It has advantages in detecting subtle changes in biological time series, indicating the extent to which the time series repeats the same sequence:

each point of the time series { x (1), x (2),.., x (n) } is plotted as a histogram during the threshold reward period. P (t) is the result of the histogram normalization. Wherein T is _max Is the maximum repetition period and t is the time between returns.

The clustering coefficient (cluster) represents in the complex network theory the probability that two neighbors of any state in the recursive graph structure are also clustered together:

RR _i representing the local recursion rate.

Transmissibility (Trans) quantifies the geometric properties of the phase space trajectory:

3. dividing the voice into a training set and a testing set, and training a recognition model by using the characteristic parameters of the voice in the training set;

4. and carrying out prediction classification on the characteristic parameters of the test set by using the trained model.

Examples

In this embodiment, the effect of the method of the present invention is verified by comparing the speech recognition results of the feature extraction method:

1. extracting characteristic parameters MFCC:

(1) after pre-emphasis, the signal S (n) is windowed and framed by adopting a Hamming window to obtain each frame signal x _n (m) then obtaining its spectrum X by short-time Fourier transform _n (k) The square of the spectrum, i.e. the energy spectrum P, is then found _n (k)。

P _n (k)＝|X _n (k)| ²

(2) Using M Mel band-pass filter pairs P _n (k) Filtering is performed, and since the contributions of the components in each band are superimposed in the human ear, the energy in each filter band is superimposed.

Wherein H _m (k) In the form of the Mel Filter frequency Domain, S _n (m) is the per filter band output.

(3) And taking a logarithmic power spectrum from the output of each filter and performing inverse discrete cosine transform to obtain L MFCC coefficients.

(4) The obtained MFCC coefficient is used as the characteristic parameter of the nth frame, the static characteristic of the voice signal is reflected, and a better effect is obtained if a first-order difference coefficient which is more sensitive to human ears is added. The first order difference is calculated as follows:

l is generally 2, represents the linear combination of 2 frames before and after the current frame and reflects the dynamic characteristics of the voice.

2. Maximum lyapunov exponent and associated dimension (LLE & D2):

(1) for a given speech signal, a smaller embedding dimension m is first selected ₀ Reconstructing a phase space;

(2) calculating relevance dimension C (r)

Wherein,

represents the distance between two phase points, theta (u) is the Heaviside function,

c (r) is a cumulative score function representing the probability that the distance between two points on the attractor in phase space is less than r.

(3) At an initial phase point x ₀ Selecting one and x from the dot set as a base point ₀ Nearest point x ₁ As end points, an initial vector, x, is constructed ₀ ,x ₁ The inter-Euclidean distance can be recorded as L (t) ₀ ). The time step or evolution time k, the initial vector evolves forward along the trajectory to obtain a new vector, and the Euclidean distance between the corresponding point and the endpoint can be marked as L (t) ₁ ) And the exponential growth rate of the system linearity in the corresponding time period is recorded as:

(4) continuing this way until all phase points, and then taking the average value of each exponential growth rate as the maximum lyapunov exponent estimate:

in this embodiment, a bayesian network classifier is used to classify and identify the speech by using the Recursive Quantization Measures (RQMs), the maximum lyapunov exponent and the associated dimension (LLE & D2), the mel-frequency cepstral coefficient (MFCC), and the multi-scale recursive quantization measures, and the experimental results are shown in the following table:

as can be seen from the table above, the multi-scale recursive quantization measure is superior to the traditional characteristic parameter Mel cepstral coefficient, the maximum Lyapunov exponent of nonlinear characteristics and the correlation and recursive quantization measures.

The accuracy of identifying the characteristic parameters of the multi-scale recursive quantitative measure in the Bayesian network classifier reaches 100%, and other evaluation indexes reach optimal values, which is superior to the traditional method. Therefore, the characteristics provided by the invention improve the recognition rate and reliability of the system.

As shown in fig. 2, the present invention also provides a speech recognition system based on multi-scale recursive quantization analysis, which uses a speech recognition method based on multi-scale recursive quantization analysis (including but not limited to bayesian network) as described above to recognize speech. The principle of solving the problem is similar to the speech recognition method based on the multi-scale recursive quantization analysis, and repeated parts are not repeated.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. A speech recognition method based on multi-scale recursive quantization analysis is characterized in that: the method comprises the following steps:

s1, extracting a glottal wave signal of the voice signal;

2. A speech recognition method based on multi-scale recursive quantization analysis according to claim 1, characterized in that: the time domain impulse response of the Gamma filter is as follows:

g _i (t)＝B ^k t ^(k-1) e ^-2πbt cos(2πf _i +φ)u(t)

the order k of the filter is set to be 4, and the initial phase phi of the filter is set to be 0; f. of _i Is the center frequency of the ith channel filter; b is a parameter related to the equivalent rectangular bandwidth; u (t) is a step function.

3. A speech recognition method based on multi-scale recursive quantization analysis according to claim 2, characterized in that: the center frequency is:

B＝1.019·ERB(f _i )

ERB(f _i )＝24.7+0.108f _i 。

4. a speech recognition method based on multi-scale recursive quantization analysis according to claim 1, characterized in that: a time series { x (1), x (2),.., x (N) } of length N is set, the phase space is reconstructed by the Takens embedding theorem:

5. A speech recognition method based on multi-scale recursive quantization as claimed in claim 4, characterized in that: when the distance between every two facies points in the facies space is less than the threshold, it represents that the distance between the two points is recursive, and the obtained recursive value is:

R _ij ＝θ(ε-||X _i -X _j ||)

i，j＝1，2…n

6. The speech recognition method based on multi-scale recursive quantization analysis of claim 5, characterized in that: from the recursion map, a series of characteristic parameters about the recursion values are obtained based on an analysis of the density of the repeated points, diagonal lines, vertical lines or horizontal lines.

7. The speech recognition method based on multi-scale recursive quantization analysis of claim 6, characterized in that: the characteristic parameters comprise: recursion rate, certainty, maximum diagonal length, entropy of diagonal length, average diagonal length, degree of stratification, capture time, maximum vertical line length, first recursion time, second recursion time, recursion time entropy, clustering coefficient, and transitivity.

8. The speech recognition method based on multi-scale recursive quantization analysis of claim 7, characterized in that: the recursion rate is the percentage of recursion points in the recursion graph;

first recursion time T1(i) and second recursion time T2 (i):

T1(i)＝t _i+1 -t _i ,t＝1,2,K

T2(i)＝t _i+1 -t _i ,t＝1,2,K

the recursion quantifies the geometric properties of the phase space trajectory.

9. A speech recognition system based on multi-scale recursive quantization analysis, characterized by: speech recognition using a method of speech recognition based on multi-scale recursive quantization as claimed in any one of claims 1 to 8.

10. A speech recognition system based on multi-scale recursive quantization analysis according to claim 9, characterized by: the recognition model classifier adopts a Bayesian network classifier.