CN113571089A

CN113571089A - Voice recognition method based on Mel cepstrum coefficient-support vector machine architecture

Info

Publication number: CN113571089A
Application number: CN202110908188.XA
Authority: CN
Inventors: 吴华明; 陈合谱; 戴磊; 张业超; 肖文波; 肖永生; 黄丽贞; 段军红; 苏荃
Original assignee: Beijing Baohang Technology Co ltd; Nanchang Hangkong University
Current assignee: Beijing Baohang Technology Co ltd; Nanchang Hangkong University
Priority date: 2021-08-09
Filing date: 2021-08-09
Publication date: 2021-10-29

Abstract

The invention provides a voice recognition method and a system based on a Mel cepstrum coefficient-support vector machine framework, wherein the method comprises the following steps: acquiring a voice signal to be identified; extracting sound characteristic data of a sound signal to be identified; the voice characteristic data comprises static characteristic data and dynamic characteristic data of the voice signal to be recognized; inputting the voice characteristic data of the voice signal to be recognized into a voice recognition model to obtain a voice recognition result; the voice recognition model is obtained by training a support vector machine model according to historical voice signals. The method and the device can improve the accuracy of voice recognition by training the support vector machine model through the static characteristic data and the dynamic characteristic data of the voice signal to obtain the voice recognition model.

Description

Voice recognition method based on Mel cepstrum coefficient-support vector machine architecture

Technical Field

The invention relates to the technical field of environmental monitoring, in particular to a voice recognition method and system based on a Mel cepstrum coefficient-support vector machine framework.

Background

By utilizing the characteristics of the optical fiber acoustic sensing system, the problems that the traditional electroacoustic sensor is difficult to use in extreme field environments such as strong electromagnetic interference, humidity and corrosion can be effectively solved when the optical fiber acoustic sensing system is applied to the sensing field, and the optical fiber acoustic sensing system can be widely applied to important fields such as medical treatment, aviation, energy and security. In an optical fiber acoustic sensing system, accurate identification and classification of acoustic signals are directly related to popularization and application of the system. To improve the accuracy of recognition and classification of sound signals, various solutions have been proposed by many scholars. An anti-noise speech feature extraction and optimization method based on HMM/SVM (hidden Markov model/support vector machine) is provided by Liwanling and Zhang autumn chrysanthemum of Xinjiang university, and a feature parameter extraction algorithm based on improved Mel cepstrum coefficient (MFCC) is provided by high Ming and Sunrong honest. However, the above method only performs simple feature extraction on the voice signal and uses the extracted features for recognition, and the extraction of the features cannot fully support the algorithm to accurately recognize and classify the voice signal.

Disclosure of Invention

The invention aims to provide a voice recognition method and a voice recognition system based on a Mel cepstrum coefficient-support vector machine framework, which can improve the recognition accuracy of voice signals.

In order to achieve the purpose, the invention provides the following scheme:

a voice recognition method based on a Mel cepstral coefficient-support vector machine architecture comprises the following steps:

acquiring a voice signal to be identified;

extracting sound characteristic data of the sound signal to be identified; the sound characteristic data comprises static characteristic data and dynamic characteristic data of the sound signal to be identified;

inputting the voice characteristic data of the voice signal to be recognized into a voice recognition model to obtain a voice recognition result; the voice recognition model is obtained by training a support vector machine model according to historical voice signals.

Optionally, before the acquiring the voice signal to be recognized, the method further includes:

acquiring a historical sound signal; the historical sound signal comprises a non-invasive sound signal frame and an invasive sound signal frame;

extracting sound characteristic data of the historical sound signal;

and training a support vector machine model by taking the sound characteristic data of the historical sound signal as input and taking whether the historical sound signal contains an invading sound signal frame as output to obtain the sound identification model.

Optionally, after the acquiring the voice signal to be recognized, the method further includes:

carrying out normalization processing on the sound signal to be recognized;

and filtering the normalized voice signal to be identified.

Optionally, the extracting the sound feature data of the sound signal to be recognized specifically includes:

framing the voice signal to be identified to obtain a plurality of voice signal frames to be identified;

windowing each voice signal frame to be identified to obtain a plurality of voice windowed signal frames to be identified;

using formulas

Respectively carrying out Fourier transform on each voice windowing signal frame to be identified to obtain a plurality of voice windowing signal frames to be identified after Fourier transform;

using formulas

Respectively carrying out cosine transform processing on each Fourier transformed sound windowing signal frame to be identified to obtain a multi-dimensional Mel cepstrum coefficient of the signal to be identified;

determining the multi-dimensional Mel cepstrum coefficient as static characteristic data of the sound signal to be identified; determining a first-order difference and a second-order difference of the static characteristic data of the voice signal to be identified as dynamic characteristic data of the voice signal to be identified;

wherein, X_a(k) Is a voice windowing signal frame to be identified after Fourier transform, x (N) is the voice windowing signal frame to be identified, k is the number of points of Fourier transform, k is more than or equal to 0 and less than or equal to N, N is the total number of Fourier transform, C (N) is the nth Weimel cepstrum coefficient, s (m) is the logarithmic energy output by the mth filter bank,

H_m(k) and M is the frequency response of the mth filter, the number of the filters is M, n is 1,2, …, and L is the dimension of the multidimensional Mel cepstrum coefficient.

A voice recognition system based on mel-frequency cepstral coefficient-support vector machine architecture, comprising:

the voice signal to be recognized acquisition module is used for acquiring a voice signal to be recognized;

the first sound characteristic data extraction module is used for extracting the sound characteristic data of the sound signal to be identified; the sound characteristic data comprises static characteristic data and dynamic characteristic data of the sound signal to be identified;

the voice signal identification module is used for inputting the voice characteristic data of the voice signal to be identified into a voice identification model to obtain a voice identification result; the voice recognition model is obtained by training a support vector machine model according to historical voice signals.

Optionally, the system further includes:

the historical sound signal acquisition module is used for acquiring a historical sound signal; the historical sound signal comprises a non-invasive sound signal frame and an invasive sound signal frame;

the second sound characteristic data extraction module is used for extracting sound characteristic data of the historical sound signal;

and the voice recognition model determining module is used for training a support vector machine model by taking the voice feature data of the historical voice signal as input and taking whether the historical voice signal contains an invading voice signal frame as output to obtain the voice recognition model.

Optionally, the system further includes:

the normalization module is used for performing normalization processing on the voice signal to be recognized;

and the filtering module is used for filtering the normalized sound signal to be identified.

Optionally, the first sound feature data extraction module specifically includes:

the framing processing unit is used for framing the voice signal to be identified to obtain a plurality of voice signal frames to be identified;

the system comprises a to-be-identified sound windowing signal frame determining unit, a processing unit and a processing unit, wherein the to-be-identified sound windowing signal frame determining unit is used for respectively performing windowing processing on each to-be-identified sound signal frame to obtain a plurality of to-be-identified sound windowing signal frames;

fourier transform unit for using formula

a multi-Vimel cepstral coefficient determining unit for using the formula

the sound characteristic data extraction unit is used for determining the multi-dimensional Mel cepstrum coefficient as static characteristic data of the sound signal to be identified; determining a first-order difference and a second-order difference of the static characteristic data of the voice signal to be identified as dynamic characteristic data of the voice signal to be identified;

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a voice recognition method based on a Mel frequency cepstral coefficient-support vector machine architecture according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a Sagnac fiber optic sensing system according to an embodiment of the present invention;

FIG. 3 is a diagram of a voice recognition framework in an embodiment of the present invention;

FIG. 4 is a diagram illustrating the result of parameter optimization for SVM in accordance with an embodiment of the present invention;

FIG. 5 is a graph of a confusion matrix for training model accuracy in an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a voice recognition system based on mel-frequency cepstral coefficient-support vector machine architecture in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a flowchart of a voice recognition method based on a mel-frequency cepstral coefficient-support vector machine architecture in an embodiment of the present invention, and as shown in fig. 1, the present invention provides a voice recognition method based on a mel-frequency cepstral coefficient-support vector machine architecture, which includes:

step 101: acquiring a voice signal to be identified;

step 102: extracting sound characteristic data of a sound signal to be identified; the voice characteristic data comprises static characteristic data and dynamic characteristic data of the voice signal to be recognized;

step 103: inputting the voice characteristic data of the voice signal to be recognized into a voice recognition model to obtain a voice recognition result; the voice recognition model is obtained by training a support vector machine model according to historical voice signals.

Before step 101, further comprising:

extracting sound characteristic data of the historical sound signal;

and training the support vector machine model by taking the sound characteristic data of the historical sound signal as input and taking whether the historical sound signal contains the invading sound signal frame as output to obtain a sound identification model.

After step 101, further comprising:

carrying out normalization processing on the sound signal to be recognized;

and filtering the normalized voice signal to be identified. Specifically, a wavelet threshold denoising method is adopted to filter the normalized sound signal to be recognized.

Specifically, in the voice recognition method based on the mel-frequency cepstrum coefficient-support vector machine architecture provided by the present invention, the method for extracting the voice feature data of the voice signal to be recognized is the same as the method for extracting the voice feature data of the historical voice signal, and taking the voice signal to be recognized as an example, the method for extracting the voice feature data of the voice signal to be recognized specifically includes:

performing frame processing on the voice signal to be recognized to obtain a plurality of voice signal frames to be recognized;

using formulas

using formulas

determining a multi-dimensional Mel cepstrum coefficient as static characteristic data of the sound signal to be identified; determining a first-order difference and a second-order difference of static characteristic data of the voice signal to be identified as dynamic characteristic data of the voice signal to be identified;

Fig. 3 is a diagram of a voice recognition framework in an embodiment of the present invention, and as shown in fig. 3, the voice recognition method provided by the present invention is a signal processing algorithm based on a MFCC _ SVM (mel-frequency cepstral coefficient-support vector machine model) framework of a Sagnac fiber optic acoustic sensing system: an optical fiber sensing system is built based on the Sagnac principle, external sound signals are collected by the system, normalization processing and filtering processing are carried out on the signals, MFCC characteristic parameters of the signals are extracted, model training is carried out on the characteristic parameters by a support vector machine optimized by a grid search method, a signal recognition model is optimized, the signals collected by the system are distinguished by the optimized model, and the recognition capability of the system on the signals in a complex environment can be greatly improved.

The method comprises the following specific steps:

step one, sound signal collection: the optical fiber sensing system is built according to the Sagnac principle to cope with a complex monitoring environment, the invasion signals are manufactured in modes of knocking, hoeing, digging, pickaxe and the like, and the optical fiber sensing system realizes the collection of the invasion signals with human interference and the non-invasion signals without human interference. The structure of the optical fiber sensing system is shown in fig. 2, wherein Laser is a light source, PD is a photoelectric detector, DAQ is a data acquisition card, and PC is a computer; 1,2, 3 each represent 3 inputs of a 3 × 3 coupler a; 4, 5, 6 denote 3 outputs of the 3 × 3 coupler a; b represents a delay fiber; 7 and 8 respectively represent two input ends of the 2 × 1 coupler c, 9 represents an output end of the 2 × 1 coupler, and 10 represents a disturbance intrusion point position; d represents a 1 × 2 coupler; 11 denotes the concatenated fibre at the output of the 1 x 2 coupler d. cw denotes the clockwise light path, the path of which is 1-a-6-b-8-c-9-10-d-11-d-10-9-c-7-4-a-3, cww denotes the counterclockwise light path, the path of which is 1-a-4-7-c-9-10-d-11-d-10-9-c-8-b-6-a-3.

Step two, signal normalization: and carrying out operations such as zero filling or cutting on the collected sound signals to enable the lengths of all the signals to be consistent so as to facilitate subsequent feature extraction and model training, and mapping the signal data to be between 0 and 1 so as to improve the processing speed of the system.

Step three, filtering treatment: and denoising the signals processed in the step two by adopting a wavelet threshold denoising method, and filtering invalid information in the signals.

Step four, static characteristic extraction: the frame length of the selected sound signal is 512ms, and the frame overlapping is 128 ms; windowing each frame signal by using a Hanning (Hamming) window to reduce the influence of the Gibbs effect, so as to achieve the effects of inhibiting the waveform from generating oscillation and improving filtering; the Fourier transform is carried out on the sound signal to transform the time domain signal into a power spectrum, and the calculation formula is as follows:

wherein, X_a(k) The method comprises the steps of taking a to-be-identified voice windowing signal frame after Fourier transformation, taking x (N) as the to-be-identified voice windowing signal frame, taking k as the number of points of the Fourier transformation, taking k to be more than or equal to 0 and less than or equal to N, and taking N as the total number of the Fourier transformation.

Filtering the power spectrum by a group of triangular window filters which are linearly distributed on Mel frequency scale, and calculating the logarithm energy formula output by each filter group as follows:

wherein H_m(k) Is the frequency response of the mth filter, and M is the number of filters.

And finally, performing discrete cosine transform to remove the correlation among all dimensional signals, and mapping the signals to a low-dimensional space to obtain the MFCC, wherein the calculation formula of the discrete cosine transform is as follows:

where C (n) is the nth Weimel cepstrum coefficient, s (m) is the logarithmic energy output by the mth filter bank, and L is the dimension of the multidimensional Weimel cepstrum coefficient.

To maximize the efficiency of model training and recognition, the first 13 dimensions of the MFCC are extracted as the static features of the acoustic signal.

Step five, dynamic feature extraction: and taking the first order difference and the second order difference of the static features as the dynamic features of the sound signals, and combining the extracted static features and the extracted dynamic features into a 39-dimensional signal feature parameter.

Specifically, a first order difference spectrum and a second order difference spectrum are obtained by using the static characteristics as the dynamic characteristics of the sound signal, and the calculation of the difference parameters is calculated by using the following formula:

in the formula (d)_tRepresents the t first order difference; c_tRepresenting the t-th cepstral coefficient; c_t+1Represents the t +1 th cepstrum coefficient; q represents the order of the cepstral coefficient; r represents the time difference of the first derivative, and can be 1 or 2.

Step six, selecting an SVM kernel function: the construction of the support vector machine mainly depends on the selection of kernel functions, and the support vector machine maps the feature data into a high-dimensional space so as to be linearly separable as much as possible. According to the invention, through comparison experiment analysis, a Gaussian kernel function is selected as a classification kernel function of the SVM.

Seventhly, model hyper-parameter optimization: in the aspect of super-parameter optimization, in order to improve the identification performance of the model, various possible punishment coefficients and kernel function radius 2 super-parameter pairs are tried by adopting a grid search method, then cross validation is carried out by adopting a K-fold cross validation (K-fold cross-validation) method, and the punishment coefficient c and the kernel function radius g which enable the SVM model to have the highest validation accuracy are selected as the optimal parameters. The result of the hyper-parameter optimization is shown in fig. 4, wherein GridSearchMethod indicates that a grid method is adopted for optimization, and CVAccuracy indicates the accuracy of cross validation.

Step eight, training and signal recognition of the model: the training and recognition speed and accuracy of the optimized SVM model can reach the best. Inputting the feature data obtained in the fourth step and the fifth step into the optimized SVM model, and the SVM can find a hyperplane according to the set parameters to divide the input data, so that the SVM-trained reference model is obtained. In order to verify the accuracy of the model, the signals collected by the Sagnac fiber sensing system are processed to extract characteristic parameters and are matched with the reference model, so that the signals are identified and classified, the classification accuracy is represented by a confusion matrix, and the confusion matrix is shown in fig. 5. In the figure, a represents a non-intrusion signal class, B represents an intrusion signal class, and each box represents that the column class is predicted in the row class, and the ratio of the column class to the number in each box.

Partial verification results, as shown in table 1:

TABLE 1 comparison of model predicted results with actual results

Wherein: traininggood1-traininggood 5: represents five non-invasive signals collected, respectively, rainingbad 1-rainingbad 5: respectively representing five collected intrusion signals. 1 denotes a non-intrusive signal and 2 denotes an intrusive signal.

Compared with the prior art, the invention has the advantages that: static characteristics and dynamic characteristics of signals are extracted by using an MFCC method to serve as parameters, the static characteristics and the dynamic characteristics are combined with a support vector machine, a grid search method is used for optimizing hyper-parameters of a model, and an optical fiber acoustic sensing recognition system based on an MFCC-SVM algorithm framework is formed. As shown in FIG. 5, the recognition accuracy can reach more than 91%.

Fig. 6 is a schematic structural diagram of a voice recognition system based on a mel-frequency cepstral coefficient-support vector machine architecture in an embodiment of the present invention, and as shown in fig. 6, the present invention further provides a voice recognition system based on a mel-frequency cepstral coefficient-support vector machine architecture, which includes:

a to-be-recognized sound signal acquisition module 601, configured to acquire a to-be-recognized sound signal;

a first sound feature data extraction module 602, configured to extract sound feature data of a sound signal to be identified; the voice characteristic data comprises static characteristic data and dynamic characteristic data of the voice signal to be recognized;

the voice signal recognition module 603 is configured to input voice feature data of the voice signal to be recognized into the voice recognition model, so as to obtain a voice recognition result; the voice recognition model is obtained by training a support vector machine model according to historical voice signals.

In addition, the system further comprises:

and the voice recognition model determining module is used for training the support vector machine model by taking the voice characteristic data of the historical voice signal as input and taking whether the historical voice signal contains the invading voice signal frame as output so as to obtain the voice recognition model.

The invention provides a voice recognition system based on a Mel cepstrum coefficient-support vector machine framework, which also comprises:

the normalization module is used for performing normalization processing on the sound signal to be recognized;

The first sound feature data extraction module 602 specifically includes:

the framing processing unit is used for framing the voice signal to be recognized to obtain a plurality of voice signal frames to be recognized;

fourier transform unit for using formula

a multi-Vimel cepstral coefficient determining unit for using the formula

the voice characteristic data extraction unit is used for determining a multi-dimensional Mel cepstrum coefficient as static characteristic data of the voice signal to be identified; determining a first-order difference and a second-order difference of static characteristic data of the voice signal to be identified as dynamic characteristic data of the voice signal to be identified;

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A voice recognition method based on a Mel cepstral coefficient-support vector machine architecture, the method comprising:

acquiring a voice signal to be identified;

2. The method for recognizing voice based on mel frequency cepstral coefficient-support vector machine architecture as claimed in claim 1, further comprising, before said obtaining the voice signal to be recognized:

extracting sound characteristic data of the historical sound signal;

3. The method for recognizing voice based on mel frequency cepstral coefficient-support vector machine architecture as claimed in claim 1, further comprising, after said obtaining the voice signal to be recognized:

carrying out normalization processing on the sound signal to be recognized;

and filtering the normalized voice signal to be identified.

4. The method as claimed in claim 1, wherein the extracting the voice feature data of the voice signal to be recognized specifically comprises:

using formulas

using formulas

5. A voice recognition system based on mel-frequency cepstral coefficient-support vector machine architecture, the system comprising:

6. The system of claim 5, wherein the system further comprises:

7. The system of claim 5, wherein the system further comprises:

8. The system according to claim 5, wherein the first voice feature data extracting module specifically includes:

fourier transform unit for using formula

a multi-Vimel cepstral coefficient determining unit for using the formula