CN108766419B - Abnormal voice distinguishing method based on deep learning - Google Patents
Abnormal voice distinguishing method based on deep learning Download PDFInfo
- Publication number
- CN108766419B CN108766419B CN201810417478.2A CN201810417478A CN108766419B CN 108766419 B CN108766419 B CN 108766419B CN 201810417478 A CN201810417478 A CN 201810417478A CN 108766419 B CN108766419 B CN 108766419B
- Authority
- CN
- China
- Prior art keywords
- layer
- voice
- frame
- convolution
- input
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 57
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000013135 deep learning Methods 0.000 title claims abstract description 16
- 239000013598 vector Substances 0.000 claims abstract description 47
- 238000012549 training Methods 0.000 claims abstract description 42
- 239000011159 matrix material Substances 0.000 claims abstract description 24
- 238000012952 Resampling Methods 0.000 claims abstract description 13
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 230000001105 regulatory effect Effects 0.000 claims abstract description 8
- 210000002569 neuron Anatomy 0.000 claims description 73
- 238000011176 pooling Methods 0.000 claims description 41
- 230000006870 function Effects 0.000 claims description 32
- 238000001228 spectrum Methods 0.000 claims description 13
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 7
- 239000000126 substance Substances 0.000 claims description 7
- 230000004044 response Effects 0.000 claims description 6
- 230000005236 sound signal Effects 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 5
- 238000005315 distribution function Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000037433 frameshift Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 230000000630 rising effect Effects 0.000 claims description 3
- 230000005540 biological transmission Effects 0.000 claims description 2
- 238000009432 framing Methods 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 230000009466 transformation Effects 0.000 abstract description 3
- 238000012545 processing Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012850 discrimination method Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000003902 lesion Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a method for distinguishing abnormal voices based on deep learning, which comprises the following steps of obtaining input voices, and performing resampling, pre-emphasis and frame-dividing and window-adding preprocessing on the input voices to obtain preprocessed voices; extracting a Mel frequency cepstrum coefficient feature vector from the preprocessed voice; regulating the voice sections with different frame numbers to a fixed frame number, and obtaining a corresponding Mel frequency cepstrum coefficient characteristic matrix for each voice section; establishing a convolution depth confidence network; inputting the Mel frequency cepstrum coefficient feature matrix into a convolution depth confidence network, training, and classifying the state of input voice; calling a hidden Markov model to perform template matching according to the classification result to obtain a voice recognition result; the invention utilizes a plurality of nonlinear transformation layers of the convolution depth confidence network to map the input MFCC characteristics to a higher dimensional space, and respectively models the voices in different states by using the hidden Markov model, thereby improving the accuracy of voice recognition.
Description
Technical Field
The invention relates to the field of intelligent voice processing research, in particular to a non-stationary voice distinguishing method based on deep learning.
Background
Speech is one of the important ways of human and machine interaction, and after decades of research, speech recognition technology has been greatly developed and has penetrated into our daily lives, however, the existing speech recognition research has the following problems:
in real life, abnormal health conditions of speakers or other reasons can cause input voices to be transferred from normal voices to abnormal voices, and more noise interference can be caused. Abnormal speech generally refers to speech of complex background noise, speech of intentionally changing speaking style or habit, speech of developing organ lesion, and the like.
Another problem is that conventional speech recognition systems often use linear predictive cepstral coefficients and mel-frequency cepstral coefficients, the main information in these underlying acoustic features is the pronunciation text feature, and speaker information is easily interfered by this information, channel and noise information, so that the recognition performance of the system is degraded.
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide an abnormal speech distinguishing method based on deep learning, which utilizes the nonlinear transformation capability of a deep neural network to map MFCC and LPCC parameters with lower dimensions to a high-dimensional space, better represents high-level abstract information of a speech signal, models normal speech and abnormal speech respectively and effectively distinguishes the normal speech and the abnormal speech.
The purpose of the invention is realized by the following technical scheme:
a abnormal speech distinguishing method based on deep learning comprises the following steps:
s1, acquiring input voice, and performing resampling, pre-emphasis and frame-dividing and windowing pre-processing on the input voice to obtain pre-processed voice;
s2, extracting a Mel frequency cepstrum coefficient feature vector for each frame of voice of the preprocessed voice by utilizing a Mel frequency filter bank and Fourier transform;
s3, regulating the voice sections with different frame numbers to a fixed frame number, and obtaining a corresponding Mel frequency cepstrum coefficient characteristic matrix for each voice section;
s4, establishing a convolution depth confidence network;
s5, inputting the Mel frequency cepstrum coefficient feature matrix into a convolution depth confidence network, training, and classifying the state of input voice;
and S6, calling the hidden Markov model to carry out template matching according to the classification result to obtain a voice recognition result.
In step S1, the sampling frequency of the resampling is 22.05kHz, and the encoding mode is wav format;
the pre-emphasis uses a first order FIR high pass filter with a transfer function of:
H(z)=1-az-1,
wherein a is a high-pass filter coefficient and takes a value of 0.93; the pre-emphasized speech signal is:
y(n)=sp(n)-sp(n-1),n=0,1,…,Length-1
wherein, y (n) is the pre-emphasis voice signal, sp (n) is the voice signal before pre-emphasis, sp (n-1) is the time shift of the voice signal, and Length is the voice signal Length;
the frame windowing specifically comprises: the method comprises the steps of slicing the voice, intercepting an audio signal with a fixed length in the input voice at fixed time intervals into a frame, and performing frame division windowing by adopting a Hamming window with the frame length of 25ms and the frame shift of 10 ms.
In step S2, the specific procedure is as follows:
v1, designing L Mel frequency filter banks with triangular shape, and setting WlL, Fs is the resampling frequency of the speech signal, Q is the frame length of a frame of speech signal, Q is also the number of points of the fourier transform, f is the frequency response of the L-th mel-frequency filterl,fhRespectively the lower limit and the upper limit cut-off frequency of the voice signal, and Q-point fast Fourier transform is carried out on a certain frame of voice signal with the frame length of Q to obtain Q frequency components, wherein o (l), c (l), h (l) are respectively the subscript values of the lower limit, the center and the upper limit frequency of the l-th Mel frequency filter in the Q frequency components; o (l), c (l), h (l) have the following relationships:
c(l-1)=o(l),
o(l+1)=c(l),
h(l)=c(l+1),
that is, the position of the center frequency of the current filter is at the side lobe attenuation position of the next filter;
at the same time, o (l)|l=1=fl,h(l)|l=L=fhThus, the subscript value of the center frequency of the l-th mel-frequency filter among the Q frequency components is expressed as:
wherein, Mel (f)1) To map the actual frequency to a function of the Mel frequency, Mel-1(f2) Is Mel (f)1) Inverse function, f1Is the actual frequency, f2As the mel frequency:
the frequency response of the ith mel-frequency filter is:
v2, performing Q-point fast fourier transform on a frame of speech signal x (n) after resampling, pre-emphasis, and frame windowing, where n is 0,1, Q-1, and Q is less than Length, and obtaining a frequency spectrum x (k) and an amplitude spectrum | x (k) |:
v3, passing a frame of speech signal through a mel-frequency filter bank to obtain an output amplitude spectrum of each filter:
v4, carrying out logarithm operation on the output amplitude spectrums of all the filters, and further carrying out discrete cosine transform to obtain a Mel frequency cepstrum coefficient:
taking the 2 nd to M +1 th coefficients of the L coefficients to form an M-dimensional Mel frequency cepstrum coefficient feature vector C ═ C of each framemfcc(2),cmfcc(3),...,cmfcc(M+1)}。
In step S2, the dimension of the mel-frequency cepstrum coefficient feature vector matrix is nxm, N is the fixed frame number after the speech is normalized, i.e., the target frame number of the time normalization, M is the M-order mel-frequency cepstrum coefficient of each frame of speech, and the mel-frequency cepstrum coefficient is 12-dimensional, i.e., M is 12.
In step S3, since the time lengths of different speech segments are different, the frames contained in different speech segments may not be consistent, and since the input of the neural network must be fixed-dimension, the speech signal needs to be time-warped so that the input of the neural network is fixed-size; performing time warping by adopting a characteristic point sequence method, wherein the warping specifically comprises the following steps:
y1, building an N-N + 1-layer time warping network, wherein N is the target frame number after time warping, the frame number of the first layer is N, and the frame number of the last layer is N;
the ith layer of the time warping network is provided with n- (i-1) frames, each frame corresponds to a feature vector, and n- (i-1) feature vector groups are formed:
wherein i is 0,1, … N-N +1, k is 1,2 … N- (i-1),a Mel frequency cepstrum coefficient feature vector representing the ith frame voice of the network;
in particular, the set of feature vectors of the first layer of the network, i.e. the set of feature vectors of the input network:
to be provided withRepresenting vectorsThe weight of the represented speech frame, when i equals 1, has:
Y3, merging two frames with the nearest distance, and subtracting the frame number by one, namely:
wherein the content of the first and second substances,representing the Mel frequency cepstrum coefficient characteristic vector of the j frame voice of the i +1 th layer of the network;representing the Mel frequency cepstrum coefficient characteristic vector of the j +1 th frame voice of the ith layer of the network;a Mel frequency cepstrum coefficient feature vector representing the kth frame voice of the i +1 layer of the network; to representThe Mel frequency cepstrum coefficient feature vector of the (k + 1) th frame of voice of the ith layer of the network;representing the weight of the jth frame voice of the ith layer of the network;representing the weight of j +1 th frame voice of the ith layer of the network;representing the weight of the j frame voice of the (i + 1) th layer of the network;representing the distance between the Mel frequency cepstrum coefficient feature vectors of the jth frame voice and the jth +1 th frame voice of the ith layer of the network;
repeating i-1 to i-N +1, and reducing the frame number by one every time the execution is finished, and finally regulating the voice signal of N frames to fixed N frames.
In step S4, the convolution depth confidence network is formed by stacking a plurality of convolution limited boltzmann machines from top to bottom, and the output layer adopts a Softmax classifier; the convolution limited boltzmann machine is composed of a layer of input layer V and a layer of convolution layer H. And a pooling layer is added after the convolution layer H of each convolution limited Boltzmann machine, and pooling operation is carried out, wherein the pooling size is E 'multiplied by F', the pooling step length is 3 multiplied by s4, and s3 is s4 which is 2, so that the pooling layer of the next convolution limited Boltzmann machine is the input layer of the last convolution limited Boltzmann machine.
Step S5, training the convolution limited boltzmann machine at the bottom, and then training the convolution limited boltzmann machine at the top, specifically:
z1, the number of input channels of the convolution limited Boltzmann machine is set as I, and each channel corresponds to a two-dimensional matrix with the size of y multiplied by s, namely:
V={v1,v2,...,vI},vi∈Ry×s,i=1,2,...,I,
wherein V is the input layer of the convolution limited Boltzmann machine, Vi∈Ry×sThe ith channel of the input layer;
in particular, in the convolution limited boltzmann machine at the bottom layer, I is 1, y is N, s is M, that is, the number of input channels of the convolution limited boltzmann machine at the bottom layer is 1, a two-dimensional mel-frequency cepstrum coefficient feature matrix with the size of N × M corresponding to the input convolution depth confidence network is provided, N is the number of time-normalized target frames, and M is the M-order mel-frequency cepstrum coefficient of each frame of voice;
z2, the convolution process uses O convolution kernels, each convolution kernel is a three-dimensional weight matrix I multiplied by E multiplied by F, namely, a convolution kernel W ═ { W ═1,w2,...,wOIn which wj∈RI×E×F,j=1,2,...,O,wjA first convolution kernel; therefore, the number of output channels of the convolution limited Boltzmann machine is O; each output channel corresponds to a certain local characteristic of the input, namely:
convolution layer H ═ H1,h2,...,hOIn which h isj∈RN′×M′J ═ 1,2,. O; the jth channel of the convolutional layer corresponds to a two-dimensional feature matrix with the size of N 'multiplied by M' after feature mapping
Z3, setting convolution step length as s1 x s2, all neurons in the same channel of input layer share bias aiI1, 2.. I, neurons of the same group in the convolutional layer share an offset bjJ ═ 1,2,. O; parameters to be trained: θ ═ W, a, b };
the Z4, convolution-limited Boltzmann machine is an energy-based model whose energy function is defined as:
obtaining the joint probability distribution of the input layer V and the convolutional layer H by an energy function, namely obtaining all neuron values of the input layer V of the convolutional limited Boltzmann machine:and all neuron values of convolutional layer H:joint probability distribution of (a):
wherein the content of the first and second substances,is a distribution function; i1, I, p 1, y, q 1, s, j 1, O, M1, N ', k 1, M',
marginal probability distribution of input layer V, i.e. all neuron values of input layer VThe probability distribution of (c) is:
if the training sample set TS has T samples, the log-likelihood function on the input layer V is:
z5, adopting a gradient rising algorithm and combining a contrast divergence algorithm to maximize a log-likelihood function so as to obtain a parameter theta; namely, it is
Z6, adopting a Softmax output layer as an output layer, wherein the number of the neurons is 2, the output of the neurons respectively represents the probability that the sample is normal speech and abnormal speech, and the actual identification is to take the category corresponding to the maximum probability as a final classification result; the output layer is fully connected with the pooling layer of the topmost convolution limited Boltzmann machine, and the parameter to be trained is the bias c of two neurons of the output layer1,c2And the connection weight of the output layer and the topmost pooling layer:
when i is 1, r is the number of neurons of the topmost hidden layer; WE1The connection weight vectors for the 1 st neuron of the Softmax output layer and the neuron of the topmost pooling layer,connecting weight values of the 1 st neuron of the Softmax output layer and the kth neuron of the topmost pooling layer, wherein k is less than or equal to r; when i is 2, WE2The connection weight vectors for the 2 nd neuron of the Softmax output layer and the neuron of the topmost pooling layer,the connection weight value of the 2 nd neuron of the Softmax output layer and the kth neuron of the topmost pooling layer is obtained;
let the output of the topmost pooling layer be G ═ G1,g2,...,gr},giThe ith neuron output of the topmost pooling layer, then the input of the output layer is: f. ofi=WEiGT+ciWherein, when i is 1, f1Representing the input of the first neuron in the output layer, when i is 2, f2An input representing a second neuron of the output layer;
the output values of the output layer are:
wherein, when i is 1, y1The 1 st neuron output of the Softmax output layer represents the probability that the input speech belongs to normal speech; when i is 2, y2The 2 nd neuron output of the Softmax output layer represents the probability that the input speech belongs to abnormal speech;
when Z7 and the Softmax classifier are trained, training is carried out in a Mini-batch mode, T training samples are grabbed once for training, parameters are updated once, and the loss function is adopted:
wherein J is the sum of squares of errors during training, for the ideal output value of the ith neuron of the classifier corresponding to the captured jth sample, if the jth sample belongs to normal voice, the ideal output value is obtainedOtherwise, the reverse is carried out The actual output value of the ith neuron of the classifier corresponding to the grabbed jth sample;
f1 is the harmonic mean of precision P and recall R, i.e.:
the accuracy rate P is the ratio of the number of samples which are correctly identified as abnormal speech to the number of samples which are identified as abnormal speech, the recall rate is the ratio of the number of samples which are correctly identified as abnormal speech to the number of samples which are all abnormal speech, TP is the number of samples which are correctly identified as abnormal speech in T samples, FP is the number of samples which are incorrectly identified as abnormal speech in T samples, and FN is the number of samples which are incorrectly identified as normal speech in T samples;
during training, updating the parameters by adopting a gradient descent algorithm until the loss function converges to a set value or the iteration times reach the maximum iteration times, and ending the training.
Step S6, the specific process is: in the online identification part, a deep neural network finishes the loading of a network structure and weight coefficients obtained by training, a hidden Markov model finishes the loading of a pre-learned model, and the preprocessing and the feature extraction are carried out on the real-time input voice; respectively establishing HMM templates of normal and abnormal voices, namely training the HMM template by using the normal voices and establishing the HMM template of the normal voices; training an HMM template by using the abnormal speech to establish an HMM template of the abnormal speech; and during online recognition, calling different HMM templates according to a judgment result output by the convolution depth confidence network, and performing template matching on a Mel frequency cepstrum coefficient of an input voice signal to obtain a final voice recognition result.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the method combines the convolution deep learning confidence network and the hidden Markov model, utilizes the multilayer nonlinear transformation layers of the convolution deep confidence network to map the input MFCC characteristics to a higher dimensional space, more comprehensively represents voice information, and effectively distinguishes abnormal voice and normal voice; and then, the time series modeling capability of the hidden Markov model is utilized to respectively model the voices in different states, so that the recognition accuracy of the voices is greatly improved.
Drawings
FIG. 1 is a flowchart of an off-line training method for abnormal speech discrimination based on deep learning according to the present invention;
FIG. 2 is a flow chart of the online recognition of the abnormal speech distinguishing method based on deep learning according to the present invention;
FIG. 3 is a schematic diagram of a convolution deep belief network recognition speech state of an abnormal speech discrimination method based on deep learning according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Example 1
A abnormal speech distinguishing method based on deep learning comprises the following steps:
the first step is as follows: acquiring input voice, and carrying out preprocessing such as resampling, pre-emphasis, framing and windowing on the input voice to obtain preprocessed voice;
the resampling specifically comprises: the input voice has different sampling frequencies and coding modes, so that the original input voice signal is resampled to facilitate the processing and analysis of data, and the sampling frequencies and the coding modes are unified; the sampling frequency is 22.05kHz, and the coding mode is wav format.
The pre-emphasis is specifically as follows: the power spectrum of the sound signal is reduced along with the increase of the frequency, most energy is concentrated in a low-frequency range, in order to improve the high-frequency part of the original sound signal, the original input sound signal is subjected to pre-emphasis processing, a first-order FIR high-pass filter is adopted, and the transmission function of the FIR high-pass filter is as follows:
H(z)=1-az-1,
wherein a is a high-pass filter coefficient and takes a value of 0.93; the pre-emphasized speech signal is:
y(n)=sp(n)-sp(n-1),n=0,1,...,Length-1
wherein, y (n) is the pre-emphasis voice signal, sp (n) is the voice signal before pre-emphasis, sp (n-1) is the time shift of the voice signal, and Length is the voice signal Length;
the frame windowing specifically comprises: the method comprises the steps of slicing the voice, intercepting an audio signal with a fixed length in the input voice at fixed time intervals into a frame, and performing frame division windowing by adopting a Hamming window with the frame length of 25ms and the frame shift of 10 ms.
The second step is that: extracting Mel frequency cepstrum coefficient characteristics of each frame of voice of the preprocessed voice;
the specific process is as follows:
v1, designing L Mel frequency filter banks with triangular shape, and setting WlL, Fs is the resampling frequency of the speech signal, Q is the frame length of a frame of speech signal, Q is also the number of points of the fourier transform, f is the frequency response of the L-th mel-frequency filterl,fhRespectively the lower limit and the upper limit cut-off frequency of the voice signal, and Q-point fast Fourier transform is carried out on a certain frame of voice signal with the frame length of Q to obtain Q frequency components, wherein o (l), c (l), h (l) are respectively the subscript values of the lower limit, the center and the upper limit frequency of the l-th Mel frequency filter in the Q frequency components; o (l), c (l), h (l) have the following relationships:
c(l-1)=o(l),
o(l+1)=c(l),
h(l)=c(l+1),
that is, the position of the center frequency of the current filter is at the side lobe attenuation position of the next filter;
at the same time, o (l)|l=1=fl,h(l)|l=L=fhThus, the subscript value of the center frequency of the l-th mel-frequency filter among the Q frequency components is expressed as:
wherein, Mel (f)1) To map the actual frequency to a function of the Mel frequency, Mel-1(f2) Is Mel (f)1) Inverse function, f1Is the actual frequency, f2As the mel frequency:
the frequency response of the ith mel-frequency filter is:
v2, performing Q-point fast fourier transform on a frame of speech signal x (n), where n is 0,1, and Q-1 after resampling, pre-emphasis, and frame windowing, to obtain a frequency spectrum x (k) and an amplitude spectrum | x (k) |:
v3, passing a frame of speech signal through a mel-frequency filter bank to obtain an output amplitude spectrum of each filter:
v4, carrying out logarithm operation on the output amplitude spectrums of all the filters, and further carrying out discrete cosine transform to obtain a Mel frequency cepstrum coefficient:
taking the 2 nd to M +1 th coefficients of the L coefficients to form an M-dimensional Mel frequency cepstrum coefficient feature vector C ═ C of each framemfcc(2),cmfcc(3),...,cmfcc(M+1)}。
The third step: regulating the voice sections with different frame numbers to a fixed frame number, and obtaining a corresponding Mel frequency cepstrum coefficient characteristic matrix for each voice section;
because the time lengths of different speech segments are different, the number of frames possibly contained in different speech segments is not consistent, and because the input of the neural network must be of a fixed dimension, the time of the speech signal needs to be regulated, so that the input of the neural network is of a fixed size; performing time warping by adopting a characteristic point sequence method, wherein the warping process specifically comprises the following steps:
y1, building an N-N + 1-layer time warping network, wherein N is the target frame number after time warping, the frame number of the first layer is N, and the frame number of the last layer is N;
the ith layer of the time warping network is provided with n- (i-1) frames, each frame corresponds to a feature vector, and n- (i-1) feature vector groups are formed:
wherein i is 0,1, … N-N +1, k is 1,2 … N- (i-1), and C is a mel-frequency cepstrum coefficient feature vector of each frame of speech;
in particular, the set of feature vectors of the first layer of the network, i.e. the set of feature vectors of the input network:
to be provided withRepresenting vectorsThe weight of the represented speech frame, when i equals 1, has:
Y3, merging two frames with the nearest distance, and subtracting the frame number by one, namely:
wherein the content of the first and second substances,representing the Mel frequency cepstrum coefficient characteristic vector of the j frame voice of the i +1 th layer of the network;representing the Mel frequency cepstrum coefficient characteristic vector of the j +1 th frame voice of the ith layer of the network;a Mel frequency cepstrum coefficient feature vector representing the kth frame voice of the i +1 layer of the network; to representThe Mel frequency cepstrum coefficient feature vector of the (k + 1) th frame of voice of the ith layer of the network;representing the weight of the jth frame voice of the ith layer of the network;representing the weight of j +1 th frame voice of the ith layer of the network;representing the weight of the j frame voice of the (i + 1) th layer of the network;representing the distance between the Mel frequency cepstrum coefficient feature vectors of the jth frame voice and the jth +1 th frame voice of the ith layer of the network;
repeating i-1 to i-N +1, reducing the frame number by one every time the execution is finished, and finally regulating the voice signals of the N frames to fixed N frames;
after time warping, each section of voice corresponds to a mel frequency cepstrum coefficient feature matrix, the dimensionality of the mel frequency cepstrum coefficient feature matrix is NxM, N is the target frame number of the section of voice time warping, M is an M-order mel frequency cepstrum coefficient of each frame of voice, the mel frequency cepstrum coefficient is 12 dimensionalities, and namely M is 12 dimensionalities.
The fourth step: establishing a convolution depth confidence network; the deep confidence network is formed by stacking two convolution limiting Boltzmann machines, and comprises 2 convolution layers, 2 pooling layers, 1 visual layer and 1 output layer; for the first convolution limited boltzmann machine, the number of visual layer nerve units is N × M ═ 200 × 12, the number of convolution kernels is 10, the size of the convolution kernels is 2 × 2, the convolution step size is 2 × 2, and the initial values of the convolution kernels are: the mean value is 0, and the variance is 0.01. The initial offset of the visible layer is taken to be 0 and the initial offset of the convolutional layer is taken to be-0.1. One iteration takes 100 samples, the number of iterations is 100. The pooling size of the first pooling layer was taken to be 2 × 2, and the pooling step size was 2 × 2. For the second convolution limited boltzmann machine, the number of convolution kernels is 10, the size of the convolution kernels is 10 × 2 × 2, the convolution step size is 2 × 2, and the initial value of the convolution kernels is: the initial offset of the second convolutional layer was taken to be-0.1 with a random value of gaussian distribution with mean 0 and variance 0.01. One iteration takes 100 samples, the number of iterations is 100. The pooling size of the second pooling layer was taken to be 2 × 2, and the pooling step size was 2 × 2. All convolutional layers used the Sigma activation function. The number of neurons in the output layer is 2, and the posterior probability of normal speech and abnormal speech is output. The output layer sets the convergence value of the loss function to be 0.004, and the maximum iteration number to be 1000.
The fifth step: inputting the mel frequency cepstrum coefficient feature matrix into a convolution depth confidence network, training, and classifying states of input voice, wherein fig. 1 is a training flow chart of an off-line state;
firstly, training the convolution limited Boltzmann machine at the bottom layer, and then training the convolution limited Boltzmann machine at the top layer, specifically:
z1, the number of input channels of the convolution limited Boltzmann machine is set as I, and each channel corresponds to a two-dimensional matrix with the size of y multiplied by s, namely:
V={v1,v2,...,vI},vi∈Ry×s,i=1,2,...,I,
wherein V is the input layer of the convolution limited Boltzmann machine, Vi∈Ry×sThe ith channel of the input layer;
in particular, in the convolution limited boltzmann machine at the bottom layer, I is 1, y is N, s is M, that is, the number of input channels of the convolution limited boltzmann machine at the bottom layer is 1, a two-dimensional mel-frequency cepstrum coefficient feature matrix with the size of N × M corresponding to the input convolution depth confidence network is provided, N is the number of time-normalized target frames, and M is the M-order mel-frequency cepstrum coefficient of each frame of voice;
z2, the convolution process uses O convolution kernels, each convolution kernel is a three-dimensional weight matrix I multiplied by E multiplied by F, namely, a convolution kernel W ═ { W ═1,w2,...,wOIn which wj∈RI×E×F,j=1,2,...,O,wjA first convolution kernel; therefore, the number of output channels of the convolution limited Boltzmann machine is O; each output channel corresponds to a certain local characteristic of the input, namely:
convolution layer H ═ H1,h2,...,hOIn which h isj∈RN′×M′J ═ 1,2,. O; the jth channel of the convolutional layer corresponds to a two-dimensional characteristic matrix which is subjected to characteristic mapping and has the size of N '× M';
z3, setting convolution step length as s1 x s2, all neurons in the same channel of input layer share bias aiI1, 2.. I, neurons of the same group in the convolutional layer share an offset bjJ ═ 1,2,. O; parameters to be trained: θ ═ W, a, b };
the Z4, convolution-limited Boltzmann machine is an energy-based model whose energy function is defined as:
obtaining the joint probability distribution of the input layer V and the convolutional layer H by an energy function, namely obtaining all neuron values of the input layer V of the convolutional limited Boltzmann machine:and all neuron values of convolutional layer H:joint probability distribution of (a):
wherein the content of the first and second substances,is a distribution function; i1, I, p 1, y, q 1, s, j 1, O, M1, N ', k 1, M',
marginal probability distribution of input layer V, i.e. all neuron values of input layer VThe probability distribution of (c) is:
if the training sample set TS has T samples, the log-likelihood function on the input layer V is:
z5, adopting gradient rising algorithm and combining contrast divergence algorithm to maximize pairsA number likelihood function is carried out, and then a parameter theta is obtained; namely, it is
Z6, adopting a Softmax output layer as an output layer, wherein the number of the neurons is 2, the output of the neurons respectively represents the probability that the sample is normal speech and abnormal speech, and the actual identification is to take the category corresponding to the maximum probability as a final classification result; the output layer is fully connected with the pooling layer of the topmost convolution limited Boltzmann machine, and the parameter to be trained is the bias c of two neurons of the output layer1,c2And the connection weight of the output layer and the topmost pooling layer:
when i is 1, r is the number of neurons of the topmost hidden layer; WE1The connection weight vectors for the 1 st neuron of the Softmax output layer and the neuron of the topmost pooling layer,connecting weight values of the 1 st neuron of the Softmax output layer and the kth neuron of the topmost pooling layer, wherein k is less than or equal to r; when i is 2, WE2The connection weight vectors for the 2 nd neuron of the Softmax output layer and the neuron of the topmost pooling layer,the connection weight value of the 2 nd neuron of the Softmax output layer and the kth neuron of the topmost pooling layer is obtained;
let the output of the topmost pooling layer be G ═ G1,g2,...,gr},giThe ith neuron output of the topmost pooling layer, then the input of the output layer is: f. ofi=WEiGT+ciWherein, when i is 1, f1Representing the input of the first neuron in the output layer, when i is 2, f2An input representing a second neuron of the output layer;
the output values of the output layer are:
wherein, when i is 1, y1The 1 st neuron output of the Softmax output layer represents the probability that the input speech belongs to normal speech; when i is 2, y2The 2 nd neuron output of the Softmax output layer represents the probability that the input speech belongs to abnormal speech;
when Z7 and the Softmax classifier are trained, training is carried out in a Mini-batch mode, T training samples are grabbed once for training, parameters are updated once, and the loss function is adopted:
wherein J is the sum of squares of errors during training, for the ideal output value of the ith neuron of the classifier corresponding to the captured jth sample, if the jth sample belongs to normal voice, the ideal output value is obtainedOtherwise, the reverse is carried out The actual output value of the ith neuron of the classifier corresponding to the grabbed jth sample;
f1 is the harmonic mean of precision P and recall R, i.e.:
the accuracy rate P is the ratio of the number of samples which are correctly identified as abnormal speech to the number of samples which are identified as abnormal speech, the recall rate is the ratio of the number of samples which are correctly identified as abnormal speech to the number of samples which are all abnormal speech, TP is the number of samples which are correctly identified as abnormal speech in T samples, FP is the number of samples which are incorrectly identified as abnormal speech in T samples, and FN is the number of samples which are incorrectly identified as normal speech in T samples;
during training, updating the parameters by adopting a gradient descent algorithm until the loss function converges to a set value or the iteration times reach the maximum iteration times, and ending the training.
And a sixth step: and calling a hidden Markov model to perform template matching according to the classification result to obtain a voice recognition result:
an online identification part, as shown in fig. 2, a deep neural network finishes the loading of a network structure and weight coefficients obtained by training, a hidden markov model finishes the loading of a pre-learned model, and the preprocessing and feature extraction are carried out on the real-time input voice; respectively establishing HMM templates of normal and abnormal voices, namely training the HMM template by using the normal voices and establishing the HMM template of the normal voices; training an HMM template by using the abnormal speech to establish an HMM template of the abnormal speech; during online recognition, different HMM templates are called according to the judgment result output by the convolution depth confidence network, and the mel-frequency cepstrum coefficient of the input speech signal is subjected to template matching to obtain a final speech recognition result, wherein the schematic flow chart is shown in fig. 3.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (8)
1. An abnormal speech distinguishing method based on deep learning is characterized by comprising the following steps:
s1, acquiring input voice, and preprocessing the input voice to obtain preprocessed voice, wherein the preprocessing comprises resampling, pre-emphasis, framing and windowing;
s2, extracting a Mel frequency cepstrum coefficient feature vector for each frame of voice of the preprocessed voice by utilizing a Mel frequency filter bank and Fourier transform;
s3, regulating the voice sections with different frame numbers to a fixed frame number, and obtaining a corresponding Mel frequency cepstrum coefficient characteristic matrix for each voice section;
s4, establishing a convolution depth confidence network; the convolution depth confidence network is formed by stacking more than one convolution limited Boltzmann machine from top to bottom, and the output layer adopts a Softmax classifier; the convolution limited Boltzmann machine is composed of a layer of input layer V and a layer of convolution layer H;
s5, inputting the Mel frequency cepstrum coefficient feature matrix into a convolution depth confidence network, training, and classifying the state of input voice; the process is as follows: firstly, training the convolution limited Boltzmann machine at the bottom layer, and then training the convolution limited Boltzmann machine at the top layer, specifically:
z1, the number of input channels of the convolution limited Boltzmann machine is set as I, and each channel corresponds to a two-dimensional matrix with the size of y multiplied by s, namely:
V={v1,v2,...,vI},vi∈Ry×s,i=1,2,...,I,
wherein V is the input layer of the convolution limited Boltzmann machine, Vi∈Ry×sThe ith channel of the input layer;
in the convolution limited Boltzmann machine at the bottom layer, I is 1, y is N, s is M, namely the number of input channels of the convolution limited Boltzmann machine at the bottom layer is 1, a two-dimensional Mel frequency cepstrum coefficient feature matrix with the size of N multiplied by M is correspondingly input into a convolution depth confidence network, N is the time-regular target frame number, and M is the M-order Mel frequency cepstrum coefficient of each frame of voice;
z2, the convolution process uses O convolution kernels, each convolution kernel is a three-dimensional weight matrix I multiplied by E multiplied by F, namely, a convolution kernel W ═ { W ═1,w2,...,wOIn which wj∈RI×E×F,j=1,2,...,O,wjA first convolution kernel; therefore, the number of output channels of the convolution limited Boltzmann machine is O; each output channel corresponds to a certain local characteristic of the input, namely:
convolution layer H ═ H1,h2,...,hOIn which h isj∈RN′×M′J is 1, 2.. gto, which means the jth channel of the convolutional layer and corresponds to a two-dimensional feature matrix with the size of N '× M' after feature mapping;
z3, setting convolution step length as s1 x s2, all neurons in the same channel of input layer share bias aiI1, 2.. I, neurons of the same group in the convolutional layer share an offset bjJ ═ 1,2,. O; parameters to be trained: θ ═ W, a, b };
the Z4, convolution-limited Boltzmann machine is an energy-based model whose energy function is defined as:
obtaining the joint probability distribution of the input layer V and the convolutional layer H by an energy function, namely obtaining all neuron values of the input layer V of the convolutional limited Boltzmann machine:and all neuron values of convolutional layer H:joint probability distribution of (a):
wherein the content of the first and second substances,is a distribution function; i1, I, p 1, y, q 1, s, j 1, O, M1, N ', k 1, M',
marginal probability distribution of input layer V, i.e. all neuron values of input layer VThe probability distribution of (c) is:
if the training sample set TS has T samples, the log-likelihood function on the input layer V is:
z5, adopting a gradient rising algorithm and combining a contrast divergence algorithm to maximize a log-likelihood function so as to obtain a parameter theta; namely, it is
Z6, adopting a Softmax output layer as an output layer, wherein the number of the neurons is 2, the output of the neurons respectively represents the probability that the sample is normal speech and abnormal speech, and the actual identification is to take the category corresponding to the maximum probability as a final classification result; the output layer is fully connected with the pooling layer of the topmost convolution limited Boltzmann machine, and the parameter to be trained is the bias c of two neurons of the output layer1,c2And the connection weight of the output layer and the topmost pooling layer:
wherein r is the number of neurons of the topmost hidden layer; when i is 1, WE1The connection weight vectors for the 1 st neuron of the Softmax output layer and the neuron of the topmost pooling layer,connecting weight values of the 1 st neuron of the Softmax output layer and the kth neuron of the topmost pooling layer, wherein k is less than or equal to r; when i is 2, WE2The connection weight vectors for the 2 nd neuron of the Softmax output layer and the neuron of the topmost pooling layer,the connection weight value of the 2 nd neuron of the Softmax output layer and the kth neuron of the topmost pooling layer is obtained;
let the output of the topmost pooling layer be G ═ G1,g2,...,gr},giThe ith neuron output of the topmost pooling layer, then the input of the output layer is: f. ofi=WEiGT+ciWherein, when i is 1, f1Representing the input of the first neuron in the output layer, when i is 2, f2An input representing a second neuron of the output layer;
the output value of the output layer isWherein, when i is 1, y1The 1 st neuron output of the Softmax output layer represents the probability that the input speech belongs to normal speech; when i is 2, y2The 2 nd neuron output of the Softmax output layer represents the probability that the input speech belongs to abnormal speech;
when Z7 and the Softmax classifier are trained, training is carried out in a Mini-batch mode, T training samples are grabbed once for training, parameters are updated once, and the loss function is adopted:
wherein J is the sum of squares of errors during training, for the ideal output value of the ith neuron of the classifier corresponding to the captured jth sample, if the jth sample belongs to normal voice, the ideal output value is obtainedOtherwise, the reverse is carried out The actual output value of the ith neuron of the classifier corresponding to the grabbed jth sample;
f1 is the harmonic mean of precision P and recall R, i.e.:
the accuracy rate P is the ratio of the number of samples which are correctly identified as abnormal speech to the number of samples which are identified as abnormal speech, the recall rate is the ratio of the number of samples which are correctly identified as abnormal speech to the number of samples which are all abnormal speech, TP is the number of samples which are correctly identified as abnormal speech in T samples, FP is the number of samples which are incorrectly identified as abnormal speech in T samples, and FN is the number of samples which are incorrectly identified as normal speech in T samples;
during training, updating parameters by adopting a gradient descent algorithm until the loss function converges to a set value or the iteration times reach the maximum iteration times, and ending the training;
and S6, calling the hidden Markov model to carry out template matching according to the classification result to obtain a voice recognition result.
2. The abnormal speech distinguishing method based on deep learning of claim 1, wherein in step S1,
the sampling frequency of the resampling is 22.05kHz, and the encoding mode is wav format;
the pre-emphasis adopts a first-order FIR high-pass filter, and the transmission function of the pre-emphasis is as follows:
H(z)=1-az-1,
wherein a is a high-pass filter coefficient and takes a value of 0.93; the pre-emphasized speech signal is:
y(n)=sp(n)-sp(n-1),n=0,1,…,Length-1;
wherein, y (n) is the pre-emphasis voice signal, sp (n) is the voice signal before pre-emphasis, sp (n-1) is the time shift of the voice signal, and Length is the voice signal Length;
the frame windowing specifically comprises: the method comprises the steps of slicing the voice, intercepting an audio signal with a fixed length in the input voice at fixed time intervals into a frame, and performing frame division windowing by adopting a Hamming window with the frame length of 25ms and the frame shift of 10 ms.
3. The abnormal speech distinguishing method based on deep learning of claim 1, wherein the step S2 specifically includes:
v1, designing L Mel frequency filter banks with triangular shape, and setting WlL, Fs is the resampling frequency of the speech signal, Q is the frame length of a frame of speech signal, Q is also the number of points of the fourier transform, f is the frequency response of the L-th mel-frequency filterl,fhQ-point fast Fourier transform is carried out on a certain frame of voice signals with the frame length of Q to obtain Q frequency components, wherein the voice signals are respectively the lower limit cutoff frequency and the upper limit cutoff frequency of the voice signals, and o (l), c (l), h (l) are respectively the first Mel frequency filterThe lower limit, the center and the lower limit frequency are lower scale values in Q frequency components; o (l), c (l), h (l) have the following relationships:
c(l-1)=o(l),
o(l+1)=c(l),
h(l)=c(l+1),
that is, the position of the center frequency of the current filter is at the side lobe attenuation position of the next filter;
at the same time, o (l)|l=1=fl,h(l)|l=L=fhThus, the subscript value of the center frequency of the l-th mel-frequency filter among the Q frequency components is expressed as:
wherein, Mel (f)1) To map the actual frequency to a function of the Mel frequency, Mel-1(f2) Is Mel (f)1) Inverse function, f1Is the actual frequency, f2As the mel frequency:
the frequency response of the ith mel-frequency filter is:
wherein k is a subscript value of the first frequency component in the Q frequency components;
v2, performing Q-point fast fourier transform on a frame of speech signal x (n) after resampling, pre-emphasis, and frame windowing, where n is 0,1, Q-1, and Q is less than Length, and obtaining a frequency spectrum x (k) and an amplitude spectrum | x (k) |:
v3, passing a frame of speech signal through a mel-frequency filter bank to obtain an output amplitude spectrum of each filter:
v4, carrying out logarithm operation on the output amplitude spectrums of all the filters, and further carrying out discrete cosine transform to obtain a Mel frequency cepstrum coefficient:
taking the 2 nd to M +1 th coefficients of the L coefficients to form an M-dimensional Mel frequency cepstrum coefficient feature vector C ═ C of each framemfcc(2),cmfcc(3),...,cmfcc(M+1)}。
4. The method for distinguishing abnormal speech according to claim 1, wherein in step S2, the dimension of the mel-frequency cepstrum coefficient feature vector matrix is nxm, N is the number of target frames for time warping of the speech, M is the M-th-order mel-frequency cepstrum coefficient of each frame of speech, and the mel-frequency cepstrum coefficient is 12-dimensional, that is, M is 12.
5. The method for distinguishing abnormal speech according to claim 1, wherein in step S3, the normalization specifically comprises:
y1, building an N-N + 1-layer time warping network, wherein N is the target frame number of time warping, the frame number of the first layer is N, and the frame number of the last layer is N;
the ith layer of the time warping network is provided with n- (i-1) frames, each frame corresponds to a feature vector, and n- (i-1) feature vector groups are formed:
wherein the content of the first and second substances,a Mel frequency cepstrum coefficient feature vector of the ith frame of voice of the ith layer of the network;
in particular, the set of feature vectors of the first layer of the network, i.e. the set of feature vectors of the input network:
to be provided withRepresenting vectorsThe weight of the represented speech frame, when i equals 1, has:
Y3, merging two frames with the nearest distance, and subtracting the frame number by one, namely:
wherein the content of the first and second substances,representing the Mel frequency cepstrum coefficient characteristic vector of the j frame voice of the i +1 th layer of the network;representing the Mel frequency cepstrum coefficient characteristic vector of the j +1 th frame voice of the ith layer of the network;a Mel frequency cepstrum coefficient feature vector representing the kth frame voice of the i +1 layer of the network; to representThe Mel frequency cepstrum coefficient feature vector of the (k + 1) th frame of voice of the ith layer of the network;representing the weight of the jth frame voice of the ith layer of the network;representing the weight of j +1 th frame voice of the ith layer of the network;representing the weight of the j frame voice of the (i + 1) th layer of the network;representing the distance between the Mel frequency cepstrum coefficient feature vectors of the jth frame voice and the jth +1 th frame voice of the ith layer of the network;
repeating i-1 to i-N +1, and reducing the frame number by one every time the execution is finished, and finally regulating the voice signal of N frames to fixed N frames.
6. The method for distinguishing abnormal speech based on deep learning of claim 1, wherein the convolution limited boltzmann machine adds a pooling layer after convolution layer H of each convolution limited boltzmann machine, performs pooling operation, and the pooling size is E '× F', the pooling step length is s3 × s4, s3 ═ s4 ═ 2, so that the pooling layer of the next convolution limited boltzmann machine is the input layer of the previous convolution limited boltzmann machine.
7. The method for distinguishing abnormal speech according to claim 1, wherein the step S6 includes the following steps: in the online identification part, a deep neural network finishes the loading of a network structure and weight coefficients obtained by training, a hidden Markov model finishes the loading of a pre-learned model, and the preprocessing and the feature extraction are carried out on the real-time input voice; and respectively establishing HMM templates of normal and abnormal voices, calling different HMM templates according to a judgment result output by the convolution depth confidence network during online recognition, and inputting a Mel frequency cepstrum coefficient of a voice signal to perform template matching to obtain a final voice recognition result.
8. The method for distinguishing abnormal speech based on deep learning of claim 7, wherein the HMM template of the normal speech is obtained by training an HMM template with the normal speech; the HMM template of the abnormal speech is obtained by training the HMM template by using abnormal speech.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810417478.2A CN108766419B (en) | 2018-05-04 | 2018-05-04 | Abnormal voice distinguishing method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810417478.2A CN108766419B (en) | 2018-05-04 | 2018-05-04 | Abnormal voice distinguishing method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108766419A CN108766419A (en) | 2018-11-06 |
CN108766419B true CN108766419B (en) | 2020-10-27 |
Family
ID=64009048
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810417478.2A Expired - Fee Related CN108766419B (en) | 2018-05-04 | 2018-05-04 | Abnormal voice distinguishing method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108766419B (en) |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111354373B (en) * | 2018-12-21 | 2023-05-12 | 中国科学院声学研究所 | Audio signal classification method based on neural network intermediate layer characteristic filtering |
CN112955954B (en) | 2018-12-21 | 2024-04-12 | 华为技术有限公司 | Audio processing device and method for audio scene classification |
CN110058689A (en) * | 2019-04-08 | 2019-07-26 | 深圳大学 | A kind of smart machine input method based on face's vibration |
CN110322887B (en) * | 2019-04-28 | 2021-10-15 | 武汉大晟极科技有限公司 | Multi-type audio signal energy feature extraction method |
EP3745412A1 (en) * | 2019-05-28 | 2020-12-02 | Corti ApS | An intelligent computer aided decision support system |
CN110444202B (en) * | 2019-07-04 | 2023-05-26 | 平安科技(深圳)有限公司 | Composite voice recognition method, device, equipment and computer readable storage medium |
CN110390929A (en) * | 2019-08-05 | 2019-10-29 | 中国民航大学 | Chinese and English civil aviaton land sky call acoustic model construction method based on CDNN-HMM |
CN110706720B (en) * | 2019-08-16 | 2022-04-22 | 广东省智能制造研究所 | Acoustic anomaly detection method for end-to-end unsupervised deep support network |
CN110600015B (en) * | 2019-09-18 | 2020-12-15 | 北京声智科技有限公司 | Voice dense classification method and related device |
CN110782901B (en) * | 2019-11-05 | 2021-12-24 | 深圳大学 | Method, storage medium and device for identifying voice of network telephone |
CN111044285A (en) * | 2019-11-22 | 2020-04-21 | 军事科学院***工程研究院军用标准研究中心 | Method for diagnosing faults of mechanical equipment under complex conditions |
CN111027675B (en) * | 2019-11-22 | 2023-03-07 | 南京大学 | Automatic adjusting method and system for multimedia playing setting |
CN110931046A (en) * | 2019-11-29 | 2020-03-27 | 福州大学 | Audio high-level semantic feature extraction method and system for overlapped sound event detection |
CN111128227B (en) * | 2019-12-30 | 2022-06-17 | 云知声智能科技股份有限公司 | Sound detection method and device |
CN111724770B (en) * | 2020-05-19 | 2022-04-01 | 中国电子科技网络信息安全有限公司 | Audio keyword identification method for generating confrontation network based on deep convolution |
CN111508501B (en) * | 2020-07-02 | 2020-09-29 | 成都晓多科技有限公司 | Voice recognition method and system with accent for telephone robot |
CN112750428A (en) * | 2020-12-29 | 2021-05-04 | 平安普惠企业管理有限公司 | Voice interaction method and device and computer equipment |
CN113240083B (en) * | 2021-05-11 | 2024-06-11 | 北京搜狗科技发展有限公司 | Data processing method and device, electronic equipment and readable medium |
CN113361647A (en) * | 2021-07-06 | 2021-09-07 | 青岛洞听智能科技有限公司 | Method for identifying type of missed call |
CN113959071B (en) * | 2021-07-21 | 2023-05-26 | 北京金茂绿建科技有限公司 | Centralized water chilling unit air conditioning system operation control optimization method based on machine learning assistance |
CN113689633B (en) * | 2021-08-26 | 2023-03-17 | 浙江力石科技股份有限公司 | Scenic spot human-computer interaction method, device and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106782602A (en) * | 2016-12-01 | 2017-05-31 | 南京邮电大学 | Speech-emotion recognition method based on length time memory network and convolutional neural networks |
CN107464568A (en) * | 2017-09-25 | 2017-12-12 | 四川长虹电器股份有限公司 | Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102129860B (en) * | 2011-04-07 | 2012-07-04 | 南京邮电大学 | Text-related speaker recognition method based on infinite-state hidden Markov model |
CN104157290B (en) * | 2014-08-19 | 2017-10-24 | 大连理工大学 | A kind of method for distinguishing speek person based on deep learning |
CN105206270B (en) * | 2015-08-20 | 2019-04-02 | 长安大学 | A kind of isolated digit speech recognition categorizing system and method combining PCA and RBM |
US10373073B2 (en) * | 2016-01-11 | 2019-08-06 | International Business Machines Corporation | Creating deep learning models using feature augmentation |
CN106941005A (en) * | 2017-02-24 | 2017-07-11 | 华南理工大学 | A kind of vocal cords method for detecting abnormality based on speech acoustics feature |
-
2018
- 2018-05-04 CN CN201810417478.2A patent/CN108766419B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106782602A (en) * | 2016-12-01 | 2017-05-31 | 南京邮电大学 | Speech-emotion recognition method based on length time memory network and convolutional neural networks |
CN107464568A (en) * | 2017-09-25 | 2017-12-12 | 四川长虹电器股份有限公司 | Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system |
Non-Patent Citations (1)
Title |
---|
A deep architecture for audio-visual voice activity detection in the presence of transients;Ido Ariav等;《Signal Processing》;20170712;第64-67页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108766419A (en) | 2018-11-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108766419B (en) | Abnormal voice distinguishing method based on deep learning | |
Sainath et al. | Learning filter banks within a deep neural network framework | |
KR100908121B1 (en) | Speech feature vector conversion method and apparatus | |
CN108447495B (en) | Deep learning voice enhancement method based on comprehensive feature set | |
Das et al. | Recognition of isolated words using features based on LPC, MFCC, ZCR and STE, with neural network classifiers | |
Bhattacharjee | A comparative study of LPCC and MFCC features for the recognition of Assamese phonemes | |
KR20080078466A (en) | Multi-stage speech recognition apparatus and method | |
CN111899757B (en) | Single-channel voice separation method and system for target speaker extraction | |
WO2023070874A1 (en) | Voiceprint recognition method | |
Pardede et al. | Convolutional neural network and feature transformation for distant speech recognition | |
CN113763965A (en) | Speaker identification method with multiple attention characteristics fused | |
Cai et al. | The DKU system for the speaker recognition task of the 2019 VOiCES from a distance challenge | |
CN111785262A (en) | Speaker age and gender classification method based on residual error network and fusion characteristics | |
Li et al. | A Convolutional Neural Network with Non-Local Module for Speech Enhancement. | |
Матиченко et al. | The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space | |
Aggarwal et al. | Performance evaluation of artificial neural networks for isolated Hindi digit recognition with LPC and MFCC | |
Hizlisoy et al. | Text independent speaker recognition based on MFCC and machine learning | |
Sunny et al. | Feature extraction methods based on linear predictive coding and wavelet packet decomposition for recognizing spoken words in malayalam | |
Raju et al. | AUTOMATIC SPEECH RECOGNITION SYSTEM USING MFCC-BASED LPC APPROACH WITH BACK PROPAGATED ARTIFICIAL NEURAL NETWORKS. | |
Qadir et al. | Isolated spoken word recognition using one-dimensional convolutional neural network | |
CN108573698B (en) | Voice noise reduction method based on gender fusion information | |
Alex et al. | Performance analysis of SOFM based reduced complexity feature extraction methods with back propagation neural network for multilingual digit recognition | |
Venkateswarlu et al. | The performance evaluation of speech recognition by comparative approach | |
Srinivasarao | Speech signal analysis and enhancement using combined wavelet Fourier transform with stacked deep learning architecture | |
Nijhawan et al. | A comparative study of two different neural models for speaker recognition systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20201027 |