CN114242116A - Comprehensive judgment method for voice and non-voice of voice - Google Patents
Comprehensive judgment method for voice and non-voice of voice Download PDFInfo
- Publication number
- CN114242116A CN114242116A CN202210006259.1A CN202210006259A CN114242116A CN 114242116 A CN114242116 A CN 114242116A CN 202210006259 A CN202210006259 A CN 202210006259A CN 114242116 A CN114242116 A CN 114242116A
- Authority
- CN
- China
- Prior art keywords
- voice
- speech
- frame
- voice data
- energy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 87
- 230000003595 spectral effect Effects 0.000 claims abstract description 43
- 238000012545 processing Methods 0.000 claims abstract description 28
- 230000009467 reduction Effects 0.000 claims abstract description 21
- 238000009432 framing Methods 0.000 claims abstract description 14
- 238000006243 chemical reaction Methods 0.000 claims abstract description 12
- 238000011410 subtraction method Methods 0.000 claims abstract description 8
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 238000011176 pooling Methods 0.000 claims description 21
- 238000001228 spectrum Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 12
- 238000001914 filtration Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 238000003672 processing method Methods 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 238000012935 Averaging Methods 0.000 claims description 3
- 230000003213 activating effect Effects 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 9
- 238000013528 artificial neural network Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 230000007547 defect Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000021615 conjugation Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000011946 reduction process Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Complex Calculations (AREA)
Abstract
The invention relates to a method for comprehensively judging voices and non-voices of voice, which comprises the following steps: performing framing processing on input voice data to obtain first framing voice data and second framing voice data; preprocessing the first frame voice data, acquiring each frame voice data, performing time-frequency conversion and cepstrum coefficient, inputting the preprocessed data into a voice recognition network, and judging the proportion of voice segments of voice in the whole voice segment; when the voice signal proportion is larger than a preset value, carrying out voice noise reduction processing by combining a short-time autocorrelation method and a spectral subtraction method; and detecting the voice endpoint by combining a short-time correlation method and an energy-entropy ratio method, marking voice speech segments in the detected voice data as voice, marking other speech segments as non-voice, and finally outputting the voice data. The invention improves the applicability of the voice judgment, enlarges the application range of the judgment of the voice and the non-voice under the complex condition and further improves the applicability of the method in the voice judgment.
Description
Technical Field
The invention relates to the technical field of voice processing, in particular to a comprehensive judgment method for voice and non-voice of voice.
Background
For the speech and non-speech decision, the prior art methods can be roughly divided into three types: a threshold-based decision method, a decision method as a classifier and a model-based decision method. The threshold-based judgment method, namely the voice endpoint detection method, sets a reasonable threshold by extracting time domain and frequency domain characteristics of voice, such as short-term energy, short-term zero-crossing rate, cepstrum coefficient and the like, so as to realize the judgment of voice and non-voice; the decision method of the classifier, regard speech judgement as the classification problem of the voice and non-voice, utilize neural network and machine learning method to train the classifier, achieve the goal of judgement; the model-based method is to use a complete acoustic model to make decisions based on decoding and by means of global information.
However, in the existing voice and non-voice judging method, the judging condition is based on that the voice contains the same type of noise with unchanged signal-to-noise ratio on the voice segment needing to be judged. In order to achieve a good noise reduction effect, the initial frames of speech in the speech segment are assumed to be non-speech frames, i.e. noise frames, during the denoising process. And taking the initial non-voice frames as background noise of the voice to carry out noise reduction and voice judgment.
The existing judgment method based on the classifier and the model needs to judge the voice and the non-voice of each frame of signal, and then needs to adopt other methods to eliminate the deviation caused by judgment, and needs a great amount of different training data to train and construct the network or the model for achieving the accurate judgment of the voice signal, and the work required in the early stage is complex.
Therefore, the conditions of the speech to be judged in the prior art are ideal, and the noise frame cannot be determined adaptively under the conditions that the initial segment is speech or background noise is complex and multiple signal-to-noise ratios coexist, and the speech data is needed to train a network and construct a model for speech judgment.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a comprehensive judgment method for voice and non-voice of voice, and solves the problems in the prior art.
The purpose of the invention is realized by the following technical scheme: a comprehensive decision method for voice and non-voice of speech, the comprehensive decision method comprises:
performing framing processing on input voice data to obtain first framing voice data and second framing voice data;
the processing method of the first frame voice data comprises the following steps:
preprocessing the first frame voice data, acquiring each frame voice data, performing time-frequency conversion and cepstrum coefficient, inputting the preprocessed data into a voice recognition network, and judging the proportion of voice segments of voice in the whole voice segment;
when the voice signal proportion is larger than a preset value, carrying out voice noise reduction processing by combining a short-time autocorrelation method and a spectral subtraction method;
detecting a voice endpoint by combining a short-time correlation method and an energy-entropy ratio method, marking voice speech segments in the detected voice data as voice, marking other speech segments as non-voice, and finally outputting the voice data;
the processing method of the second sub-frame voice data comprises the following steps:
performing voice noise reduction processing on the second sub-frame voice data by combining a short-time autocorrelation method and a spectral subtraction method;
and detecting the voice endpoint by combining a short-time correlation method and an energy-entropy ratio method, marking voice speech segments in the detected voice data as voice, marking other speech segments as non-voice, and finally outputting the voice data.
The preprocessing the first frame voice data to acquire each frame voice data for time-frequency conversion and cepstrum coefficient comprises:
the time-frequency parameters F (F, t) of the voice data obtained by the first frame voice data through short-time Fourier transform represent the relative energy value of the voice signal when the moment is t and the frequency is F;
performing MFCC feature extraction on each frame of voice data to obtain an MFCC value, a first-order MFCC difference and a second-order MFCC difference of each frame of voice data;
carrying out pre-emphasis processing on a voice signal, windowing the pre-emphasized signal and carrying out frequency domain conversion on the windowed signal to obtain the representation of the voice signal on a frequency domain;
calculating an energy spectrum of each frame of spectral line energy after passing through a Mel filter bank, and carrying out logarithm taking processing on the energy spectrum after passing through the Mel filter bank;
taking logarithm of the energy passing through the Mel filter bank, performing discrete cosine transform to obtain MFCC characteristics, and performing first-order difference processing on the MFCC characteristics to obtain first-order MFCC characteristics;
and performing differential operation on the first-order MFCC features to obtain second-order MFCC features.
The voice noise reduction processing comprises:
for each frame of voice data xnPerforming short-time autocorrelation processing to obtain autocorrelation value R of current framen;
Taking each frame autocorrelation value as a new autocorrelation sequence, and performing smooth filtering by adopting an average filtering method with set window length and window shift to obtain a filtered autocorrelation value sequence R'n;
Averaging the autocorrelation value sequenceAs the threshold η, when the autocorrelation value is less than or equal to the frame segment of the threshold ηAs a non-voice section, a frame section larger than a threshold eta is taken as a voice section;
using the determined non-speech segment and speech segment as input, and adopting spectral subtraction to make original speech data xnDenoising to obtain denoised voice data x'n。
The determined non-speech segment and speech segment are used as input, and the original speech data x is subjected to spectral subtractionnDenoising to obtain denoised voice data x'nThe method comprises the following steps:
for each original frame of speech signal xnPerforming fast Fourier transform to obtain transformed voice signal Xn(k);
According to Xn(k) Amplitude | X ofn(k) Angle of phaseCalculating the frame number NIS of the non-voice section to obtain the average power spectrum value D (k) of the non-voice section;
calculating the speech signal X after fast Fourier transformn(k) Average value Y ofn(k) And obtaining the amplitude value after spectral subtraction through a spectral subtraction formula
According to the spectrally subtracted amplitudeAnd phase angleObtaining the voice data x 'after noise reduction by utilizing inverse fast Fourier transform'n。
The method for detecting the voice endpoint by combining the short-time correlation method with the energy-entropy ratio comprises the following steps:
calculating short-time energy to obtain each frame signal xnEnergy E ofnAnd compares the noise-reduced voice-per-frame signal x'nCalculating the value X 'after fast Fourier transform'n;
Calculating short-time energy E 'of each frame of voice signal after noise reduction in frequency domain'nAnd the energy spectrum S of the k-th spectral linen(k);
Calculating a normalized spectral probability density function p for each frequency component of each framen(k) Spectral entropy per frame HnAnd the energy-to-entropy ratio Ef of each frame signaln;
Calculating energy-entropy ratio EF 'of non-voice segment signal'nThe noise-reduced voice-per-frame signal x'nReplacing the noise-reduced non-speech frame to obtain the energy entropy ratio Ef 'of the noise-reduced non-speech frame'n;
Setting a decision threshold T1And T2And calculating an energy-to-entropy ratio Ef of the speech signalnIntersection point N with threshold T22,N3The start and stop points of the speech segment being at N2,N3Outside the time interval of (1);
from the starting point N2To the left and end point N3Searching to the right to respectively find the energy-entropy ratio Ef of the voice signalsnAnd threshold value T1Cross point N of1,N4,N1Is the starting point of a speech segment, N4Is the end point of a speech segment.
The voice recognition network comprises three convolution layers, three pooling layers and three full-connection layers; a first layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 32 convolution kernels are in total, the moving step length of the convolution kernels is 1, and in the convolution process, a 0 value is adopted for filling the part with insufficient boundaries; a first pooling layer: filling the part with insufficient boundary with 0 value by adopting the maximum pooling of 2 multiplied by 2; a second layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 64 convolution kernels are totally arranged, and the rest arrangement is the same as that of the first layer of convolution layer; the arrangement of the second layer of the pooling layer is the same as that of the first layer of the pooling layer; a third layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 1024 convolution kernels are in total, and the rest of the convolution kernels are the same as the first layer of convolution layer; the arrangement of the third layer of the pooling layer is the same as that of the first layer of the pooling layer; the output of the first full connection layer and the second full connection layer is 1024, and the output of the third full connection layer is 2, which represents the number of the required classifications; after each convolution, activating the convolved value by adopting a Relu activation function; in the training process of the network, parameters of the network are updated by adopting an Adam random gradient descent method.
The first frame voice data comprises voice data which is processed according to the time of each frame being 1s and the time of overlapping between each frame being 0.7 s; the second frame-divided voice data includes voice data processed by each frame with a time of 0.025s and an overlapping time of 0.01 s.
The invention has the following advantages: a comprehensive judgment method for voice and non-voice of voice is characterized by that it utilizes neural network to identify out voice containing voice segment, then combines with spectrum subtraction method of autocorrelation to find out non-voice segment and make noise reduction, finally utilizes the voice after noise reduction and utilizes the method of energy-entropy ratio to make more accurate judgment on voice segment so as to raise the applicability of voice judgment.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic flow chart showing the details of the method of the present invention;
FIG. 3 is a schematic diagram of an MFCC feature extraction process;
FIG. 4 is a schematic diagram of recognition network data preprocessing;
FIG. 5 is a schematic diagram of a convolutional neural recognition network structure;
FIG. 6 is a schematic diagram of a training and usage flow of a recognition network;
FIG. 7 is a schematic diagram of a speech noise reduction process;
FIG. 8 is a schematic view of an endpoint detection process;
FIG. 9 is a diagram illustrating the recognition results of the recognition network on the test set;
FIG. 10 is a schematic diagram of the effect of a speech denoising method based on the combination of short-term autocorrelation and spectral subtraction;
FIG. 11 is a schematic diagram of endpoint detection based on a combination of short-term autocorrelation and energy-to-entropy ratio.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1 and fig. 2, on the basis of the prior art, the present invention provides a comprehensive decision method for voice and non-voice of voice, which specifically includes the following contents, for the situations that the prior art cannot decide that the initial segment of voice is voice, the background noise is complex, multiple signal-to-noise ratios coexist, and a neural network and a model require a large amount of data for network training and model construction.
Step 1: the speech signal data (sampling rate fs Hz) subjected to AD conversion (analog signal-digital signal conversion) is used as input speech data, and the input speech data is subjected to framing processing, and two framing modes are provided in total. The first framing method is to process voice data in a time of 1s (fs length) per frame and in an overlapping time of 0.7s (0.7 × fs length) between frames. The second framing method is to process the voice data in a time of 0.025s (0.025 × fs) per frame and in a time of 0.01s (0.01 × fs) overlapping each frame.
Step 2: the voice data which is subjected to framing processing and has the frame length of fs and the frame overlapping length of 0.7 xfs is preprocessed, and time-frequency conversion and MFCC (Mel frequency cepstrum coefficient) acquisition are carried out on each frame of voice data.
Further, as shown in fig. 3, the method specifically includes:
step 2.1, obtaining time-frequency parameters F (F, t) of data from voice data through STFT, representing the relative energy of voice at time t and frequency F, xnFor speech signals, ω (n) is the Hamming window function. In the formula (2-2), N represents the size of the nth point value of the hamming window, N represents the window length, and the selected length N is 256.
F(f,t)=STFT(xn,ω(n)) (2-1)
And 2.2, acquiring the MFCC characteristics of each frame of voice data, and mainly acquiring the MFCC value, the first-order MFCC difference and the second-order MFCC difference of each frame of voice data.
Step 2.2.1 Pre-emphasis processing, x, of the speech signaln(i) Is the ith speech signal amplitude.
x′n(i)=xn(i)-0.97×xn(i-1) (2-3)
And 2.2.2, windowing the pre-emphasized signal and performing frequency domain conversion on the windowed signal, wherein a Hamming window is selected as a window function omega (l), the length is 256, and N is the total frame length. Obtaining a representation, X, of the speech signal in the frequency domainn(k) For the amplitude spectrum, P, of the speech signal at the k-th spectral linen(k) The power spectrum of the k-th spectral line of the speech signal is obtained.
Step 2.2.3 calculating the energy spectrum after each frame of spectral line energy passes through the Mel filter bank, the number of the filters is 22, the Mel filter formula is as follows, in the formula (2-6), Mi(k) Represents the ith filter, k is the kth spectral line of the input filter, f (i) isThe center frequency of the ith filter. In the formula (2-7), flIs the lowest frequency of the filter, fhIs the highest frequency, f, of the filtersIs the sampling frequency.
And 2.2.4, taking logarithm of the energy passing through the Mel filter bank, and then performing discrete cosine transform to obtain the MFCC characteristics.
And 2.2.5, performing first-order difference processing on the MFCC features to obtain first-order MFCC features, wherein a difference operation formula is as follows, d (j) represents the j (th) first-order difference, c (j + l) represents the order of the j (th) plus l cepstrum coefficient, and z represents the interval of a difference frame.
And 2.2.6, performing differential operation on the first-order MFCC characteristics, wherein the operation steps are the same as those of the formula (2-9), and obtaining second-order MFCC characteristics.
After preprocessing, the size of the time-frequency parameter of the obtained voice signal is 129 × 64, the size of the MFCC characteristic parameter is 99 × 64, the two parts of parameters are spliced by a matrix, and the splicing method is as shown in fig. 4, and finally the size of the obtained preprocessed output data is 228 × 64.
And step 3: as shown in fig. 5 and 6, each frame of preprocessed voice data is used as input of CNN, and the label of the input data is voice or non-voice for each frame of voice data. The network adopts three layers of convolution layer, three layers of pooling layer and three layers of full-connection layer. A first layer of convolutional layers: the convolution kernel size is 3 × 3, 32 convolution kernels are provided in total, the moving step length of the convolution kernels is 1, and in the convolution process, the part with insufficient boundaries is filled with 0 values. A first pooling layer: the largest pool size of 2 x 2 was used and the insufficient boundary part was filled with a value of 0. A second layer of convolutional layers: the convolution kernel size is 3 × 3, there are 64 convolution kernels in total, and the rest of the settings are the same as those of the first layer convolution layer. The arrangement of the second layer of the pooling layer is the same as that of the first layer of the pooling layer. A third layer of convolutional layers: the convolution kernel size is 3 × 3, there are 1024 convolution kernels, and the rest of the settings are the same as those of the first layer of convolution layer. The arrangement of the third layer of the pooling layer is the same as that of the first layer of the pooling layer. The output of the first layer full link layer and the second layer full link layer are both 1024, and the output of the third layer full link layer is 2, which represents the number of required classifications. And after each convolution, activating the convolved values by adopting a Relu activation function. In the training process of the network, parameters of the network are updated by adopting an Adam random gradient descent method.
And 4, step 4: after the voice data is judged through the neural network, the proportion of the voice section which is identified as the voice by the neural network in the voice section in the whole voice section is counted. The length of the whole voice segment counted each time is 10 s.
And 5: when the ratio of voice signals in the speech segment is less than 5%, the speech segment is judged to be a non-voice segment, and the speech segment is marked as non-voice.
Step 6: the voice noise reduction adopts a method of combining a short-time autocorrelation method and a spectral subtraction method to carry out voice noise reduction.
As shown in fig. 7, further, step 6 specifically includes the following:
step 6.1 for each frame of speech data xnPerforming short-time autocorrelation processing to obtain autocorrelation value R of current framen。
Wherein k is the voice data of the kth frame, and N is the number of samples of the voice data of the frame.
Step 6.2 Pair to obtainTaking the obtained autocorrelation value of each frame as a new autocorrelation value sequence RnCarrying out smooth filtering, and obtaining a filtered autocorrelation value sequence R 'by adopting an average filtering method with the window length of 10 and the window shift of 1'n。
R′n=mean(Rn+…+Rn+9),1≤n≤K-9 (6-2)
Wherein n is the sampled nth time speech value, and K is the total number of samples.
Step 6.3 averaging of sequencesAs the threshold η, when a frame segment whose autocorrelation value is less than or equal to the threshold η is regarded as a non-speech segment, a frame segment greater than the threshold η is regarded as a speech segment.
Non-speech segment g determined by autocorrelationnAs input, the original speech data x are subjected to spectral subtractionnDenoising to obtain denoised voice data x'n。
Step 6.4 for each original frame speech signal xnFast Fourier Transform (FFT) is carried out, N is the total frame length, and Xn(k) The spectral value of the kth spectral line of the nth frame of the speech signal. For Xn(k) Has an amplitude of | Xn(k) Angle of |
Step 6.5 the frame number of the non-voice section is NIS, and the non-voice section g is processednFFT processing is carried out to obtain dn(k) And represents the spectral value of the kth spectral line of the nth frame of the unvoiced signal. The average power spectrum value D (k) of the non-speech segment is obtained.
Step 6.6 FFT of the XnCalculating the average value Y thereofn(k)。
Step 6.7, obtaining the amplitude value after the spectral subtraction through a spectral subtraction formulaWherein, a is 4 and is an over-reduction factor; b is 0.001, which is the gain compensation factor.
Step 6.8 obtaining the spectrally subtracted amplitudeAnd the original phase angleObtaining a noise-reduced voice x 'by Inverse Fast Fourier Transform (iFFT)'n。
And 7: the voice endpoint detection adopts a short-time autocorrelation method and an energy-entropy ratio method to detect the endpoint. The processing mode of the autocorrelation stage is the same as that of the speech noise reduction stage, the corresponding methods can be seen in formulas (6-1) and (6-2), and the speech segment and the non-speech segment of the noise-reduced speech are obtained through the autocorrelation method. And taking the voice section and the non-voice section of the voice after noise reduction as input, and determining the starting and ending positions of the voice section and the non-voice section by a method of energy-entropy ratio.
As shown in fig. 8, further, step 7 specifically includes the following:
7.1 energy-entropy ratio is the ratio of the energy of each frame signal to the spectral entropy, and each frame signal x is obtained by calculating the short-time energynEnergy E ofn. And N is the number of sampling points of each frame signal.
The spectral entropy of the speech signal is obtained as follows.
Step 7.2 of noise-reduced signals x 'per frame'nCalculating value X 'after FFT conversion'nAnd k represents the k-th spectral line.
Step 7.3, calculating the short-time energy E 'of each frame of voice after noise reduction in the frequency domain'nN is the length of FFT, only the positive frequency part is taken,is Xn(k) Conjugation of (1).
Step 7.4 calculate the energy spectrum S of the k-th spectral linen(k)。
Step 7.5 calculate the normalized spectral probability density function p for each frequency component of each framen(k)。
Step 7.6 calculate the spectral entropy value H of each framen。
Step 7.7 calculate the energy-to-entropy ratio Ef of each frame signaln。
Step 7.8 calculates the energy-entropy ratio EF 'of the non-speech segment signal'n. The calculation process is the same as the formulas (7-2) and (7-7), and each frame signal x 'after noise reduction is carried out'nReplacing the noise-reduced non-speech frame to obtain the energy entropy ratio Ef 'of the noise-reduced non-speech frame'n。
Step 7.9 setting a decision threshold T1And T2Me is the maximum value of the energy-entropy ratio of each frame signal, and delta is the adaptive parameter of the decision threshold.
Me=max(Efn) (7-8)
δ=Me-mean(Ef′n) (7-9)
T1=0.05×δ+mean(Ef′n) (7-10)
T2=0.1×δ+mean(Ef′n) (7-11)
7.10 initial judgment of the threshold, calculating the energy-entropy ratio Ef of the voice signalnIntersection N with threshold T22,N3The start and stop points of the speech segment being at N2,N3Outside the time interval of (c).
Step 7.11 starts from the initial decision N2To the left and end point N3Searching to the right to respectively find the energy-entropy ratio Ef of the voice signalsnAnd a threshold T1Cross point N of1,N4。N1Is the starting point of a speech segment, N4Is the end point of a speech segment.
And 8: and marking the voice data after voice endpoint detection, wherein the voice segments are marked as voice, and the rest voice segments are marked as non-voice segments.
And step 9: and outputting voice data, namely splicing the voice marked as the voice section into a whole new voice section according to the time sequence, and storing the voice section at the fs Hz sampling rate in a wav file format.
The invention designs a convolutional neural network for recognizing the voice signal by combining the time-frequency parameter and the cepstrum characteristic parameter of the voice signal aiming at the condition that the non-voice signal occupies most time and the type and the energy of the non-voice signal are complicated and changeable in the actual voice signal. The input of the neural network is time-frequency parameters and cepstrum characteristic parameter information of the normalized voice signals, the current voice signals are judged through three layers of convolution layers, three layers of pooling layers and three layers of full-connection layers, whether the current voice section contains voice information can be roughly and quickly judged, and if the current voice section contains the voice signals, subsequent voice section point detection is carried out; if the voice signal is not contained, the subsequent processing is not carried out on the voice signal, so that the speed of voice judgment is increased, and the judgment time is reduced.
The recognition result of the network is shown in fig. 9, and the neural network combining the time-frequency parameters and the cepstrum characteristic parameters of the speech signal can more accurately recognize the speech segment and non-speech segment signals in the speech signal.
The invention designs a new voice denoising method aiming at the condition that initial frames of a voice section needing voice denoising are not non-voice frames in actual conditions and combining the advantages of short-time autocorrelation and spectral subtraction of voice. The method comprises the steps of firstly, calculating short-time autocorrelation of each frame of voice signals, and screening out non-voice speech segments by combining a threshold value. And then taking the screened non-voice section as a noise section in the spectral subtraction method to perform denoising processing on the original voice signal. The method realizes the denoising operation of the voice signal by adaptively determining the unvoiced segment of the voice, solves the defect that the unvoiced segment needs to be determined manually in the existing method, and improves the intelligence and denoising effect of voice denoising.
As a result, as shown in fig. 10, when the initial speech is a speech segment, the method adaptively determines the unvoiced segment of the speech signal, so that the speech signal can be accurately denoised, and the denoising effect is excellent.
The invention designs a new voice endpoint detection method aiming at the condition that initial frames of a voice section needing endpoint detection are not non-voice frames in actual conditions and combining the advantages of short-time autocorrelation and energy-entropy ratio method of voice. The method comprises the steps of firstly, calculating short-time autocorrelation of each frame of voice signals, and roughly screening out unvoiced speech segments by combining a threshold value. And then calculating the ratio of the energy of the screened non-voice segment to the spectral entropy, and determining the threshold value of the subsequent voice detection. And calculating the energy-entropy ratio of the speech signals according to frames, and carrying out endpoint detection on the speech by combining a determined threshold value. The method realizes voice endpoint detection of the voice signal by adaptively determining the unvoiced segment of the voice, overcomes the defect that the unvoiced segment needs to be determined manually in the existing method, and improves the intelligence and the accuracy of the voice endpoint detection.
As shown in fig. 11, when the initial speech is a speech segment, the endpoint detection method provided by the present invention can adaptively determine the non-speech segment of the speech signal, perform decision and endpoint detection on the speech signal, and the decision result is relatively accurate.
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (7)
1. A speech and non-speech integrated decision method is characterized in that: the comprehensive judgment method comprises the following steps:
performing framing processing on input voice data to obtain first framing voice data and second framing voice data;
the processing method of the first frame voice data comprises the following steps:
preprocessing the first frame voice data, acquiring each frame voice data, performing time-frequency conversion and cepstrum coefficient, inputting the preprocessed data into a voice recognition network, and judging the proportion of voice segments of voice in the whole voice segment;
when the voice signal proportion is larger than a preset value, carrying out voice noise reduction processing by combining a short-time autocorrelation method and a spectral subtraction method;
detecting a voice endpoint by combining a short-time correlation method and an energy-entropy ratio method, marking voice speech segments in the detected voice data as voice, marking other speech segments as non-voice, and finally outputting the voice data;
the processing method of the second sub-frame voice data comprises the following steps:
performing voice noise reduction processing on the second sub-frame voice data by combining a short-time autocorrelation method and a spectral subtraction method;
and detecting the voice endpoint by combining a short-time correlation method and an energy-entropy ratio method, marking voice speech segments in the detected voice data as voice, marking other speech segments as non-voice, and finally outputting the voice data.
2. A method for integrated decision of speech and non-speech according to claim 1, characterized by: the preprocessing the first frame voice data to acquire each frame voice data for time-frequency conversion and cepstrum coefficient comprises:
the time-frequency parameters F (F, t) of the voice data obtained by the first frame voice data through short-time Fourier transform represent the relative energy value of the voice signal when the moment is t and the frequency is F;
performing MFCC feature extraction on each frame of voice data to obtain an MFCC value, a first-order MFCC difference and a second-order MFCC difference of each frame of voice data;
carrying out pre-emphasis processing on a voice signal, windowing the pre-emphasized signal and carrying out frequency domain conversion on the windowed signal to obtain the representation of the voice signal on a frequency domain;
calculating an energy spectrum of each frame of spectral line energy after passing through a Mel filter bank, and carrying out logarithm taking processing on the energy spectrum after passing through the Mel filter bank;
taking logarithm of the energy passing through the Mel filter bank, performing discrete cosine transform to obtain MFCC characteristics, and performing first-order difference processing on the MFCC characteristics to obtain first-order MFCC characteristics;
and performing differential operation on the first-order MFCC features to obtain second-order MFCC features.
3. A method for integrated decision of speech and non-speech according to claim 1, characterized by: the voice noise reduction processing comprises:
for each frame of voice data xnPerforming short-time autocorrelation processing to obtain autocorrelation value R of current framen;
Taking each frame autocorrelation value as a new autocorrelation sequence, and performing smooth filtering by adopting an average filtering method with set window length and window shift to obtain a filtered autocorrelation value sequence R'n;
Averaging the autocorrelation value sequenceAs a threshold η, when a frame segment whose autocorrelation value is less than or equal to the threshold η is taken as a non-speech segment, a frame segment greater than the threshold η is taken as a speech segment;
using the determined non-speech segment and speech segment as input, and adopting spectral subtraction to make original speech data xnDenoising to obtain denoised voice data x'n。
4. A method for integrated decision of speech and non-speech according to claim 3, characterized by: the determined non-speech segment and speech segment are used as input, and the original speech data x is subjected to spectral subtractionnDenoising to obtain denoised voice data x'nThe method comprises the following steps:
for each original frame of speech signal xnPerforming fast Fourier transform to obtain transformed voice signal Xn(k);
According to Xn(k) Amplitude | X ofn(k) Angle of phaseNon-speech segmentCalculating to obtain the average power spectrum value D (k) of the non-voice section;
calculating the speech signal X after fast Fourier transformn(k) Average value Y ofn(k) And obtaining the amplitude value after spectral subtraction through a spectral subtraction formula
5. A method for integrated decision of speech and non-speech according to claim 1, characterized by: the method for detecting the voice endpoint by combining the short-time correlation method with the energy-entropy ratio comprises the following steps:
calculating short-time energy to obtain each frame signal xnEnergy E ofnAnd compares the noise-reduced voice-per-frame signal x'nCalculating the value X 'after fast Fourier transform'n;
Calculating short-time energy E 'of each frame of voice signal after noise reduction in frequency domain'nAnd the energy spectrum S of the k-th spectral linen(k);
Calculating a normalized spectral probability density function p for each frequency component of each framen(k) Spectral entropy per frame HnAnd the energy-to-entropy ratio Ef of each frame signaln;
Calculating energy-entropy ratio EF 'of non-voice segment signal'nThe noise-reduced voice-per-frame signal x'nReplacing the noise-reduced non-speech frame to obtain the energy entropy ratio Ef 'of the noise-reduced non-speech frame'n;
Setting a decision threshold T1And T2And calculating an energy-to-entropy ratio Ef of the speech signalnIntersection point N with threshold T22,N3The start and stop points of the speech segment being at N2,N3Outside the time interval of (1);
from the starting point N2To the left and end point N3Searching to the right to respectively find the energy-entropy ratio Ef of the voice signalsnAnd threshold value T1Cross point N of1,N4,N1Is the starting point of a speech segment, N4Is the end point of a speech segment.
6. A method for integrated decision of speech and non-speech according to claim 1, characterized by: the voice recognition network comprises three convolution layers, three pooling layers and three full-connection layers; a first layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 32 convolution kernels are in total, the moving step length of the convolution kernels is 1, and in the convolution process, a 0 value is adopted for filling the part with insufficient boundaries; a first pooling layer: filling the part with insufficient boundary with 0 value by adopting the maximum pooling of 2 multiplied by 2; a second layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 64 convolution kernels are totally arranged, and the rest arrangement is the same as that of the first layer of convolution layer; the arrangement of the second layer of the pooling layer is the same as that of the first layer of the pooling layer; a third layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 1024 convolution kernels are in total, and the rest of the convolution kernels are the same as the first layer of convolution layer; the arrangement of the third layer of the pooling layer is the same as that of the first layer of the pooling layer; the output of the first full connection layer and the second full connection layer is 1024, and the output of the third full connection layer is 2, which represents the number of the required classifications; after each convolution, activating the convolved value by adopting a Relu activation function; in the training process of the network, parameters of the network are updated by adopting an Adam random gradient descent method.
7. A method for integrated decision of speech and non-speech according to any of claims 1-6, characterized by: the first frame voice data comprises voice data which is processed according to the time of each frame being 1s and the time of overlapping between each frame being 0.7 s; the second frame-divided voice data includes voice data processed by each frame with a time of 0.025s and an overlapping time of 0.01 s.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210006259.1A CN114242116A (en) | 2022-01-05 | 2022-01-05 | Comprehensive judgment method for voice and non-voice of voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210006259.1A CN114242116A (en) | 2022-01-05 | 2022-01-05 | Comprehensive judgment method for voice and non-voice of voice |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114242116A true CN114242116A (en) | 2022-03-25 |
Family
ID=80745796
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210006259.1A Pending CN114242116A (en) | 2022-01-05 | 2022-01-05 | Comprehensive judgment method for voice and non-voice of voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114242116A (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS6459394A (en) * | 1987-08-31 | 1989-03-07 | Ricoh Kk | Digital voice extractor |
JP2008134565A (en) * | 2006-11-29 | 2008-06-12 | Nippon Telegr & Teleph Corp <Ntt> | Voice/non-voice determination compensation device, voice/non-voice determination compensation method, voice/non-voice determination compensation program and its recording medium, and voice mixing device, voice mixing method, voice mixing program and its recording medium |
CN101308653A (en) * | 2008-07-17 | 2008-11-19 | 安徽科大讯飞信息科技股份有限公司 | End-point detecting method applied to speech identification system |
JP2009020460A (en) * | 2007-07-13 | 2009-01-29 | Yamaha Corp | Voice processing device and program |
JP2009063700A (en) * | 2007-09-05 | 2009-03-26 | Nippon Telegr & Teleph Corp <Ntt> | Device, method and program for estimating voice signal section, and storage medium recording the program |
US20110026722A1 (en) * | 2007-05-25 | 2011-02-03 | Zhinian Jing | Vibration Sensor and Acoustic Voice Activity Detection System (VADS) for use with Electronic Systems |
US20130304464A1 (en) * | 2010-12-24 | 2013-11-14 | Huawei Technologies Co., Ltd. | Method and apparatus for adaptively detecting a voice activity in an input audio signal |
CN109545188A (en) * | 2018-12-07 | 2019-03-29 | 深圳市友杰智新科技有限公司 | A kind of real-time voice end-point detecting method and device |
JP6539829B1 (en) * | 2018-05-15 | 2019-07-10 | 角元 純一 | How to detect voice and non-voice level |
CN110335593A (en) * | 2019-06-17 | 2019-10-15 | 平安科技(深圳)有限公司 | Sound end detecting method, device, equipment and storage medium |
CN111292758A (en) * | 2019-03-12 | 2020-06-16 | 展讯通信(上海)有限公司 | Voice activity detection method and device and readable storage medium |
-
2022
- 2022-01-05 CN CN202210006259.1A patent/CN114242116A/en active Pending
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS6459394A (en) * | 1987-08-31 | 1989-03-07 | Ricoh Kk | Digital voice extractor |
JP2008134565A (en) * | 2006-11-29 | 2008-06-12 | Nippon Telegr & Teleph Corp <Ntt> | Voice/non-voice determination compensation device, voice/non-voice determination compensation method, voice/non-voice determination compensation program and its recording medium, and voice mixing device, voice mixing method, voice mixing program and its recording medium |
US20110026722A1 (en) * | 2007-05-25 | 2011-02-03 | Zhinian Jing | Vibration Sensor and Acoustic Voice Activity Detection System (VADS) for use with Electronic Systems |
JP2009020460A (en) * | 2007-07-13 | 2009-01-29 | Yamaha Corp | Voice processing device and program |
JP2009063700A (en) * | 2007-09-05 | 2009-03-26 | Nippon Telegr & Teleph Corp <Ntt> | Device, method and program for estimating voice signal section, and storage medium recording the program |
CN101308653A (en) * | 2008-07-17 | 2008-11-19 | 安徽科大讯飞信息科技股份有限公司 | End-point detecting method applied to speech identification system |
US20130304464A1 (en) * | 2010-12-24 | 2013-11-14 | Huawei Technologies Co., Ltd. | Method and apparatus for adaptively detecting a voice activity in an input audio signal |
JP6539829B1 (en) * | 2018-05-15 | 2019-07-10 | 角元 純一 | How to detect voice and non-voice level |
CN109545188A (en) * | 2018-12-07 | 2019-03-29 | 深圳市友杰智新科技有限公司 | A kind of real-time voice end-point detecting method and device |
CN111292758A (en) * | 2019-03-12 | 2020-06-16 | 展讯通信(上海)有限公司 | Voice activity detection method and device and readable storage medium |
CN110335593A (en) * | 2019-06-17 | 2019-10-15 | 平安科技(深圳)有限公司 | Sound end detecting method, device, equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
MAN-WAI MAK: "A study of voice activity detection techniques for NIST speaker recognition evaluations", 《COMPUTER SPEECH & LANGUAGE》, 30 January 2014 (2014-01-30) * |
何蓉蓉: "低信噪比环境下语音信号端点检测算法", 《中国优秀硕士学位论文全文数据库(信息科技辑)》, 15 January 2019 (2019-01-15) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103854662B (en) | Adaptive voice detection method based on multiple domain Combined estimator | |
CN110232933B (en) | Audio detection method and device, storage medium and electronic equipment | |
CN103646649A (en) | High-efficiency voice detecting method | |
JP2003517624A (en) | Noise suppression for low bit rate speech coder | |
CN110599987A (en) | Piano note recognition algorithm based on convolutional neural network | |
JP3105465B2 (en) | Voice section detection method | |
Jiao et al. | Convex weighting criteria for speaking rate estimation | |
Archana et al. | Gender identification and performance analysis of speech signals | |
CN112541533A (en) | Modified vehicle identification method based on neural network and feature fusion | |
CN111540368B (en) | Stable bird sound extraction method and device and computer readable storage medium | |
CN101625858A (en) | Method for extracting short-time energy frequency value in voice endpoint detection | |
CN115346561A (en) | Method and system for estimating and predicting depression mood based on voice characteristics | |
US20050049863A1 (en) | Noise-resistant utterance detector | |
CN112151066A (en) | Voice feature recognition-based language conflict monitoring method, medium and equipment | |
CN114242116A (en) | Comprehensive judgment method for voice and non-voice of voice | |
Ziólko et al. | Phoneme segmentation of speech | |
CN112908344B (en) | Intelligent bird song recognition method, device, equipment and medium | |
Sangeetha et al. | Robust automatic continuous speech segmentation for indian languages to improve speech to speech translation | |
CN1971707A (en) | Method and apparatus for estimating fundamental tone period and adjudging unvoiced/voiced classification | |
CN113744725A (en) | Training method of voice endpoint detection model and voice noise reduction method | |
Stadtschnitzer et al. | Reliable voice activity detection algorithms under adverse environments | |
TW202143215A (en) | Speech enhancement system based on deep learning | |
CN111091816A (en) | Data processing system and method based on voice evaluation | |
CN109346106B (en) | Cepstrum domain pitch period estimation method based on sub-band signal-to-noise ratio weighting | |
Soon et al. | Evaluating the effect of multiple filters in automatic language identification without lexical knowledge |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |