CN114242116A - Comprehensive judgment method for voice and non-voice of voice - Google Patents

Comprehensive judgment method for voice and non-voice of voice Download PDF

Info

Publication number
CN114242116A
CN114242116A CN202210006259.1A CN202210006259A CN114242116A CN 114242116 A CN114242116 A CN 114242116A CN 202210006259 A CN202210006259 A CN 202210006259A CN 114242116 A CN114242116 A CN 114242116A
Authority
CN
China
Prior art keywords
voice
speech
frame
voice data
energy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210006259.1A
Other languages
Chinese (zh)
Inventor
代策宇
张义林
徐杨辉
傅松
段绍楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Jinjiang Electronic System Engineering Co Ltd
Original Assignee
Chengdu Jinjiang Electronic System Engineering Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Jinjiang Electronic System Engineering Co Ltd filed Critical Chengdu Jinjiang Electronic System Engineering Co Ltd
Priority to CN202210006259.1A priority Critical patent/CN114242116A/en
Publication of CN114242116A publication Critical patent/CN114242116A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to a method for comprehensively judging voices and non-voices of voice, which comprises the following steps: performing framing processing on input voice data to obtain first framing voice data and second framing voice data; preprocessing the first frame voice data, acquiring each frame voice data, performing time-frequency conversion and cepstrum coefficient, inputting the preprocessed data into a voice recognition network, and judging the proportion of voice segments of voice in the whole voice segment; when the voice signal proportion is larger than a preset value, carrying out voice noise reduction processing by combining a short-time autocorrelation method and a spectral subtraction method; and detecting the voice endpoint by combining a short-time correlation method and an energy-entropy ratio method, marking voice speech segments in the detected voice data as voice, marking other speech segments as non-voice, and finally outputting the voice data. The invention improves the applicability of the voice judgment, enlarges the application range of the judgment of the voice and the non-voice under the complex condition and further improves the applicability of the method in the voice judgment.

Description

Comprehensive judgment method for voice and non-voice of voice
Technical Field
The invention relates to the technical field of voice processing, in particular to a comprehensive judgment method for voice and non-voice of voice.
Background
For the speech and non-speech decision, the prior art methods can be roughly divided into three types: a threshold-based decision method, a decision method as a classifier and a model-based decision method. The threshold-based judgment method, namely the voice endpoint detection method, sets a reasonable threshold by extracting time domain and frequency domain characteristics of voice, such as short-term energy, short-term zero-crossing rate, cepstrum coefficient and the like, so as to realize the judgment of voice and non-voice; the decision method of the classifier, regard speech judgement as the classification problem of the voice and non-voice, utilize neural network and machine learning method to train the classifier, achieve the goal of judgement; the model-based method is to use a complete acoustic model to make decisions based on decoding and by means of global information.
However, in the existing voice and non-voice judging method, the judging condition is based on that the voice contains the same type of noise with unchanged signal-to-noise ratio on the voice segment needing to be judged. In order to achieve a good noise reduction effect, the initial frames of speech in the speech segment are assumed to be non-speech frames, i.e. noise frames, during the denoising process. And taking the initial non-voice frames as background noise of the voice to carry out noise reduction and voice judgment.
The existing judgment method based on the classifier and the model needs to judge the voice and the non-voice of each frame of signal, and then needs to adopt other methods to eliminate the deviation caused by judgment, and needs a great amount of different training data to train and construct the network or the model for achieving the accurate judgment of the voice signal, and the work required in the early stage is complex.
Therefore, the conditions of the speech to be judged in the prior art are ideal, and the noise frame cannot be determined adaptively under the conditions that the initial segment is speech or background noise is complex and multiple signal-to-noise ratios coexist, and the speech data is needed to train a network and construct a model for speech judgment.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a comprehensive judgment method for voice and non-voice of voice, and solves the problems in the prior art.
The purpose of the invention is realized by the following technical scheme: a comprehensive decision method for voice and non-voice of speech, the comprehensive decision method comprises:
performing framing processing on input voice data to obtain first framing voice data and second framing voice data;
the processing method of the first frame voice data comprises the following steps:
preprocessing the first frame voice data, acquiring each frame voice data, performing time-frequency conversion and cepstrum coefficient, inputting the preprocessed data into a voice recognition network, and judging the proportion of voice segments of voice in the whole voice segment;
when the voice signal proportion is larger than a preset value, carrying out voice noise reduction processing by combining a short-time autocorrelation method and a spectral subtraction method;
detecting a voice endpoint by combining a short-time correlation method and an energy-entropy ratio method, marking voice speech segments in the detected voice data as voice, marking other speech segments as non-voice, and finally outputting the voice data;
the processing method of the second sub-frame voice data comprises the following steps:
performing voice noise reduction processing on the second sub-frame voice data by combining a short-time autocorrelation method and a spectral subtraction method;
and detecting the voice endpoint by combining a short-time correlation method and an energy-entropy ratio method, marking voice speech segments in the detected voice data as voice, marking other speech segments as non-voice, and finally outputting the voice data.
The preprocessing the first frame voice data to acquire each frame voice data for time-frequency conversion and cepstrum coefficient comprises:
the time-frequency parameters F (F, t) of the voice data obtained by the first frame voice data through short-time Fourier transform represent the relative energy value of the voice signal when the moment is t and the frequency is F;
performing MFCC feature extraction on each frame of voice data to obtain an MFCC value, a first-order MFCC difference and a second-order MFCC difference of each frame of voice data;
carrying out pre-emphasis processing on a voice signal, windowing the pre-emphasized signal and carrying out frequency domain conversion on the windowed signal to obtain the representation of the voice signal on a frequency domain;
calculating an energy spectrum of each frame of spectral line energy after passing through a Mel filter bank, and carrying out logarithm taking processing on the energy spectrum after passing through the Mel filter bank;
taking logarithm of the energy passing through the Mel filter bank, performing discrete cosine transform to obtain MFCC characteristics, and performing first-order difference processing on the MFCC characteristics to obtain first-order MFCC characteristics;
and performing differential operation on the first-order MFCC features to obtain second-order MFCC features.
The voice noise reduction processing comprises:
for each frame of voice data xnPerforming short-time autocorrelation processing to obtain autocorrelation value R of current framen
Taking each frame autocorrelation value as a new autocorrelation sequence, and performing smooth filtering by adopting an average filtering method with set window length and window shift to obtain a filtered autocorrelation value sequence R'n
Averaging the autocorrelation value sequence
Figure BDA0003456889440000021
As the threshold η, when the autocorrelation value is less than or equal to the frame segment of the threshold ηAs a non-voice section, a frame section larger than a threshold eta is taken as a voice section;
using the determined non-speech segment and speech segment as input, and adopting spectral subtraction to make original speech data xnDenoising to obtain denoised voice data x'n
The determined non-speech segment and speech segment are used as input, and the original speech data x is subjected to spectral subtractionnDenoising to obtain denoised voice data x'nThe method comprises the following steps:
for each original frame of speech signal xnPerforming fast Fourier transform to obtain transformed voice signal Xn(k);
According to Xn(k) Amplitude | X ofn(k) Angle of phase
Figure BDA0003456889440000031
Calculating the frame number NIS of the non-voice section to obtain the average power spectrum value D (k) of the non-voice section;
calculating the speech signal X after fast Fourier transformn(k) Average value Y ofn(k) And obtaining the amplitude value after spectral subtraction through a spectral subtraction formula
Figure BDA0003456889440000032
According to the spectrally subtracted amplitude
Figure BDA0003456889440000033
And phase angle
Figure BDA0003456889440000034
Obtaining the voice data x 'after noise reduction by utilizing inverse fast Fourier transform'n
The method for detecting the voice endpoint by combining the short-time correlation method with the energy-entropy ratio comprises the following steps:
calculating short-time energy to obtain each frame signal xnEnergy E ofnAnd compares the noise-reduced voice-per-frame signal x'nCalculating the value X 'after fast Fourier transform'n
Calculating short-time energy E 'of each frame of voice signal after noise reduction in frequency domain'nAnd the energy spectrum S of the k-th spectral linen(k);
Calculating a normalized spectral probability density function p for each frequency component of each framen(k) Spectral entropy per frame HnAnd the energy-to-entropy ratio Ef of each frame signaln
Calculating energy-entropy ratio EF 'of non-voice segment signal'nThe noise-reduced voice-per-frame signal x'nReplacing the noise-reduced non-speech frame to obtain the energy entropy ratio Ef 'of the noise-reduced non-speech frame'n
Setting a decision threshold T1And T2And calculating an energy-to-entropy ratio Ef of the speech signalnIntersection point N with threshold T22,N3The start and stop points of the speech segment being at N2,N3Outside the time interval of (1);
from the starting point N2To the left and end point N3Searching to the right to respectively find the energy-entropy ratio Ef of the voice signalsnAnd threshold value T1Cross point N of1,N4,N1Is the starting point of a speech segment, N4Is the end point of a speech segment.
The voice recognition network comprises three convolution layers, three pooling layers and three full-connection layers; a first layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 32 convolution kernels are in total, the moving step length of the convolution kernels is 1, and in the convolution process, a 0 value is adopted for filling the part with insufficient boundaries; a first pooling layer: filling the part with insufficient boundary with 0 value by adopting the maximum pooling of 2 multiplied by 2; a second layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 64 convolution kernels are totally arranged, and the rest arrangement is the same as that of the first layer of convolution layer; the arrangement of the second layer of the pooling layer is the same as that of the first layer of the pooling layer; a third layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 1024 convolution kernels are in total, and the rest of the convolution kernels are the same as the first layer of convolution layer; the arrangement of the third layer of the pooling layer is the same as that of the first layer of the pooling layer; the output of the first full connection layer and the second full connection layer is 1024, and the output of the third full connection layer is 2, which represents the number of the required classifications; after each convolution, activating the convolved value by adopting a Relu activation function; in the training process of the network, parameters of the network are updated by adopting an Adam random gradient descent method.
The first frame voice data comprises voice data which is processed according to the time of each frame being 1s and the time of overlapping between each frame being 0.7 s; the second frame-divided voice data includes voice data processed by each frame with a time of 0.025s and an overlapping time of 0.01 s.
The invention has the following advantages: a comprehensive judgment method for voice and non-voice of voice is characterized by that it utilizes neural network to identify out voice containing voice segment, then combines with spectrum subtraction method of autocorrelation to find out non-voice segment and make noise reduction, finally utilizes the voice after noise reduction and utilizes the method of energy-entropy ratio to make more accurate judgment on voice segment so as to raise the applicability of voice judgment.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic flow chart showing the details of the method of the present invention;
FIG. 3 is a schematic diagram of an MFCC feature extraction process;
FIG. 4 is a schematic diagram of recognition network data preprocessing;
FIG. 5 is a schematic diagram of a convolutional neural recognition network structure;
FIG. 6 is a schematic diagram of a training and usage flow of a recognition network;
FIG. 7 is a schematic diagram of a speech noise reduction process;
FIG. 8 is a schematic view of an endpoint detection process;
FIG. 9 is a diagram illustrating the recognition results of the recognition network on the test set;
FIG. 10 is a schematic diagram of the effect of a speech denoising method based on the combination of short-term autocorrelation and spectral subtraction;
FIG. 11 is a schematic diagram of endpoint detection based on a combination of short-term autocorrelation and energy-to-entropy ratio.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1 and fig. 2, on the basis of the prior art, the present invention provides a comprehensive decision method for voice and non-voice of voice, which specifically includes the following contents, for the situations that the prior art cannot decide that the initial segment of voice is voice, the background noise is complex, multiple signal-to-noise ratios coexist, and a neural network and a model require a large amount of data for network training and model construction.
Step 1: the speech signal data (sampling rate fs Hz) subjected to AD conversion (analog signal-digital signal conversion) is used as input speech data, and the input speech data is subjected to framing processing, and two framing modes are provided in total. The first framing method is to process voice data in a time of 1s (fs length) per frame and in an overlapping time of 0.7s (0.7 × fs length) between frames. The second framing method is to process the voice data in a time of 0.025s (0.025 × fs) per frame and in a time of 0.01s (0.01 × fs) overlapping each frame.
Step 2: the voice data which is subjected to framing processing and has the frame length of fs and the frame overlapping length of 0.7 xfs is preprocessed, and time-frequency conversion and MFCC (Mel frequency cepstrum coefficient) acquisition are carried out on each frame of voice data.
Further, as shown in fig. 3, the method specifically includes:
step 2.1, obtaining time-frequency parameters F (F, t) of data from voice data through STFT, representing the relative energy of voice at time t and frequency F, xnFor speech signals, ω (n) is the Hamming window function. In the formula (2-2), N represents the size of the nth point value of the hamming window, N represents the window length, and the selected length N is 256.
F(f,t)=STFT(xn,ω(n)) (2-1)
Figure BDA0003456889440000051
And 2.2, acquiring the MFCC characteristics of each frame of voice data, and mainly acquiring the MFCC value, the first-order MFCC difference and the second-order MFCC difference of each frame of voice data.
Step 2.2.1 Pre-emphasis processing, x, of the speech signaln(i) Is the ith speech signal amplitude.
x′n(i)=xn(i)-0.97×xn(i-1) (2-3)
And 2.2.2, windowing the pre-emphasized signal and performing frequency domain conversion on the windowed signal, wherein a Hamming window is selected as a window function omega (l), the length is 256, and N is the total frame length. Obtaining a representation, X, of the speech signal in the frequency domainn(k) For the amplitude spectrum, P, of the speech signal at the k-th spectral linen(k) The power spectrum of the k-th spectral line of the speech signal is obtained.
Figure BDA0003456889440000052
Figure BDA0003456889440000053
Step 2.2.3 calculating the energy spectrum after each frame of spectral line energy passes through the Mel filter bank, the number of the filters is 22, the Mel filter formula is as follows, in the formula (2-6), Mi(k) Represents the ith filter, k is the kth spectral line of the input filter, f (i) isThe center frequency of the ith filter. In the formula (2-7), flIs the lowest frequency of the filter, fhIs the highest frequency, f, of the filtersIs the sampling frequency.
Figure BDA0003456889440000054
Figure BDA0003456889440000055
Figure BDA0003456889440000061
And 2.2.4, taking logarithm of the energy passing through the Mel filter bank, and then performing discrete cosine transform to obtain the MFCC characteristics.
And 2.2.5, performing first-order difference processing on the MFCC features to obtain first-order MFCC features, wherein a difference operation formula is as follows, d (j) represents the j (th) first-order difference, c (j + l) represents the order of the j (th) plus l cepstrum coefficient, and z represents the interval of a difference frame.
Figure BDA0003456889440000062
And 2.2.6, performing differential operation on the first-order MFCC characteristics, wherein the operation steps are the same as those of the formula (2-9), and obtaining second-order MFCC characteristics.
After preprocessing, the size of the time-frequency parameter of the obtained voice signal is 129 × 64, the size of the MFCC characteristic parameter is 99 × 64, the two parts of parameters are spliced by a matrix, and the splicing method is as shown in fig. 4, and finally the size of the obtained preprocessed output data is 228 × 64.
And step 3: as shown in fig. 5 and 6, each frame of preprocessed voice data is used as input of CNN, and the label of the input data is voice or non-voice for each frame of voice data. The network adopts three layers of convolution layer, three layers of pooling layer and three layers of full-connection layer. A first layer of convolutional layers: the convolution kernel size is 3 × 3, 32 convolution kernels are provided in total, the moving step length of the convolution kernels is 1, and in the convolution process, the part with insufficient boundaries is filled with 0 values. A first pooling layer: the largest pool size of 2 x 2 was used and the insufficient boundary part was filled with a value of 0. A second layer of convolutional layers: the convolution kernel size is 3 × 3, there are 64 convolution kernels in total, and the rest of the settings are the same as those of the first layer convolution layer. The arrangement of the second layer of the pooling layer is the same as that of the first layer of the pooling layer. A third layer of convolutional layers: the convolution kernel size is 3 × 3, there are 1024 convolution kernels, and the rest of the settings are the same as those of the first layer of convolution layer. The arrangement of the third layer of the pooling layer is the same as that of the first layer of the pooling layer. The output of the first layer full link layer and the second layer full link layer are both 1024, and the output of the third layer full link layer is 2, which represents the number of required classifications. And after each convolution, activating the convolved values by adopting a Relu activation function. In the training process of the network, parameters of the network are updated by adopting an Adam random gradient descent method.
And 4, step 4: after the voice data is judged through the neural network, the proportion of the voice section which is identified as the voice by the neural network in the voice section in the whole voice section is counted. The length of the whole voice segment counted each time is 10 s.
And 5: when the ratio of voice signals in the speech segment is less than 5%, the speech segment is judged to be a non-voice segment, and the speech segment is marked as non-voice.
Step 6: the voice noise reduction adopts a method of combining a short-time autocorrelation method and a spectral subtraction method to carry out voice noise reduction.
As shown in fig. 7, further, step 6 specifically includes the following:
step 6.1 for each frame of speech data xnPerforming short-time autocorrelation processing to obtain autocorrelation value R of current framen
Figure BDA0003456889440000063
Wherein k is the voice data of the kth frame, and N is the number of samples of the voice data of the frame.
Step 6.2 Pair to obtainTaking the obtained autocorrelation value of each frame as a new autocorrelation value sequence RnCarrying out smooth filtering, and obtaining a filtered autocorrelation value sequence R 'by adopting an average filtering method with the window length of 10 and the window shift of 1'n
R′n=mean(Rn+…+Rn+9),1≤n≤K-9 (6-2)
Wherein n is the sampled nth time speech value, and K is the total number of samples.
Step 6.3 averaging of sequences
Figure BDA0003456889440000071
As the threshold η, when a frame segment whose autocorrelation value is less than or equal to the threshold η is regarded as a non-speech segment, a frame segment greater than the threshold η is regarded as a speech segment.
Non-speech segment g determined by autocorrelationnAs input, the original speech data x are subjected to spectral subtractionnDenoising to obtain denoised voice data x'n
Step 6.4 for each original frame speech signal xnFast Fourier Transform (FFT) is carried out, N is the total frame length, and Xn(k) The spectral value of the kth spectral line of the nth frame of the speech signal. For Xn(k) Has an amplitude of | Xn(k) Angle of |
Figure BDA0003456889440000072
Figure BDA0003456889440000073
Figure BDA0003456889440000074
Step 6.5 the frame number of the non-voice section is NIS, and the non-voice section g is processednFFT processing is carried out to obtain dn(k) And represents the spectral value of the kth spectral line of the nth frame of the unvoiced signal. The average power spectrum value D (k) of the non-speech segment is obtained.
Figure BDA0003456889440000075
Step 6.6 FFT of the XnCalculating the average value Y thereofn(k)。
Figure BDA0003456889440000076
Step 6.7, obtaining the amplitude value after the spectral subtraction through a spectral subtraction formula
Figure BDA0003456889440000077
Wherein, a is 4 and is an over-reduction factor; b is 0.001, which is the gain compensation factor.
Figure BDA0003456889440000078
Step 6.8 obtaining the spectrally subtracted amplitude
Figure BDA0003456889440000079
And the original phase angle
Figure BDA00034568894400000710
Obtaining a noise-reduced voice x 'by Inverse Fast Fourier Transform (iFFT)'n
And 7: the voice endpoint detection adopts a short-time autocorrelation method and an energy-entropy ratio method to detect the endpoint. The processing mode of the autocorrelation stage is the same as that of the speech noise reduction stage, the corresponding methods can be seen in formulas (6-1) and (6-2), and the speech segment and the non-speech segment of the noise-reduced speech are obtained through the autocorrelation method. And taking the voice section and the non-voice section of the voice after noise reduction as input, and determining the starting and ending positions of the voice section and the non-voice section by a method of energy-entropy ratio.
As shown in fig. 8, further, step 7 specifically includes the following:
7.1 energy-entropy ratio is the ratio of the energy of each frame signal to the spectral entropy, and each frame signal x is obtained by calculating the short-time energynEnergy E ofn. And N is the number of sampling points of each frame signal.
Figure BDA0003456889440000081
The spectral entropy of the speech signal is obtained as follows.
Step 7.2 of noise-reduced signals x 'per frame'nCalculating value X 'after FFT conversion'nAnd k represents the k-th spectral line.
Figure BDA0003456889440000082
Step 7.3, calculating the short-time energy E 'of each frame of voice after noise reduction in the frequency domain'nN is the length of FFT, only the positive frequency part is taken,
Figure BDA0003456889440000083
is Xn(k) Conjugation of (1).
Figure BDA0003456889440000084
Step 7.4 calculate the energy spectrum S of the k-th spectral linen(k)。
Figure BDA0003456889440000085
Step 7.5 calculate the normalized spectral probability density function p for each frequency component of each framen(k)。
Figure BDA0003456889440000086
Step 7.6 calculate the spectral entropy value H of each framen
Figure BDA0003456889440000087
Step 7.7 calculate the energy-to-entropy ratio Ef of each frame signaln
Figure BDA0003456889440000088
Step 7.8 calculates the energy-entropy ratio EF 'of the non-speech segment signal'n. The calculation process is the same as the formulas (7-2) and (7-7), and each frame signal x 'after noise reduction is carried out'nReplacing the noise-reduced non-speech frame to obtain the energy entropy ratio Ef 'of the noise-reduced non-speech frame'n
Step 7.9 setting a decision threshold T1And T2Me is the maximum value of the energy-entropy ratio of each frame signal, and delta is the adaptive parameter of the decision threshold.
Me=max(Efn) (7-8)
δ=Me-mean(Ef′n) (7-9)
T1=0.05×δ+mean(Ef′n) (7-10)
T2=0.1×δ+mean(Ef′n) (7-11)
7.10 initial judgment of the threshold, calculating the energy-entropy ratio Ef of the voice signalnIntersection N with threshold T22,N3The start and stop points of the speech segment being at N2,N3Outside the time interval of (c).
Step 7.11 starts from the initial decision N2To the left and end point N3Searching to the right to respectively find the energy-entropy ratio Ef of the voice signalsnAnd a threshold T1Cross point N of1,N4。N1Is the starting point of a speech segment, N4Is the end point of a speech segment.
And 8: and marking the voice data after voice endpoint detection, wherein the voice segments are marked as voice, and the rest voice segments are marked as non-voice segments.
And step 9: and outputting voice data, namely splicing the voice marked as the voice section into a whole new voice section according to the time sequence, and storing the voice section at the fs Hz sampling rate in a wav file format.
The invention designs a convolutional neural network for recognizing the voice signal by combining the time-frequency parameter and the cepstrum characteristic parameter of the voice signal aiming at the condition that the non-voice signal occupies most time and the type and the energy of the non-voice signal are complicated and changeable in the actual voice signal. The input of the neural network is time-frequency parameters and cepstrum characteristic parameter information of the normalized voice signals, the current voice signals are judged through three layers of convolution layers, three layers of pooling layers and three layers of full-connection layers, whether the current voice section contains voice information can be roughly and quickly judged, and if the current voice section contains the voice signals, subsequent voice section point detection is carried out; if the voice signal is not contained, the subsequent processing is not carried out on the voice signal, so that the speed of voice judgment is increased, and the judgment time is reduced.
The recognition result of the network is shown in fig. 9, and the neural network combining the time-frequency parameters and the cepstrum characteristic parameters of the speech signal can more accurately recognize the speech segment and non-speech segment signals in the speech signal.
The invention designs a new voice denoising method aiming at the condition that initial frames of a voice section needing voice denoising are not non-voice frames in actual conditions and combining the advantages of short-time autocorrelation and spectral subtraction of voice. The method comprises the steps of firstly, calculating short-time autocorrelation of each frame of voice signals, and screening out non-voice speech segments by combining a threshold value. And then taking the screened non-voice section as a noise section in the spectral subtraction method to perform denoising processing on the original voice signal. The method realizes the denoising operation of the voice signal by adaptively determining the unvoiced segment of the voice, solves the defect that the unvoiced segment needs to be determined manually in the existing method, and improves the intelligence and denoising effect of voice denoising.
As a result, as shown in fig. 10, when the initial speech is a speech segment, the method adaptively determines the unvoiced segment of the speech signal, so that the speech signal can be accurately denoised, and the denoising effect is excellent.
The invention designs a new voice endpoint detection method aiming at the condition that initial frames of a voice section needing endpoint detection are not non-voice frames in actual conditions and combining the advantages of short-time autocorrelation and energy-entropy ratio method of voice. The method comprises the steps of firstly, calculating short-time autocorrelation of each frame of voice signals, and roughly screening out unvoiced speech segments by combining a threshold value. And then calculating the ratio of the energy of the screened non-voice segment to the spectral entropy, and determining the threshold value of the subsequent voice detection. And calculating the energy-entropy ratio of the speech signals according to frames, and carrying out endpoint detection on the speech by combining a determined threshold value. The method realizes voice endpoint detection of the voice signal by adaptively determining the unvoiced segment of the voice, overcomes the defect that the unvoiced segment needs to be determined manually in the existing method, and improves the intelligence and the accuracy of the voice endpoint detection.
As shown in fig. 11, when the initial speech is a speech segment, the endpoint detection method provided by the present invention can adaptively determine the non-speech segment of the speech signal, perform decision and endpoint detection on the speech signal, and the decision result is relatively accurate.
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A speech and non-speech integrated decision method is characterized in that: the comprehensive judgment method comprises the following steps:
performing framing processing on input voice data to obtain first framing voice data and second framing voice data;
the processing method of the first frame voice data comprises the following steps:
preprocessing the first frame voice data, acquiring each frame voice data, performing time-frequency conversion and cepstrum coefficient, inputting the preprocessed data into a voice recognition network, and judging the proportion of voice segments of voice in the whole voice segment;
when the voice signal proportion is larger than a preset value, carrying out voice noise reduction processing by combining a short-time autocorrelation method and a spectral subtraction method;
detecting a voice endpoint by combining a short-time correlation method and an energy-entropy ratio method, marking voice speech segments in the detected voice data as voice, marking other speech segments as non-voice, and finally outputting the voice data;
the processing method of the second sub-frame voice data comprises the following steps:
performing voice noise reduction processing on the second sub-frame voice data by combining a short-time autocorrelation method and a spectral subtraction method;
and detecting the voice endpoint by combining a short-time correlation method and an energy-entropy ratio method, marking voice speech segments in the detected voice data as voice, marking other speech segments as non-voice, and finally outputting the voice data.
2. A method for integrated decision of speech and non-speech according to claim 1, characterized by: the preprocessing the first frame voice data to acquire each frame voice data for time-frequency conversion and cepstrum coefficient comprises:
the time-frequency parameters F (F, t) of the voice data obtained by the first frame voice data through short-time Fourier transform represent the relative energy value of the voice signal when the moment is t and the frequency is F;
performing MFCC feature extraction on each frame of voice data to obtain an MFCC value, a first-order MFCC difference and a second-order MFCC difference of each frame of voice data;
carrying out pre-emphasis processing on a voice signal, windowing the pre-emphasized signal and carrying out frequency domain conversion on the windowed signal to obtain the representation of the voice signal on a frequency domain;
calculating an energy spectrum of each frame of spectral line energy after passing through a Mel filter bank, and carrying out logarithm taking processing on the energy spectrum after passing through the Mel filter bank;
taking logarithm of the energy passing through the Mel filter bank, performing discrete cosine transform to obtain MFCC characteristics, and performing first-order difference processing on the MFCC characteristics to obtain first-order MFCC characteristics;
and performing differential operation on the first-order MFCC features to obtain second-order MFCC features.
3. A method for integrated decision of speech and non-speech according to claim 1, characterized by: the voice noise reduction processing comprises:
for each frame of voice data xnPerforming short-time autocorrelation processing to obtain autocorrelation value R of current framen
Taking each frame autocorrelation value as a new autocorrelation sequence, and performing smooth filtering by adopting an average filtering method with set window length and window shift to obtain a filtered autocorrelation value sequence R'n
Averaging the autocorrelation value sequence
Figure FDA0003456889430000021
As a threshold η, when a frame segment whose autocorrelation value is less than or equal to the threshold η is taken as a non-speech segment, a frame segment greater than the threshold η is taken as a speech segment;
using the determined non-speech segment and speech segment as input, and adopting spectral subtraction to make original speech data xnDenoising to obtain denoised voice data x'n
4. A method for integrated decision of speech and non-speech according to claim 3, characterized by: the determined non-speech segment and speech segment are used as input, and the original speech data x is subjected to spectral subtractionnDenoising to obtain denoised voice data x'nThe method comprises the following steps:
for each original frame of speech signal xnPerforming fast Fourier transform to obtain transformed voice signal Xn(k);
According to Xn(k) Amplitude | X ofn(k) Angle of phase
Figure FDA0003456889430000022
Non-speech segmentCalculating to obtain the average power spectrum value D (k) of the non-voice section;
calculating the speech signal X after fast Fourier transformn(k) Average value Y ofn(k) And obtaining the amplitude value after spectral subtraction through a spectral subtraction formula
Figure FDA0003456889430000023
According to the spectrally subtracted amplitude
Figure FDA0003456889430000024
Phase angle of sun
Figure FDA0003456889430000025
Obtaining the voice data x 'after noise reduction by utilizing inverse fast Fourier transform'n
5. A method for integrated decision of speech and non-speech according to claim 1, characterized by: the method for detecting the voice endpoint by combining the short-time correlation method with the energy-entropy ratio comprises the following steps:
calculating short-time energy to obtain each frame signal xnEnergy E ofnAnd compares the noise-reduced voice-per-frame signal x'nCalculating the value X 'after fast Fourier transform'n
Calculating short-time energy E 'of each frame of voice signal after noise reduction in frequency domain'nAnd the energy spectrum S of the k-th spectral linen(k);
Calculating a normalized spectral probability density function p for each frequency component of each framen(k) Spectral entropy per frame HnAnd the energy-to-entropy ratio Ef of each frame signaln
Calculating energy-entropy ratio EF 'of non-voice segment signal'nThe noise-reduced voice-per-frame signal x'nReplacing the noise-reduced non-speech frame to obtain the energy entropy ratio Ef 'of the noise-reduced non-speech frame'n
Setting a decision threshold T1And T2And calculating an energy-to-entropy ratio Ef of the speech signalnIntersection point N with threshold T22,N3The start and stop points of the speech segment being at N2,N3Outside the time interval of (1);
from the starting point N2To the left and end point N3Searching to the right to respectively find the energy-entropy ratio Ef of the voice signalsnAnd threshold value T1Cross point N of1,N4,N1Is the starting point of a speech segment, N4Is the end point of a speech segment.
6. A method for integrated decision of speech and non-speech according to claim 1, characterized by: the voice recognition network comprises three convolution layers, three pooling layers and three full-connection layers; a first layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 32 convolution kernels are in total, the moving step length of the convolution kernels is 1, and in the convolution process, a 0 value is adopted for filling the part with insufficient boundaries; a first pooling layer: filling the part with insufficient boundary with 0 value by adopting the maximum pooling of 2 multiplied by 2; a second layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 64 convolution kernels are totally arranged, and the rest arrangement is the same as that of the first layer of convolution layer; the arrangement of the second layer of the pooling layer is the same as that of the first layer of the pooling layer; a third layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 1024 convolution kernels are in total, and the rest of the convolution kernels are the same as the first layer of convolution layer; the arrangement of the third layer of the pooling layer is the same as that of the first layer of the pooling layer; the output of the first full connection layer and the second full connection layer is 1024, and the output of the third full connection layer is 2, which represents the number of the required classifications; after each convolution, activating the convolved value by adopting a Relu activation function; in the training process of the network, parameters of the network are updated by adopting an Adam random gradient descent method.
7. A method for integrated decision of speech and non-speech according to any of claims 1-6, characterized by: the first frame voice data comprises voice data which is processed according to the time of each frame being 1s and the time of overlapping between each frame being 0.7 s; the second frame-divided voice data includes voice data processed by each frame with a time of 0.025s and an overlapping time of 0.01 s.
CN202210006259.1A 2022-01-05 2022-01-05 Comprehensive judgment method for voice and non-voice of voice Pending CN114242116A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210006259.1A CN114242116A (en) 2022-01-05 2022-01-05 Comprehensive judgment method for voice and non-voice of voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210006259.1A CN114242116A (en) 2022-01-05 2022-01-05 Comprehensive judgment method for voice and non-voice of voice

Publications (1)

Publication Number Publication Date
CN114242116A true CN114242116A (en) 2022-03-25

Family

ID=80745796

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210006259.1A Pending CN114242116A (en) 2022-01-05 2022-01-05 Comprehensive judgment method for voice and non-voice of voice

Country Status (1)

Country Link
CN (1) CN114242116A (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6459394A (en) * 1987-08-31 1989-03-07 Ricoh Kk Digital voice extractor
JP2008134565A (en) * 2006-11-29 2008-06-12 Nippon Telegr & Teleph Corp <Ntt> Voice/non-voice determination compensation device, voice/non-voice determination compensation method, voice/non-voice determination compensation program and its recording medium, and voice mixing device, voice mixing method, voice mixing program and its recording medium
CN101308653A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 End-point detecting method applied to speech identification system
JP2009020460A (en) * 2007-07-13 2009-01-29 Yamaha Corp Voice processing device and program
JP2009063700A (en) * 2007-09-05 2009-03-26 Nippon Telegr & Teleph Corp <Ntt> Device, method and program for estimating voice signal section, and storage medium recording the program
US20110026722A1 (en) * 2007-05-25 2011-02-03 Zhinian Jing Vibration Sensor and Acoustic Voice Activity Detection System (VADS) for use with Electronic Systems
US20130304464A1 (en) * 2010-12-24 2013-11-14 Huawei Technologies Co., Ltd. Method and apparatus for adaptively detecting a voice activity in an input audio signal
CN109545188A (en) * 2018-12-07 2019-03-29 深圳市友杰智新科技有限公司 A kind of real-time voice end-point detecting method and device
JP6539829B1 (en) * 2018-05-15 2019-07-10 角元 純一 How to detect voice and non-voice level
CN110335593A (en) * 2019-06-17 2019-10-15 平安科技(深圳)有限公司 Sound end detecting method, device, equipment and storage medium
CN111292758A (en) * 2019-03-12 2020-06-16 展讯通信(上海)有限公司 Voice activity detection method and device and readable storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6459394A (en) * 1987-08-31 1989-03-07 Ricoh Kk Digital voice extractor
JP2008134565A (en) * 2006-11-29 2008-06-12 Nippon Telegr & Teleph Corp <Ntt> Voice/non-voice determination compensation device, voice/non-voice determination compensation method, voice/non-voice determination compensation program and its recording medium, and voice mixing device, voice mixing method, voice mixing program and its recording medium
US20110026722A1 (en) * 2007-05-25 2011-02-03 Zhinian Jing Vibration Sensor and Acoustic Voice Activity Detection System (VADS) for use with Electronic Systems
JP2009020460A (en) * 2007-07-13 2009-01-29 Yamaha Corp Voice processing device and program
JP2009063700A (en) * 2007-09-05 2009-03-26 Nippon Telegr & Teleph Corp <Ntt> Device, method and program for estimating voice signal section, and storage medium recording the program
CN101308653A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 End-point detecting method applied to speech identification system
US20130304464A1 (en) * 2010-12-24 2013-11-14 Huawei Technologies Co., Ltd. Method and apparatus for adaptively detecting a voice activity in an input audio signal
JP6539829B1 (en) * 2018-05-15 2019-07-10 角元 純一 How to detect voice and non-voice level
CN109545188A (en) * 2018-12-07 2019-03-29 深圳市友杰智新科技有限公司 A kind of real-time voice end-point detecting method and device
CN111292758A (en) * 2019-03-12 2020-06-16 展讯通信(上海)有限公司 Voice activity detection method and device and readable storage medium
CN110335593A (en) * 2019-06-17 2019-10-15 平安科技(深圳)有限公司 Sound end detecting method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MAN-WAI MAK: "A study of voice activity detection techniques for NIST speaker recognition evaluations", 《COMPUTER SPEECH & LANGUAGE》, 30 January 2014 (2014-01-30) *
何蓉蓉: "低信噪比环境下语音信号端点检测算法", 《中国优秀硕士学位论文全文数据库(信息科技辑)》, 15 January 2019 (2019-01-15) *

Similar Documents

Publication Publication Date Title
CN103854662B (en) Adaptive voice detection method based on multiple domain Combined estimator
CN110232933B (en) Audio detection method and device, storage medium and electronic equipment
CN103646649A (en) High-efficiency voice detecting method
JP2003517624A (en) Noise suppression for low bit rate speech coder
CN110599987A (en) Piano note recognition algorithm based on convolutional neural network
JP3105465B2 (en) Voice section detection method
Jiao et al. Convex weighting criteria for speaking rate estimation
Archana et al. Gender identification and performance analysis of speech signals
CN112541533A (en) Modified vehicle identification method based on neural network and feature fusion
CN111540368B (en) Stable bird sound extraction method and device and computer readable storage medium
CN101625858A (en) Method for extracting short-time energy frequency value in voice endpoint detection
CN115346561A (en) Method and system for estimating and predicting depression mood based on voice characteristics
US20050049863A1 (en) Noise-resistant utterance detector
CN112151066A (en) Voice feature recognition-based language conflict monitoring method, medium and equipment
CN114242116A (en) Comprehensive judgment method for voice and non-voice of voice
Ziólko et al. Phoneme segmentation of speech
CN112908344B (en) Intelligent bird song recognition method, device, equipment and medium
Sangeetha et al. Robust automatic continuous speech segmentation for indian languages to improve speech to speech translation
CN1971707A (en) Method and apparatus for estimating fundamental tone period and adjudging unvoiced/voiced classification
CN113744725A (en) Training method of voice endpoint detection model and voice noise reduction method
Stadtschnitzer et al. Reliable voice activity detection algorithms under adverse environments
TW202143215A (en) Speech enhancement system based on deep learning
CN111091816A (en) Data processing system and method based on voice evaluation
CN109346106B (en) Cepstrum domain pitch period estimation method based on sub-band signal-to-noise ratio weighting
Soon et al. Evaluating the effect of multiple filters in automatic language identification without lexical knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination