CN106448659A

CN106448659A - Speech endpoint detection method based on short-time energy and fractal dimensions

Info

Publication number: CN106448659A
Application number: CN201611178115.5A
Authority: CN
Inventors: 魏啸天; 鲍鸿
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2016-12-19
Filing date: 2016-12-19
Publication date: 2017-02-22
Anticipated expiration: 2036-12-19
Also published as: CN106448659B

Abstract

The invention discloses a speech endpoint detection method based on short-time energy and fractal dimensions. The method includes the steps: preprocessing source speech signals to obtain each frame speech signal; calculating a fractal dimension value corresponding to each frame speech signal by the theory of the fractal dimensions, and calculating a short-time energy value of each frame speech signal to obtain the ratio of the short-time energy value to the fractal dimension value; judging whether the ratio corresponding to each frame speech signal is larger than or equal to a first threshold value or not, and taking a frame larger than or equal to the first threshold value as a speech frame if the ratio is larger than or equal to the first threshold value; extracting starting endpoints and finishing endpoints of the source speech signals in the direction of two sides of the speech frame. The theory of the fractal dimensions is applied to endpoint detection, the ratio of the short-time energy value of each frame to the fractal dimension value of each frame is compared with the first threshold value, so that the speech frame is screened, and the starting endpoints and the finishing endpoints are extracted in the direction of the two sides of the speech frame. Therefore, the endpoints can be effectively extracted from the speech signals with low signal-to-noise ratio by the method.

Description

A kind of based on short-time energy and the sound end detecting method of fractal dimension

Technical field

The present invention relates to technical field of voice recognition, more particularly to a kind of based on short-time energy and the voice of fractal dimension End-point detecting method.

Background technology

In speech recognition, end-point detection is a very important job.So-called end points is specifically referred to from one section of voice Original position and the end position of voice is determined in signal.End-point detection can not only reduce the collection of data in speech recognition Amount, saves process time, moreover it is possible to exclude the interference of unvoiced segments or noise segment, improves the performance of speech recognition system, and in language Noise and quiet section of bit rate can also be reduced in sound coding, improve the efficiency of coding.How voice signal is accurately found out End points especially in the environment of low signal-to-noise ratio, noise energy may flood the voice signal of speaker, so to follow-up instruction Practice identification and produce considerable influence.What so a kind of higher method of robustness just showed is particularly important.

Traditional end-point detecting method is to carry out end points judgement using the double threshold two parameter method of short-time energy and zero-crossing rate. Choosing a high thresholding first on the short-time energy envelope of voice carries out once thick judgement, higher than the identification of the threshold value It is voice segments, and the state pause judgments position of voice should be then located at less than on the envelope outside the high threshold；Then recycle Zero-crossing rate determines a low thresholding, for the first time thick judgement high threshold starting point to the left, terminal continue to search for out to the right voice Section real original position, why reuse zero-crossing rate come second judgement be due to Chinese syllable by average short-time energy relatively Big simple or compound vowel of a Chinese syllable syllable and frequency larger consonant initial consonant two parts of higher i.e. zero-crossing rate are constituted.

In the case of without noise jamming or high s/n ratio, above-mentioned end-point detecting method can accurately find out speaker The start-stop position of end-speech.But when noise is serious, for example, when signal to noise ratio is reduced to 10dB, above-mentioned end-point detecting method is just The position of end points is detected exactly cannot.

As can be seen here, when signal to noise ratio is relatively low, how accurately to detect that the endpoint location of voice signal is art technology Personnel's problem demanding prompt solution.

Content of the invention

It is an object of the invention to provide a kind of based on short-time energy and the speech terminals detection side of fractal dimension

How method, for when signal to noise ratio is relatively low, accurately detecting the endpoint location of voice signal.

For solving above-mentioned technical problem, the present invention provides a kind of based on short-time energy and the speech terminals detection of fractal dimension Method, including：

Pretreatment is carried out to source voice signal obtains each frame voice signal；

Using fractal dimension Theoretical Calculation described in the corresponding values of fractal dimension of each frame voice signal, and calculate described per The short-time energy value of one frame voice signal, to obtain the ratio of the short-time energy value and the values of fractal dimension；

Whether the ratio corresponding to each frame voice signal is judged more than or equal to first threshold, if it is, greatly In or equal to the first threshold frame be Speech frame；

It is drawn up, in the Speech frame both sides side, starting endpoint and the end caps that the source voice signal includes.

Preferably, described starting endpoint and the knot that the source voice signal includes is drawn up in the Speech frame both sides side Shu Duandian is specifically included：

Judge whether the ratio of the frame on the left of the Speech frame is less than Second Threshold successively, if it is not, then continue to judge, directly To frame of the ratio less than the Second Threshold is found, and using the frame as the starting endpoint；

Judge whether the ratio of the frame on the right side of the Speech frame is less than Second Threshold successively, if it is not, then continue to judge, directly To frame of the ratio less than the Second Threshold is found, and using the frame as the end caps；

Wherein, the Second Threshold is less than the first threshold.

Preferably, the first threshold is 1.6.

Preferably, the Second Threshold is 1.16.

Preferably, the pretreatment includes preemphasis process, sub-frame processing and windowing process.

Preferably, processed using hamming window function in the windowing process.

Preferably, the fractal dimension is that correlation dimension, then the corresponding values of fractal dimension is correlation dimension numerical value.

Provided by the present invention based on short-time energy and the sound end detecting method of fractal dimension, including believing to source voice Number carrying out pretreatment obtains each frame voice signal；Using the Theoretical Calculation of fractal dimension corresponding point of shape of each frame voice signal Dimension value, and the short-time energy value of each frame voice signal is calculated, to obtain the ratio of short-time energy value and values of fractal dimension；Sentence Whether the ratio for breaking corresponding to each frame voice signal is more than or equal to first threshold, if it is, being more than or equal to the first threshold The frame of value is Speech frame；Starting endpoint and end caps that on the direction of Speech frame both sides, extraction source voice signal includes.We Method applies the theory of fractal dimension in end-point detection, by the ratio of the short-time energy value of each frame and values of fractal dimension with First threshold compares, and so as to filter out Speech frame, is then drawn up starting endpoint and end caps in the both sides side of Speech frame. Therefore this method can efficiently extract end points in the relatively low voice signal of signal to noise ratio.

Description of the drawings

In order to be illustrated more clearly that the embodiment of the present invention, accompanying drawing to be used needed for embodiment will be done simply below Introduce, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for ordinary skill people For member, on the premise of not paying creative work, other accompanying drawings can also be obtained according to these accompanying drawings.

Fig. 1 for the present invention provide a kind of based on short-time energy and the flow process of the sound end detecting method of fractal dimension Figure；

Fig. 2 is that a kind of end-point detection in pure speech waveform after addition babble noise provided in an embodiment of the present invention is shown It is intended to；

Fig. 3 is that a kind of end-point detection in pure speech waveform after addition pink noise provided in an embodiment of the present invention is illustrated Figure.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.Based on this Embodiment in invention, those of ordinary skill in the art are not under the premise of creative work is made, and obtained is every other Embodiment, belongs to the scope of the present invention.

The core of the present invention be provide a kind of based on short-time energy and the sound end detecting method of fractal dimension.

In order that those skilled in the art more fully understand the present invention program, with reference to the accompanying drawings and detailed description The present invention is described in further detail.

Fig. 1 for the present invention provide a kind of based on short-time energy and the flow process of the sound end detecting method of fractal dimension Figure.As shown in figure 1, the method includes：

S10：Pretreatment is carried out to source voice signal obtains each frame voice signal.

The persistent period length of source voice signal here is not specified by, before calculating to source voice signal, first Pretreatment to be first carried out, source voice signal is mainly carried out sub-frame processing by pretreatment here, but is not to say that and can only be entered Row sub-frame processing.In being embodied as, pretreatment can include preemphasis process, sub-frame processing and windowing process.By pre-add Process again and HFS can not only be lifted, the impact of lip radiation is removed, while the low frequency part that also decayed.By preemphasis The interference that can also reduce fundamental frequency to the blob detection that resonates is processed, is conducive to detecting formant.Windowing process be by sub-frame processing Signal afterwards carries out transition connection, preferably embodiment, is processed using hamming window function in windowing process.Need Illustrate, windowing process can also adopt alternate manner, not represent a kind of only this mode.

In the present embodiment, the persistent period of each frame signal can be set as 10ms-30ms, but be not limited to that this Scope.

S11：Using the corresponding values of fractal dimension of each frame voice signal of the Theoretical Calculation of fractal dimension, and calculate each frame The short-time energy value of voice signal, to obtain the ratio of short-time energy value and values of fractal dimension.

Fractal theory is with Fractal Dimension and mathematical method, objective things to be described, and more convergence complication system is true Real attribute and the description of state, more conform to multiformity and the complexity of objective things.As voice signal has fractal property, Therefore fractal theory can be applied in end-point detection.In being embodied as, fractal dimension has multiple methods, for example meter box dimension Number, information dimension, correlation dimension etc..Different dimension methodology is had nothing in common with each other on dimension is calculated, it is contemplated that correlation dimension reflects The distributed intelligence at set midpoint, so the result fluctuation for calculating is relatively small.So preferably embodiment, point shape Dimension is correlation dimension, then corresponding values of fractal dimension is correlation dimension numerical value.

The computing formula for being associated as numerical value is as follows：

Wherein, p is N_iIndividual point falls into capacity for the probability in the box of δ, I is positive integer, and m represents the sequence number of sampled point, and N represents the total of frame Quantity, 1≤m≤l, 1≤i≤N, l are the length of each frame, x_iAnd x_jFor ith and jth phase space reconstruction vector, H (d_{I, j}, P) It is Heaviside jump function.

The computing formula of short-time energy signal is：

Wherein, y_iM () is the energy value of m-th sampled point in the i-th frame.

S12：Judge the ratio corresponding to each frame voice signal whether more than or equal to first threshold.If it is, entering Step S13.Wherein, ratio is Speech frame more than or equal to the frame of first threshold.

S13：Starting endpoint and end caps that on the direction of Speech frame both sides, extraction source voice signal includes.

It is understood that each frame voice signal can obtain a ratio, ratio is filtered out in step S12 more than the The frame of one threshold value.Such frame there may be multiple, then we need to determine whether the corresponding starting endpoint of each frame and knot Shu Duandian.

Both sides direction in the present embodiment refers to left side and the right side of Speech frame, and left side is the frame before current voice frame Direction, and right side be current voice frame after frame direction.For example in one section of voice signal, there are 10 frames, respectively First frame, second frame, the 3rd frame, the 4th frame, the 5th frame, the 6th frame, the 7th frame, the 8th frame, the 9th Individual frame, the tenth frame.If the 3rd frame is that if Speech frame, the 8th frame is Speech frame, then the left side of the 3rd frame is exactly Second frame, the right side of the 3rd frame is exactly the 4th frame.

The present embodiment provide based on short-time energy and the sound end detecting method of fractal dimension, including believing to source voice Number carrying out pretreatment obtains each frame voice signal；Using the Theoretical Calculation of fractal dimension corresponding point of shape of each frame voice signal Dimension value, and the short-time energy value of each frame voice signal is calculated, to obtain the ratio of short-time energy value and values of fractal dimension；Sentence Whether the ratio for breaking corresponding to each frame voice signal is more than or equal to first threshold, if it is, being more than or equal to the first threshold The frame of value is Speech frame；Starting endpoint and end caps that on the direction of Speech frame both sides, extraction source voice signal includes.We Method applies the theory of fractal dimension in end-point detection, by the ratio of the short-time energy value of each frame and values of fractal dimension with First threshold compares, and so as to filter out Speech frame, is then drawn up starting endpoint and end caps in the both sides side of Speech frame. Therefore this method can efficiently extract end points in the relatively low voice signal of signal to noise ratio.

Preferably embodiment, starting endpoint and knot that on the direction of Speech frame both sides, extraction source voice signal includes Shu Duandian is specifically included：

Judge whether the ratio of the frame on the left of Speech frame is less than Second Threshold successively, if it is not, then continue to judge, until looking for The frame of Second Threshold is less than to ratio, and using the frame as starting endpoint；

Judge whether the ratio of the frame on the right side of Speech frame is less than Second Threshold successively, if it is not, then continue to judge, until looking for The frame of Second Threshold is less than to ratio, and using the frame as end caps；

Wherein, Second Threshold is less than first threshold.

Illustrate also by taking above 10 frames as an example, if the 3rd frame is Speech frame, if the 8th frame is Speech frame, then For the 3rd frame, need to judge the frame in the left side of the 3rd frame successively, due to being to judge successively, therefore, first have to sentence First frame on the left of disconnected 3rd frame, i.e., second frame, if the ratio of second frame is less than Second Threshold, second frame It is exactly starting endpoint, otherwise judges first frame again.For the right side of the 3rd frame, it is the ratio for judging the 4th frame first Whether value is less than Second Threshold, if it is not, then continue to judge the ratio of the 5th frame, until finding ratio less than the second threshold The frame of value.If it is understood that not finding the ratio of a frame less than Second Threshold, the Speech frame is exactly initiating terminal Point.

For the 8th frame, identical with the implementation procedure of the 3rd frame, the present embodiment is repeated no more.

Used as preferred embodiment, first threshold is 1.6.Used as preferred embodiment, Second Threshold is 1.16.

It is understood that first threshold and Second Threshold need to set according to concrete situation, 1.6 and 1.16 are chosen here A kind of simply specific embodiment.

In order to the reliability of the end-point detecting method of present invention offer is verified, the emulation experiment of correlation has been carried out.Experiment fortune Row environment for win7 system 32 pc machines, software be matlabR2013a, certainly using other fractal dimensions such as box-counting dimension, Information dimension is also possible.As using being slightly different in time performance testing result used by different fractal dimensions, experiment is adopted With correlation dimension, under above-mentioned environment, run time is 5 to 10 minutes or so, slightly longer than traditional end-point detection time, but anti- The robust sex expression that makes an uproar is good.This experiment sample frequency is set to 8000Hz, and frame length is that 200 sampling points, frame is moved as 100 samples Point, window function is set to hamming window.

Fig. 2 is that a kind of end-point detection in pure speech waveform after addition babble noise provided in an embodiment of the present invention is shown It is intended to.As shown in Fig. 2 it is end caps that solid line is starting endpoint, dotted line.As shown in Fig. 2 under equivalent environment, by this The end-point detecting method of bright offer can detect end points in noise, and identical with the endpoint location in pure speech waveform, table Bright the method reliability is higher.

Above the sound end detecting method based on short-time energy and fractal dimension provided by the present invention is carried out in detail Thin introduction.In description, each embodiment is described by the way of going forward one by one, and what each embodiment was stressed is real with other Apply the difference of example, between each embodiment identical similar portion mutually referring to.For device disclosed in embodiment Speech, as which corresponds to the method disclosed in Example, so description is fairly simple, related part is referring to method part illustration ?.It should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention, also Some improvement being carried out to the present invention and being modified, these improve and modification also falls into the protection domain of the claims in the present invention Interior.

Claims

1. a kind of based on short-time energy and the sound end detecting method of fractal dimension, it is characterised in that to include：

Using fractal dimension Theoretical Calculation described in the corresponding values of fractal dimension of each frame voice signal, and calculate each frame The short-time energy value of voice signal, to obtain the ratio of the short-time energy value and the values of fractal dimension；

Whether judge ratio corresponding to each frame voice signal more than or equal to first threshold, if it is, more than or It is Speech frame equal to the frame of the first threshold；

2. sound end detecting method according to claim 1, it is characterised in that described in Speech frame both sides direction Upper extract starting endpoint that the source voice signal includes and end caps are specifically included：

Judge whether the ratio of the frame on the left of the Speech frame is less than Second Threshold successively, if it is not, then continue to judge, until looking for The frame of the Second Threshold is less than to ratio, and using the frame as the starting endpoint；

Judge whether the ratio of the frame on the right side of the Speech frame is less than Second Threshold successively, if it is not, then continue to judge, until looking for The frame of the Second Threshold is less than to ratio, and using the frame as the end caps；

Wherein, the Second Threshold is less than the first threshold.

3. sound end detecting method according to claim 2, it is characterised in that the first threshold be.

4. sound end detecting method according to claim 3, it is characterised in that the Second Threshold be.

5. sound end detecting method according to claim 1, it is characterised in that the pretreatment is included at preemphasis Reason, sub-frame processing and windowing process.

6. sound end detecting method according to claim 5, it is characterised in that adopt hamming window in the windowing process Function is processed.

7. sound end detecting method according to claim 1, it is characterised in that the fractal dimension is correlation dimension, Then the corresponding values of fractal dimension is correlation dimension numerical value.