CN112102851B

CN112102851B - Voice endpoint detection method, device, equipment and computer readable storage medium

Info

Publication number: CN112102851B
Application number: CN202011282116.0A
Authority: CN
Inventors: 赵沁; 徐国强
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2021-04-13
Anticipated expiration: 2040-11-17
Also published as: CN112102851A; WO2022105570A1

Abstract

The invention relates to the technical field of voice signal processing, and discloses a voice endpoint detection method, a device, equipment and a computer readable storage medium, wherein the method comprises the following steps: extracting time domain signals of all data frames in a voice signal acquired in real time, and converting each time domain signal into a frequency domain spectrum signal; sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to the traversed current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal; detecting whether the short-time energy-entropy ratio is larger than an initial detection threshold of the voice signal; and if the short-time entropy ratio is larger than the initial detection threshold of the voice signal, moving the current data frame to a preset voice frame buffer, and determining a voice paragraph endpoint of the voice signal according to all data frames in the voice frame buffer. The invention improves the accuracy of detecting the image voice endpoint.

Description

Voice endpoint detection method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of speech signal processing technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for detecting a speech endpoint.

Background

The voice endpoint detection is a front-end processing means, and is required to be small in calculation amount and capable of outputting voice paragraphs in real time. The existing methods are mainly divided into two types: a method based on signal statistical properties, and a method based on a deep network. The former has less parameter quantity and higher interpretability; the latter can solve the speech segment detection under the non-stationary noise to some extent, but the algorithm performance highly depends on the training set, needs a large amount of data for training, and has poor generalization. Most of real-time systems adopt statistical methods, which are mainly based on the subband energy, the zero-crossing rate, the spectral characteristics and the like of signals. However, parameters such as a detection threshold and the like need to be set in advance, and a voice signal in a real environment changes dynamically, so that the fixed threshold effect is not good, the problem of high false alarm rate is likely to occur, and a voice endpoint of the voice signal cannot be accurately detected.

Disclosure of Invention

The invention mainly aims to provide a voice endpoint detection method, a voice endpoint detection device, voice endpoint detection equipment and a computer readable storage medium, and aims to solve the technical problem of improving the accuracy of voice endpoint detection.

In order to achieve the above object, the present invention provides a voice endpoint detection method, including:

extracting time domain signals of all data frames in a voice signal acquired in real time, and converting each time domain signal into a frequency domain spectrum signal;

sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to the traversed current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal;

detecting whether the short-time energy-entropy ratio is larger than an initial detection threshold of the voice signal;

and if the short-time entropy ratio is larger than the initial detection threshold of the voice signal, moving the current data frame to a preset voice frame buffer, and determining a voice paragraph endpoint of the voice signal according to all data frames in the voice frame buffer.

Optionally, the step of determining a speech segment endpoint of the speech signal according to all data frames in the speech frame buffer includes:

detecting whether the number of all data frames in the voice frame buffer is equal to a first preset value or not;

and if the value is equal to the first preset value, taking the first data frame in the voice frame buffer as a voice paragraph endpoint of the voice signal.

Optionally, the step of calculating the short-time energy-to-entropy ratio of the current data frame according to the current frequency-domain spectrum signal includes:

calculating density functions of all frequency components of the current data frame according to the current frequency domain spectrum signal, and calculating short-time spectrum entropy of the current data frame according to a preset short-time spectrum entropy calculation formula and each density function;

and acquiring the short-time energy of the current data frame, and calculating the short-time energy entropy ratio of the current data frame according to the short-time energy and the short-time spectrum entropy.

Optionally, the step of obtaining the short-time energy of the current data frame includes:

and determining the frame shift and the frame length of the current data frame according to the time domain signal of the current data frame, and calculating the short-time energy of the current data frame according to the frame shift and the frame length.

Optionally, the step of detecting whether the short-time entropy ratio is greater than the initial detection threshold of the speech signal includes:

acquiring the number of preset mute frames, calculating short-time energy entropy ratios corresponding to all the mute frames based on the number of the mute frames, and calculating a mean value of the short-time energy entropy ratios according to the short-time energy entropy ratios corresponding to all the mute frames;

and calculating an initial detection threshold value of the voice signal according to the short-time energy-entropy ratio mean value and a preset adjusting factor.

Optionally, the step of calculating an initial detection threshold of the speech signal according to the short-time entropy ratio mean and a preset adjustment factor includes:

determining the maximum short-time energy entropy ratio in the short-time energy entropy ratios corresponding to the mute frames, and calculating the difference between the maximum short-time energy entropy ratio and the average value of the short-time energy entropy ratios;

and calculating a product between the short-time entropy ratio mean value and a preset adjusting factor, and taking a sum value between the product and the difference value as an initial detection threshold value of the voice signal.

Optionally, the step of detecting whether the short-time entropy ratio is greater than the initial detection threshold of the speech signal is followed by:

and if the short-time energy-entropy ratio is smaller than or equal to the initial detection threshold value of the voice signal, moving the current data frame to a preset noise frame buffer, and determining a voice section endpoint of the voice signal according to all data frames in the noise frame buffer.

In addition, to achieve the above object, the present invention further provides a voice endpoint detection apparatus, including:

the extraction module is used for extracting time domain signals of all data frames in the voice signals collected in real time and converting the time domain signals into frequency domain spectrum signals;

the calculation module is used for sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to a traversed current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal;

a detection module, configured to detect whether the short-time entropy ratio is greater than an initial detection threshold of the speech signal;

and the determining module is used for moving the current data frame to a preset voice frame buffer and determining a voice paragraph endpoint of the voice signal according to all data frames in the voice frame buffer if the short-time entropy ratio is larger than the initial detection threshold of the voice signal.

In addition, in order to achieve the above object, the present invention also provides a voice endpoint detection device;

the voice endpoint detection apparatus includes: a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein:

the computer program, when executed by the processor, implements the steps of the voice endpoint detection method as described above.

In addition, to achieve the above object, the present invention also provides a computer-readable storage medium;

the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the voice endpoint detection method as described above.

The method comprises the steps of extracting time domain signals of all data frames in a voice signal collected in real time, and converting each time domain signal into a frequency domain spectrum signal; sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to the traversed current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal; detecting whether the short-time energy-entropy ratio is larger than an initial detection threshold of the voice signal; and if the short-time entropy ratio is larger than the initial detection threshold of the voice signal, moving the current data frame to a preset voice frame buffer, and determining a voice paragraph endpoint of the voice signal according to all data frames in the voice frame buffer. The time domain signals of all data frames in the voice signals collected in real time are converted into frequency domain spectrum signals, the short-time energy-entropy ratio is calculated according to the current frequency domain spectrum signals of the traversed current data frames, and when the short-time energy-entropy ratio is larger than an initial detection threshold, the current data frames are moved to the voice frame buffer to determine the voice paragraph end points of the voice signals, so that the phenomenon that the voice signals cannot be accurately detected due to dynamic change of the voice signals in real buffering in the prior art is avoided, and the accuracy of voice end point detection is improved.

Drawings

FIG. 1 is a schematic diagram of a voice endpoint detection device in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a voice endpoint detection method according to a first embodiment of the present invention;

fig. 3 is a functional block diagram of the voice endpoint detection apparatus according to the present invention.

The objects, features and advantages of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic structural diagram of a voice endpoint detection device in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the voice endpoint detection apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Optionally, the voice endpoint detection device may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display screen according to the brightness of ambient light. Of course, the voice endpoint detection device may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, and so on, which are not described herein again.

Those skilled in the art will appreciate that the voice endpoint detection apparatus configuration shown in FIG. 1 does not constitute a limitation of voice endpoint detection apparatus and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a voice endpoint detection program.

In the voice endpoint detection apparatus shown in fig. 1, the network interface 1004 is mainly used for connecting to a background server and performing data communication with the background server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to call the voice endpoint detection program stored in the memory 1005 and execute the voice endpoint detection method provided by the embodiment of the present invention.

Referring to fig. 2, the present invention provides a voice endpoint detection method, in an embodiment of the voice endpoint detection method, the voice endpoint detection method includes the following steps:

step S10, extracting time domain signals of all data frames in the voice signals collected in real time, and converting each time domain signal into a frequency domain spectrum signal;

in this embodiment, the detection threshold is determined by calculating the short-term entropy ratio of the speech signal in real time, and the start-stop position of the speech paragraph is accurately detected according to the detection threshold, so as to facilitate the subsequent speech segmentation and speech recognition tasks. Therefore, the voice signal can be collected in real time through a voice collecting device, such as a microphone, and the time domain signal x (n) is extracted frame by frame from the voice signal collected in real time, that is, the time domain signal of each data frame is collected. And for smoothing the signal, the frame shift is set to be smaller than the frame length so as to calculate the short-time energy E of each data frame_iWhere i represents time domain data of the ith frame. After the time domain signals of all the data frames are obtained, the time domain signals of all the data frames are subjected to short-time discrete Fourier transform to obtain a frequency domain spectrum signal Y of each data frame_i. The short-time discrete fourier transform of the time domain signal is performed in a conventional manner, and is not described here.

Step S20, sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to the traversed current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal;

after the frequency domain spectrum signals of each data frame of the speech signal are acquired, the same operation needs to be performed on all the data frames to determine the speech paragraphs. Therefore, each data frame can be traversed in sequence, and the traversed data frame, that is, the frequency domain spectrum signal corresponding to the current data frame is determined and is used as the current frequency domain spectrum signal. And calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal. Since the spectral energy is calculated band by band for one frame of data, the normalized spectral probability density function of the kth frequency component of the ith frame of data is calculated as P_i（k）=Y_i（k）/Σ_mY_i(m), wherein m =1,2_FFTA/2, wherein N_FFTIs the fast fourier transform length.

Therefore, when calculating the short-time energy-entropy ratio of the current data frame, the normalized spectral probability density function can be adopted as P_i（k）=Y_i（k）/Σ_mY_i(m) calculating all density functions of the current data frame, and calculating the short-time spectral entropy of the current data frame by the density functions, namely by the formula H_i= —Σ_kP_i（k）log（P_i(k) To compute the short-time spectral entropy of the current data frame. Then according to the short-time energy E of the current data frame_iTo calculate. Therefore, when the short-time energy-entropy ratio of the current data frame is calculated, the short-time energy-entropy ratio of the current data frame can be calculated according to the calculation formula of the short-time energy-entropy ratio. The calculation formula of the short-time energy-entropy ratio can be

Wherein α is a regulatory factor.

In the present proposal, after the frequency domain spectrum signals of each data frame of the speech signal are acquired, empirical constraints may be added, that is, the speech spectrum mainly exists in the [100Hz,3500Hz ] interval, and the noise spectrum exists in the full frequency band. Therefore, in order to better distinguish speech from noise, only data in the [100Hz,3500Hz ] interval part can be calculated while calculating spectral energy band by band. I.e. only data between 100Hz,3500Hz is calculated for each data frame.

The method for obtaining the short-time energy may be to first set a time domain signal as x (n), add a window function w (n), and obtain an i-th frame speech signal after framing as yi (n), where yi (n) satisfies:

yi（n）=w（n）*x（（i-1）*inc+n），

1≤n≤L，1≤i≤fn

where w (n) is a window function, typically a rectangular window or a hamming window, y (i) is a frame number, n is 1, 2.. L, i is 1, 2.. n, L is the frame length, and inc is the frame shift length; fn is the total number of frames after framing. The short-time energy formula of the ith frame speech signal yi (n) is:

after the frame shift and the frame length are obtained, the short-time energy of the current data frame can be calculated according to the short-time energy formula.

Step S30, detecting whether the short-time energy-entropy ratio is larger than the initial detection threshold of the voice signal;

after the short-time energy-entropy ratio of the current data frame is obtained through calculation, the current frame can be judged according to the short-time energy-entropy ratio, and according to signal characteristics, the short-time energy-entropy ratio of voice is larger than the energy-entropy ratio of noise, so that whether the current frame belongs to noise or voice is determined. Therefore, it is also necessary to obtain an initial detection threshold of the speech signal, so as to detect whether the short-time energy-entropy ratio of the current data frame is greater than the initial detection value, and perform different operations according to different detection results. Therefore, it is necessary to set an initial detection threshold of the voice signal in advance, and the initial detection threshold may be T₀The equation of = α × μ + Φ, where α is an adjustment factor, μ is a mean short-time entropy ratio of a mute frame, and Φ = max (EH)_1-N)—μ，T₀For the initial detection threshold, N is the number of silent frames.

The method for obtaining the initial detection threshold of the speech signal in advance may be to obtain the number of the mute frames set in advance, that is, the number of the mute frames, calculate the short-time energy entropy ratios corresponding to all the mute frames according to the mute frame data and the short-time energy entropy ratio calculation formula, and perform mean calculation on the short-time energy entropy ratios to obtain a mean short-time energy entropy ratio.

Simultaneously comparing the short-time energy entropy ratios corresponding to the mute frames to obtain the maximum short-time energy entropy ratio, namely the maximum short-time energy entropy ratio, and calculating the difference between the maximum short-time energy entropy ratio and the average value of the short-time energy entropy ratios; and then, calculating the product between the short-time entropy ratio mean value and the adjusting factor set in advance by the user, calculating the sum value between the product and the difference value, and using the sum value as the initial detection threshold value of the voice signal. I.e. calculating an initial detection threshold value of T₀= α × μ + Φ, where α is an adjustment factor, μ is a short-time energy-to-entropy ratio mean value of a mute frame, Φ = max (EH)_1-N)—μ。

When the short-time energy-entropy ratio is smaller than or equal to the initial detection threshold value of the voice signal, the current data frame can be moved to a noise frame buffer which is set in advance, and the same detection is continuously carried out on the next data frame to determine whether to move to the noise frame buffer or not until the detection of each data frame of the voice signal is finished. And when the number of data frames in the noise frame buffer is greater than a predetermined value (i.e., a second predetermined value), such as L, in the noise frame buffer₀Then, it can be determined that there is noise in the speech signal, and it can be determined that the speech segment in the speech signal ends up entering the noise segment. And the beginning of a speech segment is the first frame in the noise buffer.

Step S40, if the short-time entropy ratio is greater than the initial detection threshold of the speech signal, moving the current data frame to a preset speech frame buffer, and determining a speech paragraph endpoint of the speech signal according to all data frames in the speech frame buffer.

When the short-term entropy ratio is larger than the initial detection threshold of the voice signal, the current data frame can be moved to a voice frame buffer set in advance, and the same detection is continuously carried out on the next data frame to determine whether to move to the voice frame buffer or not until the detection of each data frame of the voice signal is finished. Also in the present embodiment, in order to eliminate burst noise in an actual environment, two buffers, i.e., a speech frame buffer and a noise frame buffer, may be provided, and the length of the speech frame buffer is set to L₁The length of the noise frame buffer is L₀Wherein L is₀ >N, N is the number of silent frames.

And when the number of data frames in the speech frame buffer is greater than a predetermined value (i.e., a first predetermined value), such as L, in the speech frame buffer₁Then, it can be determined that speech is present in the speech signal, and it can be determined that the start position of a speech segment in the speech signal is the first frame in the speech buffer. If the short-time energy entropy ratio is less than or equal to the initial detection threshold value, moving the current data frame to a preset noise frame buffer, and when the noise frame buffer is fullThe end point of the speech segment is set to the first frame of the noise buffer, i.e. the speech segment end points of the speech signal can be determined to be the first frame (first data frame) in the speech buffer and the first frame (first data frame) of the noise buffer.

In this embodiment, time domain signals of all data frames in a speech signal collected in real time are extracted, and each time domain signal is converted into a frequency domain spectrum signal; sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to the traversed current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal; detecting whether the short-time energy-entropy ratio is larger than an initial detection threshold of the voice signal; and if the short-time entropy ratio is larger than the initial detection threshold of the voice signal, moving the current data frame to a preset voice frame buffer, and determining a voice paragraph endpoint of the voice signal according to all data frames in the voice frame buffer. The time domain signals of all data frames in the voice signals collected in real time are converted into frequency domain spectrum signals, the short-time energy-entropy ratio is calculated according to the current frequency domain spectrum signals of the traversed current data frames, and when the short-time energy-entropy ratio is larger than an initial detection threshold, the current data frames are moved to the voice frame buffer to determine the voice paragraph end points of the voice signals, so that the phenomenon that the voice signals cannot be accurately detected due to dynamic change of the voice signals in real buffering in the prior art is avoided, and the accuracy of voice end point detection is improved.

Further, on the basis of the first embodiment of the present invention, a second embodiment of the voice endpoint detection method of the present invention is provided, where this embodiment is step S30 of the first embodiment of the present invention, and the step of determining the voice paragraph endpoint of the voice signal according to all the data frames in the voice frame buffer includes:

m, detecting whether the number of all data frames in the voice frame buffer is equal to a first preset value or not;

in this embodiment, after the current data frame is moved to the speech frame buffer, the existing data frame in the speech buffer is detected, i.e. the speech frame buffer is detectedAnd whether the number of all data frames in the buffer is equal to a first preset value set in advance or not, and executing different operations according to different detection results. Wherein, the first preset value can be the length of the voice frame buffer set in advance

。

And n, if the value is equal to the first preset value, taking the first data frame in the voice frame buffer as a voice paragraph endpoint of the voice signal.

When the number of all the data frames in the speech frame buffer is found to be equal to the first preset value, the first data frame in the speech frame buffer can be used as the end point of the speech paragraph of the speech signal, that is, the start position of the speech paragraph. And if the number of all the data frames in the voice frame buffer is smaller than the first preset value, continuously traversing the next data frame.

In this embodiment, when the number of all data frames in the speech frame buffer is equal to the first preset value, the first data frame in the speech frame buffer is used as the speech paragraph endpoint of the speech signal, so that the accuracy of the calculated speech paragraph endpoint is ensured.

Further, the step of calculating the short-time energy-entropy ratio of the current data frame according to the current frequency-domain spectrum signal includes:

step a, calculating density functions of all frequency components of the current data frame according to the current frequency domain spectrum signal, and calculating short-time spectrum entropy of the current data frame according to a preset short-time spectrum entropy calculation formula and each density function;

in this embodiment, when calculating the short-term energy-entropy ratio of the current data frame, the density function of all frequency components in the current data frame may be calculated according to the current frequency-domain spectrum signal, for example, the normalized spectrum probability density function (i.e. density function) of the kth frequency component of the ith data frame is calculated as P_i（k）=Y_i（k）/Σ_mY_i(m), wherein m =1,2_FFT/2 wherein N is_FFTIs a fast fourier transformThe length of the transform. And after the density functions of all frequency components are obtained through calculation, the short-time spectral entropy of the current data frame can be calculated according to a preset short-time spectral entropy calculation formula and each density function. Wherein the preset short-time spectrum entropy calculation formula can be H_i= —Σ_kP_i（k）log（P_i（k））。

And b, acquiring the short-time energy of the current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the short-time energy and the short-time spectrum entropy.

After the short-time spectral entropy is obtained through calculation, the short-time energy Ei of the current data frame needs to be obtained, and then the short-time energy Ei and the short-time spectral entropy are input into a formula

Wherein α is an adjustment factor, calculating a short-time energy-to-entropy ratio of the current data frame.

In this embodiment, all density functions are determined according to the current frequency domain spectrum signal, the short-time spectrum entropy of the current data frame is calculated according to the preset short-time spectrum entropy calculation formula and each density function, and the short-time energy entropy ratio of the current data frame is calculated according to the short-time spectrum entropy and the short-time energy, so that the accuracy of the calculated short-time energy entropy ratio is guaranteed.

Specifically, the step of obtaining the short-time energy of the current data frame includes:

and c, determining the frame shift and the frame length of the current data frame according to the time domain signal of the current data frame, and calculating the short-time energy of the current data frame according to the frame shift and the frame length.

In this embodiment, the method for obtaining the short-time energy may be to first set a time domain signal as x (n), add a window function w (n), and obtain an i-th frame speech signal after framing as yi (n), where yi (n) satisfies:

yi（n）=w（n）*x（（i-1）*inc+n），

1≤n≤L，1≤i≤fn

In the embodiment, the frame shift and the frame length of the current data frame are determined according to the time domain signal of the current data frame, so that the short-time energy of the current data frame is calculated, and the accuracy of the calculated short-time energy is guaranteed.

Further, the step of detecting whether the short-time entropy ratio is greater than the initial detection threshold of the speech signal includes:

step d, acquiring the number of preset mute frames, calculating the short-time energy entropy ratios corresponding to all the mute frames based on the number of the mute frames, and calculating the average value of the short-time energy entropy ratios according to the short-time energy entropy ratios corresponding to all the mute frames;

in this embodiment, the number of the mute frames set in advance, that is, the number of the mute frames, needs to be obtained, the short-time energy entropy ratios corresponding to all the mute frames are calculated according to the mute frame data and the short-time energy entropy ratio calculation formula, and the average value of each short-time energy entropy ratio is calculated to obtain the average value of the short-time energy entropy ratios.

And e, calculating an initial detection threshold value of the voice signal according to the short-time energy-entropy ratio mean value and a preset adjusting factor.

Simultaneously comparing the short-time energy entropy ratios corresponding to the mute frames to obtain the maximum short-time energy entropy ratio, namely the maximum short-time energy entropy ratio, and calculating the difference between the maximum short-time energy entropy ratio and the average value of the short-time energy entropy ratios; then, the product between the short-time entropy ratio mean value and the adjustment factor set in advance by the user is calculated, and the product and the difference value are calculatedAnd the sum is used as an initial detection threshold value of the voice signal. I.e. calculating an initial detection threshold value of T₀= α × μ + Φ, where α is an adjustment factor, μ is a short-time energy-to-entropy ratio mean value of a mute frame, Φ = max (EH)_1-N)—μ。

In this embodiment, the initial detection threshold of the speech signal is calculated according to the preset adjustment factor and the short-time entropy ratio mean value by calculating the short-time entropy ratio mean value according to the preset number of the mute frames, so that the accuracy of the calculated initial detection threshold is ensured.

Further, the step of calculating the initial detection threshold of the speech signal according to the short-time entropy ratio mean value and a preset adjustment factor includes:

step f, determining the maximum short-time energy entropy ratio in the short-time energy entropy ratios corresponding to the mute frames, and calculating the difference between the maximum short-time energy entropy ratio and the average value of the short-time energy entropy ratios;

in this embodiment, when calculating the initial detection threshold, the short-time energy-entropy ratios corresponding to the mute frames need to be sequentially compared, the maximum short-time energy-entropy ratio with the largest value is obtained according to the comparison result, and then the average value of the short-time energy-entropy ratios is subtracted from the maximum short-time energy-entropy ratio to obtain the difference value between the maximum short-time energy-entropy ratio and the average value of the short-time energy-entropy ratios.

And g, calculating a product between the short-time energy-entropy ratio mean value and a preset adjusting factor, and taking a sum value between the product and the difference value as an initial detection threshold value of the voice signal.

And acquiring an adjusting factor set by a user in advance, multiplying the short-time energy entropy ratio mean value by the adjusting factor to acquire a product between the short-time energy entropy ratio mean value and the adjusting factor, calculating the product and a difference value between the maximum short-time energy entropy ratio obtained by the calculation and the short-time energy entropy ratio mean value, adding the product and the difference value to acquire a sum value of the sum value, and taking the sum value as an initial detection threshold value of the voice signal. I.e. calculating an initial detection threshold value of T₀= α × μ + Φ, where α is an adjustment factor, μ is a short-time energy-to-entropy ratio mean value of a mute frame, Φ = max (EH)_1-N)—μ。

In this embodiment, the difference between the maximum short-time energy-entropy ratio and the mean value of the short-time energy-entropy ratio corresponding to each mute frame is calculated, the product of the mean value of the short-time energy-entropy ratio and the adjustment factor is calculated, and the sum of the product and the difference value is used as the initial detection threshold, so that the accuracy of the initial detection threshold obtained by calculation is guaranteed.

Further, the step of detecting whether the short-time entropy ratio is larger than the initial detection threshold of the speech signal is followed by:

and h, if the short-time energy-entropy ratio is smaller than or equal to the initial detection threshold of the voice signal, moving the current data frame to a preset noise frame buffer, and determining a voice section endpoint of the voice signal according to all data frames in the dry frame buffer.

When the short-time energy-entropy ratio is smaller than or equal to the initial detection threshold value of the voice signal, the current data frame can be moved to a noise frame buffer which is set in advance, and the same detection is continuously carried out on the next data frame to determine whether to move to the noise frame buffer or not until the detection of each data frame of the voice signal is finished. And when the number of data frames in the noise frame buffer is greater than a predetermined value (i.e., a second predetermined value), such as

Then, it can be determined that there is noise in the speech signal, and it can be determined that the speech segment in the speech signal ends up entering the noise segment. And the beginning of a speech segment is the first frame in the noise buffer.

In this embodiment, when it is determined that the short-time entropy ratio is less than or equal to the initial detection threshold, the current data frame is moved to the noise frame buffer, and the end point of the speech paragraph is determined according to all the data frames in the noise frame buffer, thereby ensuring the accuracy of the calculated end point of the speech paragraph.

In addition, referring to fig. 3, an embodiment of the present invention further provides a voice endpoint detection apparatus, where the voice endpoint detection apparatus includes:

the extraction module A10 is used for extracting time domain signals of all data frames in the voice signals collected in real time and converting the time domain signals into frequency domain spectrum signals;

a calculating module a20, configured to sequentially traverse each frequency domain spectrum signal, determine a current frequency domain spectrum signal corresponding to a traversed current data frame, and calculate a short-time entropy ratio of the current data frame according to the current frequency domain spectrum signal;

a detecting module a30, configured to detect whether the short-time entropy ratio is greater than an initial detection threshold of the speech signal;

a determining module a40, configured to, if the short-time entropy ratio is greater than the initial detection threshold of the speech signal, move the current data frame to a preset speech frame buffer, and determine a speech paragraph endpoint of the speech signal according to all data frames in the speech frame buffer.

Optionally, the determining module a40 is further configured to:

Optionally, a calculating module a20, configured to:

Optionally, the detection module a30 is further configured to:

The steps implemented by each functional module of the voice endpoint detection apparatus may refer to each embodiment of the voice endpoint detection method of the present invention, and are not described herein again.

The present invention also provides a voice endpoint detection apparatus, including: a memory, a processor, and a voice endpoint detection program stored on the memory; the processor is configured to execute the voice endpoint detection program to implement the following steps:

The present invention also provides a computer readable storage medium storing one or more programs, the one or more programs being further executable by one or more processors for performing the steps of:

The specific implementation of the computer-readable storage medium of the present invention is substantially the same as the embodiments of the voice endpoint detection method described above, and is not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A voice endpoint detection method is characterized by comprising the following steps:

sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to a traversed current data frame, and calculating a short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal, wherein short-time energy in a [100HZ,3500HZ ] interval in the current data frame is calculated, and the short-time energy-entropy ratio of the current data frame is calculated according to the short-time energy and the current frequency spectrum signal, wherein yi (n) = w (n) x ((i-1) × inc + n), n is not less than 1 and not more than L, i is not less than 1 and not more than fn, w (n) is a window function, L is a frame length, inc is a frame shift, fn is a total frame number after frame division, yi (n) is an i-th frame speech signal, and x (n) is a time domain signal; the short-time energy of the ith frame speech signal yi (n) is:

；

detecting whether the short-time energy-entropy ratio is larger than an initial detection threshold of the voice signal, wherein the initial detection threshold is according to T₀= α × μ + Φ, where α is an adjustment factor, μ is a mean value of short-time energy-to-entropy ratios of the mute frames, and Φ = max (EH)_1-N)—μ，T₀For the initial detection threshold, N is the number of silent frames, max (EH)_1-N) The maximum short-time energy entropy ratio;

if the short-time entropy ratio is larger than the initial detection threshold of the voice signal, moving the current data frame to a preset voice frame buffer, and determining a voice paragraph end point of the voice signal according to all data frames in the voice frame buffer, wherein if the number of the data frames of all the data frames in the voice frame buffer is equal to a first preset value, determining that voice exists in the voice signal, and determining that the starting position of a voice paragraph in the voice signal is a first frame in the voice frame buffer;

and if the short-time energy-entropy ratio is smaller than or equal to the initial detection threshold, moving the current data frame to a preset noise frame buffer, and when the number of the data frames of all the data frames in the noise frame buffer is larger than a second preset value in the noise frame buffer, determining that noise exists in the voice signal, and determining that a first frame in the noise frame buffer is the position of the termination point of the voice paragraph.

2. The method of speech endpoint detection according to claim 1, wherein the step of calculating the short-time entropy ratio of the current data frame from the current frequency-domain spectral signal comprises:

3. The voice endpoint detection method of claim 2, wherein the step of obtaining the short-time energy of the current data frame comprises:

4. The voice endpoint detection method of claim 1, wherein the step of detecting whether the short-time energy-to-entropy ratio is greater than an initial detection threshold for the voice signal is preceded by:

acquiring the number of preset mute frames, calculating the short-time energy entropy ratios corresponding to all the mute frames based on the number of the mute frames, and calculating the average value of the short-time energy entropy ratios according to the short-time energy entropy ratios corresponding to all the mute frames.

5. A voice endpoint detection apparatus, the voice endpoint detection apparatus comprising:

a calculating module, configured to sequentially traverse each frequency domain spectrum signal, determine a current frequency domain spectrum signal corresponding to a traversed current data frame, and calculate a short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal, where the short-time energy of a [100HZ,3500HZ ] interval in the current data frame is calculated, and the short-time energy-entropy ratio of the current data frame is calculated according to the short-time energy and the current frequency spectrum signal, where yi (n) = w (n) x ((i-1) × inc + n), n is greater than or equal to 1 and less than or equal to L, i is greater than or equal to 1 and less than or equal to fn, w (n) is a window function, L is a frame length, inc is a frame shift, fn is a total frame number after framing, yi (n) is an i-th frame speech signal, and x (n) is a time domain signal; the short-time energy of the ith frame speech signal yi (n) is:

；

a detection module for detecting whether the short-time entropy ratio is greater than an initial detection threshold of the speech signal, wherein the initial detection threshold is based on T₀= α × μ + Φ, where α is an adjustment factor, μ is a mean value of short-time energy-to-entropy ratios of the mute frames, and Φ = max (EH)_1-N)—μ，T₀For the initial detection threshold, N is the number of silent frames, max (EH)_1-N) The maximum short-time energy entropy ratio;

a determining module, configured to move the current data frame to a preset speech frame buffer and determine a speech paragraph endpoint of the speech signal according to all data frames in the speech frame buffer if the short-time entropy ratio is greater than an initial detection threshold of the speech signal, where if the number of data frames of all data frames in the speech frame buffer is equal to a first preset value, it is determined that speech exists in the speech signal, and it is determined that a start position of a speech paragraph in the speech signal is a first frame in the speech frame buffer;

6. A voice endpoint detection device, the voice endpoint detection device comprising: memory, a processor and a speech endpoint detection program stored on the memory and executable on the processor, the speech endpoint detection program, when executed by the processor, implementing the steps of the speech endpoint detection method according to any one of claims 1 to 4.

7. A computer-readable storage medium, having a voice endpoint detection program stored thereon, which when executed by a processor implements the steps of the voice endpoint detection method according to any one of claims 1 to 4.