CN112102851B - Voice endpoint detection method, device, equipment and computer readable storage medium - Google Patents

Voice endpoint detection method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN112102851B
CN112102851B CN202011282116.0A CN202011282116A CN112102851B CN 112102851 B CN112102851 B CN 112102851B CN 202011282116 A CN202011282116 A CN 202011282116A CN 112102851 B CN112102851 B CN 112102851B
Authority
CN
China
Prior art keywords
short
voice
frame
signal
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011282116.0A
Other languages
Chinese (zh)
Other versions
CN112102851A (en
Inventor
赵沁
徐国强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Financial Technology Co Ltd Shanghai
Original Assignee
OneConnect Financial Technology Co Ltd Shanghai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Financial Technology Co Ltd Shanghai filed Critical OneConnect Financial Technology Co Ltd Shanghai
Priority to CN202011282116.0A priority Critical patent/CN112102851B/en
Publication of CN112102851A publication Critical patent/CN112102851A/en
Application granted granted Critical
Publication of CN112102851B publication Critical patent/CN112102851B/en
Priority to PCT/CN2021/127184 priority patent/WO2022105570A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention relates to the technical field of voice signal processing, and discloses a voice endpoint detection method, a device, equipment and a computer readable storage medium, wherein the method comprises the following steps: extracting time domain signals of all data frames in a voice signal acquired in real time, and converting each time domain signal into a frequency domain spectrum signal; sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to the traversed current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal; detecting whether the short-time energy-entropy ratio is larger than an initial detection threshold of the voice signal; and if the short-time entropy ratio is larger than the initial detection threshold of the voice signal, moving the current data frame to a preset voice frame buffer, and determining a voice paragraph endpoint of the voice signal according to all data frames in the voice frame buffer. The invention improves the accuracy of detecting the image voice endpoint.

Description

Voice endpoint detection method, device, equipment and computer readable storage medium
Technical Field
The present invention relates to the field of speech signal processing technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for detecting a speech endpoint.
Background
The voice endpoint detection is a front-end processing means, and is required to be small in calculation amount and capable of outputting voice paragraphs in real time. The existing methods are mainly divided into two types: a method based on signal statistical properties, and a method based on a deep network. The former has less parameter quantity and higher interpretability; the latter can solve the speech segment detection under the non-stationary noise to some extent, but the algorithm performance highly depends on the training set, needs a large amount of data for training, and has poor generalization. Most of real-time systems adopt statistical methods, which are mainly based on the subband energy, the zero-crossing rate, the spectral characteristics and the like of signals. However, parameters such as a detection threshold and the like need to be set in advance, and a voice signal in a real environment changes dynamically, so that the fixed threshold effect is not good, the problem of high false alarm rate is likely to occur, and a voice endpoint of the voice signal cannot be accurately detected.
Disclosure of Invention
The invention mainly aims to provide a voice endpoint detection method, a voice endpoint detection device, voice endpoint detection equipment and a computer readable storage medium, and aims to solve the technical problem of improving the accuracy of voice endpoint detection.
In order to achieve the above object, the present invention provides a voice endpoint detection method, including:
extracting time domain signals of all data frames in a voice signal acquired in real time, and converting each time domain signal into a frequency domain spectrum signal;
sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to the traversed current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal;
detecting whether the short-time energy-entropy ratio is larger than an initial detection threshold of the voice signal;
and if the short-time entropy ratio is larger than the initial detection threshold of the voice signal, moving the current data frame to a preset voice frame buffer, and determining a voice paragraph endpoint of the voice signal according to all data frames in the voice frame buffer.
Optionally, the step of determining a speech segment endpoint of the speech signal according to all data frames in the speech frame buffer includes:
detecting whether the number of all data frames in the voice frame buffer is equal to a first preset value or not;
and if the value is equal to the first preset value, taking the first data frame in the voice frame buffer as a voice paragraph endpoint of the voice signal.
Optionally, the step of calculating the short-time energy-to-entropy ratio of the current data frame according to the current frequency-domain spectrum signal includes:
calculating density functions of all frequency components of the current data frame according to the current frequency domain spectrum signal, and calculating short-time spectrum entropy of the current data frame according to a preset short-time spectrum entropy calculation formula and each density function;
and acquiring the short-time energy of the current data frame, and calculating the short-time energy entropy ratio of the current data frame according to the short-time energy and the short-time spectrum entropy.
Optionally, the step of obtaining the short-time energy of the current data frame includes:
and determining the frame shift and the frame length of the current data frame according to the time domain signal of the current data frame, and calculating the short-time energy of the current data frame according to the frame shift and the frame length.
Optionally, the step of detecting whether the short-time entropy ratio is greater than the initial detection threshold of the speech signal includes:
acquiring the number of preset mute frames, calculating short-time energy entropy ratios corresponding to all the mute frames based on the number of the mute frames, and calculating a mean value of the short-time energy entropy ratios according to the short-time energy entropy ratios corresponding to all the mute frames;
and calculating an initial detection threshold value of the voice signal according to the short-time energy-entropy ratio mean value and a preset adjusting factor.
Optionally, the step of calculating an initial detection threshold of the speech signal according to the short-time entropy ratio mean and a preset adjustment factor includes:
determining the maximum short-time energy entropy ratio in the short-time energy entropy ratios corresponding to the mute frames, and calculating the difference between the maximum short-time energy entropy ratio and the average value of the short-time energy entropy ratios;
and calculating a product between the short-time entropy ratio mean value and a preset adjusting factor, and taking a sum value between the product and the difference value as an initial detection threshold value of the voice signal.
Optionally, the step of detecting whether the short-time entropy ratio is greater than the initial detection threshold of the speech signal is followed by:
and if the short-time energy-entropy ratio is smaller than or equal to the initial detection threshold value of the voice signal, moving the current data frame to a preset noise frame buffer, and determining a voice section endpoint of the voice signal according to all data frames in the noise frame buffer.
In addition, to achieve the above object, the present invention further provides a voice endpoint detection apparatus, including:
the extraction module is used for extracting time domain signals of all data frames in the voice signals collected in real time and converting the time domain signals into frequency domain spectrum signals;
the calculation module is used for sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to a traversed current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal;
a detection module, configured to detect whether the short-time entropy ratio is greater than an initial detection threshold of the speech signal;
and the determining module is used for moving the current data frame to a preset voice frame buffer and determining a voice paragraph endpoint of the voice signal according to all data frames in the voice frame buffer if the short-time entropy ratio is larger than the initial detection threshold of the voice signal.
In addition, in order to achieve the above object, the present invention also provides a voice endpoint detection device;
the voice endpoint detection apparatus includes: a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein:
the computer program, when executed by the processor, implements the steps of the voice endpoint detection method as described above.
In addition, to achieve the above object, the present invention also provides a computer-readable storage medium;
the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the voice endpoint detection method as described above.
The method comprises the steps of extracting time domain signals of all data frames in a voice signal collected in real time, and converting each time domain signal into a frequency domain spectrum signal; sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to the traversed current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal; detecting whether the short-time energy-entropy ratio is larger than an initial detection threshold of the voice signal; and if the short-time entropy ratio is larger than the initial detection threshold of the voice signal, moving the current data frame to a preset voice frame buffer, and determining a voice paragraph endpoint of the voice signal according to all data frames in the voice frame buffer. The time domain signals of all data frames in the voice signals collected in real time are converted into frequency domain spectrum signals, the short-time energy-entropy ratio is calculated according to the current frequency domain spectrum signals of the traversed current data frames, and when the short-time energy-entropy ratio is larger than an initial detection threshold, the current data frames are moved to the voice frame buffer to determine the voice paragraph end points of the voice signals, so that the phenomenon that the voice signals cannot be accurately detected due to dynamic change of the voice signals in real buffering in the prior art is avoided, and the accuracy of voice end point detection is improved.
Drawings
FIG. 1 is a schematic diagram of a voice endpoint detection device in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a voice endpoint detection method according to a first embodiment of the present invention;
fig. 3 is a functional block diagram of the voice endpoint detection apparatus according to the present invention.
The objects, features and advantages of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic structural diagram of a voice endpoint detection device in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the voice endpoint detection apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Optionally, the voice endpoint detection device may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display screen according to the brightness of ambient light. Of course, the voice endpoint detection device may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, and so on, which are not described herein again.
Those skilled in the art will appreciate that the voice endpoint detection apparatus configuration shown in FIG. 1 does not constitute a limitation of voice endpoint detection apparatus and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a voice endpoint detection program.
In the voice endpoint detection apparatus shown in fig. 1, the network interface 1004 is mainly used for connecting to a background server and performing data communication with the background server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to call the voice endpoint detection program stored in the memory 1005 and execute the voice endpoint detection method provided by the embodiment of the present invention.
Referring to fig. 2, the present invention provides a voice endpoint detection method, in an embodiment of the voice endpoint detection method, the voice endpoint detection method includes the following steps:
step S10, extracting time domain signals of all data frames in the voice signals collected in real time, and converting each time domain signal into a frequency domain spectrum signal;
in this embodiment, the detection threshold is determined by calculating the short-term entropy ratio of the speech signal in real time, and the start-stop position of the speech paragraph is accurately detected according to the detection threshold, so as to facilitate the subsequent speech segmentation and speech recognition tasks. Therefore, the voice signal can be collected in real time through a voice collecting device, such as a microphone, and the time domain signal x (n) is extracted frame by frame from the voice signal collected in real time, that is, the time domain signal of each data frame is collected. And for smoothing the signal, the frame shift is set to be smaller than the frame length so as to calculate the short-time energy E of each data frameiWhere i represents time domain data of the ith frame. After the time domain signals of all the data frames are obtained, the time domain signals of all the data frames are subjected to short-time discrete Fourier transform to obtain a frequency domain spectrum signal Y of each data framei. The short-time discrete fourier transform of the time domain signal is performed in a conventional manner, and is not described here.
Step S20, sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to the traversed current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal;
after the frequency domain spectrum signals of each data frame of the speech signal are acquired, the same operation needs to be performed on all the data frames to determine the speech paragraphs. Therefore, each data frame can be traversed in sequence, and the traversed data frame, that is, the frequency domain spectrum signal corresponding to the current data frame is determined and is used as the current frequency domain spectrum signal. And calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal. Since the spectral energy is calculated band by band for one frame of data, the normalized spectral probability density function of the kth frequency component of the ith frame of data is calculated as Pi(k)=Yi(k)/ΣmYi(m), wherein m =1,2FFTA/2, wherein NFFTIs the fast fourier transform length.
Therefore, when calculating the short-time energy-entropy ratio of the current data frame, the normalized spectral probability density function can be adopted as Pi(k)=Yi(k)/ΣmYi(m) calculating all density functions of the current data frame, and calculating the short-time spectral entropy of the current data frame by the density functions, namely by the formula Hi = —ΣkPi(k)log(Pi(k) To compute the short-time spectral entropy of the current data frame. Then according to the short-time energy E of the current data frameiTo calculate. Therefore, when the short-time energy-entropy ratio of the current data frame is calculated, the short-time energy-entropy ratio of the current data frame can be calculated according to the calculation formula of the short-time energy-entropy ratio. The calculation formula of the short-time energy-entropy ratio can be
Figure 994366DEST_PATH_IMAGE001
Wherein α is a regulatory factor.
In the present proposal, after the frequency domain spectrum signals of each data frame of the speech signal are acquired, empirical constraints may be added, that is, the speech spectrum mainly exists in the [100Hz,3500Hz ] interval, and the noise spectrum exists in the full frequency band. Therefore, in order to better distinguish speech from noise, only data in the [100Hz,3500Hz ] interval part can be calculated while calculating spectral energy band by band. I.e. only data between 100Hz,3500Hz is calculated for each data frame.
The method for obtaining the short-time energy may be to first set a time domain signal as x (n), add a window function w (n), and obtain an i-th frame speech signal after framing as yi (n), where yi (n) satisfies:
yi(n)=w(n)*x((i-1)*inc+n),
1≤n≤L,1≤i≤fn
where w (n) is a window function, typically a rectangular window or a hamming window, y (i) is a frame number, n is 1, 2.. L, i is 1, 2.. n, L is the frame length, and inc is the frame shift length; fn is the total number of frames after framing. The short-time energy formula of the ith frame speech signal yi (n) is:
Figure 98457DEST_PATH_IMAGE002
after the frame shift and the frame length are obtained, the short-time energy of the current data frame can be calculated according to the short-time energy formula.
Step S30, detecting whether the short-time energy-entropy ratio is larger than the initial detection threshold of the voice signal;
after the short-time energy-entropy ratio of the current data frame is obtained through calculation, the current frame can be judged according to the short-time energy-entropy ratio, and according to signal characteristics, the short-time energy-entropy ratio of voice is larger than the energy-entropy ratio of noise, so that whether the current frame belongs to noise or voice is determined. Therefore, it is also necessary to obtain an initial detection threshold of the speech signal, so as to detect whether the short-time energy-entropy ratio of the current data frame is greater than the initial detection value, and perform different operations according to different detection results. Therefore, it is necessary to set an initial detection threshold of the voice signal in advance, and the initial detection threshold may be T0The equation of = α × μ + Φ, where α is an adjustment factor, μ is a mean short-time entropy ratio of a mute frame, and Φ = max (EH)1-N)—μ,T0For the initial detection threshold, N is the number of silent frames.
The method for obtaining the initial detection threshold of the speech signal in advance may be to obtain the number of the mute frames set in advance, that is, the number of the mute frames, calculate the short-time energy entropy ratios corresponding to all the mute frames according to the mute frame data and the short-time energy entropy ratio calculation formula, and perform mean calculation on the short-time energy entropy ratios to obtain a mean short-time energy entropy ratio.
Simultaneously comparing the short-time energy entropy ratios corresponding to the mute frames to obtain the maximum short-time energy entropy ratio, namely the maximum short-time energy entropy ratio, and calculating the difference between the maximum short-time energy entropy ratio and the average value of the short-time energy entropy ratios; and then, calculating the product between the short-time entropy ratio mean value and the adjusting factor set in advance by the user, calculating the sum value between the product and the difference value, and using the sum value as the initial detection threshold value of the voice signal. I.e. calculating an initial detection threshold value of T0= α × μ + Φ, where α is an adjustment factor, μ is a short-time energy-to-entropy ratio mean value of a mute frame, Φ = max (EH)1-N)—μ。
When the short-time energy-entropy ratio is smaller than or equal to the initial detection threshold value of the voice signal, the current data frame can be moved to a noise frame buffer which is set in advance, and the same detection is continuously carried out on the next data frame to determine whether to move to the noise frame buffer or not until the detection of each data frame of the voice signal is finished. And when the number of data frames in the noise frame buffer is greater than a predetermined value (i.e., a second predetermined value), such as L, in the noise frame buffer0Then, it can be determined that there is noise in the speech signal, and it can be determined that the speech segment in the speech signal ends up entering the noise segment. And the beginning of a speech segment is the first frame in the noise buffer.
Step S40, if the short-time entropy ratio is greater than the initial detection threshold of the speech signal, moving the current data frame to a preset speech frame buffer, and determining a speech paragraph endpoint of the speech signal according to all data frames in the speech frame buffer.
When the short-term entropy ratio is larger than the initial detection threshold of the voice signal, the current data frame can be moved to a voice frame buffer set in advance, and the same detection is continuously carried out on the next data frame to determine whether to move to the voice frame buffer or not until the detection of each data frame of the voice signal is finished. Also in the present embodiment, in order to eliminate burst noise in an actual environment, two buffers, i.e., a speech frame buffer and a noise frame buffer, may be provided, and the length of the speech frame buffer is set to L1The length of the noise frame buffer is L0Wherein L is0 >N, N is the number of silent frames.
And when the number of data frames in the speech frame buffer is greater than a predetermined value (i.e., a first predetermined value), such as L, in the speech frame buffer1Then, it can be determined that speech is present in the speech signal, and it can be determined that the start position of a speech segment in the speech signal is the first frame in the speech buffer. If the short-time energy entropy ratio is less than or equal to the initial detection threshold value, moving the current data frame to a preset noise frame buffer, and when the noise frame buffer is fullThe end point of the speech segment is set to the first frame of the noise buffer, i.e. the speech segment end points of the speech signal can be determined to be the first frame (first data frame) in the speech buffer and the first frame (first data frame) of the noise buffer.
In this embodiment, time domain signals of all data frames in a speech signal collected in real time are extracted, and each time domain signal is converted into a frequency domain spectrum signal; sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to the traversed current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal; detecting whether the short-time energy-entropy ratio is larger than an initial detection threshold of the voice signal; and if the short-time entropy ratio is larger than the initial detection threshold of the voice signal, moving the current data frame to a preset voice frame buffer, and determining a voice paragraph endpoint of the voice signal according to all data frames in the voice frame buffer. The time domain signals of all data frames in the voice signals collected in real time are converted into frequency domain spectrum signals, the short-time energy-entropy ratio is calculated according to the current frequency domain spectrum signals of the traversed current data frames, and when the short-time energy-entropy ratio is larger than an initial detection threshold, the current data frames are moved to the voice frame buffer to determine the voice paragraph end points of the voice signals, so that the phenomenon that the voice signals cannot be accurately detected due to dynamic change of the voice signals in real buffering in the prior art is avoided, and the accuracy of voice end point detection is improved.
Further, on the basis of the first embodiment of the present invention, a second embodiment of the voice endpoint detection method of the present invention is provided, where this embodiment is step S30 of the first embodiment of the present invention, and the step of determining the voice paragraph endpoint of the voice signal according to all the data frames in the voice frame buffer includes:
m, detecting whether the number of all data frames in the voice frame buffer is equal to a first preset value or not;
in this embodiment, after the current data frame is moved to the speech frame buffer, the existing data frame in the speech buffer is detected, i.e. the speech frame buffer is detectedAnd whether the number of all data frames in the buffer is equal to a first preset value set in advance or not, and executing different operations according to different detection results. Wherein, the first preset value can be the length of the voice frame buffer set in advance
Figure 878194DEST_PATH_IMAGE003
And n, if the value is equal to the first preset value, taking the first data frame in the voice frame buffer as a voice paragraph endpoint of the voice signal.
When the number of all the data frames in the speech frame buffer is found to be equal to the first preset value, the first data frame in the speech frame buffer can be used as the end point of the speech paragraph of the speech signal, that is, the start position of the speech paragraph. And if the number of all the data frames in the voice frame buffer is smaller than the first preset value, continuously traversing the next data frame.
In this embodiment, when the number of all data frames in the speech frame buffer is equal to the first preset value, the first data frame in the speech frame buffer is used as the speech paragraph endpoint of the speech signal, so that the accuracy of the calculated speech paragraph endpoint is ensured.
Further, the step of calculating the short-time energy-entropy ratio of the current data frame according to the current frequency-domain spectrum signal includes:
step a, calculating density functions of all frequency components of the current data frame according to the current frequency domain spectrum signal, and calculating short-time spectrum entropy of the current data frame according to a preset short-time spectrum entropy calculation formula and each density function;
in this embodiment, when calculating the short-term energy-entropy ratio of the current data frame, the density function of all frequency components in the current data frame may be calculated according to the current frequency-domain spectrum signal, for example, the normalized spectrum probability density function (i.e. density function) of the kth frequency component of the ith data frame is calculated as Pi(k)=Yi(k)/ΣmYi(m), wherein m =1,2FFT/2 wherein N isFFTIs a fast fourier transformThe length of the transform. And after the density functions of all frequency components are obtained through calculation, the short-time spectral entropy of the current data frame can be calculated according to a preset short-time spectral entropy calculation formula and each density function. Wherein the preset short-time spectrum entropy calculation formula can be Hi = —ΣkPi(k)log(Pi(k))。
And b, acquiring the short-time energy of the current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the short-time energy and the short-time spectrum entropy.
After the short-time spectral entropy is obtained through calculation, the short-time energy Ei of the current data frame needs to be obtained, and then the short-time energy Ei and the short-time spectral entropy are input into a formula
Figure 273403DEST_PATH_IMAGE004
Wherein α is an adjustment factor, calculating a short-time energy-to-entropy ratio of the current data frame.
In this embodiment, all density functions are determined according to the current frequency domain spectrum signal, the short-time spectrum entropy of the current data frame is calculated according to the preset short-time spectrum entropy calculation formula and each density function, and the short-time energy entropy ratio of the current data frame is calculated according to the short-time spectrum entropy and the short-time energy, so that the accuracy of the calculated short-time energy entropy ratio is guaranteed.
Specifically, the step of obtaining the short-time energy of the current data frame includes:
and c, determining the frame shift and the frame length of the current data frame according to the time domain signal of the current data frame, and calculating the short-time energy of the current data frame according to the frame shift and the frame length.
In this embodiment, the method for obtaining the short-time energy may be to first set a time domain signal as x (n), add a window function w (n), and obtain an i-th frame speech signal after framing as yi (n), where yi (n) satisfies:
yi(n)=w(n)*x((i-1)*inc+n),
1≤n≤L,1≤i≤fn
where w (n) is a window function, typically a rectangular window or a hamming window, y (i) is a frame number, n is 1, 2.. L, i is 1, 2.. n, L is the frame length, and inc is the frame shift length; fn is the total number of frames after framing. The short-time energy formula of the ith frame speech signal yi (n) is:
Figure 87776DEST_PATH_IMAGE002
after the frame shift and the frame length are obtained, the short-time energy of the current data frame can be calculated according to the short-time energy formula.
In the embodiment, the frame shift and the frame length of the current data frame are determined according to the time domain signal of the current data frame, so that the short-time energy of the current data frame is calculated, and the accuracy of the calculated short-time energy is guaranteed.
Further, the step of detecting whether the short-time entropy ratio is greater than the initial detection threshold of the speech signal includes:
step d, acquiring the number of preset mute frames, calculating the short-time energy entropy ratios corresponding to all the mute frames based on the number of the mute frames, and calculating the average value of the short-time energy entropy ratios according to the short-time energy entropy ratios corresponding to all the mute frames;
in this embodiment, the number of the mute frames set in advance, that is, the number of the mute frames, needs to be obtained, the short-time energy entropy ratios corresponding to all the mute frames are calculated according to the mute frame data and the short-time energy entropy ratio calculation formula, and the average value of each short-time energy entropy ratio is calculated to obtain the average value of the short-time energy entropy ratios.
And e, calculating an initial detection threshold value of the voice signal according to the short-time energy-entropy ratio mean value and a preset adjusting factor.
Simultaneously comparing the short-time energy entropy ratios corresponding to the mute frames to obtain the maximum short-time energy entropy ratio, namely the maximum short-time energy entropy ratio, and calculating the difference between the maximum short-time energy entropy ratio and the average value of the short-time energy entropy ratios; then, the product between the short-time entropy ratio mean value and the adjustment factor set in advance by the user is calculated, and the product and the difference value are calculatedAnd the sum is used as an initial detection threshold value of the voice signal. I.e. calculating an initial detection threshold value of T0= α × μ + Φ, where α is an adjustment factor, μ is a short-time energy-to-entropy ratio mean value of a mute frame, Φ = max (EH)1-N)—μ。
In this embodiment, the initial detection threshold of the speech signal is calculated according to the preset adjustment factor and the short-time entropy ratio mean value by calculating the short-time entropy ratio mean value according to the preset number of the mute frames, so that the accuracy of the calculated initial detection threshold is ensured.
Further, the step of calculating the initial detection threshold of the speech signal according to the short-time entropy ratio mean value and a preset adjustment factor includes:
step f, determining the maximum short-time energy entropy ratio in the short-time energy entropy ratios corresponding to the mute frames, and calculating the difference between the maximum short-time energy entropy ratio and the average value of the short-time energy entropy ratios;
in this embodiment, when calculating the initial detection threshold, the short-time energy-entropy ratios corresponding to the mute frames need to be sequentially compared, the maximum short-time energy-entropy ratio with the largest value is obtained according to the comparison result, and then the average value of the short-time energy-entropy ratios is subtracted from the maximum short-time energy-entropy ratio to obtain the difference value between the maximum short-time energy-entropy ratio and the average value of the short-time energy-entropy ratios.
And g, calculating a product between the short-time energy-entropy ratio mean value and a preset adjusting factor, and taking a sum value between the product and the difference value as an initial detection threshold value of the voice signal.
And acquiring an adjusting factor set by a user in advance, multiplying the short-time energy entropy ratio mean value by the adjusting factor to acquire a product between the short-time energy entropy ratio mean value and the adjusting factor, calculating the product and a difference value between the maximum short-time energy entropy ratio obtained by the calculation and the short-time energy entropy ratio mean value, adding the product and the difference value to acquire a sum value of the sum value, and taking the sum value as an initial detection threshold value of the voice signal. I.e. calculating an initial detection threshold value of T0= α × μ + Φ, where α is an adjustment factor, μ is a short-time energy-to-entropy ratio mean value of a mute frame, Φ = max (EH)1-N)—μ。
In this embodiment, the difference between the maximum short-time energy-entropy ratio and the mean value of the short-time energy-entropy ratio corresponding to each mute frame is calculated, the product of the mean value of the short-time energy-entropy ratio and the adjustment factor is calculated, and the sum of the product and the difference value is used as the initial detection threshold, so that the accuracy of the initial detection threshold obtained by calculation is guaranteed.
Further, the step of detecting whether the short-time entropy ratio is larger than the initial detection threshold of the speech signal is followed by:
and h, if the short-time energy-entropy ratio is smaller than or equal to the initial detection threshold of the voice signal, moving the current data frame to a preset noise frame buffer, and determining a voice section endpoint of the voice signal according to all data frames in the dry frame buffer.
When the short-time energy-entropy ratio is smaller than or equal to the initial detection threshold value of the voice signal, the current data frame can be moved to a noise frame buffer which is set in advance, and the same detection is continuously carried out on the next data frame to determine whether to move to the noise frame buffer or not until the detection of each data frame of the voice signal is finished. And when the number of data frames in the noise frame buffer is greater than a predetermined value (i.e., a second predetermined value), such as
Figure 847921DEST_PATH_IMAGE006
Then, it can be determined that there is noise in the speech signal, and it can be determined that the speech segment in the speech signal ends up entering the noise segment. And the beginning of a speech segment is the first frame in the noise buffer.
In this embodiment, when it is determined that the short-time entropy ratio is less than or equal to the initial detection threshold, the current data frame is moved to the noise frame buffer, and the end point of the speech paragraph is determined according to all the data frames in the noise frame buffer, thereby ensuring the accuracy of the calculated end point of the speech paragraph.
In addition, referring to fig. 3, an embodiment of the present invention further provides a voice endpoint detection apparatus, where the voice endpoint detection apparatus includes:
the extraction module A10 is used for extracting time domain signals of all data frames in the voice signals collected in real time and converting the time domain signals into frequency domain spectrum signals;
a calculating module a20, configured to sequentially traverse each frequency domain spectrum signal, determine a current frequency domain spectrum signal corresponding to a traversed current data frame, and calculate a short-time entropy ratio of the current data frame according to the current frequency domain spectrum signal;
a detecting module a30, configured to detect whether the short-time entropy ratio is greater than an initial detection threshold of the speech signal;
a determining module a40, configured to, if the short-time entropy ratio is greater than the initial detection threshold of the speech signal, move the current data frame to a preset speech frame buffer, and determine a speech paragraph endpoint of the speech signal according to all data frames in the speech frame buffer.
Optionally, the determining module a40 is further configured to:
detecting whether the number of all data frames in the voice frame buffer is equal to a first preset value or not;
and if the value is equal to the first preset value, taking the first data frame in the voice frame buffer as a voice paragraph endpoint of the voice signal.
Optionally, a calculating module a20, configured to:
calculating density functions of all frequency components of the current data frame according to the current frequency domain spectrum signal, and calculating short-time spectrum entropy of the current data frame according to a preset short-time spectrum entropy calculation formula and each density function;
and acquiring the short-time energy of the current data frame, and calculating the short-time energy entropy ratio of the current data frame according to the short-time energy and the short-time spectrum entropy.
Optionally, a calculating module a20, configured to:
and determining the frame shift and the frame length of the current data frame according to the time domain signal of the current data frame, and calculating the short-time energy of the current data frame according to the frame shift and the frame length.
Optionally, the detection module a30 is further configured to:
acquiring the number of preset mute frames, calculating short-time energy entropy ratios corresponding to all the mute frames based on the number of the mute frames, and calculating a mean value of the short-time energy entropy ratios according to the short-time energy entropy ratios corresponding to all the mute frames;
and calculating an initial detection threshold value of the voice signal according to the short-time energy-entropy ratio mean value and a preset adjusting factor.
Optionally, the detection module a30 is further configured to:
determining the maximum short-time energy entropy ratio in the short-time energy entropy ratios corresponding to the mute frames, and calculating the difference between the maximum short-time energy entropy ratio and the average value of the short-time energy entropy ratios;
and calculating a product between the short-time entropy ratio mean value and a preset adjusting factor, and taking a sum value between the product and the difference value as an initial detection threshold value of the voice signal.
Optionally, the detection module a30 is further configured to:
and if the short-time energy-entropy ratio is smaller than or equal to the initial detection threshold value of the voice signal, moving the current data frame to a preset noise frame buffer, and determining a voice section endpoint of the voice signal according to all data frames in the noise frame buffer.
The steps implemented by each functional module of the voice endpoint detection apparatus may refer to each embodiment of the voice endpoint detection method of the present invention, and are not described herein again.
The present invention also provides a voice endpoint detection apparatus, including: a memory, a processor, and a voice endpoint detection program stored on the memory; the processor is configured to execute the voice endpoint detection program to implement the following steps:
extracting time domain signals of all data frames in a voice signal acquired in real time, and converting each time domain signal into a frequency domain spectrum signal;
sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to the traversed current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal;
detecting whether the short-time energy-entropy ratio is larger than an initial detection threshold of the voice signal;
and if the short-time entropy ratio is larger than the initial detection threshold of the voice signal, moving the current data frame to a preset voice frame buffer, and determining a voice paragraph endpoint of the voice signal according to all data frames in the voice frame buffer.
The present invention also provides a computer readable storage medium storing one or more programs, the one or more programs being further executable by one or more processors for performing the steps of:
extracting time domain signals of all data frames in a voice signal acquired in real time, and converting each time domain signal into a frequency domain spectrum signal;
sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to the traversed current data frame, and calculating the short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal;
detecting whether the short-time energy-entropy ratio is larger than an initial detection threshold of the voice signal;
and if the short-time entropy ratio is larger than the initial detection threshold of the voice signal, moving the current data frame to a preset voice frame buffer, and determining a voice paragraph endpoint of the voice signal according to all data frames in the voice frame buffer.
The specific implementation of the computer-readable storage medium of the present invention is substantially the same as the embodiments of the voice endpoint detection method described above, and is not described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (7)

1. A voice endpoint detection method is characterized by comprising the following steps:
extracting time domain signals of all data frames in a voice signal acquired in real time, and converting each time domain signal into a frequency domain spectrum signal;
sequentially traversing each frequency domain spectrum signal, determining a current frequency domain spectrum signal corresponding to a traversed current data frame, and calculating a short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal, wherein short-time energy in a [100HZ,3500HZ ] interval in the current data frame is calculated, and the short-time energy-entropy ratio of the current data frame is calculated according to the short-time energy and the current frequency spectrum signal, wherein yi (n) = w (n) x ((i-1) × inc + n), n is not less than 1 and not more than L, i is not less than 1 and not more than fn, w (n) is a window function, L is a frame length, inc is a frame shift, fn is a total frame number after frame division, yi (n) is an i-th frame speech signal, and x (n) is a time domain signal; the short-time energy of the ith frame speech signal yi (n) is:
Figure 804287DEST_PATH_IMAGE001
detecting whether the short-time energy-entropy ratio is larger than an initial detection threshold of the voice signal, wherein the initial detection threshold is according to T0= α × μ + Φ, where α is an adjustment factor, μ is a mean value of short-time energy-to-entropy ratios of the mute frames, and Φ = max (EH)1-N)—μ,T0For the initial detection threshold, N is the number of silent frames, max (EH)1-N) The maximum short-time energy entropy ratio;
if the short-time entropy ratio is larger than the initial detection threshold of the voice signal, moving the current data frame to a preset voice frame buffer, and determining a voice paragraph end point of the voice signal according to all data frames in the voice frame buffer, wherein if the number of the data frames of all the data frames in the voice frame buffer is equal to a first preset value, determining that voice exists in the voice signal, and determining that the starting position of a voice paragraph in the voice signal is a first frame in the voice frame buffer;
and if the short-time energy-entropy ratio is smaller than or equal to the initial detection threshold, moving the current data frame to a preset noise frame buffer, and when the number of the data frames of all the data frames in the noise frame buffer is larger than a second preset value in the noise frame buffer, determining that noise exists in the voice signal, and determining that a first frame in the noise frame buffer is the position of the termination point of the voice paragraph.
2. The method of speech endpoint detection according to claim 1, wherein the step of calculating the short-time entropy ratio of the current data frame from the current frequency-domain spectral signal comprises:
calculating density functions of all frequency components of the current data frame according to the current frequency domain spectrum signal, and calculating short-time spectrum entropy of the current data frame according to a preset short-time spectrum entropy calculation formula and each density function;
and acquiring the short-time energy of the current data frame, and calculating the short-time energy entropy ratio of the current data frame according to the short-time energy and the short-time spectrum entropy.
3. The voice endpoint detection method of claim 2, wherein the step of obtaining the short-time energy of the current data frame comprises:
and determining the frame shift and the frame length of the current data frame according to the time domain signal of the current data frame, and calculating the short-time energy of the current data frame according to the frame shift and the frame length.
4. The voice endpoint detection method of claim 1, wherein the step of detecting whether the short-time energy-to-entropy ratio is greater than an initial detection threshold for the voice signal is preceded by:
acquiring the number of preset mute frames, calculating the short-time energy entropy ratios corresponding to all the mute frames based on the number of the mute frames, and calculating the average value of the short-time energy entropy ratios according to the short-time energy entropy ratios corresponding to all the mute frames.
5. A voice endpoint detection apparatus, the voice endpoint detection apparatus comprising:
the extraction module is used for extracting time domain signals of all data frames in the voice signals collected in real time and converting the time domain signals into frequency domain spectrum signals;
a calculating module, configured to sequentially traverse each frequency domain spectrum signal, determine a current frequency domain spectrum signal corresponding to a traversed current data frame, and calculate a short-time energy-entropy ratio of the current data frame according to the current frequency domain spectrum signal, where the short-time energy of a [100HZ,3500HZ ] interval in the current data frame is calculated, and the short-time energy-entropy ratio of the current data frame is calculated according to the short-time energy and the current frequency spectrum signal, where yi (n) = w (n) x ((i-1) × inc + n), n is greater than or equal to 1 and less than or equal to L, i is greater than or equal to 1 and less than or equal to fn, w (n) is a window function, L is a frame length, inc is a frame shift, fn is a total frame number after framing, yi (n) is an i-th frame speech signal, and x (n) is a time domain signal; the short-time energy of the ith frame speech signal yi (n) is:
Figure 721428DEST_PATH_IMAGE001
a detection module for detecting whether the short-time entropy ratio is greater than an initial detection threshold of the speech signal, wherein the initial detection threshold is based on T0= α × μ + Φ, where α is an adjustment factor, μ is a mean value of short-time energy-to-entropy ratios of the mute frames, and Φ = max (EH)1-N)—μ,T0For the initial detection threshold, N is the number of silent frames, max (EH)1-N) The maximum short-time energy entropy ratio;
a determining module, configured to move the current data frame to a preset speech frame buffer and determine a speech paragraph endpoint of the speech signal according to all data frames in the speech frame buffer if the short-time entropy ratio is greater than an initial detection threshold of the speech signal, where if the number of data frames of all data frames in the speech frame buffer is equal to a first preset value, it is determined that speech exists in the speech signal, and it is determined that a start position of a speech paragraph in the speech signal is a first frame in the speech frame buffer;
and if the short-time energy-entropy ratio is smaller than or equal to the initial detection threshold, moving the current data frame to a preset noise frame buffer, and when the number of the data frames of all the data frames in the noise frame buffer is larger than a second preset value in the noise frame buffer, determining that noise exists in the voice signal, and determining that a first frame in the noise frame buffer is the position of the termination point of the voice paragraph.
6. A voice endpoint detection device, the voice endpoint detection device comprising: memory, a processor and a speech endpoint detection program stored on the memory and executable on the processor, the speech endpoint detection program, when executed by the processor, implementing the steps of the speech endpoint detection method according to any one of claims 1 to 4.
7. A computer-readable storage medium, having a voice endpoint detection program stored thereon, which when executed by a processor implements the steps of the voice endpoint detection method according to any one of claims 1 to 4.
CN202011282116.0A 2020-11-17 2020-11-17 Voice endpoint detection method, device, equipment and computer readable storage medium Active CN112102851B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011282116.0A CN112102851B (en) 2020-11-17 2020-11-17 Voice endpoint detection method, device, equipment and computer readable storage medium
PCT/CN2021/127184 WO2022105570A1 (en) 2020-11-17 2021-10-28 Speech endpoint detection method, apparatus and device, and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011282116.0A CN112102851B (en) 2020-11-17 2020-11-17 Voice endpoint detection method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN112102851A CN112102851A (en) 2020-12-18
CN112102851B true CN112102851B (en) 2021-04-13

Family

ID=73785690

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011282116.0A Active CN112102851B (en) 2020-11-17 2020-11-17 Voice endpoint detection method, device, equipment and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN112102851B (en)
WO (1) WO2022105570A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112102851B (en) * 2020-11-17 2021-04-13 深圳壹账通智能科技有限公司 Voice endpoint detection method, device, equipment and computer readable storage medium
CN113613159B (en) * 2021-08-20 2023-07-21 贝壳找房(北京)科技有限公司 Microphone blowing signal detection method, device and system
CN114582354A (en) * 2022-05-06 2022-06-03 深圳市长丰影像器材有限公司 Voice control method, device and equipment based on voiceprint recognition and storage medium
CN116665717B (en) * 2023-08-02 2023-09-29 广东技术师范大学 Cross-subband spectral entropy weighted likelihood ratio voice detection method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012215600A (en) * 2011-03-31 2012-11-08 Oki Electric Ind Co Ltd Voice section determination device, voice section determination method, and program
CN106653062A (en) * 2017-02-17 2017-05-10 重庆邮电大学 Spectrum-entropy improvement based speech endpoint detection method in low signal-to-noise ratio environment
CN106816157A (en) * 2015-11-30 2017-06-09 展讯通信(上海)有限公司 Audio recognition method and device
CN107910017A (en) * 2017-12-19 2018-04-13 河海大学 A kind of method that threshold value is set in noisy speech end-point detection
CN109412763A (en) * 2018-11-15 2019-03-01 电子科技大学 A kind of digital signal Detection of Existence method based on signal energy entropy ratio
CN111179975A (en) * 2020-04-14 2020-05-19 深圳壹账通智能科技有限公司 Voice endpoint detection method for emotion recognition, electronic device and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107731223B (en) * 2017-11-22 2022-07-26 腾讯科技(深圳)有限公司 Voice activity detection method, related device and equipment
CN108428456A (en) * 2018-03-29 2018-08-21 浙江凯池电子科技有限公司 Voice de-noising algorithm
US11062727B2 (en) * 2018-06-13 2021-07-13 Ceva D.S.P Ltd. System and method for voice activity detection
CN108962285B (en) * 2018-07-20 2023-04-14 浙江万里学院 Voice endpoint detection method for dividing sub-bands based on human ear masking effect
CN109087632B (en) * 2018-08-17 2023-06-06 平安科技(深圳)有限公司 Speech processing method, device, computer equipment and storage medium
CN111755028A (en) * 2020-07-03 2020-10-09 四川长虹电器股份有限公司 Near-field remote controller voice endpoint detection method and system based on fundamental tone characteristics
CN112102851B (en) * 2020-11-17 2021-04-13 深圳壹账通智能科技有限公司 Voice endpoint detection method, device, equipment and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012215600A (en) * 2011-03-31 2012-11-08 Oki Electric Ind Co Ltd Voice section determination device, voice section determination method, and program
CN106816157A (en) * 2015-11-30 2017-06-09 展讯通信(上海)有限公司 Audio recognition method and device
CN106653062A (en) * 2017-02-17 2017-05-10 重庆邮电大学 Spectrum-entropy improvement based speech endpoint detection method in low signal-to-noise ratio environment
CN107910017A (en) * 2017-12-19 2018-04-13 河海大学 A kind of method that threshold value is set in noisy speech end-point detection
CN109412763A (en) * 2018-11-15 2019-03-01 电子科技大学 A kind of digital signal Detection of Existence method based on signal energy entropy ratio
CN111179975A (en) * 2020-04-14 2020-05-19 深圳壹账通智能科技有限公司 Voice endpoint detection method for emotion recognition, electronic device and storage medium

Also Published As

Publication number Publication date
CN112102851A (en) 2020-12-18
WO2022105570A1 (en) 2022-05-27

Similar Documents

Publication Publication Date Title
CN112102851B (en) Voice endpoint detection method, device, equipment and computer readable storage medium
US20210327448A1 (en) Speech noise reduction method and apparatus, computing device, and computer-readable storage medium
CN107004409B (en) Neural network voice activity detection using run range normalization
WO2019101123A1 (en) Voice activity detection method, related device, and apparatus
US10014005B2 (en) Harmonicity estimation, audio classification, pitch determination and noise estimation
CN113766073B (en) Howling detection in conference systems
US8520861B2 (en) Signal processing system for tonal noise robustness
CN110853664B (en) Method and device for evaluating performance of speech enhancement algorithm and electronic equipment
JP2012128411A (en) Voice determination device and voice determination method
JP2016524724A (en) Method and system for controlling a home electrical appliance by identifying a position associated with a voice command in a home environment
CN111968662A (en) Audio signal processing method and device and storage medium
WO2012175054A1 (en) Method and device for detecting fundamental tone
CN111028845A (en) Multi-audio recognition method, device, equipment and readable storage medium
CN111292758B (en) Voice activity detection method and device and readable storage medium
CN110970051A (en) Voice data acquisition method, terminal and readable storage medium
US20150019222A1 (en) Method for using voiceprint identification to operate voice recognition and electronic device thereof
US20210264940A1 (en) Position detection method, apparatus, electronic device and computer readable storage medium
CN110556128B (en) Voice activity detection method and device and computer readable storage medium
JP6314475B2 (en) Audio signal processing apparatus and program
CN115995234A (en) Audio noise reduction method and device, electronic equipment and readable storage medium
JP2016080767A (en) Frequency component extraction device, frequency component extraction method and frequency component extraction program
TWI756817B (en) Voice activity detection device and method
CN116364106A (en) Voice detection method, device, terminal equipment and storage medium
US20230253010A1 (en) Voice activity detection (vad) based on multiple indicia
US20240194220A1 (en) Position detection method, apparatus, electronic device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant