CN108346425B

CN108346425B - Voice activity detection method and device and voice recognition method and device

Info

Publication number: CN108346425B
Application number: CN201710056814.0A
Authority: CN
Inventors: 李洋; 欧阳宏宇; 陈伟
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2017-01-25
Filing date: 2017-01-25
Publication date: 2021-05-25
Anticipated expiration: 2037-01-25
Also published as: CN108346425A

Abstract

The embodiment of the invention provides a voice activity detection method and device and a voice recognition method and device, wherein the voice activity detection method comprises the following steps: acquiring signal characteristic parameters of a current frame input signal; determining a first signal type of the current frame input signal by adopting the signal characteristic parameters, and determining a second signal type of the current frame input signal by adopting the signal characteristic parameters and a preset deep neural network model; according to the embodiment of the invention, the speech signal and the non-speech signal in the input signal are determined based on the signal level and the model level, so that the recognition accuracy of the speech signal and the non-speech signal is improved, and the robustness and the continuity of speech recognition to a noise environment are enhanced.

Description

Voice activity detection method and device and voice recognition method and device

Technical Field

The invention relates to the technical field of input methods, in particular to a voice activity detection method, a voice activity detection device, a voice recognition method and a voice recognition device

Background

At present, the rapid development of mobile internet has driven the wide popularization of mobile intelligent devices such as mobile phones, tablet computers, wearable devices and the like, and as one of the most convenient and natural ways of man-machine interaction on mobile devices, voice input is gradually accepted by the vast majority of users.

The process of voice input is a typical data input and data output process. Specifically, the method comprises the steps of collecting a recording signal, identifying a voice signal and a non-voice signal in the recording signal, processing the voice signal in the recording signal, identifying the voice signal, and finally obtaining an identification result of the voice signal.

In the existing speech recognition method, due to the limited resources of the decoder, the long-time recorded signal needs to be cut into effective segments matched with the limited resources, and the cutting of the recorded signal into the effective segments mainly depends on the pause intervals of human speaking, and the pause intervals can be generally regarded as silence or noise, i.e. non-speech signals. At present, Voice signals and non-Voice signals are usually detected by Voice Activity Detection (VAD), the traditional VAD Detection method is mainly based on signals or models, the traditional VAD Detection method can rapidly identify the Voice signals and the non-Voice signals under steady noise, but under the non-steady noise environment such as fuzzy noise, transient noise and the like, the identification result is not accurate, and false alarm or false alarm omission is caused; the recognition method based on the model can accurately recognize the voice signal and the non-voice signal under stable or non-stable noise, but the voice signal when a person speaks less needs to be eliminated, so that the voice signal recognition is discontinuous.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present invention is to provide a method and an apparatus for voice activity detection, and a method and an apparatus for voice recognition, so as to improve the accuracy and continuity of detection and recognition of voice signals and non-voice signals.

In order to solve the above problem, the present invention discloses a voice activity detection method, which comprises:

acquiring signal characteristic parameters of a current frame input signal;

determining a first signal type of the current frame input signal by adopting the signal characteristic parameters, and determining a second signal type of the current frame input signal by adopting the signal characteristic parameters and a preset deep neural network model;

and determining the signal type of the current frame input signal according to the first signal type and the second signal type.

Preferably, the step of determining the first signal type of the current frame input signal by using the signal characteristic parameters and determining the second signal type of the current frame input signal by using the signal characteristic parameters and a preset deep neural network model includes:

determining a first signal type of the current frame input signal by adopting the signal energy value and the voice signal-to-noise ratio;

and determining a second signal type of the current frame input signal in a preset deep neural network model by adopting the perceptual linear prediction parameters and the fundamental frequency.

Preferably, the step of determining the first signal type of the current frame input signal by using the signal energy value and the speech signal-to-noise ratio comprises:

determining a first pre-judging signal type of the current frame input signal by using the signal energy value;

determining a second pre-judging signal type of the current frame input signal by adopting the voice signal-to-noise ratio;

when the first pre-judging signal type and the second pre-judging signal type are both voice signals, determining that the first signal type is a voice signal;

and when a non-voice signal exists in the first pre-judging signal type and the second pre-judging signal type, determining that the first signal type is the non-voice signal.

Preferably, the step of determining the first predicted signal type of the current frame input signal using the signal energy value includes:

judging whether the signal energy value is larger than a preset energy threshold value or not;

if so, determining the type of the first pre-judging signal as a voice signal;

if not, determining that the first pre-judging signal type is a non-voice signal.

Preferably, the step of determining the second pre-determined signal type of the current frame input signal by using the speech signal-to-noise ratio comprises:

calculating the voice existence probability of the current frame input signal by adopting the voice signal-to-noise ratio;

judging whether the voice existence probability is larger than a preset voice existence probability threshold value or not;

if so, determining the type of the second pre-judging signal as a voice signal;

if not, determining that the second pre-judging signal type is a non-voice signal.

Preferably, the step of determining the second signal type of the current frame input signal in a preset deep neural network model by using the perceptual linear prediction parameter and the fundamental frequency includes:

generating input parameters by adopting the perception linear prediction parameters and the fundamental frequency;

calculating the confidence probability that the current frame input signal is a voice signal in a preset deep neural network model by adopting the input parameters;

obtaining the confidence coefficient that the input signal of the previous frame of the current frame of input signal is a voice signal;

calculating the confidence coefficient of the current frame input signal by adopting the confidence coefficient that the previous frame input signal is a voice signal and the confidence probability that the current frame input signal is a voice signal;

judging whether the confidence coefficient of the current frame input signal is greater than a preset confidence coefficient threshold value;

if so, determining that the second signal type is a voice signal;

if not, determining that the second signal type is a non-voice signal.

Preferably, the step of determining the signal type of the current frame input signal according to the first signal type and the second signal type includes:

when the first signal type and the second signal type are both voice signals, determining that the signal type of the current frame input signal is a voice signal;

and when a non-speech signal exists in the first signal type and the second signal type, determining that the signal type of the current frame input signal is the non-speech signal.

The embodiment of the invention discloses a voice recognition method, which comprises the following steps:

determining the signal type of the current frame input signal by adopting a voice activity detection method;

and when the signal type of the current frame input signal is determined to be a voice signal, sending the current frame input signal to a decoder for decoding to obtain text information corresponding to the current frame input signal.

Preferably, the method further comprises the following steps:

when the signal type of the current frame input signal is determined to be a non-voice signal, calculating the duration of the non-voice signal;

resetting the decoder when the duration is greater than a preset time threshold.

Preferably, before the step of sending the current frame input signal to a decoder for decoding, the method further comprises:

pre-processing the input signal, the pre-processing comprising: low frequency de-noising, and/or signal enhancement.

The embodiment of the invention also discloses a voice activity detection device, which comprises:

the characteristic parameter acquisition module is used for acquiring the signal characteristic parameters of the current frame input signal;

the signal type acquisition module is used for determining a first signal type of the current frame input signal by adopting the signal characteristic parameters; determining a second signal type of the current frame input signal by adopting the signal characteristic parameters and a preset deep neural network model;

and the signal type determining module is used for determining the signal type of the current frame input signal according to the first signal type and the second signal type.

Preferably, the signal characteristic parameters include a signal energy value, a speech signal-to-noise ratio, a perceptual linear prediction parameter, and a fundamental frequency, and the signal type obtaining module includes:

a first signal type determining submodule, configured to determine a first signal type of the current frame input signal by using the signal energy value and the speech signal-to-noise ratio;

the second signal type determining submodule is used for determining a second signal type of the current frame input signal in a preset deep neural network model by adopting the perceptual linear prediction parameter and the fundamental frequency;

a voice signal determining sub-module, configured to determine that the signal type of the current frame input signal is a voice signal when the first signal type and the second signal type are both voice signals;

a non-speech signal determination sub-module, configured to determine that the signal type of the current frame input signal is a non-speech signal when a non-speech signal exists in the first signal type and the second signal type.

Preferably, the first signal type determination submodule includes:

a first pre-judging signal type unit, configured to determine a first pre-judging signal type of the current frame input signal by using the signal energy value;

a second pre-judging signal type unit, configured to determine a second pre-judging signal type of the current frame input signal by using the speech signal-to-noise ratio;

a first voice signal determination unit, configured to determine that the first signal type is a voice signal when the first pre-determined signal type and the second pre-determined signal type are both voice signals;

a first non-speech signal determination unit, configured to determine that the first signal type is a non-speech signal when a non-speech signal exists in the first pre-determined signal type and the second pre-determined signal type.

Preferably, the first pre-judging signal type unit includes:

the signal energy value judging subunit judges whether the signal energy value is greater than a preset energy threshold value;

a first voice signal determining subunit, configured to determine that the first pre-determined signal type is a voice signal;

and the first non-voice signal determining subunit is used for determining that the type of the first pre-judging signal is a non-voice signal.

Preferably, the second pre-judging signal type unit includes:

a voice existence probability calculating subunit, configured to calculate a voice existence probability of the current frame input signal by using the voice signal-to-noise ratio;

the voice existence probability judging subunit is used for judging whether the voice existence probability is greater than a preset voice existence probability threshold value or not;

a second voice signal determining subunit, configured to determine that the second pre-determined signal type is a voice signal;

and the second non-voice signal determining subunit is used for determining that the type of the second pre-judging signal is a non-voice signal.

Preferably, the second signal type determination submodule includes:

an input parameter generating unit for generating input parameters using the perceptual linear prediction parameters and the fundamental frequency;

the confidence probability calculation unit is used for calculating the confidence probability that the current frame input signal is a voice signal in a preset deep neural network model by adopting the input parameters;

the confidence coefficient acquisition unit is used for acquiring the confidence coefficient that the input signal of the previous frame of the input signal of the current frame is a voice signal;

the confidence coefficient calculation unit is used for calculating the confidence coefficient of the current frame input signal by adopting the confidence coefficient that the previous frame input signal is a voice signal and the confidence probability that the current frame input signal is the voice signal;

the confidence coefficient judging unit is used for judging whether the confidence coefficient of the current frame input signal is greater than a preset confidence coefficient threshold value;

a second voice signal determination unit, configured to determine that the second signal type is a voice signal;

a second non-speech signal determination unit for determining that the second signal type is a non-speech signal.

The embodiment of the invention discloses a voice recognition device, which comprises:

the voice activity detection module is used for determining the signal type of the current frame input signal by adopting a voice activity detection device;

and the voice recognition module is used for sending the current frame input signal to a decoder for decoding when the signal type of the current frame input signal is determined to be a voice signal, so as to obtain text information corresponding to the current frame input signal.

Preferably, the method further comprises the following steps:

the time calculation module is used for calculating the duration of the non-voice signal when the signal type of the current frame input signal is determined to be the non-voice signal;

Preferably, the method further comprises the following steps:

a pre-processing module for pre-processing the input signal, the pre-processing comprising: low frequency de-noising, and/or signal enhancement.

An embodiment of the present invention discloses a voice activity detection apparatus, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by one or more processors include instructions for:

acquiring signal characteristic parameters of a current frame input signal;

determining the signal type of the current frame input signal by adopting the signal characteristic parameters and a preset deep neural network model;

The embodiment of the invention discloses a voice recognition device, which comprises a memory and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs are configured to be executed by one or more processors and comprise instructions for:

the speech activity detection means is used to determine the signal type of the input signal of the current frame,

Compared with the background art, the embodiment of the invention has the following advantages:

in the embodiment of the invention, the signal characteristic parameters of the current frame input signal are acquired in a voice input mode, the first signal type of the current frame input signal is determined by adopting the signal characteristic parameters, the second signal type of the current frame input signal is determined by adopting the signal characteristic parameters and a preset deep neural network model, and the signal type of the current frame input signal is determined according to the first signal type and the second signal type. The embodiment of the invention determines the first signal type of the input signal on the signal level based on the signal characteristic parameter, determines the second signal type of the input signal on the model level based on the signal characteristic parameter and the preset deep neural network model, and then determines the voice signal and the non-voice signal in the input signal by integrating the first signal type and the second signal type, thereby improving the recognition accuracy of the voice signal and the non-voice signal and enhancing the robustness and the continuity of the voice recognition to the noise environment.

Drawings

FIG. 1 is a flow chart of the steps of an embodiment 1 of a voice activity detection method of the present invention;

FIG. 2 is a flow chart of the steps of an embodiment 2 of a speech recognition method of the present invention;

fig. 3 is a block diagram of a voice activity detection apparatus according to embodiment 3 of the present invention;

fig. 4 is a block diagram of a speech recognition apparatus according to embodiment 4 of the present invention;

fig. 5 is a block diagram of a voice activity detection apparatus of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of embodiment 1 of a voice activity detection method according to the present invention is shown, which may specifically include the following steps:

step 101, obtaining signal characteristic parameters of the input signal of the current frame.

In the embodiment of the invention, the input signal can be an analog input signal obtained by acquiring the voice of a user through the voice acquisition equipment in a voice input mode, or a digital signal obtained by performing analog-to-digital conversion on the analog input signal through the voice acquisition equipment. The input signal in the voice input mode includes a voice signal and a non-voice signal, the voice signal may refer to a signal into which a voice of a user is converted, and the non-voice signal may be a signal generated when the user stops speaking, or a signal generated by environmental noise or noise generated by a voice acquisition device.

The input signal into which the voice of a human speaker is converted is a quasi-stationary signal, and in processing, the input signal is usually subjected to framing processing. The framing processing of the input signal may be that the speech acquisition device performs framing and then transmits the framing to the decoder, the decoder may receive the input signal frame by frame, and the framing processing of the input signal may also be performed at the front end of the decoder. Since speech recognition is a continuous data processing process, when a frame of input signal is received, signal characteristic parameters of the frame of input signal may be obtained, in the embodiment of the present invention, the signal characteristic parameters of the input signal may include a signal energy value, a speech signal-to-noise ratio, a perceptual linear prediction parameter, and a fundamental frequency, and of course, other signal characteristic parameters may also be included, such as a short-term zero-crossing rate, a short-term autocorrelation function, and the like, which is not limited in this embodiment of the present invention.

The signal energy value represents the strength of a signal, and may be the product of the power of the signal and the time in a period of time, and the energy value of a normal speech signal is greater than that of a non-speech signal.

SIGNAL-to-NOISE RATIO (SNR) refers to a RATIO of an effective power of a SIGNAL to an effective power of NOISE in an electronic device or an electronic system, in an embodiment of the present invention, the SIGNAL refers to an input SIGNAL collected by a voice collecting device, and the NOISE refers to an irregular extra SIGNAL (or information) that does not exist in an original SIGNAL generated after passing through the voice collecting device.

The perceptual linear prediction parameter is a characteristic parameter based on an auditory model and is a set of coefficients of an all-pole model prediction polynomial. The input signal is subjected to spectrum analysis, critical frequency band analysis, equal loudness pre-emphasis, intensity-loudness conversion, inverse Fourier transform and Debin algorithm to calculate a 12-order all-pole model, and a 16-order cepstrum coefficient is obtained and is a perceptual linear prediction parameter.

The fundamental frequency, i.e. the frequency of the fundamental tone, determines the pitch of the speech sound of the human, and in the embodiment of the present invention, the fundamental frequency may refer to the fundamental frequency of the speech signal or the non-speech signal existing in the frame of the input signal. The fundamental frequency can be extracted by an autocorrelation method, a cepstrum method, an average amplitude difference function method, a linear prediction method, a wavelet-autocorrelation function method, a spectral subtraction-autocorrelation function method and the like.

And step 102, determining a first signal type of the current frame input signal by using the signal characteristic parameters, and determining a second signal type of the current frame input signal by using the signal characteristic parameters and a preset deep neural network model.

After signal characteristic parameters such as a signal energy value, a voice signal-to-noise ratio, a perceptual linear prediction parameter, a fundamental frequency and the like of the current frame input signal are obtained, the signal energy value, the voice signal-to-noise ratio, the perceptual linear prediction parameter, the fundamental frequency and a preset deep neural network model can be adopted to determine whether the signal type of the current frame input signal is a voice signal or a non-voice signal.

In a preferred embodiment of the present invention, determining the signal type of the current frame input signal by using the signal characteristic parameter and a preset deep neural network model includes the following steps:

and a sub-step S11 of determining a first signal type of the input signal of the current frame by using the signal energy value and the speech signal-to-noise ratio.

In a preferred example of the embodiment of the present invention, step S11 may include the following sub-steps:

and a substep S111, determining a first pre-determined signal type of the current frame input signal by using the signal energy value.

In practical applications, the speech signal is a non-stationary signal as a whole, but may be considered as a stationary signal in a short time, for example, a frame signal with a frame length of 10ms may be considered as a stationary signal, and a frame input signal has a signal energy value, and it can be determined whether the input signal is a speech signal or a non-speech signal according to the signal energy value, specifically, step S111 may include the following sub-steps:

substep S111-1, judging whether the signal energy value is greater than a preset energy threshold value, if so, executing substep S111-2, and if not, executing substep S111-3;

substep S111-2, determining the type of the first pre-judging signal as a voice signal;

and a substep S111-3 of determining that the first predetermined signal type is a non-speech signal.

In the embodiment of the invention, the input signal is generated by collecting the voice of a person through the voice collecting equipment, the energy value of the input signal generated when a person speaks is generally greater than the pause interval in which the person does not speak, the energy value of the input signal generated by background noise, and, therefore, whether the input signal of the current frame is a speech signal or a non-speech signal can be determined according to the signal energy value, and particularly, when a person does not speak, under the low-noise environment of the user, the signal energy of the signal generated by the background noise is used as a preset energy threshold value, when the signal energy value of the received current frame input signal is greater than the preset energy threshold value, it may be determined that the first predicted signal type of the current frame input signal is a speech signal, otherwise, it is determined that the first predicted signal type of the current frame input signal is a non-speech signal, in this way, the signal energy value can quickly determine the mute and low noise signals in the input signal.

And a sub-step S112, determining a second pre-determined signal type of the current frame input signal by using the speech signal-to-noise ratio.

In a preferred example of the embodiment of the present invention, step S112 may include the following sub-steps:

substep S112-1, adopting the speech signal-to-noise ratio to calculate the speech existence probability of the current frame input signal;

a substep S112-2 of judging whether the voice existence probability is greater than a preset voice existence probability threshold value, if so, executing the substep S112-3, and if not, executing the substep S112-4;

substep S112-3, determining the second pre-determined signal type as a speech signal;

and a substep S112-4 of determining the second predetermined signal type to be a non-speech signal.

In the embodiment of the present invention, the speech signal-to-noise ratio of the current frame input signal can be used to calculate the speech existence probability of the current frame input signal, when the speech existence probability is greater than the preset speech existence probability threshold, the second pre-determined signal type of the current frame input signal is determined as a speech signal, otherwise, the second pre-determined signal type is determined as a non-speech signal, specifically, the signal-to-noise ratio of the signal generated when the person does not speak can be obtained, the existence probability of the non-speech signal is calculated by using the signal-to-noise ratio, and the existence probability of the non-speech signal is subtracted from 1 to obtain the speech existence probability threshold, of course, the speech signal-to-noise ratio can also be directly used to determine whether the second pre-determined signal type of the current frame input signal is a speech signal or a non-speech signal, for example, when the signal-to-, and determining as a voice signal, otherwise, determining as a non-voice signal.

Substep S113, determining that the first signal type is a speech signal when the first pre-determined signal type and the second pre-determined signal type are both speech signals.

After the first pre-judging signal type and the second pre-judging signal type of the current frame input signal are determined, the first signal type of the current frame input signal can be determined according to the first pre-judging signal type and the second pre-judging signal type, and when the first pre-judging signal type and the second pre-judging signal type are both voice signals, the first signal type is determined to be the voice signals.

And a substep S114, determining that the first signal type is a non-speech signal when there is a non-speech signal in the first pre-determined signal type and the second pre-determined signal type.

In the embodiment of the present invention, the determination process of the first pre-determined signal type and the determination process of the second pre-determined signal type may be performed simultaneously, or one of the two determination processes may be performed first, for example, the step of determining the first pre-determined signal type is performed first, then whether to perform the step of determining the second pre-determined signal type is determined according to the first pre-determined signal type, when the first pre-determined signal type is a speech signal, the step of determining the second pre-determined signal type is performed, otherwise, the step is not performed. And when the non-voice signal exists in the first pre-judging signal type and the second pre-judging signal type, determining that the first signal type is the non-voice signal.

In the embodiment of the invention, the first pre-judging signal type of the current frame input signal is determined based on the signal energy value of the current frame input signal, the second pre-judging signal type of the current frame input signal is determined based on the voice signal-to-noise ratio of the current frame input signal, when the first pre-judging signal type and the second pre-judging signal type are both voice signals, the first signal type of the current frame input signal is determined to be a voice signal, otherwise, the first signal type is a non-voice signal, and the signal type is determined by adopting two stages, so that the accuracy of the first signal type judgment is improved.

According to the embodiment of the invention, the signal type of the input signal is determined through the signal energy value and the signal-to-noise ratio of the signal, the voice signal and the non-voice signal in the input signal can be rapidly identified, and the identification efficiency is improved.

And a substep S12, determining a second signal type of the current frame input signal in a preset deep neural network model by using the perceptual linear prediction parameter and the fundamental frequency.

In the embodiment of the invention, large-scale voice data and non-voice data are collected in the deep neural network model, wherein the non-voice data comprises mute data and noise data, the deep neural network can be used for training the classification model, the perceptual linear prediction parameters and the fundamental frequency of the current frame input signal are input into the classification model, and the confidence probability that the current frame input signal is a voice signal or a non-voice signal can be calculated.

In a preferred embodiment of the present invention, step S12 may include the following sub-steps:

and a substep S121 of generating input parameters using the perceptual linear prediction parameters and the fundamental frequency.

In the embodiment of the present invention, the deep neural network model is a multidimensional analysis model, and has a plurality of structural layers, each structural layer has corresponding input and output interfaces, for example, in an input layer, input parameters are required to be calculated, so that the perceptual linear prediction parameters of the input signal of the current frame and the fundamental frequency can be synthesized to generate the input parameters of the multidimensional input signal, so as to perform operations in the deep neural network model, for example, the input parameters are vector parameters of 123 dimensions, wherein 120 dimensions belong to the perceptual linear prediction parameters, and 3 dimensions belong to the fundamental frequency.

And a substep S122 of calculating the confidence probability that the current frame input signal is a voice signal in a preset deep neural network model by using the input parameters.

In the embodiment of the present invention, the obtained perceptual linear prediction parameter of the current frame input signal and the fundamental frequency may be synthesized to generate an input parameter, the input parameter is input into a preset deep neural network model, and the confidence probability that the current frame input signal is a speech signal is calculated, or of course, the confidence probability that the current frame input signal is a non-speech signal may be calculated.

And a substep S123 of obtaining a confidence that the input signal of the previous frame of the current frame of the input signal is a speech signal.

When voice is input, an input signal generated by human speech is a data stream, each frame of input signal is not independent, and a current frame of input signal has a certain correlation with a previous frame of input signal, so that in order to calculate the reliability of the confidence probability of the current frame of input signal, namely the confidence coefficient, comprehensive calculation can be performed according to the confidence coefficient of the previous frame of input signal of the current frame of input signal, and therefore, the confidence coefficient that the previous frame of input signal of the current frame of input signal is a voice signal can be obtained.

And a substep S124, calculating the confidence level of the current frame input signal by using the confidence level that the previous frame input signal is a speech signal and the confidence probability that the current frame input signal is a speech signal.

In one example of the embodiment of the present invention, the confidence of the current frame input signal may be calculated by the following formula:

S(t2)＝α×S(t1)+(1-α)×P(t2)

wherein, S (t1) is the confidence of the input signal of the previous frame, S (t2) is the confidence of the input signal of the current frame, and P (t2) is the confidence probability of the input signal of the current frame; alpha is a smoothing coefficient, 0 ≦ alpha ≦ 1, and in the above formula, the closer the smoothing coefficient alpha is to 1, the higher the degree of association between the confidence of the current frame input signal and the previous frame input signal is, for example, in practical application, the confidence is calculated for each frame input signal, when the confidence is higher for the previous frame as a speech signal, and then the confidence is calculated for the next adjacent frame, alpha may be set to be closer to 1, otherwise alpha may be set to be closer to 0.

And a substep S125, determining whether the confidence of the current frame input signal is greater than a preset confidence threshold, if so, executing a substep S126, and if not, executing a substep S127.

In the embodiment of the present invention, a confidence threshold may be preset, for example, a confidence threshold that an input signal is a speech signal or a non-speech signal is set, so as to determine whether the input signal is a speech signal or a non-speech signal according to the confidence threshold. And when the second signal type of the current frame input signal is larger than the preset confidence coefficient threshold value, determining that the second signal type of the current frame input signal is a voice signal, otherwise, determining that the current frame input signal is a non-voice signal.

Substep S126, determining the second signal type to be a speech signal;

and a substep S127 of determining the second signal type to be a non-speech signal.

In the embodiment of the invention, the second signal type is determined by the current frame input signal with reference to the confidence coefficient of the previous frame input signal of the current frame input signal, so that the accuracy of identifying the speech signal and the non-speech signal is improved, and false alarm misjudgment or false alarm misjudgment missing when the speech signal and the non-speech signal are determined only according to the confidence probability is avoided.

Step 103, determining the signal type of the current frame input signal according to the first signal type and the second signal type.

In an embodiment of the present invention, step 103 may include the following sub-steps:

and a substep S21 of determining the signal type of the current frame input signal as a speech signal when the first signal type and the second signal type are both speech signals.

In the embodiment of the present invention, the determination of the first signal type and the second signal type may be performed simultaneously, or the first signal type may be determined first, and then the step of determining the second signal type is determined according to the result of the first signal type, for example, when the result of the first signal type is a speech signal, the step of determining the second signal type is performed, otherwise, the step is not performed. And when the first signal type and the second signal type are both voice signals, determining the voice signals when the current frame is input.

And a sub-step S22 of determining the signal type of the current frame input signal as a non-speech signal when the non-speech signal exists in the first signal type and the second signal type.

When the result of the first signal type is a non-speech signal, it may be determined that the current frame input signal is a non-speech signal, or the first signal type is a speech signal, and after the step of determining the second signal type is performed, when the second signal type is a non-speech signal, it is determined that the current frame input signal is a non-speech signal, that is, when there is a non-speech signal in the first signal type and the second signal type, it is determined that the signal type of the current frame input signal is a non-speech signal.

In the embodiment of the present invention, through step 102, the signal energy value and the speech signal-to-noise ratio of the input signal are used to determine the signal type of the input signal on the signal level, and the perceptual linear prediction parameter and the fundamental frequency of the input signal are combined, and the signal type of the input signal is determined based on the model level in the preset deep neural network model.

The method comprises the steps of obtaining signal characteristic parameters of input signals of a current frame in a voice input mode, determining a first signal type of the input signals of the current frame by adopting the signal characteristic parameters, determining a second signal type of the input signals of the current frame by adopting the signal characteristic parameters and a preset deep neural network model, and determining the signal type of the input signals of the current frame according to the first signal type and the second signal type. The embodiment of the invention determines the first signal type of the input signal on the signal level based on the signal characteristic parameter, determines the second signal type of the input signal on the model level based on the signal characteristic parameter and the preset deep neural network model, and then determines the voice signal and the non-voice signal in the input signal by integrating the first signal type and the second signal type, thereby improving the recognition accuracy of the voice signal and the non-voice signal and enhancing the robustness and the continuity of the voice recognition to the noise environment.

Referring to fig. 2, a flowchart of the steps of an embodiment 2 of the speech recognition method according to the present invention is shown.

The voice recognition method of the embodiment of the invention, after determining the type of the input signal by adopting the voice activity detection method of the first embodiment, comprises the following steps:

step 201, determining the signal type of the current frame input signal by using a voice activity detection method.

In the embodiment of the present invention, the voice activity detection method as described in embodiment 1 may be used to determine the signal type of the input signal of the current frame, which is not described herein again.

Step 202, when the signal type of the current frame input signal is determined to be a voice signal, sending the current frame input signal to a decoder for decoding, and obtaining text information corresponding to the current frame input signal.

And after the signal type of the current frame input signal is determined to be a voice signal, sending the current frame input signal to a decoder for decoding to obtain text information corresponding to the current frame input signal. The decoding process for the speech signal may be a recognition process for the speech signal, which may obtain one or more text messages for the received speech signal under the guidance of an Acoustic Model (AM) and a Language Model (LM).

An Acoustic Model (AM) is a bottommost part in a Model of an automatic speech recognition system and is also a most key component unit in the automatic speech recognition system, and the recognition effect and robustness of the speech recognition system are directly and fundamentally influenced by the quality of Acoustic Model modeling. The model of the acoustic model experiment probability statistics establishes a model for the voice basic unit with the acoustic information and describes the statistical characteristics of the voice basic unit. Through modeling of the acoustic model, the similarity between the feature vector sequence of the speech and each pronunciation template can be effectively measured, and the acoustic information of the speech, namely the content of the speech, can be judged. The speech content of a speaker is composed of basic speech units, which may be sentences, phrases, words, syllables (syllables), Sub-syllables (Sub-syllables) or phonemes.

Due to the time-varying nature of speech signals, noise and other instability factors, a higher accuracy of speech recognition cannot be achieved by purely using acoustic models. In human language, the words of each sentence are directly and closely related, the information at the word level can reduce the search range on the acoustic model and effectively improve the recognition accuracy, and the language model is necessary for completing the task and provides context information and semantic information between words in the language. The Language Model (LM) may specifically include an N-Gram Model, a Markov N-Gram (Markov N-Gram), an Exponential Model (Exponential Models), a Decision Tree Model (Decision Tree Models), and so forth. The N-Gram model is the most commonly used statistical language model, in particular the bigram (bigram) or the trigram (trigram).

In a preferred embodiment of the present invention, before sending the input signal to the decoder for decoding, the method further comprises:

pre-processing an input signal, the pre-processing comprising: low frequency de-noising, and/or signal enhancement.

Various noises may exist in the voice input environment of the user, for example, noises emitted by an air conditioner may exist in an office of the user, low-frequency noises of an engine of an automobile exist when the user uses a mobile terminal for voice input on a road, and signal noises generated when a voice collecting device, such as a microphone, processes signals can also be used, if the input signals are directly sent into a decoder, the accuracy of a decoding result is possibly influenced, so before the input signals enter the decoder, the input signals are subjected to low-frequency denoising firstly to eliminate various low-frequency noises, meanwhile, the strength of the input signals is weak due to small user voice or hardware performance influence of the voice collecting device and the like caused by user environmental factors, the strength of the input signals can be enhanced through an amplitude enhancement technology, and after the input signals are preprocessed, the noise immunity of the input signals can be improved, the accuracy of recognition in decoding can be improved.

In a preferred embodiment of the present invention, the method of speech recognition further comprises:

when the signal type of the current frame input signal is determined to be a non-speech signal, calculating the duration of the non-speech signal.

When a user speaks, a pause exists, an input signal generated in the pause is a non-speech signal, the pause of the user speaking can be in one frame of input signal, or can be in continuous multi-frame input signals, for example, in the time that the length of one frame of input signal is 1 second, wherein 0.2 second is possibly the non-speech time, or the length of one frame of input signal is 10 milliseconds, all continuous multi-frame input signals are non-speech signals, and therefore when the non-speech signals are determined, the duration of the non-speech signals is calculated.

The decoder receives the input signal continuously to decode and outputs the decoding result in the operation process, the decoding is not performed when the non-voice signal is encountered, when the duration of the non-voice signal exceeds a preset time threshold, the decoder can be reset, the data in the decoder is released, for example, cache data is emptied, and the problems that the resource of the decoder is limited, the decoding cannot be performed for a long time, and the decoding is output in real time are solved. For example, when the user says "good family, i call lie and feel happy, the pause input signal after" good family "is determined as a non-voice signal, and the duration of the non-voice signal exceeds a pause threshold, the voice signal is judged to be finished, the decoder is decoded to be finished, and the decoder is reset.

The user can set the time threshold according to the actual situation, for example, according to the speaking speed, speaking habit, language type, etc. of the user.

The preferred embodiment of the present invention is optional and may be determined as to whether it needs to be executed or not as the case may be.

In the embodiment of the invention, when the input signal is determined to be the non-voice signal, the duration time of the non-voice signal is calculated, and when the duration time is greater than the preset time threshold, the decoder is reset, so that the limitation of decoder resources is avoided in the voice recognition process, and long-time voice recognition can be realized.

In the embodiment of the invention, the first signal type of the input signal is determined on the signal level based on the signal characteristic parameter, the second signal type of the input signal is determined on the model level based on the signal characteristic parameter and the preset deep neural network model, and then the voice signal and the non-voice signal in the input signal are determined by integrating the first signal type and the second signal type, so that the recognition accuracy of the voice signal and the non-voice signal is improved, and the robustness and the continuity of the voice recognition to the noise environment are enhanced.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 3, a block diagram of a voice activity detection apparatus in embodiment 3 of the present invention is shown, which may specifically include the following modules:

a characteristic parameter obtaining module 301, configured to obtain a signal characteristic parameter of the current frame input signal;

a signal type obtaining module 302, configured to determine a first signal type of the current frame input signal by using the signal characteristic parameter; determining a second signal type of the current frame input signal by adopting the signal characteristic parameters and a preset deep neural network model;

a signal type determining module 303, configured to determine a signal type of the current frame input signal according to the first signal type and the second signal type.

In a preferred embodiment of the present invention, the signal type obtaining module 302 includes:

In a preferred embodiment of the present invention, the first signal type determination submodule includes:

In a preferred embodiment of the present invention, the first pre-judging signal type unit includes:

The second pre-judging signal type unit includes:

In a preferred embodiment of the present invention, the second signal type determination submodule includes:

Referring to fig. 4, a block diagram of a speech recognition apparatus embodiment 4 of the present invention is shown, which may specifically include the following modules:

a voice activity detection module 401, configured to determine a signal type of the current frame input signal by using a voice activity detection apparatus;

and the speech recognition module 402 is configured to, when it is determined that the signal type of the current frame input signal is a speech signal, send the current frame input signal to a decoder for decoding, so as to obtain text information corresponding to the current frame input signal.

In a preferred embodiment of the present invention, the method further comprises:

the duration calculation module is used for calculating the duration of the non-voice signal when the signal type of the current frame input signal is determined to be the non-voice signal;

a decoder reset module to reset the decoder when the duration is greater than a preset time threshold.

In a preferred embodiment of the present invention, the speech recognition apparatus further comprises:

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Fig. 5 is a block diagram illustrating a voice activity detection apparatus 500 according to an example embodiment. For example, the apparatus 500 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 5, the apparatus 500 may include one or more of the following components: processing component 502, memory 504, power component 506, multimedia component 508, audio component 510, input/output (I/O) interface 512, sensor component 514, and communication component 516.

The processing component 502 generally controls overall operation of the device 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 502 may include one or more processors 520 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 502 can include one or more modules that facilitate interaction between the processing component 502 and other components. For example, the processing component 502 can include a multimedia module to facilitate interaction between the multimedia component 508 and the processing component 502.

The memory 504 is configured to store various types of data to support operations at the apparatus 500. Examples of such data include instructions for any application or method operating on device 500, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 504 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 506 provides power to the various components of the device 500. The power components 506 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 500.

The multimedia component 508 includes a screen that provides an output interface between the device 500 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 508 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 500 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 510 is configured to output and/or input audio signals. For example, audio component 510 includes a Microphone (MIC) configured to receive external audio signals when apparatus 500 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 504 or transmitted via the communication component 516. In some embodiments, audio component 510 further includes a speaker for outputting audio signals.

The I/O interface 512 provides an interface between the processing component 502 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 514 includes one or more sensors for providing various aspects of status assessment for the device 500. For example, the sensor assembly 514 may detect an open/closed state of the device 500, the relative positioning of the components, such as a display and keypad of the apparatus 500, the sensor assembly 514 may also detect a change in the position of the apparatus 500 or a component of the apparatus 500, the presence or absence of user contact with the apparatus 500, orientation or acceleration/deceleration of the apparatus 500, and a change in the temperature of the apparatus 500. The sensor assembly 514 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 514 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 516 is configured to facilitate communication between the apparatus 500 and other devices in a wired or wireless manner. The apparatus 500 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication section 514 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 514 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 504 comprising instructions, executable by the processor 520 of the apparatus 500 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium having instructions therein which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform a voice activity detection method, the method comprising:

acquiring signal characteristic parameters of a current frame input signal;

Optionally, the determining the first signal type of the current frame input signal by using the signal characteristic parameter, and determining the second signal type of the current frame input signal by using the signal characteristic parameter and a preset deep neural network model by using the signal characteristic parameter and a fundamental frequency includes:

Optionally, the determining the first signal type of the current frame input signal by using the signal energy value and the speech signal-to-noise ratio includes:

Optionally, the determining a first pre-determined signal type of the current frame input signal by using the signal energy value includes:

if so, determining the type of the first pre-judging signal as a voice signal;

Optionally, the determining the second pre-determined signal type of the current frame input signal by using the speech signal-to-noise ratio includes:

if so, determining the type of the second pre-judging signal as a voice signal;

Optionally, the determining, in a preset deep neural network model, a second signal type of the current frame input signal by using the perceptual linear prediction parameter and the fundamental frequency includes:

if so, determining that the second signal type is a voice signal;

if not, determining that the second signal type is a non-voice signal.

Optionally, the determining the signal type of the current frame input signal according to the first signal type and the second signal type includes:

In a speech recognition apparatus of another exemplary embodiment, the instructions in the storage medium, when executed by a processor of a mobile terminal, enable the mobile terminal to perform a method of speech recognition, the method comprising:

Optionally, the method further comprises:

Optionally, before the step of sending the current frame input signal to a decoder for decoding, the method further includes:

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method of voice activity detection, comprising:

acquiring signal characteristic parameters of a current frame input signal;

determining a first signal type of the current frame input signal by adopting the signal characteristic parameters, and determining a second signal type of the current frame input signal by adopting the signal characteristic parameters and a preset deep neural network model; the first signal type is a voice signal or a non-voice signal, and the second signal type is a voice signal or a non-voice signal;

determining the signal type of the current frame input signal according to the first signal type and the second signal type;

the method comprises the following steps of determining a first signal type of a current frame input signal by using signal characteristic parameters, and determining a second signal type of the current frame input signal by using the signal characteristic parameters and a preset deep neural network model, wherein the signal characteristic parameters comprise a signal energy value, a speech signal-to-noise ratio, a perceptual linear prediction parameter and a fundamental frequency:

if so, determining that the second signal type is a voice signal;

if not, determining that the second signal type is a non-voice signal.

2. The method of claim 1, wherein said step of using said signal energy value and said speech signal-to-noise ratio to determine a first signal type of said current frame input signal comprises:

3. The method of claim 2, wherein the step of determining the first predicted signal type of the current frame input signal using the signal energy value comprises:

if so, determining the type of the first pre-judging signal as a voice signal;

4. The method of claim 2, wherein the step of determining the second predicted signal type of the current frame input signal using the speech signal-to-noise ratio comprises:

if so, determining the type of the second pre-judging signal as a voice signal;

5. The method according to any of claims 1-4, wherein said step of determining a signal type of said current frame input signal based on said first signal type and said second signal type comprises:

6. A method of speech recognition, comprising:

determining the signal type of the input signal of the current frame by using the method of voice activity detection as claimed in any one of claims 1-5;

7. The method of claim 6, further comprising:

8. The method of claim 6 or 7, further comprising, before the step of sending the current frame input signal to a decoder for decoding:

9. An apparatus for voice activity detection, comprising:

the signal type acquisition module is used for determining a first signal type of the current frame input signal by adopting the signal characteristic parameters; determining a second signal type of the current frame input signal by adopting the signal characteristic parameters and a preset deep neural network model; the first signal type is a voice signal or a non-voice signal, and the second signal type is a voice signal or a non-voice signal;

a signal type determining module, configured to determine a signal type of the current frame input signal according to the first signal type and the second signal type;

wherein, the signal characteristic parameters comprise a signal energy value, a speech signal-to-noise ratio, a perception linear prediction parameter and a fundamental frequency, and the signal type obtaining module comprises:

a second signal type determination submodule for generating input parameters using the perceptual linear prediction parameters and the fundamental frequency; calculating the confidence probability that the current frame input signal is a voice signal in a preset deep neural network model by adopting the input parameters; obtaining the confidence coefficient that the input signal of the previous frame of the current frame of input signal is a voice signal; calculating the confidence coefficient of the current frame input signal by adopting the confidence coefficient that the previous frame input signal is a voice signal and the confidence probability that the current frame input signal is a voice signal; judging whether the confidence coefficient of the current frame input signal is greater than a preset confidence coefficient threshold value; if so, determining that the second signal type is a voice signal; if not, determining that the second signal type is a non-voice signal.

10. The apparatus of claim 9, wherein the first signal type determination submodule comprises:

11. The apparatus of claim 10, wherein the first pre-determined signal type unit comprises:

12. The apparatus of claim 10, wherein the second pre-determined signal type unit comprises:

13. The apparatus of any of claims 9-12, wherein the signal type determination module comprises:

14. A speech recognition apparatus, comprising:

a voice activity detection module for determining a signal type of a current frame input signal using the voice activity detection apparatus as claimed in any one of claims 9-13;

15. The apparatus of claim 14, further comprising:

16. The apparatus of claim 14 or 15, further comprising:

17. A voice activity detection device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for:

acquiring signal characteristic parameters of a current frame input signal;

if so, determining that the second signal type is a voice signal;

if not, determining that the second signal type is a non-voice signal.

18. A speech recognition device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein execution of the one or more programs by one or more processors comprises instructions for:

the speech activity detection apparatus as claimed in claim 17 is used to determine the signal type of the current frame input signal,

19. A storage medium, characterized in that instructions in the storage medium, when executed by a processor of a mobile terminal, enable the mobile terminal to perform a voice activity detection method according to one or more of method claims 1-5.

20. A storage medium, characterized in that instructions in the storage medium, when executed by a processor of a mobile terminal, enable the mobile terminal to perform a method of speech recognition according to one or more of method claims 6-8.