CN115132231B

CN115132231B - Voice activity detection method, device, equipment and readable storage medium

Info

Publication number: CN115132231B
Application number: CN202211051500.9A
Authority: CN
Inventors: 胡今朝; 李威; 李永超; 马志强; 周传福; 潘志兵
Original assignee: Anhui Xunfei Huanyu Technology Co ltd
Current assignee: Anhui Xunfei Huanyu Technology Co ltd
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2022-12-13
Anticipated expiration: 2042-08-31
Also published as: CN115132231A

Abstract

The application discloses a voice activity detection method, a voice activity detection device, voice activity detection equipment and a readable storage medium, wherein the voice characteristics of each signal frame corresponding to a voice signal to be detected are obtained; then, inputting the voice characteristics of each signal frame into a voice activity detection model, outputting the voice activity detection result of each signal frame by the voice activity detection model, wherein the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; and finally, determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame. In the scheme, for each signal frame, the voice activity detection model obtains the voice activity detection result of the signal frame based on the signal frame and the historical signal frame before the signal frame, and the future frame after the signal frame is not used, so that the waiting time delay generated in the forward propagation process of the model in the inference stage can be reduced.

Description

Voice activity detection method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for detecting speech activity.

Background

A Voice Activity Detection (VAD) system is used to determine the speech frame and the non-speech frame of an input speech signal, and the determined speech frame is sent to the subsequent speech processing steps. The voice activity detection system is a crucial pre-step in many voice-related applications (e.g., voice wake-up, voice enhancement, speech coding, speech recognition, speaker recognition), which require high real-time performance in many scenarios, such as video conferencing scenarios. Therefore, the speech activity detection system needs to send valid speech frames to subsequent speech processing steps as quickly as possible.

At present, a voice activity detection system mostly adopts a common Convolutional Neural Network (CNN) model to judge a voice frame and a non-voice frame of an input voice signal, and the common CNN model is used for a future frame in order to keep the frame number of time dimensions before and after a convolution operation unchanged, which can cause the common CNN model to generate a waiting time delay in a forward propagation process at an inference stage.

Therefore, how to provide a voice activity detection system to reduce the latency generated by the forward propagation of the model in the inference stage is an urgent technical problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above problems, the present application provides a voice activity detection method, apparatus, device and readable storage medium. The specific scheme is as follows:

a voice activity detection method, the method comprising:

acquiring voice characteristics of each signal frame corresponding to a voice signal to be detected;

inputting the voice characteristics of each signal frame into a voice activity detection model, wherein the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame;

and determining an active speech segment corresponding to the speech signal based on the speech activity detection result of each signal frame.

Optionally, the obtaining of the voice feature of each signal frame corresponding to the voice signal to be detected includes:

performing frame windowing on the voice signal to obtain a plurality of signal frames;

and for each signal frame, performing feature extraction on the signal frame to obtain the voice feature of the signal frame.

Optionally, the determining an active speech segment corresponding to the speech signal based on the detection result of speech activity of each signal frame includes:

performing smooth operation on the voice activity detection result of each signal frame to obtain an initial active voice segment corresponding to the voice signal;

determining a noise voice segment and a non-noise voice segment from each initial active voice segment;

determining the non-noise speech segment as an active speech segment corresponding to the speech signal.

Optionally, the determining a noise speech segment and a non-noise speech segment from each initial active speech segment includes:

calculating the posterior probability mean square error corresponding to each initial active voice fragment;

if the posterior probability mean square error corresponding to the initial active voice segment is lower than a preset posterior probability mean square error threshold, determining the initial active voice segment as a noise voice segment;

and if the posterior probability mean square error corresponding to the initial active voice segment is not lower than a preset posterior probability mean square error threshold, determining that the initial active voice segment is a non-noise voice segment.

Optionally, the voice activity detection model includes a first convolution layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolution neural network, a second convolution layer, and a full connection layer, which are connected in sequence;

the first convolution layer is used for receiving the voice characteristics of each signal frame and performing convolution processing on the voice characteristics of each signal frame, and the frame number of the output signal frame is kept consistent with the frame number of the received signal frame through front and back zero padding in the convolution processing process;

the regularization layer is used for receiving the output of the first convolution layer and carrying out regularization processing on the output of the first convolution layer;

the activation function layer is used for receiving the output of the regularization layer and performing activation processing on the output of the regularization layer;

the pooling layer is used for receiving the output of the activation function layer and performing pooling processing on the output of the activation function layer;

the frame splicing layer is used for receiving the output of the pooling layer and splicing the frame of the output of the pooling layer;

the causal convolutional neural network is used for receiving the output of the frame splicing layer, carrying out convolutional processing on the output of the frame splicing layer, and keeping the frame number of the output signal frame consistent with the frame number of the signal frame output by the frame splicing layer through pre-zero padding in the convolutional processing process;

the second convolution layer is used for receiving the output of the causal convolutional neural network and carrying out convolution processing on the output of the causal convolutional neural network, and the frame number of the output signal frame is kept consistent with the frame number of the signal frame output by the causal convolutional neural network through front and back zero padding in the convolution processing process;

and the full-connection layer is used for receiving the output of the second convolution layer and performing full-connection processing on the output of the second convolution layer to obtain the voice activity detection result of each signal frame.

Optionally, the causal convolutional neural network comprises: the first convolution module, a plurality of parallel second convolution modules respectively connected with the first convolution module, and a fusion module connected with the plurality of parallel second convolution modules;

the first convolution module is used for receiving the output of the framing layer and performing convolution processing on the output of the framing layer;

each second convolution module is used for receiving the output of the first convolution module and performing convolution processing on the output of the first convolution module to obtain a convolution result;

and the fusion module is used for receiving the convolution results of the second convolution modules and carrying out fusion processing on the convolution results of the second convolution modules to obtain the output of the causal convolution neural network.

Optionally, the second convolution module includes a preset number of convolution units, and each convolution unit includes a pre-padding layer, a first convolution sublayer, a second convolution sublayer, and a residual connection layer;

the pre-filling layer is used for receiving the output of the first convolution module and performing pre-filling processing on the output of the first convolution module based on a pre-filling parameter, and the pre-filling parameter is determined based on an expansion coefficient of the causal convolution neural network;

the first convolution sublayer is used for receiving the output of the pre-filling layer and performing convolution processing on the output of the pre-filling layer;

the second convolution sublayer is used for receiving the output of the first convolution sublayer and performing convolution processing on the output of the first convolution sublayer;

and the residual connecting layer is used for carrying out residual processing on the output of the pre-filling layer and the output of the second convolution sublayer.

A voice activity detection apparatus, the apparatus comprising:

the acquisition unit is used for acquiring the voice characteristics of each signal frame corresponding to the voice signal to be detected;

the detection unit is used for inputting the voice characteristics of each signal frame into the voice activity detection model, the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame;

and the determining unit is used for determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.

Optionally, the obtaining unit includes:

a framing and windowing unit, configured to perform framing and windowing processing on the voice signal to obtain a plurality of signal frames;

and the feature extraction unit is used for extracting the features of the signal frames aiming at each signal frame to obtain the voice features of the signal frames.

Optionally, the determining unit includes:

a smoothing operation unit, configured to perform smoothing operation on a voice activity detection result of each signal frame to obtain an initial active voice segment corresponding to the voice signal;

a noise voice segment and non-noise voice segment determining unit, for determining a noise voice segment and a non-noise voice segment from each initial active voice segment;

an active speech segment determining unit, configured to determine the non-noise speech segment as an active speech segment corresponding to the speech signal.

Optionally, the noise speech segment and non-noise speech segment determining unit is specifically configured to:

calculating the posterior probability mean square error corresponding to each initial active voice segment;

if the posterior probability mean square error corresponding to the initial active voice fragment is lower than a preset posterior probability mean square error threshold, determining the initial active voice fragment as a noise voice fragment;

the causal convolutional neural network is used for receiving the output of the splicing frame layer, performing convolutional processing on the output of the splicing frame layer, and keeping the frame number of the output signal frame consistent with the frame number of the signal frame output by the splicing frame layer through pre-zero padding in the convolutional processing process;

A voice activity detection device comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the voice activity detection method as described above.

A readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for detecting speech activity as described above.

By means of the technical scheme, the application discloses a voice activity detection method, a voice activity detection device, voice activity detection equipment and a readable storage medium, and the voice activity detection method comprises the steps of firstly, obtaining voice characteristics of each signal frame corresponding to a voice signal to be detected; then, inputting the voice characteristics of each signal frame into a voice activity detection model, outputting the voice activity detection result of each signal frame by the voice activity detection model, wherein the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; and finally, determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame. In the scheme, for each signal frame, the voice activity detection model obtains the voice activity detection result of the signal frame based on the signal frame and the historical signal frame before the signal frame, and the future frame after the signal frame is not used, so that the waiting time delay generated in the forward propagation process of the model in the inference stage can be reduced.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a schematic flow chart of a method for detecting speech activity disclosed in an embodiment of the present application;

fig. 2 is a schematic flowchart of a method for determining an active speech segment corresponding to a speech signal based on a result of speech activity detection of each signal frame according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a voice activity detection model according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a causal convolutional neural network in a voice activity detection model disclosed in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a speech activity detection apparatus according to an embodiment of the present disclosure;

fig. 6 is a block diagram of a hardware structure of a speech activity detection apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Next, the voice activity detection method provided by the present application will be described by the following embodiments.

Referring to fig. 1, fig. 1 is a schematic flow chart of a voice activity detection method disclosed in an embodiment of the present application, where the method may include:

step S101: and acquiring the voice characteristics of each signal frame corresponding to the voice signal to be detected.

In the present application, the voice signal to be detected may be a voice signal input in real time, and for the voice signal input in real time, in the present application, the voice signal may be subjected to framing and windowing to obtain a plurality of signal frames; and then, aiming at each signal frame, carrying out feature extraction on the signal frame to obtain the voice feature of the signal frame.

It should be noted that, in the present application, the speech signal may be subjected to frame division and windowing based on a preset frame length, frame shift, and window function, so as to obtain a plurality of signal frames. In this application, the speech feature may be a common speech feature such as a PLP (Perceptual linear prediction coefficient), an MFCC (Mel frequency cepstrum coefficient), a Filter Bank feature, etc., and since the Filter Bank feature retains a more primitive acoustic feature than the MFCC, as an implementable manner, the Filter Bank feature may be selected to be used as the speech feature of the signal frame, for example, the Filter Bank feature with a dimension of 40 may be selected to be used as the speech feature of the signal frame in this application.

The human ear has different perception degrees to different frequencies, the higher the Frequency is, the lower the sensitivity is, so the Frequency domain perception of the human ear is nonlinear, the Mel Scale (Mel Scale) just describes the rule, it reflects the relationship between Mel Frequency (Mel Frequency) and common Frequency of the human ear linear perception, the logarithm is taken to the energy value of Mel Frequency spectrum, and the final result is the Filter Bank (Filter Bank) characteristic.

Step S102: inputting the voice characteristics of each signal frame into a voice activity detection model, wherein the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame.

In the present application, the voice characteristics of each signal frame may be input to the voice activity detection model in batches, and the voice activity detection model may be implemented based on a causal convolutional neural network, and compared with a voice activity detection model implemented based on a general convolutional neural network in the prior art, when performing voice activity detection on each signal frame, the voice activity detection model in the present application obtains the voice activity detection result of the signal frame based on the signal frame and a preset number of historical signal frames before the signal frame, and does not use a future frame after the signal frame.

It should be noted that the specific structure and function implementation of the voice activity detection model will be described in detail by the following embodiments, and will not be described herein.

Step S103: and determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.

In the application, based on the voice activity detection result of each signal frame, whether each signal frame is a voice frame or a non-voice frame can be determined, and the active voice segment can be determined based on the result.

In order to ensure the accuracy of the determined active speech segment, the time sequence correlation between the signal frames and the noise characteristics of the signal frames may also be considered, and the active speech segment corresponding to the speech signal is determined based on the detection result of the speech activity of the signal frames, the time sequence correlation between the signal frames, and the noise characteristics of the signal frames.

The embodiment discloses a voice activity detection method, which comprises the steps of firstly, acquiring voice characteristics of each signal frame corresponding to a voice signal to be detected; then, inputting the voice characteristics of each signal frame into a voice activity detection model, outputting the voice activity detection result of each signal frame by the voice activity detection model, wherein the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; and finally, determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame. In the scheme, for each signal frame, the voice activity detection model obtains the voice activity detection result of the signal frame based on the signal frame and the historical signal frame before the signal frame, and the future frame after the signal frame is not used, so that the waiting time delay generated in the forward propagation process of the model in the inference stage can be reduced.

In another embodiment of the present application, a specific implementation manner of determining, in step S103, an active speech segment corresponding to a speech signal based on a speech activity detection result of each signal frame is described.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for determining an active speech segment corresponding to a speech signal based on a result of detecting speech activity of each signal frame according to an embodiment of the present application, where the method may include:

step S201: and performing smooth operation on the voice activity detection result of each signal frame to obtain an initial active voice segment corresponding to the voice signal.

The speech signal is a time sequence signal, which indicates that there is a correlation between the front and back of a signal frame, for example, if the current signal frame is a speech frame, then the probability that the next signal frame is a speech frame is high, but when each signal frame is independently judged, a phenomenon that a non-speech frame is included in a plurality of speech frames occurs, and therefore, it is necessary to perform a smoothing operation at a speech segment level by an artificially defined rule based on the result of detecting the speech activity of each signal frame to reduce frequent jumps of the speech frame and the non-speech frame. Therefore, in the present application, the voice activity detection result of each signal frame can be smoothly operated to obtain the initial active voice segment corresponding to the voice signal.

Step S202: from each of the initial active speech segments, a noisy speech segment and a non-noisy speech segment are determined.

In the present application, because the speech activity detection model obtains, for each signal frame, the speech activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame, which may cause the speech activity detection model to detect background human voice as a speech frame more easily, in order to solve this problem, the initial active speech segment may be further processed, a noise speech segment and a non-noise speech segment are determined from each initial active speech segment, after the noise speech segment and the non-noise speech segment are determined, the noise speech segment is discarded, and the non-noise speech segment is determined as the active speech segment.

As an implementation, the determining a noise speech segment and a non-noise speech segment from each initial active speech segment includes: calculating the posterior probability mean square error corresponding to each initial active voice segment; if the posterior probability mean square error corresponding to the initial active voice segment is lower than a preset posterior probability mean square error threshold, determining the initial active voice segment as a noise voice segment; and if the posterior probability mean square error corresponding to the initial active voice segment is not lower than a preset posterior probability mean square error threshold, determining that the initial active voice segment is a non-noise voice segment.

Step S203: and determining the non-noise voice segment as an active voice segment corresponding to the voice signal.

In another embodiment of the present application, the structural and functional implementation of the voice activity detection model is described.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a voice activity detection model disclosed in an embodiment of the present application, where the voice activity detection model includes a first convolution layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolutional neural network, a second convolution layer, and a full connection layer, which are connected in sequence;

the first convolution layer is used for receiving the voice characteristics of each signal frame and performing convolution processing on the voice characteristics of each signal frame, and the frame number of the output signal frame is kept consistent with the frame number of the received signal frame through front and back zero padding in the convolution processing process; as an implementation, the first convolution layer may use a convolution kernel of 3 × 3. In the present application, the padding parameters of the first convolution layer may be set, thereby implementing front and back zero padding.

the frame splicing layer is used for receiving the output of the pooling layer and splicing the frame of the output of the pooling layer; it should be noted that the frame splicing process can reduce the computational complexity of the subsequent model structure, and further reduce the computational delay of the model.

The causal convolutional neural network is used for receiving the output of the frame splicing layer, carrying out convolutional processing on the output of the frame splicing layer, and keeping the frame number of the output signal frame consistent with the frame number of the signal frame output by the frame splicing layer through pre-zero padding in the convolutional processing process; in the application, the filling parameters of the causal convolutional neural network can be set, and then the pre-zero padding is realized.

The second convolution layer is used for receiving the output of the causal convolutional neural network and carrying out convolution processing on the output of the causal convolutional neural network, and the frame number of the output signal frame is kept consistent with the frame number of the signal frame output by the causal convolutional neural network through front and back zero padding in the convolution processing process; as an implementation, the second convolutional layer may employ a convolution kernel of 1 × 5. In the present application, the filling parameters of the second convolution layer may be set, thereby implementing front and back zero padding.

In another embodiment of the present application, the structure of a causal convolutional neural network in a voice activity detection model is described.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a causal convolutional neural network in a voice activity detection model disclosed in an embodiment of the present application, where the causal convolutional neural network includes: the first convolution module, a plurality of parallel second convolution modules respectively connected with the first convolution module, and a fusion module connected with the plurality of parallel second convolution modules;

the first convolution module is used for receiving the output of the framing layer and performing convolution processing on the output of the framing layer; as an implementation, the first convolution module may employ a convolution kernel of 1 × 1.

The second convolution module comprises a preset number of convolution units, and each convolution unit comprises a pre-filling layer, a first convolution sub-layer, a second convolution sub-layer and a residual error connecting layer;

the pre-padding layer is used for receiving the output of the first convolution module and performing pre-padding processing on the output of the first convolution module based on a pre-padding parameter, wherein the pre-padding parameter is determined based on an expansion coefficient of the causal convolution neural network;

the first convolution sublayer is used for receiving the output of the pre-filling layer and performing convolution processing on the output of the pre-filling layer; as an implementation, the first convolution sublayer may employ a convolution kernel of 1 × 3.

The second convolution sublayer is used for receiving the output of the first convolution sublayer and performing convolution processing on the output of the first convolution sublayer; as an implementation, the second convolution sublayer may employ a 1 × 1 convolution kernel.

It should be noted that the structure of the voice activity detection model proposed in the embodiment of the present application is merely exemplary, and other similar structures obtained on this basis should also be within the scope of the present application.

The following describes a voice activity detection device disclosed in an embodiment of the present application, and the voice activity detection device described below and the voice activity detection method described above may be referred to correspondingly.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a voice activity detection apparatus according to an embodiment of the present application. As shown in fig. 5, the voice activity detecting apparatus may include:

the acquiring unit 11 is configured to acquire a voice feature of each signal frame corresponding to a voice signal to be detected;

a detecting unit 12, configured to input the speech characteristics of each signal frame into a speech activity detection model, where the speech activity detection model outputs a speech activity detection result of each signal frame, and the speech activity detection result of each signal frame is used to indicate whether the signal frame is a speech frame or a non-speech frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame;

a determining unit 13, configured to determine an active speech segment corresponding to the speech signal based on the detection result of speech activity of each signal frame.

As an implementation, the obtaining unit includes:

As an implementable manner, the determining unit includes:

and the active voice segment determining unit is used for determining the non-noise voice segment as an active voice segment corresponding to the voice signal.

As an implementation manner, the noise speech segment and non-noise speech segment determining unit is specifically configured to:

As an implementation manner, the voice activity detection model comprises a first convolutional layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolutional neural network, a second convolutional layer and a full-connection layer which are connected in sequence;

As one possible implementation, the causal convolutional neural network includes: the first convolution module, a plurality of parallel second convolution modules respectively connected with the first convolution module, and a fusion module connected with the plurality of parallel second convolution modules;

and the fusion module is used for receiving the convolution results of the second convolution modules and performing fusion processing on the convolution results of the second convolution modules to obtain the output of the causal convolution neural network.

As an implementation manner, the second convolution module includes a preset number of convolution units, each convolution unit includes a pre-padding layer, a first convolution sublayer, a second convolution sublayer and a residual connecting layer;

Referring to fig. 6, fig. 6 is a block diagram of a hardware structure of a speech activity detection device according to an embodiment of the present application, and referring to fig. 6, the hardware structure of the speech activity detection device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

and determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for voice activity detection, the method comprising:

inputting the voice characteristics of each signal frame into a voice activity detection model, wherein the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame; the voice activity detection model comprises a first convolution layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolution neural network, a second convolution layer and a full-connection layer which are sequentially connected; the first convolution layer is used for receiving the voice characteristics of each signal frame and performing convolution processing on the voice characteristics of each signal frame, and the frame number of the output signal frame is kept consistent with the frame number of the received signal frame through front and back zero padding in the convolution processing process;

2. The method according to claim 1, wherein the obtaining the speech features of the signal frames corresponding to the speech signal to be detected comprises:

3. The method according to claim 1, wherein the determining the active speech segment corresponding to the speech signal based on the detection result of the speech activity of each signal frame comprises:

and determining the non-noise voice segment as an active voice segment corresponding to the voice signal.

4. The method of claim 3, wherein said determining a noisy speech segment and a non-noisy speech segment from each initial active speech segment comprises:

and if the posterior probability mean square error corresponding to the initial active voice fragment is not lower than a preset posterior probability mean square error threshold, determining that the initial active voice fragment is a non-noise voice fragment.

5. The method of claim 1, wherein the regularization layer is configured to receive an output of the first convolutional layer and to regularize the output of the first convolutional layer;

the pooling layer is used for receiving the output of the activation function layer and pooling the output of the activation function layer;

6. The method of claim 5, wherein the causal convolutional neural network comprises: the first convolution module, a plurality of parallel second convolution modules respectively connected with the first convolution module, and a fusion module connected with the plurality of parallel second convolution modules;

7. The method of claim 6, wherein the second convolution module comprises a preset number of convolution units, each convolution unit comprising a pre-fill layer, a first convolution sublayer, a second convolution sublayer, and a residual connection layer;

8. A voice activity detection apparatus, the apparatus comprising:

the detection unit is used for inputting the voice characteristics of each signal frame into the voice activity detection model, the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame; the voice activity detection model comprises a first convolution layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolution neural network, a second convolution layer and a full-connection layer which are sequentially connected; the first convolution layer is used for receiving the voice characteristics of each signal frame and performing convolution processing on the voice characteristics of each signal frame, and the frame number of the output signal frame is kept consistent with the frame number of the received signal frame through front and back zero padding in the convolution processing process;

and the determining unit is used for determining the active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.

9. A voice activity detection device comprising a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, implementing the steps of the voice activity detection method according to any one of claims 1 to 7.

10. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for detecting speech activity according to any one of claims 1 to 7.