CN115132231B - Voice activity detection method, device, equipment and readable storage medium - Google Patents

Voice activity detection method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN115132231B
CN115132231B CN202211051500.9A CN202211051500A CN115132231B CN 115132231 B CN115132231 B CN 115132231B CN 202211051500 A CN202211051500 A CN 202211051500A CN 115132231 B CN115132231 B CN 115132231B
Authority
CN
China
Prior art keywords
voice
convolution
layer
signal frame
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211051500.9A
Other languages
Chinese (zh)
Other versions
CN115132231A (en
Inventor
胡今朝
李威
李永超
马志强
周传福
潘志兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Xunfei Huanyu Technology Co ltd
Original Assignee
Anhui Xunfei Huanyu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Xunfei Huanyu Technology Co ltd filed Critical Anhui Xunfei Huanyu Technology Co ltd
Priority to CN202211051500.9A priority Critical patent/CN115132231B/en
Publication of CN115132231A publication Critical patent/CN115132231A/en
Application granted granted Critical
Publication of CN115132231B publication Critical patent/CN115132231B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The application discloses a voice activity detection method, a voice activity detection device, voice activity detection equipment and a readable storage medium, wherein the voice characteristics of each signal frame corresponding to a voice signal to be detected are obtained; then, inputting the voice characteristics of each signal frame into a voice activity detection model, outputting the voice activity detection result of each signal frame by the voice activity detection model, wherein the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; and finally, determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame. In the scheme, for each signal frame, the voice activity detection model obtains the voice activity detection result of the signal frame based on the signal frame and the historical signal frame before the signal frame, and the future frame after the signal frame is not used, so that the waiting time delay generated in the forward propagation process of the model in the inference stage can be reduced.

Description

Voice activity detection method, device, equipment and readable storage medium
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for detecting speech activity.
Background
A Voice Activity Detection (VAD) system is used to determine the speech frame and the non-speech frame of an input speech signal, and the determined speech frame is sent to the subsequent speech processing steps. The voice activity detection system is a crucial pre-step in many voice-related applications (e.g., voice wake-up, voice enhancement, speech coding, speech recognition, speaker recognition), which require high real-time performance in many scenarios, such as video conferencing scenarios. Therefore, the speech activity detection system needs to send valid speech frames to subsequent speech processing steps as quickly as possible.
At present, a voice activity detection system mostly adopts a common Convolutional Neural Network (CNN) model to judge a voice frame and a non-voice frame of an input voice signal, and the common CNN model is used for a future frame in order to keep the frame number of time dimensions before and after a convolution operation unchanged, which can cause the common CNN model to generate a waiting time delay in a forward propagation process at an inference stage.
Therefore, how to provide a voice activity detection system to reduce the latency generated by the forward propagation of the model in the inference stage is an urgent technical problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the above problems, the present application provides a voice activity detection method, apparatus, device and readable storage medium. The specific scheme is as follows:
a voice activity detection method, the method comprising:
acquiring voice characteristics of each signal frame corresponding to a voice signal to be detected;
inputting the voice characteristics of each signal frame into a voice activity detection model, wherein the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame;
and determining an active speech segment corresponding to the speech signal based on the speech activity detection result of each signal frame.
Optionally, the obtaining of the voice feature of each signal frame corresponding to the voice signal to be detected includes:
performing frame windowing on the voice signal to obtain a plurality of signal frames;
and for each signal frame, performing feature extraction on the signal frame to obtain the voice feature of the signal frame.
Optionally, the determining an active speech segment corresponding to the speech signal based on the detection result of speech activity of each signal frame includes:
performing smooth operation on the voice activity detection result of each signal frame to obtain an initial active voice segment corresponding to the voice signal;
determining a noise voice segment and a non-noise voice segment from each initial active voice segment;
determining the non-noise speech segment as an active speech segment corresponding to the speech signal.
Optionally, the determining a noise speech segment and a non-noise speech segment from each initial active speech segment includes:
calculating the posterior probability mean square error corresponding to each initial active voice fragment;
if the posterior probability mean square error corresponding to the initial active voice segment is lower than a preset posterior probability mean square error threshold, determining the initial active voice segment as a noise voice segment;
and if the posterior probability mean square error corresponding to the initial active voice segment is not lower than a preset posterior probability mean square error threshold, determining that the initial active voice segment is a non-noise voice segment.
Optionally, the voice activity detection model includes a first convolution layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolution neural network, a second convolution layer, and a full connection layer, which are connected in sequence;
the first convolution layer is used for receiving the voice characteristics of each signal frame and performing convolution processing on the voice characteristics of each signal frame, and the frame number of the output signal frame is kept consistent with the frame number of the received signal frame through front and back zero padding in the convolution processing process;
the regularization layer is used for receiving the output of the first convolution layer and carrying out regularization processing on the output of the first convolution layer;
the activation function layer is used for receiving the output of the regularization layer and performing activation processing on the output of the regularization layer;
the pooling layer is used for receiving the output of the activation function layer and performing pooling processing on the output of the activation function layer;
the frame splicing layer is used for receiving the output of the pooling layer and splicing the frame of the output of the pooling layer;
the causal convolutional neural network is used for receiving the output of the frame splicing layer, carrying out convolutional processing on the output of the frame splicing layer, and keeping the frame number of the output signal frame consistent with the frame number of the signal frame output by the frame splicing layer through pre-zero padding in the convolutional processing process;
the second convolution layer is used for receiving the output of the causal convolutional neural network and carrying out convolution processing on the output of the causal convolutional neural network, and the frame number of the output signal frame is kept consistent with the frame number of the signal frame output by the causal convolutional neural network through front and back zero padding in the convolution processing process;
and the full-connection layer is used for receiving the output of the second convolution layer and performing full-connection processing on the output of the second convolution layer to obtain the voice activity detection result of each signal frame.
Optionally, the causal convolutional neural network comprises: the first convolution module, a plurality of parallel second convolution modules respectively connected with the first convolution module, and a fusion module connected with the plurality of parallel second convolution modules;
the first convolution module is used for receiving the output of the framing layer and performing convolution processing on the output of the framing layer;
each second convolution module is used for receiving the output of the first convolution module and performing convolution processing on the output of the first convolution module to obtain a convolution result;
and the fusion module is used for receiving the convolution results of the second convolution modules and carrying out fusion processing on the convolution results of the second convolution modules to obtain the output of the causal convolution neural network.
Optionally, the second convolution module includes a preset number of convolution units, and each convolution unit includes a pre-padding layer, a first convolution sublayer, a second convolution sublayer, and a residual connection layer;
the pre-filling layer is used for receiving the output of the first convolution module and performing pre-filling processing on the output of the first convolution module based on a pre-filling parameter, and the pre-filling parameter is determined based on an expansion coefficient of the causal convolution neural network;
the first convolution sublayer is used for receiving the output of the pre-filling layer and performing convolution processing on the output of the pre-filling layer;
the second convolution sublayer is used for receiving the output of the first convolution sublayer and performing convolution processing on the output of the first convolution sublayer;
and the residual connecting layer is used for carrying out residual processing on the output of the pre-filling layer and the output of the second convolution sublayer.
A voice activity detection apparatus, the apparatus comprising:
the acquisition unit is used for acquiring the voice characteristics of each signal frame corresponding to the voice signal to be detected;
the detection unit is used for inputting the voice characteristics of each signal frame into the voice activity detection model, the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame;
and the determining unit is used for determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.
Optionally, the obtaining unit includes:
a framing and windowing unit, configured to perform framing and windowing processing on the voice signal to obtain a plurality of signal frames;
and the feature extraction unit is used for extracting the features of the signal frames aiming at each signal frame to obtain the voice features of the signal frames.
Optionally, the determining unit includes:
a smoothing operation unit, configured to perform smoothing operation on a voice activity detection result of each signal frame to obtain an initial active voice segment corresponding to the voice signal;
a noise voice segment and non-noise voice segment determining unit, for determining a noise voice segment and a non-noise voice segment from each initial active voice segment;
an active speech segment determining unit, configured to determine the non-noise speech segment as an active speech segment corresponding to the speech signal.
Optionally, the noise speech segment and non-noise speech segment determining unit is specifically configured to:
calculating the posterior probability mean square error corresponding to each initial active voice segment;
if the posterior probability mean square error corresponding to the initial active voice fragment is lower than a preset posterior probability mean square error threshold, determining the initial active voice fragment as a noise voice fragment;
and if the posterior probability mean square error corresponding to the initial active voice segment is not lower than a preset posterior probability mean square error threshold, determining that the initial active voice segment is a non-noise voice segment.
Optionally, the voice activity detection model includes a first convolution layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolution neural network, a second convolution layer, and a full connection layer, which are connected in sequence;
the first convolution layer is used for receiving the voice characteristics of each signal frame and performing convolution processing on the voice characteristics of each signal frame, and the frame number of the output signal frame is kept consistent with the frame number of the received signal frame through front and back zero padding in the convolution processing process;
the regularization layer is used for receiving the output of the first convolution layer and carrying out regularization processing on the output of the first convolution layer;
the activation function layer is used for receiving the output of the regularization layer and performing activation processing on the output of the regularization layer;
the pooling layer is used for receiving the output of the activation function layer and performing pooling processing on the output of the activation function layer;
the frame splicing layer is used for receiving the output of the pooling layer and splicing the frame of the output of the pooling layer;
the causal convolutional neural network is used for receiving the output of the splicing frame layer, performing convolutional processing on the output of the splicing frame layer, and keeping the frame number of the output signal frame consistent with the frame number of the signal frame output by the splicing frame layer through pre-zero padding in the convolutional processing process;
the second convolution layer is used for receiving the output of the causal convolutional neural network and carrying out convolution processing on the output of the causal convolutional neural network, and the frame number of the output signal frame is kept consistent with the frame number of the signal frame output by the causal convolutional neural network through front and back zero padding in the convolution processing process;
and the full-connection layer is used for receiving the output of the second convolution layer and performing full-connection processing on the output of the second convolution layer to obtain the voice activity detection result of each signal frame.
Optionally, the causal convolutional neural network comprises: the first convolution module, a plurality of parallel second convolution modules respectively connected with the first convolution module, and a fusion module connected with the plurality of parallel second convolution modules;
the first convolution module is used for receiving the output of the framing layer and performing convolution processing on the output of the framing layer;
each second convolution module is used for receiving the output of the first convolution module and performing convolution processing on the output of the first convolution module to obtain a convolution result;
and the fusion module is used for receiving the convolution results of the second convolution modules and carrying out fusion processing on the convolution results of the second convolution modules to obtain the output of the causal convolution neural network.
Optionally, the second convolution module includes a preset number of convolution units, and each convolution unit includes a pre-padding layer, a first convolution sublayer, a second convolution sublayer, and a residual connection layer;
the pre-filling layer is used for receiving the output of the first convolution module and performing pre-filling processing on the output of the first convolution module based on a pre-filling parameter, and the pre-filling parameter is determined based on an expansion coefficient of the causal convolution neural network;
the first convolution sublayer is used for receiving the output of the pre-filling layer and performing convolution processing on the output of the pre-filling layer;
the second convolution sublayer is used for receiving the output of the first convolution sublayer and performing convolution processing on the output of the first convolution sublayer;
and the residual connecting layer is used for carrying out residual processing on the output of the pre-filling layer and the output of the second convolution sublayer.
A voice activity detection device comprising a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the voice activity detection method as described above.
A readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for detecting speech activity as described above.
By means of the technical scheme, the application discloses a voice activity detection method, a voice activity detection device, voice activity detection equipment and a readable storage medium, and the voice activity detection method comprises the steps of firstly, obtaining voice characteristics of each signal frame corresponding to a voice signal to be detected; then, inputting the voice characteristics of each signal frame into a voice activity detection model, outputting the voice activity detection result of each signal frame by the voice activity detection model, wherein the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; and finally, determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame. In the scheme, for each signal frame, the voice activity detection model obtains the voice activity detection result of the signal frame based on the signal frame and the historical signal frame before the signal frame, and the future frame after the signal frame is not used, so that the waiting time delay generated in the forward propagation process of the model in the inference stage can be reduced.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a schematic flow chart of a method for detecting speech activity disclosed in an embodiment of the present application;
fig. 2 is a schematic flowchart of a method for determining an active speech segment corresponding to a speech signal based on a result of speech activity detection of each signal frame according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a voice activity detection model according to an embodiment of the present disclosure;
FIG. 4 is a schematic structural diagram of a causal convolutional neural network in a voice activity detection model disclosed in an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a speech activity detection apparatus according to an embodiment of the present disclosure;
fig. 6 is a block diagram of a hardware structure of a speech activity detection apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Next, the voice activity detection method provided by the present application will be described by the following embodiments.
Referring to fig. 1, fig. 1 is a schematic flow chart of a voice activity detection method disclosed in an embodiment of the present application, where the method may include:
step S101: and acquiring the voice characteristics of each signal frame corresponding to the voice signal to be detected.
In the present application, the voice signal to be detected may be a voice signal input in real time, and for the voice signal input in real time, in the present application, the voice signal may be subjected to framing and windowing to obtain a plurality of signal frames; and then, aiming at each signal frame, carrying out feature extraction on the signal frame to obtain the voice feature of the signal frame.
It should be noted that, in the present application, the speech signal may be subjected to frame division and windowing based on a preset frame length, frame shift, and window function, so as to obtain a plurality of signal frames. In this application, the speech feature may be a common speech feature such as a PLP (Perceptual linear prediction coefficient), an MFCC (Mel frequency cepstrum coefficient), a Filter Bank feature, etc., and since the Filter Bank feature retains a more primitive acoustic feature than the MFCC, as an implementable manner, the Filter Bank feature may be selected to be used as the speech feature of the signal frame, for example, the Filter Bank feature with a dimension of 40 may be selected to be used as the speech feature of the signal frame in this application.
The human ear has different perception degrees to different frequencies, the higher the Frequency is, the lower the sensitivity is, so the Frequency domain perception of the human ear is nonlinear, the Mel Scale (Mel Scale) just describes the rule, it reflects the relationship between Mel Frequency (Mel Frequency) and common Frequency of the human ear linear perception, the logarithm is taken to the energy value of Mel Frequency spectrum, and the final result is the Filter Bank (Filter Bank) characteristic.
Step S102: inputting the voice characteristics of each signal frame into a voice activity detection model, wherein the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame.
In the present application, the voice characteristics of each signal frame may be input to the voice activity detection model in batches, and the voice activity detection model may be implemented based on a causal convolutional neural network, and compared with a voice activity detection model implemented based on a general convolutional neural network in the prior art, when performing voice activity detection on each signal frame, the voice activity detection model in the present application obtains the voice activity detection result of the signal frame based on the signal frame and a preset number of historical signal frames before the signal frame, and does not use a future frame after the signal frame.
It should be noted that the specific structure and function implementation of the voice activity detection model will be described in detail by the following embodiments, and will not be described herein.
Step S103: and determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.
In the application, based on the voice activity detection result of each signal frame, whether each signal frame is a voice frame or a non-voice frame can be determined, and the active voice segment can be determined based on the result.
In order to ensure the accuracy of the determined active speech segment, the time sequence correlation between the signal frames and the noise characteristics of the signal frames may also be considered, and the active speech segment corresponding to the speech signal is determined based on the detection result of the speech activity of the signal frames, the time sequence correlation between the signal frames, and the noise characteristics of the signal frames.
The embodiment discloses a voice activity detection method, which comprises the steps of firstly, acquiring voice characteristics of each signal frame corresponding to a voice signal to be detected; then, inputting the voice characteristics of each signal frame into a voice activity detection model, outputting the voice activity detection result of each signal frame by the voice activity detection model, wherein the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; and finally, determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame. In the scheme, for each signal frame, the voice activity detection model obtains the voice activity detection result of the signal frame based on the signal frame and the historical signal frame before the signal frame, and the future frame after the signal frame is not used, so that the waiting time delay generated in the forward propagation process of the model in the inference stage can be reduced.
In another embodiment of the present application, a specific implementation manner of determining, in step S103, an active speech segment corresponding to a speech signal based on a speech activity detection result of each signal frame is described.
Referring to fig. 2, fig. 2 is a flowchart illustrating a method for determining an active speech segment corresponding to a speech signal based on a result of detecting speech activity of each signal frame according to an embodiment of the present application, where the method may include:
step S201: and performing smooth operation on the voice activity detection result of each signal frame to obtain an initial active voice segment corresponding to the voice signal.
The speech signal is a time sequence signal, which indicates that there is a correlation between the front and back of a signal frame, for example, if the current signal frame is a speech frame, then the probability that the next signal frame is a speech frame is high, but when each signal frame is independently judged, a phenomenon that a non-speech frame is included in a plurality of speech frames occurs, and therefore, it is necessary to perform a smoothing operation at a speech segment level by an artificially defined rule based on the result of detecting the speech activity of each signal frame to reduce frequent jumps of the speech frame and the non-speech frame. Therefore, in the present application, the voice activity detection result of each signal frame can be smoothly operated to obtain the initial active voice segment corresponding to the voice signal.
Step S202: from each of the initial active speech segments, a noisy speech segment and a non-noisy speech segment are determined.
In the present application, because the speech activity detection model obtains, for each signal frame, the speech activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame, which may cause the speech activity detection model to detect background human voice as a speech frame more easily, in order to solve this problem, the initial active speech segment may be further processed, a noise speech segment and a non-noise speech segment are determined from each initial active speech segment, after the noise speech segment and the non-noise speech segment are determined, the noise speech segment is discarded, and the non-noise speech segment is determined as the active speech segment.
As an implementation, the determining a noise speech segment and a non-noise speech segment from each initial active speech segment includes: calculating the posterior probability mean square error corresponding to each initial active voice segment; if the posterior probability mean square error corresponding to the initial active voice segment is lower than a preset posterior probability mean square error threshold, determining the initial active voice segment as a noise voice segment; and if the posterior probability mean square error corresponding to the initial active voice segment is not lower than a preset posterior probability mean square error threshold, determining that the initial active voice segment is a non-noise voice segment.
Step S203: and determining the non-noise voice segment as an active voice segment corresponding to the voice signal.
In another embodiment of the present application, the structural and functional implementation of the voice activity detection model is described.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a voice activity detection model disclosed in an embodiment of the present application, where the voice activity detection model includes a first convolution layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolutional neural network, a second convolution layer, and a full connection layer, which are connected in sequence;
the first convolution layer is used for receiving the voice characteristics of each signal frame and performing convolution processing on the voice characteristics of each signal frame, and the frame number of the output signal frame is kept consistent with the frame number of the received signal frame through front and back zero padding in the convolution processing process; as an implementation, the first convolution layer may use a convolution kernel of 3 × 3. In the present application, the padding parameters of the first convolution layer may be set, thereby implementing front and back zero padding.
The regularization layer is used for receiving the output of the first convolution layer and carrying out regularization processing on the output of the first convolution layer;
the activation function layer is used for receiving the output of the regularization layer and performing activation processing on the output of the regularization layer;
the pooling layer is used for receiving the output of the activation function layer and performing pooling processing on the output of the activation function layer;
the frame splicing layer is used for receiving the output of the pooling layer and splicing the frame of the output of the pooling layer; it should be noted that the frame splicing process can reduce the computational complexity of the subsequent model structure, and further reduce the computational delay of the model.
The causal convolutional neural network is used for receiving the output of the frame splicing layer, carrying out convolutional processing on the output of the frame splicing layer, and keeping the frame number of the output signal frame consistent with the frame number of the signal frame output by the frame splicing layer through pre-zero padding in the convolutional processing process; in the application, the filling parameters of the causal convolutional neural network can be set, and then the pre-zero padding is realized.
The second convolution layer is used for receiving the output of the causal convolutional neural network and carrying out convolution processing on the output of the causal convolutional neural network, and the frame number of the output signal frame is kept consistent with the frame number of the signal frame output by the causal convolutional neural network through front and back zero padding in the convolution processing process; as an implementation, the second convolutional layer may employ a convolution kernel of 1 × 5. In the present application, the filling parameters of the second convolution layer may be set, thereby implementing front and back zero padding.
And the full-connection layer is used for receiving the output of the second convolution layer and performing full-connection processing on the output of the second convolution layer to obtain the voice activity detection result of each signal frame.
In another embodiment of the present application, the structure of a causal convolutional neural network in a voice activity detection model is described.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a causal convolutional neural network in a voice activity detection model disclosed in an embodiment of the present application, where the causal convolutional neural network includes: the first convolution module, a plurality of parallel second convolution modules respectively connected with the first convolution module, and a fusion module connected with the plurality of parallel second convolution modules;
the first convolution module is used for receiving the output of the framing layer and performing convolution processing on the output of the framing layer; as an implementation, the first convolution module may employ a convolution kernel of 1 × 1.
Each second convolution module is used for receiving the output of the first convolution module and performing convolution processing on the output of the first convolution module to obtain a convolution result;
and the fusion module is used for receiving the convolution results of the second convolution modules and carrying out fusion processing on the convolution results of the second convolution modules to obtain the output of the causal convolution neural network.
The second convolution module comprises a preset number of convolution units, and each convolution unit comprises a pre-filling layer, a first convolution sub-layer, a second convolution sub-layer and a residual error connecting layer;
the pre-padding layer is used for receiving the output of the first convolution module and performing pre-padding processing on the output of the first convolution module based on a pre-padding parameter, wherein the pre-padding parameter is determined based on an expansion coefficient of the causal convolution neural network;
the first convolution sublayer is used for receiving the output of the pre-filling layer and performing convolution processing on the output of the pre-filling layer; as an implementation, the first convolution sublayer may employ a convolution kernel of 1 × 3.
The second convolution sublayer is used for receiving the output of the first convolution sublayer and performing convolution processing on the output of the first convolution sublayer; as an implementation, the second convolution sublayer may employ a 1 × 1 convolution kernel.
And the residual connecting layer is used for carrying out residual processing on the output of the pre-filling layer and the output of the second convolution sublayer.
It should be noted that the structure of the voice activity detection model proposed in the embodiment of the present application is merely exemplary, and other similar structures obtained on this basis should also be within the scope of the present application.
The following describes a voice activity detection device disclosed in an embodiment of the present application, and the voice activity detection device described below and the voice activity detection method described above may be referred to correspondingly.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a voice activity detection apparatus according to an embodiment of the present application. As shown in fig. 5, the voice activity detecting apparatus may include:
the acquiring unit 11 is configured to acquire a voice feature of each signal frame corresponding to a voice signal to be detected;
a detecting unit 12, configured to input the speech characteristics of each signal frame into a speech activity detection model, where the speech activity detection model outputs a speech activity detection result of each signal frame, and the speech activity detection result of each signal frame is used to indicate whether the signal frame is a speech frame or a non-speech frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame;
a determining unit 13, configured to determine an active speech segment corresponding to the speech signal based on the detection result of speech activity of each signal frame.
As an implementation, the obtaining unit includes:
a framing and windowing unit, configured to perform framing and windowing processing on the voice signal to obtain a plurality of signal frames;
and the feature extraction unit is used for extracting the features of the signal frames aiming at each signal frame to obtain the voice features of the signal frames.
As an implementable manner, the determining unit includes:
a smoothing operation unit, configured to perform smoothing operation on a voice activity detection result of each signal frame to obtain an initial active voice segment corresponding to the voice signal;
a noise voice segment and non-noise voice segment determining unit, for determining a noise voice segment and a non-noise voice segment from each initial active voice segment;
and the active voice segment determining unit is used for determining the non-noise voice segment as an active voice segment corresponding to the voice signal.
As an implementation manner, the noise speech segment and non-noise speech segment determining unit is specifically configured to:
calculating the posterior probability mean square error corresponding to each initial active voice fragment;
if the posterior probability mean square error corresponding to the initial active voice segment is lower than a preset posterior probability mean square error threshold, determining the initial active voice segment as a noise voice segment;
and if the posterior probability mean square error corresponding to the initial active voice segment is not lower than a preset posterior probability mean square error threshold, determining that the initial active voice segment is a non-noise voice segment.
As an implementation manner, the voice activity detection model comprises a first convolutional layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolutional neural network, a second convolutional layer and a full-connection layer which are connected in sequence;
the first convolution layer is used for receiving the voice characteristics of each signal frame and performing convolution processing on the voice characteristics of each signal frame, and the frame number of the output signal frame is kept consistent with the frame number of the received signal frame through front and back zero padding in the convolution processing process;
the regularization layer is used for receiving the output of the first convolution layer and carrying out regularization processing on the output of the first convolution layer;
the activation function layer is used for receiving the output of the regularization layer and performing activation processing on the output of the regularization layer;
the pooling layer is used for receiving the output of the activation function layer and performing pooling processing on the output of the activation function layer;
the frame splicing layer is used for receiving the output of the pooling layer and splicing the frame of the output of the pooling layer;
the causal convolutional neural network is used for receiving the output of the frame splicing layer, carrying out convolutional processing on the output of the frame splicing layer, and keeping the frame number of the output signal frame consistent with the frame number of the signal frame output by the frame splicing layer through pre-zero padding in the convolutional processing process;
the second convolution layer is used for receiving the output of the causal convolutional neural network and carrying out convolution processing on the output of the causal convolutional neural network, and the frame number of the output signal frame is kept consistent with the frame number of the signal frame output by the causal convolutional neural network through front and back zero padding in the convolution processing process;
and the full-connection layer is used for receiving the output of the second convolution layer and performing full-connection processing on the output of the second convolution layer to obtain the voice activity detection result of each signal frame.
As one possible implementation, the causal convolutional neural network includes: the first convolution module, a plurality of parallel second convolution modules respectively connected with the first convolution module, and a fusion module connected with the plurality of parallel second convolution modules;
the first convolution module is used for receiving the output of the framing layer and performing convolution processing on the output of the framing layer;
each second convolution module is used for receiving the output of the first convolution module and performing convolution processing on the output of the first convolution module to obtain a convolution result;
and the fusion module is used for receiving the convolution results of the second convolution modules and performing fusion processing on the convolution results of the second convolution modules to obtain the output of the causal convolution neural network.
As an implementation manner, the second convolution module includes a preset number of convolution units, each convolution unit includes a pre-padding layer, a first convolution sublayer, a second convolution sublayer and a residual connecting layer;
the pre-filling layer is used for receiving the output of the first convolution module and performing pre-filling processing on the output of the first convolution module based on a pre-filling parameter, and the pre-filling parameter is determined based on an expansion coefficient of the causal convolution neural network;
the first convolution sublayer is used for receiving the output of the pre-filling layer and performing convolution processing on the output of the pre-filling layer;
the second convolution sublayer is used for receiving the output of the first convolution sublayer and performing convolution processing on the output of the first convolution sublayer;
and the residual connecting layer is used for carrying out residual processing on the output of the pre-filling layer and the output of the second convolution sublayer.
Referring to fig. 6, fig. 6 is a block diagram of a hardware structure of a speech activity detection device according to an embodiment of the present application, and referring to fig. 6, the hardware structure of the speech activity detection device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;
the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;
wherein the memory stores a program and the processor can call the program stored in the memory, the program for:
acquiring voice characteristics of each signal frame corresponding to a voice signal to be detected;
inputting the voice characteristics of each signal frame into a voice activity detection model, wherein the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame;
and determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.
Alternatively, the detailed function and the extended function of the program may be as described above.
Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:
acquiring voice characteristics of each signal frame corresponding to a voice signal to be detected;
inputting the voice characteristics of each signal frame into a voice activity detection model, wherein the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame;
and determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.
Alternatively, the detailed function and the extended function of the program may be as described above.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for voice activity detection, the method comprising:
acquiring voice characteristics of each signal frame corresponding to a voice signal to be detected;
inputting the voice characteristics of each signal frame into a voice activity detection model, wherein the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame; the voice activity detection model comprises a first convolution layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolution neural network, a second convolution layer and a full-connection layer which are sequentially connected; the first convolution layer is used for receiving the voice characteristics of each signal frame and performing convolution processing on the voice characteristics of each signal frame, and the frame number of the output signal frame is kept consistent with the frame number of the received signal frame through front and back zero padding in the convolution processing process;
and determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.
2. The method according to claim 1, wherein the obtaining the speech features of the signal frames corresponding to the speech signal to be detected comprises:
performing frame windowing on the voice signal to obtain a plurality of signal frames;
and for each signal frame, performing feature extraction on the signal frame to obtain the voice feature of the signal frame.
3. The method according to claim 1, wherein the determining the active speech segment corresponding to the speech signal based on the detection result of the speech activity of each signal frame comprises:
performing smooth operation on the voice activity detection result of each signal frame to obtain an initial active voice segment corresponding to the voice signal;
determining a noise voice segment and a non-noise voice segment from each initial active voice segment;
and determining the non-noise voice segment as an active voice segment corresponding to the voice signal.
4. The method of claim 3, wherein said determining a noisy speech segment and a non-noisy speech segment from each initial active speech segment comprises:
calculating the posterior probability mean square error corresponding to each initial active voice fragment;
if the posterior probability mean square error corresponding to the initial active voice segment is lower than a preset posterior probability mean square error threshold, determining the initial active voice segment as a noise voice segment;
and if the posterior probability mean square error corresponding to the initial active voice fragment is not lower than a preset posterior probability mean square error threshold, determining that the initial active voice fragment is a non-noise voice fragment.
5. The method of claim 1, wherein the regularization layer is configured to receive an output of the first convolutional layer and to regularize the output of the first convolutional layer;
the activation function layer is used for receiving the output of the regularization layer and performing activation processing on the output of the regularization layer;
the pooling layer is used for receiving the output of the activation function layer and pooling the output of the activation function layer;
the frame splicing layer is used for receiving the output of the pooling layer and splicing the frame of the output of the pooling layer;
the causal convolutional neural network is used for receiving the output of the frame splicing layer, carrying out convolutional processing on the output of the frame splicing layer, and keeping the frame number of the output signal frame consistent with the frame number of the signal frame output by the frame splicing layer through pre-zero padding in the convolutional processing process;
the second convolution layer is used for receiving the output of the causal convolutional neural network and carrying out convolution processing on the output of the causal convolutional neural network, and the frame number of the output signal frame is kept consistent with the frame number of the signal frame output by the causal convolutional neural network through front and back zero padding in the convolution processing process;
and the full-connection layer is used for receiving the output of the second convolution layer and performing full-connection processing on the output of the second convolution layer to obtain the voice activity detection result of each signal frame.
6. The method of claim 5, wherein the causal convolutional neural network comprises: the first convolution module, a plurality of parallel second convolution modules respectively connected with the first convolution module, and a fusion module connected with the plurality of parallel second convolution modules;
the first convolution module is used for receiving the output of the framing layer and performing convolution processing on the output of the framing layer;
each second convolution module is used for receiving the output of the first convolution module and performing convolution processing on the output of the first convolution module to obtain a convolution result;
and the fusion module is used for receiving the convolution results of the second convolution modules and performing fusion processing on the convolution results of the second convolution modules to obtain the output of the causal convolution neural network.
7. The method of claim 6, wherein the second convolution module comprises a preset number of convolution units, each convolution unit comprising a pre-fill layer, a first convolution sublayer, a second convolution sublayer, and a residual connection layer;
the pre-filling layer is used for receiving the output of the first convolution module and performing pre-filling processing on the output of the first convolution module based on a pre-filling parameter, and the pre-filling parameter is determined based on an expansion coefficient of the causal convolution neural network;
the first convolution sublayer is used for receiving the output of the pre-filling layer and performing convolution processing on the output of the pre-filling layer;
the second convolution sublayer is used for receiving the output of the first convolution sublayer and performing convolution processing on the output of the first convolution sublayer;
and the residual connecting layer is used for carrying out residual processing on the output of the pre-filling layer and the output of the second convolution sublayer.
8. A voice activity detection apparatus, the apparatus comprising:
the acquisition unit is used for acquiring the voice characteristics of each signal frame corresponding to the voice signal to be detected;
the detection unit is used for inputting the voice characteristics of each signal frame into the voice activity detection model, the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame; the voice activity detection model comprises a first convolution layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolution neural network, a second convolution layer and a full-connection layer which are sequentially connected; the first convolution layer is used for receiving the voice characteristics of each signal frame and performing convolution processing on the voice characteristics of each signal frame, and the frame number of the output signal frame is kept consistent with the frame number of the received signal frame through front and back zero padding in the convolution processing process;
and the determining unit is used for determining the active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.
9. A voice activity detection device comprising a memory and a processor;
the memory is used for storing programs;
the processor, configured to execute the program, implementing the steps of the voice activity detection method according to any one of claims 1 to 7.
10. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for detecting speech activity according to any one of claims 1 to 7.
CN202211051500.9A 2022-08-31 2022-08-31 Voice activity detection method, device, equipment and readable storage medium Active CN115132231B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211051500.9A CN115132231B (en) 2022-08-31 2022-08-31 Voice activity detection method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211051500.9A CN115132231B (en) 2022-08-31 2022-08-31 Voice activity detection method, device, equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN115132231A CN115132231A (en) 2022-09-30
CN115132231B true CN115132231B (en) 2022-12-13

Family

ID=83387721

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211051500.9A Active CN115132231B (en) 2022-08-31 2022-08-31 Voice activity detection method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN115132231B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1391212A (en) * 2001-06-11 2003-01-15 阿尔卡塔尔公司 Method for detecting phonetic activity in signals and phonetic signal encoder including device thereof
WO2016188553A1 (en) * 2015-05-22 2016-12-01 Huawei Technologies Co., Ltd. Methods and nodes in a wireless communication network
CN106601229A (en) * 2016-11-15 2017-04-26 华南理工大学 Voice awakening method based on soc chip
CN108564942A (en) * 2018-04-04 2018-09-21 南京师范大学 One kind being based on the adjustable speech-emotion recognition method of susceptibility and system
CN111312218A (en) * 2019-12-30 2020-06-19 苏州思必驰信息科技有限公司 Neural network training and voice endpoint detection method and device
CN111816216A (en) * 2020-08-25 2020-10-23 苏州思必驰信息科技有限公司 Voice activity detection method and device
CN113470652A (en) * 2021-06-30 2021-10-01 山东恒远智能科技有限公司 Voice recognition and processing method based on industrial Internet
WO2022036801A1 (en) * 2020-08-18 2022-02-24 深圳大学 Method and system for achieving coexistence of heterogeneous networks
CN114155839A (en) * 2021-12-15 2022-03-08 科大讯飞股份有限公司 Voice endpoint detection method, device, equipment and storage medium
CN114566179A (en) * 2022-03-16 2022-05-31 北京声加科技有限公司 Time delay controllable voice noise reduction method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111276125B (en) * 2020-02-11 2023-04-07 华南师范大学 Lightweight speech keyword recognition method facing edge calculation
KR102167808B1 (en) * 2020-03-31 2020-10-20 한밭대학교 산학협력단 Semantic segmentation method and system applicable to AR
CN113288183B (en) * 2021-05-20 2022-04-19 中国科学技术大学 Silent voice recognition method based on facial neck surface myoelectricity

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1391212A (en) * 2001-06-11 2003-01-15 阿尔卡塔尔公司 Method for detecting phonetic activity in signals and phonetic signal encoder including device thereof
WO2016188553A1 (en) * 2015-05-22 2016-12-01 Huawei Technologies Co., Ltd. Methods and nodes in a wireless communication network
CN106601229A (en) * 2016-11-15 2017-04-26 华南理工大学 Voice awakening method based on soc chip
CN108564942A (en) * 2018-04-04 2018-09-21 南京师范大学 One kind being based on the adjustable speech-emotion recognition method of susceptibility and system
CN111312218A (en) * 2019-12-30 2020-06-19 苏州思必驰信息科技有限公司 Neural network training and voice endpoint detection method and device
WO2022036801A1 (en) * 2020-08-18 2022-02-24 深圳大学 Method and system for achieving coexistence of heterogeneous networks
CN111816216A (en) * 2020-08-25 2020-10-23 苏州思必驰信息科技有限公司 Voice activity detection method and device
CN113470652A (en) * 2021-06-30 2021-10-01 山东恒远智能科技有限公司 Voice recognition and processing method based on industrial Internet
CN114155839A (en) * 2021-12-15 2022-03-08 科大讯飞股份有限公司 Voice endpoint detection method, device, equipment and storage medium
CN114566179A (en) * 2022-03-16 2022-05-31 北京声加科技有限公司 Time delay controllable voice noise reduction method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Temporal Modeling Using Dilated Convolution and Gating for Voice-Activity-Detection;SY Chang 等;《2018 IEEE International Conference on Acoustics, Speech and Signal Processing》;20180420;第5549-5552页 *
基于CNN多特征融合的藏语语音识别的研究;侯苗苗;《中国优秀硕士学位论文全文数据库》;20211215;第27-30页 *

Also Published As

Publication number Publication date
CN115132231A (en) 2022-09-30

Similar Documents

Publication Publication Date Title
US11508366B2 (en) Whispering voice recovery method, apparatus and device, and readable storage medium
CN108428447B (en) Voice intention recognition method and device
CN110364143B (en) Voice awakening method and device and intelligent electronic equipment
CN110415699B (en) Voice wake-up judgment method and device and electronic equipment
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
CN109448746B (en) Voice noise reduction method and device
CN112967738B (en) Human voice detection method and device, electronic equipment and computer readable storage medium
CN112652306A (en) Voice wake-up method and device, computer equipment and storage medium
CN111048118B (en) Voice signal processing method and device and terminal
CN115132231B (en) Voice activity detection method, device, equipment and readable storage medium
WO2024017110A1 (en) Voice noise reduction method, model training method, apparatus, device, medium, and product
CN111640423B (en) Word boundary estimation method and device and electronic equipment
CN112289311A (en) Voice wake-up method and device, electronic equipment and storage medium
CN116312616A (en) Processing recovery method and control system for noisy speech signals
CN114333912B (en) Voice activation detection method, device, electronic equipment and storage medium
CN114360572A (en) Voice denoising method and device, electronic equipment and storage medium
JP6106618B2 (en) Speech section detection device, speech recognition device, method thereof, and program
CN113436640A (en) Audio noise reduction method, device and system and computer readable storage medium
JP3006496B2 (en) Voice recognition device
CN111048096A (en) Voice signal processing method and device and terminal
CN116110393B (en) Voice similarity-based refusing method, device, computer and medium
CN113689886B (en) Voice data emotion detection method and device, electronic equipment and storage medium
US20240170003A1 (en) Audio Signal Enhancement with Recursive Restoration Employing Deterministic Degradation
CN112447169B (en) Word boundary estimation method and device and electronic equipment
CN113393858A (en) Voice separation method and system, electronic device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant