CN113327589A

CN113327589A - Voice activity detection method based on attitude sensor

Info

Publication number: CN113327589A
Application number: CN202110646290.7A
Authority: CN
Inventors: 王蒙; 胡奎; 姜黎
Original assignee: Hangzhou Ccvui Intelligent Technology Co ltd
Current assignee: Hangzhou Ccvui Intelligent Technology Co ltd
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2021-08-31
Anticipated expiration: 2041-06-10
Also published as: CN113327589B

Abstract

The invention provides a voice activity detection method based on an attitude sensor, and relates to the technical field of human-computer interaction. According to the method, the attitude characteristic data and the sound characteristic data are subjected to characteristic splicing to obtain mixed characteristic data; the neural network model is trained through the mixed feature data, so that accurate detection of voice activity under different postures can be realized, and the problem that the detection accuracy of the voice activity is influenced by user postures is solved; the trained neural network quantity is quantized and compressed by a three-value quantization method in a quantization compression method, and a 32-bit floating point type weight is quantized into a 2-bit fixed point type weight, so that the memory occupied by the weight is further reduced, and meanwhile, the consumption of calculation space and time is greatly reduced; constructing data relation of a front frame and a rear frame by using a recurrent neural network model so as to improve the effect of the model; and the quantity of parameters of the recurrent neural network model is less, so that the size of the occupied memory is further reduced.

Description

Voice activity detection method based on attitude sensor

Technical Field

The invention relates to the technical field of human-computer interaction, in particular to a voice activity detection method based on an attitude sensor.

Background

Voice Activity Detection (VAD), which is a classic problem of detecting a Voice signal segment and a non-Voice signal segment from a Voice signal containing noise, has become an indispensable important component in each Voice signal processing system, such as Voice Coding (Speech Coding), Voice Enhancement (Speech Enhancement), Voice Recognition (Automatic Speech Recognition), etc., and with the continuous development of digital devices, Voice Activity Detection is also more used on digital devices.

Embedded headsets are also constantly being innovated as a current hot product. The embedded earphone is usually connected with the intelligent equipment, not only has an audio playing function, but also can interact with the intelligent equipment by collecting human voice, human posture information and the like, and compared with the traditional earphone, the embedded earphone has the characteristics of being more intelligent and richer in function, and can quickly receive the pursuit of people.

The embedded earphone, as an interactive device with the smart device, has a high requirement for its data acquisition capability, for example: when the sound control is performed on the smart phone through the embedded earphone, clear voice needs to be collected, although the smart phone usually performs operations such as noise reduction and separation on collected audio data, if the embedded earphone cannot ensure the clarity and accuracy of the provided audio data, the embedded earphone is not helpful even if the audio processing software carried by the smart phone is powerful again.

The embedded earphone has complex and various working environments, the collection and the recognition of sound can be influenced by various postures of a user, the quality of collected audio data is reduced due to the posture changes, and therefore related measures are required to be carried out to improve the audio data.

To this end, the invention application with application number CN201911174434.2 discloses a headset wearer voice activity detection system based on microphone technology, comprising: the system comprises a microphone array, a first estimation module, a second estimation module and a joint control module; a microphone array for receiving an acoustic signal; the first estimation module is used for determining the first voice existence probability of a wearer according to the incoming wave direction of a sound source; the second estimation module is used for determining the existence probability of a second voice of the wearer according to the direct reverberation ratio of the sound source; and the joint control module is used for determining the third voice existence probability according to the first voice existence probability and the second voice existence probability and carrying out voice activity detection on the wearer. Using microphone array technology, the headset wearer's voice activity is detected. Even under the complex acoustic scenes of low signal-to-noise ratio, high reverberation condition, multi-speaker interference and the like, the voice activity detection of the wearer can be realized, and important basis is provided for the subsequent voice enhancement and voice recognition technology.

However, the invention application does not deal with the audio data collection change caused by the user gesture, so that a voice activity detection method for eliminating the influence of the user gesture is needed to solve the above problems.

Disclosure of Invention

In order to solve the technical problem, the voice activity detection method based on the attitude sensor is applied to an audio acquisition device with the attitude sensor, performs neural network quantitative training by constructing mixed characteristic data which gives consideration to attitude characteristic data and sound characteristic data, and obtains an optimal solution of a neural network model, wherein the neural network model is used for voice activity detection, and the mixed characteristic data is constructed through the following steps:

acquiring the attitude change of the audio acquisition device through an attitude sensor and recording the attitude change as attitude characteristic data;

collecting external sound changes through an audio collection device and using the external sound changes as sound characteristic data;

respectively carrying out data preprocessing operation on the attitude characteristic data and the sound characteristic data;

performing feature splicing on the preprocessed attitude feature data and the preprocessed sound feature data to obtain mixed feature data;

and taking the mixed characteristic data as neural network quantitative training data for subsequent model training.

As a more specific solution, the voice feature data is MFCC feature data, and MFCC voice feature data extraction and voice feature data preprocessing operations are performed by the following steps:

pre-emphasis is carried out on the sound characteristic data through a high-pass filter;

performing framing operation on the pre-emphasis data through a framing function;

carrying out windowing operation by substituting each sub-frame into a window function;

performing fast Fourier transform on each windowed sub-frame signal to obtain an energy spectrum of each sub-frame;

performing line discrete cosine transform on the energy spectrum to obtain an MFCC coefficient;

extracting a first order difference parameter from the Mel frequency spectrogram;

and splicing the MFCC coefficient and the first-order difference parameter to obtain MFCC characteristic data.

As a more specific solution, the posture characteristic data preprocessing operation is an operation of converting time domain posture characteristic data into frequency domain posture characteristic data, the posture characteristic data is posture characteristic data including an X axis, a Y axis and a Z axis, and the posture characteristic data preprocessing operation is performed by the following steps:

performing framing operation on the attitude characteristic data, wherein each frame of the attitude characteristic data corresponds to each frame of the sound characteristic data one by one;

calculating the displacement of each frame according to the attitude characteristic data, wherein the calculation formula is as follows:

s(n)＝f(n)-f(n-1)；n∈(0，512]；

as(n)＝s(n)-s(n-1)；n∈(0，512]；

wherein, s (n) represents the speed of the nth frame, as (n) represents the acceleration of the nth frame, and f (n) represents the data position label of the nth frame;

respectively carrying out logarithmic transformation on the calculated speed and acceleration;

and splicing the speed and the acceleration together to obtain attitude characteristic data.

As a more specific solution, the preprocessed attitude feature data and voice feature data are feature-spliced through the following steps:

performing one-to-one punctuation on the collected sound characteristic data and the collected attitude characteristic data according to real-time corresponding positions;

carrying out information marking on the initial position and the end position of the sound characteristic data on the attitude characteristic data of the attitude sensor;

mixing the random noise data with the marked sound characteristic data in a random SNR mode according to the signal-to-noise ratio requirement, and ensuring that the mixed data is in one-to-one correspondence with the starting position and the ending position of the sound characteristic data;

performing benchmarking on the mixed data and the posture characteristic data subjected to punctuation, and thus obtaining training data after characteristic splicing;

and performing feature splicing on all the posture feature data and the sound feature data, and obtaining a training data set after feature splicing.

As a more specific solution, the neural network model is a recurrent neural network model, and the recurrent neural network model collects information of adjacent frames and adjusts a weight matrix for detecting voice activity of a current frame according to the information of the adjacent frames.

As a more specific solution, the trained neural network quantity is quantized and compressed, and a 32-bit floating point type weight is quantized into a 2-bit fixed point type weight through quantization compression; the quantization compression steps are as follows:

calculating threshold value delta and scaling factor alpha from original matrix

Converting the original weight into a three-valued weight

The input X is multiplied by alpha to be used as a new input, and then the new input is added with the three-valued weight to replace the original multiplication for forward propagation.

Iterative training is performed using the SGD algorithm backpropagation.

As a more specific solution, an original weight matrix W is passed through a three-valued weight W^tApproximately expressed by multiplication with a scaling coefficient alpha, the three-valued weight W^tExpressed as:

wherein: a threshold Δ is generated from the original weight matrix W, the threshold Δ being:

wherein: i represents the number of sequences corresponding to the weight terms, and n represents the total number of sequences of the weight terms;

the scaling factor α is:

wherein: i is_Δ＝{1≤i≤n||W_i＞Δ|}，|I_ΔI represents_ΔOf (1).

As a more specific solution, the windowing operation is performed by a hamming window function, where the hamming window function is:

wherein (meaning of N, N, a)

The emphasis factor of the pre-emphasis is 0.97, and the mel-filter function of the mel-filter is:

where f represents the primitive function that needs to be filtered.

As a more specific solution, voice activity detection is performed by a trained neural network model; the neural network model is a deep neural network model, the deep neural network model carries out frame-by-frame feature data processing on an audio signal needing voice activity detection, and the posterior probability of voice/non-voice is calculated according to the calculation result of the deep neural network model through a softmax function; the posterior probability value is between 0 and 1, if the posterior probability value exceeds the judgment threshold value, the voice is considered to be voice, and if the posterior probability value does not exceed the judgment threshold value, the non-voice is considered to be non-voice.

Compared with the related art, the voice activity detection method based on the attitude sensor has the following beneficial effects:

1. according to the method, the attitude characteristic data and the sound characteristic data are subjected to characteristic splicing to obtain mixed characteristic data; the neural network model is trained through the mixed feature data, so that accurate detection of voice activity under different postures can be realized, and the problem that the detection accuracy of the voice activity is influenced by user postures is solved;

2. according to the invention, the trained neural network quantity is quantized and compressed by a three-value quantization method in a quantization compression method, and a 32-bit floating point type weight is quantized into a 2-bit fixed point type weight, so that the memory occupied by the weight is further reduced, and meanwhile, the consumption of calculation space and time is greatly reduced;

3. the invention considers the influence of the information of adjacent frames on the VAD judgment of the current frame, and uses the recurrent neural network model to construct the data relation of the previous frame and the next frame so as to improve the model effect; and the quantity of parameters of the recurrent neural network model is less, so that the size of the occupied memory is further reduced.

Drawings

Fig. 1 is a system diagram of a voice activity detection method based on an attitude sensor according to a preferred embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and embodiments.

As shown in fig. 1, the voice activity detection method based on the attitude sensor of the present invention is applied to an audio acquisition device with an attitude sensor.

Specifically, the conventional voice activity detection method is difficult to adapt to the use scenes of devices such as earphones and the like, and the voice activity detection scene is continuously changed due to different postures of users, so that the accuracy of voice activity detection is difficult to ensure, and the influence caused by the user postures is difficult to realize through simple algorithm improvement.

The embodiment provides a method for eliminating attitude influence and increasing system robustness by combining an attitude sensor and an audio acquisition device, wherein the attitude sensor usually adopts three-axis and above sensors and is installed together with the audio acquisition device, attitude information of the audio acquisition device can be acquired in real time through the attitude sensor, the acquired attitude information and sound information are subjected to feature extraction, neural network quantitative training is performed by constructing mixed feature data which considers attitude feature data and sound feature data, and an optimal solution of a neural network model is obtained, and the neural network model trained by the method can be used for performing real-time voice activity detection on the sound information by combining the attitude information, so that the aims of improving voice activity detection accuracy and robustness are fulfilled.

Specifically, the neural network model is used for voice activity detection, and the mixed feature data is constructed through the following steps:

It should be noted that: the mixed feature data can give consideration to both the sound feature and the posture feature, and the mixed feature data can enhance the adaptability and robustness of the model to voice activity detection under different postures when used for subsequent model training.

It should be noted that: in the detection of voice activity, the present embodiment employs Mel-scale Frequency Cepstral Coefficients (MFCC). MFCCs are set according to human auditory mechanisms to have different auditory sensitivities to sound waves of different frequencies. Speech signals from 200Hz to 5000Hz have a large impact on the intelligibility of speech. When two sounds with different loudness act on human ears, the presence of frequency components with higher loudness affects the perception of frequency components with lower loudness, making them less noticeable, which is called masking effect. Since lower frequency sounds travel a greater distance up the cochlear inner basilar membrane than higher frequency sounds, generally bass sounds tend to mask treble sounds, while treble sounds mask bass sounds more difficult. The critical bandwidth of sound masking at low frequencies is smaller than at higher frequencies. Therefore, a group of band-pass filters is arranged according to the size of a critical bandwidth in a frequency band from low frequency to high frequency to filter the input signal. The signal energy output by each band-pass filter is used as the basic characteristic of the signal, and the characteristic can be used as the input characteristic of voice after being further processed. Since the characteristics do not depend on the properties of the signals, no assumptions and restrictions are made on the input signals, and the research results of the auditory model are utilized. Therefore, the parameter has better robustness than the LPCC based on the vocal tract model, better conforms to the auditory characteristics of human ears, and still has better recognition performance when the signal-to-noise ratio is reduced.

s(n)＝f(n)-f(n-1)；n∈(0，512]；

as(n)＝s(n)-s(n-1)；n∈(0，512]；

It should be noted that: the posture characteristic data and the sound characteristic data are subjected to punctuation and marking on the premise of ensuring strict real-time correspondence, and a good training effect can be obtained only by the correct processing of the step.

calculating threshold value delta and scaling factor alpha from original matrix

Converting the original weight into a three-valued weight

Iterative training is performed using the SGD algorithm backpropagation.

It should be noted that: artificial neural networks have enabled computers to achieve an unprecedented level of performance in processing speech recognition tasks. However, the high complexity of the model brings high storage space and computational resource consumption, so that the model is difficult to implement on each hardware platform.

To address these issues, models are compressed to minimize the consumption of computational space and time by the model. Currently, the mainstream network, such as VGG16, has a parameter amount of 1 hundred and 3 million or more, occupies more than 500 MB of space, and needs more than 300 hundred million floating point operations to complete one recognition task.

In the artificial neural network, a large number of redundant nodes exist, only a small part (5-10%) of weight participates in main calculation, that is, only a small part of weight parameters are trained to achieve performance similar to that of the original network. Therefore, the trained neural network model needs to be compressed, and the compression aiming at the neural network model comprises tensor decomposition, model pruning and model quantization.

Tensor decomposition is to use the network weight as a full-rank matrix and use a plurality of low-rank matrices to approximate the matrix, and the method is suitable for model compression, but is not easy to implement, involves decomposition operation with high calculation cost, and needs a large amount of retraining to achieve convergence.

Model pruning is to remove the relatively unimportant weights in the weight matrix and then refine (finetune) the network again for fine tuning. However, the network connection is irregular due to model pruning, the memory occupation needs to be reduced through sparse expression, and further, a large amount of condition judgment and extra space are needed to mark 0 or non-0 parameter positions during forward propagation, so that the method is not suitable for parallel computing, and a special software computing library or hardware is needed for unstructured sparsity.

Therefore, the quantization compression is carried out through the quantization direction of the model, and generally, the weight values of the neural network model are all represented by floating point numbers with the length of 32 bits. Many times this high degree of accuracy is not required and can be expressed, for example, by 8 bits by quantization. The space required for each weight is reduced by sacrificing precision. The required precision of the SGD is only 6-8 bits, and the storage volume of the model can be reduced under the condition that the precision can be guaranteed through reasonable quantification. According to different quantization methods, binary quantization, ternary quantization and multi-valued quantization can be used. In the embodiment, ternary quantization is selected, and compared with binary quantization, the ternary quantization is formed by adding 0 value on the basis of 1 and-1 values, and the calculated amount is not increased.

And iterative training is carried out by using the SGD algorithm back propagation, and the weight of the neural network is adjusted by using the calculated gradient. The SGD algorithm is a form of gradient descent, and as the SGD algorithm adjusts these weights, the neural network will produce a more desirable output. The overall error of the neural network should decrease with training.

the scaling factor α is:

wherein: i is_Δ＝{1≤i≤n||W_i＞Δ|}，|I_ΔI represents_ΔOf (1).

wherein (meaning of N, N, a)

where f represents the primitive function that needs to be filtered.

It should be noted that the neural network model obtained by training the mixed feature data can be well adapted to the detection of voice activities in various postures, while the softmax function is mainly used for normalizing the calculation result of the model, and the softmax function can "compress" a K-dimensional vector z containing any real number into another K-dimensional real vector σ (z) so that the range of each element is between (0,1), and the sum of all elements is 1. The speech/non-speech can be classified accurately by the softmax function.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A voice activity detection method based on an attitude sensor is applied to an audio acquisition device with the attitude sensor, and is characterized in that a neural network quantization training is carried out by constructing mixed characteristic data which takes attitude characteristic data and sound characteristic data into consideration, and an optimal solution of a neural network model is obtained, wherein the neural network model is used for voice activity detection, and the mixed characteristic data is constructed through the following steps:

2. An attitude sensor based voice activity detection method according to claim 1, wherein the voice feature data is MFCC feature data, and MFCC voice feature data extraction and voice feature data preprocessing operations are performed by:

3. The method of claim 1, wherein the preprocessing operation on the gesture feature data is an operation of converting time domain gesture feature data into frequency domain gesture feature data, the gesture feature data is gesture feature data comprising an X axis, a Y axis and a Z axis, and the preprocessing operation on the gesture feature data is performed by:

s(n)＝f(n)-f(n-1)；n∈(0，512]；

as(n)＝s(n)-s(n-1)；n∈(0，512]；

4. The method of claim 1, wherein the preprocessed gesture feature data and the voice feature data are feature-spliced by:

5. The method as claimed in claim 1, wherein the neural network model is a recurrent neural network model, and the recurrent neural network model collects information of adjacent frames and adjusts a weight matrix for detecting voice activity of a current frame according to the information of the adjacent frames.

6. The method of claim 1, wherein the trained neural network is quantized and compressed, and a 32-bit floating point weight is quantized into a 2-bit fixed point weight by quantization and compression; the quantization compression steps are as follows:

calculating threshold value delta and scaling factor alpha from original matrix

Converting the original weight into a three-valued weight;

multiplying the input X by alpha to serve as a new input, and then carrying out addition calculation on the new input and the three-valued weight to replace the original multiplication calculation for forward propagation;

iterative training is performed using the SGD algorithm backpropagation.

7. An attitude sensor based voice activity detection method according to claim 7, characterized in that the original weight matrix W is passed through a three-valued weight W^tApproximately expressed by multiplication with a scaling coefficient alpha, the three-valued weight W^tExpressed as:

the scaling factor α is:

wherein: i is_Δ＝{1≤i≤n||W_i＞Δ|}，|I_ΔI represents_ΔOf (1).

8. An attitude sensor based voice activity detection method according to claim 2, characterized in that the windowing is performed by a hamming window function, which is:

wherein n represents the intercepted signal; a is₀Representing a hamming window constant with a value of 25/46; n-1 represents the length of a cutting window of the Hamming window;

where f represents the primitive function that needs to be filtered.

9. An attitude sensor-based voice activity detection method according to claim 2, characterized in that voice activity detection is performed by a trained neural network model; the neural network model is a deep neural network model, the deep neural network model carries out frame-by-frame feature data processing on an audio signal needing voice activity detection, and the posterior probability of voice/non-voice is calculated according to the calculation result of the deep neural network model through a softmax function; the posterior probability value is between 0 and 1, if the posterior probability value exceeds the judgment threshold value, the voice is considered to be voice, and if the posterior probability value does not exceed the judgment threshold value, the non-voice is considered to be non-voice.