CN113327589A - Voice activity detection method based on attitude sensor - Google Patents

Voice activity detection method based on attitude sensor Download PDF

Info

Publication number
CN113327589A
CN113327589A CN202110646290.7A CN202110646290A CN113327589A CN 113327589 A CN113327589 A CN 113327589A CN 202110646290 A CN202110646290 A CN 202110646290A CN 113327589 A CN113327589 A CN 113327589A
Authority
CN
China
Prior art keywords
data
characteristic data
neural network
attitude
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110646290.7A
Other languages
Chinese (zh)
Other versions
CN113327589B (en
Inventor
王蒙
胡奎
姜黎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Ccvui Intelligent Technology Co ltd
Original Assignee
Hangzhou Ccvui Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Ccvui Intelligent Technology Co ltd filed Critical Hangzhou Ccvui Intelligent Technology Co ltd
Priority to CN202110646290.7A priority Critical patent/CN113327589B/en
Publication of CN113327589A publication Critical patent/CN113327589A/en
Application granted granted Critical
Publication of CN113327589B publication Critical patent/CN113327589B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a voice activity detection method based on an attitude sensor, and relates to the technical field of human-computer interaction. According to the method, the attitude characteristic data and the sound characteristic data are subjected to characteristic splicing to obtain mixed characteristic data; the neural network model is trained through the mixed feature data, so that accurate detection of voice activity under different postures can be realized, and the problem that the detection accuracy of the voice activity is influenced by user postures is solved; the trained neural network quantity is quantized and compressed by a three-value quantization method in a quantization compression method, and a 32-bit floating point type weight is quantized into a 2-bit fixed point type weight, so that the memory occupied by the weight is further reduced, and meanwhile, the consumption of calculation space and time is greatly reduced; constructing data relation of a front frame and a rear frame by using a recurrent neural network model so as to improve the effect of the model; and the quantity of parameters of the recurrent neural network model is less, so that the size of the occupied memory is further reduced.

Description

Voice activity detection method based on attitude sensor
Technical Field
The invention relates to the technical field of human-computer interaction, in particular to a voice activity detection method based on an attitude sensor.
Background
Voice Activity Detection (VAD), which is a classic problem of detecting a Voice signal segment and a non-Voice signal segment from a Voice signal containing noise, has become an indispensable important component in each Voice signal processing system, such as Voice Coding (Speech Coding), Voice Enhancement (Speech Enhancement), Voice Recognition (Automatic Speech Recognition), etc., and with the continuous development of digital devices, Voice Activity Detection is also more used on digital devices.
Embedded headsets are also constantly being innovated as a current hot product. The embedded earphone is usually connected with the intelligent equipment, not only has an audio playing function, but also can interact with the intelligent equipment by collecting human voice, human posture information and the like, and compared with the traditional earphone, the embedded earphone has the characteristics of being more intelligent and richer in function, and can quickly receive the pursuit of people.
The embedded earphone, as an interactive device with the smart device, has a high requirement for its data acquisition capability, for example: when the sound control is performed on the smart phone through the embedded earphone, clear voice needs to be collected, although the smart phone usually performs operations such as noise reduction and separation on collected audio data, if the embedded earphone cannot ensure the clarity and accuracy of the provided audio data, the embedded earphone is not helpful even if the audio processing software carried by the smart phone is powerful again.
The embedded earphone has complex and various working environments, the collection and the recognition of sound can be influenced by various postures of a user, the quality of collected audio data is reduced due to the posture changes, and therefore related measures are required to be carried out to improve the audio data.
To this end, the invention application with application number CN201911174434.2 discloses a headset wearer voice activity detection system based on microphone technology, comprising: the system comprises a microphone array, a first estimation module, a second estimation module and a joint control module; a microphone array for receiving an acoustic signal; the first estimation module is used for determining the first voice existence probability of a wearer according to the incoming wave direction of a sound source; the second estimation module is used for determining the existence probability of a second voice of the wearer according to the direct reverberation ratio of the sound source; and the joint control module is used for determining the third voice existence probability according to the first voice existence probability and the second voice existence probability and carrying out voice activity detection on the wearer. Using microphone array technology, the headset wearer's voice activity is detected. Even under the complex acoustic scenes of low signal-to-noise ratio, high reverberation condition, multi-speaker interference and the like, the voice activity detection of the wearer can be realized, and important basis is provided for the subsequent voice enhancement and voice recognition technology.
However, the invention application does not deal with the audio data collection change caused by the user gesture, so that a voice activity detection method for eliminating the influence of the user gesture is needed to solve the above problems.
Disclosure of Invention
In order to solve the technical problem, the voice activity detection method based on the attitude sensor is applied to an audio acquisition device with the attitude sensor, performs neural network quantitative training by constructing mixed characteristic data which gives consideration to attitude characteristic data and sound characteristic data, and obtains an optimal solution of a neural network model, wherein the neural network model is used for voice activity detection, and the mixed characteristic data is constructed through the following steps:
acquiring the attitude change of the audio acquisition device through an attitude sensor and recording the attitude change as attitude characteristic data;
collecting external sound changes through an audio collection device and using the external sound changes as sound characteristic data;
respectively carrying out data preprocessing operation on the attitude characteristic data and the sound characteristic data;
performing feature splicing on the preprocessed attitude feature data and the preprocessed sound feature data to obtain mixed feature data;
and taking the mixed characteristic data as neural network quantitative training data for subsequent model training.
As a more specific solution, the voice feature data is MFCC feature data, and MFCC voice feature data extraction and voice feature data preprocessing operations are performed by the following steps:
pre-emphasis is carried out on the sound characteristic data through a high-pass filter;
performing framing operation on the pre-emphasis data through a framing function;
carrying out windowing operation by substituting each sub-frame into a window function;
performing fast Fourier transform on each windowed sub-frame signal to obtain an energy spectrum of each sub-frame;
performing line discrete cosine transform on the energy spectrum to obtain an MFCC coefficient;
extracting a first order difference parameter from the Mel frequency spectrogram;
and splicing the MFCC coefficient and the first-order difference parameter to obtain MFCC characteristic data.
As a more specific solution, the posture characteristic data preprocessing operation is an operation of converting time domain posture characteristic data into frequency domain posture characteristic data, the posture characteristic data is posture characteristic data including an X axis, a Y axis and a Z axis, and the posture characteristic data preprocessing operation is performed by the following steps:
performing framing operation on the attitude characteristic data, wherein each frame of the attitude characteristic data corresponds to each frame of the sound characteristic data one by one;
calculating the displacement of each frame according to the attitude characteristic data, wherein the calculation formula is as follows:
s(n)=f(n)-f(n-1);n∈(0,512];
as(n)=s(n)-s(n-1);n∈(0,512];
wherein, s (n) represents the speed of the nth frame, as (n) represents the acceleration of the nth frame, and f (n) represents the data position label of the nth frame;
respectively carrying out logarithmic transformation on the calculated speed and acceleration;
and splicing the speed and the acceleration together to obtain attitude characteristic data.
As a more specific solution, the preprocessed attitude feature data and voice feature data are feature-spliced through the following steps:
performing one-to-one punctuation on the collected sound characteristic data and the collected attitude characteristic data according to real-time corresponding positions;
carrying out information marking on the initial position and the end position of the sound characteristic data on the attitude characteristic data of the attitude sensor;
mixing the random noise data with the marked sound characteristic data in a random SNR mode according to the signal-to-noise ratio requirement, and ensuring that the mixed data is in one-to-one correspondence with the starting position and the ending position of the sound characteristic data;
performing benchmarking on the mixed data and the posture characteristic data subjected to punctuation, and thus obtaining training data after characteristic splicing;
and performing feature splicing on all the posture feature data and the sound feature data, and obtaining a training data set after feature splicing.
As a more specific solution, the neural network model is a recurrent neural network model, and the recurrent neural network model collects information of adjacent frames and adjusts a weight matrix for detecting voice activity of a current frame according to the information of the adjacent frames.
As a more specific solution, the trained neural network quantity is quantized and compressed, and a 32-bit floating point type weight is quantized into a 2-bit fixed point type weight through quantization compression; the quantization compression steps are as follows:
calculating threshold value delta and scaling factor alpha from original matrix
Converting the original weight into a three-valued weight
The input X is multiplied by alpha to be used as a new input, and then the new input is added with the three-valued weight to replace the original multiplication for forward propagation.
Iterative training is performed using the SGD algorithm backpropagation.
As a more specific solution, an original weight matrix W is passed through a three-valued weight WtApproximately expressed by multiplication with a scaling coefficient alpha, the three-valued weight WtExpressed as:
Figure BDA0003109838440000041
wherein: a threshold Δ is generated from the original weight matrix W, the threshold Δ being:
Figure BDA0003109838440000042
wherein: i represents the number of sequences corresponding to the weight terms, and n represents the total number of sequences of the weight terms;
the scaling factor α is:
Figure BDA0003109838440000043
wherein: i isΔ={1≤i≤n||Wi>Δ|},|IΔI representsΔOf (1).
As a more specific solution, the windowing operation is performed by a hamming window function, where the hamming window function is:
Figure BDA0003109838440000044
wherein (meaning of N, N, a)
The emphasis factor of the pre-emphasis is 0.97, and the mel-filter function of the mel-filter is:
Figure BDA0003109838440000045
where f represents the primitive function that needs to be filtered.
As a more specific solution, voice activity detection is performed by a trained neural network model; the neural network model is a deep neural network model, the deep neural network model carries out frame-by-frame feature data processing on an audio signal needing voice activity detection, and the posterior probability of voice/non-voice is calculated according to the calculation result of the deep neural network model through a softmax function; the posterior probability value is between 0 and 1, if the posterior probability value exceeds the judgment threshold value, the voice is considered to be voice, and if the posterior probability value does not exceed the judgment threshold value, the non-voice is considered to be non-voice.
Compared with the related art, the voice activity detection method based on the attitude sensor has the following beneficial effects:
1. according to the method, the attitude characteristic data and the sound characteristic data are subjected to characteristic splicing to obtain mixed characteristic data; the neural network model is trained through the mixed feature data, so that accurate detection of voice activity under different postures can be realized, and the problem that the detection accuracy of the voice activity is influenced by user postures is solved;
2. according to the invention, the trained neural network quantity is quantized and compressed by a three-value quantization method in a quantization compression method, and a 32-bit floating point type weight is quantized into a 2-bit fixed point type weight, so that the memory occupied by the weight is further reduced, and meanwhile, the consumption of calculation space and time is greatly reduced;
3. the invention considers the influence of the information of adjacent frames on the VAD judgment of the current frame, and uses the recurrent neural network model to construct the data relation of the previous frame and the next frame so as to improve the model effect; and the quantity of parameters of the recurrent neural network model is less, so that the size of the occupied memory is further reduced.
Drawings
Fig. 1 is a system diagram of a voice activity detection method based on an attitude sensor according to a preferred embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and embodiments.
As shown in fig. 1, the voice activity detection method based on the attitude sensor of the present invention is applied to an audio acquisition device with an attitude sensor.
Specifically, the conventional voice activity detection method is difficult to adapt to the use scenes of devices such as earphones and the like, and the voice activity detection scene is continuously changed due to different postures of users, so that the accuracy of voice activity detection is difficult to ensure, and the influence caused by the user postures is difficult to realize through simple algorithm improvement.
The embodiment provides a method for eliminating attitude influence and increasing system robustness by combining an attitude sensor and an audio acquisition device, wherein the attitude sensor usually adopts three-axis and above sensors and is installed together with the audio acquisition device, attitude information of the audio acquisition device can be acquired in real time through the attitude sensor, the acquired attitude information and sound information are subjected to feature extraction, neural network quantitative training is performed by constructing mixed feature data which considers attitude feature data and sound feature data, and an optimal solution of a neural network model is obtained, and the neural network model trained by the method can be used for performing real-time voice activity detection on the sound information by combining the attitude information, so that the aims of improving voice activity detection accuracy and robustness are fulfilled.
Specifically, the neural network model is used for voice activity detection, and the mixed feature data is constructed through the following steps:
acquiring the attitude change of the audio acquisition device through an attitude sensor and recording the attitude change as attitude characteristic data;
collecting external sound changes through an audio collection device and using the external sound changes as sound characteristic data;
respectively carrying out data preprocessing operation on the attitude characteristic data and the sound characteristic data;
performing feature splicing on the preprocessed attitude feature data and the preprocessed sound feature data to obtain mixed feature data;
and taking the mixed characteristic data as neural network quantitative training data for subsequent model training.
It should be noted that: the mixed feature data can give consideration to both the sound feature and the posture feature, and the mixed feature data can enhance the adaptability and robustness of the model to voice activity detection under different postures when used for subsequent model training.
As a more specific solution, the voice feature data is MFCC feature data, and MFCC voice feature data extraction and voice feature data preprocessing operations are performed by the following steps:
pre-emphasis is carried out on the sound characteristic data through a high-pass filter;
performing framing operation on the pre-emphasis data through a framing function;
carrying out windowing operation by substituting each sub-frame into a window function;
performing fast Fourier transform on each windowed sub-frame signal to obtain an energy spectrum of each sub-frame;
performing line discrete cosine transform on the energy spectrum to obtain an MFCC coefficient;
extracting a first order difference parameter from the Mel frequency spectrogram;
and splicing the MFCC coefficient and the first-order difference parameter to obtain MFCC characteristic data.
It should be noted that: in the detection of voice activity, the present embodiment employs Mel-scale Frequency Cepstral Coefficients (MFCC). MFCCs are set according to human auditory mechanisms to have different auditory sensitivities to sound waves of different frequencies. Speech signals from 200Hz to 5000Hz have a large impact on the intelligibility of speech. When two sounds with different loudness act on human ears, the presence of frequency components with higher loudness affects the perception of frequency components with lower loudness, making them less noticeable, which is called masking effect. Since lower frequency sounds travel a greater distance up the cochlear inner basilar membrane than higher frequency sounds, generally bass sounds tend to mask treble sounds, while treble sounds mask bass sounds more difficult. The critical bandwidth of sound masking at low frequencies is smaller than at higher frequencies. Therefore, a group of band-pass filters is arranged according to the size of a critical bandwidth in a frequency band from low frequency to high frequency to filter the input signal. The signal energy output by each band-pass filter is used as the basic characteristic of the signal, and the characteristic can be used as the input characteristic of voice after being further processed. Since the characteristics do not depend on the properties of the signals, no assumptions and restrictions are made on the input signals, and the research results of the auditory model are utilized. Therefore, the parameter has better robustness than the LPCC based on the vocal tract model, better conforms to the auditory characteristics of human ears, and still has better recognition performance when the signal-to-noise ratio is reduced.
As a more specific solution, the posture characteristic data preprocessing operation is an operation of converting time domain posture characteristic data into frequency domain posture characteristic data, the posture characteristic data is posture characteristic data including an X axis, a Y axis and a Z axis, and the posture characteristic data preprocessing operation is performed by the following steps:
performing framing operation on the attitude characteristic data, wherein each frame of the attitude characteristic data corresponds to each frame of the sound characteristic data one by one;
calculating the displacement of each frame according to the attitude characteristic data, wherein the calculation formula is as follows:
s(n)=f(n)-f(n-1);n∈(0,512];
as(n)=s(n)-s(n-1);n∈(0,512];
wherein, s (n) represents the speed of the nth frame, as (n) represents the acceleration of the nth frame, and f (n) represents the data position label of the nth frame;
respectively carrying out logarithmic transformation on the calculated speed and acceleration;
and splicing the speed and the acceleration together to obtain attitude characteristic data.
As a more specific solution, the preprocessed attitude feature data and voice feature data are feature-spliced through the following steps:
performing one-to-one punctuation on the collected sound characteristic data and the collected attitude characteristic data according to real-time corresponding positions;
carrying out information marking on the initial position and the end position of the sound characteristic data on the attitude characteristic data of the attitude sensor;
mixing the random noise data with the marked sound characteristic data in a random SNR mode according to the signal-to-noise ratio requirement, and ensuring that the mixed data is in one-to-one correspondence with the starting position and the ending position of the sound characteristic data;
performing benchmarking on the mixed data and the posture characteristic data subjected to punctuation, and thus obtaining training data after characteristic splicing;
and performing feature splicing on all the posture feature data and the sound feature data, and obtaining a training data set after feature splicing.
It should be noted that: the posture characteristic data and the sound characteristic data are subjected to punctuation and marking on the premise of ensuring strict real-time correspondence, and a good training effect can be obtained only by the correct processing of the step.
As a more specific solution, the neural network model is a recurrent neural network model, and the recurrent neural network model collects information of adjacent frames and adjusts a weight matrix for detecting voice activity of a current frame according to the information of the adjacent frames.
As a more specific solution, the trained neural network quantity is quantized and compressed, and a 32-bit floating point type weight is quantized into a 2-bit fixed point type weight through quantization compression; the quantization compression steps are as follows:
calculating threshold value delta and scaling factor alpha from original matrix
Converting the original weight into a three-valued weight
The input X is multiplied by alpha to be used as a new input, and then the new input is added with the three-valued weight to replace the original multiplication for forward propagation.
Iterative training is performed using the SGD algorithm backpropagation.
It should be noted that: artificial neural networks have enabled computers to achieve an unprecedented level of performance in processing speech recognition tasks. However, the high complexity of the model brings high storage space and computational resource consumption, so that the model is difficult to implement on each hardware platform.
To address these issues, models are compressed to minimize the consumption of computational space and time by the model. Currently, the mainstream network, such as VGG16, has a parameter amount of 1 hundred and 3 million or more, occupies more than 500 MB of space, and needs more than 300 hundred million floating point operations to complete one recognition task.
In the artificial neural network, a large number of redundant nodes exist, only a small part (5-10%) of weight participates in main calculation, that is, only a small part of weight parameters are trained to achieve performance similar to that of the original network. Therefore, the trained neural network model needs to be compressed, and the compression aiming at the neural network model comprises tensor decomposition, model pruning and model quantization.
Tensor decomposition is to use the network weight as a full-rank matrix and use a plurality of low-rank matrices to approximate the matrix, and the method is suitable for model compression, but is not easy to implement, involves decomposition operation with high calculation cost, and needs a large amount of retraining to achieve convergence.
Model pruning is to remove the relatively unimportant weights in the weight matrix and then refine (finetune) the network again for fine tuning. However, the network connection is irregular due to model pruning, the memory occupation needs to be reduced through sparse expression, and further, a large amount of condition judgment and extra space are needed to mark 0 or non-0 parameter positions during forward propagation, so that the method is not suitable for parallel computing, and a special software computing library or hardware is needed for unstructured sparsity.
Therefore, the quantization compression is carried out through the quantization direction of the model, and generally, the weight values of the neural network model are all represented by floating point numbers with the length of 32 bits. Many times this high degree of accuracy is not required and can be expressed, for example, by 8 bits by quantization. The space required for each weight is reduced by sacrificing precision. The required precision of the SGD is only 6-8 bits, and the storage volume of the model can be reduced under the condition that the precision can be guaranteed through reasonable quantification. According to different quantization methods, binary quantization, ternary quantization and multi-valued quantization can be used. In the embodiment, ternary quantization is selected, and compared with binary quantization, the ternary quantization is formed by adding 0 value on the basis of 1 and-1 values, and the calculated amount is not increased.
And iterative training is carried out by using the SGD algorithm back propagation, and the weight of the neural network is adjusted by using the calculated gradient. The SGD algorithm is a form of gradient descent, and as the SGD algorithm adjusts these weights, the neural network will produce a more desirable output. The overall error of the neural network should decrease with training.
As a more specific solution, an original weight matrix W is passed through a three-valued weight WtApproximately expressed by multiplication with a scaling coefficient alpha, the three-valued weight WtExpressed as:
Figure BDA0003109838440000091
wherein: a threshold Δ is generated from the original weight matrix W, the threshold Δ being:
Figure BDA0003109838440000092
wherein: i represents the number of sequences corresponding to the weight terms, and n represents the total number of sequences of the weight terms;
the scaling factor α is:
Figure BDA0003109838440000093
wherein: i isΔ={1≤i≤n||Wi>Δ|},|IΔI representsΔOf (1).
As a more specific solution, the windowing operation is performed by a hamming window function, where the hamming window function is:
Figure BDA0003109838440000094
wherein (meaning of N, N, a)
The emphasis factor of the pre-emphasis is 0.97, and the mel-filter function of the mel-filter is:
Figure BDA0003109838440000095
where f represents the primitive function that needs to be filtered.
As a more specific solution, voice activity detection is performed by a trained neural network model; the neural network model is a deep neural network model, the deep neural network model carries out frame-by-frame feature data processing on an audio signal needing voice activity detection, and the posterior probability of voice/non-voice is calculated according to the calculation result of the deep neural network model through a softmax function; the posterior probability value is between 0 and 1, if the posterior probability value exceeds the judgment threshold value, the voice is considered to be voice, and if the posterior probability value does not exceed the judgment threshold value, the non-voice is considered to be non-voice.
It should be noted that the neural network model obtained by training the mixed feature data can be well adapted to the detection of voice activities in various postures, while the softmax function is mainly used for normalizing the calculation result of the model, and the softmax function can "compress" a K-dimensional vector z containing any real number into another K-dimensional real vector σ (z) so that the range of each element is between (0,1), and the sum of all elements is 1. The speech/non-speech can be classified accurately by the softmax function.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (9)

1. A voice activity detection method based on an attitude sensor is applied to an audio acquisition device with the attitude sensor, and is characterized in that a neural network quantization training is carried out by constructing mixed characteristic data which takes attitude characteristic data and sound characteristic data into consideration, and an optimal solution of a neural network model is obtained, wherein the neural network model is used for voice activity detection, and the mixed characteristic data is constructed through the following steps:
acquiring the attitude change of the audio acquisition device through an attitude sensor and recording the attitude change as attitude characteristic data;
collecting external sound changes through an audio collection device and using the external sound changes as sound characteristic data;
respectively carrying out data preprocessing operation on the attitude characteristic data and the sound characteristic data;
performing feature splicing on the preprocessed attitude feature data and the preprocessed sound feature data to obtain mixed feature data;
and taking the mixed characteristic data as neural network quantitative training data for subsequent model training.
2. An attitude sensor based voice activity detection method according to claim 1, wherein the voice feature data is MFCC feature data, and MFCC voice feature data extraction and voice feature data preprocessing operations are performed by:
pre-emphasis is carried out on the sound characteristic data through a high-pass filter;
performing framing operation on the pre-emphasis data through a framing function;
carrying out windowing operation by substituting each sub-frame into a window function;
performing fast Fourier transform on each windowed sub-frame signal to obtain an energy spectrum of each sub-frame;
performing line discrete cosine transform on the energy spectrum to obtain an MFCC coefficient;
extracting a first order difference parameter from the Mel frequency spectrogram;
and splicing the MFCC coefficient and the first-order difference parameter to obtain MFCC characteristic data.
3. The method of claim 1, wherein the preprocessing operation on the gesture feature data is an operation of converting time domain gesture feature data into frequency domain gesture feature data, the gesture feature data is gesture feature data comprising an X axis, a Y axis and a Z axis, and the preprocessing operation on the gesture feature data is performed by:
performing framing operation on the attitude characteristic data, wherein each frame of the attitude characteristic data corresponds to each frame of the sound characteristic data one by one;
calculating the displacement of each frame according to the attitude characteristic data, wherein the calculation formula is as follows:
s(n)=f(n)-f(n-1);n∈(0,512];
as(n)=s(n)-s(n-1);n∈(0,512];
wherein, s (n) represents the speed of the nth frame, as (n) represents the acceleration of the nth frame, and f (n) represents the data position label of the nth frame;
respectively carrying out logarithmic transformation on the calculated speed and acceleration;
and splicing the speed and the acceleration together to obtain attitude characteristic data.
4. The method of claim 1, wherein the preprocessed gesture feature data and the voice feature data are feature-spliced by:
performing one-to-one punctuation on the collected sound characteristic data and the collected attitude characteristic data according to real-time corresponding positions;
carrying out information marking on the initial position and the end position of the sound characteristic data on the attitude characteristic data of the attitude sensor;
mixing the random noise data with the marked sound characteristic data in a random SNR mode according to the signal-to-noise ratio requirement, and ensuring that the mixed data is in one-to-one correspondence with the starting position and the ending position of the sound characteristic data;
performing benchmarking on the mixed data and the posture characteristic data subjected to punctuation, and thus obtaining training data after characteristic splicing;
and performing feature splicing on all the posture feature data and the sound feature data, and obtaining a training data set after feature splicing.
5. The method as claimed in claim 1, wherein the neural network model is a recurrent neural network model, and the recurrent neural network model collects information of adjacent frames and adjusts a weight matrix for detecting voice activity of a current frame according to the information of the adjacent frames.
6. The method of claim 1, wherein the trained neural network is quantized and compressed, and a 32-bit floating point weight is quantized into a 2-bit fixed point weight by quantization and compression; the quantization compression steps are as follows:
calculating threshold value delta and scaling factor alpha from original matrix
Converting the original weight into a three-valued weight;
multiplying the input X by alpha to serve as a new input, and then carrying out addition calculation on the new input and the three-valued weight to replace the original multiplication calculation for forward propagation;
iterative training is performed using the SGD algorithm backpropagation.
7. An attitude sensor based voice activity detection method according to claim 7, characterized in that the original weight matrix W is passed through a three-valued weight WtApproximately expressed by multiplication with a scaling coefficient alpha, the three-valued weight WtExpressed as:
Figure FDA0003109838430000031
wherein: a threshold Δ is generated from the original weight matrix W, the threshold Δ being:
Figure FDA0003109838430000032
wherein: i represents the number of sequences corresponding to the weight terms, and n represents the total number of sequences of the weight terms;
the scaling factor α is:
Figure FDA0003109838430000033
wherein: i isΔ={1≤i≤n||Wi>Δ|},|IΔI representsΔOf (1).
8. An attitude sensor based voice activity detection method according to claim 2, characterized in that the windowing is performed by a hamming window function, which is:
Figure FDA0003109838430000034
wherein n represents the intercepted signal; a is0Representing a hamming window constant with a value of 25/46; n-1 represents the length of a cutting window of the Hamming window;
the emphasis factor of the pre-emphasis is 0.97, and the mel-filter function of the mel-filter is:
Figure FDA0003109838430000035
where f represents the primitive function that needs to be filtered.
9. An attitude sensor-based voice activity detection method according to claim 2, characterized in that voice activity detection is performed by a trained neural network model; the neural network model is a deep neural network model, the deep neural network model carries out frame-by-frame feature data processing on an audio signal needing voice activity detection, and the posterior probability of voice/non-voice is calculated according to the calculation result of the deep neural network model through a softmax function; the posterior probability value is between 0 and 1, if the posterior probability value exceeds the judgment threshold value, the voice is considered to be voice, and if the posterior probability value does not exceed the judgment threshold value, the non-voice is considered to be non-voice.
CN202110646290.7A 2021-06-10 2021-06-10 Voice activity detection method based on attitude sensor Active CN113327589B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110646290.7A CN113327589B (en) 2021-06-10 2021-06-10 Voice activity detection method based on attitude sensor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110646290.7A CN113327589B (en) 2021-06-10 2021-06-10 Voice activity detection method based on attitude sensor

Publications (2)

Publication Number Publication Date
CN113327589A true CN113327589A (en) 2021-08-31
CN113327589B CN113327589B (en) 2023-04-25

Family

ID=77420338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110646290.7A Active CN113327589B (en) 2021-06-10 2021-06-10 Voice activity detection method based on attitude sensor

Country Status (1)

Country Link
CN (1) CN113327589B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818773A (en) * 2022-03-12 2022-07-29 西北工业大学 Low-rank matrix sparsity compensation method for improving reverberation suppression robustness

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708857A (en) * 2011-03-02 2012-10-03 微软公司 Motion-based voice activity detection
CN106531186A (en) * 2016-10-28 2017-03-22 中国科学院计算技术研究所 Footstep detecting method according to acceleration and audio information
CN109872728A (en) * 2019-02-27 2019-06-11 南京邮电大学 Voice and posture bimodal emotion recognition method based on kernel canonical correlation analysis
US10692485B1 (en) * 2016-12-23 2020-06-23 Amazon Technologies, Inc. Non-speech input to speech processing system
CN111798875A (en) * 2020-07-21 2020-10-20 杭州芯声智能科技有限公司 VAD implementation method based on three-value quantization compression

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708857A (en) * 2011-03-02 2012-10-03 微软公司 Motion-based voice activity detection
CN106531186A (en) * 2016-10-28 2017-03-22 中国科学院计算技术研究所 Footstep detecting method according to acceleration and audio information
US10692485B1 (en) * 2016-12-23 2020-06-23 Amazon Technologies, Inc. Non-speech input to speech processing system
CN109872728A (en) * 2019-02-27 2019-06-11 南京邮电大学 Voice and posture bimodal emotion recognition method based on kernel canonical correlation analysis
CN111798875A (en) * 2020-07-21 2020-10-20 杭州芯声智能科技有限公司 VAD implementation method based on three-value quantization compression

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KALIN STEFANOV等: "Spatial Bias in Vision-Based Voice Activity Detection", 《2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818773A (en) * 2022-03-12 2022-07-29 西北工业大学 Low-rank matrix sparsity compensation method for improving reverberation suppression robustness
CN114818773B (en) * 2022-03-12 2024-04-16 西北工业大学 Low-rank matrix sparsity compensation method for improving reverberation suppression robustness

Also Published As

Publication number Publication date
CN113327589B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
Bhat et al. A real-time convolutional neural network based speech enhancement for hearing impaired listeners using smartphone
CN111833896B (en) Voice enhancement method, system, device and storage medium for fusing feedback signals
CN106486131B (en) A kind of method and device of speech de-noising
WO2019232829A1 (en) Voiceprint recognition method and apparatus, computer device and storage medium
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN113889138B (en) Target voice extraction method based on double microphone arrays
Wang et al. Deep learning assisted time-frequency processing for speech enhancement on drones
WO2019232833A1 (en) Speech differentiating method and device, computer device and storage medium
CN111798875A (en) VAD implementation method based on three-value quantization compression
CN111027675B (en) Automatic adjusting method and system for multimedia playing setting
CN113327589B (en) Voice activity detection method based on attitude sensor
Girirajan et al. Real-Time Speech Enhancement Based on Convolutional Recurrent Neural Network.
CN112397090B (en) Real-time sound classification method and system based on FPGA
CN110197657B (en) Dynamic sound feature extraction method based on cosine similarity
CN114023352B (en) Voice enhancement method and device based on energy spectrum depth modulation
CN114566179A (en) Time delay controllable voice noise reduction method
Zheng et al. Bandwidth extension WaveNet for bone-conducted speech enhancement
CN114464188A (en) Voiceprint awakening algorithm based on distributed edge calculation
Skariah et al. Review of speech enhancement methods using generative adversarial networks
CN112992157A (en) Neural network noisy line identification method based on residual error and batch normalization
Pan et al. Application of hidden Markov models in speech command recognition
Srinivasarao An efficient recurrent Rats function network (Rrfn) based speech enhancement through noise reduction
Chen et al. Analysis of Embedded AI Speech Recognition Technology Based on MFCC

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant