CN114822510B

CN114822510B - Voice awakening method and system based on binary convolutional neural network

Info

Publication number: CN114822510B
Application number: CN202210737439.7A
Authority: CN
Inventors: 王啸; 李郡; 付冠宇; 尚德龙; 周玉梅
Original assignee: Zhongke Nanjing Intelligent Technology Research Institute
Current assignee: Zhongke Nanjing Intelligent Technology Research Institute
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2022-10-04
Anticipated expiration: 2042-06-28
Also published as: CN114822510A

Abstract

The invention relates to a voice awakening method and a system based on a binary convolution neural network, relating to the field of voice recognition, wherein the method comprises the following steps: performing MFCC feature extraction on each voice sample of the voice data set to obtain a continuous MFCC feature frame corresponding to each voice sample; taking the continuous MFCC characteristic frames as the input of a teacher network, and taking the labels corresponding to the voice samples as the output to train the teacher network, so as to obtain a trained teacher network; based on a knowledge distillation method, adopting a trained teacher network to conduct guide training on a student network, and taking the trained student network as a voice wake-up system classifier; the student network is a binary convolution neural network; and performing MFCC feature extraction on the voice signal to be recognized, inputting the extracted continuous MFCC feature frame into a voice awakening system classifier, and inputting the output of the voice awakening system classifier into a voice awakening system. The invention reduces the calculation amount and power consumption of voice recognition.

Description

Voice awakening method and system based on binary convolutional neural network

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice awakening method and system based on a binary convolutional neural network.

Background

The voice wake-up system usually runs on the mobile device, and the mobile device has a small memory and limited computing power, so the voice wake-up system should simultaneously meet the requirements of high accuracy, small memory for running and small computing amount. However, the high-performance deep neural network model has high complexity and large computation amount, and often occupies a large amount of memory, so that it is difficult to deploy the deep neural network model to a mobile terminal with a smaller memory. Therefore, the network needs to be compressed to obtain a lightweight model, which is more convenient to deploy to the mobile terminal device.

Disclosure of Invention

The invention aims to provide a voice awakening method and a voice awakening system based on a binary convolutional neural network, which reduce the calculated amount and the power consumption of voice recognition.

In order to achieve the purpose, the invention provides the following scheme:

a voice awakening method based on a binary convolutional neural network comprises the following steps:

performing MFCC feature extraction on each voice sample of the voice data set to obtain a continuous MFCC feature frame corresponding to each voice sample; the label of each voice sample comprises a keyword and a non-keyword;

taking continuous MFCC characteristic frames as input of a teacher network, and taking labels corresponding to voice samples as output to train the teacher network, so as to obtain a trained teacher network;

based on a knowledge distillation method, adopting a trained teacher network to conduct guided training on a student network, and taking the trained student network as a voice wake-up system classifier; the student network is a binary convolution neural network;

and performing MFCC feature extraction on a voice signal to be recognized, inputting the extracted continuous MFCC feature frame into the voice awakening system classifier, and inputting the output of the voice awakening system classifier into the voice awakening system.

Optionally, the loss function adopted in the student network training is a KD loss function, where the KD loss function is expressed as:

L _KD (W _student )=aT ² *CrossEntropy(

)+(1-a)*CrossEntropy(Q _s ,y _true )；

wherein, the first and the second end of the pipe are connected with each other,L _KD (W _student ) Represents the function of the KD loss as described,CrossEntropy(. -) represents a cross entropy loss function,

representing the probability of the student network outputTThe ratio of (a) to (b),

probability of representing the teacher network output andTthe ratio of (a) to (b),Tin order to set the parameters, the user can select the parameters,ain order to set the parameters, the user can select the parameters,Q _s the probability output for the student network is,y _true are tags retrieved from the voice data set.

Optionally, the teacher network is Resnet152.

Optionally, the binary convolutional neural network includes a convolutional layer, a batch normalization layer, a ReLU activation function, 3 blocks, a maximum pooling layer, and a fully-connected layer, which are connected in sequence, where each Block includes a binarized convolutional layer, a batch normalization layer, and a ReLU activation function, which are connected in sequence.

Optionally, the set of voice data is a *** voice command set.

The invention also discloses a voice awakening system based on the binary convolution neural network, which comprises the following components:

the MFCC feature extraction module is used for performing MFCC feature extraction on each voice sample of the voice data set to obtain a continuous MFCC feature frame corresponding to each voice sample; the label of each voice sample comprises a keyword and a non-keyword;

the teacher network training module is used for training the teacher network by taking the continuous MFCC characteristic frames as the input of the teacher network and taking the labels corresponding to the voice samples as the output to obtain the trained teacher network;

the student network training module is used for adopting a trained teacher network to conduct guide training on the student network based on a knowledge distillation method, and taking the trained student network as a voice awakening system classifier; the student network is a binary convolution neural network;

and the to-be-recognized voice signal classification module is used for performing MFCC feature extraction on the to-be-recognized voice signal, inputting the extracted continuous MFCC feature frames into the voice awakening system classifier, and inputting the output of the voice awakening system classifier into the voice awakening system.

L _KD (W _student )=aT ² *CrossEntropy(

)+(1-a)*CrossEntropy(Q _s ,y _true )；

wherein, the first and the second end of the pipe are connected with each other,L _KD (W _student ) Represents the function of the KD loss as described,CrossEntropy(-) represents a cross-entropy loss function,

probability of representing the student network output andTthe ratio of (a) to (b),

probability of representing the teacher network output andTthe ratio of (a) to (b),Tin order to set the parameters for the first time,ain order to set the parameters for the second setting,Q _s the probability output for the student network is,y _true are tags retrieved from the voice data set.

Optionally, the teacher network is Resnet152.

Optionally, the binary convolutional neural network includes a convolutional layer, a batch normalization layer, a ReLU activation function, 3 blocks, a maximum pooling layer, and a fully-connected layer, which are connected in sequence, where each Block includes a binary convolutional layer, a batch normalization layer, and a ReLU activation function, which are connected in sequence.

Optionally, the set of voice data is a *** voice command set.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention discloses a voice awakening method and system based on a binary convolution neural network.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flow chart of a voice wake-up method based on a binary convolutional neural network according to the present invention;

fig. 2 is a schematic structural diagram of a voice wake-up system based on a binary convolutional neural network according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a schematic flow chart of a voice wake-up method based on a binary convolutional neural network according to the present invention, and as shown in fig. 1, a voice wake-up method based on a binary convolutional neural network includes:

step 101: performing MFCC feature extraction on each voice sample of the voice data set to obtain a continuous MFCC feature frame corresponding to each voice sample; the labels of each speech sample include keywords and non-keywords.

The voice data set is the *** voice command set (GSCD).

In step 101, the continuous MFCC feature frames, i.e., mel-frequency cepstrum coefficient feature matrices, are used as input for teacher networking and pre-training.

The keyword tags include "on", "off" and "zero", and the non-keyword tags are selected as "silent state".

Step 102: and taking the continuous MFCC characteristic frames as the input of the teacher network, and taking the labels corresponding to the voice samples as the output to train the teacher network, thereby obtaining the trained teacher network.

The teacher network is a residual network, specifically the Resnet152.

Step 103: based on a knowledge distillation method, adopting a trained teacher network to conduct guide training on a student network, and taking the trained student network as a voice wake-up system classifier; the student network is a Binary Convolutional Neural Network (BCNN).

Network training is usually performed by minimizing the error between the output and the label by using a loss function, and a cross-entropy loss function, i.e. loss, is generally adopted in classification tasks such as voice wakeupLoss=CrossEntroy(output, target), where output is the output of the neural network and target is the label obtained from the dataset. The final network output is probability output, and in this embodiment, non-key words of "mute state" and "on" are selected "The three keywords of off and zero are equivalent to four classification tasks, and the final output is four corresponding probability outputs, while in the traditional training process, only the maximum probability output and the label are selected for loss function calculation, which causes the waste of information contained in the rest probability outputs. The student network not only needs to carry out loss function calculation on the probability output of the student network and the label, but also needs to carry out loss function calculation on the probability output and the probability output of the teacher network, so that the information loss in the student network training process is reduced, and the information loss is also a problem of the binarization network. In particular to the method and the device which are realized by modifying the loss function in the network training process of the students.

In the invention, the loss function adopted during network training of students is a KD loss function (KD loss), which is expressed as follows:

L _KD (W _student )=aT ² *CrossEntropy(

)+(1-a)*CrossEntropy(Q _s ,y _true )；

wherein the content of the first and second substances,L _KD (W _student ) The function of the KD loss is expressed,CrossEntropy(. -) represents a cross entropy loss function,

probability of representing network output of student andTthe ratio of (a) to (b),

probability of representing teacher's network output andTthe ratio of (a) to (b),Tin order to set the parameters, the user can set the parameters,ain order to set the parameters, the user can set the parameters,Q _s is the probability output by the student network,y _true for the tags to be retrieved from the voice data set,y _true 0 or 1, the correct class of tags is 1, the other tags are 0,for example: the output of the four types of tags is 0,0,1,0.

The binary convolutional neural network comprises a convolutional layer, a batch normalization layer, a ReLU activation function, 3 blocks, a maximum pooling layer and a full connection layer which are connected in sequence, wherein each Block comprises a binary convolutional layer, a batch normalization layer and a ReLU activation function which are connected in sequence. Before convolution operation, the binary convolution layer quantizes the input activation value and the weight value into 1 and-1 in a binary mode, so that the parameter quantity is reduced, and complex floating point convolution operation is converted into simple shift operation.

The binary quantization is specifically a formula:

，

；

wherein the content of the first and second substances,a _r an input activation value that represents full precision,w _r an input weight value representing full precision;a _b representing the activation value after the binarization,w _b representing the binarized weight value.

Step 104: and performing MFCC feature extraction on the voice signal to be recognized, inputting the extracted continuous MFCC feature frame into a voice awakening system classifier, and inputting the output of the voice awakening system classifier into a voice awakening system.

Wherein, step 104 specifically includes: the method comprises the steps of obtaining an audio file to be recognized, obtaining a voice signal to be recognized, carrying out MFCC feature extraction on the voice signal to be recognized to obtain a Mel cepstrum coefficient feature matrix, inputting the Mel cepstrum coefficient feature matrix into a voice awakening system classifier, outputting the probability of keywords and non-keywords, taking the output value with the maximum probability as final output, and outputting and inputting the final output into the voice awakening system.

Compared with the traditional binarization neural network, the voice awakening method based on the binarization convolutional neural network improves the identification precision and greatly improves the feasibility of the binarization network applied to the voice awakening system.

The method uses the distillation knowledge training method, uses the pre-trained teacher network to conduct guide training on the student network, optimizes the Loss function used in the student network training, and improves the knowledge (information content) acquired in the student network training process by the proposed KD Loss function compared with the traditional cross entropy Loss function, thereby relieving the defect of large Loss of the binaryzation network information and improving the identification precision of the network.

Compared with the traditional neural network voice awakening system, the binary convolution neural network is used, the space for data storage is reduced at the cost of certain precision, the calculated amount and the power consumption of the voice awakening system are greatly reduced, and the difficulty of hardware implementation is reduced. The advantage comes from the binarization processing of the input and the weight of the neural network, thereby greatly reducing the storage amount of data and the operation amount of the system, further reducing the power consumption and providing a realization scheme of the lightweight voice wake-up system which is convenient for a mobile terminal to use.

Fig. 2 is a schematic structural diagram of a voice wake-up system based on a binary convolutional neural network according to the present invention, and as shown in fig. 2, a voice wake-up system based on a binary convolutional neural network includes:

the MFCC feature extraction module 201 is configured to perform MFCC feature extraction on each voice sample of the voice data set to obtain a continuous MFCC feature frame corresponding to each voice sample; the labels of each speech sample include keywords and non-keywords.

And the teacher network training module 202 is used for training the teacher network by taking the continuous MFCC characteristic frames as input of the teacher network and taking the labels corresponding to the voice samples as output, so as to obtain the trained teacher network.

The student network training module 203 is used for guiding and training a student network by adopting a trained teacher network based on a knowledge distillation method, and taking the trained student network as a voice awakening system classifier; the student network is a binary convolutional neural network.

And the to-be-recognized voice signal classification module 204 is configured to perform MFCC feature extraction on the to-be-recognized voice signal, input the extracted continuous MFCC feature frames into the voice wake-up system classifier, and input the output of the voice wake-up system classifier into the voice wake-up system.

The loss function adopted during the student network training is a KD loss function, which is expressed as:

L _KD (W _student )=aT ² *CrossEntropy(

)+(1-a)*CrossEntropy(Q _s ,y _true )；

wherein, the first and the second end of the pipe are connected with each other,L _KD (W _student ) The function of the KD loss is expressed,CrossEntropy(. -) represents a cross entropy loss function,

probability of representing teacher's network output andTthe ratio of (a) to (b),Tin order to set the parameters, the user can set the parameters,ain order to set the parameters, the user can set the parameters,Q _s is the probability output by the student network,y _true are tags retrieved from the voice data set.

The teacher network is Resnet152.

The binary convolutional neural network comprises a convolutional layer, a batch normalization layer, a ReLU activation function, 3 blocks, a maximum pooling layer and a full connection layer which are sequentially connected, wherein each Block comprises a binary convolutional layer, a batch normalization layer and a ReLU activation function which are sequentially connected.

The voice data set is a *** voice command set.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A voice wake-up method based on a binary convolutional neural network is characterized by comprising the following steps:

training the teacher network by taking the continuous MFCC characteristic frames as input of the teacher network and taking the labels corresponding to the voice samples as output, and obtaining the trained teacher network;

based on a knowledge distillation method, adopting a trained teacher network to conduct guide training on a student network, and taking the trained student network as a voice wake-up system classifier; the student network is a binary convolution neural network; the binary convolutional neural network comprises a convolutional layer, a batch normalization layer, a ReLU activation function, 3 blocks, a maximum pooling layer and a full connection layer which are connected in sequence, wherein each Block comprises a binary convolutional layer, a batch normalization layer and a ReLU activation function which are connected in sequence; the binary convolution layer is used for quantizing the input activation value and the weight value into 1 and-1 in a binary manner before convolution operation is carried out, and converting floating point convolution operation into shift operation;

performing MFCC feature extraction on a voice signal to be recognized, inputting an extracted continuous MFCC feature frame into the voice awakening system classifier, and inputting the output of the voice awakening system classifier into a voice awakening system;

the loss function adopted during the student network training is a KD loss function, and the KD loss function is expressed as:

L _KD (W _student )=aT ² *CrossEntropy(

)+(1-a)*CrossEntropy(Q _s ,y _true )；

wherein the content of the first and second substances,L _KD (W _student ) A function representing the loss of KD as described,CrossEntropy(-) represents a cross-entropy loss function,

probability of representing the teacher network output andTthe ratio of (a) to (b),Tin order to set the parameters for the first time,ain order to set the parameters for the second setting,Q _s the probability output for the student network is,y _true as tags obtained from the voice data set.

2. The binary convolutional neural network-based voice wakeup method according to claim 1, wherein the teacher network is Resnet152.

3. The binary convolutional neural network-based voice wakeup method according to claim 1, wherein the voice data set is a Google voice command set.

4. A voice wake-up system based on a binary convolutional neural network, comprising:

the student network training module is used for adopting a trained teacher network to conduct guide training on a student network based on a knowledge distillation method, and taking the trained student network as a voice awakening system classifier; the student network is a binary convolution neural network; the binary convolutional neural network comprises a convolutional layer, a batch normalization layer, a ReLU activation function, 3 blocks, a maximum pooling layer and a full connection layer which are connected in sequence, wherein each Block comprises a binary convolutional layer, a batch normalization layer and a ReLU activation function which are connected in sequence; the binary convolution layer is used for quantizing the input activation value and the weight value into 1 and-1 in a binary manner before convolution operation is carried out, and converting floating point convolution operation into shift operation;

the voice signal classification module to be recognized is used for performing MFCC feature extraction on a voice signal to be recognized, inputting extracted continuous MFCC feature frames into the voice awakening system classifier, and inputting the output of the voice awakening system classifier into a voice awakening system;

L _KD (W _student )=aT ² *CrossEntropy(

)+(1-a)*CrossEntropy(Q _s ,y _true )；

wherein the content of the first and second substances,L _KD (W _student ) Represents the function of the KD loss as described,CrossEntropy(-) represents a cross-entropy loss function,

probability of representing the teacher network output andTthe ratio of (a) to (b),Tin order to set the parameters, the user can select the parameters,ain order to set the parameters, the user can set the parameters,Q _s the probability output for the student network is,y _true are tags retrieved from the voice data set.

5. The binary convolutional neural network-based voice wake-up system of claim 4, wherein the teacher network is Resnet152.

6. The binary convolutional neural network-based voice wake-up system of claim 4, wherein the voice data set is a Google Voice Command set.