CN107578775A

CN107578775A - A kind of multitask method of speech classification based on deep neural network

Info

Publication number: CN107578775A
Application number: CN201710801016.6A
Authority: CN
Inventors: 毛华; 彭德中; 章毅; 曾煜妮
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2017-09-07
Filing date: 2017-09-07
Publication date: 2018-01-12
Anticipated expiration: 2037-09-07
Also published as: CN107578775B

Abstract

The present invention discloses a kind of multitask method of speech classification based on deep learning, is related to voice processing technology field, comprises the following steps：S1:Time frequency analysis operation is carried out to speech data, obtains corresponding sound spectrograph.S2:The neural network model based on convolutional neural networks and residual error network is established, and using sound spectrograph as network inputs, extracts feature.S3:The feature of extraction is input to multiple different softmax graders, so as to obtain the model of an initialization.S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization, the network model trained are trained with this data set.S5:The model trained is predicted unlabelled speech data row, the probable value classified, and select the classification of high probability value as classification results.It is to stay alone reason for job order and ignore semantic task correlation that the present invention, which solves existing audio frequency classification method, the problem of causing classification effectiveness low.

Description

A kind of multitask method of speech classification based on deep neural network

Technical field

The present invention relates to sound signal processing technical field, more particularly to a kind of multitask language based on deep neural network Sound sorting technique.

Background technology

Sound is to we provide the much information on sound source and surrounding environment.The auditory system of the mankind can divide From the sound complicated with identification, if a machine can perform similar function, (audio classification and identification) is highly useful , such as the speech recognition in noise.Audio classification is a key areas of pattern-recognition, and has successfully been applied Many fields, such as specialized education and entertainment field are arrived.In recent years, different classes of audio classification, such as accents recognition, say People's identification is talked about, speech emotion recognition there are many successfully applications.

However, most of audio frequency classification method is stayed alone reason both for job order, the mutual pass between each task have ignored Connection.For example accents recognition task and Speaker Identification are generally regarded as single two classification tasks.But in fact, for Same speech data, once confirming, its accent will also determine voice speaker.Then, it is intended that same using this relation When improve the classifying qualities of two kinds of tasks.

Deep learning caused the climax of artificial intelligence in recent years, due to the deep neural network abstract energy powerful to data Power, network learning method are successfully applied to the every field such as Speech processing.In our work, convolution Neutral net is used for learning phonetic feature, improves the accuracy rate in more classification tasks.

Sound spectrograph is phonetic representation that is a kind of detailed and accurately including time and frequency information.The general type of sound spectrograph Mainly three dimensions：Time, frequency and the amplitude represented with color.

The content of the invention

It is an object of the invention to：It is to stay alone to manage for job order to ignore voice to solve existing audio frequency classification method Task dependencies, the problem of causing classification effectiveness low.

Technical scheme is as follows：

A kind of multitask method of speech classification based on deep learning, comprises the following steps：

S1:Time frequency analysis operation is carried out to speech data, obtains corresponding sound spectrograph.

S2:The neural network model based on convolutional neural networks and residual error network is established, and it is defeated using sound spectrograph as network Enter, extract feature.

S3:The feature of extraction is input to multiple different softmax graders, so as to obtain the mould of an initialization Type.

S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization is trained with this data set, The network model trained.

S5:The model trained is predicted unlabelled speech data row, the probable value classified, and select compared with The classification of high probability values is as classification results.

Further, in the S2, the basic operation of convolutional neural networks includes convolution operation and pondization operates, convolution behaviour Work can be represented with following formula：

Wherein, M and N defines the size of convolution kernel, and i, j represent line number and columns, for defining the position of pixel, f It is convolution kernel function, l ∈ (1, L) represent the number of plies of convolutional neural networks,The feature of the i rows j row of l layers is defined,It is fixed The justice parameter of the n rows m of l layers convolution kernel, b is corresponding bias function,

The implication of formula (1) is：The effect of the different piece of input feature vector figure and the product of convolution kernel in convolution kernel function Under obtain new characteristic pattern, above-mentioned formula ensure that feature extraction is unrelated with position, that is, a part for input feature vector figure Statistical property and other parts are the same.

The pond operation of convolutional neural networks can be represented with following formula：

a^l=f (β^ldown(a^l-1)+b^l) (2)

In above-mentioned formula, a^lFor layer of input, down illustrates down-sampling mode, β^lIt is corresponding parameter；Formula (2) Implication be that input feature vector figure mirrored poolization is operated, that is, the feature of image diverse location polymerize, so as to reduce net Parameter in network.

The basic residual block of residual error network can be represented with following formula in the S2：

Y=F (x, W)+x. (3)

Wherein F represents two layers of convolutional network, and W is the parameter of convolutional network, and x is the input of a residual block, and y is represented Basic residual block.

The implication of formula (3) is an input x, after two layers of forward direction convolutional network, to obtain an output F (x, W), so Afterwards by a shortcut, output y is obtained.

The formula of the basic framework model used in S2 is expressed as：

Y=F₁(x, W₁)*F₂(x, W₂)+x. (4)

Wherein, * is the multiplication of digitwise operation, F₁, F₂It is two convolutional layers, x is the input of this basic structure, W₁, W₂It is The parameter of two convolutional layers.

The implication of formula (4) is an input x, respectively in the presence of two convolutional networks, to obtain exporting F₁(x, W₁) And F₂(x, W₂), both are multiplied, then by a shortcut, obtains output y.

Specifically, comprise the following steps in the S4：

S41：Time-domain and frequency-domain analysis is carried out to each speech samples, extracts sound spectrograph, and the speech samples that quantize is more Multiple marks corresponding to individual task.

S42：On the basis of the multitask disaggregated model of the initialization obtained in step s3, learn current Classification of Speech and appoint Business, the multitask disaggregated model trained.

S43：The multitask that the multitask disaggregated model trained is used for speech data is classified, every voice is provided and exists The probable value of each task, the classification of greater probability value is selected as classification results.

After such scheme, the beneficial effects of the present invention are：

(1) feature extraction of speech data is a crucial pretreatment operation, by neutral net to voice sound spectrograph Feature is extracted, sound spectrograph is converted to the sharing feature of 200 dimensions in concrete operations.

(2) in assorting process, it is desirable to which neutral net can learn to voice substantive characteristics, so as to correctly predicted each point Class classification, then we have proposed our own neural network structure, more preferable phonetic representation is obtained.Specifically, for Equally complete polytypic model, such as SVM, classical neural network structure, our model is relatively good；For single mould of classifying Type, on the same model, the accuracy rate of two tasks, below multitask disaggregated model is implemented separately.

By taking the speech emotion recognition on sentence and song as an example, its main task is classified for speech emotional, its nonproductive task For the classification of sentence and song.

	Accuracy rate
		SVM	48.01%
Single task model	56.33%
		Multi task model	62.39%

The main contrast of table 1 accuracy rate of single task model and multi task model in main task.Wherein, SVM is a kind of Classical machine learning classification method；Single task model be it is proposed that model for single task classify, the standard of emotional semantic classification True rate is 56.33%, and in multi task model, while two tasks are realized, the accuracy rate of its emotion recognition adds 6.06%

Network structure	Emotion recognition accuracy rate	Voice and categorizing songs accuracy rate
			Convolutional neural networks	53.73%	92.24
Residual error network	57.21%	94.62%
			Residual error network based on door	62.39	93.13

Table 2, speech emotional of multi task model of the main contrast based on different neural network structures on sentence and song Accuracy rate in identification.Wherein, the residual error network based on door is the model that this patent proposes.

Above-mentioned the results show：

1) for equally completing polytypic model, such as SVM, classical neural network structure, our model is relatively good.

2) for single disaggregated model, on the same model, the accuracy rate of two tasks, below multitask is implemented separately Disaggregated model.

(3) for compared to the model of other non-neural net methods, the spy by deep neural network method to voice Sign extraction, can be good at initializing multitask disaggregated model, increases model robustness, improves each task recognition Effect.Because audio signal may have noise etc. to influence in itself, and the generalization ability that neural net method has had to noise etc..Separately Outside, it is very sensitive to new speaker, multitask is classified due to also learning such as the emotional semantic classification of audio to single task model Speaker characteristic is relative effect is less.

Brief description of the drawings

Fig. 1 is multi task model figure in the present invention；

Fig. 2 is the sound spectrograph of the voice comprising angry emotion；

Fig. 3 is the sound spectrograph of the voice comprising happy emotion；

Fig. 4 is the residual error network infrastructure figure of the present invention；

Fig. 5 is the basic block diagram of the neutral net in the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the present embodiment is carried out clearly and completely Description, it is clear that described embodiment is only the part of the embodiment of the present invention, rather than whole embodiments.Based on this Embodiment in invention, the every other reality that those of ordinary skill in the art are obtained under the premise of creative work is not made Example is applied, belongs to the scope of protection of the invention.

Referring to Fig. 1, a kind of kernel model of the multitask Classification of Speech based on deep neural network is a multitask Disaggregated model, the model are used for two generic tasks of classifying.

Multitask method of speech classification based on deep learning, comprises the following steps：

S2:The neural network model based on convolutional neural networks and residual error network is established, and it is defeated using sound spectrograph as network Enter, extraction feature, in this step, the common spy for multiple tasks is extracted by building a two classification task network structures Sign.The multitask of the present invention is to be directed to two major class classification tasks, and one is, while distinguishes emotion and this voice that voice includes Belong to song or sentence；Secondly being, while distinguish voice speaker and speaker's accent.

As shown in figure 3, the basic operation of convolutional neural networks includes convolution operation and pondization operates, under convolution operation is available State formula expression：

The implication of formula (1) is：The effect of the different piece of input feature vector figure and the product of convolution kernel in convolution kernel function Under obtain new characteristic pattern, appeal formula ensure that feature extraction is unrelated with position, that is, a part for input feature vector figure Statistical property and other parts are the same；The pond operation of convolutional neural networks can be represented with following formula：

a^l=f (β^ldown(a^l-1)+b^l) (2)

In above-mentioned formula, down illustrates down-sampling mode, β^lIt is corresponding parameter；

The implication of formula (2) is that input feature vector figure mirrored poolization is operated, that is, the feature of image diverse location is carried out Polymerization, so as to reduce the parameter in network.

As shown in figure 4, the basic residual block of residual error network can be represented with following formula in the S2：

Y=F (x, W)+x (3)

Wherein F represents two layers of convolutional network, and W is the parameter of convolutional network, and x is the input of a residual block, and y represents base This residual block.

As shown in figure 5, the formula of the basic framework model of the deep neural network used in S2 is expressed as：

Y=F₁(x, W₁)*F₂(x, W₂)+x. (4)

S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization is trained with this data set. Comprise the following steps in S4：

S4:Using speech data and corresponding multiple marks, the model of initialization, the network mould trained are trained Type；

S41：Time-domain and frequency-domain analysis is carried out to each speech samples, extracts sound spectrograph, and the speech samples that quantize is more Multiple marks corresponding to individual task；

S42：On the basis of the multitask disaggregated model of the initialization obtained in step s3, learn current Classification of Speech and appoint Business, the multitask disaggregated model trained；

Fig. 2 and Fig. 3 lists the sound spectrograph for including " anger " " happy " two kinds of emotions, we can see that being arrived in 10kHz In the range of 15kHz, sound spectrograph amplitude difference is apparent.

If Fig. 4 and Fig. 5 is neural net method proposed by the present invention, specifically include：

(1) basic structure of two kinds of models is convolutional neural networks in Fig. 4 and Fig. 5, wherein specifically including two kinds of operations.Its First, the convolution operation of convolutional neural networks, can be represented with following formula：

Wherein, M and N defines the size of convolution kernel, and p, q represent line number and columns, for defining the position of pixel, f It is convolution kernel function, l ∈ (1, L) represent the number of plies when convolutional neural networks,The feature of the i rows j row of l layers is defined, k is fixed The justice parameter of convolution kernel, b are corresponding bias functions.

Another operation is the pondization operation of convolutional neural networks, can be represented with following formula：

a^l=f (β^ldown(a^l-1)+b^l)

In above-mentioned formula, down illustrates down-sampling operation, and β is corresponding parameter.

(2) what Fig. 4 was represented is the basic residual block of residual error network, can also be represented with following formula：

Y=F (x, W)+x.

Wherein F is convolution layer functions, and x is the input of a residual block, and W is parameter.

(3) what Fig. 5 was represented is that we use the basic framework of neutral net, can also be represented with following formula：

Y=F₁(x, W₁)*F₂(x, W₂)+x.

Wherein, * is the multiplication of digitwise operation, F₁, F₂It is to connect a convolutional layer, x is the input of this basic structure. W₁, W₂It is The parameter of two convolutional layers.

Existing audio classification problem marks primarily directed to single sample list, that is to say, that the model of training, only meeting Single task is classified.For example speech emotional is classified, single task classification, it can only exactly realize which an audio belongs to Kind emotion.But because understanding of the different speakers to emotion is different, in the case of resulting in different speakers in same emotion Expression be different.And multitask is classified, multiple different tasks mainly are realized simultaneously, such as, this project is completed While speech emotional classification task, also the problem of completion speaker clustering.That is to a model trained, input One voice, obtained result two, one is this voice people said, and another is the emotion that this voice packet contains.Also It is to say, this project is in training pattern, while affective characteristics and speaker characteristic have been arrived in study.

Above-mentioned the results show：

(1) for equally completing polytypic model, such as SVM, classical neural network structure, our model is relatively good

(2) for single disaggregated model, on the same model, the accuracy rate of two tasks is implemented separately, below more Business disaggregated model.

It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power Profit requires rather than described above limits, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the present invention.Any reference in claim should not be considered as to the involved claim of limitation.

Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each embodiment is only wrapped Containing an independent technical scheme, this narrating mode of specification is only that those skilled in the art should for clarity Using specification as an entirety, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art It is appreciated that other embodiment.

Claims

A kind of 1. multitask method of speech classification based on deep learning, it is characterised in that：Comprise the following steps：

S1:Time frequency analysis operation is carried out to speech data, obtains corresponding sound spectrograph；

S2:The neural network model based on convolutional neural networks and residual error network is established, and using sound spectrograph as network inputs, is carried Take feature；

S3:The feature of extraction is input to multiple different softmax graders, so as to obtain the model of an initialization；

S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization is trained with this data set, are obtained The network model trained；

S5:The model trained is predicted unlabelled speech data row, the probable value classified, and selected higher general The classification of rate value is as classification results.
2. a kind of multitask method of speech classification based on deep learning according to claim 1, it is characterised in that described In S2, the basic operation of convolutional neural networks includes convolution operation and pondization operates, and convolution operation can be represented with following formula：

Wherein, M and N defines the size of convolution kernel, and i, j represent line number and columns, and for defining the position of pixel, f is volume Product kernel function, l ∈ (1, L) represent the number of plies of convolutional neural networks,The feature of the i rows j row of l layers is defined,Define l The parameter of the n rows m of layer convolution kernel, b^lIt is the bias function of l layers；

The pond operation of convolutional neural networks can be represented with following formula：

a^l=f (β^ldown(a^l-1)+b^l) (2)

In above-mentioned formula, a^lFor layer of input, f is pond layer functions, and down illustrates down-sampling mode, β^lIt is to join accordingly Number；

The basic residual block of residual error network can be represented with following formula in the S2：

Y=F (x, W)+x (3)

Wherein F represents two layers of convolutional network, and W is the parameter of convolutional network, and x is the input of a residual block, and y represents substantially residual Poor block output；

The formula of the basic framework model used in S2 is expressed as：

Y=F₁(x, W₁)*F₂(x, W₂)+x (4)

Wherein, * is the multiplication of digitwise operation, F₁, F₂It is two convolutional layers, x is the input of this basic structure, W₁, W₂It is two volumes The parameter of lamination, y represent output.
A kind of 3. multitask method of speech classification based on deep learning according to claim 1, it is characterised in that：It is described Comprise the following steps in S4：

S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization is trained with this data set, are obtained The network model trained；

S41：Time-domain and frequency-domain analysis is carried out to each speech samples, extracts sound spectrograph, and multiple of the speech samples that quantize Multiple marks corresponding to business；

S42：On the basis of the multitask disaggregated model of the initialization obtained in step s3, learn current Classification of Speech task, The multitask disaggregated model trained；

S43：The multitask that the multitask disaggregated model trained is used for speech data is classified, provides every voice each The probable value of task, the classification of greater probability value is selected as classification results.