CN109523994A

CN109523994A - A kind of multitask method of speech classification based on capsule neural network

Info

Publication number: CN109523994A
Application number: CN201811346110.8A
Authority: CN
Inventors: 陈盈科; 毛华; 吴雨; 何涛
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2018-11-13
Filing date: 2018-11-13
Publication date: 2019-03-26

Abstract

The invention discloses a kind of multitask method of speech classification based on capsule neural network, are related to speech signal analysis, and the technical fields such as artificial intelligence solve the multitask classification problem in speech recognition.The present invention mainly has the feature representation for extracting voice, including from frequency domain, and multiple angles such as time domain go to extract the primary features of voice；With convolutional neural networks and capsule neural network, on the basis of voice primary features after the pre-treatment, then the abstract and study of profound phonetic feature is carried out；According to the multiple classifiers of multitask Demand Design after advanced features, the loss function of multiple classifiers is merged, unified training multitask Classification of Speech model is finally reached in multiple tasks while improving classification accuracy.

Description

A kind of multitask method of speech classification based on capsule neural network

Technical field

A kind of multitask method of speech classification based on capsule neural network is related to speech signal analysis processing and artificial intelligence The technical fields such as energy, solve the speech recognition problem of multitask.

Background technique

Sound is one of the most convenient means of the daily exchange of people, while delivering information abundant.Voice is as a kind of Important big data existence form is the indispensable part of big data composition, is had in the intelligent epoch in current manual huge Research Prospects.Human-computer interaction emphasizes that offer user is comfortable, and natural Product Experience is felt, voice is as most natural interaction side Formula, importance are not sayed.These intelligent sound products such as intelligent music recommendation, voice synchronous translation, Voice Communication Everybody daily life is all facilitated significantly.The research of speech-sound intelligent technology has also been designed into many aspects at present: speech recognition, Classification of Speech, semantic analysis etc., wherein Classification of Speech is the basis for studying voice data.Different classes of Classification of Speech, such as Accents recognition, Speaker Identification, speech emotion recognition have had many successfully applications.The Classification of Speech of computer identifies energy Power is the important component that computer carries out speech processes, is the key precondition for realizing natural human-computer interaction interface, has very Big researching value and application value.

Often Classification of Speech task is considered independent, but a voice can transmit much information in practice, such as Gender, word content, mood etc. are studied interrelated with display meaning between variant task.For example, accents recognition Task and Speaker Identification are generally regarded as individual two classification tasks.But in fact, for same voice data, language Once confirming, accent will also determine sound speaker.This research contents is by considering actual environment, it is desirable to divide from speech audio Richer information is precipitated, classifies under unified model to multiple and different semantic tasks to realize.

Current manual's intellectual technology has several broad aspects, traditional deep neural network, generates confrontation network, enhancing study with And capsule network.This research contents system is dedicated to the Classification of Speech problem by multitask by research capsule network, so that Finally recognition effect gets a promotion system under multiple tasks.

Summary of the invention

The present invention provides a kind of multitask method of speech classification based on capsule neural network, analyze multitask voice it Between correlation, solve the problems, such as multitask Classification of Speech, realize the abstract-learning of phonetic feature, obtain in multiple tasks being language Cent class obtains more accurate result.

To achieve the goals above, the technical scheme adopted by the invention is that:

Multitask method of speech classification based on capsule neural network, it is characterised in that utilize depth convolutional neural networks and capsule The more abstract higher layer voice feature of neural network learning, includes the following steps:

(1) voice original signal is pre-processed, using speech recognition algorithm, extracts the expression of voice low-level feature；

(2) it is expressed using the middle level features that depth convolutional neural networks extract voice signal；

(3) feature representation of the higher level of abstraction of voice is further extracted using capsule neural network；

(4) multiple and different classifier and loss function are designed, realizes the training whole end to end of multitask Classification of Speech.

Further, include the following steps: in the step (1)

(11) the primitive character expression of voice is one-dimensional high dimensional feature, special using different tradition in voice pretreated model Extraction algorithm is levied, feature time-domain and frequency-domain feature is extracted to original audio, finally by various features amalgamation and expression input depth mind Through network model；

(12) time domain speech feature extraction algorithm uses linear forecast coding coefficient (LPCC), is a kind of homomorphic signal processing side Method, time domain speech feature extraction algorithm extract voice signal using Fourier using Meier Frequency Cepstral Coefficients (MFCC)；Pass through The voice primary features of different characteristics are merged, the final input for forming deep neural network model.

Further, include the following steps: in the step (2)

(21) the higher feature of input feature vector is extracted in the step (2) using the convolution operation of depth convolutional neural networks, It can be indicated with following formula:

Wherein,The input of convolutional layer is defined,It indicates to learn weight in convolution kernel, whereinIt is convolution kernel letter It counts and acts on nonlinear mapping function；

(22) the higher feature of input feature vector is extracted in the step (2) using the pondization operation of depth convolutional neural networks, It can be indicated with following formula:

Wherein,The input for defining pond layer does not have since pond layer does not have the parameter of study；Common pond Change operationFunction is maximized, minimum value or is averaged.

Further, include the following steps: in the step (3)

(31) capsule neural network is different from conventional depth neural network, and the minimum unit of calculating is one group of neuron, and glue There are the weight of two kinds of different roles in keed network, it is respectively used to the weight predicted and predicted；

(32) firstly, capsule network prediction interval, it is similar to be calculated with traditional feedforward, by input capsule and forecast power it Between matrix multiple obtain prediction result, specific formula calculates as follows:

Wherein,For low layer capsule neural network,Be expressed as prediction as a result, it is noted that hereWithIt is all one The expression of group neuron；

(33) it is different from traditional convolutional neural networks, by the high-rise feature representation of lower layer network prediction in study, capsule nerve net Network learns weight of the low layer various pieces to the same prediction again, and specific formula calculates as follows:

Wherein,It is expressed as prediction low layer capsuleTo high-rise capsulePrediction,It is expressed as the weight of prediction, it is final high Layer capsule has obtained net input to all prediction weighted sums；It is worth noting that, being different from joining in traditional neural network Several updates uses gradient descent method, hereIt is updated by dynamic routing algorithm；

(34) finally, the prediction expression of summation is needed by a nonlinear mapping, due to the smallest in capsule neural network Computing unit is one group of neuron, and therefore, activation primitive is changed, and is mainly expressed as follows:

Wherein, the prediction expression after activation has the meaning of two aspects, and direction illustrates the attribute of the category, and its Size is expressed as probability existing for the category.

Further, include the following steps: in the step (4)

(41) voice multitask categorised content, the corresponding label of numeralization multitask are determined；

(42) according to the categorised content of different types, the classifier of multiple quantity is defined；

(43) it is directed to different classifiers, designs corresponding loss function；Specific function design is as follows:

Wherein,For certain corresponding a kind of authentic specimen label of voice,For the probability value after classifier softmax,It indicates The total quantity of sample passes through superpositionA sample obtains the damage of all samples on the generic task in the loss function of a certain generic task Lose average value；

Above-mentioned is only loss function to be devised, for the Classification of Speech of multitask for the single classification results in multitask Problem, final loss function are defined as follows:

Wherein,Indicate the above-mentioned loss function for single task role in population sample,It indicates to multiple in practice The quantity of business, total loss function of final multitask speech recognition problemIt is expressed as all single loss function summations；

(44) network structure by above-mentioned design, constructs data set, allowable loss function and etc., finally calculated using backpropagation The entire nerve of capsule end to end of method training.

Compared with the prior art, the advantages of the present invention are as follows:

One, preprocessing part dexterously merges the various primitive characters of voice, compared to original voice data, reduces Data dimension is expressed compared to single primary features, enriches voice input information；

Two, on the basis of depth convolutional neural networks, state-of-the-art capsule neural network is further designed, for learning voice Higher level feature representation；

Three, by the loss function of multitask, carry out the correlation between learning tasks, to preferably train network.

Detailed description of the invention

Fig. 1 is the illustraton of model of the multitask Classification of Speech based on capsule neural network in the present invention；

Fig. 2 is the flow chart of the multitask Classification of Speech based on capsule neural network in the present invention；

Fig. 3 is the topological diagram of capsule in the present invention.

Specific embodiment

The present invention is further illustrated with reference to the accompanying drawings and examples.

Referring to Fig. 1, a kind of kernel model of the multitask audio recognition method based on capsule neural network is a capsule Neural network model, the model are inputted by receiving the data of different phonetic primitive character combination, while using the basic of convolution Structure can carry out feature learning to input feature vector, then the profound structure using capsule network carries out into one primary features The feature extraction of step, while considering the learning objective of multitask, new loss function is designed, thus effectively in multiple tasks Improve the accuracy of speech recognition.

Referring to fig. 2, a kind of overall data process of the multitask method of speech classification based on capsule neural network is specific to walk It is rapid as follows:

(1) audio pre-processes: the extraction algorithm of phonetic feature is related to a variety of classic algorithms, and the calculating of Meier coefficient is such as in MFCC Under:

Wherein,Indicate voice actual frequency, above-mentioned formula describes the relationship of Mel frequency and actual frequency in algorithm, human ear Audible frequencies and Mel frequency growth be in consistency.

The LPCC being related to mainly calculates the linear prediction residue error in voice, and calculation is as follows:

Wherein,It is expressed asStage linear anticipation function.By the way that above-mentioned various primary speech features are mixed to get most Whole mode input feature.

(2) convolution and pond: extracting the higher feature of input feature vector using the convolution operation of depth convolutional neural networks, It can be indicated with following formula:

The higher feature that input feature vector is extracted using the pondization operation of depth convolutional neural networks, can be indicated with following formula:

Wherein,Define the input of pond layer, common pondization operationFunction is maximized, minimum value or is averaged Value.

(3) capsule neural network: the basic computational ele- ment of capsule network is one group of neuron, and each vector represents one group Neuronal structure, the calculating between two layers of capsule network are needed through two steps: prediction is summed with prediction.By input capsule and in advance It surveys matrix multiple between weight and obtains intermediate prediction result, specific formula calculates as follows:

Wherein,For low layer capsule neural network,It is expressed as the result of prediction.

By the high-rise feature representation of lower layer network prediction in study, capsule neural network is to low layer various pieces to same The weight of prediction is learnt again, and specific formula calculates as follows:

Wherein,It is expressed as prediction low layer capsuleTo high-rise capsulePrediction,It is expressed as the weight of prediction, it is final high Layer capsule has obtained net input to all prediction weighted sums.

Finally, the prediction expression of summation is needed by a nonlinear mapping, due to the smallest in capsule neural network Computing unit is one group of neuron, and therefore, activation primitive is changed, and is mainly expressed as follows:

Wherein,Expression is exported for final capsule.

(4) total losses function: determining the content of task first, designs multiple classifiers, the learning objective of corresponding multitask. For the learning objective of a certain single task role, corresponding loss function design is as follows:

It is multitask Classification of Speech problem since this model is corresponding, it is therefore desirable to a rule is designed, to independent loss letter Number merged, thus the total losses function of the Classification of Speech model of multitask embody it is as follows:

Wherein,Indicate the above-mentioned loss function for single task role in population sample,It indicates to multiple in practice The quantity of business, total loss function of final multitask speech recognition problemIt is expressed as all single loss function summations.

Referring to Fig. 3, a kind of calculating topological diagram in any two-tier network based on capsule neural network,It is expressed as low layer The feature representation that arrives of capsule neural network learning, pass throughIt goes to carry out a study to input and predicts high-rise one It expresses, the result of prediction interval is concealed in diagramAnd each weight value of prediction interval, finally can just obtain next layer High-rise capsule expression, specifically it is expressed as follows:

。

Claims

1. a kind of multitask method of speech classification based on capsule neural network, it is characterised in that extracted using capsule neural network The feature of voice higher level of abstraction, while being classified using the multitask that multi-categorizer completes voice, include the following steps:

2. a kind of multitask method of speech classification based on capsule neural network according to claim 1, the step (1) In include the following steps:

3. a kind of multitask method of speech classification based on capsule neural network according to claim 1, the step (2) In include the following steps:

Wherein,The input of convolutional layer is defined,It indicates to learn weight in convolution kernel, whereinIt is convolution kernel function And act on nonlinear mapping function；

4. a kind of multitask method of speech classification based on capsule neural network according to claim 1, the step (3) In include the following steps:

Wherein,For low layer capsule neural network,Be expressed as prediction as a result, it is noted that hereWithAll It is the expression of one group of neuron；

Wherein,It is expressed as prediction low layer capsuleTo high-rise capsulePrediction,It is expressed as the weight of prediction, finally High-rise capsule has obtained net input to all prediction weighted sums；It is worth noting that, being different from traditional neural network The update of parameter uses gradient descent method, hereIt is updated by dynamic routing algorithm；

5. a kind of multitask method of speech classification based on capsule neural network according to claim 1, the step (4) In include the following steps:

Wherein,For certain corresponding a kind of authentic specimen label of voice,For the probability value after classifier softmax,Indicate sample This total quantity, passes through superpositionA sample obtains the loss of all samples on the generic task in the loss function of a certain generic task Average value；

Wherein,Indicate the above-mentioned loss function for single task role in population sample,It indicates to multiple tasks in practice Quantity, total loss function of final multitask speech recognition problemIt is expressed as all single loss function summations；

(44) network structure by above-mentioned design, constructs data set, allowable loss function and etc., finally calculated using backpropagation The entire capsule neural network end to end of method training.