CN107578775A - A kind of multitask method of speech classification based on deep neural network - Google Patents

A kind of multitask method of speech classification based on deep neural network Download PDF

Info

Publication number
CN107578775A
CN107578775A CN201710801016.6A CN201710801016A CN107578775A CN 107578775 A CN107578775 A CN 107578775A CN 201710801016 A CN201710801016 A CN 201710801016A CN 107578775 A CN107578775 A CN 107578775A
Authority
CN
China
Prior art keywords
model
classification
speech
network
multitask
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710801016.6A
Other languages
Chinese (zh)
Other versions
CN107578775B (en
Inventor
毛华
彭德中
章毅
曾煜妮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201710801016.6A priority Critical patent/CN107578775B/en
Publication of CN107578775A publication Critical patent/CN107578775A/en
Application granted granted Critical
Publication of CN107578775B publication Critical patent/CN107578775B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The present invention discloses a kind of multitask method of speech classification based on deep learning, is related to voice processing technology field, comprises the following steps:S1:Time frequency analysis operation is carried out to speech data, obtains corresponding sound spectrograph.S2:The neural network model based on convolutional neural networks and residual error network is established, and using sound spectrograph as network inputs, extracts feature.S3:The feature of extraction is input to multiple different softmax graders, so as to obtain the model of an initialization.S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization, the network model trained are trained with this data set.S5:The model trained is predicted unlabelled speech data row, the probable value classified, and select the classification of high probability value as classification results.It is to stay alone reason for job order and ignore semantic task correlation that the present invention, which solves existing audio frequency classification method, the problem of causing classification effectiveness low.

Description

A kind of multitask method of speech classification based on deep neural network
Technical field
The present invention relates to sound signal processing technical field, more particularly to a kind of multitask language based on deep neural network Sound sorting technique.
Background technology
Sound is to we provide the much information on sound source and surrounding environment.The auditory system of the mankind can divide From the sound complicated with identification, if a machine can perform similar function, (audio classification and identification) is highly useful , such as the speech recognition in noise.Audio classification is a key areas of pattern-recognition, and has successfully been applied Many fields, such as specialized education and entertainment field are arrived.In recent years, different classes of audio classification, such as accents recognition, say People's identification is talked about, speech emotion recognition there are many successfully applications.
However, most of audio frequency classification method is stayed alone reason both for job order, the mutual pass between each task have ignored Connection.For example accents recognition task and Speaker Identification are generally regarded as single two classification tasks.But in fact, for Same speech data, once confirming, its accent will also determine voice speaker.Then, it is intended that same using this relation When improve the classifying qualities of two kinds of tasks.
Deep learning caused the climax of artificial intelligence in recent years, due to the deep neural network abstract energy powerful to data Power, network learning method are successfully applied to the every field such as Speech processing.In our work, convolution Neutral net is used for learning phonetic feature, improves the accuracy rate in more classification tasks.
Sound spectrograph is phonetic representation that is a kind of detailed and accurately including time and frequency information.The general type of sound spectrograph Mainly three dimensions:Time, frequency and the amplitude represented with color.
The content of the invention
It is an object of the invention to:It is to stay alone to manage for job order to ignore voice to solve existing audio frequency classification method Task dependencies, the problem of causing classification effectiveness low.
Technical scheme is as follows:
A kind of multitask method of speech classification based on deep learning, comprises the following steps:
S1:Time frequency analysis operation is carried out to speech data, obtains corresponding sound spectrograph.
S2:The neural network model based on convolutional neural networks and residual error network is established, and it is defeated using sound spectrograph as network Enter, extract feature.
S3:The feature of extraction is input to multiple different softmax graders, so as to obtain the mould of an initialization Type.
S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization is trained with this data set, The network model trained.
S5:The model trained is predicted unlabelled speech data row, the probable value classified, and select compared with The classification of high probability values is as classification results.
Further, in the S2, the basic operation of convolutional neural networks includes convolution operation and pondization operates, convolution behaviour Work can be represented with following formula:
Wherein, M and N defines the size of convolution kernel, and i, j represent line number and columns, for defining the position of pixel, f It is convolution kernel function, l ∈ (1, L) represent the number of plies of convolutional neural networks,The feature of the i rows j row of l layers is defined,It is fixed The justice parameter of the n rows m of l layers convolution kernel, b is corresponding bias function,
The implication of formula (1) is:The effect of the different piece of input feature vector figure and the product of convolution kernel in convolution kernel function Under obtain new characteristic pattern, above-mentioned formula ensure that feature extraction is unrelated with position, that is, a part for input feature vector figure Statistical property and other parts are the same.
The pond operation of convolutional neural networks can be represented with following formula:
al=f (βldown(al-1)+bl) (2)
In above-mentioned formula, alFor layer of input, down illustrates down-sampling mode, βlIt is corresponding parameter;Formula (2) Implication be that input feature vector figure mirrored poolization is operated, that is, the feature of image diverse location polymerize, so as to reduce net Parameter in network.
The basic residual block of residual error network can be represented with following formula in the S2:
Y=F (x, W)+x. (3)
Wherein F represents two layers of convolutional network, and W is the parameter of convolutional network, and x is the input of a residual block, and y is represented Basic residual block.
The implication of formula (3) is an input x, after two layers of forward direction convolutional network, to obtain an output F (x, W), so Afterwards by a shortcut, output y is obtained.
The formula of the basic framework model used in S2 is expressed as:
Y=F1(x, W1)*F2(x, W2)+x. (4)
Wherein, * is the multiplication of digitwise operation, F1, F2It is two convolutional layers, x is the input of this basic structure, W1, W2It is The parameter of two convolutional layers.
The implication of formula (4) is an input x, respectively in the presence of two convolutional networks, to obtain exporting F1(x, W1) And F2(x, W2), both are multiplied, then by a shortcut, obtains output y.
Specifically, comprise the following steps in the S4:
S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization is trained with this data set, The network model trained.
S41:Time-domain and frequency-domain analysis is carried out to each speech samples, extracts sound spectrograph, and the speech samples that quantize is more Multiple marks corresponding to individual task.
S42:On the basis of the multitask disaggregated model of the initialization obtained in step s3, learn current Classification of Speech and appoint Business, the multitask disaggregated model trained.
S43:The multitask that the multitask disaggregated model trained is used for speech data is classified, every voice is provided and exists The probable value of each task, the classification of greater probability value is selected as classification results.
After such scheme, the beneficial effects of the present invention are:
(1) feature extraction of speech data is a crucial pretreatment operation, by neutral net to voice sound spectrograph Feature is extracted, sound spectrograph is converted to the sharing feature of 200 dimensions in concrete operations.
(2) in assorting process, it is desirable to which neutral net can learn to voice substantive characteristics, so as to correctly predicted each point Class classification, then we have proposed our own neural network structure, more preferable phonetic representation is obtained.Specifically, for Equally complete polytypic model, such as SVM, classical neural network structure, our model is relatively good;For single mould of classifying Type, on the same model, the accuracy rate of two tasks, below multitask disaggregated model is implemented separately.
By taking the speech emotion recognition on sentence and song as an example, its main task is classified for speech emotional, its nonproductive task For the classification of sentence and song.
Accuracy rate
SVM 48.01%
Single task model 56.33%
Multi task model 62.39%
The main contrast of table 1 accuracy rate of single task model and multi task model in main task.Wherein, SVM is a kind of Classical machine learning classification method;Single task model be it is proposed that model for single task classify, the standard of emotional semantic classification True rate is 56.33%, and in multi task model, while two tasks are realized, the accuracy rate of its emotion recognition adds 6.06%
Network structure Emotion recognition accuracy rate Voice and categorizing songs accuracy rate
Convolutional neural networks 53.73% 92.24
Residual error network 57.21% 94.62%
Residual error network based on door 62.39 93.13
Table 2, speech emotional of multi task model of the main contrast based on different neural network structures on sentence and song Accuracy rate in identification.Wherein, the residual error network based on door is the model that this patent proposes.
Above-mentioned the results show:
1) for equally completing polytypic model, such as SVM, classical neural network structure, our model is relatively good.
2) for single disaggregated model, on the same model, the accuracy rate of two tasks, below multitask is implemented separately Disaggregated model.
(3) for compared to the model of other non-neural net methods, the spy by deep neural network method to voice Sign extraction, can be good at initializing multitask disaggregated model, increases model robustness, improves each task recognition Effect.Because audio signal may have noise etc. to influence in itself, and the generalization ability that neural net method has had to noise etc..Separately Outside, it is very sensitive to new speaker, multitask is classified due to also learning such as the emotional semantic classification of audio to single task model Speaker characteristic is relative effect is less.
Brief description of the drawings
Fig. 1 is multi task model figure in the present invention;
Fig. 2 is the sound spectrograph of the voice comprising angry emotion;
Fig. 3 is the sound spectrograph of the voice comprising happy emotion;
Fig. 4 is the residual error network infrastructure figure of the present invention;
Fig. 5 is the basic block diagram of the neutral net in the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the present embodiment is carried out clearly and completely Description, it is clear that described embodiment is only the part of the embodiment of the present invention, rather than whole embodiments.Based on this Embodiment in invention, the every other reality that those of ordinary skill in the art are obtained under the premise of creative work is not made Example is applied, belongs to the scope of protection of the invention.
Referring to Fig. 1, a kind of kernel model of the multitask Classification of Speech based on deep neural network is a multitask Disaggregated model, the model are used for two generic tasks of classifying.
Multitask method of speech classification based on deep learning, comprises the following steps:
S1:Time frequency analysis operation is carried out to speech data, obtains corresponding sound spectrograph.
S2:The neural network model based on convolutional neural networks and residual error network is established, and it is defeated using sound spectrograph as network Enter, extraction feature, in this step, the common spy for multiple tasks is extracted by building a two classification task network structures Sign.The multitask of the present invention is to be directed to two major class classification tasks, and one is, while distinguishes emotion and this voice that voice includes Belong to song or sentence;Secondly being, while distinguish voice speaker and speaker's accent.
As shown in figure 3, the basic operation of convolutional neural networks includes convolution operation and pondization operates, under convolution operation is available State formula expression:
Wherein, M and N defines the size of convolution kernel, and i, j represent line number and columns, for defining the position of pixel, f It is convolution kernel function, l ∈ (1, L) represent the number of plies of convolutional neural networks,The feature of the i rows j row of l layers is defined,It is fixed The justice parameter of the n rows m of l layers convolution kernel, b is corresponding bias function,
The implication of formula (1) is:The effect of the different piece of input feature vector figure and the product of convolution kernel in convolution kernel function Under obtain new characteristic pattern, appeal formula ensure that feature extraction is unrelated with position, that is, a part for input feature vector figure Statistical property and other parts are the same;The pond operation of convolutional neural networks can be represented with following formula:
al=f (βldown(al-1)+bl) (2)
In above-mentioned formula, down illustrates down-sampling mode, βlIt is corresponding parameter;
The implication of formula (2) is that input feature vector figure mirrored poolization is operated, that is, the feature of image diverse location is carried out Polymerization, so as to reduce the parameter in network.
As shown in figure 4, the basic residual block of residual error network can be represented with following formula in the S2:
Y=F (x, W)+x (3)
Wherein F represents two layers of convolutional network, and W is the parameter of convolutional network, and x is the input of a residual block, and y represents base This residual block.
The implication of formula (3) is an input x, after two layers of forward direction convolutional network, to obtain an output F (x, W), so Afterwards by a shortcut, output y is obtained.
As shown in figure 5, the formula of the basic framework model of the deep neural network used in S2 is expressed as:
Y=F1(x, W1)*F2(x, W2)+x. (4)
Wherein, * is the multiplication of digitwise operation, F1, F2It is two convolutional layers, x is the input of this basic structure, W1, W2It is The parameter of two convolutional layers.
The implication of formula (4) is an input x, respectively in the presence of two convolutional networks, to obtain exporting F1(x, W1) And F2(x, W2), both are multiplied, then by a shortcut, obtains output y.
S3:The feature of extraction is input to multiple different softmax graders, so as to obtain the mould of an initialization Type.
S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization is trained with this data set. Comprise the following steps in S4:
S4:Using speech data and corresponding multiple marks, the model of initialization, the network mould trained are trained Type;
S41:Time-domain and frequency-domain analysis is carried out to each speech samples, extracts sound spectrograph, and the speech samples that quantize is more Multiple marks corresponding to individual task;
S42:On the basis of the multitask disaggregated model of the initialization obtained in step s3, learn current Classification of Speech and appoint Business, the multitask disaggregated model trained;
S43:The multitask that the multitask disaggregated model trained is used for speech data is classified, every voice is provided and exists The probable value of each task, the classification of greater probability value is selected as classification results.
S5:The model trained is predicted unlabelled speech data row, the probable value classified, and select compared with The classification of high probability values is as classification results.
Fig. 2 and Fig. 3 lists the sound spectrograph for including " anger " " happy " two kinds of emotions, we can see that being arrived in 10kHz In the range of 15kHz, sound spectrograph amplitude difference is apparent.
If Fig. 4 and Fig. 5 is neural net method proposed by the present invention, specifically include:
(1) basic structure of two kinds of models is convolutional neural networks in Fig. 4 and Fig. 5, wherein specifically including two kinds of operations.Its First, the convolution operation of convolutional neural networks, can be represented with following formula:
Wherein, M and N defines the size of convolution kernel, and p, q represent line number and columns, for defining the position of pixel, f It is convolution kernel function, l ∈ (1, L) represent the number of plies when convolutional neural networks,The feature of the i rows j row of l layers is defined, k is fixed The justice parameter of convolution kernel, b are corresponding bias functions.
Another operation is the pondization operation of convolutional neural networks, can be represented with following formula:
al=f (βldown(al-1)+bl)
In above-mentioned formula, down illustrates down-sampling operation, and β is corresponding parameter.
(2) what Fig. 4 was represented is the basic residual block of residual error network, can also be represented with following formula:
Y=F (x, W)+x.
Wherein F is convolution layer functions, and x is the input of a residual block, and W is parameter.
(3) what Fig. 5 was represented is that we use the basic framework of neutral net, can also be represented with following formula:
Y=F1(x, W1)*F2(x, W2)+x.
Wherein, * is the multiplication of digitwise operation, F1, F2It is to connect a convolutional layer, x is the input of this basic structure. W1, W2It is The parameter of two convolutional layers.
Existing audio classification problem marks primarily directed to single sample list, that is to say, that the model of training, only meeting Single task is classified.For example speech emotional is classified, single task classification, it can only exactly realize which an audio belongs to Kind emotion.But because understanding of the different speakers to emotion is different, in the case of resulting in different speakers in same emotion Expression be different.And multitask is classified, multiple different tasks mainly are realized simultaneously, such as, this project is completed While speech emotional classification task, also the problem of completion speaker clustering.That is to a model trained, input One voice, obtained result two, one is this voice people said, and another is the emotion that this voice packet contains.Also It is to say, this project is in training pattern, while affective characteristics and speaker characteristic have been arrived in study.
By taking the speech emotion recognition on sentence and song as an example, its main task is classified for speech emotional, its nonproductive task For the classification of sentence and song.
The main contrast of table 1 accuracy rate of single task model and multi task model in main task.Wherein, SVM is a kind of Classical machine learning classification method;Single task model be it is proposed that model for single task classify, the standard of emotional semantic classification True rate is 56.33%, and in multi task model, while two tasks are realized, the accuracy rate of its emotion recognition adds 6.06%
Network structure Emotion recognition accuracy rate Voice and categorizing songs accuracy rate
Convolutional neural networks 53.73% 92.24
Residual error network 57.21% 94.62%
Residual error network based on door 62.39 93.13
Table 2, speech emotional of multi task model of the main contrast based on different neural network structures on sentence and song Accuracy rate in identification.Wherein, the residual error network based on door is the model that this patent proposes.
Above-mentioned the results show:
(1) for equally completing polytypic model, such as SVM, classical neural network structure, our model is relatively good
(2) for single disaggregated model, on the same model, the accuracy rate of two tasks is implemented separately, below more Business disaggregated model.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power Profit requires rather than described above limits, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the present invention.Any reference in claim should not be considered as to the involved claim of limitation.
Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each embodiment is only wrapped Containing an independent technical scheme, this narrating mode of specification is only that those skilled in the art should for clarity Using specification as an entirety, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art It is appreciated that other embodiment.

Claims (3)

  1. A kind of 1. multitask method of speech classification based on deep learning, it is characterised in that:Comprise the following steps:
    S1:Time frequency analysis operation is carried out to speech data, obtains corresponding sound spectrograph;
    S2:The neural network model based on convolutional neural networks and residual error network is established, and using sound spectrograph as network inputs, is carried Take feature;
    S3:The feature of extraction is input to multiple different softmax graders, so as to obtain the model of an initialization;
    S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization is trained with this data set, are obtained The network model trained;
    S5:The model trained is predicted unlabelled speech data row, the probable value classified, and selected higher general The classification of rate value is as classification results.
  2. 2. a kind of multitask method of speech classification based on deep learning according to claim 1, it is characterised in that described In S2, the basic operation of convolutional neural networks includes convolution operation and pondization operates, and convolution operation can be represented with following formula:
    Wherein, M and N defines the size of convolution kernel, and i, j represent line number and columns, and for defining the position of pixel, f is volume Product kernel function, l ∈ (1, L) represent the number of plies of convolutional neural networks,The feature of the i rows j row of l layers is defined,Define l The parameter of the n rows m of layer convolution kernel, blIt is the bias function of l layers;
    The pond operation of convolutional neural networks can be represented with following formula:
    al=f (βldown(al-1)+bl) (2)
    In above-mentioned formula, alFor layer of input, f is pond layer functions, and down illustrates down-sampling mode, βlIt is to join accordingly Number;
    The basic residual block of residual error network can be represented with following formula in the S2:
    Y=F (x, W)+x (3)
    Wherein F represents two layers of convolutional network, and W is the parameter of convolutional network, and x is the input of a residual block, and y represents substantially residual Poor block output;
    The formula of the basic framework model used in S2 is expressed as:
    Y=F1(x, W1)*F2(x, W2)+x (4)
    Wherein, * is the multiplication of digitwise operation, F1, F2It is two convolutional layers, x is the input of this basic structure, W1, W2It is two volumes The parameter of lamination, y represent output.
  3. A kind of 3. multitask method of speech classification based on deep learning according to claim 1, it is characterised in that:It is described Comprise the following steps in S4:
    S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization is trained with this data set, are obtained The network model trained;
    S41:Time-domain and frequency-domain analysis is carried out to each speech samples, extracts sound spectrograph, and multiple of the speech samples that quantize Multiple marks corresponding to business;
    S42:On the basis of the multitask disaggregated model of the initialization obtained in step s3, learn current Classification of Speech task, The multitask disaggregated model trained;
    S43:The multitask that the multitask disaggregated model trained is used for speech data is classified, provides every voice each The probable value of task, the classification of greater probability value is selected as classification results.
CN201710801016.6A 2017-09-07 2017-09-07 Multi-classification voice method based on deep neural network Expired - Fee Related CN107578775B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710801016.6A CN107578775B (en) 2017-09-07 2017-09-07 Multi-classification voice method based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710801016.6A CN107578775B (en) 2017-09-07 2017-09-07 Multi-classification voice method based on deep neural network

Publications (2)

Publication Number Publication Date
CN107578775A true CN107578775A (en) 2018-01-12
CN107578775B CN107578775B (en) 2021-02-12

Family

ID=61031600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710801016.6A Expired - Fee Related CN107578775B (en) 2017-09-07 2017-09-07 Multi-classification voice method based on deep neural network

Country Status (1)

Country Link
CN (1) CN107578775B (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109243424A (en) * 2018-08-28 2019-01-18 合肥星空物联信息科技有限公司 One key voiced translation terminal of one kind and interpretation method
CN109490822A (en) * 2018-10-16 2019-03-19 南京信息工程大学 Voice DOA estimation method based on ResNet
CN109493881A (en) * 2018-11-22 2019-03-19 北京奇虎科技有限公司 A kind of labeling processing method of audio, device and calculate equipment
CN109523994A (en) * 2018-11-13 2019-03-26 四川大学 A kind of multitask method of speech classification based on capsule neural network
CN109523993A (en) * 2018-11-02 2019-03-26 成都三零凯天通信实业有限公司 A kind of voice languages classification method merging deep neural network with GRU based on CNN
CN109684995A (en) * 2018-12-22 2019-04-26 中国人民解放军战略支援部队信息工程大学 Specific Emitter Identification method and device based on depth residual error network
CN109754357A (en) * 2018-01-26 2019-05-14 京东方科技集团股份有限公司 Image processing method, processing unit and processing equipment
CN109754822A (en) * 2019-01-22 2019-05-14 平安科技(深圳)有限公司 The method and apparatus for establishing Alzheimer's disease detection model
CN109919047A (en) * 2019-02-18 2019-06-21 山东科技大学 A kind of mood detection method based on multitask, the residual error neural network of multi-tag
CN110189769A (en) * 2019-05-23 2019-08-30 复钧智能科技(苏州)有限公司 Abnormal sound detection method based on multiple convolutional neural networks models couplings
CN110503968A (en) * 2018-05-18 2019-11-26 北京搜狗科技发展有限公司 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN110532424A (en) * 2019-09-26 2019-12-03 西南科技大学 A kind of lungs sound tagsort system and method based on deep learning and cloud platform
CN110808069A (en) * 2019-11-11 2020-02-18 上海瑞美锦鑫健康管理有限公司 Evaluation system and method for singing songs
CN110992987A (en) * 2019-10-23 2020-04-10 大连东软信息学院 Parallel feature extraction system and method for general specific voice in voice signal
CN111128131A (en) * 2019-12-17 2020-05-08 北京声智科技有限公司 Voice recognition method and device, electronic equipment and computer readable storage medium
CN111354372A (en) * 2018-12-21 2020-06-30 中国科学院声学研究所 Audio scene classification method and system based on front-end and back-end joint training
CN111429947A (en) * 2020-03-26 2020-07-17 重庆邮电大学 Speech emotion recognition method based on multi-stage residual convolutional neural network
CN111460157A (en) * 2020-04-01 2020-07-28 哈尔滨理工大学 Cyclic convolution multitask learning method for multi-field text classification
CN111599382A (en) * 2020-07-27 2020-08-28 深圳市声扬科技有限公司 Voice analysis method, device, computer equipment and storage medium
CN111833856A (en) * 2020-07-15 2020-10-27 厦门熙重电子科技有限公司 Voice key information calibration method based on deep learning
CN111933179A (en) * 2020-06-04 2020-11-13 华南师范大学 Environmental sound identification method and device based on hybrid multi-task learning
CN112331187A (en) * 2020-11-24 2021-02-05 苏州思必驰信息科技有限公司 Multi-task speech recognition model training method and multi-task speech recognition method
CN112506667A (en) * 2020-12-22 2021-03-16 北京航空航天大学杭州创新研究院 Deep neural network training method based on multi-task optimization
CN112992119A (en) * 2021-01-14 2021-06-18 安徽大学 Deep neural network-based accent classification method and model thereof
CN112992157A (en) * 2021-02-08 2021-06-18 贵州师范大学 Neural network noisy line identification method based on residual error and batch normalization
CN113823271A (en) * 2020-12-18 2021-12-21 京东科技控股股份有限公司 Training method and device of voice classification model, computer equipment and storage medium
CN114882884A (en) * 2022-07-06 2022-08-09 深圳比特微电子科技有限公司 Multitask implementation method and device based on deep learning model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1300831A1 (en) * 2001-10-05 2003-04-09 Sony International (Europe) GmbH Method for detecting emotions involving subspace specialists
US20160027452A1 (en) * 2014-07-28 2016-01-28 Sone Computer Entertainment Inc. Emotional speech processing
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN106875007A (en) * 2017-01-25 2017-06-20 上海交通大学 End-to-end deep neural network is remembered based on convolution shot and long term for voice fraud detection
CN106952649A (en) * 2017-05-14 2017-07-14 北京工业大学 Method for distinguishing speek person based on convolutional neural networks and spectrogram

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1300831A1 (en) * 2001-10-05 2003-04-09 Sony International (Europe) GmbH Method for detecting emotions involving subspace specialists
US20160027452A1 (en) * 2014-07-28 2016-01-28 Sone Computer Entertainment Inc. Emotional speech processing
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN106875007A (en) * 2017-01-25 2017-06-20 上海交通大学 End-to-end deep neural network is remembered based on convolution shot and long term for voice fraud detection
CN106952649A (en) * 2017-05-14 2017-07-14 北京工业大学 Method for distinguishing speek person based on convolutional neural networks and spectrogram

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754357A (en) * 2018-01-26 2019-05-14 京东方科技集团股份有限公司 Image processing method, processing unit and processing equipment
CN109754357B (en) * 2018-01-26 2021-09-21 京东方科技集团股份有限公司 Image processing method, processing device and processing equipment
CN110503968A (en) * 2018-05-18 2019-11-26 北京搜狗科技发展有限公司 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN110503968B (en) * 2018-05-18 2024-06-04 北京搜狗科技发展有限公司 Audio processing method, device, equipment and readable storage medium
CN109243424A (en) * 2018-08-28 2019-01-18 合肥星空物联信息科技有限公司 One key voiced translation terminal of one kind and interpretation method
CN109490822A (en) * 2018-10-16 2019-03-19 南京信息工程大学 Voice DOA estimation method based on ResNet
CN109523993A (en) * 2018-11-02 2019-03-26 成都三零凯天通信实业有限公司 A kind of voice languages classification method merging deep neural network with GRU based on CNN
CN109523994A (en) * 2018-11-13 2019-03-26 四川大学 A kind of multitask method of speech classification based on capsule neural network
CN109493881A (en) * 2018-11-22 2019-03-19 北京奇虎科技有限公司 A kind of labeling processing method of audio, device and calculate equipment
CN109493881B (en) * 2018-11-22 2023-12-05 北京奇虎科技有限公司 Method and device for labeling audio and computing equipment
CN111354372A (en) * 2018-12-21 2020-06-30 中国科学院声学研究所 Audio scene classification method and system based on front-end and back-end joint training
CN109684995A (en) * 2018-12-22 2019-04-26 中国人民解放军战略支援部队信息工程大学 Specific Emitter Identification method and device based on depth residual error network
CN109754822A (en) * 2019-01-22 2019-05-14 平安科技(深圳)有限公司 The method and apparatus for establishing Alzheimer's disease detection model
CN109919047A (en) * 2019-02-18 2019-06-21 山东科技大学 A kind of mood detection method based on multitask, the residual error neural network of multi-tag
CN110189769A (en) * 2019-05-23 2019-08-30 复钧智能科技(苏州)有限公司 Abnormal sound detection method based on multiple convolutional neural networks models couplings
CN110532424A (en) * 2019-09-26 2019-12-03 西南科技大学 A kind of lungs sound tagsort system and method based on deep learning and cloud platform
CN110992987B (en) * 2019-10-23 2022-05-06 大连东软信息学院 Parallel feature extraction system and method for general specific voice in voice signal
CN110992987A (en) * 2019-10-23 2020-04-10 大连东软信息学院 Parallel feature extraction system and method for general specific voice in voice signal
CN110808069A (en) * 2019-11-11 2020-02-18 上海瑞美锦鑫健康管理有限公司 Evaluation system and method for singing songs
CN111128131A (en) * 2019-12-17 2020-05-08 北京声智科技有限公司 Voice recognition method and device, electronic equipment and computer readable storage medium
CN111128131B (en) * 2019-12-17 2022-07-01 北京声智科技有限公司 Voice recognition method and device, electronic equipment and computer readable storage medium
CN111429947A (en) * 2020-03-26 2020-07-17 重庆邮电大学 Speech emotion recognition method based on multi-stage residual convolutional neural network
CN111429947B (en) * 2020-03-26 2022-06-10 重庆邮电大学 Speech emotion recognition method based on multi-stage residual convolutional neural network
CN111460157A (en) * 2020-04-01 2020-07-28 哈尔滨理工大学 Cyclic convolution multitask learning method for multi-field text classification
CN111460157B (en) * 2020-04-01 2023-03-28 哈尔滨理工大学 Cyclic convolution multitask learning method for multi-field text classification
CN111933179A (en) * 2020-06-04 2020-11-13 华南师范大学 Environmental sound identification method and device based on hybrid multi-task learning
CN111833856A (en) * 2020-07-15 2020-10-27 厦门熙重电子科技有限公司 Voice key information calibration method based on deep learning
CN111833856B (en) * 2020-07-15 2023-10-24 厦门熙重电子科技有限公司 Voice key information calibration method based on deep learning
CN111599382A (en) * 2020-07-27 2020-08-28 深圳市声扬科技有限公司 Voice analysis method, device, computer equipment and storage medium
CN111599382B (en) * 2020-07-27 2020-10-27 深圳市声扬科技有限公司 Voice analysis method, device, computer equipment and storage medium
CN112331187A (en) * 2020-11-24 2021-02-05 苏州思必驰信息科技有限公司 Multi-task speech recognition model training method and multi-task speech recognition method
CN113823271A (en) * 2020-12-18 2021-12-21 京东科技控股股份有限公司 Training method and device of voice classification model, computer equipment and storage medium
CN112506667A (en) * 2020-12-22 2021-03-16 北京航空航天大学杭州创新研究院 Deep neural network training method based on multi-task optimization
CN112992119A (en) * 2021-01-14 2021-06-18 安徽大学 Deep neural network-based accent classification method and model thereof
CN112992119B (en) * 2021-01-14 2024-05-03 安徽大学 Accent classification method based on deep neural network and model thereof
CN112992157A (en) * 2021-02-08 2021-06-18 贵州师范大学 Neural network noisy line identification method based on residual error and batch normalization
CN114882884A (en) * 2022-07-06 2022-08-09 深圳比特微电子科技有限公司 Multitask implementation method and device based on deep learning model

Also Published As

Publication number Publication date
CN107578775B (en) 2021-02-12

Similar Documents

Publication Publication Date Title
CN107578775A (en) A kind of multitask method of speech classification based on deep neural network
Sun et al. Speech emotion recognition based on DNN-decision tree SVM model
Wang et al. Speech emotion recognition with dual-sequence LSTM architecture
Espi et al. Exploiting spectro-temporal locality in deep learning based acoustic event detection
CN108563653B (en) Method and system for constructing knowledge acquisition model in knowledge graph
CN108717856B (en) Speech emotion recognition method based on multi-scale deep convolution cyclic neural network
CN106503805B (en) A kind of bimodal based on machine learning everybody talk with sentiment analysis method
Daneshfar et al. Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm
CN113255755B (en) Multi-modal emotion classification method based on heterogeneous fusion network
CN109460737A (en) A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network
Vrysis et al. 1D/2D deep CNNs vs. temporal feature integration for general audio classification
CN111753549A (en) Multi-mode emotion feature learning and recognition method based on attention mechanism
CN104978587B (en) A kind of Entity recognition cooperative learning algorithm based on Doctype
CN108305616A (en) A kind of audio scene recognition method and device based on long feature extraction in short-term
CN111126218A (en) Human behavior recognition method based on zero sample learning
CN107220235A (en) Speech recognition error correction method, device and storage medium based on artificial intelligence
CN106570106A (en) Method and device for converting voice information into expression in input process
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN108711421A (en) A kind of voice recognition acoustic model method for building up and device and electronic equipment
US11735190B2 (en) Attentive adversarial domain-invariant training
CN108170848B (en) Chinese mobile intelligent customer service-oriented conversation scene classification method
Muthusamy et al. Particle swarm optimization based feature enhancement and feature selection for improved emotion recognition in speech and glottal signals
CN108986798B (en) Processing method, device and the equipment of voice data
CN107491729A (en) The Handwritten Digit Recognition method of convolutional neural networks based on cosine similarity activation
Vrysis et al. Extending temporal feature integration for semantic audio analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210212

Termination date: 20210907

CF01 Termination of patent right due to non-payment of annual fee