CN107578775A - A kind of multitask method of speech classification based on deep neural network - Google Patents
A kind of multitask method of speech classification based on deep neural network Download PDFInfo
- Publication number
- CN107578775A CN107578775A CN201710801016.6A CN201710801016A CN107578775A CN 107578775 A CN107578775 A CN 107578775A CN 201710801016 A CN201710801016 A CN 201710801016A CN 107578775 A CN107578775 A CN 107578775A
- Authority
- CN
- China
- Prior art keywords
- model
- classification
- speech
- network
- multitask
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The present invention discloses a kind of multitask method of speech classification based on deep learning, is related to voice processing technology field, comprises the following steps:S1:Time frequency analysis operation is carried out to speech data, obtains corresponding sound spectrograph.S2:The neural network model based on convolutional neural networks and residual error network is established, and using sound spectrograph as network inputs, extracts feature.S3:The feature of extraction is input to multiple different softmax graders, so as to obtain the model of an initialization.S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization, the network model trained are trained with this data set.S5:The model trained is predicted unlabelled speech data row, the probable value classified, and select the classification of high probability value as classification results.It is to stay alone reason for job order and ignore semantic task correlation that the present invention, which solves existing audio frequency classification method, the problem of causing classification effectiveness low.
Description
Technical field
The present invention relates to sound signal processing technical field, more particularly to a kind of multitask language based on deep neural network
Sound sorting technique.
Background technology
Sound is to we provide the much information on sound source and surrounding environment.The auditory system of the mankind can divide
From the sound complicated with identification, if a machine can perform similar function, (audio classification and identification) is highly useful
, such as the speech recognition in noise.Audio classification is a key areas of pattern-recognition, and has successfully been applied
Many fields, such as specialized education and entertainment field are arrived.In recent years, different classes of audio classification, such as accents recognition, say
People's identification is talked about, speech emotion recognition there are many successfully applications.
However, most of audio frequency classification method is stayed alone reason both for job order, the mutual pass between each task have ignored
Connection.For example accents recognition task and Speaker Identification are generally regarded as single two classification tasks.But in fact, for
Same speech data, once confirming, its accent will also determine voice speaker.Then, it is intended that same using this relation
When improve the classifying qualities of two kinds of tasks.
Deep learning caused the climax of artificial intelligence in recent years, due to the deep neural network abstract energy powerful to data
Power, network learning method are successfully applied to the every field such as Speech processing.In our work, convolution
Neutral net is used for learning phonetic feature, improves the accuracy rate in more classification tasks.
Sound spectrograph is phonetic representation that is a kind of detailed and accurately including time and frequency information.The general type of sound spectrograph
Mainly three dimensions:Time, frequency and the amplitude represented with color.
The content of the invention
It is an object of the invention to:It is to stay alone to manage for job order to ignore voice to solve existing audio frequency classification method
Task dependencies, the problem of causing classification effectiveness low.
Technical scheme is as follows:
A kind of multitask method of speech classification based on deep learning, comprises the following steps:
S1:Time frequency analysis operation is carried out to speech data, obtains corresponding sound spectrograph.
S2:The neural network model based on convolutional neural networks and residual error network is established, and it is defeated using sound spectrograph as network
Enter, extract feature.
S3:The feature of extraction is input to multiple different softmax graders, so as to obtain the mould of an initialization
Type.
S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization is trained with this data set,
The network model trained.
S5:The model trained is predicted unlabelled speech data row, the probable value classified, and select compared with
The classification of high probability values is as classification results.
Further, in the S2, the basic operation of convolutional neural networks includes convolution operation and pondization operates, convolution behaviour
Work can be represented with following formula:
Wherein, M and N defines the size of convolution kernel, and i, j represent line number and columns, for defining the position of pixel, f
It is convolution kernel function, l ∈ (1, L) represent the number of plies of convolutional neural networks,The feature of the i rows j row of l layers is defined,It is fixed
The justice parameter of the n rows m of l layers convolution kernel, b is corresponding bias function,
The implication of formula (1) is:The effect of the different piece of input feature vector figure and the product of convolution kernel in convolution kernel function
Under obtain new characteristic pattern, above-mentioned formula ensure that feature extraction is unrelated with position, that is, a part for input feature vector figure
Statistical property and other parts are the same.
The pond operation of convolutional neural networks can be represented with following formula:
al=f (βldown(al-1)+bl) (2)
In above-mentioned formula, alFor layer of input, down illustrates down-sampling mode, βlIt is corresponding parameter;Formula (2)
Implication be that input feature vector figure mirrored poolization is operated, that is, the feature of image diverse location polymerize, so as to reduce net
Parameter in network.
The basic residual block of residual error network can be represented with following formula in the S2:
Y=F (x, W)+x. (3)
Wherein F represents two layers of convolutional network, and W is the parameter of convolutional network, and x is the input of a residual block, and y is represented
Basic residual block.
The implication of formula (3) is an input x, after two layers of forward direction convolutional network, to obtain an output F (x, W), so
Afterwards by a shortcut, output y is obtained.
The formula of the basic framework model used in S2 is expressed as:
Y=F1(x, W1)*F2(x, W2)+x. (4)
Wherein, * is the multiplication of digitwise operation, F1, F2It is two convolutional layers, x is the input of this basic structure, W1, W2It is
The parameter of two convolutional layers.
The implication of formula (4) is an input x, respectively in the presence of two convolutional networks, to obtain exporting F1(x, W1)
And F2(x, W2), both are multiplied, then by a shortcut, obtains output y.
Specifically, comprise the following steps in the S4:
S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization is trained with this data set,
The network model trained.
S41:Time-domain and frequency-domain analysis is carried out to each speech samples, extracts sound spectrograph, and the speech samples that quantize is more
Multiple marks corresponding to individual task.
S42:On the basis of the multitask disaggregated model of the initialization obtained in step s3, learn current Classification of Speech and appoint
Business, the multitask disaggregated model trained.
S43:The multitask that the multitask disaggregated model trained is used for speech data is classified, every voice is provided and exists
The probable value of each task, the classification of greater probability value is selected as classification results.
After such scheme, the beneficial effects of the present invention are:
(1) feature extraction of speech data is a crucial pretreatment operation, by neutral net to voice sound spectrograph
Feature is extracted, sound spectrograph is converted to the sharing feature of 200 dimensions in concrete operations.
(2) in assorting process, it is desirable to which neutral net can learn to voice substantive characteristics, so as to correctly predicted each point
Class classification, then we have proposed our own neural network structure, more preferable phonetic representation is obtained.Specifically, for
Equally complete polytypic model, such as SVM, classical neural network structure, our model is relatively good;For single mould of classifying
Type, on the same model, the accuracy rate of two tasks, below multitask disaggregated model is implemented separately.
By taking the speech emotion recognition on sentence and song as an example, its main task is classified for speech emotional, its nonproductive task
For the classification of sentence and song.
Accuracy rate | |
SVM | 48.01% |
Single task model | 56.33% |
Multi task model | 62.39% |
The main contrast of table 1 accuracy rate of single task model and multi task model in main task.Wherein, SVM is a kind of
Classical machine learning classification method;Single task model be it is proposed that model for single task classify, the standard of emotional semantic classification
True rate is 56.33%, and in multi task model, while two tasks are realized, the accuracy rate of its emotion recognition adds
6.06%
Network structure | Emotion recognition accuracy rate | Voice and categorizing songs accuracy rate |
Convolutional neural networks | 53.73% | 92.24 |
Residual error network | 57.21% | 94.62% |
Residual error network based on door | 62.39 | 93.13 |
Table 2, speech emotional of multi task model of the main contrast based on different neural network structures on sentence and song
Accuracy rate in identification.Wherein, the residual error network based on door is the model that this patent proposes.
Above-mentioned the results show:
1) for equally completing polytypic model, such as SVM, classical neural network structure, our model is relatively good.
2) for single disaggregated model, on the same model, the accuracy rate of two tasks, below multitask is implemented separately
Disaggregated model.
(3) for compared to the model of other non-neural net methods, the spy by deep neural network method to voice
Sign extraction, can be good at initializing multitask disaggregated model, increases model robustness, improves each task recognition
Effect.Because audio signal may have noise etc. to influence in itself, and the generalization ability that neural net method has had to noise etc..Separately
Outside, it is very sensitive to new speaker, multitask is classified due to also learning such as the emotional semantic classification of audio to single task model
Speaker characteristic is relative effect is less.
Brief description of the drawings
Fig. 1 is multi task model figure in the present invention;
Fig. 2 is the sound spectrograph of the voice comprising angry emotion;
Fig. 3 is the sound spectrograph of the voice comprising happy emotion;
Fig. 4 is the residual error network infrastructure figure of the present invention;
Fig. 5 is the basic block diagram of the neutral net in the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the present embodiment is carried out clearly and completely
Description, it is clear that described embodiment is only the part of the embodiment of the present invention, rather than whole embodiments.Based on this
Embodiment in invention, the every other reality that those of ordinary skill in the art are obtained under the premise of creative work is not made
Example is applied, belongs to the scope of protection of the invention.
Referring to Fig. 1, a kind of kernel model of the multitask Classification of Speech based on deep neural network is a multitask
Disaggregated model, the model are used for two generic tasks of classifying.
Multitask method of speech classification based on deep learning, comprises the following steps:
S1:Time frequency analysis operation is carried out to speech data, obtains corresponding sound spectrograph.
S2:The neural network model based on convolutional neural networks and residual error network is established, and it is defeated using sound spectrograph as network
Enter, extraction feature, in this step, the common spy for multiple tasks is extracted by building a two classification task network structures
Sign.The multitask of the present invention is to be directed to two major class classification tasks, and one is, while distinguishes emotion and this voice that voice includes
Belong to song or sentence;Secondly being, while distinguish voice speaker and speaker's accent.
As shown in figure 3, the basic operation of convolutional neural networks includes convolution operation and pondization operates, under convolution operation is available
State formula expression:
Wherein, M and N defines the size of convolution kernel, and i, j represent line number and columns, for defining the position of pixel, f
It is convolution kernel function, l ∈ (1, L) represent the number of plies of convolutional neural networks,The feature of the i rows j row of l layers is defined,It is fixed
The justice parameter of the n rows m of l layers convolution kernel, b is corresponding bias function,
The implication of formula (1) is:The effect of the different piece of input feature vector figure and the product of convolution kernel in convolution kernel function
Under obtain new characteristic pattern, appeal formula ensure that feature extraction is unrelated with position, that is, a part for input feature vector figure
Statistical property and other parts are the same;The pond operation of convolutional neural networks can be represented with following formula:
al=f (βldown(al-1)+bl) (2)
In above-mentioned formula, down illustrates down-sampling mode, βlIt is corresponding parameter;
The implication of formula (2) is that input feature vector figure mirrored poolization is operated, that is, the feature of image diverse location is carried out
Polymerization, so as to reduce the parameter in network.
As shown in figure 4, the basic residual block of residual error network can be represented with following formula in the S2:
Y=F (x, W)+x (3)
Wherein F represents two layers of convolutional network, and W is the parameter of convolutional network, and x is the input of a residual block, and y represents base
This residual block.
The implication of formula (3) is an input x, after two layers of forward direction convolutional network, to obtain an output F (x, W), so
Afterwards by a shortcut, output y is obtained.
As shown in figure 5, the formula of the basic framework model of the deep neural network used in S2 is expressed as:
Y=F1(x, W1)*F2(x, W2)+x. (4)
Wherein, * is the multiplication of digitwise operation, F1, F2It is two convolutional layers, x is the input of this basic structure, W1, W2It is
The parameter of two convolutional layers.
The implication of formula (4) is an input x, respectively in the presence of two convolutional networks, to obtain exporting F1(x, W1)
And F2(x, W2), both are multiplied, then by a shortcut, obtains output y.
S3:The feature of extraction is input to multiple different softmax graders, so as to obtain the mould of an initialization
Type.
S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization is trained with this data set.
Comprise the following steps in S4:
S4:Using speech data and corresponding multiple marks, the model of initialization, the network mould trained are trained
Type;
S41:Time-domain and frequency-domain analysis is carried out to each speech samples, extracts sound spectrograph, and the speech samples that quantize is more
Multiple marks corresponding to individual task;
S42:On the basis of the multitask disaggregated model of the initialization obtained in step s3, learn current Classification of Speech and appoint
Business, the multitask disaggregated model trained;
S43:The multitask that the multitask disaggregated model trained is used for speech data is classified, every voice is provided and exists
The probable value of each task, the classification of greater probability value is selected as classification results.
S5:The model trained is predicted unlabelled speech data row, the probable value classified, and select compared with
The classification of high probability values is as classification results.
Fig. 2 and Fig. 3 lists the sound spectrograph for including " anger " " happy " two kinds of emotions, we can see that being arrived in 10kHz
In the range of 15kHz, sound spectrograph amplitude difference is apparent.
If Fig. 4 and Fig. 5 is neural net method proposed by the present invention, specifically include:
(1) basic structure of two kinds of models is convolutional neural networks in Fig. 4 and Fig. 5, wherein specifically including two kinds of operations.Its
First, the convolution operation of convolutional neural networks, can be represented with following formula:
Wherein, M and N defines the size of convolution kernel, and p, q represent line number and columns, for defining the position of pixel, f
It is convolution kernel function, l ∈ (1, L) represent the number of plies when convolutional neural networks,The feature of the i rows j row of l layers is defined, k is fixed
The justice parameter of convolution kernel, b are corresponding bias functions.
Another operation is the pondization operation of convolutional neural networks, can be represented with following formula:
al=f (βldown(al-1)+bl)
In above-mentioned formula, down illustrates down-sampling operation, and β is corresponding parameter.
(2) what Fig. 4 was represented is the basic residual block of residual error network, can also be represented with following formula:
Y=F (x, W)+x.
Wherein F is convolution layer functions, and x is the input of a residual block, and W is parameter.
(3) what Fig. 5 was represented is that we use the basic framework of neutral net, can also be represented with following formula:
Y=F1(x, W1)*F2(x, W2)+x.
Wherein, * is the multiplication of digitwise operation, F1, F2It is to connect a convolutional layer, x is the input of this basic structure. W1, W2It is
The parameter of two convolutional layers.
Existing audio classification problem marks primarily directed to single sample list, that is to say, that the model of training, only meeting
Single task is classified.For example speech emotional is classified, single task classification, it can only exactly realize which an audio belongs to
Kind emotion.But because understanding of the different speakers to emotion is different, in the case of resulting in different speakers in same emotion
Expression be different.And multitask is classified, multiple different tasks mainly are realized simultaneously, such as, this project is completed
While speech emotional classification task, also the problem of completion speaker clustering.That is to a model trained, input
One voice, obtained result two, one is this voice people said, and another is the emotion that this voice packet contains.Also
It is to say, this project is in training pattern, while affective characteristics and speaker characteristic have been arrived in study.
By taking the speech emotion recognition on sentence and song as an example, its main task is classified for speech emotional, its nonproductive task
For the classification of sentence and song.
The main contrast of table 1 accuracy rate of single task model and multi task model in main task.Wherein, SVM is a kind of
Classical machine learning classification method;Single task model be it is proposed that model for single task classify, the standard of emotional semantic classification
True rate is 56.33%, and in multi task model, while two tasks are realized, the accuracy rate of its emotion recognition adds
6.06%
Network structure | Emotion recognition accuracy rate | Voice and categorizing songs accuracy rate |
Convolutional neural networks | 53.73% | 92.24 |
Residual error network | 57.21% | 94.62% |
Residual error network based on door | 62.39 | 93.13 |
Table 2, speech emotional of multi task model of the main contrast based on different neural network structures on sentence and song
Accuracy rate in identification.Wherein, the residual error network based on door is the model that this patent proposes.
Above-mentioned the results show:
(1) for equally completing polytypic model, such as SVM, classical neural network structure, our model is relatively good
(2) for single disaggregated model, on the same model, the accuracy rate of two tasks is implemented separately, below more
Business disaggregated model.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie
In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power
Profit requires rather than described above limits, it is intended that all in the implication and scope of the equivalency of claim by falling
Change is included in the present invention.Any reference in claim should not be considered as to the involved claim of limitation.
Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each embodiment is only wrapped
Containing an independent technical scheme, this narrating mode of specification is only that those skilled in the art should for clarity
Using specification as an entirety, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art
It is appreciated that other embodiment.
Claims (3)
- A kind of 1. multitask method of speech classification based on deep learning, it is characterised in that:Comprise the following steps:S1:Time frequency analysis operation is carried out to speech data, obtains corresponding sound spectrograph;S2:The neural network model based on convolutional neural networks and residual error network is established, and using sound spectrograph as network inputs, is carried Take feature;S3:The feature of extraction is input to multiple different softmax graders, so as to obtain the model of an initialization;S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization is trained with this data set, are obtained The network model trained;S5:The model trained is predicted unlabelled speech data row, the probable value classified, and selected higher general The classification of rate value is as classification results.
- 2. a kind of multitask method of speech classification based on deep learning according to claim 1, it is characterised in that described In S2, the basic operation of convolutional neural networks includes convolution operation and pondization operates, and convolution operation can be represented with following formula:Wherein, M and N defines the size of convolution kernel, and i, j represent line number and columns, and for defining the position of pixel, f is volume Product kernel function, l ∈ (1, L) represent the number of plies of convolutional neural networks,The feature of the i rows j row of l layers is defined,Define l The parameter of the n rows m of layer convolution kernel, blIt is the bias function of l layers;The pond operation of convolutional neural networks can be represented with following formula:al=f (βldown(al-1)+bl) (2)In above-mentioned formula, alFor layer of input, f is pond layer functions, and down illustrates down-sampling mode, βlIt is to join accordingly Number;The basic residual block of residual error network can be represented with following formula in the S2:Y=F (x, W)+x (3)Wherein F represents two layers of convolutional network, and W is the parameter of convolutional network, and x is the input of a residual block, and y represents substantially residual Poor block output;The formula of the basic framework model used in S2 is expressed as:Y=F1(x, W1)*F2(x, W2)+x (4)Wherein, * is the multiplication of digitwise operation, F1, F2It is two convolutional layers, x is the input of this basic structure, W1, W2It is two volumes The parameter of lamination, y represent output.
- A kind of 3. multitask method of speech classification based on deep learning according to claim 1, it is characterised in that:It is described Comprise the following steps in S4:S4:Speech samples and corresponding multiple marks are quantized, and the model of initialization is trained with this data set, are obtained The network model trained;S41:Time-domain and frequency-domain analysis is carried out to each speech samples, extracts sound spectrograph, and multiple of the speech samples that quantize Multiple marks corresponding to business;S42:On the basis of the multitask disaggregated model of the initialization obtained in step s3, learn current Classification of Speech task, The multitask disaggregated model trained;S43:The multitask that the multitask disaggregated model trained is used for speech data is classified, provides every voice each The probable value of task, the classification of greater probability value is selected as classification results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710801016.6A CN107578775B (en) | 2017-09-07 | 2017-09-07 | Multi-classification voice method based on deep neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710801016.6A CN107578775B (en) | 2017-09-07 | 2017-09-07 | Multi-classification voice method based on deep neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107578775A true CN107578775A (en) | 2018-01-12 |
CN107578775B CN107578775B (en) | 2021-02-12 |
Family
ID=61031600
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710801016.6A Expired - Fee Related CN107578775B (en) | 2017-09-07 | 2017-09-07 | Multi-classification voice method based on deep neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107578775B (en) |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109243424A (en) * | 2018-08-28 | 2019-01-18 | 合肥星空物联信息科技有限公司 | One key voiced translation terminal of one kind and interpretation method |
CN109490822A (en) * | 2018-10-16 | 2019-03-19 | 南京信息工程大学 | Voice DOA estimation method based on ResNet |
CN109493881A (en) * | 2018-11-22 | 2019-03-19 | 北京奇虎科技有限公司 | A kind of labeling processing method of audio, device and calculate equipment |
CN109523994A (en) * | 2018-11-13 | 2019-03-26 | 四川大学 | A kind of multitask method of speech classification based on capsule neural network |
CN109523993A (en) * | 2018-11-02 | 2019-03-26 | 成都三零凯天通信实业有限公司 | A kind of voice languages classification method merging deep neural network with GRU based on CNN |
CN109684995A (en) * | 2018-12-22 | 2019-04-26 | 中国人民解放军战略支援部队信息工程大学 | Specific Emitter Identification method and device based on depth residual error network |
CN109754357A (en) * | 2018-01-26 | 2019-05-14 | 京东方科技集团股份有限公司 | Image processing method, processing unit and processing equipment |
CN109754822A (en) * | 2019-01-22 | 2019-05-14 | 平安科技(深圳)有限公司 | The method and apparatus for establishing Alzheimer's disease detection model |
CN109919047A (en) * | 2019-02-18 | 2019-06-21 | 山东科技大学 | A kind of mood detection method based on multitask, the residual error neural network of multi-tag |
CN110189769A (en) * | 2019-05-23 | 2019-08-30 | 复钧智能科技(苏州)有限公司 | Abnormal sound detection method based on multiple convolutional neural networks models couplings |
CN110503968A (en) * | 2018-05-18 | 2019-11-26 | 北京搜狗科技发展有限公司 | A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing |
CN110532424A (en) * | 2019-09-26 | 2019-12-03 | 西南科技大学 | A kind of lungs sound tagsort system and method based on deep learning and cloud platform |
CN110808069A (en) * | 2019-11-11 | 2020-02-18 | 上海瑞美锦鑫健康管理有限公司 | Evaluation system and method for singing songs |
CN110992987A (en) * | 2019-10-23 | 2020-04-10 | 大连东软信息学院 | Parallel feature extraction system and method for general specific voice in voice signal |
CN111128131A (en) * | 2019-12-17 | 2020-05-08 | 北京声智科技有限公司 | Voice recognition method and device, electronic equipment and computer readable storage medium |
CN111354372A (en) * | 2018-12-21 | 2020-06-30 | 中国科学院声学研究所 | Audio scene classification method and system based on front-end and back-end joint training |
CN111429947A (en) * | 2020-03-26 | 2020-07-17 | 重庆邮电大学 | Speech emotion recognition method based on multi-stage residual convolutional neural network |
CN111460157A (en) * | 2020-04-01 | 2020-07-28 | 哈尔滨理工大学 | Cyclic convolution multitask learning method for multi-field text classification |
CN111599382A (en) * | 2020-07-27 | 2020-08-28 | 深圳市声扬科技有限公司 | Voice analysis method, device, computer equipment and storage medium |
CN111833856A (en) * | 2020-07-15 | 2020-10-27 | 厦门熙重电子科技有限公司 | Voice key information calibration method based on deep learning |
CN111933179A (en) * | 2020-06-04 | 2020-11-13 | 华南师范大学 | Environmental sound identification method and device based on hybrid multi-task learning |
CN112331187A (en) * | 2020-11-24 | 2021-02-05 | 苏州思必驰信息科技有限公司 | Multi-task speech recognition model training method and multi-task speech recognition method |
CN112506667A (en) * | 2020-12-22 | 2021-03-16 | 北京航空航天大学杭州创新研究院 | Deep neural network training method based on multi-task optimization |
CN112992119A (en) * | 2021-01-14 | 2021-06-18 | 安徽大学 | Deep neural network-based accent classification method and model thereof |
CN112992157A (en) * | 2021-02-08 | 2021-06-18 | 贵州师范大学 | Neural network noisy line identification method based on residual error and batch normalization |
CN113823271A (en) * | 2020-12-18 | 2021-12-21 | 京东科技控股股份有限公司 | Training method and device of voice classification model, computer equipment and storage medium |
CN114882884A (en) * | 2022-07-06 | 2022-08-09 | 深圳比特微电子科技有限公司 | Multitask implementation method and device based on deep learning model |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1300831A1 (en) * | 2001-10-05 | 2003-04-09 | Sony International (Europe) GmbH | Method for detecting emotions involving subspace specialists |
US20160027452A1 (en) * | 2014-07-28 | 2016-01-28 | Sone Computer Entertainment Inc. | Emotional speech processing |
CN106847309A (en) * | 2017-01-09 | 2017-06-13 | 华南理工大学 | A kind of speech-emotion recognition method |
CN106875007A (en) * | 2017-01-25 | 2017-06-20 | 上海交通大学 | End-to-end deep neural network is remembered based on convolution shot and long term for voice fraud detection |
CN106952649A (en) * | 2017-05-14 | 2017-07-14 | 北京工业大学 | Method for distinguishing speek person based on convolutional neural networks and spectrogram |
-
2017
- 2017-09-07 CN CN201710801016.6A patent/CN107578775B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1300831A1 (en) * | 2001-10-05 | 2003-04-09 | Sony International (Europe) GmbH | Method for detecting emotions involving subspace specialists |
US20160027452A1 (en) * | 2014-07-28 | 2016-01-28 | Sone Computer Entertainment Inc. | Emotional speech processing |
CN106847309A (en) * | 2017-01-09 | 2017-06-13 | 华南理工大学 | A kind of speech-emotion recognition method |
CN106875007A (en) * | 2017-01-25 | 2017-06-20 | 上海交通大学 | End-to-end deep neural network is remembered based on convolution shot and long term for voice fraud detection |
CN106952649A (en) * | 2017-05-14 | 2017-07-14 | 北京工业大学 | Method for distinguishing speek person based on convolutional neural networks and spectrogram |
Cited By (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109754357A (en) * | 2018-01-26 | 2019-05-14 | 京东方科技集团股份有限公司 | Image processing method, processing unit and processing equipment |
CN109754357B (en) * | 2018-01-26 | 2021-09-21 | 京东方科技集团股份有限公司 | Image processing method, processing device and processing equipment |
CN110503968A (en) * | 2018-05-18 | 2019-11-26 | 北京搜狗科技发展有限公司 | A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing |
CN110503968B (en) * | 2018-05-18 | 2024-06-04 | 北京搜狗科技发展有限公司 | Audio processing method, device, equipment and readable storage medium |
CN109243424A (en) * | 2018-08-28 | 2019-01-18 | 合肥星空物联信息科技有限公司 | One key voiced translation terminal of one kind and interpretation method |
CN109490822A (en) * | 2018-10-16 | 2019-03-19 | 南京信息工程大学 | Voice DOA estimation method based on ResNet |
CN109523993A (en) * | 2018-11-02 | 2019-03-26 | 成都三零凯天通信实业有限公司 | A kind of voice languages classification method merging deep neural network with GRU based on CNN |
CN109523994A (en) * | 2018-11-13 | 2019-03-26 | 四川大学 | A kind of multitask method of speech classification based on capsule neural network |
CN109493881A (en) * | 2018-11-22 | 2019-03-19 | 北京奇虎科技有限公司 | A kind of labeling processing method of audio, device and calculate equipment |
CN109493881B (en) * | 2018-11-22 | 2023-12-05 | 北京奇虎科技有限公司 | Method and device for labeling audio and computing equipment |
CN111354372A (en) * | 2018-12-21 | 2020-06-30 | 中国科学院声学研究所 | Audio scene classification method and system based on front-end and back-end joint training |
CN109684995A (en) * | 2018-12-22 | 2019-04-26 | 中国人民解放军战略支援部队信息工程大学 | Specific Emitter Identification method and device based on depth residual error network |
CN109754822A (en) * | 2019-01-22 | 2019-05-14 | 平安科技(深圳)有限公司 | The method and apparatus for establishing Alzheimer's disease detection model |
CN109919047A (en) * | 2019-02-18 | 2019-06-21 | 山东科技大学 | A kind of mood detection method based on multitask, the residual error neural network of multi-tag |
CN110189769A (en) * | 2019-05-23 | 2019-08-30 | 复钧智能科技(苏州)有限公司 | Abnormal sound detection method based on multiple convolutional neural networks models couplings |
CN110532424A (en) * | 2019-09-26 | 2019-12-03 | 西南科技大学 | A kind of lungs sound tagsort system and method based on deep learning and cloud platform |
CN110992987B (en) * | 2019-10-23 | 2022-05-06 | 大连东软信息学院 | Parallel feature extraction system and method for general specific voice in voice signal |
CN110992987A (en) * | 2019-10-23 | 2020-04-10 | 大连东软信息学院 | Parallel feature extraction system and method for general specific voice in voice signal |
CN110808069A (en) * | 2019-11-11 | 2020-02-18 | 上海瑞美锦鑫健康管理有限公司 | Evaluation system and method for singing songs |
CN111128131A (en) * | 2019-12-17 | 2020-05-08 | 北京声智科技有限公司 | Voice recognition method and device, electronic equipment and computer readable storage medium |
CN111128131B (en) * | 2019-12-17 | 2022-07-01 | 北京声智科技有限公司 | Voice recognition method and device, electronic equipment and computer readable storage medium |
CN111429947A (en) * | 2020-03-26 | 2020-07-17 | 重庆邮电大学 | Speech emotion recognition method based on multi-stage residual convolutional neural network |
CN111429947B (en) * | 2020-03-26 | 2022-06-10 | 重庆邮电大学 | Speech emotion recognition method based on multi-stage residual convolutional neural network |
CN111460157A (en) * | 2020-04-01 | 2020-07-28 | 哈尔滨理工大学 | Cyclic convolution multitask learning method for multi-field text classification |
CN111460157B (en) * | 2020-04-01 | 2023-03-28 | 哈尔滨理工大学 | Cyclic convolution multitask learning method for multi-field text classification |
CN111933179A (en) * | 2020-06-04 | 2020-11-13 | 华南师范大学 | Environmental sound identification method and device based on hybrid multi-task learning |
CN111833856A (en) * | 2020-07-15 | 2020-10-27 | 厦门熙重电子科技有限公司 | Voice key information calibration method based on deep learning |
CN111833856B (en) * | 2020-07-15 | 2023-10-24 | 厦门熙重电子科技有限公司 | Voice key information calibration method based on deep learning |
CN111599382A (en) * | 2020-07-27 | 2020-08-28 | 深圳市声扬科技有限公司 | Voice analysis method, device, computer equipment and storage medium |
CN111599382B (en) * | 2020-07-27 | 2020-10-27 | 深圳市声扬科技有限公司 | Voice analysis method, device, computer equipment and storage medium |
CN112331187A (en) * | 2020-11-24 | 2021-02-05 | 苏州思必驰信息科技有限公司 | Multi-task speech recognition model training method and multi-task speech recognition method |
CN113823271A (en) * | 2020-12-18 | 2021-12-21 | 京东科技控股股份有限公司 | Training method and device of voice classification model, computer equipment and storage medium |
CN112506667A (en) * | 2020-12-22 | 2021-03-16 | 北京航空航天大学杭州创新研究院 | Deep neural network training method based on multi-task optimization |
CN112992119A (en) * | 2021-01-14 | 2021-06-18 | 安徽大学 | Deep neural network-based accent classification method and model thereof |
CN112992119B (en) * | 2021-01-14 | 2024-05-03 | 安徽大学 | Accent classification method based on deep neural network and model thereof |
CN112992157A (en) * | 2021-02-08 | 2021-06-18 | 贵州师范大学 | Neural network noisy line identification method based on residual error and batch normalization |
CN114882884A (en) * | 2022-07-06 | 2022-08-09 | 深圳比特微电子科技有限公司 | Multitask implementation method and device based on deep learning model |
Also Published As
Publication number | Publication date |
---|---|
CN107578775B (en) | 2021-02-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107578775A (en) | A kind of multitask method of speech classification based on deep neural network | |
Sun et al. | Speech emotion recognition based on DNN-decision tree SVM model | |
Wang et al. | Speech emotion recognition with dual-sequence LSTM architecture | |
Espi et al. | Exploiting spectro-temporal locality in deep learning based acoustic event detection | |
CN108563653B (en) | Method and system for constructing knowledge acquisition model in knowledge graph | |
CN108717856B (en) | Speech emotion recognition method based on multi-scale deep convolution cyclic neural network | |
CN106503805B (en) | A kind of bimodal based on machine learning everybody talk with sentiment analysis method | |
Daneshfar et al. | Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm | |
CN113255755B (en) | Multi-modal emotion classification method based on heterogeneous fusion network | |
CN109460737A (en) | A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network | |
Vrysis et al. | 1D/2D deep CNNs vs. temporal feature integration for general audio classification | |
CN111753549A (en) | Multi-mode emotion feature learning and recognition method based on attention mechanism | |
CN104978587B (en) | A kind of Entity recognition cooperative learning algorithm based on Doctype | |
CN108305616A (en) | A kind of audio scene recognition method and device based on long feature extraction in short-term | |
CN111126218A (en) | Human behavior recognition method based on zero sample learning | |
CN107220235A (en) | Speech recognition error correction method, device and storage medium based on artificial intelligence | |
CN106570106A (en) | Method and device for converting voice information into expression in input process | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
CN108711421A (en) | A kind of voice recognition acoustic model method for building up and device and electronic equipment | |
US11735190B2 (en) | Attentive adversarial domain-invariant training | |
CN108170848B (en) | Chinese mobile intelligent customer service-oriented conversation scene classification method | |
Muthusamy et al. | Particle swarm optimization based feature enhancement and feature selection for improved emotion recognition in speech and glottal signals | |
CN108986798B (en) | Processing method, device and the equipment of voice data | |
CN107491729A (en) | The Handwritten Digit Recognition method of convolutional neural networks based on cosine similarity activation | |
Vrysis et al. | Extending temporal feature integration for semantic audio analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210212 Termination date: 20210907 |
|
CF01 | Termination of patent right due to non-payment of annual fee |