CN109036467A - CFFD extracting method, speech-emotion recognition method and system based on TF-LSTM - Google Patents

CFFD extracting method, speech-emotion recognition method and system based on TF-LSTM Download PDF

Info

Publication number
CN109036467A
CN109036467A CN201811258369.7A CN201811258369A CN109036467A CN 109036467 A CN109036467 A CN 109036467A CN 201811258369 A CN201811258369 A CN 201811258369A CN 109036467 A CN109036467 A CN 109036467A
Authority
CN
China
Prior art keywords
layer
lstm
convolutional
dimensions
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811258369.7A
Other languages
Chinese (zh)
Other versions
CN109036467B (en
Inventor
卫伟
李晓飞
吴聪
柴磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201811258369.7A priority Critical patent/CN109036467B/en
Publication of CN109036467A publication Critical patent/CN109036467A/en
Application granted granted Critical
Publication of CN109036467B publication Critical patent/CN109036467B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses CFFD extracting method, speech-emotion recognition method and systems based on TF-LSTM, wherein the speech emotion recognition system based on TF-LSTM includes CFTD generation module, for the time domain contextual information according to the voice signal extracted in advance, CFTD is generated;Interacting depth neural network model constructing module, for constructing interacting depth neural network model;CFFD extraction module, the interacting depth neural network model that the frequency domain character for tieing up the 256x256 extracted in advance is input to construction extract;Classifier training module: for merging two kinds of features of CFTD and CFFD, training Linear SVM classifier obtains final speech emotion recognition result.It includes temporal signatures and frequency domain character that the present invention, which has merged two kinds of depth characteristic information, to improve the accuracy of speech emotion recognition;Time domain low-level image feature is extracted using one-dimensional convolutional neural networks, learns speech emotional information by multiple LSTM modules, has preferably obtained the contextual feature of time domain emotion information.

Description

CFFD extracting method, speech-emotion recognition method and system based on TF-LSTM
Technical field
The present invention relates to CFFD extracting method, speech-emotion recognition method and systems based on TF-LSTM, belong to voice feelings The other technical field of perception.
Background technique
" affection computation " this concept was having become a research hotspot in recent years, caused domestic and international many emotions The concern of assayer.Emotion information abundant has been usually contained in the voice signal of speaker, him is helped preferably to transmit letter Breath.The same person with different emotional expressions with a word when, transmitting information it is less identical.In order to make computer preferably Understand the emotion of people, must just improve the accuracy rate of speech emotion recognition.Speech emotion recognition is in artificial customer service, long-distance education, Medicine auxiliary and the field of human-computer interaction such as car steering using more and more extensive.
Currently, traditional voice emotion recognition both domestic and external is in the introducing of emotion descriptive model, the building in emotional speech library, feelings Sense signature analysis etc., which has, significantly to be developed.Speech emotion recognition accuracy rate and the extraction of speech emotional feature have Much relations, because traditional speech emotion recognition technology is established on the basis of emotion acoustic feature.Depth mind in recent years Important breakthrough is achieved in speech emotion recognition field through network, and in large-scale vocabulary continuous speech emotion recognition task (LVCSR) effect that aspect obtains is more preferable than gauss hybrid models/hidden Markov model (GMM/HMM) system.Convolutional Neural Network (Convolutional Neural Network, CNN) is known although being excellent in terms of image recognition in speech emotional Aspect can not obtain preferable effect yet, but there are frequency domain CFFD feature extracting method, there are effects to pay no attention to because of the prior art The problem of thinking and leading to speech emotion recognition low efficiency;On the other hand, with the development of science and technology voice messaging explosivity increases It is long, need to handle the data of magnanimity, and a high efficiency, the speech emotion recognition system of high discrimination are trained, become this field Technical staff needs the realistic problem solved.
Summary of the invention
The technical problem to be solved by the present invention is to overcome the deficiencies of existing technologies, the CFFD based on TF-LSTM is provided and is mentioned Method and speech-emotion recognition method and system are taken, CFFD is extracted by building depth interacting depth neural network model, Improve recognition efficiency, and by time domain contextual feature (Contextual Features in Time Domain, CFTD) and Frequency domain contextual feature (Contextual Features in Frequency Domain fusion, CFFD), makes two kinds of depth spies Sign voice not in have complementary well, improve the precision of identification.
In order to solve the above technical problems, present invention firstly provides the CFFD extracting method based on TF-LSTM, including following step It is rapid:
Construct interacting depth neural network model;
The interacting depth neural network model that the 256x256 extracted in advance dimension frequency domain character is input to construction is extracted CFFD;
The interacting depth neural network model includes: an input layer, five layers of convolutional layer, three layers of maximum pond layer, two The full articulamentum of layer and a LSTM module;
It is the first maximum pond layer after first convolutional layer C1;It is the second maximum pond layer again after second convolutional layer C2; Followed by third, the 4th and layer 5 convolutional layer followed by third maximum pond layer;It is two dimensions after layer 5 convolutional layer C5 It is the full articulamentum of 4096 dimensions;
Splice a LSTM module after full articulamentum, the LSTM module has a hidden layer, and the input of hidden layer is 4096 dimensions, output are also 4096 dimensions, and output of the output of LSTM as network, the output of whole network is 4096 dimensions, is obtained CFFD。
On the other hand, the present invention provides the speech-emotion recognition method based on TF-LSTM, comprising the following steps:
According to the time domain contextual information of the voice signal extracted in advance, CFTD is generated;
Construct interacting depth neural network model;
The interacting depth neural network model that the 256x256 extracted in advance the frequency domain character tieed up is input to construction is extracted CFFD;
The interacting depth neural network model includes: an input layer, five layers of convolutional layer, three layers of maximum pond layer, two The full articulamentum of layer and a LSTM module;
It is the first maximum pond layer after first convolutional layer C1;It is the second maximum pond layer again after second convolutional layer C2; Followed by third, the 4th and layer 5 convolutional layer followed by third maximum pond layer;It is two dimensions after layer 5 convolutional layer C5 It is the full articulamentum of 4096 dimensions;
Splice a LSTM module after full articulamentum, the LSTM module has a hidden layer, and the input of hidden layer is 4096 dimensions, output are also 4096 dimensions, and output of the output of LSTM as network, the output of whole network is 4096 dimensions, is obtained CFFD;
Two kinds of features of CFTD and CFFD are merged, training Linear SVM classifier obtains final speech emotion recognition As a result.
In the third aspect, the present invention provides the speech emotion recognition systems based on TF-LSTM, comprising:
CFTD generation module generates CFTD for the time domain contextual information according to the voice signal extracted in advance;
Interacting depth neural network model constructing module, for constructing interacting depth neural network model;
CFFD extraction module, the frequency domain character for tieing up the 256x256 extracted in advance are input to the interacting depth of construction Neural network model extracts;
Classifier training module: for merging two kinds of features of CFTD and CFFD, training Linear SVM classifier is obtained Obtain speech emotion recognition result finally.
Advantageous effects of the invention:
1) it includes temporal signatures and frequency domain character that the present invention, which has merged two kinds of depth characteristic information, to improve speech emotional knowledge Other accuracy;
2) present invention extracts time domain low-level image feature using one-dimensional convolutional neural networks, learns voice by multiple LSTM modules Emotion information has preferably obtained the contextual feature of time domain emotion information;
3) present invention devises a kind of interacting depth learning network structure being made of 2 dimension convolutional neural networks and LSTM, Speech emotional information is extracted in the context phase information of frequency domain.
Detailed description of the invention
Fig. 1 is the speech-emotion recognition method flow chart of the specific embodiment of the invention.
Specific embodiment
The invention will be further described below in conjunction with the accompanying drawings.Following embodiment is only used for clearly illustrating the present invention Technical solution, and not intended to limit the protection scope of the present invention.
Fig. 1 is that the overall flow figure of the method for the present invention, the speech-emotion recognition method based on TF-LSTM, including feature mention It takes, Fusion Features and the big step of svm classifier three.
Detailed process is as follows:
Step A: extracting the time domain contextual information of speech emotional signal, generates time domain contextual feature (Contextual Features in Time Domain, CFTD).
In a particular embodiment, the preferably detailed extraction process of CFTD extracting method is as follows:
A.1, the primary voice signal of Berlin Speech Emotion Database is input to based on one-dimensional volume It in product neural network, is pre-processed, extracts low-level image feature;
A.2, the structure of one-dimensional pretreatment convolutional neural networks are as follows: an input, 13 convolutional layers, 4 maximum pond layers, 2 full articulamentums.Wherein input is EMO-DB voice signal;
A.3, it is respectively as follows: level 2 volume lamination, 1 layer of pond layer, level 2 volume lamination, 1 layer of pond layer, 3 layers of convolutional layer, 1 for first 19 layers Layer pond layer, 3 layers of convolutional layer, 1 layer of pond layer, 3 layers of convolutional layer, 2 layers of full articulamentum.The convolution kernel size of all convolutional layers is 3 × 3, the step-length of convolution is 1.Pond layer uses 2 × 2 convolution kernel, step-length 2.The input of convolutional layer is filled using space, is made Resolution ratio remains unchanged after obtaining convolution, and the dimension of the output feature of the last one full articulamentum is 3072;
A.4, splice two LSTM modules respectively in the convolutional layer back of the 13rd layer and the 17th layer, each LSTM module has one A hidden layer, the input of hidden layer are 512 dimensions, and output is also 512 dimensions, the output of two LSTM and the full articulamentum of the last layer Output be directly connected in series, the output as network.The output of whole network is tieed up to arrive CFTD for 4096;
Step B: extracting the contextual information of speech emotional signal frequency domain, generates frequency domain contextual feature (Contextual Features in Frequency Domain, CFFD);
If having carried out the pretreatment of voice signal in a particular embodiment, this step can be saved;If not located in advance Reason then preferably, carries out the pretreatment of CFFD by the following method, and it is shown that detailed process is as follows:
B.1, EMO-DB signal is used again, sample frequency 16khz;
B.2, sub-frame processing is carried out to voice signal, to guarantee the smooth transition between frame and frame, to voice signal Overlapping framing, frame length are 512 points, and frame is stacked as at 256 points, adds Hamming window, obtains the short signal x (n) of single frames;
B.3, Fast Fourier Transform (FFT) (FFT) is carried out to every frame signal, obtains frequency domain data X (i, k);
B.4, the frequency domain character for seeking 65 dimensions is respectively as follows: the smooth fundamental frequency of 1 dimension, 1 dimension voiced sound probability, 1 dimension zero-crossing rate, 14 dimensions MFCC, 1 dimension can be measured, 28 dimension sound spectrum filtering, 15 dimension spectrum energies, 1 dimension local frequencies shake, 1 dimension interframe frequency jitter, and 1 Tie up local amplitude perturbation and the 1 humorous ratio of making an uproar of dimension.Traditional frequency domain character is introduced than being input to neural network using frequency domain information merely Middle effect is more preferable;
B.5, by after the FFT result directly obtained and the splicing of the frequency domain character of extraction, dimension is adjusted as 256x256 dimension, is obtained To pretreated frequency domain character.
The detailed extraction process for constructing interacting depth neural network is as follows:
B.6, interacting depth neural network model is an input, 5 layers of convolutional layer, 3 layers of maximum pond layer, 2 layers of full connection Layer, a LSTM module.Wherein input is the frequency domain character for the 256x256 that B.5 step obtains;
B.7, convolutional layer C1, the convolution kernel of 96 (15 × 3), step-length 3 × 1, maximum pond layer (3 × 1), step-length 2;Volume Lamination C2, the convolution kernel of 256 (9 × 3), step-length 1, maximum pond layer (3 × 1), step-length 1;Convolutional layer C3,384 (7 × 3) convolution kernel, convolutional layer C4 have 384 (7 × 1) convolution kernels;Convolutional layer C5, the convolution kernel of 256 (7 × 1);Two connect entirely Connect the dimension of layer 4096;
B.8, a LSTM module is spliced after full articulamentum.LSTM can be very good the context that speech emotional is arrived in study Feature.LSTM module has a hidden layer, and the input of hidden layer is 4096 dimensions, and output is also 4096 dimensions.
The interacting depth neural network model that the 256x256 extracted in advance the frequency domain character tieed up is input to construction is extracted CFFD, output of the output of the LSTM in interacting depth neural network model as network, the output of whole network are 4096 to tie up, Obtain CFFD;
Step C: two kinds of features of CFTD and CFFD are merged, training Linear SVM (support vector Machine) classifier obtains final speech emotional classification recognition result;
Another specific embodiment is the speech emotion recognition system based on TF-LSTM, comprising:
CFTD generation module generates CFTD, holds for the time domain contextual information according to the voice signal extracted in advance Capable method corresponds to the step A of upper one embodiment;
Interacting depth neural network model constructing module, for constructing interacting depth neural network model, the side executed Method corresponds to the step B6~B8 of upper one embodiment;
CFFD extraction module, the frequency domain character for tieing up the 256x256 extracted in advance are input to the interacting depth of construction Neural network model extracts, and the method executed corresponds to the step B of upper one embodiment;
Classifier training module: for merging two kinds of features of CFTD and CFFD, training Linear SVM classifier is obtained Speech emotion recognition finally is obtained as a result, its method executed corresponds to the step C of upper one embodiment.
It preferably, include that one-dimensional convolutional neural networks construct module in CFTD generation module, for the primary language to input Sound signal is pre-processed, and the method executed corresponds to the step A2~A4 of upper one embodiment.
It preferably, further include frequency domain character extraction module, for extracting the frequency domain character of voice signal and being adjusted to The frequency domain character of 256x256 dimension, the method executed correspond to the step B1~B5 of upper one embodiment.
The method of the present invention robot voice sentiment analysis is target, mentions the discrimination and robust of voice signal emotion algorithm Property.
The above, the only specific embodiment in the present invention, but scope of protection of the present invention is not limited thereto, appoints What is familiar with the people of the technology within the technical scope disclosed by the invention, it will be appreciated that expects transforms or replaces, and should all cover Within scope of the invention, therefore, the scope of protection of the invention shall be subject to the scope of protection specified in the patent claim.

Claims (10)

1. the CFFD extracting method based on TF-LSTM, characterized in that the following steps are included:
Construct interacting depth neural network model;
The interacting depth neural network model that the 256x256 extracted in advance dimension frequency domain character is input to construction is extracted into CFFD;
The interacting depth neural network model include: an input layer, five layers of convolutional layer, three layers of maximum pond layer, two layers entirely Articulamentum and a LSTM module;
It is the first maximum pond layer after first convolutional layer C1;It is the second maximum pond layer again after second convolutional layer C2;Then It is third, the 4th and layer 5 convolutional layer followed by third maximum pond layer;It is that two dimensions are after layer 5 convolutional layer C5 The full articulamentum of 4096 dimensions;
Splice a LSTM module after full articulamentum, the LSTM module has a hidden layer, and the input of hidden layer is 4096 Dimension, output are also 4096 dimensions, and output of the output of LSTM as network, the output of whole network is 4096 dimensions, obtains CFFD.
2. the CFFD extracting method according to claim 1 based on TF-LSTM, characterized in that
Interacting depth neural network the first convolutional layer C1 using 96 having a size of 15 × 3 convolution kernel, step size settings be 3 × 1, it is the maximum pond layer that the size that there is step-length to be 2 is 3 × 1 later;It is 9 that second convolutional layer C2, which has 256 sizes, × 3 convolution kernel and step-length is 1;It is the maximum pond layer that a size is 3 × 1 and step-length again after second convolutional layer C2 It is 1;
Third convolutional layer C3 have 384 having a size of 7 × 3 convolution kernels, C4 have 384 having a size of 7 × 1 kernel;
Last convolutional layer C5 includes the convolution kernel that 256 sizes are 7 × 1, the maximum for being 3 × 1 followed by size Pond layer;It is the full articulamentum that two dimensions are 4096 dimensions after convolutional layer C5.
3. the speech-emotion recognition method based on TF-LSTM, characterized in that the following steps are included:
According to the time domain contextual information of the voice signal extracted in advance, CFTD is generated;
Construct interacting depth neural network model;
The interacting depth neural network model that the 256x256 extracted in advance the frequency domain character tieed up is input to construction is extracted into CFFD;
The interacting depth neural network model include: an input layer, five layers of convolutional layer, three layers of maximum pond layer, two layers entirely Articulamentum and a LSTM module;
It is the first maximum pond layer after first convolutional layer C1;It is the second maximum pond layer again after second convolutional layer C2;Then It is third, the 4th and layer 5 convolutional layer followed by third maximum pond layer;It is that two dimensions are after layer 5 convolutional layer C5 The full articulamentum of 4096 dimensions;
Splice a LSTM module after full articulamentum, the LSTM module has a hidden layer, and the input of hidden layer is 4096 Dimension, output are also 4096 dimensions, and output of the output of LSTM as network, the output of whole network is 4096 dimensions, obtains CFFD;
Two kinds of features of CFTD and CFFD are merged, training Linear SVM classifier obtains final speech emotion recognition knot Fruit.
4. the speech-emotion recognition method according to claim 3 based on TF-LSTM, characterized in that extract voice signal Time domain contextual information, generate CFTD when the following steps are included:
Primary voice signal is input to based in one-dimensional convolutional neural networks, is pre-processed;
The structure of the one-dimensional convolutional neural networks is an input, 13 convolutional layers, 4 maximum pond layers, 2 full connections Layer;It is one-dimensional primary voice signal that it, which is inputted,;
First 19 layers be respectively as follows: level 2 volume lamination, 1 layer of pond layer, level 2 volume lamination, 1 layer of pond layer, 3 layers of convolutional layer, 1 layer of pond layer, 3 layers of convolutional layer, 1 layer of pond layer, 3 layers of convolutional layer, 2 layers of full articulamentum;
Splice two LSTM modules respectively in the convolutional layer back of the 13rd layer and the 17th layer, each LSTM module has one to imply Layer, the input of hidden layer is 512 dimensions, and output is also 512 dimensions, the output of the full articulamentum of the output and the last layer of two LSTM It is directly connected in series, as the output of network, the output of whole network is 4096 dimensions, obtains CFTD.
5. the speech-emotion recognition method according to claim 4 based on TF-LSTM, characterized in that the one-dimensional convolution The convolution kernel size of all convolutional layers of neural network is 3 × 3, and the step-length of convolution is 1;Pond layer uses 2 × 2 convolution kernel, Step-length is 2;The input of convolutional layer is filled using space, so that resolution ratio remains unchanged after convolution, the last one full articulamentum Output feature dimension be 3072.
6. the speech-emotion recognition method according to claim 3 based on TF-LSTM, characterized in that
Interacting depth neural network the first convolutional layer C1 using 96 having a size of 15 × 3 convolution kernel, step size settings be 3 × 1, it is the maximum pond layer that the size that there is step-length to be 2 is 3 × 1 later;It is 9 that second convolutional layer C2, which has 256 sizes, × 3 convolution kernel and step-length is 1;It is the maximum pond layer that a size is 3 × 1 and step-length again after second convolutional layer C2 It is 1;
Third convolutional layer C3 have 384 having a size of 7 × 3 convolution kernels, C4 have 384 having a size of 7 × 1 kernel;
Last convolutional layer C5 includes the convolution kernel that 256 sizes are 7 × 1, the maximum for being 3 × 1 followed by size Pond layer;It is the full articulamentum that two dimensions are 4096 dimensions after convolutional layer C5.
7. the speech-emotion recognition method according to claim 3 based on TF-LSTM, characterized in that the frequency domain of extraction is special The method of sign is as follows:
Step is B.1), voice signal is used again, sample frequency 16khz;
Step is B.2), sub-frame processing is carried out to voice signal, to guarantee the smooth transition between frame and frame, to voice signal Overlapping framing, frame length are 512 points, and frame is stacked as at 256 points, adds Hamming window, obtains the short signal x (n) of single frames;
Step is B.3), Fast Fourier Transform (FFT) is carried out to every frame signal, obtains frequency domain data X (i, k);
Step is B.4), the frequency domain character of 65 dimensions is sought, the smooth fundamental frequency of 1 dimension, 1 dimension voiced sound probability, 1 dimension zero-crossing rate, 14 dimensions are respectively as follows: MFCC, 1 dimension can be measured, 28 dimension sound spectrum filtering, 15 dimension spectrum energies, 1 dimension local frequencies shake, 1 dimension interframe frequency jitter, and 1 Tie up local amplitude perturbation and the 1 humorous ratio of making an uproar of dimension;
Step is B.4), it is adjusted to the frequency domain character of 256x256 dimension.
8. the speech emotion recognition system based on TF-LSTM, it is characterized in that: including:
CFTD generation module generates CFTD for the time domain contextual information according to the voice signal extracted in advance;
Interacting depth neural network model constructing module, for constructing interacting depth neural network model;
CFFD extraction module, the frequency domain character for tieing up the 256x256 extracted in advance are input to the interacting depth nerve of construction Network model extracts;
Classifier training module: for merging two kinds of features of CFTD and CFFD, training Linear SVM classifier is obtained most Whole speech emotion recognition result.
9. the speech emotion recognition system according to claim 8 based on TF-LSTM, it is characterized in that:
It include that one-dimensional convolutional neural networks construct module in CFTD generation module, it is pre- for being carried out to the primary voice signal of input Processing.
10. the speech emotion recognition system according to claim 8 based on TF-LSTM, it is characterized in that:
Further include frequency domain character extraction module, for extract the frequency domain character of voice signal and be adjusted to 256x256 dimension frequency domain Feature.
CN201811258369.7A 2018-10-26 2018-10-26 TF-LSTM-based CFFD extraction method, voice emotion recognition method and system Active CN109036467B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811258369.7A CN109036467B (en) 2018-10-26 2018-10-26 TF-LSTM-based CFFD extraction method, voice emotion recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811258369.7A CN109036467B (en) 2018-10-26 2018-10-26 TF-LSTM-based CFFD extraction method, voice emotion recognition method and system

Publications (2)

Publication Number Publication Date
CN109036467A true CN109036467A (en) 2018-12-18
CN109036467B CN109036467B (en) 2021-04-16

Family

ID=64614086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811258369.7A Active CN109036467B (en) 2018-10-26 2018-10-26 TF-LSTM-based CFFD extraction method, voice emotion recognition method and system

Country Status (1)

Country Link
CN (1) CN109036467B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222748A (en) * 2019-05-27 2019-09-10 西南交通大学 OFDM Radar Signal Recognition method based on the fusion of 1D-CNN multi-domain characteristics
CN110490892A (en) * 2019-07-03 2019-11-22 中山大学 A kind of Thyroid ultrasound image tubercle automatic positioning recognition methods based on USFaster R-CNN
WO2020192009A1 (en) * 2019-03-25 2020-10-01 平安科技(深圳)有限公司 Silence detection method based on neural network, and terminal device and medium
WO2020211820A1 (en) * 2019-04-16 2020-10-22 华为技术有限公司 Method and device for speech emotion recognition
CN112447187A (en) * 2019-09-02 2021-03-05 富士通株式会社 Device and method for recognizing sound event
CN113314151A (en) * 2021-05-26 2021-08-27 中国工商银行股份有限公司 Voice information processing method and device, electronic equipment and storage medium
CN113449569A (en) * 2020-03-27 2021-09-28 威海北洋电气集团股份有限公司 Mechanical signal health state classification method and system based on distributed deep learning
CN113853161A (en) * 2019-05-16 2021-12-28 托尼有限责任公司 System and method for identifying and measuring emotional states
CN114387977A (en) * 2021-12-24 2022-04-22 深圳大学 Voice cutting trace positioning method based on double-domain depth features and attention mechanism
CN114882906A (en) * 2022-06-30 2022-08-09 广州伏羲智能科技有限公司 Novel environmental noise identification method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105469065A (en) * 2015-12-07 2016-04-06 中国科学院自动化研究所 Recurrent neural network-based discrete emotion recognition method
US20160099010A1 (en) * 2014-10-03 2016-04-07 Google Inc. Convolutional, long short-term memory, fully connected deep neural networks
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
US20180047389A1 (en) * 2016-08-12 2018-02-15 Electronics And Telecommunications Research Institute Apparatus and method for recognizing speech using attention-based context-dependent acoustic model
CN107863111A (en) * 2017-11-17 2018-03-30 合肥工业大学 The voice language material processing method and processing device of interaction
CN108154879A (en) * 2017-12-26 2018-06-12 广西师范大学 A kind of unspecified person speech-emotion recognition method based on cepstrum separation signal
CN108447490A (en) * 2018-02-12 2018-08-24 阿里巴巴集团控股有限公司 The method and device of Application on Voiceprint Recognition based on Memorability bottleneck characteristic
CN108597539A (en) * 2018-02-09 2018-09-28 桂林电子科技大学 Speech-emotion recognition method based on parameter migration and sound spectrograph

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160099010A1 (en) * 2014-10-03 2016-04-07 Google Inc. Convolutional, long short-term memory, fully connected deep neural networks
CN105469065A (en) * 2015-12-07 2016-04-06 中国科学院自动化研究所 Recurrent neural network-based discrete emotion recognition method
US20180047389A1 (en) * 2016-08-12 2018-02-15 Electronics And Telecommunications Research Institute Apparatus and method for recognizing speech using attention-based context-dependent acoustic model
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN107863111A (en) * 2017-11-17 2018-03-30 合肥工业大学 The voice language material processing method and processing device of interaction
CN108154879A (en) * 2017-12-26 2018-06-12 广西师范大学 A kind of unspecified person speech-emotion recognition method based on cepstrum separation signal
CN108597539A (en) * 2018-02-09 2018-09-28 桂林电子科技大学 Speech-emotion recognition method based on parameter migration and sound spectrograph
CN108447490A (en) * 2018-02-12 2018-08-24 阿里巴巴集团控股有限公司 The method and device of Application on Voiceprint Recognition based on Memorability bottleneck characteristic

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ANDREA CIMINO等: ""Tandem LSTM-SVM Approach for Sentiment Analysis"", 《PROCEEDINGS OF THE FINAL WORKSHOP 7 DECEMBER 2016》 *
SAIKAT BASU等: ""Emotion Recognition from Speech using Convolutional Neural Network with Recurrent Neural Network Architecture"", 《2017 2ND INTERNATIONAL CONFERENCE ON COMMUNICATION AND ELECTRONICS SYSTEMS (ICCES)》 *
姚增伟等: ""基于卷积神经网络和长短时记忆神经网络的非特定人语音情感识别算法"", 《新型工业化》 *
徐聪: ""基于卷积—长短时记忆神经网络的时序信号多粒度分析处理方法研究"", 《中国优秀硕士学位论文全文数据库 医药卫生科技辑》 *
金碧程: ""基于深度学习的语音情感识别研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020192009A1 (en) * 2019-03-25 2020-10-01 平安科技(深圳)有限公司 Silence detection method based on neural network, and terminal device and medium
WO2020211820A1 (en) * 2019-04-16 2020-10-22 华为技术有限公司 Method and device for speech emotion recognition
US11900959B2 (en) 2019-04-16 2024-02-13 Huawei Technologies Co., Ltd. Speech emotion recognition method and apparatus
CN113853161A (en) * 2019-05-16 2021-12-28 托尼有限责任公司 System and method for identifying and measuring emotional states
CN110222748B (en) * 2019-05-27 2022-12-20 西南交通大学 OFDM radar signal identification method based on 1D-CNN multi-domain feature fusion
CN110222748A (en) * 2019-05-27 2019-09-10 西南交通大学 OFDM Radar Signal Recognition method based on the fusion of 1D-CNN multi-domain characteristics
CN110490892A (en) * 2019-07-03 2019-11-22 中山大学 A kind of Thyroid ultrasound image tubercle automatic positioning recognition methods based on USFaster R-CNN
CN112447187A (en) * 2019-09-02 2021-03-05 富士通株式会社 Device and method for recognizing sound event
CN113449569A (en) * 2020-03-27 2021-09-28 威海北洋电气集团股份有限公司 Mechanical signal health state classification method and system based on distributed deep learning
CN113449569B (en) * 2020-03-27 2023-04-25 威海北洋电气集团股份有限公司 Mechanical signal health state classification method and system based on distributed deep learning
CN113314151A (en) * 2021-05-26 2021-08-27 中国工商银行股份有限公司 Voice information processing method and device, electronic equipment and storage medium
CN114387977A (en) * 2021-12-24 2022-04-22 深圳大学 Voice cutting trace positioning method based on double-domain depth features and attention mechanism
CN114387977B (en) * 2021-12-24 2024-06-11 深圳大学 Voice cutting trace positioning method based on double-domain depth feature and attention mechanism
CN114882906A (en) * 2022-06-30 2022-08-09 广州伏羲智能科技有限公司 Novel environmental noise identification method and system

Also Published As

Publication number Publication date
CN109036467B (en) 2021-04-16

Similar Documents

Publication Publication Date Title
CN109036467A (en) CFFD extracting method, speech-emotion recognition method and system based on TF-LSTM
CN110491382B (en) Speech recognition method and device based on artificial intelligence and speech interaction equipment
Wang et al. Speech emotion recognition with dual-sequence LSTM architecture
CN107993665B (en) Method for determining role of speaker in multi-person conversation scene, intelligent conference method and system
CN108847249A (en) Sound converts optimization method and system
CN108777140A (en) Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
CN109979429A (en) A kind of method and system of TTS
CN108806667A (en) The method for synchronously recognizing of voice and mood based on neural network
CN110675891B (en) Voice separation method and module based on multilayer attention mechanism
Yousaf et al. A Novel Technique for Speech Recognition and Visualization Based Mobile Application to Support Two‐Way Communication between Deaf‐Mute and Normal Peoples
CN106297773A (en) A kind of neutral net acoustic training model method
CN108986798B (en) Processing method, device and the equipment of voice data
CN114566189B (en) Speech emotion recognition method and system based on three-dimensional depth feature fusion
CN110070855A (en) A kind of speech recognition system and method based on migration neural network acoustic model
CN109671423A (en) Non-parallel text compressing method under the limited situation of training data
CN113539232B (en) Voice synthesis method based on lesson-admiring voice data set
Chen et al. Speechformer++: A hierarchical efficient framework for paralinguistic speech processing
Xue et al. Cross-modal information fusion for voice spoofing detection
CN115910066A (en) Intelligent dispatching command and operation system for regional power distribution network
Rudresh et al. Performance analysis of speech digit recognition using cepstrum and vector quantization
CN111090726A (en) NLP-based electric power industry character customer service interaction method
CN114399995A (en) Method, device and equipment for training voice model and computer readable storage medium
CN115249479A (en) BRNN-based power grid dispatching complex speech recognition method, system and terminal
Radha et al. Speech and speaker recognition using raw waveform modeling for adult and children’s speech: A comprehensive review
Kethireddy et al. Deep neural architectures for dialect classification with single frequency filtering and zero-time windowing feature representations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant