CN109036467B - TF-LSTM-based CFFD extraction method, voice emotion recognition method and system - Google Patents

TF-LSTM-based CFFD extraction method, voice emotion recognition method and system Download PDF

Info

Publication number
CN109036467B
CN109036467B CN201811258369.7A CN201811258369A CN109036467B CN 109036467 B CN109036467 B CN 109036467B CN 201811258369 A CN201811258369 A CN 201811258369A CN 109036467 B CN109036467 B CN 109036467B
Authority
CN
China
Prior art keywords
layer
dimensional
convolutional
lstm
dimensions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811258369.7A
Other languages
Chinese (zh)
Other versions
CN109036467A (en
Inventor
卫伟
李晓飞
吴聪
柴磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201811258369.7A priority Critical patent/CN109036467B/en
Publication of CN109036467A publication Critical patent/CN109036467A/en
Application granted granted Critical
Publication of CN109036467B publication Critical patent/CN109036467B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a TF-LSTM-based CFFD extraction method, a TF-LSTM-based speech emotion recognition method and a TF-LSTM-based speech emotion recognition system, wherein the TF-LSTM-based speech emotion recognition system comprises a CFTD generation module, a speech emotion recognition module and a speech emotion recognition module, wherein the CFTD generation module is used for generating CFTD according to pre-extracted time domain context information of a speech signal; the hybrid deep neural network model construction module is used for constructing a hybrid deep neural network model; the CFFD extraction module is used for inputting pre-extracted 256x 256-dimensional frequency domain features into the constructed mixed depth neural network model for extraction; a classifier training module: the method is used for fusing the CFTD characteristic and the CFFD characteristic, training the linear SVM classifier and obtaining a final speech emotion recognition result. The method integrates two kinds of depth feature information including time domain features and frequency domain features to improve the accuracy of speech emotion recognition; the one-dimensional convolutional neural network is adopted to extract the bottom layer characteristics of the time domain, and the speech emotion information is learned through a plurality of LSTM modules, so that the context characteristics of the time domain emotion information are obtained well.

Description

TF-LSTM-based CFFD extraction method, voice emotion recognition method and system
Technical Field
The invention relates to a TF-LSTM-based CFFD extraction method, a voice emotion recognition method and a system, and belongs to the technical field of voice emotion recognition.
Background
The concept of emotion calculation has become a research hotspot in recent years, and has attracted the attention of many emotion analysis experts at home and abroad. The voice signal of the speaker often contains rich emotional information, which helps him to better transfer information. When the same person expresses the same sentence with different emotions, the transmitted information is not very same. In order for a computer to better understand human emotion, the accuracy of speech emotion recognition must be improved. Speech emotion recognition is increasingly used in man-machine interaction fields such as man-made customer service, distance education, medical assistance, and automobile driving.
At present, the traditional speech emotion recognition at home and abroad is greatly developed in the aspects of introduction of emotion description models, construction of emotion speech libraries, emotion characteristic analysis and the like. The speech emotion recognition accuracy rate has great relation with the extraction of speech emotion characteristics, because the traditional speech emotion recognition technology is established on the basis of emotion acoustic characteristics. Deep neural networks have made major breakthroughs in the field of speech emotion recognition in recent years, and have achieved better results in the large vocabulary continuous speech emotion recognition task (LVCSR) than the gaussian mixture model/hidden markov model (GMM/HMM) system. Although a Convolutional Neural Network (CNN) is excellent in image recognition and can obtain a good effect in speech emotion recognition, the problem of low speech emotion recognition efficiency due to an unsatisfactory effect of a frequency domain CFFD feature extraction method in the prior art exists; on the other hand, with the advancement of science and technology, the voice information is explosively increased, and massive data needs to be processed, so that training a high-efficiency and high-recognition-rate voice emotion recognition system becomes a practical problem to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a CFFD extraction method based on TF-LSTM and a speech emotion recognition method and system, wherein a deep hybrid deep neural network model is constructed to extract CFFD, so that the recognition efficiency is improved, and Time Domain Context Features (CFTD) and Frequency Domain context Features (CFFD) are fused, so that the two depth Features have good complementarity in speech and the recognition accuracy is improved.
In order to solve the technical problems, the invention firstly provides a CFFD extraction method based on TF-LSTM, which comprises the following steps:
constructing a hybrid deep neural network model;
inputting pre-extracted 256x256 dimensional frequency domain features into a constructed mixed depth neural network model to extract CFFD;
the hybrid deep neural network model includes: an input layer, five convolutional layers, three maximum pooling layers, two full-connection layers and an LSTM module;
the first convolution layer C1 is followed by a first maximum pooling layer; the second convolutional layer C2 is followed again by the second largest pool layer; followed by third, fourth and fifth convolutional layers followed by a third max pool layer; a fifth convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions;
and splicing an LSTM module after the full connection layer, wherein the LSTM module has an implicit layer, the input of the implicit layer is 4096 dimensions, the output of the implicit layer is also 4096 dimensions, the output of the LSTM is used as the output of the network, the output of the whole network is 4096 dimensions, and the CFFD is obtained.
In another aspect, the present invention provides a speech emotion recognition method based on TF-LSTM, comprising the steps of:
generating a CFTD according to pre-extracted time domain context information of the voice signal;
constructing a hybrid deep neural network model;
inputting pre-extracted 256x 256-dimensional frequency domain features into the constructed mixed deep neural network model to extract CFFD;
the hybrid deep neural network model includes: an input layer, five convolutional layers, three maximum pooling layers, two full-connection layers and an LSTM module;
the first convolution layer C1 is followed by a first maximum pooling layer; the second convolutional layer C2 is followed again by the second largest pool layer; followed by third, fourth and fifth convolutional layers followed by a third max pool layer; a fifth convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions;
splicing an LSTM module behind the full connection layer, wherein the LSTM module has a hidden layer, the input of the hidden layer is 4096 dimensions, the output of the hidden layer is also 4096 dimensions, the output of the LSTM is used as the output of the network, the output of the whole network is 4096 dimensions, and the CFFD is obtained;
and fusing the CFTD characteristic and the CFFD characteristic, training a linear SVM classifier, and obtaining a final speech emotion recognition result.
In a third aspect, the present invention provides a TF-LSTM based speech emotion recognition system, comprising:
the CFTD generating module is used for generating CFTD according to the pre-extracted time domain context information of the voice signal;
the hybrid deep neural network model construction module is used for constructing a hybrid deep neural network model;
the CFFD extraction module is used for inputting pre-extracted 256x 256-dimensional frequency domain features into the constructed mixed depth neural network model for extraction;
a classifier training module: the method is used for fusing the CFTD characteristic and the CFFD characteristic, training the linear SVM classifier and obtaining a final speech emotion recognition result.
The invention achieves the following beneficial effects:
1) the method integrates two kinds of depth feature information including time domain features and frequency domain features to improve the accuracy of speech emotion recognition;
2) the method adopts the one-dimensional convolutional neural network to extract the bottom layer characteristics of the time domain, learns the speech emotion information through a plurality of LSTM modules, and well obtains the context characteristics of the time domain emotion information;
3) the invention designs a mixed deep learning network structure consisting of a 2-dimensional convolutional neural network and an LSTM (least squares metric) to extract context phase information of speech emotion information in a frequency domain.
Drawings
FIG. 1 is a flowchart of a speech emotion recognition method according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
FIG. 1 is an overall flow chart of the method, and the voice emotion recognition method based on TF-LSTM comprises three steps of feature extraction, feature fusion and SVM classification.
The detailed process is as follows:
step A: extracting the Time Domain context information of the speech emotion signal and generating a Time Domain Context Feature (CFTD).
In a specific embodiment, the detailed extraction process of the CFTD extraction method is preferably as follows:
a.1, inputting a native Speech signal of Berlin Speech emission Database into a one-dimensional convolutional neural network, preprocessing, and extracting bottom layer features;
a.2, the structure of the one-dimensional preprocessing convolution neural network is as follows: one input, 13 convolutional layers, 4 max pooling layers, 2 full link layers. Wherein the input is an EMO-DB voice signal;
a.3, the first 19 layers are respectively: 2 convolutional layers, 1 pooling layer, 3 convolutional layers, 2 full-connection layers. The convolution kernel size for all convolutional layers is 3 × 3, the step size of the convolution is 1. The pooling layer employs a2 x2 convolution kernel with a step size of 2. Space filling is adopted for input of the convolution layer, so that the resolution ratio is kept unchanged after convolution, and the dimension of the output feature of the last full-connection layer is 3072;
and A.4, splicing two LSTM modules respectively behind the convolution layers of the 13 th layer and the 17 th layer, wherein each LSTM module has an implicit layer, the input of the implicit layer is 512 dimensions, the output of the implicit layer is also 512 dimensions, and the output of the two LSTMs and the output of the fully-connected layer of the last layer are directly connected in series to be used as the output of the network. The output of the whole network is 4096 dimensions, namely CFTD;
and B: extracting context information of a speech emotion signal Frequency Domain to generate Frequency Domain context Features (CFFD);
in the embodiment, if the voice signal is preprocessed, the step can be omitted; if no pretreatment is performed, the pretreatment of CFFD is preferably performed by the following method, as follows:
b.1, re-adopting the EMO-DB signal, wherein the sampling frequency is 16 khz;
b.2, framing the voice signal, overlapping and framing the voice signal to ensure smooth transition between frames, wherein the frame length is 512 points, the frame overlapping is 256 points, and adding a Hamming window to obtain a short-time signal x (n) of a single frame;
b.3, performing Fast Fourier Transform (FFT) on each frame of signal to obtain frequency domain data X (i, k);
and B.4, calculating 65-dimensional frequency domain characteristics, which are respectively as follows: 1-dimensional smooth fundamental frequency, 1-dimensional voiced probability, 1-dimensional zero crossing rate, 14-dimensional MFCC, 1-dimensional mean square energy, 28-dimensional acoustic spectrum filtering, 15-dimensional spectral energy, 1-dimensional local frequency jitter, 1-dimensional inter-frame frequency jitter, 1-dimensional local amplitude perturbation and 1-dimensional harmonic-to-noise ratio. The effect of introducing the traditional frequency domain characteristics is better than the effect of simply inputting the frequency domain information into the neural network;
and B.5, after the directly obtained FFT result and the extracted frequency domain feature are spliced, adjusting the dimension to be 256x256 dimensions to obtain the preprocessed frequency domain feature.
The detailed extraction process for constructing the hybrid deep neural network is as follows:
and B.6, the mixed deep neural network model comprises an input layer, 5 convolutional layers, 3 maximum pooling layers, 2 full-link layers and an LSTM module. Wherein the input is the 256x256 frequency domain characteristics obtained in step B.5;
b.7, convolutional layer C1, 96 (15 × 3) convolutional kernels, step 3 × 1, max pooling layer (3 × 1), step 2; convolutional layer C2, 256 (9 × 3) convolutional kernels, step size 1, maximum pool layer (3 × 1), step size 1; convolutional layer C3, 384 (7 × 3) convolutional kernels, convolutional layer C4 has 384 (7 × 1) convolutional kernels; convolutional layer C5, 256 (7 × 1) convolution kernels; two fully connected layers 4096 dimensions;
b.8, splicing an LSTM module after fully connecting the layers. LSTM can learn the context characteristics of speech emotion well. The LSTM module has an implied layer with 4096 dimensions for input and 4096 dimensions for output.
Inputting pre-extracted 256x 256-dimensional frequency domain characteristics into a constructed mixed deep neural network model to extract CFFD, taking the output of LSTM in the mixed deep neural network model as the output of the network, and taking the output of the whole network as 4096 dimensions to obtain CFFD;
and C: fusing the two characteristics of CFTD and CFFD, training a linear SVM (support vector machine) classifier, and obtaining a final speech emotion classification recognition result;
another embodiment is a TF-LSTM based speech emotion recognition system comprising:
a CFTD generating module, configured to generate a CFTD according to the pre-extracted time domain context information of the speech signal, where the implementation method corresponds to step a of the previous embodiment;
a hybrid deep neural network model constructing module for constructing a hybrid deep neural network model, which performs a method corresponding to steps B6-B8 of the previous embodiment;
a CFFD extraction module, which is used for inputting the pre-extracted 256x256 dimensional frequency domain features into the constructed mixed depth neural network model extraction, and the execution method corresponds to the step B of the previous embodiment;
a classifier training module: and C, fusing the two characteristics of CFTD and CFFD, training a linear SVM classifier, and obtaining a final speech emotion recognition result, wherein the execution method corresponds to the step C of the previous embodiment.
Preferably, the CFTD generation module includes a one-dimensional convolutional neural network construction module for preprocessing the input native speech signal, and the execution method corresponds to steps a2 to a4 of the previous embodiment.
Preferably, a frequency domain feature extraction module is further included for extracting the frequency domain features of the speech signal and adjusting to 256 × 256 dimensional frequency domain features, which is performed by a method corresponding to steps B1-B5 of the previous embodiment.
The robot speech emotion analysis is taken as a target, and the recognition rate and the robustness of a speech signal emotion algorithm are improved.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims (8)

1. The CFFD extraction method based on the TF-LSTM is characterized by comprising the following steps:
constructing a hybrid deep neural network model;
inputting pre-extracted 256x256 dimensional frequency domain features into a constructed mixed depth neural network model to extract CFFD;
the hybrid deep neural network model includes: an input layer, five convolutional layers, three maximum pooling layers, two full-connection layers and an LSTM module;
the first convolution layer C1 is followed by a first maximum pooling layer; the second convolutional layer C2 is followed again by the second largest pool layer; followed by third, fourth and fifth convolutional layers followed by a third max pool layer; a fifth convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions;
splicing an LSTM module behind the full connection layer, wherein the LSTM module has a hidden layer, the input of the hidden layer is 4096 dimensions, the output of the hidden layer is also 4096 dimensions, the output of the LSTM is used as the output of the network, the output of the whole network is 4096 dimensions, and the CFFD is obtained;
the method for extracting the frequency domain features comprises the following steps:
step B.1), the voice signals are adopted again, and the sampling frequency is 16 khz;
step B.2), framing the voice signal, overlapping and framing the voice signal to ensure smooth transition between frames, wherein the frame length is 512 points, the frame overlapping is 256 points, and a Hamming window is added to obtain a short-time signal x (n) of a single frame;
b.3), carrying out fast Fourier transform on each frame of signal to obtain frequency domain data X (i, k);
step B.4), obtaining 65-dimensional frequency domain characteristics which are respectively as follows: 1-dimensional smooth fundamental frequency, 1-dimensional voiced probability, 1-dimensional zero crossing rate, 14-dimensional MFCC, 1-dimensional mean square energy, 28-dimensional acoustic spectrum filtering, 15-dimensional spectral energy, 1-dimensional local frequency jitter, 1-dimensional inter-frame frequency jitter, 1-dimensional local amplitude perturbation and 1-dimensional harmonic-to-noise ratio;
and B.5), adjusting the frequency domain characteristics to 256x256 dimensions.
2. The TF-LSTM based CFFD extraction method of claim 1,
the first convolutional layer C1 of the hybrid deep neural network employs 96 convolutional kernels of size 15 × 3, with the step size set to 3 × 1, followed by a largest pooling layer of size 3 × 1 with a step size of 2; the second convolutional layer C2 has 256 convolutional kernels with size of 9 × 3 and step size of 1; the second convolutional layer C2 is followed again by a largest pool layer of size 3 × 1 and with a step size of 1;
the third convolutional layer C3 has 384 convolution kernels of size 7 × 3, and C4 has 384 kernels of size 7 × 1;
the last convolutional layer C5 contains 256 convolutional kernels of size 7 × 1, followed by the largest pool layer of size 3 × 1; convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions.
3. The voice emotion recognition method based on TF-LSTM is characterized by comprising the following steps:
generating a CFTD according to pre-extracted time domain context information of the voice signal;
constructing a hybrid deep neural network model;
inputting pre-extracted 256x 256-dimensional frequency domain features into the constructed mixed deep neural network model to extract CFFD;
the hybrid deep neural network model includes: an input layer, five convolutional layers, three maximum pooling layers, two full-connection layers and an LSTM module;
the first convolution layer C1 is followed by a first maximum pooling layer; the second convolutional layer C2 is followed again by the second largest pool layer; followed by third, fourth and fifth convolutional layers followed by a third max pool layer; a fifth convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions;
splicing an LSTM module behind the full connection layer, wherein the LSTM module has a hidden layer, the input of the hidden layer is 4096 dimensions, the output of the hidden layer is also 4096 dimensions, the output of the LSTM is used as the output of the network, the output of the whole network is 4096 dimensions, and the CFFD is obtained;
fusing the two characteristics of CFTD and CFFD, training a linear SVM classifier, and obtaining a final speech emotion recognition result;
the method for extracting the frequency domain features comprises the following steps:
step B.1), the voice signals are adopted again, and the sampling frequency is 16 khz;
step B.2), framing the voice signal, overlapping and framing the voice signal to ensure smooth transition between frames, wherein the frame length is 512 points, the frame overlapping is 256 points, and a Hamming window is added to obtain a short-time signal x (n) of a single frame;
b.3), carrying out fast Fourier transform on each frame of signal to obtain frequency domain data X (i, k);
step B.4), obtaining 65-dimensional frequency domain characteristics which are respectively as follows: 1-dimensional smooth fundamental frequency, 1-dimensional voiced probability, 1-dimensional zero crossing rate, 14-dimensional MFCC, 1-dimensional mean square energy, 28-dimensional acoustic spectrum filtering, 15-dimensional spectral energy, 1-dimensional local frequency jitter, 1-dimensional inter-frame frequency jitter, 1-dimensional local amplitude perturbation and 1-dimensional harmonic-to-noise ratio;
and B.5), adjusting the frequency domain characteristics to 256x256 dimensions.
4. The TF-LSTM based speech emotion recognition method of claim 3, wherein extracting the temporal context information of the speech signal and generating CFTD includes the steps of:
inputting a native voice signal into a one-dimensional convolution-based neural network for preprocessing;
the structure of the one-dimensional convolutional neural network is one input, 13 convolutional layers, 4 maximum pooling layers and 2 full-connection layers; the input is a one-dimensional native speech signal;
the first 19 layers are: 2 convolutional layers, 1 pooling layer, 3 convolutional layers, 2 full-connection layers;
and respectively splicing two LSTM modules behind the convolution layers of the 13 th layer and the 17 th layer, wherein each LSTM module has an implied layer, the input of the implied layer is 512 dimensions, the output of the implied layer is also 512 dimensions, the outputs of the two LSTMs and the output of the full-connection layer of the last layer are directly connected in series to be used as the output of the network, the output of the whole network is 4096 dimensions, and the CFTD is obtained.
5. The TF-LSTM based speech emotion recognition method of claim 4, wherein the convolution kernel size of all convolution layers of said one-dimensional convolutional neural network is 3 x 3, the step size of convolution is 1; the pooling layer adopts 2 multiplied by 2 convolution kernels, and the step length is 2; the convolutional layers are input with space filling, so that the resolution remains unchanged after convolution, and the dimension of the output feature of the last fully-connected layer is 3072.
6. The TF-LSTM based speech emotion recognition method of claim 3,
the first convolutional layer C1 of the hybrid deep neural network employs 96 convolutional kernels of size 15 × 3, with the step size set to 3 × 1, followed by a largest pooling layer of size 3 × 1 with a step size of 2; the second convolutional layer C2 has 256 convolutional kernels with size of 9 × 3 and step size of 1; the second convolutional layer C2 is followed again by a largest pool layer of size 3 × 1 and with a step size of 1;
the third convolutional layer C3 has 384 convolution kernels of size 7 × 3, and C4 has 384 kernels of size 7 × 1;
the last convolutional layer C5 contains 256 convolutional kernels of size 7 × 1, followed by the largest pool layer of size 3 × 1; convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions.
7. The voice emotion recognition system based on TF-LSTM is characterized in that: the method comprises the following steps:
the CFTD generating module is used for generating CFTD according to the pre-extracted time domain context information of the voice signal;
the hybrid deep neural network model construction module is used for constructing a hybrid deep neural network model;
the CFFD extraction module is used for inputting pre-extracted 256x 256-dimensional frequency domain features into the constructed mixed depth neural network model for extraction;
a classifier training module: the system is used for fusing the CFTD characteristic and the CFFD characteristic, training a linear SVM classifier and obtaining a final speech emotion recognition result;
the system also comprises a frequency domain feature extraction module, which is used for extracting the frequency domain features of the voice signals and adjusting the frequency domain features into 256x 256-dimensional frequency domain features, wherein the specific method for extracting the frequency domain features comprises the following steps:
step B.1), the voice signals are adopted again, and the sampling frequency is 16 khz;
step B.2), framing the voice signal, overlapping and framing the voice signal to ensure smooth transition between frames, wherein the frame length is 512 points, the frame overlapping is 256 points, and a Hamming window is added to obtain a short-time signal x (n) of a single frame;
b.3), carrying out fast Fourier transform on each frame of signal to obtain frequency domain data X (i, k);
step B.4), obtaining 65-dimensional frequency domain characteristics which are respectively as follows: 1-dimensional smooth fundamental frequency, 1-dimensional voiced probability, 1-dimensional zero crossing rate, 14-dimensional MFCC, 1-dimensional mean square energy, 28-dimensional acoustic spectrum filtering, 15-dimensional spectral energy, 1-dimensional local frequency jitter, 1-dimensional inter-frame frequency jitter, 1-dimensional local amplitude perturbation and 1-dimensional harmonic-to-noise ratio;
and B.5), adjusting the frequency domain characteristics to 256x256 dimensions.
8. The TF-LSTM based speech emotion recognition system of claim 7, wherein:
the CFTD generating module comprises a one-dimensional convolution neural network constructing module which is used for preprocessing the input native voice signal.
CN201811258369.7A 2018-10-26 2018-10-26 TF-LSTM-based CFFD extraction method, voice emotion recognition method and system Active CN109036467B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811258369.7A CN109036467B (en) 2018-10-26 2018-10-26 TF-LSTM-based CFFD extraction method, voice emotion recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811258369.7A CN109036467B (en) 2018-10-26 2018-10-26 TF-LSTM-based CFFD extraction method, voice emotion recognition method and system

Publications (2)

Publication Number Publication Date
CN109036467A CN109036467A (en) 2018-12-18
CN109036467B true CN109036467B (en) 2021-04-16

Family

ID=64614086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811258369.7A Active CN109036467B (en) 2018-10-26 2018-10-26 TF-LSTM-based CFFD extraction method, voice emotion recognition method and system

Country Status (1)

Country Link
CN (1) CN109036467B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110010153A (en) * 2019-03-25 2019-07-12 平安科技(深圳)有限公司 A kind of mute detection method neural network based, terminal device and medium
RU2720359C1 (en) 2019-04-16 2020-04-29 Хуавэй Текнолоджиз Ко., Лтд. Method and equipment for recognizing emotions in speech
WO2020229572A1 (en) * 2019-05-16 2020-11-19 Tawny Gmbh System and method for recognising and measuring affective states
CN110222748B (en) * 2019-05-27 2022-12-20 西南交通大学 OFDM radar signal identification method based on 1D-CNN multi-domain feature fusion
CN110490892A (en) * 2019-07-03 2019-11-22 中山大学 A kind of Thyroid ultrasound image tubercle automatic positioning recognition methods based on USFaster R-CNN
CN112447187A (en) * 2019-09-02 2021-03-05 富士通株式会社 Device and method for recognizing sound event
CN113449569B (en) * 2020-03-27 2023-04-25 威海北洋电气集团股份有限公司 Mechanical signal health state classification method and system based on distributed deep learning
CN113314151A (en) * 2021-05-26 2021-08-27 中国工商银行股份有限公司 Voice information processing method and device, electronic equipment and storage medium
CN114387977B (en) * 2021-12-24 2024-06-11 深圳大学 Voice cutting trace positioning method based on double-domain depth feature and attention mechanism
CN114882906A (en) * 2022-06-30 2022-08-09 广州伏羲智能科技有限公司 Novel environmental noise identification method and system

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10783900B2 (en) * 2014-10-03 2020-09-22 Google Llc Convolutional, long short-term memory, fully connected deep neural networks
CN105469065B (en) * 2015-12-07 2019-04-23 中国科学院自动化研究所 A kind of discrete emotion identification method based on recurrent neural network
KR102033411B1 (en) * 2016-08-12 2019-10-17 한국전자통신연구원 Apparatus and Method for Recognizing speech By Using Attention-based Context-Dependent Acoustic Model
CN106782602B (en) * 2016-12-01 2020-03-17 南京邮电大学 Speech emotion recognition method based on deep neural network
CN107863111A (en) * 2017-11-17 2018-03-30 合肥工业大学 The voice language material processing method and processing device of interaction
CN108154879B (en) * 2017-12-26 2021-04-09 广西师范大学 Non-specific human voice emotion recognition method based on cepstrum separation signal
CN108597539B (en) * 2018-02-09 2021-09-03 桂林电子科技大学 Speech emotion recognition method based on parameter migration and spectrogram
CN108447490B (en) * 2018-02-12 2020-08-18 阿里巴巴集团控股有限公司 Voiceprint recognition method and device based on memorability bottleneck characteristics

Also Published As

Publication number Publication date
CN109036467A (en) 2018-12-18

Similar Documents

Publication Publication Date Title
CN109036467B (en) TF-LSTM-based CFFD extraction method, voice emotion recognition method and system
CN108717856B (en) Speech emotion recognition method based on multi-scale deep convolution cyclic neural network
CN108597539B (en) Speech emotion recognition method based on parameter migration and spectrogram
CN110491382B (en) Speech recognition method and device based on artificial intelligence and speech interaction equipment
US11488586B1 (en) System for speech recognition text enhancement fusing multi-modal semantic invariance
CN110033758B (en) Voice wake-up implementation method based on small training set optimization decoding network
Vashisht et al. Speech recognition using machine learning
CN108777140A (en) Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
CN112101045B (en) Multi-mode semantic integrity recognition method and device and electronic equipment
CN107945791B (en) Voice recognition method based on deep learning target detection
CN110853656B (en) Audio tampering identification method based on improved neural network
CN116110405B (en) Land-air conversation speaker identification method and equipment based on semi-supervised learning
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN111009235A (en) Voice recognition method based on CLDNN + CTC acoustic model
CN114566189A (en) Speech emotion recognition method and system based on three-dimensional depth feature fusion
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
Xue et al. Cross-modal information fusion for voice spoofing detection
CN111862956A (en) Data processing method, device, equipment and storage medium
CN114842835A (en) Voice interaction system based on deep learning model
Stanek et al. Algorithms for vowel recognition in fluent speech based on formant positions
CN114360584A (en) Phoneme-level-based speech emotion layered recognition method and system
Li et al. Emotion recognition from speech with StarGAN and Dense‐DCNN
Hu et al. Speech emotion recognition based on attention mcnn combined with gender information
CN111009236A (en) Voice recognition method based on DBLSTM + CTC acoustic model
Aggarwal et al. Application of genetically optimized neural networks for hindi speech recognition system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant