CN110148408A - A kind of Chinese speech recognition method based on depth residual error - Google Patents

A kind of Chinese speech recognition method based on depth residual error Download PDF

Info

Publication number
CN110148408A
CN110148408A CN201910458947.XA CN201910458947A CN110148408A CN 110148408 A CN110148408 A CN 110148408A CN 201910458947 A CN201910458947 A CN 201910458947A CN 110148408 A CN110148408 A CN 110148408A
Authority
CN
China
Prior art keywords
layer
residual error
characteristic parameter
speech recognition
depth residual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910458947.XA
Other languages
Chinese (zh)
Inventor
袁三男
刘虹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai University of Electric Power
University of Shanghai for Science and Technology
Original Assignee
Shanghai University of Electric Power
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai University of Electric Power filed Critical Shanghai University of Electric Power
Priority to CN201910458947.XA priority Critical patent/CN110148408A/en
Publication of CN110148408A publication Critical patent/CN110148408A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The present invention relates to a kind of Chinese speech recognition method based on depth residual error, this method includes the following steps: 1) to obtain the initial data containing voice messaging;2) MFCC characteristic parameter is extracted to initial data, and obtains the first-order difference and second differnce of MFCC characteristic parameter;3) first-order difference and second differnce of present frame and the frame are spliced, obtains last characteristic parameter, and the two-dimensional array of this feature parameter is converted into three-dimensional array;4) the last characteristic parameter of three-dimensional array in step 3) is all input into convolutional neural networks, to convolutional neural networks repetition training, until obtaining satisfactory discrimination;5) trained convolutional neural networks model is tested, output identification text.Compared with prior art, the present invention has many advantages, such as to accelerate model training speed, improves phonetic recognization rate.

Description

A kind of Chinese speech recognition method based on depth residual error
Technical field
The present invention relates to Speech processing and identification fields, more particularly, to a kind of Chinese speech based on depth residual error Recognition methods.
Background technique
For voice as a kind of most convenient natural form of communication, it carries the function of information transmitting and emotional expression.With The progress of speech recognition technology, more and more people be desirable to link up by voice and machine, therefore voice know This other technology also more and more attention has been paid to.Most widely a kind of structure is long memory network in short-term to speech recognition application at present, This network can to voice it is long when correlation model, to improve recognition correct rate.And two-way LSTM network can be with Better performance is obtained, but problem high there is also training complexity height, decoding delay simultaneously.
Summary of the invention
It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and provide one kind to be based on depth residual error Chinese speech recognition method.
The purpose of the present invention can be achieved through the following technical solutions:
A kind of Chinese speech recognition method based on depth residual error, includes the following steps:
Step (1) obtains the initial data containing voice messaging.
Step (2) extracts MFCC characteristic parameter to initial data, and obtains the first-order difference and second order of MFCC characteristic parameter Difference.
It extracts MFCC characteristic parameter and specifically includes the following steps:
21) preemphasis, framing and adding window is carried out to voice to pre-process;
22) to each short-time analysis window, corresponding frequency spectrum is obtained by FFT;
23) frequency spectrum that step 22) obtains is obtained into Mel frequency spectrum by Mel filter group, it, will be linear by Mel frequency spectrum Natural frequency spectrum is converted to the Mel frequency spectrum for embodying human auditory system;
24) cepstral analysis is carried out on Mel frequency spectrum, obtains Mel frequency cepstral coefficient MFCC, using MFCC as voice spy Sign.
The first-order difference of MFCC characteristic parameter is the difference of two frame of continuous adjacent in discrete function, expression formula are as follows:
Y (k)=X (k+1)-X (k)
In formula, k is frame number, and X (k) is the MFCC characteristic parameter of kth frame, and X (k+1) is the MFCC characteristic parameter of+1 frame of kth.
Second differnce indicates the relationship between the first-order difference of+1 frame of kth and the first-order difference of kth frame, the table of second differnce Up to formula are as follows:
Z (k)=Y (k+1)-Y (k)=X (k+2) -2*X (k+1)+X (k)
Step (3) splices the first-order difference and second differnce of present frame and the frame, obtains last feature ginseng Number, and a channel will be increased on the two-dimensional array of this feature parameter, obtain the last characteristic parameter of three-dimensional array.
Residual block include two layers of convolutional layer and one layer of random deactivating layer, the output of the random deactivating layer directly with process Input after one layer of convolution is added, and obtains final target mapping.The structure of the depth residual error network includes multilayer convolution Layer, four residual blocks, two layers of pond layer, two layers of full articulamentum and softmax layers of composition, the full articulamentum of first layer are equipped with 512 A neural unit, the full articulamentum of the second layer are equipped with 1422 neural units, and the convolution kernels of all convolutional layers is 3x3, first layer, the The number of the convolution kernel of two layers and first residual blocks is 32, and the step-length of first layer pond layer is 2x2, third layer convolutional layer and The format of the convolution kernel of second residual block is 64, and the convolution kernel number of the 4th layer of convolutional layer and third residual block is 128, the The convolution kernel number of five layers of convolutional layer and the 4th residual block is 256, and the step-length of second layer pond layer is 1x2, the last layer volume The number of product core is 512.
Preferably, the size of the convolution kernel in the residual error block structure is 3x3, and the parameter of random deactivating layer is set as 0.2, random deactivating layer selectively responds input.
The last characteristic parameter of three-dimensional array in step (3) is all input into depth residual error network by step (4), right Depth residual error network repetition training, until obtaining satisfactory discrimination, the discrimination is that the phoneme of speech recognition misses Code rate.
Preferably, if training pattern reaches the 15.42% phoneme bit error rate, it is determined as that the result of model training reaches symbol Close desired discrimination.
Step (5) tests trained depth residual error network model, output identification text.
Trained model is tested, method when by voice to be tested according to training carries out feature extraction, mentions The characteristic parameter got inputs in trained model, and the output of model is the text recognized.
Compared with prior art, the invention has the following advantages that
1) residual error block structure is applied in convolutional neural networks by the method for the present invention using depth residual error network, Convolutional neural networks generally comprise convolutional layer, pond layer and full articulamentum, and the input of convolutional layer is characterized parameter, and convolution kernel is to set The step-length set is slided, and different local feature in learning characteristic figure, convolutional layer is more, and the feature of extraction is more, Chi Hua Layer mainly compresses characteristic parameter, calculates the average value or maximum value in each region, carries out dimensionality reduction to feature, reduces mould The number of network node in type, full articulamentum have the function of classifier, which is mapped to the characteristic parameter learnt Sample labeling space, carries out classification and matching, and predicted input signal generic, therefore, convolutional neural networks share the spy of weight Point can greatly reduce the parameter of model, accelerate the training speed of model, and then solve the problems, such as that decoding delay is high;
2) residual error structure is applied in convolutional neural networks by the present invention, and convolutional neural networks directly learn to input data into The target of output label maps, it may appear that and after the number of plies of neural network is deepened, training precision does not rise the problems such as declining instead, but It is this phenomenon is not as caused by over-fitting, simple network of deepening can make network itself be difficult to train, and residual error net Residual error amount is added with former input quantity by the residual error amount of learning objective mapping and former input, obtains final target mapping by network, This study mechanism can effectively solve the problems such as network performance is degenerated, and while deepening network depth, alleviate over-fitting Problem improves the discrimination of voice.
Detailed description of the invention
Residual error block structural diagram in Fig. 1 present invention;
Fig. 2 is the flow diagram of the method for the present invention;
Fig. 3 is the broad flow diagram for extracting MFCC feature;
Fig. 4 is depth residual error network general construction schematic diagram.
Specific embodiment
The present invention is described in detail with specific embodiment below in conjunction with the accompanying drawings.Obviously, described embodiment is this A part of the embodiment of invention, rather than whole embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, all should belong to the scope of protection of the invention.
The present invention relates to a kind of Chinese speech recognition methods based on depth residual error, including the following steps:
Step 1: obtaining the initial data containing voice messaging.
Step 2: extracting MFCC characteristic parameter to initial data.
MFCC (Mel Frequency Cepstral Coefficents) spy is extracted by one group of Mel filter to voice Levy parameter, i.e. mel-frequency cepstrum coefficient (500,13), wherein extract mainly comprising the processes of for MFCC feature
1) preemphasis, framing and adding window first are carried out to voice, is used for enhanced speech signal performance (signal-to-noise ratio, processing accuracy Deng) some pretreatments.
2) to each short-time analysis window, corresponding frequency spectrum is obtained by FFT, it is different on a timeline for obtaining distribution Frequency spectrum in time window.
3) frequency spectrum above is obtained into Mel frequency spectrum by Mel filter group, by Mel frequency spectrum, by linear natural frequency spectrum Be converted to the Mel frequency spectrum for embodying human auditory system.
4) carried out on Mel frequency spectrum cepstral analysis (take logarithm, do inverse transformation, practical inverse transformation generally by DCT from Scattered cosine transform realizes that the 2nd after taking DCT be to the 13rd coefficient as MFCC coefficient), obtain Mel frequency cepstral coefficient MFCC, this MFCC are exactly this frame phonetic feature.
At this time, voice can be described by a series of cepstrum vector, and each vector is exactly the MFCC of every frame Feature vector.Speech classifier can be trained and be identified by these cepstrum vectors after obtaining MFCC feature vector ?.
However, MFCC is the static nature of voice, to extract the behavioral characteristics of voice, then single order and two scales are sought Point.First-order difference is exactly the difference of two frame of continuous adjacent in discrete function, is defined as follows formula:
Y (k)=X (k+1)-X (k)
In formula, k is frame number, and X (k) is the MFCC characteristic parameter of kth frame, and X (k+1) is the MFCC characteristic parameter of+1 frame of kth.
Second differnce indicates that the relationship between the first-order difference of+1 frame of kth and the first-order difference of kth frame, second differnce are determined Justice such as following formula:
Z (k)=Y (k+1)-Y (k)=X (k+2) -2*X (k+1)+X (k)
Step 3: the first-order difference of present frame and the frame and second differnce are spliced, last characteristic parameter be (500, 39), increase a channel on the two-dimensional array, which is converted into three-dimensional array (500,39,1).
Step 4: the calculated characteristic parameter after is all input into depth residual error network, to depth residual error network Repetition training reduces the loss of neural network by backpropagation, until obtaining preferable discrimination.
Residual error block structure is by two layers of convolutional layer, and one layer of random deactivating layer is constituted.The output of random deactivating layer directly with warp Input after crossing one layer of convolution is added, and obtains final target mapping.The size of convolution kernel is 3x3 in the residual error block structure, with The parameter of machine deactivating layer is set as 0.2, and random deactivating layer selectively responds input, and study precision can be improved.
The structure of depth residual error network of the present invention is by the full connection of multilayer convolutional layer, 4 residual blocks, two layers of pond layer and two layers Layer and softmax layers of composition, the full articulamentum of first layer have 512 neural units, and the full articulamentum of the second layer has 1422 nerves Unit.The convolution kernel of all convolutional layers is 3x3, and the number of the convolution kernel of first layer, the second layer and first residual block is 32, The step-length of first layer pond layer is 2x2, and the format of the convolution kernel of third layer convolutional layer and second residual block is 64, the 4th layer of volume The convolution kernel number of lamination and third residual block is 128, and the convolution kernel number of layer 5 convolutional layer and the 4th residual block is 256, the step-length of second layer pond layer is 1x2, and the number of the last layer convolution kernel is 512.List entries is by neural network (x1,x2,...,xT) characteristic parameter passing through a series of convolutional layers, pond layer after full articulamentum and softmax layers, converts For output sequence (y1,y2,...,yT), CTC (Connectionist Temporal Classification, connection timing point Class technology) according to (y1,y2,...,yT) calculate the posterior probability p (l of actual sequence1,l2,...,lm|x1,x2,...xT), mind Process through network training is exactly in the case where given input and practical aligned phoneme sequence, and adjustment neural network parameter to train Sample set p (l1,l2,...,lm|x1,x2,...xT) maximum, i.e. CTC decoding is exactly to find posteriority under conditions of given input The sequence of maximum probabilityWherein, l1,l2,...,lmFor sequence label, T is frame number, and m is the number of label.
Discrimination is the phoneme bit error rate of speech recognition, by test of many times, depth residual error network loss hardly When decline, i.e., model reaches the 15.42% phoneme bit error rate, is determined as that the result of model training reaches the discrimination met.
The present embodiment is based on THCHS30 Chinese data collection and carries out actual experiment, relative to BLSTM traditional in speech recognition (bidirectional long short-term memory, two-way long short-term memory) frame, using the method for the present invention training When convergent speed ratio BLSTM network it is upper 3 times fast, the discrimination of voice improves 3%.
Step 5: testing trained model, method when by voice to be tested according to training carries out feature It extracts, the characteristic parameter extracted inputs in trained model, and the output of model is the text recognized.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any The staff for being familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right It is required that protection scope subject to.

Claims (8)

1. a kind of Chinese speech recognition method based on depth residual error, which is characterized in that this method includes the following steps:
1) initial data containing voice messaging is obtained;
2) MFCC characteristic parameter is extracted to initial data, and obtains the first-order difference and second differnce of MFCC characteristic parameter;
3) first-order difference and second differnce of present frame and the frame are spliced, obtains last characteristic parameter, and by the spy It levies and increases a channel on the two-dimensional array of parameter, obtain the last characteristic parameter of three-dimensional array;
4) the last characteristic parameter of three-dimensional array in step 3) is all input into depth residual error network, to depth residual error network Repetition training, until obtaining satisfactory discrimination;
5) trained depth residual error network model is tested, output identification text.
2. a kind of Chinese speech recognition method based on depth residual error according to claim 1, which is characterized in that step 2) In, it extracts MFCC characteristic parameter and specifically includes the following steps:
21) preemphasis, framing and adding window is carried out to voice to pre-process;
22) to each short-time analysis window, corresponding frequency spectrum is obtained by FFT;
23) frequency spectrum that step 22) obtains is obtained into Mel frequency spectrum by Mel filter group, by Mel frequency spectrum, by linear nature Frequency spectrum is converted to the Mel frequency spectrum for embodying human auditory system;
24) cepstral analysis is carried out on Mel frequency spectrum, Mel frequency cepstral coefficient MFCC is obtained, using MFCC as phonetic feature.
3. a kind of Chinese speech recognition method based on depth residual error according to claim 2, which is characterized in that step 2) In, the first-order difference of MFCC characteristic parameter is the difference of two frame of continuous adjacent in discrete function, expression formula are as follows:
Y (k)=X (k+1)-X (k)
In formula, k is frame number, and X (k) is the MFCC characteristic parameter of kth frame, and X (k+1) is the MFCC characteristic parameter of+1 frame of kth.
4. a kind of Chinese speech recognition method based on depth residual error according to claim 3, which is characterized in that step 2) In, second differnce indicates the relationship between the first-order difference of+1 frame of kth and the first-order difference of kth frame, the expression formula of second differnce Are as follows:
Z (k)=Y (k+1)-Y (k)=X (k+2) -2*X (k+1)+X (k).
5. a kind of Chinese speech recognition method based on depth residual error according to claim 1, which is characterized in that step 3) In, the structure of the depth residual error network includes multilayer convolutional layer, four residual blocks, two layers of pond layer, two layers of full articulamentum And softmax layers of composition, the full articulamentum of first layer are equipped with 512 neural units, the full articulamentum of the second layer is equipped with 1422 minds Through unit, the convolution kernel of all convolutional layers is 3x3, and the number of the convolution kernel of first layer, the second layer and first residual block is 32, the step-length of first layer pond layer is 2x2, and the format of the convolution kernel of third layer convolutional layer and second residual block is the 64, the 4th The convolution kernel number of layer convolutional layer and third residual block is 128, the convolution kernel of layer 5 convolutional layer and the 4th residual block Number is 256, and the step-length of second layer pond layer is 1x2, and the number of the last layer convolution kernel is 512.
6. a kind of Chinese speech recognition method based on depth residual error according to claim 5, which is characterized in that step 3) In, the residual block include two layers of convolutional layer and one layer of random deactivating layer, the output of the random deactivating layer directly with warp Input after crossing one layer of convolution is added, and obtains final target mapping.
7. a kind of Chinese speech recognition method based on depth residual error according to claim 6, which is characterized in that described The size of convolution kernel in residual error block structure is 3x3, and the parameter of random deactivating layer is set as 0.2, and random deactivating layer is selectively Input is responded.
8. a kind of Chinese speech recognition method based on depth residual error according to claim 1, which is characterized in that described Discrimination is the phoneme bit error rate of speech recognition, if training pattern reaches the 15.42% phoneme bit error rate, is determined as that model is instructed Experienced result reaches satisfactory discrimination.
CN201910458947.XA 2019-05-29 2019-05-29 A kind of Chinese speech recognition method based on depth residual error Pending CN110148408A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910458947.XA CN110148408A (en) 2019-05-29 2019-05-29 A kind of Chinese speech recognition method based on depth residual error

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910458947.XA CN110148408A (en) 2019-05-29 2019-05-29 A kind of Chinese speech recognition method based on depth residual error

Publications (1)

Publication Number Publication Date
CN110148408A true CN110148408A (en) 2019-08-20

Family

ID=67592187

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910458947.XA Pending CN110148408A (en) 2019-05-29 2019-05-29 A kind of Chinese speech recognition method based on depth residual error

Country Status (1)

Country Link
CN (1) CN110148408A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909601A (en) * 2019-10-18 2020-03-24 武汉虹识技术有限公司 Beautiful pupil identification method and system based on deep learning
CN111276125A (en) * 2020-02-11 2020-06-12 华南师范大学 Lightweight speech keyword recognition method facing edge calculation
CN111401530A (en) * 2020-04-22 2020-07-10 上海依图网络科技有限公司 Recurrent neural network and training method thereof
CN111402901A (en) * 2020-03-27 2020-07-10 广东外语外贸大学 CNN voiceprint recognition method and system based on RGB mapping characteristics of color image
CN111798875A (en) * 2020-07-21 2020-10-20 杭州芯声智能科技有限公司 VAD implementation method based on three-value quantization compression
CN111833886A (en) * 2020-07-27 2020-10-27 中国科学院声学研究所 Fully-connected multi-scale residual error network and voiceprint recognition method thereof
CN112614483A (en) * 2019-09-18 2021-04-06 珠海格力电器股份有限公司 Modeling method based on residual convolutional network, voice recognition method and electronic equipment
CN112951277A (en) * 2019-11-26 2021-06-11 新东方教育科技集团有限公司 Method and device for evaluating speech
CN113361647A (en) * 2021-07-06 2021-09-07 青岛洞听智能科技有限公司 Method for identifying type of missed call
WO2022237053A1 (en) * 2021-05-11 2022-11-17 Huawei Technologies Co.,Ltd. Methods and systems for computing output of neural network layer

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830287A (en) * 2018-04-18 2018-11-16 哈尔滨理工大学 The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN108847223A (en) * 2018-06-20 2018-11-20 陕西科技大学 A kind of audio recognition method based on depth residual error neural network
CN109272990A (en) * 2018-09-25 2019-01-25 江南大学 Audio recognition method based on convolutional neural networks
CN109272988A (en) * 2018-09-30 2019-01-25 江南大学 Audio recognition method based on multichannel convolutional neural networks
CN109460774A (en) * 2018-09-18 2019-03-12 华中科技大学 A kind of birds recognition methods based on improved convolutional neural networks
US20190130896A1 (en) * 2017-10-26 2019-05-02 Salesforce.Com, Inc. Regularization Techniques for End-To-End Speech Recognition
CN109767759A (en) * 2019-02-14 2019-05-17 重庆邮电大学 End-to-end speech recognition methods based on modified CLDNN structure

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130896A1 (en) * 2017-10-26 2019-05-02 Salesforce.Com, Inc. Regularization Techniques for End-To-End Speech Recognition
CN108830287A (en) * 2018-04-18 2018-11-16 哈尔滨理工大学 The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN108847223A (en) * 2018-06-20 2018-11-20 陕西科技大学 A kind of audio recognition method based on depth residual error neural network
CN109460774A (en) * 2018-09-18 2019-03-12 华中科技大学 A kind of birds recognition methods based on improved convolutional neural networks
CN109272990A (en) * 2018-09-25 2019-01-25 江南大学 Audio recognition method based on convolutional neural networks
CN109272988A (en) * 2018-09-30 2019-01-25 江南大学 Audio recognition method based on multichannel convolutional neural networks
CN109767759A (en) * 2019-02-14 2019-05-17 重庆邮电大学 End-to-end speech recognition methods based on modified CLDNN structure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JIAN GUO: ""depth dropout :efficient training of residual convolutinal neural networks"", 《INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING:TECHNIQUES & APPLICATIONS》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614483A (en) * 2019-09-18 2021-04-06 珠海格力电器股份有限公司 Modeling method based on residual convolutional network, voice recognition method and electronic equipment
CN110909601A (en) * 2019-10-18 2020-03-24 武汉虹识技术有限公司 Beautiful pupil identification method and system based on deep learning
CN110909601B (en) * 2019-10-18 2022-12-09 武汉虹识技术有限公司 Beautiful pupil identification method and system based on deep learning
CN112951277A (en) * 2019-11-26 2021-06-11 新东方教育科技集团有限公司 Method and device for evaluating speech
CN112951277B (en) * 2019-11-26 2023-01-13 新东方教育科技集团有限公司 Method and device for evaluating speech
CN111276125A (en) * 2020-02-11 2020-06-12 华南师范大学 Lightweight speech keyword recognition method facing edge calculation
CN111276125B (en) * 2020-02-11 2023-04-07 华南师范大学 Lightweight speech keyword recognition method facing edge calculation
CN111402901A (en) * 2020-03-27 2020-07-10 广东外语外贸大学 CNN voiceprint recognition method and system based on RGB mapping characteristics of color image
CN111402901B (en) * 2020-03-27 2023-04-18 广东外语外贸大学 CNN voiceprint recognition method and system based on RGB mapping characteristics of color image
CN111401530A (en) * 2020-04-22 2020-07-10 上海依图网络科技有限公司 Recurrent neural network and training method thereof
CN111798875A (en) * 2020-07-21 2020-10-20 杭州芯声智能科技有限公司 VAD implementation method based on three-value quantization compression
CN111833886A (en) * 2020-07-27 2020-10-27 中国科学院声学研究所 Fully-connected multi-scale residual error network and voiceprint recognition method thereof
WO2022237053A1 (en) * 2021-05-11 2022-11-17 Huawei Technologies Co.,Ltd. Methods and systems for computing output of neural network layer
CN113361647A (en) * 2021-07-06 2021-09-07 青岛洞听智能科技有限公司 Method for identifying type of missed call

Similar Documents

Publication Publication Date Title
CN110148408A (en) A kind of Chinese speech recognition method based on depth residual error
CN110827801B (en) Automatic voice recognition method and system based on artificial intelligence
CN107818164A (en) A kind of intelligent answer method and its system
CN110459225B (en) Speaker recognition system based on CNN fusion characteristics
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN107221320A (en) Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model
CN106952643A (en) A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering
CN108766419A (en) A kind of abnormal speech detection method based on deep learning
CN109119072A (en) Civil aviaton's land sky call acoustic model construction method based on DNN-HMM
CN111243602A (en) Voiceprint recognition method based on gender, nationality and emotional information
CN102568476B (en) Voice conversion method based on self-organizing feature map network cluster and radial basis network
CN110111797A (en) Method for distinguishing speek person based on Gauss super vector and deep neural network
CN109272988A (en) Audio recognition method based on multichannel convolutional neural networks
CN111724770B (en) Audio keyword identification method for generating confrontation network based on deep convolution
CN108986798B (en) Processing method, device and the equipment of voice data
CN113539232B (en) Voice synthesis method based on lesson-admiring voice data set
CN113111786B (en) Underwater target identification method based on small sample training diagram convolutional network
CN109192192A (en) A kind of Language Identification, device, translator, medium and equipment
CN109671423A (en) Non-parallel text compressing method under the limited situation of training data
CN107293290A (en) The method and apparatus for setting up Speech acoustics model
CN110473571A (en) Emotion identification method and device based on short video speech
CN109036470A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN111341294A (en) Method for converting text into voice with specified style
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190820

RJ01 Rejection of invention patent application after publication