CN109712609A - A method of it solving keyword and identifies imbalanced training sets - Google Patents
A method of it solving keyword and identifies imbalanced training sets Download PDFInfo
- Publication number
- CN109712609A CN109712609A CN201910014005.2A CN201910014005A CN109712609A CN 109712609 A CN109712609 A CN 109712609A CN 201910014005 A CN201910014005 A CN 201910014005A CN 109712609 A CN109712609 A CN 109712609A
- Authority
- CN
- China
- Prior art keywords
- keyword
- voice
- training
- different
- keywords
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Electrically Operated Instructional Devices (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of methods of solution keyword identification imbalanced training sets, including 1) changing speech pitch and keeping voice semanteme constant, the voice containing keyword is converted using Voice Conversion Techniques, obtains the different sexes of same semantic sample, multiple speech samples of all ages and classes speaker;2) adaptive weighted processing is done to the loss function in neural network model: when using weight cross entropy, the accuracy rate of keyword corpus and non-key word material is calculated separately in every wheel training, the weighting coefficient W of kth wheel is automatically adjusted according to the difference of the twok;3) different keywords adaptive frame number: are used with different detection frame number L according to length keywords when using DNN as training patterni;The present invention can be effectively relieved because data nonbalance or it is very little caused by training effect difference or the problems such as can not train, while training speed can be accelerated to a certain extent, training for promotion effect.
Description
Technical field
The present invention relates to continuous speech keywords to identify field, and in particular to a kind of solution keyword identification imbalanced training sets
Method.
Background technique
In most important technological progress in recent years, speech recognition technology will undoubtedly be arranged in first place, it is answered actual
Also there is very big market in.Keyword identification is also commonly referred to as keyword recognition or word confirmation, has become voice in recent years
A key areas in quite being paid attention in Study of recognition.Popular saying exactly is recognized from the continuous voice of speaker
Sensitive or given keyword
Since the vocabulary of each language is particularly big, want that speech recognition system is almost with designing all vocabularies of covering
It is impossible, and identifies that a small amount of keyword is critically important to some application requests from voice signal, and
There are very big application prospect and market.Unlike but the former be recognized and converted into from continuous voice flow it is a series of continuous
Text, and these texts are exactly content described in speaker.And keyword identification is more flexible, both accurate, calculation amount is again small,
And there is biggish elasticity.Keyword identification do not require entire voice flow all to identify, speaker can at will talk,
As long as keyword identification detects certain keywords and does not have to consider other words, and generally unrelated with speaker.When speaker is not
It works in the environment of cooperation or noise, Keyword Spotting System can also obtain good effect, and continuous speech recognition is then to saying
The attitude of words people has certain requirement, therefore Keyword Spotting System has critically important application prospect in terms of monitoring.
Information security is increasingly paid attention in countries in the world, especially the monitoring to internet and communication equipment, the U.S. for
The needs of anti-terrorism represent the highest level in the world in voice monitoring.It is given from being identified in the talk of two people or more people
Keyword, these words are some sensitive vocabulary, may repeatedly be occurred in dialogue.And due to the randomness and record of speaker's time
There are many sound data, and the monitoring of tradition manually is obviously undesirable, install machines to replace manual labor just necessary.
Summary of the invention
In order to overcome the problems, such as in voice keyword neural network based identification in the prior art there are imbalanced training sets,
The present invention provides a kind of method of solution keyword identification imbalanced training sets.
The present invention adopts the following technical scheme:
A method of it solving keyword and identifies imbalanced training sets, comprising:
1) change speech pitch and keep voice semanteme constant, the voice containing keyword is carried out using Voice Conversion Techniques
Conversion obtains the different sexes of same semantic sample, multiple speech samples of all ages and classes speaker;
2) adaptive weighted processing is done to the loss function in neural network model: uses weight cross entropy
When, the accuracy rate of keyword corpus and non-key word material is calculated separately in every wheel training, is automatically adjusted according to the difference of the two
The weighting coefficient W of kth wheelk;
3) adaptive frame number: different keywords are used not according to length keywords when using DNN as training pattern
Same detection frame number Li;
Further, in 1), in the speech enhan-cement to the voice containing keyword, keep the semantic information of the voice constant,
Increase the diversity of voice, such as gender, age etc. by changing the other information in voice;Gender, age etc. believe in voice
Breath is mainly related with speech pitch, using the library soundtouch to the voice H of different fundamental frequenciesmVoice conversion is done, normal voice
Average base frequency range is (136,332), and the average base frequency range of normal voice is divided into N sections, it is average to calculate present video
Segmentation where fundamental frequency digitizes the speech into different fundamental frequencies using voice conversion and is segmented, it is hereby achieved that N section averagely fundamental frequencies are not mutually
Identical voiceTo change the non-semantic information of information such as gender, age in the voice, increase the diversity of voice.
SoundTouch is one with the audio processing library of the C++ open source write, thus it is possible to vary the section of audio file or real-time audio stream
(Tempo), tone (Pitch), playback rate (Playback Rates) are clapped, also supports the stabilization beat rate (BPM of estimation track
rate).3 effects of ST are independent mutually, can also be used together.These effects are converted by sample rate, time-stretching combines
It realizes.The object of ST processing is PCM (Pulse Code Modulation, pulse code modulation), is mainly this in .wav file
Kind format, therefore the example of ST is all processing wav audio.The formats such as mp3 have passed through compression, need to be converted to after PCM again at ST
Reason.Using the library soundtouch to the voice H of different fundamental frequenciesmThe stabilization beat rate of estimation track need to be used when doing voice conversion,
And beat and the tone of audio are adjusted with information such as the fundamental frequencies that changes voice, to generate containing not according to the stabilization beat rate of track
The synonymous voice of same gender, age information.
Further, in 2), deep neural network model is the full Connection Neural Network of feedforward of standard, and every layer has k a hidden
Layer and n concealed nodes are hidden, each calculates the nonlinear function of the weighted sum of preceding layer output.The last layer has one
Softmax, it exports each posterior estimated value of output label.For hidden layer, using ReLU function, ReLU is in development set
Have preferably as a result, reducing calculating simultaneously.Use ReLU as activation primitive.The size of network also depends on output label
Quantity.The sub-word unit of entire keyword or keyword can be used in our label.
Assuming that pijIt is i-th in neural networkthLabel and jththFrame xjPosterior probability, wherein the value of i be 0,1,, n-
Value between 1, n are total number of tags, and 0 is the label of non-key word word.Pass through the training data { x of labelj, ij}jUpper maximization is handed over
Fork entropy training standard carrys out the weight and deviation θ of estimating depth neural network.
It is executed using the software frame TensorFlow of the distributed computing of multiple CPU in support deep neural network excellent
Change.Training for promotion speed is maximized using the exponential damping of the loss function of asynchronous stochastic gradient descent and learning rate.Transfer
Study refers to the situation that the relevant parameter of (part) network parameter existing network is initialized, rather than trained from the beginning.?
Here, I carries out speech recognition using deep neural network, and the hidden layer of network is initialized using suitable topology.It is all
Layer all updates in training.Shift learning has a potential advantage, and hidden layer can be by using a greater amount of data and avoiding
Bad local optimum learns more preferable, more excellent character representation.It is exactly such in this experiment.
To adaptive weighted processing in neural network model: using weight cross entropy (weight cross entropy)
When, initial weighting coefficients W is initialized according to the quantitative comparison of keyword in corpus and non-key wordcValue, every training in rotation later
The accuracy rate A of keyword corpus and non-key word material is calculated separately in white silk1、A2, according to the difference C=A of the two1-A2Automatic adjustment
Weighting coefficient W during the training of kth wheelkValue, α is adjustment factor.
Wk=Wc*(1+C*α)
Further, in 3), the simple but effective method of proposition: DNN generates the label posteriority based on frame, by DNN
Posteriority is combined into keyword confidence.If confidence level is more than some predefined threshold value, judgement could be made that.First assume
One keyword describes confidence calculations, behind can easily modify it with and meanwhile detect multiple keywords.Make using DNN
When for training pattern, length keywords are different, and the voice duration of this keyword is also different, can be according to i-th of keyword in training
The minimum duration of voice is concentrated to determine the minimum detection frame number L of keywordi。
WhereinIndicate the minimum duration of i-th of keyword voice in training set, S indicates mfcc as voice spy
Frame when sign moves.
Beneficial effects of the present invention:
The present invention compares conventional depth neural network method, there is data with good adaptation effect and higher utilization rate,
Can be effectively relieved because data nonbalance or it is very little caused by training effect difference or the problems such as can not train.It simultaneously can be to a certain degree
Upper quickening training speed, training for promotion effect.
Detailed description of the invention
Fig. 1 is the flow chart that loss function of the present invention is weighted processing.
Specific embodiment
Below with reference to examples and drawings, the present invention is described in further detail, but embodiments of the present invention are not
It is limited to this.
Embodiment
As shown in Figure 1, the present invention is using AiShell-1 corpus as training set, with the real speech that school student is recorded
Test set, wherein keyword are as follows: China, company, reporter, investment, four keywords carry out keyword spottings.The present invention provides
The method for solving keyword identification imbalanced training sets, specific as follows:
1) change speech pitch and keep voice semanteme constant, the voice containing keyword is carried out using Voice Conversion Techniques
Conversion obtains the different sexes of same semantic sample, multiple speech samples of all ages and classes speaker;
2) adaptive weighted processing is done to the loss function in neural network model: uses weight cross entropy
When, the accuracy rate of keyword corpus and non-key word material is calculated separately in every wheel training, is automatically adjusted according to the difference of the two
The weighting coefficient W of kth wheelk;
3) adaptive frame number: different keywords are used not according to length keywords when using DNN as training pattern
Same detection frame number Li;
Further, in 1), in the speech enhan-cement to the voice containing keyword, keep the semantic information of the voice constant,
Increase the diversity of voice, such as gender, age etc. by changing the other information in voice;Gender, age etc. believe in voice
Breath is mainly related with speech pitch, using the library soundtouch to the voice H of different fundamental frequenciesmVoice conversion is done, normal voice
Average base frequency range is (136,332), and the base frequency range of normal voice is divided into N sections, calculates fundamental frequency where present video
Segmentation digitizes the speech into different fundamental frequencies using voice conversion and is segmented, it is hereby achieved that the N sections of mutually different voices of fundamental frequencyTo change the non-semantic information of information such as gender, age in the voice, increases the diversity of voice, be averaged in this experiment
It is divided into 10 sections.
SoundTouch is one with the audio processing library of the C++ open source write, thus it is possible to vary audio file or real-time sound
Beat (Tempo), the tone (Pitch), playback rate (Playback Rates) of frequency stream, also support the stabilization beat of estimation track
Rate (BPM rate).3 effects of ST are independent mutually, can also be used together.These effects are converted by sample rate, the time draws
It stretches and is implemented in combination with.The object of ST processing is PCM (Pulse Code Modulation, pulse code modulation), master in wav file
If this format, therefore the example of ST is all processing wav audio.The formats such as mp3 have passed through compression, need to be converted to after PCM again
It is handled with ST.Using the library soundtouch to the voice H of different fundamental frequenciesmEstimation track need to be used when doing voice conversion stablizes section
Beat rate, and beat and the tone of audio are adjusted with information such as the fundamental frequencies that changes voice, thus raw according to the stabilization beat rate of track
At the synonymous voice containing different sexes, age information.
10 sections of mutually different voices of fundamental frequency are generated in the present invention in normal voice fundamental frequency section, use SoundTouch's
C++ source code simultaneously rewrites it, can Mass production difference fundamental frequency section voice.
Further, in 2), deep neural network model is the full Connection Neural Network of feedforward of standard, network have 3 it is hidden
Layer and every layer of 128 concealed nodes are hidden, each calculates the nonlinear function of the weighted sum of preceding layer output.The last layer has one
A softmax, it exports each posterior estimated value of output label.For hidden layer, using ReLU function, ReLU is in development set
On have preferably as a result, reducing calculating simultaneously.Use ReLU as activation primitive, the size of network also depends on output label
Quantity.The sub-word unit of entire keyword or keyword can be used in our label.
Assuming that pijIt is i-th in neural networkthLabel and jththFrame xjPosterior probability, wherein the value of i be 0,1 ... n-1
Between value, n be total number of tags, 0 be non-key word word label.Pass through the training data { x of labelj, ij}jUpper maximization is handed over
Fork entropy training standard carrys out the weight and deviation θ of estimating depth neural network.
It is executed using the software frame TensorFlow of the distributed computing of multiple CPU in support deep neural network excellent
Change.Training for promotion speed is maximized using the exponential damping of the loss function of asynchronous stochastic gradient descent and learning rate.Transfer
Study refers to the situation that the relevant parameter of (part) network parameter existing network is initialized, rather than trained from the beginning.This
Invention carries out speech recognition using deep neural network, and the hidden layer of network is initialized using suitable topology.All layers
All updated in training.Shift learning has a potential advantage, and hidden layer can be by using a greater amount of data and avoiding grain
The local optimum of cake is more preferable to learn, more excellent character representation.It is exactly such in this experiment.In the net of the present embodiment
In network, we use weight cross entropy using the aNN network for having 3 hidden layers and every layer of 128 concealed nodes, loss function
(weight cross entropy) is to be applicable in positive and negative imbalanced training sets problem.Label uses entire keyword, such as: market,
Using the speech frame of the entire word in market as a label.
To adaptive weighted processing in neural network model: using weight cross entropy (weight cross entropy)
When, initial weighting coefficients W is initialized according to the quantitative comparison of keyword in corpus and non-key wordcValue, every training in rotation later
The accuracy rate A of keyword corpus and non-key word material is calculated separately in white silk1、A2, according to the difference C=A of the two1-A2Automatic adjustment
Weighting coefficient W during the training of kth wheelkValue, α is adjustment factor, takes 0.5 here in this experiment.
Wk=Wc*(1+C*α)
Further, in 3), the simple but effective method of proposition: DNN generates the label posteriority based on frame, by DNN
Posteriority is combined into keyword confidence.If confidence level is more than some predefined threshold value, judgement could be made that.First assume
One keyword describes confidence calculations, behind can easily modify it with and meanwhile detect multiple keywords.Make using DNN
When for training pattern, length keywords are different, and the voice duration of this keyword is also different, can be according to i-th of keyword in training
The minimum duration of voice is concentrated to determine the minimum detection frame number L of keywordi。
WhereinIndicate the minimum duration of i-th of keyword voice in training set, S indicates mfcc as voice spy
Frame when sign moves.
It is test that the present invention, which tests the real speech using AiShell-1 corpus as training set, with the recording of this school student,
Collect, wherein keyword are as follows: China, company, reporter, investment.Four keywords are not using DNN neural network model of the invention
In average recall rate be 78.6%, using after a kind of keyword identification imbalanced training sets problem-solving approach of the present invention in identical number
Average recall rate according to concentration is 81.92%, improves 3.32 percentage points.
Specific experiment the result is as follows:
0: China 1: company 2: reporter 3: investment
0 | 1 | 2 | 3 | Average false detection rate | |
Normal DNN false detection rate (%) | 7.56 | 6.92 | 8.33 | 8.78 | 7.89 |
The method of the present invention false detection rate (%) | 5.31 | 4.97 | 6.12 | 5.16 | 5.39 |
0 | 1 | 2 | 3 | Average omission factor | |
Normal DNN omission factor (%) | 21.83 | 22.32 | 19.64 | 21.81 | 21.40 |
The method of the present invention omission factor (%) | 18.84 | 17.12 | 16.62 | 19.75 | 18.08 |
The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by the embodiment
Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention,
It should be equivalent substitute mode, be included within the scope of the present invention.
Claims (4)
1. a kind of method for solving keyword identification imbalanced training sets, which is characterized in that including
Change the speech pitch containing keyword and keep voice semanteme constant, using Voice Conversion Techniques to the language containing keyword
Sound is converted, and the different sexes of same semantic sample, multiple speech samples of all ages and classes speaker are obtained;
Adaptive weighted processing is done to the loss function in neural network model according to multiple speech samples: using weight cross entropy
When, the accuracy rate of keyword corpus and non-key word material is calculated separately in every wheel training, is automatically adjusted according to the difference of the two
The weighting coefficient W of kth wheelk;
Adaptive frame number: different keywords are used with different inspections according to length keywords when using DNN as training pattern
Frame number L outi。
2. the method according to claim 1, wherein the change of audio basic frequency is using the library soundtouch to not
With the voice H of fundamental frequencymVoice conversion is done, the average base frequency range of normal voice is (136,332), by the average base of normal voice
Frequency range is divided into N sections, calculates present video and be averaged segmentation where fundamental frequency, is converted using voice and digitize the speech into different bases
Frequency division section, to obtain the N sections of average mutually different voices of fundamental frequency
3. the method according to claim 1, wherein adaptive weighted being done to neural network model loss function
Processing: when using weight cross entropy, initial weighting coefficients are initialized according to the quantitative comparison of keyword in corpus and non-key word
WcValue, calculate separately the recognition accuracy A of keyword corpus and non-key word material in subsequent every wheel training1、A2, according to
The difference C=A of the two1-A2Automatically adjust the weighting coefficient W of kth wheel in training processkValue, specific formula is as follows, wherein α be adjust
Save coefficient
Wk=Wc*(1+C*α)。
4. the method according to claim 1, wherein length keywords are not when using DNN as training pattern
Together, the voice duration of this keyword is also different, can be determined according to the minimum duration of i-th of keyword voice in training set crucial
The minimum detection frame number L of wordi;
WhereinThe minimum duration of i-th of keyword voice in training set is indicated, when S indicates mfcc as phonetic feature
Frame move.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910014005.2A CN109712609B (en) | 2019-01-08 | 2019-01-08 | Method for solving imbalance of keyword recognition samples |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910014005.2A CN109712609B (en) | 2019-01-08 | 2019-01-08 | Method for solving imbalance of keyword recognition samples |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109712609A true CN109712609A (en) | 2019-05-03 |
CN109712609B CN109712609B (en) | 2021-03-30 |
Family
ID=66260895
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910014005.2A Active CN109712609B (en) | 2019-01-08 | 2019-01-08 | Method for solving imbalance of keyword recognition samples |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109712609B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188200A (en) * | 2019-05-27 | 2019-08-30 | 哈尔滨工程大学 | A kind of depth microblog emotional analysis method using social context feature |
CN110827791A (en) * | 2019-09-09 | 2020-02-21 | 西北大学 | Edge-device-oriented speech recognition-synthesis combined modeling method |
CN111508475A (en) * | 2020-04-16 | 2020-08-07 | 五邑大学 | Robot awakening voice keyword recognition method and device and storage medium |
CN111554273A (en) * | 2020-04-28 | 2020-08-18 | 华南理工大学 | Method for selecting amplified corpora in voice keyword recognition |
CN113345426A (en) * | 2021-06-02 | 2021-09-03 | 云知声智能科技股份有限公司 | Voice intention recognition method and device and readable storage medium |
CN113823326A (en) * | 2021-08-16 | 2021-12-21 | 华南理工大学 | Method for using training sample of efficient voice keyword detector |
CN114818685A (en) * | 2022-04-21 | 2022-07-29 | 平安科技(深圳)有限公司 | Keyword extraction method and device, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060026147A1 (en) * | 2004-07-30 | 2006-02-02 | Cone Julian M | Adaptive search engine |
US9646634B2 (en) * | 2014-09-30 | 2017-05-09 | Google Inc. | Low-rank hidden input layer for speech recognition neural network |
CN108538285A (en) * | 2018-03-05 | 2018-09-14 | 清华大学 | A kind of various keyword detection method based on multitask neural network |
CN108735202A (en) * | 2017-03-13 | 2018-11-02 | 百度(美国)有限责任公司 | Convolution recurrent neural network for small occupancy resource keyword retrieval |
-
2019
- 2019-01-08 CN CN201910014005.2A patent/CN109712609B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060026147A1 (en) * | 2004-07-30 | 2006-02-02 | Cone Julian M | Adaptive search engine |
US9646634B2 (en) * | 2014-09-30 | 2017-05-09 | Google Inc. | Low-rank hidden input layer for speech recognition neural network |
CN108735202A (en) * | 2017-03-13 | 2018-11-02 | 百度(美国)有限责任公司 | Convolution recurrent neural network for small occupancy resource keyword retrieval |
CN108538285A (en) * | 2018-03-05 | 2018-09-14 | 清华大学 | A kind of various keyword detection method based on multitask neural network |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188200A (en) * | 2019-05-27 | 2019-08-30 | 哈尔滨工程大学 | A kind of depth microblog emotional analysis method using social context feature |
CN110827791B (en) * | 2019-09-09 | 2022-07-01 | 西北大学 | Edge-device-oriented speech recognition-synthesis combined modeling method |
CN110827791A (en) * | 2019-09-09 | 2020-02-21 | 西北大学 | Edge-device-oriented speech recognition-synthesis combined modeling method |
CN111508475A (en) * | 2020-04-16 | 2020-08-07 | 五邑大学 | Robot awakening voice keyword recognition method and device and storage medium |
CN111508475B (en) * | 2020-04-16 | 2022-08-09 | 五邑大学 | Robot awakening voice keyword recognition method and device and storage medium |
CN111554273A (en) * | 2020-04-28 | 2020-08-18 | 华南理工大学 | Method for selecting amplified corpora in voice keyword recognition |
CN111554273B (en) * | 2020-04-28 | 2023-02-10 | 华南理工大学 | Method for selecting amplified corpora in voice keyword recognition |
CN113345426A (en) * | 2021-06-02 | 2021-09-03 | 云知声智能科技股份有限公司 | Voice intention recognition method and device and readable storage medium |
CN113345426B (en) * | 2021-06-02 | 2023-02-28 | 云知声智能科技股份有限公司 | Voice intention recognition method and device and readable storage medium |
CN113823326A (en) * | 2021-08-16 | 2021-12-21 | 华南理工大学 | Method for using training sample of efficient voice keyword detector |
CN113823326B (en) * | 2021-08-16 | 2023-09-19 | 华南理工大学 | Method for using training sample of high-efficiency voice keyword detector |
CN114818685A (en) * | 2022-04-21 | 2022-07-29 | 平安科技(深圳)有限公司 | Keyword extraction method and device, electronic equipment and storage medium |
CN114818685B (en) * | 2022-04-21 | 2023-06-20 | 平安科技(深圳)有限公司 | Keyword extraction method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109712609B (en) | 2021-03-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109712609A (en) | A method of it solving keyword and identifies imbalanced training sets | |
Zhang | Music style classification algorithm based on music feature extraction and deep neural network | |
Fan et al. | Singing voice separation and pitch extraction from monaural polyphonic audio music via DNN and adaptive pitch tracking | |
CN109616105A (en) | A kind of noisy speech recognition methods based on transfer learning | |
CN110265063B (en) | Lie detection method based on fixed duration speech emotion recognition sequence analysis | |
CN110349597A (en) | A kind of speech detection method and device | |
CN116665669A (en) | Voice interaction method and system based on artificial intelligence | |
CN112559797A (en) | Deep learning-based audio multi-label classification method | |
CN112767921A (en) | Voice recognition self-adaption method and system based on cache language model | |
CN112509560A (en) | Voice recognition self-adaption method and system based on cache language model | |
CN116229932A (en) | Voice cloning method and system based on cross-domain consistency loss | |
Wang et al. | A study on acoustic modeling for child speech based on multi-task learning | |
Sertsi et al. | Robust voice activity detection based on LSTM recurrent neural networks and modulation spectrum | |
Nwe et al. | Speaker clustering and cluster purification methods for RT07 and RT09 evaluation meeting data | |
Bastanfard et al. | A singing voice separation method from Persian music based on pitch detection methods | |
WO2018163279A1 (en) | Voice processing device, voice processing method and voice processing program | |
Rabiee et al. | Persian accents identification using an adaptive neural network | |
Arumugam et al. | An efficient approach for segmentation, feature extraction and classification of audio signals | |
CN116189671B (en) | Data mining method and system for language teaching | |
CN111179914B (en) | Voice sample screening method based on improved dynamic time warping algorithm | |
Euler et al. | Statistical segmentation and word modeling techniques in isolated word recognition | |
Li et al. | Instructional video content analysis using audio information | |
Yu | Research on multimodal music emotion recognition method based on image sequence | |
CN114927144A (en) | Voice emotion recognition method based on attention mechanism and multi-task learning | |
Zhou et al. | Speech recognition using double data augmentation strategy |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |