CN109712609A - A method of it solving keyword and identifies imbalanced training sets - Google Patents

A method of it solving keyword and identifies imbalanced training sets Download PDF

Info

Publication number
CN109712609A
CN109712609A CN201910014005.2A CN201910014005A CN109712609A CN 109712609 A CN109712609 A CN 109712609A CN 201910014005 A CN201910014005 A CN 201910014005A CN 109712609 A CN109712609 A CN 109712609A
Authority
CN
China
Prior art keywords
keyword
voice
training
different
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910014005.2A
Other languages
Chinese (zh)
Other versions
CN109712609B (en
Inventor
贺前华
汪星
严海康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201910014005.2A priority Critical patent/CN109712609B/en
Publication of CN109712609A publication Critical patent/CN109712609A/en
Application granted granted Critical
Publication of CN109712609B publication Critical patent/CN109712609B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Electrically Operated Instructional Devices (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of methods of solution keyword identification imbalanced training sets, including 1) changing speech pitch and keeping voice semanteme constant, the voice containing keyword is converted using Voice Conversion Techniques, obtains the different sexes of same semantic sample, multiple speech samples of all ages and classes speaker;2) adaptive weighted processing is done to the loss function in neural network model: when using weight cross entropy, the accuracy rate of keyword corpus and non-key word material is calculated separately in every wheel training, the weighting coefficient W of kth wheel is automatically adjusted according to the difference of the twok;3) different keywords adaptive frame number: are used with different detection frame number L according to length keywords when using DNN as training patterni;The present invention can be effectively relieved because data nonbalance or it is very little caused by training effect difference or the problems such as can not train, while training speed can be accelerated to a certain extent, training for promotion effect.

Description

A method of it solving keyword and identifies imbalanced training sets
Technical field
The present invention relates to continuous speech keywords to identify field, and in particular to a kind of solution keyword identification imbalanced training sets Method.
Background technique
In most important technological progress in recent years, speech recognition technology will undoubtedly be arranged in first place, it is answered actual Also there is very big market in.Keyword identification is also commonly referred to as keyword recognition or word confirmation, has become voice in recent years A key areas in quite being paid attention in Study of recognition.Popular saying exactly is recognized from the continuous voice of speaker Sensitive or given keyword
Since the vocabulary of each language is particularly big, want that speech recognition system is almost with designing all vocabularies of covering It is impossible, and identifies that a small amount of keyword is critically important to some application requests from voice signal, and There are very big application prospect and market.Unlike but the former be recognized and converted into from continuous voice flow it is a series of continuous Text, and these texts are exactly content described in speaker.And keyword identification is more flexible, both accurate, calculation amount is again small, And there is biggish elasticity.Keyword identification do not require entire voice flow all to identify, speaker can at will talk, As long as keyword identification detects certain keywords and does not have to consider other words, and generally unrelated with speaker.When speaker is not It works in the environment of cooperation or noise, Keyword Spotting System can also obtain good effect, and continuous speech recognition is then to saying The attitude of words people has certain requirement, therefore Keyword Spotting System has critically important application prospect in terms of monitoring.
Information security is increasingly paid attention in countries in the world, especially the monitoring to internet and communication equipment, the U.S. for The needs of anti-terrorism represent the highest level in the world in voice monitoring.It is given from being identified in the talk of two people or more people Keyword, these words are some sensitive vocabulary, may repeatedly be occurred in dialogue.And due to the randomness and record of speaker's time There are many sound data, and the monitoring of tradition manually is obviously undesirable, install machines to replace manual labor just necessary.
Summary of the invention
In order to overcome the problems, such as in voice keyword neural network based identification in the prior art there are imbalanced training sets, The present invention provides a kind of method of solution keyword identification imbalanced training sets.
The present invention adopts the following technical scheme:
A method of it solving keyword and identifies imbalanced training sets, comprising:
1) change speech pitch and keep voice semanteme constant, the voice containing keyword is carried out using Voice Conversion Techniques Conversion obtains the different sexes of same semantic sample, multiple speech samples of all ages and classes speaker;
2) adaptive weighted processing is done to the loss function in neural network model: uses weight cross entropy When, the accuracy rate of keyword corpus and non-key word material is calculated separately in every wheel training, is automatically adjusted according to the difference of the two The weighting coefficient W of kth wheelk
3) adaptive frame number: different keywords are used not according to length keywords when using DNN as training pattern Same detection frame number Li
Further, in 1), in the speech enhan-cement to the voice containing keyword, keep the semantic information of the voice constant, Increase the diversity of voice, such as gender, age etc. by changing the other information in voice;Gender, age etc. believe in voice Breath is mainly related with speech pitch, using the library soundtouch to the voice H of different fundamental frequenciesmVoice conversion is done, normal voice Average base frequency range is (136,332), and the average base frequency range of normal voice is divided into N sections, it is average to calculate present video Segmentation where fundamental frequency digitizes the speech into different fundamental frequencies using voice conversion and is segmented, it is hereby achieved that N section averagely fundamental frequencies are not mutually Identical voiceTo change the non-semantic information of information such as gender, age in the voice, increase the diversity of voice. SoundTouch is one with the audio processing library of the C++ open source write, thus it is possible to vary the section of audio file or real-time audio stream (Tempo), tone (Pitch), playback rate (Playback Rates) are clapped, also supports the stabilization beat rate (BPM of estimation track rate).3 effects of ST are independent mutually, can also be used together.These effects are converted by sample rate, time-stretching combines It realizes.The object of ST processing is PCM (Pulse Code Modulation, pulse code modulation), is mainly this in .wav file Kind format, therefore the example of ST is all processing wav audio.The formats such as mp3 have passed through compression, need to be converted to after PCM again at ST Reason.Using the library soundtouch to the voice H of different fundamental frequenciesmThe stabilization beat rate of estimation track need to be used when doing voice conversion, And beat and the tone of audio are adjusted with information such as the fundamental frequencies that changes voice, to generate containing not according to the stabilization beat rate of track The synonymous voice of same gender, age information.
Further, in 2), deep neural network model is the full Connection Neural Network of feedforward of standard, and every layer has k a hidden Layer and n concealed nodes are hidden, each calculates the nonlinear function of the weighted sum of preceding layer output.The last layer has one Softmax, it exports each posterior estimated value of output label.For hidden layer, using ReLU function, ReLU is in development set Have preferably as a result, reducing calculating simultaneously.Use ReLU as activation primitive.The size of network also depends on output label Quantity.The sub-word unit of entire keyword or keyword can be used in our label.
Assuming that pijIt is i-th in neural networkthLabel and jththFrame xjPosterior probability, wherein the value of i be 0,1,, n- Value between 1, n are total number of tags, and 0 is the label of non-key word word.Pass through the training data { x of labelj, ij}jUpper maximization is handed over Fork entropy training standard carrys out the weight and deviation θ of estimating depth neural network.
It is executed using the software frame TensorFlow of the distributed computing of multiple CPU in support deep neural network excellent Change.Training for promotion speed is maximized using the exponential damping of the loss function of asynchronous stochastic gradient descent and learning rate.Transfer Study refers to the situation that the relevant parameter of (part) network parameter existing network is initialized, rather than trained from the beginning.? Here, I carries out speech recognition using deep neural network, and the hidden layer of network is initialized using suitable topology.It is all Layer all updates in training.Shift learning has a potential advantage, and hidden layer can be by using a greater amount of data and avoiding Bad local optimum learns more preferable, more excellent character representation.It is exactly such in this experiment.
To adaptive weighted processing in neural network model: using weight cross entropy (weight cross entropy) When, initial weighting coefficients W is initialized according to the quantitative comparison of keyword in corpus and non-key wordcValue, every training in rotation later The accuracy rate A of keyword corpus and non-key word material is calculated separately in white silk1、A2, according to the difference C=A of the two1-A2Automatic adjustment Weighting coefficient W during the training of kth wheelkValue, α is adjustment factor.
Wk=Wc*(1+C*α)
Further, in 3), the simple but effective method of proposition: DNN generates the label posteriority based on frame, by DNN Posteriority is combined into keyword confidence.If confidence level is more than some predefined threshold value, judgement could be made that.First assume One keyword describes confidence calculations, behind can easily modify it with and meanwhile detect multiple keywords.Make using DNN When for training pattern, length keywords are different, and the voice duration of this keyword is also different, can be according to i-th of keyword in training The minimum duration of voice is concentrated to determine the minimum detection frame number L of keywordi
WhereinIndicate the minimum duration of i-th of keyword voice in training set, S indicates mfcc as voice spy Frame when sign moves.
Beneficial effects of the present invention:
The present invention compares conventional depth neural network method, there is data with good adaptation effect and higher utilization rate, Can be effectively relieved because data nonbalance or it is very little caused by training effect difference or the problems such as can not train.It simultaneously can be to a certain degree Upper quickening training speed, training for promotion effect.
Detailed description of the invention
Fig. 1 is the flow chart that loss function of the present invention is weighted processing.
Specific embodiment
Below with reference to examples and drawings, the present invention is described in further detail, but embodiments of the present invention are not It is limited to this.
Embodiment
As shown in Figure 1, the present invention is using AiShell-1 corpus as training set, with the real speech that school student is recorded Test set, wherein keyword are as follows: China, company, reporter, investment, four keywords carry out keyword spottings.The present invention provides The method for solving keyword identification imbalanced training sets, specific as follows:
1) change speech pitch and keep voice semanteme constant, the voice containing keyword is carried out using Voice Conversion Techniques Conversion obtains the different sexes of same semantic sample, multiple speech samples of all ages and classes speaker;
2) adaptive weighted processing is done to the loss function in neural network model: uses weight cross entropy When, the accuracy rate of keyword corpus and non-key word material is calculated separately in every wheel training, is automatically adjusted according to the difference of the two The weighting coefficient W of kth wheelk
3) adaptive frame number: different keywords are used not according to length keywords when using DNN as training pattern Same detection frame number Li
Further, in 1), in the speech enhan-cement to the voice containing keyword, keep the semantic information of the voice constant, Increase the diversity of voice, such as gender, age etc. by changing the other information in voice;Gender, age etc. believe in voice Breath is mainly related with speech pitch, using the library soundtouch to the voice H of different fundamental frequenciesmVoice conversion is done, normal voice Average base frequency range is (136,332), and the base frequency range of normal voice is divided into N sections, calculates fundamental frequency where present video Segmentation digitizes the speech into different fundamental frequencies using voice conversion and is segmented, it is hereby achieved that the N sections of mutually different voices of fundamental frequencyTo change the non-semantic information of information such as gender, age in the voice, increases the diversity of voice, be averaged in this experiment It is divided into 10 sections.
SoundTouch is one with the audio processing library of the C++ open source write, thus it is possible to vary audio file or real-time sound Beat (Tempo), the tone (Pitch), playback rate (Playback Rates) of frequency stream, also support the stabilization beat of estimation track Rate (BPM rate).3 effects of ST are independent mutually, can also be used together.These effects are converted by sample rate, the time draws It stretches and is implemented in combination with.The object of ST processing is PCM (Pulse Code Modulation, pulse code modulation), master in wav file If this format, therefore the example of ST is all processing wav audio.The formats such as mp3 have passed through compression, need to be converted to after PCM again It is handled with ST.Using the library soundtouch to the voice H of different fundamental frequenciesmEstimation track need to be used when doing voice conversion stablizes section Beat rate, and beat and the tone of audio are adjusted with information such as the fundamental frequencies that changes voice, thus raw according to the stabilization beat rate of track At the synonymous voice containing different sexes, age information.
10 sections of mutually different voices of fundamental frequency are generated in the present invention in normal voice fundamental frequency section, use SoundTouch's C++ source code simultaneously rewrites it, can Mass production difference fundamental frequency section voice.
Further, in 2), deep neural network model is the full Connection Neural Network of feedforward of standard, network have 3 it is hidden Layer and every layer of 128 concealed nodes are hidden, each calculates the nonlinear function of the weighted sum of preceding layer output.The last layer has one A softmax, it exports each posterior estimated value of output label.For hidden layer, using ReLU function, ReLU is in development set On have preferably as a result, reducing calculating simultaneously.Use ReLU as activation primitive, the size of network also depends on output label Quantity.The sub-word unit of entire keyword or keyword can be used in our label.
Assuming that pijIt is i-th in neural networkthLabel and jththFrame xjPosterior probability, wherein the value of i be 0,1 ... n-1 Between value, n be total number of tags, 0 be non-key word word label.Pass through the training data { x of labelj, ij}jUpper maximization is handed over Fork entropy training standard carrys out the weight and deviation θ of estimating depth neural network.
It is executed using the software frame TensorFlow of the distributed computing of multiple CPU in support deep neural network excellent Change.Training for promotion speed is maximized using the exponential damping of the loss function of asynchronous stochastic gradient descent and learning rate.Transfer Study refers to the situation that the relevant parameter of (part) network parameter existing network is initialized, rather than trained from the beginning.This Invention carries out speech recognition using deep neural network, and the hidden layer of network is initialized using suitable topology.All layers All updated in training.Shift learning has a potential advantage, and hidden layer can be by using a greater amount of data and avoiding grain The local optimum of cake is more preferable to learn, more excellent character representation.It is exactly such in this experiment.In the net of the present embodiment In network, we use weight cross entropy using the aNN network for having 3 hidden layers and every layer of 128 concealed nodes, loss function (weight cross entropy) is to be applicable in positive and negative imbalanced training sets problem.Label uses entire keyword, such as: market, Using the speech frame of the entire word in market as a label.
To adaptive weighted processing in neural network model: using weight cross entropy (weight cross entropy) When, initial weighting coefficients W is initialized according to the quantitative comparison of keyword in corpus and non-key wordcValue, every training in rotation later The accuracy rate A of keyword corpus and non-key word material is calculated separately in white silk1、A2, according to the difference C=A of the two1-A2Automatic adjustment Weighting coefficient W during the training of kth wheelkValue, α is adjustment factor, takes 0.5 here in this experiment.
Wk=Wc*(1+C*α)
Further, in 3), the simple but effective method of proposition: DNN generates the label posteriority based on frame, by DNN Posteriority is combined into keyword confidence.If confidence level is more than some predefined threshold value, judgement could be made that.First assume One keyword describes confidence calculations, behind can easily modify it with and meanwhile detect multiple keywords.Make using DNN When for training pattern, length keywords are different, and the voice duration of this keyword is also different, can be according to i-th of keyword in training The minimum duration of voice is concentrated to determine the minimum detection frame number L of keywordi
WhereinIndicate the minimum duration of i-th of keyword voice in training set, S indicates mfcc as voice spy Frame when sign moves.
It is test that the present invention, which tests the real speech using AiShell-1 corpus as training set, with the recording of this school student, Collect, wherein keyword are as follows: China, company, reporter, investment.Four keywords are not using DNN neural network model of the invention In average recall rate be 78.6%, using after a kind of keyword identification imbalanced training sets problem-solving approach of the present invention in identical number Average recall rate according to concentration is 81.92%, improves 3.32 percentage points.
Specific experiment the result is as follows:
0: China 1: company 2: reporter 3: investment
0 1 2 3 Average false detection rate
Normal DNN false detection rate (%) 7.56 6.92 8.33 8.78 7.89
The method of the present invention false detection rate (%) 5.31 4.97 6.12 5.16 5.39
0 1 2 3 Average omission factor
Normal DNN omission factor (%) 21.83 22.32 19.64 21.81 21.40
The method of the present invention omission factor (%) 18.84 17.12 16.62 19.75 18.08
The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by the embodiment Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention, It should be equivalent substitute mode, be included within the scope of the present invention.

Claims (4)

1. a kind of method for solving keyword identification imbalanced training sets, which is characterized in that including
Change the speech pitch containing keyword and keep voice semanteme constant, using Voice Conversion Techniques to the language containing keyword Sound is converted, and the different sexes of same semantic sample, multiple speech samples of all ages and classes speaker are obtained;
Adaptive weighted processing is done to the loss function in neural network model according to multiple speech samples: using weight cross entropy When, the accuracy rate of keyword corpus and non-key word material is calculated separately in every wheel training, is automatically adjusted according to the difference of the two The weighting coefficient W of kth wheelk
Adaptive frame number: different keywords are used with different inspections according to length keywords when using DNN as training pattern Frame number L outi
2. the method according to claim 1, wherein the change of audio basic frequency is using the library soundtouch to not With the voice H of fundamental frequencymVoice conversion is done, the average base frequency range of normal voice is (136,332), by the average base of normal voice Frequency range is divided into N sections, calculates present video and be averaged segmentation where fundamental frequency, is converted using voice and digitize the speech into different bases Frequency division section, to obtain the N sections of average mutually different voices of fundamental frequency
3. the method according to claim 1, wherein adaptive weighted being done to neural network model loss function Processing: when using weight cross entropy, initial weighting coefficients are initialized according to the quantitative comparison of keyword in corpus and non-key word WcValue, calculate separately the recognition accuracy A of keyword corpus and non-key word material in subsequent every wheel training1、A2, according to The difference C=A of the two1-A2Automatically adjust the weighting coefficient W of kth wheel in training processkValue, specific formula is as follows, wherein α be adjust Save coefficient
Wk=Wc*(1+C*α)。
4. the method according to claim 1, wherein length keywords are not when using DNN as training pattern Together, the voice duration of this keyword is also different, can be determined according to the minimum duration of i-th of keyword voice in training set crucial The minimum detection frame number L of wordi
WhereinThe minimum duration of i-th of keyword voice in training set is indicated, when S indicates mfcc as phonetic feature Frame move.
CN201910014005.2A 2019-01-08 2019-01-08 Method for solving imbalance of keyword recognition samples Active CN109712609B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910014005.2A CN109712609B (en) 2019-01-08 2019-01-08 Method for solving imbalance of keyword recognition samples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910014005.2A CN109712609B (en) 2019-01-08 2019-01-08 Method for solving imbalance of keyword recognition samples

Publications (2)

Publication Number Publication Date
CN109712609A true CN109712609A (en) 2019-05-03
CN109712609B CN109712609B (en) 2021-03-30

Family

ID=66260895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910014005.2A Active CN109712609B (en) 2019-01-08 2019-01-08 Method for solving imbalance of keyword recognition samples

Country Status (1)

Country Link
CN (1) CN109712609B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188200A (en) * 2019-05-27 2019-08-30 哈尔滨工程大学 A kind of depth microblog emotional analysis method using social context feature
CN110827791A (en) * 2019-09-09 2020-02-21 西北大学 Edge-device-oriented speech recognition-synthesis combined modeling method
CN111508475A (en) * 2020-04-16 2020-08-07 五邑大学 Robot awakening voice keyword recognition method and device and storage medium
CN111554273A (en) * 2020-04-28 2020-08-18 华南理工大学 Method for selecting amplified corpora in voice keyword recognition
CN113345426A (en) * 2021-06-02 2021-09-03 云知声智能科技股份有限公司 Voice intention recognition method and device and readable storage medium
CN113823326A (en) * 2021-08-16 2021-12-21 华南理工大学 Method for using training sample of efficient voice keyword detector
CN114818685A (en) * 2022-04-21 2022-07-29 平安科技(深圳)有限公司 Keyword extraction method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060026147A1 (en) * 2004-07-30 2006-02-02 Cone Julian M Adaptive search engine
US9646634B2 (en) * 2014-09-30 2017-05-09 Google Inc. Low-rank hidden input layer for speech recognition neural network
CN108538285A (en) * 2018-03-05 2018-09-14 清华大学 A kind of various keyword detection method based on multitask neural network
CN108735202A (en) * 2017-03-13 2018-11-02 百度(美国)有限责任公司 Convolution recurrent neural network for small occupancy resource keyword retrieval

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060026147A1 (en) * 2004-07-30 2006-02-02 Cone Julian M Adaptive search engine
US9646634B2 (en) * 2014-09-30 2017-05-09 Google Inc. Low-rank hidden input layer for speech recognition neural network
CN108735202A (en) * 2017-03-13 2018-11-02 百度(美国)有限责任公司 Convolution recurrent neural network for small occupancy resource keyword retrieval
CN108538285A (en) * 2018-03-05 2018-09-14 清华大学 A kind of various keyword detection method based on multitask neural network

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188200A (en) * 2019-05-27 2019-08-30 哈尔滨工程大学 A kind of depth microblog emotional analysis method using social context feature
CN110827791B (en) * 2019-09-09 2022-07-01 西北大学 Edge-device-oriented speech recognition-synthesis combined modeling method
CN110827791A (en) * 2019-09-09 2020-02-21 西北大学 Edge-device-oriented speech recognition-synthesis combined modeling method
CN111508475A (en) * 2020-04-16 2020-08-07 五邑大学 Robot awakening voice keyword recognition method and device and storage medium
CN111508475B (en) * 2020-04-16 2022-08-09 五邑大学 Robot awakening voice keyword recognition method and device and storage medium
CN111554273A (en) * 2020-04-28 2020-08-18 华南理工大学 Method for selecting amplified corpora in voice keyword recognition
CN111554273B (en) * 2020-04-28 2023-02-10 华南理工大学 Method for selecting amplified corpora in voice keyword recognition
CN113345426A (en) * 2021-06-02 2021-09-03 云知声智能科技股份有限公司 Voice intention recognition method and device and readable storage medium
CN113345426B (en) * 2021-06-02 2023-02-28 云知声智能科技股份有限公司 Voice intention recognition method and device and readable storage medium
CN113823326A (en) * 2021-08-16 2021-12-21 华南理工大学 Method for using training sample of efficient voice keyword detector
CN113823326B (en) * 2021-08-16 2023-09-19 华南理工大学 Method for using training sample of high-efficiency voice keyword detector
CN114818685A (en) * 2022-04-21 2022-07-29 平安科技(深圳)有限公司 Keyword extraction method and device, electronic equipment and storage medium
CN114818685B (en) * 2022-04-21 2023-06-20 平安科技(深圳)有限公司 Keyword extraction method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN109712609B (en) 2021-03-30

Similar Documents

Publication Publication Date Title
CN109712609A (en) A method of it solving keyword and identifies imbalanced training sets
Zhang Music style classification algorithm based on music feature extraction and deep neural network
Fan et al. Singing voice separation and pitch extraction from monaural polyphonic audio music via DNN and adaptive pitch tracking
CN109616105A (en) A kind of noisy speech recognition methods based on transfer learning
CN110265063B (en) Lie detection method based on fixed duration speech emotion recognition sequence analysis
CN110349597A (en) A kind of speech detection method and device
CN116665669A (en) Voice interaction method and system based on artificial intelligence
CN112559797A (en) Deep learning-based audio multi-label classification method
CN112767921A (en) Voice recognition self-adaption method and system based on cache language model
CN112509560A (en) Voice recognition self-adaption method and system based on cache language model
CN116229932A (en) Voice cloning method and system based on cross-domain consistency loss
Wang et al. A study on acoustic modeling for child speech based on multi-task learning
Sertsi et al. Robust voice activity detection based on LSTM recurrent neural networks and modulation spectrum
Nwe et al. Speaker clustering and cluster purification methods for RT07 and RT09 evaluation meeting data
Bastanfard et al. A singing voice separation method from Persian music based on pitch detection methods
WO2018163279A1 (en) Voice processing device, voice processing method and voice processing program
Rabiee et al. Persian accents identification using an adaptive neural network
Arumugam et al. An efficient approach for segmentation, feature extraction and classification of audio signals
CN116189671B (en) Data mining method and system for language teaching
CN111179914B (en) Voice sample screening method based on improved dynamic time warping algorithm
Euler et al. Statistical segmentation and word modeling techniques in isolated word recognition
Li et al. Instructional video content analysis using audio information
Yu Research on multimodal music emotion recognition method based on image sequence
CN114927144A (en) Voice emotion recognition method based on attention mechanism and multi-task learning
Zhou et al. Speech recognition using double data augmentation strategy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant