CN109545189A - A kind of spoken language pronunciation error detection and correcting system based on machine learning - Google Patents

A kind of spoken language pronunciation error detection and correcting system based on machine learning Download PDF

Info

Publication number
CN109545189A
CN109545189A CN201811534792.5A CN201811534792A CN109545189A CN 109545189 A CN109545189 A CN 109545189A CN 201811534792 A CN201811534792 A CN 201811534792A CN 109545189 A CN109545189 A CN 109545189A
Authority
CN
China
Prior art keywords
pronunciation
error detection
spoken language
phoneme
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811534792.5A
Other languages
Chinese (zh)
Inventor
吴怡之
董权
张俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
National Dong Hwa University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN201811534792.5A priority Critical patent/CN109545189A/en
Publication of CN109545189A publication Critical patent/CN109545189A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The spoken language pronunciation error detection and correcting system that the present invention relates to a kind of based on machine learning, comprising: spoken language pronunciation sample collection module, for acquiring orthoepy phoneme and different types of incorrect pronunciations phoneme from whole sentence or whole section of spoken language pronunciation;Pronounce error detection model building module, for extracting acoustic feature to pronunciation phonemes collected and carrying out type mark as pronunciation error detection model training sample set, generates pronunciation error detection model by machine learning algorithm training;Module is corrected in online error detection, and the whole sentence or whole section of spoken language pronunciation read aloud using the pronunciation error detection model of generation to learner score and phoneme error detection and pronunciation correction.The present invention can on-line evaluation spoken language pronunciation achievement, check pronunciation mistake and forward suggestion for correction.

Description

A kind of spoken language pronunciation error detection and correcting system based on machine learning
Technical field
The present invention relates to online verbal learning technical fields, examine more particularly to a kind of spoken language pronunciation based on machine learning Wrong and correcting system.
Background technique
During language learning, due to the limitation of qualified teachers and environment, the Oral Training time of attending class is insufficient, after class mouth Language practice cannot be fed back, and the factors such as nonstandard of pronouncing of most of teacher of spoken language cause when learner learns foreign languages One difficulty is ready to pay high tuition fee in order to correct one's pronunciation, foreign teacher is asked to correct the pronunciation of oneself there are many people.With shifting The rise for moving online language learning has expedited the emergence of automatic pronunciation error-detection system.
On pronunciation error-detecting method, currently used method can be roughly divided into two types: the first kind is by voice Xue Zhi The method of knowledge finds some distinctive features, for example may be replaced with " lice " using Japanese by the English learner of mother tongue The pronunciation of " rice ", learner possibly can not adjust its articulation to correct this mistake because phoneme/r/ in Japanese not In the presence of.For these typical type of errors, some acoustic feature such as formants for specifically having distinction can be usually extracted To be used as the detection and diagnosis of pronunciation mistake.Second class is based on speaker to the pronunciation of given text and based on speaker's mother tongue Similitude between the standard pronunciation of acoustic model identifies pronunciation mistake, if SelinaParveen et al. is in paper (BangIa Pronunciation error detection syatem) use this method.Its similarity indices is based on automatic speech Identify the confidence of (ASR).Generally speaking, first method can detecte out the movement for leading to the vocal organs of mistake, Such as the deviation of tongue position height front and back is judged by formant, but be limited primarily to the discovery that word reads aloud medial vowel mistake.And Algorithm fault-tolerance is lower, and the correctness that formant extracts is very crucial, but in noise circumstance, it is easy to formant be caused to extract Mistake.Second method can not judge vocal organs errors present, (accidentally mainly for the replacement in phonation there are phoneme Read), the mistakes such as skip and insertion, therefore targetedly pronunciation correction and raising scheme can not be given.
In application aspect, only exist has product for articulation problems at present on a small quantity, and but most of functions it is all relatively more single One, only simple playback audio-video learning materials, student record with reading, system plays.Software only few in number, has mouth The detection of language articulation problems is fed back, but existing first defect is the root cause problems that feedback function is not enough to solve learner, If sound wishes the Oral Training function of science and technology, this function can only point out that the pronunciation of learner is not good enough after learner is with reading pronunciation, But learner can not understand oneself pronunciation mistake where, and how should improve pronunciation, be unable to get learner most Information is corrected for valuable feedback.The oracy of learner can not often be improved.Second defect is the mistake for detection Accidentally type the mistake in these typical mistakes such as often focuses on the skip of pronunciation phonemes, misreads and be inserted into, lacking and move to pronunciation Make the judgement of error reason and the feedback of correction scheme.
Summary of the invention
The spoken language pronunciation error detection and correct system that technical problem to be solved by the invention is to provide a kind of based on machine learning System, can on-line evaluation spoken language pronunciation achievement, check pronunciation mistake and forward suggestion for correction.
The technical solution adopted by the present invention to solve the technical problems is: providing a kind of spoken language pronunciation based on machine learning Error detection and correcting system, comprising: spoken language pronunciation sample collection module, for acquiring correct hair from whole sentence or whole section of spoken language pronunciation Sound phoneme and different types of incorrect pronunciations phoneme;Pronounce error detection model building module, for mentioning to pronunciation phonemes collected It takes acoustic feature and carries out type mark as pronunciation error detection model training sample set, hair is generated by machine learning algorithm training Sound error detection model;Module, the whole sentence or whole section of mouth read aloud using the pronunciation error detection model of generation learner are corrected in online error detection Language pronunciation carries out scoring and phoneme error detection and pronunciation correction.
The case where spoken language pronunciation sample collection module issues sound is divided into voiced segments, mute section and mute section, Specifically: by the prediction residual energy of voice signal S (n) is defined as:Wherein, N is frame It is long, the first reflection coefficient is defined as:It is segmented according to following rule: if the first reflection coefficient is greater than 0.2, And prediction residual energy is greater than 2 times of system threshold value θ, is voiced segments by current speech frame definition;If the first reflection system Number is greater than 0.3, and prediction residual energy is greater than system threshold value θ, and the former frame of current speech frame is pronunciation frame, then currently Speech frame is defined as voiced segments;It is mute section for current speech frame definition if being unsatisfactory for above-mentioned two rule.
The spoken language pronunciation sample collection module is realized to obtain pronunciation factor by the way of forcing alignment, specifically: it will Text file is handled by punctuation mark;Audio file is converted into monophonic, is done by endpoint detection processing;By text file The conversion of word to sound is carried out, according to trained acoustic model, text file is extended to by implying Markov model Search space composed by status switch;Feature extraction is carried out to the voice signal in audio file, according to from front to back frame by frame Sequence phonetic feature is aligned with search space composed by corresponding implicit Markov Model state sequence;To each frame Data are aligned using dynamic time warpping Viterbi, are obtained: Q (t, s)=maxs′{p(xt,s|s′)*Qv(t-1, s') }, wherein Q (t, It s) is that moment t falls in the specific best score implied on Markov Model state s of some in search space, p (xt,s|s') It is that latter frame state shifts hiding sequence x in the case of known previous frame state is s'tProbability, xtIt is implicit Markov state Metastasis sequence, s' are the former frame status of s;In t moment, when there is path to reach active state sweWhen, wherein sweIt is the phase It hopes the suffix state node for estimating the current sentence of its optimal end time τ, counts all active state s at this timeiUpper road Diameter assumes numberWherein, δ () is indicator function, All paths are assumed according to its score sequencing statistical;Count sweUpper all paths;Remember that Q is assumed in pathk(t,swe) institute Having ranking in a path N (t) is Rk(t,swe), then sweOn path assume in a path N (t) in ranking sample expectationDefinition status active degree isA(t,swe) at the time of be maximized It is to be aligned maximum likelihood time t, according to alignment maximum likelihood time t, exports the voice and text justification temporal information of sentence; Level is separated according to the phoneme that voice and text justification temporal information read text table, reads some sound of phoneme separation level At the beginning of element and end time, progress phoneme cutting obtain pronunciation phonemes.
The pronunciation error detection model building module first divides the data into training data and and test data in feature extraction Collect, then the pronunciation phonemes that will acquire extract the MFCC feature and formant feature of each phoneme pronunciation, to primary speech signal The time-domain signal of each speech frame is obtained by processing;Time-domain signal trailing zero is formed to the sequence of a length of N, then by from Scattered Fourier transformation obtains linear spectral;Linear spectral obtains Mel frequency spectrum by Mel frequency filter group, obtained Mel frequency spectrum It takes logarithmic energy to obtain log spectrum S (m), log spectrum S (m) is obtained into cepstral domains by discrete cosine transform, can be obtained To Mel frequency cepstral coefficient c (n).
The pronunciation error detection model building module is divided into 7 classifications, respectively -1,1, A, B, C, D when being trained And E, wherein -1 indicate type of error, 1 indicate right type, A, B, C, D, E respectively indicate mistake classification in tongue position it is to the front, It is to the rear, tongue position is excessively high, too low and phoneme elongates and shortens class;When extracting training set, successively by the sample of some classification It is classified as one kind, other remaining samples are classified as another kind of, obtain 7 classifiers in this way;In support vector machines as training classification When device algorithm, phoneme of speech sound acoustic feature vector data is divided according to training set and test set 4:1, training set is as branch The input vector number of vector machine is held, support vector machines kernel function selects Radial basis kernel function.
The pronunciation error detection model building module extracts feature using the unsupervised training of deepness belief network in modeling, most Upper one layer uses support vector machines, is determined preferably according to the number and dimension of training set in pronunciation phonemes acoustic feature data set Deepness belief network model;Mode by adjusting parameter model and the output result for comparing each self-optimizing determines hidden layer Number gradually determines the optimal hidden layer number of plies and number of nodes by test method(s), and then pass through after fixing first layer hidden layer number of nodes Optimal models are obtained to the adjusting of other parameters.
Beneficial effect
Due to the adoption of the above technical solution, compared with prior art, the present invention having the following advantages that and actively imitating Fruit: the present invention by machine learning algorithm establish pronunciation error detection correct model in whole sentence or whole section phoneme pronunciation mistake it is quick Detection and diagnosis compensates for current market and assists spoken learning areas in terms of real-time scoring and error detection feedback correction to online Vacancy.It can quickly, effectively judge that there are which type of mistakes in which place in the entire sentence that learner is read aloud It misses type and is assisted correcting.
Detailed description of the invention
Fig. 1 is offline machine learning classification error detection model training frame diagram in the present invention;
Fig. 2 is online pronunciation error detection and interactive work for correction flow chart;
Fig. 3 is to force alignment separation phoneme algorithm flow chart.
Specific embodiment
Present invention will be further explained below with reference to specific examples.It should be understood that these embodiments are merely to illustrate the present invention Rather than it limits the scope of the invention.In addition, it should also be understood that, after reading the content taught by the present invention, those skilled in the art Member can make various changes or modifications the present invention, and such equivalent forms equally fall within the application the appended claims and limited Range.
Embodiments of the present invention are related to a kind of spoken language pronunciation error detection based on machine learning and correcting system, comprising: mouth Language pronunciation sample collection module, for acquiring orthoepy phoneme and different types of mistake from whole sentence or whole section of spoken language pronunciation Pronunciation phonemes;Pronounce error detection model building module, for extracting acoustic feature to pronunciation phonemes collected and carrying out type mark Note generates pronunciation error detection model as pronunciation error detection model training sample set, by machine learning algorithm training;Online error detection is entangled Positive module, the whole sentence or whole section of spoken language pronunciation read aloud using the pronunciation error detection model of generation to learner are scored and phoneme is examined Wrong and pronunciation correction.
The present invention obtains atypical pronunciation error sample data in whole sentence or whole section of pronunciation first, utilizes machine learning Classification and identification algorithm trains error detection disaggregated model, identifies that the specific mistake of learner's pronunciation belongs to the tongue of phoneme pronunciation Position height front and back deviation or pronunciation tone length deviation caused by pronunciation mistake which kind of.Further for identification Pronunciation type of error out is given correct interactive feedback and is corrected, and main includes shape of the mouth as one speaks tongue position when learner pronounces and hair The Adjusted Option of the tone length of sound.
Fig. 1 is offline machine learning classification error detection model training frame diagram in the present invention.Fig. 2 is online pronunciation error detection and hands over Mutual formula work for correction flow chart.Realize that step of the invention is as follows referring to Figures 1 and 2:
One, off-line model training part:
Step 1: obtaining the standard voice data of whole sentence.615 are obtained according to countries and regions, gender, age
Position is standard pronunciation data by the voice of reading aloud of the speaker of mother tongue of English.
Step 2: obtaining the non-standard voice data of whole sentence.It is obtained according to the English learner of different mother tongues different classes of Incorrect pronunciations data, error category is divided into following 6 kinds: tongue position is to the front, to the rear, tongue position is excessively high, too low and phoneme elongates and contracting Short class.The sample number of each classification is 200.
Step 3: alignment is forced to separate with phoneme.
1, the pronunciation of one section of voice signal is divided into mute section (S, Silent) according to the sending situation of sound, and mute section (U, Unvoiced), voiced segments (V, Voiced).The prediction residual energy of voice signal S (n) is defined as:
Wherein, N is frame length, the first reflection coefficient is defined as:V/U/S chopping rule is as follows:
(1) if the first reflection coefficient is greater than 0.2, and prediction residual energy is greater than 2 times of threshold value θ, by current language Sound frame definition is V sections;
(2) if the first reflection coefficient is greater than 0.3, and prediction residual energy is greater than threshold value θ, and before present frame One frame is pronunciation frame, is V sections by current speech frame definition;
(3) if being unsatisfactory for two rule of front, speech frame is defined as U sections.
2, alignment is forced, as shown in Figure 3
(1) text file is handled by special punctuation mark, and English string segmentation processing finally saves as UTF-8 format.
(2) audio file is converted to monophonic, sample rate 16000hz format, does by endpoint detection processing, end-point detection Purpose is exactly the starting point and end point that accurate detection goes out voice from comprising voice signal.
(3) text is extended to by implying by the conversion that text carries out word to sound according to trained acoustic model Search space composed by Markov model (HMM) status switch.
(4) feature extraction is carried out to the voice signal in audio file, according to sequence frame by frame from front to back, by voice spy Sign is aligned with search space composed by corresponding implicit Markov Model state sequence;Each frame data are advised using dynamic Whole Viterbi alignment, obtains:
Q (t, s)=maxs′{p(xt,s|s')*Qv(t-1,s')}
Wherein, Q (t, s) is that moment t is fallen on the specific implicit Markov Model state s of some in search space most Good score, p (xt, s | s') it is that latter frame state shifts hiding sequence x in the case of known previous frame state is s'tProbability, xtIt is Implicit Markov state metastasis sequence, s' is the former frame status of s;sweIt is its optimal end time τ of expectation estimation The suffix state node of current sentence.
In t moment, when there is path to reach active state sweWhen, count all active state s at this timeiUpper path assume NumberWherein, δ () is indicator function,To own Path assume according to its score sequencing statistical;Count sweUpper all paths;Remember that Q is assumed in pathk(t,swe) in all N (t) Ranking is R in a pathk(t,swe), then sweOn path assume in a path N (t) in ranking sample expectationDefinition status active degree isA(t,swe) be maximized when Quarter is alignment maximum likelihood time t, according to alignment maximum likelihood time t, exports the voice and text justification time letter of sentence Breath.
In forcing alignment procedure, whole sentence pronunciation is snapped into word level and phone-level, in order in subsequent step The acoustic feature that different pronunciations are extracted from phone-level is believed after forcing alignment according to the voice of output and text justification time Breath reads the level of the phoneme separation of text table (textgrid), at the beginning of reading some phoneme of phoneme level and ties The beam time carries out phoneme cutting.Obtain pronunciation phonemes.
Step 4: data normalization is handled.Data normalization processing is to be limited to the acoustic feature data of phoneme pronunciation Within the scope of some, reduces the otherness of data its purpose is to reduce the dispersion degree of phoneme pronunciation acoustic feature data, allow The fluctuation of data is smaller, has no effect on the original distribution of data, uses in present embodiment and be most worth method for normalizing.
Step 5: feature extraction, first divide the data into training data and and test data set, then will be obtained in step 4 Pronunciation phonemes extract the MFCC feature and formant feature of each phoneme pronunciation, primary speech signal S (n) by preemphasis, The processing such as framing, adding window, end-point detection, obtains the time-domain signal x (n) of each speech frame.
Time-domain signal x (n) is mended to several 0 to form the sequence of a length of N (N=512 in the present embodiment) afterwards, then passed through It crosses discrete Fourier transform DFT or (FFT) obtains linear spectral X (k);
Above-mentioned linear spectral obtains Mel frequency spectrum by Mel frequency filter group.In order to miss result to noise and Power estimation Difference has better robustness, takes logarithmic energy to obtain S (m) the above-mentioned Mel frequency spectrum obtained by Mel filter group;
Above-mentioned log spectrum S (m) is obtained into cepstral domains by discrete cosine transform (DCT) transformation, Mel frequency can be obtained Rate cepstrum coefficient (MFCC parameter) c (n).
Since voice signal is that time domain is continuous, the characteristic information that framing is extracted only has reacted the characteristic of this frame voice, is The continuity of time domain can be embodied by making feature more, can before and after characteristic dimension increase frame information dimension, using first differential Coefficient and 13 accelerator coefficients, in addition four-dimensional formant parameter, 43 coefficients altogether.Form 43 dimensional feature vectors.
Step 6: the training of pronunciation error detection model.Due to being related to more classification problems, support vector machines and decision Tree algorithms are all It is two-value classifier.The construction of multi-categorizer can be realized using by combining multiple two classifiers.Combination multiplicity, has One-to-one method (one-versus-one) and one-to-many method (one-versus-rest).Present embodiment uses one-to-many method, letter Claim OVR.Its thought are as follows: training when the sample of some classification is successively classified as one kind, other remaining samples be classified as it is another kind of, this The sample of k classification of sample has just constructed k classifier.Unknown sample is classified as with maximum classification function value when classification That class.The specific steps are under:
Present embodiment has 7 classifications to need to divide (namely 7 label), is -1,1, A, B, C, D, E respectively.- 1 table Show type of error, 1 indicates right type, A, B, C, D, E respectively indicate the tongue position in mistake classification is to the front, to the rear, tongue position is excessively high, Too low and phoneme elongates and shortens class.
When extracting training set, extract respectively
Vector corresponding to (1) 1 is as positive collection, and the vector corresponding to other is as negative collection;
Vector corresponding to (2) -1 is as positive collection, and the vector corresponding to other is as negative collection;
(3) vector corresponding to A is as positive collection, and the vector corresponding to other is as negative collection;
(4) vector corresponding to B is as positive collection, and the vector corresponding to other is as negative collection;
(5) vector corresponding to C is as positive collection, and the vector corresponding to other is as negative collection;
(6) vector corresponding to D is as positive collection, and the vector corresponding to other is as negative collection;
(7) vector corresponding to E is as positive collection, and the vector corresponding to other is as negative collection.
When support vector machines is as training classifier algorithm, to phoneme of speech sound acoustic feature vector data according to training set It is divided with test set 4:1, input vector number of the training set as support vector machines, support vector machines kernel function selects diameter Xiang Ji (RBF) kernel function.
The range of support vector machines penalty factor c is set as [0,100], and the range of kernel functional parameter g is set as [0,1000].
It is trained respectively using this 7 training sets, then obtains 7 training result files.
When test, corresponding test vector is utilized respectively this four training set results and is tested, it is last every A test has a result f1 (x), f2 (x), f3 (x), f4 (x), f5 (x), f6 (x), f7 (x).Then final classification knot Fruit be in these values maximum one be used as classification results.
In practical bright read error, often the distribution of type of error is simultaneously uneven, such as in a certain area, most of study The pronunciation mistake of person may concentrate in a certain or two classes, and this type of error is commonplace, and some type of errors are more rare See, therefore lead to imbalance problem in sample in type of error training, further to explore optimal modeling method, this literary grace With the thought of transfer learning, extract feature using in the unsupervised training of deepness belief network (DBN), most upper one layer using support to Amount machine (SVM), i.e. DBN+SVM carry out disaggregated model foundation.
Preferably DBN model is determined according to the number and dimension of training set in pronunciation phonemes acoustic feature data set.It determines Hidden layer number, tuning level-learning rate and the number of iterations.Determine parameter training Boltzmann machine.We are by 200 groups of sounds of every class Prime number evidence is divided into training set (160 groups) and test set (40 groups) and is trained modeling using training set, using test set to foundation Pronunciation phonemes error detection disaggregated model tested.It is hidden using 100,200,300,400,500,600,700 and 800 etc. eight Layer neuron node number models feature in single hidden layer.By adjusting parameter model and compare each self-optimizing Output as a result, it has been found that, when taking hidden layer number of nodes [400], available optimal result.After fixed first layer hidden layer number of nodes, The optimal hidden layer number of plies and number of nodes are gradually determined by test method(s), and then optimal mould is obtained by the adjusting to other parameters Type.In the phoneme pronunciation error detection disaggregated model finally established, the iteration that hidden layer number is 5, RBM is 50, DBN network iteration time Number is 1000 times, Batch_size 64, and weight learning rate is 0.000001.It is last knot that test experiments, which are repeated several times, and take mean value Fruit.
Error detection can produce the result of four seed types: 1) correctly receiving (CA), that is, be judged as correct orthoepy Number;2) correct rejection (CR) is judged as incorrect incorrect pronunciations number;3) mistake receives (FA), that is, is judged and is positive True incorrect pronunciations number;And 4) False Rejects (FR), that is, it is judged as the quantity of incorrect orthoepy.Using this four It is a as a result, calculating correct receptance (CAR), correct rejection ratio (CRR).Present embodiment is accurate as identification using CAR and CRR The parameter of measurement of degree, CAR are used for orthoepy, and CRR is used for incorrect pronunciations.
2 two kinds of algorithms of different classifiers of table and the test set of deep learning (DBN+SVM) identification CAR and CRR compare
It can be seen from 2 result of table in phoneme of speech sound pronunciation error detection effect, classification based on support vector machines and Being sorted on recognition accuracy based on decision tree is not much different, and the two is stable to the discrimination of type of error on 80% left side It is right.Recognition effect is preferable.Based on practical reasons, part is pronounced, and mistake is commonplace, and error sample is more, and classifying quality is more preferable, and Part pronunciation type of error is more rare, and training sample amount deficiency may be to lead to the main reason for causing classification accuracy relatively low Table 2 is crossed it can be seen that can averagely improve 2 percentage points on the basis of the above two recognition accuracy using (DBN+SVM).Cause This adds the model of the classification algorithm training of support vector machines to be optimal based on deepness belief network.
Step 7: evaluation of result and error correction, point that the error detection disaggregated model obtained according to the machine learning algorithm provides Class result, it is indicated which kind of mistake is test sample belong to;Classification belonging to mistake of pronouncing is gone out by model prediction.Learner is pronounced wrong Accidentally place and error type feed back to learner and propose correction scheme for it.
Two, pronunciation error detection and interactive mode correct part online
Step 1: obtaining pronunciation data
Fig. 2 is pronunciation error detection and interactive correcting system work flow diagram, after learner's login system, what selection to be learnt Pronounce sentence, according to display text, reads aloud whole sentence, system obtains learner's pronunciation.
Step 2: data processing and pronunciation error detection
The learner's pronunciation data obtained in step 1 is pre-processed, including alignment separation phoneme and feature is forced to mention It takes, processing step and off-line model training department split-phase are same.By treated, data are sent into trained pronunciation error detection model, by mould Type exports learner's pronunciation result.
Step 3: interactive correct, judged according to the pronunciation that system provides, tell the problems of learner's pronunciation, And the correction of articulation is given for the different incorrect pronunciations judged, prompt learner to read aloud again.Persistently correct one's pronunciation. Until phoneme standard.
It is not difficult to find that the present invention is based on machine learning algorithm, using different pronunciation phonemes have different acoustic features this One feature, acquisition and processing to the read aloud foreign language voice segments signal data of different learners obtain its 39+4 under frequency domain Acoustic feature vector is tieed up, it is right by means of having the learning network of supervision or unsupervised learning network as the input of training pattern The acoustic feature vector of extraction carries out learning training and generates acoustics error detection model.Using test set to the mistake of acoustics error detection model Classifying quality is verified, experiments verify that classification accuracy is higher, meets normal learner's pronunciation type of error analysis.And needle Pronunciation evaluation and correction scheme are given to test set verification result.The present invention not only point out learner pronunciation to mistake, Further identified on this basis learner's pronunciation be it is wrong where, and will how improved method feeds back to study Person, the articulation ability of raising learner that in this way can be practical.

Claims (6)

1. a kind of spoken language pronunciation error detection and correcting system based on machine learning characterized by comprising spoken language pronunciation sample is adopted Collect module, for acquiring orthoepy phoneme and different types of incorrect pronunciations phoneme from whole sentence or whole section of spoken language pronunciation;Hair Sound error detection model building module, for extracting acoustic feature to pronunciation phonemes collected and carrying out type mark as pronunciation inspection Mismatch type training sample set generates pronunciation error detection model by machine learning algorithm training;Module is corrected in online error detection, using life At pronunciation error detection model whole sentence that learner is read aloud or whole section of spoken language pronunciation score and phoneme error detection and pronunciation correction.
2. the spoken language pronunciation error detection and correcting system according to claim 1 based on machine learning, which is characterized in that described The case where spoken language pronunciation sample collection module issues sound is divided into voiced segments, mute section and mute section, specifically: by voice The prediction residual energy of signal S (n) is defined as:Wherein, N is frame length, the first reflection coefficient Is defined as:It is segmented according to following rule: if the first reflection coefficient is greater than 0.2, and predicting error energy Amount is greater than 2 times of system threshold value θ, is voiced segments by current speech frame definition;If the first reflection coefficient be greater than 0.3, and Prediction residual energy is greater than system threshold value θ, and the former frame of current speech frame is pronunciation frame, then current speech frame definition is hair Segment;It is mute section for current speech frame definition if being unsatisfactory for above-mentioned two rule.
3. the spoken language pronunciation error detection and correcting system according to claim 1 based on machine learning, which is characterized in that described Spoken language pronunciation sample collection module is realized to obtain pronunciation factor by the way of forcing alignment, specifically: text file is passed through Punctuation mark processing;Audio file is converted into monophonic, is done by endpoint detection processing;Text file is subjected to word to sound Text file is extended to by implying Markov Model state sequence institute group by conversion according to trained acoustic model At search space;Feature extraction is carried out to the voice signal in audio file, according to sequence from front to back frame by frame by voice Feature is aligned with search space composed by corresponding implicit Markov Model state sequence;Dynamic is used to each frame data Regular Viterbi alignment, obtains: Q (t, s)=maxs'{p(xt,s|s')*Qv(t-1, s') }, wherein Q (t, s) is that moment t is fallen Best score in search space on some specific implicit Markov Model state s, p (xt, s | s') it is known previous Latter frame state, which shifts, in the case of frame state is s' hides sequence xtProbability, xtIt is implicit Markov state metastasis sequence, s' It is the former frame status of s;In t moment, when there is path to reach active state sweWhen, wherein sweIt is that its is optimal for expectation estimation The suffix state node of the current sentence of end time τ counts all active state s at this timeiUpper path assume numberWherein, δ () is indicator function,By all roads Diameter is assumed according to its score sequencing statistical;Count sweUpper all paths;Remember that Q is assumed in pathk(t,swe) on a road of all N (t) Ranking is R in diameterk(t,swe), then sweOn path assume in a path N (t) in ranking sample expectationDefinition status active degree isA(t,swe) be maximized when Quarter is alignment maximum likelihood time t, according to alignment maximum likelihood time t, exports the voice and text justification time letter of sentence Breath;Level is separated according to the phoneme that voice and text justification temporal information read text table, reads phoneme separation level certain At the beginning of a phoneme and end time, progress phoneme cutting obtain pronunciation phonemes.
4. the spoken language pronunciation error detection and correcting system according to claim 1 based on machine learning, which is characterized in that described Error detection model building module pronounce in feature extraction, first divide the data into training data and and test data set, then will acquire Pronunciation phonemes extract the MFCC feature and formant feature of each phoneme pronunciation, primary speech signal is obtained by processing The time-domain signal of each speech frame;Time-domain signal trailing zero is formed to the sequence of a length of N, then passes through discrete Fourier transform Obtain linear spectral;Linear spectral obtains Mel frequency spectrum by Mel frequency filter group, and obtained Mel frequency spectrum takes logarithmic energy to obtain To log spectrum S (m), log spectrum S (m) is obtained into cepstral domains by discrete cosine transform, Mel frequency cepstral can be obtained Coefficient c (n).
5. the spoken language pronunciation error detection and correcting system according to claim 1 based on machine learning, which is characterized in that described Error detection model building module pronounce when being trained, is divided into 7 classifications, respectively -1,1, A, B, C, D and E, wherein -1 Indicate type of error, 1 indicates right type, A, B, C, D, E respectively indicate in mistake classification tongue position is to the front, to the rear, tongue position mistake High, too low and phoneme elongates and shortens class;When extracting training set, the sample of some classification is successively classified as one kind, His remaining sample is classified as another kind of, obtains 7 classifiers in this way;It is right when support vector machines is as training classifier algorithm Phoneme of speech sound acoustic feature vector data is divided according to training set and test set 4:1, and training set is as support vector machines Input vector number, support vector machines kernel function select Radial basis kernel function.
6. the spoken language pronunciation error detection and correcting system according to claim 1 based on machine learning, which is characterized in that described The error detection model building module that pronounces extracts feature using the unsupervised training of deepness belief network in modeling, and most upper one layer using branch Vector machine is held, preferably deepness belief network is determined according to the number and dimension of training set in pronunciation phonemes acoustic feature data set Model;Hidden layer number, fixed first layer are determined by adjusting the mode of parameter model and the output result for comparing each self-optimizing After hidden layer number of nodes, the optimal hidden layer number of plies and number of nodes are gradually determined by test method(s), and then by other parameters It adjusts and obtains optimal models.
CN201811534792.5A 2018-12-14 2018-12-14 A kind of spoken language pronunciation error detection and correcting system based on machine learning Pending CN109545189A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811534792.5A CN109545189A (en) 2018-12-14 2018-12-14 A kind of spoken language pronunciation error detection and correcting system based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811534792.5A CN109545189A (en) 2018-12-14 2018-12-14 A kind of spoken language pronunciation error detection and correcting system based on machine learning

Publications (1)

Publication Number Publication Date
CN109545189A true CN109545189A (en) 2019-03-29

Family

ID=65856297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811534792.5A Pending CN109545189A (en) 2018-12-14 2018-12-14 A kind of spoken language pronunciation error detection and correcting system based on machine learning

Country Status (1)

Country Link
CN (1) CN109545189A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110415679A (en) * 2019-07-25 2019-11-05 北京百度网讯科技有限公司 Voice error correction method, device, equipment and storage medium
CN110457670A (en) * 2019-07-25 2019-11-15 天津大学 A method of it reducing the space of a whole page before printing based on machine learning and handles error rate
CN110488675A (en) * 2019-07-12 2019-11-22 国网上海市电力公司 A kind of substation's Abstraction of Sound Signal Characteristics based on dynamic time warpping algorithm
CN110556093A (en) * 2019-09-17 2019-12-10 浙江核新同花顺网络信息股份有限公司 Voice marking method and system
CN110598208A (en) * 2019-08-14 2019-12-20 清华大学深圳研究生院 AI/ML enhanced pronunciation course design and personalized exercise planning method
CN111292769A (en) * 2020-03-04 2020-06-16 苏州驰声信息科技有限公司 Method, system, device and storage medium for correcting pronunciation of spoken language
CN111833859A (en) * 2020-07-22 2020-10-27 科大讯飞股份有限公司 Pronunciation error detection method and device, electronic equipment and storage medium
CN112215018A (en) * 2020-08-28 2021-01-12 北京中科凡语科技有限公司 Automatic positioning method and device for correction term pair, electronic equipment and storage medium
CN112863486A (en) * 2021-04-23 2021-05-28 北京一起教育科技有限责任公司 Voice-based spoken language evaluation method and device and electronic equipment
CN112967538A (en) * 2021-03-01 2021-06-15 郑州铁路职业技术学院 English pronunciation information acquisition system
TWI767532B (en) * 2021-01-22 2022-06-11 賽微科技股份有限公司 A wake word recognition training system and training method thereof
CN114758647A (en) * 2021-07-20 2022-07-15 无锡柠檬科技服务有限公司 Language training method and system based on deep learning
CN114783412A (en) * 2022-04-21 2022-07-22 山东青年政治学院 Spanish spoken language pronunciation training correction method and system
WO2022168102A1 (en) * 2021-02-08 2022-08-11 Rambam Med-Tech Ltd. Machine-learning-based speech production correction
CN115148225A (en) * 2021-03-30 2022-10-04 北京猿力未来科技有限公司 Intonation scoring method, intonation scoring system, computing device and storage medium
CN116340489A (en) * 2023-03-27 2023-06-27 齐齐哈尔大学 Japanese teaching interaction method and device based on big data
CN116805495A (en) * 2023-08-17 2023-09-26 北京语言大学 Pronunciation deviation detection and action feedback method and system based on large language model
CN116894442A (en) * 2023-09-11 2023-10-17 临沂大学 Language translation method and system for correcting guide pronunciation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060136225A1 (en) * 2004-12-17 2006-06-22 Chih-Chung Kuo Pronunciation assessment method and system based on distinctive feature analysis
CN101651788A (en) * 2008-12-26 2010-02-17 中国科学院声学研究所 Alignment system of on-line speech text and method thereof
CN103366759A (en) * 2012-03-29 2013-10-23 北京中传天籁数字技术有限公司 Speech data evaluation method and speech data evaluation device
CN103383845A (en) * 2013-07-08 2013-11-06 上海昭鸣投资管理有限责任公司 Multi-dimensional dysarthria measuring system and method based on real-time vocal tract shape correction
CN106297828A (en) * 2016-08-12 2017-01-04 苏州驰声信息科技有限公司 The detection method of a kind of mistake utterance detection based on degree of depth study and device
CN107203777A (en) * 2017-04-19 2017-09-26 北京协同创新研究院 audio scene classification method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060136225A1 (en) * 2004-12-17 2006-06-22 Chih-Chung Kuo Pronunciation assessment method and system based on distinctive feature analysis
CN101651788A (en) * 2008-12-26 2010-02-17 中国科学院声学研究所 Alignment system of on-line speech text and method thereof
CN103366759A (en) * 2012-03-29 2013-10-23 北京中传天籁数字技术有限公司 Speech data evaluation method and speech data evaluation device
CN103383845A (en) * 2013-07-08 2013-11-06 上海昭鸣投资管理有限责任公司 Multi-dimensional dysarthria measuring system and method based on real-time vocal tract shape correction
CN106297828A (en) * 2016-08-12 2017-01-04 苏州驰声信息科技有限公司 The detection method of a kind of mistake utterance detection based on degree of depth study and device
CN107203777A (en) * 2017-04-19 2017-09-26 北京协同创新研究院 audio scene classification method and device

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110488675A (en) * 2019-07-12 2019-11-22 国网上海市电力公司 A kind of substation's Abstraction of Sound Signal Characteristics based on dynamic time warpping algorithm
CN110415679A (en) * 2019-07-25 2019-11-05 北京百度网讯科技有限公司 Voice error correction method, device, equipment and storage medium
CN110457670A (en) * 2019-07-25 2019-11-15 天津大学 A method of it reducing the space of a whole page before printing based on machine learning and handles error rate
US11328708B2 (en) 2019-07-25 2022-05-10 Beijing Baidu Netcom Science And Technology Co., Ltd. Speech error-correction method, device and storage medium
CN110415679B (en) * 2019-07-25 2021-12-17 北京百度网讯科技有限公司 Voice error correction method, device, equipment and storage medium
CN110598208A (en) * 2019-08-14 2019-12-20 清华大学深圳研究生院 AI/ML enhanced pronunciation course design and personalized exercise planning method
CN110556093A (en) * 2019-09-17 2019-12-10 浙江核新同花顺网络信息股份有限公司 Voice marking method and system
CN110556093B (en) * 2019-09-17 2021-12-10 浙江同花顺智富软件有限公司 Voice marking method and system
CN111292769A (en) * 2020-03-04 2020-06-16 苏州驰声信息科技有限公司 Method, system, device and storage medium for correcting pronunciation of spoken language
CN111833859B (en) * 2020-07-22 2024-02-13 科大讯飞股份有限公司 Pronunciation error detection method and device, electronic equipment and storage medium
CN111833859A (en) * 2020-07-22 2020-10-27 科大讯飞股份有限公司 Pronunciation error detection method and device, electronic equipment and storage medium
CN112215018A (en) * 2020-08-28 2021-01-12 北京中科凡语科技有限公司 Automatic positioning method and device for correction term pair, electronic equipment and storage medium
CN112215018B (en) * 2020-08-28 2021-08-13 北京中科凡语科技有限公司 Automatic positioning method and device for correction term pair, electronic equipment and storage medium
TWI767532B (en) * 2021-01-22 2022-06-11 賽微科技股份有限公司 A wake word recognition training system and training method thereof
WO2022168102A1 (en) * 2021-02-08 2022-08-11 Rambam Med-Tech Ltd. Machine-learning-based speech production correction
CN112967538B (en) * 2021-03-01 2023-09-15 郑州铁路职业技术学院 English pronunciation information acquisition system
CN112967538A (en) * 2021-03-01 2021-06-15 郑州铁路职业技术学院 English pronunciation information acquisition system
CN115148225A (en) * 2021-03-30 2022-10-04 北京猿力未来科技有限公司 Intonation scoring method, intonation scoring system, computing device and storage medium
CN112863486A (en) * 2021-04-23 2021-05-28 北京一起教育科技有限责任公司 Voice-based spoken language evaluation method and device and electronic equipment
CN114758647A (en) * 2021-07-20 2022-07-15 无锡柠檬科技服务有限公司 Language training method and system based on deep learning
CN114783412A (en) * 2022-04-21 2022-07-22 山东青年政治学院 Spanish spoken language pronunciation training correction method and system
CN114783412B (en) * 2022-04-21 2022-11-15 山东青年政治学院 Spanish spoken language pronunciation training correction method and system
CN116340489A (en) * 2023-03-27 2023-06-27 齐齐哈尔大学 Japanese teaching interaction method and device based on big data
CN116340489B (en) * 2023-03-27 2023-08-22 齐齐哈尔大学 Japanese teaching interaction method and device based on big data
CN116805495A (en) * 2023-08-17 2023-09-26 北京语言大学 Pronunciation deviation detection and action feedback method and system based on large language model
CN116805495B (en) * 2023-08-17 2023-11-21 北京语言大学 Pronunciation deviation detection and action feedback method and system based on large language model
CN116894442A (en) * 2023-09-11 2023-10-17 临沂大学 Language translation method and system for correcting guide pronunciation
CN116894442B (en) * 2023-09-11 2023-12-05 临沂大学 Language translation method and system for correcting guide pronunciation

Similar Documents

Publication Publication Date Title
CN109545189A (en) A kind of spoken language pronunciation error detection and correcting system based on machine learning
Safavi et al. Automatic speaker, age-group and gender identification from children’s speech
CN107221318B (en) English spoken language pronunciation scoring method and system
Franco et al. EduSpeak®: A speech recognition and pronunciation scoring toolkit for computer-aided language learning applications
Shobaki et al. The OGI kids’ speech corpus and recognizers
Friedland et al. Prosodic and other long-term features for speaker diarization
TWI275072B (en) Pronunciation assessment method and system based on distinctive feature analysis
Das et al. Bengali speech corpus for continuous auutomatic speech recognition system
CN106782603B (en) Intelligent voice evaluation method and system
US20100004931A1 (en) Apparatus and method for speech utterance verification
Maier et al. Automatic detection of articulation disorders in children with cleft lip and palate
KR20080059180A (en) Pronunciation diagnosis device, pronunciation diagnosis method, recording medium, and pronunciation diagnosis program
CN106205603B (en) A kind of tone appraisal procedure
Cole et al. Speaker-independent recognition of spoken English letters
Ahsiah et al. Tajweed checking system to support recitation
Arafa et al. A dataset for speech recognition to support Arabic phoneme pronunciation
Burgos Gammatone and MFCC features in speaker recognition
Mathad et al. The Impact of Forced-Alignment Errors on Automatic Pronunciation Evaluation.
CN114220419A (en) Voice evaluation method, device, medium and equipment
CN113571088A (en) Difficult airway assessment method and device based on deep learning voiceprint recognition
Sadeghian et al. Towards an automated screening tool for pediatric speech delay
Shafie et al. Dynamic time warping features extraction design for quranic syllable-based harakaat assessment
Abdou et al. Enhancing the confidence measure for an Arabic pronunciation verification system
Barczewska et al. Detection of disfluencies in speech signal
Wang et al. Putonghua proficiency test and evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190329