CN109545189A

CN109545189A - A kind of spoken language pronunciation error detection and correcting system based on machine learning

Info

Publication number: CN109545189A
Application number: CN201811534792.5A
Authority: CN
Inventors: 吴怡之; 董权; 张俊杰
Original assignee: Donghua University
Current assignee: Donghua University; National Dong Hwa University
Priority date: 2018-12-14
Filing date: 2018-12-14
Publication date: 2019-03-29

Abstract

The spoken language pronunciation error detection and correcting system that the present invention relates to a kind of based on machine learning, comprising: spoken language pronunciation sample collection module, for acquiring orthoepy phoneme and different types of incorrect pronunciations phoneme from whole sentence or whole section of spoken language pronunciation；Pronounce error detection model building module, for extracting acoustic feature to pronunciation phonemes collected and carrying out type mark as pronunciation error detection model training sample set, generates pronunciation error detection model by machine learning algorithm training；Module is corrected in online error detection, and the whole sentence or whole section of spoken language pronunciation read aloud using the pronunciation error detection model of generation to learner score and phoneme error detection and pronunciation correction.The present invention can on-line evaluation spoken language pronunciation achievement, check pronunciation mistake and forward suggestion for correction.

Description

A kind of spoken language pronunciation error detection and correcting system based on machine learning

Technical field

The present invention relates to online verbal learning technical fields, examine more particularly to a kind of spoken language pronunciation based on machine learning Wrong and correcting system.

Background technique

During language learning, due to the limitation of qualified teachers and environment, the Oral Training time of attending class is insufficient, after class mouth Language practice cannot be fed back, and the factors such as nonstandard of pronouncing of most of teacher of spoken language cause when learner learns foreign languages One difficulty is ready to pay high tuition fee in order to correct one's pronunciation, foreign teacher is asked to correct the pronunciation of oneself there are many people.With shifting The rise for moving online language learning has expedited the emergence of automatic pronunciation error-detection system.

On pronunciation error-detecting method, currently used method can be roughly divided into two types: the first kind is by voice Xue Zhi The method of knowledge finds some distinctive features, for example may be replaced with " lice " using Japanese by the English learner of mother tongue The pronunciation of " rice ", learner possibly can not adjust its articulation to correct this mistake because phoneme/r/ in Japanese not In the presence of.For these typical type of errors, some acoustic feature such as formants for specifically having distinction can be usually extracted To be used as the detection and diagnosis of pronunciation mistake.Second class is based on speaker to the pronunciation of given text and based on speaker's mother tongue Similitude between the standard pronunciation of acoustic model identifies pronunciation mistake, if SelinaParveen et al. is in paper (BangIa Pronunciation error detection syatem) use this method.Its similarity indices is based on automatic speech Identify the confidence of (ASR).Generally speaking, first method can detecte out the movement for leading to the vocal organs of mistake, Such as the deviation of tongue position height front and back is judged by formant, but be limited primarily to the discovery that word reads aloud medial vowel mistake.And Algorithm fault-tolerance is lower, and the correctness that formant extracts is very crucial, but in noise circumstance, it is easy to formant be caused to extract Mistake.Second method can not judge vocal organs errors present, (accidentally mainly for the replacement in phonation there are phoneme Read), the mistakes such as skip and insertion, therefore targetedly pronunciation correction and raising scheme can not be given.

In application aspect, only exist has product for articulation problems at present on a small quantity, and but most of functions it is all relatively more single One, only simple playback audio-video learning materials, student record with reading, system plays.Software only few in number, has mouth The detection of language articulation problems is fed back, but existing first defect is the root cause problems that feedback function is not enough to solve learner, If sound wishes the Oral Training function of science and technology, this function can only point out that the pronunciation of learner is not good enough after learner is with reading pronunciation, But learner can not understand oneself pronunciation mistake where, and how should improve pronunciation, be unable to get learner most Information is corrected for valuable feedback.The oracy of learner can not often be improved.Second defect is the mistake for detection Accidentally type the mistake in these typical mistakes such as often focuses on the skip of pronunciation phonemes, misreads and be inserted into, lacking and move to pronunciation Make the judgement of error reason and the feedback of correction scheme.

Summary of the invention

The spoken language pronunciation error detection and correct system that technical problem to be solved by the invention is to provide a kind of based on machine learning System, can on-line evaluation spoken language pronunciation achievement, check pronunciation mistake and forward suggestion for correction.

The technical solution adopted by the present invention to solve the technical problems is: providing a kind of spoken language pronunciation based on machine learning Error detection and correcting system, comprising: spoken language pronunciation sample collection module, for acquiring correct hair from whole sentence or whole section of spoken language pronunciation Sound phoneme and different types of incorrect pronunciations phoneme；Pronounce error detection model building module, for mentioning to pronunciation phonemes collected It takes acoustic feature and carries out type mark as pronunciation error detection model training sample set, hair is generated by machine learning algorithm training Sound error detection model；Module, the whole sentence or whole section of mouth read aloud using the pronunciation error detection model of generation learner are corrected in online error detection Language pronunciation carries out scoring and phoneme error detection and pronunciation correction.

The case where spoken language pronunciation sample collection module issues sound is divided into voiced segments, mute section and mute section, Specifically: by the prediction residual energy of voice signal S (n) is defined as:Wherein, N is frame It is long, the first reflection coefficient is defined as:It is segmented according to following rule: if the first reflection coefficient is greater than 0.2, And prediction residual energy is greater than 2 times of system threshold value θ, is voiced segments by current speech frame definition；If the first reflection system Number is greater than 0.3, and prediction residual energy is greater than system threshold value θ, and the former frame of current speech frame is pronunciation frame, then currently Speech frame is defined as voiced segments；It is mute section for current speech frame definition if being unsatisfactory for above-mentioned two rule.

The spoken language pronunciation sample collection module is realized to obtain pronunciation factor by the way of forcing alignment, specifically: it will Text file is handled by punctuation mark；Audio file is converted into monophonic, is done by endpoint detection processing；By text file The conversion of word to sound is carried out, according to trained acoustic model, text file is extended to by implying Markov model Search space composed by status switch；Feature extraction is carried out to the voice signal in audio file, according to from front to back frame by frame Sequence phonetic feature is aligned with search space composed by corresponding implicit Markov Model state sequence；To each frame Data are aligned using dynamic time warpping Viterbi, are obtained: Q (t, s)=max_s′{p(x_t,s|s′)*Q_v(t-1, s') }, wherein Q (t, It s) is that moment t falls in the specific best score implied on Markov Model state s of some in search space, p (x_t,s|s') It is that latter frame state shifts hiding sequence x in the case of known previous frame state is s'_tProbability, x_tIt is implicit Markov state Metastasis sequence, s' are the former frame status of s；In t moment, when there is path to reach active state s_weWhen, wherein s_weIt is the phase It hopes the suffix state node for estimating the current sentence of its optimal end time τ, counts all active state s at this time_iUpper road Diameter assumes numberWherein, δ () is indicator function, All paths are assumed according to its score sequencing statistical；Count s_weUpper all paths；Remember that Q is assumed in path_k(t,s_we) institute Having ranking in a path N (t) is R_k(t,s_we), then s_weOn path assume in a path N (t) in ranking sample expectationDefinition status active degree isA(t,s_we) at the time of be maximized It is to be aligned maximum likelihood time t, according to alignment maximum likelihood time t, exports the voice and text justification temporal information of sentence； Level is separated according to the phoneme that voice and text justification temporal information read text table, reads some sound of phoneme separation level At the beginning of element and end time, progress phoneme cutting obtain pronunciation phonemes.

The pronunciation error detection model building module first divides the data into training data and and test data in feature extraction Collect, then the pronunciation phonemes that will acquire extract the MFCC feature and formant feature of each phoneme pronunciation, to primary speech signal The time-domain signal of each speech frame is obtained by processing；Time-domain signal trailing zero is formed to the sequence of a length of N, then by from Scattered Fourier transformation obtains linear spectral；Linear spectral obtains Mel frequency spectrum by Mel frequency filter group, obtained Mel frequency spectrum It takes logarithmic energy to obtain log spectrum S (m), log spectrum S (m) is obtained into cepstral domains by discrete cosine transform, can be obtained To Mel frequency cepstral coefficient c (n).

The pronunciation error detection model building module is divided into 7 classifications, respectively -1,1, A, B, C, D when being trained And E, wherein -1 indicate type of error, 1 indicate right type, A, B, C, D, E respectively indicate mistake classification in tongue position it is to the front, It is to the rear, tongue position is excessively high, too low and phoneme elongates and shortens class；When extracting training set, successively by the sample of some classification It is classified as one kind, other remaining samples are classified as another kind of, obtain 7 classifiers in this way；In support vector machines as training classification When device algorithm, phoneme of speech sound acoustic feature vector data is divided according to training set and test set 4:1, training set is as branch The input vector number of vector machine is held, support vector machines kernel function selects Radial basis kernel function.

The pronunciation error detection model building module extracts feature using the unsupervised training of deepness belief network in modeling, most Upper one layer uses support vector machines, is determined preferably according to the number and dimension of training set in pronunciation phonemes acoustic feature data set Deepness belief network model；Mode by adjusting parameter model and the output result for comparing each self-optimizing determines hidden layer Number gradually determines the optimal hidden layer number of plies and number of nodes by test method(s), and then pass through after fixing first layer hidden layer number of nodes Optimal models are obtained to the adjusting of other parameters.

Beneficial effect

Due to the adoption of the above technical solution, compared with prior art, the present invention having the following advantages that and actively imitating Fruit: the present invention by machine learning algorithm establish pronunciation error detection correct model in whole sentence or whole section phoneme pronunciation mistake it is quick Detection and diagnosis compensates for current market and assists spoken learning areas in terms of real-time scoring and error detection feedback correction to online Vacancy.It can quickly, effectively judge that there are which type of mistakes in which place in the entire sentence that learner is read aloud It misses type and is assisted correcting.

Detailed description of the invention

Fig. 1 is offline machine learning classification error detection model training frame diagram in the present invention；

Fig. 2 is online pronunciation error detection and interactive work for correction flow chart；

Fig. 3 is to force alignment separation phoneme algorithm flow chart.

Specific embodiment

Present invention will be further explained below with reference to specific examples.It should be understood that these embodiments are merely to illustrate the present invention Rather than it limits the scope of the invention.In addition, it should also be understood that, after reading the content taught by the present invention, those skilled in the art Member can make various changes or modifications the present invention, and such equivalent forms equally fall within the application the appended claims and limited Range.

Embodiments of the present invention are related to a kind of spoken language pronunciation error detection based on machine learning and correcting system, comprising: mouth Language pronunciation sample collection module, for acquiring orthoepy phoneme and different types of mistake from whole sentence or whole section of spoken language pronunciation Pronunciation phonemes；Pronounce error detection model building module, for extracting acoustic feature to pronunciation phonemes collected and carrying out type mark Note generates pronunciation error detection model as pronunciation error detection model training sample set, by machine learning algorithm training；Online error detection is entangled Positive module, the whole sentence or whole section of spoken language pronunciation read aloud using the pronunciation error detection model of generation to learner are scored and phoneme is examined Wrong and pronunciation correction.

The present invention obtains atypical pronunciation error sample data in whole sentence or whole section of pronunciation first, utilizes machine learning Classification and identification algorithm trains error detection disaggregated model, identifies that the specific mistake of learner's pronunciation belongs to the tongue of phoneme pronunciation Position height front and back deviation or pronunciation tone length deviation caused by pronunciation mistake which kind of.Further for identification Pronunciation type of error out is given correct interactive feedback and is corrected, and main includes shape of the mouth as one speaks tongue position when learner pronounces and hair The Adjusted Option of the tone length of sound.

Fig. 1 is offline machine learning classification error detection model training frame diagram in the present invention.Fig. 2 is online pronunciation error detection and hands over Mutual formula work for correction flow chart.Realize that step of the invention is as follows referring to Figures 1 and 2:

One, off-line model training part:

Step 1: obtaining the standard voice data of whole sentence.615 are obtained according to countries and regions, gender, age

Position is standard pronunciation data by the voice of reading aloud of the speaker of mother tongue of English.

Step 2: obtaining the non-standard voice data of whole sentence.It is obtained according to the English learner of different mother tongues different classes of Incorrect pronunciations data, error category is divided into following 6 kinds: tongue position is to the front, to the rear, tongue position is excessively high, too low and phoneme elongates and contracting Short class.The sample number of each classification is 200.

Step 3: alignment is forced to separate with phoneme.

1, the pronunciation of one section of voice signal is divided into mute section (S, Silent) according to the sending situation of sound, and mute section (U, Unvoiced), voiced segments (V, Voiced).The prediction residual energy of voice signal S (n) is defined as:

Wherein, N is frame length, the first reflection coefficient is defined as:V/U/S chopping rule is as follows:

(1) if the first reflection coefficient is greater than 0.2, and prediction residual energy is greater than 2 times of threshold value θ, by current language Sound frame definition is V sections；

(2) if the first reflection coefficient is greater than 0.3, and prediction residual energy is greater than threshold value θ, and before present frame One frame is pronunciation frame, is V sections by current speech frame definition；

(3) if being unsatisfactory for two rule of front, speech frame is defined as U sections.

2, alignment is forced, as shown in Figure 3

(1) text file is handled by special punctuation mark, and English string segmentation processing finally saves as UTF-8 format.

(2) audio file is converted to monophonic, sample rate 16000hz format, does by endpoint detection processing, end-point detection Purpose is exactly the starting point and end point that accurate detection goes out voice from comprising voice signal.

(3) text is extended to by implying by the conversion that text carries out word to sound according to trained acoustic model Search space composed by Markov model (HMM) status switch.

(4) feature extraction is carried out to the voice signal in audio file, according to sequence frame by frame from front to back, by voice spy Sign is aligned with search space composed by corresponding implicit Markov Model state sequence；Each frame data are advised using dynamic Whole Viterbi alignment, obtains:

Q (t, s)=max_s′{p(x_t,s|s')*Q_v(t-1,s')}

Wherein, Q (t, s) is that moment t is fallen on the specific implicit Markov Model state s of some in search space most Good score, p (x_t, s | s') it is that latter frame state shifts hiding sequence x in the case of known previous frame state is s'_tProbability, x_tIt is Implicit Markov state metastasis sequence, s' is the former frame status of s；s_weIt is its optimal end time τ of expectation estimation The suffix state node of current sentence.

In t moment, when there is path to reach active state s_weWhen, count all active state s at this time_iUpper path assume NumberWherein, δ () is indicator function,To own Path assume according to its score sequencing statistical；Count s_weUpper all paths；Remember that Q is assumed in path_k(t,s_we) in all N (t) Ranking is R in a path_k(t,s_we), then s_weOn path assume in a path N (t) in ranking sample expectationDefinition status active degree isA(t,s_we) be maximized when Quarter is alignment maximum likelihood time t, according to alignment maximum likelihood time t, exports the voice and text justification time letter of sentence Breath.

In forcing alignment procedure, whole sentence pronunciation is snapped into word level and phone-level, in order in subsequent step The acoustic feature that different pronunciations are extracted from phone-level is believed after forcing alignment according to the voice of output and text justification time Breath reads the level of the phoneme separation of text table (textgrid), at the beginning of reading some phoneme of phoneme level and ties The beam time carries out phoneme cutting.Obtain pronunciation phonemes.

Step 4: data normalization is handled.Data normalization processing is to be limited to the acoustic feature data of phoneme pronunciation Within the scope of some, reduces the otherness of data its purpose is to reduce the dispersion degree of phoneme pronunciation acoustic feature data, allow The fluctuation of data is smaller, has no effect on the original distribution of data, uses in present embodiment and be most worth method for normalizing.

Step 5: feature extraction, first divide the data into training data and and test data set, then will be obtained in step 4 Pronunciation phonemes extract the MFCC feature and formant feature of each phoneme pronunciation, primary speech signal S (n) by preemphasis, The processing such as framing, adding window, end-point detection, obtains the time-domain signal x (n) of each speech frame.

Time-domain signal x (n) is mended to several 0 to form the sequence of a length of N (N=512 in the present embodiment) afterwards, then passed through It crosses discrete Fourier transform DFT or (FFT) obtains linear spectral X (k)；

Above-mentioned linear spectral obtains Mel frequency spectrum by Mel frequency filter group.In order to miss result to noise and Power estimation Difference has better robustness, takes logarithmic energy to obtain S (m) the above-mentioned Mel frequency spectrum obtained by Mel filter group；

Above-mentioned log spectrum S (m) is obtained into cepstral domains by discrete cosine transform (DCT) transformation, Mel frequency can be obtained Rate cepstrum coefficient (MFCC parameter) c (n).

Since voice signal is that time domain is continuous, the characteristic information that framing is extracted only has reacted the characteristic of this frame voice, is The continuity of time domain can be embodied by making feature more, can before and after characteristic dimension increase frame information dimension, using first differential Coefficient and 13 accelerator coefficients, in addition four-dimensional formant parameter, 43 coefficients altogether.Form 43 dimensional feature vectors.

Step 6: the training of pronunciation error detection model.Due to being related to more classification problems, support vector machines and decision Tree algorithms are all It is two-value classifier.The construction of multi-categorizer can be realized using by combining multiple two classifiers.Combination multiplicity, has One-to-one method (one-versus-one) and one-to-many method (one-versus-rest).Present embodiment uses one-to-many method, letter Claim OVR.Its thought are as follows: training when the sample of some classification is successively classified as one kind, other remaining samples be classified as it is another kind of, this The sample of k classification of sample has just constructed k classifier.Unknown sample is classified as with maximum classification function value when classification That class.The specific steps are under:

Present embodiment has 7 classifications to need to divide (namely 7 label), is -1,1, A, B, C, D, E respectively.- 1 table Show type of error, 1 indicates right type, A, B, C, D, E respectively indicate the tongue position in mistake classification is to the front, to the rear, tongue position is excessively high, Too low and phoneme elongates and shortens class.

When extracting training set, extract respectively

Vector corresponding to (1) 1 is as positive collection, and the vector corresponding to other is as negative collection；

Vector corresponding to (2) -1 is as positive collection, and the vector corresponding to other is as negative collection；

(3) vector corresponding to A is as positive collection, and the vector corresponding to other is as negative collection；

(4) vector corresponding to B is as positive collection, and the vector corresponding to other is as negative collection；

(5) vector corresponding to C is as positive collection, and the vector corresponding to other is as negative collection；

(6) vector corresponding to D is as positive collection, and the vector corresponding to other is as negative collection；

(7) vector corresponding to E is as positive collection, and the vector corresponding to other is as negative collection.

When support vector machines is as training classifier algorithm, to phoneme of speech sound acoustic feature vector data according to training set It is divided with test set 4:1, input vector number of the training set as support vector machines, support vector machines kernel function selects diameter Xiang Ji (RBF) kernel function.

The range of support vector machines penalty factor c is set as [0,100], and the range of kernel functional parameter g is set as [0,1000].

It is trained respectively using this 7 training sets, then obtains 7 training result files.

When test, corresponding test vector is utilized respectively this four training set results and is tested, it is last every A test has a result f1 (x), f2 (x), f3 (x), f4 (x), f5 (x), f6 (x), f7 (x).Then final classification knot Fruit be in these values maximum one be used as classification results.

In practical bright read error, often the distribution of type of error is simultaneously uneven, such as in a certain area, most of study The pronunciation mistake of person may concentrate in a certain or two classes, and this type of error is commonplace, and some type of errors are more rare See, therefore lead to imbalance problem in sample in type of error training, further to explore optimal modeling method, this literary grace With the thought of transfer learning, extract feature using in the unsupervised training of deepness belief network (DBN), most upper one layer using support to Amount machine (SVM), i.e. DBN+SVM carry out disaggregated model foundation.

Preferably DBN model is determined according to the number and dimension of training set in pronunciation phonemes acoustic feature data set.It determines Hidden layer number, tuning level-learning rate and the number of iterations.Determine parameter training Boltzmann machine.We are by 200 groups of sounds of every class Prime number evidence is divided into training set (160 groups) and test set (40 groups) and is trained modeling using training set, using test set to foundation Pronunciation phonemes error detection disaggregated model tested.It is hidden using 100,200,300,400,500,600,700 and 800 etc. eight Layer neuron node number models feature in single hidden layer.By adjusting parameter model and compare each self-optimizing Output as a result, it has been found that, when taking hidden layer number of nodes [400], available optimal result.After fixed first layer hidden layer number of nodes, The optimal hidden layer number of plies and number of nodes are gradually determined by test method(s), and then optimal mould is obtained by the adjusting to other parameters Type.In the phoneme pronunciation error detection disaggregated model finally established, the iteration that hidden layer number is 5, RBM is 50, DBN network iteration time Number is 1000 times, Batch_size 64, and weight learning rate is 0.000001.It is last knot that test experiments, which are repeated several times, and take mean value Fruit.

Error detection can produce the result of four seed types: 1) correctly receiving (CA), that is, be judged as correct orthoepy Number；2) correct rejection (CR) is judged as incorrect incorrect pronunciations number；3) mistake receives (FA), that is, is judged and is positive True incorrect pronunciations number；And 4) False Rejects (FR), that is, it is judged as the quantity of incorrect orthoepy.Using this four It is a as a result, calculating correct receptance (CAR), correct rejection ratio (CRR).Present embodiment is accurate as identification using CAR and CRR The parameter of measurement of degree, CAR are used for orthoepy, and CRR is used for incorrect pronunciations.

2 two kinds of algorithms of different classifiers of table and the test set of deep learning (DBN+SVM) identification CAR and CRR compare

It can be seen from 2 result of table in phoneme of speech sound pronunciation error detection effect, classification based on support vector machines and Being sorted on recognition accuracy based on decision tree is not much different, and the two is stable to the discrimination of type of error on 80% left side It is right.Recognition effect is preferable.Based on practical reasons, part is pronounced, and mistake is commonplace, and error sample is more, and classifying quality is more preferable, and Part pronunciation type of error is more rare, and training sample amount deficiency may be to lead to the main reason for causing classification accuracy relatively low Table 2 is crossed it can be seen that can averagely improve 2 percentage points on the basis of the above two recognition accuracy using (DBN+SVM).Cause This adds the model of the classification algorithm training of support vector machines to be optimal based on deepness belief network.

Step 7: evaluation of result and error correction, point that the error detection disaggregated model obtained according to the machine learning algorithm provides Class result, it is indicated which kind of mistake is test sample belong to；Classification belonging to mistake of pronouncing is gone out by model prediction.Learner is pronounced wrong Accidentally place and error type feed back to learner and propose correction scheme for it.

Two, pronunciation error detection and interactive mode correct part online

Step 1: obtaining pronunciation data

Fig. 2 is pronunciation error detection and interactive correcting system work flow diagram, after learner's login system, what selection to be learnt Pronounce sentence, according to display text, reads aloud whole sentence, system obtains learner's pronunciation.

Step 2: data processing and pronunciation error detection

The learner's pronunciation data obtained in step 1 is pre-processed, including alignment separation phoneme and feature is forced to mention It takes, processing step and off-line model training department split-phase are same.By treated, data are sent into trained pronunciation error detection model, by mould Type exports learner's pronunciation result.

Step 3: interactive correct, judged according to the pronunciation that system provides, tell the problems of learner's pronunciation, And the correction of articulation is given for the different incorrect pronunciations judged, prompt learner to read aloud again.Persistently correct one's pronunciation. Until phoneme standard.

It is not difficult to find that the present invention is based on machine learning algorithm, using different pronunciation phonemes have different acoustic features this One feature, acquisition and processing to the read aloud foreign language voice segments signal data of different learners obtain its 39+4 under frequency domain Acoustic feature vector is tieed up, it is right by means of having the learning network of supervision or unsupervised learning network as the input of training pattern The acoustic feature vector of extraction carries out learning training and generates acoustics error detection model.Using test set to the mistake of acoustics error detection model Classifying quality is verified, experiments verify that classification accuracy is higher, meets normal learner's pronunciation type of error analysis.And needle Pronunciation evaluation and correction scheme are given to test set verification result.The present invention not only point out learner pronunciation to mistake, Further identified on this basis learner's pronunciation be it is wrong where, and will how improved method feeds back to study Person, the articulation ability of raising learner that in this way can be practical.

Claims

1. a kind of spoken language pronunciation error detection and correcting system based on machine learning characterized by comprising spoken language pronunciation sample is adopted Collect module, for acquiring orthoepy phoneme and different types of incorrect pronunciations phoneme from whole sentence or whole section of spoken language pronunciation；Hair Sound error detection model building module, for extracting acoustic feature to pronunciation phonemes collected and carrying out type mark as pronunciation inspection Mismatch type training sample set generates pronunciation error detection model by machine learning algorithm training；Module is corrected in online error detection, using life At pronunciation error detection model whole sentence that learner is read aloud or whole section of spoken language pronunciation score and phoneme error detection and pronunciation correction.

2. the spoken language pronunciation error detection and correcting system according to claim 1 based on machine learning, which is characterized in that described The case where spoken language pronunciation sample collection module issues sound is divided into voiced segments, mute section and mute section, specifically: by voice The prediction residual energy of signal S (n) is defined as:Wherein, N is frame length, the first reflection coefficient Is defined as:It is segmented according to following rule: if the first reflection coefficient is greater than 0.2, and predicting error energy Amount is greater than 2 times of system threshold value θ, is voiced segments by current speech frame definition；If the first reflection coefficient be greater than 0.3, and Prediction residual energy is greater than system threshold value θ, and the former frame of current speech frame is pronunciation frame, then current speech frame definition is hair Segment；It is mute section for current speech frame definition if being unsatisfactory for above-mentioned two rule.

3. the spoken language pronunciation error detection and correcting system according to claim 1 based on machine learning, which is characterized in that described Spoken language pronunciation sample collection module is realized to obtain pronunciation factor by the way of forcing alignment, specifically: text file is passed through Punctuation mark processing；Audio file is converted into monophonic, is done by endpoint detection processing；Text file is subjected to word to sound Text file is extended to by implying Markov Model state sequence institute group by conversion according to trained acoustic model At search space；Feature extraction is carried out to the voice signal in audio file, according to sequence from front to back frame by frame by voice Feature is aligned with search space composed by corresponding implicit Markov Model state sequence；Dynamic is used to each frame data Regular Viterbi alignment, obtains: Q (t, s)=max_s'{p(x_t,s|s')*Q_v(t-1, s') }, wherein Q (t, s) is that moment t is fallen Best score in search space on some specific implicit Markov Model state s, p (x_t, s | s') it is known previous Latter frame state, which shifts, in the case of frame state is s' hides sequence x_tProbability, x_tIt is implicit Markov state metastasis sequence, s' It is the former frame status of s；In t moment, when there is path to reach active state s_weWhen, wherein s_weIt is that its is optimal for expectation estimation The suffix state node of the current sentence of end time τ counts all active state s at this time_iUpper path assume numberWherein, δ () is indicator function,By all roads Diameter is assumed according to its score sequencing statistical；Count s_weUpper all paths；Remember that Q is assumed in path_k(t,s_we) on a road of all N (t) Ranking is R in diameter_k(t,s_we), then s_weOn path assume in a path N (t) in ranking sample expectationDefinition status active degree isA(t,s_we) be maximized when Quarter is alignment maximum likelihood time t, according to alignment maximum likelihood time t, exports the voice and text justification time letter of sentence Breath；Level is separated according to the phoneme that voice and text justification temporal information read text table, reads phoneme separation level certain At the beginning of a phoneme and end time, progress phoneme cutting obtain pronunciation phonemes.

4. the spoken language pronunciation error detection and correcting system according to claim 1 based on machine learning, which is characterized in that described Error detection model building module pronounce in feature extraction, first divide the data into training data and and test data set, then will acquire Pronunciation phonemes extract the MFCC feature and formant feature of each phoneme pronunciation, primary speech signal is obtained by processing The time-domain signal of each speech frame；Time-domain signal trailing zero is formed to the sequence of a length of N, then passes through discrete Fourier transform Obtain linear spectral；Linear spectral obtains Mel frequency spectrum by Mel frequency filter group, and obtained Mel frequency spectrum takes logarithmic energy to obtain To log spectrum S (m), log spectrum S (m) is obtained into cepstral domains by discrete cosine transform, Mel frequency cepstral can be obtained Coefficient c (n).

5. the spoken language pronunciation error detection and correcting system according to claim 1 based on machine learning, which is characterized in that described Error detection model building module pronounce when being trained, is divided into 7 classifications, respectively -1,1, A, B, C, D and E, wherein -1 Indicate type of error, 1 indicates right type, A, B, C, D, E respectively indicate in mistake classification tongue position is to the front, to the rear, tongue position mistake High, too low and phoneme elongates and shortens class；When extracting training set, the sample of some classification is successively classified as one kind, His remaining sample is classified as another kind of, obtains 7 classifiers in this way；It is right when support vector machines is as training classifier algorithm Phoneme of speech sound acoustic feature vector data is divided according to training set and test set 4:1, and training set is as support vector machines Input vector number, support vector machines kernel function select Radial basis kernel function.

6. the spoken language pronunciation error detection and correcting system according to claim 1 based on machine learning, which is characterized in that described The error detection model building module that pronounces extracts feature using the unsupervised training of deepness belief network in modeling, and most upper one layer using branch Vector machine is held, preferably deepness belief network is determined according to the number and dimension of training set in pronunciation phonemes acoustic feature data set Model；Hidden layer number, fixed first layer are determined by adjusting the mode of parameter model and the output result for comparing each self-optimizing After hidden layer number of nodes, the optimal hidden layer number of plies and number of nodes are gradually determined by test method(s), and then by other parameters It adjusts and obtains optimal models.