CN114783412B - Spanish spoken language pronunciation training correction method and system - Google Patents

Spanish spoken language pronunciation training correction method and system Download PDF

Info

Publication number
CN114783412B
CN114783412B CN202210422182.6A CN202210422182A CN114783412B CN 114783412 B CN114783412 B CN 114783412B CN 202210422182 A CN202210422182 A CN 202210422182A CN 114783412 B CN114783412 B CN 114783412B
Authority
CN
China
Prior art keywords
corrected
voice
corpus
pronunciation
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210422182.6A
Other languages
Chinese (zh)
Other versions
CN114783412A (en
Inventor
孙晓萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANDONG YOUTH UNIVERSITY OF POLITICAL SCIENCE
Original Assignee
SHANDONG YOUTH UNIVERSITY OF POLITICAL SCIENCE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANDONG YOUTH UNIVERSITY OF POLITICAL SCIENCE filed Critical SHANDONG YOUTH UNIVERSITY OF POLITICAL SCIENCE
Priority to CN202210422182.6A priority Critical patent/CN114783412B/en
Publication of CN114783412A publication Critical patent/CN114783412A/en
Application granted granted Critical
Publication of CN114783412B publication Critical patent/CN114783412B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention belongs to the field of computer-aided teaching, and provides a Spanish spoken language pronunciation training and correcting method and system. The method comprises the steps of obtaining a voice corpus to be corrected, and extracting characteristic parameters of the voice corpus to be corrected; performing pronunciation error recognition on the to-be-corrected voice corpus according to the characteristic parameters of the to-be-corrected voice corpus to obtain a pronunciation error recognition result of the to-be-corrected voice corpus; scoring each index of the voice corpus to be corrected respectively, indicating errors in the pronunciation of each Spanish language of the voice corpus to be corrected, giving out a pronunciation rule of the voice and simultaneously giving out targeted training data and strengthening training; the feature parameters of the voice corpus to be corrected comprise MFCC-OVOT mixed feature vectors, the MFCC-OVOT mixed feature vectors comprise Mel frequency cepstrum coefficients and optimized voice starting time, and the optimized voice starting time refers to the difference between the time of vocal cord vibration and the earlier occurrence time of the corresponding phoneme end and the oral cavity obstruction removal time.

Description

Spanish spoken language pronunciation training and correcting method and system
Technical Field
The invention belongs to the field of computer-aided teaching, and particularly relates to a Spanish spoken language pronunciation training and correcting method and system.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Spanish belongs to the Latin family, a branch of the Roman family. The number of people who use spanish as the mother language is second to chinese and second in the world. In spanish education, the learning of spoken language accounts for a very large proportion. However, the pronunciation mode of some phonemes in spanish language is very different from that in chinese, and spanish learners who use chinese as the mother language often face many pronunciation difficulties in the stage of beginners:
firstly, regarding consonants, the clear and turbid consonants are distinguished by whether vocal cords vibrate or not, and the clear and turbid consonants in spanish are distinguished by differentiation, but the consonants in northern dialects (including mandarin) in china do not distinguish whether the vocal cords vibrate or not (namely, clear and turbid), and only distinguish between air supply and air supply, so that most of the time, the difference of the clear and turbid consonants cannot be sensed by the auditory system of our own, and the correct pronunciation cannot be achieved naturally. The inventor finds that, in long-term teaching and investigation of a plurality of northern colleges, when a Chinese spanish learner generates unvented consonants, the phenomenon that unvented consonants are replaced by gassed unvoiced consonants (for example, the original unvoiced consonant p is generated into gassed unvoiced consonants), and unvoiced consonants are replaced by unvoiced consonants (for example, the original unvoiced consonant b is generated into unvoiced consonants) generally exists. Most Chinese and western learners learn that the first acquired foreign language is English, while English unvoiced consonants mostly supply air, while voiced consonants mostly do not supply air, and listeners who use English as mother language can distinguish word senses by whether air is supplied or not although speakers are unvoiced or not, and basically do not form communication problems, so that most Chinese English learners have no problem of unvoiced or unvoiced consonants in the learning process. In spanish, consonants do not supply air regardless of whether they are clear or turbid, and the meaning of words is distinguished only by whether the vocal cords vibrate (clear or turbid) (for example, buda means Buddha, b and d are turbid nonpermissive consonants; puta means courtesan, and p and t are clear nonpermissive consonants), so that the above-described replacement of clear or turbid with air supply causes a great trouble to spanish mother speakers.
In summary, spanish is harder to distinguish and pronounce correctly than English for Spanish learners, especially northern learners. Due to lack of practice and correction, some spanish language learners cannot distinguish unvoiced consonants from voiced consonants even after beginning to learn for three or four years, and therefore, distinguishing unvoiced consonants has been a difficult point for spanish language learners who have chinese, especially northern dialects, as their native language.
Second, stress location is also a frequent place of error for Spanish learners in China. Because there is no syllable read with emphasis in Chinese, the beginner of Spanish pronounces and reads with emphasis at will often, cause the obstacle to exchange. Accents are very important in spanish, and words with identical phonemes can change semantics due to differences in accents, such as: pap means "dad" and papa means "potato"; the toface is a verb command meaning "you eat a bar" and the tomate is a noun meaning "tomato". Compared with the Chinese language, the stress rule of the Spanish language is relatively complex, the stress at the end of vowel, s and n falls on the penultimate syllable, the stress at the end of other consonants falls on the last syllable, and the condition that the double vowel and two strong vowels are parallel is more complex, which causes difficulty in daily teaching.
Finally, there are no vibrato in mandarin chinese, single-tap vibrato, multi-tap vibrato in spanish, and vibrato causes semantic differences, such as single-tap vibrato meaning "but", multi-tap vibrato meaning "dog", and non-vibrato pelo meaning "hair". Spanish learning in China is easy to replace with non-vibratory phonemes where vibrato should be given, or to send multi-tap vibrato where single-tap vibrato is given, and single-tap vibrato where multi-tap vibrato is given, thus often causing semantic confusion.
In view of the above problems, the conventional spoken language learning correction method is mainly performed by a method in which teachers practice in class to correct one by one and students practice repeatedly by reading and following standard voice data.
The traditional pronunciation training and correcting method has the following defects: firstly, in classroom practice, a teacher can only correct the pronunciation of a student at the same time, which is time-consuming, labor-consuming and inefficient, and in class practice, the teacher can not correct pronunciation errors instantly; secondly, since human beings lose sensitivity to non-native speech after an adult, the auditory desensitization to non-native speech adds a relatively inadequate language environment, and is affected by a first foreign language such as english, it is difficult for learners to autonomously correct speech errors by simply listening to standard speech material.
Computer-assisted language teaching (computer-assisted language learning) can be designed to assist in solving the above problems.
In the process of implementing the invention, the inventor finds that the known computer-assisted language teaching technology has the following defects in Spanish language-assisted teaching:
(1) Only one general score can be given to pronunciation, and specific evaluation cannot be given to the pronunciation difficulty of the Spanish learner;
(2) A method for correcting training cannot be automatically given according to the specific problem of pronunciation;
(3) The teacher can not master the learning condition of the student in time and can not accurately feed back the learning condition of the student to the spoken teacher in time;
(4) The teacher cannot be assisted to timely correct the pronunciations with problems one by one.
Disclosure of Invention
In order to solve the technical problems in the background art, the invention provides a Spanish spoken language pronunciation training correction method and system, which can evaluate spoken language input of a trainee on all relevant aspects of pronunciation quality, indicate specific problems existing in pronunciation, provide a targeted training method and feed back a supervision and correction result, and repeat the steps until a learner forms a conditioned response; meanwhile, the learning record is uploaded, and a teacher decides whether to feed back and demonstrate in person. The accuracy of correcting pronunciation in learning foreign languages is accelerated, partial work of teachers is automated, and work of spoken teachers is obviously lightened.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a Spanish spoken language pronunciation training correction method.
A Spanish spoken language pronunciation training correction method comprises the following steps:
acquiring a voice corpus to be corrected, and extracting characteristic parameters of the voice corpus to be corrected;
performing pronunciation error recognition on the voice to be corrected according to the characteristic parameters of the voice corpus to be corrected to obtain a pronunciation error recognition result of the voice to be corrected;
scoring each index of the voice to be corrected respectively, indicating errors in the pronunciation of each Spanish voice of the voice to be corrected, giving out a pronunciation rule of the voice and a targeted training data and strengthening training;
the characteristic parameters of the Voice corpus to be corrected comprise Mel Frequency cepstral coefficients-Optimized Voice Onset Time (hereinafter referred to as MFCC-OVOT) mixed characteristic vectors, the MFCC-OVOT mixed characteristic vectors comprise Mel Frequency cepstral coefficients and Optimized Voice Onset Time, and the Optimized Voice Onset Time refers to the difference between the Time of earlier occurrence of vocal cord vibration and corresponding phoneme end and the oral cavity resistance removal Time.
A second aspect of the invention provides a spanish spoken language pronunciation training correction system.
A spanish spoken utterance training correction system, comprising:
an acquisition module configured to: acquiring a voice corpus to be corrected, and extracting characteristic parameters of the voice corpus to be corrected;
an identification module configured to: performing pronunciation error recognition on the voice to be corrected according to the characteristic parameters of the voice corpus to be corrected to obtain a pronunciation error recognition result of the voice to be corrected;
an output training module configured to: scoring each index of the voice to be corrected respectively, indicating errors in the pronunciation of each Spanish voice of the voice to be corrected, giving out a pronunciation rule of the voice and a targeted training data and strengthening training;
the feature parameters of the to-be-corrected voice corpus comprise MFCC-OVOT mixed feature vectors, the MFCC-OVOT mixed feature vectors comprise Mel frequency cepstrum coefficients and optimized voice starting time, and the optimized voice starting time refers to the difference between vocal cord vibration and earlier occurrence time of corresponding phoneme end and oral cavity obstruction removal time.
A third aspect of the invention provides a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the spanish spoken utterance training correction method according to the first aspect.
A fourth aspect of the invention provides a computer apparatus.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the spanish spoken utterance training correction method according to the first aspect.
Compared with the prior art, the invention has the beneficial effects that:
according to the method, pronunciation is corrected by giving a judging method frequently used by a teacher when the teacher is in class, the reasonability and the accuracy of pronunciation training are improved, total scores are obtained by weighting the scores of the items according to the corresponding weight coefficients, the learning progress can be vividly displayed by displaying historical scores, the learning enthusiasm is improved, and the teacher can set the weight coefficients of various indexes for weighting according to different questions, so that the scoring method is more flexible; by feeding back pronunciation error information, the trainee can more clearly know the pronunciation problem of the trainee. And a teacher can conveniently and rapidly master the learning condition, and the quality of teaching work is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of a Spanish spoken utterance training correction method shown in the present invention;
fig. 2 is a block diagram of a spanish spoken language pronunciation training correction system shown in the present invention.
Detailed Description
The invention is further described with reference to the following figures and examples.
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
It should be noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems according to various embodiments of the present disclosure. It should be noted that each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the logical function specified in the various embodiments. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Example one
As shown in fig. 1, the embodiment provides a method for correcting spanish spoken language pronunciation training, and the embodiment is exemplified by applying the method to a server, it can be understood that the method may also be applied to a terminal, and may also be applied to a system including the terminal and the server, and is implemented through interaction between the terminal and the server. The server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, a network server, cloud communication, middleware service, domain name service, security service CDN (content delivery network), a big data and artificial intelligence platform and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. In this embodiment, the method includes the steps of:
acquiring a voice corpus to be corrected, and extracting characteristic parameters of the voice corpus to be corrected;
carrying out pronunciation error recognition on the voice to be corrected according to the characteristic parameters of the voice corpus to be corrected to obtain a pronunciation error recognition result of the voice to be corrected;
scoring each index of the voice to be corrected respectively, indicating errors in the pronunciation of each Spanish voice of the voice to be corrected, giving out a pronunciation rule of the voice and a targeted training data and strengthening training;
the feature parameters of the voice corpus to be corrected comprise MFCC-OVOT mixed feature vectors, the MFCC-OVOT mixed feature vectors comprise Mel frequency cepstrum coefficients and optimized voice starting time, and the optimized voice starting time refers to the difference between the time of vocal cord vibration and the earlier occurrence time of the corresponding phoneme end and the oral cavity obstruction removal time.
Specifically, the present embodiment can be implemented by the following steps:
giving a reference text library and marking key information of key words;
recording standard reference text library voice and establishing a standard correct voice corpus model;
recording different standard error voices according to the reference text and establishing a standard error model;
giving a reference text and recording the voice to be corrected read according to the reference text;
preprocessing the recorded voice to be corrected to obtain voice corpus to be corrected, and extracting characteristic parameters of the voice forecast to be corrected;
segmenting the voice corpus to be corrected according to the reference text;
scoring scores of each scoring item of the voice corpus to be corrected according to the segmentation of the voice corpus;
if the scoring result of a certain scoring item of the voice corpus to be corrected is lower than a certain threshold value, marking the item as the voice corpus to be corrected, and matching the voice corpus marked as the voice corpus to be corrected with a standard error model to obtain the error type of the voice to be corrected;
matching the error type of the voice to be corrected with a pre-established correction method, and giving the correction method and pronunciation rules to facilitate memory (if applicable);
and recording the learning progress, and visually displaying the learning progress to the trainees. Repeating the training and correcting method until the achievement of each item reaches a preset numerical value;
and (4) transmitting the training correction condition to a teacher end, and determining whether to feed back and teach himself by the teacher.
In an optional implementation, the method of marking key word key information includes: the Chinese native speaker is a place where the Chinese native speaker easily makes mistakes, such as the pronunciation of the clear consonant and the voiced consonant of the keyword, the air supply condition, the accent position, the vibrato (single-click vibrato and multi-click vibrato), and the like.
In an alternative embodiment, the extracting the feature parameter model of the standard correct speech corpus includes: and extracting the characteristic parameters of the standard correct speech corpus by using the MFCC-OVOT mixed characteristic vector under the low-noise environment of the conventional speech training.
In an alternative embodiment, the extracting the feature parameter model of the standard erroneous speech corpus includes:
the method comprises the steps of extracting characteristic parameters of standard error speech corpora by adopting a Gaussian mixture Model-Universal Background Model (hereinafter referred to as GMM-UBM Model), and using MFCC-OVOT mixed characteristic vectors as the characteristic parameters of the standard error speech corpora in a low-noise environment of conventional speech training. Under the condition of low sample size of standard errors, taking the feature vector of the correct pronunciation corresponding to each standard error as an input vector of the GMM-UBM model, and training by an EM (expectation maximization) algorithm to obtain the UBM model;
and performing adaptive transformation (MAP) on the feature vector of each standard error through a UBM model to obtain a GMM model of each standard error as each error model of the standard errors. The problem that the number of standard error samples is not enough is solved through the GMM-UBM model, the characteristic information of the standard error is better described, and the accuracy of standard error identification is improved.
Note that Voice Onset Time (VOT) is defined as the difference between oral cavity block removal Time and vocal cord vibration Onset Time. VOT is positive (elimination of the block before vocal cords vibrate) and is without consonants, or unvoiced sounds. VOT can be considered a tone if it is as long as 20 ms. Otherwise, if VOT is zero or negative, the consonant is voiced. The Optimized Vocal Onset Time (OVOT) is the difference between vocal cord vibrations and the earlier occurrence time of the corresponding phoneme end and the oral obstruction removal time, and is done to effectively avoid the situation that two consonants in two spanish languages are continuous. For the consonant error corrected by the spanish pronunciation training in this embodiment, the block removal time may be simply defined as the phoneme start time after the speech corpus is segmented according to the phonemes, alternatively, the vocal cord vibration time is determined by calculating the correlation coefficient of the time domain waveform using the periodic vibration characteristic of the vocal, and if the correlation coefficient is greater than a preset value, the vocal is considered to be the start of the vocal. Whether the air supply is provided or not can be marked by OVOT, and the error of the air supply consonant in the Spanish pronunciation, which is generated by the way that the air supply consonant is provided, can be more effectively distinguished by adopting the MFCC-OVOT mixed feature vector instead of the common MFCC feature vector because the Spanish language has no air supply consonant.
In an optional embodiment, the preprocessing the pre-recorded speech to be corrected includes: and pre-emphasis, framing, windowing and end point detection are carried out on the voice to be corrected. The high-frequency part has relatively large attenuation in propagation, and can be promoted through pre-emphasis, so that the high-frequency part of the signal is recovered, and the frequency spectrum is flattened. The audio signal may be considered stationary for a period of time and a relatively stationary speech signal may be obtained for a short period of time by framing the speech signal for further processing. Because the voice between each frame has certain relevance, an overlapped framing mode is adopted, and optionally, a 30% overlapped mode of each frame is adopted for framing to ensure the relevance between voice signals, so that the smooth transition between voice corpora of each frame is ensured. In order to reduce frequency leakage caused by fast fourier transform of each frame of speech, windowing is performed on each frame of speech corpus, and as an option, a hamming window function is used to window the speech corpus frame to be corrected.
Optionally, the voice corpus end point detection is performed on the voice corpus to be corrected by adopting a double-threshold method of extracting time domain short-time energy and short-time zero-crossing rate. The double-threshold method is simple and effective, can avoid the influence of small noise, is good enough in the method/system, and the voice corpus after the end point detection can be used as the input of the next processing.
The preprocessing of the voice corpus to be corrected can be realized by pre-emphasis, framing, windowing and end point detection of the voice corpus, so that the characteristic parameters can be better extracted in the following steps.
In an optional implementation, the segmenting the to-be-corrected speech corpus according to the feature parameters of the to-be-corrected speech corpus includes:
HMM boundaries are adjusted according to the Viterbi Algorithm (Viterbi Algorithm) to maximize likelihood based on a hidden Markov HMM model for a given standard speech corpus. And performing phoneme and word segmentation on the speech corpus to be corrected under the condition of giving a reference text, and determining the starting position of the word phoneme.
In an optional implementation manner, the performing, by the recognition and scoring, the unvoiced and voiced consonant errors on the speech corpus to be corrected includes: extracting characteristic parameters of the voice corpus to be corrected; and calculating probability through GMM according to the characteristic parameters of the voice corpus to be corrected and the characteristic parameters of the standard voice corpus, and grading the voice corpus according to the probability.
In an alternative embodiment, the matching the speech corpus marked as needing correction with the standard error model comprises: using the extracted feature parameters of the voice to be corrected; matching the voice corpus to be corrected based on a voice model which is established in advance according to the characteristic parameters of the standard error voice and according to the characteristic parameters of the voice corpus to be corrected to obtain a matching result: the probability is calculated through GMM according to the characteristic parameters of the voice corpus to be corrected and the characteristic parameters of the standard error voice corpus, the ratio of the probability to the probability calculated by the voice to be corrected and the standard correct voice is calculated, and if the ratio is larger than a certain value, the corresponding problem in the standard error of the voice to be corrected is indicated. By combining the correlation between the to-be-corrected voice corpus and the standard correct voice corpus and the correlation between the to-be-corrected voice corpus and the standard incorrect voice corpus, the detection accuracy is improved.
Optionally, the method for providing correction training includes:
giving a mouth shape cartoon with standard correct pronunciation;
giving a spectrogram of standard correct pronunciation;
giving a spectrogram of the input voice to be trained;
giving correction skills and a method for the pronunciation error, which are pre-recorded by a spoken teacher;
and giving the next training content for the pronunciation error.
In an alternative embodiment, the recording of the progress of learning and the visual presentation to the trainee includes recording the historical performance and presenting it in an intuitive manner. Preferably, the historical results are displayed in a histogram in chronological order.
Preferably, the transmitting of the training correction to the teacher end may be implemented by transmitting the training correction to the cloud server, pushing a training briefing to the teacher, and enabling the teacher to obtain detailed information from the server according to the briefing (in a manner of app on a mobile phone or executable program on a personal PC or a web end) and feeding the detailed information back to the trainee end as needed.
Example two
Accents appear as accents with shorter energy than other phones within the word and with a length of pronunciation that is longer than when not accented. Therefore, whether the stress position is wrong can be obtained by comparing the phonetic corpus to be trained with the standard correct phonetic corpus and calculating the average energy and pronunciation length of the stress position phoneme.
The embodiment provides a Spanish spoken language pronunciation training and correcting method.
This embodiment further defines the specific steps of scoring and mismatching the stress position error as follows, in addition to all the contents of the first embodiment:
extracting the average short-time energy of each phoneme of the standard correct speech corpus;
extracting the duration of each phoneme of the standard correct speech corpus;
calculating the ratio of the average short-time energy of each accent phoneme of the standard correct speech corpus to the average value of the average short-time energy of other phonemes in the same word, namely the relative strength E;
calculating the ratio of the average duration of each stressed phoneme of the standard correct speech corpus to the average duration of other phonemes in the same word, namely the relative duration T;
calculating the stress weighted value of the standard correct speech corpus: w = E × c + T × (1-c), where E is the relative strength, c ∈ {0,1} is a predetermined constant, and T is the relative duration;
obtaining the average short-time energy E of each phoneme of the phonetic corpus to be corrected t
Extracting the duration T of each phoneme of the speech corpus to be corrected t
Extracting the accent weighted value of each phoneme in the phonetic corpus to be corrected: w t =E t ×c+T t ×(1-c);
The weighted value of the stress is the maximum value, namely the actual stress phoneme of the phonetic corpus to be corrected;
matching the actual stress phoneme position with the known stress in the standard correct phonetic corpusComparing phoneme positions, if not, prompting an accent position error and giving a correct accent phoneme and an accent rule (if applicable), and if so, calculating the ratio R = W of the accent weighted value of the to-be-corrected phonetic corpus to the accent weighted value of the standard correct phonetic corpus t When the value of normal R is about 1 divided by W, and the weight is too heavy or too light if the value deviates from a preset range.
Scoring the stress of the voice corpus to be corrected according to the stress position error and the ratio R;
preferably, the weighted score of the stress position in the speech corpus to be corrected can be expressed as:
Figure GDA0003879056630000141
wherein ∑ correct Indicating that the summing range is in the correct position, N total Representing the total number of stress in the reference text.
EXAMPLE III
The trills in spanish are large tongue tremor, or gingival trills, and are classified into single-click trills and multi-click trills. The click vibrato refers to that the tongue tip strikes the palate once in the pronunciation process, the multi-click vibrato refers to that the tongue tip strikes the palate three times or more, and the time of each vibration is approximately equal to that of the click vibrato. On a time domain diagram, the duration of the single-click vibrato is generally about 30-40ms, the energy is concentrated on about 0.6 time of the total duration of a phoneme, the duration of the multi-click vibrato is generally three times or more than three times of the single-click vibrato and is represented by continuous splicing of multiple single clicks, and the energy is concentrated on about 0.6 time of the duration occupied by each single click, so that the single-click vibrato and the multi-click vibrato can be easily distinguished from the periodicity of the phoneme duration and the short-time energy density.
The embodiment provides a spanish spoken language pronunciation training and correcting method, which comprises all the contents of the first embodiment, and further defines the specific steps of scoring a trill error and performing error matching as follows:
extracting the duration of the trill phoneme of the to-be-corrected speech corpus;
acquiring the average short-time energy of the trill phonemes of the to-be-corrected speech corpus;
if the short-time energy is greater than a certain threshold for the first time, the percussion is performed, if the energy is less than the threshold for a continuous period of time, the percussion is considered as a percussion gap, the percussion times are calculated, if the energy is greater than or equal to three times, the tremor is caused, and if the energy is less than the threshold for the first time, the tremor is caused to be single;
comparing the calculated single/multiple tremors with the correct pronunciation for realizing the labeling, and giving a prompt if the result is wrong;
and giving a vibrato score according to the correct vibrato times/the total vibrato times in the text.
Example four
As shown in fig. 2, the present embodiment provides a spanish spoken language pronunciation training correction system.
A spanish spoken utterance training correction system, comprising:
an acquisition module configured to: acquiring a voice corpus to be corrected, and extracting characteristic parameters of the voice corpus to be corrected;
an identification module configured to: performing pronunciation error recognition on the voice to be corrected according to the characteristic parameters of the voice corpus to be corrected to obtain a pronunciation error recognition result of the voice to be corrected;
an output training module configured to: scoring each index of the voice to be corrected respectively, indicating errors in the pronunciation of each Spanish voice of the voice to be corrected, giving out a pronunciation rule of the voice and a targeted training data and strengthening training;
the feature parameters of the to-be-corrected voice corpus comprise MFCC-OVOT mixed feature vectors, the MFCC-OVOT mixed feature vectors comprise Mel frequency cepstrum coefficients and optimized voice starting time, and the optimized voice starting time refers to the difference between vocal cord vibration and earlier occurrence time of corresponding phoneme end and oral cavity obstruction removal time.
Specifically, the acquiring module in this embodiment includes a reference corpus module, a standard correct speech corpus module corresponding to the reference corpus, a standard incorrect speech corpus module corresponding to the reference corpus, a question generating module, a recording and processing module for the speech prediction to be corrected, and a speech corpus segmentation module. The recognition module comprises a scoring module and an error matching and correcting module of each scoring item of the voice corpus to be corrected in the following modules. The output training module comprises a training progress recording and displaying module and a teacher interaction module in the following modules. It should be noted that the present embodiment is not limited to the above three modules. The specific technical solution of this embodiment can be implemented by referring to the following modules:
providing a reference text library module for recording text contents used for training, keywords of the text and key information;
a standard correct speech corpus module corresponding to the reference text library, which is used for establishing a standard correct speech corpus model corresponding to the text;
a standard error speech corpus module corresponding to the reference text library, which is used for establishing a standard error speech corpus model corresponding to the text;
the question setting module is used for taking out a text to be trained from the reference text library according to the training history of the trainer;
the recording and processing module is used for recording and preprocessing the voice to be corrected and extracting the characteristic parameters of the voice corpus to be corrected;
the voice corpus segmentation module is used for segmenting the voice corpus;
the scoring module of each scoring item of the voice corpus to be corrected is used for scoring the score of each scoring item of the voice corpus to be corrected;
the error matching and correcting module is used for matching main errors of the voice corpus to be corrected and providing a training improvement method;
the training progress recording and displaying module is used for recording the scores of the previous training and visually displaying the learning progress to the trainees;
and the teacher interaction module is used for transmitting the training correction condition to the teacher end and determining whether to feed back the training correction condition or not by the teacher.
In an alternative embodiment, the reference corpus module contains keywords for the trained text content and text and key information, the keyword key information comprising: the pronunciation and air supply condition of the clear consonant and the turbid consonant of the keyword, the position of accent, the vibrato (single-click vibrato and multi-click vibrato) and other places where Chinese native speakers easily make mistakes.
In an optional embodiment, the module of the standard correct speech corpus of the speech corresponding to the reference text library includes pre-emphasizing, framing, windowing, detecting an end point, and extracting feature parameters of the standard correct speech corresponding to the reference text library.
Preferably, the pre-emphasis boosts the high frequency part to restore the high frequency part of the signal and flatten the spectrum. The audio signal can be regarded as stable in a period of time, and the framing can obtain a relatively stable voice signal in a short period of time for further processing by segmenting the voice signal. Because the voice between each frame has certain relevance, an overlapped framing mode is adopted, and optionally, a 30% overlapped mode of each frame is adopted for framing to ensure the relevance between voice signals, so that the smooth transition between voice corpora of each frame is ensured. The windowing is to perform windowing processing on each frame of voice corpus in order to reduce frequency leakage caused by performing fast fourier transform on each frame of voice, and as an option, a hamming window function is adopted to perform windowing on the voice corpus frame to be corrected.
And the end point detection is to perform end point detection on the speech corpus to be corrected by adopting a double-threshold method of extracting time domain short-time energy and short-time zero crossing rate. The double-threshold method is simple and effective, can avoid the influence of small noise, is good enough in the method/system, and the voice corpus after the end point detection can be used as the input of the next processing. The preprocessing of the voice corpus can be realized by the pre-emphasis, framing, windowing and end point detection of the voice corpus, and the voice corpus can be better extracted with characteristic parameters in the following steps.
Preferably, the feature parameter model for extracting the standard correct speech corpus includes: feature parameters of standard correct speech corpora are extracted by using MFCC-OVOT (Mel frequency cepstrum coefficient-optimized voice onset time) mixed feature vectors under a low-noise environment of conventional speech training. The use of MFCC-OVOT mixed feature vectors, rather than just the conventional MFCC feature vectors, may more effectively resolve errors in a Spanish pronunciation in which unvoiced consonants are voiced into voiced consonants.
In an alternative embodiment, the standard incorrect corpus module of the reference corpus is configured to build a standard incorrect speech corpus model corresponding to the text in the corpus, and the same steps as the standard correct speech corpus module of the reference corpus speech may be adopted, except that the object for building the model is the standard incorrect speech corpus and the GMM-UBM model is used to extract the feature parameters. The problem that the number of standard error samples is not enough is solved through the GMM-UBM model, the characteristic information of the standard error is better described, and the accuracy of standard error identification is improved.
In an optional implementation mode, the question setting module is used for taking out a text to be trained from a reference text library according to the training history of a trainer, and if the trainer is trained for the first time, the text is randomly given for the trainer to read; if the training is not the first training, selecting the text with larger corresponding item components according to the items with lower scores to train intensively according to the results of the previous training.
In an optional implementation manner, the recording and processing module of the language forecast to be corrected is configured to record a speech corpus to be corrected, perform preprocessing, and extract a parameter to be characterized. In an alternative embodiment, the recording, preprocessing and extracting feature parameters may use the same method steps as the standard erroneous speech corpus module for reference corpus speech, except that the object processed is the trainee's recorded speech prediction to be corrected.
In an optional implementation, the phonetic corpus segmentation module is configured to segment a phonetic corpus, and may adjust an HMM boundary according to a Viterbi Algorithm (Viterbi Algorithm) to maximize a likelihood according to a hidden markov (HMM) model of a given phonetic corpus. And performing phoneme and word segmentation on the speech corpus to be corrected under the condition of giving a reference text, and determining the starting position of the word phoneme. Under normal low noise environment, the segmentation can reach enough accuracy.
In an optional embodiment, the scoring module for each scoring item of the speech corpus to be corrected comprises an unvoiced consonant pronunciation scoring module, an accent scoring module and a vibrato (single-click vibrato, multi-click vibrato) scoring module.
Preferably, the unvoiced and voiced consonant pronunciation scoring module calculates probability through GMM according to the characteristic parameters of the to-be-corrected voice corpus and the characteristic parameters of the standard voice corpus, and scores the to-be-corrected voice corpus according to the probability.
In an optional implementation manner, the mismatch and correction training module includes a mismatch unit and a training and correction method unit for the speech corpus to be corrected. Preferably, the error matching unit part includes: matching the MFCC-OVOT characteristic parameters of the voice corpus to be corrected with the pre-established characteristic parameters of the corresponding standard error voice to obtain a matching result; the probability is calculated through GMM according to the characteristic parameters of the voice corpus to be corrected and the characteristic parameters of the standard error voice corpus, the ratio of the probability to the probability calculated by the voice corpus to be corrected and the standard correct voice corpus is calculated, and if the ratio is larger than a certain value, the problem corresponding to the standard error of the voice to be corrected is indicated. By combining the correlation between the to-be-corrected voice corpus and the standard correct voice corpus and the correlation between the to-be-corrected voice corpus and the standard incorrect voice corpus, the detection accuracy is improved. Preferably, the training correction method unit section gives a method of correcting training utterances, including: giving a mouth shape cartoon of a standard pronunciation; giving a spectrogram of a standard pronunciation; giving a spectrogram of the input voice to be trained; giving correction skills and a method for pronunciation errors recorded by a spoken teacher in advance; the next training content for the pronunciation error is given.
In an alternative embodiment, the training progress recording and displaying module is used for recording the historical performances and displaying the historical performances in an intuitive mode. The historical performance may preferably be represented in a histogram in chronological order.
In an optional implementation mode, teacher interactive module for reach the teacher end with the training correction condition, decide how to make the feedback to pronunciation training by the teacher, including the trainee end, high in the clouds and teacher end triplex, wherein the trainee end is responsible for uploading training data to high in the clouds server, thereby the teacher end regularly inquires high in the clouds data and in time learns the trainee condition, and reach the high in the clouds server and then propelling movement to the trainee end with the feedback of text or audio frequency form as required by the teacher, accomplish the closed loop that the pronunciation corrected the training. The teacher side may be implemented as an app on a cell phone or an executable program on a personal PC or as a web side.
EXAMPLE five
The embodiment further provides another embodiment of the spanish spoken language pronunciation training and correcting system, which includes all of the spanish spoken language pronunciation training and correcting system described in the fourth embodiment, and further defines that the pronunciation position error scoring unit and the error matching unit in the error scoring module and the error matching and correcting module specifically include:
an accent energy and duration extraction unit, configured to extract an average short-time energy of each phoneme of the standard correct speech corpus and the to-be-corrected speech corpus and a duration of each phoneme;
a relative strength calculating unit for calculating the ratio of the average short-time energy of each phoneme of the standard correct speech corpus to the average short-time energy of other phonemes in the same word, namely the relative strength E, E t
A relative time length calculating unit for calculating the relative time lengths T, T which are the ratios of the average time length of each phoneme of the standard correct speech corpus marked stress phoneme and the average time length of each phoneme of the speech corpus to be corrected to the average value of the average time lengths of other phonemes in the same word t
An accent weighted value calculation unit for calculating the marked accent phoneme of the standard correct speech corpus and each accent phoneme of the speech corpus to be correctedWeighting values of individual phonemes: w = E × c + T × (1-c), W t =E t ×c+T t X (1-c). Wherein E, E t For the relative strength of the standard correct speech corpus and the speech corpus to be corrected, c ∈ {0,1} is a predetermined constant, T t Respectively the marked stress phoneme of the standard correct speech corpus and the relative duration of each phoneme of the speech corpus to be corrected;
the phonetic phoneme discrimination unit of the phonetic corpus to be corrected extracts the actual phonetic phoneme of the phonetic corpus to be corrected, wherein the maximum weighted value of the phonetic corpus to be corrected is the actual phonetic phoneme of the phonetic corpus to be corrected;
the to-be-corrected phonetic corpus stress position comparison unit: comparing the actual accent phoneme position with the known accent phoneme position in the standard correct speech corpus, if not, prompting the accent position error and giving the correct accent phoneme and accent rule (if applicable), if yes, calculating the ratio of accent weighted value of the to-be-corrected speech corpus to accent weighted value of the standard correct speech corpus R = W t When the value of normal R is about 1 divided by W, and the weight is too heavy or too light if the value deviates from a preset range.
The phonetic corpus stress position scoring unit to be corrected: scoring the stress of the voice corpus to be corrected according to the stress position error times and the ratio R of the voice corpus to be corrected; preferably, the weighted score of the stress position in the speech corpus to be corrected can be expressed as:
Figure GDA0003879056630000231
in which sigma correct Indicating that the summing range is in the correct position, N total Representing the total number of stress in the reference text.
Example six
The embodiment further provides another embodiment of the spanish spoken language pronunciation training and correcting system, which includes all of the spanish spoken language pronunciation training and correcting system described in the fourth embodiment, and further defines that the scoring unit and the error matching unit for vibrato errors in the error scoring module and the error matching and correcting module specifically include:
a time length and energy extraction unit for the trill phoneme of the to-be-corrected speech corpus, which is used for extracting the time length of the trill phoneme of the to-be-corrected speech corpus and the average short-time energy of the trill phoneme of the to-be-corrected speech corpus;
and the judger for the expected vibrato phoneme of the speech to be corrected is used for judging whether the single vibration or the multi-vibration. The short-time energy is firstly larger than a certain threshold value and is regarded as the percussion, if the energy is smaller than the threshold value for a continuous period of time, the percussion clearance is regarded as the percussion clearance, the percussion times are calculated, if the percussion times are larger than or equal to three times, the tremor is more, and if the percussion times are once, the tremor is singly quivered;
and the to-be-corrected phonetic corpus trill phoneme error scoring unit compares the calculated single-trill and multi-trill with the pre-labeled correct pronunciation trill condition, and gives out a trill score according to the ratio of the correct trill times to the total trill times in the text.
EXAMPLE seven
The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the spanish spoken language pronunciation training correction method according to the first embodiment or the second embodiment or the third embodiment.
Example eight
This embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method for correcting the pronunciation of spanish spoken language according to the first embodiment or the second embodiment or the third embodiment.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A Spanish spoken language pronunciation training correction method is characterized by comprising the following steps:
acquiring a voice corpus to be corrected, and extracting characteristic parameters of the voice corpus to be corrected;
performing pronunciation error recognition on the voice to be corrected according to the characteristic parameters of the voice corpus to be corrected to obtain a pronunciation error recognition result of the voice to be corrected;
scoring each index of the voice to be corrected respectively, indicating errors in the pronunciation of each Spanish voice of the voice to be corrected, giving out a pronunciation rule of the voice and a targeted training data and strengthening training;
the characteristic parameters of the voice corpus to be corrected comprise an MFCC-OVOT mixed characteristic vector, the MFCC-OVOT mixed characteristic vector comprises a Mel frequency cepstrum coefficient and an optimized voice starting time, and the optimized voice starting time refers to the difference between the time of vocal cord vibration and the earlier occurrence time of the corresponding phoneme end and the oral cavity obstruction removal time;
wherein, the scoring each index of the voice to be corrected respectively specifically comprises:
taking the feature vector of each standard error as an input vector of a Gaussian mixture model-general background model, and training the Gaussian mixture model of the input vector by adopting an expectation-maximization algorithm to obtain a model of each standard error; and calculating the probability average score of each standard error model for the segmented speech to be corrected.
2. The method as claimed in claim 1, wherein the oral unblocking time is a phoneme start time after the phonetic corpus is segmented into phonemes.
3. The Spanish spoken language pronunciation training correction method as claimed in claim 1, wherein in the process of performing pronunciation error recognition on the speech to be corrected according to the feature parameters of the corpus of the speech to be corrected, a standard error model is used to perform pronunciation error recognition on the speech to be corrected;
the standard error model construction process comprises the following steps:
preprocessing the recorded standard error voice; obtaining a speech corpus with a standard error;
extracting characteristic parameters of the speech corpus with standard errors; the feature parameters of the standard wrong speech corpus comprise MFCC-OVOT mixed feature vectors, and the average MFCC-OVOT mixed feature vectors are calculated for each standard wrong speech corpus of all keywords.
4. The method according to claim 1, wherein the scoring each indicator of the speech to be corrected to indicate the error in each spanish pronunciation of the speech to be corrected specifically comprises: and if the scoring result of a certain index of the voice corpus to be corrected is lower than the threshold value, marking the index as the voice corpus to be corrected.
5. The spanish spoken language pronunciation training correction method according to claim 1, wherein the scoring each index of the speech to be corrected specifically includes: determining the scoring of the stress position of the to-be-corrected voice corpus, wherein the scoring of the stress position of the to-be-corrected voice corpus comprises the following steps:
weighting and scoring the stress position of the voice corpus to be corrected according to the stress position error and the ratio:
Figure FDA0003879056620000021
wherein S represents the weighted score of the stress position of the phonetic corpus to be corrected, sigma correct Indicating that the summing range is in the correct position, N total Representing the total number of stress in the reference text; the ratio is the ratio of the weighted accent value of the phonetic corpus to be corrected to the weighted accent value of the standard correct phonetic corpus.
6. The spanish spoken language pronunciation training correction method according to claim 1, wherein the giving of pronunciation rules of the speech while giving targeted training data and reinforcement training specifically comprises: giving out a spectrogram of standard pronunciation, giving out a spectrogram of standard error, giving out a spectrogram of voice to be trained, giving out tongue position oral animation of standard pronunciation, giving out tongue position oral animation of standard error, giving out pre-recorded standard training materials, giving out a past score of a certain type of error, and giving out pre-recorded pronunciation attention, skill and pronunciation demonstration for a spoken teacher.
7. A Spanish spoken language pronunciation training correction system, characterized by, includes:
an acquisition module configured to: acquiring a voice corpus to be corrected, and extracting characteristic parameters of the voice corpus to be corrected;
an identification module configured to: performing pronunciation error recognition on the voice to be corrected according to the characteristic parameters of the voice corpus to be corrected to obtain a pronunciation error recognition result of the voice to be corrected;
an output training module configured to: scoring each index of the voice to be corrected respectively, indicating errors in the pronunciation of each Spanish voice of the voice to be corrected, giving out a targeted training data while giving out a pronunciation rule of the voice and strengthening training;
the feature parameters of the to-be-corrected voice corpus comprise MFCC-OVOT mixed feature vectors, the MFCC-OVOT mixed feature vectors comprise Mel frequency cepstrum coefficients and optimized voice starting time, and the optimized voice starting time refers to the difference between vocal cord vibration and earlier occurrence time of corresponding phoneme end and oral cavity obstruction removal time;
wherein, the scoring each index of the voice to be corrected respectively specifically includes:
taking the feature vector of each standard error as an input vector of a Gaussian mixture model-general background model, and training the Gaussian mixture model of the input vector by adopting an expectation-maximization algorithm to obtain a model of each standard error; and calculating the probability average score of each standard error model for the segmented speech to be corrected.
8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the spanish spoken utterance training correction method according to any one of claims 1 to 6.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in the spanish spoken utterance training correction method of any one of claims 1 to 6.
CN202210422182.6A 2022-04-21 2022-04-21 Spanish spoken language pronunciation training correction method and system Active CN114783412B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210422182.6A CN114783412B (en) 2022-04-21 2022-04-21 Spanish spoken language pronunciation training correction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210422182.6A CN114783412B (en) 2022-04-21 2022-04-21 Spanish spoken language pronunciation training correction method and system

Publications (2)

Publication Number Publication Date
CN114783412A CN114783412A (en) 2022-07-22
CN114783412B true CN114783412B (en) 2022-11-15

Family

ID=82430141

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210422182.6A Active CN114783412B (en) 2022-04-21 2022-04-21 Spanish spoken language pronunciation training correction method and system

Country Status (1)

Country Link
CN (1) CN114783412B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104183236A (en) * 2014-09-04 2014-12-03 北京语言大学 Method and system for measuring discrimination of perception parameters
CN109545189A (en) * 2018-12-14 2019-03-29 东华大学 A kind of spoken language pronunciation error detection and correcting system based on machine learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI502583B (en) * 2013-04-11 2015-10-01 Wistron Corp Apparatus and method for voice processing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104183236A (en) * 2014-09-04 2014-12-03 北京语言大学 Method and system for measuring discrimination of perception parameters
CN109545189A (en) * 2018-12-14 2019-03-29 东华大学 A kind of spoken language pronunciation error detection and correcting system based on machine learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于多特征组合的普通话塞音识别;冯沛等;《现代电子技术》;20190221(第08期);第159-163页以及图1 *
结合高斯混合模型和VOT特征的音素发音错误检测;刘明辉等;《科学技术与工程》;20130308(第07期);第1789-1793页 *

Also Published As

Publication number Publication date
CN114783412A (en) 2022-07-22

Similar Documents

Publication Publication Date Title
CN101551947A (en) Computer system for assisting spoken language learning
Tsubota et al. Practical use of English pronunciation system for Japanese students in the CALL classroom
Hincks Technology and learning pronunciation
Bolanos et al. Automatic assessment of expressive oral reading
Daniels et al. The suitability of cloud-based speech recognition engines for language learning.
WO2021074721A2 (en) System for automatic assessment of fluency in spoken language and a method thereof
Duan et al. A Preliminary study on ASR-based detection of Chinese mispronunciation by Japanese learners
WO2019075828A1 (en) Voice evaluation method and apparatus
Demenko et al. The use of speech technology in foreign language pronunciation training
Peabody et al. Towards automatic tone correction in non-native mandarin
Tsubota et al. An English pronunciation learning system for Japanese students based on diagnosis of critical pronunciation errors
CN111508522A (en) Statement analysis processing method and system
Dai [Retracted] An Automatic Pronunciation Error Detection and Correction Mechanism in English Teaching Based on an Improved Random Forest Model
CN114783412B (en) Spanish spoken language pronunciation training correction method and system
Utami et al. Improving students’ English pronunciation competence by using shadowing technique
Kantor et al. Reading companion: The technical and social design of an automated reading tutor
US20210304628A1 (en) Systems and Methods for Automatic Video to Curriculum Generation
Pascual et al. Experiments and pilot study evaluating the performance of reading miscue detector and automated reading tutor for filipino: A children's speech technology for improving literacy
Strik et al. Speech technology for language tutoring
Li et al. English sentence pronunciation evaluation using rhythm and intonation
van Doremalen Developing automatic speech recognition-enabled language learning applications: from theory to practice
Yu A Model for Evaluating the Quality of English Reading and Pronunciation Based on Computer Speech Recognition
CN111508523A (en) Voice training prompting method and system
Vidal et al. Phone-Level Pronunciation Scoring for Spanish Speakers Learning English Using a GOP-DNN System.
Tsubota et al. Practical use of autonomous English pronunciation learning system for Japanese students

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant