Background technology
Nearly in the world 70% language is that the language of tone type is tone language (Tone Language or Tonal Language), as: Chinese, Southeast Asia language, Japanese, Swedish and Norwegian etc.In these language, syllable is minimum pronunciation unit, and each syllable is made up of consonant, vowel and tone.Phoneme is minimum phonetic unit, and it analyzes from syllable.The complete meaning of tone is meant that the height of syllable in phonation rises and falls, and promptly tone is passed on the meaning of words and phrases with pitch, and the tone difference then can cause the syllable implication of identical consonant and vowel different.
For example, Chinese has four kinds of tones (then is five kinds of tones softly if consider) as a kind of typical tone language, and they are respectively high and level tone (), rising tone (two), last sound (three) and falling tone (four tones of standard Chinese pronunciation).The syllable that identical initial consonant (consonant) and simple or compound vowel of a Chinese syllable (vowel) constitute has diverse meaning with tone different, corresponding different Chinese characters, and promptly tone is being born the effect that justice distinguished in important structure word in standard Chinese.Further, the standard Chinese tone is only to appear on the simple or compound vowel of a Chinese syllable, so simple or compound vowel of a Chinese syllable is also referred to as " tone phoneme ", initial consonant just is called " non-tone phoneme ".
Therefore, in the learning system of tone language, need further tone information to be discerned, thus auto judge and marking.Tone can represent with fundamental frequency, i.e. fundamental frequency pattern over time as shown in Figure 1, is the synoptic diagram of four pairing fundamental frequency method for expressing of tone of Chinese.In Tone recognition, traditional method is to use fundamental curve to judge, promptly extracts the fundamental frequency F of each frame voice
0, and according to the fundamental frequency F of each tone
0Track is differentiated tone.Because tone is very complicated, every kind of tone all has a lot of distortion, shown in Fig. 2 A~Fig. 2 D, is the sample synoptic diagram of the real speech of the one~four tones of standard Chinese pronunciation of Chinese.This makes Tone recognition have a lot of challenges.Particularly in the tone language learning system, how speaker's tone is differentiated automatically, the reliability of Tone recognition just becomes and is even more important.
Tone can also represent that the time domain pitch period is fundamental frequency F with the time domain fundamental tone
0Inverse, pitch period is the repetitive of time domain periodic signal minimum, so a pitch period can intactly be described cyclical signal, so tone information can obtain by pitch Detection.And because the complicacy, particularly voiceless sound (unvoiced) of voice signal itself and the differentiation of voiced sound (voiced), make a mistake through regular meeting and to discern the generation of pitch period phenomenon, thereby caused the wrong identification of tone.Because the standard Chinese tone is on the voiced segments to appear at the tone phoneme only, so the mistake of pure and impure differentiation will cause the failure of pitch recognition.In the prior art, generally be to carry out pure and impure sound according to the characteristic of pure and impure sound to differentiate, promptly quasi-periodic voiced sound signal has higher relatively energy; Aperiodic schwa signal has relatively low energy.But because existing voice process technology can't carry out pure and impure differentiation reliably, the voice that occur non-tone phoneme section sometimes also can detect fundamental tone, thereby have caused tone by wrong identification.
Summary of the invention
The purpose of this invention is to provide a kind of Tone recognition method and system,, realize accurately discerning the tone in the tone language, improve the reliability of Tone recognition in order to reduce the generation of the wrong identification tone phenomenon in the prior art Tone recognition.
For achieving the above object, the present invention provides a kind of Tone recognition method by some embodiment, may further comprise the steps:
Received speech signal;
Described voice signal is carried out spectrum analysis, and the voice sequence of time alignment information is carried in generation according to referenced text;
From the voice signal that receives, extract the tone phoneme according to described voice sequence;
Determine the tone of described voice signal according to described tone phoneme.
For achieving the above object, the present invention provides a kind of tone recognition system by other embodiment, comprising:
Grammar database is used for the stored reference text;
Sound identification module is used for received speech signal, and described voice signal is carried out spectrum analysis, and carries the voice sequence of time alignment information according to described referenced text generation;
The Tone recognition module is used for received speech signal, and extracts the tone phoneme according to described voice sequence from described voice signal;
The tone sort module is used for determining according to described tone phoneme the tone of described voice signal.
Based on technique scheme, the voice sequence that time alignment information is carried in embodiment of the invention utilization extracts the tone phoneme exactly, determine the tone of input speech signal, thereby can reduce the wrong identification tone in the Tone recognition effectively, realize the tone in the accurate identification tone language, improved the reliability of Tone recognition.
Embodiment
Below in conjunction with accompanying drawing the specific embodiment of the present invention is described in further detail.
In the learning system of tone language, speaker's speech recognition is not only comprised identification to syllable structure, but also comprise identification syllable tone.As shown in Figure 3, be the schematic flow sheet of first embodiment of Tone recognition method of the present invention.Present embodiment may further comprise the steps:
Step 101, received speech signal;
Step 102, voice signal is carried out spectrum analysis, and generate according to referenced text and to carry the voice sequence of time alignment information;
Step 103, from the voice signal that receives, extract the tone phoneme according to voice sequence;
Step 104, determine the tone of voice signal according to the tone phoneme.
In the present embodiment, the voice sequence that carries time alignment information by utilization extracts the tone phoneme of input speech signal exactly, thereby determine the tone of input speech signal, reduced the wrong identification tone in the Tone recognition, realize the tone in the accurate identification tone language, thereby improved the reliability of Tone recognition.
As shown in Figure 4, be the schematic flow sheet of second embodiment of Tone recognition method of the present invention.Present embodiment may further comprise the steps:
Step 201, received speech signal.
Receive the sound signal of the tone language speech syllable of input;
Step 202, voice signal is carried out spectrum analysis, extract the phonetic feature parameter.
The extraction of above-mentioned characteristic parameter is based on speech frame, according to the smooth performance in short-term of voice signal, can be divided into some frames to voice signal and handle, and the length of each frame is about 10~30ms, and each frame is extracted a phonetic feature.The method of dividing frame can adopt contiguous segmentation, but in order to embody the correlativity between adjacent two frame data, and make between frame and the frame and seamlessly transit, keep its continuity, the general method that adopts the overlapping segmentation, promptly the frame head of the postamble of each frame and next frame is overlapping, and frame moves and is 1/2 of frame length usually.
Above-mentioned phonetic feature Parameter selection need be taken all factors into consideration the requirement of storage quantitative limitation and recognition performance.As: can use Mel frequency cepstral coefficient (Mel-Frequency Ceptral Coefficients is hereinafter to be referred as MFCC).In order to reduce the truncation effect of speech frame, reduce the gradient at frame two ends, make the two ends of speech frame not cause rapid variation and be smoothly transitted into 0, will allow speech frame multiply by a window function.Because the variation of voice signal on time domain fast and unstable, observed on the frequency domain so all it is transformed into usually, this moment its frequency spectrum can along with the time intercropping change slowly.Frame after the windowing through fast fourier transform (Fast Fourier Transform is called for short FFT), is obtained the frequency spectrum parameter of every frame.Again with the frequency spectrum parameter of every frame by one group of N (N is generally 20~30) Mel frequency filter that the triangle strip bandpass filter is formed, the output of each frequency band is taken the logarithm, obtain logarithm energy (logenergy) E of each output
k, k=1,2 ... N.Again this N parameter is carried out cosine transform (cosinetransform) and obtain the Mel cepstrum on L rank (Mel-scale cepstrum) parameter.
Above-mentioned phonetic feature parameter can also use 39 dimensional feature vectors, comprises 13 dimension MFCC, 13 dimension first order difference MFCC and 13 dimension second order difference MFCC;
Step 203, search in speech model according to referenced text, match the voice sequence of phonetic feature parameter, voice sequence carries time alignment information.
Above-mentioned speech model can be Hidden Markov Model (HMM) (Hidden Markov Model, hereinafter to be referred as HMM) be a discrete time-domain finite-state automata, HMM is meant that the internal state external world of this Markov model is invisible, and each output valve constantly can only be seen by the external world.To speech recognition system, output valve is exactly the acoustic feature (phonetic feature) that gets from each frame calculating usually.Need make two hypothesis with HMM portrayal voice signal: the one, the transfer of internal state is only relevant with laststate, and another is that output valve is only relevant with current state (or current state transitions), and these two hypothesis greatly reduce the complexity of model.The corresponding algorithm of the marking of HMM, decoding and training is forward direction algorithm, Viterbi (Viterbi) algorithm and forward-backward algorithm algorithm.
Use HMM normally to use unidirectional from left to right, as to be with ring, band leap certainly topological structure to come in the speech recognition to discerning the primitive modeling, a phoneme is exactly the HMM of one three to five state, speech is exactly to constitute the HMM that the HMM serial of a plurality of phonemes of speech gets up to constitute, and the whole model of continuous speech recognition is exactly speech and the quiet HMM that combines.
In order to make model voice can be described more accurately, can consider context dependent modeling coarticulation when setting up HMM, the influence of adjacent tone and changing before and after promptly sound is subjected to, from sound generating mechanism be exactly people's phonatory organ its characteristic can only gradual change when a sound turns to another sound, thereby make the frequency spectrum of a back sound and the frequency spectrum under other conditions produce difference.If only consider last sound influence be called diphone (Biphone); If consider simultaneously last sound and back one sound influence be called three-tone (Triphone).
The operation of above-mentioned search is sought a speech model sequence exactly with the description input speech signal, thereby is obtained speech decoding sequence (voice sequence).In actual use, often to add a high weight to language model, and a long word punishment mark is set according to experience.
Viterbi algorithm each state on each time point based on dynamic programming, calculate the posterior probability of decoding status switch to observation sequence, the path that keeps the probability maximum, and under each nodes records corresponding status information so that oppositely obtain the speech decoding sequence at last.Viterbi algorithm is under the condition of not losing optimum solution, solved the non-linear time alignment of HMM model state sequence and acoustics observation sequence in the continuous speech recognition simultaneously, the identification of speech Boundary Detection and speech, thus make this algorithm become the elementary tactics of speech recognition search.
This step can provide the voice sequence that carries time alignment information reliably, and when non-tone phoneme (initial consonant) the harmony tuning element (simple or compound vowel of a Chinese syllable) that can know the identification input speech signal is respectively from beginning to when finishing;
Step 204, from the voice signal that receives, extract the tone phoneme according to voice sequence.
The voice sequence and the aligning time that provide in rapid according to previous step, cut away the part that is not the tone joint.For Chinese, cut away the part that is not simple or compound vowel of a Chinese syllable exactly;
Step 205, in the tone model, match the tone of voice signal according to the tone phoneme.
Alternatively, above-mentioned steps 205 can also for:
Utilize the Support Vector Machine algorithm, find out one group of suitable lineoid the tone phoneme is carried out the tone classification.
In the present embodiment, viterbi algorithm by dynamic programming searches out the voice sequence that mates with the input speech signal characteristic parameter in HMM, the voice sequence that time alignment information is carried in utilization extracts the tone phoneme of input speech signal exactly, thereby determine the tone of input speech signal by tone model or one group of suitable lineoid utilizing the Support Vector Machine algorithm to find out, reduced the wrong identification tone in the Tone recognition, realize the tone in the accurate identification tone language, thereby improved the reliability of Tone recognition.
As shown in Figure 5, be the structural representation of first embodiment of tone recognition system of the present invention.Present embodiment comprises: grammar database 10 is used for the stored reference text; Sound identification module 20 is used for received speech signal, and voice signal is carried out spectrum analysis, and the voice sequence of time alignment information is carried in generation according to referenced text; Tone recognition module 30 is used for received speech signal, and extracts the tone phoneme according to voice sequence from voice signal; Tone sort module 40 is used for determining according to the tone phoneme tone of voice signal.
In the present embodiment,, can in grammar database 10, input in advance be referenced text with the object of reading owing to be at the situation in the language learning.Sound identification module 20 provides voice sequence and time alignment information, Tone recognition module 30 extracts the tone phoneme exactly according to above-mentioned voice sequence and time alignment information from voice signal, determine the tone of voice signal by tone sort module 40, thereby reduced the wrong identification tone in the Tone recognition, realized the tone in the accurate identification tone language.
As shown in Figure 6, be the structural representation of second embodiment of tone recognition system of the present invention.Compare with a last embodiment, sound identification module 20 comprises in the present embodiment: feature extraction unit 21, be used for received speech signal, and voice signal is carried out spectrum analysis and extracts the phonetic feature parameter; Speech model unit 22 is used for the storaged voice model; Phonetic search unit 23 is used for matching voice sequence according to phonetic feature parameter and referenced text at speech model, and voice sequence carries time alignment information.
In the present embodiment, the phonetic feature parameter that feature extraction unit 21 is extracted can be the Mel frequency cepstral coefficient; Can also be Mel frequency cepstral coefficient, single order Mel frequency cepstral coefficient and second order Mel frequency cepstral coefficient.The speech model of being stored in the speech model unit 22 is a hidden Markov model.
Compare with a last embodiment, tone sort module 40 comprises in the present embodiment: tone model unit 41 is used to store the tone model; Tone taxon 42 is used for matching at the tone model according to the tone phoneme tone of voice signal.
In the present embodiment, the tone model that tone model unit 41 is stored can be used fundamental frequency F
0The track envelope and the tone features such as envelope of logarithm energy train.
In the present embodiment, voice sequence and time alignment information are provided by sound identification module 20, Tone recognition module 30 extracts the tone phoneme exactly according to above-mentioned voice sequence and time alignment information from voice signal, determine the tone of voice signal by tone sort module 40, reduced the wrong identification tone in the Tone recognition, improve the reliability of Tone recognition, realized the tone in the accurate identification tone language.
One of ordinary skill in the art will appreciate that: all or part of step that realizes said method embodiment can be finished by the relevant hardware of programmed instruction, aforesaid program can be stored in the computer read/write memory medium, this program is carried out the step that comprises said method embodiment when carrying out; And aforesaid storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CD.
It should be noted that at last: above embodiment only in order to technical scheme of the present invention to be described, is not intended to limit; Although with reference to previous embodiment the present invention is had been described in detail, those of ordinary skill in the art is to be understood that: it still can be made amendment to the technical scheme that aforementioned each embodiment put down in writing, and perhaps part technical characterictic wherein is equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution break away from the spirit and scope of various embodiments of the present invention technical scheme.