CN107293306A

CN107293306A - A kind of appraisal procedure of the Objective speech quality based on output

Info

Publication number: CN107293306A
Application number: CN201710475912.8A
Authority: CN
Inventors: 李庆先; 刘良江; 王晋威; 朱宪宇; 熊婕; 李彦博
Original assignee: HUNAN MEASUREMENT INSPECTION RESEARCH INSTITUTE
Current assignee: HUNAN MEASUREMENT INSPECTION RESEARCH INSTITUTE; Hunan Institute of Metrology and Test
Priority date: 2017-06-21
Filing date: 2017-06-21
Publication date: 2017-10-24
Anticipated expiration: 2037-06-21
Also published as: CN107293306B

Abstract

The present invention provides a kind of method of the objective speech quality assessment based on output, comprises the following steps：Calculate the mel-frequency cepstrum coefficient of the distorted speech after system is transmitted；Obtain the reference model for meeting human hearing characteristic；The mel-frequency cepstrum coefficient of distorted speech and the reference model for meeting human hearing characteristic are subjected to uniformity measure calculation；One section of sequence is inserted in raw tone, the bit error rate that the sequence is extracted in the distorted speech after being transmitted by system is calculated；Measured according to uniformity and mapping relations that the bit error rate is set up between subjective MOS and homogeneity measure, obtain the objective forecast model to voice MOS to be evaluated points, pass through the objective evaluation that the objective forecast model carries out voice quality.Using the method for the present invention, step is simplified, easy to use, and is capable of the quality of effectively objective evaluation voice, independent of subjective assessment.

Description

A kind of appraisal procedure of the Objective speech quality based on output

Technical field

The present invention relates to voice process technology field, especially, it is related to a kind of Objective speech quality based on output Appraisal procedure.

Background technology

Speech quality objective assessment refer to use machine automatic discrimination voice quality, by whether need to use input voice angle Degree can be divided into two classes：Objective evaluation based on input-output mode and the objective evaluation based on the way of output.

In many fields, such as wireless mobile communications, space flight navigation and modern military often require that evaluation method has Higher flexibility, real-time and versatility, and also will can be to voice matter in the case of it cannot get original input speech signal Amount is estimated, and is difficult often to obtain corresponding raw tone, phonetic storage in the objective evaluation of the mode based on input-output In terms of cost it is bigger, the drawbacks of existing certain under these application scenarios.

The general process of objective speech quality assessment method based on output is certain characteristic parameter of Calculation Estimation voice, And uniformity calculating is carried out with learning the characteristic parameter of reference voice after concluding by particular model, final mapping obtains subjectivity MOS points of estimate.In this process, the selection of characteristic parameter, training pattern and MOS points of mapping methods is most important , it affects the performance of assessment system.Because human ear meets Bark critical band to the perception characteristic of sound, therefore in feature Need to realize linear frequency and inflection frequency conversion during parameter extraction.Meanwhile, in this kind of application of radio communication, except from voice Itself analyze outer, in addition it is also necessary to consider influence of the external factors such as channel quality to voice quality.

Therefore, a kind of appraisal procedure tool for the voice quality that can be used for after objective evaluation coding or channel transmission is designed It is significant.

The content of the invention

It is an object of the invention to provide a kind of method of the objective speech quality assessment based on output.In view of human ear pair The auditory properties of frequency, while the cepstral analysis of voice signal is taken into account, using mel-frequency cepstrum coefficient (Mel-Frequency Cepstral Coefficients, MFCC) description phonetic feature.Trained by combining mel-frequency cepstrum coefficient and GMM-HMM Model obtains speech objective distortion value, at the same by channel effect by error rate index be introduced into it is objective estimate, then set up master See MOS point and it is objective estimate between mapping relations, the forecast model to subjective MOS is obtained, so as to be commented for objective Voice quality after valency coding or channel transmission.Details as Follows：

A kind of appraisal procedure of the Objective speech quality based on output, comprises the following steps：

Calculate the mel-frequency cepstrum coefficient of the distorted speech after system is transmitted；Acquisition meets human hearing characteristic Reference model；

The mel-frequency cepstrum coefficient of distorted speech and the reference model for meeting human hearing characteristic are subjected to uniformity amount Degree is calculated；One section of sequence is inserted in raw tone, calculating extracts the sequence in the distorted speech after being transmitted by system The bit error rate of row；

Measured and mapping relations that the bit error rate is set up between subjective MOS and homogeneity measure, obtained pair according to uniformity Voice MOS to be evaluated points of objective forecast model, the objective evaluation of voice quality is carried out by the objective forecast model.

Preferred in above technical scheme, the calculating process of the mel-frequency cepstrum coefficient includes pretreatment, FFT and become Change, four steps of Mel frequency filterings and discrete cosine transform.

Preferred in above technical scheme, the pretreatment specifically includes following steps：

Step 1.1, preemphasis, be specifically：Come using the digital filter of the lifting high frequency characteristics with 6dB/ octaves Preemphasis is realized, its transmission function is expression formula 1)：

H (z)=1- μ z^-11)；

Wherein：μ is pre emphasis factor, and its value is 0.9-1.0；

Step 1.2, end-point detection, be specifically：Carried out by setting the thresholding of short-time energy and short-time zero-crossing rate, if certain The Short Time Speech signal that individual length is N is x (m), its short-time energy E expression formulas 2) calculate：

Its short-time zero-crossing rate Z expression formulas 3) calculate：

Wherein, sgn [] is sign function, i.e.,：

Step 1.3, framing and adding window, be specifically：The framing is that voice is divided into frame one by one, the length of each frame For 10-30ms；The adding window is to carry out adding window to each frame signal using Hamming windows.

Preferred in above technical scheme, the detailed process of the adding window is：If frame signal is x (n), window function is w (n), then signal y (n) after adding window is expression formula 4)：

Y (n)=x (n) w (n), 0≤n≤N-1 4)；

Wherein, N is the number of sampling per frame, and w (n) expression formula is w (n)=0.54-0.46cos [2 π n/ (N-1)], 0≤n≤ N-1。

Preferred in above technical scheme, the Mel frequency filterings are specifically：The discrete spectrum handled by FFT is used Sequence triangular filter is filtered processing, obtains one group of Coefficient m_l、m₂、……；The number p of the wave filter group is cut by signal Only frequency is determined, all wave filters generally cover from 0Hz to Nyquist 1/2nd of frequency, i.e. sample rate；m_iBy expressing Formula 5) calculate obtain：

Wherein：

F [i] is triangle filtering The centre frequency of device, meets：Mel (f [i+1])-Mel (f [i])=Mel (f [i])-Mel (f [i-1])；X (k) is frame signal x (n) discrete spectrum after being handled through FFT.

Preferred in above technical scheme, the discrete cosine transform is specifically：By the Mel frequencies by Mel frequency filterings Spectral transformation obtains Mel frequency cepstral coefficients, it is by expression formula 6 to time domain) calculate obtain：

Wherein：MFCC (i) is Mel frequency cepstral coefficients, and N is points per frame sample, and P is the number of wave filter group.

It is preferred in above technical scheme, obtain meet human hearing characteristic reference model detailed process it is as follows：

If the characteristic vector sequence of observation is O=o₁,o₂,…,o_T, the state model sequence of the sequence is S=s₁,s₂,…, s_N, then the HMM model of the sequence be expressed as expression formula 7)：

λ=(π, A, B) is 7)；

Wherein, π={ π_i=P (s₁=i), i=1,2 ..., N } it is initial state probabilities vector；A={ a_ijFor between state The transition probability matrix redirected, a_ijTo jump to state j probability from state i；B={ b_i(o_t)=P (o_t|s_t=i), 2≤i≤ N-1 } collect for the output probability distribution of state；

To continuous HMM model, observation sequence is continuous signal, and its signal space corresponding with state j is with M mixed Gaussian Density function and represent, such as expression formula 8) and expression formula 9) under：

Wherein, c_jkThe coefficient of expression state j k-th of Gaussian Mixture Model Probability Density Function；μ_jkIt is the average of Gaussian density function Vector；C_jkFor covariance matrix, D is observation sequence O dimension；HMM parameters are by observation sequence O=o₁,o₂,…,o_TEstimate Arrive, the target of estimation is that the likelihood function P (O | λ) for making model and training data maximizes to estimate newest λ, even if

The likelihood function p (O | λ) forward direction probability calculation formula such as expression formula 10)：

Wherein：α₁(i)=π b_i(o₁),1≤i≤N；

The likelihood function p (O | λ) backward probability calculation formula such as expression formula 11)：

Wherein：β_T(i)=1,1≤i≤N；

To giving observation sequence O=o₁,o₂,…,o_TNewest λ is obtained by revaluation, ξ is defined herein_tWhen (i, j) is t Quarter state be s_iAnd t+1 moment state is s_jProbability, by expression formula 12) obtain：

Under conditions of setting models λ and observation sequence O, state s_iIt is expression formula 13 in moment t posterior probability)：

Thus, the revaluation of HMM parameter lambdas is as follows：

In the parameter c of t state k-th of Gaussian mixture components of j_jk, μ_jkAnd C_jkBy expression formula 14), 15) and 16) weight New estimation：

Wherein, γ_t(j, k) represents the probability in t state k-th of Gaussian mixture components of j, can be obtained by following formula：

Preferred in above technical scheme, the computational methods that uniformity is measured are specifically：Using expression formula 17) counted Calculate：

Wherein：X₁,...,X_NIt is the mel-frequency cepstrum coefficient vector of distorted speech, N is vectorial number, and C is distorted speech Measured with the uniformity of model.

Preferred in above technical scheme, the calculating process of the bit error rate is as follows：

Step A, one PN sequence of generation, and be multiplied with a chaos sequence, the generation of chaos sequence is reflected by logistic Generation is penetrated, logistic mapping definitions are as follows：

x_k+1=μ x_k(1-x_k)

Wherein, 0≤μ≤4 are referred to as branch parameter, x_k∈ (0,1), when 3.5699456 ...<During μ≤4, logistic mappings Work in the sequence { x that chaos state, i.e. primary condition are produced under logistic mappings_k；K=0,1,2,3 ... } it is aperiodic , it is not convergent and very sensitive to initial value；Generate comprising the following steps that for monitoring data sequent：

Step a1, produce real-valued sequence first, and it is big for monitoring data sequent to choose the length that the position of some in sequence starts Small one section；

Step a2, real-valued sequence is changed into binary sequence：By defining a threshold value Γ, obtained by real-valued sequence：

Binary chaotic sequence is { Γ (x_k)；K=0,1,2,3 ... }；

Step a3, binary chaotic sequence is multiplied with a PN sequence, you can obtain monitoring data sequent；

Step B, for monitoring data sequent insert synchronous code, monitoring data sequent embedded below is extracted frame by frame；

Step C, the monitoring data sequent of synchronous code will be inserted in wavelet field in embedded voice signal, detailed process is as follows：

Step c1, selection Daubechies10 small echos are used as wavelet function；

Step c2, sub-frame processing is carried out to voice signal, the size per frame is 1152 sampled points, and every frame signal is entered 3 layers of wavelet transformation of row；

Step c3, wavelet coefficient is quantified, and monitoring data sequent is modulated, so that monitoring data sequent is embedded in into voice In signal, if coefficient to be quantified is f, the bit of embedded monitoring data sequent is w, and quantization step is Δ, and monitoring sequence is contained after quantization The coefficient of column information concretely comprises the following steps for f'：

To f modulus and floor operation, as f ＞ 0, ifN=m%2, then：

As f ＜ 0, ifN=m%2, n=w, then：

Monitoring data sequent is embedded into voice signal one by one according to above-mentioned formula；

Step c4, the signal of embedded monitoring data sequent switched back into time-domain signal；

Embedded monitoring data sequent is extracted in step D, the voice received, and calculates the bit error rate, the process specifically extracted Comprise the following steps：

Step d1, synchronous code is searched in voice signal, be specifically：If needing the length that the signal length searched for is L, then L Degree should be more than the length of two synchronous codes and the summation of a complete monitoring data sequent length；If the initiating searches point of signal is I=1, if the sample value of signalIn the range of 900-1100, then it is assumed that searched possible synchronous code, It is compared using default synchronous code；If determined as synchronous code, then I points are the original position of monitoring data sequent, otherwise make I =I+L；

Step d2, since the starting point found, to voice signal carry out wavelet transform；

Step d3, the operation for making contrary during with insertion to the coefficient f after wavelet decomposition, i.e.,：During f ＞ 0, if W=m%2；During f ＜ 0, ifW=m%2；

So as to extract binary system monitoring data sequent；

Step d4, compare the monitoring data sequent extracted and the monitoring data sequent of insertion, by expression formula 18) calculate the bit error rate：

Wherein Seq_send、Seq_receiveAnd Seq_lengthRepresent that transmission monitoring data sequent, reception monitoring data sequent and sequence are long respectively Degree；The Hamming weight of sequence is sought in Hammingweight () expressions, and XOR represents xor operation.

Preferred in above technical scheme, the mapping relations pass through expression formula 19) obtain：

In formula：F () is multivariate nonlinear regression analysis model；C_iIt is that the uniformity of i-th kind of parameter is measured；N is phonetic feature The number of parameter；It is c₁,...,c_NThe objective MOS scorings predicted by f ().

Apply the technical scheme of the present invention, effect is：

1st, Mel frequency scales are approached using MFCC, so as to stretch the low-frequency information and compacting high-frequency information of voice, it can be used In voice robust analysis and speech recognition, suppress the feature dependent on speaker, retain the philological quality of voice segments.

2nd, the present invention set up subjective MOS and it is objective estimate and channel quality between mapping relations, obtain to subjectivity MOS points of forecast model so that point closer subjective quality.

3rd, the inventive method step is simplified, easy to use, and is capable of the quality of effectively objective evaluation voice, independent of master See and evaluate.

In addition to objects, features and advantages described above, the present invention also has other objects, features and advantages. Below with reference to accompanying drawings, the present invention is further detailed explanation.

Brief description of the drawings

The accompanying drawing for constituting the part of the application is used for providing a further understanding of the present invention, schematic reality of the invention Apply example and its illustrate to be used to explain the present invention, do not constitute inappropriate limitation of the present invention.In the accompanying drawings：

Fig. 1 is the principle schematic diagram of the appraisal procedure of the Objective speech quality based on output in embodiment 1.

Embodiment

Embodiments of the invention are described in detail below in conjunction with accompanying drawing, but the present invention can be limited according to claim Fixed and covering multitude of different ways is implemented.

Embodiment 1：

A kind of appraisal procedure of the Objective speech quality based on output, refers to Fig. 1, specifically includes：Calculate and passed by system Raw tone (is obtained distorted speech by the mel-frequency cepstrum coefficient of the distorted speech after defeated after system is transmitted；Calculate plum Your process of frequency cepstral coefficient is MFCC parameter extraction process)；The reference model that acquisition meets human hearing characteristic (is first carried The MFCC parameters of reference voice are taken, then obtain GMM-HMM models)；By the mel-frequency cepstrum coefficient of distorted speech with meeting people The reference model of ear auditory properties carries out uniformity measure calculation (i.e. uniformity is calculated)；One section of sequence is inserted in raw tone Row, calculate the bit error rate that the sequence is extracted in the distorted speech after being transmitted by system；Measured and missed according to uniformity Code check sets up the mapping relations between subjective MOS and homogeneity measure (i.e. MOS points of mappings in Fig. 1), obtains to be evaluated Voice MOS points of objective forecast model, by the objective forecast model carry out voice quality objective evaluation (here by The MOD of the mapping points of degrees of correlation and biased error between subjective MOS are used as evaluation criterion).Evaluation voice is ITU voices Storehouse (International Telecommunication Union's sound bank), Details as Follows：

The calculating process of mel-frequency cepstrum coefficient includes pretreatment, FFT (Fast Fourier Transform (FFT)) conversion, Mel frequencies Filtering and four steps of discrete cosine transform, be specifically：

The pretreatment specifically includes following steps：

H (z)=1- μ z^-11)；

Wherein：μ is pre emphasis factor, and its value is 0.9-1.0 (taking 0.95 herein)；

Its short-time zero-crossing rate Z expression formulas 3) calculate：

Wherein, sgn [] is sign function, i.e.,：

Step 1.3, framing and adding window, be specifically：In order to be analyzed using the analysis method of stationary process, by language Sound is divided into frame one by one, and the length of each frame is 10-30ms；Meanwhile, in order to reduce the truncation effect of speech frame, use Hamming windows (hamming code window) carry out adding window to each frame signal, are specifically：

If frame signal is x (n), window function is w (n), then the signal y (n) after adding window is expression formula 4)：

Y (n)=x (n) w (n), 0≤n≤N-1 4)；

The Mel frequency filterings are specifically：It will be filtered by the FFT discrete spectrums handled with sequence triangular filter Processing, obtains one group of Coefficient m_l、m₂、……；The number p of the wave filter group is determined that all wave filters are total by the cut-off frequency of signal Frequency (nyquist frequency), i.e. 1/2nd of sample rate are covered from 0Hz to Nyquist on body；m_iBy expression formula 5) calculate Obtain：

Wherein：

F [i] is triangle filtering The centre frequency of device, meets：Mel (f [i+1])-Mel (f [i])=Mel (f [i])-Mel (f [i-1]).

Because Mel spectral coefficients are all real numbers, time domain can be transformed to by discrete cosine transform.It is described discrete remaining String is converted：, to time domain, Mel frequency cepstral coefficients will be obtained, it is by table by the Mel Spectrum Conversions of Mel frequency filterings Up to formula 6) calculate obtain：

The reference model detailed process that acquisition meets human hearing characteristic is as follows：

Pronunciation modeling and training based on GMM-HMM, if the characteristic vector sequence of observation is O=o₁,o₂,…,o_T, the sequence The state model sequence of row is S=s₁,s₂,…,s_N, then the HMM model (hidden Markov model) of the sequence be expressed as expression formula 7)：

λ=(π, A, B) is 7)；

Wherein, c_jkThe coefficient of expression state j k-th of Gaussian Mixture Model Probability Density Function；μ_jkIt is the average of Gaussian density function Vector；C_jkFor covariance matrix, D is observation sequence O dimension；HMM parameters are by observation sequence O=o₁,o₂,…,o_TEstimate Arrive, the target of estimation is that the likelihood function P (O | λ) for making model and training data maximizes to estimate newest λ, even ifThis can realize that the EM algorithms include two parts using EM algorithms (EM algorithm)：Before It is as follows to estimating again for backward probability calculation and HMM parameters and Gaussian mixture parameters：

Wherein：α₁(i)=π b_i(o₁),1≤i≤N；

Wherein：β_T(i)=1,1≤i≤N；

Thus, the revaluation of HMM parameter lambdas is as follows：

In the parameter c of t state k-th of Gaussian mixture components of j_jk、μ_jkAnd C_jkBy expression formula 14), 15) and 16) weight New estimation：

The computational methods that uniformity is measured are specifically：After modeling, the mel-frequency cepstrum coefficient of distorted speech and the ginseng Model progress uniformity is examined to measure using expression formula 17) calculated：

Wherein：X₁,...,X_NIt is mel-frequency cepstrum coefficient (MFCC) vector of distorted speech, N is vectorial number, and C is to lose The uniformity of true voice and model is measured.

The calculating process of the bit error rate is as follows：

x_k+1=μ x_k(1-x_k)

Binary chaotic sequence is { Γ (x_k)；K=0,1,2,3 ... }；

Step a3, binary chaotic sequence is multiplied with a PN sequence (pseudo noise sequence), you can monitoring data sequent；

Step B, for monitoring data sequent insert synchronous code, monitoring data sequent embedded below is extracted frame by frame, be specifically： Synchronous code is inserted for monitoring data sequent, the purpose of insertion synchronous code is that, in order to prevent audio after the decay of channel, receiving terminal is difficult To extract monitoring data sequent；The synchronous code that we use is 16 bits, and in order to the code of positioning synchronous exactly, we adopt The method taken is the embedded synchronous code in the time domain of voice signal, and concrete methods of realizing is by 16 sampled points before monitoring data sequent Amplitude be set to 1000, so receiving terminal extract monitoring data sequent during, if there is the nonsynchronous situation of starting point, then may be used With the sampled point using continuous 16 sample values 900~1100, rising for watermark is rapidly found out in the way of searching synchronous code Beginning sample position, in this way, monitoring data sequent embedded below can be extracted frame by frame；

Step C, the monitoring data sequent for inserting synchronous code in embedded voice signal, selected embedding in wavelet field in wavelet field The reason for entering is that embedded monitoring data sequent has more preferable disguised in transform domain, human ear will not be caused to distinguish to raw tone Influence.The detailed process of sequence embedded voice in wavelet field is as follows：

Step c1, due to analyzing same problem using different wavelet basis different results can be produced, accordingly, it would be desirable to root The problem of according to analysis, selects suitable wavelet basis, and Daubechies10 small echos are chosen herein and are used as wavelet function；

Step c2, sub-frame processing is carried out to voice signal, the size per frame is 1152 sampled points, and every frame signal is entered 3 layers of wavelet transformation of row；In view of the auditory properties of human ear, select to be embedded in sequence in high band herein；

Step c3, wavelet coefficient is quantified, and monitoring data sequent is modulated, so that monitoring data sequent is embedded in into voice In signal, if coefficient to be quantified is f, the bit of embedded monitoring data sequent is w, and quantization step is Δ, and monitoring sequence is contained after quantization The coefficient of column information concretely comprises the following steps for f'：First to f modulus and floor operation, as f ＞ 0, ifN= M%2, then：

As f ＜ 0, ifN=m%2, n=w, then：

Monitoring data sequent can be embedded into voice signal one by one according to above-mentioned formula.

Embedded monitoring data sequent is extracted in step D, the voice received, and calculates the bit error rate, details are：Monitoring data sequent Extraction be embedded inverse process, therefore the wavelet function and the series of wavelet decomposition used all keep constant, specifically extract Process comprises the following steps：

Step d1, synchronous code is searched in voice signal, be specifically：If needing the length that the signal length searched for is L, then L Degree should be more than the length of two synchronous codes and the summation of a complete monitoring data sequent length.If the initiating searches point of signal is I=1, if the sample value of signalIn the range of 900-1100, then it is assumed that searched possible synchronous code, It is compared using default synchronous code；If determined as synchronous code, then I points are the original position of monitoring data sequent, otherwise make I =I+L；

Step d3, the operation for making contrary during with insertion to the coefficient f after wavelet decomposition, i.e.,：

During f ＞ 0, ifW=m%2；

During f ＜ 0, ifW=m%2；

So as to extract binary system monitoring data sequent；

Step d4, compare the monitoring data sequent extracted and the monitoring data sequent of insertion, by expression formula 18) calculate the bit error rate (bit error rate as one of speech quality evaluation objective estimate)：

After the parameter consistency of voice under calculating various distortion conditions is measured, a kind of Function Mapping relation can be used Come represent parameter consistency measure with it is objectiveBetween relation, i.e., described mapping relations pass through expression formula 19) obtain：

In formula：F () is that (it can be linearly or nonlinearly regression relation or fitting of a polynomial to anticipation function Relation, in this patent embodiment, in order to obtain more accurately predicting MOS values, preferred multivariate nonlinear regression analysis model herein)； C_iIt is that the uniformity of i-th kind of parameter is measured；N is the number of speech characteristic parameter；It is c₁,...,c_NPredicted by f () The objective MOS scorings gone out.The bit error rate is bigger, illustrates to disturb stronger in channel, the speech damage brought in transmitting procedure is also corresponding Also it is big, it is correspondingValue is smaller, and the quality of voice is poorer.

The performance of speech quality evaluation algorithm is weighed from the degree of correlation, biased error below.The degree of correlation mainly reflects voice Whether the mapping relations that quality evaluation algorithm obtains MOS points of prediction by distortion map are reasonable, the general MOS with Algorithm mapping points Degree of correlation and biased error between known subjective MOS values are used as evaluation criterion.

Correlation coefficient ρ and pass through expression formula 20 with standard estimated bias σ) and expression formula 21) obtain：

Wherein：MOS_o(i) be i-th of voice prediction MOS values, MOS_s(i) it is known MOS points, N is total voice pair Number,The average of prediction MOS values is represented,Represent the average that MOS divides.

Correlation coefficient ρ is closer to 1, and prediction MOS values are closer to true MOS values；Biased error σ is smaller, then predicated error is got over Small, the performance of algorithm is better.

The appraisal procedure of the present embodiment 1 and International Telecommunication Union ITU P.563 method for objectively evaluating (ITU-TP.563) Performance comparison is the results detailed in Table 1.

From table 1 it follows that the inventive method (embodiment 1) is certain relative to having on ITU-TP.563 algorithm performances The raising of degree, the average degree of correlation ρ of subjective MOS is higher, and estimated bias σ is relatively low, therefore, and the inventive method has validity And feasibility.

The performance comparision table that the inventive method of table 1 (embodiment 1) and ITU-TP.563 are handled voice respectively

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims

1. a kind of appraisal procedure of the Objective speech quality based on output, it is characterised in that comprise the following steps：

Calculate the mel-frequency cepstrum coefficient of the distorted speech after system is transmitted；Obtain the reference for meeting human hearing characteristic Model；

The mel-frequency cepstrum coefficient of distorted speech and the reference model progress uniformity for meeting human hearing characteristic are measured into meter Calculate；One section of sequence is inserted in raw tone, calculating extracts the sequence in the distorted speech after being transmitted by system The bit error rate；

Measured and mapping relations that the bit error rate is set up between subjective MOS and homogeneity measure, obtained to be evaluated according to uniformity The objective forecast model that MOS points of valency voice, the objective evaluation of voice quality is carried out by the objective forecast model.

2. the appraisal procedure of the Objective speech quality according to claim 1 based on output, it is characterised in that：The Mel The calculating process of frequency cepstral coefficient includes pretreatment, FFT, four steps of Mel frequency filterings and discrete cosine transform.

3. the appraisal procedure of the Objective speech quality according to claim 2 based on output, it is characterised in that：

The pretreatment specifically includes following steps：

Step 1.1, preemphasis, be specifically：Realized using the digital filter of the lifting high frequency characteristics with 6dB/ octaves Preemphasis, its transmission function is expression formula 1)：

H (z)=1- μ z^-11)；

Wherein：μ is pre emphasis factor, and its value is 0.9-1.0；

Step 1.2, end-point detection, be specifically：Carried out by setting the thresholding of short-time energy and short-time zero-crossing rate, if some is grown The Short Time Speech signal that degree is N is x (m), its short-time energy E expression formulas 2) calculate：

<mrow> <mi>E</mi> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>m</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>N</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <msup> <mi>x</mi> <mn>2</mn> </msup> <mrow> <mo>(</mo> <mi>m</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mn>2</mn> <mo>)</mo> <mo>;</mo> </mrow>

Its short-time zero-crossing rate Z expression formulas 3) calculate：

<mrow> <mi>Z</mi> <mo>=</mo> <mfrac> <mn>1</mn> <mn>2</mn> </mfrac> <munderover> <mo>&Sigma;</mo> <mrow> <mi>m</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>N</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <mo>|</mo> <mi>sgn</mi> <mo>&lsqb;</mo> <mi>x</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>)</mo> </mrow> <mo>&rsqb;</mo> <mo>-</mo> <mi>sgn</mi> <mo>&lsqb;</mo> <mi>x</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>&rsqb;</mo> <mo>|</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mn>3</mn> <mo>)</mo> <mo>;</mo> </mrow>

Wherein, sgn [] is sign function, i.e.,：

Step 1.3, framing and adding window, be specifically：The framing is that voice is divided into frame one by one, and the length of each frame is 10-30ms；The adding window is to carry out adding window to each frame signal using Hamming windows.

4. the appraisal procedure of the Objective speech quality according to claim 3 based on output, it is characterised in that：The adding window Detailed process be：If frame signal is x (n), window function is w (n), then the signal y (n) after adding window is expression formula 4)：

Y (n)=x (n) w (n), 0≤n≤N-1 4)；

5. the appraisal procedure of the Objective speech quality according to claim 2 based on output, it is characterised in that：The Mel Frequency filtering is specifically：The discrete spectrum handled by FFT is filtered processing with sequence triangular filter, a system is obtained Number m_l、m₂、……；The number p of the wave filter group determines by the cut-off frequency of signal, all wave filters generally cover from 0Hz to / 2nd of Nyquist frequencies, i.e. sample rate；m_iBy expression formula 5) calculate obtain：

<mrow> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>=</mo> <mi>l</mi> <mi>n</mi> <mrow> <mo>(</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>N</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <mo>|</mo> <mi>X</mi> <mo>(</mo> <mi>k</mi> <mo>)</mo> <mo>|</mo> <mo>&CenterDot;</mo> <msub> <mi>H</mi> <mi>i</mi> </msub> <mo>(</mo> <mi>k</mi> <mo>)</mo> <mo>)</mo> </mrow> <mo>,</mo> <mi>i</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mi>p</mi> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mn>5</mn> <mo>)</mo> <mo>;</mo> </mrow>

Wherein：

F [i] is the centre frequency of triangular filter, is met：Mel (f [i+1])-Mel (f [i])=Mel (f [i])-Mel (f [i- 1])；X (k) is the discrete spectrum after frame signal x (n) is handled through FFT.

6. the appraisal procedure of the Objective speech quality according to claim 2 based on output, it is characterised in that：It is described discrete Cosine transform is specifically：Will by Mel frequency filterings Mel Spectrum Conversions to time domain, obtain Mel frequency cepstral coefficients, its by Expression formula 6) calculate obtain：

<mrow> <mi>M</mi> <mi>F</mi> <mi>C</mi> <mi>C</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>=</mo> <msqrt> <mrow> <mfrac> <mn>2</mn> <mi>N</mi> </mfrac> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>P</mi> </munderover> <msub> <mi>m</mi> <mi>j</mi> </msub> <mi>c</mi> <mi>o</mi> <mi>s</mi> <mo>&lsqb;</mo> <mrow> <mo>(</mo> <mi>j</mi> <mo>-</mo> <mn>0.5</mn> <mo>)</mo> </mrow> <mfrac> <mrow> <mi>&pi;</mi> <mi>i</mi> </mrow> <mi>p</mi> </mfrac> <mo>&rsqb;</mo> </mrow> </msqrt> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mn>6</mn> <mo>)</mo> <mo>;</mo> </mrow>

7. the appraisal procedure of the Objective speech quality according to claim 1 based on output, it is characterised in that：Met The reference model detailed process of human hearing characteristic is as follows：

If the characteristic vector sequence of observation is O=o₁,o₂,…,o_T, the state model sequence of the sequence is S=s₁,s₂,…,s_N, Then the HMM model of the sequence is expressed as expression formula 7)：

λ=(π, A, B) is 7)；

Wherein, π={ π_i=P (s₁=i), i=1,2 ..., N } it is initial state probabilities vector；A={ a_ijFor what is redirected between state Transition probability matrix, a_ijTo jump to state j probability from state i；B={ b_i(o_t)=P (o_t|s_t=i), 2≤i≤N-1 } be The output probability distribution collection of state；

<mrow> <msub> <mi>b</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>o</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <msub> <mi>c</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mi>N</mi> <mrow> <mo>(</mo> <msub> <mi>o</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>&mu;</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mo>,</mo> <msub> <mi>C</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>,</mo> <mn>1</mn> <mo>&le;</mo> <mi>j</mi> <mo>&le;</mo> <mi>N</mi> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mn>8</mn> <mo>)</mo> <mo>;</mo> </mrow>

Wherein, c_jkThe coefficient of expression state j k-th of Gaussian Mixture Model Probability Density Function；μ_jkIt is the mean vector of Gaussian density function； C_jkFor covariance matrix, D is observation sequence O dimension；HMM parameters are by observation sequence O=o₁,o₂,…,o_TEstimation is obtained, estimation Target be that the likelihood function P (O | λ) for making model and training data maximizes to estimate newest λ, even if

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>O</mi> <mo>|</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mn>10</mn> <mo>)</mo> <mo>;</mo> </mrow>

Wherein：α₁(i)=π b_i(o₁),1≤i≤N；

<mrow> <msub> <mi>&alpha;</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>&lsqb;</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <msub> <mi>a</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>&rsqb;</mo> <msub> <mi>b</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>o</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>,</mo> <mn>1</mn> <mo>&le;</mo> <mi>t</mi> <mo>&le;</mo> <mi>T</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> <mo>&le;</mo> <mi>j</mi> <mo>&le;</mo> <mi>N</mi> <mo>;</mo> </mrow>

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>O</mi> <mo>|</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>&pi;</mi> <mi>i</mi> </msub> <msub> <mi>b</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <msub> <mi>o</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>&beta;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mn>11</mn> <mo>)</mo> <mo>;</mo> </mrow>

Wherein：β_T(i)=1,1≤i≤N；

<mrow> <msub> <mi>&beta;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>a</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <msub> <mi>b</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>o</mi> <mrow> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <msub> <mi>&beta;</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mrow> <mo>(</mo> <mi>j</mi> <mo>)</mo> </mrow> <mo>,</mo> <mn>1</mn> <mo>&le;</mo> <mi>j</mi> <mo>&le;</mo> <mi>N</mi> <mo>,</mo> <mi>t</mi> <mo>=</mo> <mi>T</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mi>T</mi> <mo>-</mo> <mn>2</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mn>1</mn> <mo>;</mo> </mrow>

To giving observation sequence O=o₁,o₂,…,o_TNewest λ is obtained by revaluation, ξ is defined herein_t(i, j) is t shape State is s_iAnd t+1 moment state is s_jProbability, by expression formula 12) obtain：

<mrow> <mtable> <mtr> <mtd> <mrow> <msub> <mi>&xi;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>P</mi> <mrow> <mi>S</mi> <mo>|</mo> <mi>O</mi> <mo>,</mo> <mi>&lambda;</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mi>t</mi> </msub> <mo>=</mo> <msub> <mi>s</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>s</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>=</mo> <msub> <mi>s</mi> <mi>j</mi> </msub> <mo>|</mo> <mi>O</mi> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <msub> <mi>a</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <msub> <mi>b</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>o</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <msub> <mi>&beta;</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mrow> <mo>(</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <msub> <mi>a</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <msub> <mi>b</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>o</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <msub> <mi>&beta;</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mrow> <mo>(</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mn>12</mn> <mo>)</mo> <mo>;</mo> </mrow>

<mrow> <mtable> <mtr> <mtd> <mrow> <msub> <mi>&gamma;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>P</mi> <mrow> <mi>S</mi> <mo>|</mo> <mi>O</mi> <mo>,</mo> <mi>&lambda;</mi> </mrow> </msub> <mrow> <mo>(</mo> <mrow> <msub> <mi>s</mi> <mi>t</mi> </msub> <mo>=</mo> <mi>i</mi> <mo>|</mo> <mi>O</mi> <mo>,</mo> <mi>&lambda;</mi> </mrow> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <msub> <mi>&beta;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <msub> <mi>&beta;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>&xi;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mn>13</mn> <mo>)</mo> <mo>;</mo> </mrow>

Thus, the revaluation of HMM parameter lambdas is as follows：

<mrow> <msub> <mover> <mi>&pi;</mi> <mo>&OverBar;</mo> </mover> <mi>i</mi> </msub> <mo>=</mo> <msub> <mi>&gamma;</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>;</mo> </mrow>

<mrow> <msub> <mover> <mi>a</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>T</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <msub> <mi>&xi;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>T</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <msub> <mi>&gamma;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>;</mo> </mrow>

In the parameter c of t state k-th of Gaussian mixture components of j_jk, μ_jkAnd C_jkBy expression formula 14), 15) and 16) estimate again Meter：

<mrow> <msub> <mover> <mi>c</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msub> <mi>&gamma;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>j</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> </mrow> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <msub> <mi>&gamma;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>j</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mn>14</mn> <mo>)</mo> <mo>;</mo> </mrow>

<mrow> <msub> <mover> <mi>&mu;</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msub> <mi>&gamma;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>j</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <msub> <mi>o</mi> <mi>t</mi> </msub> </mrow> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msub> <mi>&gamma;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>j</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mn>15</mn> <mo>)</mo> <mo>;</mo> </mrow>

<mrow> <msub> <mover> <mi>C</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msub> <mi>&gamma;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mrow> <mi>j</mi> <mo>,</mo> <mi>k</mi> </mrow> <mo>)</mo> </mrow> <mrow> <mo>(</mo> <mrow> <msub> <mi>o</mi> <mi>t</mi> </msub> <mo>-</mo> <msub> <mi>&mu;</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <msup> <mrow> <mo>(</mo> <mrow> <msub> <mi>o</mi> <mi>t</mi> </msub> <mo>-</mo> <msub> <mi>&mu;</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mi>T</mi> </msup> </mrow> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msub> <mi>&gamma;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mrow> <mi>j</mi> <mo>,</mo> <mi>k</mi> </mrow> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mn>16</mn> <mo>)</mo> <mo>;</mo> </mrow>

<mrow> <msub> <mi>&gamma;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>j</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>j</mi> <mo>)</mo> </mrow> <msub> <mi>&beta;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <msub> <mi>&beta;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>&lsqb;</mo> <mfrac> <mrow> <msub> <mi>c</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mi>N</mi> <mrow> <mo>(</mo> <msub> <mi>o</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>&mu;</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mo>,</mo> <msub> <mi>C</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>m</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <msub> <mi>c</mi> <mrow> <mi>j</mi> <mi>m</mi> </mrow> </msub> <mi>N</mi> <mrow> <mo>(</mo> <msub> <mi>o</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>&mu;</mi> <mrow> <mi>j</mi> <mi>m</mi> </mrow> </msub> <mo>,</mo> <msub> <mi>C</mi> <mrow> <mi>j</mi> <mi>m</mi> </mrow> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>&rsqb;</mo> <mo>.</mo> </mrow>

8. the appraisal procedure of the Objective speech quality according to claim 1 based on output, it is characterised in that：Uniformity amount The computational methods of degree are specifically：Using expression formula 17) calculated：

<mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>X</mi> <mn>1</mn> </msub> <mo>,</mo> <mo>...</mo> <mo>,</mo> <msub> <mi>X</mi> <mi>N</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mi>N</mi> </mfrac> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mi>P</mi> <mo>(</mo> <mrow> <msub> <mi>X</mi> <mi>j</mi> </msub> <mo>|</mo> <mi>&lambda;</mi> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mn>17</mn> <mo>)</mo> <mo>;</mo> </mrow>

Wherein：X₁,...,X_NIt is the mel-frequency cepstrum coefficient vector of distorted speech, N is vectorial number, and C is distorted speech and mould The uniformity of type is measured.

9. the appraisal procedure of the Objective speech quality according to claim 1 based on output, it is characterised in that：The error code The calculating process of rate is as follows：

Step A, one PN sequence of generation, and be multiplied with a chaos sequence, the generation of chaos sequence is mapped by logistic produces Raw, logistic mapping definitions are as follows：

x_k+1=μ x_k(1-x_k)

Wherein, 0≤μ≤4 are referred to as branch parameter, x_k∈ (0,1), when 3.5699456 ...<During μ≤4, logistic mappings works in Sequence { the x that chaos state, i.e. primary condition are produced under logistic mappings_k；K=0,1,2,3 ... } it is aperiodic, does not receive It is holding back and very sensitive to initial value；Generate comprising the following steps that for monitoring data sequent：

Step a1, produce real-valued sequence first, and it is monitoring data sequent size to choose the length that the position of some in sequence starts One section；

<mrow> <mi>&Gamma;</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </mtd> <mtd> <mrow> <mi>x</mi> <mo><</mo> <mi>&Gamma;</mi> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>1</mn> </mtd> <mtd> <mrow> <mi>&Gamma;</mi> <mo>&le;</mo> <mi>x</mi> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>;</mo> </mrow>

Binary chaotic sequence is { Γ (x_k)；K=0,1,2,3 ... }；

Step c1, selection Daubechies10 small echos are used as wavelet function；

Step c2, sub-frame processing is carried out to voice signal, the size per frame is 1152 sampled points, and carries out 3 to every frame signal Layer wavelet transformation；

Step c3, wavelet coefficient is quantified, and monitoring data sequent is modulated, so that monitoring data sequent is embedded in into voice signal In, if coefficient to be quantified is f, the bit of embedded monitoring data sequent is w, and quantization step is Δ, is believed after quantization containing monitoring data sequent The coefficient of breath concretely comprises the following steps for f'：

To f modulus and floor operation, as f ＞ 0, ifN=m%2, then：

As f ＜ 0, ifN=m%2, n=w, then：

Embedded monitoring data sequent is extracted in step D, the voice received, and calculates the bit error rate, the process specifically extracted includes Following steps：

Step d1, synchronous code is searched in voice signal, be specifically：If it is L to need the signal length searched for, then L length should When the length more than two synchronous codes and the summation of a complete monitoring data sequent length；If the initiating searches point of signal is I= 1, if the sample value of signalIn the range of 900-1100, then it is assumed that searched possible synchronous code, profit It is compared with default synchronous code；If determined as synchronous code, then I points are the original position of monitoring data sequent, otherwise make I= I+L；

Step d3, the operation for making contrary during with insertion to the coefficient f after wavelet decomposition, i.e.,：During f ＞ 0, ifw =m%2；During f ＜ 0, ifW=m%2；

So as to extract binary system monitoring data sequent；

Wherein Seq_send、Seq_receiveAnd Seq_lengthRepresent to send monitoring data sequent respectively, receive monitoring data sequent and sequence length； The Hamming weight of sequence is sought in Hammingweight () expressions, and XOR represents xor operation.

10. the appraisal procedure of the Objective speech quality according to claim 1 based on output, it is characterised in that：It is described to reflect Penetrate relation and pass through expression formula 19) obtain：

In formula：F () is multivariate nonlinear regression analysis model；C_iIt is that the uniformity of i-th kind of parameter is measured；N is speech characteristic parameter Number；It is c₁,...,c_NThe objective MOS scorings predicted by f ().