CN107293306B

CN107293306B - A kind of appraisal procedure of the Objective speech quality based on output

Info

Publication number: CN107293306B
Application number: CN201710475912.8A
Authority: CN
Inventors: 李庆先; 刘良江; 王晋威; 朱宪宇; 熊婕; 李彦博
Original assignee: HUNAN MEASUREMENT INSPECTION RESEARCH INSTITUTE
Current assignee: HUNAN MEASUREMENT INSPECTION RESEARCH INSTITUTE
Priority date: 2017-06-21
Filing date: 2017-06-21
Publication date: 2018-06-15
Anticipated expiration: 2037-06-21
Also published as: CN107293306A

Abstract

The present invention provides a kind of method of the objective speech quality assessment based on output, includes the following steps：Calculate the mel-frequency cepstrum coefficient of the distorted speech after system is transmitted；Obtain the reference model for meeting human hearing characteristic；The mel-frequency cepstrum coefficient of distorted speech and the reference model for meeting human hearing characteristic are subjected to consistency measure calculation；One section of sequence is inserted into raw tone, calculates the bit error rate that the sequence is extracted in the distorted speech after being transmitted by system；Mapping relations between subjective MOS and homogeneity measure are established according to consistency measurement and the bit error rate, obtain the objective prediction model to voice MOS to be evaluated points, the objective evaluation of voice quality is carried out by the objective prediction model.Using the method for the present invention, step is simplified, easy to use, and is capable of the quality of effectively objective evaluation voice, does not depend on subjective assessment.

Description

A kind of appraisal procedure of the Objective speech quality based on output

Technical field

The present invention relates to voice process technology fields, particularly, are related to a kind of Objective speech quality based on output Appraisal procedure.

Background technology

Speech quality objective assessment refers to machine automatic discrimination voice quality, by whether needing angle using input voice Degree can be divided into two classes：Objective evaluation based on input-output mode and the objective evaluation based on the way of output.

In many fields, such as wireless mobile communications, space flight navigation and modern military, often require that evaluation method has Higher flexibility, real-time and versatility, and also will can be to voice matter in the case that cannot get original input speech signal Amount is assessed, and is difficult often to obtain corresponding raw tone, phonetic storage in the objective evaluation of the mode based on input-output Etc. cost bigger, there is certain drawbacks under these application scenarios.

The general process of objective speech quality assessment method based on output is certain characteristic parameter of Calculation Estimation voice, And with carrying out consistency calculating by the characteristic parameter of reference voice after particular model study conclusion, final mapping obtains subjectivity MOS points of estimated value.In this process, characteristic parameter, training pattern and MOS divide the selection of mapping method to be most important , it affects the performance of assessment system.Since human ear meets Bark critical band to the perception characteristics of sound, in feature It needs to realize linear frequency and inflection frequency conversion during parameter extraction.Meanwhile in this kind of application is wirelessly communicated, in addition to from voice It itself analyzes outer, it is also necessary to consider influence of the external factors such as channel quality to voice quality.

Therefore, design is a kind of can be for the appraisal procedure tool of the voice quality after objective evaluation coding or channel transmission It is significant.

Invention content

The purpose of the present invention is to provide a kind of methods of the objective speech quality assessment based on output.In view of human ear pair The auditory properties of frequency, while the cepstral analysis of voice signal is taken into account, using mel-frequency cepstrum coefficient (Mel-Frequency Cepstral Coefficients, MFCC) description phonetic feature.It is trained by combining mel-frequency cepstrum coefficient and GMM-HMM Model obtains speech objective distortion value, at the same by channel effect by error rate index be introduced into it is objective estimate, then establish master See MOS point and it is objective estimate between mapping relations, the prediction model to subjective MOS is obtained, so as to be used for objective comment Voice quality after valency coding or channel transmission.Details as Follows：

A kind of appraisal procedure of the Objective speech quality based on output, includes the following steps：

Calculate the mel-frequency cepstrum coefficient of the distorted speech after system is transmitted；Acquisition meets human hearing characteristic Reference model；

The mel-frequency cepstrum coefficient of distorted speech and the reference model for meeting human hearing characteristic are subjected to uniformity amount Degree calculates；One section of sequence is inserted into raw tone, calculates and the sequence is extracted in the distorted speech after being transmitted by system The bit error rate of row；

Mapping relations between subjective MOS and homogeneity measure are established according to consistency measurement and the bit error rate, are obtained pair Voice MOS to be evaluated points of objective prediction model carries out the objective evaluation of voice quality by the objective prediction model.

Preferred in above technical scheme, the calculating process of the mel-frequency cepstrum coefficient includes pretreatment, FFT becomes It changes, four steps of Mel frequency filterings and discrete cosine transform.

Preferred in above technical scheme, the pretreatment specifically includes following steps：

Step 1.1, preemphasis, specifically：Come using the digital filter of the promotion high frequency characteristics with 6dB/ octaves Realize preemphasis, transmission function is expression formula 1)：

H (z)=1- μ z^-11)；

Wherein：μ is pre emphasis factor, value 0.9-1.0；

Step 1.2, end-point detection, specifically：It is carried out by setting the thresholding of short-time energy and short-time zero-crossing rate, if certain The Short Time Speech signal that a length is N is x (m), short-time energy E expression formulas 2) it calculates：

Its short-time zero-crossing rate Z expression formulas 3) it calculates：

Wherein, sgn [] is sign function, i.e.,：

Step 1.3, framing and adding window, specifically：The framing is that voice is divided into frame one by one, the length of each frame For 10-30ms；The adding window is to carry out adding window to each frame signal using Hamming windows.

Preferred in above technical scheme, the detailed process of the adding window is：If frame signal is x (n), window function w (n), then the signal y (n) after adding window is expression formula 4)：

Y (n)=x (n) w (n), 0≤n≤N-1 4)；Wherein, N is the number of sampling per frame, and the expression formula of w (n) is w (n)=0.54-0.46cos [2 π n/ (N-1)], 0≤n≤N-1.

Preferred in above technical scheme, the Mel frequency filterings are specifically：It will be used by the discrete spectrum of FFT processing Sequence triangular filter is filtered, and obtains one group of Coefficient m_l、m₂、……；The number p of the wave filter group is cut by signal Only frequency determines, all wave filters generally cover from 0Hz to Nyquist the half of frequency, i.e. sample rate；m_iBy expressing Formula 5) it is calculated：

Wherein：

F [i] is triangle filtering The centre frequency of device meets：Mel (f [i+1])-Mel (f [i])=Mel (f [i])-Mel (f [i-1])；X (k) is frame signal x (n) through FFT treated discrete spectrums.

Preferred in above technical scheme, the discrete cosine transform is specifically：By the Mel frequencies by Mel frequency filterings Spectral transformation obtains Mel frequency cepstral coefficients, by expression formula 6 to time domain) it is calculated：

Wherein：MFCC (i) is Mel frequency cepstral coefficients, and N is to count per frame sample, and P is the number of wave filter group.

Preferred in above technical scheme, the reference model detailed process that acquisition meets human hearing characteristic is as follows：

If the characteristic vector sequence of observation is O=o₁,o₂,…,o_T, the state model sequence of the sequence is S=s₁, s₂,…,s_N, then the HMM model of the sequence be expressed as expression formula 7)：

λ=(π, A, B) 7)；

Wherein, π={ π_i=P (s₁=i), i=1,2 ..., N } it is initial state probabilities vector；A={ a_ijBetween state The transition probability matrix redirected, a_ijTo jump to the probability of state j from state i；B={ b_i(o_t)=P (o_t|s_t=i), 2≤i ≤ N-1 } for state output probability distribution collection；

To continuous HMM model, observation sequence is continuous signal, M mixed Gaussian of signal space corresponding with state j Density function and represent, such as expression formula 8) and expression formula 9) under：

Wherein, c_jkThe coefficient of k-th of Gaussian Mixture Model Probability Density Function of expression state j；μ_jkIt is the mean value of Gaussian density function Vector；C_jkFor covariance matrix, D is the dimension of observation sequence O；HMM parameters are by observation sequence O=o₁,o₂,…,o_TEstimate It arriving, the target of estimation is that the likelihood function P of model and training data (O | λ) is made to maximize to estimate newest λ, even if

The forward direction probability calculation formula such as expression formula 10 of the likelihood function p (O | λ))：

Wherein：α₁(i)=π b_i(o₁),1≤i≤N；

The backward probability calculation formula such as expression formula 11 of the likelihood function p (O | λ))：

Wherein：β_t(i)=1,1≤i≤N；

To giving observation sequence O=o₁,o₂,…,o_TNewest λ is obtained by revaluation, defines ξ herein_tWhen (i, j) is t Quarter state be s_iAnd t+1 moment state is s_jProbability, by expression formula 12) obtain：

Under conditions of setting models λ and observation sequence O, state s_iIt is expression formula 13 in the posterior probability of moment t)：

The revaluation of HMM parameter lambdas is as follows as a result,：

In the parameter c of t moment state k-th of Gaussian mixture components of j_jk, μ_jkAnd C_jkBy expression formula 14), 15) and 16) It reevaluates：

Wherein, γ_t(j, k) represents the probability in t moment state k-th of Gaussian mixture components of j, can be obtained by following formula：

Preferred in above technical scheme, the computational methods of consistency measurement are specifically：Using expression formula 17) it is counted It calculates：

Wherein：X₁,...,X_NIt is the mel-frequency cepstrum coefficient vector of distorted speech, N is vectorial number, and C is distorted speech It is measured with the consistency of model.

Preferred in above technical scheme, the calculating process of the bit error rate is as follows：

Step A, a PN sequence is generated, and is multiplied with a chaos sequence, the generation of chaos sequence is reflected by logistic Generation is penetrated, logistic mapping definitions are as follows：

x_k+1=μ x_k(1-x_k)

Wherein, 0≤μ≤4 are known as branch parameter, x_k∈ (0,1), when 3.5699456 ...<During μ≤4, logistic mappings Work in chaos state, i.e., the sequence { x that primary condition generates under logistic mappings_k；K=0,1,2,3 ... } it is aperiodic , it is not convergent and very sensitive to initial value；Generation monitoring data sequent is as follows：

Step a1, real-valued sequence is generated first, and it is big for monitoring data sequent to choose the length that some position starts in sequence Small one section；

Step a2, real-valued sequence is become into binary sequence：By defining a threshold value Γ, obtained by real-valued sequence：

Binary chaotic sequence is { Γ (x_k)；K=0,1,2,3 ... }；

Step a3, binary chaotic sequence is multiplied with a PN sequence, you can obtain monitoring data sequent；

Step B, synchronous code is inserted into for monitoring data sequent, frame by frame extracts monitoring data sequent embedded below；

Step C, the monitoring data sequent for being inserted into synchronous code is embedded in wavelet field in voice signal, detailed process is as follows：

Step c1, Daubechies10 small echos are chosen as wavelet function；

Step c2, sub-frame processing is carried out to voice signal, the size per frame for 1152 sampled points, and to every frame signal into 3 layers of wavelet transformation of row；

Step c3, wavelet coefficient is quantified, and monitoring data sequent is modulated, so as to which monitoring data sequent is embedded in voice In signal, if coefficient to be quantified is f, the bit of embedded monitoring data sequent is w, and quantization step is Δ, and monitoring sequence is contained after quantization The coefficient of column information for f' the specific steps are：

To f modulus and floor operation, as f ＞ 0, ifN=m%2, then：

As f ＜ 0, ifN=m%2, n=w, then：

Monitoring data sequent is embedded into voice signal one by one according to above-mentioned formula；

Step c4, the signal of embedded monitoring data sequent is switched back into time-domain signal；

Step D, embedded monitoring data sequent is extracted in the voice received, and calculates the bit error rate, the process specifically extracted Include the following steps：

Step d1, synchronous code is searched in voice signal, specifically：If need the length that the signal length searched for is L, then L Degree should be more than the length of two synchronous codes and the summation of a complete monitoring data sequent length；If the initiating searches point of signal is I=1, if the sample value of signalIn the range of 900-1100, then it is assumed that possible synchronous code has been searched, It is compared using preset synchronous code；If determined as synchronous code, then I points are the initial position of monitoring data sequent, otherwise enable I =I+L；

Step d2, since the starting point found, wavelet transform is carried out to voice signal；

Step d3, make the operation with contrary during insertion to the coefficient f after wavelet decomposition, i.e.,：During f ＞ 0, ifW=m%2；During f ＜ 0, ifW=m%2；

So as to extract binary system monitoring data sequent；

Step d4, compare the monitoring data sequent extracted and the monitoring data sequent of insertion, pass through expression formula 18) calculate the bit error rate：

Wherein Seq_send、Seq_receiveAnd Seq_lengthRepresent that transmission monitoring data sequent, reception monitoring data sequent and sequence are long respectively Degree；Hammingweight () represents to ask the Hamming weight of sequence, XOR expression xor operations.

Preferred in above technical scheme, the mapping relations pass through expression formula 19) it obtains：

In formula：F () is multivariate nonlinear regression analysis model；C_iIt is the consistency measurement of i-th kind of parameter；N is phonetic feature The number of parameter；It is c₁,...,c_NIt is scored by the objective MOS that f () is predicted.

It applies the technical scheme of the present invention, effect is：

1st, Mel frequency scales are approached using MFCC, so as to stretch the low-frequency information of voice and compacting high-frequency information, it can be used In voice robust analysis and speech recognition, inhibit the feature dependent on speaker, retain the philological quality of voice segments.

2nd, the present invention establish subjective MOS and it is objective estimate and channel quality between mapping relations, obtain to subjectivity MOS points of prediction model so that point closer subjective quality.

3rd, the method for the present invention step is simplified, easy to use, and is capable of the quality of effectively objective evaluation voice, does not depend on master See evaluation.

Other than objects, features and advantages described above, the present invention also has other objects, features and advantages. Below with reference to accompanying drawings, the present invention is described in further detail.

Description of the drawings

The attached drawing for forming the part of the application is used to provide further understanding of the present invention, schematic reality of the invention Example and its explanation are applied for explaining the present invention, is not constituted improper limitations of the present invention.In the accompanying drawings：

Fig. 1 is the principle schematic diagram of the appraisal procedure of the Objective speech quality based on output in embodiment 1.

Specific embodiment

The embodiment of the present invention is described in detail below in conjunction with attached drawing, but the present invention can be limited according to claim Fixed and covering multitude of different ways is implemented.

Embodiment 1：

A kind of appraisal procedure of the Objective speech quality based on output, refers to Fig. 1, specifically includes：It calculates and is passed by system Raw tone (is obtained distorted speech by the mel-frequency cepstrum coefficient of the distorted speech after defeated after system is transmitted；Calculate plum Your process of frequency cepstral coefficient is MFCC parameter extraction process)；The reference model that acquisition meets human hearing characteristic (first carries The MFCC parameters of reference voice are taken, then obtain GMM-HMM models)；By the mel-frequency cepstrum coefficient of distorted speech and meet people The reference model of ear auditory properties carries out consistency measure calculation (i.e. consistency calculates)；One section of sequence is inserted into raw tone Row, calculate the bit error rate that the sequence is extracted in the distorted speech after being transmitted by system；It is measured and missed according to consistency Code check establishes the mapping relations between subjective MOS and homogeneity measure (i.e. MOS points of mappings in Fig. 1), obtains to be evaluated Voice MOS points of objective prediction model, by the objective prediction model carry out voice quality objective evaluation (here by The MOD of the mapping points of degrees of correlation and biased error between subjective MOS are as evaluation criterion).Evaluation voice is ITU languages Sound library (International Telecommunication Union's sound bank), Details as Follows：

The calculating process of mel-frequency cepstrum coefficient includes pretreatment, FFT (Fast Fourier Transform (FFT)) transformation, Mel frequencies Filtering and four steps of discrete cosine transform, specifically：

The pretreatment specifically includes following steps：

H (z)=1- μ z^-11)；

Wherein：μ is pre emphasis factor, and value is 0.9-1.0 (taking 0.95 herein)；

Its short-time zero-crossing rate Z expression formulas 3) it calculates：

Wherein, sgn [] is sign function, i.e.,：

Step 1.3, framing and adding window, specifically：In order to be analyzed using the analysis method of stationary process, by language Sound is divided into frame one by one, and the length of each frame is 10-30ms；Meanwhile it for the truncation effect for reducing speech frame, uses Hamming windows (hamming code window) carry out adding window to each frame signal, specifically：

If frame signal is x (n), window function is w (n), then the signal y (n) after adding window is expression formula 4)：

The Mel frequency filterings are specifically：It will be filtered by the discrete spectrum of FFT processing with sequence triangular filter Processing, obtains one group of Coefficient m_l、m₂、……；The number p of the wave filter group is determined that all wave filters are total by the cutoff frequency of signal The half of covering frequency (nyquist frequency), i.e. sample rate from 0Hz to Nyquist on body；m_iBy expression formula 5) it calculates It obtains：

Wherein：

F [i] is triangle filtering The centre frequency of device meets：Mel (f [i+1])-Mel (f [i])=Mel (f [i])-Mel (f [i-1]).

Since Mel spectral coefficients are all real numbers, can time domain be transformed to by discrete cosine transform.It is described discrete remaining String converts：, to time domain, Mel frequency cepstral coefficients will be obtained, by table by the Mel Spectrum Conversions of Mel frequency filterings Up to formula 6) it is calculated：

The reference model detailed process that acquisition meets human hearing characteristic is as follows：

Pronunciation modeling and training based on GMM-HMM, if the characteristic vector sequence of observation is O=o₁,o₂,…,o_T, the sequence The state model sequence of row is S=s₁,s₂,…,s_N, then the HMM model (hidden Markov model) of the sequence be expressed as expressing Formula 7)：

λ=(π, A, B) 7)；

Wherein, c_jkThe coefficient of k-th of Gaussian Mixture Model Probability Density Function of expression state j；μ_jkIt is the mean value of Gaussian density function Vector；C_jkFor covariance matrix, D is the dimension of observation sequence O；HMM parameters are by observation sequence O=o₁,o₂,…,o_TEstimate It arriving, the target of estimation is that the likelihood function P of model and training data (O | λ) is made to maximize to estimate newest λ, even ifEM algorithms (EM algorithm) may be used to realize in this, and the EM algorithms include two parts：Before It is as follows to estimating again for backward probability calculation and HMM parameters and Gaussian mixture parameters：

Wherein：α₁(i)=π b_i(o₁),1≤i≤N；

Wherein：β_t(i)=1,1≤i≤N；

The revaluation of HMM parameter lambdas is as follows as a result,：

In the parameter c of t moment state k-th of Gaussian mixture components of j_jk、μ_jkAnd C_jkBy expression formula 14), 15) and 16) It reevaluates：

Consistency measurement computational methods be specifically：After modeling, mel-frequency cepstrum coefficient and the ginseng of distorted speech Examine model and carry out consistency measurement using expression formula 17) it is calculated：

Wherein：X₁,...,X_NIt is mel-frequency cepstrum coefficient (MFCC) vector of distorted speech, N is vectorial number, and C is The consistency of distorted speech and model is measured.

The calculating process of the bit error rate is as follows：

x_k+1=μ x_k(1-x_k)

Binary chaotic sequence is { Γ (x_k)；K=0,1,2,3 ... }；

Step a3, binary chaotic sequence is multiplied with a PN sequence (pseudo noise sequence), you can obtain monitoring data sequent；

Step B, synchronous code is inserted into for monitoring data sequent, frame by frame extracted monitoring data sequent embedded below, specifically： Synchronous code is inserted into for monitoring data sequent, the purpose for being inserted into synchronous code is that for audio after the attenuation of channel, receiving terminal is difficult in order to prevent To extract monitoring data sequent；The synchronous code that we use is 16 bits, and in order to be accurately located synchronous code, we adopt The method taken is the embedded synchronous code in the time domain of voice signal, and concrete methods of realizing is by 16 sampled points before monitoring data sequent Amplitude be set to 1000, in this way receiving terminal extract monitoring data sequent during, if there is the nonsynchronous situation of starting point, then may be used Using sampled point of continuous 16 sample values 900~1100, rising for watermark is rapidly found out in a manner of searching synchronous code Beginning sample position, in this way, can frame by frame extract monitoring data sequent embedded below；

Step C, the monitoring data sequent for being inserted into synchronous code in embedded voice signal, is selected embedding in wavelet field in wavelet field The reason of entering is that embedded monitoring data sequent has better concealment in transform domain, will not cause human ear that can distinguish to raw tone Influence.The detailed process of sequence embedded voice in wavelet field is as follows：

Step c1, it can be generated due to the use of the different same problems of wavelet basis analysis different as a result, therefore, it is necessary to roots Suitable wavelet basis is selected according to the problem of analysis, chooses Daubechies10 small echos herein as wavelet function；

Step c2, sub-frame processing is carried out to voice signal, the size per frame for 1152 sampled points, and to every frame signal into 3 layers of wavelet transformation of row；In view of the auditory properties of human ear, select to be embedded in sequence in high band herein；

Step c3, wavelet coefficient is quantified, and monitoring data sequent is modulated, so as to which monitoring data sequent is embedded in voice In signal, if coefficient to be quantified is f, the bit of embedded monitoring data sequent is w, and quantization step is Δ, and monitoring sequence is contained after quantization The coefficient of column information for f' the specific steps are：First to f modulus and floor operation, as f ＞ 0, ifn =m%2, then：

As f ＜ 0, ifN=m%2, n=w, then：

Monitoring data sequent can be embedded into voice signal one by one according to above-mentioned formula.

Step D, embedded monitoring data sequent is extracted in the voice received, and calculates the bit error rate, details are：Monitoring data sequent Extraction be embedded inverse process, therefore the wavelet function and the series of wavelet decomposition used all remain unchanged, and specifically extract Process includes the following steps：

Step d1, synchronous code is searched in voice signal, specifically：If need the length that the signal length searched for is L, then L Degree should be more than the length of two synchronous codes and the summation of a complete monitoring data sequent length.If the initiating searches point of signal is I=1, if the sample value of signalIn the range of 900-1100, then it is assumed that possible synchronous code has been searched, It is compared using preset synchronous code；If determined as synchronous code, then I points are the initial position of monitoring data sequent, otherwise enable I =I+L；

Step d3, make the operation with contrary during insertion to the coefficient f after wavelet decomposition, i.e.,：

During f ＞ 0, ifW=m%2；

During f ＜ 0, ifW=m%2；

So as to extract binary system monitoring data sequent；

Step d4, compare the monitoring data sequent extracted and the monitoring data sequent of insertion, pass through expression formula 18) calculate the bit error rate (bit error rate as one of speech quality evaluation objective estimate)：

In the case where calculating various distortion conditions after the parameter consistency measurement of voice, a kind of Function Mapping relationship can be used Come expression parameter consistency measurement with it is objectiveBetween relationship, i.e., described mapping relations pass through expression formula 19) obtain：

In formula：For anticipation function, (it can be linearly or nonlinearly regression relation or fitting of a polynomial to f () Relationship in this patent embodiment, more accurately predicts MOS values, herein preferred multivariate nonlinear regression analysis model in order to obtain)； C_iIt is the consistency measurement of i-th kind of parameter；N is the number of speech characteristic parameter；It is c₁,...,c_NIt is predicted by f () The objective MOS scorings gone out.The bit error rate is bigger, illustrates to interfere in channel stronger, and the speech damage brought in transmission process is also corresponding Also it is big, it is correspondingIt is worth smaller, the quality of voice is poorer.

The performance of speech quality evaluation algorithm is weighed from the degree of correlation, biased error below.The degree of correlation mainly reflects voice Whether the mapping relations that quality evaluation algorithm obtains MOS points of prediction by distortion map are reasonable, the general MOS with Algorithm mapping points Degree of correlation and biased error between known subjectivity MOS values is as evaluation criterion.

Correlation coefficient ρ and pass through expression formula 20 with standard estimated bias σ) and expression formula 21) obtain：

Wherein：MOS_o(i) be i-th of voice prediction MOS values, MOS_s(i) it is known MOS points, N is total voice pair Number,Represent the mean value of prediction MOS values,Represent MOS points of mean value.

Correlation coefficient ρ is closer to 1, and prediction MOS values are closer to true MOS values；Biased error σ is smaller, then predicts that error is got over Small, the performance of algorithm is better.

The appraisal procedure of the present embodiment 1 and International Telecommunication Union ITU P.563 method for objectively evaluating (ITU-TP.563) Performance comparison is the results detailed in Table 1.

From table 1 it follows that the method for the present invention (embodiment 1) is certain relative to having on ITU-TP.563 algorithm performances The raising of degree, the average degree of correlation ρ higher of subjective MOS, estimated bias σ is relatively low, and therefore, the method for the present invention has validity And feasibility.

The performance comparison sheet that 1 the method for the present invention of table (embodiment 1) and ITU-TP.563 are respectively handled voice

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, that is made any repaiies Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of appraisal procedure of the Objective speech quality based on output, which is characterized in that include the following steps：

Calculate the mel-frequency cepstrum coefficient of the distorted speech after system is transmitted；Obtain the reference for meeting human hearing characteristic Model；

The mel-frequency cepstrum coefficient of distorted speech and the reference model for meeting human hearing characteristic are subjected to consistency measurement meter It calculates；One section of sequence is inserted into raw tone, calculates and extracts the sequence in the distorted speech after being transmitted by system The bit error rate；

Mapping relations between subjective MOS and homogeneity measure are established according to consistency measurement and the bit error rate, are obtained to be evaluated The objective prediction model that MOS points of valency voice carries out the objective evaluation of voice quality by the objective prediction model；

If the characteristic vector sequence of observation is O=o₁,o₂,…,o_T, the state model sequence of the sequence is S=s₁,s₂,…,s_N, Then the HMM model of the sequence is expressed as expression formula 7)：

λ=(π, A, B) 7)；

Wherein, π={ π_i=P (s₁=i), i=1,2 ..., N } it is initial state probabilities vector；A={ a_ijRedirected between state Transition probability matrix, a_ijTo jump to the probability of state j from state i；B={ b_i(o_t)=P (o_t|s_t=i), 2≤i≤N-1 } be The output probability distribution collection of state；

To continuous HMM model, observation sequence is continuous signal, M mixed Gaussian density of signal space corresponding with state j Function and represent, such as expression formula 8) and expression formula 9) under：

Wherein, c_jkThe coefficient of k-th of Gaussian Mixture Model Probability Density Function of expression state j；μ_jkIt is the mean vector of Gaussian density function； C_jkFor covariance matrix, D is the dimension of observation sequence O；HMM parameters are by observation sequence O=o₁,o₂,…,o_TEstimation obtains, estimation Target be that the likelihood function P of model and training data (O | λ) is made to maximize to estimate newest λ, even if

Wherein：α₁(i)=π b_i(o₁),1≤i≤N；

Wherein：β_t(i)=1,1≤i≤N；

To giving observation sequence O=o₁,o₂,…,o_TNewest λ is obtained by revaluation, defines ξ herein_t(i, j) is t moment shape State is s_iAnd t+1 moment state is s_jProbability, by expression formula 12) obtain：

The revaluation of HMM parameter lambdas is as follows as a result,：

In the parameter c of t moment state k-th of Gaussian mixture components of j_jk, μ_jkAnd C_jkBy expression formula 14), 15) and 16) estimate again Meter：

2. the appraisal procedure of the Objective speech quality according to claim 1 based on output, it is characterised in that：The Meier The calculating process of frequency cepstral coefficient includes four pretreatment, FFT transform, Mel frequency filterings and discrete cosine transform steps.

3. the appraisal procedure of the Objective speech quality according to claim 2 based on output, it is characterised in that：

The pretreatment specifically includes following steps：

Step 1.1, preemphasis, specifically：It is realized using the digital filter of the promotion high frequency characteristics with 6dB/ octaves Preemphasis, transmission function are expression formula 1)：

H (z)=1- μ z^-11)；

Wherein：μ is pre emphasis factor, value 0.9-1.0；

Step 1.2, end-point detection, specifically：It is carried out by setting the thresholding of short-time energy and short-time zero-crossing rate, if some is grown It is x (m) to spend for the Short Time Speech signal of N, short-time energy E expression formulas 2) it calculates：

Its short-time zero-crossing rate Z expression formulas 3) it calculates：

Wherein, sgn [] is sign function, i.e.,：

Step 1.3, framing and adding window, specifically：The framing is that voice is divided into frame one by one, and the length of each frame is 10-30ms；The adding window is to carry out adding window to each frame signal using Hamming windows.

4. the appraisal procedure of the Objective speech quality according to claim 3 based on output, it is characterised in that：The adding window Detailed process be：If frame signal is x (n), window function is w (n), then the signal y (n) after adding window is expression formula 4)：

Y (n)=x (n) w (n), 0≤n≤N-1 4)；Wherein, N is the number of sampling per frame, and the expression formula of w (n) is w (n) =0.54-0.46cos [2 π n/ (N-1)], 0≤n≤N-1.

5. the appraisal procedure of the Objective speech quality according to claim 2 based on output, it is characterised in that：The Mel Frequency filtering is specifically：It will be filtered by the discrete spectrum of FFT processing with sequence triangular filter, obtain a system Number m_l、m₂、……；The number p of the wave filter group determines by the cutoff frequency of signal, all wave filters generally cover from 0Hz to The half of Nyquist frequencies, i.e. sample rate；m_iBy expression formula 5) it is calculated：

Wherein：

F [i] is triangular filter Centre frequency meets：Mel (f [i+1])-Mel (f [i])=Mel (f [i])-Mel (f [i-1])；X (k) is passed through for frame signal x (n) FFT treated discrete spectrums.

6. the appraisal procedure of the Objective speech quality according to claim 2 based on output, it is characterised in that：It is described discrete Cosine transform is specifically：By the Mel Spectrum Conversions of process Mel frequency filterings to time domain, Mel frequency cepstral coefficients are obtained, by Expression formula 6) it is calculated：

7. the appraisal procedure of the Objective speech quality according to claim 1 based on output, it is characterised in that：Uniformity amount The computational methods of degree are specifically：Using expression formula 17) it is calculated：

Wherein：X₁,...,X_NIt is the mel-frequency cepstrum coefficient vector of distorted speech, N is vectorial number, and C is distorted speech and mould The consistency measurement of type.

8. the appraisal procedure of the Objective speech quality according to claim 1 based on output, it is characterised in that：The error code The calculating process of rate is as follows：

Step A, a PN sequence is generated, and is multiplied with a chaos sequence, the generation of chaos sequence is mapped by logistic produces Raw, logistic mapping definitions are as follows：

x_k+1=μ x_k(1-x_k)

Wherein, 0≤μ≤4 are known as branch parameter, x_k∈ (0,1), when 3.5699456 ...<During μ≤4, logistic mappings works in Sequence { the x that chaos state, i.e. primary condition generate under logistic mappings_k；K=0,1,2,3 ... } it is aperiodic, does not receive It is holding back and very sensitive to initial value；Generation monitoring data sequent is as follows：

Step a1, real-valued sequence is generated first, and it is monitoring data sequent size to choose the length that some position starts in sequence One section；

Binary chaotic sequence is { Γ (x_k)；K=0,1,2,3 ... }；

Step c1, Daubechies10 small echos are chosen as wavelet function；

Step c2, sub-frame processing is carried out to voice signal, the size per frame is 1152 sampled points, and carries out 3 to every frame signal Layer wavelet transformation；

Step c3, wavelet coefficient is quantified, and monitoring data sequent is modulated, so as to which monitoring data sequent is embedded in voice signal In, if coefficient to be quantified is f, the bit of embedded monitoring data sequent is w, and quantization step is Δ, is believed after quantization containing monitoring data sequent The coefficient of breath for f' the specific steps are：

To f modulus and floor operation, as f ＞ 0, ifN=m%2, then：

As f ＜ 0, ifN=m%2, n=w, then：

Step D, embedded monitoring data sequent is extracted in the voice received, and calculates the bit error rate, the process specifically extracted includes Following steps：

Step d1, synchronous code is searched in voice signal, specifically：If needing the signal length searched for, then the length of L should for L When the length more than two synchronous codes and the summation of a complete monitoring data sequent length；If the initiating searches point of signal is I= 1, if the sample value of signalIn the range of 900-1100, then it is assumed that searched possible synchronous code, profit It is compared with preset synchronous code；If determined as synchronous code, then I points are the initial position of monitoring data sequent, otherwise enable I= I+L；

Step d3, make the operation with contrary during insertion to the coefficient f after wavelet decomposition, i.e.,：During f ＞ 0, ifw =m%2；During f ＜ 0, ifW=m%2；

So as to extract binary system monitoring data sequent；

Wherein Seq_send、Seq_receiveAnd Seq_lengthIt represents to send monitoring data sequent respectively, receive monitoring data sequent and sequence length； Hammingweight () represents to ask the Hamming weight of sequence, XOR expression xor operations.

9. the appraisal procedure of the Objective speech quality according to claim 1 based on output, it is characterised in that：The mapping Relationship passes through expression formula 19) it obtains：

In formula：F () is multivariate nonlinear regression analysis model；C_iIt is the consistency measurement of i-th kind of parameter；N is speech characteristic parameter Number；It is c₁,...,c_NIt is scored by the objective MOS that f () is predicted.