CN107293306B - A kind of appraisal procedure of the Objective speech quality based on output - Google Patents

A kind of appraisal procedure of the Objective speech quality based on output Download PDF

Info

Publication number
CN107293306B
CN107293306B CN201710475912.8A CN201710475912A CN107293306B CN 107293306 B CN107293306 B CN 107293306B CN 201710475912 A CN201710475912 A CN 201710475912A CN 107293306 B CN107293306 B CN 107293306B
Authority
CN
China
Prior art keywords
sequence
monitoring data
signal
mel
expression formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710475912.8A
Other languages
Chinese (zh)
Other versions
CN107293306A (en
Inventor
李庆先
刘良江
王晋威
朱宪宇
熊婕
李彦博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HUNAN MEASUREMENT INSPECTION RESEARCH INSTITUTE
Original Assignee
HUNAN MEASUREMENT INSPECTION RESEARCH INSTITUTE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HUNAN MEASUREMENT INSPECTION RESEARCH INSTITUTE filed Critical HUNAN MEASUREMENT INSPECTION RESEARCH INSTITUTE
Priority to CN201710475912.8A priority Critical patent/CN107293306B/en
Publication of CN107293306A publication Critical patent/CN107293306A/en
Application granted granted Critical
Publication of CN107293306B publication Critical patent/CN107293306B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Complex Calculations (AREA)

Abstract

The present invention provides a kind of method of the objective speech quality assessment based on output, includes the following steps:Calculate the mel-frequency cepstrum coefficient of the distorted speech after system is transmitted;Obtain the reference model for meeting human hearing characteristic;The mel-frequency cepstrum coefficient of distorted speech and the reference model for meeting human hearing characteristic are subjected to consistency measure calculation;One section of sequence is inserted into raw tone, calculates the bit error rate that the sequence is extracted in the distorted speech after being transmitted by system;Mapping relations between subjective MOS and homogeneity measure are established according to consistency measurement and the bit error rate, obtain the objective prediction model to voice MOS to be evaluated points, the objective evaluation of voice quality is carried out by the objective prediction model.Using the method for the present invention, step is simplified, easy to use, and is capable of the quality of effectively objective evaluation voice, does not depend on subjective assessment.

Description

A kind of appraisal procedure of the Objective speech quality based on output
Technical field
The present invention relates to voice process technology fields, particularly, are related to a kind of Objective speech quality based on output Appraisal procedure.
Background technology
Speech quality objective assessment refers to machine automatic discrimination voice quality, by whether needing angle using input voice Degree can be divided into two classes:Objective evaluation based on input-output mode and the objective evaluation based on the way of output.
In many fields, such as wireless mobile communications, space flight navigation and modern military, often require that evaluation method has Higher flexibility, real-time and versatility, and also will can be to voice matter in the case that cannot get original input speech signal Amount is assessed, and is difficult often to obtain corresponding raw tone, phonetic storage in the objective evaluation of the mode based on input-output Etc. cost bigger, there is certain drawbacks under these application scenarios.
The general process of objective speech quality assessment method based on output is certain characteristic parameter of Calculation Estimation voice, And with carrying out consistency calculating by the characteristic parameter of reference voice after particular model study conclusion, final mapping obtains subjectivity MOS points of estimated value.In this process, characteristic parameter, training pattern and MOS divide the selection of mapping method to be most important , it affects the performance of assessment system.Since human ear meets Bark critical band to the perception characteristics of sound, in feature It needs to realize linear frequency and inflection frequency conversion during parameter extraction.Meanwhile in this kind of application is wirelessly communicated, in addition to from voice It itself analyzes outer, it is also necessary to consider influence of the external factors such as channel quality to voice quality.
Therefore, design is a kind of can be for the appraisal procedure tool of the voice quality after objective evaluation coding or channel transmission It is significant.
Invention content
The purpose of the present invention is to provide a kind of methods of the objective speech quality assessment based on output.In view of human ear pair The auditory properties of frequency, while the cepstral analysis of voice signal is taken into account, using mel-frequency cepstrum coefficient (Mel-Frequency Cepstral Coefficients, MFCC) description phonetic feature.It is trained by combining mel-frequency cepstrum coefficient and GMM-HMM Model obtains speech objective distortion value, at the same by channel effect by error rate index be introduced into it is objective estimate, then establish master See MOS point and it is objective estimate between mapping relations, the prediction model to subjective MOS is obtained, so as to be used for objective comment Voice quality after valency coding or channel transmission.Details as Follows:
A kind of appraisal procedure of the Objective speech quality based on output, includes the following steps:
Calculate the mel-frequency cepstrum coefficient of the distorted speech after system is transmitted;Acquisition meets human hearing characteristic Reference model;
The mel-frequency cepstrum coefficient of distorted speech and the reference model for meeting human hearing characteristic are subjected to uniformity amount Degree calculates;One section of sequence is inserted into raw tone, calculates and the sequence is extracted in the distorted speech after being transmitted by system The bit error rate of row;
Mapping relations between subjective MOS and homogeneity measure are established according to consistency measurement and the bit error rate, are obtained pair Voice MOS to be evaluated points of objective prediction model carries out the objective evaluation of voice quality by the objective prediction model.
Preferred in above technical scheme, the calculating process of the mel-frequency cepstrum coefficient includes pretreatment, FFT becomes It changes, four steps of Mel frequency filterings and discrete cosine transform.
Preferred in above technical scheme, the pretreatment specifically includes following steps:
Step 1.1, preemphasis, specifically:Come using the digital filter of the promotion high frequency characteristics with 6dB/ octaves Realize preemphasis, transmission function is expression formula 1):
H (z)=1- μ z-11);
Wherein:μ is pre emphasis factor, value 0.9-1.0;
Step 1.2, end-point detection, specifically:It is carried out by setting the thresholding of short-time energy and short-time zero-crossing rate, if certain The Short Time Speech signal that a length is N is x (m), short-time energy E expression formulas 2) it calculates:
Its short-time zero-crossing rate Z expression formulas 3) it calculates:
Wherein, sgn [] is sign function, i.e.,:
Step 1.3, framing and adding window, specifically:The framing is that voice is divided into frame one by one, the length of each frame For 10-30ms;The adding window is to carry out adding window to each frame signal using Hamming windows.
Preferred in above technical scheme, the detailed process of the adding window is:If frame signal is x (n), window function w (n), then the signal y (n) after adding window is expression formula 4):
Y (n)=x (n) w (n), 0≤n≤N-1 4);Wherein, N is the number of sampling per frame, and the expression formula of w (n) is w (n)=0.54-0.46cos [2 π n/ (N-1)], 0≤n≤N-1.
Preferred in above technical scheme, the Mel frequency filterings are specifically:It will be used by the discrete spectrum of FFT processing Sequence triangular filter is filtered, and obtains one group of Coefficient ml、m2、……;The number p of the wave filter group is cut by signal Only frequency determines, all wave filters generally cover from 0Hz to Nyquist the half of frequency, i.e. sample rate;miBy expressing Formula 5) it is calculated:
Wherein:
F [i] is triangle filtering The centre frequency of device meets:Mel (f [i+1])-Mel (f [i])=Mel (f [i])-Mel (f [i-1]);X (k) is frame signal x (n) through FFT treated discrete spectrums.
Preferred in above technical scheme, the discrete cosine transform is specifically:By the Mel frequencies by Mel frequency filterings Spectral transformation obtains Mel frequency cepstral coefficients, by expression formula 6 to time domain) it is calculated:
Wherein:MFCC (i) is Mel frequency cepstral coefficients, and N is to count per frame sample, and P is the number of wave filter group.
Preferred in above technical scheme, the reference model detailed process that acquisition meets human hearing characteristic is as follows:
If the characteristic vector sequence of observation is O=o1,o2,…,oT, the state model sequence of the sequence is S=s1, s2,…,sN, then the HMM model of the sequence be expressed as expression formula 7):
λ=(π, A, B) 7);
Wherein, π={ πi=P (s1=i), i=1,2 ..., N } it is initial state probabilities vector;A={ aijBetween state The transition probability matrix redirected, aijTo jump to the probability of state j from state i;B={ bi(ot)=P (ot|st=i), 2≤i ≤ N-1 } for state output probability distribution collection;
To continuous HMM model, observation sequence is continuous signal, M mixed Gaussian of signal space corresponding with state j Density function and represent, such as expression formula 8) and expression formula 9) under:
Wherein, cjkThe coefficient of k-th of Gaussian Mixture Model Probability Density Function of expression state j;μjkIt is the mean value of Gaussian density function Vector;CjkFor covariance matrix, D is the dimension of observation sequence O;HMM parameters are by observation sequence O=o1,o2,…,oTEstimate It arriving, the target of estimation is that the likelihood function P of model and training data (O | λ) is made to maximize to estimate newest λ, even if
The forward direction probability calculation formula such as expression formula 10 of the likelihood function p (O | λ)):
Wherein:α1(i)=π bi(o1),1≤i≤N;
The backward probability calculation formula such as expression formula 11 of the likelihood function p (O | λ)):
Wherein:βt (i)=1,1≤i≤N;
To giving observation sequence O=o1,o2,…,oTNewest λ is obtained by revaluation, defines ξ hereintWhen (i, j) is t Quarter state be siAnd t+1 moment state is sjProbability, by expression formula 12) obtain:
Under conditions of setting models λ and observation sequence O, state siIt is expression formula 13 in the posterior probability of moment t):
The revaluation of HMM parameter lambdas is as follows as a result,:
In the parameter c of t moment state k-th of Gaussian mixture components of jjk, μjkAnd CjkBy expression formula 14), 15) and 16) It reevaluates:
Wherein, γt(j, k) represents the probability in t moment state k-th of Gaussian mixture components of j, can be obtained by following formula:
Preferred in above technical scheme, the computational methods of consistency measurement are specifically:Using expression formula 17) it is counted It calculates:
Wherein:X1,...,XNIt is the mel-frequency cepstrum coefficient vector of distorted speech, N is vectorial number, and C is distorted speech It is measured with the consistency of model.
Preferred in above technical scheme, the calculating process of the bit error rate is as follows:
Step A, a PN sequence is generated, and is multiplied with a chaos sequence, the generation of chaos sequence is reflected by logistic Generation is penetrated, logistic mapping definitions are as follows:
xk+1=μ xk(1-xk)
Wherein, 0≤μ≤4 are known as branch parameter, xk∈ (0,1), when 3.5699456 ...<During μ≤4, logistic mappings Work in chaos state, i.e., the sequence { x that primary condition generates under logistic mappingsk;K=0,1,2,3 ... } it is aperiodic , it is not convergent and very sensitive to initial value;Generation monitoring data sequent is as follows:
Step a1, real-valued sequence is generated first, and it is big for monitoring data sequent to choose the length that some position starts in sequence Small one section;
Step a2, real-valued sequence is become into binary sequence:By defining a threshold value Γ, obtained by real-valued sequence:
Binary chaotic sequence is { Γ (xk);K=0,1,2,3 ... };
Step a3, binary chaotic sequence is multiplied with a PN sequence, you can obtain monitoring data sequent;
Step B, synchronous code is inserted into for monitoring data sequent, frame by frame extracts monitoring data sequent embedded below;
Step C, the monitoring data sequent for being inserted into synchronous code is embedded in wavelet field in voice signal, detailed process is as follows:
Step c1, Daubechies10 small echos are chosen as wavelet function;
Step c2, sub-frame processing is carried out to voice signal, the size per frame for 1152 sampled points, and to every frame signal into 3 layers of wavelet transformation of row;
Step c3, wavelet coefficient is quantified, and monitoring data sequent is modulated, so as to which monitoring data sequent is embedded in voice In signal, if coefficient to be quantified is f, the bit of embedded monitoring data sequent is w, and quantization step is Δ, and monitoring sequence is contained after quantization The coefficient of column information for f' the specific steps are:
To f modulus and floor operation, as f > 0, ifN=m%2, then:
As f < 0, ifN=m%2, n=w, then:
Monitoring data sequent is embedded into voice signal one by one according to above-mentioned formula;
Step c4, the signal of embedded monitoring data sequent is switched back into time-domain signal;
Step D, embedded monitoring data sequent is extracted in the voice received, and calculates the bit error rate, the process specifically extracted Include the following steps:
Step d1, synchronous code is searched in voice signal, specifically:If need the length that the signal length searched for is L, then L Degree should be more than the length of two synchronous codes and the summation of a complete monitoring data sequent length;If the initiating searches point of signal is I=1, if the sample value of signalIn the range of 900-1100, then it is assumed that possible synchronous code has been searched, It is compared using preset synchronous code;If determined as synchronous code, then I points are the initial position of monitoring data sequent, otherwise enable I =I+L;
Step d2, since the starting point found, wavelet transform is carried out to voice signal;
Step d3, make the operation with contrary during insertion to the coefficient f after wavelet decomposition, i.e.,:During f > 0, ifW=m%2;During f < 0, ifW=m%2;
So as to extract binary system monitoring data sequent;
Step d4, compare the monitoring data sequent extracted and the monitoring data sequent of insertion, pass through expression formula 18) calculate the bit error rate:
Wherein Seqsend、SeqreceiveAnd SeqlengthRepresent that transmission monitoring data sequent, reception monitoring data sequent and sequence are long respectively Degree;Hammingweight () represents to ask the Hamming weight of sequence, XOR expression xor operations.
Preferred in above technical scheme, the mapping relations pass through expression formula 19) it obtains:
In formula:F () is multivariate nonlinear regression analysis model;CiIt is the consistency measurement of i-th kind of parameter;N is phonetic feature The number of parameter;It is c1,...,cNIt is scored by the objective MOS that f () is predicted.
It applies the technical scheme of the present invention, effect is:
1st, Mel frequency scales are approached using MFCC, so as to stretch the low-frequency information of voice and compacting high-frequency information, it can be used In voice robust analysis and speech recognition, inhibit the feature dependent on speaker, retain the philological quality of voice segments.
2nd, the present invention establish subjective MOS and it is objective estimate and channel quality between mapping relations, obtain to subjectivity MOS points of prediction model so that point closer subjective quality.
3rd, the method for the present invention step is simplified, easy to use, and is capable of the quality of effectively objective evaluation voice, does not depend on master See evaluation.
Other than objects, features and advantages described above, the present invention also has other objects, features and advantages. Below with reference to accompanying drawings, the present invention is described in further detail.
Description of the drawings
The attached drawing for forming the part of the application is used to provide further understanding of the present invention, schematic reality of the invention Example and its explanation are applied for explaining the present invention, is not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the principle schematic diagram of the appraisal procedure of the Objective speech quality based on output in embodiment 1.
Specific embodiment
The embodiment of the present invention is described in detail below in conjunction with attached drawing, but the present invention can be limited according to claim Fixed and covering multitude of different ways is implemented.
Embodiment 1:
A kind of appraisal procedure of the Objective speech quality based on output, refers to Fig. 1, specifically includes:It calculates and is passed by system Raw tone (is obtained distorted speech by the mel-frequency cepstrum coefficient of the distorted speech after defeated after system is transmitted;Calculate plum Your process of frequency cepstral coefficient is MFCC parameter extraction process);The reference model that acquisition meets human hearing characteristic (first carries The MFCC parameters of reference voice are taken, then obtain GMM-HMM models);By the mel-frequency cepstrum coefficient of distorted speech and meet people The reference model of ear auditory properties carries out consistency measure calculation (i.e. consistency calculates);One section of sequence is inserted into raw tone Row, calculate the bit error rate that the sequence is extracted in the distorted speech after being transmitted by system;It is measured and missed according to consistency Code check establishes the mapping relations between subjective MOS and homogeneity measure (i.e. MOS points of mappings in Fig. 1), obtains to be evaluated Voice MOS points of objective prediction model, by the objective prediction model carry out voice quality objective evaluation (here by The MOD of the mapping points of degrees of correlation and biased error between subjective MOS are as evaluation criterion).Evaluation voice is ITU languages Sound library (International Telecommunication Union's sound bank), Details as Follows:
The calculating process of mel-frequency cepstrum coefficient includes pretreatment, FFT (Fast Fourier Transform (FFT)) transformation, Mel frequencies Filtering and four steps of discrete cosine transform, specifically:
The pretreatment specifically includes following steps:
Step 1.1, preemphasis, specifically:Come using the digital filter of the promotion high frequency characteristics with 6dB/ octaves Realize preemphasis, transmission function is expression formula 1):
H (z)=1- μ z-11);
Wherein:μ is pre emphasis factor, and value is 0.9-1.0 (taking 0.95 herein);
Step 1.2, end-point detection, specifically:It is carried out by setting the thresholding of short-time energy and short-time zero-crossing rate, if certain The Short Time Speech signal that a length is N is x (m), short-time energy E expression formulas 2) it calculates:
Its short-time zero-crossing rate Z expression formulas 3) it calculates:
Wherein, sgn [] is sign function, i.e.,:
Step 1.3, framing and adding window, specifically:In order to be analyzed using the analysis method of stationary process, by language Sound is divided into frame one by one, and the length of each frame is 10-30ms;Meanwhile it for the truncation effect for reducing speech frame, uses Hamming windows (hamming code window) carry out adding window to each frame signal, specifically:
If frame signal is x (n), window function is w (n), then the signal y (n) after adding window is expression formula 4):
Y (n)=x (n) w (n), 0≤n≤N-1 4);Wherein, N is the number of sampling per frame, and the expression formula of w (n) is w (n)=0.54-0.46cos [2 π n/ (N-1)], 0≤n≤N-1.
The Mel frequency filterings are specifically:It will be filtered by the discrete spectrum of FFT processing with sequence triangular filter Processing, obtains one group of Coefficient ml、m2、……;The number p of the wave filter group is determined that all wave filters are total by the cutoff frequency of signal The half of covering frequency (nyquist frequency), i.e. sample rate from 0Hz to Nyquist on body;miBy expression formula 5) it calculates It obtains:
Wherein:
F [i] is triangle filtering The centre frequency of device meets:Mel (f [i+1])-Mel (f [i])=Mel (f [i])-Mel (f [i-1]).
Since Mel spectral coefficients are all real numbers, can time domain be transformed to by discrete cosine transform.It is described discrete remaining String converts:, to time domain, Mel frequency cepstral coefficients will be obtained, by table by the Mel Spectrum Conversions of Mel frequency filterings Up to formula 6) it is calculated:
Wherein:MFCC (i) is Mel frequency cepstral coefficients, and N is to count per frame sample, and P is the number of wave filter group.
The reference model detailed process that acquisition meets human hearing characteristic is as follows:
Pronunciation modeling and training based on GMM-HMM, if the characteristic vector sequence of observation is O=o1,o2,…,oT, the sequence The state model sequence of row is S=s1,s2,…,sN, then the HMM model (hidden Markov model) of the sequence be expressed as expressing Formula 7):
λ=(π, A, B) 7);
Wherein, π={ πi=P (s1=i), i=1,2 ..., N } it is initial state probabilities vector;A={ aijBetween state The transition probability matrix redirected, aijTo jump to the probability of state j from state i;B={ bi(ot)=P (ot|st=i), 2≤i ≤ N-1 } for state output probability distribution collection;
To continuous HMM model, observation sequence is continuous signal, M mixed Gaussian of signal space corresponding with state j Density function and represent, such as expression formula 8) and expression formula 9) under:
Wherein, cjkThe coefficient of k-th of Gaussian Mixture Model Probability Density Function of expression state j;μjkIt is the mean value of Gaussian density function Vector;CjkFor covariance matrix, D is the dimension of observation sequence O;HMM parameters are by observation sequence O=o1,o2,…,oTEstimate It arriving, the target of estimation is that the likelihood function P of model and training data (O | λ) is made to maximize to estimate newest λ, even ifEM algorithms (EM algorithm) may be used to realize in this, and the EM algorithms include two parts:Before It is as follows to estimating again for backward probability calculation and HMM parameters and Gaussian mixture parameters:
The forward direction probability calculation formula such as expression formula 10 of the likelihood function p (O | λ)):
Wherein:α1(i)=π bi(o1),1≤i≤N;
The backward probability calculation formula such as expression formula 11 of the likelihood function p (O | λ)):
Wherein:βt (i)=1,1≤i≤N;
To giving observation sequence O=o1,o2,…,oTNewest λ is obtained by revaluation, defines ξ hereintWhen (i, j) is t Quarter state be siAnd t+1 moment state is sjProbability, by expression formula 12) obtain:
Under conditions of setting models λ and observation sequence O, state siIt is expression formula 13 in the posterior probability of moment t):
The revaluation of HMM parameter lambdas is as follows as a result,:
In the parameter c of t moment state k-th of Gaussian mixture components of jjk、μjkAnd CjkBy expression formula 14), 15) and 16) It reevaluates:
Wherein, γt(j, k) represents the probability in t moment state k-th of Gaussian mixture components of j, can be obtained by following formula:
Consistency measurement computational methods be specifically:After modeling, mel-frequency cepstrum coefficient and the ginseng of distorted speech Examine model and carry out consistency measurement using expression formula 17) it is calculated:
Wherein:X1,...,XNIt is mel-frequency cepstrum coefficient (MFCC) vector of distorted speech, N is vectorial number, and C is The consistency of distorted speech and model is measured.
The calculating process of the bit error rate is as follows:
Step A, a PN sequence is generated, and is multiplied with a chaos sequence, the generation of chaos sequence is reflected by logistic Generation is penetrated, logistic mapping definitions are as follows:
xk+1=μ xk(1-xk)
Wherein, 0≤μ≤4 are known as branch parameter, xk∈ (0,1), when 3.5699456 ...<During μ≤4, logistic mappings Work in chaos state, i.e., the sequence { x that primary condition generates under logistic mappingsk;K=0,1,2,3 ... } it is aperiodic , it is not convergent and very sensitive to initial value;Generation monitoring data sequent is as follows:
Step a1, real-valued sequence is generated first, and it is big for monitoring data sequent to choose the length that some position starts in sequence Small one section;
Step a2, real-valued sequence is become into binary sequence:By defining a threshold value Γ, obtained by real-valued sequence:
Binary chaotic sequence is { Γ (xk);K=0,1,2,3 ... };
Step a3, binary chaotic sequence is multiplied with a PN sequence (pseudo noise sequence), you can obtain monitoring data sequent;
Step B, synchronous code is inserted into for monitoring data sequent, frame by frame extracted monitoring data sequent embedded below, specifically: Synchronous code is inserted into for monitoring data sequent, the purpose for being inserted into synchronous code is that for audio after the attenuation of channel, receiving terminal is difficult in order to prevent To extract monitoring data sequent;The synchronous code that we use is 16 bits, and in order to be accurately located synchronous code, we adopt The method taken is the embedded synchronous code in the time domain of voice signal, and concrete methods of realizing is by 16 sampled points before monitoring data sequent Amplitude be set to 1000, in this way receiving terminal extract monitoring data sequent during, if there is the nonsynchronous situation of starting point, then may be used Using sampled point of continuous 16 sample values 900~1100, rising for watermark is rapidly found out in a manner of searching synchronous code Beginning sample position, in this way, can frame by frame extract monitoring data sequent embedded below;
Step C, the monitoring data sequent for being inserted into synchronous code in embedded voice signal, is selected embedding in wavelet field in wavelet field The reason of entering is that embedded monitoring data sequent has better concealment in transform domain, will not cause human ear that can distinguish to raw tone Influence.The detailed process of sequence embedded voice in wavelet field is as follows:
Step c1, it can be generated due to the use of the different same problems of wavelet basis analysis different as a result, therefore, it is necessary to roots Suitable wavelet basis is selected according to the problem of analysis, chooses Daubechies10 small echos herein as wavelet function;
Step c2, sub-frame processing is carried out to voice signal, the size per frame for 1152 sampled points, and to every frame signal into 3 layers of wavelet transformation of row;In view of the auditory properties of human ear, select to be embedded in sequence in high band herein;
Step c3, wavelet coefficient is quantified, and monitoring data sequent is modulated, so as to which monitoring data sequent is embedded in voice In signal, if coefficient to be quantified is f, the bit of embedded monitoring data sequent is w, and quantization step is Δ, and monitoring sequence is contained after quantization The coefficient of column information for f' the specific steps are:First to f modulus and floor operation, as f > 0, ifn =m%2, then:
As f < 0, ifN=m%2, n=w, then:
Monitoring data sequent can be embedded into voice signal one by one according to above-mentioned formula.
Step c4, the signal of embedded monitoring data sequent is switched back into time-domain signal;
Step D, embedded monitoring data sequent is extracted in the voice received, and calculates the bit error rate, details are:Monitoring data sequent Extraction be embedded inverse process, therefore the wavelet function and the series of wavelet decomposition used all remain unchanged, and specifically extract Process includes the following steps:
Step d1, synchronous code is searched in voice signal, specifically:If need the length that the signal length searched for is L, then L Degree should be more than the length of two synchronous codes and the summation of a complete monitoring data sequent length.If the initiating searches point of signal is I=1, if the sample value of signalIn the range of 900-1100, then it is assumed that possible synchronous code has been searched, It is compared using preset synchronous code;If determined as synchronous code, then I points are the initial position of monitoring data sequent, otherwise enable I =I+L;
Step d2, since the starting point found, wavelet transform is carried out to voice signal;
Step d3, make the operation with contrary during insertion to the coefficient f after wavelet decomposition, i.e.,:
During f > 0, ifW=m%2;
During f < 0, ifW=m%2;
So as to extract binary system monitoring data sequent;
Step d4, compare the monitoring data sequent extracted and the monitoring data sequent of insertion, pass through expression formula 18) calculate the bit error rate (bit error rate as one of speech quality evaluation objective estimate):
Wherein Seqsend、SeqreceiveAnd SeqlengthRepresent that transmission monitoring data sequent, reception monitoring data sequent and sequence are long respectively Degree;Hammingweight () represents to ask the Hamming weight of sequence, XOR expression xor operations.
In the case where calculating various distortion conditions after the parameter consistency measurement of voice, a kind of Function Mapping relationship can be used Come expression parameter consistency measurement with it is objectiveBetween relationship, i.e., described mapping relations pass through expression formula 19) obtain:
In formula:For anticipation function, (it can be linearly or nonlinearly regression relation or fitting of a polynomial to f () Relationship in this patent embodiment, more accurately predicts MOS values, herein preferred multivariate nonlinear regression analysis model in order to obtain); CiIt is the consistency measurement of i-th kind of parameter;N is the number of speech characteristic parameter;It is c1,...,cNIt is predicted by f () The objective MOS scorings gone out.The bit error rate is bigger, illustrates to interfere in channel stronger, and the speech damage brought in transmission process is also corresponding Also it is big, it is correspondingIt is worth smaller, the quality of voice is poorer.
The performance of speech quality evaluation algorithm is weighed from the degree of correlation, biased error below.The degree of correlation mainly reflects voice Whether the mapping relations that quality evaluation algorithm obtains MOS points of prediction by distortion map are reasonable, the general MOS with Algorithm mapping points Degree of correlation and biased error between known subjectivity MOS values is as evaluation criterion.
Correlation coefficient ρ and pass through expression formula 20 with standard estimated bias σ) and expression formula 21) obtain:
Wherein:MOSo(i) be i-th of voice prediction MOS values, MOSs(i) it is known MOS points, N is total voice pair Number,Represent the mean value of prediction MOS values,Represent MOS points of mean value.
Correlation coefficient ρ is closer to 1, and prediction MOS values are closer to true MOS values;Biased error σ is smaller, then predicts that error is got over Small, the performance of algorithm is better.
The appraisal procedure of the present embodiment 1 and International Telecommunication Union ITU P.563 method for objectively evaluating (ITU-TP.563) Performance comparison is the results detailed in Table 1.
From table 1 it follows that the method for the present invention (embodiment 1) is certain relative to having on ITU-TP.563 algorithm performances The raising of degree, the average degree of correlation ρ higher of subjective MOS, estimated bias σ is relatively low, and therefore, the method for the present invention has validity And feasibility.
The performance comparison sheet that 1 the method for the present invention of table (embodiment 1) and ITU-TP.563 are respectively handled voice
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, that is made any repaiies Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims (9)

1. a kind of appraisal procedure of the Objective speech quality based on output, which is characterized in that include the following steps:
Calculate the mel-frequency cepstrum coefficient of the distorted speech after system is transmitted;Obtain the reference for meeting human hearing characteristic Model;
The mel-frequency cepstrum coefficient of distorted speech and the reference model for meeting human hearing characteristic are subjected to consistency measurement meter It calculates;One section of sequence is inserted into raw tone, calculates and extracts the sequence in the distorted speech after being transmitted by system The bit error rate;
Mapping relations between subjective MOS and homogeneity measure are established according to consistency measurement and the bit error rate, are obtained to be evaluated The objective prediction model that MOS points of valency voice carries out the objective evaluation of voice quality by the objective prediction model;
The reference model detailed process that acquisition meets human hearing characteristic is as follows:
If the characteristic vector sequence of observation is O=o1,o2,…,oT, the state model sequence of the sequence is S=s1,s2,…,sN, Then the HMM model of the sequence is expressed as expression formula 7):
λ=(π, A, B) 7);
Wherein, π={ πi=P (s1=i), i=1,2 ..., N } it is initial state probabilities vector;A={ aijRedirected between state Transition probability matrix, aijTo jump to the probability of state j from state i;B={ bi(ot)=P (ot|st=i), 2≤i≤N-1 } be The output probability distribution collection of state;
To continuous HMM model, observation sequence is continuous signal, M mixed Gaussian density of signal space corresponding with state j Function and represent, such as expression formula 8) and expression formula 9) under:
Wherein, cjkThe coefficient of k-th of Gaussian Mixture Model Probability Density Function of expression state j;μjkIt is the mean vector of Gaussian density function; CjkFor covariance matrix, D is the dimension of observation sequence O;HMM parameters are by observation sequence O=o1,o2,…,oTEstimation obtains, estimation Target be that the likelihood function P of model and training data (O | λ) is made to maximize to estimate newest λ, even if
The forward direction probability calculation formula such as expression formula 10 of the likelihood function p (O | λ)):
Wherein:α1(i)=π bi(o1),1≤i≤N;
The backward probability calculation formula such as expression formula 11 of the likelihood function p (O | λ)):
Wherein:βt (i)=1,1≤i≤N;
To giving observation sequence O=o1,o2,…,oTNewest λ is obtained by revaluation, defines ξ hereint(i, j) is t moment shape State is siAnd t+1 moment state is sjProbability, by expression formula 12) obtain:
Under conditions of setting models λ and observation sequence O, state siIt is expression formula 13 in the posterior probability of moment t):
The revaluation of HMM parameter lambdas is as follows as a result,:
In the parameter c of t moment state k-th of Gaussian mixture components of jjk, μjkAnd CjkBy expression formula 14), 15) and 16) estimate again Meter:
Wherein, γt(j, k) represents the probability in t moment state k-th of Gaussian mixture components of j, can be obtained by following formula:
2. the appraisal procedure of the Objective speech quality according to claim 1 based on output, it is characterised in that:The Meier The calculating process of frequency cepstral coefficient includes four pretreatment, FFT transform, Mel frequency filterings and discrete cosine transform steps.
3. the appraisal procedure of the Objective speech quality according to claim 2 based on output, it is characterised in that:
The pretreatment specifically includes following steps:
Step 1.1, preemphasis, specifically:It is realized using the digital filter of the promotion high frequency characteristics with 6dB/ octaves Preemphasis, transmission function are expression formula 1):
H (z)=1- μ z-11);
Wherein:μ is pre emphasis factor, value 0.9-1.0;
Step 1.2, end-point detection, specifically:It is carried out by setting the thresholding of short-time energy and short-time zero-crossing rate, if some is grown It is x (m) to spend for the Short Time Speech signal of N, short-time energy E expression formulas 2) it calculates:
Its short-time zero-crossing rate Z expression formulas 3) it calculates:
Wherein, sgn [] is sign function, i.e.,:
Step 1.3, framing and adding window, specifically:The framing is that voice is divided into frame one by one, and the length of each frame is 10-30ms;The adding window is to carry out adding window to each frame signal using Hamming windows.
4. the appraisal procedure of the Objective speech quality according to claim 3 based on output, it is characterised in that:The adding window Detailed process be:If frame signal is x (n), window function is w (n), then the signal y (n) after adding window is expression formula 4):
Y (n)=x (n) w (n), 0≤n≤N-1 4);Wherein, N is the number of sampling per frame, and the expression formula of w (n) is w (n) =0.54-0.46cos [2 π n/ (N-1)], 0≤n≤N-1.
5. the appraisal procedure of the Objective speech quality according to claim 2 based on output, it is characterised in that:The Mel Frequency filtering is specifically:It will be filtered by the discrete spectrum of FFT processing with sequence triangular filter, obtain a system Number ml、m2、……;The number p of the wave filter group determines by the cutoff frequency of signal, all wave filters generally cover from 0Hz to The half of Nyquist frequencies, i.e. sample rate;miBy expression formula 5) it is calculated:
Wherein:
F [i] is triangular filter Centre frequency meets:Mel (f [i+1])-Mel (f [i])=Mel (f [i])-Mel (f [i-1]);X (k) is passed through for frame signal x (n) FFT treated discrete spectrums.
6. the appraisal procedure of the Objective speech quality according to claim 2 based on output, it is characterised in that:It is described discrete Cosine transform is specifically:By the Mel Spectrum Conversions of process Mel frequency filterings to time domain, Mel frequency cepstral coefficients are obtained, by Expression formula 6) it is calculated:
Wherein:MFCC (i) is Mel frequency cepstral coefficients, and N is to count per frame sample, and P is the number of wave filter group.
7. the appraisal procedure of the Objective speech quality according to claim 1 based on output, it is characterised in that:Uniformity amount The computational methods of degree are specifically:Using expression formula 17) it is calculated:
Wherein:X1,...,XNIt is the mel-frequency cepstrum coefficient vector of distorted speech, N is vectorial number, and C is distorted speech and mould The consistency measurement of type.
8. the appraisal procedure of the Objective speech quality according to claim 1 based on output, it is characterised in that:The error code The calculating process of rate is as follows:
Step A, a PN sequence is generated, and is multiplied with a chaos sequence, the generation of chaos sequence is mapped by logistic produces Raw, logistic mapping definitions are as follows:
xk+1=μ xk(1-xk)
Wherein, 0≤μ≤4 are known as branch parameter, xk∈ (0,1), when 3.5699456 ...<During μ≤4, logistic mappings works in Sequence { the x that chaos state, i.e. primary condition generate under logistic mappingsk;K=0,1,2,3 ... } it is aperiodic, does not receive It is holding back and very sensitive to initial value;Generation monitoring data sequent is as follows:
Step a1, real-valued sequence is generated first, and it is monitoring data sequent size to choose the length that some position starts in sequence One section;
Step a2, real-valued sequence is become into binary sequence:By defining a threshold value Γ, obtained by real-valued sequence:
Binary chaotic sequence is { Γ (xk);K=0,1,2,3 ... };
Step a3, binary chaotic sequence is multiplied with a PN sequence, you can obtain monitoring data sequent;
Step B, synchronous code is inserted into for monitoring data sequent, frame by frame extracts monitoring data sequent embedded below;
Step C, the monitoring data sequent for being inserted into synchronous code is embedded in wavelet field in voice signal, detailed process is as follows:
Step c1, Daubechies10 small echos are chosen as wavelet function;
Step c2, sub-frame processing is carried out to voice signal, the size per frame is 1152 sampled points, and carries out 3 to every frame signal Layer wavelet transformation;
Step c3, wavelet coefficient is quantified, and monitoring data sequent is modulated, so as to which monitoring data sequent is embedded in voice signal In, if coefficient to be quantified is f, the bit of embedded monitoring data sequent is w, and quantization step is Δ, is believed after quantization containing monitoring data sequent The coefficient of breath for f' the specific steps are:
To f modulus and floor operation, as f > 0, ifN=m%2, then:
As f < 0, ifN=m%2, n=w, then:
Monitoring data sequent is embedded into voice signal one by one according to above-mentioned formula;
Step c4, the signal of embedded monitoring data sequent is switched back into time-domain signal;
Step D, embedded monitoring data sequent is extracted in the voice received, and calculates the bit error rate, the process specifically extracted includes Following steps:
Step d1, synchronous code is searched in voice signal, specifically:If needing the signal length searched for, then the length of L should for L When the length more than two synchronous codes and the summation of a complete monitoring data sequent length;If the initiating searches point of signal is I= 1, if the sample value of signalIn the range of 900-1100, then it is assumed that searched possible synchronous code, profit It is compared with preset synchronous code;If determined as synchronous code, then I points are the initial position of monitoring data sequent, otherwise enable I= I+L;
Step d2, since the starting point found, wavelet transform is carried out to voice signal;
Step d3, make the operation with contrary during insertion to the coefficient f after wavelet decomposition, i.e.,:During f > 0, ifw =m%2;During f < 0, ifW=m%2;
So as to extract binary system monitoring data sequent;
Step d4, compare the monitoring data sequent extracted and the monitoring data sequent of insertion, pass through expression formula 18) calculate the bit error rate:
Wherein Seqsend、SeqreceiveAnd SeqlengthIt represents to send monitoring data sequent respectively, receive monitoring data sequent and sequence length; Hammingweight () represents to ask the Hamming weight of sequence, XOR expression xor operations.
9. the appraisal procedure of the Objective speech quality according to claim 1 based on output, it is characterised in that:The mapping Relationship passes through expression formula 19) it obtains:
In formula:F () is multivariate nonlinear regression analysis model;CiIt is the consistency measurement of i-th kind of parameter;N is speech characteristic parameter Number;It is c1,...,cNIt is scored by the objective MOS that f () is predicted.
CN201710475912.8A 2017-06-21 2017-06-21 A kind of appraisal procedure of the Objective speech quality based on output Active CN107293306B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710475912.8A CN107293306B (en) 2017-06-21 2017-06-21 A kind of appraisal procedure of the Objective speech quality based on output

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710475912.8A CN107293306B (en) 2017-06-21 2017-06-21 A kind of appraisal procedure of the Objective speech quality based on output

Publications (2)

Publication Number Publication Date
CN107293306A CN107293306A (en) 2017-10-24
CN107293306B true CN107293306B (en) 2018-06-15

Family

ID=60096759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710475912.8A Active CN107293306B (en) 2017-06-21 2017-06-21 A kind of appraisal procedure of the Objective speech quality based on output

Country Status (1)

Country Link
CN (1) CN107293306B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107818797B (en) * 2017-12-07 2021-07-06 苏州科达科技股份有限公司 Voice quality evaluation method, device and system
CN108364661B (en) * 2017-12-15 2020-11-24 海尔优家智能科技(北京)有限公司 Visual voice performance evaluation method and device, computer equipment and storage medium
CN110289014B (en) * 2019-05-21 2021-11-19 华为技术有限公司 Voice quality detection method and electronic equipment
CN110211566A (en) * 2019-06-08 2019-09-06 安徽中医药大学 A kind of classification method of compressed sensing based hepatolenticular degeneration disfluency
CN111091816B (en) * 2020-03-19 2020-08-04 北京五岳鑫信息技术股份有限公司 Data processing system and method based on voice evaluation
CN111968677B (en) * 2020-08-21 2021-09-07 南京工程学院 Voice quality self-evaluation method for fitting-free hearing aid

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1713273A (en) * 2005-07-21 2005-12-28 复旦大学 Algorithm of local robust digital voice-frequency watermark for preventing time size pantography
CN101847409A (en) * 2010-03-25 2010-09-29 北京邮电大学 Voice integrity protection method based on digital fingerprint
CN102044248A (en) * 2009-10-10 2011-05-04 北京理工大学 Objective evaluating method for audio quality of streaming media
CN102881289A (en) * 2012-09-11 2013-01-16 重庆大学 Hearing perception characteristic-based objective voice quality evaluation method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7327985B2 (en) * 2003-01-21 2008-02-05 Telefonaktiebolaget Lm Ericsson (Publ) Mapping objective voice quality metrics to a MOS domain for field measurements

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1713273A (en) * 2005-07-21 2005-12-28 复旦大学 Algorithm of local robust digital voice-frequency watermark for preventing time size pantography
CN102044248A (en) * 2009-10-10 2011-05-04 北京理工大学 Objective evaluating method for audio quality of streaming media
CN101847409A (en) * 2010-03-25 2010-09-29 北京邮电大学 Voice integrity protection method based on digital fingerprint
CN102881289A (en) * 2012-09-11 2013-01-16 重庆大学 Hearing perception characteristic-based objective voice quality evaluation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
A Speech Quality Evaluation Method Based on Auditory Characteristic;Qingxian Li etc;《Proceedings of the 2016 International Conference on Intelligent Control and Computer Application(ICCA 2016)》;20160131;第320-323页 *

Also Published As

Publication number Publication date
CN107293306A (en) 2017-10-24

Similar Documents

Publication Publication Date Title
CN107293306B (en) A kind of appraisal procedure of the Objective speech quality based on output
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
Hossan et al. A novel approach for MFCC feature extraction
Tiwari MFCC and its applications in speaker recognition
EP2352145B1 (en) Transient speech signal encoding method and device, decoding method and device, processing system and computer-readable storage medium
CN1121681C (en) Speech processing
Dubey et al. Non-intrusive speech quality assessment using several combinations of auditory features
CN111696580B (en) Voice detection method and device, electronic equipment and storage medium
Karbasi et al. Twin-HMM-based non-intrusive speech intelligibility prediction
CN101577116A (en) Extracting method of MFCC coefficients of voice signal, device and Mel filtering method
Aparna et al. Role of windowing techniques in speech signal processing for enhanced signal cryptography
Maganti et al. Auditory processing-based features for improving speech recognition in adverse acoustic conditions
Lim et al. Classification of underwater transient signals using mfcc feature vector
CN105741853A (en) Digital speech perception hash method based on formant frequency
Gandhiraj et al. Auditory-based wavelet packet filterbank for speech recognition using neural network
Varela et al. Combining pulse-based features for rejecting far-field speech in a HMM-based voice activity detector
Makhijani et al. Speech enhancement using pitch detection approach for noisy environment
Jawarkar et al. Effect of nonlinear compression function on the performance of the speaker identification system under noisy conditions
KR102427874B1 (en) Method and Apparatus for Artificial Band Conversion Based on Learning Model
Tomchuk Spectral masking in MFCC calculation for noisy speech
Rahdari et al. A two-level multi-gene genetic programming model for speech quality prediction in Voice over Internet Protocol systems
KR100474969B1 (en) Vector quantization method of line spectral coefficients for coding voice singals and method for calculating masking critical valule therefor
CN110689875A (en) Language identification method and device and readable storage medium
Maurya et al. Speaker recognition for noisy speech in telephonic channel
Lei et al. Speaker Recognition on Mobile Phone: Using Wavelet, Cepstral Coefficients and Probabilisitc Neural Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant