CN1614686A

CN1614686A - Super frame track parameter vector quantizing method

Info

Publication number: CN1614686A
Application number: CNA2004100906614A
Authority: CN
Inventors: 崔慧娟; 唐昆; 赵永刚; 李军林
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2004-11-12
Filing date: 2004-11-12
Publication date: 2005-05-11
Anticipated expiration: 2024-11-12
Also published as: CN1284137C

Abstract

A vector quantization method includes framing inputted voice signal sample point to form a super frame, picking up A sound channel parameter as well as voiceless and voiced consonant parameter for quantization, converting A sound channel parameter to be line spectrum pair parameter, deducting DC subcomponent, confirming line spectrum pair predictive value, using line spectrum pair parameter without DC to detuct the prodictive value for carrying out vector quantization.

Description

A kind of super frame track parameter vector quantizing method

Technical field

The invention belongs to the speech coding technology field, particularly multi-frame joint is handled low code check parametric speech coding technology.

Background technology

Voice coding in communication system, voice storage-playback, have in the consumer product of phonetic function and be widely used.International Telecommunication Union, some regional organizations and some countries had formulated a series of voice compression coding standards in succession in the last few years, were that 1.2kb/s has obtained gratifying voice quality to 16kb/s in code rate.Domestic and international research mainly concentrates on the following speed high-quality speech of the 1.2kb/s compressed encoding at present, is mainly used in radio communication, secret communication, high capacity voice storage playback etc.Because code rate is too low, the parametric speech coding technology that must adopt multi-frame joint (being superframe) to handle, wherein most critical is how channel parameters to be quantized, the highest because channel parameters quantizes needed bit number, and the quality that it is quantized will determine the intelligibility of speech.

Directly bad to sound channel A parameter quantification effect, therefore sound channel A parameter need be changed into line spectrum pairs parameter, and then quantize.As shown in Figure 1, this method may further comprise the steps:

(1) divides frame in chronological order to the input speech signal sampling point, continuous some frames are formed a superframe;

(2) superframe is handled in chronological order, then each frame in the current superframe is extracted sound channel A parameter;

(3) the sound channel A Parameters Transformation that each frame in the current superframe is extracted becomes line spectrum pairs parameter;

(4) deduct corresponding DC component again from this line spectrum pairs parameter, DC component adopts fixed value, and this fixed value obtains with a large amount of speech samples statistics;

(5) a last superframe line spectrum pairs parameter the predicting of using fixedly fallout predictor and utilization to handle then to each the frame line spectrum pairs parameter in the current superframe, from remove the direct current line spectrum pairs parameter, deduct this predicted value, obtain the surplus line spectrum pairs parameter of current superframe;

(6) again prediction surplus parameter is carried out vector quantization, the prediction surplus parameter after the quantification that obtains adds the line spectrum pairs parameter after obtaining quantizing after corresponding DC component and the predicted value; Prediction surplus parameter after will quantizing is simultaneously sent into delay cell, and postponing a superframe is that next superframe prediction is used;

(7) line spectrum pairs parameter after will quantizing at last converts sound channel A parameter to, the sound channel A parameter after obtaining quantizing.

Above-mentioned prior art adopts vector quantization for the superframe line spectrum pairs parameter, also line spectrum pairs parameter is removed DC component before the vector quantization and utilize handled on a superframe line spectrum pairs parameter line spectrum pairs parameter of current superframe was predicted, but do not fully take into account the voice correlation properties between characteristic and excitation parameters and the channel parameters in short-term.When removing the DC component of line spectrum pairs parameter usually with the mean value of whole training utterances as DC component, and the line spectrum pairs parameter of in fact different voice segments is that different DC component is arranged.For 1.2kb/s and above rate parameter coding, because it is more to quantize the line spectrum pairs parameter available bit number, can be with simply going the direct current method, the quantification effect that can access still.For lower code check speech parameter coding, then must adopt more high efficiency method of going DC component, the just quantification effect that may obtain, and then the voice quality that obtains.

As shown in Figure 1, on the line spectrum pair parameter prediction, original technology adopts the fallout predictor of fixing or basic fixed, does not make full use of between the superframe, the correlativity between excitation parameters and the channel parameters.This is for the lower code check parametric speech coding that adopts superframe to handle, and forecasting efficiency is very low, is the bad one of the main reasons of channel parameters quantification effect.

Summary of the invention

The objective of the invention is to propose a kind of method of superframe acoustic channel parameter vector quantization for overcoming the weak point of prior art, can relatively make full use of between the superframe, the correlativity between excitation parameters and the channel parameters, forecasting efficiency be higher.

The superframe acoustic channel parameter quantization method that the present invention proposes may further comprise the steps:

(1) at first divides frame in chronological order to the input speech signal sampling point; Continuous some frames are formed a superframe (superframe is handled in chronological order);

(2) each frame in the current superframe is extracted sound channel A parameter and pure and impure sound parameter;

(4) the pure and impure sound parameter vector to current superframe quantizes, and determines the pattern of current superframe according to quantized value;

(5) determine the DC component of each line spectrum pairs parameter in the current superframe according to the pattern of current superframe, and from each line spectrum pairs parameter, deduct corresponding DC component (promptly having adopted self-adaptation to go the DC component method);

(6) and then determine the predicted value (promptly also having used adaptive method to determine predicted value) of line spectrum pairs parameter according to the pattern of previous superframe and current superframe;

(7) from remove the direct current line spectrum pairs parameter, deduct this predicted value, obtain the surplus line spectrum pairs parameter of current superframe;

(8) the surplus line spectrum pairs parameter to current superframe carries out vector quantization, surplus line spectrum pairs parameter after the quantification that obtains adds DC component in the step (5) and the predicted value in the step (6), obtains the line spectrum pairs parameter (the line spectrum pair surplus parameter after simultaneously this superframe being quantized is added to and postpones in the units of super-frames) after current superframe quantizes;

(9) convert the line spectrum pairs parameter after the current superframe quantification to sound channel A parameter, obtain the sound channel A parameter after current superframe quantizes.

Characteristics of the present invention and technique effect

Characteristics of the present invention are to go DC component to adopt adaptive method to line spectrum pairs parameter, and adaptive approach has also been adopted in the prediction of line spectrum pairs parameter.Traditional speech production model thinks that voice-activated parameter and channel parameters are independently, therefore they is handled respectively, and this is little for the influence of code rate condition with higher, and has just limited the raising of performance for the very low situation of code rate.The present invention finds that through to a large amount of speech samples statistics there are certain correlativity in excitation parameters (being the pure and impure sound parameter among the present invention) and channel parameters, utilizes this correlativity can improve the performance of low code check voice coding.The present invention utilizes this group parameter of pure and impure sound parameter to carry out property sort to speech frame, superframe, reaches channel parameters is better quantized.The present invention determines the line spectrum pairs parameter DC component according to the pure and impure sound parameter character of current processed voice superframes, and the result who has just utilized excitation parameters to handle handles channel parameters, has reached the more accurate valuation of DC component.The present invention has also utilized the predictive mode of the current superframe line spectrum pairs parameter of the common decision of last superframe excitation parameters except utilizing current superframe excitation parameters, make the predictive mode classification meticulousr, has reached better prediction effect.This correlativity that exists between excitation parameters and the channel parameters of utilizing makes the precision that channel parameters is quantized be improved, and then has improved the quality of low code check voice coding.

This method can improve the quantified precision of channel parameters, makes synthetic speech have the higher property understood.The most suitable 600～800b/s the low rate of this method parametric speech coding.

Description of drawings

Fig. 1 is the superframe acoustic channel parameter quantization method FB(flow block) of prior art.

The superframe acoustic channel parameter quantization method FB(flow block) that Fig. 2 proposes for the present invention.

Embodiment

The method of the superframe acoustic channel parameter vector quantization that the present invention proposes reaches embodiment in conjunction with the accompanying drawings and further specifies as follows:

Method flow of the present invention may further comprise the steps as shown in Figure 2:

(2) each frame in the current superframe is extracted sound channel A parameter;

(3) each the frame sound channel A Parameters Transformation in the current superframe is become line spectrum pairs parameter;

(4) each frame in the current superframe is extracted pure and impure sound parameter;

(5) the pure and impure sound parameter of current superframe is carried out vector quantization, obtain the pure and impure sound parameter quantification value after current superframe quantizes;

(6) determine current super frame mode according to the pure and impure sound parameter quantification of current superframe value;

(7) determine the DC component of each line spectrum pairs parameter in the current superframe according to the pattern of current superframe, and from each line spectrum pairs parameter, deduct corresponding DC component;

(8) determine one group of line spectrum pairs parameter predictive coefficient according to the pattern of current super frame mode and previous superframe;

(9) go direct current line spectrum pair calculation of parameter predicted value after utilizing last frame in this group predictive coefficient and the last superframe to quantize; Each has gone to deduct corresponding predicted value in the direct current line spectrum pairs parameter from current superframe, obtains the surplus line spectrum pairs parameter of current superframe;

(10) the surplus line spectrum pairs parameter to current superframe carries out vector quantization, obtains the surplus line spectrum pairs parameter after current superframe quantizes;

(11) predicted value that obtains of DC component that above-mentioned steps (7) is obtained and above-mentioned steps (9) is added to the line spectrum pairs parameter after obtaining quantizing in the surplus line spectrum pairs parameter after the quantification; Line spectrum pair surplus parameter after will quantizing simultaneously joins and postpones in the units of super-frames;

(12) convert the line spectrum pairs parameter after the current superframe quantification to sound channel A parameter, the sound channel A parameter after promptly obtaining quantizing.

The specific embodiment of each step of said method of the present invention is described in detail as follows respectively:

Said method step (1) is divided frame in chronological order to the input speech signal sampling point, and the embodiment that continuous some frames is formed a superframe is by the 8khz frequency sampling, removes the voice sampling point that power frequency is disturbed through high-pass filtering.Every 20ms, just 160 voice sampling points constitute a frame, and continuous 6 frames are formed a superframe (according to the number of a superframe institute break frame of code rate selection, for example code rate is that 1200b/s can select 3 frames, and 600b/s selects 6 frames etc.).

The embodiment of said method step (2) is: be with the described method of linear prediction (MELP) speech coding algorithm standard of excitation that each frame in the current superframe is all extracted 10 rank sound channel A parameter a by the 2400b/s of U.S. government more ⁿ=[a ₁ ⁿ, a ₂ ⁿ..., a ₁₀ ⁿ] (n=0,1 ..., 5).

The embodiment of said method step (3) is: be with the described method of linear prediction (MELP) speech coding algorithm standard of excitation that each the frame sound channel A Parameters Transformation in the current superframe is become line spectrum pairs parameter f by the 2400b/s of U.S. government more ⁿ=[f ₁ ⁿ, f ₂ ⁿ..., f ₁₀ ⁿ], (n=0,1 ..., 5).

The embodiment of said method step (4) is: be with the described method of linear prediction (MELP) speech coding algorithm standard of excitation each frame in the current superframe all to be extracted the pure and impure sound parameter of 5 subbands by the 2400b/s of U.S. government more, subband is that voiceless sound is represented with " 0 ", and subband is that voiced sound is represented with " 1 ".6 frames always have 30 sub-band surd and sonant parameters in such superframe, constitute the vector of one 30 dimension, and the value of each dimension is " 0 " or " 1 ", and note is F

F = [B^{(0)}, B^{(1)}, \cdot \cdot \cdot, B^{(5)}] = [b_{1}^{(0)}, b_{2}^{(0)}, \cdot \cdot \cdot, b_{5}^{(0)}, \cdot \cdot \cdot \cdot \cdot \cdot {, b}_{1}^{(5)}, b_{2}^{(5)}, \cdot \cdot \cdot, b_{5}^{(5)}]

Embodiment in the said method step (5) is: above-mentioned pure and impure sound vector F is quantized with 4 bit vectors, always have the 16 kinds of pure and impure sound vector quantization of superframe values, the corresponding a kind of super frame mode of each vector quantization value; The vector quantization value code table of present embodiment is provided by table 1, and corresponding super frame mode also provides in table 1.Employing weighted euclidean distance criterion is estimated in quantizing distortion, even the distortion D minimum shown in the following formula

D＝(F-F ⁽ⁱ⁾)·W·(F-F ⁽ⁱ⁾) ^T

F wherein ⁽ⁱ⁾(i=0,1 ..., 15) and be a code word in the Codebook of Vector Quantization, weighting matrix W is a diagonal matrix, its value is used for representing the difference of each subband importance.Usually low strap is most important, along with the rising importance of frequency band reduces successively, and to 5 weights that allocation of subbands is different of every frame, in the present embodiment, 5 sons

The ratio that cum rights is heavy is 16: 8: 4: 2: 1, reflected that low frequency sub-band is more important than high-frequency sub-band, thereby weighting matrix W is as follows:

Each code word has all been represented a kind of pattern of superframe, that is to say, by the quantification of the pure and impure decision parameter of superframe, can determine the pattern of superframe.Pure and impure sound vector quantization code word is the maximum pure and impure sound vector of 16 kinds of superframes of occurrence number by the training utterance sample statistics is obtained.

The embodiment of said method step (6) is: by the result who among the embodiment of step (5) pure and impure sound vector F is quantized, utilize table 1 to determine the pattern of current superframe, note is made F ₀, subscript " 0 " is represented current superframe.

The pure and impure sound vector quantization of table 1 superframe code table and corresponding super frame mode

The pure and impure sound vector quantization of superframe value	Corresponding super frame mode
	Corresponding super frame mode	????10000?10000?10000?10000?00000?00000	????1
????11111?11111?11111?11000?10000?00000	????2	????10000?10000?10000?10000?00000?00000	????1
????11111?11111?11111?11000?10000?00000	????2	????00000?00000?11000?11111?11111?11111	????3
????00000?00000?00000?00000?10000?11100	????4	????00000?00000?11000?11111?11111?11111	????3
????00000?00000?00000?00000?10000?11100	????4	????10000?10000?00000?00000?00000?00000	????5
????11111?11111?11100?10000?00000?00000	????6	????10000?10000?00000?00000?00000?00000	????5
????11111?11111?11100?10000?00000?00000	????6	????00000?00000?00000?11100?11111?11111	????7
????11000?10000?10000?00000?10000?11100	????8	????00000?00000?00000?11100?11111?11111	????7
????11000?10000?10000?00000?10000?11100	????8	????10000?10000?10000?10000?10000?10000	????9
????11111?11111?11000?10000?10000?10000	????10	????10000?10000?10000?10000?10000?10000	????9
????11111?11111?11000?10000?10000?10000	????10	????10000?11000?11111?11111?11111?11111	????11
????11000?10000?10000?11000?11111?11111	????12	????10000?11000?11111?11111?11111?11111	????11
????11000?10000?10000?11000?11111?11111	????12	????00000?00000?00000?00000?00000?00000	????13
????00000?11000?11111?11111?11111?11111	????14	????00000?00000?00000?00000?00000?00000	????13
????00000?11000?11111?11111?11111?11111	????14	????11111?11111?11111?11111?11000?10000	????15
????11111?11111?11111?11111?11111?11111	????16	????11111?11111?11111?11111?11000?10000	????15

The embodiment of said method step (7) is: the current super frame mode F that the embodiment method of use above-mentioned steps (5) obtains ₀Determine the DC component vector of each frame line spectrum pairs parameter

d^{n} (F_{0}) = (d_{1}^{n}, d_{2}^{n}, \cdot \cdot \cdot, d_{10}^{n}), n = (0,1, \cdot \cdot \cdot, 5),

And from the line spectrum pairs parameter of correspondence, deduct DC component, obtain the line spectrum pairs parameter l after the DC component ⁿ=[l ₁ ⁿ, l ₂ ⁿ..., l ₁₀ ⁿ] (n=0,1 ..., 5)

l ⁿ＝f ⁿ-d ⁿ(F ₀)，(n＝0，1，…，5)

DC component

d^{n} (F_{0}) = (d_{1}^{n}, d_{2}^{n}, \cdot \cdot \cdot, d_{10}^{n}), n = (0,1, \cdot \cdot \cdot, 5)

Obtain with training utterance.The specific practice of present embodiment is that training utterance is divided into 16 subclass by super frame mode, and the line spectrum pairs parameter of each subclass is averaged respectively promptly obtains the line spectrum pairs parameter DC component.

The predictive coefficient that obtains above-mentioned steps (8) be the superframe transfer mode determined by the pattern of last superframe and current super frame mode with the training utterance diversity, by the minimum principle of square error each collection is obtained one group of predictive coefficient respectively.The embodiment of this step (8) is: according to current super frame mode F ₀With last super frame mode F _-1Transfer mode (F _-1, F ₀) determine one group of line spectrum pairs parameter prediction coefficient matrix α ⁿ(F _-1, F ₀), (n=0,1 ..., 5), it is one 10 * 10 a matrix.

The embodiment of said method step (9) is: the prediction coefficient matrix α that utilizes the embodiment method of above-mentioned steps (8) to obtain ⁿ(F _-1, F ₀) and last superframe in last frame remove direct current line spectrum pair parameter vector after quantizing

{\hat{l}}_{- 1}^{5} = (l_{- 1,1}^{5}, l_{- 1,2}^{5}, \cdot \cdot \cdot, l_{- 1,10}^{5})

Calculate predicted value, and each has gone direct current line spectrum pair parameter l from the current superframe that the embodiment of above-mentioned steps (7) obtains ⁿ=[l ₁ ⁿ, l ₂ ⁿ..., l ₁₀ ⁿ] (n=0,1 ..., 5) in deduct corresponding predicted value, obtain the surplus line spectrum pairs parameter of current superframe

r^{n} = (r_{1}^{n}, r_{2}^{n}, \cdot \cdot \cdot, r_{10}^{n}), n = (0,1, \cdot \cdot \cdot, 5),

Promptly

r^{n} = l^{n} - α^{n} (F_{- 1}, F_{0}) \cdot {({\hat{l}}_{- 1}^{5})}^{T}, (n = 0,1, \cdot \cdot \cdot, 5)

T in the following formula represents transposition.Prediction coefficient matrix α ⁿ(F _-1, F ₀) obtain with training utterance.The specific practice of present embodiment is by transfer mode (F _-1, F ₀) with the training utterance diversity, each collection is asked α respectively ⁿ(F _-1, F ₀), make the following formula minimum:

\min E (Σ_{i = 0}^{5} {(l^{i} - α^{i} (F_{- 1}, F_{0}) \cdot {\hat{l}}_{- 1})}^{2})

The E representative asks average in the formula.

The specific practice of the embodiment of said method step (10) is: the vector of earlier 6 surplus line spectrum pairs parameter vectors of 6 frames in the current superframe being reformulated 2 30 dimensions is shown below,

R_{1} = (r_{1}^{0}, r_{2}^{0}, r_{3}^{0}, r_{4}^{0}, r_{5}^{0}, r_{1}^{1}, r_{2}^{1}, r_{3}^{1}, r_{4}^{1}, r_{5}^{1} \cdot \cdot \cdot, r_{1}^{5}, r_{2}^{5}, r_{3}^{5}, r_{4}^{5}, r_{5}^{5})

R_{2} = (r_{6}^{0}, r_{7}^{0}, r_{8}^{0}, r_{9}^{0}, r_{10}^{0}, r_{6}^{1}, r_{7}^{1}, r_{8}^{1}, r_{9}^{1}, r_{10}^{1}, \cdot \cdot \cdot, r_{6}^{5}, r_{7}^{5}, r_{8}^{5}, r_{9}^{5}, r_{10}^{5})

Then these two vectors are carried out multi-stage vector quantization (MSVQ) respectively.Wherein to vector R ₁Adopt three grades of vector quantizations of 26 bits altogether, preceding two-stage is respectively 9 bits, and the third level is 8 bits; To vector R ₂Adopt three grades of vector quantizations of 21 bits altogether, three grades of bits are respectively 8 bits, 7 bits and 6 bits.Codebook of Vector Quantization adopts the simulated annealing coaching method to obtain to training utterance.R after the quantification ₁, R ₂Ownership by original element reconstitutes 6 10 n dimensional vector ns, 6 10 dimension surplus line spectrum pairs parameter vectors after promptly obtaining quantizing.

In the said method step (11) corresponding DC component and predicted value be added in the surplus line spectrum pairs parameter after the current superframe corresponding quantization, obtain the line spectrum pairs parameter after current superframe quantizes, promptly

{\hat{f}}^{n} = {\hat{r}}^{n} + α^{n} (F_{- 1}, F_{0}) \cdot {({\hat{l}}_{- 1}^{5})}^{T} + d (F_{0}), (n = 0,1, \cdot \cdot \cdot, 5)

In the formula Be the line spectrum pairs parameter vector after the n frame quantizes in the current superframe,

It is the surplus line spectrum pairs parameter vector after the n frame quantizes in the current superframe.

The embodiment of said method step (12) is: be with line spectrum pairs parameter after the described method of linear prediction (MELP) speech coding algorithm standard of excitation will quantize according to the 2400b/s of U.S. government more Convert sound channel A parameter to, the sound channel A parameter after promptly obtaining quantizing.

Claims

1, a kind of method of superframe acoustic channel parameter vector quantization is characterized in that this method may further comprise the steps:

(9) utilize the surplus line spectrum pairs parameter calculating predicted value after the last frame quantification in this group predictive coefficient and the last superframe; Each has gone to deduct corresponding predicted value in the direct current line spectrum pairs parameter from current superframe, obtains the surplus line spectrum pairs parameter of current superframe;

(11) predicted value that obtains of DC component that above-mentioned steps (7) is obtained and above-mentioned steps (9) is added to the line spectrum pairs parameter after obtaining quantizing in the surplus line spectrum pairs parameter after the quantification;

(12) convert the line spectrum pairs parameter after the current superframe quantification to sound channel A parameter, the sound channel A parameter after obtaining quantizing.

2, by the described method of claim 1, it is characterized in that each superframe comprises 6 frames in the described step (1), each frame comprises 160 voice sampling points.

3, by the described method of claim 2, it is characterized in that, in the described step (5) each frame is all extracted the pure and impure sound parameter of 5 subbands, obtain 30 pure and impure sound parameters altogether, carry out vector quantization, always have the 16 kinds of pure and impure sound vector quantization of superframe values with 4 bits; Pure and impure sound vector quantization adopts the weighted euclidean distance criterion, and to 5 weights that allocation of subbands is different of described every frame, this each weighting factor is 16: 8: 4: 2: 1.Pure and impure sound vector quantization value is by the maximum pure and impure sound vector of 16 kinds of superframes of occurrence number that the training utterance sample statistics is obtained, and these 16 vectors are corresponding a kind of super frame mode respectively.

4, by the described method of claim 1, it is characterized in that, the concrete grammar that obtains the DC component of the line spectrum pairs parameter in the described step (7) is: training utterance is divided into 16 subclass by super frame mode, line spectrum pairs parameter to each subclass is averaged respectively, and this mean value is line spectrum pairs parameter DC component under this super frame mode.

5, by the described method of claim 1, it is characterized in that, the predictive coefficient that obtains described step (8) be the superframe transfer mode determined by the pattern of last superframe and current super frame mode with the training utterance diversity, by the minimum principle of square error each collection is obtained one group of predictive coefficient respectively.