Brief summary of the invention
According to the present invention, provide some technology of process information.Especially, the invention provides a kind of method and apparatus, be used for from a kind of standard based on CELP to another kind based on the standard of CELP and/or in single standard but different patterns is carried out the conversion of frame.In whole instructions, especially below, will provide of the present invention further describing.
In a certain embodiments, the invention provides a kind of equipment, be used for from a kind of standard based on CELP to another kind based on the standard of CELP and/or in single standard but different patterns is carried out the conversion of frame.This equipment has a bit stream and removes package module, is used for obtaining one or more CELP parameters from a source codec.This equipment also has an interpolator module that is coupled to bit stream removal package module.Interpolator module is applicable to the interpolation between the sampling rate of different frame size, subframe size and/or source codec and purpose codec.A mapping block is coupled to interpolator module.Mapping block is applicable to from one or more CELP parameter maps of source codec one or more CELP parameters to the purpose codec.This equipment has the purpose bit stream package module that is coupled to mapping block.Purpose bit stream package module is applicable to according at least one or a plurality of CELP parameter from the purpose codec and constitutes at least one purpose output CELP frame.A controller is coupled to purpose bit stream package module, mapping block, interpolator module and bit stream at least removes package module.Best, controller is applicable to the operation of the one or more modules of management, and is applicable to the instruction of reception from one or more external applications.Controller is applicable to status information is offered one or more external applications.
In other specific embodiment, the invention provides a kind of method, be used for carry out code conversion based on the compressed voice bitstream of CELP from the source codec to the purpose codec.This method comprises that process source codec input CELP bit stream makes it at least one or a plurality of CELP parameter from input CELP bit stream are removed encapsulation, comprise frame sign if exist, subframe size, and/or one or more in a plurality of purpose codecs parameter of the sampling rate of purpose codec format and comprise frame sign, subframe size, and/or the one or more difference in the multiple source codecs parameter of the sampling rate of source codec format, then one or more the CELP parameter of a plurality of removals encapsulation from the source codec format to purpose codec format interpolation.This method comprises encodes to one or more CELP parameters of purpose codec, and by the one or more CELP parameters that encapsulate the purpose codec at least processing intent CELP bit stream.
In other specific embodiment, the invention provides a kind of method, be used for carry out processing based on the compressed voice bitstream of CELP from the source codec to the purpose codec format.This method comprises in a plurality of control signals of self-application program process always and transmits a control signal, and at least according to from a plurality of different CELP mapping policys, selecting a CELP mapping policy from the control signal of application program.This method also comprises uses selected CELP mapping policy to carry out mapping process, one or more CELP parameters are mapped to one or more CELP parameters of purpose codec format from the source codec format.
Further again, the invention provides a kind of system, be used for carry out processing based on the compressed voice bitstream of CELP from the source codec to the purpose codec format.This system comprises one or more storeies.Sort memory can comprise one or more codes, and being used for always, a plurality of control signals of self-application program process receive a control signal.Also comprise one or more codes, be used for basis and select a CELP mapping policy from the control signal of application program from a plurality of different CELP mapping policys at least.One or more storeies also comprise one or more codes, be used to use selected CELP mapping policy to carry out mapping process, one or more CELP parameters are mapped to one or more CELP parameters of purpose codec format from the source codec format.According to embodiment, can also be useful on carry out function described herein and this explanation outside, other computer code of the function that can combine with the present invention.
Use the present invention to obtain many benefits.According to embodiment, can obtain one or more in these benefits.
● reduce the computational complexity of transcode process.
● reduce delay by transcode process.
● the quantity of the storer that the minimizing code conversion needs.
● introduce dynamic rate control.
● support quiet (silence) frame by the speech activity detector that embeds.
● the framework that can use various parameter maps strategies is provided.
● provide general code conversion foundation structure to adapt to current and codec in the future based on diversity CELP.
Code conversion invention can obtain one or more in these benefits.In a particular embodiment, code conversion equipment comprises:
● source CELP parameter is removed package module, and it obtains the CELP parameter from input coding CELP bit stream;
● CELP parameter interpolate device, it becomes purpose CELP parameter to input source CELP Parameters Transformation, and this purpose CELP parameter is corresponding to the subframe size difference between the source and destination codec; If the sub-Frame size of source and destination codec is different, then the operation parameter interpolation.
● purpose CELP parameter maps and tuning engine, it is transformed into purpose CELP codecs parameter to the CELP parameter from described interpolator module;
● purpose CELP code wrapper, it is encapsulated into the CELP parameter through mapping in the purpose CELP code frame;
● the advanced features manager, its management is in optional feature and the feature of the CELP-in-CELP code conversion;
● controller, it manages whole transcode process;
● the status report function, it provides the state of transcode process.
It is the CELP demoder that does not have the simplification of format filter and back-wave filter that source CELP parameter is removed package module.
CELP parameter interpolate device comprises one group of interpolater with one or more CELP relating to parameters.
Purpose CELP parameter maps and tuner module comprise parameter maps strategy handover module, and one or more in the following parameters mapping policy a: module of CELP parameter straight space mapping, analyze a module of excitation space mapping, analyze a module through the excitation space mapping of filtering.
The present invention is the run time version conversion on the basis of a sub-Frame of a sub-Frame.That is, when the code converting system received (the source compressed information) frame, transcoder can begin to operate thereon, and produced the sub-Frame of output.In case produced the sub-Frame of sufficient amount, just can produce (according to the compressed information of purpose form) frame, and if communication be purpose, just can send to communication channel.If storage is a purpose, then can store the frame that is produced on demand.If the extended period of the frame by source and destination format standard definition is identical, then single incoming frame will produce single output frame, otherwise will need to cushion other incoming frame, or produce a plurality of output Frame.Extended period as fruit Frame is different, then will need the interpolation between the sub-Frame parameter.Therefore, code conversion operation comprises four kinds of operations: (1) bit stream is removed encapsulation, the interpolation of (2) sub-Frame buffering and source CELP parameter, (3) mapping and be tuned to purpose CELP parameter, and (4) encapsulation code is with the generation output frame.
So when receiving frame, the encapsulation of transcoder removal bit stream is included in the CELP parameter (Figure 10, square frame (1)) of each the sub-Frame in the frame with generation.Parameters of interest is that LPC coefficient, excitation (producing from self-adaptation and fixed codeword) and pitch lag behind.Note, for the low-complexity solution that produces excellent quality, only need be to excitation rather than to decoding whole the synthesizing of speech waveform.Sub-if desired Frame interpolation is then finished by intelligent interpolation engine at this moment.
Present sub-Frame is in a kind of form, and this form can be admitted by the processing of purpose parameter maps and tuner module (Figure 10, square frame (5)).Be independent of excitation CELP parameter and shine upon short-term LPC filter coefficient.Can use the simple linear mapping in LSP puppet-frequency space, be used for the LSP coefficient of purpose codec with generation.Many methods that can correspondingly provide better quality output by the cost of computational complexity are shone upon excitation CELP parameter.In presents, described three kinds of so mapping policys, and be the part (Figure 10, square frame (4)) of mapping and tuning policy module:
● CELP parameter straight space mapping (DSM);
● analyze the excitation space territory;
● analyze excitation space territory through filtering
The selection of mapping and tuning strategy is by mapping and tuning tactful handover module (Figure 10, square frame (3)).
Because three kinds of methods are to quality tradeoffs in order to reduce computation burden, thus at equipment owing under the situation that a large amount of simultaneously channels transship, can use them, so that appropriate degrading to be provided aspect the quality.Therefore the performance of code converter can be adapted to available resource.On the other hand, can use the quality of only generation requirement and a kind of strategy of performance to construct transcoding system.In this case, will be not in conjunction with mapping and tuning tactful handover module (Figure 10, square frame (3)).
If can be applicable to the purpose standard, then can also use speech activity detector (in parameter space, operating) to reduce the bandwidth of output this moment.Then, can be encapsulated into (Figure 10, square frame (7)) in the purpose bitstream format frame to parameter, and produce and be used for sending or storage through mapping.
The present invention includes and be used between based on the voice coding standard of CELP, carrying out algorithm and the method that smart code is changed.The present invention also comprises the code conversion in the single standard, so that carry out rate controlled (arriving than low mode or the speech activity detector introducing silent frames by embedding by code conversion).
Manage whole transcode process (Figure 10, square frame (8)) by control module, described control module sends order according to the state and the external command of code conversion.
In order to adapt to different code conversion requirements, equipment of the present invention provides the possibility (Figure 10, square frame (6)) of adding optional feature and function.
From description below in conjunction with accompanying drawing, will be more clear to other features and advantages of the present invention, in all accompanying drawings, do corresponding identification with identical mark.
Detailed description of the present invention
According to the present invention, provide the technology of process information.Especially, the invention provides a kind of method and apparatus, be used for changing the CELP frame to another kind based on the standard of CELP and/or in the still different pattern of single standard from a kind of standard based on CELP.In whole instructions, especially below, provide further detailed description of the present invention.
The present invention includes the algorithm and the method that are used for carrying out based on coding method and the code conversion between the standard of CELP (Code Excited Linear Prediction).The CELP coding method that most interested is by the group normsization such as International Telecommunication Union or ETSI (ETSI).The present invention also is included in the code conversion in the single standard, so that carry out rate controlled (arriving than low mode or the speech activity detector introducing silent frames by embedding by code conversion).
Generally can (for example be categorized into wave coder to speech coding technology, from ITU G.711, G.726, standard G.722) and scrambler by synthesis analysis (AbS) type (for example, from the G.723.1 and G.729 standard of ITU, and from the GSM-AMR standard of ETSI and from enhanced variable rate codec (EVRC) standard, selectable modes vocoder (SMV) standard of telecommunications industry association (TIA)).Wave coder is operated in time domain, and they are based on the method for a sampling of a sampling, and this method is utilized the correlativity between the phonetic sampling.Scrambler trial by synthesis analysis is imitated human speech generation system by the model in the source (glottis) of simplification and the model of wave filter (voice range), and these models are formed on the output voice spectrum on the frame basis (generally using the frame sign of 10-30 millisecond).
Introducing the scrambler by the synthesis analysis type, provide high-quality speech by low bit rate, is cost to increase the calculated amount that needs.Compress technique is a kind of eloquent method of saving resource in the communication interface.
On mathematics, all audio coder ﹠ decoder (codec)s are all used one dimension analog voice signal x
0(1) starts, this signal is taken a sample unchangeably and quantized, to obtain the numeric field expression, x (n)=Q (x
0(nT)).The sampling rate f=1/T of voice signal generally is 8kHz or 16kHz, and generally sampled signal is quantized to maximum 16-bits.
Then, can consider codec based on CELP as a kind of algorithm, this algorithm uses model for speech production, is shining upon between the voice x (n) of sampling and some parameter space θ, that is, it carries out Code And Decode to digital speech.All all go up operation at speech frame (can further be divided into several subframes to frame) based on the algorithm of CELP.In some codec, speech frame is overlapped.Can be defined as speech frame the vector of the phonetic sampling that begins at n sometime, that is,
Wherein, L is the length (number of samples) of speech frame.Notice that the frame index i and the first frame sample n have linear relationship,
IL is for non-overlapped frame
n={
I (L-K) is for overlapping frame.Wherein K is the overlapping sampling number between the frame.
Now, compression (lossy coding) process is speech frame
Be mapped to parameter θ
iA kind of function, and decode procedure is from parameter θ
iShine upon back the raw tone frame
Approximate value.By the speech frame of demoder generation and the speech frame of original coding is unequal.The design codec with on the sensigenous as far as possible to input voice similar output voice, promptly, when processing parameter, scrambler must produce so parameter, and these parameters make the input speech frame and maximize by some the sensation level measured value between the speech frame of demoder generation.
Generally, from being input to parameter, the mapping from parameter to output needs the input before all or the knowledge of parameter.For example, this can be by obtaining in the structure that the state S in the codec is kept at the self-adaptation code book of using based on the method for CELP.Must synchronously preserve coder state and decoder states.By the data that only all have according to both sides' (encoder), that is, parameter is come update mode, just can reach this point.Fig. 3 illustrates the universal model of scrambler, channel and demoder.
The frame parameter θ that in model, uses based on CELP
iThe linear predictor coefficient (LPC) that comprises the short-term forecasting that is used for voice signal (relevant with voice range, oral cavity and nasal cavity and lip physically), and the pumping signal that constitutes by self-adaptation and fixed code.Use adaptive code to form the model of the long-term tone information in the voice.Code (self-adaptation and fixing) has the code book that is associated, and this code book is predefined for specific CELP codec.Fig. 1 illustrates typical C ELP demoder, wherein by gain factor self-adaptation and this vector of fixed password is calibrated independently, then, makes up and filtering, to produce synthetic voice.Usually these voice are by back one wave filter, to remove the artefact that model is introduced.
CELP coding (analysis) process comprises voice signal is carried out pre-service removing the unwanted frequency component shown in figure 2, and uses a window function, then obtains short-term LPC parameter.This general Levinson-Durbin algorithm that uses is finished.Become the LPC Parameters Transformation line frequency spectrum to (LineSpectral Pairs (LSP)), to promote quantification and subframe interpolation.Then, by short-term LPC wave filter make voice anti--filtering, to produce the residual excitation signal.This residue is carried out appreciable weighting, improving the quality, and analyze, to seek the estimated value of speech tone.Use a method of analyzing an analysis of closed loop to determine optimum tone.In case find tone, just from residue, deduct the self-adaptation code book component of excitation, and find optimum fixed codeword.The storer of new encoder inside more is with the change of reflection codec states (such as the self-adaptation code book).
The simplest method of code conversion is the smoothing method that is called as the cascade code conversion, sees Fig. 4.This method is carried out decoding completely to the compressed bit of input, to produce synthetic voice.Then, with target criteria synthetic voice are encoded.This method suffers from: a large amount of calculating that signal is encoded again, and from pre--and the quality decline problem introduced of back-filtering of speech waveform, and the potential delay of eyes front requirement (look-ahead-requirements) introducing by scrambler.
The method that " intelligence " code conversion similar in article, occurred to method illustrated in fig. 5.Yet these methods are still basically and construct voice signal again, then, carry out extensive work and obtain various CELP parameters, such as LPC and tone.That is, these methods are still operated in the voice signal space.Especially, only use pumping signal for the generation of synthetic speech, this pumping signal is optimally mated by far-end scrambler (at the scrambler of far-end, this far-end has produced compressed voice according to a kind of compressed format) and raw tone.Then, use synthetic voice to calculate new optimal excitation.Because in conjunction with the requirement of impulse response filter operation, this becomes calculating strength and operates greatly in the closed loop search.
Fig. 6 illustrates US 6,260, the method that 009 B1 uses.Quantize the resonance peak filter coefficient from input stimulus parameter and output and produce the signal of structure again that passes through searcher as the echo signal use.Because the difference between the resonance peak filter coefficient of quantification in the source and destination codec, this causes degrading in the searcher echo signal, and is last, reduces widely from the output voice quality of code conversion.See Fig. 6.In whole instructions, especially below, can find other restrictions.
Fig. 7 illustrates another kind " intelligence " code conversion method.Announced (US2002/0077812 A1).Change by run time version by directly shining upon the reciprocation between each CELP parameter ignorance CELP parameter for this method.This method only is applied to require between source and destination CELP codec in the particular case of extremely limited condition.For example, it requires Algebraic CELP (ACELP) and in source and destination codec identical subframe size among both.For the code conversion of great majority based on CELP, it does not produce the voice of excellent quality.This method one of only is suitable in the GSM-AMR pattern, does not comprise all patterns among the GSM-AMR.
Go through a kind of method and apparatus of the present invention below.In the following description, for illustrative purposes, state many specific details, so that thorough understanding of the present invention is provided.For illustrative purposes and purpose for example and use GSM-AMR and situation G.723.1.Method described herein is general, and be applied to the CELP codec any between code conversion.The relevant personnel that are familiar with the present technique field will appreciate that, can use other step, configuration and arrangement and without departing from the spirit and scope of the present invention.
The present invention includes algorithm and method, be used to carry out based on the smart code between the voice coding standard of CELP and change.The present invention also comprises the code conversion in the single standard, so that carry out rate controlled (arriving than low mode or the speech activity detector introducing silent frames by embedding by code conversion).Lower part is discussed details of the present invention.
The present invention is the run time version conversion on basis of sub-frames of a subframe.That is, when the code converting system received a frame, transcoder can begin the operation on its subframe, and produced the output subframe.In case produced the subframe of sufficient amount, just can produce a frame.If the extended period of the frame by the source and destination standard definition is identical, then an incoming frame will produce an output frame, otherwise will need to cushion each incoming frame or produce a plurality of output frames.If subframe has the different extended periods, then need be between the subframe parameter interpolation.Therefore the code conversion operation comprises four kinds of operations: (1) bit stream is removed encapsulation, the interpolation of (2) sub-Frame buffering and source CELP parameter, (3) mapping and be tuned to purpose CELP parameter, and (4) encapsulation code is with the generation output frame.(see figure 8).
Figure 10 is a block scheme, and the principle according to the codec code conversion equipment based on CELP of the present invention is described.This square frame comprises source bit stream removal package module, intelligent interpolation engine, parameter maps and tuner module, optional advanced features module, control module and purpose bit stream package module.
Parameter maps and tuner module comprise mapping and tuning tactful handover module and parameter maps and tuning policy module.
By control module management code conversion operations.
When receiving a frame, the encapsulation of transcoder removal bit stream is included in the CELP parameter of each subframe in the frame with generation.Parameters of interest is LPC coefficient, excitation (producing from self-adaptation and fixed codeword) and pitch lag.
Note only need decoding, rather than whole speech waveforms is synthetic to excitation.This has reduced source codec bit stream widely and has removed the complicacy of encapsulation.For CELP parameter straight space mapping (DSM) code conversion strategy, interested also have code book to gain and fixed codeword.The subframe interpolation is then finished at this moment if desired.
Subframe is in a kind of form now, and this form can be admitted by the purpose parameter maps shown in Figure 14 and the processing of tuner module.Be independent of excitation CELP parameter and shine upon short-term LPC filter coefficient.Can use the simple linear mapping in LSP puppet-frequency space, be used for the LSP coefficient of purpose codec with generation.Can also use more complicated non--linear interpolation.Many methods that can correspondingly provide better quality output by the cost of computational complexity are shone upon excitation CELP parameter.In presents, described three kinds of so mapping policys, and be the part (Figure 10, square frame (4)) of parameter maps and tuning policy module:
● CELP parameter straight space mapping (DSM);
● analyze the excitation space territory;
● analyze excitation space territory through filtering
The selection of mapping and tuning strategy is by mapping and tuning tactful handover module (Figure 10, square frame (3)).
Go through this three kinds of methods in the part below.Because these three kinds of methods are to quality tradeoffs in order to reduce computation burden, thus at equipment owing under the situation that a large amount of simultaneously channels transship, can use them, so that appropriate degrading to be provided aspect the quality.Therefore the performance of code converter can be adapted to available resource.On the other hand, can use the quality of only generation requirement and a kind of strategy of performance to construct transcoding system.In this case, will be not in conjunction with mapping and tuning tactful handover module (Figure 10, square frame (3)).
If can be applicable to the purpose standard, then can also use speech activity detector (in parameter space, operating) to reduce the bandwidth of output this moment.
The output of parameter maps and tuner module is purpose CELP codec code.They are encapsulated in the purpose bit-stream frames according to codec CELP frame format.Need encapsulation process, so that the output bit is put in the understandable form of purpose CELP demoder.If use is in order to store, then can to encapsulate purpose CELP parameter and maybe can store by using specific format.If transmit frame according to multi-media protocol, then can also change encapsulation process, for example, in encapsulation process, implement to compare bit scrambling.
In addition, equipment of the present invention provides the function of interpolation optional signals processing capacity in future or module.
The subframe interpolation
When the subframe of various criterion represents that different time maybe when using different sampling rate, may need the subframe interpolation during extended period in the signal domain.For example, G.723.1 use the frame (7.5 milliseconds of every subframes) of 30 milliseconds of extended periods, and GSM-AMR uses the frame (5 milliseconds of every subframes) of 20 milliseconds of extended periods.This illustrates to imagery in Fig. 9.On two kinds of dissimilar parameters, carry out the subframe interpolation: the parameter (such as excitation and code word vector) of a sampling of (1) sampling, and (2) subframe parameter (such as LSP coefficient and pitch lag estimated value).Shine upon them by the discrete time index of parameter of considering a sampling of a sampling and the correct position that copies in the target-subframe.If use different sampling rates by different CELP standards, then may need to take a sample up or down.Come interpolation subframe parameter by some interpolation functions, in target-subframe, to produce the smooth estimated value of parameter.The intelligence interpolation algorithm can improve the speech code conversion, is not aspect calculated performance, and the more important thing is aspect speech quality.Simple interpolation functions is a linear interpolation.
As an example, Fig. 9 illustrates needs three GSM-AMR frames to describe two the identical voice signal extended periods that just can describe of frame G.723.1.Equally, for per two G.723.1 subframe need three GSM-AMR subframes.As mentioned above, there are two class parameters: the parameter (for example, self-adaptation and fixed codeword) of a full subframe parameter (for example, LSP coefficient) and a sampling of a sampling.Come conversion table linearly to be shown the subframe parameter of θ by the weighted sum of calculating overlapping subframe, and by copy suitable sampling form be expressed as v[] the parameter of a sampling of a sampling.For from subframe G.723.1 to the interpolation of GSM-AMR subframe, illustrate that to analyze formula as follows:
Wherein i=0 is first subframe of a GSM-AMR frame, and i=4 is first subframe of the 2nd GSM-AMR frame, or the like.Figure 12 describes this process.
Should be being inserted in puppet-frequency domain in the LSP parameter (they are full subframe parameters), i.e. f=cos
-1(q).This causes the output of better quality.Before interpolation, do not need other subframe parameter of conversion.
Notice that above-mentioned analysis formula obtains from simple linear interpolation.Any suitable interpolation scheme (such as teeth groove (spline), sinusoidal, or the like) can substitute this formula.In addition, each CELP parameter (LSP coefficient, hysteresis, pitch gain, code word gain and or the like) can use different interpolation schemes to obtain optimal perceptual quality.
LSP parameter maps and excitation vectors by the LSP coefficient are proofreaied and correct
Though nearly all audio codec based on CELP all uses identical method to obtain the LPC coefficient, also has some less important differences.These differences are owing to different windows size and the Different L PC interpolation of shape, each subframe, different subframe size, different LPC quantization scheme and different look-up tables cause.
In order further to improve the quality of the Audiocode conversion that produces by above-mentioned subframe interpolating method, by using the excitation vectors that is used as the echo signal in the code conversion from the LPC adjustment of data of source and destination codec.
Can use following two kinds of methods to improve perceived quality.
The linear transformation of method 1:LSP coefficient
The conventional method of changing between the LSP coefficient is through linear transformation,
Q '=Λ q+b wherein q ' is a purpose LSP vector (in puppet-frequency domain), and q is source (original) LSP vector, and A is the matrix of a linear transformation, and b is a bias term.In the simplest situation, A reduces to identity matrix (identitymatrix), and b reduces to zero.For the embodiment that G.723.1 arrives the GSM-AMR transcoder, the DC bias term of using in the GSM-AMR codec is different with a DC bias term of G.723.1 codec use, uses the b item in the above-mentioned formula to compensate this difference.
Method 2: the excitation vectors by the LSP coefficient is proofreaied and correct
In each subframe by the synthetic source forcing vector through decoding of source LSP coefficient to be transformed into voice domain, then, the LP parameter through quantizing of application target codec is carried out filtering, to form the echo signal in the code conversion.This correction is chosen wantonly, and when there were significant differences in the LSP parameter, it can improve perceptual speech quality widely.Figure 13 describes to encourage bearing calibration.
Parameter maps and tuner module
Three kinds of strategies of mapping CELP excitation parameters are discussed in this part.Ordering by continuous computational complexity and output quality is represented them.Core of the present invention is such fact, that is, can directly shine upon excitation and need not to construct voice signal again.This means because signal does not need to resemble the conventional art requirement filtering by short-term impulse response, so during the closed loop codebook search, saved a large amount of calculating.This mappings work is because incoming bit stream has comprised the optimal excitation according to the source CELP codec that produces voice.The present invention uses this fact to carry out to replace the quick search in the excitation domain of voice domain.
As mentioned above, have three kinds of methods of each excitation that all has preferable successively performance mapping, allow transcoder to be adapted to available computational resource.
The mapping of CELP parameter straight space
This strategy is the simplest code conversion scheme.Mapping is based on the similarity of the physical significance between the source and destination parameter, and the direct run time version conversion of operational analysis formula and need not any iteration or search.The advantage of this scheme is that it does not need a large amount of storeies, and consumes almost nil MIPS, but it still can produce the sound of intelligence, even quality decreases.Notice that CELP parameter straight space mapping method of the present invention is different with the equipment of the prior art shown in Fig. 7.This method is general, and aspect different frame or subframe size, it is applied to all types of code conversions based on CELP.
Analysis in the excitation space territory
This strategy is to search for self-adaptation and fixed password these both than the more advanced part of previous strategy, and the gain of estimating by common mode by purpose CELP standard definition, unless define them in excitation domain rather than in voice domain.At first use from the tone of input CELP subframe and determine that by Local Search tonal content (pitch contribution) is as initial estimate.In case find, just deduct tonal content, and assign to determine fixed password originally by optimally mating remainder from excitation.The advantage of these Cascading Methods do not need to be the automatic correlation technique from the CELP standard is used to calculate open loop tone estimated value, but as an alternative, can determine from the pitch lag of the CELP subframe through decoding.Also be in excitation domain, rather than in the voice domain, execution is searched for, so that do not need the impulse response filter during tone and the codebook search.This has saved a large amount of calculating and not compromise output quality.
In the analysis in the excitation space territory of filtering
In this case, still the LP parameter is mapped directly to the purpose codec from the source codec, and the pitch lag of use through decoding is as the open loop tone estimated value of purpose codec.Still in excitation domain, carry out the search of closed loop tone.Yet, carry out this search of fixed password in excitation space territory through filtering.The selection of filter type, and whether the target vector of one or two search is transformed into this territory, depend on desired quality and complicacy requirement.
Various wave filters be can use, a low-pass filter of filtering scrambling (smooth irregularities), a wave filter that compensates the difference between the incentive characteristic in the source and destination codec and a wave filter that strengthens appreciable signal of interest feature comprised.Advantage is, uses the composite filter through the LP of weighting the echo signal in standard code is calculated, and the parameter of this wave filter (exponent number (order), frequency increase the weight of/remove to increase the weight of, phase place) all is tunable.Therefore, this strategy allow tuning and improve specific codec between the code conversion quality, and the quality tradeoffs that guarantees to reduce complicacy.
Silent frames code conversion and generation
Some is based on the standard implementation speech activity detector (VAD) of CELP, and it allows discontinuous transmission (DTX) and comfort noise between no speech period to produce (CNG).In using VAD, there is important bit rate advantage.Need the code conversion between these frames, and do not produce in the situation of silent frames, for the purpose codec produces silent frames at the source codec.Frame generally includes some parameters, is used at the suitable comfort noise of demoder place generation.Can use simple algebraic method that these parameters are carried out code conversion.
The embodiments of the invention example
Lower part show for G.723.1 with the embodiments of the invention of GSM-AMR voice coding standard.The invention is not restricted to these standards.It comprises all audio coding standard based on CELP.Be familiar with those skilled in the art person and will appreciate that how to use these methods to carry out other based on the code conversion between the coding standard of CELP.Before describing preferred embodiment, at first provide GSM-AMR and the G.723.1 simple declaration of codec.
The GSM-AMR codec
It is eight source codecs of 12.2,10.2,7.95,7.40,6.70,5.90,5.15 and 4.75 kilobits/second that the GSM-AMR codec uses bit rate.
Codec is based on Code Excited Linear Prediction (CELP) encoding model.Use the 10th rank linear prediction (LP), or short-term, composite filter.It is long-term to use so-called self-adaptation code book method to implement, or tone, composite filter.
In CELP phonetic synthesis model, by adding the pumping signal that constitutes short-term LP composite filter input from two excitation vectors of self-adaptation and fixing (innovation) code book.Come synthetic speech by presenting by two vectors correctly selecting the code book of short-term composite filter from these.Use by analyzing the search procedure of synthesize (in this process, according to appreciable weighted distortion measurement, the error minimum between the original and synthetic speech) and select the optimal excitation sequence in the code book.The perceptual weighting filter that uses in the search technique synthetic by analysis uses non-quantized LP parameter.
Codec is operated on the speech frame of 20 milliseconds (corresponding to 160 samplings by the sampling frequencies of 8000 sampling/seconds).Each place at 160 phonetic samplings analyzes voice signal, with the parameter (LP filter coefficient, self-adaptation and this index of fixed password and gain) of obtaining the CELP model.These parameters are encoded and sent.At the demoder place, these parameters are decoded, and come synthetic speech by the reconstituted pumping signal of LP composite filter filtering.
For 12.2 kilobits/second patterns, every frame is carried out twice LP and is analyzed, and for other pattern, carries out once.For 12.2 kilobits/second patterns, become two groups of LP Parameters Transformation the line frequency spectrum to (LSP), and use division matrix quantization (SMQ) to quantize together with 38 bits.For other pattern, single LP parameter group is converted to the line frequency spectrum to (LSP), and use division vector quantization (SVQ) to quantize.
Speech frame is divided into four subframes that each is 5 milliseconds (40 samplings).Each subframe sends self-adaptation and this parameter of fixed password.According to subframe use through quantize with non-quantized LP parameter or their interpolation form.According to the weighted speech signal of perception, estimate the open loop pitch lag every a subframe (except 5.15 and 4.75 kilobits/second patterns, the every frame of this two-mode carries out once).
Then, repeat following operation for each subframe:
● assign to calculate echo signal by weighted synthesis filter filtering LP remainder, wherein upgraded the original state (this and deduct the commonsense method equivalence of the zero input response of weighted synthesis filter from voice signal) of wave filter through weighting by the error between filtering LP remainder and the excitation.
● calculate the impulse response of weighted synthesis filter.
● then,, use target and impulse response, carry out closed loop tone analysis (seeking pitch lag and gain) by search open loop pitch lag.The use sampling resolution is 1/6 or 1/3 mark tone (according to pattern).
● upgrade echo signal by removing self-adaptation code book component (filtering adaptive code vector), and fixedly using this new target (seeking optimum innovation code word) in the algebraically codebook search.
● this gain of self-adaptation and fixed password is a scalar of using 4 and 5 bit quantizations respectively, or with the vector (having the moving average (MA) that puts on this gain of fixed password predicts) of 6-7 bit quantization.
● last, upgrade filter memory (using the pumping signal of determining) in order to seek the echo signal in the next subframe.
In each speech frame of 20 milliseconds, produce the Bit Allocation in Discrete of 95,103,118,134,148,159,204 or 244 bits, corresponding to the bit rate of 4.75,5.15,5.90,6.70,7.40,7.95,10.2 and 12.2 kilobits/second.
G.723.1 codec
G.723.1 codec has two bit rates associated therewith, that is, and and 5.3 and 6.3kbps.Two speed are the mandatory parts of encoder.Might on any 30 milliseconds of frame boundaries, between two speed, switch.
Codec is based on by the linear prediction analysis principle of composite coding, and attempts to make the weighted error signal minimum of perception.Scrambler is the upward operation of piece (frame) of 240 samplings at each.When the 8KHz sampling rate, this equals 30 milliseconds.Each piece at first carries out high-pass filtering, to remove the DC component, then, is divided into four subframes that each is 60 samplings.For each subframe, use untreated input signal to calculate the 10th rank Linear Predictive Coder (LPC) wave filter.Use prediction division vector quantizer (PSVQ) to quantize the LP wave filter of last subframe.Use non-quantized LPC coefficient to construct the short-term perception weighting filter, use this wave filter that entire frame is carried out filtering, and obtain the perceptual weighting voice signal.
For per two subframes (120 samplings), use the voice signal of weighting to calculate open loop pitch period L
OLCarrying out this tone on the piece of 120 samplings estimates.In the scope of from 18 to 142 samplings, search for pitch period.
From this moment, processed voice on the basis of 60 samplings of every subframe.
The pitch period that calculates before using through estimating, structure harmonic noise forming filter.Use the combination of LPC composite filter, resonance peak perceptual weighting filter and harmonic noise forming filter, to create impulse response.Then, use impulse response further to calculate.
Use pitch period estimation value L
OLAnd closed loop tone predicted value is calculated in impulse response.Use the 5th rank tone predicted value.Calculate pitch period as a little difference around open loop tone estimated value.From the initial target vector, deduct tone predicted value component then.Pitch period and difference both are sent to demoder.
At last, the aperiodic component of approximate excitation.For high bit rate, use multiple-pulse maximum likelihood and encourage, and, use the excitation of algebraically code book for low bit rate than quantizing (MP-MLQ).
First embodiment-GSM-AMR is to 6.723.1
Figure 17 is the block scheme according to the first embodiment of the present invention, illustrates from GSM-AMR to G.723.1 transcoder.The GSM-AMR bit stream comprises 95 bits (12 byte) of length from 244 bits (31 byte) of flank speed pattern 12.2kbps to minimum speed limit pattern 4.75 kbps codecs.Always have eight patterns.In eight GSM-AMR operator schemes each produces different bit streams.Because the G.723.1 frame of 30 milliseconds of extended periods comprises one and half GSM-AMR frame, so need two GSM-AMR frames to produce single G.723.1 frame.Can when arriving, the 3rd GSM-AMR frame produce G.723.1 frame of the next one then.So three GSM-AMR frames of every processing produce two G.723.1 frames.
The 10LSP parameter of using identical technology that the short-term filter in the GSM-AMR model for speech production is used is encoded, but presses different bitstream formats for different operator schemes.In the GSM-AMR normative document, provide the algorithm of constructing the LSP parameter again
In case produced the short-term filter parameter of each subframe, just needed to form excitation vectors by combination self-adaptation code word and fixing (algebraically) code word.According to 1/6 or 1/3 resolution pitch lag parameter, use 60-tap (tap) interpolation filter to construct the self-adaptation code word.Construct fixed codeword then, define as excitation by standard and formation:
Wherein x is excitation, and v is the self-adaptation code word through interpolation, and c is the fixed code vector, and
With
It is respectively the gain of self-adaptation and fixed code.Use this to encourage then and upgrade the memory state that GSM-AMR removes wrapper, and shine upon by bit stream wrapper G.723.1.
Seek the self-adaptation code word of each subframe by the linear combination that forms excitation vectors, and seek remove the Optimum Matching of the target excitation signal x{} of wrapper structure by GSM-AMR.Combination is the weighted sum of five former excitations that lag behind continuously.This can illustrate best by formula:
V[wherein] be the self-adaptation code word of constructing again, u[] be former excitation impact damper, L is (integer) pitch lag (removing package module from GSM-AMR determines) that comprises between 18 and 143, and β
jBe the hysteresis weighted value, it determines gain and lagging phase.Search β
jVector table, make self-adaptation code word v[] and excitation vectors x[] between the coupling optimization.
In case find the adaptive code word component of excitation, just deduct this component from excitation, stay remainder and prepare by this coding of fixed password.The residual signal that calculates each subframe is,
x
2[n]=x[n]-v[n],n=0,…,59
X wherein
2[] is the target of this search of fixed password, x[] be to remove the excitation that encapsulation is derived from GSM-AMR, and v[] be (through interpolation with through calibrating) self-adaptation code word.
For the G.723.1 height and the low rate mode of codec, fixed password originally is different.Two-forty is used the MP-MLQ code book, and it allows in any position, six pulses of the every subframe of even number subframe, and five pulses of the every subframe of odd number subframe.Low rate mode is used algebraically code book (ACELP), and it allows four pulses of every subframe in restricted position.Two kinds of code books are all used the grid sign to represent whether should be offset code word and are made it to move a position.Except owing to be to carry out search rather than carry out search in voice domain in excitation domain, do not use outside the impulse response filter, search for these code books by the method that in standard, defines.
When the processing of finishing each subframe, need upgrade (lasting) storer of codec.This so finishes: at first make former excitation impact damper u[] displacement 60 samplings (that is, a subframe), so that abandoned the oldest sampling, then encouraging 60 samplings that copy the impact damper top from current subframe to,
Wherein first sampling with respect to current subframe is provided with index n, and the former definition of other parameter.
All parameters through mapping all are encoded to export G.723.1 in the bit stream, next frame is prepared to handle by system.
Second embodiment: 6.723.1 is to GSM-AMR
Figure 18 is a block scheme according to a second embodiment of the present invention, and the transcoder that G.723.1 arrives GSM-AMR is described.G.723.1 bit stream comprises the frame of length 192 bits (24 byte) of two-forty (6.3kbps) codec, or the frame of 160 bits (20 byte) of low rate (5.3kbps) codec.These frames have the structure of fairly similar, and difference only is the expression of this parameter of fixed password.
For high and low rate, by identical mode the 10LSP parameter that is used to form short-term voice range filter model is encoded, and can obtain to 25 from the bit 2 of frame G.723.1.Only the LSP to the 4th subframe encodes, and uses the interpolation between the frame, to produce the LSP of other three subframes again.Coding uses three look-up tables, and constructs the LSP vector again by the combination of three sub-vectors obtaining from these forms.Each form has 256 vector inputs, and two forms in front have 3-unit sub-vector, and last form has 4-unit sub-vector.Make up these and provide 10-unit LSP vector.
Construct the self-adaptation code word of each subframe by making up former excitation vectors.Combination is the weighted sum of the former excitation of five continuous hysteresis place.Can this be described preferably by formula,
V[wherein] be the self-adaptation code word of constructing again, u[] be former excitation impact damper, L is (integer) pitch lag that comprises between 18 and 143, and β
jIt is the hysteresis weighted value of determining by the pitch gain parameter.
Directly obtain lag parameter L from bit stream.Whole dynamic ranges that the first and the 3rd subframe use to lag behind, and the second and the 4th subframe to lag behind coding as from before the skew of subframe.Search to determine hysteresis weighting parameters β by form
jRemove the result of encapsulation as the self-adaptation code word, can be by calculating the approximate value of the gain of determining the mark pitch lag and being associated.
For the G.723.1 height and the low rate mode of codec, fixed password originally is different.High-rate mode is used the MP-MLQ code book, and it allows in any position, six pulses of the every subframe of even number subframe, and five pulses of the every subframe of odd number subframe.Low rate mode is used algebraically code book (ACELP), and it allows four pulses of every subframe in restricted position.Two kinds of code books are all used the grid sign to represent whether should be offset code word and are made it to move a position.G.723.1 providing the algorithm that produces code word from encoded bit stream in the normative document.
When the processing of finishing each subframe, need upgrade (lasting) storer of codec.This so finishes: at first make former excitation impact damper u[] displacement 60 samplings (that is, a subframe), so that abandoned the oldest sampling, then encouraging 60 samplings that copy the impact damper top from current subframe to,
Wherein first sampling with respect to current subframe is provided with index n, and the former definition of other parameter.
The GSM-AMR parameter maps of transcoder partly obtains aforesaid through the CELP of interpolation parameter, and uses their bases as search GSM-AMR parameter space.When receiving, the LSP parameter is encoded simply, and use other parameter, that is, excitation and pitch lag are as the estimated value of sound search in the GSM-AMR space.Below describe (figure) the main operation that must occur in for completion code conversion on each subframe is shown.
For with the optimum matching of target excitation, the former excitation vectors that reaches maximum 143 hysteresis by search forms the self-adaptation code word.Determine target excitation from subframe through interpolation.Can come interpolation excitation in the past at interval by 1/6 or 1/3 according to pattern.Seek optimum the hysteresis by search about a zonule of pitch lag (determining) from G.723.1 removing package module.Search for this zone and lag behind, and then seek and definite fractional part that lags behind to seek optimum integer.This process is used 24-tap interpolation filter, to carry out the mark search.First is different with the processing of the second and the 4th subframe with the processing of the 3rd subframe.Then, form self-adaptation code word v[through interpolation] be,
V[wherein] be former excitation impact damper, L is (integer) pitch lag, t is the mark pitch lag by 1/6 resolution, and b
60It is the 60-tap interpolation filter.
Calculate and quantize pitch gain, so that can encode and send to demoder, and be used to calculate this target vector of fixed password it.All patterns are all pressed same way as each subframe are calculated pitch gain,
G wherein
pBe non-quantized pitch gain, x is the target of self-adaptation codebook search, and v is (through interpolation) self-adaptation code word vector.12.2kbps quantize self-adaptation and this gain of fixed password independently with the 7.95kbps pattern, and other pattern is used the quantification of uniting of fixing and adaptive gain.
In case find the self-adaptation code book component of excitation, just deduct this component from excitation, stay remainder and prepare to be used for by fixed password coding originally.The residual signal that calculates each subframe is,
X wherein
2[] is the target of this search of fixed password, x[] be the target of self-adaptation codebook search, g^
pBe pitch gain, and v[through quantizing] be (through interpolation) self-adaptation.
The designs fix codebook search is to seek the optimum matching for residual signal after removing self-adaptation code book component.This is very important for non-voice voice and for starting the self-adaptation code book.Owing to the analysis of a large amount of raw tones has taken place, so the codebook search that uses can be simpler than the codebook search that uses in codec in code conversion.Also have, the signal of carrying out codebook search thereon is the pumping signal through constructing again that replaces synthetic speech, has therefore had a kind of structure that more can admit this coding of fixed password.
According to the energy of former four subframes, use the moving average value prediction to quantize this gain of fixed password.Correction factor between reality and the prediction gain is quantized (by searching form), and send to demoder.In the GSM-AMR normative document, provide definite details.
When the processing of finishing each subframe, need to upgrade (lasting) storer that is used for codec.This so carries out: at first make former excitation impact damper u[] displacement 40 samplings (that is, a subframe), consequently abandon the oldest sampling, from current subframe excitation is copied to 40 samplings in top of impact damper then,
Wherein first sampling with respect to current subframe is provided with index n, and other parameter all defined in the past.
When illustrating and describing the embodiment of the current conduct example of thinking of the present invention, those skilled in the art that will appreciate that, can carry out various other modifications, and can substitute, and not depart from true scope of the present invention with equivalent.In addition, can make many modifications by theory of the present invention adapts to specific situation and does not depart from invention thought in center described herein.