WO2013062392A1 - Procédé de codage d'un signal vocal, procédé de décodage d'un signal vocal et appareil utilisant ceux-ci - Google Patents

Procédé de codage d'un signal vocal, procédé de décodage d'un signal vocal et appareil utilisant ceux-ci Download PDF

Info

Publication number
WO2013062392A1
WO2013062392A1 PCT/KR2012/008947 KR2012008947W WO2013062392A1 WO 2013062392 A1 WO2013062392 A1 WO 2013062392A1 KR 2012008947 W KR2012008947 W KR 2012008947W WO 2013062392 A1 WO2013062392 A1 WO 2013062392A1
Authority
WO
WIPO (PCT)
Prior art keywords
bit allocation
signal
current frame
echo
echo zone
Prior art date
Application number
PCT/KR2012/008947
Other languages
English (en)
Korean (ko)
Inventor
이영한
정규혁
강인규
전혜정
김락용
Original Assignee
엘지전자 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 엘지전자 주식회사 filed Critical 엘지전자 주식회사
Priority to US14/353,981 priority Critical patent/US9672840B2/en
Priority to JP2014538722A priority patent/JP6039678B2/ja
Priority to CN201280063395.9A priority patent/CN104025189B/zh
Priority to EP12843449.5A priority patent/EP2772909B1/fr
Priority to KR1020147010211A priority patent/KR20140085453A/ko
Publication of WO2013062392A1 publication Critical patent/WO2013062392A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/002Dynamic bit allocation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/005Correction of errors induced by the transmission channel, if related to the coding algorithm
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/022Blocking, i.e. grouping of samples in time; Choice of analysis windows; Overlap factoring
    • G10L19/025Detection of transients or attacks for time/frequency resolution switching
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/24Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding

Definitions

  • the present invention relates to a technique for processing a speech signal, and more particularly, to a method and apparatus for variably performing bit allocation in encoding a speech signal in order to solve a pre-echo problem.
  • Expansion of the communication band means that almost all sound signals, including not only voice but also music and mixed content, are included as encoding targets.
  • CELP Code Excited Linear Prediction
  • An object of the present invention is to provide a method and apparatus for solving a pre-echo problem that may occur due to transform-based encoding (transform encoding).
  • An object of the present invention is to provide a method and apparatus for adaptively performing bit allocation by dividing a fixed frame into sections in which pre-echo may occur and other sections.
  • An object of the present invention is to provide a method and apparatus for improving coding efficiency by dividing a frame into predetermined sections and changing bit allocations according to characteristics of signals in each section when the bit rate to be transmitted from the encoder is fixed. It is done.
  • An embodiment of the present invention provides a speech signal encoding method, comprising: determining an echo zone in a current frame, assigning a bit for the current frame based on a location of the echo zone, and using the allocated bits
  • the method may include encoding a frame, and in the bit allocation step, more bits may be allocated to a section in which the echo zone is located than in a section in which the echo zone is not located in the current frame.
  • the current frame may be divided into a predetermined number of sections, and more bits may be allocated to a section in which the echo zone exists than in a section in which the echo zone does not exist.
  • the determining of the echo zone when the current frame is divided into sections, it may be determined that an echo zone exists in the current frame when the energy level of the voice signal for each section is not uniform. In this case, it may be determined that the echo zone is located in the section where the transition of the energy magnitude exists.
  • the normalized energy for the current subframe shows a change that passes a threshold from the normalized energy for the previous subframe, it may be determined that the echo zone is located in the current subframe. have.
  • the normalized energy may be normalized based on the largest energy value among energy values for each subframe of the current frame.
  • the subframes of the current frame may be searched in order, and it may be determined that the echo zone is located in the first subframe whose normalized energy for the subframe exceeds a threshold.
  • the subframes of the current frame may be searched in order, and it may be determined that the echo zone is located in the first subframe in which the normalized energy for the subframe is smaller than a threshold value.
  • the current frame may be divided into a predetermined number of sections, and a bit amount may be allocated for each section based on a weight according to whether an echo zone is located and an energy size in the section.
  • the current frame may be divided into a predetermined number of sections, and bit allocation may be performed by applying a mode corresponding to an echo zone position in the current frame among predetermined bit allocation modes.
  • information indicating the applied bit allocation mode may be transmitted to the decoder.
  • Another embodiment of the present invention is a speech signal decoding method, comprising: obtaining bit allocation information for a current frame and decoding a speech signal based on the bit allocation information, wherein the bit allocation information is the current frame. It may be bit allocation information for each section.
  • the bit allocation information may indicate a bit allocation mode applied to the current frame on a table in which a predetermined bit allocation mode is defined.
  • the bit allocation information may indicate that bit allocation is differentially performed in a section in which a transition component is located and a section in which a transition component is not located in the current frame.
  • the present invention it is possible to provide improved sound quality by preventing or attenuating noise caused by pre-echo while maintaining the same overall bit rate.
  • bit allocation can be changed in consideration of the magnitude of the energy component, more efficient encoding can be performed according to energy.
  • various additional services can be created by implementing a high quality voice and audio communication service.
  • the generation of the pre-echo can be prevented or reduced even if the conversion-based speech encoding is applied, so that the conversion-based speech encoding can be utilized more effectively.
  • 1 and 2 schematically show examples of a configuration of an encoder.
  • 3 and 4 are diagrams schematically showing examples of a decoder corresponding to an encoder.
  • 5 and 6 are diagrams schematically illustrating the pre-echo.
  • FIG. 7 is a diagram schematically illustrating a block switching method.
  • FIG. 8 is a diagram schematically illustrating an example of a window type in a case where a basic frame is set to 20 ms and a larger frame size, 40 ms or 80 ms, is applied according to signal characteristics.
  • FIG. 9 is a diagram schematically illustrating a relationship between a position of a pre echo and bit allocation.
  • FIG. 10 is a diagram schematically illustrating a method of performing bit allocation according to the present invention.
  • FIG. 11 is a flowchart schematically illustrating a method in which an encoder variably allocates a bit amount according to the present invention.
  • FIG. 12 is a configuration of a speech encoder having a form of an extended structure, and schematically illustrates an example to which the present invention is applied.
  • FIG. 13 is a diagram schematically illustrating a configuration of a pre echo reduction unit.
  • FIG. 14 is a flowchart schematically illustrating a method for encoding a speech signal by variably performing bit allocation in accordance with the present invention.
  • 15 is a diagram schematically illustrating a method of decoding an encoded speech signal when bit allocation is variably performed in encoding the speech signal according to the present invention.
  • the first component when the first component is described as “connected” or “connected” to the second component, the first component may be directly connected to or connected to the second component, or may be used to mediate the third component. May be connected or connected to the second component.
  • first and second may be used to distinguish one technical configuration from another.
  • a component that has been named as a first component within the scope of the technical idea of the present invention may be referred to as a second component to perform the same function.
  • CELP Code Excited Linear Prediction
  • 'CELP decoding' and 'for convenience of explanation
  • CELP decoding ' and transform-based encoding / decoding
  • transformation encoding' and 'transform decoding' for convenience of description
  • FIG. 1 schematically illustrates an example of a configuration of an encoder.
  • TCX Transform Coded Excitation
  • ACELP Algebraic Code-Excited Linear Prediction
  • the speech coder 100 may include a bandwidth checker 105, a sampling converter 125, a preprocessor 130, a band divider 110, a linear prediction analyzer 115, and 135.
  • the mode selector 185, the band predictor 190, and the compensation gain predictor 195 may be included.
  • the bandwidth checking unit 105 may determine bandwidth information of an input voice signal.
  • the voice signal has a bandwidth of about 4 kHz and is widely used in public switched telephone networks (PSTNs).
  • the narrow band has a bandwidth of about 7 kHz and is more used in high-quality speech or AM radio than in narrow band voice signals.
  • Wideband signal which has a bandwidth of about 14 kHz and is widely used in a field where sound quality is important, such as music and digital broadcasting, may be classified according to bandwidth.
  • the bandwidth checking unit 105 may convert the input voice signal into the frequency domain to determine whether the bandwidth of the current voice signal is a narrow band signal, a wide band signal, or an ultra wide band signal.
  • the bandwidth checking unit 105 may convert the input voice signal into the frequency domain to investigate and determine the presence and / or component of upper band bins of the spectrum.
  • the bandwidth checking unit 105 may not be separately provided when the bandwidth of the input voice signal is fixed according to an implementation.
  • the bandwidth checking unit 105 may transmit the ultra wideband signal to the band splitter 110 and the narrowband signal or the wideband signal to the sampling converter 125 according to the bandwidth of the input voice signal.
  • the band dividing unit 110 may convert a sampling rate of an input signal and divide the input signal into upper and lower bands. For example, a 32 kHz audio signal may be converted into a sampling frequency of 25.6 kHz and divided into 12.8 kHz by an upper band and a lower band.
  • the band divider 110 transmits a lower band signal of the divided bands to the preprocessor 130, and transmits an upper band signal to the linear prediction analyzer 115.
  • the sampling converter 125 may change the constant sampling rate by receiving the input narrowband signal or the wideband signal. For example, if the sampling rate of the input narrowband speech signal is 8 kHz, the upper band signal may be generated by upsampling to 12.8 kHz, and if the input wideband speech signal is 16 kHz, downsampling is performed at 12.8 kHz. You can create a low band signal.
  • the sampling converter 125 outputs the sampling-converted lower band signal.
  • the internal sampling frequency may have a sampling frequency other than 12.8 kHz.
  • the preprocessor 130 performs preprocessing on the lower band signals output from the sampling converter 125 and the band divider 110.
  • the preprocessor 130 filters the input signal so that the speech parameter can be efficiently extracted.
  • high pass filtering of a very low frequency a frequency band in which relatively less important information is collected, can concentrate on a critical band required for parameter extraction.
  • pre-emphasis filtering can be used to boost the high frequency band of the input signal to scale the energy of the low and high frequency domains. Therefore, the resolution can be increased in the linear prediction analysis.
  • the linear prediction analyzer 115 and 135 may calculate an LPC (Linear Prediction Coefficient).
  • the linear prediction analyzer 115 and 135 may model a formant representing the overall shape of the frequency spectrum of the speech signal.
  • a mean square error (MSE) of an error value which is a difference between an original speech signal and a predicted speech signal generated by using the linear prediction coefficient calculated by the linear prediction analyzer 135.
  • the LPC value can be calculated such that is smallest.
  • Various methods may be used to calculate the LPC, such as an autocorrelation method or a covariance method.
  • the linear prediction analyzer 115 may extract a low order LPC, unlike the linear prediction analyzer 135 for the lower band signal.
  • the linear prediction quantizers 120 and 140 convert the extracted LPC to generate transform coefficients in a frequency domain such as a linear spectral pair (LSP) or a linear spectral frequency (LSF), and quantize the transform coefficients of the generated frequency domain.
  • LSP linear spectral pair
  • LSF linear spectral frequency
  • the linear prediction quantization units 120 and 140 may inversely quantize the quantized LPCs to generate a linear prediction residual signal using the LPCs transformed into the time domain.
  • the linear prediction residual signal is a signal in which the predicted formant component is excluded from the speech signal and may include pitch information and a random signal.
  • the linear prediction quantization unit 120 uses the quantized LPC to generate the preceding prediction residual signal through filtering with the original higher band signal.
  • the generated linear prediction residual signal is transmitted to the compensation gain prediction unit 195 to obtain a compensation gain with the higher band prediction excitation signal.
  • the linear prediction quantization unit 140 uses the quantized LPC to generate a linear prediction residual signal through filtering with the original lower band signal.
  • the generated linear prediction residual signal is input to the transformer 145 and the pitch detector 160.
  • the transform unit 145, the quantization unit 150, and the inverse transform unit 155 may operate as a TCX mode execution unit that performs TCX (Transform Coded Excitation) mode.
  • the pitch detector 160, the adaptive codebook search unit 165, and the fixed codebook search unit 170 may operate as a CELP mode execution unit that performs a CELP (Code Excited Linear Prediction) mode.
  • the transform unit 145 may convert the input linear prediction residual signal into the frequency domain based on a transform function such as a Discrete Fourier Transform (DFT) or a Fast Fourier Transform (FFT).
  • the transform unit 145 may transmit the transform coefficient information to the quantization unit 150.
  • the quantization unit 150 may perform quantization on the transform coefficients generated by the transformer 145.
  • the quantization unit 150 may perform quantization in various ways.
  • the quantization unit 150 may selectively perform quantization according to the frequency band, and may also calculate an optimal frequency combination using analysis by synthesis (ABS).
  • ABS analysis by synthesis
  • the inverse transform unit 155 may generate a reconstructed excitation signal of the linear prediction residual signal in the time domain by performing inverse transformation based on the quantized information.
  • the inverse transformed linear prediction residual signal that is, the reconstructed excitation signal
  • the restored voice signal is transmitted to the mode selector 185.
  • the speech signal reconstructed in the TCX mode may be compared with the speech signal quantized and reconstructed in the CELP mode to be described later.
  • the pitch detector 160 may calculate a pitch for the linear prediction residual signal by using an open-loop method such as an autocorrelation method. For example, the pitch detector 160 may calculate a pitch period and a peak value by comparing the synthesized speech signal with the actual speech signal. In this case, an Abs (Analysis by Synthesis) method may be used.
  • the adaptive codebook search unit 165 extracts an adaptive codebook index and a gain based on the pitch information calculated by the pitch detector.
  • the adaptive codebook search unit 165 may calculate a pitch structure from the linear prediction residual signal based on the adaptive codebook index and the gain information using AbS or the like.
  • the adaptive codebook search unit 165 transmits to the fixed codebook search unit 170 a linear prediction residual signal from which the contribution of the adaptive codebook, for example, information on the pitch structure, is excluded.
  • the fixed codebook search unit 170 may extract and encode a fixed codebook index and a gain based on the linear prediction residual signal received from the adaptive codebook search unit 165.
  • the linear prediction residual signal used by the fixed codebook search unit 170 to extract the fixed codebook index and the gain may be a linear prediction residual signal from which information on the pitch structure is excluded.
  • the quantization unit 175 may include pitch information output from the pitch detection unit 160, adaptive codebook index and gain output from the adaptive codebook search unit 165, and fixed codebook index and gain output from the fixed codebook search unit 170. Quantize the parameter of.
  • the inverse transformer 180 may generate an excitation signal, which is a reconstructed linear prediction residual signal, by using the information quantized by the quantization unit 175. Based on the excitation signal, the speech signal may be reconstructed through the inverse process of linear prediction.
  • the inverse transformer 180 transmits the speech signal restored to the CELP mode to the mode selector 185.
  • the mode selector 185 may select a signal more similar to the original linear prediction residual signal by comparing the TCX excitation signal reconstructed through the TCX mode and the CELP excitation signal reconstructed through the CELP mode.
  • the mode selector 185 may also encode information on which mode the selected excitation signal is restored.
  • the mode selector 185 may transmit selection information regarding the selection of the reconstructed speech signal and the excitation signal to the band predictor 190.
  • the band predictor 190 may generate the predictive excitation signal of the upper band by using the selection information transmitted from the mode selector 185 and the restored excitation signal.
  • the compensation gain predictor 195 may compensate for the spectral gain by comparing the higher band predicted excitation signal transmitted from the band predictor 190 and the higher band predicted residual signal transmitted from the linear prediction quantization unit 120.
  • each component may operate as a separate module, or a plurality of components may operate by forming one module.
  • the quantization units 120, 140, 150, and 175 may perform each operation as one module, and each of the quantization units 120, 140, 150, and 175 may be provided as a separate module at a necessary position in the process. It may be.
  • FIG. 2 schematically illustrates another example of a configuration of an encoder.
  • the excitation signal is converted into a frequency axis through a Modified Discrete Cosine Transform (MDCT), adaptive vector quantization (AVQ), band selective-shape gain coding (BS-SGC), and FPC (Factorial Pulse). Coding) or the like will be described as an example.
  • MDCT Modified Discrete Cosine Transform
  • AVQ adaptive vector quantization
  • BS-SGC band selective-shape gain coding
  • FPC Fractorial Pulse
  • the bandwidth checking unit 205 may determine whether an input signal (voice signal) is a narrow band (NB) signal, a wide band (WB) signal, or a super wide band (SWB) signal.
  • the NB signal may have a sampling rate of 8 kHz
  • the WB signal may have a sampling rate of 16 kHz
  • the SWB signal may have a sampling rate of 32 kHz.
  • the bandwidth checking unit 205 may convert an input signal into a frequency domain to determine a component and a zone of upper band bins of the spectrum.
  • the encoder 200 may not include the bandwidth checking unit 205 when the input signal is fixed, for example, when the input signal is fixed to NB.
  • the bandwidth checking unit 205 determines the input signal and outputs the NB or WB signal to the sampling converter 210, and the SWB signal to the sampling converter 210 or the MDCT converter 215.
  • the sampling converter 210 performs sampling for converting an input signal into a WB signal input to the core encoder 220.
  • the sampling converter 210 up-samples the input signal to be a signal having a sampling rate of 12.8 kHz when the input signal is an NB signal, and the sampling rate is 12.8 kHz when the input signal is a WB signal.
  • the down-sampling to the signal can produce a 12.8kHz low-band signal.
  • the sampling converter 210 downsamples the sampling rate to be 12.8 kHz to generate an input signal of the core encoder 220.
  • the preprocessor 225 may filter low frequency components among the lower band signals input to the core encoder 220 and transmit only a signal of a desired band to the linear prediction analyzer.
  • the linear prediction analyzer 230 may extract a linear prediction coefficient (LPC) from the signal processed by the preprocessor 225.
  • LPC linear prediction coefficient
  • the linear prediction analyzer 230 may extract the 16th linear prediction coefficient from the input signal and transfer the extracted 16th linear prediction coefficient to the quantization unit 235.
  • the quantization unit 235 quantizes the linear prediction coefficients transmitted from the linear prediction analyzer 230.
  • the linear prediction residual signal is generated by filtering the original lower band signal using the quantized linear prediction coefficients in the lower band.
  • the linear prediction residual signal generated by the quantization unit 235 is input to the CELP mode performing unit 240.
  • the CELP mode performing unit 240 detects a pitch of the input linear prediction residual signal by using a self-correlation function.
  • a first open loop pitch search method a first closed loop pitch search method, and Abs (Analysis by Synthesis) may be used.
  • the CELP mode performing unit 240 may extract the adaptive codebook index and the gain information based on the detected pitch information.
  • the CELP mode performing unit 240 may extract the index and the gain of the fixed codebook based on the remaining components limiting the contribution of the adaptive codebook in the linear prediction residual signal.
  • the CELP mode performing unit 240 quantizes the parameters (pitch, adaptive codebook index and gain, fixed codebook index and gain) related to the linear prediction residual signal extracted through the pitch search, the adaptive codebook search, and the fixed codebook search. To pass on.
  • the quantizer 245 quantizes the parameters transmitted from the CELP mode performer 240.
  • Parameters related to the quantized linear prediction residual signal in the quantization unit 245 may be output as a bit stream and transmitted to the decoder.
  • the parameters related to the quantized linear prediction residual signal may be transferred to the inverse quantizer 250.
  • the inverse quantization unit 250 generates an excitation signal reconstructed using the extracted and quantized parameters through the CELP mode.
  • the generated excitation signal is transmitted to the synthesis and post processor 255.
  • the synthesis and post-processing unit 255 synthesizes the reconstructed excitation signal and the quantized linear prediction coefficient, generates a synthesized signal of 12.8 kHz, and restores the 16 kHz WB signal through upsampling.
  • the difference signal between the signal (12.8 kHz) output from the synthesis post-processing unit 255 and the lower band signal sampled at the sampling rate of 12.8 kHz by the sampling converter 210 is input to the MDCT converter 260.
  • the MDCT converter 260 converts a difference signal between the signal output from the sampling converter 210 and the signal output from the synthesis post-processor 255 using a modified discrete cosine transform (MDCT) method.
  • MDCT modified discrete cosine transform
  • the quantization unit 265 may quantize the MDCT-converted signal using AVQ, BS-SGC, or FPC, and output the bitstream corresponding to narrowband or wideband.
  • the inverse quantization unit 270 inversely quantizes the quantized signal and transfers the lower band enhancement layer MDCT coefficients to the important MDCT coefficient extraction unit 280.
  • the important MDCT coefficient extractor 280 extracts transform coefficients to be quantized using MDCT transform coefficients input from the MDCT transform unit 275 and the inverse quantization unit 270.
  • the quantization unit 285 quantizes the extracted MDCT coefficients and outputs them as a bitstream of the ultra-wideband signal.
  • FIG. 3 is a diagram schematically illustrating an example of a decoder corresponding to the speech encoder of FIG. 1.
  • the speech decoder 300 may include an inverse quantizer 305 and 310, a band predictor 320, a gain compensator 325, an inverse transform unit 315, and a linear prediction synthesizer 330 and 335. ), A sampling converter 340, a band synthesizer 350, and a post-processing filter 345 and 355.
  • the inverse quantizers 305 and 310 receive the quantized parameter information from the speech encoder and dequantize it.
  • the inverse transform unit 315 may restore the excitation signal by inversely transforming the TCX coded or CELP coded speech information.
  • the inverse transform unit 315 may generate the reconstructed excitation signal based on the parameter received from the encoder. In this case, the inverse transform unit 315 may perform inverse transform only on some bands selected by the speech encoder.
  • the inverse transformer 315 may transmit the reconstructed excitation signal to the linear prediction synthesizer 335 and the band predictor 320.
  • the linear prediction synthesizer 335 may reconstruct the lower band signal using the excitation signal transmitted from the inverse transformer 315 and the linear prediction coefficient transmitted from the speech encoder.
  • the linear prediction synthesizer 335 may transmit the reconstructed lower band signal to the sampling converter 340 and the band combiner 350.
  • the band predictor 320 may generate the predicted excitation signal of the upper band based on the restored excitation signal value received from the inverse transformer 315.
  • the gain compensator 325 may compensate the gain on the spectrum of the ultra wideband speech signal based on the higher band predicted excitation signal received from the band predictor 320 and the compensation gain value transmitted from the encoder.
  • the linear prediction synthesis unit 330 receives the compensated higher band prediction excitation signal value from the gain compensator 325 and based on the compensated higher band prediction excitation signal value and the linear prediction coefficient value received from the speech coder. The signal can be restored.
  • the band combiner 350 receives the reconstructed lower band signal from the linear prediction synthesizer 335, and receives the reconstructed upper band signal from the band linear prediction synthesizer 355 to receive the received upper band signal and the lower band signal. Band synthesis may be performed on the band signal.
  • the sampling converter 340 may convert the internal sampling frequency value back to the original sampling frequency value.
  • the post processing units 345 and 355 may perform post processing necessary for signal recovery.
  • the post-processors 345 and 355 may include a de-emphasis filter capable of reverse filtering the pre-emphasis filter in the pre-processor.
  • the post-processing units 345 and 355 may perform various post-processing operations such as minimizing quantization errors, utilizing harmonic peaks of the spectrum, and killing valleys.
  • the post processor 345 may output the restored narrowband or wideband signal, and the postprocessor 355 may output the restored ultra wideband signal.
  • FIG. 4 is a diagram schematically illustrating an example of a decoder configuration corresponding to the speech encoder of FIG. 3.
  • a bitstream including an NB signal or a WB signal transmitted from an encoder is input to an inverse transformer 420 and a linear prediction synthesizer 430.
  • the inverse transform unit 420 may inversely transform the CELP-coded speech information and restore the excitation signal based on a parameter received from the encoder.
  • the inverse transform unit 420 may transmit the reconstructed excitation signal to the linear prediction synthesis unit 430.
  • the linear prediction synthesizer 430 may reconstruct a lower band signal (NB signal, WB signal, etc.) using the excitation signal transmitted from the inverse transformer 420 and the linear prediction coefficient transmitted from the encoder.
  • the lower band signal (12.8 kHz) reconstructed by the linear prediction synthesis unit 430 may be downsampled to NB or upsampled to WB.
  • the WB signal is output to the post-processing / sampling converter 450 or to the MDCT converter 440.
  • the restored lower band signal (12.8 kHz) is output to the MDCT converter 440.
  • the post-processing / sampling converter 450 may apply filtering on the reconstructed signal. Filtering allows for post-processing such as reducing quantization errors, highlighting peaks and killing valleys.
  • the MDCT converter 440 performs MDCT conversion on the restored lower band signal (12.8 kHz) and the upsampled WB signal (16 kHz) and transmits the same to the upper MDCT coefficient generator 470.
  • the inverse transform unit 495 receives the NB / WB enhancement layer bitstream and restores the MDCT coefficients of the enhancement layer.
  • the MDCT coefficients restored by the inverse transformer 495 are added to the output signal of the MDCT transformer 440 and input to the upper MDCT coefficient generator 470.
  • the dequantizer 460 receives the SWB signal and the parameter quantized through the bitstream from the encoder and dequantizes the received information.
  • the dequantized SWB signal and the parameter are transmitted to the upper MDCT coefficient generator 470.
  • the upper MDCT coefficient generation unit 470 receives the MDCT coefficients for the 12.8 kHz signal or the WB signal synthesized from the core decoder 410, and receives necessary parameters from the bitstream for the SWB signal to dequantize them. Generate MDCT coefficients for the SWB signal.
  • the higher MDCT coefficient generator 470 may apply the generic mode or the sine wave mode according to whether the signal is tonal, and may apply an additional sine wave to the signal of the enhancement layer.
  • the MDCT inverse transform unit 480 restores a signal through an inverse transform on the generated MDCT coefficients.
  • the post processing filter 490 may apply filtering on the restored signal. Filtering allows for post-processing such as reducing quantization errors, highlighting peaks and killing valleys.
  • the SWB signal may be restored by synthesizing the signal restored by the post-processing filter 490 and the signal restored by the post-processing converter 450.
  • the transform coding / decoding technique has a high compression efficiency with respect to a stationary signal, so that a high quality audio signal and a high quality audio signal can be provided when there is a margin of bit rate.
  • pre-echo noise may occur unlike encoding performed in the time domain.
  • Pre-echo refers to a case in which noise is generated by a transformation for encoding in an area where no sound is included in an original signal. Pre-echo occurs in transform encoding because encoding is performed in units of frames having a constant size in order to transform into a frequency domain.
  • FIG. 5 is a diagram schematically illustrating a pre echo.
  • FIG. 5A shows an original signal and FIG. 5B shows a signal obtained by decoding and decoding a signal encoded by a transform encoding method.
  • FIG. 5 (b) is a signal to which transform coding is applied.
  • 6 is another diagram schematically illustrating the pre-echo.
  • FIG. 6 (a) shows an original signal
  • FIG. 6 (b) shows a signal decoded by transcoding
  • the original signal of FIG. 6A does not have a signal corresponding to voice at the beginning of the frame, and the signal is concentrated in the second half of the frame.
  • the quantization noise may be concealed in the original signal when the original signal exists along the time axis in the time domain so that the noise may not be heard. However, when there is no original signal as shown in the beginning of the frame of FIG. 6 (a), noise, that is, the pre echo distortion 600 is not concealed.
  • the quantization noise may be concealed by the corresponding component, but in the time domain, since the quantization noise is present throughout the frame, the noise is exposed in the silent section on the time axis. Occurs.
  • Quantization noise due to the transformation that is, pre-echo (quantization) noise, may cause deterioration of sound quality, and thus it is necessary to perform a process for minimizing it.
  • Pre-Echo appears on the time axis when the quantization noise on the frequency axis is inversely transformed and then subjected to superposition summation. Quantization noise spreads uniformly throughout the synthesis window during inverse transformation.
  • Quantization noise depends on the average energy of the frame, resulting in quantization noise on the time axis throughout the synthesis window.
  • the signal-to-noise ratio is so small that quantization noise is present in the human ear.
  • the attenuation of the signal at the point where the energy increases rapidly in the synthesis window can reduce the influence of quantization noise, or pre-eco.
  • an area where energy is small that is, an area where pre-echo may appear
  • an echo zone in a frame in which the energy changes rapidly.
  • block switching or temporal noise shaping can be applied.
  • the frame length is variably adjusted to prevent pre-echo.
  • pre-echo is prevented based on the time / frequency duality of LPC (Linear Prediction Coding) analysis.
  • FIG. 7 is a diagram schematically illustrating a block switching method.
  • the length of a frame is variably adjusted.
  • the window is composed of a long window and a short window.
  • the length of a frame to be converted by applying a long window is increased and encoded.
  • the short window is applied to reduce the length of a frame to be converted.
  • TNS Temporal Noise Shaping
  • LPC Linear Prediction Coding
  • the LPC coefficient refers to envelope information on the frequency axis and the excitation signal refers to frequency components sampled on the frequency axis.
  • the LPC coefficients represent the envelope information on the time axis and the excitation signal means the time component sampled on the time axis.
  • noise generated in the excitation signal due to quantization error is finally restored in proportion to the envelope information on the time axis. For example, in a silent section where the envelope information is close to zero, noise finally occurs close to zero.
  • noise is generated relatively large in the sound-absorbing section in which the voice and audio signals exist, relatively large noise is at a level that can be concealed by the signal.
  • the noise disappears in the silent section and the noise is concealed in the silent section (voice and audio sections), thereby providing psychoacoustically improved sound quality.
  • the total delay including channel delay and codec delay, must not exceed a predetermined reference, for example, 200 ms, but the block switching method is variable because the frame is variable and the total delay close to 200 ms in two-way communication is exceeded. It is not suitable for dual communication.
  • a method of reducing pre-echo may be considered by adjusting the magnitude of the signal decoded by the transform.
  • the size of the transformed and decoded signal is relatively small in the frame in which the noise due to the pre-echo occurs, and the size of the transformed and decoded signal is relatively large in the frame in which the noise due to the pre-echo does not occur. .
  • pre-echo in transform encoding occur in a section in which the energy of the signal increases rapidly. Therefore, by attenuating the front signal of the portion where the energy is rapidly increased in the synthesis window, the noise due to the pre-echo can be reduced.
  • the echo zone is determined to reduce the noise caused by the pre echo. To this end, two signals that overlap in the inverse transform are used.
  • m ( n ) which is the front half of the current window, may be used.
  • n 0,... , 639.
  • the generated d conc 32_SWB ( n ) is divided into 32 subframes having 40 samples, and the time axis envelope E ( i ) is calculated using the energy of each subframe. From E ( i ) we can find the subframe with the maximum energy.
  • the normalization process is performed as shown in Equation 2 using the maximum energy value and the time base envelope.
  • E ( i ) is greater than or equal to a predetermined reference value.
  • a predetermined reference value For example, the case where r E ( i )> 8 is determined as the echo zone, and the attenuation function g pre ( n ) is applied to the echo zone. Decay function for the case of applying a signal in the time domain, r E (i)> When 16-in, and applied to a 0.2 as g pre (n), r E (i) ⁇ 8 the case as g pre (n) 1 is applied, otherwise 0.5 is applied as g pre ( n ) to produce the final composite signal.
  • a first-order Infinite Impulse Response (IIR) filter may be applied to smooth the decay function of the previous frame and the decay function of the current frame.
  • IIR Infinite Impulse Response
  • encoding may be performed by applying a multi-frame unit according to a signal characteristic instead of a fixed frame.
  • a frame of 20 ms, a frame of 40 ms, or a frame of 80 ms may be applied according to a signal characteristic.
  • the basic frame may be applied with a small size of 20 ms, but the frame may be applied with a large size of 40 ms or 80 ms for a stationary signal. Assuming an internal sampling rate of 12.8kHz, 20ms would be equivalent to 256 samples.
  • FIG. 8 is a diagram schematically illustrating an example of a window type in a case where a basic frame is set to 20 ms and a larger frame size, 40 ms or 80 ms, is applied according to signal characteristics.
  • FIG. 8 (a) a window for 20ms, which is a basic frame, is shown in FIG. 8 (b), and a window for 40ms is shown in FIG. 8 (b), and in FIG. 8 (c), a window for an 80ms frame is shown.
  • the window has three types, but the window has four shapes for each length to overlap with the previous frame. Can be. Therefore, a total of 12 windows can be applied according to the characteristics of the signal.
  • the size of the signal is adjusted based on the signal reconstructed from the bitstream. That is, the echo zone is determined and the signal is attenuated using the signal reconstructed by the decoder using the bits allocated by the encoder.
  • the bit allocation in the encoder is performed by allocating a fixed number of bits per frame.
  • This method may be referred to as an approach to control pre-echo in a concept similar to a post-processing filter.
  • the bits allocated to the 20 ms frame depend on the overall bit rate and are transmitted at a fixed value.
  • the procedure for controlling the pre-echo is performed on the decoder side rather than the encoder based on the information transmitted from the encoder.
  • the encoder selects the window size to be processed according to the characteristics of the signal, so that the pre-echo can be reduced efficiently. Difficult to use as communication codec For example, assuming bidirectional communication where 20 ms can be sent in one packet, a delay corresponding to four times the basic packet is allocated when a frame having a large size such as 80 ms is set.
  • bit allocation may be performed in consideration of an area where a pre-echo may occur.
  • bit allocation in the region where pre-echo occurs, more bits are allocated by increasing the bit rate.
  • the bit rate for the subframe in which the pre-echo exists can be adjusted higher.
  • M subframes which are bit allocation units, are referred to as bit allocation intervals to distinguish subframes as signal processing units and subframes as bit allocation units.
  • FIG. 9 is a diagram schematically illustrating a relationship between a position of a pre echo and bit allocation.
  • the voice signals are uniformly distributed in the frame overall, and the first bit allocation interval 910 and the second bit allocation interval 920 are entirely distributed. Bits corresponding to 1/2 of the bit amount are allocated.
  • the pre echo is positioned in the second bit allocation interval 940.
  • the conventional method uses bits corresponding to 1/2 of the total bit rate.
  • the pre echo is positioned in the first bit allocation interval 950.
  • the second bit allocation interval 960 corresponds to a stationary signal, bits corresponding to 1/2 of the total bit rate may be used, although a relatively small bit may be encoded. have.
  • bit allocation is performed irrespective of the characteristic of the voice signal, for example, the position of the echo zone or the position of the section in which there is a sudden increase in energy, the bit efficiency is reduced.
  • the bit amount allocated to each bit allocation period is different depending on whether an echo zone exists.
  • the energy information of the speech signal and the position information of the transient component where noise due to pre-echo may occur are used.
  • the transition component of the speech signal refers to a component of a region where a transition of energy changes rapidly, for example, a speech signal component at a position where a voice transitions from a voiced voice to a voiced sound or a voice signal component at a position where a voiced voice transitions to a voiceless voice.
  • FIG. 10 is a diagram schematically illustrating a method of performing bit allocation according to the present invention.
  • bit allocation may be variably performed based on energy information of a voice signal and position information of a transition component.
  • the energy of the voice signal for the first bit allocation interval 1010 is the voice for the second bit allocation interval 1020. Less than the energy of the signal.
  • bit allocation section eg, a silent section or a section including unvoiced sound
  • a transition component may exist.
  • bit allocation for the bit allocation interval in which no transition component exists is reduced, and additionally allocate the saved bits to the bit allocation interval in which the transition component exists.
  • bit allocation for the first bit allocation interval 1010 which is an unvoiced interval
  • the saved bits are converted into the second bit allocation interval 1020, that is, the transition component of the speech signal. It can be allocated in addition to the bit allocation interval located.
  • a transition component exists in the first bit allocation interval 1030 and a stationary signal exists in the second bit allocation interval 1040.
  • the energy for the second bit allocation interval 1040 where the normal signal is present is greater than the energy for the first bit allocation interval 1030.
  • a transition component may exist, and more bits may be allocated to the bit allocation interval in which the transition component exists. For example, in FIG. 10B, the bit allocation for the second bit allocation section 1040 that is the normal signal section is reduced, and the bits saved in the first bit allocation section 1030 where the transition component of the voice signal is located. You can allocate more.
  • FIG. 11 is a flowchart schematically illustrating a method in which an encoder variably allocates a bit amount according to the present invention.
  • the encoder determines whether a transient is detected in the current frame (S1110).
  • the encoder may determine whether energy is uniform for each interval, and if it is not uniform, may determine that a transition exists.
  • the encoder may, for example, set a threshold offset, and determine that there is a transition in the current frame if there is a case where the energy difference between the intervals is out of the threshold offset.
  • the encoder may select an encoding method according to whether or not a transition exists. If there is a transition, the encoder may divide the frame into bit allocation intervals (S1120).
  • the encoder may use the entire frame without dividing into bit allocation intervals (S1130).
  • the encoder When using the entire frame, the encoder performs bit allocation on the entire frame (S1140). The encoder can encode the speech signal for the entire frame using the allocated bits.
  • bit allocation is performed after the step of determining that the entire frame is used when there is no transition, but the present invention is not limited thereto. For example, if there is a transition, bit allocation may be performed for the entire frame without separately going through the step of determining to use the entire frame.
  • the encoder may determine which bit allocation interval exists in the operation S1150. The encoder may perform bit allocation differently on the bit allocation interval in which the transition exists and the bit allocation interval in which the transition does not exist.
  • the current frame is divided into two bit allocation intervals
  • more bits may be allocated to the first bit allocation interval than the second bit allocation interval (S1160).
  • BA 1st the bit amount allocated to the first bit allocation interval
  • BA 2nd the bit amount allocated to the second bit allocation interval
  • bit budget The total number of bits (bit amount) allocated to the current frame
  • bit amount the number of bits (bit amount) allocated to the first bit allocation period.
  • BA 1st the number of bits allocated to the second bit allocation interval
  • BA 2nd the relation of Equation 3 is established.
  • the number of bits allocated to each bit allocation interval may be determined as shown in Equation 4 by considering which of the two bit allocation intervals there is a transition and the magnitude of the energy of the speech signal for the two bit allocation intervals. have.
  • Equation 4 Energy n-th refers to the energy of the speech signal in the nth bit allocation interval, and Transient n-th is a weighting constant for the nth bit allocation interval, depending on whether the transition is located in the corresponding bit allocation interval. Has a value. Equation 5 shows an example of a method of determining a transient n-th value.
  • Equation 5 shows an example in which the weight constant Transient is set to 1 or 0.5 according to the position of the transition, but the present invention is not limited thereto, and the weight constant Transient may be set to another value through experiments.
  • a method of variably allocating and encoding bits according to the position of the transition may be applied to bidirectional communication.
  • the size of one frame used for two-way communication is A ms and the transmission bitrate of the encoder is B kbps
  • the size of the analysis and synthesis window applied in the case of the transform encoder is 2A ms
  • the encoder is one frame.
  • the amount of bits to be sent is B x A bits. For example, if the size of one frame is 20ms, the size of the synthesis window is 40ms, and the amount of bits transmitted per frame is B / 50 kbit.
  • a narrowband (NB) / wideband (WB) core is applied to a lowband, and so-called coded information is used by an upper codec that is an ultra-wideband.
  • NB narrowband
  • WB wideband
  • coded information is used by an upper codec that is an ultra-wideband.
  • the form of extension structure may be applied.
  • FIG. 12 is a configuration of a speech encoder having a form of an extended structure, and schematically illustrates an example to which the present invention is applied.
  • an encoder having an extended structure includes a narrowband encoder 1215, a wideband encoder 1235, and an ultra-wideband encoder 1260.
  • the sampling converter 1205 receives a narrowband signal, a wideband signal, or an ultra wideband signal.
  • the sampling converter 1205 converts the input signal to an internal sampling rate of 12.8 kHz and outputs the converted signal.
  • the output of the sampling converter 1205 is transferred to the encoder corresponding to the band of the output signal by the switching unit.
  • the sampling converter 1210 When the narrowband signal or the wideband signal is input, the sampling converter 1210 generates a 25.6 kHz signal after upsampling the ultra wideband signal, and outputs the upsampled ultra wideband signal and the generated 25.6 kHz signal. In addition, when the ultra-wideband signal is input, it is downsampled to 25.6kHz and output together with the ultra-wideband signal.
  • the low band encoder 1215 encodes a narrow band signal and includes a linear predictor 1220 and an ACELP unit 1225. After the linear prediction is performed by the linear prediction unit 1220, the residual signal is encoded by the CELP unit 1225 based on the CELP.
  • the linear predictor 1220 and the CELP unit 1225 of the low band encoder 1215 correspond to the configuration of encoding the low band on the basis of linear prediction and the configuration of encoding the low band on the basis of CELP in FIGS. 1 and 3. .
  • the compatible core portion 1230 corresponds to the core configuration of FIG. 1.
  • the signal reconstructed by the compatible core unit 1230 may be used for encoding in an encoder that processes an ultra-wideband signal.
  • the compatible core unit 1230 may enable the low band signal to be processed by, for example, compatible encoding such as AMR-WB, and may allow the high band signal to be processed in the ultra wide band signal unit 1260.
  • the wideband encoder 1235 encodes a wideband signal, and includes a linear predictor 1240, a CELP unit 1250, and an enhancement layer unit 1255.
  • the linear predictor 1240 and the CELP unit 1250 are similar to the low band encoder 1215 in FIG. 1 and FIG. 3. Corresponds.
  • the enhancement layer unit 1255 may encode higher quality when the bit rate is increased by processing the additional layer.
  • the output of the wideband encoder 1235 may be inversely restored and used for encoding in the ultra wideband encoder 1260.
  • the ultra-wideband encoder 1260 encodes the ultra-wideband signal, and converts input signals to process transform coefficients.
  • the ultra-wideband signal is encoded by the generic mode unit 1275 and the sine mode unit 1280, as shown, and the signal is converted from the generic mode unit 1275 and the sine mode unit 1280 by the core switching unit 1265.
  • the module to be processed can be switched.
  • the pre echo reduction unit 1270 reduces the pre echo using the method described above in the present invention.
  • the pre-echo reduction block 1270 may determine an echo zone using the input time-domain signal and the transform coefficient, and perform variable bit allocation based on this.
  • the enhancement layer unit 1285 processes a signal of an extension layer (eg, layer 7 or layer 8) added in addition to the base layer.
  • an extension layer eg, layer 7 or layer 8
  • the pre-echo reduction unit 1270 operates after switching the core between the generic mode unit 1275 and the sine mode unit 1280 in the ultra wideband encoder 1260, the present invention is not limited thereto.
  • core switching between the generic mode unit 1275 and the sine mode unit 1280 may be performed after the pre-echo reduction operation in the pre-echo reduction unit 1270 is performed.
  • the pre-echo reduction unit 1270 of FIG. 12 determines where the bit allocation section in which the transition is located in the voice signal frame is different based on the bit allocation section as described in FIG. 11. Bit amount can be allocated.
  • the pre-echo reduction unit may apply a method of performing pre-echo reduction by determining the position of the echo zone in units of subframes based on the amount of energy for each subframe in the frame.
  • FIG. 13 is a diagram schematically illustrating a configuration when the pre-echo reduction unit introduced in FIG. 12 determines an echo zone based on energy for each subframe to perform pre-echo reduction.
  • the pre echo reducer 1270 includes an echo zone determiner 1310 and a bit allocation adjuster 1360.
  • the echo zone determiner 1310 includes a target signal generator and frame divider 1320, an energy calculator 1330, an envelope peak calculator 1340, and an echo zone determiner 1350.
  • the current frame and the past frame are concatenated and transformed after analysis windowing.
  • a frame size is 20 ms, that is, a signal to be processed in units of 20 ms is input.
  • 10 ms of the past half of the past frame, 10 ms of the first half of the current frame, and 10 ms of the first half of the current frame and 10 ms of the second half of the current frame are respectively windowed into an analysis window (eg, a symmetric window such as a sine window and a hamming window).
  • an analysis window eg, a symmetric window such as a sine window and a hamming window.
  • the current frame and the future frame may be connected to be processed after analysis windowing.
  • the target signal generation and frame dividing unit 1320 generates the target signal based on the input voice signal and divides the frame into subframes.
  • the signal input to the ultra-wideband encoder includes (1) an ultra-wideband signal among original signals, (2) a signal decoded again through narrowband encoding or wideband encoding, and (3) a difference between a wideband signal and a decoded signal among original signals ( difference) signals.
  • the signals 1, 2 and 3 of the input time domain may be input in a frame unit (20 ms units), and a transform coefficient is generated through conversion.
  • the generated transform coefficients are processed in a signal processing module including a pre echo reduction unit in the ultra wideband encoder.
  • the target signal generation and frame dividing unit 1320 generates a target signal for determining the existence of the echo zone based on the signals of 1 and 2 having ultra-wide band components.
  • the target signal d conc 32_SWB (n) may be determined as shown in Equation 6.
  • n indicates a sampling position.
  • Scaling for the signal of 2 is upsampling which converts the sampling rate of the signal of 2 to the sampling rate of the ultra-wideband signal.
  • the target signal generation and frame dividing unit 1320 divides the voice signal frame into a predetermined number of subframes (eg, N and an integer) to determine an echo zone.
  • the subframe may be a unit of sampling and / or voice signal processing.
  • a subframe is a processing unit for calculating an envelope of a speech signal, and if a calculation amount is not taken into account, the subframe is divided into many subframes, thereby obtaining a more accurate value. For example, if one sample is processed per subframe, N is 640 when the frame for the ultra-wideband signal is 20 ms.
  • the subframe may be used as an energy calculation unit for determining the echo zone.
  • the target signal d conc 32_SWB ( n ) of Equation 6 may be used to calculate speech signal energy in units of subframes.
  • the energy calculator 1330 calculates the voice signal energy of each subframe using the target signal.
  • N of subframes per frame to 16 will be described as an example for convenience of description.
  • the energy of each subframe may be obtained as shown in Equation 7 by using the target signal d conc 32_SWB ( n ).
  • Equation 7 i is an index indicating a subframe, and n is a sample number (sample position).
  • E ( i ) corresponds to the envelope of the time domain (time axis).
  • the envelope peak calculator 1340 determines the peak Max E of the time domain (time axis) envelope using E ( i ) as shown in Equation 8.
  • the envelope peak calculator 1340 finds out which subframe has the largest energy among the N subframes in the frame.
  • the echo zone determiner 1350 normalizes the energy of the N subframes in the frame and compares the energy with a reference value to determine the echo zone.
  • the energy for the subframes may be normalized using Equation 9 using the envelope peak value determined by the envelope peak calculator 1340, that is, the largest energy among the energy of each subframe.
  • Normal_E ( i ) represents normalized energy for the i th subframe.
  • the echo zone determiner 1350 determines the echo zone by comparing the normalized energy of each subframe with a predetermined reference value (threshold value).
  • the echo zone determiner 1350 compares the predetermined reference value with the magnitude of the normalized energy of the subframe in order from the first subframe to the last subframe in the frame. When the normalized energy for the first subframe is smaller than the reference value, the echo zone determiner 1350 may first determine that the echo zone exists in the subframe found to have the normalized energy above the reference value. When the normalized energy for the first subframe is greater than the reference value, the echo zone determiner 1350 may first determine that the echo zone exists in the subframe found to have the normalized energy below the reference value.
  • the echo zone determiner 1350 may compare the predetermined reference value with the normalized energy of the subframe from the last subframe to the first subframe in the frame in the reverse order to the method. When the normalized energy for the last subframe is smaller than the reference value, the echo zone determiner 1350 may first determine that the echo zone exists in the subframe found to have the normalized energy above the reference value. When the normalized energy for the last subframe is greater than the reference value, the echo zone determiner 1350 may first determine that the echo zone exists in the subframe found to have the normalized energy below the reference value.
  • the reference value that is, the threshold value may be determined experimentally. For example, if the threshold is 0.128 and is searched from the first subframe, and the normalized energy for the first subframe is less than 0.128, the first normalized energy that is greater than 0.128 is searched while searching for normalized energy in order. It may be determined that there is an echo zone in the subframe.
  • the echo zone determiner 1350 may find a subframe that does not search for a subframe that satisfies the above condition, that is, the normalized energy has changed from the reference value below the reference value or above the reference value or below the reference value. If none, it can be determined that there is no echo zone in the current frame.
  • the bit allocation adjusting unit 1360 may allocate bit amounts to areas where the echo zone exists and other areas.
  • the bit allocation adjustment unit 1360 may bypass additional bit allocation adjustment, and the bit allocation adjustment may be performed as described with reference to FIG. 11. The bit allocation may be performed uniformly in units of the current frame.
  • normalized time domain envelope information that is, Normal_E ( i ) may be transmitted to the bit allocation adjusting unit 1360.
  • the bit allocation adjusting unit 1360 allocates a bit amount for each bit allocation section based on the normalized time domain envelope information. For example, the bit allocation adjusting unit 1360 adjusts the total amount of bits allocated to the current frame to be differentially allocated to the bit allocation section in which the echo zone exists and the bit allocation region in which the echo zone does not exist.
  • the bit allocation adjusting unit 1360 allocates C / 3 kbps bits in the first bit allocation interval, and more 2C / 3 kbps in the second bit allocation interval. Can be assigned.
  • the total amount of bits allocated to the current frame is the same as C kbps, more bits may be allocated in the second bit allocation interval in which the echo zone exists.
  • the encoder / decoder may configure a table in which the bit allocation mode is defined and transmit / receive bit allocation information using the table.
  • an index indicating on which bit allocation information table to use may be transmitted to the decoder.
  • the decoder can decode the encoded speech information according to the bit allocation mode indicated by the index received from the encoder on the bit allocation information table.
  • Table 1 shows an example of a bit allocation information table used to transmit bit allocation information.
  • Bit Allocation Mode Index Value First bit allocation interval 2nd bit allocation interval 0 C / 2 C / 2 One C / 3 2C / 3 2 C / 4 3C / 4 3 C / 5 4C / 5
  • Table 1 the case where the number of bit allocation areas is two and the number of fixed bits allocated to the frame is C will be described as an example.
  • the encoder transmits 0 as the bit allocation mode index, it indicates that the same bit amount is allocated to the two bit allocation intervals.
  • the value of the bit allocation mode index is 0, it may mean that the echo zone does not exist.
  • bit allocation mode index When the value of the bit allocation mode index is 1 to 3, different bit amounts are allocated to the two bit allocation intervals. In this case, it may mean that the echo zone exists in the current frame.
  • bit allocation information table may be configured in consideration of both the case where there is an echo zone in the first bit allocation interval and the case where there is an echo zone in the second bit allocation interval.
  • Bit Allocation Mode Index Value First bit allocation interval 2nd bit allocation interval 0 C / 3 2C / 3 One 2C / 3 C / 3 2 C / 4 3C / 4 3 3C / 4 C / 4
  • indexes 0 and 2 indicate bit allocation modes for cases in which an echo zone exists in a second bit allocation interval
  • indexes 1 and 3 indicate an echo zone in a first bit allocation interval.
  • bit allocation mode index value may not be transmitted. If the bit allocation mode index is not transmitted, the decoder may determine that a fixed number of C bits have been allocated using the entire interval of the current frame as one bit allocation unit and perform decoding.
  • the decoder may perform decoding on the current frame based on the bit allocation mode indicated by the corresponding index value in the bit allocation information table of Table 2.
  • Table 1 and Table 2 have described the case of transmitting the bit allocation information index using 2 bits as an example.
  • the bit allocation information index is transmitted using two bits, information about four modes can be transmitted as shown in Tables 1 and 2.
  • bit allocation may be performed using more than four bit allocation modes, and information about the bit allocation mode may be transmitted using more than two bits.
  • bit allocation may be performed using a bit allocation mode smaller than four, and information about the bit allocation mode may be transmitted using transmission bits (for example, one bit) smaller than two bits.
  • the encoder determines the location of the echo zone and selects a mode for allocating more bits in the bit allocation interval in which the echo zone exists as described above. An index indicating this can be transmitted.
  • FIG. 14 is a flowchart schematically illustrating a method for encoding a speech signal by variably performing bit allocation in accordance with the present invention.
  • the encoder determines an echo zone in the current frame (S1410).
  • the encoder divides M bit allocation intervals of the current frame and determines whether an echo zone exists in each bit allocation interval.
  • the encoder determines whether the voice signal energy of each bit allocation interval is uniform within a predetermined range, and if there is an energy difference out of the predetermined range between the bit allocation intervals, it may determine that an echo zone exists in the current frame. In this case, the encoder may determine that the echo zone exists in the bit allocation interval in which the transition component exists.
  • the encoder divides the current frame into N subframes, calculates normalized energy for each subframe, and determines that an echo zone exists in the corresponding subframe when the normalized energy changes based on a threshold value.
  • the encoder may determine that the echo zone does not exist in the current frame when the voice signal energy is uniform within a predetermined range or there is no normalized energy that changes based on the threshold.
  • the encoder may perform allocation of encoding bits for the current frame in consideration of the existence of an echo zone (S1420).
  • the encoder allocates the total number of bits allocated to the current frame to each bit allocation interval.
  • the encoder can prevent or attenuate noise due to pre-echo by allocating more bits in the bit allocation interval in which the echo zone exists.
  • the total number of bits allocated to the current frame may be the number of bits allocated fixedly.
  • the encoder may use the total number of bits in units of frames without dividing the bit allocation interval for the current frame and differentially allocating the bit amount.
  • the encoder performs encoding using the allocated bits (S1430). If there is an echo zone, the encoder may perform transform encoding while preventing or attenuating noise due to pre-echo using differentially assigned bits.
  • the encoder may transmit the information about the bit allocation mode used for encoding together with the encoded speech information to the decoder.
  • 15 is a diagram schematically illustrating a method of decoding an encoded speech signal when bit allocation is variably performed in encoding the speech signal according to the present invention.
  • the decoder receives bit allocation information together with the encoded speech information from the encoder (S1510).
  • the encoded speech information and the information about the bits allocated when the speech information is encoded may be transmitted through a bit stream.
  • the bit allocation information may indicate whether there is a differential bit allocation for each section in the current frame.
  • the bit allocation information may indicate at what rate a bit amount is allocated if there is differential bit allocation.
  • the bit allocation information may be index information, and the received index may indicate a bit allocation mode (bit allocation ratio or bit amount allocated for each bit allocation interval) applied to the current frame on the bit allocation information table.
  • the decoder may perform decoding on the current frame based on the bit allocation information (S1520). If there is a differential bit allocation in the current frame, the decoder may decode the voice information by reflecting the bit allocation modes.
  • variable values or the set values have been described as examples for the purpose of understanding the present invention, but the present invention is not limited thereto.
  • the number N of subframes has been described as 24 or 32, the present invention is not limited thereto.
  • the number M of bit allocation intervals has also been described as an example for convenience of description, but the present invention is not limited thereto.
  • the threshold value compared with the normalized energy level for determining the echo zone may be determined by an arbitrary value or an experimental value set by the user.
  • the case of converting once in each of two bit allocation intervals in a fixed frame of 20 ms has been described as an example.
  • the methods are described based on a flowchart as a series of steps or blocks, but the present invention is not limited to the order of steps, and any steps may occur in a different order or simultaneously from other steps as described above. have.
  • the above-described embodiments include examples of various aspects.
  • the above-described embodiments may be implemented in combination with each other, which also belongs to the embodiments according to the present invention.
  • the invention includes various modifications and changes in accordance with the spirit of the invention within the scope of the claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

La présente invention concerne un procédé de codage d'un signal vocal, un procédé de décodage d'un signal vocal et un appareil utilisant ceux-ci. Le procédé de codage du signal vocal selon la présente invention comprend les étapes consistant à : déterminer une éco-zone dans une trame présente ; attribuer des bits pour la trame présente sur la base de l'emplacement de l'éco-zone ; et coder la trame présente à l'aide des bits attribués, l'étape d'attribution des bits attribuant davantage de bits dans la section dans laquelle l'éco-zone est située que dans la section dans laquelle l'éco-zone n'est pas située.
PCT/KR2012/008947 2011-10-27 2012-10-29 Procédé de codage d'un signal vocal, procédé de décodage d'un signal vocal et appareil utilisant ceux-ci WO2013062392A1 (fr)

Priority Applications (5)

Application Number Priority Date Filing Date Title
US14/353,981 US9672840B2 (en) 2011-10-27 2012-10-29 Method for encoding voice signal, method for decoding voice signal, and apparatus using same
JP2014538722A JP6039678B2 (ja) 2011-10-27 2012-10-29 音声信号符号化方法及び復号化方法とこれを利用する装置
CN201280063395.9A CN104025189B (zh) 2011-10-27 2012-10-29 编码语音信号的方法、解码语音信号的方法,及使用其的装置
EP12843449.5A EP2772909B1 (fr) 2011-10-27 2012-10-29 Procédé de codage d'un signal vocal
KR1020147010211A KR20140085453A (ko) 2011-10-27 2012-10-29 음성 신호 부호화 방법 및 복호화 방법과 이를 이용하는 장치

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201161552446P 2011-10-27 2011-10-27
US61/552,446 2011-10-27
US201261709965P 2012-10-04 2012-10-04
US61/709,965 2012-10-04

Publications (1)

Publication Number Publication Date
WO2013062392A1 true WO2013062392A1 (fr) 2013-05-02

Family

ID=48168121

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2012/008947 WO2013062392A1 (fr) 2011-10-27 2012-10-29 Procédé de codage d'un signal vocal, procédé de décodage d'un signal vocal et appareil utilisant ceux-ci

Country Status (6)

Country Link
US (1) US9672840B2 (fr)
EP (1) EP2772909B1 (fr)
JP (1) JP6039678B2 (fr)
KR (1) KR20140085453A (fr)
CN (1) CN104025189B (fr)
WO (1) WO2013062392A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015122752A1 (fr) * 2014-02-17 2015-08-20 삼성전자 주식회사 Procédé et appareil de codage de signal, et procédé et appareil de décodage de signal
WO2015162500A3 (fr) * 2014-03-24 2016-01-28 삼성전자 주식회사 Procédé et dispositif de codage de bande haute et procédé et dispositif de décodage de bande haute
CN106233112A (zh) * 2014-02-17 2016-12-14 三星电子株式会社 信号编码方法和设备以及信号解码方法和设备
US11676614B2 (en) 2014-03-03 2023-06-13 Samsung Electronics Co., Ltd. Method and apparatus for high frequency decoding for bandwidth extension

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2992766A1 (fr) * 2012-06-29 2014-01-03 France Telecom Attenuation efficace de pre-echos dans un signal audionumerique
WO2015037969A1 (fr) * 2013-09-16 2015-03-19 삼성전자 주식회사 Procédé et dispositif de codage de signal et procédé et dispositif de décodage de signal
CN105745703B (zh) 2013-09-16 2019-12-10 三星电子株式会社 信号编码方法和装置以及信号解码方法和装置
FR3024581A1 (fr) * 2014-07-29 2016-02-05 Orange Determination d'un budget de codage d'une trame de transition lpd/fd
US20170085597A1 (en) * 2015-09-18 2017-03-23 Samsung Electronics Co., Ltd. Device and method for merging circuit switched calls and packet switched calls in user equipment
EP3483880A1 (fr) * 2017-11-10 2019-05-15 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Mise en forme de bruit temporel
EP3483879A1 (fr) 2017-11-10 2019-05-15 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Fonction de fenêtrage d'analyse/de synthèse pour une transformation chevauchante modulée
CN113302688A (zh) * 2019-01-13 2021-08-24 华为技术有限公司 高分辨率音频编解码
WO2020253941A1 (fr) * 2019-06-17 2020-12-24 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Codeur audio avec un nombre dépendant du signal et une commande de précision, décodeur audio, et procédés et programmes informatiques associés
CN112767953B (zh) * 2020-06-24 2024-01-23 腾讯科技(深圳)有限公司 语音编码方法、装置、计算机设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR950009412B1 (en) * 1992-11-20 1995-08-22 Daewoo Electronics Co Ltd Method and system of adaptive beit allocation according to frame variation
JPH08204575A (ja) * 1995-01-20 1996-08-09 Daewoo Electron Co Ltd 適応的符号化システム及びビット割当方法
KR20080103088A (ko) * 2006-02-20 2008-11-26 프랑스 텔레콤 디코더 및 대응 디바이스에서 디지털 신호의 반향들의 안전한 구별과 감쇠를 위한 방법
KR20100115215A (ko) * 2009-04-17 2010-10-27 삼성전자주식회사 가변 비트율 오디오 부호화 및 복호화 장치 및 방법

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5921039B2 (ja) * 1981-11-04 1984-05-17 日本電信電話株式会社 適応予測符号化方式
US4568234A (en) 1983-05-23 1986-02-04 Asq Boats, Inc. Wafer transfer apparatus
GB8421498D0 (en) 1984-08-24 1984-09-26 British Telecomm Frequency domain speech coding
FR2674710B1 (fr) * 1991-03-27 1994-11-04 France Telecom Procede et systeme de traitement des preechos d'un signal audio-numerique code par transformee frequentielle.
JP3134338B2 (ja) * 1991-03-30 2001-02-13 ソニー株式会社 ディジタル音声信号符号化方法
US6240379B1 (en) * 1998-12-24 2001-05-29 Sony Corporation System and method for preventing artifacts in an audio data encoder device
JP3660599B2 (ja) 2001-03-09 2005-06-15 日本電信電話株式会社 音響信号の立ち上がり・立ち下がり検出方法及び装置並びにプログラム及び記録媒体
JP4399185B2 (ja) 2002-04-11 2010-01-13 パナソニック株式会社 符号化装置および復号化装置
AU2003278013A1 (en) * 2002-10-11 2004-05-04 Voiceage Corporation Methods and devices for source controlled variable bit-rate wideband speech coding
US7653542B2 (en) 2004-05-26 2010-01-26 Verizon Business Global Llc Method and system for providing synthesized speech
JP2006224862A (ja) 2005-02-18 2006-08-31 Alps Electric Co Ltd ステアリングスイッチ装置
KR100979624B1 (ko) * 2005-09-05 2010-09-01 후지쯔 가부시끼가이샤 오디오 부호화 장치 및 오디오 부호화 방법
US7966175B2 (en) * 2006-10-18 2011-06-21 Polycom, Inc. Fast lattice vector quantization
CN101751926B (zh) * 2008-12-10 2012-07-04 华为技术有限公司 信号编码、解码方法及装置、编解码***

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR950009412B1 (en) * 1992-11-20 1995-08-22 Daewoo Electronics Co Ltd Method and system of adaptive beit allocation according to frame variation
JPH08204575A (ja) * 1995-01-20 1996-08-09 Daewoo Electron Co Ltd 適応的符号化システム及びビット割当方法
KR20080103088A (ko) * 2006-02-20 2008-11-26 프랑스 텔레콤 디코더 및 대응 디바이스에서 디지털 신호의 반향들의 안전한 구별과 감쇠를 위한 방법
KR20100115215A (ko) * 2009-04-17 2010-10-27 삼성전자주식회사 가변 비트율 오디오 부호화 및 복호화 장치 및 방법

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2772909A4 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015122752A1 (fr) * 2014-02-17 2015-08-20 삼성전자 주식회사 Procédé et appareil de codage de signal, et procédé et appareil de décodage de signal
CN106233112A (zh) * 2014-02-17 2016-12-14 三星电子株式会社 信号编码方法和设备以及信号解码方法和设备
US10395663B2 (en) 2014-02-17 2019-08-27 Samsung Electronics Co., Ltd. Signal encoding method and apparatus, and signal decoding method and apparatus
CN110176241A (zh) * 2014-02-17 2019-08-27 三星电子株式会社 信号编码方法和设备以及信号解码方法和设备
US10657976B2 (en) 2014-02-17 2020-05-19 Samsung Electronics Co., Ltd. Signal encoding method and apparatus, and signal decoding method and apparatus
US10902860B2 (en) 2014-02-17 2021-01-26 Samsung Electronics Co., Ltd. Signal encoding method and apparatus, and signal decoding method and apparatus
CN110176241B (zh) * 2014-02-17 2023-10-31 三星电子株式会社 信号编码方法和设备以及信号解码方法和设备
US11676614B2 (en) 2014-03-03 2023-06-13 Samsung Electronics Co., Ltd. Method and apparatus for high frequency decoding for bandwidth extension
WO2015162500A3 (fr) * 2014-03-24 2016-01-28 삼성전자 주식회사 Procédé et dispositif de codage de bande haute et procédé et dispositif de décodage de bande haute
US10468035B2 (en) 2014-03-24 2019-11-05 Samsung Electronics Co., Ltd. High-band encoding method and device, and high-band decoding method and device
US10909993B2 (en) 2014-03-24 2021-02-02 Samsung Electronics Co., Ltd. High-band encoding method and device, and high-band decoding method and device
US11688406B2 (en) 2014-03-24 2023-06-27 Samsung Electronics Co., Ltd. High-band encoding method and device, and high-band decoding method and device

Also Published As

Publication number Publication date
CN104025189A (zh) 2014-09-03
JP6039678B2 (ja) 2016-12-07
CN104025189B (zh) 2016-10-12
EP2772909B1 (fr) 2018-02-21
EP2772909A4 (fr) 2015-06-10
US20140303965A1 (en) 2014-10-09
JP2014531064A (ja) 2014-11-20
US9672840B2 (en) 2017-06-06
KR20140085453A (ko) 2014-07-07
EP2772909A1 (fr) 2014-09-03

Similar Documents

Publication Publication Date Title
WO2013062392A1 (fr) Procédé de codage d'un signal vocal, procédé de décodage d'un signal vocal et appareil utilisant ceux-ci
US10885926B2 (en) Classification between time-domain coding and frequency domain coding for high bit rates
JP4861196B2 (ja) Acelp/tcxに基づくオーディオ圧縮中の低周波数強調の方法およびデバイス
US8942988B2 (en) Efficient temporal envelope coding approach by prediction between low band signal and high band signal
CN101622661B (zh) 一种数字语音信号的改进编解码方法
US8532983B2 (en) Adaptive frequency prediction for encoding or decoding an audio signal
KR101147878B1 (ko) 코딩 및 디코딩 방법 및 장치
EP2128857B1 (fr) Dispositif de codage et procédé de codage
EP1899962B1 (fr) Post-filtre audio pour un codeur audio
JP5978218B2 (ja) 低ビットレート低遅延の一般オーディオ信号の符号化
US8396707B2 (en) Method and device for efficient quantization of transform information in an embedded speech and audio codec
KR102105305B1 (ko) 계층형 정현파 코딩을 이용한 오디오 신호의 인코딩 및 디코딩 방법 및 장치
CN101281749A (zh) 可分级的语音和乐音联合编码装置和解码装置
US9390722B2 (en) Method and device for quantizing voice signals in a band-selective manner
US20230133513A1 (en) Audio decoder, audio encoder, and related methods using joint coding of scale parameters for channels of a multi-channel audio signal
Jia et al. A novel super-wideband embedded speech and audio codec based on ITU-T Recommendation G. 729.1

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12843449

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20147010211

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 14353981

Country of ref document: US

ENP Entry into the national phase

Ref document number: 2014538722

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2012843449

Country of ref document: EP