JPH08123493A

JPH08123493A - Code excited linear predictive speech encoding device

Info

Publication number: JPH08123493A
Application number: JP6264235A
Authority: JP
Inventors: Sachiko Hosaka; 祥子保坂; Akitoshi Kataoka; 章俊片岡; Takehiro Moriya; 健弘守谷; Shinji Hayashi; 伸二林
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1994-10-27
Filing date: 1994-10-27
Publication date: 1996-05-17

Abstract

PURPOSE: To provide the code excited linear predictive speech encoding device which obtains not only a speech, but also a natural decoded sound even if a background noise, music, etc., are added to the speech. CONSTITUTION: The code excited linear predictive speech encoding device is equipped with a shape excitation source code book, which consists of plural subordinate code books; and the subordinate code books 106a and 106b consist of speech parts 106a-1 and 106b-1 consisting of signal source vectors learnt by using speech containing no noise and non-peech parts 106a-2 and 106b-2 consisting of signal source vectors learnt by utilizing non-speech signals other than the speech and random signal source vectors that are not learnt. Consequently, the quality is improved as compared with when the subordinate code books are composed of only the vectors of the speech parts 106a-1 and 106b-1 or non-speech parts 106a-2 and 106b-2.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、２ｋｂｉｔ／ｓ−１
６ｋｂｉｔ／ｓ程度のビットレートで使われる符号励振
線形予測（ＣＥＬＰ）音声符号化方式、残差駆動線予測
（ＲＥＬＰ）音声符号化方式などの、音声合成フィルタ
を励振信号源で駆動する形式の音声符号化方式に適用
し、音声のみならず、音声に背景雑音や音楽等が加わっ
ても自然な復号音が得られる符号励振線形予測音声符号
化装置に関するものである。BACKGROUND OF THE INVENTION The present invention is 2 kbit / s-1.
Speech-exciting linear prediction (CELP) speech coding method used at a bit rate of about 6 kbit / s, residual drive line prediction (RELP) speech coding method, and the like in which speech synthesis filters are driven by an excitation signal source. The present invention relates to a code-excited linear predictive speech coding apparatus which is applied to a coding system and can obtain a natural decoded sound not only when speech is added but also when background noise or music is added to the speech.

【０００２】[0002]

【従来の技術】近年、ディジタル移動通信などの技術分
野においては、電波を有効利用するなどの目的で、種々
の高能率符号化方式が用いられている。８ｋｂｉｔ／ｓ
程度の符号化速度で音声を符号化する高能率符号化方式
の一つに、符号励振線形予測（ＣＥＬＰ）符号化方式が
ある。この符号励振線形予測符号化方式は、人間の音声
の生成機構をモデル化する線形予測分析合成にベクトル
量子化技術とＡｂＳ（合成を利用した分析）技術を組み
合せた周知の技術である。また、符号励振線形予測符号
化方式のなかには、２つのサブ符号帳によって形状励振
源符号帳を構成した、共役構造ＣＥＬＰ（ＣＳ−ＣＥＬ
Ｐ）と呼ばれる構造のものがある（なお、共役構造ＣＥ
ＬＰについては、本願出願人の出願による特願平−７０
５３４「音声の符号化方法」に詳細に説明されてい
る）。2. Description of the Related Art In recent years, in the technical field of digital mobile communication and the like, various high efficiency coding systems have been used for the purpose of effectively utilizing radio waves. 8 kbit / s
One of the high-efficiency coding schemes for coding speech at a moderate coding speed is a code-excited linear prediction (CELP) coding scheme. This code-excited linear predictive coding system is a well-known technique in which a vector quantization technique and an AbS (analysis using synthesis) technique are combined with a linear predictive analysis synthesis that models a human voice generation mechanism. Further, in the code-excited linear predictive coding method, a conjugate structure CELP (CS-CEL) in which a shape excitation source codebook is configured by two sub-codebooks.
There is a structure called P) (note that the conjugated structure CE
Regarding LP, Japanese Patent Application No. 70 by the applicant of the present application
534, "Audio Coding Methods").

【０００３】ここで、図３を参照して、共役構造を持つ
符号励振線形予測符号化方式の一例について説明する。
図３は、共役構造ＣＥＬＰ符号化方式による符号励振線
形予測音声符号化装置の一構成例を示すブロック図であ
り、この図において、符号１は、入力端子であり、この
入力端子１からは、アナログの音声信号をサンプリング
周波数８ｋＨｚでサンプリングして生成したデジタルの
音声データが入力される。入力端子１から入力された入
力信号は、ＬＰＣ（線形予測）分析器２へ入力される。
このＬＰＣ分析器２は、入力信号を、窓かけ、自己相関
関数計算、斉次連立方程式の求解等の周知の手法によっ
て、線形予測分析し、音声合成フィルタ４の予測係数と
音声分析フィルタ３の予測係数を求める。そして、音声
合成フィルタ４の予測係数（線形予測係数）は、一度量
子化され、伝送に適した形に変換された後、再度復号さ
れて、音声合成フィルタ４に設定される。また、音声分
析フィルタ３の予測係数も、同様にして設定される。An example of a code-excited linear predictive coding system having a conjugate structure will be described with reference to FIG.
FIG. 3 is a block diagram showing a configuration example of a code-excited linear predictive speech coding apparatus using the conjugate structure CELP coding method. In this figure, reference numeral 1 is an input terminal, and from this input terminal 1, Digital audio data generated by sampling an analog audio signal at a sampling frequency of 8 kHz is input. The input signal input from the input terminal 1 is input to the LPC (linear prediction) analyzer 2.
The LPC analyzer 2 performs linear prediction analysis on the input signal by a known method such as windowing, autocorrelation function calculation, and solution of simultaneous simultaneous equations, and the prediction coefficient of the speech synthesis filter 4 and the speech analysis filter 3 Calculate the prediction coefficient. Then, the prediction coefficient (linear prediction coefficient) of the speech synthesis filter 4 is quantized once, converted into a form suitable for transmission, then decoded again, and set in the speech synthesis filter 4. The prediction coefficient of the voice analysis filter 3 is also set in the same manner.

【０００４】音声分析フィルタ３は、音声合成フィルタ
４における合成作用と逆の作用をなすフィルタであり、
音声合成フィルタ４の出力信号（複合音声）を、ＬＰＣ
分析器２によって設定された予測係数に基づいて分析す
ることによって、線形予測残差信号を求める。そして、
線形予測残差信号に基いて、音声の基本周期（ピッチ）
成分をあらわす適応励振源符号帳５の記憶データが設定
される。この適応励振源符号帳５は、ピッチ周期時間と
ピッチパルスの振幅をピッチ周期ベクトルとして表し、
記憶した符号帳であり、音声のピッチ変化に追従して適
応的に変化するものである。The voice analysis filter 3 is a filter which has an action opposite to the synthesis action of the voice synthesis filter 4.
The output signal (composite voice) of the voice synthesis filter 4 is set to LPC.
A linear prediction residual signal is obtained by performing analysis based on the prediction coefficient set by the analyzer 2. And
Based on the linear prediction residual signal, the fundamental period (pitch) of speech
The storage data of the adaptive excitation source codebook 5 representing the component is set. This adaptive excitation source codebook 5 represents the pitch cycle time and the amplitude of the pitch pulse as a pitch cycle vector,
It is a stored codebook, which adaptively changes according to the pitch change of the voice.

【０００５】６ａおよび６ｂは、形状励振源符号帳を構
成するサブ符号帳であり、残差信号からピッチ成分を除
いた残りの波形である形状励振源成分を、励振ベクトル
として表し、記憶した符号帳である。サブ符号帳６ａ、
６ｂからなる形状励振源符号帳は、非周期的励振に対応
するものであり、時間と共に変化しない。Reference numerals 6a and 6b are sub-codebooks constituting the shape excitation source codebook. The shape excitation source component, which is the remaining waveform obtained by removing the pitch component from the residual signal, is expressed as an excitation vector and stored in the code. It is a book. Sub codebook 6a,
The shape excitation source codebook consisting of 6b corresponds to aperiodic excitation and does not change with time.

【０００６】そして、適応励振源符号帳５ならびにサブ
符号帳６ａ、６ｂからは、後述する最小歪み計算部８に
よって選択されたピッチ周期ベクトルおよび励振ベクト
ルが取り出され、適応励振源符号帳５から取り出された
ピッチ周期ベクトルには、利得部１０において、ピッチ
ゲインが乗算され、一方、サブ符号帳６ａ、６ｂから取
り出された各励振ベクトルは、互いに加算されて励振ベ
クトル３０として合成された後、利得部１１において、
ゲインアダプタ９によって設定された予測ゲインが乗算
され、さらに利得部１３において、形状ゲインが乗算さ
れる。Then, the pitch period vector and the excitation vector selected by the minimum distortion calculation unit 8 described later are extracted from the adaptive excitation source codebook 5 and the sub-codebooks 6a and 6b, and extracted from the adaptive excitation source codebook 5. The obtained pitch period vector is multiplied by the pitch gain in the gain section 10, while the respective excitation vectors extracted from the sub codebooks 6a and 6b are added to each other and combined as the excitation vector 30, In part 11,
The prediction gain set by the gain adapter 9 is multiplied, and the gain unit 13 is further multiplied by the shape gain.

【０００７】そして、利得部１０の出力ベクトルと利得
部１３の出力ベクトルは、互いに加算された後、音声合
成フィルタ４に供給され、音声合成フィルタ４におい
て、上述したようにしてＬＰＣ分析器２によって設定さ
れた線形予測係数に基づいて合成される。なお、上記の
利得部１０におけるピッチゲインは、ピッチ周期の励振
に対する振幅のゲインであり、音声合成フィルタ４に入
力されるベクトルに応じて設定されるものである。他
方、利得部１１における予測ゲインは、形状励振に対す
る振幅のゲインであり、ゲインアダプタ９において、過
去の励振ベクトル３０ａのパワーに基づいて線形予測分
析を行うことによって設定されるものである。The output vector of the gain unit 10 and the output vector of the gain unit 13 are added to each other and then supplied to the voice synthesis filter 4, where the LPC analyzer 2 operates as described above. It is synthesized based on the set linear prediction coefficient. The pitch gain in the gain section 10 is an amplitude gain with respect to the excitation of the pitch cycle, and is set according to the vector input to the voice synthesis filter 4. On the other hand, the prediction gain in the gain unit 11 is an amplitude gain for shape excitation, and is set by the gain adapter 9 by performing linear prediction analysis based on the power of the past excitation vector 30a.

【０００８】そして、入力端子１から入力された入力信
号ベクトルから、音声合成フィルタ４の出力ベクトルが
減算されて歪データが求められ、この歪データが、聴覚
重み付けフィルタ７において、人間の聴覚の特性に対応
した係数によって重み付けされた後、最小歪み計算部８
へ入力される。そして、最小歪み計算部８において、聴
覚重み付けフィルタ７から出力された歪データのパワー
が計算され、この歪データのパワーが最も小さくなるよ
うに適応励振源符号帳５ならびにサブ符号帳６ａ、６ｂ
から、それぞれピッチ周期ベクトルおよび励振ベクトル
が選択される。なお、通常、上記の聴覚重み付けフィル
タ７は、移動平均自己回帰型の１０次程度のフィルタで
あり、フォルマントの山の部分をやや強調するような特
性を持つように構成されたものであり、最小歪み計算部
８は、２乗誤差最小の計算を行うように構成されたもの
である。Then, the output vector of the voice synthesis filter 4 is subtracted from the input signal vector input from the input terminal 1 to obtain distortion data, and this distortion data is detected by the auditory weighting filter 7 as a characteristic of human auditory sense. After being weighted by the coefficient corresponding to
Is input to. Then, the minimum distortion calculation unit 8 calculates the power of the distortion data output from the auditory weighting filter 7, and the adaptive excitation source codebook 5 and the sub-codebooks 6a and 6b are set so that the power of the distortion data is minimized.
From these, the pitch period vector and the excitation vector are respectively selected. The auditory weighting filter 7 is usually a moving average autoregressive type filter of about 10th order, and is configured to have a characteristic of slightly emphasizing the mountain portion of the formant. The distortion calculator 8 is configured to perform a calculation with a minimum squared error.

【０００９】そして、符号出力部１２において、上述し
た予測係数、ピッチ周期ベクトルおよび励振ベクトルそ
れぞれに対して選択されたコード、利得等が、ビット系
列の符号に変換され、さらに必要に応じて訂正符号が付
加され、符号出力部１２から伝送路へ向けて出力され
る。すなわち、符号化の際は、入力信号の波形に対し、
合成波形（音声合成フィルタ４の出力ベクトル）の聴覚
重み付け自乗誤差が最小となるような励振源の組み合せ
が、適応励振源符号帳５ならびにサブ符号帳６ａ、６ｂ
からなる形状励振源符号帳から選ばれるのである。Then, in the code output unit 12, the code, gain, etc. selected for each of the above-described prediction coefficient, pitch period vector and excitation vector are converted into a code of a bit sequence, and a correction code is further added if necessary. Is added and is output from the code output unit 12 toward the transmission path. That is, at the time of encoding, for the waveform of the input signal,
The combination of the excitation sources that minimizes the perceptual weighted squared error of the synthesized waveform (the output vector of the speech synthesis filter 4) is the adaptive excitation source codebook 5 and the sub-codebooks 6a and 6b.
It is selected from the shape excitation source codebook consisting of.

【００１０】上述したように、共役構造ＣＥＬＰ（ＣＳ
−ＣＥＬＰ）において、形状励振源符号帳は、２つのサ
ブ符号帳６ａ、６ｂに分かれて構成されている。各サブ
符号帳６ａ、６ｂは、相互に他を補完する機能を果た
し、仮に伝送路の符号誤りにより、復号器（図示せず）
において、一方のサブ符号帳の形状励振ベクトルが誤っ
て選ばれても、他方のサブ符号帳のベクトルとの和をと
るため、結果として大きな誤差を生じないという特徴が
ある。As described above, the conjugated structure CELP (CS
-CELP), the shape excitation source codebook is divided into two sub codebooks 6a and 6b. Each of the sub codebooks 6a and 6b has a function of complementing each other, and a decoder (not shown) may be provided due to a code error in the transmission path.
In (1), even if the shape excitation vector of one sub codebook is erroneously selected, it is summed with the vector of the other sub codebook, and as a result, a large error does not occur.

【００１１】ところで、サブ符号帳６ａ、６ｂからなる
形状励振源符号帳に記憶される形状励振源成分を求める
際にも、上述した適応励振源符号帳５を設定する場合と
同様に、ベクトル量子手法が適用される。しかし、音声
の変化に応じて形状励振源符号帳を適応的に変化させる
には、あまりにも分散が大きいため、形状励振源成分を
求める場合には、予め尤もらしい波形をベクトルとして
複数用意して、符号帳に記憶させておくことが行われ
る。ただし、伝送の際、実際に伝送されるのは、ベクト
ルの指標のみである。By the way, when the shape excitation source component stored in the shape excitation source codebook composed of the sub-codebooks 6a and 6b is obtained, the vector quantum is set in the same manner as when the adaptive excitation source codebook 5 is set. The method is applied. However, since the variance is too large to adaptively change the shape excitation source codebook according to the change in speech, a plurality of likely waveforms are prepared in advance as vectors when obtaining the shape excitation source component. , Is stored in the codebook. However, at the time of transmission, only the vector index is actually transmitted.

【００１２】この尤もらしい符号帳を構成する手法の一
つが学習である。図３に示す形状励振源符号帳の場合、
雑音を表すベクトルの一群を初期状態として、予め音声
データベースを用いて学習がなされる。学習は、各励振
源に合成フィルタを畳み込み、振幅（ゲイン）を乗じて
得た合成波形と目標音声信号との誤差の、符号帳中の全
ての励振源についての総和が最小となるよう各形状励振
源符号帳のサブ符号帳のベクトルを決定するものであ
る。Learning is one of the methods for constructing this plausible codebook. In the case of the shape excitation source codebook shown in FIG.
Learning is performed in advance using a speech database with a group of vectors representing noise as an initial state. Learning is performed by convolving each excitation source with a synthesis filter and multiplying the amplitude (gain) to obtain the error between the synthesized waveform and the target speech signal, so that the total sum of all the excitation sources in the codebook is minimized. The vector of the sub codebook of the excitation source codebook is determined.

【００１３】ここで、図４を参照して、従来の形状励振
源符号帳の学習においてサブ符号帳に設定される励振ベ
クトルの構成について説明する。通常の音声データーベ
ースは、背景雑音や背景音楽等を含まないため、このよ
うな音声データーベースを用いた場合、学習済みの形状
励振源符号帳は、図４（ａ）に示すように、それを構成
する２つのサブ符号帳６ａ−１、６ｂ−１が、共に雑音
を含まない音声データの信号源ベクトルからなる音声部
のみで構成される。一方、学習をしない場合（例えばガ
ウス雑音からなる符号帳の場合）、あるいは、音声以外
すなわち非音声である背景音声等を利用して学習した符
号帳の場合、図４（ｂ）に示すように、２つのサブ符号
帳６ａ−２、６ｂ−２が、共に学習を行わないランダム
信号源ベクトルもしくは音声以外の非音声を利用して学
習した信号源ベクトルからなる非音声部のみで構成され
る。The structure of the excitation vector set in the sub codebook in the conventional learning of the shape excitation source codebook will be described with reference to FIG. Since a normal voice database does not include background noise, background music, etc., when such a voice database is used, the learned shape excitation source codebook is as shown in FIG. 4 (a). Each of the two sub-codebooks 6a-1 and 6b-1 that compose the above is composed only of a voice part composed of a signal source vector of voice data containing no noise. On the other hand, when learning is not performed (for example, in the case of a codebook composed of Gaussian noise), or in the case of a codebook learned by using a background voice or the like other than voice, that is, non-voice, as shown in FIG. Each of the two sub codebooks 6a-2 and 6b-2 is composed of only a non-voice portion which is a random signal source vector which is not learned, or a signal source vector which is learned by using non-voice other than voice.

【００１４】[0014]

【発明が解決しようとする課題】背景雑音や背景音楽等
を含まない音声データベースによって学習済みの形状励
振ベクトル信号すなわちサブ符号帳６ａ−１、６ｂ−１
の出力から合成した励振ベクトル３０−１（図４（ａ）
参照）は、雑音の無い入力音声に最適なものとなる。し
たがって、結果的に、このような学習済みの符号帳を有
する符号化装置は、雑音の無い入力に対して高品質であ
るが、音声の背景に雑音や音楽などが加わると極めて明
らかな品質劣化を生じるという問題点があった。Shape excitation vector signals, that is, sub-codebooks 6a-1 and 6b-1 learned by a speech database containing no background noise or background music.
Excitation vector 30-1 synthesized from the output of
(See) is optimal for noise-free input speech. Therefore, as a result, the coding apparatus having such a trained codebook has high quality with respect to a noiseless input, but when noise or music is added to the background of the voice, the quality degradation is extremely obvious. There was a problem that caused.

【００１５】一方、学習をしない、例えばガウス雑音か
らなる符号帳あるいは音声以外の非音声である背景音声
等を利用して学習した符号帳から合成される励振ベクト
ル３０−２（図４（ｂ）参照）は、雑音のある場合の入
力音声に最適なものとなる。したがって、付加雑音によ
る品質劣化が少ないものの、雑音の無い環境下の音声に
対してあまり品質が上がらないという欠点があった。On the other hand, an excitation vector 30-2 synthesized from a codebook that is not learned, for example, a codebook made up of Gaussian noise or a codebook learned by using non-speech background speech other than speech (FIG. 4B). The reference) is optimal for input speech in the presence of noise. Therefore, although the quality deterioration due to the additional noise is small, there is a drawback that the quality is not improved so much with respect to the voice in a noise-free environment.

【００１６】本発明は、上記の問題点に鑑みてなされた
ものであり、音声のみならず、音声に背景雑音や音楽等
が加わっても自然な復号音が得られる符号励振線形予測
音声符号化装置を提供することを目的とする。The present invention has been made in view of the above problems, and code-excited linear predictive speech coding in which not only speech but also natural decoded sound can be obtained even if background noise or music is added to speech. The purpose is to provide a device.

【００１７】[0017]

【課題を解決するための手段】請求項１記載の発明は、
形状励振源符号帳を備える符号励振線形予測音声符号化
装置において、前記形状励振源符号帳は、複数のサブ符
号帳から構成され、前記各サブ符号帳は、雑音を含まな
い音声又は雑音を含まない音声と雑音を含む音声の両方
を用いて学習した信号源ベクトルから構成される音声部
と、音声以外の非音声を利用して学習した信号源ベクト
ル又は学習を行わないランダム信号源ベクトルから構成
される非音声部とからなることを特徴とする。According to the first aspect of the present invention,
In a code-excited linear predictive speech coding apparatus including a shape excitation source codebook, the shape excitation source codebook is composed of a plurality of sub codebooks, and each of the sub codebooks includes noise-free speech or noise. Composed of a signal source vector learned using both unvoiced speech and speech containing noise, and a signal source vector learned using non-voice other than speech or a random signal source vector not learned And a non-voice part to be played.

【００１８】また、請求項２記載の発明は、形状励振源
符号帳を備える符号励振線形予測音声符号化装置におい
て、前記形状励振源符号帳は、第１および第２のサブ符
号帳から構成され、前記第１のサブ符号帳は、雑音を含
まない音声又は雑音を含まない音声と雑音を含む音声の
両方を用いて学習した信号源ベクトルのみから構成さ
れ、前記第２のサブ符号帳は、音声以外の非音声を利用
して学習した信号源ベクトルのみ又は学習を行わないラ
ンダム信号源ベクトルのみで構成されることを特徴とす
る。According to a second aspect of the present invention, in a code-excited linear predictive speech coding apparatus including a shape excitation source codebook, the shape excitation source codebook is composed of first and second sub codebooks. , The first sub-codebook is composed only of signal source vectors learned using both noise-free speech or both noise-free speech and noise-containing speech, and the second sub-codebook is It is characterized in that it is composed of only signal source vectors learned using non-voice other than voice or only random signal source vectors not learned.

【００１９】[0019]

【作用】以上の構成によれば、各サブ符号帳を構成する
音声部と非音声部のベクトルの分布が異なっているた
め、音声部と非音声部をうまくカバーするように両者か
ら選んでサブ符号帳を構成することにより、雑音を含ま
ない音声入力に対しても背景音楽を含む音声入力に対し
てもそれぞれ最適なベクトルが選択されるため、総合的
に考えるとサブ符号帳を音声部のみあるいは非音声部の
みのベクトルで構成するよりも品質が向上する。According to the above construction, since the distributions of the vectors of the speech part and the non-speech part of each sub codebook are different, the sub parts are selected from both so as to cover the speech part and the non-speech part well. By constructing the codebook, optimal vectors are selected for both speech input that does not contain noise and speech input that contains background music. Alternatively, the quality is improved as compared with the case where the vector is composed only of the non-voice part.

【００２０】[0020]

【実施例】以下、図面を参照してこの発明による一実施
例を説明する。なお、本発明は、図３を参照して説明し
た符号励振線形予測符号化方式等に設けられている形状
励振源符号帳の構成に係り、その初期値の設定に特徴が
あるものである。したがって、符号励振線形予測音声符
号化装置のそれ以外の構成については、従来のものと同
様であり、以下の説明では、形状励振源符号帳の構成に
ついて詳細に記述する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. The present invention relates to the configuration of the shape excitation source codebook provided in the code excitation linear predictive coding method and the like described with reference to FIG. 3, and is characterized by setting its initial value. Therefore, the other configurations of the code excitation linear predictive speech encoding apparatus are the same as those of the conventional one, and the configuration of the shape excitation source codebook will be described in detail in the following description.

【００２１】図１は、この発明による符号励振線形予測
音声符号化装置の形状励振源符号帳の構成を示す構成図
であり、一例として、図３に示すものと同様に、２つの
サブ符号帳を有する場合の形状励振源符号帳を示したも
のである。この図に示すサブ符号帳１０６ａ、１０６ｂ
は、図３に示すサブ符号帳６ａ、６ｂに対応するもので
あり、それぞれを構成するベクトルは、実際の構成にお
いて、例えば、各４０次元で１２８個であるが、この図
においては、それぞれを８個のベクトルによって簡略し
て表している。FIG. 1 is a block diagram showing the configuration of a shape excitation source codebook of a code-excited linear predictive speech coder according to the present invention. As an example, two sub-codebooks are provided as in the case of FIG. 3 shows a shape excitation source codebook in the case of having The sub codebooks 106a and 106b shown in this figure
Corresponds to the sub-codebooks 6a and 6b shown in FIG. 3, and the vectors constituting each are, for example, 128 in each 40 dimensions in the actual configuration. It is simply represented by eight vectors.

【００２２】サブ符号帳１０６ａ、１０６ｂを学習させ
るのに先立ち、まず、雑音を含まない音声又は雑音を含
まない音声と雑音を含む音声の両方を使って学習するこ
とによって得られた符号帳と音声以外の非音声例えば付
加雑音や背景音楽を使って学習することによって得られ
た符号帳を用意する。Prior to learning the sub codebooks 106a and 106b, first, a codebook and a speech obtained by learning using noise-free speech or both speech without noise and speech with noise Other than non-voice, for example, a codebook obtained by learning using additional noise or background music is prepared.

【００２３】次に音声入力の環境（例えば車の中）を考
慮し、（音声部のベクトル個数）＋（非音声部のベクト
ル個数）＝１２８個になるように予め用意した符号帳の
音声部と非音声部のベクトルの中から適切なベクトルを
選ぶことによって、各サブ符号帳１０６ａ、１０６ｂ
が、それぞれ音声部１０６ａ−１または１０６ｂ−１と
非音声部１０６ａ−２または１０６ｂ−２を持つ複合形
状の符号帳となるように、サブ符号帳１０６ａ、１０６
ｂそれぞれに音声部のベクトルと非音声部のベクトルを
記憶させる。Next, in consideration of the environment of voice input (for example, in a car), the voice part of the codebook prepared in advance so that (the number of vectors of the voice part) + (the number of vectors of the non-voice part) = 128. By selecting an appropriate vector from the vectors of the non-speech part and the sub-codebooks 106a and 106b.
Sub-codebooks 106a, 106b such that the subcodebooks 106a-1 or 106b-1 and the non-voice portion 106a-2 or 106b-2 have a complex shape, respectively.
The voice part vector and the non-voice part vector are stored in each b.

【００２４】各１２８ベクトルの内、例えば６４ベクト
ルを音声部、残り６４ベクトルを非音声部に振り分ける
ことができる。ただし、この比率は１対１に限らず、使
用状況に応じた入力音声の性質を見込んで自由に設定す
ることができる。Of the 128 vectors, for example, 64 vectors can be distributed to the voice part and the remaining 64 vectors can be distributed to the non-voice part. However, this ratio is not limited to 1: 1 and can be freely set in consideration of the nature of the input voice according to the usage situation.

【００２５】符号励振線形予測音声符号化装置の実際の
動作においては、上述したように、各サブ符号帳１０６
ａ、１０６ｂの中のベクトルの選択に当たって、聴覚重
みづけを考慮して歪最小となるベクトルが選ばれるか
ら、一方、雑音を含まない入力音声の場合は、２つとも
音声部（１０６ａ−１および１０６ｂ−１）のベクトル
が選ばれ、励振ベクトル１３０が合成され、他方、入力
音声が背景音声等を含んでいる場合には２つとも非音声
部（１０６ａ−２および１０６ｂ−２）が選ばれて励振
ベクトル１３０が合成される可能性が高い。したがっ
て、従来の場合すなわち背景雑音や背景音楽等を含まな
い音声データベースによって学習済みサブ符号帳を用い
て音声の背景に雑音や音楽などが加わった入力信号を符
号化する場合、あるいは、学習をしない符号帳もしくは
音声以外の非音声である背景音声等を利用して学習した
符号帳を用いて付加雑音による品質劣化が少ないものも
しくは雑音の無い環境下の音声を符号化する場合と比較
して、励振ベクトル１３０（図３に示す励振ベクトル３
０に対応するもの）として、入力信号の特性により適合
したものが選択されることになるので、音声品質を向上
させることができる。In the actual operation of the code-excited linear predictive speech coder, as described above, each sub-codebook 106 is used.
In selecting a vector from among a and 106b, a vector with minimum distortion is selected in consideration of the weighting of auditory senses. On the other hand, in the case of an input voice that does not include noise, both of the voice parts (106a-1 and 106b-1) is selected and the excitation vector 130 is synthesized. On the other hand, when the input voice includes background voice or the like, both of the non-voice portions (106a-2 and 106b-2) are selected. The excitation vector 130 is likely to be synthesized. Therefore, in the conventional case, that is, when the input signal in which noise or music is added to the background of the voice is encoded using the sub-codebook that has been learned by the voice database that does not include background noise or background music, or no learning is performed. Compared with the case of encoding speech in an environment without noise or noise with little quality deterioration by additional noise using a codebook learned using background speech that is non-speech other than codebook or speech, Excitation vector 130 (excitation vector 3 shown in FIG.
As the one (corresponding to 0) which is more suitable for the characteristics of the input signal is selected, the voice quality can be improved.

【００２６】次に、本発明による他の実施例を図２を参
照して説明する。この図において、形状励振源符号帳
は、図１に示す各サブ符号帳１０６ａ、１０６ｂに代え
て、音声部のみからなるサブ符号帳２０６ａと非音声部
のみからなるサブ符号帳２０６ｂから構成されている。
この場合、サブ符号帳２０６ｂを構成する非音声部は、
雑音波形の時系列のベクトルまたは雑音や非音声信号か
らなるデータベースを用いたベクトル１２８個から構成
されたものであり、一方、サブ符号帳２０６ａを構成す
る音声部は、同様の雑音ベクトルを初期値として、雑音
を含まない音声又は雑音を含まない音声と雑音を含む音
声の両方のデータベースを用いて学習した結果得られた
１２８個の励振ベクトルから構成されたものである。Next, another embodiment according to the present invention will be described with reference to FIG. In this figure, the shape excitation source codebook is composed of a sub-codebook 206a consisting only of a voice part and a sub-codebook 206b consisting only of a non-voice part, instead of the sub-codebooks 106a and 106b shown in FIG. There is.
In this case, the non-voice part of the sub codebook 206b is
The time-series vector of the noise waveform is composed of 128 vectors using a database consisting of noise and non-speech signals. On the other hand, the speech part constituting the sub codebook 206a has a similar noise vector as an initial value. Is composed of 128 excitation vectors obtained as a result of learning using a database of voices not containing noise or both voices not containing noise and voices containing noise.

【００２７】この図に示すようにして各サブ符号帳２０
６ａ、２０６ｂを構成した場合、各サブ符号帳が、それ
ぞれ１２８個の音声部あるいは非音声部からなる励振ベ
クトルによって構成されるので、図１に示した場合と比
較して、一方、サブ符号帳２０６ａでは、付加雑音によ
る品質劣化が少ないものもしくは雑音の無い環境下の音
声に対して適切な励振ベクトルの選択の範囲が広がり、
他方、サブ符号帳２０６ｂでは、音声の背景に雑音や音
楽などが加わった入力信号に対する適切な励振ベクトル
の選択の範囲が広がる。したがって、各サブ符号帳から
選択された励振ベクトルの合成ベクトルである励振ベク
トル２３０（図３に示す励振ベクトル３０に対応するも
の）は、図１に示す実施例と同様、従来の場合に比較
し、結果として音声品質の向上を図ることができる。As shown in this figure, each sub codebook 20
6a and 206b, each sub-codebook is composed of excitation vectors each consisting of 128 speech parts or non-speech parts. Therefore, as compared with the case shown in FIG. In 206a, the range of selection of an appropriate excitation vector is widened for speech in an environment with little quality deterioration due to additional noise or noise-free environment,
On the other hand, in the sub-codebook 206b, the range of selection of an appropriate excitation vector for an input signal in which noise and music are added to the background of the voice is widened. Therefore, the excitation vector 230 (corresponding to the excitation vector 30 shown in FIG. 3), which is a composite vector of the excitation vectors selected from each sub-codebook, is compared with the conventional case as in the embodiment shown in FIG. As a result, the voice quality can be improved.

【００２８】なお、以上の実施例では、形状励振源符号
帳を２つに分けたＣＳ−ＣＥＬＰの例を示したが、更に
多数の符号帳に分ける構成にも、また、１つのみの符号
帳を用いる場合も本複合符号帳の構成手法を適用するこ
とができる。In the above embodiment, an example of the CS-CELP in which the shape excitation source codebook is divided into two is shown. However, even in the configuration in which the codebook is divided into a larger number of codebooks, only one code is provided. Even when a book is used, this composite codebook configuration method can be applied.

【００２９】また、符号帳のメモリを削減するため、一
つまたは数個のベクトルを循環的に使用する方法も矛盾
なく組み合せることができる。また、演算量削減のため
符号帳ベクトルの非重要成分を０として、数点の代表パ
ルスのみを残すスパース励振ベクトルの採用も本方式と
併用することが出来、効果を損なわない。このとき、音
声部符号帳のみにスパース励振ベクトルを適用し、非音
声部符号帳はすべての標本に値を持つように設定するこ
とが可能で、品質を保ちつつ演算量やメモリを削減出来
る点で実用的に有効である。Also, in order to reduce the memory of the codebook, the method of cyclically using one or several vectors can be combined without contradiction. Further, in order to reduce the amount of calculation, the sparse excitation vector that leaves only a few representative pulses with the non-important component of the codebook vector set to 0 can be used in combination with this method, and the effect is not impaired. At this time, the sparse excitation vector can be applied only to the speech part codebook, and the non-speech part codebook can be set to have values for all samples, which can reduce the amount of calculation and memory while maintaining quality. Is practically effective.

【００３０】さらに、雑音励振部分の検索において数個
の励振パルスを順次予め決った位置に立てて歪みの少な
いパルス列を求める代数的符号励振ＣＥＬＰ（Ａ−ＣＥ
ＬＰ）と組み合せても効果を発揮することは明らかであ
る。Further, in searching for a noise excitation part, algebraic code excitation CELP (A-CE) is used to obtain a pulse train with few distortions by sequentially setting several excitation pulses at predetermined positions.
It is clear that the effect is exhibited even when combined with LP).

【００３１】本発明を適用した場合を、従来技術による
場合（形状励振源符号帳に全て音声部ベクトルを用いた
場合、及び、非音声部ベクトルのみを用いた場合）と比
較した効果を以下に示す。 ○従来技術（音声部ベクトルのみの場合）全ての励振音源を学習し
た方式では、雑音を含まない入力音声に対して好ましい
品質を示すものの、背景に雑音の混じった音声では、背
景の雑音が変形し違和感のある妨害音となって聴感上の
劣化がはげしい。（非音声部ベクトルのみの場合）入力信号が雑音のない
音声、雑音や音楽などが含まれる音声に対しても重大な
差のない品質が得られる。ただし、同一のビットレート
で学習音源を用いたものを、本発明による一実施例と比
較した場合、復号音声の信号対雑音比で１．０−２．０
ｄＢ程度劣り、８ｋｂｉｔ／ｓ程度の符号化では、聴感
上ざらざらした雑音感がぬぐえないという実験結果が得
られた。The effects obtained by comparing the case of applying the present invention with the case of the prior art (the case of using all speech part vectors in the shape excitation source codebook and the case of using only non-speech part vectors) are as follows. Show. ○ Conventional technology (only for speech part vector) Although the method that learned all excitation sound sources shows favorable quality for input speech that does not contain noise, background noise is deformed in speech with background noise. However, the sound becomes uncomfortable and the sound is deteriorated. (In the case of only non-voice part vector) Even if the input signal is a noise-free voice or a voice including noise or music, quality without significant difference can be obtained. However, when the one using the learning sound source at the same bit rate is compared with one embodiment according to the present invention, the signal-to-noise ratio of the decoded speech is 1.0-2.0.
An experimental result was obtained in which the noise was inferior by about dB, and the audible rough noise could not be wiped off at the encoding of about 8 kbit / s.

【００３２】○本発明背景に雑音のないときは最適な音声部のベクトルが選択
されるため、音声部のみと同程度の品質感が得られ雑音
感は避けられる。また、背景雑音が存在するときは最適
な非音声部ベクトルが選択されるため、雑音が変化して
違和感をあたえることを避けられる。よって、広範囲の
入力音声条件に対して高品質な音声の符号化が可能であ
る。The present invention: When there is no noise in the background, the optimal vector of the voice part is selected, so that the same level of quality feeling as that of the voice part alone can be obtained and the noise feeling can be avoided. Further, when background noise is present, the optimum non-speech part vector is selected, so that it is possible to avoid changing the noise and giving a feeling of strangeness. Therefore, it is possible to encode a high quality voice for a wide range of input voice conditions.

【００３３】[0033]

【発明の効果】請求項１記載の発明によれば、入力信号
が雑音を含まない音声であれば従来構成例と同様な音声
部のベクトルが選ばれるので、従来に比べ劣化がない。
また、入力信号に付加雑音や背景音楽などを含んでいれ
ば、非音声部符号帳のベクトルが選ばれるので、従来に
比べ品質を向上させることができる。すなわち、音声部
と非音声部のベクトルの分布が異なっているため、音声
部と非音声部をうまくカバーするように両者から選んで
サブ符号帳を構成することによって、雑音を含まない音
声入力に対しても背景音楽等を含む音声入力に対しても
それぞれ最適なベクトルが選択されるため、総合的に考
えるとサブ符号帳を音声部のみあるいは非音声部のみの
ベクトルで構成するよりも品質が向上する、という効果
を得ることができる。According to the first aspect of the present invention, if the input signal is a voice containing no noise, the vector of the voice section similar to that of the conventional configuration is selected, so that there is no deterioration compared to the conventional case.
If the input signal contains additional noise or background music, the vector of the non-voice section codebook is selected, so that the quality can be improved as compared with the conventional case. That is, since the distributions of the vectors of the voice part and the non-voice part are different, the sub-codebook is configured by selecting them so that the voice part and the non-voice part are well covered. On the other hand, the optimum vector is selected for each voice input including background music, etc. Therefore, considering comprehensively, the quality is better than configuring the sub codebook with only the voice part or the non-voice part vector. The effect of improving can be obtained.

【００３４】また、請求項２記載の発明によれば、雑音
を含まない音声入力に対しては、雑音を含まない音声又
は雑音を含まない音声と雑音を含む音声の両方を用いて
学習した信号源ベクトルのみから構成された第１のサブ
符号帳から最適なベクトルが選択され、背景音楽等を含
む音声入力に対しては、音声以外の非音声を利用して学
習した信号源ベクトルのみ又は学習を行わないランダム
信号源ベクトルのみで構成された第２のサブ符号帳から
最適なベクトルが選択されるので、総合的に考えると２
つのサブ符号帳を音声部のみあるいは非音声部のみのベ
クトルで構成するよりも品質を向上させることができ
る。According to the second aspect of the invention, with respect to the voice input containing no noise, the signal learned by using the voice containing no noise or both the voice containing no noise and the voice containing noise. An optimal vector is selected from the first sub-codebook composed only of source vectors, and for a voice input including background music, etc., only the source vector learned by using non-voice other than voice or learned The optimal vector is selected from the second sub-codebook composed only of random signal source vectors that do not perform
It is possible to improve the quality as compared with the case where one sub-codebook is composed of vectors of only the voice part or only the non-voice part.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の一実施例による形状励振源符号帳の構
成を示す構成図である。FIG. 1 is a configuration diagram showing a configuration of a shape excitation source codebook according to an embodiment of the present invention.

【図２】本発明の他の実施例による形状励振源符号帳の
構成を示す構成図である。FIG. 2 is a configuration diagram showing a configuration of a shape excitation source codebook according to another embodiment of the present invention.

【図３】従来の符号励振線形予測音声符号化装置の構成
を示すブロック図である。FIG. 3 is a block diagram showing the configuration of a conventional code-excited linear predictive speech coding apparatus.

【図４】従来の形状励振源符号帳の構成を示す構成図で
ある。FIG. 4 is a configuration diagram showing a configuration of a conventional shape excitation source codebook.

【符号の説明】[Explanation of symbols]

１０６ａ，１０６ｂ，２０６ａ，２０６ｂサブ符号帳１０６ａ−１，１０６ｂ−１音声部ベクトル１０６ａ−２，１０６ｂ−２非音声部ベクトル 106a, 106b, 206a, 206b Sub-codebook 106a-1, 106b-1 Speech part vector 106a-2, 106b-2 Non-speech part vector

フロントページの続き (72)発明者林伸二東京都千代田区内幸町１丁目１番６号日本電信電話株式会社内Front page continuation (72) Inventor Shinji Hayashi 1-1-6 Uchisaiwaicho, Chiyoda-ku, Tokyo Nihon Telegraph and Telephone Corporation

Claims

【特許請求の範囲】[Claims]

【請求項１】形状励振源符号帳を備える符号励振線形
予測音声符号化装置において、前記形状励振源符号帳は、複数のサブ符号帳から構成さ
れ、前記各サブ符号帳は、雑音を含まない音声又は雑音を含
まない音声と雑音を含む音声の両方を用いて学習した信
号源ベクトルから構成される音声部と、音声以外の非音
声を利用して学習した信号源ベクトル又は学習を行わな
いランダム信号源ベクトルから構成される非音声部とか
らなることを特徴とする符号励振線形予測音声符号化装
置。1. A code-excited linear predictive speech coding apparatus including a shape excitation source codebook, wherein the shape excitation source codebook is composed of a plurality of sub codebooks, and each of the sub codebooks does not include noise. A speech part composed of a signal source vector learned using both speech or noise-free speech and noise-containing speech, and a signal source vector learned using non-speech other than speech or random without learning A code-excited linear predictive speech coder comprising: a non-speech section composed of signal source vectors.

【請求項２】形状励振源符号帳を備える符号励振線形
予測音声符号化装置において、前記形状励振源符号帳は、第１および第２のサブ符号帳
から構成され、前記第１のサブ符号帳は、雑音を含まない音声又は雑音
を含まない音声と雑音を含む音声の両方を用いて学習し
た信号源ベクトルのみから構成され、前記第２のサブ符
号帳は、音声以外の非音声を利用して学習した信号源ベ
クトルのみ又は学習を行わないランダム信号源ベクトル
のみで構成されることを特徴とする符号励振線形予測音
声符号化装置。2. A code-excited linear predictive speech coding apparatus comprising a shape excitation source codebook, wherein the shape excitation source codebook is composed of first and second sub-codebooks, and the first sub-codebook. Is composed of only source vectors learned using no-noise speech or both no-noise speech and noisy speech, and the second sub-codebook uses non-speech other than speech. A code-excited linear predictive speech coding apparatus characterized in that the code-excited linear predictive speech coding apparatus is constituted only by a signal source vector learned by learning or a random signal source vector not learned.