JPH10232698A

JPH10232698A - Speech speed changing device

Info

Publication number: JPH10232698A
Application number: JP9053863A
Authority: JP
Inventors: Masanori Nishimoto; 正則西本
Original assignee: Toyo Communication Equipment Co Ltd
Current assignee: Toyo Communication Equipment Co Ltd
Priority date: 1997-02-21
Filing date: 1997-02-21
Publication date: 1998-09-02

Abstract

PROBLEM TO BE SOLVED: To reduce an operation amount for extracting a pitch component when a speech speed is changed and to suppress the occurrence of discontinuity on a connection point of a voice elemental piece inserted for voice expansion. SOLUTION: In a speech speed changing device constituted as a decoder answering to a coder of a voice compression coding system, an expansion means consisting of a pitch expander 7 and a data storage part 8 is provided on a preceding stage of a short period predictive synthetic filter 6 in the decoder operated based on parameter and synthetic filter parameter imparting an optimum voice source transmitted from the coder. Then, the voice elemental piece by one pitch is extracted from a voice source signal to which a formant component is not added, and this voice elemental piece is inserted into the voice source signal by the number of required pieces.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、少ない演算処理で
優れた音声出力を得ることのできる話速変換装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech speed converter capable of obtaining an excellent voice output with a small amount of arithmetic processing.

【０００２】[0002]

【従来の技術】従来より、高齢者においては、加齢に伴
う聴力低下のみならず、言語処理能力の衰えにより、声
は聞こえても早口だと何を言っているのかわからないと
いう、補聴器の使用のみでは解決できない問題点が多い
ということが指摘されている。これに対し、早口で話さ
れた発声速度を、ゆっくりした口調の音声に変換する話
速変換の手段としては、有声音、無声音及び無音部で構
成される音声のうちの、主として有声音である母音部分
（例えば「か／Ｋａ／」の／ａ／の部分）の持続時間
を、ピッチ周期で１周期単位で変化させるものが一般的
であった。図３に、従来技術による話速変換装置の一例
の機能ブロック図を示す。図３において、話者から発声
された音声入力は、分析部の音声分割部２１で、有声、
無声及び無音の区間に分割される。分割された音声区間
のうち、パワーの大きな上記有声区間は、ピッチ抽出部
２２において、ピッチ検出窓を用いた自己相関法によっ
てピッチ周期の抽出が行われ、更に、ピッチ分割部２３
で、ピッチ周期毎の音声素片に分割される。次に、上記
有声区間の音声は、合成部のピッチ伸張部２４におい
て、話速セッティングユニット２５からの伸長倍率に応
じた倍率で、波形の補間がピッチ周期毎に行われ、もと
のピッチ周期を保ったまま持続時間を変化させる。ま
た、必要に応じて、前述の無音区間も、無音伸張部２６
で持続時間を変化させる。上記のように、持続時間が変
化した有声区間と無声区間の音声及び無音区間の音声が
音声合成部２７において合成され、もとの有声音のピッ
チ周期を保ったまま、即ち、声の高さを維持しつつ、話
速のみが変換された音声が得られる。2. Description of the Related Art Hitherto, hearing aids have been used in the elderly to prevent hearing loss due to aging and a decline in language processing ability, so that it is difficult to know what to say if they can hear their voice but they are speaking fast. It has been pointed out that there are many problems that cannot be solved only by themselves. On the other hand, as a means of speech speed conversion for converting the utterance speed spoken at a rapid rate into a voice with a slow tone, voiced sounds, unvoiced sounds, and voices mainly composed of silent parts are mainly voiced sounds. Generally, the duration of a vowel portion (for example, the portion of / a / in "// Ka /") is changed in units of one pitch cycle. FIG. 3 shows a functional block diagram of an example of a speech speed conversion device according to the prior art. In FIG. 3, a voice input uttered by a speaker is voiced by a voice division unit 21 of an analysis unit.
It is divided into unvoiced and silent sections. Among the divided voice sections, the pitch section of the voiced section having a large power is extracted by an autocorrelation method using a pitch detection window in the pitch extraction section 22.
Is divided into speech units for each pitch period. Next, the voice of the voiced section is interpolated in the pitch expansion unit 24 of the synthesizing unit at a rate corresponding to the expansion rate from the speech speed setting unit 25 for each pitch cycle, and the original pitch cycle is obtained. And change the duration. If necessary, the above-mentioned silent section is also added to the silent extension section 26.
To change the duration. As described above, the voice of the voiced section, the voice of the unvoiced section, and the voice of the voiceless section whose duration has changed are synthesized in the voice synthesis unit 27, and the pitch period of the voiced sound is maintained, that is, the voice pitch is maintained. , And a voice in which only the speech speed is converted is obtained.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、音声の
話速変換のために、変換すべき音声の音声素片のピッチ
と、上記音声素片を挿入し接続するための音声素片の先
頭の位置を求めるには、多大の量の演算処理を必要とし
ており、且つ、上記ピッチ成分の抽出においては、音声
のホルマント成分によって、ピッチ同期の相関の計測値
が変動して誤差を生ずるため、その抽出精度が劣化す
る。また、一般的に、音声波形は音声素片の前後の相関
が強く、上述の話速変換装置において、音声素片を挿入
した場合の出力音声は、その接続点において不連続とな
り不自然な音声となることがある。本発明は、上記課題
を解決するためになされたものであって、少ない演算処
理で良好な品質の音声を得られる話速変換装置を提供す
ることを目的とする。However, for speech speed conversion, the pitch of the speech unit of the speech to be converted and the position of the head of the speech unit for inserting and connecting the speech unit. Requires a large amount of arithmetic processing, and in the extraction of the pitch component, the measured value of the pitch synchronization correlation fluctuates due to the formant component of the voice, causing an error. Accuracy deteriorates. Also, in general, a speech waveform has a strong correlation before and after a speech unit, and in the above-described speech speed conversion device, an output speech when a speech unit is inserted becomes discontinuous at the connection point and unnatural speech. It may be. SUMMARY OF THE INVENTION The present invention has been made to solve the above-described problem, and has as its object to provide a speech speed conversion device capable of obtaining good quality speech with a small amount of arithmetic processing.

【０００４】[0004]

【課題を解決するための手段】上記課題を解決するた
め、本発明に係わる話速変換装置は、音声圧縮符号化方
式の符号化器により抽出した原音声のパラメータに基づ
き動作する前記音声圧縮符号化方式の復号器を構成する
短周期予測合成フィルタの前段に、ピッチ伸長器とデー
タ蓄積部とから成る音声伸張手段を設けた構成とする。In order to solve the above-mentioned problems, a speech speed conversion apparatus according to the present invention comprises a speech compression code operating based on parameters of an original speech extracted by an encoder of a speech compression encoding system. Speech expansion means including a pitch expansion unit and a data storage unit is provided at a stage prior to the short-period prediction synthesis filter constituting the decoding system decoder.

【０００５】[0005]

【発明の実施の形態】本発明に係わる話速変換装置は、
ＶＳＥＬＰ等の音声圧縮符号化方式に基づく音声伝送シ
ステム技術に基づいてなされたものである。よって、本
発明の説明に先立ち、本発明の前提となる音声圧縮符号
化方式について、ＶＳＥＬＰ方式を例に説明する。一般
に、ＶＳＥＬＰ音声圧縮符号化方式の音声伝送システ
ムは、符号化器と復号器で構成される。上記の符号化器
及び復号器は、主として駆動音源、長周期予測合成フィ
ルタ及び短周期予測合成フィルタで構成される。上記駆
動音源は、複数の雑音信号発生源を備えた音源符号帳
（雑音コードブック）であり、上記複数の信号発生源の
組み合わせによって必要とする駆動音源を取り出すもの
である。また、上記長周期予測合成フィルタは、入力さ
れた音源信号を異なる遅延時間で出力する符号帳（適応
コードブック）である。図４（ア）は、ＶＳＥＬＰ方
式の符号化器の一例を示す機能ブロック図であり、同図
において、雑音コードブック１１から出力された駆動音
源信号は、可変ゲインアンプ１２を経由して加算部１５
に入力される。加算部１５の出力は、適応コードブック
１３に入力されて所定のピッチ周期を持つ音源信号とな
り、可変ゲインアンプ１４で必要な振幅が与えられた
後、上記加算部１５に入力される。そして加算部１５に
て駆動音源信号に加算された後、短周期予測合成フィル
タ１６に入力される。一方、入力音声信号は、その音声
信号のホルマント成分を抽出するために、分析部１７に
おいて線形予測分析される。該分析によって抽出された
ホルマント成分のパラメータに基づいて、前記短周期予
測合成フィルタ１６の伝達特性が決定される。特性の設
定された上記短周期予測合成フィルタ１６にてホルマン
ト成分が与えられた音源信号は、合成音声として、比較
部１８において前記入力音声信号と比較される。比較の
結果、制御回路１９において、その誤差電力が最小にな
る前記の雑音コードブック１１、適応コードブック１
３、可変ゲインアンプ１２、１４それぞれのパラメータ
の選択が行われて最適音源が得られる。こうして得られ
た最適音源を与えるパラメータ及び短周期予測合成フィ
ルタパラメータが、伝送路２０を経由して後述する復合
器に伝送される。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A speech speed conversion device according to the present invention comprises:
This is based on an audio transmission system technology based on an audio compression coding method such as VSELP. Therefore, prior to the description of the present invention, the VSELP method will be described as an example of the audio compression coding method that is the premise of the present invention. In general, a VSELP audio compression-based audio transmission system includes an encoder and a decoder. The above encoder and decoder are mainly composed of a driving excitation, a long-period prediction synthesis filter, and a short-period prediction synthesis filter. The driving excitation is a excitation codebook (noise codebook) provided with a plurality of noise signal generation sources, and extracts a driving excitation required by a combination of the plurality of signal generation sources. The long-period prediction synthesis filter is a codebook (adaptive codebook) that outputs an input excitation signal with different delay times. FIG. 4A is a functional block diagram showing an example of a VSELP encoder. In FIG. 4, a driving excitation signal output from a noise codebook 11 is added via a variable gain amplifier 12 to an adding unit. Fifteen
Is input to The output of the adder 15 is input to the adaptive codebook 13 to become a sound source signal having a predetermined pitch period. The variable gain amplifier 14 gives a required amplitude, and then is input to the adder 15. After being added to the driving sound source signal by the adding unit 15, the signal is input to the short-period prediction synthesis filter 16. On the other hand, the input speech signal is subjected to linear prediction analysis in the analysis unit 17 in order to extract the formant component of the speech signal. The transfer characteristics of the short-period prediction synthesis filter 16 are determined based on the parameters of the formant components extracted by the analysis. The sound source signal to which the formant component has been given by the short-period prediction synthesis filter 16 whose characteristics have been set is compared with the input audio signal in the comparing unit 18 as a synthesized voice. As a result of the comparison, in the control circuit 19, the noise codebook 11 and the adaptive codebook 1 in which the error power is minimized.
3. The parameters of the variable gain amplifiers 12 and 14 are selected to obtain the optimal sound source. The parameters that provide the optimal sound source and the short-period prediction synthesis filter parameters thus obtained are transmitted via the transmission line 20 to a decoder described later.

【０００６】図４（イ）はＶＳＥＬＰ方式の復号器の
一例を示す機能ブロック図であり、同図において、前述
の符号化器から伝送路２０を介して伝送された最適音源
のパラメータ及び合成フィルタパラメータが、上記復号
器の雑音コードブック１、適応コードブック３及び可変
ゲインアンプ２、４並びに短周期予測合成フィルタ６に
与えられて各部の特性が設定される。上記復号器におけ
る駆動音源としての上記雑音コードブック１の出力信号
は、可変ゲインアンプ２を介して加算部５に入力され
る。該加算部５の出力は、適応コードブック３にてピッ
チ成分が付加され、可変ゲインアンプ４を経由して加算
部５に入力されて音源信号として出力される。該音源信
号は、前記短周期予測合成フィルタ６にてホルマント成
分が付加され、その結果、合成音声として出力される。
以上説明したように、符号化器と復号器との間では、音
声の特徴を示すパラメータのみを伝送するものであり、
伝送すべきデータ量を大幅に削減できるという利点を有
する。本発明に係わる話速変換装置は、上記音声圧縮符
号化方式の復号器における音声合成では、パラメータの
みを参照して合成音声を出力できる点に着目してなされ
たものである。FIG. 4 (a) is a functional block diagram showing an example of a VSELP type decoder. In FIG. 4 (a), the parameters of the optimal excitation and the synthesis filter transmitted from the encoder via the transmission path 20 are shown. The parameters are given to the noise codebook 1, adaptive codebook 3, variable gain amplifiers 2, 4 and short-period prediction synthesis filter 6 of the decoder, and the characteristics of each unit are set. An output signal of the noise codebook 1 as a driving sound source in the decoder is input to an addition unit 5 via a variable gain amplifier 2. The output of the adder 5 is added with a pitch component by the adaptive codebook 3, input to the adder 5 via the variable gain amplifier 4, and output as a sound source signal. The sound source signal is added with a formant component by the short-period prediction synthesis filter 6, and as a result, is output as a synthesized voice.
As described above, between the encoder and the decoder, only the parameter indicating the characteristics of the audio is transmitted.
This has the advantage that the amount of data to be transmitted can be significantly reduced. The speech rate conversion device according to the present invention focuses on the point that, in the speech synthesis in the decoder of the above-mentioned speech compression coding method, a synthesized speech can be output with reference to only parameters.

【０００７】以下、本発明を、図面に示した実施の形態
に基づいて説明する。本発明は、上記音声圧縮符号化方
式の符号化器と復号器を一つの装置にまとめると共に、
復合器の短周期予測合成フィルタの前段に、ピッチ伸張
器とデータ蓄積部を付加することにより実現したもので
ある。図１は、本発明に係わる話速変換装置の実施の一
形態例を示す機能ブロック図であり、ＶＳＥＬＰ音声圧
縮符号化方式の符号化器に相当する部分の図示は省略し
た。図１において、前記ＶＳＥＬＰ音声圧縮符号化方式
の復合器と異なる部分は、加算部５の出力をピッチ伸張
器７に入力するとともに、データ蓄積部８に一時蓄積す
る点である。なお、上記の雑音コードブック１、適応コ
ードブック３、可変ゲインアンプ２、４及び短周期予測
合成フィルタ６のそれぞれの特性は、図示しない符号化
器から供給される最適音源パラメータ及び合成フィルタ
パラメータによって設定される。Hereinafter, the present invention will be described based on an embodiment shown in the drawings. The present invention combines the encoder and decoder of the audio compression encoding method into one device,
This is realized by adding a pitch expander and a data storage unit in front of the short-period prediction synthesis filter of the decoder. FIG. 1 is a functional block diagram showing an embodiment of a speech speed conversion device according to the present invention, and the illustration of a portion corresponding to an encoder of the VSELP speech compression coding system is omitted. In FIG. 1, the difference from the decoder of the VSELP voice compression coding method is that the output of the adder 5 is input to the pitch expander 7 and is temporarily stored in the data storage 8. The characteristics of the noise codebook 1, the adaptive codebook 3, the variable gain amplifiers 2 and 4, and the short-period prediction synthesis filter 6 depend on the optimal excitation parameters and synthesis filter parameters supplied from an encoder (not shown). Is set.

【０００８】ここで、上記装置の話速変換動作について
説明する。話速を伸張する場合は、前記データ蓄積部８
に蓄積された音声データから伸張する音声の１ピッチ分
の成分（音声素片）を取り出し、図示しない話速設定手
段にてあらかじめ設定された伸張倍率に基づいて、前記
ピッチ伸張器７において音声素片を入力音声に挿入す
る。例えば、Ｎピッチ分の話速の伸張には、該動作をＮ
回繰り返すことになる。一般に、ＶＳＥＬＰ音声圧縮符
号化方式においては、音声を符号化し復号する際には１
フレーム（４０サンプル）毎に処理を行っており、例え
ば、データ蓄積部８では、常時最新の４フレーム分の音
声データを蓄積する。Here, the speech speed conversion operation of the above device will be described. To extend the voice speed, the data storage unit 8
One pitch component (speech unit) of the voice to be expanded is extracted from the voice data stored in the voice data, and the pitch expander 7 uses the voice expander 7 based on the expansion ratio set in advance by a speech speed setting unit (not shown). Insert a piece into the input speech. For example, to extend the speech speed for N pitches,
It will be repeated times. In general, in the VSELP audio compression encoding method, when encoding and decoding audio, 1 is used.
Processing is performed for each frame (40 samples). For example, the data storage unit 8 always stores the latest four frames of audio data.

【０００９】図２は、上記のピッチ伸張器７及びデータ
蓄積部８における話速変換の音声信号処理に関する説明
図である。図２（ア）は、図１におけるピッチ伸張器
７の入力音源信号であり、音声Ａ、Ｂ、Ｃ、Ｄが連続し
て入力されたことを表している。また、データ蓄積部８
には、図２（イ）に示したように、４フレーム分の音源
信号を蓄積する。いま、図２（ア）において音声Ｂを伸
張する場合、上記データ蓄積部８のデータから、図２
（イ）に示すように、蓄積データ中の音声Ｂのｇ点で、
１ピッチ分の音声即ち音声素片ｂ（図のｂＰの部分）を
切り出す。次に、図２（ウ）に示すように、前記ピッチ
伸張器７において、この音声素片ｂＰを、ｂＰを切り出
したｇ点に挿入する。Ｎピッチ分伸張する場合は、この
動作をＮ回繰り返す。図２（ウ）には、２ピッチ分の伸
張の場合を示す。FIG. 2 is an explanatory diagram relating to speech signal processing of speech speed conversion in the pitch expander 7 and the data storage section 8. FIG. 2A shows an input sound source signal of the pitch expander 7 in FIG. 1 and shows that sounds A, B, C, and D are continuously input. The data storage unit 8
As shown in FIG. 2A, the sound source signals for four frames are accumulated. Now, when audio B is expanded in FIG. 2A, the data in the data
As shown in (a), at point g of voice B in the stored data,
A voice for one pitch, that is, a voice segment b (portion bP in the figure) is cut out. Next, as shown in FIG. 2C, in the pitch expander 7, the speech unit bP is inserted into a point g where bP is cut out. This operation is repeated N times when extending by N pitches. FIG. 2C shows the case of expansion for two pitches.

【００１０】上記のピッチ伸張器７の入力信号には、音
声のホルマント成分が付加されていないので信号データ
の相関は弱い。従って、図２の音源信号列のｇ点に、１
ピッチ分の整数倍の信号を挿入すれば、挿入する音声素
片ｂＰの前後の接続点での不連続性を低く押さえること
ができる。そして、上記ピッチ伸張器７を経た音源信号
は、短周期予測合成フィルタ６に入力されてホルマント
成分が付加され、合成音声として出力される。なお、以
上、本発明をＶＳＥＬＰ方式に基づいて説明したが、本
発明はこれに限定されるものではなく、他の音声圧縮符
号化方式に適用してもよいことはいうまでもない。Since the input signal of the pitch expander 7 does not include a formant component of voice, the correlation of the signal data is weak. Therefore, at point g in the sound source signal sequence of FIG.
If a signal of an integral multiple of the pitch is inserted, discontinuity at connection points before and after the speech unit bP to be inserted can be suppressed to a low level. The sound source signal that has passed through the pitch expander 7 is input to the short-period prediction synthesis filter 6, where a formant component is added thereto, and output as synthesized speech. Although the present invention has been described based on the VSELP scheme, the present invention is not limited to this, and it goes without saying that the present invention may be applied to other audio compression coding schemes.

【００１１】[0011]

【発明の効果】以上説明したように、従来、話速変換の
ために、変換すべき音声素片のピッチの長さと先頭の位
置を求めるためには、多大の量の演算処理を必要として
いたが、本発明に係わる符号化器に基づく話速変換装置
においては、符号化器から伝送されるパラメータを用い
て信号処理するので、話速変換のための演算処理量は極
めて少ない。また、音声にホルマント成分を付加する短
周期予測合成フィルタの前段においてピッチ成分の抽出
及びピッチ伸張処理を行うので、挿入した音声素片の前
後の不連続によって生ずる、出力音声の不自然さは発生
しない。従って、本発明に係わる話速変換装置は、少な
い演算量で、良好な音声の話速変換を行える効果があ
る。As described above, conventionally, a large amount of arithmetic processing has been required to obtain the pitch length and head position of a speech unit to be converted for speech speed conversion. However, in the speech speed conversion device based on the encoder according to the present invention, signal processing is performed using parameters transmitted from the encoder, so that the amount of arithmetic processing for speech speed conversion is extremely small. In addition, since the pitch component is extracted and the pitch is expanded in the preceding stage of the short-period prediction synthesis filter that adds a formant component to the voice, unnaturalness of the output voice caused by discontinuity before and after the inserted voice unit occurs. do not do. Therefore, the speech speed conversion device according to the present invention has an effect that a good speech speed conversion of voice can be performed with a small amount of calculation.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明に係わる話速変換装置の実施の一形態例
を示す機能ブロック図FIG. 1 is a functional block diagram showing an embodiment of a speech speed conversion device according to the present invention.

【図２】（ア）〜（ウ）は、本発明に係わる話速変換装
置のピッチ伸張器におけるピッチ伸張処理に関する説明
図FIGS. 2A to 2C are explanatory diagrams relating to pitch expansion processing in a pitch expander of the speech speed conversion device according to the present invention.

【図３】従来の話速変換装置の機能ブロック図FIG. 3 is a functional block diagram of a conventional speech speed conversion device.

【図４】ＶＳＥＬＰ音声圧縮符号化方式の機能ブロック
図、（ア）は符号化器の機能ブロック図、（イ）は復号
器の機能ブロック図FIG. 4 is a functional block diagram of a VSELP speech compression coding system, (A) is a functional block diagram of an encoder, and (A) is a functional block diagram of a decoder.

【符号の説明】[Explanation of symbols]

１復号器の雑音コードブック２復号器の可変ゲインアンプ３復号器の適応コードブック４復号器の可変ゲインアンプ５復号器の加算部６復号器の短周期予測合成フィルタ７本発明に基づいて付加したピッチ伸張器８本発明に基づいて付加したデータ蓄積部１１符号化器の雑音コードブック１２符号化器の可変ゲインアンプ１３符号化器の適応コードブック１４符号化器の可変ゲインアンプ１５符号化器の加算部１６符号化器の可変ゲインアンプ１７符号化器の分析部１８符号化器の比較部１９符号化器の制御回路２０伝送路２１音声分割部２２ピッチ抽出部２３ピッチ分割部２４ピッチ伸張部２５話速セッティングユニット２６無音伸張部２７音声合成部 DESCRIPTION OF SYMBOLS 1 Noise codebook of decoder 2 Variable gain amplifier of decoder 3 Adaptive codebook of decoder 4 Variable gain amplifier of decoder 5 Adder of decoder 6 Short-period prediction synthesis filter of decoder 7 Added based on the present invention Pitch expander 8 Data storage unit added based on the present invention 11 Noise codebook of encoder 12 Variable gain amplifier of encoder 13 Adaptive codebook of encoder 14 Variable gain amplifier of encoder 15 Encoding Adder of encoder 16 variable gain amplifier of encoder 17 analyzer of encoder 18 comparator of encoder 19 encoder control circuit 20 transmission line 21 audio division unit 22 pitch extraction unit 23 pitch division unit 24 pitch Expansion unit 25 Speech speed setting unit 26 Silence expansion unit 27 Voice synthesis unit

Claims

【特許請求の範囲】[Claims]

【請求項１】音声圧縮符号化方式の符号化器により抽
出した原音声のパラメータに基づき動作する前記音声圧
縮符号化方式の復号器を構成する短周期予測合成フィル
タの前段に、ピッチ伸長器とデータ蓄積部とから成る音
声伸張手段を設けたことを特徴とする話速変換装置。A pitch expander is provided in front of a short-period prediction synthesis filter that constitutes a decoder of the audio compression encoding system that operates based on parameters of an original audio extracted by an encoder of the audio compression encoding system. A speech speed conversion device provided with a voice decompression means comprising a data storage unit.