JP2010156740A

JP2010156740A - Speech synthesizer and speech processing system

Info

Publication number: JP2010156740A
Application number: JP2008333607A
Authority: JP
Inventors: Takaya Kakisaki; 貴也柿▲さき▼; Shinya Sakurada; 信弥櫻田; Takuro Sone; 卓朗曽根
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2008-12-26
Filing date: 2008-12-26
Publication date: 2010-07-15
Anticipated expiration: 2028-12-26
Also published as: JP5446256B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech synthesizer for superimposing a modulation signal free from hearing discomfort, without restricting a band width of speech. <P>SOLUTION: In a waveform memory 131N for indicating a tone of noise in a sound source 13, a waveform is stored as a pseudo noise code stream (a waveform of pseudo noise) with high autocorrelation such as a pseudo noise (PN) code. Data communication is carried out by controlling a polarity of the pseudo noise code stream of the waveform memory 131N by a control section 14. When bit data of transmission data 141 is "1", the PN code is output as it is. When the bit data of the transmission data 141 is "0", it is output by reversing a polarity of the PN code (an anti-phase). <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

この発明は、音声を合成する音声合成装置、および当該音声合成装置を用いた音声処理システムに関する。 The present invention relates to a speech synthesizer for synthesizing speech and a speech processing system using the speech synthesizer.

従来、データ通信においては、ＯＦＤＭ等の変調方式が用いられている（特許文献１参照）。このような変調方式を用いて音声信号に変調信号を重畳し、データ通信を行うことも考えられるが、可聴域に近い帯域では変調信号のノイズ音が聞こえてしまうという問題がある。そこで、楽音信号等とミックスして音声を聞こえにくく（音声透かし）したり、ハイパスフィルタを通して非可聴域の帯域にのみ変調信号を重畳したりすることが考えられる。
特開２０００−５９３２９号公報 Conventionally, modulation schemes such as OFDM are used in data communication (see Patent Document 1). Although it is conceivable to perform data communication by superimposing a modulation signal on an audio signal using such a modulation method, there is a problem that a noise sound of the modulation signal is heard in a band close to the audible range. Therefore, it is conceivable to mix with a musical sound signal or the like to make it difficult to hear the sound (audio watermark), or to superimpose the modulation signal only in the band of the non-audible range through a high-pass filter.
JP 2000-59329 A

しかし、音声透かしでは必ずしもノイズ音を消すことができず、違和感が生じる場合もある。また、非可聴域に変調信号を重畳するためには、変調信号と通常の音声を切り分けるために音声側の帯域を制限する必要があり、音質の劣化が問題となる。 However, the sound watermark does not necessarily eliminate the noise sound, which may cause a sense of discomfort. In addition, in order to superimpose the modulation signal in the non-audible range, it is necessary to limit the band on the sound side in order to separate the modulation signal from the normal sound, and degradation of sound quality becomes a problem.

そこで、この発明は、音声側の帯域を制限する必要がなく、聴感上の違和感が皆無である変調信号を重畳することができる音声合成装置を提供することを目的とする。 Therefore, an object of the present invention is to provide a speech synthesizer that can superimpose a modulation signal that does not require any limitation on the voice side band and that does not cause any sense of incongruity.

この発明の音声合成装置は、入力パラメータに基づいて複数の見本信号から音声を合成する音声合成装置であって、前記見本信号の一部を疑似雑音信号として記憶する記憶部を備えたことを特徴とする。 The speech synthesizer according to the present invention is a speech synthesizer that synthesizes speech from a plurality of sample signals based on input parameters, and includes a storage unit that stores a part of the sample signal as a pseudo noise signal. And

見本信号としては、例えばＭＩＤＩ音源、音声圧縮系のデコーダの合成フィルタバンク、ＣＥＬＰデコーダのコードブック等が考えられる。例えば、ＭＩＤＩを例にとると、ＭＩＤＩ音源の一部には、必ずホワイトノイズのような音源が含まれている。本発明では、このノイズ音を示すＭＩＤＩ音源の一部を疑似雑音（ＰＮ符号）として記憶している。疑似雑音は、もともとＭＩＤＩ音源に含まれているノイズ音と聴感上全く等価とすることが可能である。よって本発明の音声合成装置は、帯域を制限する必要がなく、聴感上の違和感が皆無である変調信号を重畳することができる。 As a sample signal, for example, a MIDI sound source, a synthesis filter bank of an audio compression decoder, a code book of a CELP decoder, or the like can be considered. For example, taking MIDI as an example, a part of the MIDI sound source always includes a sound source such as white noise. In the present invention, a part of the MIDI sound source indicating the noise sound is stored as pseudo noise (PN code). The pseudo noise can be completely equivalent to the noise sound originally included in the MIDI sound source. Therefore, the speech synthesizer of the present invention does not need to limit the band, and can superimpose a modulation signal that has no sense of incongruity in hearing.

そして、制御部は、疑似雑音信号の極性を制御することで、種々のデータを送信することができる。復調側では、重畳されている疑似雑音信号と同じ疑似雑音信号で入力音声との相関値を求めることで、データを復調することができる。 And a control part can transmit various data by controlling the polarity of a pseudo noise signal. On the demodulation side, data can be demodulated by obtaining a correlation value with the input speech using the same pseudo noise signal as the pseudo noise signal superimposed.

この発明によれば、音声側の帯域を制限する必要がなく、聴感上の違和感が皆無である変調信号を重畳することができる。 According to the present invention, it is not necessary to limit the band on the voice side, and it is possible to superimpose a modulation signal that has no sense of incongruity in hearing.

本発明の音声合成装置および音声処理システムに係る実施形態について説明する。図１（Ａ）は、本実施形態の音声合成装置として自動演奏装置を示し、図１（Ｂ）は、復調装置を示した図である。なお、本実施形態においては、Ａ／Ｄ変換器、Ｄ／Ａ変換器を省略し、特に記載がない場合、全てデジタル処理であるとして説明する。なお、図面における楕円表示は情報内容を示すものである。 Embodiments of the speech synthesizer and speech processing system of the present invention will be described. FIG. 1A shows an automatic performance device as a speech synthesizer of the present embodiment, and FIG. 1B shows a demodulation device. In the present embodiment, the A / D converter and the D / A converter are omitted, and unless otherwise specified, all digital processing will be described. In addition, the ellipse display in a drawing shows the information content.

自動演奏装置１は、ＭＩＤＩインタフェース（Ｉ／Ｆ）１１、シーケンサ１２、音源１３、制御部１４、および音声出力Ｉ／Ｆ１５を備えている。ＭＩＤＩインタフェース１１には、ＭＩＤＩパラメータ（ＭＩＤＩ規格に従ったデータ）が入力される。 The automatic performance device 1 includes a MIDI interface (I / F) 11, a sequencer 12, a sound source 13, a control unit 14, and an audio output I / F 15. A MIDI parameter (data in accordance with the MIDI standard) is input to the MIDI interface 11.

シーケンサ１２は、ＭＩＤＩインタフェース１１から入力されたＭＩＤＩパラメータ（ノートオン、ノートオフ等）に従って、音源１３の波形メモリ１３１Ａ〜波形メモリ１３１Ｎに記憶されている波形を読み出し、音声を合成する。音源１３の波形メモリ１３１Ａ〜波形メモリ１３１Ｎは、本発明の見本信号に相当する。ＭＩＤＩパラメータには、音色を指定するデータ（プログラムチェンジ）が含まれており、その中にはノイズ音の音色（例えばハイハットシンバルの様な音色）を指定するデータも含まれている。 The sequencer 12 reads out the waveforms stored in the waveform memory 131A to the waveform memory 131N of the sound source 13 in accordance with the MIDI parameters (note-on, note-off, etc.) input from the MIDI interface 11 and synthesizes speech. The waveform memory 131A to the waveform memory 131N of the sound source 13 correspond to sample signals of the present invention. The MIDI parameter includes data (program change) for specifying a timbre, and includes data for specifying a timbre of a noise sound (for example, a timbre like a hi-hat cymbal).

本実施形態の自動演奏装置１では、音源１３のうち、ノイズ音の音色を示す波形メモリ１３１ＮがＭ系列あるいはＧｏｌｄ系列のような自己相関性の高い疑似雑音符号列（疑似雑音信号の波形）として記憶されている。この波形メモリ１３１Ｎの疑似雑音符号列の極性を制御部１４が制御することで、データ通信を行うことができる。すなわち、送信データ１４１のビットデータが「１」の場合、ＰＮ符号をそのままの極性で出力し、送信データ１４１のビットデータが「０」の場合、ＰＮ符号の極性を逆（逆位相）にして出力する。受信側では、算出された相関値の位相を検出することにより、重畳されているビットデータの「１」、「０」を復調することができる。なお、極性が反対の波形メモリを２つ用意しておき、制御部１４がいずれかの波形メモリが読み出されるように制御してもよい。 In the automatic performance device 1 of the present embodiment, the waveform memory 131N that indicates the tone color of the noise sound of the sound source 13 is used as a pseudo-noise code string (pseudo-noise signal waveform) having a high autocorrelation such as an M series or Gold series. It is remembered. Data communication can be performed by the control unit 14 controlling the polarity of the pseudo-noise code string of the waveform memory 131N. That is, when the bit data of the transmission data 141 is “1”, the PN code is output with the same polarity, and when the bit data of the transmission data 141 is “0”, the polarity of the PN code is reversed (reverse phase). Output. On the receiving side, “1” and “0” of the superimposed bit data can be demodulated by detecting the phase of the calculated correlation value. Note that two waveform memories having opposite polarities may be prepared, and the control unit 14 may perform control so that any one of the waveform memories is read out.

シーケンサ１２によって合成された疑似雑音を含む音声信号は、出力Ｉ／Ｆ１５を介して出力され、アンプ等により増幅され音声として放音される。放音された音声は、マイク等により収音され、同図（Ｂ）の復調装置２に入力される。 The audio signal including pseudo noise synthesized by the sequencer 12 is output via the output I / F 15, amplified by an amplifier or the like, and emitted as audio. The emitted sound is collected by a microphone or the like and input to the demodulator 2 in FIG.

同図（Ｂ）の復調装置２において、入力Ｉ／Ｆ２１から入力された音声信号は、整合フィルタ２２に入力される。整合フィルタ２２は、入力された音声信号と上記疑似雑音との相関を求める相関計算部である。整合フィルタ２２は、ＦＩＲフィルタにより実現され、フィルタ係数として、送信側の波形メモリ１３１Ｎに含まれている擬似雑音符号列が設定されている。ＰＮ符号は非常に高い自己相関性を有するため、整合フィルタ２２は、入力された音声にＰＮ符号が含まれている場合、相関値ピーク（所定レベル以上の相関値）を出力する。整合フィルタ２２は、位相が正転であれば正の相関値ピークを出力し、位相が反転していれば負の相関値ピークを出力する。 In the demodulator 2 in FIG. 5B, the audio signal input from the input I / F 21 is input to the matched filter 22. The matched filter 22 is a correlation calculation unit that obtains a correlation between the input voice signal and the pseudo noise. The matched filter 22 is realized by an FIR filter, and a pseudo noise code string included in the waveform memory 131N on the transmission side is set as a filter coefficient. Since the PN code has a very high autocorrelation, the matched filter 22 outputs a correlation value peak (correlation value of a predetermined level or higher) when the input speech includes the PN code. The matched filter 22 outputs a positive correlation value peak if the phase is normal, and outputs a negative correlation value peak if the phase is inverted.

復調部２３では、整合フィルタ２２の出力値からデータ復調を行う。すなわち、復調部２３は、整合フィルタ２２から正の相関値ピークが入力された場合、ビットデータとして「１」を復調し、整合フィルタ２２から負の相関値ピークが入力された場合、ビットデータとして「０」を復調する。なお、疑似雑音の出力周期は予め決められており、復調部２３は、相関値ピークが入力された場合、その後、疑似雑音の出力周期の長さだけビット出力を続ける。例えば、疑似雑音の周期が１０２３サンプルであれば、正の相関値ピークが入力された場合、「１」を１０２３サンプル連続して出力する。 The demodulator 23 demodulates data from the output value of the matched filter 22. That is, the demodulator 23 demodulates “1” as bit data when a positive correlation value peak is input from the matched filter 22, and as bit data when a negative correlation value peak is input from the matched filter 22. Demodulate "0". Note that the output period of the pseudo noise is determined in advance, and when the correlation value peak is input, the demodulator 23 continues to output the bit by the length of the output period of the pseudo noise. For example, if the period of the pseudo noise is 1023 samples, “1” is continuously output for 1023 samples when a positive correlation value peak is input.

このようにして復調部２３で送信データ２３１が復調される。上記疑似雑音は、周波数特性上はホワイトノイズ等のノイズ音そのものであり、もともとＭＩＤＩ音源に含まれているノイズ音と聴感上全く等価とすることが可能である。よって、従来の音声透かしのような通常の音声（楽音信号）に変調信号を重畳したものとは異なり、楽音信号そのものが変調信号を含んでいるため、送信側で何ら帯域を制限する必要がなく、聴感上の違和感が皆無である変調信号を出力することができる。 In this way, the transmission data 231 is demodulated by the demodulator 23. The pseudo noise is a noise sound such as white noise in terms of frequency characteristics, and can be completely equivalent to the noise sound originally included in the MIDI sound source. Therefore, unlike the conventional speech watermark (musical sound signal) in which the modulation signal is superimposed, the musical sound signal itself includes the modulation signal, so there is no need to limit the band on the transmission side. Therefore, it is possible to output a modulated signal with no sense of incongruity in hearing.

また、合成装置および復調装置で閉ループが形成される場合、さらに下記のような情報を取出すこともできる。図２は、合成装置である自動演奏装置１、および復調装置２を内蔵した閉ループを形成する装置の例として、カラオケ装置１００の構成を示した図である。 Further, when a closed loop is formed by the synthesizing device and the demodulating device, the following information can be extracted. FIG. 2 is a diagram showing a configuration of a karaoke apparatus 100 as an example of an apparatus for forming a closed loop including an automatic performance apparatus 1 and a demodulation apparatus 2 that are synthesis apparatuses.

カラオケ装置１００には、マイク１０１、スピーカ１０２が接続されている。自動演奏装置１のＭＩＤＩインタフェース１１には、カラオケ装置１００の記憶部（不図示）からカラオケ曲データデータ（ＭＩＤＩデータ）が入力される。このカラオケ曲データにより、自動演奏装置１にてカラオケ演奏音が生成される。生成されたカラオケ演奏音には、上記疑似雑音が含まれる。疑似雑音が含まれたカラオケ演奏音は、信号処理装置７に入力され、イコライジング、増幅等の信号処理がされ、スピーカ１０２から放音される。 A microphone 101 and a speaker 102 are connected to the karaoke apparatus 100. Karaoke music data data (MIDI data) is input to the MIDI interface 11 of the automatic performance device 1 from a storage unit (not shown) of the karaoke device 100. A karaoke performance sound is generated by the automatic performance device 1 based on the karaoke song data. The generated karaoke performance sound includes the pseudo noise. The karaoke performance sound including the pseudo noise is input to the signal processing device 7, subjected to signal processing such as equalizing and amplification, and emitted from the speaker 102.

スピーカ１０２から放音されたカラオケ演奏音は、歌唱音とともにマイク１０１で収音される。マイク１０１から入力された音声は、復調装置２および信号処理装置７に入力される。このようにして、閉ループが形成される。 The karaoke performance sound emitted from the speaker 102 is collected by the microphone 101 together with the singing sound. The sound input from the microphone 101 is input to the demodulation device 2 and the signal processing device 7. In this way, a closed loop is formed.

図２の例において、復調装置２の復調部２３は、帰還したカラオケ演奏音に含まれる疑似雑音から、閉ループの遅延量２３２、およびループゲイン２３３を取出す（推定する）処理を行う。 In the example of FIG. 2, the demodulator 23 of the demodulator 2 performs a process of extracting (estimating) the closed loop delay amount 232 and the loop gain 233 from the pseudo noise included in the returned karaoke performance sound.

以下、復調部２３における遅延量２３２およびループゲイン２３３の推定処理について説明する。図３は、相関の時間軸特性を模式的に表した図である。 Hereinafter, the estimation processing of the delay amount 232 and the loop gain 233 in the demodulation unit 23 will be described. FIG. 3 is a diagram schematically showing the time axis characteristic of the correlation.

復調部２３は、自動演奏装置１が疑似雑音を出力したタイミングから最初に相関値ピークが入力された場合、当該最初に算出した時間帯における相関値をカラオケ演奏音の帰還成分のうち、直接波の成分とみなし、直接波のピーク成分を求める。なお、この場合、復調部２３には、自動演奏装置１から疑似雑音を出力したタイミングを示す情報（例えばノートオンメッセージ）が入力されるものとする。 When the correlation value peak is input for the first time from the timing when the automatic performance device 1 outputs the pseudo noise, the demodulator 23 calculates the correlation value in the first calculated time zone as a direct wave among the feedback components of the karaoke performance sound. The peak component of the direct wave is obtained. In this case, it is assumed that information (for example, note-on message) indicating the timing at which pseudo noise is output from the automatic performance device 1 is input to the demodulator 23.

復調部２３は、相関値ピークが入力された場合、その後所定時間帯ｔ１の相関値をメモリ（不図示）に一時記憶し、所定時間帯ｔ１の中で最も高レベルの相関値を抽出し、ピーク値ａ０とする。なお、所定レベルは、定常ノイズのレベルに応じて設定する。ピーク値を抽出する所定時間帯ｔ１は、相関値算出の精度（疑似雑音符号の符号長等）等に応じて設定する。 When the correlation value peak is input, the demodulator 23 temporarily stores the correlation value in a predetermined time zone t1 in a memory (not shown), and extracts the highest level correlation value in the predetermined time zone t1, The peak value is a0. The predetermined level is set according to the level of stationary noise. The predetermined time period t1 for extracting the peak value is set according to the accuracy of correlation value calculation (code length of the pseudo noise code, etc.) and the like.

復調部２３は、自動演奏装置１が疑似雑音を出力したタイミングから直接波のピーク値ａ０を算出するタイミングとの時間差を閉ループの遅延量２３２として推定する。閉ループの遅延量２３２は、スピーカからマイクまでの距離に相当する。 The demodulator 23 estimates the time difference from the timing at which the automatic performance device 1 outputs the pseudo noise from the timing at which the peak value a0 of the direct wave is calculated as the closed loop delay amount 232. The closed loop delay amount 232 corresponds to the distance from the speaker to the microphone.

そして、復調部２３は、最初に相関値ピークを入力してから上記所定時間帯ｔ１が経過した後に再び相関値ピークが入力された場合、当該相関値を反射波とみなし、反射波のピーク成分を求める。なお、この場合、自動演奏装置１の疑似雑音の出力周期は、室内の残響時間よりも十分に長いものとする。 Then, when the correlation value peak is input again after the predetermined time zone t1 has elapsed since the correlation value peak was first input, the demodulation unit 23 regards the correlation value as a reflected wave, and the peak component of the reflected wave Ask for. In this case, the pseudo-noise output period of the automatic performance device 1 is sufficiently longer than the reverberation time in the room.

復調部２３は、上記と同様、相関値ピークを入力した場合、その後所定時間帯ｔ１の相関値をメモリに一時記憶し、最も高レベルの相関値を抽出し、ピーク値ａ１とする。以下、同様にして反射波のピーク値（ａ１，ａ２，・・・）を所定時間長ｔ２だけ抽出する。 Similarly to the above, when the correlation value peak is input, the demodulator 23 temporarily stores the correlation value in the predetermined time zone t1 in the memory, extracts the highest level correlation value, and sets it as the peak value a1. Similarly, the peak value (a1, a2,...) Of the reflected wave is extracted for a predetermined time length t2.

そして、復調部２３は、抽出した直接波および反射波のピーク値の絶対値（｜ａ１｜，｜ａ２｜，・・・）を求め、各絶対値の総和からループゲイン２３３を推定する。 Then, the demodulation unit 23 obtains absolute values (| a1 |, | a2 |,...) Of the peak values of the extracted direct wave and reflected wave, and estimates the loop gain 233 from the sum of the absolute values.

復調装置２は、このループゲイン２３３が所定のしきい値に近づいた場合、ハウリング発生の可能性が高いとして、信号処理装置７のゲインを抑制するよう指示する。また、復調装置２は、ループゲイン２３３がしきい値に近づいた場合に警告（カラオケ装置のＬＥＤを点灯させる、カラオケ用ディスプレイに警告を表示する等）を行ってもよい。なお、ゲイン抑制の処理および警告の処理は、いずれか一方のみ行ってもよく、音声信号のゲインを抑制しつつ、さらに警告を行うようにしてもよい。また、最初に警告を行い、その後ゲイン抑制処理を行う、という態様であってもよい。 When the loop gain 233 approaches a predetermined threshold, the demodulating device 2 instructs that the gain of the signal processing device 7 be suppressed, assuming that there is a high possibility of howling. Further, the demodulating device 2 may give a warning (such as turning on the LED of the karaoke device or displaying a warning on the karaoke display) when the loop gain 233 approaches the threshold value. Note that only one of the gain suppression processing and the warning processing may be performed, and further warning may be performed while suppressing the gain of the audio signal. Alternatively, a warning may be given first, and then gain suppression processing may be performed.

なお、上記しきい値は、どのような値であってもよいが、例えばカラオケ装置１００で実際にハウリング発生を検出した時のループゲイン推定値をメモリに記憶しておき、記憶したループゲイン推定値に、ある程度のマージンを見た値を設定しておけばよい。 The threshold value may be any value. For example, the loop gain estimated value when the howling occurrence is actually detected by the karaoke apparatus 100 is stored in the memory, and the stored loop gain estimation is performed. A value with a certain margin should be set as the value.

なお、閉ループの遅延量およびループゲインの推定手法は、上記手法に限るものではない。例えば、直接波のピーク成分のみの値をループゲインとして推定してもよい。 The method for estimating the closed loop delay amount and the loop gain is not limited to the above method. For example, the value of only the peak component of the direct wave may be estimated as the loop gain.

また、上記実施形態では、音声合成装置の例として、ＭＩＤＩによる自動演奏装置を示したが、下記の変形例のような態様でも本発明の音声合成装置を実現可能である。 In the above embodiment, an automatic performance device using MIDI is shown as an example of a speech synthesizer. However, the speech synthesizer of the present invention can also be realized in the following modifications.

（変形例１）
図４は、変形例１に係る音声合成装置の例として、ＣＥＬＰデコーダの主要構成を簡易的に示した図である。ＣＥＬＰデコーダ３は、入力Ｉ／Ｆ３１、適応コードブック３２、セレクタ３３、コードブック３４、加算器３５、出力Ｉ／Ｆ３６、および制御部３７を備えている。なお、実際には線形予測フィルタ等の合成フィルタが加算器３５の後段に設けられているが、本実施形態では図示および説明を省略する。 (Modification 1)
FIG. 4 is a diagram simply showing the main configuration of a CELP decoder as an example of the speech synthesis apparatus according to the first modification. The CELP decoder 3 includes an input I / F 31, an adaptive code book 32, a selector 33, a code book 34, an adder 35, an output I / F 36, and a control unit 37. Note that a synthesis filter such as a linear prediction filter is actually provided at the subsequent stage of the adder 35, but illustration and description thereof are omitted in this embodiment.

入力Ｉ／Ｆ３１には、ＣＥＬＰ符号データが入力される。入力された符号データは、適応コードブック３２およびセレクタ３３に入力される。 CELP code data is input to the input I / F 31. The input code data is input to the adaptive code book 32 and the selector 33.

適応コードブック３２は、過去に加算器６５が出力した信号を入力符号データに含まれるピッチ情報に基づいて再利用し、出力するものである。 The adaptive code book 32 reuses and outputs a signal output from the adder 65 in the past based on pitch information included in the input code data.

セレクタ３３は、入力符号データに含まれるコードブック・インデックスに従い、コードブック３４から見本信号（音声信号）を読み出して加算器３５に出力する。加算器３５は、適応コードブック３２から出力された信号とセレクタ３３から出力された信号を加算し、出力Ｉ／Ｆ３６に出力する。 The selector 33 reads a sample signal (audio signal) from the code book 34 according to the code book index included in the input code data, and outputs the sample signal to the adder 35. The adder 35 adds the signal output from the adaptive codebook 32 and the signal output from the selector 33 and outputs the result to the output I / F 36.

コードブック３４には、固定コードブック３４１Ａ〜固定コードブック３４１Ｎが含まれており、種々の音声信号（波形データ）が記憶されている。固定コードブック３４１Ｎには、ノイズ音の波形データが記憶されている。本実施形態のＣＥＬＰデコーダ３では、固定コードブック３４１Ｎが疑似雑音符号列を記憶するコードブックとして機能する。制御部３７は、固定コードブック３４１Ｎの疑似雑音符号の極性を制御し、送信データ３７１を重畳する。 The code book 34 includes a fixed code book 341A to a fixed code book 341N and stores various audio signals (waveform data). The fixed codebook 341N stores noise sound waveform data. In the CELP decoder 3 of the present embodiment, the fixed code book 341N functions as a code book for storing a pseudo noise code string. The control unit 37 controls the polarity of the pseudo noise code of the fixed codebook 341N and superimposes the transmission data 371.

出力された音声信号は、図１（Ｂ）に示した復調装置に入力され、送信データの復調、遅延量、ループゲインの推定等が行われる。 The output audio signal is input to the demodulator shown in FIG. 1B, where transmission data is demodulated, a delay amount, a loop gain is estimated, and the like.

この場合においても、疑似雑音は、周波数特性上はホワイトノイズ等のノイズ音そのものであり、もともとＣＬＥＰデコーダのコードブックに含まれているノイズ音と聴感上全く等価とすることが可能である。このように、本発明の音声合成装置は、ＣＥＬＰデコーダに実装する形でも実現することができる。 Also in this case, the pseudo noise is a noise sound such as white noise in terms of frequency characteristics, and can be completely equivalent to the noise sound originally included in the code book of the CLEP decoder. Thus, the speech synthesizer of the present invention can also be realized by being mounted on a CELP decoder.

（変形例２）
図５は、変形例２に係る音声合成装置の例として、音声圧縮系のデコーダの主要構成を簡易的に示した図である。デコーダ４は、入力Ｉ／Ｆ４１、フィルタバンク４２、出力Ｉ／Ｆ４３、および制御部４４を備えている。なお、実際には、フィルタバンク４２の前段にハフマン復号、逆量子化等の処理を行う機能部も存在するが、本実施形態では図示および説明を省略する。 (Modification 2)
FIG. 5 is a diagram simply showing the main configuration of a speech compression decoder as an example of a speech synthesizer according to the second modification. The decoder 4 includes an input I / F 41, a filter bank 42, an output I / F 43, and a control unit 44. In practice, there are functional units that perform processing such as Huffman decoding and inverse quantization in the preceding stage of the filter bank 42, but illustration and description thereof are omitted in this embodiment.

入力Ｉ／Ｆ４１には、音声圧縮系の符号データ（ビットストリーム）が入力される。入力されたビットストリームは、ハフマン復号、逆量子化処理がされ、フィルタバンク４２に入力される。ビットストリームは、フィルタバンク４２で音声信号（音声データ）にデコードされ、出力Ｉ／Ｆ４３から出力される。 The input I / F 41 receives audio compression code data (bit stream). The input bit stream is subjected to Huffman decoding and inverse quantization processing and input to the filter bank 42. The bit stream is decoded into an audio signal (audio data) by the filter bank 42 and output from the output I / F 43.

フィルタバンク４２の合成フィルタバンク４２１Ａ〜合成フィルタバンク４２１Ｎは、それぞれ入力された値（フィルタバンク値）を波形化するフィルタであり、合成フィルタバンク４２１Ｎは、ノイズ音の波形を合成するフィルタバンクである。各合成フィルタバンクの出力波形が合成部４２３で合成され、音声信号（音声データ）として出力Ｉ／Ｆ４３に出力される。本実施形態のデコーダ４では、合成フィルタバンク４２１Ｎが疑似雑音符号列を合成するフィルタバンクとして機能する。制御部４４は、合成フィルタバンク４２１の疑似雑音符号の極性を制御し、送信データ４４１を重畳する。 The synthesis filter bank 421A to the synthesis filter bank 421N of the filter bank 42 are filters that waveformize inputted values (filter bank values), and the synthesis filter bank 421N is a filter bank that synthesizes a waveform of noise sound. . The output waveform of each synthesis filter bank is synthesized by the synthesis unit 423 and output to the output I / F 43 as an audio signal (audio data). In the decoder 4 of this embodiment, the synthesis filter bank 421N functions as a filter bank that synthesizes a pseudo-noise code string. The control unit 44 controls the polarity of the pseudo noise code of the synthesis filter bank 421 and superimposes the transmission data 441.

この場合においても、疑似雑音は、周波数特性上はホワイトノイズ等のノイズ音そのものであり、もともとデコーダの合成フィルタバンクから出力されるノイズ音と聴感上全く等価とすることが可能である。このように、本発明の音声合成装置は、音声圧縮系のデコーダに実装する形でも実現することができる。 Also in this case, the pseudo noise is a noise sound such as white noise in terms of frequency characteristics, and can be completely equivalent to the noise sound originally output from the synthesis filter bank of the decoder. Thus, the speech synthesizer of the present invention can also be realized by being mounted on a speech compression system decoder.

無論、上記変形例１，変形例２以外にも、見本信号としてノイズ音が含まれているものであれば、どのような音声合成装置にも本発明を適用することが可能である。 Needless to say, the present invention can be applied to any speech synthesizer as long as noise signals are included as sample signals in addition to the first and second modifications.

音声合成装置の構成を示したブロック図である。It is the block diagram which showed the structure of the speech synthesizer. 相関の時間軸特性を示した図である。It is the figure which showed the time-axis characteristic of correlation. 自動演奏装置および復調装置を内蔵した閉ループを形成する装置の例として、カラオケ装置の構成を示した図である。It is the figure which showed the structure of the karaoke apparatus as an example of the apparatus which forms the closed loop which incorporated the automatic performance apparatus and the demodulation apparatus. 音声合成装置としてＣＥＬＰデコーダを示した図である。It is the figure which showed the CELP decoder as a speech synthesizer. 音声合成装置として圧縮音声系のデコーダを示した図である。It is the figure which showed the decoder of the compression audio system as a speech synthesizer.

符号の説明Explanation of symbols

１−自動演奏装置
２−復調装置 1-automatic performance device 2-demodulation device

Claims

入力パラメータに基づいて複数の見本信号から音声を合成する音声合成装置であって、
前記見本信号の一部を疑似雑音信号として記憶する記憶部を備えた音声合成装置。 A speech synthesizer that synthesizes speech from a plurality of sample signals based on input parameters,
A speech synthesizer comprising a storage unit for storing a part of the sample signal as a pseudo noise signal.

前記記憶部の疑似雑音信号の極性を制御する制御部を備えた請求項１に記載の音声合成装置。 The speech synthesizer according to claim 1, further comprising a control unit that controls a polarity of the pseudo noise signal of the storage unit.

請求項１または請求項２に記載の音声合成装置が出力した音声を入力する復調装置を備え、
前記入力した音声と前記疑似雑音信号の相関を求める相関計算部と、
前記相関計算部が算出した相関値に基づいて復調処理を行う復調部と、
を備えた音声処理システム。 A demodulator for inputting the voice output by the voice synthesizer according to claim 1 or 2,
A correlation calculation unit for obtaining a correlation between the input speech and the pseudo noise signal;
A demodulator that performs demodulation based on the correlation value calculated by the correlation calculator;
Voice processing system with