JP2007047313A

JP2007047313A - Speech speed conversion apparatus

Info

Publication number: JP2007047313A
Application number: JP2005229901A
Authority: JP
Inventors: Kazuki Sakai; 和樹酒井
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2005-08-08
Filing date: 2005-08-08
Publication date: 2007-02-22

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech speed conversion apparatus capable of converting speech to natural voice, while maintaining clearness including passage between vowels (co-articulation). <P>SOLUTION: In an extending processing section 12, weighting is imposed so that an extending rate for a co-articulation section between vowels in sonant data which are output from an analysis buffer section 6a by analysis by a sound analysis section 6 may be made smaller than the extending rate to the vowel in sonant data, and extending processing is performed to the sonant data with each extending rate. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、入力された音声信号の再生速度を原音声信号の再生速度よりも遅い速度に変換して再生する話速変換装置に関する。 The present invention relates to a speech speed conversion device that converts and reproduces the reproduction speed of an input audio signal to a speed slower than the reproduction speed of an original audio signal.

音声のピッチを変更せずにその音韻性や個人性を保ったまま、時間軸で音声を伸張理する技術（話速変換技術）は既に開発されている。伸張処理された音声は再生速度が遅くなる為、高齢者の聴覚補助や語学学習に有効な技術である。伸張処理する際、音響的な特徴が大きく異なる母音や子音とで構成される音声を一律に伸張処理した場合、明瞭性が低下する原因となる。 A technology (speech speed conversion technology) has already been developed for expanding speech on the time axis while maintaining the phonological and personality without changing the pitch of the speech. Since the decompressed voice has a slow playback speed, it is an effective technique for hearing assistance and language learning for the elderly. When the decompression process is performed, if the speech composed of vowels and consonants having greatly different acoustic characteristics is uniformly decompressed, the clarity is lowered.

下記特許文献１では音声を母音、子音、母音から子音への渡り、雑音に分類し、それぞれに応じた伸張率の設定を行っている。具体的には、入力音声信号の近傍の波形の類似度に基づいて、入力音声信号を、母音、子音、母音から子音への渡り、雑音に分類している。母音区間においては、音声信号波形は周期性のある波形となるため、近傍の波形の類似度は大きくなる。子音区間においては、音声信号波形は白色雑音に近い波形となるため、近傍の波形の類似度は小さくなる。また、子音から母音への遷移期間および母音から子音への遷移期間においては、近傍の波形の類似度は母音区間の場合の類似度と子音区間の場合の類似度との中間となる。そこで、類似度が中の場合には、当該入力音声信号は、遷移区間と判定する。 In Patent Document 1 below, speech is classified into vowels, consonants, vowels to consonants, and noise, and the expansion rate is set according to each. Specifically, the input speech signal is classified into noise based on the similarity of waveforms in the vicinity of the input speech signal, from vowels, consonants, vowels to consonants. In the vowel section, the speech signal waveform has a periodic waveform, and the similarity between neighboring waveforms increases. In the consonant section, the speech signal waveform is a waveform close to white noise, so the similarity of the nearby waveforms is small. Further, in the transition period from the consonant to the vowel and the transition period from the vowel to the consonant, the similarity of the nearby waveform is intermediate between the similarity in the vowel section and the similarity in the consonant section. Therefore, when the degree of similarity is medium, the input audio signal is determined as a transition section.

特開平０９−１５２８８９号公報JP 09-152889 A

しかしながら、上記特許文献１では、母音間の渡りに関してはなんら考慮されていない。音声の大部分は母音で構成されており、母音間の渡り（調音結合）部分も音声には非常に多く含まれている。この調音結合部分を定常部分である母音と一律に伸張処理した音声では、人がゆっくり発声した音声と大きく異なり、非常に不自然で明瞭性の劣る音声となる。 However, in Patent Document 1, no consideration is given to the transition between vowels. Most of the speech is composed of vowels, and there are very many transitions (articulation coupling) between vowels. The voice obtained by uniformly extending the articulated joint portion with the vowel that is the stationary portion is very different from the voice slowly uttered by a person, and is very unnatural and inferior in clarity.

本発明は、上記実情に鑑みてなされたものであり、母音間の渡り（調音結合）を含む音声をその明瞭性を保ったまま、自然な音声に話速変換することができる話速変換装置の提供を目的とする。 The present invention has been made in view of the above circumstances, and is capable of converting a speech speed including a transition (articulation coupling) between vowels into a natural speech while maintaining its clarity. The purpose is to provide.

本発明に係る話速変換装置は、上記課題を解決するために、入力音声信号の再生速度を原音声信号の再生速度よりも遅い速度に変換して再生する話速変換装置において、アナログ音声信号が入力される音声入力部と、上記音声入力部によって入力された上記アナログ音声信号をデジタル音声データに変換するＡ／Ｄ変換部と、上記Ａ／Ｄ変換部からのデジタル音声データを蓄える入力バッファ部と、上記入力バッファ部が蓄えたデジタル音声データの後段への転送を制御する転送制御部と、上記転送制御部によって転送が制御されて供給された上記デジタル音声データを分析に必要となる量だけ分析バッファ部に蓄えて、有音であるか否かを分析して有音データ又は無音データを弁別し、さらに有音データを子音、母音、母音間の調音結合部分に分析して、上記分析バッファ部から出力する音声分析部と、上記音声分析部による分析によって上記分析バッファ部から出力された上記有音データ中の母音間の調音結合部分に対する伸張率を上記有音データ中の母音に対する伸張率よりも小さくして上記有音データを各伸張率にて伸張処理し、かつ上記分析バッファ部から出力された上記無音データを削除処理するデータ処理部と、上記音声分析部による上記デジタル音声データに対する分析処理を制御し、かつ上記データ処理部による上記伸張処理、上記削除処理を制御する制御部と、上記データ処理部によって各伸張処理が施された有音データを蓄積する出力バッファ部と、上記出力バッファ部から読み出された音声データをアナログ音声信号に変換するＤ／Ａ変換部と、上記Ｄ／Ａ変換部からのアナログ音声信号を出力する音声出力部とを備えてなる。 In order to solve the above-mentioned problem, the speech speed conversion apparatus according to the present invention is an analog speech signal in a speech speed conversion apparatus that converts the playback speed of the input speech signal to a speed slower than the playback speed of the original speech signal. A voice input unit, an A / D converter that converts the analog voice signal input by the voice input unit into digital voice data, and an input buffer that stores the digital voice data from the A / D converter Unit, a transfer control unit that controls transfer of the digital audio data stored in the input buffer unit to the subsequent stage, and an amount necessary for analysis of the digital audio data supplied by the transfer control unit. Only in the analysis buffer unit, analyze whether it is sound, discriminate between sound data or silence data, and further add the sound data to consonant, vowel, vowel articulation coupling And the expansion rate for the articulation coupling portion between the vowels in the sound data output from the analysis buffer unit by the analysis by the voice analysis unit and the voice analysis unit output from the analysis buffer unit A data processing unit that decompresses the voiced data at each expansion rate by making the voice data smaller than the expansion rate for the vowels in the voiced data, and deletes the silent data output from the analysis buffer unit; and A control unit that controls analysis processing of the digital audio data by the audio analysis unit and controls the expansion processing and the deletion processing by the data processing unit, and sound data that has been subjected to each expansion processing by the data processing unit An output buffer unit for storing audio data, a D / A conversion unit for converting audio data read from the output buffer unit into an analog audio signal, and the D / A And it outputs the analog audio signal from the conversion unit comprising an audio output unit.

音声分析部は、入力音声が有音であるか無音であるかを弁別し、さらに有音中の母音部分において非定常部分から母音と母音の渡りである調音結合部を検出する。データ処理部は、話速変換する際に長音結合部分と母音とで話速変換の割合（伸張率）に差をつけることでより自然な話速変換音声を得る。 The speech analysis unit discriminates whether the input speech is sounded or silenced, and further detects an articulation coupling unit that is a transition between vowels and vowels from an unsteady part in the vowel part being sounded. The data processing unit obtains a more natural speech speed converted speech by making a difference in the rate of speech speed conversion (expansion rate) between the long sound coupling portion and the vowel when the speech speed is converted.

本発明に係る話速変換装置によれば、母音間の渡り（調音結合）を含む音声をその明瞭性を保ったまま、自然な音声に話速変換することができる。 According to the speech speed converting apparatus according to the present invention, it is possible to convert speech speed including a transition between vowels (articulation coupling) into a natural voice while maintaining its clarity.

以下、本発明を実施するための最良の形態について図面を参照しながら説明する。 The best mode for carrying out the present invention will be described below with reference to the drawings.

この実施の形態は、入力された音声信号の再生速度をリアルタイム処理により、原音声信号の再生速度よりも遅い速度に変換して再生する話速変換装置である。母国語でない語学を学習する学習者の支援のため、また受聴能力が低下した高齢者、障害者の聴覚能力補助のため、或いは異なる言語による双方会話での相互理解補助等に使用される。 This embodiment is a speech speed conversion device that converts the playback speed of an input voice signal to a speed slower than the playback speed of the original voice signal by real-time processing. It is used to support learners who are learning languages other than their native language, to assist the hearing ability of elderly people with impaired listening ability, and disabled persons, or to assist mutual understanding in two-way conversations in different languages.

図１は、話速変換装置のブロック図である。この話速変換装置１は、音声入力部２と、Ａ／Ｄ変換部３と、入力バッファ部４と、転送制御部５と、音声分析部６と、音声データ分析＆処理制御部７と、データ処理部８と、出力バッファ部９と、Ｄ／Ａ変換部１０と、音声出力部１１とを備える。 FIG. 1 is a block diagram of a speech speed conversion apparatus. The speech speed conversion apparatus 1 includes a voice input unit 2, an A / D conversion unit 3, an input buffer unit 4, a transfer control unit 5, a voice analysis unit 6, a voice data analysis & processing control unit 7, A data processing unit 8, an output buffer unit 9, a D / A conversion unit 10, and an audio output unit 11 are provided.

音声入力部２はアナログ音声信号を入力する。Ａ／Ｄ変換部３は音声入力部２によって入力されたアナログ音声信号をデジタル音声データに変換する。入力バッファ部４はＡ／Ｄ変換部３からのデジタル音声データを蓄える。転送制御部５は入力バッファ部４が蓄えたデジタル音声データの後段への転送を制御する。 The voice input unit 2 inputs an analog voice signal. The A / D conversion unit 3 converts the analog audio signal input by the audio input unit 2 into digital audio data. The input buffer unit 4 stores the digital audio data from the A / D conversion unit 3. The transfer control unit 5 controls the transfer of the digital audio data stored in the input buffer unit 4 to the subsequent stage.

音声分析部６は転送制御部５によって転送が制御されて供給された上記デジタル音声データを分析に必要となる量だけ分析バッファ部６ａに蓄えて、有音であるか否かを分析して有音データ又は無音データを弁別し、さらに有音データを子音、母音、母音間の調音結合部分に分析して、上記分析バッファ部６ａから出力する。 The voice analysis unit 6 stores the digital voice data supplied after the transfer is controlled by the transfer control unit 5 in the analysis buffer unit 6a in an amount necessary for the analysis, and analyzes whether it is sound or not. Sound data or silent data is discriminated, and voiced data is further analyzed into a consonant, vowel, and articulation coupling portion between vowels, and output from the analysis buffer unit 6a.

データ処理部８は音声分析部６による分析によって分析バッファ部６ａから出力された上記各有音データに対して後述する処理に基づいた伸張率にて伸張処理を施し、かつ上記無音データに対して削除処理を施す。このためデータ処理部８は、伸張処理部１２と、削除処理部１３とを備える。 The data processing unit 8 performs an expansion process on the voice data output from the analysis buffer unit 6a by the analysis by the voice analysis unit 6 at a expansion rate based on a process described later, and the silent data. Perform deletion processing. Therefore, the data processing unit 8 includes an expansion processing unit 12 and a deletion processing unit 13.

伸張処理部１２は音声分析部６による分析によって分析バッファ部６ａから出力された上記有音データのうち、母音間の調音結合部分に対する伸張率を上記有音データ中の母音に対する伸張率よりも小さくする重み付けを行って、上記有音データを各伸張率にて伸張処理する。具体的に、伸張処理部１２は、予め設定された伸張率Ｒ（1.0＞＝Ｒ）から1.0を減算した値と有音データの定常性に応じた重み付け係数Ｗとの乗算した値（1.0+(Ｒ-1.0）＊Ｗ）で有音データを伸張処理する。有音データの定常性の分析パラメータとして、本実施の形態では波形パワーとフォルマント軌跡を用いる。 The decompression processing unit 12 has a decompression rate for the articulation coupling portion between the vowels of the voiced data output from the analysis buffer unit 6a by the analysis by the voice analysis unit 6 smaller than the decompression rate for the vowels in the voiced data. The voice data is decompressed at each decompression rate. Specifically, the expansion processing unit 12 multiplies a value obtained by subtracting 1.0 from a predetermined expansion rate R (1.0> = R) and a weighting coefficient W corresponding to the continuity of the sound data (1.0+ (R-1.0) * W) The voice data is decompressed. In this embodiment, waveform power and formant trajectory are used as analysis parameters for the steadiness of voiced data.

削除処理部１３は、分析バッファ部６ａから出力された無音データを削除処理する。具体的には、原音声の再生タイミングと伸張処理された音声の再生タイミングとの時間的なズレが大きくならないように無音データを適時削除処理する。 The deletion processing unit 13 deletes the silence data output from the analysis buffer unit 6a. Specifically, the silence data is timely deleted so that the time lag between the reproduction timing of the original audio and the reproduction timing of the decompressed audio does not increase.

音声データ分析＆処理制御部７は、音声分析部６によるデジタル音声データに対する分析処理を制御し、かつデータ処理部８による伸張処理部１２での上記伸張処理、削除処理部１３での上記削除処理を制御する。 The voice data analysis & processing control unit 7 controls the analysis process on the digital voice data by the voice analysis unit 6, and the decompression process by the decompression processing unit 12 by the data processing unit 8 and the deletion process by the deletion processing unit 13. To control.

以下には、図２を参照して、音声データ分析＆処理制御部７の制御の基に、音声分析部６で行われる、調音結合部分と残りの他の部分の検出について説明する。 In the following, with reference to FIG. 2, detection of the articulation coupling part and the remaining other part performed by the voice analysis unit 6 based on the control of the voice data analysis & processing control unit 7 will be described.

図２の（ａ）は、一定のデータ数（フレーム）毎に波形パワーが時間［sec］に応じてどのように変化するかを示す特性図である。図２の（ｂ）は、上記波形パワーの特性図に対応するフォルマント軌跡の特性図である。 FIG. 2A is a characteristic diagram showing how the waveform power changes with time [sec] for each fixed number of data (frames). FIG. 2B is a characteristic diagram of a formant locus corresponding to the characteristic diagram of the waveform power.

図２の（ａ）では二つの閾値ｔ０、ｔ１より波形パワーが大であるか否かが判定できる。図２の（ｂ）ではフォルマント軌跡の明瞭性及び安定性が判断できる。 In FIG. 2A, it can be determined whether or not the waveform power is larger than the two threshold values t0 and t1. In FIG. 2B, the clarity and stability of the formant trajectory can be determined.

ここで、フォルマント軌跡の明瞭性及び安定性について説明しておく。 Here, the clarity and stability of the formant trajectory will be described.

まず、フォルマントの明瞭性について説明する。所定区間ごと(例えば、1024ポイント)に音声信号より声帯からの影響を除いた特性(声道特性)のパワースペクトル分析を行い、少しずつ時間窓をずらして、所謂スペクトログラムとして表示する。この手法には、ケプストラム、線形予測法などいくつかの方法が知られ、多用されている。 First, the clarity of formants will be described. A power spectrum analysis of a characteristic (voice tract characteristic) excluding the influence from the vocal cords from the audio signal is performed for each predetermined section (for example, 1024 points), and the time window is gradually shifted and displayed as a so-called spectrogram. As this method, several methods such as cepstrum and linear prediction method are known and widely used.

ある区間における声道特性のパワースペクトルにおいて、第１フォルマント、あるいは第２、第３フォルマントも含めて、そのスペクトルの強度、先鋭度を評価し、その結果に基づいてフォルマントの明瞭性を判断する。 In the power spectrum of the vocal tract characteristic in a certain section, including the first formant or the second and third formants, the intensity and sharpness of the spectrum are evaluated, and the clarity of the formant is determined based on the result.

フォルマント部分に相当するスペクトルの強度は、そのレベルを所定の基準値以上であるか否かにより判断しても良いし、あるいは、その音声区間の信号パワーを評価しても良い。 The intensity of the spectrum corresponding to the formant part may be determined based on whether the level is equal to or higher than a predetermined reference value, or the signal power of the speech section may be evaluated.

フォルマント部分に相当するスペクトルの先鋭度は、スペクトルのピークとなる周波数に対し、例えば-3dBとなる周波数バンド幅などを指標にして評価すればよい。 The sharpness of the spectrum corresponding to the formant portion may be evaluated using, for example, a frequency bandwidth of −3 dB as an index with respect to the frequency at which the spectrum is peaked.

次に、フォルマントの安定性について説明する。上記と同様の手法で、声道特性のスペクトログラムを求め、第１フォルマント、あるいは第２、第３フォルマントも含めて、そのフォルマント周波数の時間遷移を測定し評価することで、フォルマントの安定性を判断する。 Next, the stability of formants will be described. Using the same method as above, obtain the spectrogram of the vocal tract characteristics, and determine the stability of the formant by measuring and evaluating the time transition of the formant frequency, including the first formant, or the second and third formants. To do.

より具体的には、フォルマント周波数を単位時間ごとに観測し、その周波数分布の分散を計算して、分散（ばらつき度合い）が小さいときにそのフォルマント軌跡が安定であるとし、分散が所定値より大きいときはそのフォルマント軌跡が安定ではないと判断する。 More specifically, the formant frequency is observed every unit time, the variance of the frequency distribution is calculated, and when the variance (variation degree) is small, the formant trajectory is stable, and the variance is larger than a predetermined value. Sometimes it is determined that the formant trajectory is not stable.

また、その他の手法としては、音声分析技術の、ケプストラム（cepstrum）を使うこともできる。フーリエ変換によって求められたパワースペクトルの対数値をさらにフーリエ逆変換したものである。 As another method, cepstrum of speech analysis technology can be used. The logarithmic value of the power spectrum obtained by Fourier transformation is further Fourier inverse transformed.

ケプストラム分析によれば、スペクトルの微細構造(基本周波数成分:音声であれば声帯に依存)はケフレンシーの高いところにピークが現れ、スペクトル包絡 (音声であれば声道、すなわち舌、顎や唇の位置や形などに依存)はケフレンシーの低いところに集中するので、その高ケフレンシー部分のピークを抽出することにより、音声の周期性の評価に利用できる。 According to the cepstrum analysis, the spectral fine structure (fundamental frequency component: depending on the vocal cords for speech) has a peak at a high cefency, and the spectral envelope (for speech, the vocal tract, that is, the tongue, jaw, and lips). (Depending on position, shape, etc.) is concentrated at a low kerfency, and can be used to evaluate the periodicity of speech by extracting the peak of the high kerfrenity part.

この周期性がはっきりとしている部分は母音部分であると判断され、低いケフレンシー部分を抽出することにより、パワースペクトルのエンベロープが求められ、その時間変動を統計的手法を使って評価することにより、フォルマントの安定性が判断される。その他、フォルマント抽出には線形予測分析などの手法も用いることもできる。 The part where this periodicity is clear is judged to be a vowel part, and the envelope of the power spectrum is obtained by extracting the low quefrency part, and the time variation is evaluated using a statistical method. Is determined. In addition, methods such as linear prediction analysis can be used for formant extraction.

音声波形に対して周波数分析による複数のフォルマントの抽出を行い、各フォルマントの強度、および各フォルマントの周波数値の時間軸上での変動を観察することで母音区間における定常・非定常区間の判別を行う。フォルマント抽出には線形予測分析やケプストラム分析を用いることができる。 Multiple formants are extracted from the speech waveform by frequency analysis, and the steady-state and non-stationary sections in the vowel section are identified by observing the intensity of each formant and the variation of the frequency value of each formant on the time axis. Do. Linear predictive analysis or cepstrum analysis can be used for formant extraction.

したがって、本願発明では、音声の母音の“渡り”部分を弁別するために、上述のフォルマントの明瞭性判断と、このフォルマントの安定性判断（定常・非定常区間の判別）を用いる。 Therefore, in the present invention, in order to discriminate the “crossover” portion of the vowel of the speech, the above-described formant clarity determination and the stability determination of this formant (discrimination between steady / unsteady sections) are used.

音声分析部６は図２の（ａ）及び図２の（ｂ）の特性図に応じて、定常区間及び非定常区間の検出を以下に示すように行い、調音結合部分と残りの他の部分の検出を行う。 The voice analysis unit 6 performs detection of the stationary section and the unsteady section as follows according to the characteristic diagrams of FIGS. 2A and 2B, and the articulation coupling portion and the remaining other portions. Detection is performed.

まず、音声分析部６は波形パワー値が閾値ｔ0より小さいと分析した部分を無音と判断する。図２の（ｂ）では非定常区間ｘである。 First, the voice analysis unit 6 determines that the analyzed portion is silent when the waveform power value is smaller than the threshold t0. In FIG. 2B, it is the unsteady section x.

次に、音声分析部６は波形パワー値が閾値ｔ0より大きく、データ処理部８を介してフォルマント軌跡が不明瞭であるという部分を子音と判断する。図２の（ｂ）では非定常区間ｙである。 Next, the voice analysis unit 6 determines that a portion where the waveform power value is larger than the threshold value t0 and the formant trajectory is unclear via the data processing unit 8 is a consonant. In FIG. 2B, it is the unsteady section y.

次に、音声分析部６は波形パワー値が閾値ｔ１より大きく、データ処理部８を介してフォルマント軌跡が明瞭で安定しているという部分を母音と判断する。図２の（ｂ）では定常区間ａ，ｂである。 Next, the voice analysis unit 6 determines that a part where the waveform power value is larger than the threshold value t1 and the formant locus is clear and stable via the data processing unit 8 is a vowel. In FIG. 2 (b), they are the steady sections a and b.

次に、音声分析部６は波形パワー値が閾値ｔ１より大きく、データ処理部８を介してフォルマント軌跡が明瞭であるが、変動しているという部分を母音の渡り部分と判断する。図２の（ｂ）では非定常区間ｚである。 Next, the voice analysis unit 6 determines that the waveform power value is larger than the threshold value t1 and the formant locus is clear via the data processing unit 8 but is fluctuated, as a vowel transition part. In FIG. 2B, it is an unsteady section z.

以上のように分析された各結果のうち、無音を除く有音データに対して、データ処理部８の伸張処理部１２は例えば1.2とした伸張率Ｒに対して、各重み付けＷの設定を行う。すなわち、有音データのうち、子音に対してはＷ＝0.5、母音に対してはＷ＝1.0、母音の渡り（調音結合部分）に対してはＷ＝0.8と設定する。この伸張率Ｒと重み付け係数Ｗは、音声データ分析＆処理制御部７によって算出され、伸張処理部１２に供給される。 Of the results analyzed as described above, the expansion processing unit 12 of the data processing unit 8 sets each weighting W with respect to the expansion rate R, for example, 1.2 for the sound data excluding silence. . That is, in the sound data, W = 0.5 is set for consonants, W = 1.0 is set for vowels, and W = 0.8 is set for vowel transition (articulation coupling portion). The expansion rate R and the weighting coefficient W are calculated by the audio data analysis & processing control unit 7 and supplied to the expansion processing unit 12.

伸張処理部１２は、上述のように上記各有音データに対して乗算値1.0+（Ｒ-1.0）＊Ｗで伸張処理を施す。子音のときには1.0+（Ｒ-1.0）＊Ｗ＝1.0+0.2×0.5＝1.1という伸張率とする。母音のときには1.0+（Ｒ-1.0）＊Ｗ＝1.0+0.2×1.0＝1.2という伸張率とする。調音結合部分（母音の渡り）のときは1.0+（Ｒ-1.0）＊Ｗ＝1.0+0.2×0.8＝1.16という伸張率とする。 As described above, the decompression processing unit 12 performs decompression processing on each sound data with a multiplication value of 1.0+ (R−1.0) * W. For consonants, the expansion rate is 1.0+ (R-1.0) * W = 1.0 + 0.2 × 0.5 = 1.1. For vowels, the expansion rate is 1.0+ (R-1.0) * W = 1.0 + 0.2 × 1.0 = 1.2. For the articulatory joint (vowel transition), the expansion rate is 1.0+ (R-1.0) * W = 1.0 + 0.2 × 0.8 = 1.16.

このように、伸張処理部１２は、予め設定された伸張率Ｒ（１．０＞＝Ｒ）から1.0を減算した値との乗算した値（1.0+（Ｒ-1.0）＊Ｗ）で上記各有音データに伸張処理を施す。削除処理部１３にて削除されなかった無音データ、および伸張処理部１２で伸張処理された有音データは出力バッファ部９に送られる。出力バッファ部９内のデータは順次一定速度で読み出され、Ｄ／Ａ変換部１０によりＤ／Ａ変換され後、音声出力部１１へ送られる。 In this way, the decompression processing unit 12 multiplies a value obtained by multiplying a preset decompression rate R (1.0> = R) by 1.0 (1.0+ (R−1.0) * W). Performs decompression processing on voice data. The silent data that has not been deleted by the deletion processing unit 13 and the voiced data that has been expanded by the expansion processing unit 12 are sent to the output buffer unit 9. Data in the output buffer unit 9 is sequentially read at a constant speed, is D / A converted by the D / A conversion unit 10, and then sent to the audio output unit 11.

以上に説明したように、話速変換装置１は、母音部分においてもその定常性に応じた重み付けを行うことで、定常性の強い母音部分は定常性の弱い母音の渡り部分に比較して伸張率が大きく設定されるため、母音の渡り部分が不自然に伸張されることなく、人がゆっくり話したような自然で明瞭性の高い話速変換音声を得ることが可能となる。 As described above, the speech speed converting apparatus 1 performs weighting in accordance with the continuity even in the vowel part, so that the vowel part having strong continuity is expanded compared with the transition part of vowel having low continuity. Since the rate is set to be large, it is possible to obtain a natural and highly clear speech speed converted speech that a person spoke slowly without an unnatural extension of the vowel.

本発明の実施の形態となる話速変換装置のブロック図である。1 is a block diagram of a speech speed converting apparatus according to an embodiment of the present invention. 音声分析部おける定常区間及び非定常区間の検出処理を説明するための図である。It is a figure for demonstrating the detection process of the stationary area and unsteady area in a speech analysis part.

符号の説明Explanation of symbols

１話速変換装置、２音声入力部、３Ａ／Ｄ変換部、４入力バッファ部、５転送制御部、６音声分析部、６ａ分析バッファ部、７音声データ分析＆処理制御部、８データ処理部、９出力バッファ部、１０Ｄ／Ａ変換部、１１音声出力部、１２伸張処理部、１３削除処理部 DESCRIPTION OF SYMBOLS 1 Speech speed converter 2 Voice input part 3 A / D conversion part 4 Input buffer part 5 Transfer control part 6 Voice analysis part 6a Analysis buffer part 7 Voice data analysis & processing control part 8 Data processing Unit, 9 output buffer unit, 10 D / A conversion unit, 11 audio output unit, 12 expansion processing unit, 13 deletion processing unit

Claims

入力音声信号の再生速度を原音声信号の再生速度よりも遅い速度に変換して再生する話速変換装置において、
アナログ音声信号が入力される音声入力部と、
上記音声入力部によって入力された上記アナログ音声信号をデジタル音声データに変換するＡ／Ｄ変換部と、
上記Ａ／Ｄ変換部からのデジタル音声データを蓄える入力バッファ部と、
上記入力バッファ部が蓄えたデジタル音声データの後段への転送を制御する転送制御部と、
上記転送制御部によって転送が制御されて供給された上記デジタル音声データを分析に必要となる量だけ分析バッファ部に蓄えて、有音であるか否かを分析して有音データ又は無音データを弁別し、さらに有音データを子音、母音、母音間の調音結合部分に分析して、上記分析バッファ部から出力する音声分析部と、
上記音声分析部による分析によって上記分析バッファ部から出力された上記有音データ中の母音間の調音結合部分に対する伸張率を上記有音データ中の母音に対する伸張率よりも小さくして上記有音データを各伸張率にて伸張処理し、かつ上記分析バッファ部から出力された上記無音データを削除処理するデータ処理部と、
上記音声分析部による上記デジタル音声データに対する分析処理を制御し、かつ上記データ処理部による上記伸張処理、上記削除処理を制御する制御部と、
上記データ処理部によって各伸張処理が施された有音データを蓄積する出力バッファ部と、
上記出力バッファ部から読み出された音声データをアナログ音声信号に変換するＤ／Ａ変換部と、
上記Ｄ／Ａ変換部からのアナログ音声信号を出力する音声出力部と
を備えてなることを特徴とする話速変換装置。 In a speech speed conversion device that converts the playback speed of the input voice signal to a speed slower than the playback speed of the original voice signal,
An audio input unit to which an analog audio signal is input;
An A / D converter that converts the analog audio signal input by the audio input unit into digital audio data;
An input buffer unit for storing digital audio data from the A / D conversion unit;
A transfer control unit for controlling transfer of the digital audio data stored in the input buffer unit to the subsequent stage;
The digital audio data supplied with the transfer controlled by the transfer control unit is stored in the analysis buffer unit in an amount necessary for analysis, and whether or not it is sound is analyzed to obtain sound data or sound data. Discriminating, further analyzing the voiced data into consonant, vowel, and articulatory coupling part between vowels, and outputting from the analysis buffer unit,
The sound data is obtained by making the expansion rate for the articulation coupling portion between vowels in the sound data output from the analysis buffer unit by the analysis by the sound analysis unit smaller than the expansion rate for the vowels in the sound data. A data processing unit that performs decompression processing at each decompression rate and deletes the silent data output from the analysis buffer unit;
A control unit that controls analysis processing of the digital audio data by the audio analysis unit, and that controls the expansion processing and the deletion processing by the data processing unit;
An output buffer for accumulating the sound data subjected to each expansion processing by the data processing unit;
A D / A converter for converting audio data read from the output buffer unit into an analog audio signal;
A speech speed conversion apparatus comprising: an audio output unit that outputs an analog audio signal from the D / A conversion unit.

上記音声分析部は、上記制御部の制御に基づいて上記入力音声信号が、上記無音データであるか、上記有音データのうち上記子音であるか、上記母音であるか、あるいは上記調音結合部分であるかを検出することを特徴とする請求項１記載の話速変換装置。 The voice analysis unit is configured to control whether the input voice signal is the silent data, the consonant of the voiced data, the vowel, or the articulation coupling part based on the control of the control unit. The speech rate conversion apparatus according to claim 1, wherein:

上記音声分析部は有音データ又は無音データの弁別を波形パワーに基づいて行い、さらに上記有音データの内で上記調音結合部分の検出をフォルマント軌跡に基づいて行うことを特徴とする請求項２記載の話速変換装置。 3. The voice analysis unit performs discrimination of sound data or sound data based on waveform power, and further detects the articulation coupling portion in the sound data based on a formant trajectory. The speech rate conversion device described.

上記音声分析部は上記波形パワーに基づいて弁別した上記有音パワーに対してフォルマント軌跡に基づいて上記調音結合部分を検出するとき、上記フォルマント軌跡が明瞭であるが、変動している部分を上記調音結合部分と判定することを特徴とする請求項３記載の話速変換装置。 When the voice analysis unit detects the articulation coupling portion based on the formant trajectory with respect to the sound power discriminated based on the waveform power, the formant trajectory is clear, but the changing portion is 4. The speech rate conversion apparatus according to claim 3, wherein the speech speed conversion unit is determined to be an articulation coupling portion.

上記データ処理部は上記制御部の制御に基づいて上記音声分析部で検出された上記調音結合部分及び上記母音に上記各伸張処理を施すとき、予め決められた伸張率に、相互に異なった重み付け係数を乗算して得た値を用いることを特徴とする請求項１記載の話速変換装置。 When the data processing unit performs the expansion processing on the articulation coupling portion and the vowel detected by the speech analysis unit based on the control of the control unit, different weights are assigned to predetermined expansion rates. 2. The speech speed conversion apparatus according to claim 1, wherein a value obtained by multiplying a coefficient is used.