JP3219892B2

JP3219892B2 - Real-time speech speed converter

Info

Publication number: JP3219892B2
Application number: JP07809893A
Authority: JP
Inventors: 龍池沢; 章中村; 栄一宮坂
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1993-04-05
Filing date: 1993-04-05
Publication date: 2001-10-15
Anticipated expiration: 2016-10-15
Also published as: JPH06289895A

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、原音声を取り込んで、
聴覚障害者や高齢者等の音声聴取に好適なゆっくりした
速度の音声に変換するリアルタイム話速変換装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention
The present invention relates to a real-time speech speed conversion device for converting into a low-speed speech suitable for hearing a voice of a hearing-impaired person or an elderly person.

【０００２】［発明の概要］本発明は、原音声を取り込んで、聴覚障害者や高齢者等
の音声聴取に好適なゆっくりした速度の音声に変換する
リアルタイム話速変換装置に関するものであり、リアル
タイム処理で、音声のイントネーション（ピッチ周波
数）の変化を検出するとともに、この検出結果に基づい
てイントネーションの高い部分では、話速を緩め、低い
部分では、話速を早めるという規則で話速を変化させる
ことにより、原音声の発話時間を保ったまま、原音声を
聞き易い良好な音声に変換するものである。[Summary of the Invention] [0002] The present invention relates to a real-time speech speed conversion device which takes in an original voice and converts it into a slow-speed voice suitable for listening to a voice of a hearing-impaired person or an elderly person. In the processing, a change in the intonation (pitch frequency) of the voice is detected, and based on the detection result, the speech speed is changed according to the rule that the speech speed is relaxed in the high intonation portion and the speech speed is increased in the low intonation portion. In this way, the original voice is converted into an easy-to-hear good voice while keeping the speech time of the original voice.

【０００３】[0003]

【従来の技術】一般に、受聴者が加齢ないしなんらかの
障害などによって音声識別臨界速度（音声を正確に識別
できる最大の話速）などの受聴能力が低下すると、通常
の速さの音声や早口で話される音声の識別度が大幅に低
下するようになる。2. Description of the Related Art Generally, when a listener loses his / her listening ability such as a critical speed for voice recognition (maximum speech speed at which voice can be accurately recognized) due to aging or some kind of obstacle, a normal speed voice or a quick voice is used. The discrimination degree of the spoken voice is greatly reduced.

【０００４】そして、従来においては、このような聴力
障害を持つ人のための補聴手段として、補聴器しかなか
った。[0004] Conventionally, there has been only a hearing aid as a hearing aid for a person having such a hearing impairment.

【０００５】[0005]

【発明が解決しようとする課題】ところで、上述した補
聴器は単に周波数特性の改善や利得制御などによって聴
覚系の外耳、中耳の伝達特性のみを補償するための機器
であることから、聴覚中枢の劣化に関与する音声の識別
能力の低下を補償することができないという問題があっ
た。The hearing aid described above is a device for compensating only the transmission characteristics of the outer ear and the middle ear of the auditory system by simply improving the frequency characteristics and controlling the gain, so that the hearing aid is not used. There is a problem that it is not possible to compensate for a decrease in the ability to discriminate voices involved in the deterioration.

【０００６】そこで、このような問題を解決する方法と
して、原音声の品質を保ったまま、話速を変換する技術
が開発されている。Therefore, as a method for solving such a problem, a technique for converting the speech speed while maintaining the quality of the original voice has been developed.

【０００７】この話速変換技術では、音声の話速のみを
一様に遅くすることにより、特に高齢者や聴覚障害者等
にとっては、はるかに聴き易い音声にすることが可能で
あるが、この操作によって音声の発話時間も必然的に伸
張する。しかし、放送等では、伸張前の音声の発話時間
は、決められた時間内に収まるように発話されているこ
とから、このような音声の伸張を行なうと、上記制限時
間内に収まらなくなる可能性が生じる。また、テレビジ
ョン等のように音声と映像を同期して提供するような場
合に、音声のみを伸張すると、映像との間に時間的な
「ずれ」が生じ、これが聞き取りに悪影響を及ぼす虞が
発生する。In this speech speed conversion technique, it is possible to make the speech much easier to hear, especially for the elderly and the hearing impaired, by uniformly lowering only the speech speed of the speech. The operation also inevitably extends the speech utterance time. However, in broadcasting, etc., the utterance time of the voice before expansion is uttered so as to be within the predetermined time, so if such voice expansion is performed, there is a possibility that the voice will not fit within the above-mentioned time limit. Occurs. Further, in a case where audio and video are provided synchronously as in a television or the like, if only audio is expanded, a temporal "shift" occurs between the audio and the video, which may adversely affect listening. appear.

【０００８】このため、このような時間的な「ずれ」を
考慮した話速変換技術をオフライン処理で実現するもの
が開発されているものの、時間的な「ずれ」を考慮した
話速変換技術をリアルタイム処理で実現することができ
るものは未だ開発されていない。[0008] For this reason, although a speech speed conversion technique taking into account such a temporal “shift” has been developed by an off-line process, a speech rate conversion technique taking into account the temporal “shift” has been developed. What can be realized by real-time processing has not yet been developed.

【０００９】本発明は上記の事情に鑑み、上述した時間
的な「ずれ」に伴う問題点を解決するため、リアルタイ
ム処理で、発話音声中の意味上重要と考えられる部分の
話速を適度に遅くし、それ以外の部分を逆に早めること
によって、発話時間を実質的に伸張させることなく、全
体としてゆっくりとした聞き易い音声に変換することが
できるリアルタイム話速変換装置を提供することを目的
としている。The present invention has been made in view of the above circumstances, and in order to solve the above-described problem associated with the time lag, the speech speed of a portion considered to be significant in the uttered voice is appropriately adjusted by real-time processing. It is an object of the present invention to provide a real-time speech speed conversion device capable of converting speech into a slow and easy-to-hear sound as a whole without substantially extending the speech time by slowing down and advancing other portions in reverse. And

【００１０】[0010]

【課題を解決するための手段】上記の目的を達成するた
めに本発明は、請求項１では、受聴音声の発声する速さ
（話速）を遅くする際、所定時間以上の無音区間を検出
し、この無音区間を発声音の息つぎ区間を意味するポー
ズ区間と判定するポーズ区間判定手段と、ポーズ区間と
次のポーズ区間との間をフレーズ区間とし、このフレー
ズ区間の開始点から所定数の有声区間における最高ピッ
チ周波数を検出する最高ピッチ周波数検出手段と、前記
フレーズ区間の前記開始点から所定の伸張倍率、かつ所
定の減少関数に基づき、一定時間にわたって話速を変換
するとともに、一定時間の経過後においては、前記最高
ピッチ周波数を考慮した話速変換を実行する話速変換手
段とを備えたことを特徴としている。請求項２では、請
求項１に記載のリアルタイム話速変換装置において、前
記話速変換手段は、前記一定時間の経過後における話速
を決定するに際し、前記フレーズ区間における処理対象
の有声区間の平均ピッチ周波数を求め、この平均ピッチ
周波数と前記最高ピッチ周波数に所定の閾値を掛けた数
値との大小関係によって、処理対象の有声区間の伸張倍
率を決定することを特徴としている。In order to achieve the above object, according to the present invention, when the speed of uttering a listening sound (speaking speed) is reduced, a silent section for a predetermined time or more is detected. A pause section determining means for determining the silent section as a pause section meaning a breath section of the utterance; a phrase section between the pause section and the next pause section; and a predetermined number from a start point of the phrase section. A maximum pitch frequency detecting means for detecting a maximum pitch frequency in a voiced section, and a predetermined expansion factor from the start point of the phrase section, and based on a predetermined decreasing function. After the lapse of the period, a speech speed conversion means for executing a speech speed conversion in consideration of the maximum pitch frequency is provided. According to a second aspect, in the real-time speech rate conversion device according to the first aspect, the speech rate conversion means determines an average of voiced sections to be processed in the phrase section when determining the speech rate after the lapse of the predetermined time. It is characterized in that a pitch frequency is obtained, and an expansion ratio of a voice section to be processed is determined based on a magnitude relationship between the average pitch frequency and a numerical value obtained by multiplying the maximum pitch frequency by a predetermined threshold.

【００１１】[0011]

【作用】上記の構成によれば、ポーズ区間判定手段は、
受聴音声の発声する速さ（話速）を遅くする際、所定時
間以上の無音区間を検出し、この無音区間を発声音の息
つぎ区間を意味するポーズ区間と判定する。最高ピッチ
周波数検出手段は、ポーズ区間と次のポーズ区間との間
をフレーズ区間とし、このフレーズ区間の開始点から所
定数の有声区間における最高ピッチ周波数を検出する。
そして、話速変換手段は、前記フレーズ区間の前記開始
点から所定の伸張倍率、かつ所定の減少関数に基づき、
一定時間にわたって話速を変換するとともに、一定時間
経過後においては、前記最高ピッチ周波数を考慮した話
速変換を実行する。これにより、時間的な「ずれ」に伴
う問題点を解決しながら、発話音声中の意味上重要な部
分の話速を適度に遅くし、それ以外の部分を逆に速める
ことによって、発話時間を実質的に伸張させることな
く、リアルタイムで全体としてゆっくりとした聞きやす
い音声に変換する。According to the above arrangement, the pause section determining means includes:
When reducing the speed at which the listening sound is uttered (speaking speed), a silent section for a predetermined time or longer is detected, and this silent section is determined as a pause section which means a breath section of the uttered sound. The maximum pitch frequency detecting means sets a phrase section between the pause section and the next pause section, and detects a maximum pitch frequency in a predetermined number of voiced sections from a start point of the phrase section.
Then, the speech speed conversion means, based on a predetermined expansion factor from the start point of the phrase section and a predetermined decreasing function,
The voice speed is converted over a certain period of time, and after a certain period of time, the voice speed conversion is performed in consideration of the maximum pitch frequency. As a result, while solving the problems associated with the temporal "shift", the speech speed of the semantically important part in the uttered speech is moderately reduced, and the other parts are reversed, thereby shortening the utterance time. It converts the sound into slow, easy-to-hear sound in real time, without any substantial stretching.

【００１２】[0012]

【実施例】《実施例の構成》図１は本発明によるリアルタイム話速変換装置の一実施
例を示すブロック図である。FIG. 1 is a block diagram showing an embodiment of a real-time speech speed converter according to the present invention.

【００１３】この図に示すリアルタイム話速変換装置
は、音声入力回路１と、ＣＰＵ回路２と、ＰＲＯＭ回路
３と、入力バッファ回路４と、処理バッファ回路５と、
ファイル回路６と、音声出力回路７と、バス８とを備え
ており、音声入力回路１によって話速変換対象となる音
声（原音声）を取り込み、リアルタイム処理で、原音声
のイントネーション（ピッチ周波数）の変化を検出する
とともに、この検出結果に基づいてイントネーションの
高い部分では、話速を緩め、低い部分では、話速を早め
るという規則で話速を変化させることにより、原音声の
発話時間を保ったまま、原音声を聞き易い良好な音声に
変換する。The real-time speech speed converter shown in FIG. 1 includes a voice input circuit 1, a CPU circuit 2, a PROM circuit 3, an input buffer circuit 4, a processing buffer circuit 5,
The voice input circuit 1 includes a file circuit 6, a voice output circuit 7, and a bus 8. The voice input circuit 1 captures voice (original voice) to be converted in speech speed, and performs real-time processing into the original sound intonation (pitch frequency). Of the original voice, and based on the detection result, the speech speed is changed according to the rule that the speech speed is slowed down in the portion of high intonation and the speech speed is increased in the portion of low intonation, thereby keeping the speech time of the original voice. The original sound is converted into a good sound that is easy to hear while keeping it.

【００１４】音声入力回路１は、原音声を入力するため
の一般的な構成の回路、例えばマイクロホン、音調回
路、アナログディジタル変換器、音声記憶再生（録音）
回路、音声記憶媒体（例えば、ＩＣメモリ、ハードディ
スク、フロッピーディスクまたはＶＴＲ）、およびイン
タフェース回路等を備えており、話速変換対象となる音
声を取り込み、これをデジタル形式の音声信号に変換す
るとともに、ＣＰＵ回路２からの指示に基づいてフレー
ム単位で入力バッファ回路４に供給する。The audio input circuit 1 is a circuit having a general configuration for inputting an original audio, for example, a microphone, a tone circuit, an analog-to-digital converter, a voice storage / reproduction (recording).
Circuit, an audio storage medium (for example, an IC memory, a hard disk, a floppy disk or a VTR), an interface circuit, and the like. The audio to be converted is converted into a digital audio signal. The data is supplied to the input buffer circuit 4 in frame units based on an instruction from the CPU circuit 2.

【００１５】入力バッファ回路４は、必要な容量のＲＡ
Ｍなどによって構成され、ＣＰＵ回路２の作業域として
使用される部分であり、音声入力回路１から出力される
音声信号を取り込んでこれを記憶するとともに、ＣＰＵ
回路２からの指示に基づいて記憶している音声信号を処
理バッファ回路５に転送する。The input buffer circuit 4 has a required capacity RA.
M, and is used as a work area of the CPU circuit 2. The audio signal output from the audio input circuit 1 is taken in and stored therein.
The stored audio signal is transferred to the processing buffer circuit 5 based on an instruction from the circuit 2.

【００１６】処理バッファ回路５は、必要な容量のＲＡ
Ｍなどによって構成され、ＣＰＵ回路２の作業域として
使用される部分であり、入力バッファ回路４から出力さ
れる音声信号を取り込んでこれを記憶するとともに、Ｃ
ＰＵ回路２からの指示に基づいて記憶している音声信号
をファイル回路６などに転送する。The processing buffer circuit 5 has a required capacity RA.
M, and is used as a work area of the CPU circuit 2. The audio signal output from the input buffer circuit 4 is fetched and stored.
The stored audio signal is transferred to the file circuit 6 or the like based on an instruction from the PU circuit 2.

【００１７】ファイル回路６は、ＲＡＭの他に、ＩＣメ
モリやフロッピーディスク等の音声記憶媒体によって構
成され、本発明に係わる有声区間の伸張された音声信号
と、無音区間の短縮の処理を施された信号などを格納す
るメモリであり、処理バッファ回路５から処理済みの音
声信号が出力されたとき、これを取り込んで記憶し、こ
の後ＣＰＵ回路２からの指示に基づいて記憶している音
声信号を音声出力回路７に供給する。The file circuit 6 is constituted by an audio storage medium such as an IC memory or a floppy disk, in addition to the RAM, and performs the processing of expanding the voiced section of the voiced section and shortening the silent section according to the present invention. A memory for storing a processed audio signal when the processed audio signal is output from the processing buffer circuit 5 and storing the same, and thereafter storing the audio signal based on an instruction from the CPU circuit 2. Is supplied to the audio output circuit 7.

【００１８】音声出力回路７は、ファイル回路６内の音
声信号を外部に出力するための一般的な構成の回路、例
えばインタフェース回路、ディジタルアナログ変換器、
スピーカー、録音装置（あるいは放送機器）等を備えて
おり、ファイル回路６から音声信号が出力されたとき、
これを取り込んで音声に変換しながら、外部に出力す
る。The audio output circuit 7 is a circuit having a general configuration for outputting the audio signal in the file circuit 6 to the outside, for example, an interface circuit, a digital-to-analog converter,
A speaker, a recording device (or a broadcasting device), etc., and when an audio signal is output from the file circuit 6,
It captures this and converts it into audio while outputting it to the outside.

【００１９】また、ＣＰＵ回路２は、ワンチップマイク
ロコンピュータ等によって構成される部分であり、ＰＲ
ＯＭ回路３に格納されているプログラムに基づいて装置
全体の制御や各種のデータ処理を行なう。The CPU circuit 2 is a portion constituted by a one-chip microcomputer or the like.
Based on a program stored in the OM circuit 3, the control of the entire apparatus and various data processing are performed.

【００２０】また、ＰＲＯＭ回路３は、ＣＰＵ回路２の
動作を規定するプログラムや各種の処理で使用される定
数データなどの格納場所として使用される部分であり、
ＣＰＵ回路２からの読出し指令に応じて記憶しているプ
ログラムや定数データを読出してＣＰＵ回路２に供給す
る。The PROM circuit 3 is a portion used as a storage location for programs that define the operation of the CPU circuit 2 and constant data used in various processes.
In accordance with a read command from the CPU circuit 2, the stored program or constant data is read and supplied to the CPU circuit 2.

【００２１】《実施例の動作》次に、図１に示すブロッ
ク図および図２、図３に示すフローチャート、図４に示
すタイミング図を参照しながら、この実施例の動作を説
明する。<< Operation of Embodiment >> Next, the operation of this embodiment will be described with reference to the block diagram shown in FIG. 1, the flowcharts shown in FIGS. 2 and 3, and the timing chart shown in FIG.

【００２２】まず、ＣＰＵ回路２は最初に音声入力回路
１に入力されて処理された音声信号をフレームと呼ばれ
る一定長、例えば３．３ｍｓ毎に切出し、これを入力バ
ッファ回路４に転送させて格納させる（ステップＳＴ
１）。First, the CPU circuit 2 cuts out an audio signal which is first input to the audio input circuit 1 and processed and is cut out at a fixed length called a frame, for example, every 3.3 ms, transferred to the input buffer circuit 4 and stored therein. (Step ST
1).

【００２３】この後、ＣＰＵ回路２は入力バッファ回路
４に格納されている音声信号を各フレーム毎に自己相関
法や零クロス法などの方法で処理して各フレーム毎に有
声、無声、無音の判定を行なう。但し、この場合、人が
発する有声および無声以外の入力音（例えば、低レベル
の雑音や背景音など）については、原則として無音とし
て処理する（ステップＳＴ２）。Thereafter, the CPU circuit 2 processes the audio signal stored in the input buffer circuit 4 for each frame by a method such as the auto-correlation method or the zero-cross method, and performs voiced, unvoiced, and silent operation for each frame. Make a decision. However, in this case, input sounds other than voiced and unvoiced sounds (for example, low-level noise and background sounds) generated by a person are processed as silence in principle (step ST2).

【００２４】次いで、ＣＰＵ回路２は今回のフレームに
ついての有声、無声、無音の判定結果と、前回のフレー
ムについての有声、無声、無音の判定結果とが同じであ
るかどうかを判定し（ステップＳＴ３）、これらが同じ
種類であれば、上述したフレームの切出し処理に戻って
同じ処理を繰り返し、また違う種類、例えば前回のフレ
ームが有声区間であり、今回のフレームが無声区間であ
れば、それまで同じ種類の区間と判定されている音声信
号を処理バッファ回路５に転送して格納させる（ステッ
プＳＴ４）。Next, the CPU circuit 2 determines whether or not the determination result of voiced, unvoiced, and silent for the current frame is the same as the determination result of voiced, unvoiced, and silent for the previous frame (step ST3). If these are the same type, the process returns to the above-described frame extraction processing and the same processing is repeated. If the type is a different type, for example, the previous frame is a voiced section and the current frame is an unvoiced section, until then, The audio signals determined to be the same type of section are transferred to the processing buffer circuit 5 and stored (step ST4).

【００２５】これによって、図４に示す如く、音声入力
回路１によって取り込まれた音声が有声区間と、無声区
間と、無音区間とに区分されて処理バッファ回路５に格
納される。As a result, as shown in FIG. 4, the voice fetched by the voice input circuit 1 is divided into a voiced section, an unvoiced section and a silent section and stored in the processing buffer circuit 5.

【００２６】この後、ＣＰＵ回路２は処理バッファ回路
５に格納されている各音声信号のうち、無音区間と判定
された区間の中で、その区間長が２５０ｍｓ以上の無音
区間をポーズ区間（発声音の息つぎ区間）と判定すると
ともに、各ポーズ区間の間にある区間をフレーズ区間
（一息で発声される区間）と判定する（ステップＳＴ
５）。Thereafter, the CPU circuit 2 sets a pause section (sending section) of a section having a section length of 250 ms or more among sections determined as silent sections among the audio signals stored in the processing buffer circuit 5. It is determined that the section is a breath section of a vocal sound) and the section between the pause sections is a phrase section (a section uttered in a single breath) (step ST).
5).

【００２７】そして、ＣＰＵ回路２は各フレーズ区間の
有声区間と判定された区間に対し、図３に示す有声区間
処理１０を行なう（ステップＳＴ６）。Then, the CPU circuit 2 performs a voiced section process 10 shown in FIG. 3 on the section determined as a voiced section of each phrase section (step ST6).

【００２８】この有声区間処理では、ＣＰＵ回路２は最
初に、処理対象となっている有声区間がポーズ区間直後
の有声区間かどうかを判定し（ステップＳＴ１５）、ポ
ーズ区間直後の有声区間であれば、フレーズ区間の開始
点（Ｐｈ＿ｓｔ）から３つの有声区間（第１有声区間、
第２有声区間、第３有声区間）を抽出してこれら第１有
声区間〜第３有声区間の各ピッチ周波数のうち、最高の
ピッチ周波数を最高ピッチ周波数Ｐｉｔｃｈ＿ｍａｘと
するとともに、第１有声区間の開始点Ｖ＿ｓｔにおける
話速の伸張倍率を“ｒｓ”とする（ステップＳＴ１
６）。In the voiced section processing, the CPU circuit 2 first determines whether or not the voiced section to be processed is a voiced section immediately after the pause section (step ST15). , Three voiced sections from the start point (Ph_st) of the phrase section (first voiced section,
The second voiced section and the third voiced section) are extracted, the highest pitch frequency among the pitch frequencies of the first voiced section to the third voiced section is set as the highest pitch frequency Pitch_max, and the start of the first voiced section is started. The expansion rate of the speech speed at the point V_st is set to “rs” (step ST1).
6).

【００２９】この後、ＣＰＵ回路２は処理対象となる音
声信号が第１有声区間の開始点Ｖ＿ｓｔから予め設定さ
れている長さの時間Ｔ（この実施例では、２０００ｍ
ｓ）が経過したかどうかを判定し（ステップＳＴ１
７）、時間Ｔが経過していなければ、話速の伸張倍率を
予め設定されている適切な減少関数、例えば次式に示す
余弦関数ｆ（ｔ）を用いて“ｒｓ”から“ｒｅ”まで変
化させる（ステップＳＴ１８）。Thereafter, the CPU circuit 2 determines that the audio signal to be processed has a predetermined time T (2000 m in this embodiment) from the start point V_st of the first voiced section.
s) is determined (step ST1).
7) If the time T has not elapsed, the expansion rate of the speech speed is changed from "rs" to "re" by using an appropriate decreasing function set in advance, for example, a cosine function f (t) shown in the following equation. It is changed (step ST18).

【００３０】ｆ（ｔ）＝ｒｅ＋（１／２）・（ｒｓ−ｒｅ）・｛ｃｏｓπ・（ｔ−Ｖ＿ｓｔ）／Ｔ＋１．０｝ …（１）但し、ｔ：ｔ＝Ｖ＿ｓｔ〜Ｖ＿ｓｔ＋Ｔまた、このとき、この範囲では、無音区間および無声区
間に対し、何等の処理も施さない。F (t) = re + (1/2) · (rs−re) · {cosπ · (t−V_st) /T+1.0} (1) where t: t = V_st to V_st + T At this time, in this range, no processing is performed on the silent section and the unvoiced section.

【００３１】また、処理対象となる音声信号が第１有声
区間の開始点Ｖ＿ｓｔから予め設定されている長さの時
間Ｔを経過していれば（ステップＳＴ１７）、ＣＰＵ回
路２は処理対象となっている音声信号を含む区間（第ｎ
音声区間）（但し、ｎ≧ｋ）における平均ピッチ周波数
Ｐｉｔｃｈ（ｎ）が次式を満たすかどうかを判定する
（ステップＳＴ１９）。If the audio signal to be processed has passed a predetermined length of time T from the start point V_st of the first voiced section (step ST17), the CPU circuit 2 becomes the processing target. Section containing the audio signal (the nth
It is determined whether or not the average pitch frequency Pitch (n) in the voice section (where n ≧ k) satisfies the following equation (step ST19).

【００３２】Ｐｉｔｃｈ（ｎ）＞Ｐｉｔｃｈ＿ｍａｘ×Ｔｈ２ …（２）但し、Ｔｈ２：しきい値であり、この実施例では、Ｔｈ
２＝０．７。Pitch (n)> Pitch_max × Th2 (2) where Th2 is a threshold value, and in this embodiment, Th is
2 = 0.7.

【００３３】そして、第ｎ音声区間が上記（２）式を満
たしていれば、ＣＰＵ回路２はこの第ｎ音声区間の開始
点を“Ｖ２＿ｓｔ”とし（ステップＳＴ２０）、伸張倍
率を“ｒｓ−Ｔｈ３”と設定する。上述した期間内かど
うかの判定処理および減少関数ｆ（ｔ）を使用した有声
区間の伸張処理を行ない、開始点Ｖ２＿ｓｔから期間Ｔ
までの範囲で、話速の伸張倍率を“ｒｓ−Ｔｈ３”から
“ｒｅ”まで変化させる（ステップＳＴ１８）。If the n-th voice section satisfies the above equation (2), the CPU circuit 2 sets the start point of the n-th voice section to "V2_st" (step ST20) and sets the expansion ratio to "rs-Th3". ". The above-described process of determining whether or not the time period is within the period and the process of expanding the voiced section using the decreasing function f (t) are performed.
The speech speed expansion factor is changed from "rs-Th3" to "re" within the range (step ST18).

【００３４】この場合、この実施例では、しきい値Ｔｈ
３の値を“０．１”に設定している。In this case, in this embodiment, the threshold value Th
3 is set to “0.1”.

【００３５】また、前記（２）式が満たされていなけれ
ば（ステップＳＴ１９）、ＣＰＵ回路２は有声区間の伸
張倍率ｒｅ、すなわち話速を最も速い状態のままにする
（ステップＳＴ２１）。If the expression (2) is not satisfied (step ST19), the CPU circuit 2 keeps the expansion ratio re of the voiced section, that is, the speech speed at the fastest (step ST21).

【００３６】以下、ＣＰＵ回路２は次のポーズ区間ま
で、有声区間が検出される毎に、この有声区間内の音声
信号に対して上述した処理を繰り返し行なう。Thereafter, every time a voiced section is detected until the next pause section, the CPU circuit 2 repeats the above-described processing on the audio signal in this voiced section.

【００３７】そして、この処理が終了した後、ＣＰＵ回
路２は処理バッファ内にある話速変換済みの音声信号を
ファイル回路６に転送させて格納させるとともに、処理
バッファ回路５をクリアする（ステップＳＴ７）。Then, after this processing is completed, the CPU circuit 2 transfers the speech speed-converted audio signal in the processing buffer to the file circuit 6 for storage, and clears the processing buffer circuit 5 (step ST7). ).

【００３８】また、上述した識別区間処理において（ス
テップＳＴ５）、処理対象となる区間が無声区間と判定
されれば、ＣＰＵ回路２はこの区間の音声信号を処理バ
ッファ回路５からファイル回路６に転送させて格納させ
た後、処理バッファ回路５をクリアする（ステップＳＴ
８）。In the above-described identification section processing (step ST5), if the section to be processed is determined to be an unvoiced section, the CPU circuit 2 transfers the audio signal in this section from the processing buffer circuit 5 to the file circuit 6. After that, the processing buffer circuit 5 is cleared (step ST
8).

【００３９】また、上述した識別区間処理において（ス
テップＳＴ５）、処理対象となる区間が無音区間と判定
されれば、ＣＰＵ回路２はこの区間がポーズ区間かどう
か判定し（ステップＳＴ９）、ポーズ区間であるときに
は、文章と文章との区切れ（句点）と判断して、文章を
聴感上の違和感なく最短に短縮するため、予め設定され
ているアルゴリズムの短縮処理を行なって無音区間を短
縮する（ステップＳＴ１０）。In the above-described identification section processing (step ST5), if the section to be processed is determined to be a silent section, the CPU circuit 2 determines whether or not this section is a pause section (step ST9). In the case of, it is determined that the sentence is a break (punctuation point), and in order to shorten the sentence to the shortest without any discomfort in the sense of hearing, a shortening process of a preset algorithm is performed to shorten a silent section ( Step ST10).

【００４０】また、上述した識別区間処理において（ス
テップＳＴ５）、無音区間と判定さても、ポーズ区間で
なければ（ステップＳＴ９）、ＣＰＵ回路２はこの短縮
処理をスキップする。Further, in the above-described identification section processing (step ST5), if it is determined that the section is a silent section but is not a pause section (step ST9), the CPU circuit 2 skips this shortening processing.

【００４１】この後、ＣＰＵ回路２は処理バッファ回路
５内にある処理済みの無音区間の信号をファイル回路６
に転送させて格納させた後、処理バッファ回路５をクリ
アする（ステップＳＴ１１）。Thereafter, the CPU circuit 2 sends the processed silence section signal in the processing buffer circuit 5 to the file circuit 6.
After that, the processing buffer circuit 5 is cleared (step ST11).

【００４２】以下、ＣＰＵ回路２は処理対象となる音声
信号が無くなるまで（ステップＳＴ１２）、上述した処
理を繰り返し行なう。Thereafter, the CPU circuit 2 repeats the above processing until there is no more audio signal to be processed (step ST12).

【００４３】また、上述した処理と並行して、ＣＰＵ回
路２はファイル回路６内に格納されている処理済みの音
声信号を音声出力回路７に転送させて音声として出力さ
せる。In parallel with the above processing, the CPU circuit 2 transfers the processed audio signal stored in the file circuit 6 to the audio output circuit 7 and outputs it as audio.

【００４４】《実験例》そして、表１に示す実際のニュ
ース音声を含む音声を上述したリアルタイム話速変換装
置で処理したとき、文章中の（１）−１〜（１）−６に
対し、各々の区切れをポーズと認識して各フレーズの開
始点で話速を遅くすることができ、（１）−１、（１）
−４の後半部分（下線部分）で話速を遅くすることがで
きた。<< Experimental Example >> When the speech including the actual news speech shown in Table 1 is processed by the above-described real-time speech speed converter, (1) -1 to (1) -6 in the text are Recognizing each break as a pause, the speech speed can be reduced at the start point of each phrase. (1) -1, (1)
In the latter part of -4 (underlined part), the speech speed could be reduced.

【００４５】[0045]

【表１】そして、この実験において、話速の伸張倍率を“ｒｓ＝
１．２５”、“ｒｅ＝０．９”としたとき、１３６秒の
長さの原音声を１３６秒の長さの音声にすることがで
き、“ｒｓ＝ｒｅ＝１．３”と話速を一律に遅くした場
合と比べて、吸収率αを１００％（未吸収時間０．０
秒）にすることができた。[Table 1] In this experiment, the expansion rate of the speech speed was set to “rs =
When “1.25” and “re = 0.9”, the original voice having a length of 136 seconds can be converted into a voice having a length of 136 seconds, and “rs = re = 1.3” and the speech speed Compared with the case where the absorption rate is uniformly slowed, the absorption rate α is set to 100% (the non-absorption time 0.0
Seconds).

【００４６】この場合、吸収率αは次式で表わされる値
である。In this case, the absorption rate α is a value represented by the following equation.

【００４７】 α＝｛（Ｔ１−Ｔ２）／Ｔ１｝・１００ …（３）但し、Ｔ１：話速を一律の場合の伸張時間Ｔ２：話速を変化させた場合の伸張時間したがって、この実施例を使用することにより、文章間
の無音区間を効果的に短縮し、これによって全体の時間
長を伸張せずに、リアルタイムで原音声をゆっくりした
音声に変換することができる。Α = {(T1−T2) / T1} · 100 (3) where T1: the extension time when the speech speed is uniform T2: the extension time when the speech speed is changed Therefore, in this embodiment, By using, the silent section between sentences is effectively shortened, so that the original speech can be converted into a slow speech in real time without extending the entire time length.

【００４８】このようにこの実施例においては、音声入
力回路１によって話速変換対象となる音声（原音声）を
取り込み、リアルタイム処理で、原音声のイントネーシ
ョン（ピッチ周波数）の変化を検出するとともに、この
検出結果に基づいてイントネーションの高い部分では、
話速を緩め、低い部分では、話速を早めるという規則で
話速を変化させることにより、原音声の発話時間を保っ
たまま、原音声を聞き易い良好な音声に変換するように
したので、発話音声中の意味上重要な部分の話速は適度
に遅くし、それ以外の部分は逆に速めることができ、こ
れによって発話時間を実質的に伸張させることなく、全
体としてゆっくりとした聞きやすい音声に変換すること
ができる。As described above, in this embodiment, the voice (original voice) to be subjected to the speech speed conversion is fetched by the voice input circuit 1, and the change of the intonation (pitch frequency) of the original voice is detected by real-time processing. Based on this detection result, in the part with high intonation,
By reducing the speech speed and changing the speech speed in the low part at the rule of increasing the speech speed, the original voice is converted into a good voice that is easy to hear while maintaining the speech time of the original voice. The speech speed of the important parts in the uttered speech can be moderately slowed, and the other parts can be sped up in reverse, so that it is easy to listen as a whole without substantially increasing the speech time. Can be converted to voice.

【００４９】[0049]

【発明の効果】以上説明したように本発明によれば、時
間的な「ずれ」に伴う問題点を解決するため、リアルタ
イム処理で、発話音声中の意味上重要な部分と考えられ
る部分の話速を適度に遅くし、それ以外の部分を逆に速
めることによって、発話時間を実質的に伸張させること
なく、全体としてゆっくりとした聞きやすい音声に変換
することができる。As described above, according to the present invention, in order to solve the problem associated with the temporal "shift", the real-time processing is performed on the part of the uttered speech that is considered to be a significant part. By slowing the speed appropriately and increasing the other parts in reverse, it is possible to convert the speech into a slow, easy-to-hear sound as a whole without substantially extending the speaking time.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明によるリアルタイム話速変換装置の一実
施例を示すブロック図である。FIG. 1 is a block diagram showing one embodiment of a real-time speech speed conversion device according to the present invention.

【図２】図１に示すリアルタイム話速変換装置の動作例
を示すメインフローチャートである。FIG. 2 is a main flowchart showing an operation example of the real-time speech speed conversion device shown in FIG.

【図３】図１に示すリアルタイム話速変換装置の動作例
を示す有声区間処理ルーチンの一例を示すフローチャー
トである。FIG. 3 is a flowchart showing an example of a voiced section processing routine showing an operation example of the real-time speech speed conversion device shown in FIG. 1;

【図４】図１に示すリアルタイム話速変換装置の動作例
を示すタイミング図である。FIG. 4 is a timing chart showing an operation example of the real-time speech speed conversion device shown in FIG. 1;

【符号の説明】[Explanation of symbols]

１音声入力回路２ＣＰＵ回路３ＰＲＯＭ回路４入力バッファ回路５処理バッファ回路６ファイル回路７音声出力回路８バス Reference Signs List 1 audio input circuit 2 CPU circuit 3 PROM circuit 4 input buffer circuit 5 processing buffer circuit 6 file circuit 7 audio output circuit 8 bus

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 21/04 ──────────────────────────────────────────────────続き Continued on the front page (58) Field surveyed (Int.Cl. ⁷ , DB name) G10L 21/04

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】受聴音声の発声する速さ（話速）を遅く
する際、所定時間以上の無音区間を検出し、この無音区
間を発声音の息つぎ区間を意味するポーズ区間と判定す
るポーズ区間判定手段と、ポーズ区間と次のポーズ区間との間をフレーズ区間と
し、このフレーズ区間の開始点から所定数の有声区間に
おける最高ピッチ周波数を検出する最高ピッチ周波数検
出手段と、前記フレーズ区間の前記開始点から所定の伸張倍率、か
つ所定の減少関数に基づき、一定時間にわたって話速を
変換するとともに、一定時間の経過後においては、前記
最高ピッチ周波数を考慮した話速変換を実行する話速変
換手段と、を備えたことを特徴とするリアルタイム話速変換装置。1. A pause for detecting a silent period of a predetermined time or more when the speed of uttering a listening voice (speaking speed) is reduced, and determining the silent period as a pause period representing a breath interval of the uttered sound. Section determination means; a phrase section between a pause section and the next pause section; a maximum pitch frequency detection section for detecting a maximum pitch frequency in a predetermined number of voiced sections from a start point of the phrase section; Based on a predetermined expansion factor and a predetermined decreasing function from the start point, the speech speed is converted over a certain period of time, and after the lapse of a certain period of time, the speech speed is converted in consideration of the maximum pitch frequency. A real-time speech speed conversion device, comprising: conversion means.

【請求項２】請求項１に記載のリアルタイム話速変換
装置において、前記話速変換手段は、前記一定時間の経過後における話
速を決定するに際し、前記フレーズ区間における処理対
象の有声区間の平均ピッチ周波数を求め、この平均ピッ
チ周波数と前記最高ピッチ周波数に所定の閾値を掛けた
数値との大小関係によって、処理対象の有声区間の伸張
倍率を決定することを特徴とするリアルタイム話速変換
装置。2. The real-time speech speed conversion device according to claim 1, wherein the speech speed conversion means determines an average of a voiced section to be processed in the phrase section when determining the speech speed after the predetermined time has elapsed. A real-time speech speed conversion apparatus characterized in that a pitch frequency is obtained, and an expansion ratio of a voice section to be processed is determined based on a magnitude relationship between the average pitch frequency and a numerical value obtained by multiplying the maximum pitch frequency by a predetermined threshold.