JP6032832B2

JP6032832B2 - Speech synthesizer

Info

Publication number: JP6032832B2
Application number: JP2012053799A
Authority: JP
Inventors: 稔木幡
Original assignee: Chiba Institute of Technology
Current assignee: Chiba Institute of Technology
Priority date: 2012-03-09
Filing date: 2012-03-09
Publication date: 2016-11-30
Anticipated expiration: 2032-03-09
Also published as: JP2013186428A

Description

本発明は、様々な環境下でも明瞭に音声を聞き取ることが可能な音声合成方式に関する。 The present invention relates to a speech synthesis method that can clearly hear speech even under various environments.

トンネルやホール等の公共施設において、非常時や災害時に拡声されたアナウンスを行うことが必要である。このようなアナウンス音声は確実に聞き取ることができるように明瞭である事が必須である。しかしながら残響の影響により、明瞭さを欠いた非常に聞き取りにくい音声となる場合が多い。これは、先行する音声の残響が、後続する音素をマスクしてしまうことに起因する。この問題に対し、アナウンス音声に予め処理を施して、残響の影響を受けにくい音声とする加工技術が存在する。 In public facilities such as tunnels and halls, it is necessary to make public announcements in the event of an emergency or disaster. It is essential that such announcement voice be clear so that it can be heard reliably. However, due to the effects of reverberation, the voice often lacks clarity and is difficult to hear. This is due to the fact that the reverberation of the preceding speech masks the subsequent phonemes. To deal with this problem, there is a processing technique in which an announcement voice is processed in advance to make it less likely to be affected by reverberation.

非特許文献１には次のような技術が開示されている。該技術は、母音などの音声定常部は、パワーが大きいため、残響を起こしやすい点に着目している。図１３は非特許文献１における、母音および子音の判定アルゴリズムを説明するための概略図である。この図にあるように、時間軸上に窓を設け、前後の窓のパワーを測定することにより、音声が母音であるか、子音であるか、を認識するといったアルゴリズムを用いる。図１３において、Ｗ１<Ｗ２のときは、Ｗ１を母音と判定し処理を行わないが、Ｗ１>Ｗ２の時はＷ１を子音と判定し、子音を強調する処理を行う。これにより、子音が母音の残響に埋もれてしまわないように音声を伝えることが可能となる。 Non-Patent Document 1 discloses the following technique. The technology focuses on the point that a steady sound part such as a vowel is likely to reverberate because of its high power. FIG. 13 is a schematic diagram for explaining a vowel and consonant determination algorithm in Non-Patent Document 1. As shown in this figure, an algorithm is used in which a window is provided on the time axis and the power of the front and rear windows is measured to recognize whether the voice is a vowel or a consonant. In FIG. 13, when W1 <W2, W1 is determined as a vowel and the process is not performed. However, when W1> W2, W1 is determined as a consonant and a process of enhancing the consonant is performed. Thereby, it is possible to convey the voice so that the consonant is not buried in the reverberation of the vowel.

非特許文献２には次のような技術が開示されている。該技術では、入力のテキストを形態素解析により品詞に分解する。次に、分解した品詞を結合するか、分離するかのルールを作成する。このルールに基づいて品詞を結合し文節を生成する。次に生成した文節の間にポーズを挿入していくが、この時のポーズ長は文節のモーラ数に応じて決定される。このように文節の間にポーズを付加したテキストを音声合成により発声する。これにより、残響下においても聞き取りやすい音声の生成が可能である。 Non-Patent Document 2 discloses the following technology. In this technique, input text is decomposed into parts of speech by morphological analysis. Next, a rule for combining or separating the decomposed parts of speech is created. Based on this rule, parts of speech are combined to generate a phrase. Next, a pause is inserted between the generated phrases, and the pause length at this time is determined according to the number of mora of the phrase. In this way, the text with pauses between the phrases is uttered by speech synthesis. Thereby, it is possible to generate a voice that is easy to hear even under reverberation.

信学技報ＩＥＩＣＥＴｅｃｈｎｉｃａｌＲｅｐｏｒｔＨＩＰ２００５−９４（２００５−１２）IEICE Technical Report HIP2005-94 (2005-12) 日本音響学会講演論文集２０１１年９月１−Ｒ−３３Proceedings of the Acoustical Society of Japan September 2011 1-R-33

しかしながら、これらの方法では、アナウンス音声に予め処理を行っているため、残響が発生する環境が変化すると適応ができない。環境が変化すると逐一、子音の強調量や、文節間のポーズ時間等を再調整する必要がある。また、アナウンス音声の内容により、残響は変化するが、アナウンス内容を考慮して残響の影響を回避するような制御をすることができない。そこで、環境の変化やアナウンス内容等に柔軟に適応することのできるアナウンス明瞭化装置およびその方法の開発が課題として生ずる。 However, in these methods, since the announcement voice is processed in advance, it cannot be adapted if the environment in which reverberation occurs changes. As the environment changes, it is necessary to readjust the consonant enhancement amount and pause time between phrases. Further, although the reverberation changes depending on the content of the announcement voice, it is not possible to control to avoid the influence of the reverberation in consideration of the announcement content. Therefore, the development of an announcement clarification device and method capable of flexibly adapting to environmental changes, announcement contents, etc. arises as a problem.

以上の課題を解決するために、第一に本発明は以下のようなアナウンス明瞭化装置を提供する。本発明のアナウンス明瞭化装置は音声合成装置を採用することにより、話速やピッチ周波数、ポーズ挿入、パワーの調整などのパラメータの制御が可能である。パラメータの制御は、インパルス応答波形を予め取得しておき、これを畳み込むことにより残響の影響を予測し、予測の結果をフィードバックすることにより、前記パラメータの調整を行う。 In order to solve the above problems, first, the present invention provides the following announcement clarification device. The announcement clarification device of the present invention employs a speech synthesizer to control parameters such as speech speed, pitch frequency, pause insertion, and power adjustment. In the parameter control, an impulse response waveform is acquired in advance, the effect of reverberation is predicted by convolving it, and the parameter is adjusted by feeding back the result of the prediction.

具体的には、アナウンス内容を示すテキスト形式のデータを含むアナウンスデータを取得するアナウンスデータ取得部と、取得したアナウンスデータからアナウンス音声を合成する音声合成部と、スピーカの配置される空間での残響特性値であるインパルス応答波形を取得するインパルス応答波形取得部と、取得したインパルス応答波形を用いて合成されたアナウンス音声によって生じる残響付音声を生成する残響付音声生成部と、生成された残響付音声と合成されたアナウンス音声を比較する比較部と、合成されたアナウンス音声と生成された残響付音声との比較結果が所定の範囲内に収まったか判断する判断部と、判断結果が所定の範囲内に収まらない場合に、比較部での比較結果に応じて音声合成部を制御する制御部と、を有する音声合成装置である。 Specifically, an announcement data acquisition unit that acquires announcement data including data in text format indicating the announcement content, a speech synthesis unit that synthesizes announcement speech from the acquired announcement data, and reverberation in a space where speakers are arranged An impulse response waveform acquisition unit that acquires an impulse response waveform that is a characteristic value, a reverberant sound generation unit that generates a reverberant sound generated by an announcement voice synthesized using the acquired impulse response waveform, and a generated reverberation A comparison unit that compares the synthesized speech with the synthesized speech, a determination unit that determines whether the comparison result between the synthesized announcement speech and the generated reverberant speech is within a predetermined range, and the determination result is within the predetermined range And a control unit that controls the speech synthesis unit according to the comparison result of the comparison unit when it does not fit within It is formed apparatus.

第二は、上記第一の音声合成装置を基本として、音声のスペクトルを制御することが可能な音声合成装置を提供する。具体的には、音声合成部は、音声スペクトル制御手段を有する請求項１に記載の音声合成装置である。 Second, a speech synthesizer capable of controlling the spectrum of speech based on the first speech synthesizer is provided. Specifically, the speech synthesizer is the speech synthesizer according to claim 1 having speech spectrum control means.

第三は、上記第一または第二の音声合成装置を基本として、調整が完了したアナウンス音声をスピーカにより出力可能な音声合成装置である。具体的には、判断結果が所定の範囲内に収まった場合に、合成されたアナウンス音声をスピーカに対して出力する出力部を有する請求項１又は２に記載の音声合成装置を提供する。 The third type is a voice synthesizer that can output the announcement voice that has been adjusted on the basis of the first or second voice synthesizer. Specifically, the speech synthesizer according to claim 1 or 2, further comprising an output unit that outputs the synthesized announcement voice to a speaker when the determination result falls within a predetermined range.

第四は、上記第一から第三の音声合成装置を基本として、音声を入力し、これを認識してテキスト形式に変換して処理を行うことが可能な音声合成装置である。具体的には、肉声をテキスト形式のデータに変換しアナウンスデータ取得部に対して出力するデータ変換出力部をさらに有する請求項１から３のいずれか一に記載の音声合成装置を提供する。 The fourth is a speech synthesizer that can input speech, recognize it, convert it into a text format, and perform processing based on the first to third speech synthesizers. Specifically, the speech synthesizer according to any one of claims 1 to 3, further comprising a data conversion output unit that converts a real voice into text format data and outputs the data to an announcement data acquisition unit.

第五は、上記第一から第四の音声合成装置を基本として、入力されるアナウンスデータに関する音韻情報を取得し、これを用いて音声合成を制御可能な音声合成装置である。具体的には、アナウンスデータには、さらに音声合成の際のパラメータが含まれる請求項１から４のいずれか一に記載の音声合成装置を提供する。 The fifth is a speech synthesizer capable of acquiring phonological information related to input announcement data and controlling speech synthesis using this information based on the first to fourth speech synthesizers. Specifically, the speech synthesis apparatus according to claim 1, wherein the announcement data further includes parameters for speech synthesis.

以上のような構成をとる第一の本発明によって、音声合成による各種のパラメータ（ポーズ、話速、ピッチ、パワー等）を、予測される残響信号の分析により、フィードバックすることが可能となる。これにより、環境により適応的な耐残響性を有する音声の生成が可能である。 According to the first aspect of the present invention having the above-described configuration, various parameters (pause, speech speed, pitch, power, etc.) by speech synthesis can be fed back by analyzing a predicted reverberation signal. As a result, it is possible to generate speech having reverberation resistance that is more adaptive to the environment.

第二の本発明によって、音声スペクトル構造を適応的に加工するフィルタの適用が可能となる。これにより、より環境に適応的なフィルタの特性制御が可能となる。 According to the second aspect of the present invention, it is possible to apply a filter that adaptively processes the speech spectrum structure. This makes it possible to control the filter characteristics more adaptively to the environment.

第三の本発明によって、耐残響性の高い合成音声を様々な環境で生成することが可能である。 According to the third aspect of the present invention, it is possible to generate synthesized speech with high reverberation resistance in various environments.

第四の本発明によって、入力音声を肉声で入力し、これを認識することにより、音声合成することが可能となり、より即時性の高いアナウンスが可能となる。 According to the fourth aspect of the present invention, it is possible to synthesize a speech by inputting the input speech with the real voice and recognizing the input speech, thereby enabling a more immediate announcement.

第五の本発明によって、生成するアナウンスデータの音韻情報に基づいて音声合成が可能であるので、テキストの内容に適応的な各種パラメータを元に音声の生成が可能である。さらに、これを端緒として残響付音声を評価し、フィードバックを得ることにより、環境にも適応的なアナウンスをすることが可能となる。
According to the fifth aspect of the present invention, since speech synthesis is possible based on the phoneme information of the announcement data to be generated, it is possible to generate speech based on various parameters that are adaptive to the text content. Furthermore, it is possible to make an announcement suitable for the environment by evaluating the sound with reverberation and obtaining feedback.

実施例１のアナウンス明瞭化装置の処理の一例を説明するための図The figure for demonstrating an example of a process of the announcement clarification apparatus of Example 1. FIG. 実施例１のアナウンス明瞭化装置の機能ブロックの一例を表す図The figure showing an example of the functional block of the announcement clarification apparatus of Example 1. 実施例１のアナウンス明瞭化装置における処理の流れの一例を表すフローチャートThe flowchart showing an example of the flow of the process in the announcement clarification apparatus of Example 1. 実施例１のアナウンス明瞭化装置におけるハードウエア構成の一例を表す概略図Schematic showing an example of the hardware constitutions in the announcement clarification apparatus of Example 1. 実施例２のアナウンス明瞭化装置による処理の概要を説明するための図The figure for demonstrating the outline | summary of the process by the announcement clarification apparatus of Example 2. FIG. 実施例２のアナウンス明瞭化装置の機能ブロックの一例を表す図The figure showing an example of the functional block of the announcement clarification apparatus of Example 2. 実施例２のアナウンス明瞭化装置における処理の流れの一例を表すフローチャートThe flowchart showing an example of the flow of the process in the announcement clarification apparatus of Example 2. 実施例２のアナウンス明瞭化装置におけるハードウエア構成の一例を表す概略図Schematic showing an example of the hardware constitutions in the announcement clarification apparatus of Example 2. FIG. 実施例３のアナウンス明瞭化装置による処理の概要を説明するための図The figure for demonstrating the outline | summary of the process by the announcement clarification apparatus of Example 3. FIG. 実施例３のアナウンス明瞭化装置の機能ブロックの一例を表す図The figure showing an example of the functional block of the announcement clarification apparatus of Example 3. 実施例３のアナウンス明瞭化装置における処理の流れの一例を表すフローチャートThe flowchart showing an example of the flow of the process in the announcement clarification apparatus of Example 3. 実施例３のアナウンス明瞭化装置におけるハードウエア構成の一例を表す概略図Schematic showing an example of the hardware constitutions in the announcement clarification apparatus of Example 3. 従来技術による処理の概要を説明するための図Diagram for explaining the outline of processing according to the prior art

以下に、図を用いて本発明の実施の形態を説明する。なお、本発明はこれら実施の形態に何ら限定されるものではなく、その要旨を逸脱しない範囲において、種々なる態様で実施しうる。なお、実施例１は、主に請求項１、２、５、６、及び請求項８、９、１０、１３、１４について説明する。実施例２は、主に請求項３、１１について説明する。また、実施例３は、主に請求項４、７、１２、１５について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. Note that the present invention is not limited to these embodiments, and can be implemented in various modes without departing from the spirit of the present invention. In the first embodiment, claims 1, 2, 5, 6 and claims 8, 9, 10, 13, 14 will be mainly described. The second embodiment will mainly describe claims 3 and 11. In the third embodiment, claims 4, 7, 12, and 15 will be mainly described.

≪実施例１≫
<概要>
実施例１では、次のようなアナウンス明瞭化装置について記載する。まず、アナウンスデータを入力として受け付け、合成音声と、インパルス応答波形とを畳み込むことにより予測残響信号を生成する。予測残響信号は元の合成音声と残響レベルや周波数特性について比較される。比較の結果、所定の範囲であると評価された場合には、スピーカ等のデバイスから出力を行う。一方、所定の範囲に収まらない場合には、これをフィードバックし、音声合成を制御する各種パラメータ（ピッチ、ポーズ、話速、パワー等）の調整を行い、再度合成音声を生成して、評価を行う。 Example 1
<Overview>
In the first embodiment, the following announcement clarification device will be described. First, announcement data is received as an input, and a predicted reverberation signal is generated by convolving a synthesized speech and an impulse response waveform. The predicted reverberation signal is compared with the original synthesized speech in terms of reverberation level and frequency characteristics. As a result of the comparison, when it is evaluated that it is within the predetermined range, output is performed from a device such as a speaker. On the other hand, if it does not fall within the predetermined range, this is fed back, various parameters (pitch, pause, speech speed, power, etc.) that control speech synthesis are adjusted, and synthesized speech is generated again for evaluation. Do.

図１は本実施例における処理の概要の一例を示す図である。図に示すように、ある環境下における２つの連なる音の波形を示している。上の波形（図１（A））は前の音が後の音にオーバーラップしてしまい、後の音をマスクしてしまう。これを回避するために下の波形（図1（B））のように一定時間ポーズを与える処理を行う。この処理を、音声合成を利用することにより、音声を生成する環境に適応的に行うことが本実施例のアナウンス明瞭化装置では可能である。 FIG. 1 is a diagram illustrating an example of an outline of processing in the present embodiment. As shown in the figure, waveforms of two consecutive sounds in a certain environment are shown. In the upper waveform (FIG. 1A), the previous sound overlaps the subsequent sound, and the subsequent sound is masked. In order to avoid this, a process for giving a pause for a certain time is performed as shown in the waveform below (Fig. 1 (B)). The announcement clarification apparatus of the present embodiment can perform this processing adaptively in an environment for generating speech by using speech synthesis.

<機能的構成>
図２は、本実施例のアナウンス明瞭化装置における機能ブロックの一例を表す図である。この図にあるように、本実施例の「アナウンス明瞭化装置」（０２００）は、「アナウンスデータ取得部」（０２０１）と、「音声合成部」（０２０２）と、「インパルス応答波形取得部」（０２０３）と、「残響付音声生成部」（０２０４）と、「比較部」（０２０５）と、「判断部」（０２０６）と、「制御部」（０２０７）と、を有する。なお、本実施例のアナウンス明瞭化装置は「データ変換出力部」（０２０８）と、「出力部」（０２０９）を有していても良い。「音声合成部」（０２０２）は「話速、ポーズ、ピッチ、パワー制御手段」（０２１０）を有していてもよい。 <Functional configuration>
FIG. 2 is a diagram illustrating an example of functional blocks in the announcement clarification apparatus according to the present embodiment. As shown in this figure, the “announcement clarification device” (0200) of this embodiment includes an “announcement data acquisition unit” (0201), a “speech synthesis unit” (0202), and an “impulse response waveform acquisition unit”. (0203), “sound generation unit with reverberation” (0204), “comparison unit” (0205), “determination unit” (0206), and “control unit” (0207). The announcement clarifying apparatus of the present embodiment may include a “data conversion output unit” (0208) and an “output unit” (0209). The “speech synthesizer” (0202) may include “speech speed, pause, pitch, power control means” (0210).

「アナウンスデータ取得部」（０２０１）は、アナウンス内容を示すテキスト形式のデータを含むアナウンスデータを取得する機能を有する。具体的には、テキスト形式の入力をメモリに読み込むことにより、音声合成部（０２０２）に対して出力を行う。「テキスト形式」とは、最終的にテキストデータが取得出来るものであればよく、プレーンテキストのデータに限られない。 The “announcement data acquisition unit” (0201) has a function of acquiring announcement data including text format data indicating the contents of the announcement. Specifically, the text format input is read into the memory, and output to the speech synthesizer (0202). The “text format” is not limited to plain text data as long as text data can be finally obtained.

ここでアナウンスデータ取得部（０２０１）は、「データ変換出力部」（０２０８）よりテキストデータを取得してよい。「データ変換出力部」（０２０８）は、肉声をテキスト形式のデータに変換しアナウンスデータ取得部に対して出力する機能を有する。具体的には、音声認識により音声をテキストデータに変換する処理を行い、アナウンスデータ取得部に出力を行う。音声認識の方法については種々の技術があるが、テキストデータが出力可能なものであればいかなる技術でも適用可能である。 Here, the announcement data acquisition unit (0201) may acquire text data from the “data conversion output unit” (0208). The “data conversion output unit” (0208) has a function of converting the real voice into text format data and outputting it to the announcement data acquisition unit. Specifically, the voice is converted into text data by voice recognition and output to the announcement data acquisition unit. There are various techniques for speech recognition, but any technique that can output text data is applicable.

「音声合成部」（０２０２）は、取得したテキスト形式のデータをアナウンス音声に合成する機能を有する。具体的にはアナウンスデータ取得部からテキスト形式のデータを取得し、音声データに変換を行う。変換の後、残響付音声生成部（０２０４）に対し出力を行う。音声合成の方法についても種々の技術が採用可能であるが、少なくともピッチ、ポーズ、話速、パワー等の一つ以上をパラメータにより操作可能な話速，ポーズ，ピッチ，パワー制御手段（０２１０）を含む態様で構わない。 The “speech synthesizer” (0202) has a function of synthesizing the acquired text data into the announcement voice. Specifically, data in text format is acquired from the announcement data acquisition unit, and converted into audio data. After the conversion, output is performed to the reverberant speech generation unit (0204). Various techniques can be employed for the speech synthesis method, but at least one of speech, pause, pitch, and power control means (0210) capable of operating at least one of pitch, pause, speech speed, power, and the like by parameters. It may be an aspect including.

「インパルス応答波形取得部」（０２０３）は、スピーカの配置される空間での残響特性値であるインパルス応答波形を取得する機能を有する。具体的には、予めアナウンスを行う環境において、インパルス応答波形を取得しておき残響付音声生成部（０２０４）に対して出力可能とする。インパルス応答波形の取得では種々の方法が考えられるが、残響付音声生成部（０２０４）が取得可能な形式であればあらゆる方法で適用可能である。 The “impulse response waveform acquisition unit” (0203) has a function of acquiring an impulse response waveform that is a reverberation characteristic value in a space in which speakers are arranged. Specifically, an impulse response waveform is acquired in an environment in which an announcement is made in advance, and can be output to the reverberant speech generation unit (0204). Various methods are conceivable for obtaining the impulse response waveform, but any method can be applied as long as the reverberant speech generation unit (0204) can obtain the impulse response waveform.

「残響付音声生成部」（０２０４）は、取得したインパルス応答波形を用いて合成されたアナウンス音声によって生じる残響付音声を生成する機能を有する。具体的には、インパルス応答波形取得部（０２０３）において取得したインパルス応答波形を、音声合成部（０２０２）より取得したアナウンス音声の波形に畳み込み演算により残響を組み込んでいき、残響付音声を取得する処理を行う。畳み込みの方法であるが、ＦＩＲフィルタを用いるものや、サンプリング・リバーブ、等の技術が適用可能である。 The “reverberant speech generation unit” (0204) has a function of generating speech with reverberation generated by the announcement speech synthesized using the acquired impulse response waveform. Specifically, the impulse response waveform acquired in the impulse response waveform acquisition unit (0203) is incorporated into the waveform of the announcement speech acquired from the speech synthesis unit (0202) by convolution to obtain reverberant speech. Process. Although it is a convolution method, techniques using an FIR filter, sampling reverb, and the like are applicable.

「比較部」（０２０５）は、生成された残響付音声と合成されたアナウンス音声を比較する機能を有する。具体的には、合成された残響を付与しないアナウンス音声と、残響付音声とを、時間軸上のひずみや、周波数軸上のひずみなどの側面等から比較を行い、残響の聞き取りへの影響を反映した誤差を取得する。取得された誤差は比較結果として判断部に出力される。 The “comparator” (0205) has a function of comparing the generated reverberant speech with the synthesized announcement speech. Specifically, the synthesized speech without added reverberation is compared with speech with reverberation from the viewpoint of distortion on the time axis, distortion on the frequency axis, etc. Get the reflected error. The acquired error is output to the determination unit as a comparison result.

合成されたアナウンス音声の評価指標は、上記のひずみの評価の他、ＭＴＦ（ＭｏｄｕｌａｔｉｏｎＴｒａｎｓｆｅｒＦｕｎｃｔｉｏｎ）やＳＴＩ（ＳｐｅｅｃｈＴｒａｎｓｍｉｓｓｉｏｎＩｎｄｅｘ）のように、残響空間に対して計算される音声の聞き取りやすさといった指標を応用したものを採用することが可能である。 In addition to the above-mentioned distortion evaluation, the synthesized announcement speech evaluation index is an index such as MTF (Modulation Transfer Function) or STI (Speech Transmission Index), such as the ease of hearing of the sound calculated for the reverberation space. Can be applied.

誤差の算出法については、上記の他、種々の態様の評価関数が考えられる。例えば、ポーズ長については、耐残響性の側面ではポーズが長ければ長いほどよいが、そうなるとかえってアナウンスの自然性が損なわれる。従って、ポーズが長くなるとペナルティを加えるような関数の形態が考えられる。この関数を採用すると、誤差については極小値が一つの下に凸な形態の関数になる。 As for the error calculation method, various evaluation functions other than the above can be considered. For example, with respect to the pose length, the longer the pose, the better in terms of reverberation resistance, but the naturalness of the announcement is impaired. Therefore, a form of a function that adds a penalty when the pause is long can be considered. When this function is adopted, the error becomes a function in which the minimum value is convex downward by one.

「判断部」（０２０６）は、合成されたアナウンス音声と生成された残響付音声との比較結果が所定の範囲内に収まったか判断する機能を有する。具体的には、比較部（０２０５）による比較結果を取得し、その結果を所定のしきい値との大小比較により判断を行う。判断結果がしきい値より小さい場合には、出力部に信号を送り、合成されたアナウンス音声をスピーカに対して出力を行う。一方、しきい値より大きい場合には、判断結果は制御部（０２０７）に対し出力される。 The “determination unit” (0206) has a function of determining whether the comparison result between the synthesized announcement sound and the generated reverberant sound falls within a predetermined range. Specifically, the comparison result by the comparison unit (0205) is acquired, and the result is determined by comparing the magnitude with a predetermined threshold value. If the determination result is smaller than the threshold value, a signal is sent to the output unit, and the synthesized announcement voice is output to the speaker. On the other hand, if it is larger than the threshold value, the determination result is output to the control unit (0207).

「制御部」（０２０７）は、判断結果が所定の範囲内に収まらない場合に、比較部での比較結果に応じて音声合成部を制御する機能を有する。具体的には、判断部（０２０６）より判断結果を取得し、判断結果が所定の範囲内でなければ、音声合成部（０２０２）に対して、パラメータを調節するための制御信号を送る。 The “control unit” (0207) has a function of controlling the speech synthesis unit according to the comparison result in the comparison unit when the determination result does not fall within a predetermined range. Specifically, a determination result is acquired from the determination unit (0206), and if the determination result is not within a predetermined range, a control signal for adjusting parameters is sent to the speech synthesis unit (0202).

「出力部」（０２０９）は、判断結果が所定の範囲内に収まった場合に、判断部（０２０６）からの信号を元に、合成されたアナウンス音声をスピーカに対して出力する機能を有する。出力されたアナウンス音声はスピーカにより出力される。 The “output unit” (0209) has a function of outputting the synthesized announcement voice to the speaker based on the signal from the determination unit (0206) when the determination result falls within a predetermined range. The output announcement sound is output from a speaker.

「話速、ポーズ、ピッチ、パワー制御手段」（０２１０）は、音声合成部（０２０２）における音声合成のパラメータのうち、話速、ポーズ、ピッチ、パワーを制御する機能を有する。制御部（０２０７）より制御信号を受け、各パラメータの調節を行う。 “Speech speed, pause, pitch, power control means” (0210) has a function of controlling the speech speed, pause, pitch, and power among the speech synthesis parameters in the speech synthesis unit (0202). In response to a control signal from the control unit (0207), each parameter is adjusted.

<処理の流れ>
図３は、本実施例のアナウンス明瞭化装置における処理の流れの一例を表すフローチャートである。まず、テキストの取得を行う（ステップＳ０３０２）。肉声が入力される場合にはテキストの取得前に肉声をテキスト形式のデータに変換する処理（ステップＳ０３０１）を実行する。次に音声合成が実行される（ステップＳ０３０３）。ここまでに別途インパルス応答の波形の取得（ステップＳ０３０４）を行っておく。次に残響付音声の生成（ステップＳ０３０５）を行う。その後、残響付音声と合成されたアナウンス音声を比較する（ステップＳ０３０６）。その結果、比較の結果が所定の範囲内か否かの判断を行う（ステップＳ０３０７）。範囲内でない場合には、比較結果に応じて音声合成ステップを制御する（ステップＳ０３０８）。比較結果が所定の範囲内であった場合には、アナウンス音声をスピーカに対して出力することができる（ステップＳ０３０９）。 <Process flow>
FIG. 3 is a flowchart showing an example of the flow of processing in the announcement clarification device of the present embodiment. First, text is acquired (step S0302). When a real voice is input, a process of converting the real voice into text format data (step S0301) is executed before obtaining the text. Next, speech synthesis is executed (step S0303). Thus far, the impulse response waveform has been separately acquired (step S0304). Next, a reverberant voice is generated (step S0305). Thereafter, the reverberant speech and the synthesized speech are compared (step S0306). As a result, it is determined whether or not the comparison result is within a predetermined range (step S0307). If not within the range, the speech synthesis step is controlled according to the comparison result (step S0308). If the comparison result is within a predetermined range, the announcement voice can be output to the speaker (step S0309).

<ハードウエア的構成>
図４は、上記機能的な各構成要件をハードウエアとして実現した際の、アナウンス明瞭化装置における構成の一例を表す概略図である。この図を利用して本発明の処理におけるそれぞれのハードウエア構成部の働きについて説明する。この図にあるように、本実施例のアナウンス明瞭化装置は、各種演算処理を行う「ＣＰＵ（中央演算装置）」（０４０１）と、「揮発性メモリ」（０４０２）と、「不揮発性メモリ」（０４０３）と、「Ｄ／Ａコンバータ」（０４０４）と、「Ａ／Ｄコンバータ」（０４０５）を有している。Ｄ／Ａコンバータには、「スピーカ」（０４０６）が接続されている。また、Ａ／Ｄコンバータには、「マイク」（０４０７）が接続されている。そしてそれらが「システムバス」（０４０８）などのデータ通信経路によって相互に接続され、情報の送受信や処理を行う。 <Hardware configuration>
FIG. 4 is a schematic diagram illustrating an example of a configuration of the announcement clarification device when the above functional components are realized as hardware. The operation of each hardware component in the processing of the present invention will be described using this figure. As shown in this figure, the announcement clarification device of the present embodiment includes a “CPU (central processing unit)” (0401), “volatile memory” (0402), and “nonvolatile memory” that perform various arithmetic processes. (0403), "D / A converter" (0404), and "A / D converter" (0405). A “speaker” (0406) is connected to the D / A converter. In addition, a “microphone” (0407) is connected to the A / D converter. These are connected to each other by a data communication path such as a “system bus” (0408), and perform transmission / reception and processing of information.

また、「揮発性メモリ」（０４０２）は、各種処理を行うプログラムを「ＣＰＵ」（０４０１）に実行させるために「不揮発性メモリ」（０４０３）から読み出すと同時にそのプログラムの作業領域でもあるワーク領域を提供する。 The “volatile memory” (0402) is a work area that is also a work area of the program at the same time it is read from the “nonvolatile memory” (0403) in order to cause the “CPU” (0401) to execute a program for performing various processes. I will provide a.

ここで、装置が起動するとまず、インパルス応答波形の取得を行う。揮発性メモリ（０４０２）または不揮発性メモリに（０４０３）保持されている音源をＤ／Ａコンバータ（０４０４）を介してスピーカ（０４０６）より出力する。その音声をマイク（０４０７）で取得し、Ａ／Ｄコンバータ（０４０５）を介して、ＣＰＵ（０４０１）に送る。ＣＰＵ上ではノイズの除去等の各種処理を行い、インパルス応答波形として揮発性メモリ（０４０２）あるいは不揮発性メモリ（０４０３）で保持される。 Here, when the apparatus is activated, an impulse response waveform is first acquired. The sound source held in the volatile memory (0402) or the nonvolatile memory (0403) is output from the speaker (0406) via the D / A converter (0404). The voice is acquired by the microphone (0407) and sent to the CPU (0401) via the A / D converter (0405). Various processing such as noise removal is performed on the CPU, and the impulse response waveform is held in the volatile memory (0402) or the nonvolatile memory (0403).

次に、揮発性メモリ（０４０２）より音声合成プログラムがＣＰＵ（０４０１）にロードされる。当プログラムは、揮発性メモリ（０４０２）、あるいは不揮発性メモリ（０４０３）上に取得されたテキストデータを入力とし、音声合成を行う。音声合成は不揮発性メモリ内の音声素片を使用してよい。 Next, a speech synthesis program is loaded from the volatile memory (0402) to the CPU (0401). This program performs speech synthesis using text data acquired on the volatile memory (0402) or the non-volatile memory (0403) as an input. Speech synthesis may use speech segments in non-volatile memory.

入力がテキストデータでなく、肉声であった場合には、揮発性メモリ（０４０２）上の、音声認識プログラムをＣＰＵ（０４０１）にロードする。次にマイク（０４０７）より肉声を入力し、「Ａ／Ｄコンバータ」（０４０５）によりデジタルデータに変換する。このデータを音声認識プログラムに対し入力すると、認識処理を行い、テキストデータの形式で揮発性メモリ（０４０２）または不揮発性メモリ（０４０３）に対して出力される。出力されたデータは、音声合成プログラムにより取得される。 If the input is not text data but a real voice, the voice recognition program on the volatile memory (0402) is loaded into the CPU (0401). Next, the real voice is input from the microphone (0407), and is converted into digital data by the “A / D converter” (0405). When this data is input to the speech recognition program, recognition processing is performed and the text data is output to the volatile memory (0402) or the non-volatile memory (0403). The output data is acquired by a speech synthesis program.

音声合成プログラムにより合成された音声波形は、ＣＰＵ（０４０１）の演算によってインパルス応答波形を畳み込まれる。演算の結果、残響付音声波形が生成される。この残響付音声波形と、音声合成プログラムで出力された音声波形と、をＣＰＵ（０４０１）上により比較を行う。比較の結果誤差ＸがＡ：しきい値以下であれば、音声合成で出力された音声波形をＣＰＵ（０４０１）にて再生し、Ｄ／Ａコンバータ（０４０４）を介し、スピーカ（０４０６）により出力を行う。Ａ：しきい値以上であれば、音声合成プログラムのパラメータ（ピッチ・ポーズ・話速、パワー等）を調整する。調整後、ＣＰＵ（０４０１）上で再び音声合成プログラムにより音声合成を行い、再度比較を行う。 The speech waveform synthesized by the speech synthesis program is convoluted with the impulse response waveform by the calculation of the CPU (0401). As a result of the calculation, a reverberant speech waveform is generated. This reverberant speech waveform and the speech waveform output by the speech synthesis program are compared on the CPU (0401). If the comparison error X is equal to or less than A: the threshold, the speech waveform output by speech synthesis is reproduced by the CPU (0401) and output by the speaker (0406) via the D / A converter (0404). I do. A: If it is equal to or greater than the threshold value, the parameters (pitch / pause / speech rate, power, etc.) of the speech synthesis program are adjusted. After adjustment, speech synthesis is performed again on the CPU (0401) using the speech synthesis program, and comparison is performed again.

上記比較の方法は、例えば原音と比較して音声部分にかかっている残響レベルや、残響付加前の音声からのスペクトルの歪について評価を行い、誤差ＸがＡ：しきい値以下となるように音声合成プログラムのパラメータを調整するような処理が挙げられる。 In the above comparison method, for example, the reverberation level applied to the sound part compared to the original sound and the distortion of the spectrum from the sound before adding the reverberation are evaluated, and the error X becomes A: a threshold value or less. A process for adjusting the parameters of the speech synthesis program is exemplified.

<効果の簡単な説明>
以上のように本実施例のアナウンス明瞭化装置によって、合成音声波形にインパルス応答波形を畳み込み、残響付音声を生成可能である。残響付音声と元の合成音声とを比較することで、音声合成の各種パラメータの調整を行うことが可能である。これにより、アナウンスされる環境により適応的なアナウンスを実行可能である。 <Brief description of effect>
As described above, the announcement clarification device of the present embodiment can convolve the impulse response waveform with the synthesized speech waveform to generate a reverberant speech. Various parameters for speech synthesis can be adjusted by comparing the reverberant speech and the original synthesized speech. Thereby, it is possible to execute an announcement more adaptive to the environment in which the announcement is made.

≪実施例２≫
<概要>
図５は、本実施例のアナウンス明瞭化装置の処理の一例について説明するための概念図である。図５（Ａ）は、ある環境下における音声の周波数スペクトルを示している。図５（Ｂ）では、周波数スペクトルの一部を調整し、後続音声をマスクしてしまう周波数帯域の音声を抑制している。このように本実施例のアナウンス明瞭化装置は、周波数スペクトルにおける一定の周波数帯域を抑制または強調することで、残響による影響を低減する処理が可能である。そして、この処理を残響付音声波形に適用し、フィードバックによる最適化を行うことで、アナウンスされる環境に適応的に残響の低減を行うことが可能である。 << Example 2 >>
<Overview>
FIG. 5 is a conceptual diagram for explaining an example of processing of the announcement clarification device of the present embodiment. FIG. 5A shows a frequency spectrum of sound under a certain environment. In FIG. 5B, a part of the frequency spectrum is adjusted to suppress the frequency band sound that masks the subsequent sound. As described above, the announcement clarification apparatus according to the present embodiment can perform processing to reduce the influence of reverberation by suppressing or enhancing a certain frequency band in the frequency spectrum. Then, by applying this processing to a reverberant speech waveform and performing optimization by feedback, it is possible to reduce reverberation adaptively to the announced environment.

<機能的構成>
図６は、本実施例のアナウンス明瞭化装置における機能ブロックの一例を表す図である。この図にあるように、本実施例の「アナウンス明瞭化装置」（０６００）は、「アナウンスデータ取得部」（０６０１）と、「音声合成部」（０６０２）と、「インパルス応答波形取得部」（０６０３）と、「残響付音声生成部」（０６０４）と、「比較部」（０６０５）と、「判断部」（０６０６）と、「制御部」（０６０７）と、を有する。なお、本実施例のアナウンス明瞭化装置は「データ変換出力部」（０６０８）と、「出力部」（０６０９）と、を有していてもよい。また、図示していないが、「音声合成部」（０６０２）は、「話速、ポーズ、ピッチ、パワー制御手段」を有していても良い。本実施例のアナウンス明瞭化装置の特徴は、「音声合成部」（０６０２）が、「音声スペクトル制御手段」（０６１１）を新たに有する点である。 <Functional configuration>
FIG. 6 is a diagram illustrating an example of functional blocks in the announcement clarification apparatus according to the present embodiment. As shown in this figure, the “announcement clarification device” (0600) of this embodiment includes an “announcement data acquisition unit” (0601), a “speech synthesis unit” (0602), and an “impulse response waveform acquisition unit”. (0603), “sound generation unit with reverberation” (0604), “comparison unit” (0605), “determination unit” (0606), and “control unit” (0607). The announcement clarifying apparatus of the present embodiment may include a “data conversion output unit” (0608) and an “output unit” (0609). Although not shown, the “speech synthesizer” (0602) may include “speech speed, pause, pitch, power control means”. The feature of the announcement clarifying apparatus of the present embodiment is that the “speech synthesizer” (0602) newly has “speech spectrum control means” (0611).

「音声スペクトル制御手段」（０６１１）は、音声スペクトルを制御する機能を有する。具体的には、制御部からの信号を受信し、音声合成部が有する適応フィルタのパラメータを制御する。例えば、ある一定周波数領域の要素のみを強調または抑制することが可能である。これにより残響の影響が少ない、明瞭な音声アナウンスが可能となる。 The “voice spectrum control means” (0611) has a function of controlling the voice spectrum. Specifically, it receives a signal from the control unit and controls parameters of an adaptive filter that the speech synthesis unit has. For example, it is possible to emphasize or suppress only elements in a certain frequency region. As a result, a clear voice announcement with less influence of reverberation is possible.

フィルタ調整の手法は種々の態様が考えられる。例えば音声符号化用のポストフィルタを用いることが可能である。これによりフォルマント強調や平坦化、有声音の調波構造を強調する等の処理が可能である。 Various modes of filter adjustment can be considered. For example, a post filter for speech encoding can be used. As a result, processing such as formant emphasis and flattening, and the harmonic structure of voiced sound can be emphasized.

<処理の流れ>
図７は、本実施例のアナウンス明瞭化装置における処理の流れの一例を表すフローチャートである。まず、テキストの取得を行う（ステップＳ０７０２）。肉声が入力される場合にはテキストの取得前に肉声をテキスト形式のデータに変換する処理（ステップＳ０７０１）を実行する。次に音声合成の実行がされる（ステップＳ０７０３）。ここまでに別途インパルス応答の波形の取得（ステップＳ０７０４）を行っておく。次に残響付音声の生成（ステップＳ０７０５）を行う。その後、残響付音声と合成されたアナウンス音声を比較する（ステップＳ０７０６）。その結果、比較の結果が所定の範囲内か否かの判断を行う（ステップＳ０７０７）。範囲内でない場合には、比較結果に応じて音声合成ステップを制御する（ステップＳ０７０８）。また、比較結果に応じて音声スペクトルを制御する（ステップＳ０７０９）。比較結果が所定の範囲内であった場合には、アナウンス音声をスピーカに対して出力することができる（ステップＳ０７１０）。 <Process flow>
FIG. 7 is a flowchart illustrating an example of a process flow in the announcement clarification apparatus according to the present embodiment. First, text is acquired (step S0702). When a real voice is input, a process (step S0701) of converting the real voice into text format data is performed before the text is acquired. Next, speech synthesis is executed (step S0703). The impulse response waveform has been separately acquired so far (step S0704). Next, a reverberant voice is generated (step S0705). Thereafter, the reverberant speech and the synthesized speech are compared (step S0706). As a result, it is determined whether or not the comparison result is within a predetermined range (step S0707). If not within the range, the speech synthesis step is controlled according to the comparison result (step S0708). Further, the voice spectrum is controlled according to the comparison result (step S0709). If the comparison result is within the predetermined range, the announcement voice can be output to the speaker (step S0710).

<ハードウエア的構成>
図８は、上記機能的な各構成要件をハードウエアとして実現した際の、アナウンス明瞭化装置における構成の一例を表す概略図である。この図を利用して本発明の処理におけるそれぞれのハードウエア構成部の働きについて説明する。この図にあるように、本実施例のアナウンス明瞭化装置は、各種演算処理を行う「ＣＰＵ（中央演算装置）」（０８０１）と、「揮発性メモリ」（０８０２）と、「不揮発性メモリ」（０８０３）と、「Ｄ／Ａコンバータ」（０８０４）と、「Ａ／Ｄコンバータ」（０８０５）を有している。Ｄ／Ａコンバータには、「スピーカ」（０８０６）が接続されている。また、Ａ／Ｄコンバータには、「マイク」（０８０７）が接続されている。そしてそれらが「システムバス」（０８０８）などのデータ通信経路によって相互に接続され、情報の送受信や処理を行う。 <Hardware configuration>
FIG. 8 is a schematic diagram illustrating an example of a configuration of the announcement clarification device when the above functional components are realized as hardware. The operation of each hardware component in the processing of the present invention will be described using this figure. As shown in this figure, the announcement clarification device of the present embodiment includes a “CPU (central processing unit)” (0801), “volatile memory” (0802), and “nonvolatile memory” that perform various arithmetic processes. (0803), “D / A converter” (0804), and “A / D converter” (0805). A “speaker” (0806) is connected to the D / A converter. In addition, a “microphone” (0807) is connected to the A / D converter. Then, they are connected to each other through a data communication path such as “system bus” (0808) to transmit / receive information and process information.

装置が起動すると、インパルス応答波形の取得と、音声合成が行われる。この処理のハードウエアの動作に関しては上記実施例にて記載済みであるので省略する。 When the apparatus is activated, an impulse response waveform is acquired and speech synthesis is performed. Since the hardware operation of this process has been described in the above embodiment, it will be omitted.

音声合成プログラムにより合成された音声波形は、ＣＰＵ（０８０１）の演算によってインパルス応答波形を畳み込まれる。演算の結果、残響付音声波形が生成される。この残響付音声波形と、音声合成プログラムで出力された音声波形と、をＣＰＵ（０８０１）上により比較を行う。比較の結果誤差ＸがＡ：しきい値以下であれば、音声合成で出力された音声波形をＣＰＵ（０８０１）にて再生し、Ｄ／Ａコンバータ（０８０４）を介し、スピーカ（０８０６）により出力を行う。Ａ：しきい値以上であれば、音声合成プログラムのフィルタのパラメータ（周波数別のパワー等）を調整する。調整後、ＣＰＵ（０８０１）上で再び音声合成プログラムにより音声合成を行い、再度比較を行う。 The speech waveform synthesized by the speech synthesis program is convoluted with the impulse response waveform by the calculation of the CPU (0801). As a result of the calculation, a reverberant speech waveform is generated. The reverberant speech waveform and the speech waveform output by the speech synthesis program are compared on the CPU (0801). If the comparison error X is equal to or less than A: the threshold, the speech waveform output by speech synthesis is reproduced by the CPU (0801) and output by the speaker (0806) via the D / A converter (0804). I do. A: If it is equal to or greater than the threshold value, the filter parameters (power by frequency, etc.) of the speech synthesis program are adjusted. After the adjustment, speech synthesis is performed again on the CPU (0801) by the speech synthesis program, and the comparison is performed again.

<効果の簡単な説明>
このように、本実施例のアナウンス明瞭化装置は、適応的に音声スペクトルの制御が可能である。これによりアナウンスする環境に合わせ、残響等の影響が少ないアナウンス音声の生成が可能である。 <Brief description of effect>
Thus, the announcement clarification device of this embodiment can adaptively control the speech spectrum. Accordingly, it is possible to generate an announcement sound with little influence of reverberation or the like in accordance with the announcement environment.

≪実施例３≫
<概要>
図９は本実施例のアナウンス明瞭化装置における処理の一例を表す図である。この図にあるように、音声合成を行う際に、音声素片データベースより素片を選択し、この素片により音声の生成を行う。この素片にはラベルデータが付属しており、種々の条件を記載することが可能である。例えば、「た」の素片に対して、定常部のａを抑制したり、ポーズの指定を行ったり、ピッチ、話速の増減の値を加えたり、後続する素片により制御をするか否かの条件を記載したり、等の種々の情報を付加することができる。このデータに基づき音声素片毎に音声合成のパラメータを調整することにより、明瞭な音声の生成を行うことが可能である。 Example 3
<Overview>
FIG. 9 is a diagram illustrating an example of processing in the announcement clarification apparatus according to the present embodiment. As shown in this figure, when speech synthesis is performed, a segment is selected from a speech segment database, and speech is generated using this segment. Label data is attached to the segment, and various conditions can be described. For example, whether or not to control the “a” segment by suppressing the steady-state a, specifying a pause, adding pitch and speech speed increase / decrease values, and controlling the subsequent segment Such information can be described or various information can be added. It is possible to generate clear speech by adjusting speech synthesis parameters for each speech unit based on this data.

<機能的構成>
図１０は、本実施例のアナウンス明瞭化装置における機能ブロックの一例を表す図である。この図にあるように、本実施例の「アナウンス明瞭化装置」（１０００）は、「アナウンスデータ取得部」（１００１）と、「音声合成部」（１００２）と、「インパルス応答波形取得部」（１００３）と、「残響付音声生成部」（１００４）と、「比較部」（１００５）と、「判断部」（１００６）と、「制御部」（１００７）と、を有する。なお、本実施例のアナウンス明瞭化装置は「データ変換出力部」（１００８）と、「出力部」（１００９）を有していても良い。また、図示していないが、「音声合成部」（１００２）は、「話速、ポーズ、ピッチ、パワー制御手段」と、「音声スペクトル制御手段」と、を有していてもよい。本実施例のアナウンス明瞭化装置の特徴は、「制御部」（１００２）が取得するアナウンスデータには、さらに音声合成の際のパラメータが含まれる点と、「音声合成部」（１００２）が「素片パラメータ制御手段」（１０１２）を有する点である。 <Functional configuration>
FIG. 10 is a diagram illustrating an example of functional blocks in the announcement clarification apparatus according to the present embodiment. As shown in this figure, the “announcement clarification device” (1000) of this embodiment includes an “announcement data acquisition unit” (1001), a “speech synthesis unit” (1002), and an “impulse response waveform acquisition unit”. (1003), “sound generation unit with reverberation” (1004), “comparison unit” (1005), “determination unit” (1006), and “control unit” (1007). The announcement clarification device of this embodiment may include a “data conversion output unit” (1008) and an “output unit” (1009). Although not shown, the “speech synthesizer” (1002) may include “speech speed, pause, pitch, power control means” and “speech spectrum control means”. The feature of the announcement clarification device of this embodiment is that the announcement data acquired by the “control unit” (1002) further includes parameters for speech synthesis, and the “speech synthesis unit” (1002) It is a point which has a segment parameter control means "(1012).

本実施例の「制御部」（１００７）は、アナウンスデータ取得部が取得したアナウンスデータを用いて音声合成部を制御する機能をさらに有する。「アナウンスデータ」には、音声合成の際のパラメータ等が含まれる。例えば、音声合成に使用する音声素片等もアナウンスデータに含まれる。このアナウンスデータに含まれる音韻情報を元に、音声合成部を制御する。具体的には、アナウンスデータ取得部よりアナウンスデータを取得し、音声素片に付属するラベルデータを取得する。取得したラベルデータを用いて音声合成のパラメータ制御を行う。 The “control unit” (1007) of the present embodiment further has a function of controlling the speech synthesis unit using the announcement data acquired by the announcement data acquisition unit. “Announcement data” includes parameters for speech synthesis. For example, speech units used for speech synthesis are also included in the announcement data. Based on the phoneme information included in the announcement data, the speech synthesis unit is controlled. Specifically, the announcement data is acquired from the announcement data acquisition unit, and the label data attached to the speech segment is acquired. Using the acquired label data, speech synthesis parameter control is performed.

ここで「ラベルデータ」とは、その素片の母音／子音の別や、上記のように音素を抑制する程度、適用する周波数帯等が記載できる他、ポーズの長さ、ピッチの変化量、話速の変化量等も記載可能である。また、後続条件や先行条件を記述して、他のどの音素に接続するかで処理を変化させるといった条件についても記載可能である。 Here, the “label data” can describe the vowel / consonant of the segment, the degree of suppression of the phoneme as described above, the frequency band to be applied, etc., the pause length, the amount of change in pitch, The amount of change in speech speed can also be described. It is also possible to describe a condition in which the subsequent condition or the preceding condition is described and the process is changed depending on which other phoneme is connected.

「素片パラメータ制御手段」（１０１２）は、音声素片に付属するパラメータに基づいて音声合成を制御する機能を有する。具体的には、「制御部」（１００７）が取得したアナウンスデータに含まれる、音声素片に付属のラベルデータより生成された制御信号を受け取る。これに基づいて、特定周波数帯のパワーを抑制したり、ポーズの長さ、ピッチ、話速の変更を行ったりする処理を実行する。 The “segment parameter control means” (1012) has a function of controlling speech synthesis based on parameters attached to speech segments. Specifically, the control signal generated from the label data attached to the speech unit included in the announcement data acquired by the “control unit” (1007) is received. Based on this, processing for suppressing the power in a specific frequency band or changing the length, pitch, and speech speed of a pause is executed.

<処理の流れ>
図１１は、本実施例のアナウンス明瞭化装置における処理の流れの一例を表すフローチャートである。まず、テキストの取得を行う（ステップＳ１１０２）。肉声が入力される場合にはテキストの取得前に肉声をテキスト形式のデータに変換する処理（ステップＳ１１０１）を実行する。次に音声合成の実行がされる（ステップＳ１１０３）。ここまでに別途インパルス応答の波形の取得（ステップＳ１１０４）を行っておく。次に残響付音声の生成（ステップＳ１１０５）を行う。その後、残響付音声と合成されたアナウンス音声を比較する（ステップＳ１１０６）。その結果、比較の結果が所定の範囲内か否かの判断を行う（ステップＳ１１０７）。範囲内でない場合には、比較結果に応じて音声合成ステップを制御（ステップＳ１１０８）し、比較結果に応じて音声スペクトルを制御する（ステップＳ１１０９）。また、次にアナウンスデータを用いて音声合成ステップを制御する（ステップＳ１１１０）。比較結果が所定の範囲内であった場合には、アナウンス音声をスピーカに対して出力することができる（ステップＳ１１１１）。 <Process flow>
FIG. 11 is a flowchart illustrating an example of a processing flow in the announcement clarification apparatus according to the present embodiment. First, text is acquired (step S1102). When a real voice is input, a process of converting the real voice into text data (step S1101) is executed before the text is acquired. Next, speech synthesis is executed (step S1103). Thus far, the impulse response waveform is separately acquired (step S1104). Next, a reverberant voice is generated (step S1105). Thereafter, the reverberant speech and the synthesized speech are compared (step S1106). As a result, it is determined whether or not the comparison result is within a predetermined range (step S1107). If not within the range, the speech synthesis step is controlled according to the comparison result (step S1108), and the speech spectrum is controlled according to the comparison result (step S1109). Next, the speech synthesis step is controlled using the announcement data (step S1110). If the comparison result is within a predetermined range, the announcement voice can be output to the speaker (step S1111).

<ハードウエア的構成>
図１２は、上記機能的な各構成要件をハードウエアとして実現した際の、アナウンス明瞭化装置における構成の一例を表す概略図である。この図を利用して本発明の処理におけるそれぞれのハードウエア構成部の働きについて説明する。この図にあるように、本実施例のアナウンス明瞭化装置は、各種演算処理を行う「ＣＰＵ（中央演算装置）」（１２０１）と、「揮発性メモリ」（１２０２）と、「不揮発性メモリ」（１２０３）と、「Ｄ／Ａコンバータ」（１２０４）と、「Ａ／Ｄコンバータ」（１２０５）を有している。Ｄ／Ａコンバータには、「スピーカ」（１２０６）が接続されている。また、Ａ／Ｄコンバータには、「マイク」（１２０７）が接続されている。そしてそれらが「システムバス」（１２０８）などのデータ通信経路によって相互に接続され、情報の送受信や処理を行う。 <Hardware configuration>
FIG. 12 is a schematic diagram illustrating an example of a configuration of the announcement clarification device when the above functional components are realized as hardware. The operation of each hardware component in the processing of the present invention will be described using this figure. As shown in this figure, the announcement clarification device of the present embodiment includes a “CPU (central processing unit)” (1201), a “volatile memory” (1202), and a “nonvolatile memory” that perform various arithmetic processes. (1203), “D / A converter” (1204), and “A / D converter” (1205). A “speaker” (1206) is connected to the D / A converter. In addition, a “microphone” (1207) is connected to the A / D converter. Then, they are connected to each other through a data communication path such as a “system bus” (1208) to transmit / receive information and process information.

音声合成プログラムにより合成された音声波形は、ＣＰＵ（１２０１）の演算によってインパルス応答波形を畳み込まれる。演算の結果、残響付音声波形が生成される。この残響付音声波形と、音声合成プログラムで出力された音声波形と、をＣＰＵ（１２０１）上により比較を行う。比較の結果誤差ＸがＡ：しきい値以下であれば、音声合成で出力された音声波形をＣＰＵ（１２０１）にて再生し、Ｄ／Ａコンバータ（１２０４）を介し、スピーカ（１２０６）により出力を行う。Ａ：しきい値以上であれば、音声合成プログラムのフィルタのパラメータ（周波数別のパワー等）を調整する。 The speech waveform synthesized by the speech synthesis program is convoluted with the impulse response waveform by the calculation of the CPU (1201). As a result of the calculation, a reverberant speech waveform is generated. The reverberant speech waveform and the speech waveform output by the speech synthesis program are compared on the CPU (1201). If the comparison result error X is A: threshold value or less, the speech waveform output by speech synthesis is reproduced by the CPU (1201) and output by the speaker (1206) via the D / A converter (1204). I do. A: If it is equal to or greater than the threshold value, the filter parameters (power by frequency, etc.) of the speech synthesis program are adjusted.

この時、入力するテキストに対応する音声素片を不揮発性メモリ（１２０３）より揮発性メモリ（１２０２）に読み出す。音声合成プログラムは音声素片データに付属のラベルデータにアクセスし、情報を取得する。ここでの情報は母音／子音の別等の情報の他、ポーズ、ピッチ、話速の規定値や、パラメータ調整の際の調整量、音声スペクトルを調整する適応フィルタの規定値や調整量などが含まれる。これに基づいて、音声合成プログラムのパラメータの調整が行われる。調整後、ＣＰＵ（１２０１）上で再び音声合成プログラムにより音声合成を行い、再度比較を行う。 At this time, the speech segment corresponding to the input text is read from the nonvolatile memory (1203) to the volatile memory (1202). The speech synthesis program accesses the label data attached to the speech segment data and acquires information. The information here includes information such as vowel / consonant distinction, prescribed values for pause, pitch, speech speed, adjustment amount for parameter adjustment, and specified values and adjustment amounts for adaptive filters that adjust the speech spectrum. included. Based on this, the parameters of the speech synthesis program are adjusted. After the adjustment, speech synthesis is performed again by the speech synthesis program on the CPU (1201), and the comparison is performed again.

<効果の簡単な説明>
このように、本実施例のアナウンス明瞭化装置は、アナウンスデータを取得した際に、対応する音声素片に付属のラベルデータにアクセスが可能である。これにより、生成するアナウンス内容および生成する環境に最適な、音声合成部のパラメータの調節が可能である。 <Brief description of effect>
As described above, the announcement clarification device of this embodiment can access the label data attached to the corresponding speech segment when the announcement data is acquired. This makes it possible to adjust the parameters of the speech synthesizer that are optimal for the content of the announcement to be generated and the environment to be generated.

０２００アナウンス明瞭化装置
０２０１アナウンスデータ取得部
０２０２音声合成部
０２０３インパルス応答波形取得部
０２０４残響付音声生成部
０２０５比較部
０２０６判断部
０２０７制御部 0200 Announcement clarification device 0201 Announcement data acquisition unit 0202 Speech synthesis unit 0203 Impulse response waveform acquisition unit 0204 Reverberation speech generation unit 0205 Comparison unit 0206 Judgment unit 0207 Control unit

Claims

アナウンス内容を示すテキスト形式のデータを含むアナウンスデータを取得するアナウンスデータ取得部と、
取得したアナウンスデータからアナウンス音声を合成する音声合成部と、
スピーカの配置される空間での残響特性値であるインパルス応答波形を取得するインパルス応答波形取得部と、
取得したインパルス応答波形を用いて合成されたアナウンス音声によって生じる残響付音声を生成する残響付音声生成部と、
生成された残響付音声と合成されたアナウンス音声を比較する比較部と、
合成されたアナウンス音声と生成された残響付音声との比較結果が所定の範囲内に収まったか判断する判断部と、
判断結果が所定の範囲内に収まらない場合に、比較部での比較結果に応じて音声合成部を制御する制御部と、
を有する音声合成装置。 An announcement data acquisition unit for acquiring announcement data including data in text format indicating the announcement content;
A speech synthesizer that synthesizes the announcement speech from the acquired announcement data;
An impulse response waveform acquisition unit that acquires an impulse response waveform that is a reverberation characteristic value in a space in which the speaker is disposed;
A reverberant speech generator for generating reverberant speech generated by the announcement speech synthesized using the acquired impulse response waveform;
A comparison unit that compares the generated reverberant speech and the synthesized announcement speech;
A determination unit that determines whether a comparison result between the synthesized announcement sound and the generated reverberant sound is within a predetermined range;
A control unit that controls the speech synthesis unit according to the comparison result in the comparison unit when the determination result does not fall within a predetermined range;
A speech synthesizer.

音声合成部は、話速、ポーズ、ピッチ、パワー制御手段を有する請求項１に記載の音声合成装置。 The speech synthesizer according to claim 1, wherein the speech synthesizer includes speech speed, pause, pitch, and power control means.

音声合成部は、音声スペクトル制御手段を有する請求項１又は２に記載の音声合成装置。 The speech synthesizer according to claim 1, wherein the speech synthesizer includes speech spectrum control means.

音声合成部は、音声素片に付属するパラメータに基づいて音声合成を制御する素片パラメータ制御手段を有する請求項１から３のいずれか一に記載の音声合成装置。 The speech synthesizer according to any one of claims 1 to 3, wherein the speech synthesizer includes a segment parameter control unit that controls speech synthesis based on a parameter attached to the speech segment.

判断結果が所定の範囲内に収まった場合に、合成されたアナウンス音声をスピーカに対して出力する出力部を有する請求項１から４のいずれか一に記載の音声合成装置。 5. The speech synthesizer according to claim 1, further comprising: an output unit configured to output the synthesized announcement voice to the speaker when the determination result falls within a predetermined range.

肉声をテキスト形式のデータに変換しアナウンスデータ取得部に対して出力するデータ変換出力部をさらに有する請求項１から５のいずれか一に記載の音声合成装置。 6. The speech synthesizer according to claim 1, further comprising a data conversion output unit that converts a real voice into text format data and outputs the converted data to an announcement data acquisition unit.

アナウンスデータには、さらに音声合成の際のパラメータが含まれる請求項１から６のいずれか一に記載の音声合成装置。 The speech synthesizer according to claim 1, wherein the announcement data further includes parameters for speech synthesis.

音声合成装置に、アナウンス内容を示すテキスト形式のデータを含むアナウンスデータを取得するアナウンスデータ取得ステップと、
取得したアナウンスデータからアナウンス音声を合成する音声合成ステップと、
スピーカの配置される空間での残響特性値であるインパルス応答波形を取得するインパルス応答波形取得ステップと、
取得したインパルス応答波形を用いて合成されたアナウンス音声によって生じる残響付音声を生成する残響付音声生成ステップと、
生成された残響付音声と合成されたアナウンス音声を比較する比較ステップと、
合成されたアナウンス音声と生成された残響付音声との比較結果が所定の範囲内に収まったか判断する判断ステップと、
判断結果が所定の範囲内に収まらない場合に、比較ステップでの比較結果に応じて音声合成ステップを制御する制御ステップと、
を実行させるための音声合成プログラム。

An announcement data acquisition step for acquiring announcement data including data in text format indicating the announcement content in the speech synthesizer ;
A speech synthesis step of synthesizing the announcement speech from the obtained announcement data;
An impulse response waveform acquisition step of acquiring an impulse response waveform that is a reverberation characteristic value in a space where the speakers are arranged;
A reverberant speech generation step of generating speech with reverberation generated by the announcement speech synthesized using the acquired impulse response waveform;
A comparison step for comparing the generated reverberant speech with the synthesized announcement speech;
A determination step of determining whether a comparison result between the synthesized announcement sound and the generated sound with reverberation falls within a predetermined range;
A control step for controlling the speech synthesis step according to the comparison result in the comparison step when the determination result does not fall within the predetermined range;
A speech synthesis program for running .

アナウンス内容を示すテキスト形式のデータを含むアナウンスデータを取得するアナウンスデータ取得ステップと、
取得したアナウンスデータからアナウンス音声を合成する音声合成ステップと、
スピーカの配置される空間での残響特性値であるインパルス応答波形を取得するインパルス応答波形取得ステップと、
取得したインパルス応答波形を用いて合成されたアナウンス音声によって生じる残響付音声を生成する残響付音声生成ステップと、
生成された残響付音声と合成されたアナウンス音声を比較する比較ステップと、
合成されたアナウンス音声と生成された残響付音声との比較結果が所定の範囲内に収まったか判断する判断ステップと、
判断結果が所定の範囲内に収まらない場合に、比較ステップでの比較結果に応じて音声合成ステップを制御する制御ステップと、
を有する音声合成方法。 Announcement data acquisition step for acquiring announcement data including data in text format indicating the announcement content;
A speech synthesis step of synthesizing the announcement speech from the obtained announcement data;
An impulse response waveform acquisition step of acquiring an impulse response waveform that is a reverberation characteristic value in a space where the speakers are arranged;
A reverberant speech generation step of generating speech with reverberation generated by the announcement speech synthesized using the acquired impulse response waveform;
A comparison step for comparing the generated reverberant speech with the synthesized announcement speech;
A determination step of determining whether a comparison result between the synthesized announcement sound and the generated sound with reverberation falls within a predetermined range;
A control step for controlling the speech synthesis step according to the comparison result in the comparison step when the determination result does not fall within the predetermined range;
A speech synthesis method comprising:

音声合成ステップは、話速，ポーズ，ピッチ，パワー制御サブステップを有する請求項９に記載の音声合成方法。 The speech synthesis method according to claim 9, wherein the speech synthesis step includes speech speed, pause, pitch, and power control substeps.

音声合成ステップは、音声スペクトル制御サブステップを有する請求項９又は１０に記載の音声合成方法。 The speech synthesis method according to claim 9 or 10, wherein the speech synthesis step includes a speech spectrum control substep.

音声合成ステップは、音声素片に付属するパラメータに基づいて音声合成を制御する素片パラメータ制御サブステップを有する請求項９から１１のいずれか一に記載の音声合成方法。 The speech synthesis method according to claim 9, wherein the speech synthesis step includes a segment parameter control sub-step for controlling speech synthesis based on a parameter attached to the speech unit.

判断結果が所定の範囲内に収まった場合に、合成されたアナウンス音声をスピーカに対して出力する出力ステップを有する請求項９から１２のいずれか一に記載の音声合成方法。 The speech synthesis method according to any one of claims 9 to 12, further comprising an output step of outputting the synthesized announcement voice to a speaker when the determination result falls within a predetermined range.

肉声をテキスト形式のデータに変換しアナウンスデータ取得ステップに対して出力するデータ変換出力ステップをさらに有する請求項９から１３のいずれか一に記載の音声合成方法。 The speech synthesizing method according to any one of claims 9 to 13, further comprising a data conversion output step of converting the real voice into text format data and outputting the data to the announcement data acquisition step.

アナウンスデータには、さらに音声合成の際のパラメータが含まれ、制御ステップは前記パラメータを用いて音声合成ステップを制御するテキスト制御サブステップを有する請求項９から１４のいずれか一に記載の音声合成方法。 15. The speech synthesis according to claim 9, wherein the announcement data further includes parameters for speech synthesis, and the control step includes a text control sub-step for controlling the speech synthesis step using the parameters. Method.