WO2016088241A1

WO2016088241A1 - Speech processing system and speech processing method

Info

Publication number: WO2016088241A1
Application number: PCT/JP2014/082198
Authority: WO
Inventors: 亮岩宮
Original assignee: 三菱電機株式会社
Priority date: 2014-12-05
Filing date: 2014-12-05
Publication date: 2016-06-09

Abstract

The purpose of the present invention is to provide a technique for minimizing each disadvantage of an offline prepared phoneme and an online generated phoneme. A speech processing device 1 includes a text and phoneme extraction unit 12 for extracting text data 82a and an offline prepared phoneme 82b from content information 82, a phoneme generation unit 13, and a working phoneme selection unit 14. The working phoneme selection unit 14 determines whether or not to use the extracted offline prepared phoneme 82b, and selects either the offline prepared phoneme 82b if it is determined that the offline prepared phoneme 82b is to be used or an online generated phoneme if it is determined that the offline prepared phoneme 82b is not to be used, said online generated phoneme being generated by the phoneme generation unit 13.

Description

音声処理システム及び音声処理方法Voice processing system and voice processing method

　本発明は、音声処理システム及び音声処理方法に関する。 The present invention relates to a voice processing system and a voice processing method.

　近年、テキストデータ等を含むコンテンツ情報が、ラジオ放送波に含まれたり、移動通信装置により配信されたりしている。そこで、当該コンテンツ情報を、音声認識や音声合成に利用する車載機器が提案されている。このような車載機器によれば、例えば、衛星ラジオ放送波に含まれる放送局名を用いて音声認識を行うことによりチャネルを切り替えることや、認識した放送局名を読み上げることが可能である。 In recent years, content information including text data has been included in radio broadcast waves or distributed by mobile communication devices. Therefore, an in-vehicle device that uses the content information for speech recognition and speech synthesis has been proposed. According to such an in-vehicle device, for example, it is possible to switch channels by performing speech recognition using a broadcast station name included in a satellite radio broadcast wave, or to read out the recognized broadcast station name.

　さて、上述の音声認識や音声合成には、言語の発音記号に相当する音素が使用される。例えば、音素から音声認識辞書が生成されたり、音素から音声が合成されたりする。このような音素には、車載器外部のオフラインで予め作成してから、テキストデータとともにコンテンツ情報に含めて配信される音素（以下「オフライン作成音素」と記す）と、配信されたコンテンツ情報に含まれているテキストデータに基づいて車載機器上（オンライン）で生成される音素（以下「オンライン生成音素」と記す）とがある。オフライン作成音素については例えば特許文献１などにおいて開示されており、オンライン生成音素については例えば特許文献２などにおいて開示されている。 Now, phonemes corresponding to phonetic symbols of language are used for the above-mentioned speech recognition and speech synthesis. For example, a speech recognition dictionary is generated from phonemes, or speech is synthesized from phonemes. Such phonemes are created in advance outside the vehicle-mounted device, and then included in the content information along with text data (hereinafter referred to as “offline-created phonemes”) and in the distributed content information. Phoneme generated on-vehicle equipment (online) based on the text data stored (hereinafter referred to as “online generated phoneme”). An off-line created phoneme is disclosed in, for example, Patent Document 1, and an on-line generated phoneme is disclosed in, for example, Patent Document 2.

　ここで、音声認識エンジンや音声合成エンジンがサポートする音素の形式や体系は、エンジンによって異なる。例えば、同じ言語のエンジンでも、メーカーによって音素の形式が異なることがある。さらに、同じメーカーのエンジンでも、英語のエンジンと仏語のエンジンとに共通にサポートされている音素もあれば、英語のエンジンだけ、または、仏語のエンジンだけにサポートされている音素もある。 Here, the phoneme formats and systems supported by the speech recognition engine and speech synthesis engine differ depending on the engine. For example, even in the same language engine, phoneme formats may differ from manufacturer to manufacturer. In addition, there are phonemes that are supported by both English and French engines in common, but there are also phonemes that are supported only by English engines or only by French engines.

　オフライン作成音素のメリットとして、テキストデータの正しい読みの知識に基づいて人がチューニングして作成することが可能であるため、機械的にしか生成できないオンライン生成音素よりも正しい音素を使用することが可能である点が挙げられる。しかしながら、オフライン作成音素のデメリットとして、コンテンツ情報に予め格納しておかなければ使用できず、オンライン生成音素よりも使用可能性が低い点が挙げられる。 As an advantage of off-line created phonemes, humans can tune and create based on knowledge of correct reading of text data, so it is possible to use correct phonemes rather than online generated phonemes that can only be generated mechanically This is the point. However, a disadvantage of offline created phonemes is that they cannot be used unless they are stored in advance in the content information, and are less likely to be used than online generated phonemes.

　一方、オンライン生成音素のデメリットとして、オフライン作成音素よりも正しくない音素を使用することになる点が挙げられる。しかしながら、オンライン生成音素のメリットとして、エンジンがサポートしているテキストデータであれば、必ず音素を生成することができることから、オフライン作成音素よりも使用可能性が高い点が挙げられる。 On the other hand, a disadvantage of online generated phonemes is that they use incorrect phonemes than offline created phonemes. However, as an advantage of online-generated phonemes, it is possible to generate phonemes as long as the text data is supported by the engine.

国際公開第２００７／０６９５１２号International Publication No. 2007/069512 特開２０１１－０３３８７４号公報JP 2011-033874 A

　従来の装置では、オフライン作成音素、及び、オンライン生成音素のどちらか一方のみを固定的に使用する。このため、どちらの音素を使用しても、各音素のデメリットを回避することができないという問題があった。 In the conventional device, only one of offline created phonemes and online generated phonemes is fixedly used. For this reason, there is a problem that the demerit of each phoneme cannot be avoided regardless of which phoneme is used.

　そこで、本発明は、上記のような問題点を鑑みてなされたものであり、オフライン作成音素、及び、オンライン生成音素のそれぞれのデメリットを抑制可能な技術を提供することを目的とする。 Therefore, the present invention has been made in view of the above problems, and an object of the present invention is to provide a technology capable of suppressing the disadvantages of offline created phonemes and online generated phonemes.

　本発明に係る音声処理システムは、テキストデータと、テキストデータの読みに対応するオフライン作成音素とを含むコンテンツ情報を外部から取得する情報取得部と、情報取得部で取得されたコンテンツ情報から、テキストデータ及びオフライン作成音素を抽出する抽出部と、抽出部で抽出されたテキストデータに基づいて、オンライン生成音素を生成する音素生成部とを備える。また、音声処理システムは、抽出部で抽出されたオフライン作成音素を使用するかどうかを判定し、使用すると判定した場合にはオフライン作成音素を選択し、使用しないと判定した場合には音素生成部にオンライン生成音素を生成させて当該オンライン生成音素を選択する音素選択部を備える。 An audio processing system according to the present invention includes an information acquisition unit that externally acquires content information including text data and off-line created phonemes corresponding to the reading of the text data, and text information from the content information acquired by the information acquisition unit. An extraction unit that extracts data and off-line created phonemes, and a phoneme generation unit that generates online generation phonemes based on the text data extracted by the extraction unit. In addition, the speech processing system determines whether to use the offline created phoneme extracted by the extracting unit, selects the offline created phoneme when determined to use, and selects the phoneme generating unit when determined not to use it. Includes a phoneme selection unit that generates an online generated phoneme and selects the online generated phoneme.

　本発明に係る音声処理方法は、テキストデータと、前記テキストデータの読みに対応するオフライン作成音素とを含むコンテンツ情報を外部から取得し、取得された前記コンテンツ情報から、前記テキストデータ及び前記オフライン作成音素を抽出し、抽出された前記オフライン作成音素を使用するかどうかを判定し、使用すると判定した場合には、前記オフライン作成音素を選択し、使用しないと判定した場合には、抽出された前記テキストデータに基づいてオンライン生成音素を生成し、当該オンライン生成音素を選択する。 The speech processing method according to the present invention obtains content information including text data and off-line created phonemes corresponding to reading of the text data from outside, and the text data and off-line creation from the obtained content information Extracting phonemes, determining whether to use the extracted offline created phonemes, and selecting the offline generated phonemes if determined to use, and determining not to use the extracted phonemes An online generated phoneme is generated based on the text data, and the online generated phoneme is selected.

　本発明によれば、オフライン作成音素、及び、オンライン生成音素のそれぞれのデメリットを抑制することができる。 According to the present invention, it is possible to suppress the disadvantages of offline created phonemes and online generated phonemes.

　本発明の目的、特徴、態様及び利点は、以下の詳細な説明と添付図面とによって、より明白となる。 The objects, features, aspects and advantages of the present invention will become more apparent from the following detailed description and the accompanying drawings.

実施の形態１に係る音声処理装置の構成を示すブロック図である。1 is a block diagram showing a configuration of a speech processing apparatus according to Embodiment 1. FIG. 実施の形態１に係る音声処理装置の動作を示すフローチャートである。3 is a flowchart showing an operation of the speech processing apparatus according to the first embodiment. 実施の形態２に係る音声処理装置の構成を示すブロック図である。4 is a block diagram illustrating a configuration of a speech processing apparatus according to Embodiment 2. FIG. 実施の形態２に係る音声処理装置の動作を示すフローチャートである。6 is a flowchart showing an operation of the speech processing apparatus according to the second embodiment. 実施の形態３に係る音声処理装置の構成を示すブロック図である。FIG. 10 is a block diagram showing a configuration of a speech processing apparatus according to Embodiment 3. 実施の形態３に係る音声処理装置の動作を示すフローチャートである。10 is a flowchart illustrating an operation of the speech processing apparatus according to the third embodiment. 変形例に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio processing apparatus which concerns on a modification.

　＜実施の形態１＞
　以下の説明では、本発明に係る音声処理システムが、音声処理装置単体に適用された場合を例にして説明する。 <Embodiment 1>
In the following description, a case where the speech processing system according to the present invention is applied to a speech processing device alone will be described as an example.

　図１は、本発明の実施の形態１に係る音声処理装置１の構成を示すブロック図である。図１の音声処理装置１は、コンテンツ情報取得部（情報取得部）１１と、テキスト及び音素抽出部（抽出部）１２と、音素生成部１３と、使用音素選択部（音素選択部）１４とを備えている。 FIG. 1 is a block diagram showing a configuration of a speech processing apparatus 1 according to Embodiment 1 of the present invention. 1 includes a content information acquisition unit (information acquisition unit) 11, a text and phoneme extraction unit (extraction unit) 12, a phoneme generation unit 13, and a used phoneme selection unit (phoneme selection unit) 14. It has.

　まず、本実施の形態１に係る音声処理装置１の構成要素に適用されるハードウェアの一例について説明する。コンテンツ情報取得部１１は、例えば、音声処理装置１の外部８１（衛星ラジオ放送局など）と通信可能な通信装置、または、当該通信装置と接続可能な入力装置などから構成される。テキスト及び音素抽出部１２、音素生成部１３及び使用音素選択部１４は、例えば、音声処理装置１の図示しないＣＰＵ（Central Processing Unit）などが、音声処理装置１の図示しないＨＤＤ（Hard Disk Drive）や半導体メモリなどの記憶装置に記憶されたプログラムを実行することにより、当該ＣＰＵの機能として実現される。 First, an example of hardware applied to the components of the audio processing device 1 according to the first embodiment will be described. The content information acquisition unit 11 includes, for example, a communication device that can communicate with the outside 81 (such as a satellite radio broadcast station) of the audio processing device 1 or an input device that can be connected to the communication device. The text and phoneme extraction unit 12, the phoneme generation unit 13, and the used phoneme selection unit 14 are, for example, a CPU (Central Processing Unit) (not shown) of the speech processing device 1 or an HDD (Hard Disk Drive) (not shown) of the speech processing device 1. It is implemented as a function of the CPU by executing a program stored in a storage device such as a semiconductor memory.

　次に、本実施の形態１に係る音声処理装置１の各構成要素について詳細に説明する。 Next, each component of the speech processing apparatus 1 according to the first embodiment will be described in detail.

　コンテンツ情報取得部１１は、テキストデータ８２ａと、テキストデータ８２ａの読みに対応するオフライン作成音素８２ｂとを含むコンテンツ情報８２を外部８１から取得する。オフライン作成音素８２ｂは、例えば人がテキストデータ８２ａの正しい読みの知識に基づいてチューニングするなどによって作成される。 The content information acquisition unit 11 acquires content information 82 including text data 82a and offline created phonemes 82b corresponding to reading of the text data 82a from the outside 81. The off-line created phoneme 82b is created, for example, by a person tuning based on knowledge of correct reading of the text data 82a.

　テキスト及び音素抽出部１２は、コンテンツ情報取得部１１で取得されたコンテンツ情報８２から、テキストデータ８２ａ及びオフライン作成音素８２ｂを抽出する。 The text and phoneme extraction unit 12 extracts the text data 82a and the offline created phoneme 82b from the content information 82 acquired by the content information acquisition unit 11.

　音素生成部１３は、テキスト及び音素抽出部１２で抽出されたテキストデータ８２ａに基づいて、オンライン生成音素を生成する。オンライン生成音素は、例えばテキストデータ８２ａに基づいて機械的に生成される。なお、次に説明するように、音素生成部１３にてオンライン生成音素を生成するか否かは、使用音素選択部１４の判定結果に基づいて決定される。 The phoneme generation unit 13 generates an online generation phoneme based on the text and the text data 82a extracted by the phoneme extraction unit 12. The online generated phonemes are generated mechanically based on, for example, text data 82a. Note that, as will be described next, whether or not the phoneme generation unit 13 generates the online generation phoneme is determined based on the determination result of the used phoneme selection unit 14.

　使用音素選択部１４は、テキスト及び音素抽出部１２で抽出されたオフライン作成音素８２ｂを使用するかどうかを判定し、使用すると判定した場合には、音素生成部１３にオンライン生成音素を生成させずに当該オフライン作成音素８２ｂを選択する。一方、使用音素選択部１４は、抽出されたオフライン作成音素８２ｂを使用しないと判定した場合には、音素生成部１３にオンライン生成音素を生成させて当該オンライン生成音素を選択する。 The use phoneme selection unit 14 determines whether to use the off-line created phoneme 82b extracted by the text and phoneme extraction unit 12, and does not cause the phoneme generation unit 13 to generate an online generation phoneme if it is determined to use it. The off-line created phoneme 82b is selected. On the other hand, if it is determined that the extracted offline created phoneme 82b is not used, the used phoneme selection unit 14 causes the phoneme generation unit 13 to generate an online generated phoneme and selects the online generated phoneme.

　＜動作＞
　図２は、本実施の形態１に係る音声処理装置１の動作を示すフローチャートである。 <Operation>
FIG. 2 is a flowchart showing the operation of the speech processing apparatus 1 according to the first embodiment.

　まず、ステップＳ１にて、コンテンツ情報取得部１１は、コンテンツ情報８２を外部８１から取得する。 First, in step S 1, the content information acquisition unit 11 acquires content information 82 from the outside 81.

　ステップＳ２にて、テキスト及び音素抽出部１２は、コンテンツ情報取得部１１で取得されたコンテンツ情報８２から、テキストデータ８２ａ及びオフライン作成音素８２ｂを抽出する。 In step S2, the text and phoneme extraction unit 12 extracts the text data 82a and the offline created phoneme 82b from the content information 82 acquired by the content information acquisition unit 11.

　ステップＳ３にて、使用音素選択部１４は、テキスト及び音素抽出部１２で抽出されたオフライン作成音素８２ｂを使用するか否かを判定する。使用音素選択部１４は、オフライン作成音素８２ｂを使用すると判定した場合にはステップＳ４に進み、そうでない場合にはステップＳ５に進む。 In step S3, the use phoneme selection unit 14 determines whether to use the off-line created phoneme 82b extracted by the text and phoneme extraction unit 12. The use phoneme selection unit 14 proceeds to step S4 when it is determined that the off-line created phoneme 82b is used, and proceeds to step S5 otherwise.

　ステップＳ４にて、使用音素選択部１４は、オフライン作成音素８２ｂを選択する。その後、図２の動作を終了する。 In step S4, the phoneme selection unit 14 selects the off-line created phoneme 82b. Thereafter, the operation of FIG.

　ステップＳ５にて、使用音素選択部１４は、抽出されたテキストデータ８２ａに基づいて音素生成部１３にオンライン生成音素を生成させる。 In step S5, the phoneme selection unit 14 causes the phoneme generation unit 13 to generate online generated phonemes based on the extracted text data 82a.

　そして、ステップＳ６にて、使用音素選択部１４は、生成されたオンライン生成音素を選択する。その後、図２の動作を終了する。 In step S6, the phoneme selection unit 14 selects the generated online generated phoneme. Thereafter, the operation of FIG.

　＜実施の形態１のまとめ＞
　以上のような本実施の形態１に係る音声処理装置１によれば、オフライン作成音素８２ｂを使用すると判定した場合にはオフライン作成音素８２ｂを選択し、オフライン作成音素８２ｂを使用しないと判定した場合にはオンライン生成音素を選択する。このように使用する音素を動的に選択することができるので、オフライン作成音素のデメリット、及び、オンライン生成音素のデメリットを抑制することができる。この結果として、使用する音素の正確性、及び、音素を使用できる可能性を高めることができる。 <Summary of Embodiment 1>
According to the speech processing apparatus 1 according to the first embodiment as described above, when it is determined that the offline created phoneme 82b is used, the offline created phoneme 82b is selected, and when it is determined that the offline created phoneme 82b is not used. Select an online generated phoneme. Since the phonemes to be used can be dynamically selected in this way, the disadvantages of offline created phonemes and the disadvantages of online generated phonemes can be suppressed. As a result, the accuracy of phonemes to be used and the possibility of using phonemes can be increased.

　＜実施の形態２＞
　図３は、本発明の実施の形態２に係る音声処理装置１の構成を示すブロック図である。なお、本実施の形態２に係る音声処理装置１において、以上で説明した構成要素と同一または類似するものについては同じ参照符号を付し、異なる部分について主に説明する。 <Embodiment 2>
FIG. 3 is a block diagram showing the configuration of the speech processing apparatus 1 according to Embodiment 2 of the present invention. In the speech processing apparatus 1 according to the second embodiment, the same or similar components as those described above are denoted by the same reference numerals, and different portions are mainly described.

　本実施の形態２に係る音声処理装置１は、音声認識装置の機能を有しており、図１の構成に加えて、音声入力部２１と、音声認識辞書生成部（辞書生成部）２２と、音声認識辞書記憶部２３と、音声認識部２４とを備えている。 The speech processing apparatus 1 according to the second embodiment has a function of a speech recognition apparatus, and in addition to the configuration of FIG. 1, a speech input unit 21, a speech recognition dictionary generation unit (dictionary generation unit) 22, The voice recognition dictionary storage unit 23 and the voice recognition unit 24 are provided.

　まず、本実施の形態２に係る音声処理装置１に追加された構成要素に適用されるハードウェアの一例について説明する。音声入力部２１は、例えば、マイクロフォンなどの音声入力装置から構成され、音声認識辞書記憶部２３は、例えば、ＨＤＤや半導体メモリなどの記憶装置などから構成される。音声認識辞書生成部２２及び音声認識部２４は、例えば、音声処理装置１の図示しないＣＰＵの機能として実現される。 First, an example of hardware applied to components added to the speech processing device 1 according to the second embodiment will be described. The voice input unit 21 is configured by a voice input device such as a microphone, and the voice recognition dictionary storage unit 23 is configured by a storage device such as an HDD or a semiconductor memory. The voice recognition dictionary generation unit 22 and the voice recognition unit 24 are realized as functions of a CPU (not shown) of the voice processing device 1, for example.

　次に、本実施の形態２に係る音声処理装置１の各構成要素について詳細に説明する。 Next, each component of the speech processing apparatus 1 according to the second embodiment will be described in detail.

　音声入力部２１は、音声を外部（例えばユーザー）から受け付ける。 The voice input unit 21 receives voice from outside (for example, a user).

　音声認識辞書生成部２２は、使用音素選択部１４で選択された音素に基づいて音声認識辞書を生成する。音声認識辞書生成部２２で生成された音声認識辞書は、音声認識辞書記憶部２３に記憶される。 The speech recognition dictionary generation unit 22 generates a speech recognition dictionary based on the phonemes selected by the used phoneme selection unit 14. The voice recognition dictionary generated by the voice recognition dictionary generation unit 22 is stored in the voice recognition dictionary storage unit 23.

　ここで、本実施の形態２に係る使用音素選択部１４は、テキスト及び音素抽出部１２で抽出されたオフライン作成音素８２ｂが音声認識辞書の生成に使用すべき予め定められた音素のみを含む場合には、オフライン作成音素８２ｂを使用すると判定する。一方、使用音素選択部１４は、テキスト及び音素抽出部１２で抽出されたオフライン作成音素８２ｂが上述の予め定められた音素以外の音素も含む場合には、オフライン作成音素８２ｂを使用しないと判定する。 Here, in the phoneme selection unit 14 according to the second embodiment, the offline-generated phoneme 82b extracted by the text and phoneme extraction unit 12 includes only predetermined phonemes to be used for generating the speech recognition dictionary. Is determined to use the off-line created phoneme 82b. On the other hand, the used phoneme selection unit 14 determines that the offline created phoneme 82b is not used when the offline created phoneme 82b extracted by the text and phoneme extracting unit 12 includes phonemes other than the above-described predetermined phonemes. .

　音声認識部２４は、音声認識辞書生成部２２で生成された音声認識辞書（音声認識辞書記憶部２３に記憶された音声認識辞書）を用いて、認識対象の音声（音声入力部２１で受け付けた音声）の音声認識を行う。 The speech recognition unit 24 uses the speech recognition dictionary (speech recognition dictionary stored in the speech recognition dictionary storage unit 23) generated by the speech recognition dictionary generation unit 22 to accept the speech to be recognized (received by the speech input unit 21). Voice).

　＜動作＞
　図４は、本実施の形態２に係る音声処理装置１の動作を示すフローチャートである。 <Operation>
FIG. 4 is a flowchart showing the operation of the speech processing apparatus 1 according to the second embodiment.

　まず、ステップＳ１１及びＳ１２にて、図２のステップＳ１及びＳ２と同様の動作を行う。 First, in steps S11 and S12, operations similar to those in steps S1 and S2 in FIG. 2 are performed.

　ステップＳ１３にて、使用音素選択部１４は、音声認識辞書を生成する試行を行う。その試行を通じて、使用音素選択部１４は、テキスト及び音素抽出部１２で抽出されたオフライン作成音素８２ｂが、音声認識辞書の生成に使用すべき予め定められた音素のみを含むか否かを判定する。 In step S13, the phoneme selection unit 14 attempts to generate a speech recognition dictionary. Through the trial, the used phoneme selection unit 14 determines whether or not the offline created phoneme 82b extracted by the text and phoneme extraction unit 12 includes only a predetermined phoneme to be used for generating the speech recognition dictionary. .

　ステップＳ１４にて、使用音素選択部１４は、抽出されたオフライン作成音素８２ｂが上述の予め定められた音素のみを含むと判定した場合には、オフライン作成音素８２ｂを使用すると判定してステップＳ１５に進み、そうでない場合には、オフライン作成音素８２ｂを使用しないと判定してステップＳ１６に進む。 In step S14, if the used phoneme selection unit 14 determines that the extracted offline created phoneme 82b includes only the above-described predetermined phoneme, it determines that the offline created phoneme 82b is to be used and proceeds to step S15. If not, it is determined that the off-line created phoneme 82b is not used, and the process proceeds to step S16.

　ステップＳ１５にて、使用音素選択部１４は、オフライン作成音素８２ｂを選択し、音声認識辞書生成部２２は、オフライン作成音素８２ｂ（使用音素選択部１４で選択された音素）に基づいて音声認識辞書を生成する。その後、ステップＳ１９に進む。 In step S15, the use phoneme selection unit 14 selects the offline created phoneme 82b, and the speech recognition dictionary generation unit 22 uses the speech recognition dictionary based on the offline creation phoneme 82b (the phoneme selected by the use phoneme selection unit 14). Is generated. Thereafter, the process proceeds to step S19.

　ステップＳ１６にて、使用音素選択部１４は、抽出されたテキストデータ８２ａに基づいて音素生成部１３にオンライン生成音素を生成させる。 In step S16, the used phoneme selection unit 14 causes the phoneme generation unit 13 to generate online generated phonemes based on the extracted text data 82a.

　そして、ステップＳ１７にて、使用音素選択部１４は、生成されたオンライン生成音素を選択し、音声認識辞書生成部２２は、当該オンライン生成音素（使用音素選択部１４で選択された音素）に基づいて音声認識辞書を生成する。その後、ステップＳ１９に進む。 In step S17, the use phoneme selection unit 14 selects the generated online generation phoneme, and the speech recognition dictionary generation unit 22 is based on the online generation phoneme (the phoneme selected by the use phoneme selection unit 14). To generate a speech recognition dictionary. Thereafter, the process proceeds to step S19.

　また、ステップＳ１１～Ｓ１７と並行するステップＳ１８にて、音声入力部２１は、音声を外部から受け付ける。その後、ステップＳ１９に進む。 In step S18 parallel to steps S11 to S17, the voice input unit 21 receives voice from the outside. Thereafter, the process proceeds to step S19.

　ステップＳ１９にて、音声認識部２４は、音声認識辞書生成部２２で生成された音声認識辞書を用いて、認識対象の音声（音声入力部２１で受け付けた音声）の音声認識を行う。その後、図４の動作を終了する。 In step S19, the voice recognition unit 24 performs voice recognition of the voice to be recognized (the voice received by the voice input unit 21) using the voice recognition dictionary generated by the voice recognition dictionary generation unit 22. Thereafter, the operation of FIG. 4 ends.

　＜実施の形態２のまとめ＞
　以上のような本実施の形態２に係る音声処理装置１によれば、音声認識辞書の生成においても実施の形態１と同様に、オフライン作成音素のデメリット、及び、オンライン生成音素のデメリットを抑制することができる。この結果として、音声認識辞書の生成において、使用する音素の正確性、及び、音素を使用できる可能性を高めることができる。よって、例えばヨーロッパなどのように音素が異なる複数の言語が用いられている地域において、音声認識辞書を適切に生成することができる。 <Summary of Embodiment 2>
According to the speech processing apparatus 1 according to the second embodiment as described above, the disadvantages of offline-generated phonemes and the disadvantages of online-generated phonemes are suppressed in the generation of the speech recognition dictionary as in the first embodiment. be able to. As a result, it is possible to increase the accuracy of phonemes to be used and the possibility of using phonemes in generating a speech recognition dictionary. Therefore, a speech recognition dictionary can be appropriately generated in a region where a plurality of languages having different phonemes are used, such as Europe.

　なお、以上の構成において、コンテンツ情報取得部１１が、複数の外部８１のそれぞれからコンテンツ情報８２を取得した場合には、使用音素選択部１４は、外部８１ごとに上記選択を行ってもよい。この結果、いくつかの外部８１には、オフライン作成音素８２ｂを選択（使用）しつつ、残りの外部８１には、オンライン生成音素を選択（使用）してもよい。また、１つのコンテンツ情報８２に複数のテキストデータ８２ａ及びオフライン作成音素８２ｂが存在する場合には、テキストデータ８２ａ及びオフライン作成音素８２ｂごとに上記選択を行ってもよい。 In the above configuration, when the content information acquisition unit 11 acquires the content information 82 from each of the plurality of externals 81, the used phoneme selection unit 14 may perform the above selection for each external 81. As a result, offline generated phonemes 82b may be selected (used) for some externals 81, while online generated phonemes may be selected (used) for the remaining externals 81. When a plurality of text data 82a and offline created phonemes 82b exist in one content information 82, the above selection may be performed for each text data 82a and offline created phonemes 82b.

　＜実施の形態３＞
　図５は、本発明の実施の形態３に係る音声処理装置１の構成を示すブロック図である。なお、本実施の形態３に係る音声処理装置１において、以上で説明した構成要素と同一または類似するものについては同じ参照符号を付し、異なる部分について主に説明する。 <Embodiment 3>
FIG. 5 is a block diagram showing the configuration of the speech processing apparatus 1 according to Embodiment 3 of the present invention. In the speech processing apparatus 1 according to the third embodiment, the same or similar components as those described above are denoted by the same reference numerals, and different portions will be mainly described.

　本実施の形態３に係る音声処理装置１は、音声合成装置の機能を有しており、図１の構成に加えて、音声合成部３１と、音声出力部３２とを備えている。 The speech processing apparatus 1 according to the third embodiment has a function of a speech synthesizer, and includes a speech synthesizer 31 and a speech output unit 32 in addition to the configuration of FIG.

　まず、本実施の形態３に係る音声処理装置１に追加された構成要素に適用されるハードウェアの一例について説明する。音声合成部３１は、例えば、音声処理装置１の図示しないＣＰＵの機能として実現される。音声出力部３２は、例えば、スピーカーなどの音声出力装置などから構成される。 First, an example of hardware applied to components added to the speech processing apparatus 1 according to the third embodiment will be described. The voice synthesizer 31 is realized as a function of a CPU (not shown) of the voice processing device 1, for example. The audio output unit 32 includes, for example, an audio output device such as a speaker.

　次に、本実施の形態３に係る音声処理装置１の各構成要素について詳細に説明する。 Next, each component of the speech processing apparatus 1 according to the third embodiment will be described in detail.

　音声合成部３１は、使用音素選択部１４で選択された音素を使用して、音声出力部３２から出力する音声を合成する。 The speech synthesizer 31 synthesizes the speech output from the speech output unit 32 using the phonemes selected by the used phoneme selector 14.

　ここで、本実施の形態３に係る使用音素選択部１４は、テキスト及び音素抽出部１２で抽出されたオフライン作成音素８２ｂが音声出力部３２から出力する音声の合成に使用すべき予め定められた音素のみを含む場合には、オフライン作成音素８２ｂを使用すると判定する。一方、使用音素選択部１４は、テキスト及び音素抽出部１２で抽出されたオフライン作成音素８２ｂが上述の予め定められた音素以外の音素も含む場合には、オフライン作成音素８２ｂを使用しないと判定する。 Here, the used phoneme selection unit 14 according to the third embodiment is predetermined to be used for synthesizing the speech output from the speech output unit 32 by the off-line created phoneme 82b extracted by the text and phoneme extraction unit 12. When only the phoneme is included, it is determined that the off-line created phoneme 82b is used. On the other hand, the used phoneme selection unit 14 determines that the offline created phoneme 82b is not used when the offline created phoneme 82b extracted by the text and phoneme extracting unit 12 includes phonemes other than the above-described predetermined phonemes. .

　音声出力部３２は、音声合成部３１で合成された音声を外部に出力する。 The voice output unit 32 outputs the voice synthesized by the voice synthesis unit 31 to the outside.

　＜動作＞
　図６は、本実施の形態３に係る音声処理装置１の動作を示すフローチャートである。 <Operation>
FIG. 6 is a flowchart showing the operation of the speech processing apparatus 1 according to the third embodiment.

　まず、ステップＳ２１及びＳ２２にて、図２のステップＳ１及びＳ２と同様の動作を行う。 First, in steps S21 and S22, operations similar to those in steps S1 and S2 in FIG. 2 are performed.

　ステップＳ２３にて、使用音素選択部１４は、音声出力部３２から出力する音声を合成する試行を行う。その試行を通じて、使用音素選択部１４は、テキスト及び音素抽出部１２で抽出されたオフライン作成音素８２ｂが、当該音声の合成に使用すべき予め定められた音素のみを含むか否かを判定する。 In step S23, the phoneme selection unit 14 attempts to synthesize the voice output from the voice output unit 32. Through the trial, the use phoneme selection unit 14 determines whether or not the offline created phoneme 82b extracted by the text and phoneme extraction unit 12 includes only a predetermined phoneme to be used for the synthesis of the speech.

　ステップＳ２４にて、使用音素選択部１４は、抽出されたオフライン作成音素８２ｂが上述の予め定められた音素のみを含むと判定した場合には、オフライン作成音素８２ｂを使用すると判定してステップＳ２５に進み、そうでない場合には、オフライン作成音素８２ｂを使用しないと判定してステップＳ２６に進む。 In step S24, if the used phoneme selection unit 14 determines that the extracted offline created phoneme 82b includes only the above-described predetermined phoneme, the used phoneme selection unit 14 determines that the offline created phoneme 82b is to be used and proceeds to step S25. If not, it is determined that the off-line created phoneme 82b is not used, and the process proceeds to step S26.

　ステップＳ２５にて、使用音素選択部１４は、オフライン作成音素８２ｂを選択し、音声合成部３１は、オフライン作成音素８２ｂ（使用音素選択部１４で選択された音素）に基づいて音声出力部３２から出力する音声を合成する。その後、ステップＳ２８に進む。 In step S25, the use phoneme selection unit 14 selects the off-line created phoneme 82b, and the speech synthesis unit 31 selects the off-line creation phoneme 82b (the phoneme selected by the use phoneme selection unit 14) from the speech output unit 32. Synthesize the output voice. Thereafter, the process proceeds to step S28.

　ステップＳ２６にて、使用音素選択部１４は、抽出されたテキストデータ８２ａに基づいて音素生成部１３にオンライン生成音素を生成させる。 In step S26, the phoneme selection unit 14 causes the phoneme generation unit 13 to generate online generated phonemes based on the extracted text data 82a.

　そして、ステップＳ２７にて、使用音素選択部１４は、生成されたオンライン生成音素を選択し、音声合成部３１は、当該オンライン生成音素（使用音素選択部１４で選択された音素）に基づいて音声出力部３２から出力する音声を合成する。その後、ステップＳ２８に進む。 In step S27, the use phoneme selection unit 14 selects the generated online generation phoneme, and the speech synthesis unit 31 performs speech based on the online generation phoneme (phoneme selected by the use phoneme selection unit 14). The voice output from the output unit 32 is synthesized. Thereafter, the process proceeds to step S28.

　ステップＳ２８にて、音声出力部３２は、音声合成部３１で合成された音声を外部に出力する。その後、図６の動作を終了する。 In step S28, the voice output unit 32 outputs the voice synthesized by the voice synthesis unit 31 to the outside. Thereafter, the operation of FIG.

　＜実施の形態３のまとめ＞
　以上のような本実施の形態３に係る音声処理装置１によれば、音声の合成においても実施の形態１と同様に、オフライン作成音素のデメリット、及び、オンライン生成音素のデメリットを抑制することができる。この結果として、音声の合成において、使用する音素の正確性、及び、音素を使用できる可能性を高めることができる。よって、例えばヨーロッパなどのように音素が異なる複数の言語が用いられている地域において、出力すべき音声を適切に合成することができる。 <Summary of Embodiment 3>
According to the speech processing apparatus 1 according to the third embodiment as described above, it is possible to suppress the disadvantages of offline created phonemes and the disadvantages of online generated phonemes in speech synthesis as in the first embodiment. it can. As a result, it is possible to increase the accuracy of phonemes used and the possibility of using phonemes in speech synthesis. Therefore, for example, in a region where a plurality of languages having different phonemes are used, such as Europe, it is possible to appropriately synthesize speech to be output.

　＜変形例＞
　音声処理装置１は、実施の形態２の構成と、実施の形態３の構成とを組み合わせたものであってもよい。図７は、本変形例に係る音声処理装置１の構成を示すブロック図である。 <Modification>
The voice processing device 1 may be a combination of the configuration of the second embodiment and the configuration of the third embodiment. FIG. 7 is a block diagram showing a configuration of the sound processing apparatus 1 according to this modification.

　本変形例に係る使用音素選択部１４は、テキスト及び音素抽出部１２で抽出されたオフライン作成音素８２ｂが音声認識辞書の生成に使用すべき予め定められた音素のみを含む場合には、オフライン作成音素８２ｂを使用すると判定する。一方、使用音素選択部１４は、テキスト及び音素抽出部１２で抽出されたオフライン作成音素８２ｂが上述の予め定められた音素以外の音素も含む場合には、オフライン作成音素８２ｂを使用しないと判定する。 The used phoneme selection unit 14 according to the present modification creates offline when the offline created phoneme 82b extracted by the text and phoneme extracting unit 12 includes only a predetermined phoneme to be used for generating the speech recognition dictionary. It is determined that the phoneme 82b is used. On the other hand, the used phoneme selection unit 14 determines that the offline created phoneme 82b is not used when the offline created phoneme 82b extracted by the text and phoneme extracting unit 12 includes phonemes other than the above-described predetermined phonemes. .

　また、本変形例に係る使用音素選択部１４は、テキスト及び音素抽出部１２で抽出されたオフライン作成音素８２ｂが音声出力部３２から出力する音声の合成に使用すべき予め定められた音素のみを含む場合には、オフライン作成音素８２ｂを使用すると判定する。一方、使用音素選択部１４は、テキスト及び音素抽出部１２で抽出されたオフライン作成音素８２ｂが上述の予め定められた音素以外の音素も含む場合には、オフライン作成音素８２ｂを使用しないと判定する。 In addition, the use phoneme selection unit 14 according to the present modification uses only predetermined phonemes to be used for synthesizing speech output from the speech output unit 32 by the off-line created phoneme 82b extracted by the text and phoneme extraction unit 12. If included, it is determined that the off-line created phoneme 82b is used. On the other hand, the used phoneme selection unit 14 determines that the offline created phoneme 82b is not used when the offline created phoneme 82b extracted by the text and phoneme extracting unit 12 includes phonemes other than the above-described predetermined phonemes. .

　以上のような本変形例に係る音声処理装置１によれば、実施の形態２の効果、及び、実施の形態３の効果の両方を得ることができる。 According to the speech processing apparatus 1 according to this modification as described above, both the effects of the second embodiment and the effects of the third embodiment can be obtained.

　＜その他の変形例＞
　以上で説明した音声処理装置１は、車両に搭載可能な備え付けられたナビゲーション装置、Portable Navigation Device、通信端末（例えば携帯電話、スマートフォン、及びタブレットなどの携帯端末）、及びこれらにインストールされるアプリケーションの機能、並びにサーバなどを適宜に組み合わせてシステムとして構築される音声処理システムにも適用することができる。この場合、以上で説明した音声処理装置１の各機能あるいは各構成要素は、前記システムを構築する各機器に分散して配置されてもよいし、いずれかの機器に集中して配置されてもよい。 <Other variations>
The voice processing device 1 described above includes an installed navigation device that can be mounted on a vehicle, a portable navigation device, a communication terminal (for example, a portable terminal such as a mobile phone, a smartphone, and a tablet), and applications installed on these devices. The present invention can also be applied to a voice processing system constructed as a system by appropriately combining functions and servers. In this case, each function or each component of the voice processing device 1 described above may be distributed and arranged in each device that constructs the system, or may be concentrated in any device. Good.

　なお、本発明は、その発明の範囲内において、各実施の形態を自由に組み合わせたり、各実施の形態を適宜、変形、省略したりすることが可能である。 In the present invention, it is possible to freely combine the respective embodiments within the scope of the invention, and to appropriately modify and omit the respective embodiments.

　本発明は詳細に説明されたが、上記した説明は、すべての態様において、例示であって、本発明がそれに限定されるものではない。例示されていない無数の変形例が、本発明の範囲から外れることなく想定され得るものと解される。 Although the present invention has been described in detail, the above description is illustrative in all aspects, and the present invention is not limited thereto. It is understood that countless variations that are not illustrated can be envisaged without departing from the scope of the present invention.

　１　音声処理装置、１１　コンテンツ情報取得部、１２　テキスト及び音素抽出部、１３　音素生成部、１４　使用音素選択部、２２　音声認識辞書生成部、２４　音声認識部、３１　音声合成部、８１　外部、８２　コンテンツ情報、８２ａ　テキストデータ、８２ｂ　オフライン作成音素。 1 speech processing device, 11 content information acquisition unit, 12 text and phoneme extraction unit, 13 phoneme generation unit, 14 used phoneme selection unit, 22 speech recognition dictionary generation unit, 24 speech recognition unit, 31 speech synthesis unit, 81 external, 82 Content information, 82a text data, 82b offline created phonemes.

Claims

　テキストデータと、前記テキストデータの読みに対応するオフライン作成音素とを含むコンテンツ情報を外部から取得する情報取得部と、
　前記情報取得部で取得された前記コンテンツ情報から、前記テキストデータ及び前記オフライン作成音素を抽出する抽出部と、
　前記抽出部で抽出された前記テキストデータに基づいて、オンライン生成音素を生成する音素生成部と、
　前記抽出部で抽出された前記オフライン作成音素を使用するかどうかを判定し、使用すると判定した場合には前記オフライン作成音素を選択し、使用しないと判定した場合には前記音素生成部に前記オンライン生成音素を生成させて当該オンライン生成音素を選択する音素選択部と
を備える、音声処理システム。 An information acquisition unit for acquiring content information including text data and off-line created phonemes corresponding to reading of the text data;
An extraction unit that extracts the text data and the off-line created phonemes from the content information acquired by the information acquisition unit;
Based on the text data extracted by the extraction unit, a phoneme generation unit that generates online generation phonemes;
It is determined whether to use the off-line created phoneme extracted by the extracting unit. When it is determined that the off-line created phoneme is used, the off-line generated phoneme is selected. A speech processing system comprising: a phoneme selection unit that generates a generated phoneme and selects the online generated phoneme.
　請求項１に記載の音声処理システムであって、
　前記音素選択部で選択された音素に基づいて音声認識辞書を生成する辞書生成部と、
　前記辞書生成部で生成された音声認識辞書を用いて、認識対象の音声の音声認識を行う音声認識部と
をさらに備える、音声処理システム。 The speech processing system according to claim 1,
A dictionary generation unit that generates a speech recognition dictionary based on the phonemes selected by the phoneme selection unit;
A speech processing system further comprising: a speech recognition unit that performs speech recognition of speech to be recognized using the speech recognition dictionary generated by the dictionary generation unit.
　請求項２に記載の音声処理システムであって、
　前記音素選択部は、
　前記抽出部で抽出された前記オフライン作成音素が前記音声認識辞書の生成に使用すべき予め定められた音素のみを含む場合には、前記オフライン作成音素を使用すると判定し、前記抽出部で抽出された前記オフライン作成音素が前記予め定められた音素以外の音素も含む場合には、前記オフライン作成音素を使用しないと判定する、音声処理システム。 The speech processing system according to claim 2,
The phoneme selection unit
When the off-line created phoneme extracted by the extraction unit includes only predetermined phonemes to be used for generating the speech recognition dictionary, it is determined that the off-line created phoneme is used, and is extracted by the extraction unit. In addition, when the off-line created phoneme includes a phoneme other than the predetermined phoneme, the speech processing system determines that the off-line created phoneme is not used.
　請求項１に記載の音声処理システムであって、
　前記音素選択部で選択された音素を使用することによって音声出力部から出力する音声を合成する音声合成部をさらに備える、音声処理システム。 The speech processing system according to claim 1,
A speech processing system further comprising a speech synthesizer that synthesizes speech output from the speech output unit by using the phonemes selected by the phoneme selection unit.
　請求項４に記載の音声処理システムであって、
　前記音素選択部は、
　前記抽出部で抽出された前記オフライン作成音素が前記音声出力部から出力する前記音声の合成に使用すべき予め定められた音素のみを含む場合には、前記オフライン作成音素を使用すると判定し、前記抽出部で抽出された前記オフライン作成音素が前記予め定められた音素以外の音素も含む場合には、前記オフライン作成音素を使用しないと判定する、音声処理システム。 The voice processing system according to claim 4,
The phoneme selection unit
When the offline created phoneme extracted by the extraction unit includes only a predetermined phoneme to be used for synthesis of the speech output from the speech output unit, it is determined that the offline created phoneme is used, The speech processing system which determines not to use the offline created phoneme when the offline created phoneme extracted by the extraction unit includes a phoneme other than the predetermined phoneme.
　請求項１に記載の音声処理システムであって、
　前記音素選択部で選択された音素に基づいて音声認識辞書を生成する辞書生成部と、
　前記辞書生成部で生成された音声認識辞書を用いて、認識対象の音声の音声認識を行う音声認識部と、
　前記音素選択部で選択された音素を使用することによって音声出力部から出力する音声を合成する音声合成部と
をさらに備える、音声処理システム。 The speech processing system according to claim 1,
A dictionary generation unit that generates a speech recognition dictionary based on the phonemes selected by the phoneme selection unit;
A speech recognition unit that performs speech recognition of speech to be recognized using the speech recognition dictionary generated by the dictionary generation unit;
A speech processing system further comprising: a speech synthesis unit that synthesizes speech output from the speech output unit by using the phonemes selected by the phoneme selection unit.
　テキストデータと、前記テキストデータの読みに対応するオフライン作成音素とを含むコンテンツ情報を外部から取得し、
　取得された前記コンテンツ情報から、前記テキストデータ及び前記オフライン作成音素を抽出し、
　抽出された前記オフライン作成音素を使用するかどうかを判定し、
　使用すると判定した場合には、前記オフライン作成音素を選択し、
　使用しないと判定した場合には、抽出された前記テキストデータに基づいてオンライン生成音素を生成し、当該オンライン生成音素を選択する、音声処理方法。 Content information including text data and offline created phonemes corresponding to the reading of the text data is acquired from the outside,
Extracting the text data and the off-line created phonemes from the acquired content information,
Determine whether to use the extracted offline created phoneme,
If you decide to use it, select the offline created phoneme,
A speech processing method of generating online generated phonemes based on the extracted text data and selecting the online generated phonemes when it is determined not to use.