WO2016088241A1 - Speech processing system and speech processing method - Google Patents

Speech processing system and speech processing method Download PDF

Info

Publication number
WO2016088241A1
WO2016088241A1 PCT/JP2014/082198 JP2014082198W WO2016088241A1 WO 2016088241 A1 WO2016088241 A1 WO 2016088241A1 JP 2014082198 W JP2014082198 W JP 2014082198W WO 2016088241 A1 WO2016088241 A1 WO 2016088241A1
Authority
WO
WIPO (PCT)
Prior art keywords
phoneme
speech
unit
phonemes
created
Prior art date
Application number
PCT/JP2014/082198
Other languages
French (fr)
Japanese (ja)
Inventor
亮 岩宮
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to PCT/JP2014/082198 priority Critical patent/WO2016088241A1/en
Publication of WO2016088241A1 publication Critical patent/WO2016088241A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice

Definitions

  • the present invention relates to a voice processing system and a voice processing method.
  • an in-vehicle device that uses the content information for speech recognition and speech synthesis has been proposed. According to such an in-vehicle device, for example, it is possible to switch channels by performing speech recognition using a broadcast station name included in a satellite radio broadcast wave, or to read out the recognized broadcast station name.
  • phonemes corresponding to phonetic symbols of language are used for the above-mentioned speech recognition and speech synthesis.
  • a speech recognition dictionary is generated from phonemes, or speech is synthesized from phonemes.
  • Such phonemes are created in advance outside the vehicle-mounted device, and then included in the content information along with text data (hereinafter referred to as “offline-created phonemes”) and in the distributed content information.
  • Phoneme generated on-vehicle equipment (online) based on the text data stored hereinafter referred to as “online generated phoneme”).
  • An off-line created phoneme is disclosed in, for example, Patent Document 1
  • an on-line generated phoneme is disclosed in, for example, Patent Document 2.
  • phoneme formats and systems supported by the speech recognition engine and speech synthesis engine differ depending on the engine. For example, even in the same language engine, phoneme formats may differ from manufacturer to manufacturer.
  • phoneme formats may differ from manufacturer to manufacturer.
  • there are phonemes that are supported by both English and French engines in common but there are also phonemes that are supported only by English engines or only by French engines.
  • off-line created phonemes humans can tune and create based on knowledge of correct reading of text data, so it is possible to use correct phonemes rather than online generated phonemes that can only be generated mechanically This is the point.
  • a disadvantage of offline created phonemes is that they cannot be used unless they are stored in advance in the content information, and are less likely to be used than online generated phonemes.
  • the present invention has been made in view of the above problems, and an object of the present invention is to provide a technology capable of suppressing the disadvantages of offline created phonemes and online generated phonemes.
  • An audio processing system includes an information acquisition unit that externally acquires content information including text data and off-line created phonemes corresponding to the reading of the text data, and text information from the content information acquired by the information acquisition unit.
  • An extraction unit that extracts data and off-line created phonemes, and a phoneme generation unit that generates online generation phonemes based on the text data extracted by the extraction unit.
  • the speech processing system determines whether to use the offline created phoneme extracted by the extracting unit, selects the offline created phoneme when determined to use, and selects the phoneme generating unit when determined not to use it. Includes a phoneme selection unit that generates an online generated phoneme and selects the online generated phoneme.
  • the speech processing method obtains content information including text data and off-line created phonemes corresponding to reading of the text data from outside, and the text data and off-line creation from the obtained content information Extracting phonemes, determining whether to use the extracted offline created phonemes, and selecting the offline generated phonemes if determined to use, and determining not to use the extracted phonemes An online generated phoneme is generated based on the text data, and the online generated phoneme is selected.
  • FIG. 1 is a block diagram showing a configuration of a speech processing apparatus according to Embodiment 1.
  • FIG. 3 is a flowchart showing an operation of the speech processing apparatus according to the first embodiment.
  • 4 is a block diagram illustrating a configuration of a speech processing apparatus according to Embodiment 2.
  • FIG. 6 is a flowchart showing an operation of the speech processing apparatus according to the second embodiment.
  • FIG. 10 is a block diagram showing a configuration of a speech processing apparatus according to Embodiment 3.
  • 10 is a flowchart illustrating an operation of the speech processing apparatus according to the third embodiment. It is a block diagram which shows the structure of the audio processing apparatus which concerns on a modification.
  • FIG. 1 is a block diagram showing a configuration of a speech processing apparatus 1 according to Embodiment 1 of the present invention.
  • 1 includes a content information acquisition unit (information acquisition unit) 11, a text and phoneme extraction unit (extraction unit) 12, a phoneme generation unit 13, and a used phoneme selection unit (phoneme selection unit) 14. It has.
  • the content information acquisition unit 11 includes, for example, a communication device that can communicate with the outside 81 (such as a satellite radio broadcast station) of the audio processing device 1 or an input device that can be connected to the communication device.
  • the text and phoneme extraction unit 12, the phoneme generation unit 13, and the used phoneme selection unit 14 are, for example, a CPU (Central Processing Unit) (not shown) of the speech processing device 1 or an HDD (Hard Disk Drive) (not shown) of the speech processing device 1. It is implemented as a function of the CPU by executing a program stored in a storage device such as a semiconductor memory.
  • the content information acquisition unit 11 acquires content information 82 including text data 82a and offline created phonemes 82b corresponding to reading of the text data 82a from the outside 81.
  • the off-line created phoneme 82b is created, for example, by a person tuning based on knowledge of correct reading of the text data 82a.
  • the text and phoneme extraction unit 12 extracts the text data 82a and the offline created phoneme 82b from the content information 82 acquired by the content information acquisition unit 11.
  • the phoneme generation unit 13 generates an online generation phoneme based on the text and the text data 82a extracted by the phoneme extraction unit 12.
  • the online generated phonemes are generated mechanically based on, for example, text data 82a. Note that, as will be described next, whether or not the phoneme generation unit 13 generates the online generation phoneme is determined based on the determination result of the used phoneme selection unit 14.
  • the use phoneme selection unit 14 determines whether to use the off-line created phoneme 82b extracted by the text and phoneme extraction unit 12, and does not cause the phoneme generation unit 13 to generate an online generation phoneme if it is determined to use it.
  • the off-line created phoneme 82b is selected.
  • the used phoneme selection unit 14 causes the phoneme generation unit 13 to generate an online generated phoneme and selects the online generated phoneme.
  • FIG. 2 is a flowchart showing the operation of the speech processing apparatus 1 according to the first embodiment.
  • step S 1 the content information acquisition unit 11 acquires content information 82 from the outside 81.
  • step S2 the text and phoneme extraction unit 12 extracts the text data 82a and the offline created phoneme 82b from the content information 82 acquired by the content information acquisition unit 11.
  • step S3 the use phoneme selection unit 14 determines whether to use the off-line created phoneme 82b extracted by the text and phoneme extraction unit 12. The use phoneme selection unit 14 proceeds to step S4 when it is determined that the off-line created phoneme 82b is used, and proceeds to step S5 otherwise.
  • step S4 the phoneme selection unit 14 selects the off-line created phoneme 82b. Thereafter, the operation of FIG.
  • step S5 the phoneme selection unit 14 causes the phoneme generation unit 13 to generate online generated phonemes based on the extracted text data 82a.
  • step S6 the phoneme selection unit 14 selects the generated online generated phoneme. Thereafter, the operation of FIG.
  • ⁇ Summary of Embodiment 1> According to the speech processing apparatus 1 according to the first embodiment as described above, when it is determined that the offline created phoneme 82b is used, the offline created phoneme 82b is selected, and when it is determined that the offline created phoneme 82b is not used. Select an online generated phoneme. Since the phonemes to be used can be dynamically selected in this way, the disadvantages of offline created phonemes and the disadvantages of online generated phonemes can be suppressed. As a result, the accuracy of phonemes to be used and the possibility of using phonemes can be increased.
  • FIG. 3 is a block diagram showing the configuration of the speech processing apparatus 1 according to Embodiment 2 of the present invention.
  • the same or similar components as those described above are denoted by the same reference numerals, and different portions are mainly described.
  • the speech processing apparatus 1 has a function of a speech recognition apparatus, and in addition to the configuration of FIG. 1, a speech input unit 21, a speech recognition dictionary generation unit (dictionary generation unit) 22, The voice recognition dictionary storage unit 23 and the voice recognition unit 24 are provided.
  • the voice input unit 21 is configured by a voice input device such as a microphone
  • the voice recognition dictionary storage unit 23 is configured by a storage device such as an HDD or a semiconductor memory.
  • the voice recognition dictionary generation unit 22 and the voice recognition unit 24 are realized as functions of a CPU (not shown) of the voice processing device 1, for example.
  • the voice input unit 21 receives voice from outside (for example, a user).
  • the speech recognition dictionary generation unit 22 generates a speech recognition dictionary based on the phonemes selected by the used phoneme selection unit 14.
  • the voice recognition dictionary generated by the voice recognition dictionary generation unit 22 is stored in the voice recognition dictionary storage unit 23.
  • the offline-generated phoneme 82b extracted by the text and phoneme extraction unit 12 includes only predetermined phonemes to be used for generating the speech recognition dictionary. Is determined to use the off-line created phoneme 82b.
  • the used phoneme selection unit 14 determines that the offline created phoneme 82b is not used when the offline created phoneme 82b extracted by the text and phoneme extracting unit 12 includes phonemes other than the above-described predetermined phonemes. .
  • the speech recognition unit 24 uses the speech recognition dictionary (speech recognition dictionary stored in the speech recognition dictionary storage unit 23) generated by the speech recognition dictionary generation unit 22 to accept the speech to be recognized (received by the speech input unit 21). Voice).
  • speech recognition dictionary speech recognition dictionary stored in the speech recognition dictionary storage unit 23
  • FIG. 4 is a flowchart showing the operation of the speech processing apparatus 1 according to the second embodiment.
  • steps S11 and S12 operations similar to those in steps S1 and S2 in FIG. 2 are performed.
  • step S13 the phoneme selection unit 14 attempts to generate a speech recognition dictionary.
  • the used phoneme selection unit 14 determines whether or not the offline created phoneme 82b extracted by the text and phoneme extraction unit 12 includes only a predetermined phoneme to be used for generating the speech recognition dictionary. .
  • step S14 if the used phoneme selection unit 14 determines that the extracted offline created phoneme 82b includes only the above-described predetermined phoneme, it determines that the offline created phoneme 82b is to be used and proceeds to step S15. If not, it is determined that the off-line created phoneme 82b is not used, and the process proceeds to step S16.
  • step S15 the use phoneme selection unit 14 selects the offline created phoneme 82b, and the speech recognition dictionary generation unit 22 uses the speech recognition dictionary based on the offline creation phoneme 82b (the phoneme selected by the use phoneme selection unit 14). Is generated. Thereafter, the process proceeds to step S19.
  • step S16 the used phoneme selection unit 14 causes the phoneme generation unit 13 to generate online generated phonemes based on the extracted text data 82a.
  • step S17 the use phoneme selection unit 14 selects the generated online generation phoneme, and the speech recognition dictionary generation unit 22 is based on the online generation phoneme (the phoneme selected by the use phoneme selection unit 14). To generate a speech recognition dictionary. Thereafter, the process proceeds to step S19.
  • step S18 parallel to steps S11 to S17, the voice input unit 21 receives voice from the outside. Thereafter, the process proceeds to step S19.
  • step S19 the voice recognition unit 24 performs voice recognition of the voice to be recognized (the voice received by the voice input unit 21) using the voice recognition dictionary generated by the voice recognition dictionary generation unit 22. Thereafter, the operation of FIG. 4 ends.
  • ⁇ Summary of Embodiment 2> According to the speech processing apparatus 1 according to the second embodiment as described above, the disadvantages of offline-generated phonemes and the disadvantages of online-generated phonemes are suppressed in the generation of the speech recognition dictionary as in the first embodiment. be able to. As a result, it is possible to increase the accuracy of phonemes to be used and the possibility of using phonemes in generating a speech recognition dictionary. Therefore, a speech recognition dictionary can be appropriately generated in a region where a plurality of languages having different phonemes are used, such as Europe.
  • the used phoneme selection unit 14 may perform the above selection for each external 81.
  • offline generated phonemes 82b may be selected (used) for some externals 81, while online generated phonemes may be selected (used) for the remaining externals 81.
  • the above selection may be performed for each text data 82a and offline created phonemes 82b.
  • FIG. 5 is a block diagram showing the configuration of the speech processing apparatus 1 according to Embodiment 3 of the present invention.
  • the same or similar components as those described above are denoted by the same reference numerals, and different portions will be mainly described.
  • the speech processing apparatus 1 has a function of a speech synthesizer, and includes a speech synthesizer 31 and a speech output unit 32 in addition to the configuration of FIG.
  • the voice synthesizer 31 is realized as a function of a CPU (not shown) of the voice processing device 1, for example.
  • the audio output unit 32 includes, for example, an audio output device such as a speaker.
  • the speech synthesizer 31 synthesizes the speech output from the speech output unit 32 using the phonemes selected by the used phoneme selector 14.
  • the used phoneme selection unit 14 is predetermined to be used for synthesizing the speech output from the speech output unit 32 by the off-line created phoneme 82b extracted by the text and phoneme extraction unit 12. When only the phoneme is included, it is determined that the off-line created phoneme 82b is used. On the other hand, the used phoneme selection unit 14 determines that the offline created phoneme 82b is not used when the offline created phoneme 82b extracted by the text and phoneme extracting unit 12 includes phonemes other than the above-described predetermined phonemes. .
  • the voice output unit 32 outputs the voice synthesized by the voice synthesis unit 31 to the outside.
  • FIG. 6 is a flowchart showing the operation of the speech processing apparatus 1 according to the third embodiment.
  • steps S21 and S22 operations similar to those in steps S1 and S2 in FIG. 2 are performed.
  • step S23 the phoneme selection unit 14 attempts to synthesize the voice output from the voice output unit 32.
  • the use phoneme selection unit 14 determines whether or not the offline created phoneme 82b extracted by the text and phoneme extraction unit 12 includes only a predetermined phoneme to be used for the synthesis of the speech.
  • step S24 if the used phoneme selection unit 14 determines that the extracted offline created phoneme 82b includes only the above-described predetermined phoneme, the used phoneme selection unit 14 determines that the offline created phoneme 82b is to be used and proceeds to step S25. If not, it is determined that the off-line created phoneme 82b is not used, and the process proceeds to step S26.
  • step S25 the use phoneme selection unit 14 selects the off-line created phoneme 82b, and the speech synthesis unit 31 selects the off-line creation phoneme 82b (the phoneme selected by the use phoneme selection unit 14) from the speech output unit 32. Synthesize the output voice. Thereafter, the process proceeds to step S28.
  • step S26 the phoneme selection unit 14 causes the phoneme generation unit 13 to generate online generated phonemes based on the extracted text data 82a.
  • step S27 the use phoneme selection unit 14 selects the generated online generation phoneme, and the speech synthesis unit 31 performs speech based on the online generation phoneme (phoneme selected by the use phoneme selection unit 14).
  • the voice output from the output unit 32 is synthesized. Thereafter, the process proceeds to step S28.
  • step S28 the voice output unit 32 outputs the voice synthesized by the voice synthesis unit 31 to the outside. Thereafter, the operation of FIG.
  • the voice processing device 1 may be a combination of the configuration of the second embodiment and the configuration of the third embodiment.
  • FIG. 7 is a block diagram showing a configuration of the sound processing apparatus 1 according to this modification.
  • the used phoneme selection unit 14 creates offline when the offline created phoneme 82b extracted by the text and phoneme extracting unit 12 includes only a predetermined phoneme to be used for generating the speech recognition dictionary. It is determined that the phoneme 82b is used. On the other hand, the used phoneme selection unit 14 determines that the offline created phoneme 82b is not used when the offline created phoneme 82b extracted by the text and phoneme extracting unit 12 includes phonemes other than the above-described predetermined phonemes. .
  • the use phoneme selection unit 14 uses only predetermined phonemes to be used for synthesizing speech output from the speech output unit 32 by the off-line created phoneme 82b extracted by the text and phoneme extraction unit 12. If included, it is determined that the off-line created phoneme 82b is used. On the other hand, the used phoneme selection unit 14 determines that the offline created phoneme 82b is not used when the offline created phoneme 82b extracted by the text and phoneme extracting unit 12 includes phonemes other than the above-described predetermined phonemes. .
  • the voice processing device 1 described above includes an installed navigation device that can be mounted on a vehicle, a portable navigation device, a communication terminal (for example, a portable terminal such as a mobile phone, a smartphone, and a tablet), and applications installed on these devices.
  • the present invention can also be applied to a voice processing system constructed as a system by appropriately combining functions and servers. In this case, each function or each component of the voice processing device 1 described above may be distributed and arranged in each device that constructs the system, or may be concentrated in any device. Good.
  • 1 speech processing device 11 content information acquisition unit, 12 text and phoneme extraction unit, 13 phoneme generation unit, 14 used phoneme selection unit, 22 speech recognition dictionary generation unit, 24 speech recognition unit, 31 speech synthesis unit, 81 external, 82 Content information, 82a text data, 82b offline created phonemes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The purpose of the present invention is to provide a technique for minimizing each disadvantage of an offline prepared phoneme and an online generated phoneme. A speech processing device 1 includes a text and phoneme extraction unit 12 for extracting text data 82a and an offline prepared phoneme 82b from content information 82, a phoneme generation unit 13, and a working phoneme selection unit 14. The working phoneme selection unit 14 determines whether or not to use the extracted offline prepared phoneme 82b, and selects either the offline prepared phoneme 82b if it is determined that the offline prepared phoneme 82b is to be used or an online generated phoneme if it is determined that the offline prepared phoneme 82b is not to be used, said online generated phoneme being generated by the phoneme generation unit 13.

Description

音声処理システム及び音声処理方法Voice processing system and voice processing method
 本発明は、音声処理システム及び音声処理方法に関する。 The present invention relates to a voice processing system and a voice processing method.
 近年、テキストデータ等を含むコンテンツ情報が、ラジオ放送波に含まれたり、移動通信装置により配信されたりしている。そこで、当該コンテンツ情報を、音声認識や音声合成に利用する車載機器が提案されている。このような車載機器によれば、例えば、衛星ラジオ放送波に含まれる放送局名を用いて音声認識を行うことによりチャネルを切り替えることや、認識した放送局名を読み上げることが可能である。 In recent years, content information including text data has been included in radio broadcast waves or distributed by mobile communication devices. Therefore, an in-vehicle device that uses the content information for speech recognition and speech synthesis has been proposed. According to such an in-vehicle device, for example, it is possible to switch channels by performing speech recognition using a broadcast station name included in a satellite radio broadcast wave, or to read out the recognized broadcast station name.
 さて、上述の音声認識や音声合成には、言語の発音記号に相当する音素が使用される。例えば、音素から音声認識辞書が生成されたり、音素から音声が合成されたりする。このような音素には、車載器外部のオフラインで予め作成してから、テキストデータとともにコンテンツ情報に含めて配信される音素(以下「オフライン作成音素」と記す)と、配信されたコンテンツ情報に含まれているテキストデータに基づいて車載機器上(オンライン)で生成される音素(以下「オンライン生成音素」と記す)とがある。オフライン作成音素については例えば特許文献1などにおいて開示されており、オンライン生成音素については例えば特許文献2などにおいて開示されている。 Now, phonemes corresponding to phonetic symbols of language are used for the above-mentioned speech recognition and speech synthesis. For example, a speech recognition dictionary is generated from phonemes, or speech is synthesized from phonemes. Such phonemes are created in advance outside the vehicle-mounted device, and then included in the content information along with text data (hereinafter referred to as “offline-created phonemes”) and in the distributed content information. Phoneme generated on-vehicle equipment (online) based on the text data stored (hereinafter referred to as “online generated phoneme”). An off-line created phoneme is disclosed in, for example, Patent Document 1, and an on-line generated phoneme is disclosed in, for example, Patent Document 2.
 ここで、音声認識エンジンや音声合成エンジンがサポートする音素の形式や体系は、エンジンによって異なる。例えば、同じ言語のエンジンでも、メーカーによって音素の形式が異なることがある。さらに、同じメーカーのエンジンでも、英語のエンジンと仏語のエンジンとに共通にサポートされている音素もあれば、英語のエンジンだけ、または、仏語のエンジンだけにサポートされている音素もある。 Here, the phoneme formats and systems supported by the speech recognition engine and speech synthesis engine differ depending on the engine. For example, even in the same language engine, phoneme formats may differ from manufacturer to manufacturer. In addition, there are phonemes that are supported by both English and French engines in common, but there are also phonemes that are supported only by English engines or only by French engines.
 オフライン作成音素のメリットとして、テキストデータの正しい読みの知識に基づいて人がチューニングして作成することが可能であるため、機械的にしか生成できないオンライン生成音素よりも正しい音素を使用することが可能である点が挙げられる。しかしながら、オフライン作成音素のデメリットとして、コンテンツ情報に予め格納しておかなければ使用できず、オンライン生成音素よりも使用可能性が低い点が挙げられる。 As an advantage of off-line created phonemes, humans can tune and create based on knowledge of correct reading of text data, so it is possible to use correct phonemes rather than online generated phonemes that can only be generated mechanically This is the point. However, a disadvantage of offline created phonemes is that they cannot be used unless they are stored in advance in the content information, and are less likely to be used than online generated phonemes.
 一方、オンライン生成音素のデメリットとして、オフライン作成音素よりも正しくない音素を使用することになる点が挙げられる。しかしながら、オンライン生成音素のメリットとして、エンジンがサポートしているテキストデータであれば、必ず音素を生成することができることから、オフライン作成音素よりも使用可能性が高い点が挙げられる。 On the other hand, a disadvantage of online generated phonemes is that they use incorrect phonemes than offline created phonemes. However, as an advantage of online-generated phonemes, it is possible to generate phonemes as long as the text data is supported by the engine.
国際公開第2007/069512号International Publication No. 2007/069512 特開2011-033874号公報JP 2011-033874 A
 従来の装置では、オフライン作成音素、及び、オンライン生成音素のどちらか一方のみを固定的に使用する。このため、どちらの音素を使用しても、各音素のデメリットを回避することができないという問題があった。 In the conventional device, only one of offline created phonemes and online generated phonemes is fixedly used. For this reason, there is a problem that the demerit of each phoneme cannot be avoided regardless of which phoneme is used.
 そこで、本発明は、上記のような問題点を鑑みてなされたものであり、オフライン作成音素、及び、オンライン生成音素のそれぞれのデメリットを抑制可能な技術を提供することを目的とする。 Therefore, the present invention has been made in view of the above problems, and an object of the present invention is to provide a technology capable of suppressing the disadvantages of offline created phonemes and online generated phonemes.
 本発明に係る音声処理システムは、テキストデータと、テキストデータの読みに対応するオフライン作成音素とを含むコンテンツ情報を外部から取得する情報取得部と、情報取得部で取得されたコンテンツ情報から、テキストデータ及びオフライン作成音素を抽出する抽出部と、抽出部で抽出されたテキストデータに基づいて、オンライン生成音素を生成する音素生成部とを備える。また、音声処理システムは、抽出部で抽出されたオフライン作成音素を使用するかどうかを判定し、使用すると判定した場合にはオフライン作成音素を選択し、使用しないと判定した場合には音素生成部にオンライン生成音素を生成させて当該オンライン生成音素を選択する音素選択部を備える。 An audio processing system according to the present invention includes an information acquisition unit that externally acquires content information including text data and off-line created phonemes corresponding to the reading of the text data, and text information from the content information acquired by the information acquisition unit. An extraction unit that extracts data and off-line created phonemes, and a phoneme generation unit that generates online generation phonemes based on the text data extracted by the extraction unit. In addition, the speech processing system determines whether to use the offline created phoneme extracted by the extracting unit, selects the offline created phoneme when determined to use, and selects the phoneme generating unit when determined not to use it. Includes a phoneme selection unit that generates an online generated phoneme and selects the online generated phoneme.
 本発明に係る音声処理方法は、テキストデータと、前記テキストデータの読みに対応するオフライン作成音素とを含むコンテンツ情報を外部から取得し、取得された前記コンテンツ情報から、前記テキストデータ及び前記オフライン作成音素を抽出し、抽出された前記オフライン作成音素を使用するかどうかを判定し、使用すると判定した場合には、前記オフライン作成音素を選択し、使用しないと判定した場合には、抽出された前記テキストデータに基づいてオンライン生成音素を生成し、当該オンライン生成音素を選択する。 The speech processing method according to the present invention obtains content information including text data and off-line created phonemes corresponding to reading of the text data from outside, and the text data and off-line creation from the obtained content information Extracting phonemes, determining whether to use the extracted offline created phonemes, and selecting the offline generated phonemes if determined to use, and determining not to use the extracted phonemes An online generated phoneme is generated based on the text data, and the online generated phoneme is selected.
 本発明によれば、オフライン作成音素、及び、オンライン生成音素のそれぞれのデメリットを抑制することができる。 According to the present invention, it is possible to suppress the disadvantages of offline created phonemes and online generated phonemes.
 本発明の目的、特徴、態様及び利点は、以下の詳細な説明と添付図面とによって、より明白となる。 The objects, features, aspects and advantages of the present invention will become more apparent from the following detailed description and the accompanying drawings.
実施の形態1に係る音声処理装置の構成を示すブロック図である。1 is a block diagram showing a configuration of a speech processing apparatus according to Embodiment 1. FIG. 実施の形態1に係る音声処理装置の動作を示すフローチャートである。3 is a flowchart showing an operation of the speech processing apparatus according to the first embodiment. 実施の形態2に係る音声処理装置の構成を示すブロック図である。4 is a block diagram illustrating a configuration of a speech processing apparatus according to Embodiment 2. FIG. 実施の形態2に係る音声処理装置の動作を示すフローチャートである。6 is a flowchart showing an operation of the speech processing apparatus according to the second embodiment. 実施の形態3に係る音声処理装置の構成を示すブロック図である。FIG. 10 is a block diagram showing a configuration of a speech processing apparatus according to Embodiment 3. 実施の形態3に係る音声処理装置の動作を示すフローチャートである。10 is a flowchart illustrating an operation of the speech processing apparatus according to the third embodiment. 変形例に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio processing apparatus which concerns on a modification.
 <実施の形態1>
 以下の説明では、本発明に係る音声処理システムが、音声処理装置単体に適用された場合を例にして説明する。
<Embodiment 1>
In the following description, a case where the speech processing system according to the present invention is applied to a speech processing device alone will be described as an example.
 図1は、本発明の実施の形態1に係る音声処理装置1の構成を示すブロック図である。図1の音声処理装置1は、コンテンツ情報取得部(情報取得部)11と、テキスト及び音素抽出部(抽出部)12と、音素生成部13と、使用音素選択部(音素選択部)14とを備えている。 FIG. 1 is a block diagram showing a configuration of a speech processing apparatus 1 according to Embodiment 1 of the present invention. 1 includes a content information acquisition unit (information acquisition unit) 11, a text and phoneme extraction unit (extraction unit) 12, a phoneme generation unit 13, and a used phoneme selection unit (phoneme selection unit) 14. It has.
 まず、本実施の形態1に係る音声処理装置1の構成要素に適用されるハードウェアの一例について説明する。コンテンツ情報取得部11は、例えば、音声処理装置1の外部81(衛星ラジオ放送局など)と通信可能な通信装置、または、当該通信装置と接続可能な入力装置などから構成される。テキスト及び音素抽出部12、音素生成部13及び使用音素選択部14は、例えば、音声処理装置1の図示しないCPU(Central Processing Unit)などが、音声処理装置1の図示しないHDD(Hard Disk Drive)や半導体メモリなどの記憶装置に記憶されたプログラムを実行することにより、当該CPUの機能として実現される。 First, an example of hardware applied to the components of the audio processing device 1 according to the first embodiment will be described. The content information acquisition unit 11 includes, for example, a communication device that can communicate with the outside 81 (such as a satellite radio broadcast station) of the audio processing device 1 or an input device that can be connected to the communication device. The text and phoneme extraction unit 12, the phoneme generation unit 13, and the used phoneme selection unit 14 are, for example, a CPU (Central Processing Unit) (not shown) of the speech processing device 1 or an HDD (Hard Disk Drive) (not shown) of the speech processing device 1. It is implemented as a function of the CPU by executing a program stored in a storage device such as a semiconductor memory.
 次に、本実施の形態1に係る音声処理装置1の各構成要素について詳細に説明する。 Next, each component of the speech processing apparatus 1 according to the first embodiment will be described in detail.
 コンテンツ情報取得部11は、テキストデータ82aと、テキストデータ82aの読みに対応するオフライン作成音素82bとを含むコンテンツ情報82を外部81から取得する。オフライン作成音素82bは、例えば人がテキストデータ82aの正しい読みの知識に基づいてチューニングするなどによって作成される。 The content information acquisition unit 11 acquires content information 82 including text data 82a and offline created phonemes 82b corresponding to reading of the text data 82a from the outside 81. The off-line created phoneme 82b is created, for example, by a person tuning based on knowledge of correct reading of the text data 82a.
 テキスト及び音素抽出部12は、コンテンツ情報取得部11で取得されたコンテンツ情報82から、テキストデータ82a及びオフライン作成音素82bを抽出する。 The text and phoneme extraction unit 12 extracts the text data 82a and the offline created phoneme 82b from the content information 82 acquired by the content information acquisition unit 11.
 音素生成部13は、テキスト及び音素抽出部12で抽出されたテキストデータ82aに基づいて、オンライン生成音素を生成する。オンライン生成音素は、例えばテキストデータ82aに基づいて機械的に生成される。なお、次に説明するように、音素生成部13にてオンライン生成音素を生成するか否かは、使用音素選択部14の判定結果に基づいて決定される。 The phoneme generation unit 13 generates an online generation phoneme based on the text and the text data 82a extracted by the phoneme extraction unit 12. The online generated phonemes are generated mechanically based on, for example, text data 82a. Note that, as will be described next, whether or not the phoneme generation unit 13 generates the online generation phoneme is determined based on the determination result of the used phoneme selection unit 14.
 使用音素選択部14は、テキスト及び音素抽出部12で抽出されたオフライン作成音素82bを使用するかどうかを判定し、使用すると判定した場合には、音素生成部13にオンライン生成音素を生成させずに当該オフライン作成音素82bを選択する。一方、使用音素選択部14は、抽出されたオフライン作成音素82bを使用しないと判定した場合には、音素生成部13にオンライン生成音素を生成させて当該オンライン生成音素を選択する。 The use phoneme selection unit 14 determines whether to use the off-line created phoneme 82b extracted by the text and phoneme extraction unit 12, and does not cause the phoneme generation unit 13 to generate an online generation phoneme if it is determined to use it. The off-line created phoneme 82b is selected. On the other hand, if it is determined that the extracted offline created phoneme 82b is not used, the used phoneme selection unit 14 causes the phoneme generation unit 13 to generate an online generated phoneme and selects the online generated phoneme.
 <動作>
 図2は、本実施の形態1に係る音声処理装置1の動作を示すフローチャートである。
<Operation>
FIG. 2 is a flowchart showing the operation of the speech processing apparatus 1 according to the first embodiment.
 まず、ステップS1にて、コンテンツ情報取得部11は、コンテンツ情報82を外部81から取得する。 First, in step S 1, the content information acquisition unit 11 acquires content information 82 from the outside 81.
 ステップS2にて、テキスト及び音素抽出部12は、コンテンツ情報取得部11で取得されたコンテンツ情報82から、テキストデータ82a及びオフライン作成音素82bを抽出する。 In step S2, the text and phoneme extraction unit 12 extracts the text data 82a and the offline created phoneme 82b from the content information 82 acquired by the content information acquisition unit 11.
 ステップS3にて、使用音素選択部14は、テキスト及び音素抽出部12で抽出されたオフライン作成音素82bを使用するか否かを判定する。使用音素選択部14は、オフライン作成音素82bを使用すると判定した場合にはステップS4に進み、そうでない場合にはステップS5に進む。 In step S3, the use phoneme selection unit 14 determines whether to use the off-line created phoneme 82b extracted by the text and phoneme extraction unit 12. The use phoneme selection unit 14 proceeds to step S4 when it is determined that the off-line created phoneme 82b is used, and proceeds to step S5 otherwise.
 ステップS4にて、使用音素選択部14は、オフライン作成音素82bを選択する。その後、図2の動作を終了する。 In step S4, the phoneme selection unit 14 selects the off-line created phoneme 82b. Thereafter, the operation of FIG.
 ステップS5にて、使用音素選択部14は、抽出されたテキストデータ82aに基づいて音素生成部13にオンライン生成音素を生成させる。 In step S5, the phoneme selection unit 14 causes the phoneme generation unit 13 to generate online generated phonemes based on the extracted text data 82a.
 そして、ステップS6にて、使用音素選択部14は、生成されたオンライン生成音素を選択する。その後、図2の動作を終了する。 In step S6, the phoneme selection unit 14 selects the generated online generated phoneme. Thereafter, the operation of FIG.
 <実施の形態1のまとめ>
 以上のような本実施の形態1に係る音声処理装置1によれば、オフライン作成音素82bを使用すると判定した場合にはオフライン作成音素82bを選択し、オフライン作成音素82bを使用しないと判定した場合にはオンライン生成音素を選択する。このように使用する音素を動的に選択することができるので、オフライン作成音素のデメリット、及び、オンライン生成音素のデメリットを抑制することができる。この結果として、使用する音素の正確性、及び、音素を使用できる可能性を高めることができる。
<Summary of Embodiment 1>
According to the speech processing apparatus 1 according to the first embodiment as described above, when it is determined that the offline created phoneme 82b is used, the offline created phoneme 82b is selected, and when it is determined that the offline created phoneme 82b is not used. Select an online generated phoneme. Since the phonemes to be used can be dynamically selected in this way, the disadvantages of offline created phonemes and the disadvantages of online generated phonemes can be suppressed. As a result, the accuracy of phonemes to be used and the possibility of using phonemes can be increased.
 <実施の形態2>
 図3は、本発明の実施の形態2に係る音声処理装置1の構成を示すブロック図である。なお、本実施の形態2に係る音声処理装置1において、以上で説明した構成要素と同一または類似するものについては同じ参照符号を付し、異なる部分について主に説明する。
<Embodiment 2>
FIG. 3 is a block diagram showing the configuration of the speech processing apparatus 1 according to Embodiment 2 of the present invention. In the speech processing apparatus 1 according to the second embodiment, the same or similar components as those described above are denoted by the same reference numerals, and different portions are mainly described.
 本実施の形態2に係る音声処理装置1は、音声認識装置の機能を有しており、図1の構成に加えて、音声入力部21と、音声認識辞書生成部(辞書生成部)22と、音声認識辞書記憶部23と、音声認識部24とを備えている。 The speech processing apparatus 1 according to the second embodiment has a function of a speech recognition apparatus, and in addition to the configuration of FIG. 1, a speech input unit 21, a speech recognition dictionary generation unit (dictionary generation unit) 22, The voice recognition dictionary storage unit 23 and the voice recognition unit 24 are provided.
 まず、本実施の形態2に係る音声処理装置1に追加された構成要素に適用されるハードウェアの一例について説明する。音声入力部21は、例えば、マイクロフォンなどの音声入力装置から構成され、音声認識辞書記憶部23は、例えば、HDDや半導体メモリなどの記憶装置などから構成される。音声認識辞書生成部22及び音声認識部24は、例えば、音声処理装置1の図示しないCPUの機能として実現される。 First, an example of hardware applied to components added to the speech processing device 1 according to the second embodiment will be described. The voice input unit 21 is configured by a voice input device such as a microphone, and the voice recognition dictionary storage unit 23 is configured by a storage device such as an HDD or a semiconductor memory. The voice recognition dictionary generation unit 22 and the voice recognition unit 24 are realized as functions of a CPU (not shown) of the voice processing device 1, for example.
 次に、本実施の形態2に係る音声処理装置1の各構成要素について詳細に説明する。 Next, each component of the speech processing apparatus 1 according to the second embodiment will be described in detail.
 音声入力部21は、音声を外部(例えばユーザー)から受け付ける。 The voice input unit 21 receives voice from outside (for example, a user).
 音声認識辞書生成部22は、使用音素選択部14で選択された音素に基づいて音声認識辞書を生成する。音声認識辞書生成部22で生成された音声認識辞書は、音声認識辞書記憶部23に記憶される。 The speech recognition dictionary generation unit 22 generates a speech recognition dictionary based on the phonemes selected by the used phoneme selection unit 14. The voice recognition dictionary generated by the voice recognition dictionary generation unit 22 is stored in the voice recognition dictionary storage unit 23.
 ここで、本実施の形態2に係る使用音素選択部14は、テキスト及び音素抽出部12で抽出されたオフライン作成音素82bが音声認識辞書の生成に使用すべき予め定められた音素のみを含む場合には、オフライン作成音素82bを使用すると判定する。一方、使用音素選択部14は、テキスト及び音素抽出部12で抽出されたオフライン作成音素82bが上述の予め定められた音素以外の音素も含む場合には、オフライン作成音素82bを使用しないと判定する。 Here, in the phoneme selection unit 14 according to the second embodiment, the offline-generated phoneme 82b extracted by the text and phoneme extraction unit 12 includes only predetermined phonemes to be used for generating the speech recognition dictionary. Is determined to use the off-line created phoneme 82b. On the other hand, the used phoneme selection unit 14 determines that the offline created phoneme 82b is not used when the offline created phoneme 82b extracted by the text and phoneme extracting unit 12 includes phonemes other than the above-described predetermined phonemes. .
 音声認識部24は、音声認識辞書生成部22で生成された音声認識辞書(音声認識辞書記憶部23に記憶された音声認識辞書)を用いて、認識対象の音声(音声入力部21で受け付けた音声)の音声認識を行う。 The speech recognition unit 24 uses the speech recognition dictionary (speech recognition dictionary stored in the speech recognition dictionary storage unit 23) generated by the speech recognition dictionary generation unit 22 to accept the speech to be recognized (received by the speech input unit 21). Voice).
 <動作>
 図4は、本実施の形態2に係る音声処理装置1の動作を示すフローチャートである。
<Operation>
FIG. 4 is a flowchart showing the operation of the speech processing apparatus 1 according to the second embodiment.
 まず、ステップS11及びS12にて、図2のステップS1及びS2と同様の動作を行う。 First, in steps S11 and S12, operations similar to those in steps S1 and S2 in FIG. 2 are performed.
 ステップS13にて、使用音素選択部14は、音声認識辞書を生成する試行を行う。その試行を通じて、使用音素選択部14は、テキスト及び音素抽出部12で抽出されたオフライン作成音素82bが、音声認識辞書の生成に使用すべき予め定められた音素のみを含むか否かを判定する。 In step S13, the phoneme selection unit 14 attempts to generate a speech recognition dictionary. Through the trial, the used phoneme selection unit 14 determines whether or not the offline created phoneme 82b extracted by the text and phoneme extraction unit 12 includes only a predetermined phoneme to be used for generating the speech recognition dictionary. .
 ステップS14にて、使用音素選択部14は、抽出されたオフライン作成音素82bが上述の予め定められた音素のみを含むと判定した場合には、オフライン作成音素82bを使用すると判定してステップS15に進み、そうでない場合には、オフライン作成音素82bを使用しないと判定してステップS16に進む。 In step S14, if the used phoneme selection unit 14 determines that the extracted offline created phoneme 82b includes only the above-described predetermined phoneme, it determines that the offline created phoneme 82b is to be used and proceeds to step S15. If not, it is determined that the off-line created phoneme 82b is not used, and the process proceeds to step S16.
 ステップS15にて、使用音素選択部14は、オフライン作成音素82bを選択し、音声認識辞書生成部22は、オフライン作成音素82b(使用音素選択部14で選択された音素)に基づいて音声認識辞書を生成する。その後、ステップS19に進む。 In step S15, the use phoneme selection unit 14 selects the offline created phoneme 82b, and the speech recognition dictionary generation unit 22 uses the speech recognition dictionary based on the offline creation phoneme 82b (the phoneme selected by the use phoneme selection unit 14). Is generated. Thereafter, the process proceeds to step S19.
 ステップS16にて、使用音素選択部14は、抽出されたテキストデータ82aに基づいて音素生成部13にオンライン生成音素を生成させる。 In step S16, the used phoneme selection unit 14 causes the phoneme generation unit 13 to generate online generated phonemes based on the extracted text data 82a.
 そして、ステップS17にて、使用音素選択部14は、生成されたオンライン生成音素を選択し、音声認識辞書生成部22は、当該オンライン生成音素(使用音素選択部14で選択された音素)に基づいて音声認識辞書を生成する。その後、ステップS19に進む。 In step S17, the use phoneme selection unit 14 selects the generated online generation phoneme, and the speech recognition dictionary generation unit 22 is based on the online generation phoneme (the phoneme selected by the use phoneme selection unit 14). To generate a speech recognition dictionary. Thereafter, the process proceeds to step S19.
 また、ステップS11~S17と並行するステップS18にて、音声入力部21は、音声を外部から受け付ける。その後、ステップS19に進む。 In step S18 parallel to steps S11 to S17, the voice input unit 21 receives voice from the outside. Thereafter, the process proceeds to step S19.
 ステップS19にて、音声認識部24は、音声認識辞書生成部22で生成された音声認識辞書を用いて、認識対象の音声(音声入力部21で受け付けた音声)の音声認識を行う。その後、図4の動作を終了する。 In step S19, the voice recognition unit 24 performs voice recognition of the voice to be recognized (the voice received by the voice input unit 21) using the voice recognition dictionary generated by the voice recognition dictionary generation unit 22. Thereafter, the operation of FIG. 4 ends.
 <実施の形態2のまとめ>
 以上のような本実施の形態2に係る音声処理装置1によれば、音声認識辞書の生成においても実施の形態1と同様に、オフライン作成音素のデメリット、及び、オンライン生成音素のデメリットを抑制することができる。この結果として、音声認識辞書の生成において、使用する音素の正確性、及び、音素を使用できる可能性を高めることができる。よって、例えばヨーロッパなどのように音素が異なる複数の言語が用いられている地域において、音声認識辞書を適切に生成することができる。
<Summary of Embodiment 2>
According to the speech processing apparatus 1 according to the second embodiment as described above, the disadvantages of offline-generated phonemes and the disadvantages of online-generated phonemes are suppressed in the generation of the speech recognition dictionary as in the first embodiment. be able to. As a result, it is possible to increase the accuracy of phonemes to be used and the possibility of using phonemes in generating a speech recognition dictionary. Therefore, a speech recognition dictionary can be appropriately generated in a region where a plurality of languages having different phonemes are used, such as Europe.
 なお、以上の構成において、コンテンツ情報取得部11が、複数の外部81のそれぞれからコンテンツ情報82を取得した場合には、使用音素選択部14は、外部81ごとに上記選択を行ってもよい。この結果、いくつかの外部81には、オフライン作成音素82bを選択(使用)しつつ、残りの外部81には、オンライン生成音素を選択(使用)してもよい。また、1つのコンテンツ情報82に複数のテキストデータ82a及びオフライン作成音素82bが存在する場合には、テキストデータ82a及びオフライン作成音素82bごとに上記選択を行ってもよい。 In the above configuration, when the content information acquisition unit 11 acquires the content information 82 from each of the plurality of externals 81, the used phoneme selection unit 14 may perform the above selection for each external 81. As a result, offline generated phonemes 82b may be selected (used) for some externals 81, while online generated phonemes may be selected (used) for the remaining externals 81. When a plurality of text data 82a and offline created phonemes 82b exist in one content information 82, the above selection may be performed for each text data 82a and offline created phonemes 82b.
 <実施の形態3>
 図5は、本発明の実施の形態3に係る音声処理装置1の構成を示すブロック図である。なお、本実施の形態3に係る音声処理装置1において、以上で説明した構成要素と同一または類似するものについては同じ参照符号を付し、異なる部分について主に説明する。
<Embodiment 3>
FIG. 5 is a block diagram showing the configuration of the speech processing apparatus 1 according to Embodiment 3 of the present invention. In the speech processing apparatus 1 according to the third embodiment, the same or similar components as those described above are denoted by the same reference numerals, and different portions will be mainly described.
 本実施の形態3に係る音声処理装置1は、音声合成装置の機能を有しており、図1の構成に加えて、音声合成部31と、音声出力部32とを備えている。 The speech processing apparatus 1 according to the third embodiment has a function of a speech synthesizer, and includes a speech synthesizer 31 and a speech output unit 32 in addition to the configuration of FIG.
 まず、本実施の形態3に係る音声処理装置1に追加された構成要素に適用されるハードウェアの一例について説明する。音声合成部31は、例えば、音声処理装置1の図示しないCPUの機能として実現される。音声出力部32は、例えば、スピーカーなどの音声出力装置などから構成される。 First, an example of hardware applied to components added to the speech processing apparatus 1 according to the third embodiment will be described. The voice synthesizer 31 is realized as a function of a CPU (not shown) of the voice processing device 1, for example. The audio output unit 32 includes, for example, an audio output device such as a speaker.
 次に、本実施の形態3に係る音声処理装置1の各構成要素について詳細に説明する。 Next, each component of the speech processing apparatus 1 according to the third embodiment will be described in detail.
 音声合成部31は、使用音素選択部14で選択された音素を使用して、音声出力部32から出力する音声を合成する。 The speech synthesizer 31 synthesizes the speech output from the speech output unit 32 using the phonemes selected by the used phoneme selector 14.
 ここで、本実施の形態3に係る使用音素選択部14は、テキスト及び音素抽出部12で抽出されたオフライン作成音素82bが音声出力部32から出力する音声の合成に使用すべき予め定められた音素のみを含む場合には、オフライン作成音素82bを使用すると判定する。一方、使用音素選択部14は、テキスト及び音素抽出部12で抽出されたオフライン作成音素82bが上述の予め定められた音素以外の音素も含む場合には、オフライン作成音素82bを使用しないと判定する。 Here, the used phoneme selection unit 14 according to the third embodiment is predetermined to be used for synthesizing the speech output from the speech output unit 32 by the off-line created phoneme 82b extracted by the text and phoneme extraction unit 12. When only the phoneme is included, it is determined that the off-line created phoneme 82b is used. On the other hand, the used phoneme selection unit 14 determines that the offline created phoneme 82b is not used when the offline created phoneme 82b extracted by the text and phoneme extracting unit 12 includes phonemes other than the above-described predetermined phonemes. .
 音声出力部32は、音声合成部31で合成された音声を外部に出力する。 The voice output unit 32 outputs the voice synthesized by the voice synthesis unit 31 to the outside.
 <動作>
 図6は、本実施の形態3に係る音声処理装置1の動作を示すフローチャートである。
<Operation>
FIG. 6 is a flowchart showing the operation of the speech processing apparatus 1 according to the third embodiment.
 まず、ステップS21及びS22にて、図2のステップS1及びS2と同様の動作を行う。 First, in steps S21 and S22, operations similar to those in steps S1 and S2 in FIG. 2 are performed.
 ステップS23にて、使用音素選択部14は、音声出力部32から出力する音声を合成する試行を行う。その試行を通じて、使用音素選択部14は、テキスト及び音素抽出部12で抽出されたオフライン作成音素82bが、当該音声の合成に使用すべき予め定められた音素のみを含むか否かを判定する。 In step S23, the phoneme selection unit 14 attempts to synthesize the voice output from the voice output unit 32. Through the trial, the use phoneme selection unit 14 determines whether or not the offline created phoneme 82b extracted by the text and phoneme extraction unit 12 includes only a predetermined phoneme to be used for the synthesis of the speech.
 ステップS24にて、使用音素選択部14は、抽出されたオフライン作成音素82bが上述の予め定められた音素のみを含むと判定した場合には、オフライン作成音素82bを使用すると判定してステップS25に進み、そうでない場合には、オフライン作成音素82bを使用しないと判定してステップS26に進む。 In step S24, if the used phoneme selection unit 14 determines that the extracted offline created phoneme 82b includes only the above-described predetermined phoneme, the used phoneme selection unit 14 determines that the offline created phoneme 82b is to be used and proceeds to step S25. If not, it is determined that the off-line created phoneme 82b is not used, and the process proceeds to step S26.
 ステップS25にて、使用音素選択部14は、オフライン作成音素82bを選択し、音声合成部31は、オフライン作成音素82b(使用音素選択部14で選択された音素)に基づいて音声出力部32から出力する音声を合成する。その後、ステップS28に進む。 In step S25, the use phoneme selection unit 14 selects the off-line created phoneme 82b, and the speech synthesis unit 31 selects the off-line creation phoneme 82b (the phoneme selected by the use phoneme selection unit 14) from the speech output unit 32. Synthesize the output voice. Thereafter, the process proceeds to step S28.
 ステップS26にて、使用音素選択部14は、抽出されたテキストデータ82aに基づいて音素生成部13にオンライン生成音素を生成させる。 In step S26, the phoneme selection unit 14 causes the phoneme generation unit 13 to generate online generated phonemes based on the extracted text data 82a.
 そして、ステップS27にて、使用音素選択部14は、生成されたオンライン生成音素を選択し、音声合成部31は、当該オンライン生成音素(使用音素選択部14で選択された音素)に基づいて音声出力部32から出力する音声を合成する。その後、ステップS28に進む。 In step S27, the use phoneme selection unit 14 selects the generated online generation phoneme, and the speech synthesis unit 31 performs speech based on the online generation phoneme (phoneme selected by the use phoneme selection unit 14). The voice output from the output unit 32 is synthesized. Thereafter, the process proceeds to step S28.
 ステップS28にて、音声出力部32は、音声合成部31で合成された音声を外部に出力する。その後、図6の動作を終了する。 In step S28, the voice output unit 32 outputs the voice synthesized by the voice synthesis unit 31 to the outside. Thereafter, the operation of FIG.
 <実施の形態3のまとめ>
 以上のような本実施の形態3に係る音声処理装置1によれば、音声の合成においても実施の形態1と同様に、オフライン作成音素のデメリット、及び、オンライン生成音素のデメリットを抑制することができる。この結果として、音声の合成において、使用する音素の正確性、及び、音素を使用できる可能性を高めることができる。よって、例えばヨーロッパなどのように音素が異なる複数の言語が用いられている地域において、出力すべき音声を適切に合成することができる。
<Summary of Embodiment 3>
According to the speech processing apparatus 1 according to the third embodiment as described above, it is possible to suppress the disadvantages of offline created phonemes and the disadvantages of online generated phonemes in speech synthesis as in the first embodiment. it can. As a result, it is possible to increase the accuracy of phonemes used and the possibility of using phonemes in speech synthesis. Therefore, for example, in a region where a plurality of languages having different phonemes are used, such as Europe, it is possible to appropriately synthesize speech to be output.
 <変形例>
 音声処理装置1は、実施の形態2の構成と、実施の形態3の構成とを組み合わせたものであってもよい。図7は、本変形例に係る音声処理装置1の構成を示すブロック図である。
<Modification>
The voice processing device 1 may be a combination of the configuration of the second embodiment and the configuration of the third embodiment. FIG. 7 is a block diagram showing a configuration of the sound processing apparatus 1 according to this modification.
 本変形例に係る使用音素選択部14は、テキスト及び音素抽出部12で抽出されたオフライン作成音素82bが音声認識辞書の生成に使用すべき予め定められた音素のみを含む場合には、オフライン作成音素82bを使用すると判定する。一方、使用音素選択部14は、テキスト及び音素抽出部12で抽出されたオフライン作成音素82bが上述の予め定められた音素以外の音素も含む場合には、オフライン作成音素82bを使用しないと判定する。 The used phoneme selection unit 14 according to the present modification creates offline when the offline created phoneme 82b extracted by the text and phoneme extracting unit 12 includes only a predetermined phoneme to be used for generating the speech recognition dictionary. It is determined that the phoneme 82b is used. On the other hand, the used phoneme selection unit 14 determines that the offline created phoneme 82b is not used when the offline created phoneme 82b extracted by the text and phoneme extracting unit 12 includes phonemes other than the above-described predetermined phonemes. .
 また、本変形例に係る使用音素選択部14は、テキスト及び音素抽出部12で抽出されたオフライン作成音素82bが音声出力部32から出力する音声の合成に使用すべき予め定められた音素のみを含む場合には、オフライン作成音素82bを使用すると判定する。一方、使用音素選択部14は、テキスト及び音素抽出部12で抽出されたオフライン作成音素82bが上述の予め定められた音素以外の音素も含む場合には、オフライン作成音素82bを使用しないと判定する。 In addition, the use phoneme selection unit 14 according to the present modification uses only predetermined phonemes to be used for synthesizing speech output from the speech output unit 32 by the off-line created phoneme 82b extracted by the text and phoneme extraction unit 12. If included, it is determined that the off-line created phoneme 82b is used. On the other hand, the used phoneme selection unit 14 determines that the offline created phoneme 82b is not used when the offline created phoneme 82b extracted by the text and phoneme extracting unit 12 includes phonemes other than the above-described predetermined phonemes. .
 以上のような本変形例に係る音声処理装置1によれば、実施の形態2の効果、及び、実施の形態3の効果の両方を得ることができる。 According to the speech processing apparatus 1 according to this modification as described above, both the effects of the second embodiment and the effects of the third embodiment can be obtained.
 <その他の変形例>
 以上で説明した音声処理装置1は、車両に搭載可能な備え付けられたナビゲーション装置、Portable Navigation Device、通信端末(例えば携帯電話、スマートフォン、及びタブレットなどの携帯端末)、及びこれらにインストールされるアプリケーションの機能、並びにサーバなどを適宜に組み合わせてシステムとして構築される音声処理システムにも適用することができる。この場合、以上で説明した音声処理装置1の各機能あるいは各構成要素は、前記システムを構築する各機器に分散して配置されてもよいし、いずれかの機器に集中して配置されてもよい。
<Other variations>
The voice processing device 1 described above includes an installed navigation device that can be mounted on a vehicle, a portable navigation device, a communication terminal (for example, a portable terminal such as a mobile phone, a smartphone, and a tablet), and applications installed on these devices. The present invention can also be applied to a voice processing system constructed as a system by appropriately combining functions and servers. In this case, each function or each component of the voice processing device 1 described above may be distributed and arranged in each device that constructs the system, or may be concentrated in any device. Good.
 なお、本発明は、その発明の範囲内において、各実施の形態を自由に組み合わせたり、各実施の形態を適宜、変形、省略したりすることが可能である。 In the present invention, it is possible to freely combine the respective embodiments within the scope of the invention, and to appropriately modify and omit the respective embodiments.
 本発明は詳細に説明されたが、上記した説明は、すべての態様において、例示であって、本発明がそれに限定されるものではない。例示されていない無数の変形例が、本発明の範囲から外れることなく想定され得るものと解される。 Although the present invention has been described in detail, the above description is illustrative in all aspects, and the present invention is not limited thereto. It is understood that countless variations that are not illustrated can be envisaged without departing from the scope of the present invention.
 1 音声処理装置、11 コンテンツ情報取得部、12 テキスト及び音素抽出部、13 音素生成部、14 使用音素選択部、22 音声認識辞書生成部、24 音声認識部、31 音声合成部、81 外部、82 コンテンツ情報、82a テキストデータ、82b オフライン作成音素。 1 speech processing device, 11 content information acquisition unit, 12 text and phoneme extraction unit, 13 phoneme generation unit, 14 used phoneme selection unit, 22 speech recognition dictionary generation unit, 24 speech recognition unit, 31 speech synthesis unit, 81 external, 82 Content information, 82a text data, 82b offline created phonemes.

Claims (7)

  1.  テキストデータと、前記テキストデータの読みに対応するオフライン作成音素とを含むコンテンツ情報を外部から取得する情報取得部と、
     前記情報取得部で取得された前記コンテンツ情報から、前記テキストデータ及び前記オフライン作成音素を抽出する抽出部と、
     前記抽出部で抽出された前記テキストデータに基づいて、オンライン生成音素を生成する音素生成部と、
     前記抽出部で抽出された前記オフライン作成音素を使用するかどうかを判定し、使用すると判定した場合には前記オフライン作成音素を選択し、使用しないと判定した場合には前記音素生成部に前記オンライン生成音素を生成させて当該オンライン生成音素を選択する音素選択部と
    を備える、音声処理システム。
    An information acquisition unit for acquiring content information including text data and off-line created phonemes corresponding to reading of the text data;
    An extraction unit that extracts the text data and the off-line created phonemes from the content information acquired by the information acquisition unit;
    Based on the text data extracted by the extraction unit, a phoneme generation unit that generates online generation phonemes;
    It is determined whether to use the off-line created phoneme extracted by the extracting unit. When it is determined that the off-line created phoneme is used, the off-line generated phoneme is selected. A speech processing system comprising: a phoneme selection unit that generates a generated phoneme and selects the online generated phoneme.
  2.  請求項1に記載の音声処理システムであって、
     前記音素選択部で選択された音素に基づいて音声認識辞書を生成する辞書生成部と、
     前記辞書生成部で生成された音声認識辞書を用いて、認識対象の音声の音声認識を行う音声認識部と
    をさらに備える、音声処理システム。
    The speech processing system according to claim 1,
    A dictionary generation unit that generates a speech recognition dictionary based on the phonemes selected by the phoneme selection unit;
    A speech processing system further comprising: a speech recognition unit that performs speech recognition of speech to be recognized using the speech recognition dictionary generated by the dictionary generation unit.
  3.  請求項2に記載の音声処理システムであって、
     前記音素選択部は、
     前記抽出部で抽出された前記オフライン作成音素が前記音声認識辞書の生成に使用すべき予め定められた音素のみを含む場合には、前記オフライン作成音素を使用すると判定し、前記抽出部で抽出された前記オフライン作成音素が前記予め定められた音素以外の音素も含む場合には、前記オフライン作成音素を使用しないと判定する、音声処理システム。
    The speech processing system according to claim 2,
    The phoneme selection unit
    When the off-line created phoneme extracted by the extraction unit includes only predetermined phonemes to be used for generating the speech recognition dictionary, it is determined that the off-line created phoneme is used, and is extracted by the extraction unit. In addition, when the off-line created phoneme includes a phoneme other than the predetermined phoneme, the speech processing system determines that the off-line created phoneme is not used.
  4.  請求項1に記載の音声処理システムであって、
     前記音素選択部で選択された音素を使用することによって音声出力部から出力する音声を合成する音声合成部をさらに備える、音声処理システム。
    The speech processing system according to claim 1,
    A speech processing system further comprising a speech synthesizer that synthesizes speech output from the speech output unit by using the phonemes selected by the phoneme selection unit.
  5.  請求項4に記載の音声処理システムであって、
     前記音素選択部は、
     前記抽出部で抽出された前記オフライン作成音素が前記音声出力部から出力する前記音声の合成に使用すべき予め定められた音素のみを含む場合には、前記オフライン作成音素を使用すると判定し、前記抽出部で抽出された前記オフライン作成音素が前記予め定められた音素以外の音素も含む場合には、前記オフライン作成音素を使用しないと判定する、音声処理システム。
    The voice processing system according to claim 4,
    The phoneme selection unit
    When the offline created phoneme extracted by the extraction unit includes only a predetermined phoneme to be used for synthesis of the speech output from the speech output unit, it is determined that the offline created phoneme is used, The speech processing system which determines not to use the offline created phoneme when the offline created phoneme extracted by the extraction unit includes a phoneme other than the predetermined phoneme.
  6.  請求項1に記載の音声処理システムであって、
     前記音素選択部で選択された音素に基づいて音声認識辞書を生成する辞書生成部と、
     前記辞書生成部で生成された音声認識辞書を用いて、認識対象の音声の音声認識を行う音声認識部と、
     前記音素選択部で選択された音素を使用することによって音声出力部から出力する音声を合成する音声合成部と
    をさらに備える、音声処理システム。
    The speech processing system according to claim 1,
    A dictionary generation unit that generates a speech recognition dictionary based on the phonemes selected by the phoneme selection unit;
    A speech recognition unit that performs speech recognition of speech to be recognized using the speech recognition dictionary generated by the dictionary generation unit;
    A speech processing system further comprising: a speech synthesis unit that synthesizes speech output from the speech output unit by using the phonemes selected by the phoneme selection unit.
  7.  テキストデータと、前記テキストデータの読みに対応するオフライン作成音素とを含むコンテンツ情報を外部から取得し、
     取得された前記コンテンツ情報から、前記テキストデータ及び前記オフライン作成音素を抽出し、
     抽出された前記オフライン作成音素を使用するかどうかを判定し、
     使用すると判定した場合には、前記オフライン作成音素を選択し、
     使用しないと判定した場合には、抽出された前記テキストデータに基づいてオンライン生成音素を生成し、当該オンライン生成音素を選択する、音声処理方法。
    Content information including text data and offline created phonemes corresponding to the reading of the text data is acquired from the outside,
    Extracting the text data and the off-line created phonemes from the acquired content information,
    Determine whether to use the extracted offline created phoneme,
    If you decide to use it, select the offline created phoneme,
    A speech processing method of generating online generated phonemes based on the extracted text data and selecting the online generated phonemes when it is determined not to use.
PCT/JP2014/082198 2014-12-05 2014-12-05 Speech processing system and speech processing method WO2016088241A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2014/082198 WO2016088241A1 (en) 2014-12-05 2014-12-05 Speech processing system and speech processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2014/082198 WO2016088241A1 (en) 2014-12-05 2014-12-05 Speech processing system and speech processing method

Publications (1)

Publication Number Publication Date
WO2016088241A1 true WO2016088241A1 (en) 2016-06-09

Family

ID=56091213

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2014/082198 WO2016088241A1 (en) 2014-12-05 2014-12-05 Speech processing system and speech processing method

Country Status (1)

Country Link
WO (1) WO2016088241A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005227545A (en) * 2004-02-13 2005-08-25 Matsushita Electric Ind Co Ltd Dictionary creation system, program guide system and dictionary creation method
WO2007069512A1 (en) * 2005-12-15 2007-06-21 Sharp Kabushiki Kaisha Information processing device, and program
JP2011033874A (en) * 2009-08-03 2011-02-17 Alpine Electronics Inc Device for multilingual voice recognition, multilingual voice recognition dictionary creation method
JP2012058311A (en) * 2010-09-06 2012-03-22 Alpine Electronics Inc Method and apparatus for generating dynamic voice recognition dictionary
WO2012172596A1 (en) * 2011-06-14 2012-12-20 三菱電機株式会社 Pronunciation information generating device, in-vehicle information device, and database generating method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005227545A (en) * 2004-02-13 2005-08-25 Matsushita Electric Ind Co Ltd Dictionary creation system, program guide system and dictionary creation method
WO2007069512A1 (en) * 2005-12-15 2007-06-21 Sharp Kabushiki Kaisha Information processing device, and program
JP2011033874A (en) * 2009-08-03 2011-02-17 Alpine Electronics Inc Device for multilingual voice recognition, multilingual voice recognition dictionary creation method
JP2012058311A (en) * 2010-09-06 2012-03-22 Alpine Electronics Inc Method and apparatus for generating dynamic voice recognition dictionary
WO2012172596A1 (en) * 2011-06-14 2012-12-20 三菱電機株式会社 Pronunciation information generating device, in-vehicle information device, and database generating method

Similar Documents

Publication Publication Date Title
US11250859B2 (en) Accessing multiple virtual personal assistants (VPA) from a single device
WO2015098306A1 (en) Response control device and control program
JP6844472B2 (en) Information processing device
US11120785B2 (en) Voice synthesis device
US20180211668A1 (en) Reduced latency speech recognition system using multiple recognizers
JP2018054790A (en) Voice interaction system and voice interaction method
US20170243588A1 (en) Speech recognition method, electronic device and speech recognition system
US20130332171A1 (en) Bandwidth Extension via Constrained Synthesis
CN104282301A (en) Voice command processing method and system
US20190303443A1 (en) Speech translation apparatus, speech translation method, and recording medium storing the speech translation method
JP5606951B2 (en) Speech recognition system and search system using the same
US11367457B2 (en) Method for detecting ambient noise to change the playing voice frequency and sound playing device thereof
JP6109451B2 (en) Speech recognition apparatus and speech recognition method
US7181397B2 (en) Speech dialog method and system
KR20140028336A (en) Voice conversion apparatus and method for converting voice thereof
JP6559417B2 (en) Information processing apparatus, information processing method, dialogue system, and control program
US10964307B2 (en) Method for adjusting voice frequency and sound playing device thereof
JP2019113636A (en) Voice recognition system
WO2016088241A1 (en) Speech processing system and speech processing method
KR101945190B1 (en) Voice recognition operating system and method
JP2011180416A (en) Voice synthesis device, voice synthesis method and car navigation system
JP2009210868A (en) Speech processing device, speech processing method and the like
JP2019211966A (en) Control device, dialogue device, control method, and program
JP2019035894A (en) Voice processing device and voice processing method
US20230377594A1 (en) Mobile terminal capable of processing voice and operation method therefor

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14907512

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14907512

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP