JP2019219456A

JP2019219456A - Voice recognition system and voice recognition device

Info

Publication number: JP2019219456A
Application number: JP2018115243A
Authority: JP
Inventors: 敦菊田; Atsushi Kikuta; 高広越田; Takahiro KOSHIDA
Original assignee: Ryoyo Electro Corp
Current assignee: Ryoyo Electro Corp
Priority date: 2018-06-18
Filing date: 2018-06-18
Publication date: 2019-12-26
Anticipated expiration: 2038-06-18
Also published as: JP6462936B1; CN110914897B; CN110914897A; WO2019244385A1

Abstract

To provide a voice recognition system and a voice recognition device, which can improve recognition precision.SOLUTION: A voice recognition device comprises: acquisition means for acquiring at least one voice data; extraction means for extracting a start voiceless section and an end voiceless section, which are included in the voice data, and extracting arrangement of phonemes and pause sections, which are sandwiched between the start voiceless section and the end voiceless section, as recognition object data; detection means for selecting phoneme information corresponding to the arrangement which the recognition object data has by referring to a character string database, and detecting a plurality of character string information associated with the selected phoneme information, and class ID as candidate data; calculation means for generating a sentence obtained by combining the plurality of candidate data on the basis of grammar information by referring to a grammar database, and calculating reliability corresponding to each candidate data included in the sentence; selection means, and generation means.SELECTED DRAWING: Figure 1

Description

本発明は、音声認識システム、及び音声認識装置に関する。 The present invention relates to a speech recognition system and a speech recognition device.

従来、音声認識に関する技術として、例えば特許文献１の認知機能評価装置や、特許文献２の発話内容の把握システム等が提案されている。 2. Description of the Related Art Conventionally, as a technique related to voice recognition, for example, a cognitive function evaluation device disclosed in Patent Document 1 and a speech content grasping system disclosed in Patent Document 2 have been proposed.

特許文献１の認知機能評価装置では、フォルマント解析部は、対象者の音声に含まれる特定の音素の瞬時音圧の時間変動を対象期間に亘って表している対象データを受け取る。そして、フォルマント解析部は、対象期間を複数のフレームに分割し、特定のフォルマントの周波数を、２つ以上の対象フレームのそれぞれについて求める。特徴解析部は、対象フレーム毎に求められた特定のフォルマントの周波数について特徴量を求める。評価部は、特徴量に基づいて対象者の認知機能を評価する。 In the cognitive function evaluation device of Patent Literature 1, the formant analysis unit receives target data representing a time variation of the instantaneous sound pressure of a specific phoneme included in the subject's voice over a target period. Then, the formant analysis unit divides the target period into a plurality of frames, and obtains a specific formant frequency for each of the two or more target frames. The feature analysis unit obtains a feature amount for a specific formant frequency obtained for each target frame. The evaluation unit evaluates the cognitive function of the subject based on the feature amount.

特許文献２では、録取された音声データに対して音素基準の音声認識を行ってインデクシングされたデータを保存し、これを用いて核心語に基づく発話内容を把握することにより、発話内容の把握が正確に、手軽に且つ速やかに行われる、録取された音声データに対する核心語の取出に基づく発話内容の把握システムと、このシステムを用いたインデクシング方法及び発話内容の把握方法等が開示されている。 In Patent Literature 2, the speech content is grasped by performing speech recognition based on phonemes for the recorded speech data, storing the indexed data, and grasping the speech content based on the core language using the data. Is disclosed accurately, easily and promptly, a utterance content grasping system based on extraction of a core word from recorded voice data, and an indexing method and a utterance content grasping method using the system are disclosed. I have.

特開２０１８−５０８４７号公報JP 2018-50847 A 特開２０１５−５３９３６４号公報JP 2015-593364 A

ここで、音声認識に関する技術では、様々な分野での応用が期待される一方で、認識精度の向上が課題として挙げられている。認識精度を向上させるために、音素を用いる方法が注目を集めているが、音声データから音素の配列を取得する際のバラつき等により、依然として認識精度の向上が課題として挙げられている。 Here, in the technology related to speech recognition, applications in various fields are expected, but improvement of recognition accuracy is mentioned as a problem. In order to improve recognition accuracy, a method using phonemes has attracted attention. However, improvement in recognition accuracy is still an issue due to variations in obtaining an array of phonemes from voice data.

この点、特許文献１では、対象者の音声に基づく特定のフォルマント周波数について特徴量を求め、特徴量に基づいて対象者の認知機能を評価することで、精度の向上を図っている。しかしながら、特許文献１の開示技術では、対象者の発する音声の内容までを認識することができない。 In this regard, in Patent Document 1, the accuracy is improved by obtaining a feature amount for a specific formant frequency based on the voice of the subject and evaluating the cognitive function of the subject based on the feature amount. However, with the technology disclosed in Patent Document 1, it is not possible to recognize even the content of the voice uttered by the target person.

また、特許文献２では、核心語に基づく発話内容を把握することにより、発話内容の把握を実現する技術が開示されている。しかしながら、特許文献２の開示技術では、音素の類似する核心語が発話内容に含まれる場合、認識精度が悪くなる恐れがある。このような状況により、認識精度の向上を可能とする音声認識に関する技術が望まれている。 Further, Patent Literature 2 discloses a technique for grasping the utterance content by grasping the utterance content based on the core language. However, according to the technology disclosed in Patent Literature 2, when a core word having a similar phoneme is included in the utterance content, recognition accuracy may be deteriorated. Under such circumstances, there is a demand for a technology related to speech recognition that can improve recognition accuracy.

そこで本発明は、上述した問題に鑑みて案出されたものであり、その目的とするところは、認識精度の向上を可能とする音声認識システム、及び音声認識装置を提供することにある。 Therefore, the present invention has been devised in view of the above-described problem, and has as its object to provide a speech recognition system and a speech recognition device capable of improving recognition accuracy.

第１発明に係る音声認識システムは、少なくとも１つの音声データを取得する取得手段と、前記音声データに含まれる開始無音区間及び終了無音区間を抽出し、前記開始無音区間と前記終了無音区間との間に挟まれた音素及び休止区間の配列を、認識対象データとして抽出する抽出手段と、予め取得された文字列情報と、前記文字列情報に紐づく音素情報と、前記文字列情報に付与されたクラスＩＤとが記憶された文字列データベースと、前記文字列データベースを参照し、前記認識対象データの有する前記配列に対応する前記音素情報を選択し、選択された前記音素情報に紐づく前記文字列情報及び前記クラスＩＤを、候補データとして複数検出する検出手段と、予め取得された前記クラスＩＤの配列順序を示す文法情報が記憶された文法データベースと、前記文法データベースを参照し、複数の前記候補データを前記文法情報に基づき組み合あわせたセンテンスを生成し、前記センテンスに含まれる前記候補データ毎に対応する信頼度を算出する算出手段と、前記信頼度に基づき、複数の前記候補データから評価データを選択する選択手段と、前記評価データに基づき、認識情報を生成する生成手段とを備えることを特徴とする。 A voice recognition system according to a first aspect of the present invention includes: an obtaining unit configured to obtain at least one voice data; extracting a start silent section and an end silent section included in the voice data; Extraction means for extracting an array of phonemes and pause sections interposed therebetween as recognition target data, character string information acquired in advance, phoneme information associated with the character string information, and character string information. A character string database storing the class ID and the character string database, and referring to the character string database, selecting the phoneme information corresponding to the arrangement of the recognition target data, and selecting the character associated with the selected phoneme information. Detecting means for detecting a plurality of pieces of column information and the class ID as candidate data, and grammar storing grammatical information indicating an arrangement order of the class IDs acquired in advance A database, referring to the grammar database, generating a sentence in which a plurality of the candidate data are combined based on the grammar information, and calculating a reliability corresponding to each of the candidate data included in the sentence. And selecting means for selecting evaluation data from the plurality of candidate data based on the reliability, and generating means for generating recognition information based on the evaluation data.

第２発明に係る音声認識システムは、第１発明において、前記抽出手段は、１つの前記音声データから複数の前記認識対象データを抽出し、複数の前記認識対象データは、それぞれ異なる前記音素及び前記休止区間の前記配列を有することを特徴とする。 In the speech recognition system according to a second invention, in the first invention, the extraction means extracts a plurality of the recognition target data from one piece of the voice data, and the plurality of the recognition target data are different from the phoneme and the It is characterized in that it has the arrangement of the pause sections.

第３発明に係る音声認識システムは、第１発明又は第２発明において、前記算出手段は、前記センテンスを複数生成し、複数の前記センテンスは、それぞれ前記候補データの種類及び組み合わせの少なくとも何れかが異なることを特徴とする。 In the speech recognition system according to a third invention, in the first invention or the second invention, the calculation means generates a plurality of sentences, and each of the plurality of sentences has at least one of a type and a combination of the candidate data. It is different.

第４発明に係る音声認識システムは、第１発明〜第３発明の何れかにおいて、予め取得された前記文字列情報と、前記文字列情報を組み合わせた参照センテンスと、前記文字列情報毎に付与された閾値とが記憶された参照データベースをさらに備え、前記生成手段は、前記参照データベースを参照し、前記参照センテンスのうち、前記評価データに対応する第１参照センテンスを指定する指定手段と、前記評価データに対応する前記信頼度と、前記第１参照センテンスに含まれる第１文字列情報に付与された第１閾値とを比較する比較手段と、を有し、前記比較手段の比較結果に基づき、前記認識情報を生成することを特徴とする。 The speech recognition system according to a fourth aspect of the present invention is the speech recognition system according to any one of the first to third aspects, wherein the character string information acquired in advance, a reference sentence combining the character string information, and the character string information are provided for each character string information. Further comprising a reference database in which the set threshold value is stored, wherein the generation unit refers to the reference database, and designates a first reference sentence corresponding to the evaluation data among the reference sentences, Comparing means for comparing the reliability corresponding to the evaluation data with a first threshold value assigned to the first character string information included in the first reference sentence, based on a comparison result of the comparing means , Generating the recognition information.

第５発明に係る音声認識システムは、第４発明において、複数の前記候補データ、及び複数の前記信頼度に基づき、前記参照データベースに記憶された前記閾値を更新する更新手段をさらに備えることを特徴とする。 The speech recognition system according to a fifth invention is the speech recognition system according to the fourth invention, further comprising an updating unit that updates the threshold stored in the reference database based on the plurality of candidate data and the plurality of reliability. And

第６発明に係る音声認識システムは、第４発明又は第５発明において、前記認識情報を評価した利用者の評価結果を取得し、前記参照データベースの前記閾値に反映させる反映手段をさらに備えることを特徴とする。 The speech recognition system according to a sixth invention, in the fourth invention or the fifth invention, further comprises a reflection unit that acquires an evaluation result of a user who has evaluated the recognition information, and reflects the evaluation result on the threshold value of the reference database. Features.

第７発明に係る音声認識システムは、第１発明〜第６発明の何れかにおいて、前記取得手段は、前記音声データが生成された条件を示す条件情報を取得することを特徴とする。 In a speech recognition system according to a seventh invention, in any one of the first invention to the sixth invention, the acquisition unit acquires condition information indicating a condition under which the speech data was generated.

第８発明に係る音声認識システムは、第７発明の何れかにおいて、前記検出手段は、前記条件情報に基づき、参照する前記文字列データベースの内容を選別することを特徴とする。 The speech recognition system according to an eighth aspect is characterized in that, in any one of the seventh aspects, the detecting means selects the contents of the character string database to be referred to based on the condition information.

第９発明に係る音声認識システムは、第１発明〜第８発明の何れかにおいて、前記認識情報を出力する出力手段をさらに備え、前記認識情報は、車両の走行速度を制御するための情報を含むことを特徴とする。 A speech recognition system according to a ninth invention is the speech recognition system according to any one of the first invention to the eighth invention, further comprising an output unit that outputs the recognition information, wherein the recognition information is information for controlling a traveling speed of the vehicle. It is characterized by including.

第１０発明に係る音声認識システムは、第１発明〜第９発明の何れかにおいて、前記休止区間は、呼吸音及びリップノイズの少なくとも何れかを含むことを特徴とする。 A speech recognition system according to a tenth aspect of the present invention is the speech recognition system according to any of the first to ninth aspects, wherein the pause section includes at least one of a breath sound and a lip noise.

第１１発明に係る音声認識システムは、第１発明〜第１０発明の何れかにおいて、前記文字列情報は、２ヵ国以上の言語を含むことを特徴とする。 In a speech recognition system according to an eleventh invention, in any one of the first invention to the tenth invention, the character string information includes languages of two or more countries.

第１２発明に係る音声認識装置は、少なくとも１つの音声データを取得する取得部と、前記音声データに含まれる開始無音区間及び終了無音区間を抽出し、前記開始無音区間と前記終了無音区間との間に挟まれた音素及び休止区間の配列を、認識対象データとして抽出する抽出部と、予め取得された文字列情報と、前記文字列情報に紐づく音素情報と、前記文字列情報に付与されたクラスＩＤとが記憶された文字列データベースと、前記文字列データベースを参照し、前記認識対象データの有する前記配列に対応する前記音素情報を選択し、選択された前記音素情報に紐づく前記文字列情報及び前記クラスＩＤを、候補データとして複数検出する検出部と、予め取得された前記クラスＩＤの配列順序を示す文法情報が記憶された文法データベースと、前記文法データベースを参照し、複数の前記候補データを前記文法情報に基づき組み合あわせたセンテンスを生成し、前記センテンスに含まれる前記候補データ毎に対応する信頼度を算出する算出部と、前記信頼度に基づき、複数の前記候補データから評価データを選択する選択部と、前記評価データに基づき、認識情報を生成する生成部とを備えることを特徴とする。 A speech recognition device according to a twelfth aspect is an acquisition unit that acquires at least one piece of speech data, and extracts a start silence section and an end silence section included in the speech data, and extracts the start silence section and the end silence section. An extraction unit that extracts an array of phonemes and pause sections interposed therebetween as recognition target data, character string information acquired in advance, phoneme information associated with the character string information, and the character string information. A character string database storing the class ID and the character string database, and referring to the character string database, selecting the phoneme information corresponding to the arrangement of the recognition target data, and selecting the character associated with the selected phoneme information. A detecting unit for detecting a plurality of column information and the class ID as candidate data; and a grammar database storing grammatical information indicating an arrangement order of the class IDs acquired in advance. And a calculation unit that refers to the grammar database, generates a sentence combining the plurality of candidate data based on the grammar information, and calculates a reliability corresponding to each of the candidate data included in the sentence. A selection unit that selects evaluation data from a plurality of candidate data based on the reliability, and a generation unit that generates recognition information based on the evaluation data.

第１発明〜第１１発明によれば、抽出手段は、音素及び休止区間の配列を認識対象データとして抽出する。また、検出手段は、認識対象データの有する配列に対応する音素情報を選択し、候補データを検出する。このため、認識対象データにおける音素のみを考慮した配列に対して候補データを検出する場合に比べ、誤認識を低減させることができる。これにより、認識精度の向上を図ることが可能となる。 According to the first invention to the eleventh invention, the extraction means extracts an array of phonemes and pause intervals as recognition target data. The detecting means selects phoneme information corresponding to the arrangement of the recognition target data and detects candidate data. For this reason, erroneous recognition can be reduced as compared with a case where candidate data is detected for an array that considers only phonemes in recognition target data. This makes it possible to improve recognition accuracy.

また、第１発明〜第１１発明によれば、文字列データベースには、音素と休止区間との配列に対応する音素情報、及び音素情報に紐づく文字列情報が記憶される。このため、音素全体に対してパターンマッチングするために記憶するデータに比べて、データ容量の削減や、データ蓄積の簡易化を実現することができる。 According to the first to eleventh aspects, the character string database stores phoneme information corresponding to the arrangement of phonemes and pause segments and character string information associated with the phoneme information. For this reason, compared with the data stored for performing pattern matching for the entire phoneme, it is possible to reduce the data capacity and to simplify the data storage.

特に、第２発明によれば、抽出手段は、１つの音声データから複数の認識対象データを抽出する。このため、音素及び休止区間の配列にバラつきが発生するような音声データを取得した場合においても、認識精度の低下を抑制することができる。これにより、認識精度のさらなる向上が可能となる。 In particular, according to the second aspect, the extraction unit extracts a plurality of recognition target data from one piece of audio data. For this reason, even when acquiring voice data in which the arrangement of phonemes and pause sections varies, it is possible to suppress a decrease in recognition accuracy. Thereby, the recognition accuracy can be further improved.

特に、第３発明によれば、算出手段は、センテンスを複数生成する。すなわち、候補データを組み合わせるパターンが複数存在する場合においても、全てのパターンに対応するセンテンスを生成することができる。このため、例えばパターンマッチングの探索方法等に比べて、誤認識を低減させることができる。これにより、認識精度のさらなる向上が可能となる。 In particular, according to the third aspect, the calculating means generates a plurality of sentences. That is, even when there are a plurality of patterns that combine candidate data, sentences corresponding to all patterns can be generated. For this reason, erroneous recognition can be reduced as compared with, for example, a pattern matching search method or the like. Thereby, the recognition accuracy can be further improved.

特に、第４発明によれば、比較手段は、信頼度と、第１閾値とを比較する。このため、複数の候補データから相対的に選択された評価データに対し、閾値による判定も行うことで、誤認識をさらに低減させることができる。これにより、認識精度のさらなる向上が可能となる。 In particular, according to the fourth aspect, the comparing means compares the reliability with the first threshold. For this reason, erroneous recognition can be further reduced by performing the determination based on the threshold value for the evaluation data relatively selected from the plurality of candidate data. Thereby, the recognition accuracy can be further improved.

特に、第５発明によれば、更新手段は、候補データ及び信頼度に基づき、閾値を更新する。このため、予め設定された閾値を常に用いる場合に比べて、取得する音声データにおける品質に応じた認識情報を生成することができる。これにより、利用できる環境の幅を広げることが可能となる。 In particular, according to the fifth aspect, the updating means updates the threshold based on the candidate data and the reliability. For this reason, it is possible to generate recognition information corresponding to the quality of the acquired audio data, as compared with a case where a preset threshold is always used. This makes it possible to expand the range of available environments.

特に、第６発明によれば、反映手段は、評価結果を閾値に反映させる。このため、認識情報が、利用者の認識と乖離している場合、容易に改善を実施することができる。これにより、持続的な認識精度の向上を実現することができる。 In particular, according to the sixth aspect, the reflecting means reflects the evaluation result on the threshold. Therefore, when the recognition information is different from the recognition of the user, the improvement can be easily performed. As a result, continuous improvement in recognition accuracy can be realized.

特に、第７発明によれば、取得手段は、条件情報を取得する。すなわち、取得手段は、音声データを取得する際の周辺環境、音声データに含まれる雑音、音声を採取する収音装置の種類等の各種条件を、条件情報として取得する。このため、条件情報に応じた各手段や各データベースの設定を実施することができる。これにより、利用される環境等に関わらず、認識精度の向上を図ることが可能となる。 In particular, according to the seventh aspect, the acquiring means acquires the condition information. That is, the acquiring unit acquires, as condition information, various conditions such as a surrounding environment at the time of acquiring the audio data, noise included in the audio data, and a type of a sound collecting device that collects the audio. Therefore, setting of each means and each database according to the condition information can be performed. As a result, it is possible to improve the recognition accuracy regardless of the environment used.

特に、第８発明によれば、検出手段は、条件情報に基づき、参照する文字列データベースの内容を選別する。このため、文字列データベースには、条件情報毎に異なる文字列情報等を記憶させておくことで、条件情報毎に適した候補データを検出することができる。これにより、条件情報毎における認識精度の向上を図ることが可能となる。 In particular, according to the eighth aspect, the detecting means selects the contents of the character string database to be referred based on the condition information. For this reason, by storing different character string information for each condition information in the character string database, it is possible to detect candidate data suitable for each condition information. This makes it possible to improve recognition accuracy for each piece of condition information.

特に、第９発明によれば、出力手段は、認識情報を出力する。すなわち、認識精度の向上に伴い、利用者の運転補助等として用いることができる。これにより、幅広い用途への応用が可能となる。 In particular, according to the ninth aspect, the output means outputs the recognition information. That is, it can be used as driving assistance for a user with improvement in recognition accuracy. This enables application to a wide range of applications.

特に、第１０発明によれば、休止区間は、呼吸音及びリップノイズの少なくとも何れかを含む。このため、音素のみでは判断し難い音声データの差異に対しても容易に判断でき、認識対象データを抽出することができる。これにより、認識精度のさらなる向上を図ることが可能となる。 In particular, according to the tenth aspect, the pause section includes at least one of a breath sound and a lip noise. For this reason, it is possible to easily determine a difference in voice data that is difficult to determine only with phonemes, and to extract recognition target data. This makes it possible to further improve the recognition accuracy.

第１２発明によれば、抽出部は、音素及び休止区間の配列を認識対象データとして抽出する。また、検出部は、認識対象データの有する配列に対応する音素情報を選択し、候補データを検出する。このため、認識対象データにおける音素のみを考慮した配列に対して候補データを検出する場合に比べ、誤認識を低減させることができる。これにより、認識精度の向上を図ることが可能となる。 According to the twelfth aspect, the extraction unit extracts an array of phonemes and pause sections as recognition target data. In addition, the detection unit selects phoneme information corresponding to the sequence included in the recognition target data, and detects candidate data. For this reason, erroneous recognition can be reduced as compared with a case where candidate data is detected for an array that considers only phonemes in recognition target data. This makes it possible to improve recognition accuracy.

また、第１２発明によれば、文字列データベースには、音素と休止区間との配列に対応する音素情報、及び音素情報に紐づく文字列情報が記憶される。このため、音素全体に対してパターンマッチングするために記憶するデータに比べて、データ容量の削減や、データ蓄積の簡易化を実現することができる。 According to the twelfth aspect, the character string database stores phoneme information corresponding to the arrangement of phonemes and pause sections, and character string information associated with the phoneme information. For this reason, compared with the data stored for performing pattern matching for the entire phoneme, it is possible to reduce the data capacity and to simplify the data storage.

図１は、本実施形態における音声認識システムの構成の一例を示す模式図である。FIG. 1 is a schematic diagram illustrating an example of a configuration of a speech recognition system according to the present embodiment. 図２（ａ）は、本実施形態における音声認識装置の構成の一例を示す模式図であり、図２（ｂ）は、本実施形態における音声認識装置の機能の一例を示す模式図であり、図２（ｃ）は、本実施形態における生成部の一例を示す模式図である。FIG. 2A is a schematic diagram illustrating an example of a configuration of a speech recognition device according to the present embodiment, and FIG. 2B is a schematic diagram illustrating an example of functions of the speech recognition device according to the embodiment. FIG. 2C is a schematic diagram illustrating an example of the generation unit according to the present embodiment. 図３は、本実施形態における音声認識装置の各機能の一例を示す模式図である。FIG. 3 is a schematic diagram illustrating an example of each function of the voice recognition device according to the present embodiment. 図４は、文字列データベース、文法データベース、及び参照データベースの一例を示す模式図である。FIG. 4 is a schematic diagram illustrating an example of a character string database, a grammar database, and a reference database. 図５（ａ）は、本実施形態における音声認識システムの動作の一例を示すフローチャートであり、図５（ｂ）は、生成手段の一例を示すフローチャートであり、図５（ｃ）は、反映手段の一例を示すフローチャートである。FIG. 5A is a flowchart illustrating an example of the operation of the speech recognition system according to the present embodiment, FIG. 5B is a flowchart illustrating an example of a generating unit, and FIG. 6 is a flowchart showing an example of the above. 図６は、更新手段の一例を示す模式図である。FIG. 6 is a schematic diagram illustrating an example of the updating unit. 図７（ａ）は、更新手段の一例を示すフローチャートであり、図７（ｂ）は、設定手段の一例を示すフローチャートである。FIG. 7A is a flowchart illustrating an example of the updating unit, and FIG. 7B is a flowchart illustrating an example of the setting unit. 図８は、条件情報の一例を示す模式図である。FIG. 8 is a schematic diagram illustrating an example of the condition information. 図９は、参照データベースの変形例を示す模式図である。FIG. 9 is a schematic diagram showing a modification of the reference database.

以下、本発明の実施形態における音声認識システム及び音声認識装置の一例について、図面を参照しながら説明する。 Hereinafter, an example of a speech recognition system and a speech recognition device according to an embodiment of the present invention will be described with reference to the drawings.

（音声認識システム１００の構成）
図１〜図４を参照して、本実施形態における音声認識システム１００の構成の一例について説明する。図１は、本実施形態における音声認識システム１００の全体の構成を示す模式図である。 (Configuration of Speech Recognition System 100)
An example of the configuration of the speech recognition system 100 according to the present embodiment will be described with reference to FIGS. FIG. 1 is a schematic diagram illustrating an overall configuration of a speech recognition system 100 according to the present embodiment.

音声認識システム１００は、利用者の用途に応じて構築された文字列データベース及び文法データベースを参照し、利用者の音声に対応する認識情報を生成する。文字列データベースには、利用者が発すると想定される文字列（文字列情報）と、文字列に対応する音素（音素情報）が記憶される。このため、上記文字列及び音素を蓄積することで用途に応じた認識情報を生成でき、様々な用途に展開することが可能となる。 The voice recognition system 100 refers to a character string database and a grammar database constructed according to the use of the user, and generates recognition information corresponding to the voice of the user. The character string database stores a character string (character string information) assumed to be emitted by the user and a phoneme (phoneme information) corresponding to the character string. For this reason, by accumulating the character strings and phonemes, it is possible to generate recognition information according to the application, and it is possible to develop the application into various applications.

特に、文字列データベースに記憶される音素の配列（音素情報）は、音声に含まれる休止区間を踏まえて分類することで、音声に対する認識情報の精度を飛躍的に向上させることが可能となることを、発明者が発見した。 In particular, by classifying phoneme arrays (phoneme information) stored in the character string database based on pause intervals included in speech, it is possible to dramatically improve the accuracy of recognition information for speech. Was discovered by the inventor.

文法データベースには、文字列情報を組み合わせたセンテンスを生成するために必要な文法情報が記憶される。文法情報は、文字列情報毎に紐づくクラスＩＤの配列順序を示す情報を複数含む。文法データベースを参照することで、休止区間を踏まえて分類された音素の配列に基づいて文字列情報を検出したあと、容易に各文字列情報を組み合わせることができる。これにより、音声に対する文法を考慮した認識情報を生成することができる。この結果、利用者等の発する音声の内容を踏まえた音声認識を高精度に実現することが可能となる。 The grammar database stores grammar information necessary to generate a sentence combining character string information. The grammar information includes a plurality of pieces of information indicating the arrangement order of class IDs associated with each piece of character string information. By referring to the grammar database, it is possible to easily combine the character string information after detecting the character string information based on the arrangement of phonemes classified based on the pause section. As a result, it is possible to generate recognition information in consideration of a grammar for speech. As a result, it is possible to realize speech recognition with high accuracy based on the content of speech emitted by a user or the like.

図１に示すように、音声認識システム１００は、音声認識装置１を備える。音声認識システム１００では、例えば収音装置２等を用いて利用者等の音声を収音し、音声認識装置１を用いて音声に対応する認識情報を生成する。認識情報は、音声を文字列に変換したテキストデータ等のほか、例えば制御装置３等を制御する情報や、利用者に返答するための音声情報等を含む。 As shown in FIG. 1, the speech recognition system 100 includes a speech recognition device 1. In the voice recognition system 100, for example, the voice of the user or the like is collected using the sound collection device 2 or the like, and recognition information corresponding to the voice is generated using the voice recognition device 1. The recognition information includes, for example, text data obtained by converting a voice into a character string, information for controlling the control device 3 and the like, voice information for responding to the user, and the like.

音声認識システム１００では、音声認識装置１に対して、収音装置２や制御装置３が直接接続されるほか、例えば公衆通信網４を介して接続されてもよい。また、音声認識装置１に対して、例えば公衆通信網４を介して、サーバ５や利用者等の保有するユーザ端末６が、公衆通信網４を介して接続されてもよい。 In the speech recognition system 100, the sound collection device 2 and the control device 3 may be directly connected to the speech recognition device 1, or may be connected via, for example, a public communication network 4. Further, a server 5 or a user terminal 6 owned by a user or the like may be connected to the voice recognition device 1 via the public communication network 4, for example.

＜音声認識装置１＞
図２（ａ）は、音声認識装置１の構成の一例を示す模式図である。音声認識装置１として、ＲａｓｐｂｅｒｒｙＰｉ（登録商標）等のシングルボードコンピュータが用いられるほか、例えばパーソナルコンピュータ（ＰＣ）等の電子機器が用いられてもよい。音声認識装置１は、筐体１０と、ＣＰＵ（Central Processing Unit）１０１と、ＲＯＭ（Read Only Memory）１０２と、ＲＡＭ（Random Access Memory）１０３と、保存部１０４と、Ｉ／Ｆ１０５〜１０７とを備える。各構成１０１〜１０７は、内部バス１１０により接続される。 <Speech recognition device 1>
FIG. 2A is a schematic diagram illustrating an example of the configuration of the speech recognition device 1. As the voice recognition device 1, a single-board computer such as Raspberry Pi (registered trademark) or the like, or an electronic device such as a personal computer (PC) may be used. The voice recognition device 1 includes a housing 10, a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, a storage unit 104, and I / Fs 105 to 107. Prepare. Each of the components 101 to 107 is connected by an internal bus 110.

ＣＰＵ１０１は、音声認識装置１全体を制御する。ＲＯＭ１０２は、ＣＰＵ１０１の動作コードを格納する。ＲＡＭ１０３は、ＣＰＵ１０１の動作時に使用される作業領域である。保存部１０４は、文字列データベース等の各種情報が保存される。保存部１０４として、例えばＳＤメモリーカードのほか、例えばＨＤＤ（Hard Disk Drive）、ＳＳＤ（solid state drive）等が用いられる。 The CPU 101 controls the entire speech recognition device 1. The ROM 102 stores the operation code of the CPU 101. The RAM 103 is a work area used when the CPU 101 operates. The storage unit 104 stores various information such as a character string database. As the storage unit 104, for example, in addition to an SD memory card, for example, a hard disk drive (HDD), a solid state drive (SSD), or the like is used.

Ｉ／Ｆ１０５は、収音装置２、制御装置３、公衆通信網４等との各種情報の送受信を行うためのインターフェースである。Ｉ／Ｆ１０６は、用途に応じて接続される入力部分１０８との各種情報の送受信を行うためのインターフェースである。入力部分１０８として、例えばキーボードが用いられ、音声認識システム１００の管理等を行う利用者等は、入力部分１０８を介して、各種情報又は音声認識装置１の制御コマンド等を入力又は選択する。Ｉ／Ｆ１０７は、用途に応じて接続される出力部分１０９との各種情報の送受信を行うためのインターフェースである。出力部分１０９は、保存部１０４に保存された各種情報、認識情報、音声認識装置１の処理状況等を出力する。出力部分１０９として、ディスプレイが用いられ、例えばタッチパネル式でもよい。この場合、出力部分１０９が入力部分１０８を含む構成としてもよい。なお、Ｉ／Ｆ１０５〜Ｉ／Ｆ１０７は、例えば同一のものが用いられてもよい。 The I / F 105 is an interface for transmitting and receiving various information to and from the sound collection device 2, the control device 3, the public communication network 4, and the like. The I / F 106 is an interface for transmitting and receiving various information to and from the input unit 108 connected according to the application. For example, a keyboard is used as the input unit 108, and a user or the like who manages the speech recognition system 100 or the like inputs or selects various information or control commands of the speech recognition device 1 via the input unit 108. The I / F 107 is an interface for transmitting and receiving various information to and from the output unit 109 connected according to the application. The output unit 109 outputs various information and recognition information stored in the storage unit 104, the processing status of the voice recognition device 1, and the like. A display is used as the output unit 109, and may be, for example, a touch panel type. In this case, the output section 109 may include the input section 108. Note that, for example, the same I / Fs 105 to I / F 107 may be used.

図２（ｂ）は、音声認識装置１の機能の一例を示す模式図である。音声認識装置１は、取得部１１と、抽出部１２と、記憶部１３と、検出部１４と、算出部１５と、選択部１６と、生成部１７と、出力部１８とを備える。音声認識装置１は、例えば反映部１９を備えてもよい。なお、図２（ｂ）に示した各機能は、ＣＰＵ１０１が、ＲＡＭ１０３を作業領域として、保存部１０４等に記憶されたプログラムを実行することにより実現される。また、各機能の一部は、例えばＪｕｌｉｕｓ等の公知の音声認識エンジンや、Ｐｙｔｈｏｎ等のような公知の汎用プログラミング言語を用いて実現し、各種データの抽出や生成等の処理を行ってもよい。また、各機能の一部は、人工知能により制御されてもよい。ここで、「人工知能」は、いかなる周知の人工知能技術に基づくものであってもよい。 FIG. 2B is a schematic diagram illustrating an example of a function of the speech recognition device 1. The speech recognition device 1 includes an acquisition unit 11, an extraction unit 12, a storage unit 13, a detection unit 14, a calculation unit 15, a selection unit 16, a generation unit 17, and an output unit 18. The voice recognition device 1 may include, for example, a reflection unit 19. Note that the functions illustrated in FIG. 2B are realized by the CPU 101 executing a program stored in the storage unit 104 or the like using the RAM 103 as a work area. In addition, a part of each function may be realized using a known speech recognition engine such as Julius or a known general-purpose programming language such as Python, and may perform processing such as extraction and generation of various data. . Further, a part of each function may be controlled by artificial intelligence. Here, “artificial intelligence” may be based on any known artificial intelligence technology.

＜取得部１１＞
取得部１１は、少なくとも１つの音声データを取得する。取得部１１は、例えば収音装置２等を用いて収音した音声信号に対し、ＰＣＭ（pulse code modulation）等のパルス変調したデータを、音声データとして取得する。取得部１１は、収音装置２の種類に応じて、例えば複数の音声データを一度に取得してもよい。 <Acquisition unit 11>
The acquisition unit 11 acquires at least one piece of audio data. The acquisition unit 11 acquires pulse-modulated data such as PCM (pulse code modulation) as audio data for an audio signal collected using, for example, the sound collection device 2 or the like. The acquisition unit 11 may acquire, for example, a plurality of pieces of audio data at a time according to the type of the sound collection device 2.

取得部１１は、例えば同時に複数の音声データを取得してもよい。この場合、音声認識装置１に対して、収音装置２が複数接続されるほか、複数の音声を同時に収音できる収音装置２が接続されてもよい。なお、取得部１１は、音声データのほか、例えばＩ／Ｆ１０５、Ｉ／Ｆ１０６を介して各種情報（データ）を収音装置２等から取得する。 The acquisition unit 11 may acquire a plurality of audio data at the same time, for example. In this case, a plurality of sound collection devices 2 may be connected to the speech recognition device 1, or a sound collection device 2 that can collect a plurality of sounds at the same time may be connected. Note that the acquisition unit 11 acquires various information (data) from the sound collection device 2 and the like via, for example, the I / F 105 and the I / F 106 in addition to the audio data.

＜抽出部１２＞
抽出部１２は、音声データに含まれる開始無音区間及び終了無音区間を抽出する。また、抽出部１２は、開始無音区間と終了無音区間との間に挟まれた音素及び休止区間の配列を、認識対象データとして抽出する。 <Extraction unit 12>
The extracting unit 12 extracts a start silent section and an end silent section included in the audio data. In addition, the extraction unit 12 extracts, as recognition target data, an array of phonemes and pause sections sandwiched between a start silent section and an end silent section.

抽出部１２は、例えば１００ミリ秒以上１秒以下の非発話状態（無音区間）を、開始無音区間及び終了無音区間として抽出する。抽出部１２は、開始無音区間と終了無音区間との間に挟まれた区間（音声区間）に対し、音素及び休止区間を割り当てる。抽出部１２は、それぞれ割り当てられた音素及び休止区間の配列を、認識対象データとして抽出する。 The extraction unit 12 extracts, for example, a non-speech state (silent section) of 100 milliseconds or more and 1 second or less as a start silent section and an end silent section. The extraction unit 12 assigns a phoneme and a pause section to a section (voice section) sandwiched between the start silent section and the end silent section. The extracting unit 12 extracts an array of the assigned phonemes and pause sections as recognition target data.

音素は、母音と、子音とを含む公知のものである。休止区間は、開始無音区間及び終了無音区間よりも短い区間を示し、例えば音素の区間と同程度の区間（長さ）を示す。抽出部１２は、例えば各音素の長さ又は認識対象データ全体の長さを判定したあと、休止区間の長さを設定した上で、音素及び休止区間を割り当てた配列を、認識対象データとして抽出してもよい。すなわち、抽出部１２は、音素の長さ又は認識対象データ全体の長さに応じて、休止区間の長さを設定してもよい。 Phonemes are well-known, including vowels and consonants. The pause section indicates a section shorter than the start silent section and the end silent section, and indicates, for example, a section (length) substantially equal to a phoneme section. The extraction unit 12 determines, for example, the length of each phoneme or the entire length of the data to be recognized, sets the length of the pause interval, and extracts an array to which the phonemes and pause intervals are assigned as the data to be recognized. May be. That is, the extraction unit 12 may set the length of the pause section according to the length of the phoneme or the length of the entire recognition target data.

抽出部１２は、例えば図３に示すように、開始無音区間「silB」及び終了無音区間「silE」を抽出し、音声区間における配列「a/k/a/r/i/*/w/o/*/ts/u/k/e/t/e」（*は休止区間を示す）を、対象認識データとして抽出する。抽出部１２は、例えば１つの音声データからそれぞれ異なる配列の対象認識データを複数抽出してもよい。この場合、抽出部１２における音素及び休止区間の割り当てに伴うバラつきを考慮した音声認識を実施することができる。例えば抽出部１２は、１つ以上５つ以下の対象認識データを抽出することで、処理時間を抑えた上で、認識精度を高めることができる。なお、抽出部１２は、例えば開始無音区間及び終了無音区間の少なくとも何れかを含む配列を、対象認識データとして抽出してもよい。 For example, as shown in FIG. 3, the extraction unit 12 extracts a start silence section “silB” and an end silence section “silE”, and extracts an array “a / k / a / r / i / * / w / o” in a speech section. / * / ts / u / k / e / t / e ”(* indicates a pause section) as the target recognition data. The extraction unit 12 may, for example, extract a plurality of target recognition data having different arrangements from one piece of audio data. In this case, it is possible to perform speech recognition in consideration of the variation caused by the assignment of phonemes and pause sections in the extraction unit 12. For example, by extracting one or more and five or less target recognition data, the extraction unit 12 can improve the recognition accuracy while suppressing the processing time. The extraction unit 12 may extract, for example, an array including at least one of a start silent section and an end silent section as the target recognition data.

休止区間は、例えば呼吸音及びリップノイズの少なくとも何れかを含んでもよい。すなわち、抽出部１２は、例えば休止区間に含まれる呼吸音及びリップノイズの少なくとも何れかを、認識対象データとして抽出してもよい。この場合、後述する文字列データベースに記憶された音素情報に、呼吸音及びリップノイズの少なくとも何れかを含ませることで、より精度の高い認識情報を生成することが可能となる。 The pause section may include, for example, at least one of a breath sound and a lip noise. That is, the extraction unit 12 may extract, for example, at least one of the breath sound and the lip noise included in the pause section as the recognition target data. In this case, by including at least one of a breathing sound and a lip noise in phoneme information stored in a character string database described later, it is possible to generate more accurate recognition information.

＜記憶部１３、データベース＞
記憶部１３は、各種データを保存部１０４に記憶させ、又は各種データを保存部１０４から取出す。記憶部１３は、必要に応じて保存部１０４に記憶された各種データベースを取出す。 <Storage unit 13, database>
The storage unit 13 stores various data in the storage unit 104 or extracts various data from the storage unit 104. The storage unit 13 extracts various databases stored in the storage unit 104 as needed.

保存部１０４には、例えば図４に示すように、文字列データベース及び文法データベースが記憶され、例えば参照データベースが記憶されてもよい。 The storage unit 104 stores a character string database and a grammar database, for example, as shown in FIG. 4, and may store, for example, a reference database.

文字列データベースには、予め取得された文字列情報と、文字列情報に紐づく音素情報と、文字列情報に付与されたクラスＩＤとが記憶される。文字列データベースは、検出部１４によって候補データを検出するときに用いられる。 The character string database stores character string information acquired in advance, phoneme information associated with the character string information, and a class ID assigned to the character string information. The character string database is used when the detection unit 14 detects candidate data.

音素情報は、利用者が発すると想定される音素の配列（例えば第１音素情報「a/k/a/r/i」等）を複数含む。音素の配列は、休止区間により分離される区間に対応するほか、例えば「h/i/*/i/t/e」のように休止区間を含んでもよく、利用条件に応じて任意に設定される。なお、音素情報は、例えば開始無音区間及び終了無音区間の少なくとも何れかを含んでもよい。 The phoneme information includes a plurality of phoneme arrays (for example, first phoneme information “a / k / a / r / i” or the like) assumed to be emitted by the user. The array of phonemes corresponds to the section separated by the pause section, and may include a pause section such as "h / i / * / i / t / e", and may be arbitrarily set according to the use condition. You. Note that the phoneme information may include, for example, at least one of a start silent section and an end silent section.

文字列情報は、各音素の配列に紐づく文字列（例えば第１文字列情報「明かり」等）を含む。このため、文字列情報には、単語や形態素等の意味を持つ表現要素が用いられるほか、意味を持たない文字列が用いられてもよい。なお、文字列情報は、日本語のほか、例えば２ヵ国以上の言語を含んでもよく、数字や利用箇所で用いられる略称等の文字列を含んでもよい。また、同一の文字列情報に対して、異なる音素の配列が紐づけられてもよい。 The character string information includes a character string associated with the arrangement of each phoneme (for example, first character string information “light”). For this reason, the character string information may be an expression element having a meaning such as a word or a morpheme, or may be a character string having no meaning. In addition, the character string information may include, for example, languages of two or more countries in addition to Japanese, and may include character strings such as numbers and abbreviations used in places of use. Further, different phoneme arrangements may be associated with the same character string information.

クラスＩＤは、文字列情報に紐づき、文字列情報の単語等が文法上用いられると想定される配列箇所（例えば第１クラスＩＤ「１」等）を示す。例えば音声の文法（センテンス）が「対象」＋「助詞」＋「アクション」として表すことができる場合、クラスＩＤとして、音声の「対象」となる文字列情報に対して「１」が用いられ、音声の「助詞」となる文字列情報に対して「２」が用いられ、音声の「アクション」となる文字列情報に対して「３」が用いられる。 The class ID is associated with the character string information, and indicates an arrangement location (for example, the first class ID “1” or the like) where a word or the like of the character string information is assumed to be used in grammar. For example, if the grammar (sentence) of the voice can be represented as “target” + “particle” + “action”, “1” is used as the class ID for the character string information that is the “target” of the voice, “2” is used for the character string information that is the “particle” of the voice, and “3” is used for the character string information that is the “action” of the voice.

文法データベースには、予め取得された複数のクラスＩＤの配列順序を示す文法情報が記憶される。文法データベースは、算出部１５によって信頼度を算出するときに用いられる。文法情報として、例えば第１文法情報「１、２、３」が用いられる場合、音声の候補として「対象」＋「助詞」＋「アクション」を示すセンテンスを生成することができる。文法情報は、例えば第１文法情報「１、２、３」、第２文法情報「４、５、６」、第３文法情報「２、１、３」等のクラスＩＤの配列順序を複数含む。 The grammar database stores grammar information indicating the arrangement order of a plurality of previously acquired class IDs. The grammar database is used when the calculation unit 15 calculates the reliability. For example, when the first grammar information “1, 2, 3” is used as the grammar information, a sentence indicating “target” + “particle” + “action” can be generated as a voice candidate. The grammar information includes, for example, a plurality of arrangement orders of class IDs such as first grammar information “1, 2, 3”, second grammar information “4, 5, 6”, and third grammar information “2, 1, 3”. .

参照データベースには、予め取得された文字列情報と、文字列を組み合わせた参照センテンスと、文字列情報毎に付与された閾値とが記憶され、例えば文字列情報に紐づく音素情報が記憶されてもよい。参照データベースは、生成部１７によって認識情報を生成するときに、必要に応じて用いられる。なお、参照データベースに記憶される文字列情報及び音素情報は、例えば文字列データベースに記憶される文字列情報及び音素情報と等しくすることで、データ容量を少なくすることができる。 The reference database stores character string information obtained in advance, a reference sentence combining character strings, and a threshold value assigned to each character string information, and stores, for example, phoneme information associated with the character string information. Is also good. The reference database is used as necessary when the generation unit 17 generates the recognition information. Note that the character string information and phoneme information stored in the reference database can be made equal to the character string information and phoneme information stored in the character string database, for example, to reduce the data capacity.

＜検出部１４＞
検出部１４は、文字列データベースを参照し、認識対象データの有する音素の配列に対応する音素情報を選択する。また、検出部１４は、選択された音素情報に紐づく文字列情報及びクラスＩＤを候補データとして複数検出する。 <Detector 14>
The detection unit 14 refers to the character string database and selects phoneme information corresponding to the phoneme array included in the recognition target data. The detecting unit 14 detects a plurality of character string information and class IDs associated with the selected phoneme information as candidate data.

検出部１４は、例えば図３に示すように、認識対象データに対応する音素情報「a/k/a/r/i」、「w/o」、「ts/u/k/e/t/e」を選択し、各音素情報に紐づく文字列情報及びクラスＩＤ「明かり/１」、「を/２」、「つけて/３」を、それぞれ候補データとして検出する。このとき、認識対象データの数に応じて、候補データの数が増加する。なお、各音素の配列は、予め休止区間毎に区切られて分類されるほか、音素及び休止区間を含む音素情報に基づいて分類されてもよい。 For example, as illustrated in FIG. 3, the detection unit 14 outputs phoneme information “a / k / a / r / i”, “w / o”, and “ts / u / k / e / t /” corresponding to the recognition target data. "e" is selected, and the character string information associated with each piece of phoneme information and the class IDs "light / 1", "wo / 2", and "attach / 3" are detected as candidate data. At this time, the number of candidate data increases according to the number of recognition target data. Note that the arrangement of the phonemes may be classified in advance for each pause section and may be classified based on phoneme information including the phonemes and the pause sections.

＜算出部１５＞
算出部１５は、文法データベースを参照し、複数の候補データを文法情報に基づき組み合わせたセンテンスを生成する。また、算出部１５は、センテンスに含まれる候補データ毎に対応する信頼度を算出する。 <Calculation unit 15>
The calculation unit 15 refers to the grammar database and generates a sentence combining a plurality of candidate data based on the grammar information. Further, the calculating unit 15 calculates the reliability corresponding to each candidate data included in the sentence.

算出部１５は、例えば図３に示すように、第１文法情報「１、２、３」に含まれるクラスＩＤ毎に、各候補データ「明かり/１」、「を/２」、「つけて/３」のクラスＩＤを対応させ、センテンス「明かり/１」「を/２」「つけて/３」を生成する。このとき、例えば文法情報が「３、１、２」の場合、センテンスとして「つけて/３」「明かり/１」「を/２」が生成される。 For example, as shown in FIG. 3, the calculation unit 15 adds, for each class ID included in the first grammar information “1, 2, 3”, the candidate data “light / 1”, “ The sentence "light / 1," "/ 2," "attach / 3" is generated by associating the class ID of "/ 3". At this time, for example, when the grammar information is “3, 1, 2”, “attach / 3”, “light / 1”, and “wo / 2” are generated as sentences.

算出部１５は、センテンスに含まれる各候補データ「明かり/１」、「を/２」、「つけて/３」、に対応する信頼度「０．９８２」、「１．０００」、「０．９９０」を算出する。算出部１５は、各候補データに対して０．０００以上１．０００以下の範囲で信頼度を算出する。算出部１５は、例えば各センテンスに対して、優先度を示すランクを設定（図３ではランク１〜ランク５）してもよい。ランクを設定することで、任意のランク下位にランク付けされたセンテンス（例えばランク６以下）を、評価対象から除外することができる。このため、後述する評価データとして選択される候補データの数を減らすことができ、処理速度の向上を図ることが可能となる。 The calculation unit 15 calculates the reliability “0.982”, “1.000”, “0” corresponding to each candidate data “light / 1”, “wo / 2”, “attach / 3” included in the sentence. .990 ”is calculated. The calculating unit 15 calculates the reliability of each candidate data in the range of 0.000 to 1.000. For example, the calculation unit 15 may set a rank indicating the priority (rank 1 to rank 5 in FIG. 3) for each sentence. By setting a rank, a sentence (for example, rank 6 or lower) ranked lower than an arbitrary rank can be excluded from evaluation targets. For this reason, the number of candidate data selected as evaluation data described later can be reduced, and the processing speed can be improved.

算出部１５は、例えば内容の異なるセンテンスに同一の候補データが含まれる場合、各候補データにはそれぞれ異なる信頼度を算出してもよい。例えば、第１センテンスに含まれる各候補データ「明かり/１」、「を/２」、「つけて/３」に対応する信頼度「０．９８２」、「１．０００」、「０．９９０」が算出された場合、第２センテンスに含まれる各候補データ「明かり/１」、「を/２」、「弾いて/３」に対応する信頼度「０．９４２」、「１．０００」、「０．０２３」が算出される。すなわち、同一の候補データ「明かり」であっても、センテンスの内容や組み合わせの順序によって、異なる信頼度が算出されてもよい。 When, for example, the same candidate data is included in sentences having different contents, the calculation unit 15 may calculate different degrees of reliability for each candidate data. For example, the degrees of reliability “0.982”, “1.000”, “0.990” corresponding to the candidate data “light / 1”, “wo / 2”, and “attach / 3” included in the first sentence, for example. Is calculated, the reliability values “0.942” and “1.000” corresponding to the candidate data “light / 1”, “wo / 2”, and “play / 3” included in the second sentence are calculated. , "0.023" are calculated. That is, even if the same candidate data is “light”, different degrees of reliability may be calculated depending on the content of the sentence and the order of the combination.

信頼度として、予め設定された値が用いられるほか、例えば検出部１４において検出された候補データの種類及び数に応じた相対値が用いられてもよい。例えば、１つのクラスＩＤに対して候補データの種類が多くなるにつれて、低い信頼度を算出することができる。 As the reliability, a preset value is used, or, for example, a relative value corresponding to the type and number of candidate data detected by the detection unit 14 may be used. For example, as the number of types of candidate data increases for one class ID, a lower reliability can be calculated.

＜選択部１６＞
選択部１６は、信頼度に基づき、複数の候補データから評価データを選択する。選択部１６は、例えば複数の候補データのうち、クラスＩＤ毎に最も高い信頼度が算出された候補データを、評価データとして選択する。例えば選択部１６は、同じクラスＩＤ「３」における候補データ「つけて/３/０．９９０」、「弾いて/３/０．０２３」のうち、最も高い信頼度を有する候補データ「つけて/３/０．９９０」を評価データとして選択する。なお、選択部１６は、例えば１つのクラスＩＤに対して複数の候補データを、評価データとして選択してもよい。この場合、後述する生成部１７において、複数の候補データから１つ選択するようにしてもよい。 <Selector 16>
The selection unit 16 selects evaluation data from a plurality of candidate data based on the reliability. The selecting unit 16 selects, for example, candidate data for which the highest reliability is calculated for each class ID from among a plurality of candidate data, as evaluation data. For example, the selecting unit 16 selects the candidate data “attached” having the highest reliability among the candidate data “attached / 3 / 0.990” and “playing / 3 / 0.023” in the same class ID “3”. /3/0.990 "is selected as the evaluation data. Note that the selection unit 16 may select, for example, a plurality of candidate data for one class ID as evaluation data. In this case, the generation unit 17 described later may select one from a plurality of candidate data.

＜生成部１７＞
生成部１７は、評価データに基づき、認識情報を生成する。生成部１７は、例えば評価データをテキスト形式に変換し、認識情報として生成するほか、例えば評価データを音声データ形式や、制御装置３を制御するための制御データ形式に変換し、認識情報として生成してもよい。すなわち、認識情報は、制御装置３を制御するための情報（例えば車両の走行速度を制御するための情報）を含む。なお、評価データに基づくテキスト形式、音声データ形式、又は制御データ形式に変換する方法は、公知の技術を用いることができ、必要に応じて各データ形式を蓄積したデータベース等を用いてもよい。 <Generation unit 17>
The generation unit 17 generates recognition information based on the evaluation data. The generation unit 17 converts, for example, the evaluation data into a text format and generates the recognition information, and converts the evaluation data into, for example, a voice data format or a control data format for controlling the control device 3 and generates the recognition information. May be. That is, the recognition information includes information for controlling the control device 3 (for example, information for controlling the traveling speed of the vehicle). A known technique can be used for converting the evaluation data into a text format, a voice data format, or a control data format, and a database or the like in which each data format is stored may be used as needed.

生成部１７は、例えば指定部１７ａと、比較部１７ｂとを有してもよい。指定部１７ａは、参照データベースを参照し、参照センテンスのうち、評価データに対応する第１参照センテンスを指定する。指定部１７ａは、例えば評価データとして「明かり/１」、「を/２」、「つけて/３」が選択された場合、図４に示す第１参照センテンスを指定する。この場合、第１参照センテンスに含まれる各文字列情報（第１文字列情報）として、評価データに含まれる候補データと等しい文字列が指定される。 The generation unit 17 may include, for example, a designation unit 17a and a comparison unit 17b. The specifying unit 17a refers to the reference database and specifies a first reference sentence corresponding to the evaluation data among the reference sentences. For example, when “light / 1”, “wo / 2”, and “turn on / 3” are selected as the evaluation data, the specifying unit 17a specifies the first reference sentence shown in FIG. In this case, a character string equal to the candidate data included in the evaluation data is specified as each character string information (first character string information) included in the first reference sentence.

比較部１７ｂは、評価データに対応する信頼度と、第１文字列情報に付与された閾値（第１閾値）とを比較する。比較部１７ｂは、例えば評価データ「明かり」、「を」、「つけて」の信頼度「０．９８２」、「１．０００」、「０．９９０」が、第１文字列情報「明かり」、「を」、「つけて」の第１閾値「０．８００」、「０．９００」、「０．８８０」以上か否かを比較する。この場合、生成部１７は、比較結果に基づいて認識情報を生成する。例えば信頼度が第１閾値以上の場合に、生成部１７が認識情報を生成してもよい。例えば信頼度が第１閾値以上の場合と、第１閾値未満の場合とに応じて、生成部１７が異なる生成情報を生成してもよい。 The comparing unit 17b compares the reliability corresponding to the evaluation data with a threshold (first threshold) assigned to the first character string information. The comparing unit 17b determines that the reliability values “0.982”, “1.000”, and “0.990” of the evaluation data “light”, “wo”, and “turn on” are, for example, the first character string information “light”. , “ON” and “ON” are compared with each other to determine whether or not the first threshold is “0.800”, “0.900”, “0.880” or more. In this case, the generation unit 17 generates recognition information based on the comparison result. For example, when the reliability is equal to or more than the first threshold, the generation unit 17 may generate the recognition information. For example, the generation unit 17 may generate different pieces of generation information depending on the case where the reliability is equal to or more than the first threshold and the case where the reliability is less than the first threshold.

＜出力部１８＞
出力部１８は、認識情報を出力する。出力部１８は、Ｉ／Ｆ１０５を介して制御装置３等に認識情報を出力する。出力部１８は、例えばＩ／Ｆ１０７を介して出力部分１０９に認識情報を出力してもよい。出力部１８は、認識情報のほか、例えばＩ／Ｆ１０５、Ｉ／Ｆ１０７を介して各種情報（データ）を制御装置３等に出力する。 <Output unit 18>
The output unit 18 outputs the recognition information. The output unit 18 outputs recognition information to the control device 3 or the like via the I / F 105. The output unit 18 may output the recognition information to the output unit 109 via the I / F 107, for example. The output unit 18 outputs various information (data) to the control device 3 and the like via the I / F 105 and the I / F 107 in addition to the recognition information.

＜反映部１９＞
反映部１９は、認識情報を評価した利用者等の評価結果を取得し、参照データベースの閾値に反映させる。反映部１９は、例えば認識情報に対して評価結果が悪い場合（すなわち、音声データに対して得られる認識情報が、利用者等の要求と乖離している場合）、閾値を変更させることで、認識情報の改善を図る。このとき、例えば公知の機械学習方法等を用いて、評価結果を閾値に反映させてもよい。 <Reflection unit 19>
The reflection unit 19 acquires an evaluation result of a user or the like who has evaluated the recognition information, and reflects the evaluation result on a threshold value of the reference database. For example, when the evaluation result is bad for the recognition information (that is, when the recognition information obtained for the voice data is different from the request of the user or the like), the reflection unit 19 changes the threshold value, Improve recognition information. At this time, the evaluation result may be reflected in the threshold value using, for example, a known machine learning method.

＜収音装置２＞
収音装置２は、公知のマイクに加え、例えばＤＳＰ（digital signal processor）を有してもよい。収音装置２がＤＳＰを有する場合、収音装置２は、マイクによって収音した音声信号に対しＰＣＭ等のパルス変調したデータを生成し、音声認識装置１に送信する。 <Sound collection device 2>
The sound collection device 2 may include, for example, a DSP (digital signal processor) in addition to a known microphone. When the sound pickup device 2 has a DSP, the sound pickup device 2 generates pulse-modulated data such as PCM for the sound signal picked up by the microphone and transmits the data to the voice recognition device 1.

収音装置２は、例えば音声認識装置１と直接接続されるほか、例えば公衆通信網４を介して接続されてもよい。なお、収音装置２がマイクのみを有する場合、音声認識装置１がパルス変調したデータを生成してもよい。 The sound collection device 2 may be directly connected to, for example, the voice recognition device 1, or may be connected to the voice recognition device 1 via, for example, a public communication network 4. When the sound pickup device 2 has only a microphone, the voice recognition device 1 may generate pulse-modulated data.

＜制御装置３＞
制御装置３は、認識情報を音声認識装置１から受信して制御可能な装置を示す。制御装置３として、例えばＬＥＤ等の証明装置が用いられるほか、例えば車載装置（例えば車両の走行速度を制御するため、ブレーキ系統に直結する装置）、表示言語を変更できる自動販売機、施錠装置、オーディオ機器、マッサージ機等が用いられる。制御装置３は、例えば音声認識装置１と直接接続されるほか、例えば公衆通信網４を介して接続されてもよい。 <Control device 3>
The control device 3 is a device that can receive recognition information from the speech recognition device 1 and control the recognition information. As the control device 3, for example, a certification device such as an LED is used, for example, an in-vehicle device (for example, a device directly connected to a brake system to control the running speed of the vehicle), a vending machine capable of changing a display language, a locking device, Audio equipment, massage machines and the like are used. The control device 3 may be directly connected to the voice recognition device 1, for example, or may be connected via the public communication network 4, for example.

＜公衆通信網４＞
公衆通信網４は、音声認識装置１が通信回路を介して接続されるインターネット網等である。公衆通信網４は、いわゆる光ファイバ通信網で構成されてもよい。また、公衆通信網４は、有線通信網には限定されず、無線通信網等の公知の通信網で実現してもよい。 <Public communication network 4>
The public communication network 4 is an Internet network or the like to which the voice recognition device 1 is connected via a communication circuit. The public communication network 4 may be constituted by a so-called optical fiber communication network. In addition, the public communication network 4 is not limited to a wired communication network, and may be realized by a known communication network such as a wireless communication network.

＜サーバ５＞
サーバ５には、上述した各種情報が記憶される。サーバ５には、例えば公衆通信網４を介して送られてきた各種情報が蓄積される。サーバ５には、例えば保存部１０４と同様の情報が記憶され、公衆通信網４を介して音声認識装置１と各種情報の送受信が行われてもよい。すなわち、音声認識装置１は、保存部１０４の代わりにサーバ５を用いてもよい。特に、サーバ５が上述した各データベースを更新することで、音声認識装置１における更新機能や蓄積するデータ容量を最小限に抑えることができる。このため、音声認識装置１を公衆通信網４に常時接続しない状態で利用することができ、更新が必要な場合のみ公衆通信網４に接続するように用いることができる。これにより、音声認識装置１の利用先を大幅に拡大させることができる。 <Server 5>
The server 5 stores the various types of information described above. The server 5 stores, for example, various information transmitted via the public communication network 4. For example, the server 5 may store the same information as the storage unit 104, and may transmit and receive various information to and from the voice recognition device 1 via the public communication network 4. That is, the voice recognition device 1 may use the server 5 instead of the storage unit 104. In particular, when the server 5 updates each database described above, the update function and the amount of data to be stored in the speech recognition device 1 can be minimized. For this reason, the voice recognition device 1 can be used without being constantly connected to the public communication network 4, and can be used to connect to the public communication network 4 only when updating is necessary. Thereby, the use destination of the voice recognition device 1 can be greatly expanded.

＜ユーザ端末６＞
ユーザ端末６は、例えば音声認識システム１００の利用者等が保有する端末を示す。ユーザ端末６として、主に携帯電話（携帯端末）が用いられ、それ以外ではスマートフォン、タブレット型端末、ウェアラブル端末、パーソナルコンピュータ、ＩｏＴ（Internet of Things）デバイス等の電子機器のほか、あらゆる電子機器で具現化されたものが用いられてもよい。ユーザ端末６は、例えば公衆通信網４を介して音声認識装置１と接続されるほか、例えば音声認識装置１と直接接続されてもよい。利用者等は、例えばユーザ端末６を介して音声認識装置１から認識情報を取得するほか、例えば収音装置２の代わりにユーザ端末６を用いて音声を収音させてもよい。 <User terminal 6>
The user terminal 6 is, for example, a terminal held by a user of the voice recognition system 100 or the like. As the user terminal 6, a mobile phone (mobile terminal) is mainly used, and in other cases, in addition to electronic devices such as a smartphone, a tablet terminal, a wearable terminal, a personal computer, and an IoT (Internet of Things) device, all types of electronic devices are used. An embodiment may be used. The user terminal 6 may be connected to the voice recognition device 1 via the public communication network 4, for example, or may be directly connected to the voice recognition device 1 for example. A user or the like may obtain recognition information from the voice recognition device 1 via the user terminal 6, for example, and may use the user terminal 6 instead of the sound collection device 2 to pick up sound.

（音声認識システム１００の動作の一例）
次に、本実施形態における音声認識システム１００の動作の一例について説明する。図５（ａ）は、本実施形態における音声認識システム１００の動作の一例を示すフローチャートである。 (Example of operation of voice recognition system 100)
Next, an example of the operation of the speech recognition system 100 according to the present embodiment will be described. FIG. 5A is a flowchart illustrating an example of the operation of the speech recognition system 100 according to the present embodiment.

＜取得手段Ｓ１１０＞
先ず、少なくとも１つの音声データを取得する（取得手段Ｓ１１０）。取得部１１は、収音装置２等から音声データを取得する。取得部１１は、例えば記憶部１３を介して保存部１０４に音声データを保存する。 <Acquisition unit S110>
First, at least one audio data is acquired (acquisition means S110). The acquisition unit 11 acquires audio data from the sound collection device 2 or the like. The acquisition unit 11 stores the audio data in the storage unit 104 via the storage unit 13, for example.

＜抽出手段Ｓ１２０＞
次に、認識対象データを抽出する（抽出手段Ｓ１２０）。抽出部１２は、例えば記憶部１３を介して保存部１０４から音声データを取出し、音声データに含まれる開始無音区間及び終了無音区間を抽出する。また、抽出部１２は、開始無音区間と終了無音区間との間に挟まれた音素及び休止区間の配列を、認識対象データとして抽出する。抽出部１２は、例えば記憶部１３を介して保存部１０４に認識対象データを保存する。なお、抽出部１２は、一度に複数の音声データを取得してもよい。 <Extraction means S120>
Next, recognition target data is extracted (extraction means S120). The extraction unit 12 extracts audio data from the storage unit 104 via, for example, the storage unit 13 and extracts a start silent section and an end silent section included in the audio data. In addition, the extraction unit 12 extracts, as recognition target data, an array of phonemes and pause sections sandwiched between a start silent section and an end silent section. The extraction unit 12 stores the recognition target data in the storage unit 104 via the storage unit 13, for example. Note that the extraction unit 12 may acquire a plurality of audio data at a time.

抽出部１２は、例えば１つの音声データから複数の認識データを抽出する。このとき、複数の認識データは、それぞれ異なる音素及び休止区間の配列を有する（例えば図３の配列Ａ〜配列Ｃ）。抽出部１２は、例えばそれぞれ異なる条件を設定するほか、例えば同一条件で設定したときにおけるバラつきの範囲内で、複数の認識データを抽出する。 The extracting unit 12 extracts, for example, a plurality of pieces of recognition data from one piece of audio data. At this time, the plurality of pieces of recognition data have different arrangements of phonemes and pause sections, respectively (for example, arrangements A to C in FIG. 3). The extraction unit 12 sets different conditions, for example, and also extracts a plurality of pieces of recognition data within a range of variation when the same conditions are set, for example.

なお、例えば休止区間が呼吸音及びリップノイズの少なくとも何れかを含むとき、抽出部１２は、呼吸音及びリップノイズの少なくとも何れかを含む配列を、認識対象データとして抽出してもよい。 For example, when the pause section includes at least one of the breathing sound and the lip noise, the extracting unit 12 may extract an array including at least one of the breathing sound and the lip noise as the recognition target data.

＜検出手段Ｓ１３０＞
次に、認識対象データに基づき、候補データを検出する（検出手段Ｓ１３０）。検出部１４は、例えば記憶部１３を介して保存部１０４から認識対象データを取出す。検出部１４は、文字列データベースを参照し、認識対象データの有する配列に対応する音素情報を選択する。また、検出部１４は、選択された音素情報に紐づく文字列情報及びクラスＩＤを候補データとして複数検出する。検出部１４は、例えば記憶部１３を介して保存部１０４に候補データを保存する。なお、認識対象データの有する配列は、例えば一対の休止区間の間における音素の配列を示し、一対の休止区間の間に他の休止区間が配列されてもよい。 <Detecting means S130>
Next, candidate data is detected based on the recognition target data (detection means S130). The detection unit 14 extracts recognition target data from the storage unit 104 via the storage unit 13, for example. The detecting unit 14 refers to the character string database and selects phoneme information corresponding to the arrangement of the recognition target data. The detecting unit 14 detects a plurality of character string information and class IDs associated with the selected phoneme information as candidate data. The detection unit 14 stores the candidate data in the storage unit 104 via the storage unit 13, for example. Note that the arrangement of the recognition target data indicates, for example, an arrangement of phonemes between a pair of pause sections, and another pause section may be arranged between the pair of pause sections.

＜算出手段Ｓ１４０＞
次に、各候補データに対応する信頼度を算出する（算出手段Ｓ１４０）。算出部１５は、例えば記憶部１３を介して保存部１０４から候補データを取出す。算出部１５は、文法データベースを参照し、複数の候補データを文法情報に基づき組み合わせたセンテンスを生成する。また、算出部１５は、センテンスに含まれる候補データ毎に対応する信頼度を算出する。算出部１５は、例えば記憶部１３を介して保存部１０４に各候補データ及び信頼度を保存する。算出部１５として、例えばＪｕｌｉｕｓ等の公知の音声認識エンジンが用いられることで、センテンスの生成及び信頼度の算出が実現されてもよい。 <Calculation means S140>
Next, the reliability corresponding to each candidate data is calculated (calculation means S140). The calculation unit 15 extracts candidate data from the storage unit 104 via the storage unit 13, for example. The calculation unit 15 refers to the grammar database and generates a sentence combining a plurality of candidate data based on the grammar information. Further, the calculating unit 15 calculates the reliability corresponding to each candidate data included in the sentence. The calculation unit 15 stores each candidate data and the reliability in the storage unit 104 via the storage unit 13, for example. For example, a known speech recognition engine such as Julius may be used as the calculation unit 15 so that the generation of a sentence and the calculation of the reliability may be realized.

算出部１５は、文法データベースの文法情報の種類に応じて、複数のセンテンスを生成することができる。また、算出部１５は、文法情報の種類を選択することで、状況に適した音声認識を高精度で実施することができる。 The calculation unit 15 can generate a plurality of sentences according to the type of grammar information in the grammar database. In addition, the calculation unit 15 can perform the speech recognition suitable for the situation with high accuracy by selecting the type of the grammar information.

＜選択手段Ｓ１５０＞
次に、信頼度に基づき、評価データを選択する（選択手段Ｓ１５０）。選択部１６は、例えば記憶部１３を介して保存部１０４から候補データ及び信頼度を取出す。選択部１６は、例えば複数の候補データのうち、クラスＩＤ毎に最も高い信頼度が算出された候補データを、評価データとして選択する。選択部１６は、例えば記憶部１３を介して保存部１０４に評価データを保存する。 <Selecting means S150>
Next, evaluation data is selected based on the reliability (selection means S150). The selecting unit 16 extracts the candidate data and the reliability from the storage unit 104 via the storage unit 13, for example. The selecting unit 16 selects, for example, candidate data for which the highest reliability is calculated for each class ID from among a plurality of candidate data, as evaluation data. The selection unit 16 stores the evaluation data in the storage unit 104 via the storage unit 13, for example.

＜生成手段Ｓ１６０＞
次に、評価データに基づき、認識情報を生成する（生成手段Ｓ１６０）。生成部１７は、例えば記憶部１３を介して保存部１０４から評価データを取出す。生成部１７は、例えば上述した公知の技術を用いて評価データを任意のデータに変換し、認識情報として生成する。 <Generation means S160>
Next, recognition information is generated based on the evaluation data (generation means S160). The generation unit 17 extracts the evaluation data from the storage unit 104 via the storage unit 13, for example. The generation unit 17 converts the evaluation data into arbitrary data using, for example, the above-described known technology, and generates the recognition data as recognition information.

生成手段Ｓ１６０は、例えば図５（ｂ）に示すように、指定手段Ｓ１６１と、比較手段Ｓ１６２とを有してもよい。 The generating unit S160 may include a specifying unit S161 and a comparing unit S162, for example, as illustrated in FIG.

指定手段Ｓ１６１は、評価データに対応する第１参照センテンスを指定する。指定部１７ａは、参照データベースを参照し、参照センテンスのうち、評価データに対応する第１参照センテンスを指定する。 The designating means S161 designates a first reference sentence corresponding to the evaluation data. The specifying unit 17a refers to the reference database and specifies a first reference sentence corresponding to the evaluation data among the reference sentences.

比較手段Ｓ１６２は、評価データに対応する信頼度と、第１参照センテンスに含まれる第１文字列情報に付与された第１閾値とを比較する。比較部１７ｂは、例えば図３に示すように、評価データの信頼度が第１閾値以上の場合に、認識が正しいと判断してもよい。この後、比較部１７ｂの判断（比較結果）に基づき、認識情報が生成される。なお、比較部１７ｂにおいて評価データの信頼度が第１閾値未満となり、認識が誤っていると判断した場合、そのまま終了するか、抽出手段Ｓ１２０から再度実施するほか、例えば利用者等に再度音声を発するように促す認識情報を生成してもよい。 The comparing unit S162 compares the reliability corresponding to the evaluation data with the first threshold value assigned to the first character string information included in the first reference sentence. The comparison unit 17b may determine that the recognition is correct when the reliability of the evaluation data is equal to or more than the first threshold, as shown in FIG. 3, for example. Thereafter, recognition information is generated based on the determination (comparison result) of the comparing unit 17b. If the reliability of the evaluation data is less than the first threshold value and the recognition unit 17b determines that the recognition is incorrect, the comparing unit 17b ends the processing or executes the processing again from the extraction unit S120. Recognition information that prompts the user to emit may be generated.

＜出力手段Ｓ１７０＞
その後、必要に応じて認識情報を出力する（出力手段Ｓ１７０）。出力部１８は、Ｉ／Ｆ１０７を介して出力部分１０９に認識情報を表示するほか、例えばＩ／Ｆ１０５を介して制御装置３等を制御するための認識情報を出力する。 <Output means S170>
Thereafter, the recognition information is output as required (output means S170). The output unit 18 displays recognition information on the output unit 109 via the I / F 107 and outputs recognition information for controlling the control device 3 and the like via the I / F 105, for example.

＜反映手段Ｓ１８０＞
なお、例えば認識情報を評価した利用者等の評価結果を取得し、参照データベースの閾値に反映させてもよい（反映手段Ｓ１８０）。この場合、反映部１９は、取得部１１を介して利用者等が作成した評価結果を取得する。反映部１９は、評価結果に含まれる評価値等に基づき、比較手段Ｓ１６２における比較の結果が改善（認識精度が向上）するように、閾値を変更する。 <Reflection unit S180>
Note that, for example, an evaluation result of a user or the like who has evaluated the recognition information may be acquired and reflected on the threshold value of the reference database (reflecting unit S180). In this case, the reflection unit 19 acquires the evaluation result created by the user or the like via the acquisition unit 11. The reflection unit 19 changes the threshold based on the evaluation value or the like included in the evaluation result so that the comparison result in the comparing unit S162 is improved (the recognition accuracy is improved).

なお、反映部１９は、例えば参照データベースのほか、文字列データベース及び文法データベースの少なくとも何れかに評価結果を反映させてもよい。また、算出部１５が評価結果に基づき、信頼度の算出に反映させてもよい。 The reflection unit 19 may reflect the evaluation result on at least one of a character string database and a grammar database, for example, in addition to the reference database. Further, the calculating unit 15 may reflect the evaluation result on the calculation of the reliability based on the evaluation result.

これにより、本実施形態における音声認識システム１００の動作が終了する。 Thus, the operation of the speech recognition system 100 according to the present embodiment ends.

本実施形態における音声認識システム１００によれば、抽出手段Ｓ１２０は、音素及び休止区間の配列を認識対象データとして抽出する。また、検出手段Ｓ１３０は、認識対象データの有する配列に対応する音素情報を選択し、候補データを検出する。このため、認識対象データにおける音素のみを考慮した配列に対して候補データを検出する場合に比べ、誤認識を低減させることができる。これにより、認識精度の向上を図ることが可能となる。 According to the speech recognition system 100 in the present embodiment, the extraction unit S120 extracts an array of phonemes and pause sections as recognition target data. Further, the detecting means S130 selects phoneme information corresponding to the arrangement of the recognition target data, and detects candidate data. For this reason, erroneous recognition can be reduced as compared with a case where candidate data is detected for an array that considers only phonemes in recognition target data. This makes it possible to improve recognition accuracy.

また、認識精度の向上が可能となるため、精度向上のために用いられる事前音声入力を実施する必要がない。ここで、事前音声入力とは、音声データを取得する前に、音声認識を開始させるための音声を示す。事前音声入力を用いることで、認識精度を向上させることができる一方で、利便性の低下に影響する懸念が挙げられる。この点、本実施形態における音声認識システム１００によれば、事前音声入力を実施しないことで、利便性の向上を実現させることが可能となる。 Further, since the recognition accuracy can be improved, it is not necessary to perform a prior voice input used for improving the accuracy. Here, the pre-voice input indicates a voice for starting voice recognition before obtaining voice data. The use of pre-voice input can improve recognition accuracy, but also raises concerns that it may affect convenience. In this regard, according to the speech recognition system 100 of the present embodiment, it is possible to improve convenience by not performing advance speech input.

なお、本実施形態における音声認識システム１００によれば、必要に応じて事前音声入力を実施してもよい。これにより、認識精度のさらなる向上を図ることが可能となる。 In addition, according to the speech recognition system 100 in the present embodiment, advance speech input may be performed as needed. This makes it possible to further improve the recognition accuracy.

また、本実施形態における音声認識システム１００によれば、文字列データベースには、音素と休止区間との配列に対応する音素情報、及び音素情報に紐づく文字列情報が記憶される。このため、音素全体に対してパターンマッチングするために記憶するデータに比べて、データ容量の削減や、データ蓄積の簡易化を実現することができる。 Further, according to the speech recognition system 100 in the present embodiment, the character string database stores phoneme information corresponding to the arrangement of phonemes and pause segments, and character string information associated with the phoneme information. For this reason, compared with the data stored for performing pattern matching for the entire phoneme, it is possible to reduce the data capacity and to simplify the data storage.

特に、音声認識システム１００の利用される環境を踏まえて、文字列データベースに記憶される文字列情報を選別することで、データ容量の削減ができ、例えば公衆通信網４に接続する必要がなく、利用の幅を広げることができる。また、音声データの取得から認識情報を生成するまでの時間を大幅に短縮することができる。 In particular, by selecting the character string information stored in the character string database based on the environment in which the speech recognition system 100 is used, the data capacity can be reduced. For example, there is no need to connect to the public communication network 4, The range of use can be expanded. Further, the time from acquisition of voice data to generation of recognition information can be significantly reduced.

また、本実施形態における音声認識システム１００によれば、抽出手段Ｓ１２０は、１つの音声データから複数の認識対象データを抽出する。このため、音素及び休止区間の配列にバラつきが発生するような音声データを取得した場合においても、認識精度の低下を抑制することができる。これにより、認識精度のさらなる向上が可能となる。 Further, according to the speech recognition system 100 in the present embodiment, the extraction unit S120 extracts a plurality of recognition target data from one piece of speech data. For this reason, even when acquiring voice data in which the arrangement of phonemes and pause sections varies, it is possible to suppress a decrease in recognition accuracy. Thereby, the recognition accuracy can be further improved.

また、本実施形態における音声認識システム１００によれば、算出手段Ｓ１４０は、センテンスを複数生成する。すなわち、候補データを組み合わせるパターンが複数存在する場合においても、全てのパターンに対応するセンテンスを生成することができる。このため、例えばパターンマッチングの探索方法等に比べて、誤認識を低減させることができる。これにより、認識精度のさらなる向上が可能となる。 According to the speech recognition system 100 in the present embodiment, the calculating unit S140 generates a plurality of sentences. That is, even when there are a plurality of patterns that combine candidate data, sentences corresponding to all patterns can be generated. For this reason, erroneous recognition can be reduced as compared with, for example, a pattern matching search method or the like. Thereby, the recognition accuracy can be further improved.

また、本実施形態における音声認識システム１００によれば、比較手段Ｓ１６２は、信頼度と、第１閾値とを比較する。このため、複数の候補データから相対的に選択された評価データに対し、閾値による判定も行うことで、誤認識をさらに低減させることができる。これにより、認識精度のさらなる向上が可能となる。 Further, according to the speech recognition system 100 in the present embodiment, the comparing unit S162 compares the reliability with the first threshold. For this reason, erroneous recognition can be further reduced by performing the determination based on the threshold value for the evaluation data relatively selected from the plurality of candidate data. Thereby, the recognition accuracy can be further improved.

また、本実施形態における音声認識システム１００によれば、反映手段Ｓ１８０は、評価結果を閾値に反映させる。このため、認識情報が、利用者の認識と乖離している場合、容易に改善を実施することができる。これにより、持続的な認識精度の向上を実現することができる。 Further, according to the speech recognition system 100 in the present embodiment, the reflecting unit S180 reflects the evaluation result on the threshold. Therefore, when the recognition information is different from the recognition of the user, the improvement can be easily performed. As a result, continuous improvement in recognition accuracy can be realized.

また、本実施形態における音声認識システム１００によれば、出力手段Ｓ１７０は、認識情報を出力する。上記の通り、本実施形態における音声認識システム１００は、従来のシステムに比べて精度の高い認識情報を生成することができる。このため、認識情報に基づいて制御装置３等の制御を実施する場合、制御装置３等の誤作動を大幅に抑制することができる。例えば車両のブレーキを制御するために音声認識システム１００を用いた場合においても、通常の走行に支障を与えない程度の精度を実現し得る。すなわち、認識精度の向上に伴い、利用者の運転補助等として用いることができる。これにより、幅広い用途への応用が可能となる。 Further, according to the speech recognition system 100 in the present embodiment, the output unit S170 outputs recognition information. As described above, the speech recognition system 100 according to the present embodiment can generate recognition information with higher accuracy than a conventional system. Therefore, when controlling the control device 3 or the like based on the recognition information, malfunction of the control device 3 or the like can be significantly suppressed. For example, even when the voice recognition system 100 is used to control the brake of the vehicle, it is possible to realize an accuracy that does not hinder normal traveling. That is, it can be used as driving assistance for a user with improvement in recognition accuracy. This enables application to a wide range of applications.

また、本実施形態における音声認識システム１００によれば、休止区間は、呼吸音及びリップノイズの少なくとも何れかを含む。このため、音素のみでは判断し難い音声データの差異に対しても容易に判断でき、認識対象データを抽出することができる。これにより、認識精度のさらなる向上を図ることが可能となる。 Further, according to the speech recognition system 100 in the present embodiment, the pause section includes at least one of a breath sound and a lip noise. For this reason, it is possible to easily determine a difference in voice data that is difficult to determine only with phonemes, and to extract recognition target data. This makes it possible to further improve the recognition accuracy.

本実施形態における音声認識装置１によれば、抽出部１２は、音素及び休止区間の配列を認識対象データとして抽出する。また、検出部１４は、認識対象データの有する配列に対応する音素情報を選択し、候補データを検出する。このため、認識対象データにおける音素のみを考慮した配列に対して候補データを検出する場合に比べ、誤認識を低減させることができる。これにより、認識精度の向上を図ることが可能となる。 According to the speech recognition device 1 in the present embodiment, the extraction unit 12 extracts an array of phonemes and pause sections as recognition target data. Further, the detecting unit 14 selects phoneme information corresponding to the arrangement of the recognition target data, and detects candidate data. For this reason, erroneous recognition can be reduced as compared with a case where candidate data is detected for an array that considers only phonemes in recognition target data. This makes it possible to improve recognition accuracy.

また、本実施形態における音声認識装置１によれば、文字列データベースには、音素と休止区間との配列に対応する音素情報、及び音素情報に紐づく文字列情報が記憶される。このため、音素全体に対してパターンマッチングするために記憶するデータに比べて、データ容量の削減や、データ蓄積の簡易化を実現することができる。 According to the speech recognition device 1 of the present embodiment, the character string database stores phoneme information corresponding to the arrangement of phonemes and pause segments, and character string information associated with the phoneme information. For this reason, compared with the data stored for performing pattern matching for the entire phoneme, it is possible to reduce the data capacity and to simplify the data storage.

（音声認識システム１００の構成の第１変形例）
次に、本実施形態における音声認識システム１００の第１変形例について説明する。上述した実施形態と、第１変形例との違いは、生成部１７が更新部１７ｃを有する点である。なお、上述した構成と同様の構成については、説明を省略する。 (First Modification of Configuration of Speech Recognition System 100)
Next, a first modified example of the speech recognition system 100 according to the present embodiment will be described. The difference between the above-described embodiment and the first modification is that the generation unit 17 includes an update unit 17c. The description of the same configuration as that described above is omitted.

生成部１７の有する更新部１７ｃは、例えば図６に示すように、候補データ及び信頼度に基づき、参照データベースに記憶された閾値を更新する。すなわち、候補データ及び信頼度の内容に応じた値に、閾値を更新することができる。 The updating unit 17c included in the generating unit 17 updates the threshold stored in the reference database based on the candidate data and the reliability, for example, as illustrated in FIG. That is, the threshold value can be updated to a value corresponding to the content of the candidate data and the reliability.

更新部１７ｃは、例えば各クラスＩＤに紐づく複数の信頼度の平均値を算出する。更新部１７ｃは、算出した平均値に基づき閾値を更新する。 The update unit 17c calculates, for example, an average value of a plurality of reliability values associated with each class ID. The updating unit 17c updates the threshold based on the calculated average value.

閾値を更新する場合、算出された平均値が閾値として用いられるほか、予め設定された係数を平均値にかけ合わせた値が、更新後の閾値として用いられてもよい。また、更新前の閾値に対して、係数を平均値にかけ合わせた値を四則演算した結果の値を更新後の閾値として用いられてもよい。 When updating the threshold value, the calculated average value may be used as the threshold value, or a value obtained by multiplying a preset coefficient by the average value may be used as the updated threshold value. Further, a value obtained by performing four arithmetic operations on a value obtained by multiplying the coefficient by the average value with respect to the threshold value before the update may be used as the threshold value after the update.

候補データ及び信頼度の内容に基づき閾値を更新することで、例えば音声データにノイズ等が含まれ易い場合においても、音声データの品質に応じた閾値を設定することができる。また、１つのクラスＩＤに紐づく文字列情報が多数検出され、各文字列情報の信頼度が低い場合においても、全ての信頼度が閾値未満になることを防ぐことができる。 By updating the threshold based on the contents of the candidate data and the reliability, for example, even when the audio data is likely to contain noise or the like, the threshold can be set according to the quality of the audio data. Further, even when a large number of character string information associated with one class ID is detected and the reliability of each piece of character string information is low, it is possible to prevent all the reliability from being less than the threshold.

更新部１７ｃは、例えば各クラスＩＤに紐づく複数の信頼度のうち、最も低い信頼度を除いた平均値を算出してもよい。この場合、更新後の閾値は、更新前の閾値に比べて高くなる傾向を示す。これにより、誤認識を低減させることが可能となる。 For example, the update unit 17c may calculate an average value excluding the lowest reliability among a plurality of reliability values associated with each class ID. In this case, the updated threshold value tends to be higher than the updated threshold value. This makes it possible to reduce erroneous recognition.

更新部１７ｃは、例えば各クラスＩＤに紐づく複数の信頼度のうち、最も低い信頼度及び最も高い信頼度を除いた平均値を算出してもよい。この場合、更新後の閾値は、更新前の閾値に比べて低くなる傾向を示す。これにより、認識率を向上させることができる。また、更新前後における閾値の変動を抑制することができる。 The update unit 17c may calculate, for example, an average value excluding the lowest reliability and the highest reliability among a plurality of reliability values associated with each class ID. In this case, the threshold after the update tends to be lower than the threshold before the update. Thereby, the recognition rate can be improved. Further, it is possible to suppress a change in the threshold value before and after the update.

（音声認識システム１００の動作の第１変形例）
次に本実施形態における音声認識システム１００の第１変形例について説明する。図７（ａ）は、第１変形例における更新手段Ｓ１６３の一例を示すフローチャートである。 (First Modification of Operation of Speech Recognition System 100)
Next, a first modified example of the speech recognition system 100 according to the present embodiment will be described. FIG. 7A is a flowchart illustrating an example of the updating unit S163 in the first modification.

図７（ａ）に示すように、上述した選択手段Ｓ１５０を実施したあと、複数の候補データ、及び複数の信頼度に基づき、参照データベースに記憶された閾値を更新する（更新手段Ｓ１６３）。更新部１７ｃは、例えば記憶部１３を介して保存部１０４から候補データ、信頼度、及び参照データベースを取出す。 As shown in FIG. 7A, after performing the above-described selecting means S150, the threshold value stored in the reference database is updated based on the plurality of candidate data and the plurality of degrees of reliability (updating means S163). The update unit 17c extracts the candidate data, the reliability, and the reference database from the storage unit 104 via the storage unit 13, for example.

更新部１７ｃは、例えば図６に示すように、ランク１、２、４に含まれるクラスＩＤ「１」に紐づく複数の信頼度「０．９８２」、「０．９４２」、「０．８９７」の平均値「０．９４０」を算出する。その後、更新部１７ｃは、例えば算出した平均値に係数（例えば０．９）をかけ合わせた値「０．８４６」を、更新後の閾値として用いる。 For example, as illustrated in FIG. 6, the updating unit 17c includes a plurality of reliability values “0.982”, “0.942”, and “0.897” associated with the class ID “1” included in ranks 1, 2, and 4. Is calculated as “0.940”. Thereafter, the updating unit 17c uses, for example, a value “0.846” obtained by multiplying the calculated average value by a coefficient (for example, 0.9) as the updated threshold value.

その後、上述した指定手段Ｓ１６１等を実施し、本実施形態における音声認識システム１００の動作が終了する。 After that, the above-described designation means S161 and the like are performed, and the operation of the speech recognition system 100 in the present embodiment ends.

本変形例によれば、更新手段Ｓ１６３における更新部１７ｃは、候補データ及び信頼度に基づき、閾値を更新する。このため、予め設定された閾値を常に用いる場合に比べて、取得する音声データにおける品質に応じた認識情報を生成することができる。これにより、利用できる環境の幅を広げることが可能となる。 According to this modification, the updating unit 17c in the updating unit S163 updates the threshold based on the candidate data and the reliability. For this reason, it is possible to generate recognition information corresponding to the quality of the acquired audio data, as compared with a case where a preset threshold is always used. This makes it possible to expand the range of available environments.

（音声認識システム１００の動作の第２変形例）
次に本実施形態における音声認識システム１００の第２変形例について説明する。上述した実施形態と、第２変形例との違いは、設定手段Ｓ１９０を備える点である。なお、上述した構成と同様の構成については、説明を省略する。 (Second Modification of Operation of Speech Recognition System 100)
Next, a second modification of the speech recognition system 100 according to the present embodiment will be described. The difference between the above-described embodiment and the second modification is that a setting unit S190 is provided. The description of the same configuration as that described above is omitted.

設定手段Ｓ１９０は、例えば図７（ｂ）に示すように、生成手段Ｓ１６０の後に実施される。設定手段Ｓ１９０は、認識情報に基づき、参照する各データベースの内容を選別する。設定手段Ｓ１９０の実施後、取得手段Ｓ１１０が実施される。 The setting unit S190 is performed after the generating unit S160, for example, as shown in FIG. The setting unit S190 selects the contents of each database to be referred based on the recognition information. After performing the setting unit S190, the obtaining unit S110 is performed.

例えば設定手段Ｓ１９０として「ミュージックモード」が生成された場合、その後の検出手段Ｓ１３０において、検出部１４は、文字列データベースのうち、「ミュージックモード」に特化した音素情報、文字列情報、及びクラスＩＤを選別して参照する。このため、設定手段Ｓ１９０を実施しない場合に比べて、特定の内容に対する音素情報等に限定することができる。これにより、認識精度を飛躍的に向上させることが可能となる。 For example, when “Music Mode” is generated as the setting unit S190, in the subsequent detecting unit S130, the detecting unit 14 includes, in the character string database, phoneme information, character string information, and class The ID is selected and referenced. For this reason, compared to the case where the setting unit S190 is not performed, it is possible to limit the phoneme information to specific contents. This makes it possible to dramatically improve recognition accuracy.

（取得手段Ｓ１１０の変形例）
次に、本実施形態における取得手段Ｓ１１０の変形例について説明する。上述した実施形態と、本変形例との違いは、取得部１１が条件情報を取得する点である。なお、上述した構成と同様の構成については、説明を省略する。 (Modification of acquisition means S110)
Next, a modified example of the acquisition unit S110 in the present embodiment will be described. The difference between the above-described embodiment and the present modified example is that the acquiring unit 11 acquires the condition information. The description of the same configuration as that described above is omitted.

取得手段Ｓ１１０において取得部１１は、音声データが生成された条件を示す条件情報を取得する。条件情報は、例えば図８に示すように、環境情報と、雑音情報と、収音装置情報と、利用者情報と、音特性情報とを有する。なお、上述した設定手段Ｓ１９０と同様に、例えば検出部１４は、条件情報に基づき、参照する文字列データベース及び文法データベースの少なくとも何れかの内容を選別してもよい。また、例えば反映部１９は、参照データベースの閾値の更新に、条件情報を用いてもよい。 In the obtaining unit S110, the obtaining unit 11 obtains condition information indicating a condition under which audio data is generated. The condition information includes, for example, environment information, noise information, sound collection device information, user information, and sound characteristic information, as shown in FIG. Note that, similarly to the setting unit S190 described above, for example, the detection unit 14 may select at least one of the contents of the character string database and the grammar database to be referred based on the condition information. Further, for example, the reflection unit 19 may use the condition information to update the threshold value of the reference database.

条件情報は、例えば収音装置２により生成されるほか、例えば利用者等が予め生成してもよい。例えば取得部１１は、音声データの一部を条件情報として取得してもよい。 The condition information is generated by, for example, the sound collection device 2 or may be generated in advance by, for example, a user. For example, the acquisition unit 11 may acquire a part of the audio data as the condition information.

環境情報は、収音装置２の設置された環境に関する情報を有し、例えば屋外、屋内の広さ等を示す。環境情報を用いることで、例えば屋内における音声の反射条件等を考慮することができ、抽出される認識対象データ等の精度を高めることができる。 The environment information includes information on the environment in which the sound pickup device 2 is installed, and indicates, for example, the size of an outdoor area or an indoor area. By using the environment information, for example, indoor sound reflection conditions and the like can be considered, and the accuracy of extracted recognition target data and the like can be improved.

雑音情報は、収音装置２が収音し得る雑音に関する情報を有し、例えば利用者等以外の音声、空調音等を示す。雑音情報を用いることで、音声データに含まれる不要なデータを予め除去でき、抽出される認識対象データ等の精度を高めることができる。 The noise information includes information on noise that can be picked up by the sound pickup device 2, and indicates, for example, voices of users other than the user, air conditioning sounds, and the like. By using the noise information, unnecessary data included in the voice data can be removed in advance, and the accuracy of extracted recognition target data and the like can be improved.

収音装置情報は、収音装置２の種類、性能等に関する情報を有し、例えばマイクの数、マイクの種類等も含まれる。収音装置情報を用いることで、音声データが生成された状況に対応したデータベースの選択等ができ、音声認識の精度を高めることができる。 The sound collection device information includes information on the type, performance, and the like of the sound collection device 2, and includes, for example, the number of microphones, the type of microphone, and the like. By using the sound collection device information, it is possible to select a database or the like corresponding to a situation in which voice data is generated, and it is possible to improve the accuracy of voice recognition.

利用者情報は、利用者等の人数、国籍、性別等に関する情報を有する。音特性情報は、音声の声量、音圧、癖、活舌の状態等に関する情報を有する。利用者情報を用いることで、音声データの特徴を予め限定することができ、音声認識の精度を高めることができる。 The user information includes information on the number of users, nationality, gender, and the like. The sound characteristic information includes information on the voice volume, sound pressure, habit, state of the live tongue, and the like of the voice. By using the user information, the features of the voice data can be limited in advance, and the accuracy of voice recognition can be improved.

本変形例によれば、取得手段Ｓ１１０は、条件情報を取得する。すなわち、取得手段Ｓ１１０は、音声データを取得する際の周辺環境、音声データに含まれる雑音、音声を採取する収音装置２の種類等の各種条件を、条件情報として取得する。このため、条件情報に応じた各手段や各データベースの設定を実施することができる。これにより、利用される環境等に関わらず、認識精度の向上を図ることが可能となる。 According to the present modification, the acquiring unit S110 acquires the condition information. That is, the obtaining unit S110 obtains, as condition information, various conditions such as a surrounding environment at the time of obtaining the audio data, noise included in the audio data, and a type of the sound collection device 2 that collects the audio. Therefore, setting of each means and each database according to the condition information can be performed. As a result, it is possible to improve the recognition accuracy regardless of the environment used.

また、本変形例によれば、検出手段Ｓ１３０は、条件情報に基づき、参照する文字列データベースの内容を選別する。このため、文字列データベースには、条件情報毎に異なる文字列情報等を記憶させておくことで、条件情報毎に適した候補データを検出することができる。これにより、条件情報毎における認識精度の向上を図ることが可能となる。 Further, according to the present modification, the detection unit S130 selects the contents of the character string database to be referred to based on the condition information. For this reason, by storing different character string information for each condition information in the character string database, it is possible to detect candidate data suitable for each condition information. This makes it possible to improve recognition accuracy for each piece of condition information.

（参照データベースの変形例）
次に、本実施形態における参照データベースの変形例について説明する。上述した実施形態と、本変形例との違いは、参照データベースに記憶された情報の内容が異なる点である。なお、上述した構成と同様の構成については、説明を省略する。 (Modification of reference database)
Next, a modified example of the reference database according to the present embodiment will be described. The difference between the above-described embodiment and the present modified example is that the content of the information stored in the reference database is different. The description of the same configuration as that described above is omitted.

参照データベースには、例えば図９に示すように、予め取得された過去の評価データ、過去の評価データに紐づく参照センテンス、及び過去の評価データと参照センテンスとの
間における連関度が記憶される。 In the reference database, for example, as shown in FIG. 9, past evaluation data acquired in advance, a reference sentence linked to the past evaluation data, and a degree of association between the past evaluation data and the reference sentence are stored. .

生成部１７は、例えば参照データベースを参照し、過去の評価データのうち、評価データに対応する第１評価データ（図９の「過去の評価データ」内の破線枠）を選択する。その後、生成部１７は、参照センテンスのうち、第１評価データに対応する第１参照センテンス（図９の「参照センテンス」内の破線枠）、を取得する。また、生成部１７は、連関度のうち、第１評価データと第１参照センテンスとの間における第１連関度（図９の「６５％」等）を取得する。なお、第１評価データ及び第１参照センテンスは、複数のデータを含んでもよい。 The generation unit 17 refers to, for example, a reference database, and selects first evaluation data (a broken line frame in “past evaluation data” in FIG. 9) corresponding to the evaluation data from the past evaluation data. After that, the generation unit 17 acquires a first reference sentence (a broken line frame in “reference sentence” in FIG. 9) corresponding to the first evaluation data among the reference sentences. In addition, the generation unit 17 acquires a first degree of association (eg, “65%” in FIG. 9) between the first evaluation data and the first reference sentence among the degrees of association. Note that the first evaluation data and the first reference sentence may include a plurality of data.

生成部１７は、第１連関度の値に基づき、認識情報を生成する。生成部１７は、例えば第１連関度と、予め取得された閾値と比較し、閾値を上回る第１連関度に紐づく第１参照センテンスを参考に、認識情報を生成する。 The generation unit 17 generates recognition information based on the value of the first degree of association. The generation unit 17 compares, for example, the first degree of association with a previously acquired threshold, and generates recognition information with reference to a first reference sentence associated with the first degree of association exceeding the threshold.

過去の評価データとして、評価データと一部一致又は完全一致する情報が選択されるほか、例えば類似（同一概念等を含む）する情報が用いられる。評価データ及び過去の評価データが複数の文字列間の組み合わせで示される場合、例えば、名詞−動詞、名詞−形容詞、形容詞−動詞、名詞−名詞の何れかの組み合わせが用いられる。 As the past evaluation data, information that partially or completely matches the evaluation data is selected, and for example, information that is similar (including the same concept) is used. When the evaluation data and the past evaluation data are represented by a combination between a plurality of character strings, for example, any combination of a noun-verb, a noun-adjective, an adjective-verb, and a noun-noun is used.

連関度（第１連関度）は、例えば百分率等の３段階以上で示される。例えば参照データベースがニューラルネットワークで構成される場合、第１連関度は、選択された過去の評価対象情報に紐づく重み変数を示す。 The degree of association (first degree of association) is indicated in three or more stages, such as a percentage. For example, when the reference database is configured by a neural network, the first degree of association indicates a weight variable associated with the selected past evaluation target information.

上述した参照データベースを用いる場合、３段階以上に設定されている連関度に基づいて、音声認識を実現できる点に特徴がある。連関度等は、例えば０〜１００％までの数値で記述することができるが、これに限定されるものではなく３段階以上の数値で記述できればいかなる段階で構成されていてもよい。 When using the above-mentioned reference database, it is characterized in that speech recognition can be realized based on the degree of association set in three or more stages. The degree of association or the like can be described by, for example, a numerical value from 0% to 100%, but is not limited to this, and may be configured at any stage as long as it can be described by a numerical value of three or more stages.

このような連関度等に基づいて、評価データに対する認識情報の候補として選ばれる第１参照センテンスにおいて、連関度等の高い又は低い順に第１参照センテンスを選択することが可能となる。このように連関度の順に選択することで、状況に見合う可能性の高い第１参照センテンスを優先的に選択することができる。他方、状況に見合う可能性の低い第１参照センテンスも除外せずに選択できるため、廃棄対象とせずに認識情報の候補として選択することが可能となる。 Based on the degree of association and the like, the first reference sentence selected as a candidate for recognition information for the evaluation data can select the first reference sentence in descending order of the degree of association or the like. By selecting the degree of association in this way, it is possible to preferentially select the first reference sentence that is likely to be suitable for the situation. On the other hand, since the first reference sentence that is unlikely to be suitable for the situation can be selected without being excluded, it is possible to select the first reference sentence as a recognition information candidate without making it a discard target.

上記に加え、例えば連関度等が１％のような極めて低い評価も見逃すことなく選択することができる。すなわち、連関度等が極めて低い値であっても、僅かな兆候として繋がっていることを示しており、過度の廃棄対象の選択や誤認を抑制することが可能となる。 In addition to the above, for example, an extremely low evaluation such as an association degree of 1% can be selected without overlooking. In other words, even if the degree of association or the like is extremely low, it is shown that it is connected as a slight sign, and it is possible to suppress the selection or erroneous recognition of an excessive disposal target.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 While some embodiments of the invention have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. These new embodiments can be implemented in other various forms, and various omissions, replacements, and changes can be made without departing from the spirit of the invention. These embodiments and their modifications are included in the scope and gist of the invention, and are also included in the invention described in the claims and their equivalents.

１：音声認識装置
２：収音装置
３：制御装置
４：公衆通信網
５：サーバ
６：ユーザ端末
１０：筐体
１１：取得部
１２：抽出部
１３：記憶部
１４：検出部
１５：算出部
１６：選択部
１７：生成部
１７ａ：指定部
１７ｂ：比較部
１７ｃ：更新部
１８：出力部
１９：反映部
１００：音声認識システム
１０１：ＣＰＵ
１０２：ＲＯＭ
１０３：ＲＡＭ
１０４：保存部
１０５：Ｉ／Ｆ
１０６：Ｉ／Ｆ
１０７：Ｉ／Ｆ
１０８：入力部分
１０９：出力部分
１１０：内部バス
Ｓ１１０：取得手段
Ｓ１２０：抽出手段
Ｓ１３０：検出手段
Ｓ１４０：算出手段
Ｓ１５０：選択手段
Ｓ１６０：生成手段
Ｓ１６１：指定手段
Ｓ１６２：比較手段
Ｓ１６３：更新手段
Ｓ１７０：出力手段
Ｓ１８０：反映手段
Ｓ１９０：設定手段 1: Voice recognition device 2: Sound collection device 3: Control device 4: Public communication network 5: Server 6: User terminal 10: Housing 11: Acquisition unit 12: Extraction unit 13: Storage unit 14: Detection unit 15: Calculation unit 16: selection unit 17: generation unit 17a: designation unit 17b: comparison unit 17c: update unit 18: output unit 19: reflection unit 100: voice recognition system 101: CPU
102: ROM
103: RAM
104: storage unit 105: I / F
106: I / F
107: I / F
108: input part 109: output part 110: internal bus S110: obtaining means S120: extracting means S130: detecting means S140: calculating means S150: selecting means S160: generating means S161: specifying means S162: comparing means S163: updating means S170: Output means S180: Reflection means S190: Setting means

第１発明に係る音声認識システムは、少なくとも１つの音声データを取得する取得手段と、前記音声データに含まれる開始無音区間及び終了無音区間を音素認識により抽出し、前記開始無音区間と前記終了無音区間との間に挟まれた音素及び休止区間の配列を、前記音素認識により認識対象データとして抽出する抽出手段と、予め取得された文字列情報と、前記文字列情報に紐づく音素情報と、前記文字列情報に付与されたクラスＩＤとが記憶された文字列データベースと、前記文字列データベースを参照し、前記認識対象データの有する前記配列に対応する前記音素情報を選択し、選択された前記音素情報に紐づく前記文字列情報及び前記クラスＩＤを、候補データとして複数検出する検出手段と、予め取得された前記クラスＩＤの配列順序を示す文法情報が記憶された文法データベースと、前記文法データベースを参照し、複数の前記候補データを前記文法情報に基づき組み合あわせたセンテンスを生成し、前記センテンスに含まれる前記候補データ毎の前記文字列情報に対する信頼度を、前記文法データベースを用いて算出する算出手段と、前記信頼度に基づき、複数の前記候補データから評価データを選択する選択手段と、前記評価データに基づき、認識情報を生成する生成手段とを備えることを特徴とする。 The voice recognition system according to a first aspect of the present invention includes: an obtaining unit that obtains at least one voice data; a start silent section and an end silent section included in the voice data are extracted by phoneme recognition; Extraction means for extracting an array of phonemes and pause intervals sandwiched between sections as recognition target data by the phoneme recognition , character string information obtained in advance, and phoneme information associated with the character string information, A character string database in which a class ID given to the character string information is stored, and referring to the character string database, selecting the phoneme information corresponding to the arrangement of the recognition target data, and selecting the selected phoneme information. Detecting means for detecting a plurality of the character string information and the class ID associated with phoneme information as candidate data, and an arrangement order of the previously acquired class IDs And grammar database grammatical information is stored which indicates a reference to the grammar database to generate a sentence together union based more the candidate data to the grammatical information, said each of the candidate data contained in the sentence the reliability against the character string information, a calculation means for calculating using said grammar database, based on the reliability, and selection means for selecting the evaluation data from the plurality of candidate data, based on the evaluation data, recognition Generating means for generating information.

第１２発明に係る音声認識装置は、少なくとも１つの音声データを取得する取得部と、前記音声データに含まれる開始無音区間及び終了無音区間を音素認識により抽出し、前記開始無音区間と前記終了無音区間との間に挟まれた音素及び休止区間の配列を、前記音素認識により認識対象データとして抽出する抽出部と、予め取得された文字列情報と、前記文字列情報に紐づく音素情報と、前記文字列情報に付与されたクラスＩＤとが記憶された文字列データベースと、前記文字列データベースを参照し、前記認識対象データの有する前記配列に対応する前記音素情報を選択し、選択された前記音素情報に紐づく前記文字列情報及び前記クラスＩＤを、候補データとして複数検出する検出部と、予め取得された前記クラスＩＤの配列順序を示す文法情報が記憶された文法データベースと、前記文法データベースを参照し、複数の前記候補データを前記文法情報に基づき組み合あわせたセンテンスを生成し、前記センテンスに含まれる前記候補データ毎の前記文字列情報に対する信頼度を、前記文法データベースを用いて算出する算出部と、前記信頼度に基づき、複数の前記候補データから評価データを選択する選択部と、前記評価データに基づき、認識情報を生成する生成部とを備えることを特徴とする。 A speech recognition device according to a twelfth aspect is an acquisition unit that acquires at least one piece of speech data, and extracts a start silence section and an end silence section included in the speech data by phoneme recognition. An extraction unit that extracts an array of phonemes and pause sections sandwiched between sections as data to be recognized by the phoneme recognition , character string information acquired in advance, and phoneme information associated with the character string information, A character string database in which a class ID given to the character string information is stored, and referring to the character string database, selecting the phoneme information corresponding to the arrangement of the recognition target data, and selecting the selected phoneme information. The detection unit detects a plurality of the character string information and the class ID associated with phoneme information as candidate data, and indicates an arrangement order of the class ID acquired in advance. And grammar database legal information is stored, by referring to the grammar data base to generate a sentence together union based more the candidate data to the grammatical information, the character string for each of the candidate data contained in the sentence the reliability against the information, a calculation unit for calculating using said grammar database, based on the reliability, a selection unit for selecting the evaluation data from the plurality of candidate data, based on the evaluation data, the recognition information And a generating unit for generating.

Claims

少なくとも１つの音声データを取得する取得手段と、
前記音声データに含まれる開始無音区間及び終了無音区間を抽出し、前記開始無音区間と前記終了無音区間との間に挟まれた音素及び休止区間の配列を、認識対象データとして抽出する抽出手段と、
予め取得された文字列情報と、前記文字列情報に紐づく音素情報と、前記文字列情報に付与されたクラスＩＤとが記憶された文字列データベースと、
前記文字列データベースを参照し、前記認識対象データの有する前記配列に対応する前記音素情報を選択し、選択された前記音素情報に紐づく前記文字列情報及び前記クラスＩＤを、候補データとして複数検出する検出手段と、
予め取得された前記クラスＩＤの配列順序を示す文法情報が記憶された文法データベースと、
前記文法データベースを参照し、複数の前記候補データを前記文法情報に基づき組み合あわせたセンテンスを生成し、前記センテンスに含まれる前記候補データ毎に対応する信頼度を算出する算出手段と、
前記信頼度に基づき、複数の前記候補データから評価データを選択する選択手段と、
前記評価データに基づき、認識情報を生成する生成手段と
を備えることを特徴とする音声認識システム。 Acquiring means for acquiring at least one audio data;
Extracting means for extracting a start silent section and an end silent section included in the audio data, and extracting an array of phonemes and pause sections between the start silent section and the end silent section as recognition target data; ,
A character string database that stores character string information acquired in advance, phoneme information associated with the character string information, and a class ID assigned to the character string information;
Referring to the character string database, select the phoneme information corresponding to the arrangement of the recognition target data, and detect a plurality of the character string information and the class ID associated with the selected phoneme information as candidate data. Detecting means for
A grammar database in which grammar information indicating an arrangement order of the class IDs acquired in advance is stored;
With reference to the grammar database, generating a sentence combining a plurality of the candidate data based on the grammar information, calculating means for calculating the reliability corresponding to each of the candidate data included in the sentence,
Selecting means for selecting evaluation data from the plurality of candidate data based on the reliability,
Generating means for generating recognition information based on the evaluation data.

前記抽出手段は、１つの前記音声データから複数の前記認識対象データを抽出し、
複数の前記認識対象データは、それぞれ異なる前記音素及び前記休止区間の前記配列を有すること
を特徴とする請求項１記載の音声認識システム。 The extracting means extracts a plurality of the recognition target data from one of the voice data,
2. The speech recognition system according to claim 1, wherein the plurality of recognition target data have different arrangements of the phonemes and the pause periods, respectively. 3.

前記算出手段は、前記センテンスを複数生成し、
複数の前記センテンスは、それぞれ前記候補データの種類及び組み合わせの少なくとも何れかが異なること
を特徴とする請求項１又は２記載の音声認識システム。 The calculating means generates a plurality of the sentences,
The speech recognition system according to claim 1, wherein the plurality of sentences differ in at least one of a type and a combination of the candidate data.

予め取得された前記文字列情報と、前記文字列情報を組み合わせた参照センテンスと、前記文字列情報毎に付与された閾値とが記憶された参照データベースをさらに備え、
前記生成手段は、
前記参照データベースを参照し、前記参照センテンスのうち、前記評価データに対応する第１参照センテンスを指定する指定手段と、
前記評価データに対応する前記信頼度と、前記第１参照センテンスに含まれる第１文字列情報に付与された第１閾値とを比較する比較手段と、
を有し、前記比較手段の比較結果に基づき、前記認識情報を生成すること
を特徴とする請求項１〜３の何れか１項記載の音声認識システム。 Further provided is a reference database in which the previously obtained character string information, a reference sentence combining the character string information, and a threshold value assigned to each of the character string information are stored,
The generation means,
Specifying means for referring to the reference database and specifying a first reference sentence corresponding to the evaluation data among the reference sentences;
Comparing means for comparing the reliability corresponding to the evaluation data with a first threshold value assigned to the first character string information included in the first reference sentence;
The speech recognition system according to any one of claims 1 to 3, further comprising: generating the recognition information based on a comparison result of the comparison unit.

複数の前記候補データ、及び複数の前記信頼度に基づき、前記参照データベースに記憶された前記閾値を更新する更新手段をさらに備えること
を特徴とする請求項４記載の音声認識システム。 The speech recognition system according to claim 4, further comprising: an updating unit that updates the threshold stored in the reference database based on the plurality of candidate data and the plurality of reliability.

前記認識情報を評価した利用者の評価結果を取得し、前記参照データベースの前記閾値に反映させる反映手段をさらに備えること
を特徴とする請求項４又は５記載の音声認識システム。 The voice recognition system according to claim 4, further comprising a reflection unit configured to acquire an evaluation result of a user who has evaluated the recognition information and reflect the evaluation result in the threshold of the reference database.

前記取得手段は、前記音声データが生成された条件を示す条件情報を取得すること
を特徴とする請求項１〜６の何れか１項記載の音声認識システム。 The voice recognition system according to claim 1, wherein the obtaining unit obtains condition information indicating a condition under which the voice data is generated.

前記検出手段は、前記条件情報に基づき、参照する前記文字列データベースの内容を選別すること
を特徴とする請求項７記載の音声認識システム。 The voice recognition system according to claim 7, wherein the detection unit selects the contents of the character string database to be referred to based on the condition information.

前記認識情報を出力する出力手段をさらに備え、
前記認識情報は、車両の走行速度を制御するための情報を含むこと
を特徴とする請求項１〜８の何れか１項記載の音声認識システム。 Output means for outputting the recognition information,
The voice recognition system according to any one of claims 1 to 8, wherein the recognition information includes information for controlling a traveling speed of the vehicle.

前記休止区間は、呼吸音及びリップノイズの少なくとも何れかを含むこと
を特徴とする請求項１〜９の何れか１項記載の音声認識システム。 The speech recognition system according to claim 1, wherein the pause section includes at least one of a breath sound and a lip noise.

前記文字列情報は、２ヵ国以上の言語を含むこと
を特徴とする請求項１〜１０の何れか１項記載の音声認識システム。 The speech recognition system according to any one of claims 1 to 10, wherein the character string information includes languages of two or more countries.

少なくとも１つの音声データを取得する取得部と、
前記音声データに含まれる開始無音区間及び終了無音区間を抽出し、前記開始無音区間と前記終了無音区間との間に挟まれた音素及び休止区間の配列を、認識対象データとして抽出する抽出部と、
予め取得された文字列情報と、前記文字列情報に紐づく音素情報と、前記文字列情報に付与されたクラスＩＤとが記憶された文字列データベースと、
前記文字列データベースを参照し、前記認識対象データの有する前記配列に対応する前記音素情報を選択し、選択された前記音素情報に紐づく前記文字列情報及び前記クラスＩＤを、候補データとして複数検出する検出部と、
予め取得された前記クラスＩＤの配列順序を示す文法情報が記憶された文法データベースと、
前記文法データベースを参照し、複数の前記候補データを前記文法情報に基づき組み合あわせたセンテンスを生成し、前記センテンスに含まれる前記候補データ毎に対応する信頼度を算出する算出部と、
前記信頼度に基づき、複数の前記候補データから評価データを選択する選択部と、
前記評価データに基づき、認識情報を生成する生成部と
を備えることを特徴とする音声認識装置。 An acquisition unit that acquires at least one audio data;
An extraction unit that extracts a start silent section and an end silent section included in the audio data, and extracts an array of phonemes and pause sections between the start silent section and the end silent section as data to be recognized. ,
A character string database that stores character string information acquired in advance, phoneme information associated with the character string information, and a class ID assigned to the character string information;
Referring to the character string database, select the phoneme information corresponding to the arrangement of the recognition target data, and detect a plurality of the character string information and the class ID associated with the selected phoneme information as candidate data. A detecting unit that performs
A grammar database in which grammar information indicating an arrangement order of the class IDs acquired in advance is stored;
A calculation unit that refers to the grammar database, generates a sentence combining the plurality of candidate data based on the grammar information, and calculates a reliability corresponding to each of the candidate data included in the sentence.
A selection unit that selects evaluation data from a plurality of candidate data based on the reliability,
A recognition unit that generates recognition information based on the evaluation data.