JP6462936B1

JP6462936B1 - Speech recognition system and speech recognition device

Info

Publication number: JP6462936B1
Application number: JP2018115243A
Authority: JP
Inventors: 敦菊田; 高広越田
Original assignee: Ryoyo Electro Corp
Current assignee: Ryoyo Electro Corp
Priority date: 2018-06-18
Filing date: 2018-06-18
Publication date: 2019-01-30
Anticipated expiration: 2038-06-18
Also published as: CN110914897A; WO2019244385A1; CN110914897B; JP2019219456A

Abstract

【課題】認識精度の向上を可能とする音声認識システム、及び音声認識装置を提供する。
【解決手段】少なくとも１つの音声データを取得する取得手段と、前記音声データに含まれる開始無音区間及び終了無音区間を抽出し、前記開始無音区間と前記終了無音区間との間に挟まれた音素及び休止区間の配列を、認識対象データとして抽出する抽出手段と、文字列データベースを参照し、前記認識対象データの有する前記配列に対応する前記音素情報を選択し、選択された前記音素情報に紐づく前記文字列情報及び前記クラスＩＤを、候補データとして複数検出する検出手段と、文法データベースを参照し、複数の前記候補データを前記文法情報に基づき組み合あわせたセンテンスを生成し、前記センテンスに含まれる前記候補データ毎に対応する信頼度を算出する算出手段と、選択手段と、生成手段とを備えることを特徴とする。
【選択図】図１A speech recognition system and a speech recognition apparatus that can improve recognition accuracy.
SOLUTION: Acquisition means for acquiring at least one voice data; a start silence section and an end silence section included in the voice data are extracted; and a phoneme sandwiched between the start silence section and the end silence section And an extraction means for extracting the array of pause sections as recognition target data, a character string database, and the phoneme information corresponding to the array of the recognition target data is selected, and the selected phoneme information is linked to the selected phoneme information. Detecting a plurality of the character string information and the class ID as candidate data, and referring to a grammar database, generating a sentence that combines the plurality of candidate data based on the grammar information, The image processing apparatus includes a calculation unit that calculates a reliability corresponding to each of the candidate data included, a selection unit, and a generation unit.
[Selection] Figure 1

Description

本発明は、音声認識システム、及び音声認識装置に関する。 The present invention relates to a voice recognition system and a voice recognition device.

従来、音声認識に関する技術として、例えば特許文献１の認知機能評価装置や、特許文献２の発話内容の把握システム等が提案されている。 Conventionally, as a technology related to speech recognition, for example, a cognitive function evaluation device disclosed in Patent Literature 1 and a speech content grasping system disclosed in Patent Literature 2 have been proposed.

特許文献１の認知機能評価装置では、フォルマント解析部は、対象者の音声に含まれる特定の音素の瞬時音圧の時間変動を対象期間に亘って表している対象データを受け取る。そして、フォルマント解析部は、対象期間を複数のフレームに分割し、特定のフォルマントの周波数を、２つ以上の対象フレームのそれぞれについて求める。特徴解析部は、対象フレーム毎に求められた特定のフォルマントの周波数について特徴量を求める。評価部は、特徴量に基づいて対象者の認知機能を評価する。 In the cognitive function evaluation apparatus disclosed in Patent Document 1, the formant analysis unit receives target data that represents a temporal variation in instantaneous sound pressure of a specific phoneme included in a target person's voice over a target period. Then, the formant analysis unit divides the target period into a plurality of frames, and obtains a specific formant frequency for each of two or more target frames. The feature analysis unit obtains a feature amount for a specific formant frequency obtained for each target frame. The evaluation unit evaluates the cognitive function of the subject based on the feature amount.

特許文献２では、録取された音声データに対して音素基準の音声認識を行ってインデクシングされたデータを保存し、これを用いて核心語に基づく発話内容を把握することにより、発話内容の把握が正確に、手軽に且つ速やかに行われる、録取された音声データに対する核心語の取出に基づく発話内容の把握システムと、このシステムを用いたインデクシング方法及び発話内容の把握方法等が開示されている。 In Patent Document 2, the phoneme-based speech recognition is performed on the recorded speech data, the indexed data is stored, and the content of the speech based on the core word is used to grasp the content of the speech. A system for grasping utterance contents based on the extraction of core words from recorded speech data, an indexing method using this system, a method for grasping utterance contents, etc., which are accurately, easily and quickly performed are disclosed. Yes.

特開２０１８−５０８４７号公報Japanese Patent Laid-Open No. 2018-50847 特開２０１５−５３９３６４号公報JP2015-539364A

ここで、音声認識に関する技術では、様々な分野での応用が期待される一方で、認識精度の向上が課題として挙げられている。認識精度を向上させるために、音素を用いる方法が注目を集めているが、音声データから音素の配列を取得する際のバラつき等により、依然として認識精度の向上が課題として挙げられている。 Here, in the technology related to speech recognition, while application in various fields is expected, improvement of recognition accuracy is cited as an issue. In order to improve recognition accuracy, methods using phonemes have attracted attention. However, due to variations in acquiring phoneme arrays from speech data, improvement of recognition accuracy is still a problem.

この点、特許文献１では、対象者の音声に基づく特定のフォルマント周波数について特徴量を求め、特徴量に基づいて対象者の認知機能を評価することで、精度の向上を図っている。しかしながら、特許文献１の開示技術では、対象者の発する音声の内容までを認識することができない。 In this regard, Patent Document 1 seeks to improve accuracy by obtaining a feature amount for a specific formant frequency based on the voice of the subject and evaluating the cognitive function of the subject based on the feature amount. However, the disclosed technique disclosed in Patent Document 1 cannot recognize even the content of the voice uttered by the subject.

また、特許文献２では、核心語に基づく発話内容を把握することにより、発話内容の把握を実現する技術が開示されている。しかしながら、特許文献２の開示技術では、音素の類似する核心語が発話内容に含まれる場合、認識精度が悪くなる恐れがある。このような状況により、認識精度の向上を可能とする音声認識に関する技術が望まれている。 Patent Document 2 discloses a technique for realizing the grasp of the utterance content by grasping the utterance content based on the core word. However, in the technology disclosed in Patent Document 2, when the core words similar to phonemes are included in the utterance content, the recognition accuracy may deteriorate. Under such circumstances, a technology relating to speech recognition that can improve recognition accuracy is desired.

そこで本発明は、上述した問題に鑑みて案出されたものであり、その目的とするところは、認識精度の向上を可能とする音声認識システム、及び音声認識装置を提供することにある。 The present invention has been devised in view of the above-described problems, and an object of the present invention is to provide a speech recognition system and a speech recognition apparatus that can improve recognition accuracy.

第１発明に係る音声認識システムは、少なくとも１つの音声データを取得する取得手段と、前記音声データに含まれる開始無音区間及び終了無音区間を音素認識により抽出し、前記開始無音区間と前記終了無音区間との間に挟まれた音素及び休止区間の配列を、前記音素認識により認識対象データとして抽出する抽出手段と、予め取得された文字列情報と、前記文字列情報に紐づく音素情報と、前記文字列情報に付与されたクラスＩＤとが記憶された文字列データベースと、前記文字列データベースを参照し、前記認識対象データの有する前記配列に対応する前記音素情報を選択し、選択された前記音素情報に紐づく前記文字列情報及び前記クラスＩＤを、候補データとして複数検出する検出手段と、予め取得された前記クラスＩＤの配列順序を示す文法情報が記憶された文法データベースと、前記文法データベースを参照し、複数の前記候補データを前記文法情報に基づき組み合あわせたセンテンスを生成し、前記センテンスに含まれる前記候補データ毎の前記文字列情報に対する信頼度を、前記文法データベースを用いて算出する算出手段と、前記信頼度に基づき、複数の前記候補データから評価データを選択する選択手段と、前記評価データに基づき、認識情報を生成する生成手段とを備えることを特徴とする。 The speech recognition system according to the first aspect of the present invention is an acquisition means for acquiring at least one speech data, and extracts a start silence interval and an end silence interval included in the speech data by phoneme recognition, and the start silence interval and the end silence An extraction means for extracting the array of phonemes and pause sections sandwiched between sections as recognition target data by the phoneme recognition , previously acquired character string information, and phoneme information associated with the character string information; The character string database in which the class ID assigned to the character string information is stored, the character string database is referred to, the phoneme information corresponding to the array of the recognition target data is selected, and the selected Detection means for detecting a plurality of the character string information and the class ID associated with the phoneme information as candidate data, and an arrangement order of the class IDs acquired in advance And grammar database grammatical information is stored which indicates a reference to the grammar database to generate a sentence together union based more the candidate data to the grammatical information, said each of the candidate data contained in the sentence the reliability against the character string information, a calculation means for calculating using said grammar database, based on the reliability, and selection means for selecting the evaluation data from the plurality of candidate data, based on the evaluation data, recognition And generating means for generating information.

第２発明に係る音声認識システムは、第１発明において、前記抽出手段は、１つの前記音声データから複数の前記認識対象データを抽出し、複数の前記認識対象データは、それぞれ異なる前記音素及び前記休止区間の前記配列を有することを特徴とする。 The speech recognition system according to a second aspect is the speech recognition system according to the first aspect, wherein the extraction means extracts a plurality of pieces of recognition target data from one piece of the speech data, and the plurality of pieces of recognition target data are respectively different phonemes and It has the said arrangement | sequence of a rest area.

第３発明に係る音声認識システムは、第１発明又は第２発明において、前記算出手段は、前記センテンスを複数生成し、複数の前記センテンスは、それぞれ前記候補データの種類及び組み合わせの少なくとも何れかが異なることを特徴とする。 In the speech recognition system according to a third aspect of the present invention, in the first or second aspect, the calculation means generates a plurality of the sentences, and the plurality of sentences are each of at least one of the types and combinations of the candidate data. It is characterized by being different.

第４発明に係る音声認識システムは、第１発明〜第３発明の何れかにおいて、予め取得された前記文字列情報と、前記文字列情報を組み合わせた参照センテンスと、前記文字列情報毎に付与された閾値とが記憶された参照データベースをさらに備え、前記生成手段は、前記参照データベースを参照し、前記参照センテンスのうち、前記評価データに対応する第１参照センテンスを指定する指定手段と、前記評価データに対応する前記信頼度と、前記第１参照センテンスに含まれる第１文字列情報に付与された第１閾値とを比較する比較手段と、を有し、前記比較手段の比較結果に基づき、前記認識情報を生成することを特徴とする。 The speech recognition system according to a fourth aspect of the invention is any one of the first to third aspects, wherein the character string information acquired in advance, a reference sentence obtained by combining the character string information, and each character string information are assigned. A reference database in which the threshold value is stored, the generation means refers to the reference database, and designates a first reference sentence corresponding to the evaluation data among the reference sentences; Comparison means for comparing the reliability corresponding to the evaluation data and a first threshold value assigned to the first character string information included in the first reference sentence, and based on a comparison result of the comparison means The recognition information is generated.

第５発明に係る音声認識システムは、第４発明において、複数の前記候補データ、及び複数の前記信頼度に基づき、前記参照データベースに記憶された前記閾値を更新する更新手段をさらに備えることを特徴とする。 The speech recognition system according to a fifth aspect of the present invention is the speech recognition system according to the fourth aspect, further comprising updating means for updating the threshold value stored in the reference database based on the plurality of candidate data and the plurality of reliability. And

第６発明に係る音声認識システムは、第４発明又は第５発明において、前記認識情報を評価した利用者の評価結果を取得し、前記参照データベースの前記閾値に反映させる反映手段をさらに備えることを特徴とする。 The voice recognition system according to a sixth aspect of the present invention further comprises a reflecting means for acquiring an evaluation result of a user who has evaluated the recognition information and reflecting it in the threshold value of the reference database in the fourth or fifth aspect. Features.

第７発明に係る音声認識システムは、第１発明〜第６発明の何れかにおいて、前記取得手段は、前記音声データが生成された条件を示す条件情報を取得することを特徴とする。 A speech recognition system according to a seventh aspect is characterized in that, in any one of the first to sixth aspects, the acquisition means acquires condition information indicating a condition under which the voice data is generated.

第８発明に係る音声認識システムは、第７発明の何れかにおいて、前記検出手段は、前記条件情報に基づき、参照する前記文字列データベースの内容を選別することを特徴とする。 The speech recognition system according to an eighth aspect of the present invention is the speech recognition system according to any one of the seventh aspect, wherein the detecting means selects the contents of the character string database to be referenced based on the condition information.

第９発明に係る音声認識システムは、第１発明〜第８発明の何れかにおいて、前記認識情報を出力する出力手段をさらに備え、前記認識情報は、車両の走行速度を制御するための情報を含むことを特徴とする。 A speech recognition system according to a ninth aspect of the present invention further comprises output means for outputting the recognition information in any one of the first to eighth aspects, wherein the recognition information includes information for controlling a traveling speed of the vehicle. It is characterized by including.

第１０発明に係る音声認識システムは、第１発明〜第９発明の何れかにおいて、前記休止区間は、呼吸音及びリップノイズの少なくとも何れかを含むことを特徴とする。 The speech recognition system according to a tenth aspect of the present invention is characterized in that, in any one of the first to ninth aspects, the pause section includes at least one of breathing sound and lip noise.

第１１発明に係る音声認識システムは、第１発明〜第１０発明の何れかにおいて、前記文字列情報は、２ヵ国以上の言語を含むことを特徴とする。 In a speech recognition system according to an eleventh aspect of the present invention, in any one of the first to tenth aspects, the character string information includes languages of two or more countries.

第１２発明に係る音声認識装置は、少なくとも１つの音声データを取得する取得部と、前記音声データに含まれる開始無音区間及び終了無音区間を音素認識により抽出し、前記開始無音区間と前記終了無音区間との間に挟まれた音素及び休止区間の配列を、前記音素認識により認識対象データとして抽出する抽出部と、予め取得された文字列情報と、前記文字列情報に紐づく音素情報と、前記文字列情報に付与されたクラスＩＤとが記憶された文字列データベースと、前記文字列データベースを参照し、前記認識対象データの有する前記配列に対応する前記音素情報を選択し、選択された前記音素情報に紐づく前記文字列情報及び前記クラスＩＤを、候補データとして複数検出する検出部と、予め取得された前記クラスＩＤの配列順序を示す文法情報が記憶された文法データベースと、前記文法データベースを参照し、複数の前記候補データを前記文法情報に基づき組み合あわせたセンテンスを生成し、前記センテンスに含まれる前記候補データ毎の前記文字列情報に対する信頼度を、前記文法データベースを用いて算出する算出部と、前記信頼度に基づき、複数の前記候補データから評価データを選択する選択部と、前記評価データに基づき、認識情報を生成する生成部とを備えることを特徴とする。 According to a twelfth aspect of the present invention, there is provided a voice recognition device that acquires at least one voice data, extracts a start silence section and an end silence section included in the voice data by phoneme recognition, and the start silence section and the end silence. An extraction unit that extracts an array of phonemes and pause sections sandwiched between sections as recognition target data by the phoneme recognition , previously acquired character string information, and phoneme information associated with the character string information; The character string database in which the class ID assigned to the character string information is stored, the character string database is referred to, the phoneme information corresponding to the array of the recognition target data is selected, and the selected A detection unit that detects a plurality of character string information and class IDs associated with phoneme information as candidate data, and an arrangement order of the class IDs acquired in advance And grammar database legal information is stored, by referring to the grammar data base to generate a sentence together union based more the candidate data to the grammatical information, the character string for each of the candidate data contained in the sentence the reliability against the information, a calculation unit for calculating using said grammar database, based on the reliability, a selection unit for selecting the evaluation data from the plurality of candidate data, based on the evaluation data, the recognition information And a generating unit for generating.

第１発明〜第１１発明によれば、抽出手段は、音素及び休止区間の配列を認識対象データとして抽出する。また、検出手段は、認識対象データの有する配列に対応する音素情報を選択し、候補データを検出する。このため、認識対象データにおける音素のみを考慮した配列に対して候補データを検出する場合に比べ、誤認識を低減させることができる。これにより、認識精度の向上を図ることが可能となる。 According to the first to eleventh inventions, the extracting means extracts the arrangement of phonemes and pause intervals as recognition target data. Further, the detection means selects phoneme information corresponding to the arrangement of the recognition target data and detects candidate data. For this reason, compared with the case where candidate data is detected with respect to the arrangement | sequence which considered only the phoneme in recognition object data, a misrecognition can be reduced. As a result, the recognition accuracy can be improved.

また、第１発明〜第１１発明によれば、文字列データベースには、音素と休止区間との配列に対応する音素情報、及び音素情報に紐づく文字列情報が記憶される。このため、音素全体に対してパターンマッチングするために記憶するデータに比べて、データ容量の削減や、データ蓄積の簡易化を実現することができる。 According to the first to eleventh aspects, the character string database stores phoneme information corresponding to the arrangement of phonemes and pause intervals, and character string information associated with the phoneme information. For this reason, compared with the data memorize | stored in order to perform pattern matching with respect to the whole phoneme, reduction of data capacity and simplification of data accumulation | storage can be implement | achieved.

特に、第２発明によれば、抽出手段は、１つの音声データから複数の認識対象データを抽出する。このため、音素及び休止区間の配列にバラつきが発生するような音声データを取得した場合においても、認識精度の低下を抑制することができる。これにより、認識精度のさらなる向上が可能となる。 In particular, according to the second invention, the extracting means extracts a plurality of pieces of recognition target data from one voice data. For this reason, even when voice data that causes variations in the arrangement of phonemes and pause intervals is acquired, it is possible to suppress a reduction in recognition accuracy. Thereby, the recognition accuracy can be further improved.

特に、第３発明によれば、算出手段は、センテンスを複数生成する。すなわち、候補データを組み合わせるパターンが複数存在する場合においても、全てのパターンに対応するセンテンスを生成することができる。このため、例えばパターンマッチングの探索方法等に比べて、誤認識を低減させることができる。これにより、認識精度のさらなる向上が可能となる。 In particular, according to the third invention, the calculating means generates a plurality of sentences. That is, even when there are a plurality of patterns that combine candidate data, sentences corresponding to all patterns can be generated. For this reason, misrecognition can be reduced compared with the search method of a pattern matching etc., for example. Thereby, the recognition accuracy can be further improved.

特に、第４発明によれば、比較手段は、信頼度と、第１閾値とを比較する。このため、複数の候補データから相対的に選択された評価データに対し、閾値による判定も行うことで、誤認識をさらに低減させることができる。これにより、認識精度のさらなる向上が可能となる。 In particular, according to the fourth aspect, the comparison means compares the reliability with the first threshold value. For this reason, misrecognition can be further reduced by performing the determination based on the threshold for the evaluation data relatively selected from the plurality of candidate data. Thereby, the recognition accuracy can be further improved.

特に、第５発明によれば、更新手段は、候補データ及び信頼度に基づき、閾値を更新する。このため、予め設定された閾値を常に用いる場合に比べて、取得する音声データにおける品質に応じた認識情報を生成することができる。これにより、利用できる環境の幅を広げることが可能となる。 In particular, according to the fifth aspect, the updating means updates the threshold based on the candidate data and the reliability. For this reason, the recognition information according to the quality in the audio | voice data acquired can be produced | generated compared with the case where a preset threshold value is always used. This makes it possible to expand the range of environments that can be used.

特に、第６発明によれば、反映手段は、評価結果を閾値に反映させる。このため、認識情報が、利用者の認識と乖離している場合、容易に改善を実施することができる。これにより、持続的な認識精度の向上を実現することができる。 In particular, according to the sixth aspect, the reflecting means reflects the evaluation result on the threshold value. For this reason, when the recognition information deviates from the user's recognition, the improvement can be easily performed. Thereby, continuous improvement in recognition accuracy can be realized.

特に、第７発明によれば、取得手段は、条件情報を取得する。すなわち、取得手段は、音声データを取得する際の周辺環境、音声データに含まれる雑音、音声を採取する収音装置の種類等の各種条件を、条件情報として取得する。このため、条件情報に応じた各手段や各データベースの設定を実施することができる。これにより、利用される環境等に関わらず、認識精度の向上を図ることが可能となる。 In particular, according to the seventh aspect, the acquisition means acquires condition information. That is, the acquisition unit acquires various conditions such as the surrounding environment when acquiring the audio data, the noise included in the audio data, and the type of the sound collecting device that collects the audio as the condition information. For this reason, each means and each database can be set according to the condition information. This makes it possible to improve recognition accuracy regardless of the environment used.

特に、第８発明によれば、検出手段は、条件情報に基づき、参照する文字列データベースの内容を選別する。このため、文字列データベースには、条件情報毎に異なる文字列情報等を記憶させておくことで、条件情報毎に適した候補データを検出することができる。これにより、条件情報毎における認識精度の向上を図ることが可能となる。 In particular, according to the eighth invention, the detecting means selects the contents of the character string database to be referenced based on the condition information. For this reason, candidate data suitable for each condition information can be detected by storing different character string information or the like for each condition information in the character string database. Thereby, it becomes possible to improve the recognition accuracy for each condition information.

特に、第９発明によれば、出力手段は、認識情報を出力する。すなわち、認識精度の向上に伴い、利用者の運転補助等として用いることができる。これにより、幅広い用途への応用が可能となる。 Particularly, according to the ninth aspect, the output means outputs the recognition information. That is, it can be used as driving assistance for the user as the recognition accuracy improves. Thereby, the application to a wide use is attained.

特に、第１０発明によれば、休止区間は、呼吸音及びリップノイズの少なくとも何れかを含む。このため、音素のみでは判断し難い音声データの差異に対しても容易に判断でき、認識対象データを抽出することができる。これにより、認識精度のさらなる向上を図ることが可能となる。 In particular, according to the tenth aspect, the pause section includes at least one of breathing sound and lip noise. For this reason, it is possible to easily determine a difference in speech data that is difficult to determine using only phonemes, and to extract recognition target data. As a result, the recognition accuracy can be further improved.

第１２発明によれば、抽出部は、音素及び休止区間の配列を認識対象データとして抽出する。また、検出部は、認識対象データの有する配列に対応する音素情報を選択し、候補データを検出する。このため、認識対象データにおける音素のみを考慮した配列に対して候補データを検出する場合に比べ、誤認識を低減させることができる。これにより、認識精度の向上を図ることが可能となる。 According to the twelfth aspect, the extraction unit extracts the arrangement of phonemes and pause intervals as recognition target data. In addition, the detection unit selects phoneme information corresponding to the arrangement of the recognition target data, and detects candidate data. For this reason, compared with the case where candidate data is detected with respect to the arrangement | sequence which considered only the phoneme in recognition object data, a misrecognition can be reduced. As a result, the recognition accuracy can be improved.

また、第１２発明によれば、文字列データベースには、音素と休止区間との配列に対応する音素情報、及び音素情報に紐づく文字列情報が記憶される。このため、音素全体に対してパターンマッチングするために記憶するデータに比べて、データ容量の削減や、データ蓄積の簡易化を実現することができる。 According to the twelfth invention, the character string database stores phoneme information corresponding to the arrangement of phonemes and pause intervals, and character string information associated with the phoneme information. For this reason, compared with the data memorize | stored in order to perform pattern matching with respect to the whole phoneme, reduction of data capacity and simplification of data accumulation | storage can be implement | achieved.

図１は、本実施形態における音声認識システムの構成の一例を示す模式図である。FIG. 1 is a schematic diagram illustrating an example of a configuration of a voice recognition system according to the present embodiment. 図２（ａ）は、本実施形態における音声認識装置の構成の一例を示す模式図であり、図２（ｂ）は、本実施形態における音声認識装置の機能の一例を示す模式図であり、図２（ｃ）は、本実施形態における生成部の一例を示す模式図である。FIG. 2A is a schematic diagram illustrating an example of the configuration of the voice recognition device according to the present embodiment, and FIG. 2B is a schematic diagram illustrating an example of the function of the voice recognition device according to the present embodiment. FIG. 2C is a schematic diagram illustrating an example of a generation unit in the present embodiment. 図３は、本実施形態における音声認識装置の各機能の一例を示す模式図である。FIG. 3 is a schematic diagram illustrating an example of each function of the speech recognition apparatus according to the present embodiment. 図４は、文字列データベース、文法データベース、及び参照データベースの一例を示す模式図である。FIG. 4 is a schematic diagram illustrating an example of a character string database, a grammar database, and a reference database. 図５（ａ）は、本実施形態における音声認識システムの動作の一例を示すフローチャートであり、図５（ｂ）は、生成手段の一例を示すフローチャートであり、図５（ｃ）は、反映手段の一例を示すフローチャートである。FIG. 5A is a flowchart showing an example of the operation of the speech recognition system in the present embodiment, FIG. 5B is a flowchart showing an example of the generation unit, and FIG. 5C is a reflection unit. It is a flowchart which shows an example. 図６は、更新手段の一例を示す模式図である。FIG. 6 is a schematic diagram illustrating an example of the updating unit. 図７（ａ）は、更新手段の一例を示すフローチャートであり、図７（ｂ）は、設定手段の一例を示すフローチャートである。FIG. 7A is a flowchart illustrating an example of the updating unit, and FIG. 7B is a flowchart illustrating an example of the setting unit. 図８は、条件情報の一例を示す模式図である。FIG. 8 is a schematic diagram illustrating an example of condition information. 図９は、参照データベースの変形例を示す模式図である。FIG. 9 is a schematic diagram showing a modification of the reference database.

以下、本発明の実施形態における音声認識システム及び音声認識装置の一例について、図面を参照しながら説明する。 Hereinafter, an example of a voice recognition system and a voice recognition device according to an embodiment of the present invention will be described with reference to the drawings.

（音声認識システム１００の構成）
図１〜図４を参照して、本実施形態における音声認識システム１００の構成の一例について説明する。図１は、本実施形態における音声認識システム１００の全体の構成を示す模式図である。 (Configuration of voice recognition system 100)
With reference to FIGS. 1-4, an example of a structure of the speech recognition system 100 in this embodiment is demonstrated. FIG. 1 is a schematic diagram showing the overall configuration of a speech recognition system 100 in the present embodiment.

音声認識システム１００は、利用者の用途に応じて構築された文字列データベース及び文法データベースを参照し、利用者の音声に対応する認識情報を生成する。文字列データベースには、利用者が発すると想定される文字列（文字列情報）と、文字列に対応する音素（音素情報）が記憶される。このため、上記文字列及び音素を蓄積することで用途に応じた認識情報を生成でき、様々な用途に展開することが可能となる。 The speech recognition system 100 refers to a character string database and a grammar database constructed according to the user's application, and generates recognition information corresponding to the user's speech. The character string database stores a character string (character string information) that is assumed to be emitted by the user and a phoneme (phoneme information) corresponding to the character string. For this reason, by accumulating the character string and phoneme, it is possible to generate recognition information according to the application, and it is possible to develop it for various applications.

特に、文字列データベースに記憶される音素の配列（音素情報）は、音声に含まれる休止区間を踏まえて分類することで、音声に対する認識情報の精度を飛躍的に向上させることが可能となることを、発明者が発見した。 In particular, the phoneme array (phoneme information) stored in the character string database can be classified based on the pause intervals included in the speech, thereby greatly improving the accuracy of the recognition information for the speech. Was discovered by the inventor.

文法データベースには、文字列情報を組み合わせたセンテンスを生成するために必要な文法情報が記憶される。文法情報は、文字列情報毎に紐づくクラスＩＤの配列順序を示す情報を複数含む。文法データベースを参照することで、休止区間を踏まえて分類された音素の配列に基づいて文字列情報を検出したあと、容易に各文字列情報を組み合わせることができる。これにより、音声に対する文法を考慮した認識情報を生成することができる。この結果、利用者等の発する音声の内容を踏まえた音声認識を高精度に実現することが可能となる。 The grammar database stores grammar information necessary for generating a sentence combining character string information. The grammar information includes a plurality of pieces of information indicating the arrangement order of class IDs associated with each piece of character string information. By referring to the grammar database, the character string information can be easily combined after detecting the character string information based on the phoneme arrangement classified based on the pause period. Thereby, the recognition information in consideration of the grammar for the speech can be generated. As a result, it is possible to realize voice recognition based on the contents of voices uttered by users and the like with high accuracy.

図１に示すように、音声認識システム１００は、音声認識装置１を備える。音声認識システム１００では、例えば収音装置２等を用いて利用者等の音声を収音し、音声認識装置１を用いて音声に対応する認識情報を生成する。認識情報は、音声を文字列に変換したテキストデータ等のほか、例えば制御装置３等を制御する情報や、利用者に返答するための音声情報等を含む。 As shown in FIG. 1, the speech recognition system 100 includes a speech recognition device 1. In the voice recognition system 100, for example, a voice of a user or the like is collected using the sound collection device 2 or the like, and recognition information corresponding to the voice is generated using the voice recognition device 1. The recognition information includes, for example, information for controlling the control device 3 and the like, voice information for replying to the user, and the like in addition to text data obtained by converting voice into a character string.

音声認識システム１００では、音声認識装置１に対して、収音装置２や制御装置３が直接接続されるほか、例えば公衆通信網４を介して接続されてもよい。また、音声認識装置１に対して、例えば公衆通信網４を介して、サーバ５や利用者等の保有するユーザ端末６が、公衆通信網４を介して接続されてもよい。 In the speech recognition system 100, the sound collection device 2 and the control device 3 may be directly connected to the speech recognition device 1, or may be connected through the public communication network 4, for example. Further, the user terminal 6 held by the server 5 or the user may be connected to the voice recognition device 1 via the public communication network 4 via the public communication network 4, for example.

＜音声認識装置１＞
図２（ａ）は、音声認識装置１の構成の一例を示す模式図である。音声認識装置１として、ＲａｓｐｂｅｒｒｙＰｉ（登録商標）等のシングルボードコンピュータが用いられるほか、例えばパーソナルコンピュータ（ＰＣ）等の電子機器が用いられてもよい。音声認識装置１は、筐体１０と、ＣＰＵ（Central Processing Unit）１０１と、ＲＯＭ（Read Only Memory）１０２と、ＲＡＭ（Random Access Memory）１０３と、保存部１０４と、Ｉ／Ｆ１０５〜１０７とを備える。各構成１０１〜１０７は、内部バス１１０により接続される。 <Voice recognition device 1>
FIG. 2A is a schematic diagram illustrating an example of the configuration of the speech recognition apparatus 1. As the speech recognition apparatus 1, a single board computer such as Raspberry Pi (registered trademark) may be used, and an electronic device such as a personal computer (PC) may be used. The speech recognition apparatus 1 includes a housing 10, a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, a storage unit 104, and I / Fs 105 to 107. Prepare. Each component 101 to 107 is connected by an internal bus 110.

ＣＰＵ１０１は、音声認識装置１全体を制御する。ＲＯＭ１０２は、ＣＰＵ１０１の動作コードを格納する。ＲＡＭ１０３は、ＣＰＵ１０１の動作時に使用される作業領域である。保存部１０４は、文字列データベース等の各種情報が保存される。保存部１０４として、例えばＳＤメモリーカードのほか、例えばＨＤＤ（Hard Disk Drive）、ＳＳＤ（solid state drive）等が用いられる。 The CPU 101 controls the entire speech recognition apparatus 1. The ROM 102 stores the operation code of the CPU 101. The RAM 103 is a work area used when the CPU 101 operates. The storage unit 104 stores various information such as a character string database. As the storage unit 104, for example, an HDD (Hard Disk Drive), an SSD (solid state drive), or the like is used in addition to an SD memory card.

Ｉ／Ｆ１０５は、収音装置２、制御装置３、公衆通信網４等との各種情報の送受信を行うためのインターフェースである。Ｉ／Ｆ１０６は、用途に応じて接続される入力部分１０８との各種情報の送受信を行うためのインターフェースである。入力部分１０８として、例えばキーボードが用いられ、音声認識システム１００の管理等を行う利用者等は、入力部分１０８を介して、各種情報又は音声認識装置１の制御コマンド等を入力又は選択する。Ｉ／Ｆ１０７は、用途に応じて接続される出力部分１０９との各種情報の送受信を行うためのインターフェースである。出力部分１０９は、保存部１０４に保存された各種情報、認識情報、音声認識装置１の処理状況等を出力する。出力部分１０９として、ディスプレイが用いられ、例えばタッチパネル式でもよい。この場合、出力部分１０９が入力部分１０８を含む構成としてもよい。なお、Ｉ／Ｆ１０５〜Ｉ／Ｆ１０７は、例えば同一のものが用いられてもよい。 The I / F 105 is an interface for transmitting and receiving various types of information to and from the sound collection device 2, the control device 3, the public communication network 4, and the like. The I / F 106 is an interface for transmitting and receiving various types of information to and from the input unit 108 connected according to the application. For example, a keyboard is used as the input portion 108, and a user who manages the speech recognition system 100 inputs or selects various information or control commands of the speech recognition apparatus 1 through the input portion 108. The I / F 107 is an interface for transmitting and receiving various types of information to and from the output unit 109 connected depending on the application. The output part 109 outputs various information stored in the storage unit 104, recognition information, the processing status of the speech recognition apparatus 1, and the like. A display is used as the output portion 109, and for example, a touch panel type may be used. In this case, the output portion 109 may include the input portion 108. For example, the same I / F 105 to I / F 107 may be used.

図２（ｂ）は、音声認識装置１の機能の一例を示す模式図である。音声認識装置１は、取得部１１と、抽出部１２と、記憶部１３と、検出部１４と、算出部１５と、選択部１６と、生成部１７と、出力部１８とを備える。音声認識装置１は、例えば反映部１９を備えてもよい。なお、図２（ｂ）に示した各機能は、ＣＰＵ１０１が、ＲＡＭ１０３を作業領域として、保存部１０４等に記憶されたプログラムを実行することにより実現される。また、各機能の一部は、例えばＪｕｌｉｕｓ等の公知の音声認識エンジンや、Ｐｙｔｈｏｎ等のような公知の汎用プログラミング言語を用いて実現し、各種データの抽出や生成等の処理を行ってもよい。また、各機能の一部は、人工知能により制御されてもよい。ここで、「人工知能」は、いかなる周知の人工知能技術に基づくものであってもよい。 FIG. 2B is a schematic diagram illustrating an example of functions of the speech recognition apparatus 1. The speech recognition apparatus 1 includes an acquisition unit 11, an extraction unit 12, a storage unit 13, a detection unit 14, a calculation unit 15, a selection unit 16, a generation unit 17, and an output unit 18. The speech recognition apparatus 1 may include a reflection unit 19, for example. Each function illustrated in FIG. 2B is realized by the CPU 101 executing a program stored in the storage unit 104 or the like using the RAM 103 as a work area. Also, a part of each function may be realized by using a known speech recognition engine such as Julius or a known general-purpose programming language such as Python, and may perform processing such as extraction and generation of various data. . Also, some of the functions may be controlled by artificial intelligence. Here, “artificial intelligence” may be based on any known artificial intelligence technology.

＜取得部１１＞
取得部１１は、少なくとも１つの音声データを取得する。取得部１１は、例えば収音装置２等を用いて収音した音声信号に対し、ＰＣＭ（pulse code modulation）等のパルス変調したデータを、音声データとして取得する。取得部１１は、収音装置２の種類に応じて、例えば複数の音声データを一度に取得してもよい。 <Acquisition unit 11>
The acquisition unit 11 acquires at least one audio data. The acquisition unit 11 acquires, as audio data, data obtained by performing pulse modulation such as PCM (pulse code modulation) on an audio signal collected using the sound collection device 2 or the like, for example. The acquisition unit 11 may acquire, for example, a plurality of audio data at once according to the type of the sound collection device 2.

取得部１１は、例えば同時に複数の音声データを取得してもよい。この場合、音声認識装置１に対して、収音装置２が複数接続されるほか、複数の音声を同時に収音できる収音装置２が接続されてもよい。なお、取得部１１は、音声データのほか、例えばＩ／Ｆ１０５、Ｉ／Ｆ１０６を介して各種情報（データ）を収音装置２等から取得する。 The acquisition unit 11 may acquire a plurality of audio data at the same time, for example. In this case, a plurality of sound collection devices 2 may be connected to the voice recognition device 1, or a sound collection device 2 that can simultaneously collect a plurality of sounds may be connected. The acquisition unit 11 acquires various information (data) from the sound collection device 2 and the like via the I / F 105 and the I / F 106 in addition to the audio data.

＜抽出部１２＞
抽出部１２は、音声データに含まれる開始無音区間及び終了無音区間を抽出する。また、抽出部１２は、開始無音区間と終了無音区間との間に挟まれた音素及び休止区間の配列を、認識対象データとして抽出する。 <Extractor 12>
The extraction unit 12 extracts a start silence interval and an end silence interval included in the audio data. Further, the extraction unit 12 extracts, as recognition target data, an array of phonemes and pause sections sandwiched between a start silence section and an end silence section.

抽出部１２は、例えば１００ミリ秒以上１秒以下の非発話状態（無音区間）を、開始無音区間及び終了無音区間として抽出する。抽出部１２は、開始無音区間と終了無音区間との間に挟まれた区間（音声区間）に対し、音素及び休止区間を割り当てる。抽出部１２は、それぞれ割り当てられた音素及び休止区間の配列を、認識対象データとして抽出する。 For example, the extraction unit 12 extracts a non-speech state (silence interval) of 100 milliseconds to 1 second as a start silence interval and an end silence interval. The extraction unit 12 assigns phonemes and pause sections to a section (voice section) sandwiched between a start silence section and an end silence section. The extraction unit 12 extracts the array of the assigned phonemes and pause sections as recognition target data.

音素は、母音と、子音とを含む公知のものである。休止区間は、開始無音区間及び終了無音区間よりも短い区間を示し、例えば音素の区間と同程度の区間（長さ）を示す。抽出部１２は、例えば各音素の長さ又は認識対象データ全体の長さを判定したあと、休止区間の長さを設定した上で、音素及び休止区間を割り当てた配列を、認識対象データとして抽出してもよい。すなわち、抽出部１２は、音素の長さ又は認識対象データ全体の長さに応じて、休止区間の長さを設定してもよい。 Phonemes are known ones including vowels and consonants. The pause section is a section shorter than the start silence section and the end silence section, for example, a section (length) of the same degree as the phoneme section. For example, after determining the length of each phoneme or the entire length of the recognition target data, the extraction unit 12 sets the length of the pause period and then extracts the array to which the phoneme and the pause period are assigned as the recognition target data. May be. That is, the extraction unit 12 may set the length of the pause period according to the length of the phoneme or the entire length of the recognition target data.

抽出部１２は、例えば図３に示すように、開始無音区間「silB」及び終了無音区間「silE」を抽出し、音声区間における配列「a/k/a/r/i/*/w/o/*/ts/u/k/e/t/e」（*は休止区間を示す）を、対象認識データとして抽出する。抽出部１２は、例えば１つの音声データからそれぞれ異なる配列の対象認識データを複数抽出してもよい。この場合、抽出部１２における音素及び休止区間の割り当てに伴うバラつきを考慮した音声認識を実施することができる。例えば抽出部１２は、１つ以上５つ以下の対象認識データを抽出することで、処理時間を抑えた上で、認識精度を高めることができる。なお、抽出部１２は、例えば開始無音区間及び終了無音区間の少なくとも何れかを含む配列を、対象認識データとして抽出してもよい。 For example, as illustrated in FIG. 3, the extraction unit 12 extracts the start silence section “silB” and the end silence section “silE”, and arranges the array “a / k / a / r / i / * / w / o” in the speech section. “/ * / ts / u / k / e / t / e” (* indicates a pause interval) is extracted as object recognition data. For example, the extraction unit 12 may extract a plurality of object recognition data having different arrays from one piece of audio data. In this case, speech recognition can be performed in consideration of variations associated with the assignment of phonemes and pause intervals in the extraction unit 12. For example, the extraction unit 12 can increase the recognition accuracy while reducing the processing time by extracting one or more and five or less target recognition data. Note that the extraction unit 12 may extract, for example, an array including at least one of the start silence section and the end silence section as the target recognition data.

休止区間は、例えば呼吸音及びリップノイズの少なくとも何れかを含んでもよい。すなわち、抽出部１２は、例えば休止区間に含まれる呼吸音及びリップノイズの少なくとも何れかを、認識対象データとして抽出してもよい。この場合、後述する文字列データベースに記憶された音素情報に、呼吸音及びリップノイズの少なくとも何れかを含ませることで、より精度の高い認識情報を生成することが可能となる。 The rest period may include at least one of breathing sound and lip noise, for example. That is, the extraction unit 12 may extract, for example, at least one of breathing sound and lip noise included in the pause section as recognition target data. In this case, it is possible to generate more accurate recognition information by including at least one of breathing sound and lip noise in phoneme information stored in a character string database described later.

＜記憶部１３、データベース＞
記憶部１３は、各種データを保存部１０４に記憶させ、又は各種データを保存部１０４から取出す。記憶部１３は、必要に応じて保存部１０４に記憶された各種データベースを取出す。 <Storage unit 13, database>
The storage unit 13 stores various data in the storage unit 104 or retrieves various data from the storage unit 104. The storage unit 13 retrieves various databases stored in the storage unit 104 as necessary.

保存部１０４には、例えば図４に示すように、文字列データベース及び文法データベースが記憶され、例えば参照データベースが記憶されてもよい。 For example, as shown in FIG. 4, the storage unit 104 stores a character string database and a grammar database, and may store a reference database, for example.

文字列データベースには、予め取得された文字列情報と、文字列情報に紐づく音素情報と、文字列情報に付与されたクラスＩＤとが記憶される。文字列データベースは、検出部１４によって候補データを検出するときに用いられる。 The character string database stores character string information acquired in advance, phoneme information associated with the character string information, and a class ID assigned to the character string information. The character string database is used when the detection unit 14 detects candidate data.

音素情報は、利用者が発すると想定される音素の配列（例えば第１音素情報「a/k/a/r/i」等）を複数含む。音素の配列は、休止区間により分離される区間に対応するほか、例えば「h/i/*/i/t/e」のように休止区間を含んでもよく、利用条件に応じて任意に設定される。なお、音素情報は、例えば開始無音区間及び終了無音区間の少なくとも何れかを含んでもよい。 The phoneme information includes a plurality of phoneme arrays (for example, first phoneme information “a / k / a / r / i”, etc.) assumed to be emitted by the user. The phoneme arrangement may correspond to the intervals separated by the pause intervals, and may include pause intervals such as “h / i / * / i / t / e”, and is arbitrarily set according to the usage conditions. The Note that the phoneme information may include at least one of a start silence interval and an end silence interval, for example.

文字列情報は、各音素の配列に紐づく文字列（例えば第１文字列情報「明かり」等）を含む。このため、文字列情報には、単語や形態素等の意味を持つ表現要素が用いられるほか、意味を持たない文字列が用いられてもよい。なお、文字列情報は、日本語のほか、例えば２ヵ国以上の言語を含んでもよく、数字や利用箇所で用いられる略称等の文字列を含んでもよい。また、同一の文字列情報に対して、異なる音素の配列が紐づけられてもよい。 The character string information includes a character string (for example, first character string information “light”) associated with each phoneme array. For this reason, expression elements having meanings such as words and morphemes may be used for the character string information, and character strings having no meaning may be used. Note that the character string information may include, for example, languages of two or more countries in addition to Japanese, and may include character strings such as numbers and abbreviations used in usage locations. Further, different phoneme arrays may be associated with the same character string information.

クラスＩＤは、文字列情報に紐づき、文字列情報の単語等が文法上用いられると想定される配列箇所（例えば第１クラスＩＤ「１」等）を示す。例えば音声の文法（センテンス）が「対象」＋「助詞」＋「アクション」として表すことができる場合、クラスＩＤとして、音声の「対象」となる文字列情報に対して「１」が用いられ、音声の「助詞」となる文字列情報に対して「２」が用いられ、音声の「アクション」となる文字列情報に対して「３」が用いられる。 The class ID is associated with the character string information and indicates an arrangement location (for example, the first class ID “1”, etc.) that is assumed to be used in terms of the grammar. For example, when the grammar (sentence) of speech can be expressed as “target” + “particle” + “action”, “1” is used as the class ID for the character string information that is the “target” of speech, “2” is used for the character string information that becomes the “particle” of the voice, and “3” is used for the character string information that becomes the “action” of the voice.

文法データベースには、予め取得された複数のクラスＩＤの配列順序を示す文法情報が記憶される。文法データベースは、算出部１５によって信頼度を算出するときに用いられる。文法情報として、例えば第１文法情報「１、２、３」が用いられる場合、音声の候補として「対象」＋「助詞」＋「アクション」を示すセンテンスを生成することができる。文法情報は、例えば第１文法情報「１、２、３」、第２文法情報「４、５、６」、第３文法情報「２、１、３」等のクラスＩＤの配列順序を複数含む。 The grammar database stores grammar information indicating the arrangement order of a plurality of class IDs acquired in advance. The grammar database is used when the calculation unit 15 calculates the reliability. For example, when the first grammar information “1, 2, 3” is used as the grammar information, a sentence indicating “target” + “particle” + “action” can be generated as a speech candidate. The grammar information includes, for example, a plurality of arrangement orders of class IDs such as first grammar information “1, 2, 3”, second grammar information “4, 5, 6”, and third grammar information “2, 1, 3”. .

参照データベースには、予め取得された文字列情報と、文字列を組み合わせた参照センテンスと、文字列情報毎に付与された閾値とが記憶され、例えば文字列情報に紐づく音素情報が記憶されてもよい。参照データベースは、生成部１７によって認識情報を生成するときに、必要に応じて用いられる。なお、参照データベースに記憶される文字列情報及び音素情報は、例えば文字列データベースに記憶される文字列情報及び音素情報と等しくすることで、データ容量を少なくすることができる。 The reference database stores character string information acquired in advance, a reference sentence combining character strings, and a threshold value assigned to each character string information. For example, phoneme information associated with character string information is stored. Also good. The reference database is used as necessary when the generation unit 17 generates the recognition information. Note that the character string information and phoneme information stored in the reference database can be made equal to, for example, the character string information and phoneme information stored in the character string database, thereby reducing the data capacity.

＜検出部１４＞
検出部１４は、文字列データベースを参照し、認識対象データの有する音素の配列に対応する音素情報を選択する。また、検出部１４は、選択された音素情報に紐づく文字列情報及びクラスＩＤを候補データとして複数検出する。 <Detection unit 14>
The detection unit 14 refers to the character string database and selects phoneme information corresponding to the phoneme arrangement of the recognition target data. The detecting unit 14 detects a plurality of character string information and class IDs associated with the selected phoneme information as candidate data.

検出部１４は、例えば図３に示すように、認識対象データに対応する音素情報「a/k/a/r/i」、「w/o」、「ts/u/k/e/t/e」を選択し、各音素情報に紐づく文字列情報及びクラスＩＤ「明かり/１」、「を/２」、「つけて/３」を、それぞれ候補データとして検出する。このとき、認識対象データの数に応じて、候補データの数が増加する。なお、各音素の配列は、予め休止区間毎に区切られて分類されるほか、音素及び休止区間を含む音素情報に基づいて分類されてもよい。 For example, as illustrated in FIG. 3, the detection unit 14 includes phoneme information “a / k / a / r / i”, “w / o”, “ts / u / k / e / t /” corresponding to the recognition target data. e ”is selected, and character string information associated with each phoneme information and class IDs“ light / 1 ”,“ to / 2 ”, and“ attached / 3 ”are detected as candidate data. At this time, the number of candidate data increases according to the number of recognition target data. In addition, the arrangement of each phoneme may be classified based on phoneme information including a phoneme and a pause period, in addition to being classified in advance for each pause period.

＜算出部１５＞
算出部１５は、文法データベースを参照し、複数の候補データを文法情報に基づき組み合わせたセンテンスを生成する。また、算出部１５は、センテンスに含まれる候補データ毎に対応する信頼度を算出する。 <Calculation unit 15>
The calculation unit 15 refers to the grammar database and generates a sentence that combines a plurality of candidate data based on the grammar information. Further, the calculation unit 15 calculates a reliability corresponding to each candidate data included in the sentence.

算出部１５は、例えば図３に示すように、第１文法情報「１、２、３」に含まれるクラスＩＤ毎に、各候補データ「明かり/１」、「を/２」、「つけて/３」のクラスＩＤを対応させ、センテンス「明かり/１」「を/２」「つけて/３」を生成する。このとき、例えば文法情報が「３、１、２」の場合、センテンスとして「つけて/３」「明かり/１」「を/２」が生成される。 For example, as illustrated in FIG. 3, the calculation unit 15 adds each candidate data “light / 1”, “on / 2”, “on” for each class ID included in the first grammatical information “1, 2, 3”. The class ID of “/ 3” is made to correspond, and the sentences “light / 1”, “on / 2”, “attached / 3” are generated. At this time, for example, when the grammatical information is “3, 1, 2”, “Turn / 3”, “Light / 1” and “O / 2” are generated as sentences.

算出部１５は、センテンスに含まれる各候補データ「明かり/１」、「を/２」、「つけて/３」、に対応する信頼度「０．９８２」、「１．０００」、「０．９９０」を算出する。算出部１５は、各候補データに対して０．０００以上１．０００以下の範囲で信頼度を算出する。算出部１５は、例えば各センテンスに対して、優先度を示すランクを設定（図３ではランク１〜ランク５）してもよい。ランクを設定することで、任意のランク下位にランク付けされたセンテンス（例えばランク６以下）を、評価対象から除外することができる。このため、後述する評価データとして選択される候補データの数を減らすことができ、処理速度の向上を図ることが可能となる。 The calculation unit 15 includes the reliability “0.982”, “1.000”, “0” corresponding to each candidate data “light / 1”, “on / 2”, “attached / 3” included in the sentence. .990 ”. The calculation unit 15 calculates the reliability within a range from 0.000 to 1.000 for each candidate data. For example, the calculation unit 15 may set a rank indicating priority (rank 1 to rank 5 in FIG. 3) for each sentence. By setting a rank, a sentence (for example, rank 6 or lower) ranked lower than an arbitrary rank can be excluded from the evaluation target. For this reason, the number of candidate data selected as evaluation data described later can be reduced, and the processing speed can be improved.

算出部１５は、例えば内容の異なるセンテンスに同一の候補データが含まれる場合、各候補データにはそれぞれ異なる信頼度を算出してもよい。例えば、第１センテンスに含まれる各候補データ「明かり/１」、「を/２」、「つけて/３」に対応する信頼度「０．９８２」、「１．０００」、「０．９９０」が算出された場合、第２センテンスに含まれる各候補データ「明かり/１」、「を/２」、「弾いて/３」に対応する信頼度「０．９４２」、「１．０００」、「０．０２３」が算出される。すなわち、同一の候補データ「明かり」であっても、センテンスの内容や組み合わせの順序によって、異なる信頼度が算出されてもよい。 For example, when the same candidate data is included in sentences with different contents, the calculation unit 15 may calculate different reliability for each candidate data. For example, the reliability “0.982”, “1.000”, “0.990” corresponding to each candidate data “light / 1”, “on / 2”, “attached / 3” included in the first sentence. ”Is calculated, the reliability“ 0.942 ”and“ 1.000 ”corresponding to each candidate data“ light / 1 ”,“ on / 2 ”, and“ playing / 3 ”included in the second sentence , “0.023” is calculated. That is, even if the candidate data is “light”, different degrees of reliability may be calculated depending on the content of the sentence and the order of combination.

信頼度として、予め設定された値が用いられるほか、例えば検出部１４において検出された候補データの種類及び数に応じた相対値が用いられてもよい。例えば、１つのクラスＩＤに対して候補データの種類が多くなるにつれて、低い信頼度を算出することができる。 As the reliability, in addition to a preset value, for example, a relative value according to the type and number of candidate data detected by the detection unit 14 may be used. For example, as the number of types of candidate data increases for one class ID, a lower reliability can be calculated.

＜選択部１６＞
選択部１６は、信頼度に基づき、複数の候補データから評価データを選択する。選択部１６は、例えば複数の候補データのうち、クラスＩＤ毎に最も高い信頼度が算出された候補データを、評価データとして選択する。例えば選択部１６は、同じクラスＩＤ「３」における候補データ「つけて/３/０．９９０」、「弾いて/３/０．０２３」のうち、最も高い信頼度を有する候補データ「つけて/３/０．９９０」を評価データとして選択する。なお、選択部１６は、例えば１つのクラスＩＤに対して複数の候補データを、評価データとして選択してもよい。この場合、後述する生成部１７において、複数の候補データから１つ選択するようにしてもよい。 <Selection unit 16>
The selection unit 16 selects evaluation data from a plurality of candidate data based on the reliability. For example, the selection unit 16 selects candidate data for which the highest reliability is calculated for each class ID from among a plurality of candidate data as evaluation data. For example, the selection unit 16 selects the candidate data “attached” having the highest reliability among the candidate data “attached / 3 / 0.990” and “played / 3 / 0.023” in the same class ID “3”. /3/0.990 "is selected as the evaluation data. Note that the selection unit 16 may select, for example, a plurality of candidate data as evaluation data for one class ID. In this case, the generation unit 17 described later may select one from a plurality of candidate data.

＜生成部１７＞
生成部１７は、評価データに基づき、認識情報を生成する。生成部１７は、例えば評価データをテキスト形式に変換し、認識情報として生成するほか、例えば評価データを音声データ形式や、制御装置３を制御するための制御データ形式に変換し、認識情報として生成してもよい。すなわち、認識情報は、制御装置３を制御するための情報（例えば車両の走行速度を制御するための情報）を含む。なお、評価データに基づくテキスト形式、音声データ形式、又は制御データ形式に変換する方法は、公知の技術を用いることができ、必要に応じて各データ形式を蓄積したデータベース等を用いてもよい。 <Generator 17>
The generation unit 17 generates recognition information based on the evaluation data. The generation unit 17 converts, for example, the evaluation data into a text format and generates it as recognition information. For example, the generation unit 17 converts the evaluation data into a voice data format or a control data format for controlling the control device 3 and generates it as recognition information. May be. That is, the recognition information includes information for controlling the control device 3 (for example, information for controlling the traveling speed of the vehicle). A known technique can be used as a method for converting to a text format based on evaluation data, a voice data format, or a control data format, and a database or the like in which each data format is stored may be used as necessary.

生成部１７は、例えば指定部１７ａと、比較部１７ｂとを有してもよい。指定部１７ａは、参照データベースを参照し、参照センテンスのうち、評価データに対応する第１参照センテンスを指定する。指定部１７ａは、例えば評価データとして「明かり/１」、「を/２」、「つけて/３」が選択された場合、図４に示す第１参照センテンスを指定する。この場合、第１参照センテンスに含まれる各文字列情報（第１文字列情報）として、評価データに含まれる候補データと等しい文字列が指定される。 The generation unit 17 may include, for example, a designation unit 17a and a comparison unit 17b. The designation unit 17a refers to the reference database and designates the first reference sentence corresponding to the evaluation data among the reference sentences. The designation unit 17a designates the first reference sentence shown in FIG. 4 when, for example, “light / 1”, “on / 2”, and “turn on / 3” are selected as the evaluation data. In this case, a character string equal to the candidate data included in the evaluation data is designated as each character string information (first character string information) included in the first reference sentence.

比較部１７ｂは、評価データに対応する信頼度と、第１文字列情報に付与された閾値（第１閾値）とを比較する。比較部１７ｂは、例えば評価データ「明かり」、「を」、「つけて」の信頼度「０．９８２」、「１．０００」、「０．９９０」が、第１文字列情報「明かり」、「を」、「つけて」の第１閾値「０．８００」、「０．９００」、「０．８８０」以上か否かを比較する。この場合、生成部１７は、比較結果に基づいて認識情報を生成する。例えば信頼度が第１閾値以上の場合に、生成部１７が認識情報を生成してもよい。例えば信頼度が第１閾値以上の場合と、第１閾値未満の場合とに応じて、生成部１７が異なる生成情報を生成してもよい。 The comparison unit 17b compares the reliability corresponding to the evaluation data with the threshold value (first threshold value) assigned to the first character string information. For example, the comparison unit 17b sets the reliability of the evaluation data “light”, “on”, and “putting” “0.982”, “1.000”, and “0.990” to the first character string information “light”. , “ON”, “ON” and the first thresholds “0.800”, “0.900”, “0.880” or more are compared. In this case, the generation unit 17 generates recognition information based on the comparison result. For example, when the reliability is equal to or higher than the first threshold, the generation unit 17 may generate the recognition information. For example, the generation unit 17 may generate different generation information depending on whether the reliability is greater than or equal to the first threshold and less than the first threshold.

＜出力部１８＞
出力部１８は、認識情報を出力する。出力部１８は、Ｉ／Ｆ１０５を介して制御装置３等に認識情報を出力する。出力部１８は、例えばＩ／Ｆ１０７を介して出力部分１０９に認識情報を出力してもよい。出力部１８は、認識情報のほか、例えばＩ／Ｆ１０５、Ｉ／Ｆ１０７を介して各種情報（データ）を制御装置３等に出力する。 <Output unit 18>
The output unit 18 outputs recognition information. The output unit 18 outputs recognition information to the control device 3 and the like via the I / F 105. For example, the output unit 18 may output the recognition information to the output part 109 via the I / F 107. The output unit 18 outputs various information (data) to the control device 3 and the like via the I / F 105 and the I / F 107, for example, in addition to the recognition information.

＜反映部１９＞
反映部１９は、認識情報を評価した利用者等の評価結果を取得し、参照データベースの閾値に反映させる。反映部１９は、例えば認識情報に対して評価結果が悪い場合（すなわち、音声データに対して得られる認識情報が、利用者等の要求と乖離している場合）、閾値を変更させることで、認識情報の改善を図る。このとき、例えば公知の機械学習方法等を用いて、評価結果を閾値に反映させてもよい。 <Reflecting unit 19>
The reflection unit 19 acquires the evaluation result of the user who evaluated the recognition information, and reflects it on the threshold value of the reference database. For example, when the evaluation result is bad for the recognition information (that is, when the recognition information obtained for the voice data is different from the request of the user or the like), the reflection unit 19 changes the threshold value, Improve recognition information. At this time, for example, the evaluation result may be reflected on the threshold using a known machine learning method or the like.

＜収音装置２＞
収音装置２は、公知のマイクに加え、例えばＤＳＰ（digital signal processor）を有してもよい。収音装置２がＤＳＰを有する場合、収音装置２は、マイクによって収音した音声信号に対しＰＣＭ等のパルス変調したデータを生成し、音声認識装置１に送信する。 <Sound collecting device 2>
The sound collection device 2 may have, for example, a DSP (digital signal processor) in addition to a known microphone. When the sound collection device 2 has a DSP, the sound collection device 2 generates pulse-modulated data such as PCM for the sound signal collected by the microphone and transmits the data to the speech recognition device 1.

収音装置２は、例えば音声認識装置１と直接接続されるほか、例えば公衆通信網４を介して接続されてもよい。なお、収音装置２がマイクのみを有する場合、音声認識装置１がパルス変調したデータを生成してもよい。 The sound collection device 2 may be connected to the voice recognition device 1 directly, for example, or may be connected to the sound collection device 2 via the public communication network 4, for example. If the sound collection device 2 has only a microphone, the voice recognition device 1 may generate pulse-modulated data.

＜制御装置３＞
制御装置３は、認識情報を音声認識装置１から受信して制御可能な装置を示す。制御装置３として、例えばＬＥＤ等の証明装置が用いられるほか、例えば車載装置（例えば車両の走行速度を制御するため、ブレーキ系統に直結する装置）、表示言語を変更できる自動販売機、施錠装置、オーディオ機器、マッサージ機等が用いられる。制御装置３は、例えば音声認識装置１と直接接続されるほか、例えば公衆通信網４を介して接続されてもよい。 <Control device 3>
The control device 3 indicates a device that can receive and control recognition information from the speech recognition device 1. As the control device 3, for example, a verification device such as an LED is used, for example, an in-vehicle device (for example, a device directly connected to a brake system for controlling the traveling speed of the vehicle), a vending machine capable of changing the display language, a locking device, Audio equipment, massage machines, etc. are used. The control device 3 may be connected to the voice recognition device 1 directly, for example, or may be connected via the public communication network 4, for example.

＜公衆通信網４＞
公衆通信網４は、音声認識装置１が通信回路を介して接続されるインターネット網等である。公衆通信網４は、いわゆる光ファイバ通信網で構成されてもよい。また、公衆通信網４は、有線通信網には限定されず、無線通信網等の公知の通信網で実現してもよい。 <Public communication network 4>
The public communication network 4 is an Internet network or the like to which the voice recognition device 1 is connected via a communication circuit. The public communication network 4 may be a so-called optical fiber communication network. The public communication network 4 is not limited to a wired communication network, and may be realized by a known communication network such as a wireless communication network.

＜サーバ５＞
サーバ５には、上述した各種情報が記憶される。サーバ５には、例えば公衆通信網４を介して送られてきた各種情報が蓄積される。サーバ５には、例えば保存部１０４と同様の情報が記憶され、公衆通信網４を介して音声認識装置１と各種情報の送受信が行われてもよい。すなわち、音声認識装置１は、保存部１０４の代わりにサーバ５を用いてもよい。特に、サーバ５が上述した各データベースを更新することで、音声認識装置１における更新機能や蓄積するデータ容量を最小限に抑えることができる。このため、音声認識装置１を公衆通信網４に常時接続しない状態で利用することができ、更新が必要な場合のみ公衆通信網４に接続するように用いることができる。これにより、音声認識装置１の利用先を大幅に拡大させることができる。 <Server 5>
The server 5 stores various types of information described above. In the server 5, for example, various kinds of information sent via the public communication network 4 are accumulated. The server 5 may store information similar to that of the storage unit 104, for example, and may transmit / receive various types of information to / from the voice recognition device 1 via the public communication network 4. That is, the speech recognition apparatus 1 may use the server 5 instead of the storage unit 104. In particular, when the server 5 updates each database described above, the update function and the accumulated data capacity in the speech recognition apparatus 1 can be minimized. For this reason, the voice recognition apparatus 1 can be used without being always connected to the public communication network 4, and can be used to connect to the public communication network 4 only when updating is necessary. Thereby, the utilization place of the speech recognition apparatus 1 can be expanded significantly.

＜ユーザ端末６＞
ユーザ端末６は、例えば音声認識システム１００の利用者等が保有する端末を示す。ユーザ端末６として、主に携帯電話（携帯端末）が用いられ、それ以外ではスマートフォン、タブレット型端末、ウェアラブル端末、パーソナルコンピュータ、ＩｏＴ（Internet of Things）デバイス等の電子機器のほか、あらゆる電子機器で具現化されたものが用いられてもよい。ユーザ端末６は、例えば公衆通信網４を介して音声認識装置１と接続されるほか、例えば音声認識装置１と直接接続されてもよい。利用者等は、例えばユーザ端末６を介して音声認識装置１から認識情報を取得するほか、例えば収音装置２の代わりにユーザ端末６を用いて音声を収音させてもよい。 <User terminal 6>
The user terminal 6 is a terminal owned by a user of the voice recognition system 100, for example. As the user terminal 6, a mobile phone (mobile terminal) is mainly used, and in other cases, in addition to electronic devices such as smartphones, tablet terminals, wearable terminals, personal computers, and IoT (Internet of Things) devices, all electronic devices A materialized one may be used. For example, the user terminal 6 may be directly connected to the voice recognition device 1 in addition to being connected to the voice recognition device 1 via the public communication network 4, for example. For example, the user may acquire the recognition information from the voice recognition apparatus 1 via the user terminal 6, and may collect voice using the user terminal 6 instead of the sound collection apparatus 2, for example.

（音声認識システム１００の動作の一例）
次に、本実施形態における音声認識システム１００の動作の一例について説明する。図５（ａ）は、本実施形態における音声認識システム１００の動作の一例を示すフローチャートである。 (Example of operation of the speech recognition system 100)
Next, an example of operation | movement of the speech recognition system 100 in this embodiment is demonstrated. FIG. 5A is a flowchart showing an example of the operation of the speech recognition system 100 in the present embodiment.

＜取得手段Ｓ１１０＞
先ず、少なくとも１つの音声データを取得する（取得手段Ｓ１１０）。取得部１１は、収音装置２等から音声データを取得する。取得部１１は、例えば記憶部１３を介して保存部１０４に音声データを保存する。 <Acquisition means S110>
First, at least one audio data is acquired (acquisition means S110). The acquisition unit 11 acquires audio data from the sound collection device 2 or the like. For example, the acquisition unit 11 stores the audio data in the storage unit 104 via the storage unit 13.

＜抽出手段Ｓ１２０＞
次に、認識対象データを抽出する（抽出手段Ｓ１２０）。抽出部１２は、例えば記憶部１３を介して保存部１０４から音声データを取出し、音声データに含まれる開始無音区間及び終了無音区間を抽出する。また、抽出部１２は、開始無音区間と終了無音区間との間に挟まれた音素及び休止区間の配列を、認識対象データとして抽出する。抽出部１２は、例えば記憶部１３を介して保存部１０４に認識対象データを保存する。なお、抽出部１２は、一度に複数の音声データを取得してもよい。 <Extraction means S120>
Next, recognition target data is extracted (extraction means S120). For example, the extraction unit 12 extracts voice data from the storage unit 104 via the storage unit 13 and extracts a start silence section and an end silence section included in the voice data. Further, the extraction unit 12 extracts, as recognition target data, an array of phonemes and pause sections sandwiched between a start silence section and an end silence section. For example, the extraction unit 12 stores the recognition target data in the storage unit 104 via the storage unit 13. The extraction unit 12 may acquire a plurality of audio data at a time.

抽出部１２は、例えば１つの音声データから複数の認識データを抽出する。このとき、複数の認識データは、それぞれ異なる音素及び休止区間の配列を有する（例えば図３の配列Ａ〜配列Ｃ）。抽出部１２は、例えばそれぞれ異なる条件を設定するほか、例えば同一条件で設定したときにおけるバラつきの範囲内で、複数の認識データを抽出する。 For example, the extraction unit 12 extracts a plurality of recognition data from one piece of audio data. At this time, the plurality of pieces of recognition data have different phoneme and pause interval arrangements (for example, arrangement A to arrangement C in FIG. 3). In addition to setting different conditions, for example, the extraction unit 12 extracts a plurality of pieces of recognition data within a range of variation when, for example, the same conditions are set.

なお、例えば休止区間が呼吸音及びリップノイズの少なくとも何れかを含むとき、抽出部１２は、呼吸音及びリップノイズの少なくとも何れかを含む配列を、認識対象データとして抽出してもよい。 For example, when the pause section includes at least one of breathing sound and lip noise, the extraction unit 12 may extract an array including at least one of breathing sound and lip noise as recognition target data.

＜検出手段Ｓ１３０＞
次に、認識対象データに基づき、候補データを検出する（検出手段Ｓ１３０）。検出部１４は、例えば記憶部１３を介して保存部１０４から認識対象データを取出す。検出部１４は、文字列データベースを参照し、認識対象データの有する配列に対応する音素情報を選択する。また、検出部１４は、選択された音素情報に紐づく文字列情報及びクラスＩＤを候補データとして複数検出する。検出部１４は、例えば記憶部１３を介して保存部１０４に候補データを保存する。なお、認識対象データの有する配列は、例えば一対の休止区間の間における音素の配列を示し、一対の休止区間の間に他の休止区間が配列されてもよい。 <Detection means S130>
Next, candidate data is detected based on the recognition target data (detection means S130). For example, the detection unit 14 extracts the recognition target data from the storage unit 104 via the storage unit 13. The detection unit 14 refers to the character string database and selects phoneme information corresponding to the arrangement of the recognition target data. The detecting unit 14 detects a plurality of character string information and class IDs associated with the selected phoneme information as candidate data. For example, the detection unit 14 stores the candidate data in the storage unit 104 via the storage unit 13. In addition, the arrangement | sequence which recognition object data has shows the arrangement | sequence of the phoneme between a pair of pause intervals, for example, and another pause interval may be arranged between a pair of pause intervals.

＜算出手段Ｓ１４０＞
次に、各候補データに対応する信頼度を算出する（算出手段Ｓ１４０）。算出部１５は、例えば記憶部１３を介して保存部１０４から候補データを取出す。算出部１５は、文法データベースを参照し、複数の候補データを文法情報に基づき組み合わせたセンテンスを生成する。また、算出部１５は、センテンスに含まれる候補データ毎に対応する信頼度を算出する。算出部１５は、例えば記憶部１３を介して保存部１０４に各候補データ及び信頼度を保存する。算出部１５として、例えばＪｕｌｉｕｓ等の公知の音声認識エンジンが用いられることで、センテンスの生成及び信頼度の算出が実現されてもよい。 <Calculation means S140>
Next, the reliability corresponding to each candidate data is calculated (calculation means S140). For example, the calculation unit 15 extracts candidate data from the storage unit 104 via the storage unit 13. The calculation unit 15 refers to the grammar database and generates a sentence that combines a plurality of candidate data based on the grammar information. Further, the calculation unit 15 calculates a reliability corresponding to each candidate data included in the sentence. For example, the calculation unit 15 stores each candidate data and reliability in the storage unit 104 via the storage unit 13. As the calculation unit 15, for example, a known speech recognition engine such as Julius may be used to realize sentence generation and reliability calculation.

算出部１５は、文法データベースの文法情報の種類に応じて、複数のセンテンスを生成することができる。また、算出部１５は、文法情報の種類を選択することで、状況に適した音声認識を高精度で実施することができる。 The calculation unit 15 can generate a plurality of sentences according to the type of grammar information in the grammar database. In addition, the calculation unit 15 can perform speech recognition suitable for the situation with high accuracy by selecting the type of grammatical information.

＜選択手段Ｓ１５０＞
次に、信頼度に基づき、評価データを選択する（選択手段Ｓ１５０）。選択部１６は、例えば記憶部１３を介して保存部１０４から候補データ及び信頼度を取出す。選択部１６は、例えば複数の候補データのうち、クラスＩＤ毎に最も高い信頼度が算出された候補データを、評価データとして選択する。選択部１６は、例えば記憶部１３を介して保存部１０４に評価データを保存する。 <Selection means S150>
Next, evaluation data is selected based on the reliability (selection means S150). For example, the selection unit 16 extracts candidate data and reliability from the storage unit 104 via the storage unit 13. For example, the selection unit 16 selects candidate data for which the highest reliability is calculated for each class ID from among a plurality of candidate data as evaluation data. For example, the selection unit 16 stores the evaluation data in the storage unit 104 via the storage unit 13.

＜生成手段Ｓ１６０＞
次に、評価データに基づき、認識情報を生成する（生成手段Ｓ１６０）。生成部１７は、例えば記憶部１３を介して保存部１０４から評価データを取出す。生成部１７は、例えば上述した公知の技術を用いて評価データを任意のデータに変換し、認識情報として生成する。 <Generating means S160>
Next, recognition information is generated based on the evaluation data (generation unit S160). The generation unit 17 extracts the evaluation data from the storage unit 104 via the storage unit 13, for example. The generation unit 17 converts the evaluation data into arbitrary data using, for example, the known technique described above, and generates it as recognition information.

生成手段Ｓ１６０は、例えば図５（ｂ）に示すように、指定手段Ｓ１６１と、比較手段Ｓ１６２とを有してもよい。 The generation unit S160 may include a designation unit S161 and a comparison unit S162 as shown in FIG. 5B, for example.

指定手段Ｓ１６１は、評価データに対応する第１参照センテンスを指定する。指定部１７ａは、参照データベースを参照し、参照センテンスのうち、評価データに対応する第１参照センテンスを指定する。 The designation unit S161 designates the first reference sentence corresponding to the evaluation data. The designation unit 17a refers to the reference database and designates the first reference sentence corresponding to the evaluation data among the reference sentences.

比較手段Ｓ１６２は、評価データに対応する信頼度と、第１参照センテンスに含まれる第１文字列情報に付与された第１閾値とを比較する。比較部１７ｂは、例えば図３に示すように、評価データの信頼度が第１閾値以上の場合に、認識が正しいと判断してもよい。この後、比較部１７ｂの判断（比較結果）に基づき、認識情報が生成される。なお、比較部１７ｂにおいて評価データの信頼度が第１閾値未満となり、認識が誤っていると判断した場合、そのまま終了するか、抽出手段Ｓ１２０から再度実施するほか、例えば利用者等に再度音声を発するように促す認識情報を生成してもよい。 The comparison unit S162 compares the reliability corresponding to the evaluation data with the first threshold given to the first character string information included in the first reference sentence. For example, as illustrated in FIG. 3, the comparison unit 17 b may determine that the recognition is correct when the reliability of the evaluation data is equal to or higher than the first threshold value. Thereafter, recognition information is generated based on the determination (comparison result) of the comparison unit 17b. When the comparison unit 17b determines that the reliability of the evaluation data is less than the first threshold value and the recognition is incorrect, the comparison unit 17b terminates the process as it is or performs the process again from the extraction unit S120. Recognition information that prompts the user to emit may be generated.

＜出力手段Ｓ１７０＞
その後、必要に応じて認識情報を出力する（出力手段Ｓ１７０）。出力部１８は、Ｉ／Ｆ１０７を介して出力部分１０９に認識情報を表示するほか、例えばＩ／Ｆ１０５を介して制御装置３等を制御するための認識情報を出力する。 <Output means S170>
Thereafter, the recognition information is output as necessary (output unit S170). The output unit 18 displays recognition information on the output portion 109 via the I / F 107, and outputs recognition information for controlling the control device 3 and the like via the I / F 105, for example.

＜反映手段Ｓ１８０＞
なお、例えば認識情報を評価した利用者等の評価結果を取得し、参照データベースの閾値に反映させてもよい（反映手段Ｓ１８０）。この場合、反映部１９は、取得部１１を介して利用者等が作成した評価結果を取得する。反映部１９は、評価結果に含まれる評価値等に基づき、比較手段Ｓ１６２における比較の結果が改善（認識精度が向上）するように、閾値を変更する。 <Reflection means S180>
For example, the evaluation result of the user who evaluated the recognition information may be acquired and reflected in the threshold value of the reference database (reflection unit S180). In this case, the reflection unit 19 acquires the evaluation result created by the user or the like via the acquisition unit 11. The reflection unit 19 changes the threshold based on the evaluation value included in the evaluation result so that the comparison result in the comparison unit S162 is improved (recognition accuracy is improved).

なお、反映部１９は、例えば参照データベースのほか、文字列データベース及び文法データベースの少なくとも何れかに評価結果を反映させてもよい。また、算出部１５が評価結果に基づき、信頼度の算出に反映させてもよい。 Note that the reflecting unit 19 may reflect the evaluation result in, for example, at least one of a character string database and a grammar database in addition to the reference database. Further, the calculation unit 15 may reflect the result in the calculation of the reliability based on the evaluation result.

これにより、本実施形態における音声認識システム１００の動作が終了する。 Thereby, operation | movement of the speech recognition system 100 in this embodiment is complete | finished.

本実施形態における音声認識システム１００によれば、抽出手段Ｓ１２０は、音素及び休止区間の配列を認識対象データとして抽出する。また、検出手段Ｓ１３０は、認識対象データの有する配列に対応する音素情報を選択し、候補データを検出する。このため、認識対象データにおける音素のみを考慮した配列に対して候補データを検出する場合に比べ、誤認識を低減させることができる。これにより、認識精度の向上を図ることが可能となる。 According to the speech recognition system 100 in the present embodiment, the extraction unit S120 extracts the phoneme and pause interval arrangement as recognition target data. Further, the detection unit S130 selects phoneme information corresponding to the arrangement of the recognition target data, and detects candidate data. For this reason, compared with the case where candidate data is detected with respect to the arrangement | sequence which considered only the phoneme in recognition object data, a misrecognition can be reduced. As a result, the recognition accuracy can be improved.

また、認識精度の向上が可能となるため、精度向上のために用いられる事前音声入力を実施する必要がない。ここで、事前音声入力とは、音声データを取得する前に、音声認識を開始させるための音声を示す。事前音声入力を用いることで、認識精度を向上させることができる一方で、利便性の低下に影響する懸念が挙げられる。この点、本実施形態における音声認識システム１００によれば、事前音声入力を実施しないことで、利便性の向上を実現させることが可能となる。 In addition, since the recognition accuracy can be improved, it is not necessary to perform prior speech input used for improving the accuracy. Here, the prior voice input indicates a voice for starting voice recognition before acquiring voice data. While the use of prior voice input can improve recognition accuracy, there is a concern that it may affect convenience. In this regard, according to the speech recognition system 100 of the present embodiment, it is possible to improve convenience by not performing prior speech input.

なお、本実施形態における音声認識システム１００によれば、必要に応じて事前音声入力を実施してもよい。これにより、認識精度のさらなる向上を図ることが可能となる。 In addition, according to the speech recognition system 100 in the present embodiment, prior speech input may be performed as necessary. As a result, the recognition accuracy can be further improved.

また、本実施形態における音声認識システム１００によれば、文字列データベースには、音素と休止区間との配列に対応する音素情報、及び音素情報に紐づく文字列情報が記憶される。このため、音素全体に対してパターンマッチングするために記憶するデータに比べて、データ容量の削減や、データ蓄積の簡易化を実現することができる。 Further, according to the speech recognition system 100 of the present embodiment, the character string database stores phoneme information corresponding to the arrangement of phonemes and pause intervals, and character string information associated with the phoneme information. For this reason, compared with the data memorize | stored in order to perform pattern matching with respect to the whole phoneme, reduction of data capacity and simplification of data accumulation | storage can be implement | achieved.

特に、音声認識システム１００の利用される環境を踏まえて、文字列データベースに記憶される文字列情報を選別することで、データ容量の削減ができ、例えば公衆通信網４に接続する必要がなく、利用の幅を広げることができる。また、音声データの取得から認識情報を生成するまでの時間を大幅に短縮することができる。 In particular, in consideration of the environment in which the speech recognition system 100 is used, by selecting character string information stored in the character string database, the data capacity can be reduced, for example, there is no need to connect to the public communication network 4, The range of use can be expanded. In addition, the time from acquisition of voice data to generation of recognition information can be greatly shortened.

また、本実施形態における音声認識システム１００によれば、抽出手段Ｓ１２０は、１つの音声データから複数の認識対象データを抽出する。このため、音素及び休止区間の配列にバラつきが発生するような音声データを取得した場合においても、認識精度の低下を抑制することができる。これにより、認識精度のさらなる向上が可能となる。 Further, according to the speech recognition system 100 in the present embodiment, the extraction unit S120 extracts a plurality of recognition target data from one speech data. For this reason, even when voice data that causes variations in the arrangement of phonemes and pause intervals is acquired, it is possible to suppress a reduction in recognition accuracy. Thereby, the recognition accuracy can be further improved.

また、本実施形態における音声認識システム１００によれば、算出手段Ｓ１４０は、センテンスを複数生成する。すなわち、候補データを組み合わせるパターンが複数存在する場合においても、全てのパターンに対応するセンテンスを生成することができる。このため、例えばパターンマッチングの探索方法等に比べて、誤認識を低減させることができる。これにより、認識精度のさらなる向上が可能となる。 Moreover, according to the speech recognition system 100 in the present embodiment, the calculation unit S140 generates a plurality of sentences. That is, even when there are a plurality of patterns that combine candidate data, sentences corresponding to all patterns can be generated. For this reason, misrecognition can be reduced compared with the search method of a pattern matching etc., for example. Thereby, the recognition accuracy can be further improved.

また、本実施形態における音声認識システム１００によれば、比較手段Ｓ１６２は、信頼度と、第１閾値とを比較する。このため、複数の候補データから相対的に選択された評価データに対し、閾値による判定も行うことで、誤認識をさらに低減させることができる。これにより、認識精度のさらなる向上が可能となる。 Further, according to the speech recognition system 100 in the present embodiment, the comparison unit S162 compares the reliability with the first threshold value. For this reason, misrecognition can be further reduced by performing the determination based on the threshold for the evaluation data relatively selected from the plurality of candidate data. Thereby, the recognition accuracy can be further improved.

また、本実施形態における音声認識システム１００によれば、反映手段Ｓ１８０は、評価結果を閾値に反映させる。このため、認識情報が、利用者の認識と乖離している場合、容易に改善を実施することができる。これにより、持続的な認識精度の向上を実現することができる。 Moreover, according to the speech recognition system 100 in the present embodiment, the reflecting unit S180 reflects the evaluation result on the threshold value. For this reason, when the recognition information deviates from the user's recognition, the improvement can be easily performed. Thereby, continuous improvement in recognition accuracy can be realized.

また、本実施形態における音声認識システム１００によれば、出力手段Ｓ１７０は、認識情報を出力する。上記の通り、本実施形態における音声認識システム１００は、従来のシステムに比べて精度の高い認識情報を生成することができる。このため、認識情報に基づいて制御装置３等の制御を実施する場合、制御装置３等の誤作動を大幅に抑制することができる。例えば車両のブレーキを制御するために音声認識システム１００を用いた場合においても、通常の走行に支障を与えない程度の精度を実現し得る。すなわち、認識精度の向上に伴い、利用者の運転補助等として用いることができる。これにより、幅広い用途への応用が可能となる。 Further, according to the speech recognition system 100 in the present embodiment, the output unit S170 outputs recognition information. As described above, the speech recognition system 100 according to the present embodiment can generate recognition information with higher accuracy than a conventional system. For this reason, when control of the control apparatus 3 grade | etc., Is implemented based on recognition information, malfunction of the control apparatus 3 grade | etc., Can be suppressed significantly. For example, even when the speech recognition system 100 is used to control a vehicle brake, it is possible to achieve an accuracy that does not hinder normal travel. That is, it can be used as driving assistance for the user as the recognition accuracy improves. Thereby, the application to a wide use is attained.

また、本実施形態における音声認識システム１００によれば、休止区間は、呼吸音及びリップノイズの少なくとも何れかを含む。このため、音素のみでは判断し難い音声データの差異に対しても容易に判断でき、認識対象データを抽出することができる。これにより、認識精度のさらなる向上を図ることが可能となる。 Moreover, according to the speech recognition system 100 in the present embodiment, the pause period includes at least one of breathing sound and lip noise. For this reason, it is possible to easily determine a difference in speech data that is difficult to determine using only phonemes, and to extract recognition target data. As a result, the recognition accuracy can be further improved.

本実施形態における音声認識装置１によれば、抽出部１２は、音素及び休止区間の配列を認識対象データとして抽出する。また、検出部１４は、認識対象データの有する配列に対応する音素情報を選択し、候補データを検出する。このため、認識対象データにおける音素のみを考慮した配列に対して候補データを検出する場合に比べ、誤認識を低減させることができる。これにより、認識精度の向上を図ることが可能となる。 According to the speech recognition device 1 in the present embodiment, the extraction unit 12 extracts the arrangement of phonemes and pause sections as recognition target data. In addition, the detection unit 14 selects phoneme information corresponding to the arrangement of the recognition target data, and detects candidate data. For this reason, compared with the case where candidate data is detected with respect to the arrangement | sequence which considered only the phoneme in recognition object data, a misrecognition can be reduced. As a result, the recognition accuracy can be improved.

また、本実施形態における音声認識装置１によれば、文字列データベースには、音素と休止区間との配列に対応する音素情報、及び音素情報に紐づく文字列情報が記憶される。このため、音素全体に対してパターンマッチングするために記憶するデータに比べて、データ容量の削減や、データ蓄積の簡易化を実現することができる。 Further, according to the speech recognition apparatus 1 of the present embodiment, the character string database stores phoneme information corresponding to the arrangement of phonemes and pause sections, and character string information associated with the phoneme information. For this reason, compared with the data memorize | stored in order to perform pattern matching with respect to the whole phoneme, reduction of data capacity and simplification of data accumulation | storage can be implement | achieved.

（音声認識システム１００の構成の第１変形例）
次に、本実施形態における音声認識システム１００の第１変形例について説明する。上述した実施形態と、第１変形例との違いは、生成部１７が更新部１７ｃを有する点である。なお、上述した構成と同様の構成については、説明を省略する。 (First Modification of Configuration of Speech Recognition System 100)
Next, a first modification of the speech recognition system 100 in the present embodiment will be described. The difference between the above-described embodiment and the first modification is that the generation unit 17 includes an update unit 17c. The description of the same configuration as that described above is omitted.

生成部１７の有する更新部１７ｃは、例えば図６に示すように、候補データ及び信頼度に基づき、参照データベースに記憶された閾値を更新する。すなわち、候補データ及び信頼度の内容に応じた値に、閾値を更新することができる。 For example, as illustrated in FIG. 6, the update unit 17 c included in the generation unit 17 updates the threshold value stored in the reference database based on the candidate data and the reliability. That is, the threshold value can be updated to a value corresponding to the contents of candidate data and reliability.

更新部１７ｃは、例えば各クラスＩＤに紐づく複数の信頼度の平均値を算出する。更新部１７ｃは、算出した平均値に基づき閾値を更新する。 For example, the update unit 17c calculates an average value of a plurality of reliability levels associated with each class ID. The updating unit 17c updates the threshold based on the calculated average value.

閾値を更新する場合、算出された平均値が閾値として用いられるほか、予め設定された係数を平均値にかけ合わせた値が、更新後の閾値として用いられてもよい。また、更新前の閾値に対して、係数を平均値にかけ合わせた値を四則演算した結果の値を更新後の閾値として用いられてもよい。 When the threshold value is updated, the calculated average value is used as the threshold value, and a value obtained by multiplying a preset coefficient by the average value may be used as the updated threshold value. In addition, a value obtained by performing four arithmetic operations on a value obtained by multiplying a coefficient by an average value with respect to the threshold value before update may be used as the threshold value after update.

候補データ及び信頼度の内容に基づき閾値を更新することで、例えば音声データにノイズ等が含まれ易い場合においても、音声データの品質に応じた閾値を設定することができる。また、１つのクラスＩＤに紐づく文字列情報が多数検出され、各文字列情報の信頼度が低い場合においても、全ての信頼度が閾値未満になることを防ぐことができる。 By updating the threshold value based on the candidate data and the content of the reliability, for example, even when the voice data is likely to contain noise or the like, the threshold value can be set according to the quality of the voice data. Further, even when a large number of character string information associated with one class ID is detected and the reliability of each character string information is low, it is possible to prevent all the reliability from being less than the threshold.

更新部１７ｃは、例えば各クラスＩＤに紐づく複数の信頼度のうち、最も低い信頼度を除いた平均値を算出してもよい。この場合、更新後の閾値は、更新前の閾値に比べて高くなる傾向を示す。これにより、誤認識を低減させることが可能となる。 For example, the update unit 17c may calculate an average value excluding the lowest reliability among a plurality of reliability linked to each class ID. In this case, the updated threshold value tends to be higher than the pre-update threshold value. Thereby, it becomes possible to reduce misrecognition.

更新部１７ｃは、例えば各クラスＩＤに紐づく複数の信頼度のうち、最も低い信頼度及び最も高い信頼度を除いた平均値を算出してもよい。この場合、更新後の閾値は、更新前の閾値に比べて低くなる傾向を示す。これにより、認識率を向上させることができる。また、更新前後における閾値の変動を抑制することができる。 For example, the updating unit 17c may calculate an average value excluding the lowest reliability and the highest reliability among a plurality of reliability linked to each class ID. In this case, the updated threshold value tends to be lower than the pre-update threshold value. Thereby, the recognition rate can be improved. Moreover, the fluctuation | variation of the threshold value before and behind an update can be suppressed.

（音声認識システム１００の動作の第１変形例）
次に本実施形態における音声認識システム１００の第１変形例について説明する。図７（ａ）は、第１変形例における更新手段Ｓ１６３の一例を示すフローチャートである。 (First Modification of Operation of Speech Recognition System 100)
Next, a first modification of the voice recognition system 100 in the present embodiment will be described. FIG. 7A is a flowchart showing an example of the updating unit S163 in the first modification.

図７（ａ）に示すように、上述した選択手段Ｓ１５０を実施したあと、複数の候補データ、及び複数の信頼度に基づき、参照データベースに記憶された閾値を更新する（更新手段Ｓ１６３）。更新部１７ｃは、例えば記憶部１３を介して保存部１０４から候補データ、信頼度、及び参照データベースを取出す。 As illustrated in FIG. 7A, after the above-described selection unit S150 is performed, the threshold value stored in the reference database is updated based on the plurality of candidate data and the plurality of reliability levels (update unit S163). The updating unit 17c extracts candidate data, reliability, and reference database from the storage unit 104 via the storage unit 13, for example.

更新部１７ｃは、例えば図６に示すように、ランク１、２、４に含まれるクラスＩＤ「１」に紐づく複数の信頼度「０．９８２」、「０．９４２」、「０．８９７」の平均値「０．９４０」を算出する。その後、更新部１７ｃは、例えば算出した平均値に係数（例えば０．９）をかけ合わせた値「０．８４６」を、更新後の閾値として用いる。 For example, as illustrated in FIG. 6, the update unit 17 c includes a plurality of reliability levels “0.982”, “0.942”, “0.897” associated with the class ID “1” included in the ranks 1, 2, and 4. The average value of “0.940” is calculated. Thereafter, the updating unit 17c uses, for example, a value “0.846” obtained by multiplying the calculated average value by a coefficient (for example, 0.9) as the updated threshold value.

その後、上述した指定手段Ｓ１６１等を実施し、本実施形態における音声認識システム１００の動作が終了する。 Thereafter, the above-described specifying means S161 and the like are performed, and the operation of the speech recognition system 100 in the present embodiment ends.

本変形例によれば、更新手段Ｓ１６３における更新部１７ｃは、候補データ及び信頼度に基づき、閾値を更新する。このため、予め設定された閾値を常に用いる場合に比べて、取得する音声データにおける品質に応じた認識情報を生成することができる。これにより、利用できる環境の幅を広げることが可能となる。 According to this modification, the updating unit 17c in the updating unit S163 updates the threshold based on the candidate data and the reliability. For this reason, the recognition information according to the quality in the audio | voice data acquired can be produced | generated compared with the case where a preset threshold value is always used. This makes it possible to expand the range of environments that can be used.

（音声認識システム１００の動作の第２変形例）
次に本実施形態における音声認識システム１００の第２変形例について説明する。上述した実施形態と、第２変形例との違いは、設定手段Ｓ１９０を備える点である。なお、上述した構成と同様の構成については、説明を省略する。 (Second Modification of Operation of Speech Recognition System 100)
Next, a second modification of the voice recognition system 100 in the present embodiment will be described. The difference between the above-described embodiment and the second modification is that a setting unit S190 is provided. The description of the same configuration as that described above is omitted.

設定手段Ｓ１９０は、例えば図７（ｂ）に示すように、生成手段Ｓ１６０の後に実施される。設定手段Ｓ１９０は、認識情報に基づき、参照する各データベースの内容を選別する。設定手段Ｓ１９０の実施後、取得手段Ｓ１１０が実施される。 For example, as shown in FIG. 7B, the setting unit S190 is implemented after the generation unit S160. The setting unit S190 selects the contents of each database to be referenced based on the recognition information. After the setting unit S190 is performed, the acquisition unit S110 is performed.

例えば設定手段Ｓ１９０として「ミュージックモード」が生成された場合、その後の検出手段Ｓ１３０において、検出部１４は、文字列データベースのうち、「ミュージックモード」に特化した音素情報、文字列情報、及びクラスＩＤを選別して参照する。このため、設定手段Ｓ１９０を実施しない場合に比べて、特定の内容に対する音素情報等に限定することができる。これにより、認識精度を飛躍的に向上させることが可能となる。 For example, when “music mode” is generated as the setting unit S190, in the subsequent detection unit S130, the detection unit 14 includes phoneme information, character string information, and class specialized for “music mode” in the character string database. Select and refer to the ID. For this reason, compared with the case where setting means S190 is not implemented, it can limit to phoneme information etc. with respect to specific content. As a result, the recognition accuracy can be dramatically improved.

（取得手段Ｓ１１０の変形例）
次に、本実施形態における取得手段Ｓ１１０の変形例について説明する。上述した実施形態と、本変形例との違いは、取得部１１が条件情報を取得する点である。なお、上述した構成と同様の構成については、説明を省略する。 (Modification of Acquisition Unit S110)
Next, a modification of the acquisition unit S110 in the present embodiment will be described. The difference between the above-described embodiment and this modification is that the acquisition unit 11 acquires condition information. The description of the same configuration as that described above is omitted.

取得手段Ｓ１１０において取得部１１は、音声データが生成された条件を示す条件情報を取得する。条件情報は、例えば図８に示すように、環境情報と、雑音情報と、収音装置情報と、利用者情報と、音特性情報とを有する。なお、上述した設定手段Ｓ１９０と同様に、例えば検出部１４は、条件情報に基づき、参照する文字列データベース及び文法データベースの少なくとも何れかの内容を選別してもよい。また、例えば反映部１９は、参照データベースの閾値の更新に、条件情報を用いてもよい。 In the acquisition unit S110, the acquisition unit 11 acquires condition information indicating a condition under which sound data is generated. As shown in FIG. 8, for example, the condition information includes environment information, noise information, sound collection device information, user information, and sound characteristic information. Note that, similarly to the setting unit S190 described above, for example, the detection unit 14 may select at least one of the contents of the character string database to be referred to and the grammar database based on the condition information. For example, the reflection unit 19 may use the condition information for updating the threshold value of the reference database.

条件情報は、例えば収音装置２により生成されるほか、例えば利用者等が予め生成してもよい。例えば取得部１１は、音声データの一部を条件情報として取得してもよい。 The condition information may be generated in advance by, for example, a user in addition to the sound collection device 2. For example, the acquisition unit 11 may acquire a part of the audio data as the condition information.

環境情報は、収音装置２の設置された環境に関する情報を有し、例えば屋外、屋内の広さ等を示す。環境情報を用いることで、例えば屋内における音声の反射条件等を考慮することができ、抽出される認識対象データ等の精度を高めることができる。 The environmental information includes information related to the environment where the sound collection device 2 is installed, and indicates, for example, the size of an outdoor area or an indoor area. By using the environment information, for example, the reflection condition of the voice indoors can be taken into account, and the accuracy of the extracted recognition target data can be increased.

雑音情報は、収音装置２が収音し得る雑音に関する情報を有し、例えば利用者等以外の音声、空調音等を示す。雑音情報を用いることで、音声データに含まれる不要なデータを予め除去でき、抽出される認識対象データ等の精度を高めることができる。 The noise information includes information regarding noise that can be picked up by the sound pickup device 2, and indicates, for example, sound other than the user or the like, air-conditioning sound, and the like. By using the noise information, unnecessary data included in the voice data can be removed in advance, and the accuracy of extracted recognition target data and the like can be improved.

収音装置情報は、収音装置２の種類、性能等に関する情報を有し、例えばマイクの数、マイクの種類等も含まれる。収音装置情報を用いることで、音声データが生成された状況に対応したデータベースの選択等ができ、音声認識の精度を高めることができる。 The sound collection device information includes information regarding the type and performance of the sound collection device 2, and includes, for example, the number of microphones and the type of microphone. By using the sound collection device information, it is possible to select a database corresponding to the situation in which the voice data is generated, and to increase the accuracy of voice recognition.

利用者情報は、利用者等の人数、国籍、性別等に関する情報を有する。音特性情報は、音声の声量、音圧、癖、活舌の状態等に関する情報を有する。利用者情報を用いることで、音声データの特徴を予め限定することができ、音声認識の精度を高めることができる。 User information includes information on the number of users, nationality, gender, and the like. The sound characteristic information includes information on the volume of sound, sound pressure, habit, state of active tongue, and the like. By using the user information, the characteristics of the voice data can be limited in advance, and the accuracy of voice recognition can be improved.

本変形例によれば、取得手段Ｓ１１０は、条件情報を取得する。すなわち、取得手段Ｓ１１０は、音声データを取得する際の周辺環境、音声データに含まれる雑音、音声を採取する収音装置２の種類等の各種条件を、条件情報として取得する。このため、条件情報に応じた各手段や各データベースの設定を実施することができる。これにより、利用される環境等に関わらず、認識精度の向上を図ることが可能となる。 According to this modification, the acquisition unit S110 acquires condition information. That is, the acquisition unit S110 acquires various conditions such as the surrounding environment when acquiring the audio data, the noise included in the audio data, the type of the sound collection device 2 that collects the audio, and the like as the condition information. For this reason, each means and each database can be set according to the condition information. This makes it possible to improve recognition accuracy regardless of the environment used.

また、本変形例によれば、検出手段Ｓ１３０は、条件情報に基づき、参照する文字列データベースの内容を選別する。このため、文字列データベースには、条件情報毎に異なる文字列情報等を記憶させておくことで、条件情報毎に適した候補データを検出することができる。これにより、条件情報毎における認識精度の向上を図ることが可能となる。 Further, according to the present modification, the detection unit S130 selects the contents of the character string database to be referenced based on the condition information. For this reason, candidate data suitable for each condition information can be detected by storing different character string information or the like for each condition information in the character string database. Thereby, it becomes possible to improve the recognition accuracy for each condition information.

（参照データベースの変形例）
次に、本実施形態における参照データベースの変形例について説明する。上述した実施形態と、本変形例との違いは、参照データベースに記憶された情報の内容が異なる点である。なお、上述した構成と同様の構成については、説明を省略する。 (Reference database modification)
Next, a modification of the reference database in the present embodiment will be described. The difference between the above-described embodiment and this modification is that the content of information stored in the reference database is different. The description of the same configuration as that described above is omitted.

参照データベースには、例えば図９に示すように、予め取得された過去の評価データ、過去の評価データに紐づく参照センテンス、及び過去の評価データと参照センテンスとの
間における連関度が記憶される。 For example, as shown in FIG. 9, the reference database stores past evaluation data acquired in advance, a reference sentence associated with the past evaluation data, and a degree of association between the past evaluation data and the reference sentence. .

生成部１７は、例えば参照データベースを参照し、過去の評価データのうち、評価データに対応する第１評価データ（図９の「過去の評価データ」内の破線枠）を選択する。その後、生成部１７は、参照センテンスのうち、第１評価データに対応する第１参照センテンス（図９の「参照センテンス」内の破線枠）、を取得する。また、生成部１７は、連関度のうち、第１評価データと第１参照センテンスとの間における第１連関度（図９の「６５％」等）を取得する。なお、第１評価データ及び第１参照センテンスは、複数のデータを含んでもよい。 The generation unit 17 refers to, for example, a reference database, and selects first evaluation data (broken frame in “past evaluation data” in FIG. 9) corresponding to the evaluation data from past evaluation data. After that, the generation unit 17 acquires a first reference sentence (broken frame in “reference sentence” in FIG. 9) corresponding to the first evaluation data among the reference sentences. Further, the generation unit 17 acquires a first association degree (such as “65%” in FIG. 9) between the first evaluation data and the first reference sentence among the association degrees. The first evaluation data and the first reference sentence may include a plurality of data.

生成部１７は、第１連関度の値に基づき、認識情報を生成する。生成部１７は、例えば第１連関度と、予め取得された閾値と比較し、閾値を上回る第１連関度に紐づく第１参照センテンスを参考に、認識情報を生成する。 The generation unit 17 generates recognition information based on the first association degree value. For example, the generation unit 17 compares the first association degree with a previously acquired threshold value, and generates recognition information with reference to the first reference sentence associated with the first association degree exceeding the threshold value.

過去の評価データとして、評価データと一部一致又は完全一致する情報が選択されるほか、例えば類似（同一概念等を含む）する情報が用いられる。評価データ及び過去の評価データが複数の文字列間の組み合わせで示される場合、例えば、名詞−動詞、名詞−形容詞、形容詞−動詞、名詞−名詞の何れかの組み合わせが用いられる。 As past evaluation data, information that partially matches or completely matches the evaluation data is selected, and for example, information that is similar (including the same concept or the like) is used. When the evaluation data and the past evaluation data are indicated by a combination between a plurality of character strings, for example, any combination of noun-verb, noun-adjective, adjective-verb, and noun-noun is used.

連関度（第１連関度）は、例えば百分率等の３段階以上で示される。例えば参照データベースがニューラルネットワークで構成される場合、第１連関度は、選択された過去の評価対象情報に紐づく重み変数を示す。 The degree of association (first degree of association) is indicated in three or more stages such as a percentage. For example, when the reference database is configured by a neural network, the first relevance indicates a weight variable associated with the selected past evaluation target information.

上述した参照データベースを用いる場合、３段階以上に設定されている連関度に基づいて、音声認識を実現できる点に特徴がある。連関度等は、例えば０〜１００％までの数値で記述することができるが、これに限定されるものではなく３段階以上の数値で記述できればいかなる段階で構成されていてもよい。 The use of the above-described reference database is characterized in that speech recognition can be realized based on the degree of association set in three or more stages. The association degree and the like can be described by a numerical value from 0 to 100%, for example, but is not limited to this, and may be configured at any stage as long as it can be described by three or more numerical values.

このような連関度等に基づいて、評価データに対する認識情報の候補として選ばれる第１参照センテンスにおいて、連関度等の高い又は低い順に第１参照センテンスを選択することが可能となる。このように連関度の順に選択することで、状況に見合う可能性の高い第１参照センテンスを優先的に選択することができる。他方、状況に見合う可能性の低い第１参照センテンスも除外せずに選択できるため、廃棄対象とせずに認識情報の候補として選択することが可能となる。 In the first reference sentence selected as the recognition information candidate for the evaluation data based on the association degree and the like, it is possible to select the first reference sentence in order of the association degree and the like in descending order. Thus, by selecting in the order of relevance, it is possible to preferentially select the first reference sentence that has a high possibility of meeting the situation. On the other hand, since it is possible to select the first reference sentence having a low possibility of matching with the situation without being excluded, it is possible to select the first reference sentence as a candidate for recognition information without being a target for discarding.

上記に加え、例えば連関度等が１％のような極めて低い評価も見逃すことなく選択することができる。すなわち、連関度等が極めて低い値であっても、僅かな兆候として繋がっていることを示しており、過度の廃棄対象の選択や誤認を抑制することが可能となる。 In addition to the above, it is possible to select an extremely low evaluation such as an association degree of 1% without overlooking. That is, even if the degree of association is a very low value, it indicates that it is connected as a slight sign, and it is possible to suppress selection or misrecognition of excessive discard targets.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although several embodiments of the present invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１：音声認識装置
２：収音装置
３：制御装置
４：公衆通信網
５：サーバ
６：ユーザ端末
１０：筐体
１１：取得部
１２：抽出部
１３：記憶部
１４：検出部
１５：算出部
１６：選択部
１７：生成部
１７ａ：指定部
１７ｂ：比較部
１７ｃ：更新部
１８：出力部
１９：反映部
１００：音声認識システム
１０１：ＣＰＵ
１０２：ＲＯＭ
１０３：ＲＡＭ
１０４：保存部
１０５：Ｉ／Ｆ
１０６：Ｉ／Ｆ
１０７：Ｉ／Ｆ
１０８：入力部分
１０９：出力部分
１１０：内部バス
Ｓ１１０：取得手段
Ｓ１２０：抽出手段
Ｓ１３０：検出手段
Ｓ１４０：算出手段
Ｓ１５０：選択手段
Ｓ１６０：生成手段
Ｓ１６１：指定手段
Ｓ１６２：比較手段
Ｓ１６３：更新手段
Ｓ１７０：出力手段
Ｓ１８０：反映手段
Ｓ１９０：設定手段 1: Voice recognition device 2: Sound collection device 3: Control device 4: Public communication network 5: Server 6: User terminal 10: Case 11: Acquisition unit 12: Extraction unit 13: Storage unit 14: Detection unit 15: Calculation unit 16: Selection unit 17: Generation unit 17a: Designation unit 17b: Comparison unit 17c: Update unit 18: Output unit 19: Reflection unit 100: Speech recognition system 101: CPU
102: ROM
103: RAM
104: Storage unit 105: I / F
106: I / F
107: I / F
108: input part 109: output part 110: internal bus S110: acquisition means S120: extraction means S130: detection means S140: calculation means S150: selection means S160: generation means S161: designation means S162: comparison means S163: update means S170: Output means S180: Reflection means S190: Setting means

Claims

少なくとも１つの音声データを取得する取得手段と、
前記音声データに含まれる開始無音区間及び終了無音区間を音素認識により抽出し、前記開始無音区間と前記終了無音区間との間に挟まれた音素及び休止区間の配列を、前記音素認識により認識対象データとして抽出する抽出手段と、
予め取得された文字列情報と、前記文字列情報に紐づく音素情報と、前記文字列情報に付与されたクラスＩＤとが記憶された文字列データベースと、
前記文字列データベースを参照し、前記認識対象データの有する前記配列に対応する前記音素情報を選択し、選択された前記音素情報に紐づく前記文字列情報及び前記クラスＩＤを、候補データとして複数検出する検出手段と、
予め取得された前記クラスＩＤの配列順序を示す文法情報が記憶された文法データベースと、
前記文法データベースを参照し、複数の前記候補データを前記文法情報に基づき組み合あわせたセンテンスを生成し、前記センテンスに含まれる前記候補データ毎の前記文字列情報に対する信頼度を、前記文法データベースを用いて算出する算出手段と、
前記信頼度に基づき、複数の前記候補データから評価データを選択する選択手段と、
前記評価データに基づき、認識情報を生成する生成手段と
を備えることを特徴とする音声認識システム。 Acquisition means for acquiring at least one audio data;
A start silence section and an end silence section included in the speech data are extracted by phoneme recognition, and an array of phonemes and pause sections sandwiched between the start silence section and the end silence section is a recognition target by the phoneme recognition. Extraction means for extracting as data;
A character string database in which character string information acquired in advance, phoneme information associated with the character string information, and a class ID assigned to the character string information are stored;
Refer to the character string database, select the phoneme information corresponding to the array of the recognition target data, and detect a plurality of the character string information and the class ID associated with the selected phoneme information as candidate data Detecting means for
A grammar database in which grammar information indicating the order of arrangement of the class IDs acquired in advance is stored;
The reference to the grammar database, a plurality of the candidate data to generate sentences together union based on the grammar information, the reliability against the character string information for each of the candidate data contained in the sentence, the grammar A calculation means for calculating using a database ;
Selection means for selecting evaluation data from the plurality of candidate data based on the reliability;
A speech recognition system, comprising: generating means for generating recognition information based on the evaluation data.

前記抽出手段は、１つの前記音声データから複数の前記認識対象データを抽出し、
複数の前記認識対象データは、それぞれ異なる前記音素及び前記休止区間の前記配列を有すること
を特徴とする請求項１記載の音声認識システム。 The extraction means extracts a plurality of pieces of recognition target data from one voice data,
The speech recognition system according to claim 1, wherein a plurality of pieces of recognition target data have the different phonemes and the arrangement of the pause sections.

前記算出手段は、前記センテンスを複数生成し、
複数の前記センテンスは、それぞれ前記候補データの種類及び組み合わせの少なくとも何れかが異なること
を特徴とする請求項１又は２記載の音声認識システム。 The calculation means generates a plurality of the sentences,
The speech recognition system according to claim 1, wherein at least one of the types and combinations of the candidate data is different for each of the plurality of sentences.

予め取得された前記文字列情報と、前記文字列情報を組み合わせた参照センテンスと、前記文字列情報毎に付与された閾値とが記憶された参照データベースをさらに備え、
前記生成手段は、
前記参照データベースを参照し、前記参照センテンスのうち、前記評価データに対応する第１参照センテンスを指定する指定手段と、
前記評価データに対応する前記信頼度と、前記第１参照センテンスに含まれる第１文字列情報に付与された第１閾値とを比較する比較手段と、
を有し、前記比較手段の比較結果に基づき、前記認識情報を生成すること
を特徴とする請求項１〜３の何れか１項記載の音声認識システム。 A reference database in which the character string information acquired in advance, a reference sentence combining the character string information, and a threshold value assigned to each character string information are stored;
The generating means includes
Designating means for referring to the reference database and designating a first reference sentence corresponding to the evaluation data among the reference sentences;
A comparing means for comparing the reliability corresponding to the evaluation data with a first threshold given to the first character string information included in the first reference sentence;
The speech recognition system according to claim 1, wherein the recognition information is generated based on a comparison result of the comparison unit.

複数の前記候補データ、及び複数の前記信頼度に基づき、前記参照データベースに記憶された前記閾値を更新する更新手段をさらに備えること
を特徴とする請求項４記載の音声認識システム。 The speech recognition system according to claim 4, further comprising an updating unit configured to update the threshold stored in the reference database based on the plurality of candidate data and the plurality of reliability.

前記認識情報を評価した利用者の評価結果を取得し、前記参照データベースの前記閾値に反映させる反映手段をさらに備えること
を特徴とする請求項４又は５記載の音声認識システム。 The speech recognition system according to claim 4, further comprising a reflection unit that acquires an evaluation result of a user who has evaluated the recognition information and reflects the result of evaluation on the threshold value of the reference database.

前記取得手段は、前記音声データが生成された条件を示す条件情報を取得すること
を特徴とする請求項１〜６の何れか１項記載の音声認識システム。 The voice recognition system according to claim 1, wherein the acquisition unit acquires condition information indicating a condition under which the voice data is generated.

前記検出手段は、前記条件情報に基づき、参照する前記文字列データベースの内容を選別すること
を特徴とする請求項７記載の音声認識システム。 The speech recognition system according to claim 7, wherein the detection unit selects contents of the character string database to be referred to based on the condition information.

前記認識情報を出力する出力手段をさらに備え、
前記認識情報は、車両の走行速度を制御するための情報を含むこと
を特徴とする請求項１〜８の何れか１項記載の音声認識システム。 Further comprising output means for outputting the recognition information,
The speech recognition system according to claim 1, wherein the recognition information includes information for controlling a traveling speed of the vehicle.

前記休止区間は、呼吸音及びリップノイズの少なくとも何れかを含むこと
を特徴とする請求項１〜９の何れか１項記載の音声認識システム。 The speech recognition system according to claim 1, wherein the pause section includes at least one of breathing sound and lip noise.

前記文字列情報は、２ヵ国以上の言語を含むこと
を特徴とする請求項１〜１０の何れか１項記載の音声認識システム。 The speech recognition system according to any one of claims 1 to 10, wherein the character string information includes languages of two or more countries.

少なくとも１つの音声データを取得する取得部と、
前記音声データに含まれる開始無音区間及び終了無音区間を音素認識により抽出し、前記開始無音区間と前記終了無音区間との間に挟まれた音素及び休止区間の配列を、前記音素認識により認識対象データとして抽出する抽出部と、
予め取得された文字列情報と、前記文字列情報に紐づく音素情報と、前記文字列情報に付与されたクラスＩＤとが記憶された文字列データベースと、
前記文字列データベースを参照し、前記認識対象データの有する前記配列に対応する前記音素情報を選択し、選択された前記音素情報に紐づく前記文字列情報及び前記クラスＩＤを、候補データとして複数検出する検出部と、
予め取得された前記クラスＩＤの配列順序を示す文法情報が記憶された文法データベースと、
前記文法データベースを参照し、複数の前記候補データを前記文法情報に基づき組み合あわせたセンテンスを生成し、前記センテンスに含まれる前記候補データ毎の前記文字列情報に対する信頼度を、前記文法データベースを用いて算出する算出部と、
前記信頼度に基づき、複数の前記候補データから評価データを選択する選択部と、
前記評価データに基づき、認識情報を生成する生成部と
を備えることを特徴とする音声認識装置。 An acquisition unit for acquiring at least one audio data;
A start silence section and an end silence section included in the speech data are extracted by phoneme recognition, and an array of phonemes and pause sections sandwiched between the start silence section and the end silence section is a recognition target by the phoneme recognition. An extraction unit that extracts data;
A character string database in which character string information acquired in advance, phoneme information associated with the character string information, and a class ID assigned to the character string information are stored;
Refer to the character string database, select the phoneme information corresponding to the array of the recognition target data, and detect a plurality of the character string information and the class ID associated with the selected phoneme information as candidate data A detector to
A grammar database in which grammar information indicating the order of arrangement of the class IDs acquired in advance is stored;
The reference to the grammar database, a plurality of the candidate data to generate sentences together union based on the grammar information, the reliability against the character string information for each of the candidate data contained in the sentence, the grammar A calculation unit for calculating using a database ;
A selection unit that selects evaluation data from the plurality of candidate data based on the reliability;
A speech recognition apparatus comprising: a generation unit that generates recognition information based on the evaluation data.