JP6058807B2

JP6058807B2 - Method and system for speech recognition processing using search query information

Info

Publication number: JP6058807B2
Application number: JP2015537758A
Authority: JP
Inventors: メンギバー，ペドロ・ジェイ・モレノ; ソレンセン，ジェフリー・スコット; ウェインステイン，ユージーン
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2012-10-18
Filing date: 2013-10-14
Publication date: 2017-01-11
Anticipated expiration: 2033-10-14
Also published as: JP2016500843A; KR20150048252A; US20140114661A1; US8768698B2; WO2014062545A1; CN104854654B; CN106847265A; EP2909832B1; CN104854654A; KR101585185B1; CN106847265B; EP2909832A1; US8589164B1

Description

関連出願への相互参照
本願は、「検索クエリ情報を使用する音声認識処理のための方法およびシステム（Methods and Systems for Speech Recognition Processing Using Search Query Information）」という名称を有する２０１２年１０月１８日に出願された米国特許出願連続番号第６１／７１５，３６５号の仮出願である、「検索クエリ情報を使用する音声認識処理のための方法およびシステム（Methods And Systems For Speech Recognition Processing Using Search Query Information）」という名称を有する２０１３年３月１５日に出願された米国特許出願連続番号第１３／８３２，１３６号への優先権を主張し、これらはすべてこの明細書において完全に記載されるように本願明細書において参照により援用される。 Cross-reference to related applications This application is dated October 18, 2012, having the name "Methods and Systems for Speech Recognition Processing Using Search Query Information". "Methods And Systems For Speech Recognition Processing Using Search Query Information", which is a provisional application of U.S. Patent Application Serial No. 61 / 715,365 filed. Claiming priority to U.S. Patent Application Serial No. 13 / 832,136, filed March 15, 2013, all of which are hereby fully incorporated herein by reference. Incorporated herein by reference.

背景
自動音声認識（automatic speech recognition（ＡＳＲ））技術は、音声発声をそれらの発声のテキスト表現にマッピングするように使用され得る。いくつかのＡＳＲシステムは、個々の話者が音声認識システムにテキストのセクションを読み込む「トレーニング」を使用している。これらのシステムは、人の特定の声を分析し、当該音声を利用して、その人についてそのスピーチの認識を微調整し、より正確な転写（transcription）を得る。トレーニングを使用しないシステムは、「不特定話者（Speaker Independent）」システムと称され得る。トレーニングを使用するシステムは「特定話者（Speaker Dependent）」システムと称され得る。 Background Automatic speech recognition (ASR) technology can be used to map speech utterances to textual representations of those utterances. Some ASR systems use “training” where individual speakers read sections of text into the speech recognition system. These systems analyze a person's specific voice and use that voice to fine tune the speech perception for that person to obtain a more accurate transcription. A system that does not use training may be referred to as a “Speaker Independent” system. A system that uses training may be referred to as a “Speaker Dependent” system.

概要
本願は、音声認識処理のためのシステムおよび方法を開示する。１つの局面において方法が記載される。上記方法は、検索エンジンへの検索クエリの送信の頻度を示す情報をコンピューティングデバイスにて受け取ることを含み得る。検索クエリはワードのシーケンスを含み得る。上記方法はさらに、検索クエリの送信の頻度がしきい値を越えることに基づいて、検索クエリのワードのシーケンスについて、１つ以上のワードが検索クエリのワードのシーケンスに発生する順番に基づく検索クエリの１つ以上のワードのグルーピングを決定することを含み得る。上記方法はさらに、ワードの所与のシーケンスのコーパスを更新するよう、音声認識システムにグルーピングを示す情報を提供することを含み得る。音声認識システムは、ワードの所与のシーケンスのコーパスに基づいて、所与の話された発声をワードの所与のシーケンスに変換するように構成され得る。 SUMMARY This application discloses systems and methods for speech recognition processing. In one aspect, a method is described. The method may include receiving at the computing device information indicating a frequency of transmission of the search query to the search engine. A search query may include a sequence of words. The method further includes a search query based on an order in which one or more words occur in the sequence of words of the search query for the sequence of words of the search query based on the frequency of transmission of the search query exceeding a threshold. Determining a grouping of one or more words. The method may further include providing information indicating the grouping to the speech recognition system to update a corpus of a given sequence of words. The speech recognition system may be configured to convert a given spoken utterance into a given sequence of words based on a corpus of the given sequence of words.

別の局面では、コンピューティングデバイスによって実行されるとコンピューティングデバイスに機能を行わせる命令を格納したコンピュータ読取可能媒体が記載される。上記機能は、検索エンジンへの検索クエリの送信の頻度を示す情報を受け取ることを含み得る。検索クエリはワードのシーケンスを含み得る。上記機能はさらに、検索クエリの送信の頻度がしきい値を越えることに基づいて、検索クエリのワードのシーケンスについて、１つ以上のワードが検索クエリのワードのシーケンスに発生する順番に基づく検索クエリの１つ以上のワードのグルーピングを決定することを含み得る。上記機能はさらに、ワードの所与のシーケンスのコーパスを更新するよう、音声認識システムにグルーピングを示す情報を提供することを含み得る。音声認識システムは、ワードの所与のシーケンスのコーパスに基づいて、所与の話された発声をワードの所与のシーケンスに変換するように構成され得る。 In another aspect, a computer-readable medium that stores instructions that when executed by a computing device causes the computing device to perform a function is described. The function may include receiving information indicating a frequency of sending a search query to a search engine. A search query may include a sequence of words. The function further includes a search query based on an order in which one or more words occur in the search query word sequence for the search query word sequence based on a search query transmission frequency exceeding a threshold. Determining a grouping of one or more words. The function may further include providing grouping information to the speech recognition system to update a corpus of a given sequence of words. The speech recognition system may be configured to convert a given spoken utterance into a given sequence of words based on a corpus of the given sequence of words.

さらに別の局面では、デバイスが記載される。上記デバイスは、少なくとも１つのプロセッサを含み得る。上記デバイスはさらに、データストレージと、データストレージにおけるプログラム命令とを含み得、プログラム命令は、少なくとも１つのプロセッサによる実行の際、デバイスに、検索エンジンへの検索クエリの送信の頻度を示す情報を受け取ることを行わせる。検索クエリはワードのシーケンスを含み得る。少なくとも１つのプロセッサによる実行の際のデータストレージにおけるプログラム命令はさらに、デバイスに、検索クエリの送信の頻度がしきい値を越えることに基づいて、検索クエリのワードのシーケンスについて、１つ以上のワードが検索クエリのワードのシーケンスに発生する順番に基づく検索クエリの１つ以上のワードのグルーピングを決定することを行わせる。少なくとも１つのプロセッサによる実行の際のデータストレージにおけるプログラム命令はさらに、デバイスに、ワードの所与のシーケンスのコーパスを更新するよう、音声認識システムにグルーピングを示す情報を提供することを行わせる。音声認識システムは、ワードの所与のシーケンスのコーパスに基づいて、所与の話された発声をワードの所与のシーケンスに変換するように構成され得る。音声認識システムはさらに、コーパスのワードの所与のシーケンスについての発生確率を含み得る。 In yet another aspect, a device is described. The device may include at least one processor. The device may further include data storage and program instructions in the data storage, the program instructions receiving information indicating a frequency of transmission of a search query to the search engine upon execution by the at least one processor. To do that. A search query may include a sequence of words. The program instructions in the data storage upon execution by the at least one processor may further cause the device to transmit one or more words for the sequence of words of the search query based on the frequency of transmission of the search query exceeding a threshold. Determining the grouping of one or more words of the search query based on the order in which they occur in the sequence of words of the search query. Program instructions in the data storage upon execution by the at least one processor further cause the device to provide information indicative of the grouping to the speech recognition system to update the corpus for a given sequence of words. The speech recognition system may be configured to convert a given spoken utterance into a given sequence of words based on a corpus of the given sequence of words. The speech recognition system may further include the probability of occurrence for a given sequence of corpora words.

上記の概要は単に例示であって、如何なる態様でも限定的であるように意図されない。図面および以下の詳細な説明への参照によって、上に記載された例示的な局面、実施形態、および特徴に加えて、さらに別の局面、実施形態および特徴が明らかになるであろう。 The above summary is exemplary only and is not intended to be limiting in any manner. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.

例示的な実施形態に従った例示的な自動音声認識（ＡＳＲ）システムを示す図である。1 illustrates an exemplary automatic speech recognition (ASR) system according to an exemplary embodiment. FIG. 実施形態に従った例示的な音響モデルの局面を示す図である。FIG. 6 illustrates aspects of an exemplary acoustic model according to an embodiment. 実施形態に従ったＡＳＲシステムの例示的な検索グラフを示す図である。FIG. 6 is an exemplary search graph for an ASR system according to an embodiment. 実施形態に従った音声認識処理のための例示的な方法のフローチャートである。3 is a flowchart of an exemplary method for speech recognition processing according to an embodiment. 実施形態に従った例示的な検索クエリについての例示的なオートマトン表現を示す図である。FIG. 4 illustrates an example automaton representation for an example search query according to an embodiment. 実施形態に従った例示的な検索クエリについての例示的なバイグラム言語モデルを示す図である。FIG. 4 illustrates an exemplary bigram language model for an exemplary search query according to an embodiment. 実施形態に従った例示的な検索クエリについての例示的なファクタグラフを示す図である。FIG. 4 illustrates an example factor graph for an example search query according to an embodiment. 例示的な実施形態に従った例示的な分散コンピューティングアーキテクチャを示す図である。FIG. 2 illustrates an example distributed computing architecture according to an example embodiment. 例示的な実施形態に従った例示的なコンピューティングデバイスのブロック図である。1 is a block diagram of an exemplary computing device according to an exemplary embodiment. 例示的な実施形態に従ったクラウドベースのサーバシステムを示す図である。1 illustrates a cloud-based server system according to an exemplary embodiment. FIG. 本願明細書において示される少なくともいくつかの実施形態に従って構成される、コンピューティングデバイス上でコンピュータプロセスを実行するためのコンピュータプログラムを含む例示的なコンピュータプログラムプロダクトの部分概念図を示す概略図である。FIG. 6 is a schematic diagram illustrating a partial conceptual diagram of an exemplary computer program product that includes a computer program for executing a computer process on a computing device configured in accordance with at least some embodiments set forth herein.

詳細な説明
以下の詳細な説明は、添付の図面を参照して、開示されるシステムおよび方法のさまざまな特徴および機能を記載する。これらの図において、文脈が他の態様を示していなければ、同様の符号は同様の構成要素を特定する。本願明細書に記載される例示的なシステムおよび方法の実施形態は限定的であるよう意図されない。開示されるシステムおよび方法のある局面は、すべて本願明細書において考えられるさまざまな異なる構成において構成および組み合わせられ得るということが容易に理解され得る。 DETAILED DESCRIPTION The following detailed description describes various features and functions of the disclosed systems and methods with reference to the accompanying drawings. In these figures, similar symbols identify similar elements unless the context indicates otherwise. The exemplary system and method embodiments described herein are not intended to be limiting. It can be readily appreciated that certain aspects of the disclosed systems and methods can all be configured and combined in a variety of different configurations contemplated herein.

コンピューティングパワーが増加し続けているので、スピーチベースのユーザインターフェイスを提供するよう、さまざまな環境において自動音声認識（ＡＳＲ）システムおよびデバイスが展開され得る。これらの環境のうちのいくつかは、住宅、企業、車両などを含む。 As computing power continues to increase, automatic speech recognition (ASR) systems and devices can be deployed in various environments to provide speech-based user interfaces. Some of these environments include homes, businesses, vehicles, and the like.

たとえば住宅および企業においては、ＡＳＲは、大型電化製品（たとえばオーブン、冷蔵庫、皿洗い機、洗濯機およびドライヤ）、小型電化製品（たとえばトースタ、サーモスタット、コーヒーメーカ、マイクロ波オーブン）、メディアデバイス（ステレオ、テレビ、デジタルビデオレコーダ、デジタルビデオプレーヤ）といったデバイスに対する音声制御、ならびに、ドア、ライト、およびカーテンなどに対する音声制御を提供し得る。車両においては、ＡＳＲは、通信技術（たとえば携帯電話）、メディアデバイス（たとえばラジオおよびビデオプレーヤ）、マッピング技術（たとえばナビゲーションシステム）、および環境制御（たとえば暖房および空調）などのハンズフリーの使用を提供し得る。ある例において、ＡＳＲは、音声検索クエリを、検索結果を得るために検索エンジンに送られ得るテキストストリングに変換するよう使用され得る。音声制御についての潜在的な使用は多く存在し、これらの例は限定としてみなされるべきでない。 For example, in homes and businesses, ASRs are large appliances (eg ovens, refrigerators, dishwashers, washing machines and dryers), small appliances (eg toasters, thermostats, coffee makers, microwave ovens), media devices (stereo, Audio control for devices such as televisions, digital video recorders, digital video players) and audio control for doors, lights, curtains, etc. may be provided. In vehicles, ASR provides hands-free use such as communication technologies (eg mobile phones), media devices (eg radio and video players), mapping technologies (eg navigation systems), and environmental controls (eg heating and air conditioning). Can do. In one example, ASR can be used to convert a voice search query into a text string that can be sent to a search engine to obtain search results. There are many potential uses for voice control, and these examples should not be considered limiting.

ある例において、ＡＳＲは、話者から発声を受け取るデバイスにて行なわれ得る。このデバイスベースのＡＳＲの場合、各ユーザデバイスはＡＳＲモジュールを有するよう構成され得る。別の例では、音声認識はリモートネットワークサーバ（たとえばインターネット上のサーバまたはサーバのクラスタ）にて行なわれ得る。この例において、音声認識はユーザデバイスにＡＳＲを組み入れないかもしれないが、ユーザデバイスはそれでも、（たとえばインターネットアクセスを通じて）リモートＡＳＲシステムとの通信パスを有するように構成され得る。 In one example, ASR can be performed at a device that receives speech from a speaker. For this device-based ASR, each user device may be configured to have an ASR module. In another example, voice recognition may be performed at a remote network server (eg, a server or cluster of servers on the Internet). In this example, voice recognition may not incorporate ASR into the user device, but the user device may still be configured to have a communication path with a remote ASR system (eg, via Internet access).

さらに別の例では、音声認識は、リモートのデバイスにＡＳＲの少なくともいくつかの局面の実行をオフロードするローカルのＡＳＲシステムの使用によって行なわれ得る。ローカルのＡＳＲシステムは、ＡＳＲを行なう専用の１つ以上のデバイスか、または、たとえば汎用コンピューティングプラットフォーム上で動作するように構成されるソフトウェアであり得る。このローカルのＡＳＲシステムは、住宅、企業、車両などに物理的に設置され得、ユーザデバイスがインターネットアクセスを有していなくても動作し得る。 In yet another example, speech recognition may be performed by use of a local ASR system that offloads execution of at least some aspects of ASR to remote devices. The local ASR system can be one or more devices dedicated to performing ASR, or software configured to run on, for example, a general purpose computing platform. This local ASR system can be physically installed in a house, business, vehicle, etc., and can operate even if the user device does not have Internet access.

いくつかの例において、ユーザデバイスは話者から発声を受け取り、ローカルのＡＳＲシステムに発声の表現を送信し得る。ローカルのＡＳＲシステムは、当該発声の表現を発声のテキスト表現へと転写し、このテキスト表現をユーザデバイスに送信し得る。代替的には、ローカルのＡＳＲシステムはその代りに、当該発声の転写に基づいたコマンドをユーザデバイスに送信し得る。このコマンドは、発声の転写されたテキスト表現に基づき得るか、または、当該発声の表現からより直接的に導出され得る。また、当該コマンドは、ユーザデバイスがサポートするコマンドセットまたはコマンド言語であり得る。一例において、発声は音声検索クエリを表わし得、ローカルのＡＳＲシステムは、検索エンジンに音声検索クエリの転写を送信して、ユーザデバイスに通信され得るそれぞれの検索結果を得るように構成され得る。 In some examples, the user device may receive a utterance from the speaker and send a representation of the utterance to the local ASR system. The local ASR system may transcribe the utterance representation into a utterance text representation and send the text representation to the user device. Alternatively, the local ASR system may instead send a command based on the transcription of the utterance to the user device. This command can be based on the transcribed textual representation of the utterance or can be derived more directly from the representation of the utterance. The command may be a command set or a command language supported by the user device. In one example, the utterance may represent a voice search query, and the local ASR system may be configured to send a transcript of the voice search query to the search engine to obtain a respective search result that may be communicated to the user device.

図１は、実施形態に従った例示的なＡＳＲシステムを示す。ランタイムにおいて、ＡＳＲシステムへの入力は発声１００を含み得、出力は１つ以上のテキストストリングと、おそらく関連付けられる信頼水準１０１とを含み得る。ＡＳＲシステムのコンポーネントは、特徴ベクトル１０４を生成するように構成され得る特徴解析モジュール１０２と、パターン分類モジュール１０６と、音響モデル１０８と、辞書１１０と、言語モデル１１２とを含み得る。パターン分類モジュール１０６は、音響モデル１０８、辞書１１０および言語モデル１１２のさまざまな局面を組み込み得る。 FIG. 1 illustrates an exemplary ASR system according to an embodiment. At runtime, the input to the ASR system may include the utterance 100 and the output may include one or more text strings and possibly an associated confidence level 101. The components of the ASR system can include a feature analysis module 102, a pattern classification module 106, an acoustic model 108, a dictionary 110, and a language model 112 that can be configured to generate a feature vector 104. Pattern classification module 106 may incorporate various aspects of acoustic model 108, dictionary 110, and language model 112.

図１に示される例示的なＡＳＲシステムは単に例示目的のためのものである。異なるコンポーネント、コンポーネント同士の間の異なる関係、および／または、異なる処理を含む他のＡＳＲシステムの構成が可能であり得る。 The exemplary ASR system shown in FIG. 1 is for illustration purposes only. Other ASR system configurations that include different components, different relationships between components, and / or different processing may be possible.

特徴解析モジュール１０２は、発声１００を受け取るように構成され得る。発声１００は、人間のスピーチのアナログまたはデジタル表現を含み得、同様にバックグラウンドノイズも含み得る場合がある。特徴解析モジュール１０２は、発声１００を１つ以上の特徴ベクトル１０４のシーケンスに変換するように構成され得る。特徴ベクトル１０４の各特徴ベクトルは、発声１００の少なくとも一部の音響特徴の時間および／またはスペクトル表現を含み得る。たとえば、特徴ベクトルはこのような部分のメル周波数ケプストラム係数（mel-frequency cepstrum coefficient）を含み得る。 The feature analysis module 102 may be configured to receive the utterance 100. The utterance 100 may include analog or digital representations of human speech, and may also include background noise. The feature analysis module 102 may be configured to convert the utterance 100 into a sequence of one or more feature vectors 104. Each feature vector of feature vector 104 may include a time and / or spectral representation of at least some acoustic features of utterance 100. For example, the feature vector may include mel-frequency cepstrum coefficients for such portions.

メル周波数ケプストラム係数は、発声１００の部分の短期のパワースペクトルを表わし得る。それらはたとえば、周波数の非線形メル尺度上の対数パワースペクトルの線形コサイン変換に基づき得る。（メル尺度は、ピッチの実際の周波数が互いから等しく遠く離れていなくても互いからほぼ等しく離れていると聴き手が主観的に知覚するピッチの尺度であり得る）。 The mel frequency cepstrum coefficient may represent the short-term power spectrum of the portion of utterance 100. They can be based, for example, on a linear cosine transform of the logarithmic power spectrum on a nonlinear mel scale of frequency. (The mel scale can be a measure of the pitch that the listener perceives subjectively as the actual frequency of the pitch is not equally far away from each other, but approximately equal away from each other).

これらの係数を導出するために、特徴解析モジュール１０２は、発声１００をサンプリングおよびクオンタイズし、発声１００を分割して１５ミリ秒のオーバーラップするフレームまたはオーバーラップしないフレームにし、当該フレームに対してスペクトル分析を行なって各フレームのスペクトル成分を導出するように構成され得る。特徴解析モジュール１０２はさらに、ノイズ除去を行ない、標準スペクトル係数をメル周波数ケプストラム係数に変換し、メル周波数ケプストラム係数の一次および二次ケプストラム導関数を計算するように構成され得る。 In order to derive these coefficients, the feature analysis module 102 samples and quantizes the utterance 100 and divides the utterance 100 into 15 ms overlapping or non-overlapping frames, with a spectrum for that frame. An analysis may be performed to derive a spectral component for each frame. The feature analysis module 102 may further be configured to perform noise removal, convert standard spectral coefficients to mel frequency cepstrum coefficients, and calculate first and second order cepstrum derivatives of the mel frequency cepstrum coefficients.

一次ケプストラム係数導関数は、２つ以上の連続するフレームのウィンドウにわたって行なわれる線形回帰の傾きに基づいて計算され得る。二次ケプストラム係数導関数は、一次ケプストラム係数導関数の２つ以上の連続する組のウィンドウにわたって行なわれる線形回帰の斜きに基づいて計算され得る。しかしながら、一次および二次ケプストラム係数導関数を計算する他の方法が存在し得る。 The first order cepstrum coefficient derivative may be calculated based on the slope of a linear regression performed over two or more consecutive frame windows. The second order cepstrum coefficient derivative can be calculated based on the slope of a linear regression performed over two or more consecutive sets of windows of the first order cepstrum coefficient derivative. However, there can be other ways of calculating the first and second order cepstrum coefficient derivatives.

いくつかの例では、発声１００の１つ以上のフレームは、メル周波数ケプストラム係数、一次ケプストラム係数導関数および二次ケプストラム係数導関数の特徴ベクトルによって表わされ得る。たとえば、特徴ベクトルは、１３の係数、１３の一次導関数および１３の二次導関数を含み得、したがって長さが３９である。しかしながら、特徴ベクトルは、他の可能な例において、特徴の異なる組合せを使用し得る。 In some examples, one or more frames of utterance 100 may be represented by feature vectors of mel frequency cepstrum coefficients, first order cepstrum coefficient derivatives, and second order cepstrum coefficient derivatives. For example, a feature vector may include 13 coefficients, 13 first derivatives, and 13 second derivatives, and is thus 39 in length. However, feature vectors may use different combinations of features in other possible examples.

パターン分類モジュール１０６は、特徴解析モジュール１０２から特徴ベクトル１０４のシーケンスを受け取り、発声１００の１つ以上のテキストストリングの転写１０１を出力として生成するよう構成され得る。各転写１０１には、当該転写が正しい可能性の推定（たとえば８０％の信頼性、９０％の信頼性など）を示すそれぞれの信頼水準が伴い得る。 The pattern classification module 106 may be configured to receive a sequence of feature vectors 104 from the feature analysis module 102 and generate a transcript 101 of one or more text strings of the utterance 100 as output. Each transfer 101 may be accompanied by a respective confidence level that indicates an estimate of the likelihood that the transfer is correct (eg, 80% reliability, 90% reliability, etc.).

テキストストリングの転写１０１を生成するために、パターン分類モジュール１０６は、音響モデル１０８、辞書１１０および／もしくは言語モデル１１２の局面を含むかまたは組み込むように構成され得る。いくつかの例において、パターン分類モジュール１０６はさらに、ワードのシーケンスを表わす検索グラフまたは話された発声に現われるサブワード音響特性を使用するように構成され得る。 To generate the text string transcript 101, the pattern classification module 106 may be configured to include or incorporate aspects of the acoustic model 108, the dictionary 110, and / or the language model 112. In some examples, the pattern classification module 106 may be further configured to use a search graph that represents a sequence of words or subword acoustic features that appear in the spoken utterance.

音響モデル１０８は、話されたワードおよび／またはサブワードの音の特定のシーケンスから特徴ベクトル１０４が導出され得た確率を決定するように構成され得る。これは、特徴ベクトル１０４のシーケンスを１つ以上の音素にマッピングし、その後、音素のシーケンスを１つ以上のワードにマッピングすることを伴い得る。 The acoustic model 108 may be configured to determine the probability that the feature vector 104 could be derived from a particular sequence of spoken word and / or subword sounds. This may involve mapping the sequence of feature vectors 104 to one or more phonemes and then mapping the sequence of phonemes to one or more words.

音素は、発声の他のセグメントに対する意味のある対比を含む発声の最も小さなセグメントであると考えられ得る。したがって、ワードは典型的に１つ以上の音素を含む。たとえば、音素は文字の発声と考えられ得るが、いくつかの音素は複数の文字を示し得る。「cat」というワードのアメリカ英語の発音についての例示的な音素のスペリングは、／ｋ／、／ａｅ／、および／ｔ／という音素を含む／ｋ／／ａｅ／／ｔ／であり得る。「dog」というワードについての別の例示的な音素のスペリングは、／ｄ／、／ａｗ／、および／ｇ／という音素を含む／ｄ／／ａｗ／／ｇ／であり得る。 A phoneme may be considered the smallest segment of utterance, including a meaningful contrast to other segments of utterance. Thus, a word typically includes one or more phonemes. For example, a phoneme may be considered a utterance of a character, but some phonemes may indicate multiple characters. An exemplary phoneme spelling for the American English pronunciation of the word “cat” may be / k // ae // t /, including the phonemes / k /, / ae /, and / t /. Another exemplary phoneme spelling for the word “dog” may be / d // aw // g /, including the phonemes / d /, / aw /, and / g /.

異なる音素のアルファベットが存在し、これらのアルファベットは、その中のさまざまな音素について異なるテキスト表現を有し得る。たとえば「a」という文字は、「cat」における音については／ａｅ／という音素によって表わされ得、「ate」における音については／ｅｙ／という音素によって表わされ得、「beta」における音については／ａｈ／という音素によって表わされ得る。他の音素表現が可能である。 There are different phoneme alphabets, and these alphabets may have different textual representations for the various phonemes therein. For example, the letter “a” may be represented by the phoneme / ae / for the sound in “cat”, may be represented by the phoneme / ey / for the sound in “ate”, and the sound in “beta” Can be represented by the phoneme / ah /. Other phoneme representations are possible.

アメリカ英語についての一般的な音素のアルファベットは、約４０個の異なる音素を含む。これらの音素の各々は、特徴ベクトル値の異なる分布に関連付けられ得る。音響モデル１０８は、特徴ベクトルを上記４０個の音素の各々についての分布と比較し、特徴ベクトルによって最も可能性が高く表わされる１つ以上の音素を発見することによって、特徴ベクトルにおいて音素を推定するように構成され得る。 The general phoneme alphabet for American English contains about 40 different phonemes. Each of these phonemes can be associated with a different distribution of feature vector values. The acoustic model 108 estimates the phoneme in the feature vector by comparing the feature vector with the distribution for each of the 40 phonemes and finding one or more phonemes most likely represented by the feature vector. Can be configured as follows.

一例では、音響モデル１０８は、隠れマルコフモデル（hidden Markov model（ＨＭＭ））を含み得る。ＨＭＭは、観察されない（すなわち隠された）状態を有するマルコフ過程としてシステムをモデル化し得る。各ＨＭＭ状態は、状態の統計的な挙動を特徴付ける多変数のガウス分布として表わされ得る。さらに、各状態はまた、現在の状態から別の状態に遷移する確率を特定する１つ以上の状態遷移に関連付けられ得る。 In one example, the acoustic model 108 may include a hidden Markov model (HMM). The HMM can model the system as a Markov process with an unobserved (ie hidden) state. Each HMM state can be represented as a multivariable Gaussian distribution that characterizes the statistical behavior of the state. In addition, each state may also be associated with one or more state transitions that specify the probability of transitioning from the current state to another state.

ＡＳＲシステムに適用されると、多変数のガウス分布と、各状態についての状態遷移との組合せは、１つ以上の音素の期間にわたる特徴ベクトルの時間シーケンスを規定し得る。代替的または付加的には、ＨＭＭは、ワードを規定する音素のシーケンスをモデル化し得る。したがって、いくつかのＨＭＭベースの音響モデルはまた、１つ以上のワードに特徴ベクトルのシーケンスをマッピングする場合、音素コンテキストを考慮に入れ得る。 When applied to an ASR system, the combination of a multivariable Gaussian distribution and state transitions for each state may define a time sequence of feature vectors over one or more phoneme periods. Alternatively or additionally, the HMM may model a phoneme sequence that defines a word. Thus, some HMM-based acoustic models may also take into account phoneme context when mapping a sequence of feature vectors to one or more words.

図２は、実施形態に従った例示的な音響モデル２００の局面を示す。音響モデル２００は、「cat」というワードを構成する音素のシーケンスを規定する。各音素は、それぞれ音素の始め、音素の中間および音素の終了時での統計的な特性を表わす初期状態、中間状態および終了状態を有する３状態のＨＭＭによって表わされる。各状態（たとえば、状態／ｋ／１，状態／ｋ／２など）は、音素を表わし得るとともに、１つ以上の遷移を含み得る。 FIG. 2 illustrates aspects of an exemplary acoustic model 200 according to an embodiment. The acoustic model 200 defines a sequence of phonemes that make up the word “cat”. Each phoneme is represented by a three-state HMM having an initial state, an intermediate state, and an end state representing statistical characteristics at the beginning of the phoneme, the middle of the phoneme, and the end of the phoneme, respectively. Each state (eg, state / k / 1, state / k / 2, etc.) may represent a phoneme and may include one or more transitions.

音響モデル２００は、適切な遷移に、ワードにおける各音素についてのそれぞれの３状態のＨＭＭを一緒に連結することによって、ワードを表わし得る。これらの連結は、辞書１１０における情報に基づいて行なわれ得る。いくつかの実現例では、音響モデル２００において、１つの音素につきより多くの状態またはより少ない状態が使用され得る。 The acoustic model 200 may represent a word by concatenating together the respective three-state HMM for each phoneme in the word to the appropriate transition. These connections can be made based on information in the dictionary 110. In some implementations, more or fewer states per phoneme may be used in the acoustic model 200.

音響モデル２００は、音素の状態の各々についての表現が得られ得るように、多数のコンテキスト（たとえばさまざまなワードおよび文）における各音素の録音を使用してトレーニングされ得る。これらの表現は、上に論じられた多変数のガウス分布を含み得る。 The acoustic model 200 can be trained using recordings of each phoneme in multiple contexts (eg, various words and sentences) so that a representation for each of the phoneme states can be obtained. These representations may include the multivariable Gaussian distribution discussed above.

音響モデル２００をトレーニングするために、話された音素を含むおそらく多くの発声が各々、転写に関連付けられ得る。これらの発声は、ワードおよび文などであり得、日常の発話または何らかの他の源の録音から得られ得る。転写は、発声の自動または手動（人間が作成した）テキストストリングであり得る。 To train acoustic model 200, perhaps many utterances, including spoken phonemes, can each be associated with transcription. These utterances can be words and sentences, etc., and can be obtained from daily utterances or recordings of some other source. The transcript can be an automatic or manual (human-made) text string of utterance.

当該発声は、それぞれの転写に従ってセグメント化され得る。たとえば、音響モデル２００のトレーニングは、（たとえばバウム−ウェルチ（Baum-Welch）および／またはビタビ（Viterbi）アライメント法を用いて）話されたストリングをユニットへとセグメント化し、その後、当該セグメント化された発声を使用して各音素状態について統計的分布を構築することを伴う。 The utterance can be segmented according to the respective transcription. For example, training of the acoustic model 200 can segment spoken strings into units (eg, using Baum-Welch and / or Viterbi alignment methods) and then the segmented It involves constructing a statistical distribution for each phoneme state using utterances.

この結果、より多くのデータ（発声およびそれらの関連付けられる転写）がトレーニングに使用されると、より正確な音響モデルが生成され得る。しかしながら、よくトレーニングされた音響モデルでさえ、トレーニングされなかったドメインにおいてＡＳＲのために使用されると正確さが制限され得る。たとえば、所与の音響モデルが多くのアメリカ英語の話者の発声によってトレーニングされる場合、この音響モデルは、アメリカ英語のＡＳＲのために使用されると良好に機能し得るが、たとえばイギリス英語のＡＳＲのために使用されるとあまり正確でなくなり得る。 As a result, a more accurate acoustic model can be generated as more data (speech and their associated transcripts) is used for training. However, even a well-trained acoustic model can have limited accuracy when used for ASR in untrained domains. For example, if a given acoustic model is trained by the speech of many American English speakers, this acoustic model can work well when used for American English ASR, but for example British English May be less accurate when used for ASR.

また、音響モデル２００は、多くの話者の発声を使用してトレーニングされる場合、話者のすべてにわたったこの音素の発音の統計的な平均として各音素を表わすことになりやすい。したがって、この態様でトレーニングされた場合の音響モデル２００は、任意の特定の話者ではなく、仮定の平均的な話者の発音および使用を表わし得る。 Also, if the acoustic model 200 is trained using the utterances of many speakers, it tends to represent each phoneme as a statistical average of the pronunciation of this phoneme across all speakers. Thus, the acoustic model 200 when trained in this manner may represent the hypothetical average speaker pronunciation and usage, rather than any particular speaker.

再び図１を参照して、辞書１１０は、音素とワードとの間のあらかじめ確立されるマッピングを規定し得る。このマッピングは、たとえば何万または何十万の音素パターン対ワードマッピングのリストを含み得る。したがって、いくつかの例において、辞書１１０は、下に示されるテーブル１のようなルックアップテーブルを含み得る。テーブル１は、ＡＳＲシステムが認識を試みている対応するワードについてパターン分類モジュール１０６が識別するように構成され得る音素のシーケンスを辞書１１０がどのようにリスト化し得るかを示す。したがって、辞書１１０は、音響モデル２００によって示されるワードの音素状態表現を展開する際に使用され得る。 Referring again to FIG. 1, the dictionary 110 may define a pre-established mapping between phonemes and words. This mapping may include a list of tens or hundreds of thousands of phoneme pattern-to-word mappings, for example. Thus, in some examples, the dictionary 110 may include a lookup table such as Table 1 shown below. Table 1 shows how the dictionary 110 may list phoneme sequences that the pattern classification module 106 may be configured to identify for corresponding words that the ASR system is attempting to recognize. Thus, the dictionary 110 can be used in developing the phoneme state representation of the word indicated by the acoustic model 200.

言語モデル１１２は、ＡＳＲシステムへの入力発声において発生する音素またはワードの当該シーケンスの可能性に基づいて、音素またはワードのシーケンスに確率を割り当てるように構成され得る。したがってたとえば、言語モデル１１２は、フレーズにおけるｎ−１前のワードのパターンの値が与えられると、（発声から転写されたフレーズにおけるｎ番目のワードについて）ｗ_ｎの条件付き確率を規定し得る。例示的な条件付き確率は次のように表現され得る。 Language model 112 may be configured to assign a probability to a sequence of phonemes or words based on the likelihood of that sequence of phonemes or words occurring in an input utterance to the ASR system. Thus, for example, the language model 112 may define a conditional probability of wn (for the _nth word in the phrase transcribed from the utterance) given the value of the pattern of the n-1 previous words in the phrase. An exemplary conditional probability can be expressed as:

一般に、言語モデルは、たとえばパターン分類モジュール１０６において表わされるｎ個の音素またはワードのシーケンスであり得るｎグラム（n-gram）上で動作し得る。５より大きいｎの値を有する言語モデルは、大きなメモリまたはストレージスペースを必要し得、したがってより小さなｎグラム（たとえば３グラム（トライグラム（tri-gram）とも称される）が、許容できる結果を効率的に産出するために使用され得る。トライグラムは、例示目的で本願明細書において使用される。しかしながら、如何なる値のｎも本願明細書における例と共に使用され得る。 In general, the language model may operate on an n-gram that may be a sequence of n phonemes or words represented in the pattern classification module 106, for example. A language model with a value of n greater than 5 may require a large amount of memory or storage space, so a smaller n-gram (eg 3 grams (also called a tri-gram)) will give acceptable results. Trigrams are used herein for purposes of illustration, however, any value of n can be used with the examples herein.

言語モデルは、ワードのテキストストリングまたはシーケンスのコーパスの分析を通じてトレーニングされ得る。このコーパスは、たとえば何百、何千、何百万またはそれ以上といった多くのワードを含み得る。これらのワードは、ＡＳＲシステムのユーザが話した発声および／または書面から得られ得る。たとえば、言語モデル１１２は、人間のスピーチ、書かれたテキスト（たとえば電子メール、ウェブページ、レポート、学術論文、ワードプロセシング文書など）、および検索クエリなどにおいて発生するワードパターンに基づいて、決定または発展され得る。 Language models can be trained through analysis of a corpus of text strings or sequences of words. This corpus may contain many words, for example hundreds, thousands, millions or more. These words may be obtained from utterances and / or written spoken by users of the ASR system. For example, the language model 112 is determined or evolved based on word patterns that occur in human speech, written text (eg, emails, web pages, reports, academic papers, word processing documents, etc.), search queries, etc. Can be done.

このようなコーパスから、コーパスにおける出現のそれぞれの数に基づいて、トライグラムの確率が推定され得る。言いかえれば、Ｃ（ｗ_１，ｗ_２，ｗ_３）がコーパスにおけるワードｗ_１，ｗ_２，ｗ_３のシーケンスの発生の回数である場合、当該ワードのシーケンスについて発生確率は次のように表現され得る。 From such a corpus, the probability of the trigram can be estimated based on the respective number of occurrences in the corpus. In other words, when C (w ₁ , w ₂ , w ₃ ) is the number of occurrences of the sequence of words w ₁ , w ₂ , w _{3 in} the corpus, the occurrence probability for the sequence of words is expressed as follows: Can be done.

したがって、言語モデル１１２は条件付き確率のテーブルとして表わされ得る。テーブル２は、言語モデル１１２の基礎を形成し得るテーブルの例を示す。特に、テーブル２はトライグラムの条件付き確率を含む。 Thus, the language model 112 can be represented as a table of conditional probabilities. Table 2 shows an example of a table that can form the basis of the language model 112. In particular, Table 2 contains conditional probabilities of trigrams.

「cat and」という２グラムの接頭辞（prefix）について、テーブル２は、コーパスにおける観察された発生に基づき、次の１グラムが「dog」である時が５０％であることを示す。同様に、３５％の時、次の１グラムは「mouse」であり、１４％の時、次の１グラムは「bird」であり、１％の時、次の１グラムは「fiddle」である。完全にトレーニングされたＡＳＲシステムでは、言語モデル１１２はさらに多くのエントリを含み、これらのエントリは単に１つより多い２グラムの接頭辞を含み得る。 For a 2 gram prefix of “cat and”, Table 2 shows that based on the observed occurrence in the corpus, the next 1 gram is “dog” is 50%. Similarly, when 35%, the next 1 gram is “mouse”, when 14%, the next 1 gram is “bird”, and when 1%, the next 1 gram is “fiddle”. . In a fully trained ASR system, the language model 112 includes more entries, and these entries may simply include more than one 2 gram prefix.

音響モデル１０８および言語モデル１１２がひとたび適切にトレーニングされると、特徴分析モデル１０２およびパターン分類モジュール１０６はＡＳＲを行なうように構成され得る。入力発声１００が提供されると、ＡＳＲシステムは、言語モデル１１２から有効なワードシーケンスのスペースを検索し、発声１００において話された最大の可能性を有するワードシーケンスを発見し得る。しかしながら、検索スペースのサイズはかなり大きくなり得、検索スペースを低減する方法は、このような検索を計算上より効率的にさせ得る。例として、潜在的に桁のオーダで検索の複雑さを低減するためにヒューリスティック技術が使用され得る。検索スペースを限定する他の方法が可能である。たとえば、検索スペースは所与の期間においてポピュラーなフレーズに制限され得る。 Once the acoustic model 108 and the language model 112 are properly trained, the feature analysis model 102 and the pattern classification module 106 can be configured to perform ASR. Given the input utterance 100, the ASR system may search the language model 112 for a space of valid word sequences and find the word sequence with the greatest likelihood spoken in the utterance 100. However, the size of the search space can be quite large, and the method of reducing the search space can make such a search more computationally efficient. As an example, heuristic techniques can be used to reduce search complexity, potentially on the order of digits. Other ways of limiting the search space are possible. For example, the search space can be limited to popular phrases in a given period.

有限状態トランスデューサ（finite state transducer（ＦＳＴ））が、単一のワードにマッピングする複数の音素パターンを簡潔に表わすために使用され得る。「data」、「either」、「tomato」および「potato」のようないくつかのワードは、複数の発音を有する。これらの発音についての音素シーケンスは、１ワードにつき、単一のＦＳＴにおいて表わされ得る。 A finite state transducer (FST) can be used to concisely represent multiple phoneme patterns that map to a single word. Some words such as “data”, “either”, “tomato” and “potato” have multiple pronunciations. The phoneme sequence for these pronunciations can be represented in a single FST per word.

効率的な音素レベルのＦＳＴを作り出すこのプロセスは、辞書１１０における各ワードについて行なわれ得、結果得られるワードＦＳＴは、言語モデル１１２を使用して、センテンスＦＳＴへと組み合わせられ得る。最終的に、音素、ワードおよびワードのシーケンスについての状態のネットワークが発達され得、コンパクトな検索グラフにおいて表わされ得る。 This process of creating an efficient phoneme level FST can be performed for each word in the dictionary 110 and the resulting word FST can be combined into a sentence FST using the language model 112. Finally, a network of states for phonemes, words and sequences of words can be developed and represented in a compact search graph.

図３は、実施形態に従ったＡＳＲシステムの例示的な検索グラフ３００を示す。この例示的な検索グラフ３００は、ＡＳＲシステムにおいて使用され得る検索グラフよりも小さくて複雑ではなく、例示のために使用される。特に、検索グラフ３００は、「catapult」、「cat and mouse」、「cat and dog」、「cat」および「cap」といった５つの入力発声によりトレーニングされた。 FIG. 3 shows an exemplary search graph 300 for an ASR system according to an embodiment. This exemplary search graph 300 is smaller and less complex than a search graph that may be used in an ASR system and is used for illustration purposes. In particular, the search graph 300 was trained with five input utterances: “catapult”, “cat and mouse”, “cat and dog”, “cat” and “cap”.

検索グラフ３００における各円は、音素にマッピングされた入力発声の処理に関連付けられる状態を表わし得る。単純さのために、検索グラフ３００における各音素は、複数の状態ではなく単一の状態で表わされる。さらに、図３を簡素化するために、自己遷移（self-transition）が検索グラフ３００から省略される。 Each circle in the search graph 300 may represent a state associated with processing an input utterance mapped to a phoneme. For simplicity, each phoneme in the search graph 300 is represented in a single state rather than multiple states. Furthermore, self-transition is omitted from the search graph 300 to simplify FIG.

検索グラフ３００における状態は、「ｘ［ｙ］ｚ」というフォーマットを使用して、入力発声の現在の音素コンテキストに基づいて命名されており、これにより、考えられている現在の音素であるｙが音素ｘの左のコンテキストを有し、音素ｚの右のコンテキストを有することを示す。言いかえれば、状態「ｘ［ｙ］ｚ」は、考えられている現在の音素がｙであり、当該発声において以前に考えられた音素がｘであり、発声において考えられる次の音素がｚである発声を処理することにおけるポイントを示す。発声の開始部および発声の終了部は、「＃」文字によって表わされ、ゼロの音素（null phoneme）とも称され得る。 The states in the search graph 300 are named based on the current phoneme context of the input utterance using the format “x [y] z”, so that the current phoneme being considered is y. It indicates that it has a left context for phoneme x and a right context for phoneme z. In other words, the state “x [y] z” is that the current phoneme considered is y, the phoneme previously considered in the utterance is x, and the next phoneme considered in the utterance is z. Indicates the point in processing a certain utterance. The beginning and end of utterance are represented by the “#” character and may also be referred to as a null phoneme.

終端状態は、引用において認識されたワードまたはフレーズによって表わされ得る。検索グラフ３００は、ワードまたはワードのシーケンス（すなわちフレーズ）の認識を表わす、「catapult」、「cat and mouse」、「cat and dog」、「cat」および「cap」のような５つの終端状態を含む。 The end state may be represented by a word or phrase recognized in the citation. The search graph 300 shows five end states, such as “catapult”, “cat and mouse”, “cat and dog”, “cat” and “cap”, which represent recognition of a word or sequence of words (ie, a phrase). Including.

１つの状態から別の状態までの遷移は、コーパスにおける音素の観察された順番を表わし得る。たとえば、「＃［ｋ］ａｅ」という状態は、左のコンテキストがゼロの音素であり右のコンテキストが「ａｅ」の音素である「ｋ」音素の認識を表わす。「＃［ｋ］ａｅ」という状態からは２つの遷移が存在し、その１つは次の音素（「ａｅ」の後の音素）が「ｔ」である遷移であり、もう１つは次の音素が「ｐ」である遷移である。 A transition from one state to another may represent the observed order of phonemes in the corpus. For example, the state “# [k] ae” represents recognition of a “k” phoneme whose left context is a zero phoneme and whose right context is a phoneme “ae”. There are two transitions from the state “# [k] ae”, one of which is a transition whose next phoneme (phoneme after “ae”) is “t”, and the other is the next It is a transition whose phoneme is “p”.

音響モデル１０８、辞書１１０および言語モデル１１２に基づき、状態および／または遷移の１つ以上にコストが割り当てられ得る。たとえば、特定の音素パターンがまれな場合、その音素パターンを表わす状態への遷移は、より一般的な音素パターンを表わす状態への遷移よりも高いコストを有し得る。同様に、言語モデル（たとえばテーブル２参照）からの条件付き確率も、状態および／または遷移にコストを割り当てるために使用され得る。たとえば、テーブル２において、「cat and」というワードを有するフレーズ与えられると、当該フレーズにおける次のワードが「dog」である条件付き確率は０．５である一方、当該フレーズにおける次のワードが「mouse」である条件付き確率は０．３５である。したがって、状態「ａｅ［ｎ］ｄ」から状態「ｎ［ｄ］ｍ」への遷移は、状態「ａｅ［ｎ］ｄ」から状態「ｎ［ｄ］ｄ」への遷移よりも高いコストを有し得る。 Based on the acoustic model 108, the dictionary 110, and the language model 112, a cost may be assigned to one or more of the states and / or transitions. For example, if a particular phoneme pattern is rare, a transition to a state representing that phoneme pattern may have a higher cost than a transition to a state representing a more general phoneme pattern. Similarly, conditional probabilities from language models (see, eg, Table 2) can also be used to assign costs to states and / or transitions. For example, given a phrase having the word “cat and” in Table 2, the conditional probability that the next word in the phrase is “dog” is 0.5, while the next word in the phrase is “ The conditional probability of “mouse” is 0.35. Therefore, the transition from state “ae [n] d” to state “n [d] m” has a higher cost than the transition from state “ae [n] d” to state “n [d] d”. Can do.

任意の状態、状態間の遷移、およびそれに関連付けられるコストを含む検索グラフ３００は、新しい入力発声についてテキストストリングの転写を推定するように使用され得る。たとえば、パターン分類モジュール１０６は、検索グラフ３００に基づいて、入力発声とマッチングする１つ以上のワードのシーケンスを決定し得る。パターン分類モジュール１０６は、以下を求めることを試みるように構成され得る。 A search graph 300 including any states, transitions between states, and costs associated therewith can be used to estimate the transcription of the text string for a new input utterance. For example, the pattern classification module 106 may determine a sequence of one or more words that match the input utterance based on the search graph 300. The pattern classification module 106 may be configured to attempt to determine:

式中ａは、入力発声から導出された特徴ベクトルのストリームであり、Ｐ（ａ｜ｗ）は、それらの特徴ベクトルがワードシーケンスｗによって生成される確率を表わし、Ｐ（ｗ）は、言語モデル１１２によってｗに割り当てられた確率である。たとえば、Ｐ（ｗ）は、上で論じたようなｎグラムの条件付き確率と他のファクタとに基づき得る。関数ａｒｇｍａｘ_ｗは、Ｐ（ａ｜ｗ）Ｐ（ｗ）を最大化するｗの値を返し得る。 Where a is a stream of feature vectors derived from the input utterance, P (a | w) represents the probability that those feature vectors will be generated by the word sequence w, and P (w) is the language model 112 is the probability assigned to w by 112. For example, P (w) may be based on n-gram conditional probabilities and other factors as discussed above. The function argmax _w may return the value of w that maximizes P (a | w) P (w).

再び図１を参照して、発声１００とマッチングし得るテキストストリングを発見するために、パターン分類モジュール１０６は、特徴ベクトル１０４に基づいて、検索グラフ３００における初期状態から検索グラフ３００における終端状態へのパスの発見を試みるように構成され得る。このプロセスは、パターン分類モジュール１０６が、検索グラフ３００に対して幅優先探索（breadth-first search）、Ａスター（Ａ＊）探索（A-star(A*) search）、ビーム探索（beam search）、または何らかの他のタイプの探索を行なうことを伴い得る。パターン分類モジュール１０６は、状態に関連付けられるコストおよび／または各パスに関連付けられる遷移に基づいて、検索グラフ３００を通じて、１つ以上のパスにトータルコストを割り当てるように構成され得る。これらのコストのうちのいくつかはたとえば、発声の特定のセグメントがパスにおける音素の特定のシーケンスにマッピングする信頼水準に基づき得る。 Referring again to FIG. 1, in order to find a text string that can match the utterance 100, the pattern classification module 106 determines from the initial state in the search graph 300 to the terminal state in the search graph 300 based on the feature vector 104. Can be configured to attempt path discovery. In this process, the pattern classification module 106 performs a breadth-first search, an A-star (A *) search, a beam search, Or it may involve performing some other type of search. The pattern classification module 106 may be configured to assign a total cost to one or more paths through the search graph 300 based on costs associated with states and / or transitions associated with each path. Some of these costs may be based, for example, on the confidence level that a particular segment of utterance maps to a particular sequence of phonemes in the path.

例として、発声１００はフレーズ「cat and dog」を含み得、パターン分類モジュール１０６は、初期状態「＃［ｋ］ａｅ」で始まり終端状態「cat and dog」で終わるパスを発見するために、検索グラフ３００を通じて音素ごとに実行するように構成され得る。パターン分類モジュール１０６はさらに、検索グラフ３００を通じて、１つ以上の付加的なパスを発見するように構成され得る。たとえば、パターン分類モジュール１０６はさらに、初期状態「＃［ｋ］ａｅ」を有し終端状態「cat and mouse」で終わるパスと、初期状態「＃［ｋ］ａｅ」を有し終端状態「catapult」で終わるパスとに発声１００を関連付けるように構成され得る。しかしながら、パターン分類モジュール１０６は、他のパスによりも終端状態「cat and dog」を有するパスに低いコスト（または高い発生確率）を割り当てるように構成され得る。結果として、終端状態「cat and dog」を有するパスは、入力発声１００について最も可能性のある転写として選択され得る。 As an example, the utterance 100 may include the phrase “cat and dog” and the pattern classification module 106 searches to find a path that starts with an initial state “# [k] ae” and ends with a terminal state “cat and dog”. It may be configured to execute for each phoneme through the graph 300. The pattern classification module 106 may further be configured to find one or more additional paths through the search graph 300. For example, the pattern classification module 106 further includes a path having an initial state “# [k] ae” and ending with a termination state “cat and mouse”, and a termination state “catapult” having an initial state “# [k] ae”. The utterance 100 may be configured to be associated with a path ending with. However, the pattern classification module 106 may be configured to assign a lower cost (or higher probability of occurrence) to paths that have the termination state “cat and dog” than other paths. As a result, the path with the terminal state “cat and dog” can be selected as the most likely transcript for the input utterance 100.

ＡＳＲシステムは多くの異なる態様で動作され得る。上記の例は、例示目的で示されており、ＡＳＲシステムが動作する唯一の態様ではなくてもよい。 The ASR system can be operated in many different ways. The above examples are shown for illustrative purposes and may not be the only way in which the ASR system operates.

上述したように、ＡＳＲシステムの音声認識データベースにおける検索スペースはかなり大きくなり得る。いくつかの例では、音声認識データベースを作成するＡＳＲシステムは、統計的な言語モデルを生成するために、タイプされたクエリ、ニュース記事および他の材料といったドキュメントソースをマイニングするように構成され得る。たとえば、言語モデルは、ある確率をすべての可能性のあるワードシーケンスに割り当て得る。例では、言語モデルは、ドキュメントソースにおいて発生しないワードシーケンスを許容し得る、すなわち、言語モデルは、ドキュメントソースにおいて発生するまたは発生しないフレーズのワードの順列および組合せを許容し得る。ドキュメントソースにおいて発生しないシーケンスへの一般化は、スムージング（smoothing）と称され得る。 As mentioned above, the search space in the speech recognition database of the ASR system can be quite large. In some examples, an ASR system that creates a speech recognition database may be configured to mine document sources such as typed queries, news articles, and other materials to generate a statistical language model. For example, a language model may assign a probability to all possible word sequences. In an example, the language model may allow word sequences that do not occur in the document source, that is, the language model may allow word permutations and combinations of phrases that may or may not occur in the document source. Generalization to a sequence that does not occur in the document source may be referred to as smoothing.

スムージングは、ユーザがドキュメントソースに存在し得ないユニークまたは新しいフレーズを発し得るので、有用であり得る。しかしながら、ワードの順列および組合せを許容することは、無意味なワードシーケンスを作り出し得る。たとえば、元々のフレーズが「show me football results」である場合、無意味なワードシーケンスは「show results football me」であり得る。 Smoothing can be useful because the user can issue unique or new phrases that cannot exist in the document source. However, allowing word permutations and combinations can create meaningless word sequences. For example, if the original phrase is “show me football results”, the meaningless word sequence may be “show results football me”.

検索スペースの低減は、ＡＳＲシステムを計算上より効率的にさせ得る。一般に、ＡＳＲシステムのユーザは、高い程度の反復性で発声を生成し得る。いくつかの例では、発声の反復は、不定期間のトレンド（たとえば季節のスポーツについての結果）に基づき得る。他の例では、発声の反復は、発声が関連付けられ得るトピックのポピュラリティ（たとえばオリンピックのような所与の期間の現在のイベントに関連付けられる発声）に基づいて予測可能であり得る。例において、ＡＳＲシステムは、計算上効率的な言語モデルを生成するために、このような予測可能な反復を利用するように構成され得る。 The reduction in search space can make the ASR system more computationally efficient. In general, users of ASR systems can generate utterances with a high degree of repeatability. In some examples, utterance repetition may be based on irregular trends (eg, results for seasonal sports). In another example, utterance repetition may be predictable based on the popularity of topics to which utterances may be associated (eg, utterances associated with current events for a given period of time, such as the Olympics). In an example, the ASR system may be configured to utilize such predictable iterations to generate a computationally efficient language model.

ある例では、ＡＳＲシステムが計算上効率的になるために、ＡＳＲシステムは、ポピュラーなフレーズに基づいてワードのシーケンスを生成するように構成され得る。さらに、ポピュラーなフレーズのワードのすべてのあらゆるシーケンスが、ポピュラーなフレーズにおけるワードの順序にかかわらず仮定されることを言語モデルが許容する代わりに、ＡＳＲシステムは、ポピュラーなフレーズのワードのグループピングまたはサブシーケンスの組を、ポピュラーなフレーズにおいてワードが発生するのと同じ順で当該グルーピングまたはサブシーケンスがワードを含むように、決定するように構成され得る。 In one example, in order for the ASR system to be computationally efficient, the ASR system may be configured to generate a sequence of words based on popular phrases. Furthermore, instead of the language model allowing every sequence of words in a popular phrase to be assumed regardless of the order of the words in the popular phrase, the ASR system allows grouping of popular phrase words or The set of subsequences may be configured to determine such that the grouping or subsequence contains words in the same order that words occur in popular phrases.

例示のための例として、ポピュラーなフレーズは、所与のシーケンス「word1 word2 word3 word4 word5」における５ワードを含み得る。所与の言語モデルは、サブシーケンスまたはグルーピング「word2 word3 word5」を許容し得るが、このサブシーケンスは元々のポピュラーなフレーズに存在しないので、より効率的な言語モデルではないかもしれない。これにより、ＡＳＲシステムのための検索スペースは、より高い精度および計算上の効率を可能にしつつ、限定または低減され得る。 As an illustrative example, a popular phrase may include 5 words in a given sequence “word1 word2 word3 word4 word5”. A given language model may allow a subsequence or grouping "word2 word3 word5", but this subsequence may not be a more efficient language model because it is not present in the original popular phrase. Thereby, the search space for the ASR system may be limited or reduced while allowing higher accuracy and computational efficiency.

図４は、ある実施形態に従った、効率的な音声認識のための例示的な方法のフローチャートである。 FIG. 4 is a flowchart of an exemplary method for efficient speech recognition, according to an embodiment.

方法４００は、ブロック４０２−４０６の１つ以上によって示されるように、１つ以上の動作、機能またはアクションを含み得る。ブロックは連続する順番で示されるが、いくつかの場合にはこれらのブロックは並列に行なわれ得、および／または、本願明細書において記載されるのとは異なる順番で行われ得る。さらに、さまざまなブロックが、より少ないブロックへと組み合わされ、付加的なブロックに分割され、および／または所望の実現例に基づいて除去され得る。 Method 400 may include one or more operations, functions or actions, as indicated by one or more of blocks 402-406. Although the blocks are shown in sequential order, in some cases these blocks may be performed in parallel and / or may be performed in a different order than described herein. Further, various blocks can be combined into fewer blocks, divided into additional blocks, and / or removed based on the desired implementation.

さらに、方法４００、他のプロセス、および本願明細書において開示された方法について、フローチャートは本例の１つの可能な実現例の機能および動作を示す。この点において、各ブロックは、プロセスにおいて特定の論理機能またはステップを実現するためのプロセッサによって実行可能な１つ以上の命令を含むモジュール、セグメントまたはプログラムコードの一部を表わし得る。プログラムコードは、たとえばディスクまたはハードドライブを含むストレージデバイスのような任意のタイプのコンピュータ読取可能媒体またはメモリ上に格納され得る。たとえば、コンピュータ読取可能媒体は、レジスタメモリ、プロセッサキャッシュおよびランダムアクセスメモリ（ＲＡＭ）といった短期間の間にデータを格納するコンピュータ読取可能媒体のような一時的でないコンピュータ読取可能媒体またはメモリを含み得る。コンピュータ読取可能媒体はさらに、たとえばリードオンリーメモリ（ＲＯＭ）、光学または磁気ディスク、コンパクトディスクリードオンリーメモリ（ＣＤ−ＲＯＭ）といった二次的または持続性の長期間のストレージのような一時的でない媒体またはメモリを含み得る。コンピュータ読取可能媒体はさらに、任意の他の揮発性または不揮発性ストレージシステムであり得る。コンピュータ読取可能媒体は、たとえばコンピュータ読取可能記憶媒体、有形的ストレージデバイスまたは他の製造物品であると考えられ得る。 Further, for the method 400, other processes, and methods disclosed herein, the flowchart illustrates the functionality and operation of one possible implementation of this example. In this regard, each block may represent a module, segment, or portion of program code that includes one or more instructions that can be executed by a processor to implement a particular logical function or step in a process. The program code may be stored on any type of computer readable medium or memory, for example a storage device including a disk or hard drive. For example, computer-readable media may include non-transitory computer-readable media or memory, such as computer-readable media that stores data in a short period of time, such as register memory, processor cache, and random access memory (RAM). The computer-readable medium may further be a non-transitory medium such as a secondary or persistent long-term storage such as a read-only memory (ROM), an optical or magnetic disk, a compact disk read-only memory (CD-ROM), or Memory may be included. The computer readable medium may further be any other volatile or non-volatile storage system. A computer readable medium may be considered, for example, a computer readable storage medium, a tangible storage device, or other manufactured article.

さらに、方法４００、他のプロセス、および本願明細書において開示される方法について、図４における各ブロックは、当該プロセスにおいて特定の論理機能を行なうように配線される回路網を表わし得る。 Further, for method 400, other processes, and methods disclosed herein, each block in FIG. 4 may represent a network that is wired to perform a particular logic function in the process.

ブロック４０２では、方法４００は、検索エンジンへの検索クエリの送信の頻度を示す情報をコンピューティングデバイスにて受け取ることを含み、当該検索クエリは、ワードのシーケンスを含み得る。コンピューティングデバイスはたとえば、携帯電話、携帯情報端末（ＰＤＡ）、ラップトップ、ノートブック、またはネットブックコンピュータ、タブレットコンピューティングデバイス、ウェアラブルコンピューティングデバイス、クラウドベースのコンピューティングシステムにおけるサーバなどであり得る。 At block 402, the method 400 includes receiving information at a computing device indicating a frequency of transmission of a search query to a search engine, where the search query may include a sequence of words. The computing device can be, for example, a mobile phone, personal digital assistant (PDA), laptop, notebook or netbook computer, tablet computing device, wearable computing device, server in a cloud-based computing system, and the like.

ある例では、一般にスパイキング（spiking）と称される検索クエリアクティビティの急激な増加は、多くのソースに起因し得る。スパイキングは、休日またはスポーツイベントのような規則的もしくはポピュラーな発生に起因し得るか、または、注目のニュース項目のような不規則なイベントに起因し得る。一例において、コンピューティングデバイス（たとえばサーバ）は、ポピュラーなクエリまたはスパイキングクエリを識別するよう所与の期間における検索エンジンへのある検索クエリ（または複数の検索クエリ）の送信の頻度を追跡することに関連付けられる情報を受け取るように構成され得る。たとえば、所与の検索クエリは、所与のデバイス（たとえば携帯電話）のユーザによって発せられたテキストストリング（フレーズ）または音声検索クエリであり得る。例において、ポピュラーなクエリまたはスパイキングクエリは、毎日、毎週、または任意の他の単位時間内に識別または抽出され得る。 In one example, the sudden increase in search query activity, commonly referred to as spiking, can be attributed to many sources. Spiking can be due to regular or popular occurrences such as holidays or sporting events, or it can be due to irregular events such as featured news items. In one example, a computing device (eg, a server) tracks the frequency of sending a search query (or multiple search queries) to a search engine over a given time period to identify popular or spiking queries. May be configured to receive information associated with the. For example, a given search query may be a text string (phrase) or a voice search query issued by a user of a given device (eg, a mobile phone). In examples, popular or spiking queries can be identified or extracted daily, weekly, or in any other unit time.

再び図４を参照して、ブロック４０４では、方法４００は、検索クエリの送信の頻度がしきい値を越えることに基づいて、検索クエリのワードのシーケンスについて、１つ以上のワードが検索クエリのワードのシーケンスに発生する順番に基づく検索クエリの１つ以上のワードのグルーピングを決定することを含む。ある例において、検索エンジンへの検索クエリの送信の頻度を示す情報に基づいて、コンピューティングデバイスは、検索クエリのポピュラリティを示すメトリックを決定し、検索クエリが所与の期間においてポピュラーであるかどうかを識別するように構成され得る。たとえば、コンピューティングデバイスは、検索エンジンの使用履歴に基づいて、検索クエリの送信の頻度がしきい値を越える場合に検索クエリがポピュラーまたはスパイキング検索クエリとして指定され得るようにしきい値を決定するように構成され得る。 Referring again to FIG. 4, at block 404, the method 400 determines that one or more words of the search query are for a sequence of words in the search query based on the frequency of search query transmissions exceeding a threshold. Determining a grouping of one or more words of the search query based on the order in which they occur in the sequence of words. In one example, based on information indicating the frequency of sending a search query to a search engine, the computing device determines a metric that indicates the popularity of the search query and whether the search query is popular for a given time period. May be configured to identify. For example, a computing device determines a threshold based on search engine usage history so that a search query can be designated as a popular or spiking search query if the frequency of sending the search query exceeds the threshold Can be configured as follows.

ある例では、コンピューティングデバイスは、時間にわたるクエリの送信の時系列分析に基づいてメトリックを決定し、クエリのポピュラリティを決定するために当該メトリックをしきい値と比較するように構成され得る。当該メトリックはたとえば、クエリアクセレレーション（query acceleration）またはベロシティに関係付けられ得る。クエリベロシティはたとえば、インスタントクエリ要求ともっとも最近のクエリ要求との間の時間の差の逆数として計算され得る。当該時間の差は、ｄｔ＝（このクエリインスタンスの時間−クエリが見られた最後の時間）として計算され得、クエリベロシティは１／ｄｔとして決定され得る。所与のクエリについてのクエリアクセレレーションは、瞬間のクエリベロシティで乗算された、現在のクエリベロシティ（または平均のクエリベロシティ）と以前に決定されて、以前に計算されたクエリベロシティ（または以前に計算された平均クエリベロシティ）との間の差として決定され得る。メトリックは、クエリベロシティ、クエリアクセレレーション、または、時間にわたるクエリの送信の時系列分析に基づいて決定される他のパラメータの関数であり得る。他のパラメータまたはこれらのパラメータを計算する他の方法が可能である。 In one example, the computing device may be configured to determine a metric based on a time series analysis of the transmission of the query over time and compare the metric to a threshold value to determine the popularity of the query. The metric can be related to, for example, query acceleration or velocity. Query velocity may be calculated, for example, as the reciprocal of the time difference between the instant query request and the most recent query request. The time difference can be calculated as dt = (time of this query instance−the last time the query was seen) and the query velocity can be determined as 1 / dt. The query acceleration for a given query is the current query velocity (or average query velocity) multiplied by the instantaneous query velocity, previously determined and previously calculated query velocity (or previously (Calculated average query velocity). The metric can be a function of query velocity, query acceleration, or other parameters determined based on a time series analysis of the transmission of queries over time. Other parameters or other ways of calculating these parameters are possible.

方法４００を説明するための例として、コンピューティングデバイスは、「hello world I am here」と「world war two」という２つのポピュラーなフレーズを、検索エンジンへの２つのクエリの送信のそれぞれの頻度に基づいて決定するように構成され得る。これらの２つの検索クエリの各々は、所与の順番でワードのシーケンスを含む。たとえば、検索クエリ「world war two」は、そのシーケンスにおいて「world」、「war」および「two」という３つのワードを含む。 As an example to illustrate the method 400, a computing device may use two popular phrases “hello world I am here” and “world war two” for each frequency of sending two queries to a search engine. May be configured to determine based on. Each of these two search queries includes a sequence of words in a given order. For example, the search query “world war two” includes three words “world”, “war”, and “two” in the sequence.

コンピューティングデバイスは、ワードが対応する検索クエリにおいて発生する順番に基づいて、ワードのグルーピングを決定するように構成され得る。たとえば、検索クエリ「world war two」に関して、コンピューティングデバイスは、以下のワードのグループピングを決定するように構成され得る。 The computing device may be configured to determine a grouping of words based on the order in which the words occur in the corresponding search query. For example, for the search query “world war two”, the computing device may be configured to determine a grouping of the following words:

これらのグルーピングはファクタとも称され得る。シーケンスの大きな組について、当該グルーピングは、所与の検索クエリのサイズにおいて二次であり得、したがって、ワードのすべてのグルーピングを列挙することは禁止的であり得る。ファクタまたはグルーピングをより効率的に決定するために、コンピューティングデバイスは、検索クエリについて、オートマトン表現およびファクタグラフを決定または生成するように構成され得る。ファクタグラフはグルーピングをより簡潔に表わし得、より効率的な検索を可能にし得る。 These groupings can also be referred to as factors. For a large set of sequences, the grouping can be secondary in the size of a given search query, and thus enumerating all groupings of words can be prohibitive. In order to more efficiently determine the factor or grouping, the computing device may be configured to determine or generate an automaton representation and a factor graph for the search query. The factor graph may represent the grouping more concisely and may allow more efficient searching.

図５Ａは、実施形態に従った、例示的な検索クエリについての例示的なオートマトン表現５００を示す。オートマトン表現５００は、両方の検索クエリ「hello world I am here」および「world war two」を表わす。オートマトン表現５００は、初期状態５０２Ａ、中間状態５０２Ｂおよび終端状態５０２Ｃのようなオートマトン状態を含む。オートマトン表現５００はさらに、オートマトン弧５０４Ａおよび５０４Ｂのようなオートマトン弧を含み、各オートマトン弧は、所与の検索クエリのワードのシーケンスからのワードに対応し得る。 FIG. 5A illustrates an example automaton representation 500 for an example search query, according to an embodiment. The automaton representation 500 represents both search queries “hello world I am here” and “world war two”. The automaton representation 500 includes automaton states such as an initial state 502A, an intermediate state 502B, and a terminal state 502C. The automaton representation 500 further includes automaton arcs, such as automaton arcs 504A and 504B, where each automaton arc may correspond to a word from a sequence of words of a given search query.

図５Ｂは、実施形態に従った、例示的な検索クエリのための例示的なバイグラム言語モデル５０６を示す。言語モデル５０６は、対応する検索クエリにおけるワードの順番にかかわらず、検索クエリのワードのすべての可能なグルーピングを許容する。図５Ｂに示されるように、言語モデル５０６は複雑であり、「hello world here I」のようなワードの無意味なグルーピングを許容し得る。 FIG. 5B illustrates an exemplary bigram language model 506 for an exemplary search query, according to an embodiment. Language model 506 allows all possible groupings of words in the search query, regardless of the order of words in the corresponding search query. As shown in FIG. 5B, the language model 506 is complex and may allow for meaningless groupings of words such as “hello world here I”.

対照的に、検索クエリにおけるワードの順番に基づく、所与の検索クエリのワードの可能なグルーピングを簡潔に表わすファクタグラフは、オートマトン表現５００に基づいて生成され得る。図５Ｃは、実施形態に従った、例示的な検索クエリについての例示的なファクタグラフ５０８を示す。ファクタグラフ５０８は、図５Ｂに示される言語モデル５０６ほど複雑ではなく、対応する検索クエリにおけるワードの順番に基づいてワードのグルーピングを可能にする。 In contrast, a factor graph that briefly represents a possible grouping of words for a given search query based on the order of words in the search query may be generated based on the automaton representation 500. FIG. 5C illustrates an example factor graph 508 for an example search query, according to an embodiment. The factor graph 508 is less complex than the language model 506 shown in FIG. 5B and allows grouping of words based on the order of words in the corresponding search query.

例として、ワードの所与のグルーピングを決定するために、コンピューティングデバイスは、第１のオートマトン状態（たとえば初期状態５１２Ａ）に接続されるオートマトン弧（たとえば弧５１０Ａ）によって表わされるワードを選択し、第１のオートマトン状態に隣接する第２のオートマトン状態（たとえば状態５１２Ｂ）へと継続し、第２の弧（たとえば弧５１０Ｂ）によって表わされるワードを選択してたとえばグルーピング「I am」を決定するように構成され得る。上記のグルーピングのうちの所与のグルーピングは、ファクタグラフ５０８において任意の所与の状態で開始されることが可能にされ得る。ファクタグラフ５０８は、「hello world I」のようなグルーピングを許容し得るが、「hello I」は許容しない。言いかえれば、ファクタグラフ５０８は、ワードをスキップすること、または、元々の検索クエリにおけるワードの順番から逸脱することを許容しない。これにより、ファクタグラフ５０８は、ワードが所与の検索クエリにおいて発生する順番に基づいて所与の検索クエリ上でワードのグルーピングを表わす簡潔で効率的な態様であると考えられ得る。 As an example, to determine a given grouping of words, the computing device selects a word represented by an automaton arc (eg, arc 510A) connected to a first automaton state (eg, initial state 512A); Continue to the second automaton state (eg, state 512B) adjacent to the first automaton state and select the word represented by the second arc (eg, arc 510B) to determine the grouping “I am”, for example. Can be configured. A given grouping of the above groupings may be allowed to start at any given state in the factor graph 508. The factor graph 508 may allow a grouping such as “hello world I” but not “hello I”. In other words, the factor graph 508 does not allow skipping words or deviating from the order of words in the original search query. Thus, the factor graph 508 can be considered a concise and efficient way of representing a grouping of words on a given search query based on the order in which the words occur in the given search query.

コンピューティングデバイスは頻繁に（たとえば毎日）、（上述したようにブロック４０２にて）ポピュラーなクエリまたはスパイキングクエリを識別し、当該クエリについてファクタグラフ５０８のようなファクタグラフを構築または生成するように構成され得る。ファクタグラフ５０８のようなファクタグラフを生成することは、言語モデル５０６のような完全な言語モデルを構築するよりも効率的であり得る。更に、ファクタグラフ５０８は、ファクタグラフ５０８が効率的にサブシーケンスを許容するという事実により、言葉どおりのスパイキングクエリのみを許容することに関して、より多くの柔軟性を提供し得る。たとえば、「Albert Einstein Relativity」がポピュラーなクエリまたはスパイキングクエリであると識別されると、対応するファクタグラフは、検索エンジンに送信されると当該ポピュラーなクエリである「Albert Einstein Relativity」と同様の検索結果が得られ得るグルーピング「Einstein Relativity」および「Albert Einstein」を許容し得る。 The computing device will frequently (eg, daily) to identify popular queries or spiking queries (as described above at block 402) and build or generate a factor graph, such as factor graph 508, for the queries. Can be configured. Generating a factor graph such as factor graph 508 may be more efficient than building a complete language model such as language model 506. Furthermore, the factor graph 508 may provide more flexibility with respect to allowing only literal spiking queries due to the fact that the factor graph 508 efficiently allows subsequences. For example, if “Albert Einstein Relativity” is identified as a popular or spiking query, the corresponding factor graph will be similar to “Albert Einstein Relativity”, which is the popular query when sent to the search engine. The groupings “Einstein Relativity” and “Albert Einstein” from which search results can be obtained may be allowed.

再び図４を参照して、ブロック４０６では、方法４００は、ワードの所与のシーケンスのコーパスを更新するよう、グルーピングを示す情報を音声認識システムに提供することを含み、音声認識システムは、ワードの所与のシーケンスのコーパスに基づいて、所与の話された発声を所与のワードのシーケンスに変換するように構成される。コンピューティングデバイス（たとえばサーバ）は、図１に示されるＡＳＲシステムのような音声認識システムに結合または通信され得る。一例では、コンピューティングデバイスは音声認識システムを含み得る。 Referring again to FIG. 4, at block 406, the method 400 includes providing information indicative of the grouping to the speech recognition system to update the corpus of the given sequence of words, Is configured to convert a given spoken utterance into a given sequence of words based on the corpus of the given sequence. The computing device (eg, server) may be coupled or communicated to a speech recognition system such as the ASR system shown in FIG. In one example, the computing device may include a voice recognition system.

例において、音声認識システムは、図１における言語モード１１２のような言語モデルによって生成され得たワードの所与のシーケンスのコーパスを含む音声認識データベースを含み得る。音声認識システムは、所与の話された発声を受け取り、たとえば図１−図３に記載されるようにワードの所与のシーケンスのコーパスからのワードのシーケンスに所与の話された発声をマッチングするように構成され得る。コンピューティングデバイスは、スパイキング検索クエリのワードのグルーピングを示すファクタグラフを生成し、当該ファクタグラフおよび／またはグルーピングを音声認識システムに提供して当該グルーピングをコーパスに含む（たとえばコーパスを増強する）ように構成され得る。 In an example, the speech recognition system may include a speech recognition database that includes a corpus of a given sequence of words that may be generated by a language model such as language mode 112 in FIG. A speech recognition system receives a given spoken utterance and matches a given spoken utterance to a sequence of words from a corpus of a given sequence of words, eg, as described in FIGS. 1-3. Can be configured to. The computing device generates a factor graph that shows a grouping of words of the spiking search query and provides the factor graph and / or grouping to a speech recognition system to include the grouping in the corpus (eg, augment the corpus). Can be configured.

いくつかの例では、ポピュラーな検索クエリに対応するグルーピングでコーパスを更新した後に、コンピューティングデバイスは、コーパスにおける検索スペースが制限されるように構成され得る。たとえば、検索スペースは、少なくとも検索グラフによって表わされるグルーピングに制限され得る。別の例において、音声認識システムは、所与の話された発声をコーパスにおける他のワードシーケンスにマッチングすることを試みる前に、グルーピングのうちの１つに所与の話された発声をマッチングすることを試みるように構成され得る。 In some examples, after updating the corpus with a grouping corresponding to popular search queries, the computing device may be configured to limit the search space in the corpus. For example, the search space may be limited to at least the grouping represented by the search graph. In another example, the speech recognition system matches a given spoken utterance to one of the groupings before attempting to match the given spoken utterance to other word sequences in the corpus. May be configured to try.

さらに別の例では、音声認識システムは、コンピューティングデバイスによってポピュラーなクエリについて生成されたファクタグラフに対応する検索グラフ３００のような検索グラフを生成するように構成され得る。たとえば、ファクタグラフに対応する検索グラフは、ワードの他のシーケンスについてのより大きな検索グラフに統合され得る。所与の発声とマッチングし得るテキストストリングを発見するために、音声認識システムは、検索グラフにおける初期状態から検索グラフにおける終端状態までのパスを発見することを試みるように構成され得、状態に関連付けられるコストおよび／または各パスに関連付けられる遷移に基づいて、検索グラフを通じて１つ以上のパスにトータルコストを割り当てるように構成され得る。ファクタグラフのワードのグルーピングに対応するパスには、たとえば他のパスよりも少ないコストが割り当てられ得る（すなわち高い確率が割り当てられ得る）。 In yet another example, the speech recognition system may be configured to generate a search graph such as search graph 300 that corresponds to a factor graph generated for popular queries by a computing device. For example, a search graph corresponding to a factor graph can be integrated into a larger search graph for other sequences of words. To find a text string that can be matched to a given utterance, the speech recognition system can be configured to attempt to find a path from an initial state in the search graph to a terminal state in the search graph, and associates with the state May be configured to assign a total cost to one or more paths through a search graph based on the costs and / or transitions associated with each path. A path corresponding to a grouping of words in a factor graph can be assigned a lower cost (ie, can be assigned a higher probability) than other paths, for example.

例において、音声認識システムは、ポピュラーまたはスパイキング検索クエリのワードのグルーピングのいずれかに関係付けられず当該グルーピングのいずれにもマッチングしない話された発声を受け取り得る。一例において、この可能性に対応するために、音声認識システムは、ファクタグラフに検索スペースを制限するように、すなわち、高い信頼性のマッチを識別するようファクタグラフのパスを追跡することを試みるように構成され得、このような試みが失敗すると、音声認識システムは、完全な言語モデルまたはコーパスの残りを利用してマッチを識別するように構成され得る。別の例において、音声認識システムは、並列でファクタグラフおよび完全な言語モデルを追跡し、マッチがファクタグラフまたは完全な言語モデルのいずれかにおいて識別されると検索を終えるように構成され得る。検索グラフおよび完全な言語モデルを組み合わせる他の検索ストラテジーが可能である。 In an example, the speech recognition system may receive spoken utterances that are not associated with any of the popular or spiking search query word groupings and do not match any of the groupings. In one example, to accommodate this possibility, the speech recognition system attempts to limit the search space to the factor graph, i.e., to track the path of the factor graph to identify high confidence matches. If such an attempt fails, the speech recognition system can be configured to identify matches using the rest of the complete language model or corpus. In another example, the speech recognition system may be configured to track the factor graph and the complete language model in parallel and finish the search once a match is identified in either the factor graph or the complete language model. Other search strategies that combine search graphs and complete language models are possible.

さらに、言語モデル１１２に関して上に記載されたように、コーパスにおけるワードの所与のシーケンスには、コーパスにおけるそれぞれの出現回数に基づいて推定され得る発生確率が割り当てられ得る。したがって、音声認識システムへのグルーピングの提供に加えて、コンピューティングデバイスは、グルーピングに基づいて発生確率を更新するように構成され得る。たとえば、コンピューティングデバイスは、コーパスにおけるワードの他のシーケンスの所与の発生確率より高いそれぞれの発生確率をグルーピングに割り当てるように構成され得る。割り当てられたそれぞれの確率は、検索クエリがどれくらいポピュラーかに基づき得、たとえば、検索エンジンへの検索クエリの送信の頻度を示す情報に基づき得る。 Further, as described above with respect to language model 112, a given sequence of words in a corpus can be assigned a probability of occurrence that can be estimated based on the number of occurrences of each in the corpus. Thus, in addition to providing grouping to the speech recognition system, the computing device may be configured to update the probability of occurrence based on the grouping. For example, the computing device may be configured to assign each occurrence probability higher than a given occurrence probability of other sequences of words in the corpus to the grouping. Each assigned probability may be based on how popular the search query is, for example, based on information indicating the frequency of transmission of the search query to the search engine.

いくつかの例において、グルーピングについての発生確率は時間変化し得る。いくつかの場合において、所与の検索クエリのポピュラリティは時間にわたって減少し得る。例示のための例として、オリンピックの結果に関する検索クエリは、オリンピックの期間の間と、おそらくオリンピックの後の所与の期間の間とにおいて、ポピュラーであり得る。しかしながら、このような検索クエリのポピュラリティは時間にわたって減少し得る。したがって、この例において、コンピューティングデバイスは、グループピングの発生確率が減衰するように構成され得る。他の例において、コンピューティングデバイスは、検索クエリがどれくらいポピュラーか連続的に評価し、これにより、検索クエリのポピュラリティへの更新された変化に基づいて確率を更新または修正するように構成され得る。 In some examples, the probability of occurrence for a grouping may change over time. In some cases, the popularity of a given search query may decrease over time. As an illustrative example, search queries for Olympic results may be popular during the Olympic period and possibly during a given period after the Olympics. However, the popularity of such search queries may decrease over time. Thus, in this example, the computing device may be configured such that the probability of occurrence of grouping is attenuated. In other examples, the computing device may be configured to continuously evaluate how popular the search query is, thereby updating or correcting the probability based on updated changes to the popularity of the search query.

図６は、例示的な実施形態に従った例示的な分散コンピューティングアーキテクチャを示す。図６は、ネットワーク６０６を介してプログラマブルデバイス６０８ａ、６０８ｂおよび６０８ｃと通信するように構成されるサーバデバイス６０２および６０４を示す。ネットワーク６０６は、ＬＡＮ、ワイドエリアネットワーク（ＷＡＮ）、企業イントラネット、パブリックインターネット、またはネットワークにつながれたコンピューティングデバイス同士の間の通信パスを提供するように構成される任意の他のタイプのネットワークに対応し得る。また、ネットワーク６０６は、ＬＡＮ、ＷＡＮ、企業イントラネットおよび／またはパブリックインターネットの１つ以上の組合せに対応し得る。 FIG. 6 illustrates an exemplary distributed computing architecture in accordance with an exemplary embodiment. FIG. 6 shows server devices 602 and 604 configured to communicate with programmable devices 608a, 608b and 608c via network 606. FIG. Network 606 supports a LAN, wide area network (WAN), corporate intranet, public Internet, or any other type of network configured to provide a communication path between computing devices connected to the network. Can do. Network 606 may also correspond to one or more combinations of LAN, WAN, corporate intranet, and / or public Internet.

図６は３つのプログラマブルデバイスを示すが、分散アプリケーションアーキテクチャは、何十、何百または何千ものプログラマブルデバイスを取り扱い得る。さらに、プログラマブルデバイス６０８ａ、６０８ｂおよび６０８ｃ（または任意の付加的なプログラマブルデバイス）は、通常のラップトップコンピュータ、デスクトップコンピュータ、ネットワークターミナル、および無線通信デバイスなど（たとえばタブレット、携帯電話またはスマートフォン、ウェアラブルコンピューティングデバイスなど）のような如何なる種類のコンピューティングデバイスであってもよい。いくつかの例において、プログラマブルデバイス６０８ａ、６０８ｂおよび６０８ｃはソフトウェアアプリケーションの設計および使用に専用であり得る。他の例において、プログラマブルデバイス６０８ａ、６０８ｂおよび６０８ｃは、多くのタスクを行なうように構成され、ソフトウェア開発ツールに専用ではなくてもよい汎用コンピュータであり得る。 Although FIG. 6 shows three programmable devices, a distributed application architecture can handle dozens, hundreds or thousands of programmable devices. In addition, programmable devices 608a, 608b and 608c (or any additional programmable device) include ordinary laptop computers, desktop computers, network terminals, wireless communication devices, etc. (eg, tablets, cell phones or smartphones, wearable computing). Any type of computing device such as a device). In some examples, programmable devices 608a, 608b and 608c may be dedicated to the design and use of software applications. In other examples, programmable devices 608a, 608b, and 608c may be general purpose computers that are configured to perform many tasks and may not be dedicated to software development tools.

サーバデバイス６０２および６０４は、プログラマブルデバイス６０８ａ、６０８ｂおよび／または６０８ｃが要求するように１つ以上のサービスを行うように構成され得る。たとえば、サーバデバイス６０２および／または６０４は、プログラマブルデバイス６０８ａ−６０８ｃにコンテンツを供給し得る。コンテンツは、ウェブページ、ハイパーテキスト、スクリプト、コンパイルされたソフトウェアのようなバイナリデータ、画像、オーディオおよび／またはビデオを含み得るがこれらに限定されない。コンテンツは、圧縮されたおよび／または圧縮されていないコンテンツを含み得る。コンテンツは暗号化され得、および／または、暗号解読され得る。他のタイプのコンテンツが同様に可能である。 Server devices 602 and 604 may be configured to perform one or more services as required by programmable devices 608a, 608b and / or 608c. For example, server devices 602 and / or 604 may provide content to programmable devices 608a-608c. Content may include, but is not limited to, web pages, hypertext, scripts, binary data such as compiled software, images, audio and / or video. The content may include compressed and / or uncompressed content. The content can be encrypted and / or decrypted. Other types of content are possible as well.

別の例として、サーバデバイス６０２および／または６０４は、データベース、検索、計算、グラフィカル、オーディオ（たとえば音声認識）、ビデオ、ワールドワイドウェブ／インターネットの利用、および/または他の機能のために、ソフトウェアへのアクセスをプログラマブルデバイス６０８ａ−６０８ｃに提供し得る。サーバデバイスの他の多くの例が同様に可能である。 As another example, server devices 602 and / or 604 may be software for database, search, computation, graphical, audio (eg, speech recognition), video, World Wide Web / Internet usage, and / or other functions. Access to the programmable devices 608a-608c. Many other examples of server devices are possible as well.

サーバデバイス６０２および／または６０４は、クラウドベースのアプリケーションおよび／またはサービスのプログラムロジックおよび／またはデータを格納するクラウドベースのデバイスであり得る。いくつかの例において、サーバデバイス６０２および／または６０４は、単一のコンピューティングセンターに存在する単一のコンピューティングデバイスであり得る。他の例では、サーバデバイス６０２および／または６０４は、単一のコンピューティングセンターに複数のコンピューティングデバイスを含み得るか、または、多様な地理的位置における複数のコンピューティングセンターに配置される複数のコンピューティングデバイスを含み得る。たとえば、図６は、異なる物理的な位置に存在するサーバデバイス６０２および６０４の各々を示す。 Server devices 602 and / or 604 may be cloud-based devices that store program logic and / or data for cloud-based applications and / or services. In some examples, server devices 602 and / or 604 may be a single computing device that resides in a single computing center. In other examples, server devices 602 and / or 604 may include multiple computing devices in a single computing center, or multiple deployed at multiple computing centers in various geographic locations. A computing device may be included. For example, FIG. 6 shows each of server devices 602 and 604 residing in different physical locations.

いくつかの例において、サーバデバイス６０２および／または６０４でのデータおよびサービスは、一時的でない有形的なコンピュータ読取可能媒体（またはコンピュータ読取可能記憶媒体）に格納されプログラマブルデバイス６０８ａ、６０８ｂおよび６０８ｃならびに／または他のコンピューティングデバイスによってアクセス可能であるコンピュータ読取可能情報としてエンコードされ得る。いくつかの例において、サーバデバイス６０２および／または６０４でのデータは、単一のディスクドライブもしくは他の有形的な記憶媒体上に格納され得るか、または、１つ以上の多様な地理的位置に配置される複数のディスクドライブもしくは他の有形的な記憶媒体上に実現され得る。 In some examples, data and services at server devices 602 and / or 604 are stored in non-transitory tangible computer readable media (or computer readable storage media) and programmable devices 608a, 608b and 608c and / or Or it may be encoded as computer readable information that is accessible by other computing devices. In some examples, data at server devices 602 and / or 604 may be stored on a single disk drive or other tangible storage medium, or in one or more diverse geographic locations. It can be realized on a plurality of arranged disk drives or other tangible storage media.

図７Ａは、例示的な実施形態に従ったコンピューティングデバイス（たとえばシステム）のブロック図である。特に、図７Ａに示されるコンピューティングデバイス７００は、サーバデバイス６０２，６０４、ネットワーク６０６、ならびに／またはプログラマブルデバイス６０８ａ、６０８ｂおよび６０８ｃのうちの１つ以上の１つ以上の機能を行なうように構成され得る。コンピューティングデバイス７００は、ユーザインターフェイスモジュール７０２、ネットワーク通信インターフェイスモジュール７０４、１つ以上のプロセッサ７０６およびデータストレージ７０８を含み得、これらのすべてがシステムバス、ネットワークまたは他の接続メカニズム７１０を介してともにリンクされ得る。 FIG. 7A is a block diagram of a computing device (eg, a system) according to an example embodiment. In particular, the computing device 700 shown in FIG. 7A is configured to perform one or more functions of one or more of the server devices 602, 604, the network 606, and / or the programmable devices 608a, 608b, and 608c. obtain. The computing device 700 may include a user interface module 702, a network communication interface module 704, one or more processors 706 and data storage 708, all of which are linked together via a system bus, network or other connection mechanism 710. Can be done.

ユーザインターフェイスモジュール７０２は、データを外部ユーザの入出力デバイスに送信および／または外部ユーザの入出力デバイスから受信するように動作可能であり得る。たとえば、ユーザインターフェイスモジュール７０２は、キーボード、キーパッド、タッチスクリーン、コンピュータマウス、トラックボール、ジョイスティック、カメラ、音声認識／合成モジュールおよび／または他の同様のデバイスのように、データをユーザの入力デバイスに送信および／またはユーザの入力デバイスから受信するように構成され得る。ユーザインターフェイスモジュール７０２はさらに、１つ以上の陰極線管（ＣＲＴ）、液晶ディスプレイ（ＬＣＤ）、発光ダイオード（ＬＥＤ）、デジタル光処理（ＤＬＰ）技術を使用するディスプレイ、プリンタ、電球および／または他の同様のデバイスといった、現在公知であるかまたはこれから開発されるユーザ表示デバイスに出力を提供するように構成され得る。ユーザインターフェイスモジュール７０２はさらに、認識されたスピーチまたは可聴出力を生成するように構成され得、スピーカー、スピーカージャック、オーディオ出力ポート、オーディオ出力デバイス、イヤホンおよび／または他の同様のデバイスを含み得る。 The user interface module 702 may be operable to send data to and / or receive data from an external user input / output device. For example, the user interface module 702 may pass data to a user input device, such as a keyboard, keypad, touch screen, computer mouse, trackball, joystick, camera, voice recognition / synthesis module, and / or other similar devices. It may be configured to transmit and / or receive from a user input device. The user interface module 702 further includes one or more cathode ray tubes (CRTs), liquid crystal displays (LCDs), light emitting diodes (LEDs), displays using digital light processing (DLP) technology, printers, light bulbs and / or other similar. Can be configured to provide output to user display devices that are currently known or will be developed. The user interface module 702 may further be configured to generate recognized speech or audible output and may include speakers, speaker jacks, audio output ports, audio output devices, earphones and / or other similar devices.

ネットワーク通信インターフェイスモジュール７０４は、図６に示されるネットワーク６０６のようなネットワークを介して通信するように構成可能である１つ以上の無線インターフェイス７１２および／または１つ以上の有線インターフェイス７１４を含み得る。無線インターフェイス７１２は、１つ以上の無線トランスミッタ、レシーバ、ならびに／または、ブルートゥース（登録商標）トランシーバ、Ｚｉｇｂｅｅ（登録商標）トランシーバ、Ｗｉ−Ｆｉトランシーバ、ＬＴＥトランシーバおよび／もしくは無線ネットワークを介して通信するよう構成可能である他の同様のタイプの無線トランシーバといったトランシーバを含み得る。有線インターフェイス７１４は、１つ以上の有線トランスミッタ、レシーバ、および／または、イーサーネット（登録商標）トランシーバ、ユニバーサルシリアルバス（ＵＳＢ）トランシーバ、あるいは、ツイストペアワイヤー、同軸ケーブル、光ファイバーリンクもしくは有線ネットワークへの同様の物理接続を介して通信するように構成可能である同様のトランシーバといったトランシーバを含み得る。 The network communication interface module 704 may include one or more wireless interfaces 712 and / or one or more wired interfaces 714 that are configurable to communicate over a network, such as the network 606 shown in FIG. The wireless interface 712 communicates via one or more wireless transmitters, receivers, and / or Bluetooth transceivers, Zigbee transceivers, Wi-Fi transceivers, LTE transceivers and / or wireless networks. It may include transceivers such as other similar types of radio transceivers that are configurable. Wired interface 714 may be one or more wired transmitters, receivers, and / or Ethernet transceivers, universal serial bus (USB) transceivers, or twisted pair wires, coaxial cables, fiber optic links or wired networks May include transceivers such as similar transceivers that are configurable to communicate over a physical connection.

いくつかの例において、ネットワーク通信インターフェイスモジュール７０４は、信頼性があり、セキュアで、および／または認証された通信を提供するように構成され得る。本願明細書において記載された各通信について、信頼性のある通信（すなわち保証されたメッセージの送達）を保証するための情報が、おそらくメッセージヘッダおよび／またはフッタの部分（たとえばパケット／メッセージシーケンシング情報、カプセル化ヘッダおよび／またはフッタ、サイズ／時間情報、ならびにＣＲＣおよび／またはパリティーチェック値のような送信照合情報）として提供され得る。通信は、１つ以上の暗号プロトコルおよび／またはアルゴリズムを使用して、セキュアにされ得（たとえばエンコードもしくは暗号化され得）、および／または、暗号解読／デコードされ得、当該暗号プロトコルおよび／またはアルゴリズムの例は、ＤＥＳ、ＡＥＳ、ＲＳＡ、Ｄｉｆｆｉｅ−Ｈｅｌｌｍａｎ、および／またはＤＳＡであるがこれらに限定されない。通信をセキュアにする（次いで暗号解読／デコードする）ために、他の暗号プロトコルおよび／またはアルゴリズムが、同様に使用され得るか、または、本願明細書にリストされたものに加えて使用され得る。 In some examples, the network communication interface module 704 may be configured to provide reliable, secure, and / or authenticated communication. For each communication described herein, information to ensure reliable communication (ie, guaranteed message delivery) is probably message header and / or footer portion (eg, packet / message sequencing information). , Encapsulation headers and / or footers, size / time information, and transmission verification information such as CRC and / or parity check values). The communication may be secured (eg, encoded or encrypted) and / or decrypted / decoded using one or more cryptographic protocols and / or algorithms, such cryptographic protocols and / or algorithms Examples of DES, AES, RSA, Diffie-Hellman, and / or DSA are not limited to these. Other cryptographic protocols and / or algorithms can be used as well or in addition to those listed herein to secure (and then decrypt / decode) the communication.

プロセッサ７０６は、１つ以上の汎用プロセッサおよび／または１つ以上の特殊目的プロセッサ（たとえばデジタルシグナルプロセッサ、特定用途向け集積回路など）を含み得る。プロセッサ７０６は、データストレージ７０８に含められるコンピュータ読取可能プログラム命令７１５および／または本願明細書に記載されるような他の命令（たとえば方法４００）を実行するように構成され得る。 The processor 706 may include one or more general purpose processors and / or one or more special purpose processors (eg, digital signal processors, application specific integrated circuits, etc.). The processor 706 may be configured to execute computer readable program instructions 715 included in data storage 708 and / or other instructions as described herein (eg, method 400).

データストレージ７０８は、プロセッサ７０６の少なくとも１つによって読出および／またはアクセスすることができる１つ以上のコンピュータ読取可能記憶媒体を含み得る。１つ以上のコンピュータ読取可能記憶媒体は、全体的または部分的にプロセッサ７０６の少なくとも１つと統合され得る、光学、磁気、有機または他のメモリもしくはディスクストレージのような揮発性および／または不揮発性ストレージコンポーネントを含み得る。いくつかの例において、データストレージ７０８は、単一の物理デバイス（たとえば１つの光学、磁気、有機または他のメモリもしくはディスクストレージユニット）を使用して実現され得る一方、他の例では、データストレージ７０８は２つ以上の物理デバイスを使用して実現され得る。 Data storage 708 may include one or more computer readable storage media that can be read and / or accessed by at least one of processors 706. One or more computer readable storage media may be integrated, in whole or in part, with at least one of the processors 706, volatile and / or non-volatile storage, such as optical, magnetic, organic, or other memory or disk storage. Can contain components. In some examples, data storage 708 may be implemented using a single physical device (eg, one optical, magnetic, organic or other memory or disk storage unit), while in other examples, data storage 708 may be implemented using two or more physical devices.

データストレージ７０８は、コンピュータ読取可能プログラム命令７１５と、おそらく付加的なデータとを含み得、当該付加的なデータはたとえばソフトウェアアプリケーションの１つ以上のプロセスおよび／またはスレッドによって使用されるデータであるがこれらに限定されない。いくつかの例において、データストレージ７０８は、本願明細書において記載された方法（たとえば方法４００）および技術の少なくとも一部、ならびに／または、本願明細書において記載されたデバイスおよびネットワークの機能の少なくとも一部を行なうのに必要なストレージをさらに含み得る。 Data storage 708 may include computer readable program instructions 715 and possibly additional data, such as data used by one or more processes and / or threads of a software application, for example. It is not limited to these. In some examples, the data storage 708 is at least part of the methods (eg, method 400) and techniques described herein, and / or at least one of the device and network functions described herein. It may further include storage necessary to perform the part.

図７Ｂは、例示的な実施形態に従ったクラウドベースのサーバシステムを示す。図７Ｂにおいて、サーバデバイス６０２および／または６０４の機能は、３つのコンピューティングクラスタ７１６ａ、７１６ｂおよび７１６ｃの間で分散され得る。コンピューティングクラスタ７１６ａは、１つ以上のコンピューティングデバイス７１８ａと、クラスタストレージアレイ７２０ａと、ローカルクラスタネットワーク７２４ａによって接続されるクラスタルータ７２２ａとを含み得る。同様に、コンピューティングクラスタ７１６ｂは、１つ以上のコンピューティングデバイス７１８ｂと、クラスタストレージアレイ７２０ｂと、ローカルクラスタネットワーク７２４ｂによって接続されるクラスタルータ７２２ｂとを含み得る。同様に、コンピューティングクラスタ７１６ｃは、１つ以上のコンピューティングデバイス７１８ｃと、クラスタストレージアレイ７２０ｃと、ローカルクラスタネットワーク７２４ｃによって接続されるクラスタルータ７２２ｃとを含み得る。 FIG. 7B illustrates a cloud-based server system according to an exemplary embodiment. In FIG. 7B, the functionality of server devices 602 and / or 604 may be distributed among three computing clusters 716a, 716b, and 716c. The computing cluster 716a may include one or more computing devices 718a, a cluster storage array 720a, and a cluster router 722a connected by a local cluster network 724a. Similarly, computing cluster 716b may include one or more computing devices 718b, a cluster storage array 720b, and a cluster router 722b connected by a local cluster network 724b. Similarly, computing cluster 716c may include one or more computing devices 718c, a cluster storage array 720c, and a cluster router 722c connected by a local cluster network 724c.

いくつかの例において、コンピューティングクラスタ７１６ａ、７１６ｂおよび７１６ｃの各々は、等しい数のコンピューティングデバイスと、等しい数のクラスタストレージアレイと、等しい数のクラスタルータとを有し得る。しかしながら、他の例において、各コンピューティングクラスタは、異なる数のコンピューティングデバイスと、異なる数のクラスタストレージアレイと、異なる数のクラスタルータとを有し得る。各コンピューティングクラスタにおけるコンピューティングデバイス、クラスタストレージアレイおよびクラスタルータの数は、各コンピューティングクラスタに割り当てられるコンピューティングタスクに依存し得る。 In some examples, each of the computing clusters 716a, 716b, and 716c may have an equal number of computing devices, an equal number of cluster storage arrays, and an equal number of cluster routers. However, in other examples, each computing cluster may have a different number of computing devices, a different number of cluster storage arrays, and a different number of cluster routers. The number of computing devices, cluster storage arrays and cluster routers in each computing cluster may depend on the computing task assigned to each computing cluster.

たとえば、コンピューティングクラスタ７１６ａにおいて、コンピューティングデバイス７１８ａは、サーバデバイス６０２のさまざまなコンピューティングタスクを行なうように構成され得る。一例において、サーバデバイス６０２のさまざまな機能は、コンピューティングデバイス７１８ａ、７１８ｂおよび７１８ｃの１つ以上の間で分散され得る。コンピューティングクラスタ７１６ｂおよび７１６ｃにおけるコンピューティングデバイス７１８ｂおよび７１８ｃは、コンピューティングクラスタ７１６ａにおけるコンピューティングデバイス７１８ａと同様に構成され得る。他方では、いくつかの例において、コンピューティングデバイス７１８ａ、７１８ｂおよび７１８ｃは、異なる機能を行なうように構成され得る。 For example, in computing cluster 716a, computing device 718a may be configured to perform various computing tasks of server device 602. In one example, various functions of server device 602 may be distributed among one or more of computing devices 718a, 718b and 718c. Computing devices 718b and 718c in computing clusters 716b and 716c may be configured similarly to computing devices 718a in computing cluster 716a. On the other hand, in some examples, computing devices 718a, 718b, and 718c may be configured to perform different functions.

いくつかの例では、サーバデバイス６０２および／または６０４に関連付けられるコンピューティングタスクおよび格納データは、サーバデバイス６０２および／または６０４の処理要件、コンピューティングデバイス７１８ａ、７１８ｂおよび７１８ｃの処理能力、各コンピューティングクラスタにおけるコンピューティングデバイス同士間のネットワークリンクおよびコンピューティングクラスタ自身同士間のネットワークリンクのレイテンシ、ならびに／または、全体のシステムアーキテクチャのコスト、スピード、フォールトトレランス、弾性、効率および／もしくは他の設計ゴールに寄与し得る他のファクタに少なくとも部分的に基づいて、コンピューティングデバイス７１８ａ、７１８ｂおよび７１８ｃにわたって分散され得る。 In some examples, the computing tasks and stored data associated with server devices 602 and / or 604 may include the processing requirements of server devices 602 and / or 604, the processing capabilities of computing devices 718a, 718b and 718c, each computing For network links between computing devices in a cluster and network links between computing clusters themselves, and / or cost, speed, fault tolerance, elasticity, efficiency and / or other design goals of the overall system architecture Can be distributed across computing devices 718a, 718b and 718c based at least in part on other factors that may contribute

コンピューティングクラスタ７１６ａ、７１６ｂおよび７１６ｃのクラスタストレージアレイ７２０ａ，７２０ｂ，７２０ｃは、ハードディスクドライブのグループへの読出および書込アクセスを管理するように構成されるディスクアレイコントローラを含むデータストレージアレイであり得る。単独またはそれぞれのコンピューティングデバイスに関連するディスクアレイコントローラはさらに、クラスタストレージアレイに格納されるデータのバックアップまたは冗長性のあるコピーを管理して、１つ以上のコンピューティングデバイスが１つ以上のクラスタストレージアレイにアクセスするのを防止するディスクドライブまたは他のクラスタストレージアレイの障害および／もしくはネットワーク障害に対して保護するように構成され得る。 The cluster storage arrays 720a, 720b, 720c of the computing clusters 716a, 716b, and 716c may be data storage arrays that include a disk array controller configured to manage read and write access to groups of hard disk drives. The disk array controller associated with a single or each computing device further manages a backup or redundant copy of the data stored in the cluster storage array so that one or more computing devices are in one or more clusters. It can be configured to protect against disk drive or other cluster storage array failures and / or network failures that prevent access to the storage array.

サーバデバイス６０２および／または６０４の機能がコンピューティングクラスタ７１６ａ、７１６ｂおよび７１６ｃのコンピューティングデバイス７１８ａ、７１８ｂおよび７１８ｃにわたって分散され得るのと同様の態様で、これらのコンポーネントのさまざまなアクティブ部分および／またはバックアップ部分がクラスタストレージアレイ７２０ａ、７２０ｂおよび７２０ｃにわたって分散され得る。たとえば、いくつかのクラスタストレージアレイは、サーバデバイス６０２のデータを格納するように構成され得る一方、他のクラスタストレージアレイは、サーバデバイス６０４のデータを格納し得る。さらに、いくつかのクラスタストレージアレイは、他のクラスタストレージアレイに格納されたデータのバックアップバージョンを格納するように構成され得る。 Various active portions and / or backups of these components in a manner similar to the functionality of server devices 602 and / or 604 can be distributed across computing devices 718a, 718b and 718c of computing clusters 716a, 716b and 716c. Portions can be distributed across the cluster storage arrays 720a, 720b and 720c. For example, some cluster storage arrays may be configured to store server device 602 data while other cluster storage arrays may store server device 604 data. In addition, some cluster storage arrays may be configured to store backup versions of data stored on other cluster storage arrays.

コンピューティングクラスタ７１６ａ、７１６ｂおよび７１６ｃにおけるクラスタルータ７２２ａ、７２２ｂおよび７２２ｃは、コンピューティングクラスタのために内部通信および外部通信を提供するように構成されるネットワーク機器を含み得る。たとえば、コンピューティングクラスタ７１６ａにおけるクラスタルータ７２２ａは、１つ以上のインターネットスイッチングおよびルーティングデバイスを含み得、当該インターネットスイッチングおよびルーティングデバイスは、（ｉ）ローカルクラスタネットワーク７２４ａを介するコンピューティングデバイス７１８ａとクラスタストレージアレイ７２０ａとの間のローカルエリアネットワーク通信と、（ｉｉ）ネットワーク６０６へのワイドエリアネットワーク接続７２６ａを介するコンピューティングクラスタ７１６ａとコンピューティングクラスタ７１６ｂおよび７１６ｃとの間のワイドエリアネットワーク通信とを提供するように構成される。クラスタルータ７２２ｂおよび７２２ｃは、クラスタルータ７２２ａと同様のネットワーク機器を含み得、クラスタルータ７２２ｂおよび７２２ｃは、クラスタルータ７２２ａがコンピューティングクラスタ７１６ａについて行なうのと同様のネットワーキング機能をコンピューティングクラスタ７１６ｂおよび７１６ｃについて行ない得る。 Cluster routers 722a, 722b and 722c in computing clusters 716a, 716b and 716c may include network equipment configured to provide internal and external communications for the computing cluster. For example, the cluster router 722a in the computing cluster 716a may include one or more Internet switching and routing devices that include (i) the computing device 718a and the cluster storage array via the local cluster network 724a. To provide local area network communication with 720a and (ii) wide area network communication between computing cluster 716a and computing clusters 716b and 716c via wide area network connection 726a to network 606. Composed. Cluster routers 722b and 722c may include network equipment similar to cluster router 722a, and cluster routers 722b and 722c perform similar networking functions for computing clusters 716b and 716c that cluster router 722a performs for computing cluster 716a. You can do it.

いくつかの例では、クラスタルータ７２２ａ、７２２ｂおよび７２２ｃの構成は、コンピューティングデバイスおよびクラスタストレージアレイのデータ通信要件、クラスタルータ７２２ａ、７２２ｂおよび７２２ｃにおけるネットワーク機器のデータ通信能力、ローカルネットワーク７２４ａ，７２４ｂ，７２４ｃのレイテンシおよびスループット、ワイドエリアネットワークリンク７２６ａ、７２６ｂおよび７２６ｃのレイテンシ、スループットおよびコスト、ならびに／または、モデレーションシステムアーキテクチャのコスト、スピード、フォールトトレランス、弾性、効率および／もしくは他の設計ゴールに寄与し得る他のファクタに少なくとも部分的に基づき得る。 In some examples, the configuration of the cluster routers 722a, 722b, and 722c may include data communication requirements for computing devices and cluster storage arrays, data communication capabilities of network devices in the cluster routers 722a, 722b, and 722c, Contributes to 724c latency and throughput, wide area network links 726a, 726b and 726c latency, throughput and cost, and / or cost, speed, fault tolerance, elasticity, efficiency and / or other design goals of the moderation system architecture May be based at least in part on other possible factors.

例において、図６および図７Ａ−図７Ｂに示される構成は、方法４００に関して記載された実現例に使用され得る。たとえば、方法４００を実現するコンピューティングデバイスは、クラウドベースのデバイス（たとえばサーバデバイス６０２および／または６０４）であり得る。この例において、コンピューティングデバイスは、図６におけるプログラマブルデバイス６０８ａ−ｃまたは図７Ｂのコンピューティングデバイス７１８ａ−ｃが送信した検索クエリに関連付けられる情報を受け取り、スパイキングクエリを決定し、対応するファクタグラフを生成するように構成され得る。ファクタグラフは、サーバデバイス６０２および／または６０４といったクラウドベースのデバイスにおいても実現され得る音声認識システムに提供され得る。 In the example, the configurations shown in FIGS. 6 and 7A-7B may be used in the implementation described with respect to method 400. For example, the computing device implementing method 400 may be a cloud-based device (eg, server device 602 and / or 604). In this example, the computing device receives information associated with the search query sent by the programmable device 608a-c in FIG. 6 or the computing device 718a-c in FIG. 7B, determines a spiking query, and the corresponding factor graph May be configured to generate. The factor graph may be provided to a speech recognition system that may also be implemented in a cloud-based device such as server device 602 and / or 604.

いくつかの例において、開示された方法（たとえば方法４００）は、機械読取可能なフォーマットの一時的でないコンピュータ読取可能記憶媒体または他の一時的でない媒体もしくは製造物品上でエンコードされたコンピュータプログラム命令として実現され得る。図８は、本願明細書において示された少なくともいくつかの実施形態に従って構成される、コンピューティングデバイス上でコンピュータプロセスを実行するためのコンピュータプログラムを含む例示的なコンピュータプログラムプロダクトの部分概念図を示す模式図である。 In some examples, the disclosed methods (eg, method 400) may be performed as computer program instructions encoded on a non-transitory computer-readable storage medium in machine-readable format or other non-transitory medium or article of manufacture. Can be realized. FIG. 8 illustrates a partial conceptual diagram of an exemplary computer program product that includes a computer program for performing a computer process on a computing device configured in accordance with at least some embodiments set forth herein. It is a schematic diagram.

一実施形態において、例示的なコンピュータプログラムプロダクト８００は、信号担持媒体８０１を使用して提供される。信号担持媒体８０１は、１つ以上のプロセッサによって実行されると、図１−図７に関して上で記載された機能または機能の部分を提供し得る１つ以上のプログラム命令８０２を含み得る。いくつかの例では、信号担持媒体８０１はコンピュータ読取可能媒体８０３を含み得、コンピュータ読取可能媒体８０３はたとえば、ハードディスクドライブ、コンパクトディスク（ＣＤ）、デジタルビデオディスク（ＤＶＤ）、デジタルテープ、メモリなどであるがこれらに限定されない。いくつかの実現例において、信号担持媒体８０１は、コンピュータ記録可能媒体８０４を含み得、コンピュータ記録可能媒体８０４はたとえば、メモリ、読出／書込（Ｒ／Ｗ）ＣＤ、Ｒ／ＷＤＶＤなどであるがこれらに限定されない。いくつかの実現例では、信号担持媒体８０１は、通信媒体８０５を含み得、通信媒体８０５はたとえば、デジタルおよび／またはアナログ通信媒体（たとえば光ファイバーケーブル、導波路、有線通信リンク、無線通信リンクなど）であるがこれらに限定されない。したがって、たとえば信号担持媒体８０１は、無線形態の通信媒体８０５（たとえばＩＥＥＥ８０２．１１規格または他の伝送プロトコルに準拠する無線通信媒体）によって搬送され得る。 In one embodiment, the exemplary computer program product 800 is provided using a signal bearing medium 801. The signal bearing medium 801 may include one or more program instructions 802 that, when executed by one or more processors, may provide the functions or portions of functions described above with respect to FIGS. In some examples, signal bearing media 801 may include computer readable media 803, such as a hard disk drive, compact disc (CD), digital video disc (DVD), digital tape, memory, etc. There is, but is not limited to these. In some implementations, signal-bearing medium 801 can include computer-recordable medium 804, which can be, for example, memory, read / write (R / W) CD, R / W DVD, and the like. However, it is not limited to these. In some implementations, the signal bearing medium 801 can include a communication medium 805, which can be, for example, a digital and / or analog communication medium (eg, fiber optic cable, waveguide, wired communication link, wireless communication link, etc.). However, it is not limited to these. Thus, for example, signal bearing medium 801 may be carried by a wireless form of communication medium 805 (eg, a wireless communication medium compliant with the IEEE 802.11 standard or other transmission protocol).

１つ以上のプログラム命令８０２はたとえば、コンピュータ実行可能な命令および／またはロジックによって実現される命令であり得る。いくつかの例では、図６におけるプログラマブルデバイス６０８ａ−ｃのようなコンピューティングデバイスまたは図７Ｂのコンピューティングデバイス７１８ａ−ｃは、コンピュータ読取可能媒体８０３、コンピュータ記録可能媒体８０４および／または通信媒体８０５の１つ以上によってプログラマブルデバイス６０８ａ−ｃまたはコンピューティングデバイス７１８ａ−ｃに搬送されるプログラム命令８０２に応答して、さまざまな動作、機能またはアクションを提供するように構成され得る。 One or more program instructions 802 may be, for example, computer-executable instructions and / or instructions implemented by logic. In some examples, a computing device such as programmable device 608a-c in FIG. 6 or computing device 718a-c in FIG. 7B may be included in computer readable medium 803, computer recordable medium 804, and / or communication medium 805. Responsive to program instructions 802 carried by one or more to programmable devices 608a-c or computing devices 718a-c, may be configured to provide various operations, functions or actions.

本願明細書において記載される構成は例示目的のみであることが理解されるべきである。したがって、当業者は、その他の構成および他の要素（たとえばマシン、インターフェイス、機能、順番、および機能のグルーピングなど）が代わりに使用され得、また、いくつかの要素は所望の結果に従って全部省略され得ることを認識するだろう。さらに、記載される要素の多くは、任意の好適な組合せおよび位置において、離散的なまたは分散されたコンポーネントとして実現され得るか、または、他のコンポーネントに関連して実現され得る機能的なエンティティである。 It should be understood that the configurations described herein are for illustrative purposes only. Thus, those skilled in the art can use other configurations and other elements (eg, machine, interface, function, order, function grouping, etc.) instead, and some elements may be omitted entirely according to the desired result. You will recognize that you get. Moreover, many of the elements described may be functional entities that may be implemented as discrete or distributed components in any suitable combination and location, or may be implemented in connection with other components. is there.

さまざまな局面および実施形態が本願明細書において開示されているが、他の局面および実施形態は当業者に明らかであろう。本願明細書において開示されたさまざまな局面および実施形態は例示目的のためであり、限定するようには意図されず、真の範囲は、添付の請求の範囲と、そのような請求の範囲が権利を与える均等物の完全な範囲とによって示される。さらに、本願明細書において使用される用語は、特定の実施形態を説明するためだけのものであり、限定するようには意図されないということが理解されるべきである。 While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope being as set forth in the appended claims and the scope of such claims Indicated by with a full range of equivalents. Further, it is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting.

Claims

検索エンジンへの、ワードのシーケンスを具備するフレーズの検索クエリの送信の頻度を示す情報をコンピューティングデバイスにて受け取ることと、
前記検索クエリの送信の頻度がしきい値を越えることに基づいて、検索クエリのためのオートマトン表現を決定又は生成し、前記オートマトン表現を使用して、前記検索クエリのワードの前記シーケンスについて、１つ以上のワードが前記検索クエリのワードの前記シーケンスに発生する順番に基づいて前記検索クエリの１つ以上のワードによって構成されるグルーピングを決定することと、
ワードの所与のシーケンスのコーパスを更新するよう、音声認識システムに前記グルーピングを示す情報を提供することとを含み、前記音声認識システムは、ワードの所与のシーケンスの前記コーパスに基づいて、所与の話された発声をワードの所与のシーケンスに変換するように構成され、
前記音声認識システムはさらに、前記コーパスのワードの前記所与のシーケンスについての発生確率を含み、前記音声認識システムに前記グルーピングを提供することは、前記グルーピングと、前記検索エンジンへの前記検索クエリの送信の頻度を示す前記情報とに基づき前記発生確率を更新することを含む、方法。 Receiving at a computing device information indicating a frequency of transmission of a search query for a phrase comprising a sequence of words to a search engine;
Determining or generating an automaton representation for the search query based on a frequency of transmission of the search query exceeding a threshold, and using the automaton representation, for the sequence of words of the search query, 1 and that more than three words to determine the grouping formed by one or more words of the search query based on the order generated in the sequence of words of the search query,
Providing information indicative of the grouping to a speech recognition system to update a corpus of a given sequence of words, the speech recognition system based on the corpus of a given sequence of words Configured to convert a given spoken utterance into a given sequence of words ;
The speech recognition system further includes a probability of occurrence for the given sequence of words of the corpus, and providing the grouping to the speech recognition system includes the grouping and the search query to the search engine. Updating the probability of occurrence based on the information indicating a frequency of transmission .

前記検索クエリは、テキストストリングおよび音声検索クエリの１つ以上を含む、請求項１に記載の方法。 The method of claim 1, wherein the search query includes one or more of a text string and a voice search query.

前記音声認識システムに前記グルーピングを提供することは、
前記グルーピングを含むようワードの所与のシーケンスの前記コーパスを更新することと、
前記コーパスにおけるワードの他のシーケンスの所与の発生確率より高いそれぞれの発生確率を前記グルーピングに割り当てることとを含む、請求項１に記載の方法。 Providing the grouping to the speech recognition system includes
Updating the corpus of a given sequence of words to include the grouping;
2. The method of claim 1 , comprising assigning each occurrence probability higher than a given occurrence probability of another sequence of words in the corpus to the grouping.

前記グルーピングを含むようワードの所与のシーケンスの前記コーパスを更新することと、
前記音声認識システムが前記所与の話された発声を転写するために、前記コーパスにおける検索スペースを少なくとも前記グルーピングに制限することとをさらに含む、請求項１に記載の方法。 Updating the corpus of a given sequence of words to include the grouping;
The method of claim 1, further comprising restricting a search space in the corpus to at least the grouping for the speech recognition system to transcribe the given spoken utterance.

前記コーパスにおける他のワードシーケンスに前記所与の話された発声をマッチングすることを試みる前に、前記グルーピングのうちの１つに前記所与の話された発声をマッチングすることを、前記音声認識システムに試みさせることをさらに含む、請求項１に記載の方法。 Matching the given spoken utterance to one of the groupings before attempting to match the given spoken utterance to another word sequence in the corpus; The method of claim 1, further comprising causing the system to try.

前記検索クエリの送信の頻度を示す前記情報に基づいて前記グルーピングについてそれぞれの発生確率を割り当てることをさらに含み、前記それぞれの発生確率は時間変化する、請求項１に記載の方法。 The method of claim 1, further comprising assigning a respective probability of occurrence for the grouping based on the information indicating a frequency of transmission of the search query, wherein the respective probability of occurrence varies over time.

前記コンピューティングデバイスは前記音声認識システムを含む、請求項１に記載の方法。 The method of claim 1, wherein the computing device includes the voice recognition system.

命令を格納したコンピュータ読取可能記憶媒体であって、前記命令は、コンピューティングデバイスによって実行されると、前記コンピューティングデバイスに、
検索エンジンへの、ワードのシーケンスを具備するフレーズの検索クエリの送信の頻度を示す情報を受け取ることと、
前記検索クエリの送信の頻度がしきい値を越えることに基づいて、検索クエリのためのオートマトン表現を決定又は生成し、前記オートマトン表現を使用して、前記検索クエリのワードの前記シーケンスについて、１つ以上のワードが前記検索クエリのワードの前記シーケンスに発生する順番に基いて前記検索クエリの１つ以上のワードによって構成されるグルーピングを決定することと、
ワードの所与のシーケンスのコーパスを更新するよう、音声認識システムに前記グルーピングを示す情報を提供することとを含む機能を行なわせ、前記音声認識システムは、ワードの所与のシーケンスの前記コーパスに基づいて、所与の話された発声をワードの所与のシーケンスに変換するように構成され、
前記音声認識システムはさらに、前記コーパスのワードの前記所与のシーケンスについての発生確率を含み、前記音声認識システムに前記グルーピングを提供することは、前記グルーピングと、前記検索エンジンへの前記検索クエリの送信の頻度を示す前記情報とに基づき前記発生確率を更新することを含む、コンピュータ読取可能記憶媒体。 A computer-readable storage medium having instructions stored therein, wherein when the instructions are executed by a computing device, the computing device includes:
Receiving information indicating a frequency of transmission of a search query for a phrase comprising a sequence of words to a search engine;
Determining or generating an automaton representation for the search query based on a frequency of transmission of the search query exceeding a threshold, and using the automaton representation, for the sequence of words of the search query, 1 and that more than three words to determine the grouping formed by one or more words of the search query have groups in order of occurrence in said sequence of words of the search query,
Providing a speech recognition system with information indicative of the grouping to update a corpus of a given sequence of words, wherein the speech recognition system causes the corpus of a given sequence of words to Is configured to convert a given spoken utterance into a given sequence of words ,
The speech recognition system further includes a probability of occurrence for the given sequence of words of the corpus, and providing the grouping to the speech recognition system includes the grouping and the search query to the search engine. A computer-readable storage medium comprising updating the occurrence probability based on the information indicating the frequency of transmission .

前記グルーピングを決定する機能は、オートマトン状態およびオートマトン弧を含むファクタグラフを生成することを含み、前記オートマトン弧の各々は前記検索クエリのワードの前記シーケンスのワードに対応する、請求項８に記載のコンピュータ読取可能記憶媒体。 Ability to determine the grouping includes generating a factor graph comprising automaton states and automata arcs, each of the automaton arcs corresponding to the word of the sequence of words of the search query, according to claim 8 Computer readable storage medium.

前記グルーピングを決定する機能は、
第１のオートマトン状態に接続される第１のオートマトン弧によって表わされる第１のワードを選択することと、
前記第１のオートマトン状態に隣接する第２のオートマトン状態へと継続することと、
前記第２のオートマトン状態に接続される第２の弧によって表わされる第２のワードを選択することとを含み、前記第２のワードは、前記検索クエリのワードの前記シーケンスにおいて前記第１のワードに隣接する、請求項９に記載のコンピュータ読取可能記憶媒体。 The function of determining the grouping is
Selecting a first word represented by a first automaton arc connected to a first automaton state;
Continuing to a second automaton state adjacent to the first automaton state;
Selecting a second word represented by a second arc connected to the second automaton state, wherein the second word is the first word in the sequence of words of the search query. The computer-readable storage medium of claim 9 , adjacent to the computer.

前記グルーピングの所与のグルーピングは、前記ファクタグラフにおいて任意の所与のオートマトン状態で始まるよう示される、請求項９に記載のコンピュータ読取可能記憶媒体。 The computer-readable storage medium of claim 9 , wherein a given grouping of the groupings is shown to begin at any given automaton state in the factor graph.

デバイスであって、
少なくとも１つのプロセッサと、
データストレージと、
前記データストレージにおけるプログラム命令とを含み、前記プログラム命令は、前記少なくとも１つのプロセッサによる実行の際、前記デバイスに、
所与の期間における検索エンジンへの、ワードのシーケンスを具備するフレーズの検索クエリの送信の頻度を示す情報を受け取ることと、
前記所与の期間における前記検索エンジンへの前記検索クエリの送信の頻度を示す前記情報に基づいて、検索クエリのためのオートマトン表現を決定又は生成し、前記オートマトン表現を使用して、前記検索クエリのワードの前記シーケンスについて、１つ以上のワードが前記検索クエリのワードの前記シーケンスに発生する順番に基づいて前記検索クエリの１つ以上のワードによって構成されるグルーピングを決定することと、
ワードの所与のシーケンスのコーパスを更新するよう、音声認識システムに前記グルーピングを示す情報を提供することとを行わせ、前記音声認識システムは、ワードの所与のシーケンスの前記コーパスに基づいて、所与の話された発声をワードの所与のシーケンスに変換するように構成され、
前記音声認識システムは、前記コーパスのワードの前記所与のシーケンスについての発生確率を含み、前記音声認識システムに前記グルーピングを提供するために、前記少なくとも１つのプロセッサによる実行の際の前記データストレージにおける前記プログラム命令は、前記デバイスに、前記グルーピングと、前記検索クエリの送信の頻度を示す前記情報とに基づいて前記発生確率を更新することを行わせる、デバイス。 A device,
At least one processor;
Data storage,
Program instructions in the data storage, the program instructions being executed on the device upon execution by the at least one processor,
Receiving information indicating a frequency of transmission of a search query for a phrase comprising a sequence of words to a search engine for a given period of time;
Determining or generating an automaton expression for a search query based on the information indicating a frequency of transmission of the search query to the search engine in the given period, and using the automaton expression, the search query the word the sequence of, and determining the grouping formed by one or more words one or more words of the search query based on the order generated in the sequence of words of the search query,
Providing information indicating the grouping to update a corpus of a given sequence of words, the speech recognition system based on the corpus of a given sequence of words; Configured to convert a given spoken utterance into a given sequence of words ;
The speech recognition system includes a probability of occurrence for the given sequence of words of the corpus and in the data storage upon execution by the at least one processor to provide the grouping to the speech recognition system the program instructions, to the device, and the grouping, Ru to perform the updating the occurrence probability on the basis of said information indicating the frequency of transmission of the search query, the device.

前記少なくとも１つのプロセッサによる実行の際の前記プログラム命令はさらに、前記デバイスに、前記所与の期間における前記検索エンジンへの前記検索クエリの送信の頻度を示す前記情報に基づいてメトリックを決定することを行わせ、
前記少なくとも１つのプロセッサによる実行の際の前記プログラム命令は、前記デバイスに、しきい値と前記メトリックの比較に基づいて前記グルーピングの決定を行わせる、請求項１２に記載のデバイス。 The program instructions upon execution by the at least one processor further determine a metric based on the information indicating the frequency of transmission of the search query to the search engine for the given time period to the device. Let
The device of claim 12 , wherein the program instructions upon execution by the at least one processor cause the device to make the grouping decision based on a comparison of a threshold and the metric.

前記音声認識システムは、前記コーパスのワードの前記所与のシーケンスについての発生確率を含み、前記音声認識システムに前記グルーピングを提供するために、前記少なくとも１つのプロセッサによる実行の際の前記データストレージにおける前記プログラム命令は、前記デバイスに、
前記グルーピングにより、ワードの所与のシーケンスの前記コーパスを増強することと、
前記コーパスにおけるワードの他のシーケンスの所与の発生確率より高いそれぞれの発生確率を前記グルーピングに割り当てることとを行わせる、請求項１２に記載のデバイス。 The speech recognition system includes a probability of occurrence for the given sequence of words of the corpus and in the data storage upon execution by the at least one processor to provide the grouping to the speech recognition system The program instructions are sent to the device,
Augmenting the corpus of a given sequence of words by the grouping;
13. The device of claim 12 , wherein the device is configured to assign each occurrence probability higher than a given occurrence probability of another sequence of words in the corpus to the grouping.

前記グルーピングを決定するために、前記少なくとも１つのプロセッサによる実行の際の前記データストレージにおける前記プログラム命令は、前記デバイスに、オートマトン状態およびオートマトン弧を含むファクタグラフを生成することを行わせ、前記オートマトン弧の各々は前記検索クエリのワードの前記シーケンスのワードに対応する、請求項１２に記載のデバイス。 In order to determine the grouping, the program instructions in the data storage upon execution by the at least one processor cause the device to generate a factor graph that includes automaton states and automaton arcs, and the automaton The device of claim 12 , wherein each arc corresponds to a word of the sequence of words of the search query.

前記グルーピングの所与のグルーピングを決定するために、前記少なくとも１つのプロセッサによる実行の際の前記データストレージにおける前記プログラム命令は、前記デバイスに、
第１のオートマトン状態に接続される第１のオートマトン弧によって表わされる第１のワードを選択することと、
前記第１のオートマトン状態に隣接する第２のオートマトン状態へと継続することと、
前記第２のオートマトン状態に接続される第２の弧によって表わされる第２のワードを選択することとを行わせ、前記第２のワードは、前記検索クエリのワードの前記シーケンスにおいて前記第１のワードに隣接し、前記所与のグルーピングは、前記ファクタグラフにおいて任意の所与のオートマトン状態で始まることを許容される、請求項１５に記載のデバイス。 In order to determine a given grouping of the grouping, the program instructions in the data storage upon execution by the at least one processor are
Selecting a first word represented by a first automaton arc connected to a first automaton state;
Continuing to a second automaton state adjacent to the first automaton state;
Selecting a second word represented by a second arc connected to the second automaton state, wherein the second word is the first word in the sequence of words of the search query. 16. The device of claim 15 , adjacent to a word, the given grouping is allowed to begin in any given automaton state in the factor graph.