JP2015087540A

JP2015087540A - Voice recognition device, voice recognition system, and voice recognition program

Info

Publication number: JP2015087540A
Application number: JP2013225770A
Authority: JP
Inventors: 桐田　洋; Hiroshi Kirita; 洋桐田; 隆中桐; Takashi Nakagiri
Original assignee: Koto Co Ltd
Current assignee: Koto Co Ltd
Priority date: 2013-10-30
Filing date: 2013-10-30
Publication date: 2015-05-07

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition device, a voice recognition system, and a voice recognition program for raising speed for recognizing pronounced voice to be converted into a phoneme string.SOLUTION: A voice recognition device 1 comprises: a storage section 3 having standby phoneme storage means 11 for storing a standby phoneme group including a correct phoneme group constituting correct pronunciation of a word and wrong phonemes which are wrong pronunciation of correct phonemes included in the correct phoneme group by being associated with the word, and phoneme feature storage means 13 for storing phoneme features determined for every phoneme of the standby phoneme group; an input section 5 having voice information input means 29 for generating voice information from a voice signal of the pronounced word, and inputting the voice information; a processing section 7 having phoneme detection means 33 for detecting a pronunciation phoneme string which is the pronounced phoneme string from the standby phoneme group by comparing the phoneme features with the voice information; and an output section 9 which outputs the pronunciation phoneme string.

Description

本発明は、音声認識装置、音声認識システム、及び音声認識プログラムに関する。 The present invention relates to a voice recognition device, a voice recognition system, and a voice recognition program.

話者の発音を評価する発音評価装置が特許文献１に記載されている。この発音評価装置は、単語の正しい音素列及び単語の誤った音素列を記憶する記憶部と、音声信号をＡ／Ｄ変換してデジタル信号に変換する変換回路と、このデジタル信号から音響物理情報を抽出し、又デジタル信号をパラメータ化する音響特徴抽出手段と、パラメータ化された信号をスピーカモデルにより音素列に変換する音素列変換手段とを備える。 A pronunciation evaluation apparatus for evaluating the pronunciation of a speaker is described in Patent Document 1. This pronunciation evaluation apparatus includes a storage unit that stores a correct phoneme string of a word and an incorrect phoneme string of a word, a conversion circuit that performs A / D conversion of a speech signal into a digital signal, and acoustic physical information from the digital signal. And a sound feature extraction means for parameterizing the digital signal, and a phoneme string conversion means for converting the parameterized signal into a phoneme string by a speaker model.

この発音評価装置は、フレーム単位の音響特徴パラメータをもとにして、各フレームに音素コードを割り当てることにより音声信号を音素列へと変換するものである。しかしながら、割り当て対象となる音素コードは約１８００種類もあり、割り当て処理に要する演算負荷が高く、中央処理装置の動作周波数によっては音素列への変換に時間を要するものであった。 This pronunciation evaluation apparatus converts a speech signal into a phoneme string by assigning a phoneme code to each frame on the basis of acoustic feature parameters in units of frames. However, there are about 1800 phoneme codes to be assigned, the calculation load required for the assignment process is high, and depending on the operating frequency of the central processing unit, it takes time to convert to a phoneme string.

特開平０６−１１０４９４号公報Japanese Patent Laid-Open No. 06-110494

本発明は、発音された音声を認識して音素列へ変換する速度を向上させる音声認識装置、音声認識システム、及び音声認識プログラムを提供することを目的とする。 An object of the present invention is to provide a speech recognition device, a speech recognition system, and a speech recognition program that improve the speed of recognizing a generated sound and converting it into a phoneme string.

本発明の音声認識装置は、単語の正しい発音を構成する正音素群と、この正音素群に含まれる正音素の誤った発音である誤音素とを含む待受音素群を単語に対応付けて記憶する待受音素記憶手段と、待受音素群の音素毎に定められた音素特徴を記憶する音素特徴記憶手段と、を有する記憶部と、発音された前記単語の音声信号から音声情報を生成し、該音声情報を入力する音声情報入力手段を有する入力部と、前記音素特徴と前記音声情報とを比較して、発音された音素列である発音音素列を前記待受音素群の中から検出する音素検出手段を有する処理部と、前記発音音素列を出力する出力部と、を備えたことを特徴とする。 The speech recognition apparatus according to the present invention associates a standby phoneme group including a correct phoneme group that constitutes a correct pronunciation of a word and a false phoneme that is an incorrect pronunciation of the correct phoneme included in the correct phoneme group with the word. Generates speech information from the sound signal of the pronounced word, and a storage unit having a storage phoneme storage unit to store, a phoneme feature storage unit to store a phoneme feature defined for each phoneme of the standby phoneme group And comparing the phoneme feature and the voice information with an input unit having voice information input means for inputting the voice information, and generating a phoneme string that is a phoneme string generated from the standby phoneme group. A processing unit having a phoneme detection means for detecting and an output unit for outputting the phoneme phoneme string are provided.

また、本発明の音声認識装置は、前記待受音素記憶手段が、前記正音素群を発音順に並べた正音素列を前記単語に対応付けて記憶する正音素列記憶手段と、前記正音素と前記誤音素を互いに対応付けて成る対応音素を記憶する対応音素記憶手段とを有し、前記処理部は、前記単語の前記正音素列と前記対応音素を組み合わせて前記待受音素群を生成する待受音素生成手段を有することを特徴とする。 Further, in the speech recognition apparatus of the present invention, the standby phoneme storage means stores a phoneme string storage means in which the phoneme strings in which the phoneme groups are arranged in the pronunciation order are stored in association with the words, and the phonemes. Corresponding phoneme storage means for storing corresponding phonemes formed by associating the erroneous phonemes with each other, and the processing unit generates the standby phoneme group by combining the correct phoneme sequence of the words and the corresponding phonemes. It has a standby phoneme generation means.

さらに、本発明の音声認識装置は、前記誤音素が、前記正音素と置き換えられて発音される置換音素であり、前記置換音素は、複数の話者が該単語を発音して得た統計に基づいて定められることを特徴とする。 Furthermore, the speech recognition apparatus of the present invention is a replacement phoneme in which the false phoneme is pronounced by replacing the correct phoneme, and the replacement phoneme is based on statistics obtained by a plurality of speakers pronouncing the word. It is determined based on.

また、本発明の音声認識装置は、前記誤音素が、前記正音素に付加されて発音される付加音素であり、前記付加音素は、複数の話者が該単語を発音して得た統計に基づいて定められることを特徴とする。 The speech recognition apparatus according to the present invention is an additional phoneme in which the erroneous phoneme is added to the regular phoneme and pronounced, and the additional phoneme is based on statistics obtained by a plurality of speakers pronounced the word. It is determined based on.

さらにまた、本発明の音声認識装置は、前記記憶部が、前記単語にのみ生じる誤った音素列である例外音素列を記憶する例外音素列記憶手段を有し、前記音素特徴記憶手段は、前記例外音素列の音素毎に定められた音素特徴をさらに記憶し、前記音素検出手段は、前記音素特徴と前記音声情報とを比較して、該音声情報に含まれる音素を前記待受音素群又は例外音素列の中から検出することを特徴とする。 Furthermore, in the speech recognition apparatus of the present invention, the storage unit includes an exceptional phoneme string storage unit that stores an exceptional phoneme string that is an erroneous phoneme string that occurs only in the word, and the phoneme feature storage unit includes the phoneme feature storage unit, Phoneme features determined for each phoneme of the exceptional phoneme sequence are further stored, and the phoneme detection means compares the phoneme features with the speech information, and determines a phoneme included in the speech information as the standby phoneme group or It is detected from an exceptional phoneme string.

本発明の音声認識システムは、上記の音声認識装置と、ネットワークを介して前記音声認識装置と通信可能なサーバを備えた音声認識システムであって、
前記音声認識装置の処理部は、前記発音音素列を前記サーバへ通知し、前記サーバは、取得した前記発音音素列の統計に基づいて、前記誤音素を新たに検出して前記音声認識装置に通知する処理部を備え、前記音素検出手段は、サーバから取得した前記誤音素を含めた待受音素群の中から前記発音音素を検出することを特徴とする。 A speech recognition system of the present invention is a speech recognition system comprising the above speech recognition device and a server capable of communicating with the speech recognition device via a network,
The processing unit of the speech recognition device notifies the phoneme phoneme string to the server, and the server newly detects the erroneous phoneme based on the acquired statistics of the phoneme phoneme sequence, The phoneme detecting means detects the phoneme phoneme from a group of standby phonemes including the erroneous phoneme acquired from the server.

本発明の音声認識プログラムは、コンピュータを、単語の正しい発音を構成する正音素群と、この正音素群に含まれる正音素の誤った発音である誤音素とを含む待受音素群を単語に対応付けて記憶する待受音素記憶手段、待受音素群の音素毎に定められた音素特徴を記憶する音素特徴記憶手段、発音された前記単語の音声信号から音声情報を生成し、該音声情報を入力する音声情報入力手段、前記音素特徴と前記音声情報とを比較して、発音された音素列である発音音素列を前記待受音素群の中から検出する音素検出手段、として機能させることを特徴とする。 The speech recognition program of the present invention causes a computer to set a stand-by phoneme group including a normal phoneme group that constitutes a correct pronunciation of a word and a false phoneme that is an incorrect pronunciation of the correct phoneme included in the correct phoneme group as a word. Standby phoneme storage means for storing in association, phoneme feature storage means for storing phoneme features determined for each phoneme of the standby phoneme group, and generating voice information from the voice signal of the pronounced word, the voice information A speech information input means for inputting a phoneme, and a phoneme detection means for comparing a phoneme feature with the speech information to detect a phoneme phoneme sequence that is a phoneme sequence generated from the standby phoneme group. It is characterized by.

本発明の音声認識装置、音声認識システム、及び音声認識プログラムは、音素検出手段が、待受音素群の中から発音音素を検出する。すなわち、音素特徴と音声情報の比較は待受音素群単位で行われる。これにより、言語の発音を構成する全音素の音素特徴と比較する場合に比べて演算負荷が少なく、発音音素列の検出速度を高速化することができる。さらに、演算能力の低い中央処理装置であっても、発音音素列を素早く検出できる。すなわち、安価なハードウェア上でも発音音素列の素早い検出が実現できる。 In the speech recognition apparatus, speech recognition system, and speech recognition program of the present invention, the phoneme detection means detects a phoneme phoneme from the standby phoneme group. That is, the comparison between phoneme features and voice information is performed in units of standby phoneme groups. As a result, the calculation load is reduced compared to the case of comparing the phoneme features of all phonemes constituting the pronunciation of the language, and the detection speed of the phoneme string sequence can be increased. Furthermore, even a central processing unit with low calculation capability can quickly detect a phoneme string sequence. That is, it is possible to quickly detect a phoneme string even on inexpensive hardware.

実施例１の音声認識装置のブロック図である。1 is a block diagram of a voice recognition device according to Embodiment 1. FIG. （ａ）待受音素記憶手段の一例を示す図であり、（ｂ）待受音素記憶手段の他の例を示す図である。(A) It is a figure which shows an example of a standby phoneme memory | storage means, (b) It is a figure which shows the other example of a standby phoneme storage means. 音声認識装置による音声認識のフローチャートである。It is a flowchart of the speech recognition by a speech recognition apparatus. 音素検出ステップのフローチャートである。It is a flowchart of a phoneme detection step. 実施例２の音声認識装置のブロック図である。It is a block diagram of the speech recognition apparatus of Example 2. （ａ）正音素列記憶手段を示す図であり、（ｂ）対応音素記憶手段の一例を示す図であり、（ｃ）対応音素記憶手段の他の例を示す図である。(A) It is a figure which shows a regular phoneme sequence memory | storage means, (b) It is a figure which shows an example of a corresponding phoneme memory | storage means, (c) It is a figure which shows the other example of a corresponding phoneme memory | storage means. 待受音素生成ステップの一例を示すフローチャートである。It is a flowchart which shows an example of a standby phoneme production | generation step. 待受音素生成ステップの他の例を示すフローチャートである。It is a flowchart which shows the other example of a standby phoneme production | generation step. 実施例３の音声認識装置のブロック図である。It is a block diagram of the speech recognition apparatus of Example 3. 例外音素列記憶手段を示す図である。It is a figure which shows an exceptional phoneme string memory | storage means. 実施例４の音声認識システムの全体構成を示すブロック図である。It is a block diagram which shows the whole structure of the speech recognition system of Example 4. 実施例４の音声認識装置を示すブロック図である。It is a block diagram which shows the speech recognition apparatus of Example 4. 待受音素列集合を示す図である。It is a figure which shows a standby phoneme sequence set.

本発明の音声認識装置1を図面に従って説明する。なお、本明細書において、各図にわたって示される同じ符号は同一または同様のものを示す。 A speech recognition apparatus 1 according to the present invention will be described with reference to the drawings. In the present specification, the same reference numerals shown in the drawings indicate the same or similar elements.

本発明の音声認識装置1は、発音された単語の音声認識を行う装置である。この音声認識装置1は、図１に示すように、記憶部3、入力部5、処理部7、出力部9を有し、例えばパソコン、携帯端末、タブレット端末、音声認識専用機器等のコンピュータである。 The voice recognition device 1 of the present invention is a device that performs voice recognition of a pronounced word. As shown in FIG. 1, the voice recognition device 1 has a storage unit 3, an input unit 5, a processing unit 7, and an output unit 9, and is a computer such as a personal computer, a portable terminal, a tablet terminal, or a voice recognition dedicated device. is there.

記憶部3は、音声認識に必要なプログラムを記憶し、また音声認識に関するデータを格納し、保持し、取り出すことができるものであり、代表的には、コンピュータ内に設けられ、コンピュータを下記の手段として機能させるプログラムを記憶するハードディスク、フラッシュメモリ、ダイナミック・ランダム・アクセス・メモリ等の補助記憶装置である。記憶部3は、待受音素記憶手段11と、音素特徴記憶手段13と有する。 The storage unit 3 stores a program necessary for speech recognition, and can store, hold, and retrieve data related to speech recognition. Typically, the storage unit 3 is provided in a computer. Auxiliary storage devices such as a hard disk, a flash memory, and a dynamic random access memory that store programs that function as means. The storage unit 3 includes a standby phoneme storage unit 11 and a phoneme feature storage unit 13.

待受音素記憶手段11は、図２（ａ）に示すように、待受音素群21を単語に対応付けて記憶する。この待受音素群21は、単語の正しい発音を構成する正音素群15と、この正音素群15に含まれる正音素17の誤った発音である誤音素19とを含むものである。 The standby phoneme storage means 11 stores the standby phoneme group 21 in association with words as shown in FIG. The standby phoneme group 21 includes a regular phoneme group 15 that constitutes a correct pronunciation of a word, and a false phoneme 19 that is an incorrect pronunciation of the regular phoneme 17 included in the regular phoneme group 15.

例えば、待受音素群21は、複数の待受音素集合23から構成された待受音素集合族25である。待受音素集合23は、待受音素27を要素とする。この待受音素27は、正音素17又は誤音素19であり、後述の音素検出手段33による検出対象となる音素である。一の待受音素集合23に含まれる誤音素19の数は特に限定されない。この誤音素19は、音声認識対象の単語について話者が誤って発音する音素であり、代表的には、正音素17と置換されて発音される置換音素19aである。この置換音素19aは、予め、複数の話者の発音により得た発音情報の統計に基づいて定められたものである。複数の話者は、代表的には所定条件に該当する話者であり、例えば、認識対象の単語を母語としない複数の話者、及び／又は認識対象の単語が母語であっても標準語を話さない話者（いわゆる方言を話す者）等である。また、置換音素19aは、例えば、複数の発音情報に含まれていた置換音素19aの中で発生頻度の高いものが選択される。さらに、地域毎に統計を取得し選択された置換音素19aであっても良い。図２（ａ）に示す待受音素集合族25は、単語毎に設けられたテーブルである。このテーブルは、単語appleに関して設けられたものであり、各待受音素集合23にはappleの正音素17が含まれている。図２（ａ）に示すように、一の正音素17に対応して記憶される誤音素19の数は限定されない。なお、本願において音素は、各図に示す音素の記号に限定されず、音素を示すものとして定義付けられた情報も含むものである。 For example, the standby phoneme group 21 is a standby phoneme group 25 composed of a plurality of standby phoneme sets 23. The standby phoneme set 23 includes the standby phoneme 27 as an element. The standby phoneme 27 is a positive phoneme 17 or a false phoneme 19 and is a phoneme to be detected by the phoneme detection means 33 described later. The number of false phonemes 19 included in one standby phoneme set 23 is not particularly limited. This false phoneme 19 is a phoneme that a speaker mistakenly pronounces for a speech recognition target word, and is typically a substitution phoneme 19a that is pronounced by being replaced with a regular phoneme 17. The replacement phoneme 19a is determined in advance based on statistics of pronunciation information obtained by pronunciation of a plurality of speakers. The plurality of speakers are typically speakers that meet a predetermined condition. For example, a plurality of speakers whose recognition target word is not a native language and / or a standard word even if the recognition target word is a native language. Speakers who do not speak (so-called dialect speakers). Also, as the replacement phoneme 19a, for example, the replacement phoneme 19a included in the plurality of pronunciation information having a high occurrence frequency is selected. Furthermore, the substitution phoneme 19a obtained by acquiring statistics for each region may be used. The standby phoneme group 25 shown in FIG. 2A is a table provided for each word. This table is provided for the word apple, and each standby phoneme set 23 includes the apple phoneme 17. As shown in FIG. 2A, the number of false phonemes 19 stored corresponding to one positive phoneme 17 is not limited. In the present application, the phoneme is not limited to the phoneme symbol shown in each figure, but includes information defined as indicating a phoneme.

音素特徴記憶手段13は、待受音素群21に含まれる音素毎（待受音素27毎）に定められた当該音素の特徴を記憶する。この音素の特徴は、代表的には、音素の周波数成分等を示す音響モデルである。この音素特徴記憶手段13は、待受音素群21に含まれる音素だけでなく、単語の属する言語の全音素の特徴を記憶するものであっても良い。 The phoneme feature storage means 13 stores the feature of the phoneme determined for each phoneme included in the standby phoneme group 21 (for each standby phoneme 27). The characteristic of this phoneme is typically an acoustic model showing the frequency component of the phoneme. This phoneme feature storage means 13 may store not only the phonemes included in the standby phoneme group 21 but also the features of all phonemes of the language to which the word belongs.

入力部5は、話者が指定した単語及び該単語の音声を受信し、これらの情報を処理部7へ入力するものである。この入力部5は、図１に示すように、音声情報入力手段29と単語入力手段31を有する。 The input unit 5 receives a word designated by the speaker and the voice of the word, and inputs the information to the processing unit 7. As shown in FIG. 1, the input unit 5 includes voice information input means 29 and word input means 31.

音声情報入力手段29は、発音された単語の音声信号から音声情報を生成し、処理部7へと入力する。この音声情報は、アナログの音声信号を標本化し、量子化して得られた複数のデジタルデータである。音声情報入力手段29は、代表的には、音声を受信するマイクロホンと、このマイクロホンと電気的に接続されたアナログ−デジタル変換器である。なお、マイクロホンとアナログ−デジタル変換器との間に増幅器、及び／又は自動利得制御等を設けても良い。 The voice information input means 29 generates voice information from the voice signal of the pronounced word and inputs it to the processing unit 7. This audio information is a plurality of digital data obtained by sampling and quantizing an analog audio signal. The voice information input means 29 is typically a microphone that receives voice and an analog-digital converter that is electrically connected to the microphone. An amplifier and / or automatic gain control may be provided between the microphone and the analog-digital converter.

単語入力手段31は、音声認識対象となる単語を処理部7へと入力するものであり、例えば単語を入力するキーボード等の文字入力装置や、表示手段（不図示）に表示された単語を選択するマウス、タッチパネル、ディジタイザ等の座標入力装置である。この単語は、英語に限定されず他の言語の単語であっても良い。また、国や地域ごとに異なる英語発音を認識するために、国や地域を選択した上で単語を入力するものであっても良い。 The word input means 31 is used to input a word to be recognized by the voice to the processing unit 7. For example, a word input device such as a keyboard for inputting a word or a word displayed on a display means (not shown) is selected. Coordinate input devices such as a mouse, a touch panel, and a digitizer. This word is not limited to English and may be a word in another language. In addition, in order to recognize different English pronunciations in each country or region, a word may be input after selecting the country or region.

処理部7は、入力部5から取得した単語及び音声情報、並びに記憶部3に記憶された待受音素群21及び音素特徴に基づいて、発音された音素列（以下、発音音素列という。）を検出する。この処理部7は、代表的には、コンピュータの中央処理装置である。この処理部7は、図１に示すように、入力部5、記憶部3、出力部9と通信可能に接続されており、音素検出手段33を有する。 The processing unit 7 generates a phoneme string that is pronounced based on the words and voice information acquired from the input unit 5 and the standby phoneme group 21 and phoneme features stored in the storage unit 3 (hereinafter referred to as a phoneme phoneme string). Is detected. The processing unit 7 is typically a central processing unit of a computer. As shown in FIG. 1, the processing unit 7 is communicably connected to the input unit 5, the storage unit 3, and the output unit 9, and includes a phoneme detection unit 33.

音素検出手段33は、音素特徴と音声情報とを比較して、この音声情報に含まれる発音音素列を待受音素群21の中から検出する。この発音音素列は、発音された音素（以下、発音音素という。）が、発音された順に並べられたものである。発音音素は部分音声特徴と所定の共通項を有する音素特徴に対応する音素である。この部分音声特徴は、音声情報に含まれている一部のデータから算出された特徴であり、例えば当該一部のデータを周波数解析して得られた特徴や周波数成分等である。部分音声特徴と比較される音素特徴は、音声認識対象の単語に対応する待受音素群21に含まれる音素（待受音素27）の音素特徴である。この発音音素の検出は、音声情報に含まれる部分音声特徴の順に行われ、検出された発音音素は順番に変数に格納される。これにより発音音素列が検出される。 The phoneme detection means 33 compares the phoneme feature and the speech information, and detects the phoneme phoneme sequence included in the speech information from the standby phoneme group 21. This phonetic phoneme string is a sequence of phonemes that have been pronounced (hereinafter referred to as pronunciation phonemes) in the order in which they were pronounced. A phoneme phoneme is a phoneme corresponding to a phoneme feature having a predetermined common term with a partial speech feature. The partial voice feature is a feature calculated from a part of data included in the voice information, and is, for example, a feature or a frequency component obtained by frequency analysis of the part of the data. The phoneme feature to be compared with the partial speech feature is a phoneme feature of the phoneme (standby phoneme 27) included in the standby phoneme group 21 corresponding to the speech recognition target word. The pronunciation phonemes are detected in the order of the partial voice features included in the voice information, and the detected phonemes are sequentially stored in variables. As a result, a pronunciation phoneme string is detected.

出力部9は、処理部7によって検出された発音音素列を出力する。出力部9は例えば、ディスプレイ等の表示手段である。これにより、話者は自己の発音を知ることができる。また、正音素群15を発音順に並べた正音素列を発音音素列と共に表示手段に表示しても良い。これにより、話者は、自己の発音の誤りを容易に知ることができる。出力部9は、さらに音声を再生して出力する音声再生手段を備えても良い。この音声再生手段は、例えば、正音素列の音声をスピーカから出力する音声ガイドである。これにより、話者は認識対象の単語の正しい発音見本を知ることができる。さらに、音声再生手段は、取得した音声情報をスピーカから出力するものであっても良い。これにより、話者は、音声ガイドと自己の音声を聞き比べることができる。これにより、音声認識装置1は、話者の誤りを目と耳で認識させることができる。 The output unit 9 outputs the phoneme phoneme string detected by the processing unit 7. The output unit 9 is display means such as a display, for example. Thus, the speaker can know his / her pronunciation. In addition, the phoneme sequence in which the phoneme groups 15 are arranged in the order of pronunciation may be displayed on the display unit together with the phoneme sequence. As a result, the speaker can easily know his / her pronunciation error. The output unit 9 may further include audio reproduction means for reproducing and outputting audio. This voice reproducing means is, for example, a voice guide that outputs a voice of a normal phoneme string from a speaker. Thereby, the speaker can know the correct pronunciation sample of the word to be recognized. Further, the sound reproducing means may output the acquired sound information from a speaker. Thereby, the speaker can hear and compare the voice guide and his / her voice. Thereby, the speech recognition apparatus 1 can recognize the error of the speaker with eyes and ears.

本発明の音声認識装置1を用いた音声認識方法を、図３に基づいて説明する。この音声認識方法は、単語入力ステップ(s10)と、待受音素抽出ステップ(s20)と、音声情報入力ステップ(s30)と、音素検出ステップ(s40)と、を含む。 A speech recognition method using the speech recognition apparatus 1 of the present invention will be described with reference to FIG. This speech recognition method includes a word input step (s10), a standby phoneme extraction step (s20), a speech information input step (s30), and a phoneme detection step (s40).

単語入力ステップ(s10)は、単語入力手段31が音声認識の対象となる単語を処理部7に入力するステップである。例えば、キーボードにより単語"apple"が入力される。又は表示手段に表示された単語の一覧からマウスによって単語"apple"が選択される。 The word input step (s10) is a step in which the word input means 31 inputs a word to be subjected to speech recognition to the processing unit 7. For example, the word “apple” is input from the keyboard. Alternatively, the word “apple” is selected with a mouse from the list of words displayed on the display means.

待受音素抽出ステップ(s20)は、音素検出手段33が、単語入力手段31から取得した単語"apple"に対応する待受音素群21（図２（ａ））を待受音素記憶手段11から抽出するステップである。なお、待受音素群21が抽出できない場合、すなわち単語に対応する待受音素群21が記憶されていない場合には、単語入力ステップ(s10)に戻り、再度単語の入力を要求しても良い。 In the standby phoneme extraction step (s20), the phoneme detection unit 33 extracts the standby phoneme group 21 (FIG. 2A) corresponding to the word “apple” acquired from the word input unit 31 from the standby phoneme storage unit 11. Extracting step. When the standby phoneme group 21 cannot be extracted, that is, when the standby phoneme group 21 corresponding to the word is not stored, the process may return to the word input step (s10) and request the input of the word again. .

音声情報入力ステップ(s30)は、音声情報入力手段29が、発音された単語の音声信号から音声情報を生成し、この音声情報を処理部7へと入力するステップである。例えば、マイクロホンが、発音された"apple"の音声を受信して電気信号に変換して、アナログ−デジタル変換器が電気信号を標本化し、量子化して音声情報を生成する。 The voice information input step (s30) is a step in which the voice information input means 29 generates voice information from the voice signal of the pronounced word and inputs this voice information to the processing unit 7. For example, a microphone receives a pronounced “apple” voice and converts it into an electrical signal, and an analog-digital converter samples and quantizes the electrical signal to generate voice information.

音素検出ステップ(s40)は、音素検出手段33が、音声情報に含まれる発音音素列を待受音素群21の中から検出するステップである。例えば、先ず、音声情報入力ステップ(s30)で得た音声情報に含まれる複数の部分音声特徴を検出する。各部分音声特徴の検出方法は、特に限定されないが、音声情報を所定フレームのデータ集合に分割して複数の分割音声情報を生成し、分割音声情報毎に周波数解析等を行い検出される。なお、連続する部分音声特徴が共通する場合には、これらを一つの部分音声特徴にまとめても良い。次に、待受音素抽出ステップ(s20)で得た待受音素群21（図２（ａ））の各音素に対応する音素特徴を音素特徴記憶手段13から抽出する。そして、部分音声特徴毎に各音素特徴と比較を行い、所定の共通項を有する音素特徴に対応する音素を検出し、変数に格納する。上記ステップにより、発音された単語の音声認識がなされる。なお、待受音素群21中に所定の共通項を有する音素がない場合には、待受音素群21以外の音素の音素特徴と比較しても良い。 The phoneme detection step (s40) is a step in which the phoneme detection means 33 detects the phonetic phoneme string included in the speech information from the standby phoneme group 21. For example, first, a plurality of partial voice features included in the voice information obtained in the voice information input step (s30) are detected. The detection method of each partial voice feature is not particularly limited, but the voice information is divided into a predetermined frame data set to generate a plurality of pieces of divided voice information, and is detected by performing frequency analysis or the like for each divided voice information. In addition, when continuous partial voice features are common, these may be combined into one partial voice feature. Next, the phoneme features corresponding to each phoneme in the standby phoneme group 21 (FIG. 2A) obtained in the standby phoneme extraction step (s20) are extracted from the phoneme feature storage means 13. Then, each phoneme feature is compared with each phoneme feature, a phoneme corresponding to a phoneme feature having a predetermined common term is detected and stored in a variable. According to the above steps, speech recognition of the pronounced word is performed. If there is no phoneme having a predetermined common term in the standby phoneme group 21, the phoneme characteristics of phonemes other than the standby phoneme group 21 may be compared.

なお、出力部9が、音素検出ステップ（s40）で得た発音音素列を出力する発音音素列出力ステップを備えても良い。このステップにより話者は、自己の発音を認識することができる。 Note that the output unit 9 may include a pronunciation phoneme string output step for outputting the pronunciation phoneme string obtained in the phoneme detection step (s40). This step allows the speaker to recognize his pronunciation.

本発明の音声認識装置1は、音素検出手段33が、音声認識対象の単語について発音される頻度が高い待受音素群21の中から発音音素を検出する。すなわち、音素特徴と音声情報の比較において、待受音素群21を優先して行うものである。これにより、言語の発音を構成する全音素の音素特徴を比較対象とする場合よりも演算負荷が少なく、発音音素列の検出速度を高速化することができる。さらに、演算能力の低い中央処理装置であっても、発音音素列を素早く検出できる。すなわち、安価なハードウェア上でも発音音素列の素早い検出が実現できる。 In the speech recognition apparatus 1 of the present invention, the phoneme detection means 33 detects a phoneme phoneme from the standby phoneme group 21 that is frequently pronounced for a speech recognition target word. That is, the standby phoneme group 21 is preferentially performed in the comparison between the phoneme feature and the voice information. Thereby, the calculation load is less than that in the case where the phoneme features of all phonemes constituting the pronunciation of the language are compared, and the detection speed of the phoneme string sequence can be increased. Furthermore, even a central processing unit with low calculation capability can quickly detect a phoneme string sequence. That is, it is possible to quickly detect a phoneme string even on inexpensive hardware.

本実施例の音声認識装置1の音素検出手段33は、待受音素集合23単位（図２（ａ））で発音音素の検出をするものであっても良い。この音素検出ステップ(s40)の一例を図４に示す。先ず、音声情報に含まれる複数の部分音声特徴を検出する（s401）。次に、一の待受音素集合23の各待受音素27に対応する音素特徴を音素特徴記憶手段13から抽出する（s402）。次に、一の部分音声特徴と抽出した各音素特徴とを比較し（s403）、所定の共通項を有する音素特徴に対応する待受音素27を変数に格納する（s404）。これにより、一の発音音素が検出される。上記のステップを、全ての待受音素集合23について繰り返し行い（s405）、変数に発音音素を追加していく。これにより発音音素列が検出される。この音声認識装置1は、音素検出手段33が待受音素集合23単位で音素特徴を音声情報と比較して発音音素を検出するため、発音音素の検出速度を著しく向上させることができる。 The phoneme detection means 33 of the speech recognition apparatus 1 according to the present embodiment may detect pronunciation phonemes in units of 23 standby phoneme sets (FIG. 2A). An example of this phoneme detection step (s40) is shown in FIG. First, a plurality of partial voice features included in the voice information are detected (s401). Next, a phoneme feature corresponding to each standby phoneme 27 of one standby phoneme set 23 is extracted from the phoneme feature storage means 13 (s402). Next, one partial speech feature is compared with each extracted phoneme feature (s403), and a standby phoneme 27 corresponding to a phoneme feature having a predetermined common term is stored in a variable (s404). Thereby, one phoneme phoneme is detected. The above steps are repeated for all the standby phoneme sets 23 (s405), and pronunciation phonemes are added to the variables. As a result, a pronunciation phoneme string is detected. In this speech recognition apparatus 1, since the phoneme detection means 33 detects phoneme phonemes by comparing phoneme features with speech information in units of the standby phoneme set 23, the detection speed of the phoneme phoneme can be significantly improved.

本実施例において、待受音素群21の誤音素19は置換音素19aに限定されない。この誤音素19は、例えば、話者が誤って正音素17に続けて発音する付加音素19b（図２（ｂ））であっても良い。さらに、誤音素19は、図２（ｂ）に示すように、置換音素19aと付加音素19bの両方であっても良い。この付加音素19bは、予め、複数の話者の発音により得た発音情報の統計に基づいて定められたものである。例えば、複数の話者の発音に含まれていた付加音素19b中で発生頻度の高い付加音素19bが選択される。 In the present embodiment, the erroneous phoneme 19 of the standby phoneme group 21 is not limited to the replacement phoneme 19a. The false phoneme 19 may be, for example, an additional phoneme 19b (FIG. 2B) that a speaker mistakenly pronounces the phoneme 17 in succession. Further, the erroneous phoneme 19 may be both a replacement phoneme 19a and an additional phoneme 19b as shown in FIG. This additional phoneme 19b is determined in advance based on statistics of pronunciation information obtained by pronunciation of a plurality of speakers. For example, an additional phoneme 19b having a high occurrence frequency is selected from the additional phonemes 19b included in the pronunciations of a plurality of speakers.

なお、本実施例において、例えば、音声認識対象の単語が予め指定される場合には、単語入力手段31は必須の構成ではない。 In the present embodiment, for example, when a word for speech recognition is designated in advance, the word input means 31 is not an essential configuration.

本実施例の音声認識装置10は、図５に示すように、待受音素記憶手段11が正音素列記憶手段37と対応音素記憶手段39を備え、処理部7が待受音素生成手段41を備える。 In the speech recognition apparatus 10 of the present embodiment, as shown in FIG. 5, the standby phoneme storage unit 11 includes a normal phoneme sequence storage unit 37 and a corresponding phoneme storage unit 39, and the processing unit 7 includes a standby phoneme generation unit 41. Prepare.

正音素列記憶手段37は、図６（ａ）に示すように、単語と、この単語の正音素群15を発音順に並べた正音素列を互いに対応付けて記憶する。 As shown in FIG. 6A, the phoneme sequence storage means 37 stores a word and a phoneme sequence in which the phoneme groups 15 of the word are arranged in the pronunciation order in association with each other.

対応音素記憶手段39は、図６（ｂ）に示すように、音声認識対象の単語の正音素と、この正音素の誤発音である誤音素を含む対応音素群43を記憶する。対応音素群43は、例えば、複数の対応音素集合45で構成された対応音素集合族47である。対応音素集合45は、複数の対応音素49を要素とする。この複数の対応音素49は、代表的には、話者が誤って互いに置換して発音する関係にある音素であり、音声認識対象の単語の正音素と、正音素に対する置換音素である。この正音素と置換音素となり得る対応音素49は、認識対象となる単語に応じて異なるものである。単語appleの第一音素51（図６（ａ））を例にすると、対応音素集合45aに含まれる複数の対応音素49a〜49dの内、対応音素49aが正音素であり、他の対応音素49b〜49dが置換音素である。また、単語boxの第二音素53（図６（ａ））を例にすると、対応音素集合45aに含まれる複数の対応音素49a〜49dの内、対応音素49bが正音素であり、他の対応音素49a,49c,49dが置換音素である。この対応音素は、予め、複数の話者の発音により得た発音情報の統計に基づいて定められたものである。例えば、複数の発音情報に含まれていた置換音素の中で発生頻度の高い置換音素19aが選択される。 As shown in FIG. 6B, the corresponding phoneme storage means 39 stores a corresponding phoneme group 43 including a correct phoneme of a word to be recognized and a false phoneme that is a false pronunciation of the correct phoneme. The corresponding phoneme group 43 is, for example, a corresponding phoneme group 47 composed of a plurality of corresponding phoneme sets 45. The corresponding phoneme set 45 includes a plurality of corresponding phonemes 49 as elements. The plurality of corresponding phonemes 49 are typically phonemes that have a relationship in which a speaker mistakenly substitutes each other for pronunciation, and are phonemes of a speech recognition target word and substitution phonemes for the phonemes. The corresponding phoneme 49 that can be a regular phoneme and a replacement phoneme differs depending on the word to be recognized. Taking the first phoneme 51 of the word apple (FIG. 6 (a)) as an example, the corresponding phoneme 49a among the plurality of corresponding phonemes 49a to 49d included in the corresponding phoneme set 45a is a regular phoneme, and the other corresponding phoneme 49b. ˜49d are substitution phonemes. Further, taking the second phoneme 53 of the word box (FIG. 6A) as an example, the corresponding phoneme 49b among the plurality of corresponding phonemes 49a to 49d included in the corresponding phoneme set 45a is a regular phoneme. Phonemes 49a, 49c, and 49d are substitution phonemes. This corresponding phoneme is determined in advance based on statistics of pronunciation information obtained by pronunciation of a plurality of speakers. For example, a replacement phoneme 19a having a high occurrence frequency is selected from the replacement phonemes included in the plurality of pronunciation information.

待受音素生成手段41は、単語の正音素列を構成する正音素を含む対応音素集合45を組み合わせて待受音素群21（図２（ａ））を生成する。本実施例の音声認識方法は、待受音素群21を生成するために、上述の待受音素抽出ステップ(s20)に換えて、待受音素生成ステップを含む。この待受音素生成ステップは、例えば図７に示すように、先ず、単語入力ステップ(s10)において取得した単語（例えば"apple"）に対応する正音素列を正音素列記憶手段37から抽出する(s201)。次に、抽出した正音素列を構成する一の正音素（例えば第一音素51）を抽出し(s202)、この正音素を含む対応音素集合45（例えば対応音素集合45ａ）を対応音素記憶手段39から検索する(s203)。対応音素集合45がある場合には、これを抽出し変数に格納する(s204)。これにより待受音素集合23が生成される。一方、対応音素集合45が抽出されない場合、すなわち正音素を含む対応音素集合45が対応音素記憶手段39に記憶されていない場合には、当該正音素のみを変数に格納する(s205)。これにより、当該正音素を待受音素27とする待受音素集合23が生成される。この処理を全正音素について行う(s206)ことで待受音素群21（待受音素集合族25）が生成される。なお、取得した単語に対応する正音素列が抽出できない場合、すなわち対応する正音素列が記憶されていない場合には、単語入力ステップ(s10)に戻り、再度単語の入力を要求しても良い。 The standby phoneme generation means 41 generates a standby phoneme group 21 (FIG. 2 (a)) by combining the corresponding phoneme sets 45 including the correct phonemes constituting the correct phoneme sequence of words. The speech recognition method of the present embodiment includes a standby phoneme generation step instead of the standby phoneme extraction step (s20) described above in order to generate the standby phoneme group 21. In this standby phoneme generation step, for example, as shown in FIG. 7, first, a phoneme sequence corresponding to the word (eg, “apple”) acquired in the word input step (s10) is extracted from the phoneme sequence storage means 37. (s201). Next, one positive phoneme (for example, the first phoneme 51) constituting the extracted positive phoneme string is extracted (s202), and the corresponding phoneme set 45 (for example, the corresponding phoneme set 45a) including this positive phoneme is stored in the corresponding phoneme storage means. Search from 39 (s203). If there is a corresponding phoneme set 45, it is extracted and stored in a variable (s204). As a result, a standby phoneme set 23 is generated. On the other hand, when the corresponding phoneme set 45 is not extracted, that is, when the corresponding phoneme set 45 including the positive phoneme is not stored in the corresponding phoneme storage unit 39, only the relevant phoneme is stored in the variable (s205). As a result, a standby phoneme set 23 in which the regular phonemes are the standby phonemes 27 is generated. By performing this process for all regular phonemes (s206), the standby phoneme group 21 (the standby phoneme group 25) is generated. If the phoneme sequence corresponding to the acquired word cannot be extracted, that is, if the corresponding phoneme sequence is not stored, the processing may return to the word input step (s10) and request the input of the word again. .

本実施例の音声認識装置10は、単語毎に待受音素群21を記憶しなくても良い。よって、記憶部3の使用容量を節約しつつ、発音音素列の検出速度を高速化できる。 The speech recognition apparatus 10 of the present embodiment does not have to store the standby phoneme group 21 for each word. Therefore, it is possible to increase the detection speed of the phoneme string sequence while saving the storage capacity of the storage unit 3.

本実施例の対応音素集合45を構成する複数の対応音素49は、図６（ｃ）に示すように、音声認識対象の単語の正音素と、正音素に誤って付加する付加音素であっても良い。この付加音素を含む待受音素群21（図２（b））を生成する待受音素生成ステップは、図８に示すように、先ず、単語入力ステップ(s10)において取得した単語に対応付けられた正音素列を正音素列記憶手段37から抽出する(s211)。次に、抽出した正音素列を構成する一の正音素を抽出し(s212)、この一の正音素に対応する付加音素を対応音素記憶手段39から抽出する(s213)。次に、一の正音素に続く他の正音素が母音であるかを判断する(s214)。母音でない場合には、一の正音素を含む待受音素集合23の後に付加音素を要素とする待受音素集合23を付加する(s215)。一方、付加音素を抽出できない場合、又は他の正音素が母音である場合には、付加音素を付加しない。この処理を全ての正音素について行う(s216)ことで付加音素を含む待受音素群21（例えば図２（ｂ））が生成される。 The plurality of corresponding phonemes 49 constituting the corresponding phoneme set 45 of the present embodiment are, as shown in FIG. 6 (c), a normal phoneme of a speech recognition target word and an additional phoneme that is erroneously added to the normal phoneme. Also good. The standby phoneme generation step for generating the standby phoneme group 21 including the additional phonemes (FIG. 2B) is first associated with the word acquired in the word input step (s10) as shown in FIG. The original phoneme sequence is extracted from the normal phoneme sequence storage means 37 (s211). Next, one phoneme constituting the extracted phoneme string is extracted (s212), and an additional phoneme corresponding to this one phoneme is extracted from the corresponding phoneme storage means 39 (s213). Next, it is determined whether another phoneme following the one phoneme is a vowel (s214). If it is not a vowel, a standby phoneme set 23 having additional phonemes as elements is added after the standby phoneme set 23 including one regular phoneme (s215). On the other hand, when an additional phoneme cannot be extracted or when another regular phoneme is a vowel, no additional phoneme is added. By performing this process for all regular phonemes (s216), a standby phoneme group 21 including additional phonemes (for example, FIG. 2B) is generated.

本実施例の音声認識装置20は、図９に示すように、記憶部3に例外音素列記憶手段55をさらに備える。例外音素列記憶手段55は、図１０に示すように、単語と例外音素列を互いに対応付けて記憶する。この例外音素列は、特定の単語についてのみ生じる誤発音である例外音素を含む音素列である。この例外音素列の具体例としては、例外的に付加された誤音素（例えばtextの第四音素）を含む音素列、例外的に正音素が置換された誤音素（例えばquickの第二音素）を含む音素列、例外的に正音素列が欠落した（例えばpostの第三音素）音素列等である。この例外音素列は、複数の話者の発音情報の統計に基づいて、予め定められる。複数の話者は、代表的には所定条件に該当する話者であり、例えば、認識対象の単語を母語としない複数の話者、及び／又は認識対象の単語が母語であっても標準語を話さない話者（いわゆる方言を話す者）等である。また、例外音素列は、例えば、複数の発音情報に含まれていた例外音素列の中で発生頻度の高いものが選択される。さらに、地域毎に統計を取得しても良い。 As shown in FIG. 9, the speech recognition apparatus 20 of the present embodiment further includes exceptional phoneme string storage means 55 in the storage unit 3. The exceptional phoneme string storage means 55 stores a word and an exceptional phoneme string in association with each other as shown in FIG. This exceptional phoneme string is a phoneme string including an exceptional phoneme which is a mispronunciation that occurs only for a specific word. Specific examples of this exceptional phoneme string include a phoneme string including an erroneously added phoneme (for example, the fourth phoneme of text), and an erroneous phoneme (for example, a quick second phoneme) in which the positive phoneme has been replaced. A phoneme string including a phoneme string, a phoneme string exceptionally lacking a phoneme string (for example, the third phoneme of a post), and the like. This exceptional phoneme string is determined in advance based on statistics of pronunciation information of a plurality of speakers. The plurality of speakers are typically speakers that meet a predetermined condition. For example, a plurality of speakers whose recognition target word is not a native language and / or a standard word even if the recognition target word is a native language. Speakers who do not speak (so-called dialect speakers). Also, as the exceptional phoneme string, for example, an exceptional phoneme string included in a plurality of pronunciation information is selected with a high occurrence frequency. Further, statistics may be acquired for each region.

本実施例の待受音素生成手段41は、正音素列と対応音素群43を組み合わせて生成した上記の一の待受音素群21に、さらに認識対象の単語に基づいて例外音素列を抽出し、この例外音素列を待受音素群21に組み合わせて他の待受音素群を生成する。 The standby phoneme generation means 41 of the present embodiment further extracts an exceptional phoneme string based on the recognition target word in the one standby phoneme group 21 generated by combining the normal phoneme string and the corresponding phoneme group 43. The exceptional phoneme string is combined with the standby phoneme group 21 to generate another standby phoneme group.

本実施例の音声認識装置20は、待受音素群21と発音頻度の高い例外音素列の中から発音音素列を検出するものである。このため検出精度を向上させることができる。 The speech recognition apparatus 20 of the present embodiment detects a phoneme phoneme sequence from the standby phoneme group 21 and an exceptional phoneme sequence having a high pronunciation frequency. For this reason, detection accuracy can be improved.

発明の音声認識システム100は、図１１に示すように、ネットワーク85を介して互いに通信可能に接続されたサーバ57と複数の音声認識装置30を備える。サーバ57及び音声認識装置30の数は特に限定されず。サーバ57は地域毎に設けられても良い。 As shown in FIG. 11, the speech recognition system 100 of the invention includes a server 57 and a plurality of speech recognition devices 30 that are communicably connected to each other via a network 85. The number of servers 57 and voice recognition devices 30 is not particularly limited. The server 57 may be provided for each region.

本実施例のサーバ57は、送受信部59と、記憶部61と、処理部63とを備えるものである。 The server 57 of this embodiment includes a transmission / reception unit 59, a storage unit 61, and a processing unit 63.

送受信部59は、ネットワーク85を介して音声認識装置30に情報を送受信するものであり、例えば、送信装置、受信装置、及びこれらのプロトコルである。 The transmission / reception unit 59 transmits / receives information to / from the voice recognition device 30 via the network 85, and includes, for example, a transmission device, a reception device, and these protocols.

記憶部61は、音声認識に関するデータを格納し、保持し、かつ取り出すことができるものであり、例えば、端末内に設けられた補助記憶装置である。記憶部61は、更新正音素列記憶手段65と、更新誤音素記憶手段を備える。 The storage unit 61 can store, hold, and retrieve data related to speech recognition, and is, for example, an auxiliary storage device provided in the terminal. The storage unit 61 includes updated correct phoneme string storage means 65 and updated erroneous phoneme storage means.

更新正音素列記憶手段65は、更新正音素列を記憶する。この更新正音素列は、音声認識装置30の正音素列記憶手段37に予め記憶されている正音素列（以下、標準正音素列という。）と、新たに追加された追加正音素列を含み、更新単語に対応付けられて記憶される。更新単語は、音声認識装置30の正音素列記憶手段37に予め記憶されている標準単語と、新たに追加された追加単語を含むものである。追加正音素列及び追加単語は、ネットワーク85を介してサーバ57と通信可能に接続された管理者端末87等により送信され、後述する処理部63により格納される。 The updated correct phoneme sequence storage means 65 stores the updated correct phoneme sequence. This updated phoneme sequence includes a phoneme sequence (hereinafter referred to as a standard phoneme sequence) stored in advance in the phoneme sequence storage means 37 of the speech recognition device 30 and a newly added additional phoneme sequence. , Stored in association with the updated word. The updated word includes a standard word stored in advance in the phoneme sequence storage means 37 of the speech recognition device 30 and a newly added additional word. The additional correct phoneme sequence and the additional word are transmitted via the network 85 by an administrator terminal 87 or the like that is communicably connected to the server 57 and stored by the processing unit 63 described later.

更新誤音素記憶手段は、更新誤音素を記憶するものであり、更新対応音素記憶手段67及び／又は更新例外音素列記憶手段69である。更新誤音素は、更新対応音素と更新例外音素列である。更新対応音素は、音声認識装置30に予め記憶されている対応音素（以下、標準対応音素という。）と、新たに追加された追加対応音素を含むものであり、更新対応音素記憶手段67に記憶される。この追加対応音素は後述する処理部63により生成される。更新例外音素列は、音声認識装置30に予め記憶されている例外音素列（以下、標準例外音素列という。）と、新たに追加された追加例外音素列を含むものであり、更新単語に対応付けられて更新例外音素列記憶手段69に記憶される。この更新単語は、音声認識装置30の例外音素列記憶手段55に予め記憶されている標準単語と、新たに追加された追加単語を含むものである。この追加例外音素列及び追加単語は後述する処理部63により生成される。 The updated erroneous phoneme storage means stores updated erroneous phonemes, and is the update-corresponding phoneme storage means 67 and / or the updated exceptional phoneme string storage means 69. The update erroneous phoneme is an update-supported phoneme and an update exception phoneme string. The update correspondence phonemes include correspondence phonemes (hereinafter referred to as standard correspondence phonemes) stored in advance in the speech recognition device 30 and newly added additional correspondence phonemes, and are stored in the update correspondence phoneme storage means 67. Is done. This additional corresponding phoneme is generated by the processing unit 63 described later. The updated exceptional phoneme string includes an exceptional phoneme string stored in advance in the speech recognition device 30 (hereinafter referred to as a standard exceptional phoneme string) and a newly added additional exceptional phoneme string, and corresponds to an updated word. Attached to the updated exceptional phoneme string storage means 69. The updated word includes a standard word stored in advance in the exceptional phoneme string storage means 55 of the speech recognition apparatus 30 and an additional word added newly. The additional exceptional phoneme string and the additional word are generated by the processing unit 63 described later.

処理部63は、管理者端末87から送信された追加単語及び追加正音素列を取得し、更新正音素列記憶手段65に記憶する。また処理部63は、音声認識装置30から取得した発音音素列に基づいて追加対応音素及び追加例外音素列を生成する。 The processing unit 63 acquires the additional word and the additional phoneme sequence transmitted from the administrator terminal 87, and stores them in the updated phoneme sequence storage means 65. Further, the processing unit 63 generates an additional corresponding phoneme and an additional exceptional phoneme sequence based on the phoneme phoneme sequence acquired from the speech recognition device 30.

追加対応音素は、音声認識装置30から取得した複数の発音音素列及び単語から新たに検出された誤音素19と、この誤音素19に対応する正音素17である。この誤音素19は、取得した発音音素列と、取得した単語に対応する更新正音素列との相違音素を検出することにより得られる。検出された誤音素19の内、発生頻度が高い誤音素19が正音素17に対応付けられて追加対応音素として更新対応音素記憶手段67に格納される。なお、既に記憶されている更新対応音素は格納されない。 The additional corresponding phonemes are an erroneous phoneme 19 newly detected from a plurality of pronunciation phoneme strings and words acquired from the speech recognition device 30, and an orthophoneme 17 corresponding to the erroneous phoneme 19. This false phoneme 19 is obtained by detecting a difference phoneme between the acquired phoneme phoneme sequence and the updated correct phoneme sequence corresponding to the acquired word. Among the detected erroneous phonemes 19, the erroneous phonemes 19 having a high occurrence frequency are associated with the regular phonemes 17 and stored in the update corresponding phoneme storage unit 67 as additional corresponding phonemes. Note that the update correspondence phonemes that are already stored are not stored.

追加例外音素列は、音声認識装置30から取得した複数の発音音素列及び単語から新たに検出された例外音素列である。この例外音素列は、取得した発音音素列と、取得した単語に対応する更新正音素列との相違を検出することにより得られる。検出された例外音素列の内、発生頻度が高い例外音素列が単語に対応付けられて追加例外音素列として更新例外音素列記憶手段69に格納される。なお、既に記憶されている例外音素列は格納されない。 The additional exceptional phoneme string is an exceptional phoneme string newly detected from a plurality of pronunciation phoneme strings and words acquired from the speech recognition device 30. This exceptional phoneme string is obtained by detecting a difference between the acquired phoneme string string and the updated regular phoneme string corresponding to the acquired word. Among the detected exceptional phoneme strings, exceptional phoneme strings having a high occurrence frequency are associated with words and stored in the updated exceptional phoneme string storage unit 69 as additional exceptional phoneme strings. An exceptional phoneme string that has already been stored is not stored.

サーバ57は、上記の発生頻度を求めるために、検出された誤音素19に正音素17を対応付けて記憶する検出用誤音素記憶手段、及び検出された例外音素列に単語を対応付けて記憶する検出用例外音素列記憶手段を備えても良い。 In order to obtain the above-mentioned occurrence frequency, the server 57 stores the detected phoneme 19 in association with the detected phoneme 17 and stores the detected phoneme sequence in association with the detected exception phoneme string. An exceptional phoneme string storage means for detection may be provided.

さらに処理部63は、送受信部59を介して通知音素を各音声認識装置30に送信する。通知音素の送信は、例えば、後述する音声認識装置30の更新手段77からの要求に基づいて、通知音素を記憶部から抽出して音声認識装置30に送信することにより行われる。この通知音素とは、更新音素、追加音素、又は未送信音素である。 Further, the processing unit 63 transmits the notification phoneme to each voice recognition device 30 via the transmission / reception unit 59. The notification phoneme is transmitted, for example, by extracting the notification phoneme from the storage unit and transmitting it to the speech recognition device 30 based on a request from the update unit 77 of the speech recognition device 30 described later. This notification phoneme is an updated phoneme, an additional phoneme, or an untransmitted phoneme.

更新音素は、記憶部に記憶されている更新対応音素、更新例外音素列、及び更新正音素列である。なお、更新例外音素列及び更新正音素列は、それぞれに対応する更新単語と共に送信される。 The updated phonemes are update-corresponding phonemes, update exception phoneme strings, and update regular phoneme strings stored in the storage unit. Note that the updated exceptional phoneme string and the updated regular phoneme string are transmitted together with the corresponding updated words.

追加音素は、追加正音素列、追加対応音素、及び追加例外音素列である。この追加音素は、サーバ57に記憶されている更新音素と、予め音声認識装置30に記憶されている標準対応音素、標準例外音素列、及び標準正音素列（以下、これらをまとめて標準音素という。）との差分を求めることにより得られる。なお、追加正音素列及び追加例外音素列は、それぞれに対応する追加単語と共に送信される。 The additional phonemes are an additional regular phoneme sequence, an additional corresponding phoneme, and an additional exceptional phoneme sequence. The additional phonemes are the updated phonemes stored in the server 57, the standard corresponding phonemes, the standard exception phoneme sequences, and the standard positive phoneme sequences (hereinafter collectively referred to as standard phonemes) stored in the speech recognition device 30. )) To obtain the difference. Note that the additional regular phoneme string and the additional exceptional phoneme string are transmitted together with the corresponding additional word.

未送信音素は、未だ音声認識装置30に送信されていない追加音素である。この未送信音素は、サーバ57に記憶されている更新音素と、音声認識装置30に記憶されている標準音素及び追加音素との差分を求めることにより得られる。 Untransmitted phonemes are additional phonemes that have not yet been transmitted to the speech recognition device 30. This untransmitted phoneme is obtained by obtaining a difference between the updated phoneme stored in the server 57 and the standard phoneme and the additional phoneme stored in the speech recognition device 30.

さらに処理部63は、記憶部に記憶されている更新音素の一部を削除するものであっても良い。 Further, the processing unit 63 may delete a part of the updated phonemes stored in the storage unit.

図１２に示すように、本実施例の音声認識装置30は、送受信部71と、更新手段77とを備える。 As shown in FIG. 12, the speech recognition apparatus 30 of this embodiment includes a transmission / reception unit 71 and an update unit 77.

送受信部71は、ネットワーク85（図１１）を介してサーバ57に情報を送受信する送信装置、受信装置、及びこれらのプロトコルである。 The transmission / reception unit 71 includes a transmission device, a reception device, and a protocol for transmitting / receiving information to / from the server 57 via the network 85 (FIG. 11).

更新手段77は、サーバ57が追加対応音素及び追加例外音素列を生成するために、発音音素列をサーバ57に送信する。発音音素列の送信は、例えば音声認識を終了した際に、発音音素列と認識対象の単語を互いに対応付けて行われる。 The updating unit 77 transmits the phoneme phoneme sequence to the server 57 so that the server 57 generates the additional corresponding phoneme and the additional exceptional phoneme sequence. The phoneme phoneme string is transmitted when the phoneme phoneme string and the word to be recognized are associated with each other, for example, when the speech recognition is finished.

また更新手段77は、記憶部3に記憶された標準音素を更新するものである。標準音素の更新は、例えば、更新手段77がサーバ57に通知音素の送信を要求し、受信した通知音素を記憶部3に記憶することにより行われる。通知音素の送信要求は、発音音素列の送信した後に行われる。また、音声認識装置30がネットワーク85に接続された際に行っても良い。受信した通知音素は、対応する記憶手段37,39,55に記憶される。すなわち、更新正音素列、追加正音素列、未送信正音素列は対応する単語とともに正音素列記憶手段37に記憶される。また、更新対応音素、追加対応音素、未送信対応音素は対応音素記憶手段39に記憶される。また、更新例外音素列、追加例外音素列、未送信例外音素列は例外音素列記憶手段55に記憶される。 The updating unit 77 is for updating the standard phonemes stored in the storage unit 3. The standard phoneme is updated by, for example, the update unit 77 requesting the server 57 to transmit the notification phoneme and storing the received notification phoneme in the storage unit 3. The notification phoneme transmission request is made after the pronunciation phoneme string is transmitted. Alternatively, it may be performed when the speech recognition apparatus 30 is connected to the network 85. The received notification phoneme is stored in the corresponding storage means 37, 39, 55. That is, the updated phoneme sequence, the added phoneme sequence, and the untransmitted phoneme sequence are stored in the phoneme sequence storage unit 37 together with the corresponding words. In addition, the update corresponding phoneme, the additional corresponding phoneme, and the untransmitted corresponding phoneme are stored in the corresponding phoneme storage unit 39. The updated exceptional phoneme string, the added exceptional phoneme string, and the untransmitted exceptional phoneme string are stored in the exceptional phoneme string storage means 55.

音声認識装置30の待受音素生成手段41は、標準音素と通知音素を組み合わせて待受音素群を生成する。すなわち、音声認識装置30の待受音素生成手段41は、標準正音素列及び追加正音素列に対応する標準対応音素及び追加対応音素を抽出し、組み合わせて一の待受音素群を生成する。さらに、標準例外音素列及び追加例外音素列を抽出し、一の待受音素群に組み合わせて他の待受音素群を生成しても良い。これにより、音素検出手段33は、サーバから取得した誤音素を含む待受音素群の中から発音音素を検出することができる。 The standby phoneme generation means 41 of the speech recognition device 30 generates a standby phoneme group by combining standard phonemes and notification phonemes. That is, the standby phoneme generation means 41 of the speech recognition device 30 extracts the standard corresponding phoneme and the additional corresponding phoneme corresponding to the standard positive phoneme sequence and the additional normal phoneme sequence, and generates one standby phoneme group by combining them. Further, a standard exceptional phoneme string and an additional exceptional phoneme string may be extracted and combined with one standby phoneme group to generate another standby phoneme group. Thereby, the phoneme detection means 33 can detect a phoneme phoneme from the standby phoneme group including the erroneous phoneme acquired from the server.

音声認識装置30の記憶部3は、標準音素と追加音素とを区別して記憶しても良い。すなわち、正音素列記憶手段37は標準正音素列記憶手段74aと追加正音素列記憶手段74bから成り、対応音素記憶手段39は標準対応音素記憶手段73aと追加対応音素記憶手段73bから成り、例外音素列記憶手段55は標準例外音素列記憶手段75aと追加例外音素列記憶手段75bから成るものであっても良い。追加正音素列記憶手段74は、サーバ57から送信された追加正音素列及び追加単語を互いに対応付けて記憶する。また、追加対応音素記憶手段73は、サーバ57から送信された追加対応音素を記憶する。また、追加例外音素列記憶手段75は、サーバ57から送信された追加例外音素列と追加単語を互いに対応付けて記憶する。 The storage unit 3 of the speech recognition device 30 may store the standard phoneme and the additional phoneme separately. That is, the phoneme sequence storage means 37 is composed of standard phoneme sequence storage means 74a and additional phoneme sequence storage means 74b, and the corresponding phoneme storage means 39 is composed of standard correspondence phoneme storage means 73a and additional correspondence phoneme storage means 73b. The phoneme string storage means 55 may comprise standard exception phoneme string storage means 75a and additional exception phoneme string storage means 75b. The additional phoneme string storage means 74 stores the additional phoneme string and the additional word transmitted from the server 57 in association with each other. Further, the additional corresponding phoneme storage unit 73 stores the additional corresponding phoneme transmitted from the server 57. Further, the additional exceptional phoneme string storage means 75 stores the additional exceptional phoneme string and the additional word transmitted from the server 57 in association with each other.

本発明の音声認識システム100は、サーバ57が収集した複数の発音音素列に基づいて、複数ユーザの誤発音傾向を把握することができ、この誤発音傾向に基づいて、音声認識装置30内の対応音素及び例外音素列を追加することができる。これにより、さらに精度の高い音声認識を実現することができる。 The speech recognition system 100 of the present invention can grasp the mispronunciation tendency of a plurality of users based on a plurality of pronunciation phoneme sequences collected by the server 57, and based on the mispronunciation tendency, Corresponding phonemes and exceptional phoneme strings can be added. Thereby, voice recognition with higher accuracy can be realized.

また、音声認識システム100は、追加単語及び追加正音素列により、音声認識対象となる単語を充実させることができる。 In addition, the speech recognition system 100 can enrich the words that are subject to speech recognition with the additional words and the additional phoneme sequences.

また、サーバ57を地域毎に設けても良い。これにより、地域毎の誤発音傾向を把握することができる。このため、方言によって生じる誤発音傾向に基づく音声認識が可能となる。 A server 57 may be provided for each region. Thereby, the mispronunciation tendency for every area can be grasped. For this reason, the speech recognition based on the mispronunciation tendency caused by the dialect becomes possible.

本発明の音声認識プログラムは、コンピュータを上記の待受音素記憶手段11、音素特徴記憶手段13、正音素列記憶手段37、対応音素記憶手段39、例外音素列記憶手段55、追加正音素列記憶手段74、追加対応音素記憶手段73、追加例外音素列記憶手段75、音声情報入力手段29、単語入力手段31、音素検出手段33、待受音素生成手段41、更新手段77、出力部9、送受信部71として機能させるものである。 The speech recognition program of the present invention is a computer that stores the above-mentioned standby phoneme storage means 11, phoneme feature storage means 13, positive phoneme string storage means 37, corresponding phoneme storage means 39, exceptional phoneme string storage means 55, and additional positive phoneme string storage. Means 74, additional correspondence phoneme storage means 73, additional exceptional phoneme string storage means 75, speech information input means 29, word input means 31, phoneme detection means 33, standby phoneme generation means 41, update means 77, output unit 9, transmission / reception It functions as the unit 71.

以上、本発明の音声認識装置、音声認識システム、及び音声認識プログラムについて説明したが、本発明はその趣旨を逸脱しない範囲で、当業者の知識に基づき種々の改良、修正、変形を加えた態様で実施し得るものであり、これらの態様はいずれも本発明の範囲に属するものである。 The speech recognition apparatus, speech recognition system, and speech recognition program of the present invention have been described above. However, the present invention includes various improvements, modifications, and modifications based on the knowledge of those skilled in the art without departing from the spirit of the present invention. These embodiments are all within the scope of the present invention.

例えば、本発明の音素検出手段33は、図１３に示すように、待受音素群21（図２（ａ））の各音素を組み合わせて待受音素列集合79を生成し、待受音素列81毎に音声情報と音素特徴とを比較するものであっても良い。待受音素列集合79は、待受音素列81を要素とするものである。待受音素列81は、正音素列81a又は、誤音素を含む誤音素列81bである。待受音素列集合79の生成は、各待受音素集合23（図２（ａ））の要素である音素同士の組合処理により生成することができる。音声情報と音素特徴の比較は、例えば、一の待受音素列81を構成する音素毎の音素特徴と、音声情報の部分音声特徴とを比較する。この比較を全ての待受音素列81に対して行い共通項が最も多い一の待受音素列81を発音音素列として検出する。 For example, as shown in FIG. 13, the phoneme detection means 33 of the present invention generates a standby phoneme string set 79 by combining each phoneme of the standby phoneme group 21 (FIG. 2 (a)). The voice information may be compared with phoneme features every 81. The standby phoneme string set 79 includes the standby phoneme string 81 as an element. The standby phoneme sequence 81 is a regular phoneme sequence 81a or an erroneous phoneme sequence 81b including erroneous phonemes. The standby phoneme string set 79 can be generated by a combination process of phonemes that are elements of each standby phoneme set 23 (FIG. 2A). The comparison between the speech information and the phoneme feature is, for example, comparing the phoneme feature for each phoneme constituting one standby phoneme sequence 81 with the partial speech feature of the speech information. This comparison is performed for all the standby phoneme strings 81, and the one standby phoneme string 81 having the most common terms is detected as a pronunciation phoneme string.

1・10・20・30 … 音声認識装置
3 … 記憶部
5 … 入力部
7 … 処理部
9 … 出力部
11 … 待受音素記憶手段
13 … 音素特徴記憶手段
15 … 正音素群
17 … 正音素
19 … 誤音素
19a … 置換音素
19b … 付加音素
21 … 待受音素群
23 … 待受音素集合
25 … 待受音素集合族
27 … 待受音素
29 … 音声情報入力手段
31 … 単語入力手段
33 … 音素検出手段
37 … 正音素列記憶手段
39 … 対応音素記憶手段
41 … 待受音素生成手段
43 … 対応音素群
45 … 対応音素集合
47 … 対応音素集合族
49 … 対応音素
51 … 第一音素
53 … 第二音素
55 … 例外音素列記憶手段
57 … サーバ
59 … 送受信部
61 … 記憶部
63 … 処理部
65 … 更新正音素列記憶手段
67 … 更新対応音素記憶手段
69 … 更新例外音素列記憶手段
71 … 送受信部
73b … 追加対応音素記憶手段
74b … 追加正音素列記憶手段
75b … 追加例外音素列記憶手段
77 … 更新手段
79 … 待受音素列集合
81 … 待受音素列
85 … ネットワーク
100 … 音声認識システム 1 ・ 10 ・ 20 ・ 30… Voice recognition device
3… Memory
5… Input section
7… Processing section
9… Output section
11… Standby phoneme storage means
13… Phoneme feature storage means
15… Positive phoneme group
17… phoneme
19… false phoneme
19a… substitution phoneme
19b… Additional phonemes
21… Standby phoneme group
23… Standby phoneme set
25… standby phoneme group
27… Standby phoneme
29… Voice information input means
31… Word input means
33… Phoneme detection means
37… Mean phoneme storage means
39… Corresponding phoneme storage means
41… Standby phoneme generation means
43… Supported phonemes
45… Corresponding phoneme set
47… corresponding phoneme group
49… Applicable phonemes
51… 1st phoneme
53… second phoneme
55… Exceptional phoneme string storage means
57… Server
59… Transmitter / receiver
61… Memory part
63… Processing section
65… updated phoneme sequence storage means
67… Update correspondence phoneme storage means
69… Update exception phoneme string storage means
71… Transmitter / receiver
73b… Additional phoneme storage means
74b ... Means for storing additional phoneme sequences
75b… Additional exceptional phoneme string storage means
77… Update means
79… Standby phoneme sequence set
81… Stand-by phoneme sequence
85… Network
100… Speech recognition system

Claims

単語の正しい発音を構成する正音素群と、この正音素群に含まれる正音素の誤った発音である誤音素とを含む待受音素群を単語に対応付けて記憶する待受音素記憶手段と、待受音素群の音素毎に定められた音素特徴を記憶する音素特徴記憶手段と、を有する記憶部と、
発音された前記単語の音声信号から音声情報を生成し、該音声情報を入力する音声情報入力手段を有する入力部と、
前記音素特徴と前記音声情報とを比較して、発音された音素列である発音音素列を前記待受音素群の中から検出する音素検出手段を有する処理部と、
前記発音音素列を出力する出力部と、
を備えた音声認識装置。 Standby phoneme storage means for storing a standby phoneme group including a correct phoneme group that constitutes a correct pronunciation of a word and a false phoneme that is a wrong pronunciation of the correct phoneme included in the correct phoneme group in association with the word; A phoneme feature storing means for storing phoneme features determined for each phoneme of the standby phoneme group;
Generating voice information from the voice signal of the pronounced word, and an input unit having voice information input means for inputting the voice information;
A processing unit having a phoneme detection unit that compares the phoneme feature with the voice information and detects a phoneme phoneme sequence that is a phoneme sequence generated from the standby phoneme group;
An output unit for outputting the phonemic phoneme string;
A speech recognition device comprising:

前記待受音素記憶手段は、
前記正音素群を発音順に並べた正音素列を前記単語に対応付けて記憶する正音素列記憶手段と、
前記正音素と前記誤音素を互いに対応付けて成る対応音素を記憶する対応音素記憶手段と、を有し、
前記処理部は、
前記単語の前記正音素列と前記対応音素を組み合わせて前記待受音素群を生成する待受音素生成手段を有することを特徴とする請求項１に記載の音声認識装置。 The standby phoneme storage means includes
A phoneme sequence storage means for storing the phoneme sequence in which the phoneme groups are arranged in the pronunciation order in association with the word;
Corresponding phoneme storage means for storing corresponding phonemes formed by associating the correct phonemes and the false phonemes with each other;
The processor is
The speech recognition apparatus according to claim 1, further comprising a standby phoneme generation unit that generates the standby phoneme group by combining the regular phoneme string of the word and the corresponding phoneme.

前記誤音素は、前記正音素と置き換えられて発音される置換音素であり、
前記置換音素は、複数の話者が該単語を発音して得た統計に基づいて定められることを特徴とする請求項１又は請求項２に記載の音声認識装置。 The false phoneme is a replacement phoneme that is pronounced by replacing the correct phoneme,
The speech recognition apparatus according to claim 1, wherein the replacement phoneme is determined based on statistics obtained by a plurality of speakers pronouncing the word.

前記誤音素は、前記正音素に付加されて発音される付加音素であり、
前記付加音素は、複数の話者が該単語を発音して得た統計に基づいて定められることを特徴とする請求項１又は請求項２に記載の音声認識装置。 The false phoneme is an additional phoneme that is added to the correct phoneme and pronounced;
The speech recognition apparatus according to claim 1, wherein the additional phonemes are determined based on statistics obtained by a plurality of speakers pronouncing the words.

前記記憶部は、
前記単語にのみ生じる誤った音素列である例外音素列を記憶する例外音素列記憶手段を有し、
前記音素特徴記憶手段は、前記例外音素列の音素毎に定められた音素特徴をさらに記憶し、
前記音素検出手段は、前記音素特徴と前記音声情報とを比較して、該音声情報に含まれる音素を前記待受音素群又は例外音素列の中から検出することを特徴とする請求項１〜請求項４のいずれかに記載の音声認識装置。 The storage unit
An exceptional phoneme string storage means for storing an exceptional phoneme string that is an erroneous phoneme string that occurs only in the word;
The phoneme feature storage means further stores phoneme features determined for each phoneme of the exceptional phoneme sequence;
The phoneme detection unit compares the phoneme feature with the speech information, and detects a phoneme included in the speech information from the standby phoneme group or an exceptional phoneme string. The voice recognition device according to claim 4.

前記請求項１〜請求項５のいずれかに記載の音声認識装置と、ネットワークを介して前記音声認識装置と通信可能なサーバを備えた音声認識システムであって、
前記音声認識装置の処理部は、前記発音音素列を前記サーバへ通知し、
前記サーバは、取得した前記発音音素列の統計に基づいて、前記誤音素を新たに検出して前記音声認識装置に通知する処理部を備え、
前記音素検出手段は、サーバから取得した前記誤音素を含めた待受音素群の中から前記発音音素を検出することを特徴とする音声認識システム。 A speech recognition system comprising the speech recognition device according to any one of claims 1 to 5 and a server capable of communicating with the speech recognition device via a network,
The processing unit of the speech recognition apparatus notifies the server of the phoneme phoneme string,
The server includes a processing unit that newly detects the erroneous phoneme and notifies the voice recognition device based on the acquired statistics of the phoneme phoneme sequence,
The speech recognition system, wherein the phoneme detection means detects the phoneme phoneme from a group of standby phonemes including the erroneous phoneme acquired from a server.

コンピュータを、
単語の正しい発音を構成する正音素群と、この正音素群に含まれる正音素の誤った発音である誤音素とを含む待受音素群を単語に対応付けて記憶する待受音素記憶手段、
待受音素群の音素毎に定められた音素特徴を記憶する音素特徴記憶手段、
発音された前記単語の音声信号から音声情報を生成し、該音声情報を入力する音声情報入力手段、
前記音素特徴と前記音声情報とを比較して、発音された音素列である発音音素列を前記待受音素群の中から検出する音素検出手段、
として機能させるための音声認識プログラム。 Computer
A standby phoneme storage means for storing a standby phoneme group including a correct phoneme group that constitutes a correct pronunciation of a word and a false phoneme that is an incorrect pronunciation of the correct phoneme included in the correct phoneme group, in association with the word;
Phoneme feature storage means for storing phoneme features determined for each phoneme of the standby phoneme group;
Voice information input means for generating voice information from the voice signal of the pronounced word and inputting the voice information;
Phoneme detection means for comparing the phoneme feature and the speech information to detect a phoneme sequence that is a phoneme sequence generated from the standby phoneme group;
Voice recognition program to function as.