JP2007199552A

JP2007199552A - Device and method for speech recognition

Info

Publication number: JP2007199552A
Application number: JP2006020162A
Authority: JP
Inventors: Ryo Murakami; 涼村上
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2006-01-30
Filing date: 2006-01-30
Publication date: 2007-08-09

Abstract

<P>PROBLEM TO BE SOLVED: To provide techniques making it possible to accurately recognize a natural document that a talker converses in a short time. <P>SOLUTION: A speech recognition device is equipped with a speech input means of inputting and converting a speech into sound data, an imaging means of repeatedly photographing the talker and relating photographed image data to time, a time detecting means of detecting the start time and end time of speech input based upon the sound data, a document data generating means of generating document data from the sound data from the start time to the end time of the speech input, a speaking state recognizing means of recognizing the speaking state of the talker from the image data from the start time to the end time of the speech input, a speaking section judging means of judging whether the period from the start time to the end time of the speech input is a proper speaking section from the speaking state of the talker, and a document data output means of outputting the document data when it is judged that the period from the start time to the end time of the speech input is a proper speaking section. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、対話者が話しかける音声を文章として認識する装置と方法に関する。 The present invention relates to an apparatus and a method for recognizing a voice spoken by a conversation person as a sentence.

人間が装置の動作を制御する際に、キーボードやレバーなどのインターフェースを操作することなく、音声を発することによって装置を制御する技術がある。このような技術においては、マイクなどの音声入力手段から入力される音声から、音声によって表現される言葉の内容を認識し、認識された言葉の内容に応じた制御が行われる。 When a human controls the operation of the device, there is a technique for controlling the device by emitting a voice without operating an interface such as a keyboard or a lever. In such a technique, the content of words expressed by speech is recognized from speech input from speech input means such as a microphone, and control according to the recognized content of words is performed.

音声入力手段から入力される音声には、対話者が発した声以外にも、周囲の雑音が混入する場合がある。周囲の雑音が混入した音声に基づいて認識を行うと、誤認識を起こし、装置の誤作動などを引き起こしてしまう。音声を認識する技術においては、雑音の影響をいかにして除去するかが重要である。 In addition to the voice uttered by the conversation person, ambient noise may be mixed in the voice input from the voice input means. If recognition is performed on the basis of voice mixed with ambient noise, erroneous recognition is caused, resulting in malfunction of the apparatus. In the technology for recognizing speech, how to remove the influence of noise is important.

雑音の影響を除去するために、対話者の顔を撮影し、撮影された画像に基づいて、対話者が話している期間（発話区間）と、話していない期間を識別する技術が従来から開発されている。撮影された顔の画像から発話区間を特定し、発話区間のみについて音声認識を行うことで、雑音の影響を排除し、誤認識を防ぐことができる。
特許文献１には、音声を認識し、対話者の***付近の動きから発話区間を検出し、検出された発話区間において対話者が発生した音声認識結果のみを抽出する技術が開示されている。
特許文献２には、対話者が発した音声に基づいた発話区間と、対話者の口を撮影して得た口画像データに基づいて求めた発話区間とが略一致している場合に限り、音声認識結果としての音声操作語句データを出力する技術が開示されている。
特許文献３には、対話者の顔の向き、唇の動き、視線の向きから、発声の有無を判断し、発声中と判断される場合に音声認識処理を行う技術が開示されている。
特許文献４には、対話者の***の動きから累積変動関数を抽出し、その累積変動関数を等分割する時系列を求め、その時系列を基準として音声認識を行う技術が開示されている。
特許文献５には、対話者の***の動きから音声区間を抽出し、音声区間における音声波形を切り出し、切り出された音声波形により、音声認識を行う技術が開示されている。
特開２００４−２４８６３号公報特開２００２−９１４６６号公報特開平１１−３５２９８７号公報特開平８−７６７９２号公報特開平６−３０１３９３号公報 In order to eliminate the effects of noise, a technology has been developed to identify the period during which the interlocutor is speaking (the utterance interval) and the period during which the interlocutor is not speaking based on the captured image. Has been. By identifying the utterance section from the captured face image and performing speech recognition only on the utterance section, it is possible to eliminate the influence of noise and prevent erroneous recognition.
Patent Document 1 discloses a technique for recognizing speech, detecting an utterance section from movements near the lip of a conversation person, and extracting only a speech recognition result generated by the conversation person in the detected utterance section.
In Patent Literature 2, only when the utterance section based on the voice uttered by the conversation person and the utterance section obtained based on the mouth image data obtained by photographing the conversation person's mouth are substantially the same, A technique for outputting voice operation phrase data as a voice recognition result is disclosed.
Japanese Patent Application Laid-Open No. 2004-228688 discloses a technique for determining the presence or absence of utterance from the direction of the conversation person's face, the movement of the lips, and the direction of the line of sight, and performing speech recognition processing when it is determined that the speaker is speaking.
Patent Document 4 discloses a technique for extracting a cumulative variation function from a lip movement of a conversation person, obtaining a time series for equally dividing the cumulative variation function, and performing speech recognition based on the time series.
Patent Document 5 discloses a technique for extracting a speech section from a lip movement of a conversation person, cutting out a speech waveform in the speech section, and performing speech recognition using the extracted speech waveform.
JP 2004-24863 A JP 2002-91466 A Japanese Patent Laid-Open No. 11-352987 Japanese Patent Laid-Open No. 8-76792 JP-A-6-301393

一般に音声認識処理に必要とされる演算の負荷に対して、画像認識処理に必要とされる演算の負荷は強い。従って、画像認識処理に要する時間は、音声認識処理に要する時間に比べて長いものとなる。従来技術が提示するように、画像認識処理の結果から発話区間を特定し、その後に特定された発話区間について音声認識処理を行う場合、音声の認識率は向上するものの、話者が話しかけてから音声認識が完了するまでに長時間を必要とする。より短時間で音声認識を行うことが可能な技術が待望される。 In general, the calculation load required for the image recognition processing is stronger than the calculation load required for the voice recognition processing. Therefore, the time required for the image recognition process is longer than the time required for the voice recognition process. As suggested by the prior art, when a speech segment is identified from the result of image recognition processing and then speech recognition processing is performed for the identified speech segment, the speech recognition rate improves, but the speaker speaks. It takes a long time to complete speech recognition. A technology that can perform voice recognition in a shorter time is awaited.

音声認識処理を施設やイベント会場などを案内するロボットに適用する場合、対話者は自然な文章で話しかけても案内してもらえることを望んでいる。入力される音声を文章として認識する処理は、従来技術が扱うような単語のみを認識する処理に比べて処理時間が長い。このような場合に、従来技術のような画像認識処理を応用すると、対話者が話しかけてから文章の認識処理が完了するまでに長時間を必要とし、長い待ち時間に対話者が不満を覚えてしまう。 When speech recognition processing is applied to a robot that guides facilities or event venues, a dialogue person wants to be guided even by speaking with natural sentences. The process for recognizing input speech as a sentence takes longer than the process for recognizing only words as handled by the prior art. In such a case, applying image recognition processing as in the prior art requires a long time from when the conversation person speaks until the sentence recognition process is completed, and the conversation person is dissatisfied with a long waiting time. End up.

対話者が話しかける自然な文章を正確に認識することが可能であり、なおかつ処理に要する時間を短時間とすることが可能な技術が待望されている。 There is a need for a technology that can accurately recognize natural sentences spoken by a conversation person and that can shorten the time required for processing.

本発明では上記課題を解決する。本発明は、対話者が話しかける自然な文章を短時間で正確に認識することが可能な技術を提供する。 The present invention solves the above problems. The present invention provides a technique capable of accurately recognizing a natural sentence spoken by a conversation person in a short time.

本発明は、対話者が話しかける音声を文章として認識する装置として具現化される。その装置は、音声を入力して音データに変換する音声入力手段と、対話者を繰り返し撮影して撮影された画像データを時刻と関連付ける撮像手段と、音データに基づいて音声入力開始時刻と音声入力終了時刻を検出する時刻検出手段と、音声入力開始時刻から音声入力終了時刻までの音データから文章データを作成する文章データ作成手段と、音声入力開始時刻から音声入力終了時刻までの画像データから対話者の発話状態を認識する発話状態認識手段と、対話者の発話状態から音声入力開始時刻から音声入力終了時刻までの期間が適切な発話区間であるか否かを判断する発話区間判断手段と、音声入力開始時刻から音声入力終了時刻までの期間が適切な発話区間であると判断された場合に、文章データを出力する文章データ出力手段を備えている。 The present invention is embodied as an apparatus for recognizing a voice spoken by a conversation person as a sentence. The apparatus includes a voice input unit that inputs voice and converts it into sound data, an imaging unit that repeatedly captures and captures image data captured by a conversation person, and a voice input start time and voice based on the sound data. From time detection means for detecting the input end time, sentence data creation means for creating sentence data from sound data from the voice input start time to the voice input end time, and image data from the voice input start time to the voice input end time Utterance state recognition means for recognizing the utterance state of the conversation person, and utterance section judgment means for judging whether or not a period from the speech input start time to the voice input end time is an appropriate utterance section. A sentence data output means for outputting sentence data when it is determined that the period from the voice input start time to the voice input end time is an appropriate utterance section; There.

上記の音声認識装置では、音声入力手段で得られる音データに基づいて、時刻検出手段が対話者の発話区間を検出する。発話区間の検出は、音声入力開始時刻と音声入力終了時刻をそれぞれ検出することによって行われる。上記の音声認識装置では、検出される発話区間に関して、その発話区間における文章データの作成と、その発話区間における対話者の発話状態の認識が行われる。発話区間における文章データの作成は、文章データ作成手段が行う。発話区間における対話者の発話状態の認識は、発話状態認識手段が行う。文章データ作成手段と発話状態認識手段は、互いに独立して動作可能であり、それぞれが並行して処理を実行することができる。上記の音声認識装置では、画像データから認識される発話状態認識手段で認識される対話者の発話状態から、その発話区間が適切なものであるか、すなわち音声入力手段に入力される音声が対話者の発したものであるか否かが評価される。その発話区間が適切なものであると判断された場合に、上記の音声認識装置では、文章データ作成手段によって作成された文章データを文章データ出力手段が出力する。 In the above speech recognition apparatus, the time detection means detects the utterance section of the conversation person based on the sound data obtained by the voice input means. The speech section is detected by detecting the voice input start time and the voice input end time. In the speech recognition apparatus described above, with respect to the detected utterance section, creation of sentence data in the utterance section and recognition of a conversation person's utterance state in the utterance section are performed. Text data creation means creates text data in the utterance section. The utterance state recognition means recognizes the utterance state of the conversation person in the utterance section. The text data creation means and the utterance state recognition means can operate independently of each other, and each can execute processing in parallel. In the speech recognition apparatus described above, whether the utterance section is appropriate from the utterance state of the conversation person recognized by the utterance state recognition means recognized from the image data, that is, the voice input to the voice input means is the dialogue. It is evaluated whether or not it was issued by the person. When it is determined that the utterance section is appropriate, in the speech recognition apparatus, the sentence data output means outputs the sentence data created by the sentence data creation means.

上記の音声認識装置では、入力された音声が対話者の発したものであるか否かを、撮影された画像データに基づいて評価する。これによって、周囲の雑音などの影響を排除して、対話者が発した音声から作成された文章データを出力することができる。 In the above speech recognition apparatus, it is evaluated on the basis of photographed image data whether or not the input speech is generated by a conversation person. Thereby, it is possible to output the text data created from the voice uttered by the conversation person while eliminating the influence of ambient noise and the like.

上記の音声認識装置では、文章データ作成手段における処理（音声を文章として認識する処理）と、発話状態認識手段における処理（画像から発話状態を認識する処理）を並行して行うことができる。このような構成とすることによって、上記の音声認識装置では、発話状態認識手段での認識処理が終了して、認識された発話状態から発話区間が適切であると判断されると、即座に文章データを出力することができる。対話者が話しかけてから文章データを出力するまでの処理を短時間で行うことができる。 In the above speech recognition apparatus, the processing in the text data creation means (processing for recognizing speech as text) and the processing in the speech state recognition means (processing for recognizing the speech state from the image) can be performed in parallel. By adopting such a configuration, in the above speech recognition apparatus, when the recognition processing by the utterance state recognition unit is finished and it is determined that the utterance section is appropriate from the recognized utterance state, the sentence is immediately written. Data can be output. It is possible to perform processing in a short time from when the conversation person speaks until the sentence data is output.

上記の音声認識装置では、発話状態認識手段が、対話者の少なくとも２種類以上の発話状態を認識し、発話区間判断手段が、その少なくとも２種類以上の発話状態から、音声入力開始時刻から音声入力終了時刻までの期間が適切な発話区間であるか否かを判断することが好ましい。
このような構成とすることによって、発話区間が適切であるか否かを、複数の観点から評価することが可能となる。発話区間の妥当性について正確に評価することが可能となり、誤認識を防止することができる。 In the above speech recognition apparatus, the utterance state recognition means recognizes at least two types of utterance states of the conversation person, and the utterance section determination means performs voice input from the voice input start time from the at least two kinds of utterance states. It is preferable to determine whether or not the period until the end time is an appropriate utterance section.
With such a configuration, it is possible to evaluate whether or not the utterance section is appropriate from a plurality of viewpoints. It is possible to accurately evaluate the validity of the utterance section, and prevent misrecognition.

上記の音声認識装置では、発話状態が、対話者までの距離、対話者の顔の向き、対話者の視線の向き、および対話者の***の動きを含むグループから選択されていることが好ましい。
対話者が話しかけているか否かは種々の観点から判断することが可能であるが、上記のように対話者までの距離や、対話者の顔の向きや、対話者の視線の向きや、対話者の***の動きなどから判断することによって、発話区間の妥当性を正確に評価し、誤認識を防止することができる。 In the above speech recognition apparatus, it is preferable that the utterance state is selected from the group including the distance to the conversation person, the direction of the conversation person's face, the direction of the conversation person's line of sight, and the movement of the conversation person's lips.
It is possible to determine whether or not the talker is speaking from various viewpoints. As described above, the distance to the talker, the face direction of the talker, the direction of the talker's line of sight, and the dialogue By judging from the movement of the person's lips, etc., it is possible to accurately evaluate the validity of the utterance interval and prevent erroneous recognition.

上記の音声認識装置では、文章データ作成手段が、候補となる文章データ群を記憶しておく文章データ群記憶手段と、候補となる文章データ群のそれぞれの文章データについて、音データに基づいて尤度を算出する尤度算出手段を備えており、候補となる文章データ群から最も尤度の高い文章データを特定して、文章データを作成することが好ましい。
上記のような構成とすると、候補となる文章データ群として人間同士の会話で自然に使われている文章を用意しておくことで、音声認識の結果として出力される文章データも人間同士の会話で自然に使われている文章に対応するものとすることができる。 In the speech recognition apparatus described above, the text data creation means is configured based on sound data for text data group storage means for storing candidate text data groups, and for each text data in the candidate text data groups. It is preferable that a likelihood calculating means for calculating the degree is provided, and the sentence data having the highest likelihood is identified from the candidate sentence data group to create the sentence data.
With the above configuration, by preparing sentences that are naturally used in human conversations as candidate sentence data groups, the sentence data output as a result of speech recognition can also be used for human conversations. It can correspond to sentences that are used in nature.

上記の音声認識装置では、文章データ出力手段が、音声入力開始時刻から音声入力終了時刻までの期間が適切な発話区間であると判断されない場合に、文章データを出力しないことが好ましい。
適切な発話区間であると判断されない場合、音声入力手段で得られた音データは周囲の雑音などに起因するものであり、そのような音データに基づいて作成された文章データは用をなさない。上記の音声認識装置によれば、そのような文章データの出力を防止することができる。 In the above speech recognition apparatus, it is preferable that the text data output means does not output text data when it is not determined that the period from the voice input start time to the voice input end time is an appropriate utterance section.
If it is not determined that the speech segment is appropriate, the sound data obtained by the voice input means is due to ambient noise, etc., and the text data created based on such sound data is useless. . According to the speech recognition apparatus, it is possible to prevent such text data from being output.

本発明は方法として具現化することもできる。本発明の方法は、対話者が話しかける音声を文章として認識する方法である。その方法は、音声を入力して音データに変換する音声入力工程と、対話者を繰り返し撮影して撮影された画像データを時刻と関連付ける撮像工程と、音データに基づいて音声入力開始時刻と音声入力終了時刻を検出する時刻検出手段と、音声入力開始時刻から音声入力終了時刻までの音データから文章データを作成する文章データ作成工程と、音声入力開始時刻から音声入力終了時刻までの画像データから対話者の発話状態を認識する発話状態認識工程と、対話者の発話状態から音声入力開始時刻から音声入力終了時刻までの期間が適切な発話区間であるか否かを判断する発話区間判断工程と、音声入力開始時刻から音声入力終了時刻までの期間が適切な発話区間であると判断された場合に、文章データを出力する文章データ出力工程とを備えている。 The present invention can also be embodied as a method. The method of the present invention is a method for recognizing a voice spoken by a conversation person as a sentence. The method includes a voice input step of inputting voice and converting it into sound data, an imaging step of associating image data taken by repeatedly shooting a conversation person with time, a voice input start time and voice based on the sound data From time detection means for detecting an input end time, a sentence data creation step for creating sentence data from sound data from the voice input start time to the voice input end time, and image data from the voice input start time to the voice input end time An utterance state recognition step for recognizing the utterance state of a conversation person, and an utterance section judgment step for judging whether or not the period from the speech input start time to the voice input end time is an appropriate utterance section. A sentence data output step for outputting sentence data when it is determined that the period from the voice input start time to the voice input end time is an appropriate utterance section. To have.

本発明の文章認識装置および文章認識方法によれば、対話者が話しかける自然な文章を短時間で正確に認識することができる。 According to the sentence recognition apparatus and the sentence recognition method of the present invention, a natural sentence spoken by a conversation person can be accurately recognized in a short time.

以下に発明を実施するための最良の形態を列記する。
（形態１）文章データ作成手段は、音データから隠れマルコフ・モデル（ＨＭＭ；Hidden Markov Model）を用いて音素の時系列としての文章データを作成する。
（形態２）撮影手段の視野の方向を調整する撮影方向調整手段をさらに備えており、対話者が視野の中央付近で撮影されるように撮影手段の視野の方向を調整する。 The best mode for carrying out the invention is listed below.
(Mode 1) Text data creating means creates text data as phoneme time series from sound data using a hidden Markov model (HMM).
(Embodiment 2) A shooting direction adjusting means for adjusting the direction of the visual field of the photographing means is further provided, and the direction of the visual field of the photographing means is adjusted so that the conversation person is photographed near the center of the visual field.

本実施例では、図１に例示する音声認識装置１００において、対話者Ｖが話しかける音声を文章として認識する例を説明する。音声認識装置１００は、例えばショールームやイベント会場に配置された案内ロボットであり、案内を求めて話しかけてくる来場者（対話者）Ｖが話しかける文章を認識する。 In the present embodiment, an example will be described in which the voice spoken by the conversation person V is recognized as a sentence in the voice recognition apparatus 100 illustrated in FIG. The voice recognition device 100 is a guidance robot arranged in, for example, a showroom or an event venue, and recognizes a sentence spoken by a visitor (interactive person) V who is speaking for guidance.

音声認識装置１００は、頭部１０２の前方に並んで配置された右カメラ１０４と左カメラ１０６と、胴体部１０８に対して頭部１０２を左右に回転するアクチュエータ１１０と、胴体部１０８の前方に設けられたマイクロホン１１２と、右カメラ１０４、左カメラ１０６、アクチュエータ１１０およびマイクロホン１１２の動作を制御するコントローラ１１４を備えている。 The speech recognition apparatus 100 includes a right camera 104 and a left camera 106 arranged side by side in front of the head 102, an actuator 110 that rotates the head 102 left and right with respect to the body portion 108, and a front portion of the body portion 108. A microphone 112 provided, a right camera 104, a left camera 106, an actuator 110, and a controller 114 for controlling the operation of the microphone 112 are provided.

右カメラ１０４と左カメラ１０６は、一般的なＣＣＤカメラである。右カメラ１０４と左カメラ１０６は、所定の時間間隔で同時に撮影を実施し、撮影された画像データを撮影時刻と関連付けてコントローラ１１４へ出力する。 The right camera 104 and the left camera 106 are general CCD cameras. The right camera 104 and the left camera 106 shoot simultaneously at a predetermined time interval, and output the shot image data to the controller 114 in association with the shooting time.

マイクロホン１１２は、入力される音声によって膜面に加えられる音圧を検知し、検知した音圧に応じた電圧値をＡ／Ｄ変換して、離散化された電圧値を入力された時刻と関連付けてコントローラ１１４へ出力する。以下ではマイクロホン１１２から出力されるデータを音データと呼ぶ。 The microphone 112 detects the sound pressure applied to the membrane surface by the input sound, A / D converts the voltage value corresponding to the detected sound pressure, and associates the discretized voltage value with the input time. To the controller 114. Hereinafter, data output from the microphone 112 is referred to as sound data.

アクチュエータ１１０は例えば一般的なモータである。アクチュエータ１１０を駆動することによって、胴体部１０８に対する頭部１０２の回転角を調整することができる。胴体部１０８に対して頭部１０２を回転することによって、右カメラ１０４および左カメラ１０６が撮影する視野を調整することが可能である。 The actuator 110 is, for example, a general motor. By driving the actuator 110, the rotation angle of the head 102 with respect to the body portion 108 can be adjusted. By rotating the head 102 with respect to the body portion 108, it is possible to adjust the field of view taken by the right camera 104 and the left camera 106.

図２はコントローラ１１４の構成を示すブロック図である。コントローラ１１４は、処理装置（ＣＰＵ）、記憶装置（光学記憶媒体、磁気記憶媒体、あるいはＲＡＭやＲＯＭといった半導体メモリ等）、入出力装置、演算装置などから構成されているコンピュータ装置である。コントローラ１１４は機能的に、発話状態認識部２０２、音声解析部２０８、出力部２２０を備えている。 FIG. 2 is a block diagram showing the configuration of the controller 114. The controller 114 is a computer device that includes a processing device (CPU), a storage device (an optical storage medium, a magnetic storage medium, or a semiconductor memory such as a RAM or a ROM), an input / output device, an arithmetic device, and the like. The controller 114 functionally includes an utterance state recognition unit 202, a voice analysis unit 208, and an output unit 220.

発話状態認識部２０２は、右カメラ１０４と左カメラ１０６から出力される画像データに基づいて、対話者Ｖの発話尤度を計算する。ここで対話者Ｖの発話尤度とは、対話者Ｖが音声認識装置１００に話しかけている事象の確からしさを表現する数値であって、０から１までの範囲の数値で表現される。数値が１に近いほど、確からしさの度合いが強い。以下で詳細に説明するように、本実施例の発話状態認識部２０２は、種々の観点に基づいて発話尤度を計算する。 The speech state recognition unit 202 calculates the speech likelihood of the conversation person V based on the image data output from the right camera 104 and the left camera 106. Here, the utterance likelihood of the conversation person V is a numerical value expressing the likelihood of the event that the conversation person V is talking to the speech recognition apparatus 100, and is expressed by a numerical value ranging from 0 to 1. The closer the number is to 1, the stronger the degree of certainty. As will be described in detail below, the utterance state recognition unit 202 according to the present embodiment calculates the utterance likelihood based on various viewpoints.

発話状態認識部２０２は、発話尤度の計算を行うために、右カメラ１０４と左カメラ１０６から出力される画像データに基づいて、対話者Ｖの位置と、対話者Ｖの顔の向きと、対話者Ｖの視線の向きと、対話者Ｖの***の動きを認識する。発話状態認識部２０２は、対話者Ｖの位置から第１発話尤度を算出し、対話者Ｖの顔の向きから第２発話尤度を算出し、対話者Ｖの視線の向きから第３発話尤度を算出し、対話者Ｖの***の動きから第４発話尤度を算出する。 The utterance state recognition unit 202 calculates the utterance likelihood based on the image data output from the right camera 104 and the left camera 106, the position of the conversation person V, the face direction of the conversation person V, The direction of the line of sight of the conversation person V and the movement of the lip of the conversation person V are recognized. The utterance state recognition unit 202 calculates the first utterance likelihood from the position of the conversation person V, calculates the second utterance likelihood from the face direction of the conversation person V, and the third utterance from the direction of the line of sight of the conversation person V. The likelihood is calculated, and the fourth utterance likelihood is calculated from the lip movement of the conversation person V.

対話者Ｖの位置は、右カメラ１０４と左カメラ１０６のそれぞれの画像データにおいて対話者Ｖの輪郭を抽出し、輪郭を抽出された対話者Ｖと音声認識装置１００との相対的な位置関係をステレオ視の原理によって算出することで、算出することができる。対話者Ｖの位置を算出する際には、音声認識装置１００における頭部１０２の胴体部１０８に対する回転角が考慮される。対話者Ｖの位置が算出されると、発話状態認識部２０２は第１発話尤度を算出する。対話者Ｖが音声認識装置１００に近い位置にいる場合は、対話者Ｖが音声認識装置１００に話しかけている尤度が高い。逆に、対話者Ｖが音声認識装置１００から遠い位置にいる場合は、対話者Ｖが音声認識装置１００に話しかけている尤度が低い。本実施例では、発話状態認識部２０２は対話者Ｖと音声認識装置１００との距離を算出し、算出される距離に応じて第１発話尤度を特定する。発話状態認識部２０２には、対話者Ｖと音声認識装置１００との距離と、第１発話尤度との対応関係を示す対応表が予め記憶されており、発話状態認識部２０２は算出される距離と対応表を用いて第１発話尤度を特定する。 As for the position of the conversation person V, the contour of the conversation person V is extracted from the respective image data of the right camera 104 and the left camera 106, and the relative positional relationship between the conversation person V from which the contour has been extracted and the speech recognition apparatus 100 is obtained. It can be calculated by calculating according to the principle of stereo vision. When calculating the position of the conversation person V, the rotation angle of the head 102 with respect to the body part 108 in the speech recognition apparatus 100 is taken into consideration. When the position of the conversation person V is calculated, the utterance state recognition unit 202 calculates the first utterance likelihood. When the conversation person V is in a position close to the speech recognition apparatus 100, the likelihood that the conversation person V is talking to the speech recognition apparatus 100 is high. On the other hand, when the conversation person V is far from the speech recognition apparatus 100, the likelihood that the conversation person V is speaking to the speech recognition apparatus 100 is low. In the present embodiment, the utterance state recognition unit 202 calculates the distance between the conversation person V and the speech recognition apparatus 100, and specifies the first utterance likelihood according to the calculated distance. The utterance state recognition unit 202 stores in advance a correspondence table indicating the correspondence between the distance between the conversation person V and the speech recognition apparatus 100 and the first utterance likelihood, and the utterance state recognition unit 202 calculates the utterance state recognition unit 202. The first utterance likelihood is specified using the distance and the correspondence table.

なお本実施例では、対話者Ｖの位置が算出されると、発話状態認識部２０２は右カメラ１０４、左カメラ１０６の視野の中央付近で対話者Ｖが撮影されるように、アクチュエータ１１０を駆動して頭部１０２を回転させる。このように対話者Ｖの位置に応じて頭部１０２を回転させることで、対話者Ｖが動きながら話しかけている場合でも、右カメラ１０４、左カメラ１０６の視野から対話者Ｖが外れてしまうことを防ぐことができる。 In this embodiment, when the position of the conversation person V is calculated, the utterance state recognition unit 202 drives the actuator 110 so that the conversation person V is photographed near the center of the field of view of the right camera 104 and the left camera 106. Then, the head 102 is rotated. In this way, by rotating the head 102 according to the position of the conversation person V, even when the conversation person V is talking while moving, the conversation person V will be out of the field of view of the right camera 104 and the left camera 106. Can be prevented.

対話者Ｖの顔の向きと視線の向きは、右カメラ１０４と左カメラ１０６のそれぞれの画像データにおいて特徴点を抽出し、抽出された特徴点の位置をステレオ視の原理によって算出することで、算出された特徴点の位置に基づいて算出することができる。なお特徴点の位置を算出する際には、音声認識装置１００における頭部１０２の胴体部１０８に対する回転角が考慮される。 The direction of the face of the conversation person V and the direction of the line of sight are obtained by extracting feature points from the respective image data of the right camera 104 and the left camera 106 and calculating the positions of the extracted feature points based on the principle of stereo vision. It can be calculated based on the calculated position of the feature point. In calculating the position of the feature point, the rotation angle of the head 102 with respect to the body portion 108 in the speech recognition apparatus 100 is taken into consideration.

対話者Ｖの顔の向きを算出する場合、まず画像データにおいて対話者Ｖの顔における左右の目の目頭、目尻、および口角を特徴点として抽出する。右カメラ１０４および左カメラ１０６のそれぞれの画像データ上でのこれらの特徴点の位置から、ステレオ視の原理によって、これら特徴点の実際の位置を算出することができる。これらの特徴点は対話者Ｖの顔の表面に存在しているから、これらの特徴点の位置から対話者Ｖの顔の向きを算出することができる。対話者Ｖの顔の向きが算出されると、発話状態認識部２０２は第２発話尤度を算出する。対話者Ｖの顔の向きが、音声認識装置１００の方向を向いている場合は、対話者Ｖが音声認識装置１００に話しかけている尤度が高い。逆に、対話者Ｖの顔の向きが音声認識装置１００の方向とは異なる方向を向いている場合は、対話者Ｖが音声認識装置１００に話しかけている尤度が低い。本実施例では、発話状態認識部２０２は対話者Ｖの位置に基づいて対話者Ｖから見た発話状態認識部２０２の方向を算出し、算出された方向と対話者Ｖの顔の向きとの偏差角度を算出する。発話状態認識部２０２は算出された偏差角度に応じて第２発話尤度を特定する。発話状態認識部２０２には、偏差角度と第２発話尤度との対応関係を示す対応表が予め記憶されており、発話状態認識部２０２は算出される偏差角度と対応表を用いて第２発話尤度を特定する。 When calculating the face direction of the conversation person V, first, the eyes of the left and right eyes, the corners of the eyes, and the mouth corners of the face of the conversation person V are extracted from the image data as feature points. From the positions of these feature points on the image data of the right camera 104 and the left camera 106, the actual positions of these feature points can be calculated by the principle of stereo vision. Since these feature points exist on the surface of the conversation person V's face, the face direction of the conversation person V can be calculated from the positions of these feature points. When the face direction of the conversation person V is calculated, the utterance state recognition unit 202 calculates the second utterance likelihood. When the face of the conversation person V faces the direction of the speech recognition apparatus 100, the likelihood that the conversation person V is speaking to the speech recognition apparatus 100 is high. On the other hand, when the face of the conversation person V faces in a direction different from the direction of the speech recognition apparatus 100, the likelihood that the conversation person V speaks to the speech recognition apparatus 100 is low. In this embodiment, the utterance state recognition unit 202 calculates the direction of the utterance state recognition unit 202 viewed from the conversation person V based on the position of the conversation person V, and the calculated direction and the face direction of the conversation person V are calculated. The deviation angle is calculated. The utterance state recognition unit 202 specifies the second utterance likelihood according to the calculated deviation angle. The utterance state recognition unit 202 stores in advance a correspondence table indicating a correspondence relationship between the deviation angle and the second utterance likelihood, and the utterance state recognition unit 202 uses the calculated deviation angle and the correspondence table to store the second. Specify the utterance likelihood.

対話者Ｖの視線の向きを算出する場合、まず画像データにおいて対話者Ｖの顔における左右の目の目頭、目尻および黒目の中心を特徴点として抽出する。これら特徴点の実際の位置は、ステレオ視の原理によって算出することができる。目頭および目尻と、黒目の中心との相対的な位置関係と、対話者Ｖの顔の向きから、対話者Ｖの視線の向きを算出することができる。対話者Ｖの視線の向きが算出されると、発話状態認識部２０２は第３発話尤度を算出する。対話者Ｖの視線の向きが、音声認識装置１００の方向を向いている場合は、対話者Ｖが音声認識装置１００に話しかけている尤度が高い。逆に、対話者Ｖの視線の向きが音声認識装置１００の方向とは異なる方向を向いている場合は、対話者Ｖが音声認識装置１００に話しかけている尤度が低い。本実施例では、発話状態認識部２０２は対話者Ｖの位置に基づいて対話者Ｖから見た発話状態認識部２０２の方向を算出し、算出された方向と対話者Ｖの視線の向きとの偏差角度を算出する。発話状態認識部２０２は算出された偏差角度に応じて第３発話尤度を特定する。発話状態認識部２０２には、偏差角度と第３発話尤度との対応関係を示す対応表が予め記憶されており、発話状態認識部２０２は算出される偏差角度と対応表を用いて第３発話尤度を特定する。 When calculating the direction of the line of sight of the conversation person V, first, in the image data, the centers of the left and right eyes, the corners of the eyes, and the centers of the black eyes in the face of the conversation person V are extracted as feature points. The actual positions of these feature points can be calculated by the principle of stereo vision. The direction of the line of sight of the conversation person V can be calculated from the relative positional relationship between the eyes and the corners of the eyes and the center of the black eye and the direction of the face of the conversation person V. When the direction of the line of sight of the conversation person V is calculated, the utterance state recognition unit 202 calculates the third utterance likelihood. When the direction of the line of sight of the conversation person V faces the direction of the speech recognition apparatus 100, the likelihood that the conversation person V is speaking to the speech recognition apparatus 100 is high. On the other hand, when the direction of the line of sight of the conversation person V is different from the direction of the speech recognition apparatus 100, the likelihood that the conversation person V is speaking to the speech recognition apparatus 100 is low. In this embodiment, the utterance state recognition unit 202 calculates the direction of the utterance state recognition unit 202 viewed from the conversation person V based on the position of the conversation person V, and the calculated direction and the direction of the line of sight of the conversation person V are calculated. The deviation angle is calculated. The utterance state recognition unit 202 identifies the third utterance likelihood according to the calculated deviation angle. The utterance state recognition unit 202 stores in advance a correspondence table indicating a correspondence relationship between the deviation angle and the third utterance likelihood, and the utterance state recognition unit 202 uses the calculated deviation angle and the correspondence table to perform the third processing. Specify the utterance likelihood.

対話者Ｖの***の動きは、右カメラ１０４または左カメラ１０６のいずれかの画像データにおいて抽出される***付近の画像の経時的変化から評価される。
図３は右カメラ１０４または左カメラ１０６のいずれかで撮影された画像データから抽出された、対話者Ｖの***Ｒ付近の画像の経時的な変化を示している。図に示す例では、時刻ｔ１において***Ｒは閉じており、その直後の時刻ｔ２において***Ｒは閉じており、その直後の時刻ｔ３において***Ｒは開いており、その直後の時刻ｔ４において***Ｒは開いており、その直後の時刻ｔ５において***Ｒは閉じている。この場合、時刻ｔ２における***Ｒの状態は、その直前の時刻ｔ１における***Ｒの状態と同一である。従って、発話状態認識部２０２は時刻ｔ２において***Ｒは動いていないと評価する。時刻ｔ３における***Ｒの状態は、その直前の時刻ｔ２における***Ｒの状態とは異なる。従って、発話状態認識部２０２は時刻ｔ３において***Ｒは動いていると評価する。時刻ｔ４における***Ｒの状態は、その直前の時刻ｔ３における***Ｒの状態と同一である。従って、発話状態認識部２０２は時刻ｔ４において***Ｒは動いていないと評価する。このように、***付近の画像の経時的変化から、***の動きが評価される。対話者Ｖの***の動きが評価されると、発話状態認識部２０２は第４発話尤度を算出する。対話者Ｖが***を動かしている場合は、対話者Ｖが音声認識装置１００に話しかけている尤度が高い。逆に、対話者Ｖが***を動かしていない場合は、対話者Ｖが音声認識装置１００に話しかけている尤度が低い。本実施例の発話状態認識部２０２は、対話者Ｖが***を動かしていると評価される場合に第４発話尤度を１に設定し、対話者Ｖが***を動かしていないと評価される場合に第４発話尤度をゼロに設定する。 The movement of the lip of the conversation person V is evaluated from the temporal change of the image near the lip extracted in the image data of either the right camera 104 or the left camera 106.
FIG. 3 shows a change with time of an image in the vicinity of the lip R of the conversation person V extracted from image data photographed by either the right camera 104 or the left camera 106. In the example shown in the figure, the lip R is closed at time t1, the lip R is closed at time t2 immediately thereafter, the lip R is open at time t3 immediately thereafter, and the lip R at time t4 immediately thereafter. Is open, and lip R is closed at time t5 immediately after. In this case, the state of the lip R at time t2 is the same as the state of the lip R at time t1 immediately before that. Therefore, the speech state recognition unit 202 evaluates that the lip R is not moving at time t2. The state of the lip R at time t3 is different from the state of the lip R at time t2 immediately before it. Therefore, the speech state recognition unit 202 evaluates that the lip R is moving at time t3. The state of the lip R at time t4 is the same as the state of the lip R at time t3 immediately before it. Therefore, the speech state recognition unit 202 evaluates that the lip R is not moving at time t4. In this way, the movement of the lips is evaluated from the temporal change of the image near the lips. When the movement of the lip of the conversation person V is evaluated, the utterance state recognition unit 202 calculates the fourth utterance likelihood. When the conversation person V is moving the lips, the likelihood that the conversation person V is speaking to the speech recognition apparatus 100 is high. On the contrary, when the conversation person V is not moving the lips, the likelihood that the conversation person V is speaking to the speech recognition apparatus 100 is low. The utterance state recognition unit 202 of this embodiment sets the fourth utterance likelihood to 1 when it is evaluated that the conversation person V is moving the lips, and is evaluated that the conversation person V does not move the lips. In this case, the fourth utterance likelihood is set to zero.

図２の発話状態認識部２０２は、右カメラ１０４および左カメラ１０６が撮影する毎に上記した一連の処理を実行し、右カメラ１０４および左カメラ１０６で同一時刻に撮影された画像データに基づいて第１発話尤度、第２発話尤度、第３発話尤度および第４発話尤度をそれぞれ計算する。発話状態認識部２０２は、計算された第１発話尤度、第２発話尤度、第３発話尤度および第４発話尤度を、尤度計算の基とした画像データが撮影された時刻と関連付けて、出力部２２０の発話区間判断部２０４へ出力する。 The utterance state recognition unit 202 in FIG. 2 executes the series of processes described above every time the right camera 104 and the left camera 106 capture images, and based on image data captured at the same time by the right camera 104 and the left camera 106. A first utterance likelihood, a second utterance likelihood, a third utterance likelihood, and a fourth utterance likelihood are calculated. The utterance state recognition unit 202 uses the calculated first utterance likelihood, second utterance likelihood, third utterance likelihood, and fourth utterance likelihood as the time when image data is captured based on the likelihood calculation. The information is output in association with the utterance section determination unit 204 of the output unit 220 in association.

音声解析部２０８は、マイクロホン１１２から入力される音データに基づいて、発話の開始時刻の検出と、発話の終了時刻の検出と、発話の開始時刻から発話終了時刻までの文章データを作成する。音声解析部２０８は、時刻検出部２１０と、文章データ作成部２１２と、音素データベース（以下ではＤＢと記述する）２１４と、単語ＤＢ２１６と、文章ＤＢ２１８を備えている。 Based on the sound data input from the microphone 112, the voice analysis unit 208 detects the utterance start time, detects the utterance end time, and creates sentence data from the utterance start time to the utterance end time. The voice analysis unit 208 includes a time detection unit 210, a text data creation unit 212, a phoneme database (hereinafter referred to as DB) 214, a word DB 216, and a text DB 218.

時刻検出部２１０は、マイクロホン１１２から入力される音データから、発話の開始時刻と発話の終了時刻を検出する。
図４にマイクロホン１１２から入力される音データ４０２の波形を示す。時刻検出部２１０は、発話の開始が検知されていない状況では、音データ４０２における音圧が所定のしきい値ΔＰを超えるか否かを看視している。音データ４０２の音圧がしきい値ΔＰを超えた時点で、時刻検出部２１０は発話が開始されたと判断する。発話の開始が検知されると、時刻検出部２１０は発話の開始時刻ＴＳを特定し、文章データ作成部２１２に発話の開始時刻ＴＳを報知する。
発話の開始が検知された後は、時刻検出部２１０は、音データ４０２の波形が単位時間ΔＴあたりに音圧ゼロの線４０４と交差する回数をカウントし、カウントされた回数が所定のしきい値に達するか否かを看視する。単位時間ΔＴあたりにカウントされる回数が所定のしきい値に満たなくなった時点で、時刻検出部２１０は発話が終了したと判断する。発話の終了が検知されると、時刻検出部２１０は発話の終了時刻ＴＥを特定し、文章データ作成部２１２に発話の終了時刻ＴＥを報知する。そして時刻検出部２１０は、発話の開始時刻ＴＳと発話の終了時刻ＴＥを、出力部２２０の発話区間判断部２０４へ出力する。 The time detection unit 210 detects the utterance start time and the utterance end time from the sound data input from the microphone 112.
FIG. 4 shows a waveform of the sound data 402 input from the microphone 112. The time detection unit 210 watches whether or not the sound pressure in the sound data 402 exceeds a predetermined threshold ΔP in a situation where the start of the utterance is not detected. When the sound pressure of the sound data 402 exceeds the threshold value ΔP, the time detection unit 210 determines that the utterance has started. When the start of the utterance is detected, the time detection unit 210 specifies the utterance start time TS, and notifies the text data creation unit 212 of the utterance start time TS.
After the start of the utterance is detected, the time detection unit 210 counts the number of times that the waveform of the sound data 402 intersects the zero sound pressure line 404 per unit time ΔT, and the counted number of times is a predetermined threshold. Watch whether the value is reached. When the number of times counted per unit time ΔT does not reach a predetermined threshold, the time detection unit 210 determines that the utterance has ended. When the end of the utterance is detected, the time detection unit 210 identifies the utterance end time TE and informs the sentence data creation unit 212 of the utterance end time TE. Then, the time detection unit 210 outputs the utterance start time TS and the utterance end time TE to the utterance section determination unit 204 of the output unit 220.

文章データ作成部２１２は、マイクロホン１１２から入力される音データから、対話者Ｖが話しかけている文章を文章データとして特定する。文章データ作成部２１２は、時刻検出部２１０から発話の開始時刻が報知されると、その発話の開始時刻以降の音データに基づいて、文章データの特定を行う。文章データ作成部２１２は、発話の開始時刻からその時点までにマイクロホン１１２から入力された音データに基づいて、文章データをリアルタイムに特定し続ける。文章データ作成部２１２は、時刻検出部２１０から発話の終了時刻が報知されると、発話の開始時刻から発話の終了時刻までの音データから特定される文章データを発話区間判断部２０４へ出力する。 The text data creation unit 212 identifies the text spoken by the conversation person V as text data from the sound data input from the microphone 112. When the utterance start time is notified from the time detection unit 210, the text data creation unit 212 identifies text data based on sound data after the utterance start time. The text data creation unit 212 continues to specify text data in real time based on the sound data input from the microphone 112 from the start time of the utterance to that time. When the utterance end time is notified from the time detection unit 210, the sentence data creation unit 212 outputs the sentence data specified from the sound data from the utterance start time to the utterance end time to the utterance section determination unit 204. .

以下では文章データ作成部２１２が行う文章データの特定について詳細に説明する。本実施例の文章データ作成部２１２は、入力される音データから隠れマルコフ・モデル（ＨＭＭ；Hidden Markov Model）を用いて音素の時系列としての文章を特定する。ここで音素とは、人間が言葉を話す際に発せられる音声を構成する要素を意味する。例えば人間が「ぶどう」という言葉を話す際に発せられる音声は、「ｂ」と「ｕ」と「ｄ」と「ｏ：」という４つの音素から構成されている。ＨＭＭを用いて音素の時系列を特定する場合、１つの音素は複数の状態から構成されていると想定し、それぞれの状態を次の状態へ遷移する遷移確率と、次の状態へ遷移せずに停留する停留確率によって特徴付ける。以下では音素を構成する状態のことを音素状態と記述する。本実施例では、１つの音素が３つの音素状態から構成されている例を説明する。例えば「ｂ」という音素は、音素状態ｂ１、ｂ２、ｂ３から構成されている。ある音素状態から音素状態ｂ１へ遷移し、音素状態ｂ１から音素状態ｂ２に遷移し、音素状態ｂ２から音素状態ｂ３に遷移することで、音素「ｂ」が実現される。音素状態ｂ１は、次の音素状態である音素状態ｂ２へ遷移することもあるし、音素状態ｂ１のまま停留することもある。音素状態ｂ２、ｂ３についても同様である。本実施例の文章データ作成部２１２では、音素状態の時系列として音素が特定され、音素の時系列として単語が特定され、単語の時系列として文章が特定される。本実施例の文章データ作成部２１２では、音素状態の時系列のうちで最も尤度の高いものを特定し、特定された音素状態の時系列に対応する文章を対話者Ｖが話しかけている文章と判断する。 Hereinafter, specification of text data performed by the text data creation unit 212 will be described in detail. The text data creation unit 212 of the present embodiment specifies text as a phoneme time series from the input sound data using a hidden Markov model (HMM). Here, the phoneme means an element that constitutes a voice uttered when a human speaks a word. For example, a voice uttered when a person speaks the word “grape” is composed of four phonemes “b”, “u”, “d”, and “o:”. When specifying a phoneme time series using the HMM, it is assumed that one phoneme is composed of a plurality of states, transition probabilities of transitioning each state to the next state, and transition to the next state are not made. Characterized by the probability of stopping at In the following, the phoneme state is described as a phoneme state. In this embodiment, an example in which one phoneme is composed of three phoneme states will be described. For example, the phoneme “b” is composed of phoneme states b1, b2, and b3. The phoneme state “b” is realized by making a transition from a phoneme state to the phoneme state b1, transitioning from the phoneme state b1 to the phoneme state b2, and transitioning from the phoneme state b2 to the phoneme state b3. The phoneme state b1 may transit to the phoneme state b2 that is the next phoneme state, or may remain in the phoneme state b1. The same applies to the phoneme states b2 and b3. In the text data creation unit 212 of the present embodiment, a phoneme is specified as a phoneme state time series, a word is specified as a phoneme time series, and a sentence is specified as a word time series. In the sentence data creation unit 212 of the present embodiment, the most likely one of the time series of phoneme states is identified, and the sentence in which the conversation person V speaks the sentence corresponding to the time series of the specified phoneme state. Judge.

より具体的には、文章データ作成部２１２は、マイクロホン１１２から入力される音データについてフレーム化処理を実施し、各フレームの音データの周波数スペクトルを特定し、特定された周波数スペクトルからそのフレームに対する各音素状態の尤度を評価し、各音素状態の尤度から各音素の尤度を評価し、各音素の尤度から各単語の尤度を評価し、各単語の尤度から各文章の尤度を評価する。尤度評価の対象とする音素は、その音素を構成する音素状態と関連付けて、音素ＤＢ２１４に予め記憶されている。尤度評価の対象とする単語は、その単語を構成する音素と関連付けて、単語ＤＢ２１６に予め記憶されている。尤度評価の対象とする文章データは、その文章を構成する単語と関連付けて、文章ＤＢ２１８に予め記憶されている。文章データ作成部２１２は、各文章の尤度を評価した後、最も尤度の高い文章を、対話者Ｖが話しかけている文章と判断する。 More specifically, the text data creation unit 212 performs framing processing on the sound data input from the microphone 112, specifies the frequency spectrum of the sound data of each frame, and determines the frequency spectrum for the frame from the specified frequency spectrum. Evaluate the likelihood of each phoneme state, evaluate the likelihood of each phoneme from the likelihood of each phoneme state, evaluate the likelihood of each word from the likelihood of each phoneme, and evaluate the likelihood of each sentence from the likelihood of each word Evaluate the likelihood. The phonemes to be subjected to likelihood evaluation are stored in advance in the phoneme DB 214 in association with the phoneme states constituting the phonemes. Words to be subjected to likelihood evaluation are stored in advance in the word DB 216 in association with phonemes constituting the words. Sentence data to be subjected to likelihood evaluation is stored in advance in the sentence DB 218 in association with words constituting the sentence. After evaluating the likelihood of each sentence, the sentence data creation unit 212 determines that the sentence with the highest likelihood is the sentence that the conversation person V is talking to.

まず文章データ作成部２１２は、入力される音データについてフレーム化処理を実施し、各フレームに対応する音データの周波数スペクトルを特定する。図５に音データのフレーム化処理と、各フレームの音データの周波数スペクトルを特定する様子を示す。本実施例では、フレームの長さは２０ｍｓであり、フレーム間隔は１０ｍｓである。図５に示すように、音データ４０２についてフレームＦ１、Ｆ２、Ｆ３、・・・が規定される。文章データ作成部２１２は、フレームＦ１、Ｆ２、Ｆ３、・・・のそれぞれにおける音データ４０２の周波数スペクトルｆ１、ｆ２、ｆ３、・・・を特定する。周波数スペクトルは、周波数に対する振幅の分布として与えられる。周波数スペクトルの特定は、例えば高速フーリエ変換を用いて行うことができる。 First, the text data creation unit 212 performs framing processing on the input sound data, and specifies the frequency spectrum of the sound data corresponding to each frame. FIG. 5 shows how sound data is framed and how the frequency spectrum of the sound data of each frame is specified. In this embodiment, the frame length is 20 ms, and the frame interval is 10 ms. As shown in FIG. 5, frames F1, F2, F3,. The text data creation unit 212 identifies the frequency spectra f1, f2, f3,... Of the sound data 402 in each of the frames F1, F2, F3,. The frequency spectrum is given as a distribution of amplitude with respect to frequency. The specification of the frequency spectrum can be performed using, for example, a fast Fourier transform.

ついで文章データ作成部２１２は、フレーム毎に特定された周波数スペクトルから、そのフレームに対する各音素状態の尤度を評価する。それぞれの音素状態は、その音素状態が実現された場合に、音声として観測される周波数スペクトルについての確率分布を備えている。この確率分布は、実験などによって予め取得しておくことができる。この確率分布と、フレームに対して特定された周波数スペクトルから、そのフレームに対する音素状態の尤度を計算することができる。本実施例では、音素ＤＢ２１４に尤度評価の対象とする各音素の各音素状態について、観測される周波数スペクトルから尤度を算出する関数が予め記憶されている。文章データ作成部２１２は、周波数スペクトルｆ１、ｆ２、ｆ３、・・・のそれぞれについて、各音素の各音素状態について尤度を算出する。例えばフレームＦ１の周波数スペクトルｆ１から、フレームＦ１に対する音素「ｂ」の音素状態ｂ１、ｂ２、ｂ３のそれぞれの尤度が算出される。他の音素の音素状態についても同様にして、フレームＦ１に対する尤度が算出される。他のフレームについても同様にして、そのフレームに対する各音素の各音素状態の尤度が算出される。 Next, the text data creation unit 212 evaluates the likelihood of each phoneme state for the frame from the frequency spectrum specified for each frame. Each phoneme state has a probability distribution for the frequency spectrum observed as speech when the phoneme state is realized. This probability distribution can be acquired in advance by experiments or the like. From this probability distribution and the frequency spectrum specified for the frame, the likelihood of the phoneme state for the frame can be calculated. In this embodiment, the phoneme DB 214 stores in advance a function for calculating the likelihood from the observed frequency spectrum for each phoneme state of each phoneme to be subjected to likelihood evaluation. The text data creation unit 212 calculates the likelihood for each phoneme state of each phoneme for each of the frequency spectra f1, f2, f3,. For example, the likelihoods of the phoneme states b1, b2, and b3 of the phoneme “b” for the frame F1 are calculated from the frequency spectrum f1 of the frame F1. Similarly for the phoneme states of other phonemes, the likelihood for the frame F1 is calculated. Similarly for other frames, the likelihood of each phoneme state of each phoneme for that frame is calculated.

各フレームに対する各音素状態の尤度が算出されると、文章データ作成部２１２は、各音素の尤度の評価と、各単語の尤度の評価を行う。図６を参照しながら、各音素の尤度の評価と、各単語の尤度の評価について説明する。図６では一例として、単語「ぶどう」についての尤度を評価する例を説明する。図６の左側の欄では、単語「ぶどう」が音素「ｂ」、「ｕ」、「ｄ」、「ｏ：」の系列として構成されており、音素「ｂ」が音素状態ｂ１、ｂ２、ｂ３の系列として構成されており、音素「ｕ」が音素状態ｕ１、ｕ２、ｕ３の系列として構成されており、音素「ｄ」が音素状態ｄ１、ｄ２、ｄ３の系列として構成されており、音素「ｏ：」が音素状態ｏ：１、ｏ：２、ｏ：３の系列として構成されていることが示されている。図６では、フレームＦ１において音素状態ｂ１が実現している状態を点６０２で表現し、その後のフレームＦ２、Ｆ３、・・・Ｆｎにおいて、音素状態ｂ１、ｂ２、ｂ３、・・・が実現している状態を点６０４、６０６、６０８、６１０、６１２・・・で表現している。また、それぞれの点６０２、６０４、６０６、・・・からは、次のフレームにおいて次の音素状態へ遷移する経路と、次の音素状態へ遷移することなく停留する経路が伸びている。例えばフレームＦ１において音素状態ｂ１が実現している状態を示す点６０２からは、次のフレームＦ２において次の音素状態ｂ２へ遷移する枝６１４と、次の音素状態ｂ２へ遷移することなく音素状態ｂ１で停留する枝６１６が伸びている。枝６１４は、フレームＦ２において音素状態ｂ２が実現している状態を示す点６０４まで伸びている。枝６１６は、フレームＦ２において音素状態ｂ１が実現している状態を示す点６０６まで伸びている。 When the likelihood of each phoneme state for each frame is calculated, the text data creation unit 212 evaluates the likelihood of each phoneme and the likelihood of each word. The evaluation of the likelihood of each phoneme and the evaluation of the likelihood of each word will be described with reference to FIG. In FIG. 6, as an example, an example in which the likelihood of the word “grape” is evaluated will be described. In the left column of FIG. 6, the word “grape” is configured as a sequence of phonemes “b”, “u”, “d”, “o:”, and the phoneme “b” is the phoneme state b1, b2, b3. Phoneme “u” is configured as a sequence of phoneme states u1, u2, u3, phoneme “d” is configured as a sequence of phoneme states d1, d2, d3, and phoneme “ It is shown that “o:” is configured as a sequence of phoneme states o: 1, o: 2, and o: 3. In FIG. 6, the state where the phoneme state b1 is realized in the frame F1 is represented by a point 602, and the phoneme states b1, b2, b3,... Are realized in the subsequent frames F2, F3,. Are expressed by points 604, 606, 608, 610, 612. Further, from each of the points 602, 604, 606,..., A path that transitions to the next phoneme state in the next frame and a path that stops without transitioning to the next phoneme state extend. For example, from the point 602 indicating the state in which the phoneme state b1 is realized in the frame F1, the branch 614 that makes a transition to the next phoneme state b2 in the next frame F2, and the phoneme state b1 without making a transition to the next phoneme state b2 A branch 616 that stops at is extended. The branch 614 extends to a point 604 indicating a state in which the phoneme state b2 is realized in the frame F2. The branch 616 extends to a point 606 indicating a state in which the phoneme state b1 is realized in the frame F2.

図６のそれぞれの点６０２、６０４、６０６、・・・の尤度は、各フレームに対する各音素状態の尤度として算出することができる。それぞれの枝６１４、６１６、・・・の尤度は、各音素状態の遷移確率と停留確率から算出することができる。例えば枝６１４の尤度は、音素状態ｂ１から音素状態ｂ２への遷移確率から算出することができる。枝６１６の尤度は、音素状態ｂ１の停留確率から算出することができる。単語を構成する各音素の各音素状態の遷移確率と停留確率は、実験などによって予め取得されており、音素ＤＢ２１４と単語ＤＢ２１６に記憶されている。 The likelihood of each point 602, 604, 606,... In FIG. 6 can be calculated as the likelihood of each phoneme state for each frame. The likelihood of each branch 614, 616,... Can be calculated from the transition probability and stationary probability of each phoneme state. For example, the likelihood of the branch 614 can be calculated from the transition probability from the phoneme state b1 to the phoneme state b2. The likelihood of the branch 616 can be calculated from the retention probability of the phoneme state b1. Transition probabilities and stationary probabilities of each phoneme state of each phoneme constituting a word are acquired in advance by experiments or the like, and are stored in the phoneme DB 214 and the word DB 216.

文章データ作成部２１２は、各フレームに対する各音素状態の尤度して算出される点６０２、６０４、６０６、・・・の尤度と、音素ＤＢ２１４と単語ＤＢ２１６に記憶されている枝６１４、６１６、・・・の尤度に基づいて、その時点で取り得る全ての経路について尤度を計算し、最も尤度の高い経路を特定する。ここで経路についての尤度とは、その経路に沿って事象が進行した尤度のことをいう。経路に沿って事象が進行した尤度は、その経路に含まれる点の尤度と枝の尤度から算出することができる。文章データ作成部２１２は、その単語において最も尤度の高い経路が特定されると、その経路に沿って事象が進行した尤度を、その単語の尤度として特定する。
図６に示す例では、フレームＦ１、Ｆ２、・・・Ｆｎまで処理が進行している時点において、単語「ぶどう」において最も尤度の高い経路として経路６１８が特定されている。このような場合には、経路６１８に沿って事象が進行した尤度が、単語「ぶどう」の尤度として特定される。経路６１８に沿って事象が進行した尤度は、経路６１８に含まれる点６０２、６０４、６１０、・・・の尤度と、枝６１４、・・・の尤度から算出される。 The sentence data creation unit 212 calculates the likelihood of points 602, 604, 606,... Calculated by the likelihood of each phoneme state for each frame, and branches 614, 616 stored in the phoneme DB 214 and the word DB 216. ,... Are calculated for all possible routes at that time, and the route with the highest likelihood is identified. Here, the likelihood of a route refers to the likelihood that an event has progressed along the route. The likelihood that an event has progressed along a route can be calculated from the likelihood of points included in the route and the likelihood of branches. When the route with the highest likelihood is specified for the word, the sentence data creation unit 212 specifies the likelihood that the event has progressed along the route as the likelihood of the word.
In the example shown in FIG. 6, the route 618 is specified as the route with the highest likelihood in the word “grape” at the time when the processing is progressing to the frames F1, F2,. In such a case, the likelihood that the event has progressed along the path 618 is specified as the likelihood of the word “grape”. The likelihood that the event has progressed along the path 618 is calculated from the likelihood of the points 602, 604, 610,... Included in the path 618 and the likelihood of the branches 614,.

図６では単語「ぶどう」についての尤度を評価する例を説明したが、文章データ作成部２１２は、上記した尤度の評価を、単語ＤＢ２１６に記憶されている全ての単語について実施する。これによって、単語ＤＢ２１６に記憶されている全ての単語についての尤度が評価される。 Although the example of evaluating the likelihood of the word “grape” has been described with reference to FIG. 6, the sentence data creation unit 212 performs the above-described likelihood evaluation for all the words stored in the word DB 216. Thereby, the likelihood about all the words memorize | stored in word DB216 is evaluated.

各単語についての尤度が評価されると、文章データ作成部２１２は各文章データの尤度を評価する。文章データ作成部２１２は、文章ＤＢ２１８に記憶されている全ての文章データについて、尤度の評価を行う。文章ＤＢ２１８には、尤度の評価の対象とする文章データと、その文章を構成する単語の系列が、関連付けて記憶されている。
図８に文章の尤度を評価する様子を示している。図８に示す例では、「プリウス」（登録商標）―「の」―「燃費」―「は」―「いくら」―「ですか」という単語の系列が１つの文章を構成している。また、「プリウス」―「の」―「燃費」―「を」―「教えて」―「下さい」という単語の系列も１つの文章を構成している。これらの文章と、その文章を構成する単語の系列は、文章ＤＢ２１８に予め記憶されている。 When the likelihood for each word is evaluated, the sentence data creation unit 212 evaluates the likelihood of each sentence data. The text data creation unit 212 evaluates the likelihood for all text data stored in the text DB 218. In the sentence DB 218, sentence data to be subjected to likelihood evaluation and a series of words constituting the sentence are stored in association with each other.
FIG. 8 shows how the likelihood of a sentence is evaluated. In the example shown in FIG. 8, a series of words “Prius” (registered trademark) — “no” — “fuel consumption” — “ha” — “how much” — “what” constitutes one sentence. In addition, a series of words “Prius”-“No”-“Fuel consumption”-“O”-“Teach me”-“Please” make up one sentence. These sentences and a series of words constituting the sentences are stored in the sentence DB 218 in advance.

文章データ作成部２１２は、文章の尤度を、その文章に含まれる単語の尤度と、その文章における単語から単語への接続の尤度から算出する。単語から単語への接続の尤度は、図７に示す単語接続表７００を用いて特定される。単語接続表７００は、ある単語（図では前単語と記述している）から次に続く単語（図では後単語と記述している）への接続が出現する確率（図では出現率と記述している）を記述している。このような単語から単語への接続が出現する確率は、実験などによって取得することができる。単語接続表７００は文章ＤＢ２１８に予め記憶されており、文章データ作成部２１２は必要に応じて文章ＤＢ２１８から単語接続表７００を読み込む。文章データ作成部２１２は、文章ＤＢ２１８に記憶されているそれぞれの文章データについて尤度を計算する。文章データ作成部２１２は、最も尤度の高い文章データを、対話者Ｖが話しかけた文章として特定する。 The sentence data creation unit 212 calculates the likelihood of the sentence from the likelihood of the word included in the sentence and the likelihood of connection from the word to the word in the sentence. The likelihood of connection from word to word is specified using the word connection table 700 shown in FIG. The word connection table 700 has a probability (denoted as an appearance rate in the figure) that a connection from a certain word (denoted as the previous word in the figure) to the next word (denoted as the subsequent word in the figure) appears. Is described). The probability that such a word-to-word connection appears can be obtained by experiments or the like. The word connection table 700 is stored in the sentence DB 218 in advance, and the sentence data creation unit 212 reads the word connection table 700 from the sentence DB 218 as necessary. The text data creation unit 212 calculates the likelihood for each text data stored in the text DB 218. The sentence data creation unit 212 identifies the sentence data with the highest likelihood as the sentence spoken by the conversation person V.

文章データ作成部２１２は、上記したフレーム化処理から文章データの特定までの一連の処理を、時刻検出部２１０から発話の終了時刻が報知されるまで繰り返し実施する。時刻検出部２１０から発話の終了時刻が報知されると、文章データ作成部２１２は報知された発話の終了時刻までの音データから特定された文章データを、出力部２２０の発話区間判断部２０４へ出力する。 The text data creation unit 212 repeats the series of processes from the above framing processing to the text data identification until the time detection unit 210 notifies the end time of the utterance. When the utterance end time is notified from the time detection unit 210, the sentence data creation unit 212 sends the sentence data specified from the sound data up to the notified utterance end time to the utterance section determination unit 204 of the output unit 220. Output.

出力部２２０は、発話区間判断部２０４と文章データ出力部２０６を備えている。
発話区間判断部２０４は、画像認識手段２０２から入力される第１発話尤度、第２発話尤度、第３発話尤度および第４発話尤度の時系列と、時刻検出部２１０から入力される発話の開始時刻と終了時刻に基づいて、発話区間の妥当性を判定する。 The output unit 220 includes an utterance section determination unit 204 and a text data output unit 206.
The utterance section determination unit 204 is input from the time detection unit 210 and the time series of the first utterance likelihood, the second utterance likelihood, the third utterance likelihood, and the fourth utterance likelihood input from the image recognition unit 202. The validity of the utterance interval is determined based on the start time and end time of the utterance.

発話区間の妥当性は、種々の手法によって判定することができる。例えば、誤認識を極力起こさないようにしたい場合には、第１発話尤度、第２発話尤度、第３発話尤度および第４発話尤度をそれぞれ所定のしきい値（例えば０．９）と比較し、第１発話尤度、第２発話尤度、第３発話尤度および第４発話尤度が全て所定のしきい値を超える場合にのみ、発話区間が妥当であると判定する。
あるいは、第１発話尤度、第２発話尤度、第３発話尤度および第４発話尤度のそれぞれに重み係数を掛け合わせたものの総和を総合発話尤度として算出し、算出される総合発話尤度が所定のしきい値（例えば０．９）を超える場合にのみ、発話区間が妥当であると判定してもよい。
あるいは、第１発話尤度、第２発話尤度、第３発話尤度および第４発話尤度を時間に関して微分した値や、時間に関して積分した値に基づいて、発話区間の妥当性を判定してもよい。すなわち、第１発話尤度を f₁(t)、第２発話尤度を f₂(t)、第３発話尤度を f₃(t)、第４発話尤度を f₄(t) としたときに、以下の式で算出される値 L が所定のしきい値を超えるか否かで、発話区間の妥当性を判定してもよい。ここで、a_k,l(k=1〜4, l=1〜3) は、任意に与えることができる重み係数である。 The validity of the utterance interval can be determined by various methods. For example, when it is desired to prevent erroneous recognition as much as possible, the first utterance likelihood, the second utterance likelihood, the third utterance likelihood, and the fourth utterance likelihood are set to predetermined threshold values (for example, 0.9 ), The first utterance likelihood, the second utterance likelihood, the third utterance likelihood, and the fourth utterance likelihood are all determined to be valid only when they exceed a predetermined threshold. .
Alternatively, the total utterance is calculated by calculating the sum of the first utterance likelihood, the second utterance likelihood, the third utterance likelihood, and the fourth utterance likelihood multiplied by the weighting coefficient as the total utterance likelihood. Only when the likelihood exceeds a predetermined threshold (for example, 0.9), it may be determined that the utterance section is valid.
Alternatively, the validity of the utterance interval is determined based on a value obtained by differentiating the first utterance likelihood, the second utterance likelihood, the third utterance likelihood, and the fourth utterance likelihood with respect to time or an integrated value with respect to time. May be. That is, the first utterance likelihood is f ₁ (t), the second utterance likelihood is f ₂ (t), the third utterance likelihood is f ₃ (t), and the fourth utterance likelihood is f ₄ (t). The validity of the utterance interval may be determined based on whether or not the value L calculated by the following formula exceeds a predetermined threshold. Here, a _{k, l} (k = 1 to 4, l = 1 to 3) is a weighting factor that can be arbitrarily given.

発話区間が妥当であると判定されると、発話区間判断部２０４は文章データ作成部２１２から入力された文章データを、対話者Ｖから話しかけられた文章として文章データ出力部２０６へ出力する。 If it is determined that the utterance interval is valid, the utterance interval determination unit 204 outputs the sentence data input from the sentence data creation unit 212 to the sentence data output unit 206 as a sentence spoken by the conversation person V.

発話区間が妥当でないと判断されると、発話区間判断部２０４は文章データ出力部２０６へ何も出力しない。 If it is determined that the utterance interval is not valid, the utterance interval determination unit 204 outputs nothing to the text data output unit 206.

文章データ出力部２０６は、発話区間判断部２０４から入力される文章データを出力する。文章データ出力部２０６から出力される文章データは、種々の用途に用いることができる。例えば、対話者Ｖが話しかける文章に対する適切な回答の内容を、予めデータベース等に記憶しておき、認識された文章データに応じて回答を行う応答装置を別途設けて置き、その応答装置に文章データ出力部２０６から出力される文章データを入力することで、対話者Ｖに対して適切な対応をすることができる。 The sentence data output unit 206 outputs the sentence data input from the utterance section determination unit 204. The sentence data output from the sentence data output unit 206 can be used for various purposes. For example, the content of an appropriate answer to a sentence spoken by the conversation person V is stored in advance in a database or the like, and a response device that makes an answer according to the recognized sentence data is separately provided, and the sentence data is stored in the response device. By inputting the text data output from the output unit 206, it is possible to appropriately respond to the conversation person V.

図９は本実施例のコントローラ１１４が行う処理を説明するフローチャートを示している。コントローラ１１４には、マイクロホン１１２から音データが逐次入力されている。またコントローラ１１４には、所定の時間間隔で撮影された画像データが、右カメラ１０４、左カメラ１０６から逐次入力されている。
コントローラ１１４の音声解析部２０８と発話状態認識部２０２は、互いに並行して処理を実行する。音声解析部２０８は、ステップＳ９０２からステップＳ９１６に示す処理を実施し、発話状態認識部２０２はステップＳ９１８からステップＳ９２８に示す処理を実施する。 FIG. 9 shows a flowchart for explaining processing performed by the controller 114 of this embodiment. Sound data is sequentially input from the microphone 112 to the controller 114. In addition, image data taken at predetermined time intervals is sequentially input from the right camera 104 and the left camera 106 to the controller 114.
The voice analysis unit 208 and the speech state recognition unit 202 of the controller 114 execute processing in parallel with each other. The voice analysis unit 208 performs the processing shown in steps S902 to S916, and the speech state recognition unit 202 executes the processing shown in steps S918 to S928.

音声解析部２０８の処理について説明する。
ステップＳ９０２では、時刻検出部２１０が音データから発話の開始を検出するまで待機する。発話の開始が検出されると、音声解析部２０８の処理はステップＳ９０４へ進む。
ステップＳ９０４では、文章データ作成部２１２が音データのフレーム化処理を実施する。
ステップＳ９０６では、文章データ作成部２１２が各フレームの音データについて周波数スペクトルを特定する。
ステップＳ９０８では、文章データ作成部２１２が特定された周波数スペクトルから各音素状態の尤度を算出する。
ステップＳ９１０では、文章データ作成部２１２が各単語の尤度を算出する。
ステップＳ９１２では、文章データ作成部２１２が各文章の尤度を算出する。
ステップＳ９１４では、時刻検出部２１０が音データから発話の終了を検出したか否かを判断する。発話の終了が検出された場合（ステップＳ９１４でＹＥＳの場合）、音声解析部２０８の処理はステップＳ９１６へ進む。発話の終了が検出されない場合（ステップＳ９１４でＮＯの場合）、音声解析部２０８の処理はステップＳ９０４へ移行し、ステップＳ９１４までの処理を繰り返し実施する。
ステップＳ９１６では、時刻検出部２１０が発話の開始時刻と発話の終了時刻を発話区間判断部２０４へ出力し、文章データ作成部２１２が特定された文章データを発話区間判断部２０４へ出力する。音声解析部２０８は処理を終了し、出力部２２０がステップＳ９３２以降の処理を実施する。 The processing of the voice analysis unit 208 will be described.
In step S902, the process waits until the time detection unit 210 detects the start of utterance from the sound data. When the start of utterance is detected, the processing of the voice analysis unit 208 proceeds to step S904.
In step S904, the text data creation unit 212 performs sound data framing processing.
In step S906, the text data creation unit 212 identifies the frequency spectrum for the sound data of each frame.
In step S908, the sentence data creation unit 212 calculates the likelihood of each phoneme state from the identified frequency spectrum.
In step S910, the sentence data creation unit 212 calculates the likelihood of each word.
In step S912, the sentence data creation unit 212 calculates the likelihood of each sentence.
In step S914, it is determined whether or not the time detection unit 210 detects the end of the utterance from the sound data. When the end of the utterance is detected (YES in step S914), the process of the voice analysis unit 208 proceeds to step S916. When the end of the utterance is not detected (NO in step S914), the process of the voice analysis unit 208 proceeds to step S904, and the process up to step S914 is repeatedly performed.
In step S916, the time detection unit 210 outputs the utterance start time and the utterance end time to the utterance section determination unit 204, and the sentence data creation unit 212 outputs the specified sentence data to the utterance section determination unit 204. The voice analysis unit 208 ends the process, and the output unit 220 performs the processes after step S932.

発話状態認識部２０２の処理について説明する。
ステップＳ９１８では、右カメラ１０４、左カメラ１０６から新たに撮影された画像データが入力されるまで待機する。
ステップＳ９２０では、撮影された画像データから、対話者Ｖの特徴点を検出する。
ステップＳ９２２からステップＳ９２８では、第１発話尤度、第２発話尤度、第３発話尤度および第４発話尤度をそれぞれ算出する。
ステップＳ９３０では、算出された第１発話尤度、第２発話尤度、第３発話尤度および第４発話尤度を、画像データが撮影された時刻と関連付けて発話区間判断部２０４へ出力する。
その後、発話状態認識部２０２の処理はステップＳ９１８へ移行し、ステップＳ９３０までの処理を繰り返し実施する。 Processing of the utterance state recognition unit 202 will be described.
In step S918, the process waits until newly captured image data is input from the right camera 104 and the left camera 106.
In step S920, the feature point of the conversation person V is detected from the captured image data.
In steps S922 to S928, a first utterance likelihood, a second utterance likelihood, a third utterance likelihood, and a fourth utterance likelihood are calculated.
In step S930, the calculated first utterance likelihood, second utterance likelihood, third utterance likelihood, and fourth utterance likelihood are output to the utterance section determination unit 204 in association with the time when the image data was captured. .
Thereafter, the processing of the utterance state recognition unit 202 proceeds to step S918, and repeats the processing up to step S930.

出力部２２０の処理について説明する。
ステップＳ９３２では、発話状態認識部２０２から入力された第１発話尤度、第２発話尤度、第３発話尤度および第４発話尤度の時系列と、時刻検出部２１０から入力された発話の開始時刻と終了時刻に基づいて、発話区間の妥当性を判定する。発話区間が妥当であると判定された場合（ステップＳ９３２でＹＥＳの場合）、処理はステップＳ９３４へ進み、文章データ作成部２１２から入力された文章データを文章データ出力部２０６で出力し、処理を終了する。発話区間が妥当でないと判定された場合（ステップＳ９３２でＮＯの場合）、文章データを出力することなく処理を終了する。 Processing of the output unit 220 will be described.
In step S932, the time series of the first utterance likelihood, the second utterance likelihood, the third utterance likelihood, and the fourth utterance likelihood input from the utterance state recognition unit 202, and the utterance input from the time detection unit 210. The validity of the utterance interval is determined on the basis of the start time and end time. When it is determined that the utterance section is valid (YES in step S932), the process proceeds to step S934, and the sentence data output unit 206 outputs the sentence data input from the sentence data creation unit 212, and the process is performed. finish. If it is determined that the utterance section is not valid (NO in step S932), the process ends without outputting the sentence data.

本実施例の音声認識装置１００では、発話区間における音声を文章として認識する文章認識処理と、発話区間の妥当性を判定するための画像認識処理を並行して実施している。これによって、発話区間の妥当性が確認されると、即座に文章認識結果を得ることができる。対話者が話しかけてから文章データが出力されるまでの処理時間を短いものとすることができる。 In the speech recognition apparatus 100 of the present embodiment, a sentence recognition process for recognizing speech in an utterance section as a sentence and an image recognition process for determining the validity of the utterance section are performed in parallel. Thus, when the validity of the utterance section is confirmed, the sentence recognition result can be obtained immediately. It is possible to shorten the processing time from when the conversation person speaks until the sentence data is output.

本実施例の音声認識装置１００では、発話区間の妥当性を評価する際に、異なる観点から算出される第１発話尤度、第２発話尤度、第３発話尤度および第４発話尤度を用いて評価がなされる。これによって、発話区間の妥当性について正確に評価することが可能となり、文章の誤認識を防止することができる。 In the speech recognition apparatus 100 according to the present embodiment, the first utterance likelihood, the second utterance likelihood, the third utterance likelihood, and the fourth utterance likelihood calculated from different viewpoints when evaluating the validity of the utterance interval. Evaluation is made using. As a result, it is possible to accurately evaluate the validity of the utterance section, and prevent erroneous recognition of sentences.

本実施例の音声認識装置１００では、発話区間における文章を認識する際に、文章としての尤度を評価して、最も尤度の高い文章を特定する。これによって、単語を誤認識することによって意味不明の文章が認識される事態を防ぐことができる。 In the speech recognition apparatus 100 of the present embodiment, when recognizing a sentence in an utterance section, the likelihood as a sentence is evaluated and the sentence with the highest likelihood is specified. Thus, it is possible to prevent a situation where an unknown sentence is recognized by misrecognizing a word.

以上、本発明の具体例を詳細に説明したが、これらは例示にすぎず、特許請求の範囲を限定するものではない。特許請求の範囲に記載の技術には、以上に例示した具体例を様々に変形、変更したものが含まれる。
また、本明細書または図面に説明した技術要素は、単独であるいは各種の組み合わせによって技術的有用性を発揮するものであり、出願時請求項記載の組み合わせに限定されるものではない。また、本明細書または図面に例示した技術は複数目的を同時に達成するものであり、そのうちの一つの目的を達成すること自体で技術的有用性を持つものである。 Specific examples of the present invention have been described in detail above, but these are merely examples and do not limit the scope of the claims. The technology described in the claims includes various modifications and changes of the specific examples illustrated above.
In addition, the technical elements described in the present specification or the drawings exhibit technical usefulness alone or in various combinations, and are not limited to the combinations described in the claims at the time of filing. In addition, the technology illustrated in the present specification or the drawings achieves a plurality of objects at the same time, and has technical utility by achieving one of the objects.

図１は音声認識装置１００の外観を示す図である。FIG. 1 is a view showing the appearance of the speech recognition apparatus 100. 図２はコントローラ１１４の構成を模式的に示す図である。FIG. 2 is a diagram schematically showing the configuration of the controller 114. 図３は***Ｒの動きからの第４発話尤度の算出を説明する図である。FIG. 3 is a diagram for explaining the calculation of the fourth utterance likelihood from the movement of the lip R. 図４は発話の開始時刻ＴＳと終了時刻ＴＥの検出を説明する図である。FIG. 4 is a diagram for explaining detection of an utterance start time TS and an end time TE. 図５は音データ４０２のフレーム化処理と周波数スペクトルの特定を説明する図である。FIG. 5 is a diagram for explaining framing processing of sound data 402 and identification of a frequency spectrum. 図６は単語「ぶどう」の尤度評価を説明する図である。FIG. 6 is a diagram for explaining the likelihood evaluation of the word “grape”. 図７は単語接続表７００を例示する図である。FIG. 7 is a diagram illustrating a word connection table 700. 図８は文章の尤度評価を説明する図である。FIG. 8 is a diagram for explaining the likelihood evaluation of a sentence. 図９はコントローラ１１４の処理を説明するフローチャートである。FIG. 9 is a flowchart for explaining the processing of the controller 114.

符号の説明Explanation of symbols

１００：音声認識装置
１０２：頭部
１０４：右カメラ
１０６：左カメラ
１０８：胴体部
１１０：アクチュエータ
１１２：マイクロホン
１１４：コントローラ
２０２：発話状態認識部
２０４：発話区間判断部
２０６：文章データ出力部
２０８：音声解析部
２１０：時刻検出部
２１２：文章データ作成部
２１４：音素データベース
２１６：単語データベース
２１８：文章データベース
２２０：出力部
４０２：音データ
４０４：音圧ゼロの線
６０２、６０４、６０６、６０８、６１０、６１２：点
６１４、６１６：枝
６１８：経路
７００：単語接続表 100: voice recognition device 102: head 104: right camera 106: left camera 108: body 110: actuator 112: microphone 114: controller 202: utterance state recognition unit 204: utterance section determination unit 206: sentence data output unit 208: Speech analysis unit 210: Time detection unit 212: Text data creation unit 214: Phoneme database 216: Word database 218: Text database 220: Output unit 402: Sound data 404: Sound pressure zero line 602, 604, 606, 608, 610 612: point 614, 616: branch 618: path 700: word connection table

Claims

対話者が話しかける音声を文章として認識する装置であって、
音声を入力し、音データに変換する音声入力手段と、
対話者を繰り返し撮影し、撮影された画像データを時刻と関連付ける撮像手段と、
音データに基づいて音声入力開始時刻と音声入力終了時刻を検出する時刻検出手段と、
音声入力開始時刻から音声入力終了時刻までの音データから、文章データを作成する文章データ作成手段と、
音声入力開始時刻から音声入力終了時刻までの画像データから、対話者の発話状態を認識する発話状態認識手段と、
対話者の発話状態から、音声入力開始時刻から音声入力終了時刻までの期間が適切な発話区間であるか否かを判断する発話区間判断手段と、
音声入力開始時刻から音声入力終了時刻までの期間が適切な発話区間であると判断された場合に、文章データを出力する文章データ出力手段と
を備える音声認識装置。 A device that recognizes the speech spoken by the interlocutor as a sentence,
Voice input means for inputting voice and converting it into sound data;
Image capturing means for repeatedly capturing a conversation person and associating the captured image data with time;
Time detection means for detecting a voice input start time and a voice input end time based on the sound data;
Sentence data creating means for creating sentence data from sound data from the voice input start time to the voice input end time;
Utterance state recognition means for recognizing the utterance state of the conversation person from the image data from the voice input start time to the voice input end time;
An utterance section determination means for determining whether or not a period from the voice input start time to the voice input end time is an appropriate utterance section from the utterance state of the conversation person;
A speech recognition apparatus comprising: sentence data output means for outputting sentence data when it is determined that a period from a voice input start time to a voice input end time is an appropriate utterance section.

前記発話状態認識手段は、対話者の少なくとも２種類以上の発話状態を認識し、
前記発話区間判断手段は、前記少なくとも２種類以上の発話状態から、音声入力開始時刻から音声入力終了時刻までの期間が適切な発話区間であるか否かを判断することを特徴とする請求項１の音声認識装置。 The utterance state recognition means recognizes at least two types of utterance states of a conversation person,
2. The speech segment determining means determines whether or not a period from a voice input start time to a voice input end time is an appropriate speech segment from the at least two types of speech states. Voice recognition device.

前記発話状態は、対話者までの距離、対話者の顔の向き、対話者の視線の向き、および対話者の***の動きを含むグループから選択されていることを特徴とする請求項２の音声認識装置。 3. The voice according to claim 2, wherein the utterance state is selected from a group including a distance to a conversation person, a face direction of the conversation person, a direction of the conversation person's line of sight, and a movement of the conversation person's lips. Recognition device.

前記文章データ作成手段は、
候補となる文章データ群を記憶しておく文章データ群記憶手段と、
候補となる文章データ群のそれぞれの文章データについて、音データに基づいて尤度を算出する尤度算出手段を備えており、候補となる文章データ群から最も尤度の高い文章データを特定して、文章データを作成することを特徴とする請求項１の音声認識装置。 The sentence data creating means includes
Text data group storage means for storing candidate text data groups;
For each sentence data in the candidate sentence data group, it is provided with likelihood calculating means for calculating the likelihood based on the sound data, and the sentence data group with the highest likelihood is identified from the candidate sentence data group. The speech recognition apparatus according to claim 1, wherein the sentence data is created.

前記文章データ出力手段は、音声入力開始時刻から音声入力終了時刻までの期間が適切な発話区間であると判断されない場合に、文章データを出力しないことを特徴とする請求項１の音声認識装置。 2. The speech recognition apparatus according to claim 1, wherein the text data output means does not output text data when it is not determined that the period from the voice input start time to the voice input end time is an appropriate utterance section.

対話者が話しかける音声を文章として認識する方法であって、
音声を入力し、音データに変換する音声入力工程と、
対話者を繰り返し撮影し、撮影された画像データを時刻と関連付ける撮像工程と、
音データに基づいて音声入力開始時刻と音声入力終了時刻を検出する時刻検出手段と、
音声入力開始時刻から音声入力終了時刻までの音データから、文章データを作成する文章データ作成工程と、
音声入力開始時刻から音声入力終了時刻までの画像データから、対話者の発話状態を認識する発話状態認識工程と、
対話者の発話状態から、音声入力開始時刻から音声入力終了時刻までの期間が適切な発話区間であるか否かを判断する発話区間判断工程と、
音声入力開始時刻から音声入力終了時刻までの期間が適切な発話区間であると判断された場合に、文章データを出力する文章データ出力工程と
を備える音声認識方法。 A method of recognizing speech spoken by a dialogue person as a sentence,
A voice input process for inputting voice and converting it into sound data;
An imaging process of repeatedly capturing a conversation person and associating the captured image data with the time;
Time detecting means for detecting a voice input start time and a voice input end time based on the sound data;
A sentence data creation step for creating sentence data from sound data from a voice input start time to a voice input end time;
An utterance state recognition process for recognizing the utterance state of a conversation person from image data from the voice input start time to the voice input end time;
An utterance interval determination step for determining whether or not a period from the voice input start time to the voice input end time is an appropriate utterance interval from the utterance state of the conversation person;
A speech recognition method comprising: a sentence data output step of outputting sentence data when it is determined that a period from a voice input start time to a voice input end time is an appropriate utterance section.