JP2020155944A

JP2020155944A - Speaker detection system, speaker detection method, and program

Info

Publication number: JP2020155944A
Application number: JP2019052911A
Authority: JP
Inventors: 紘之長野; Hiroyuki Nagano; 能勢　将樹; Masaki Nose; 将樹能勢; 悠斗後藤; Yuto Goto
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2019-03-20
Filing date: 2019-03-20
Publication date: 2020-09-24
Anticipated expiration: 2039-03-20
Also published as: JP7259447B2

Abstract

To detect a speaker even when it is difficult to detect a speaker from the movement of the lips.SOLUTION: A speaker detection system detecting a speaker according to the present invention comprises: an imaging unit that picks up an image of a plurality of participants including the speaker to create image data; a first detection unit that detects, based on the image data, the movement of the lips to detect the speaker; and a second detection unit that, when the first detection unit cannot detect the speaker, detects, based on the image data, the movement or the appearance of the participants to detect the speaker.SELECTED DRAWING: Figure 1

Description

本発明は、発話者検出システム、発話者検出方法及びプログラムに関する。 The present invention relates to a speaker detection system, a speaker detection method and a program.

近年、会議等において、言葉を発している人物（以下「発話者」という。）を検出する方法が知られている。 In recent years, a method of detecting a person who is speaking a word (hereinafter referred to as "speaker") has been known at a conference or the like.

例えば、テレビ会議システムが、まず、会議室内にいる、発話者を含む参加者をカメラ等で撮像する。そして、撮像された画像データに基づいて、テレビ会議システムは、それぞれの参加者の顔画像を抽出する。次に、テレビ会議システムは、顔画像における***部分の動作を検出することで発話者を特定する。このようにして、発話者を特定して選択的に画像を撮像する方法が知られている（例えば、特許文献１参照）。 For example, the video conferencing system first captures a participant including a speaker in the conference room with a camera or the like. Then, based on the captured image data, the video conferencing system extracts the facial image of each participant. Next, the video conferencing system identifies the speaker by detecting the movement of the lip portion in the facial image. In this way, a method of identifying a speaker and selectively capturing an image is known (see, for example, Patent Document 1).

しかしながら、従来の方法では、画像データに***部分が写っていないと、発話者の検出が難しい場合がある。例えば、人物がマスクを装着していたり、***部分を手で隠す癖等があったりすると、***部分が遮蔽され、画像データに、***部分が写らない場合がある。このような場合には、***の動作を検出して発話者を検出するのが難しい場合がある。 However, with the conventional method, it may be difficult to detect the speaker if the lip portion is not shown in the image data. For example, if a person wears a mask or has a habit of hiding the lip portion by hand, the lip portion may be shielded and the lip portion may not be shown in the image data. In such cases, it may be difficult to detect the movement of the lips and detect the speaker.

本発明の一態様は、***の動作で発話者を検出するのが難しい場合であっても、発話者を検出することを目的とする。 One aspect of the present invention is intended to detect a speaker even when it is difficult to detect the speaker by the movement of the lips.

本発明の一実施形態による、発話者を検出する発話者検出システムは、
前記発話者を含む複数の参加者を撮像して画像データを生成する撮像部と、
前記画像データに基づいて、***の動作を検出して前記発話者を検出する第１検出部と、
前記第１検出部で前記発話者が検出できない場合に、前記画像データに基づいて、前記参加者の動き又は外観を検出して前記発話者を検出する第２検出部と
を備えることを特徴とする。 The speaker detection system for detecting a speaker according to an embodiment of the present invention is
An imaging unit that captures a plurality of participants including the speaker and generates image data,
Based on the image data, a first detection unit that detects the movement of the lips and detects the speaker, and
When the speaker cannot be detected by the first detection unit, the first detection unit is provided with a second detection unit that detects the movement or appearance of the participant based on the image data and detects the speaker. To do.

本発明の実施形態によって、***の動作で発話者を検出するのが難しい場合であっても、発話者を検出できる。 According to the embodiment of the present invention, the speaker can be detected even when it is difficult to detect the speaker by the movement of the lips.

発話者検出システムの全体構成例及び使用例を示す概略図である。It is the schematic which shows the whole structure example and use example of the utterance detection system. 電子黒板のハードウェア構成例を示す図である。It is a figure which shows the hardware configuration example of an electronic blackboard. 全体処理例を示すフローチャートである。It is a flowchart which shows the whole processing example. 第１人物の検出例を示す図である。It is a figure which shows the detection example of the 1st person. 第２人物の検出例を示す図である。It is a figure which shows the detection example of the 2nd person. 第３人物の検出例を示す図である。It is a figure which shows the detection example of the 3rd person. 第１実施形態における発話者検出システムの機能構成例を示す機能ブロック図である。It is a functional block diagram which shows the functional structure example of the speaker detection system in 1st Embodiment. 第２実施形態における発話者検出システムの機能構成例を示す機能ブロック図である。It is a functional block diagram which shows the functional structure example of the speaker detection system in 2nd Embodiment. 第３実施形態における全体処理例を示すフローチャートである。It is a flowchart which shows the whole processing example in 3rd Embodiment.

以下、発明を実施するための最適な形態について、図面を参照して説明する。 Hereinafter, the optimum mode for carrying out the invention will be described with reference to the drawings.

＜第１実施形態＞
＜発話者検出システムの全体構成例及び使用例＞
発話者検出システムは、例えば、複数の参加者が会議室等に集まって話し合い等をする場面等において、以下のように設置して使用される。なお、設置場所は、会議室に限られず、他の部屋等でもよい。 <First Embodiment>
<Overall configuration example and usage example of speaker detection system>
The speaker detection system is installed and used as follows, for example, in a situation where a plurality of participants gather in a conference room or the like for discussion. The installation location is not limited to the conference room, but may be another room or the like.

図１は、発話者検出システムの全体構成例及び使用例を示す概略図である。例えば、発話者検出システム１０は、図示するように、撮像装置の例であるカメラ１と、カメラ１と有線又は無線で接続する情報処理装置の例である電子黒板２とを有する構成である。 FIG. 1 is a schematic view showing an overall configuration example and a usage example of the utterance detection system. For example, as shown in the figure, the speaker detection system 10 has a camera 1 which is an example of an image pickup device and an electronic blackboard 2 which is an example of an information processing device which is connected to the camera 1 by wire or wirelessly.

カメラ１は、参加者である第１参加者ＭＡ、第２参加者ＭＢ及び第３参加者ＭＣが撮像できる画角及び設置位置であるのが望ましい。例えば、カメラ１は、３６０ °の範囲が撮像できる画角を有する。このように、カメラ１は、１８０ °以上の広角な範囲を撮像できる光学系であるのが望ましい。このような広角な範囲を撮像できる撮像装置であると、画像データに参加者が漏れなく撮像できる確率を高くできる。 It is desirable that the camera 1 has an angle of view and an installation position that can be imaged by the first participant MA, the second participant MB, and the third participant MC who are participants. For example, the camera 1 has an angle of view capable of capturing an image in a range of 360 °. As described above, it is desirable that the camera 1 is an optical system capable of capturing a wide-angle range of 180 ° or more. With an imaging device capable of capturing such a wide-angle range, it is possible to increase the probability that a participant can image the image data without omission.

なお、撮像装置は、複数でもよい。また、撮像装置は、図示するような会議室の真ん中となる配置でなくともよい。すなわち、撮像装置は、部屋の端等に設置され、全体を撮像できるように調整されてもよい。 The number of imaging devices may be plurality. Further, the imaging device does not have to be arranged in the center of the conference room as shown in the figure. That is, the imaging device may be installed at the edge of the room or the like and adjusted so that the entire image can be captured.

カメラ１は、静止画像又は動画像である画像データを電子黒板２に送信する。そして、電子黒板２は、例えば、画像データに基づいて、会議の様子等を表示する等の処理を行う。なお、電子黒板２は、画像データをクラウド上又は記憶装置等に保存してもよい。 The camera 1 transmits image data, which is a still image or a moving image, to the electronic blackboard 2. Then, the electronic blackboard 2 performs processing such as displaying the state of the meeting or the like based on the image data, for example. The electronic blackboard 2 may store image data on the cloud or in a storage device or the like.

＜電子黒板の例＞
図２は、電子黒板のハードウェア構成例を示す図である。図示するように、電子黒板２は、ＣＰＵ２０１、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２０２、ＲＡＭ(ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ)２０３、ＳＳＤ２０４、ネットワークＩ／Ｆ２０５、及び、外部機器接続Ｉ／Ｆ２０６を備える。 <Example of electronic blackboard>
FIG. 2 is a diagram showing a hardware configuration example of the electronic blackboard. As shown in the figure, the electronic blackboard 2 includes a CPU 201, a ROM (Read Only Memory) 202, a RAM (Random Access Memory) 203, an SSD 204, a network I / F 205, and an external device connection I / F 206.

これらのうち、ＣＰＵ２０１は、電子黒板２全体の動作を制御する。ＲＯＭ２０２は、ＣＰＵ２０１やＩＰＬ（ＩｎｉｔｉａｌＰｒｏｇｒａｍＬｏａｄｅｒ）等のＣＰＵ２０１の駆動に用いられるプログラムを記憶する。ＲＡＭ２０３は、ＣＰＵ２０１のワークエリアとして使用される。ＳＳＤ２０４は、電子黒板用のプログラム等の各種データを記憶する。ネットワークＩ／Ｆ２０５は、通信ネットワークで外部機器と通信を行うためのインターフェースである。ネットワークコントローラは、通信ネットワークとの通信を制御する。外部機器接続Ｉ／Ｆ２０６は、各種の外部機器を接続するためのインターフェースである。この場合の外部機器は、例えば、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリ２３０、外付け機器（マイク２４０、スピーカ２５０、カメラ１）である。 Of these, the CPU 201 controls the operation of the entire electronic blackboard 2. The ROM 202 stores a program used for driving the CPU 201 such as the CPU 201 and the IPL (Initial Program Loader). The RAM 203 is used as a work area of the CPU 201. The SSD 204 stores various data such as a program for an electronic blackboard. The network I / F205 is an interface for communicating with an external device in a communication network. The network controller controls communication with the communication network. The external device connection I / F 206 is an interface for connecting various external devices. The external device in this case is, for example, a USB (Universal Serial Bus) memory 230 and an external device (microphone 240, speaker 250, camera 1).

また、電子黒板２は、キャプチャデバイス２１１、ＧＰＵ２１２、ディスプレイコントローラ２１３、接触センサ２１４、センサコントローラ２１５、電子ペンコントローラ２１６、近距離通信回路２１９、及び近距離通信回路２１９のアンテナ２１９ａ、電源スイッチ２２２及び選択スイッチ類２２３を備える。 Further, the electronic blackboard 2 includes a capture device 211, a GPU 212, a display controller 213, a contact sensor 214, a sensor controller 215, an electronic pen controller 216, a short-range communication circuit 219, an antenna 219a of the short-range communication circuit 219, a power switch 222, and the like. The selection switch 223 is provided.

これらのうち、キャプチャデバイス２１１は、外付けのＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）２７０のディスプレイに対して映像情報を静止画又は動画として表示させる。ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２１２は、グラフィクスを専門に扱う半導体チップである。ディスプレイコントローラ２１３は、ＧＰＵ２１２からの出力画像をディスプレイ２８０等へ出力するために画面表示の制御及び管理を行う。接触センサ２１４は、ディスプレイ２８０上に電子ペン２９０やユーザの手Ｈ等が接触したことを検知する。センサコントローラ２１５は、接触センサ２１４の処理を制御する。接触センサ２１４は、赤外線遮断方式による座標の入力及び座標の検出を行う。この座標の入力及び座標の検出する方法は、ディスプレイ２８０の上側両端部に設置された２つ受発光装置が、ディスプレイ２８０に平行して複数の赤外線を放射し、ディスプレイ２８０の周囲に設けられた反射部材によって反射されて、受光素子が放射した光の光路と同一の光路上を戻って来る光を受光する方法である。接触センサ２１４は、物体によって遮断された２つの受発光装置が放射した赤外線のＩＤをセンサコントローラ２１５に出力し、センサコントローラ２１５が、物体の接触位置である座標位置を特定する。電子ペンコントローラ２１６は、電子ペン２９０と通信することで、ディスプレイ２８０へのペン先のタッチやペン尻のタッチの有無を判断する。近距離通信回路２１９は、ＮＦＣ又はＢｌｕｅｔｏｏｔｈ（登録商標）等の通信回路である。電源スイッチ２２２は、電子黒板２の電源のＯＮ／ＯＦＦを切り換えるためのスイッチである。選択スイッチ類２２３は、例えば、ディスプレイ２８０の表示の明暗や色合い等を調整するためのスイッチ群である。 Of these, the capture device 211 causes the display of the external PC (Personal Computer) 270 to display the video information as a still image or a moving image. The GPU (Graphics Processing Unit) 212 is a semiconductor chip that specializes in graphics. The display controller 213 controls and manages the screen display in order to output the output image from the GPU 212 to the display 280 or the like. The contact sensor 214 detects that the electronic pen 290, the user's hand H, or the like has come into contact with the display 280. The sensor controller 215 controls the processing of the contact sensor 214. The contact sensor 214 inputs and detects the coordinates by the infrared blocking method. In this method of inputting coordinates and detecting coordinates, two light receiving and emitting devices installed at both upper ends of the display 280 emit a plurality of infrared rays in parallel with the display 280 and are provided around the display 280. This is a method of receiving light that is reflected by a reflecting member and returns on the same optical path as the light path emitted by the light receiving element. The contact sensor 214 outputs an infrared ID emitted by two light emitting / receiving devices blocked by the object to the sensor controller 215, and the sensor controller 215 specifies a coordinate position which is a contact position of the object. The electronic pen controller 216 determines whether or not there is a touch of the pen tip or a touch of the pen tail on the display 280 by communicating with the electronic pen 290. The short-range communication circuit 219 is a communication circuit such as NFC or Bluetooth (registered trademark). The power switch 222 is a switch for switching the power ON / OFF of the electronic blackboard 2. The selection switches 223 are, for example, a group of switches for adjusting the brightness and hue of the display of the display 280.

更に、電子黒板２は、バスライン２１０を備えている。バスライン２１０は、図２に示されているＣＰＵ２０１等の各構成要素を電気的に接続するためのアドレスバスやデータバス等である。 Further, the electronic blackboard 2 includes a bus line 210. The bus line 210 is an address bus, a data bus, or the like for electrically connecting each component such as the CPU 201 shown in FIG.

なお、接触センサ２１４は、赤外線遮断方式に限らず、静電容量の変化を検知することにより接触位置を特定する静電容量方式のタッチパネル、対向する２つの抵抗膜の電圧変化によって接触位置を特定する抵抗膜方式のタッチパネル、接触物体が表示部に接触することによって生じる電磁誘導を検知して接触位置を特定する電磁誘導方式のタッチパネル等の種々の検出手段を用いてもよい。また、電子ペンコントローラ２１６が、電子ペン２９０のペン先及びペン尻だけでなく、電子ペン２９０のユーザが握る部分、又は、その他の電子ペンの部分のタッチの有無を判断するようにしてもよい。 The contact sensor 214 is not limited to the infrared blocking method, but is a capacitive touch panel that specifies a contact position by detecting a change in capacitance, and a contact position is specified by a voltage change between two opposing resistance films. Various detection means such as a resistance film type touch panel, an electromagnetic induction type touch panel that detects electromagnetic induction generated by contact of a contact object with a display unit, and specifies a contact position may be used. Further, the electronic pen controller 216 may determine whether or not there is a touch not only on the pen tip and pen end of the electronic pen 290 but also on the portion held by the user of the electronic pen 290 or other electronic pen portions. ..

なお、情報処理装置は、電子黒板でなくともよい。例えば、情報処理装置は、ＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）又はサーバ等でもよい。 The information processing device does not have to be an electronic blackboard. For example, the information processing device may be a PC (Personal Computer), a server, or the like.

＜全体処理例＞
図３は、全体処理例を示すフローチャートである。例えば、会議が開始されると、発話者検出システムは、例えば、以下のような処理を行う。 <Overall processing example>
FIG. 3 is a flowchart showing an example of overall processing. For example, when the conference is started, the speaker detection system performs the following processing, for example.

＜画像データの生成例＞（ステップＳ１Ｓ）
発話者検出システムは、参加者を撮像して画像データを生成する。なお、ステップＳ１Ｓによって撮像する画像を１フレームとすると、発話者検出システムは、ステップＳ１Ｅにより、繰り返し画像データを生成し、複数のフレームを生成する。また、ステップＳ２Ｓ以降の処理は、それぞれのフレームに対して行われる。 <Example of image data generation> (step S1S)
The speaker detection system images the participants and generates image data. Assuming that the image captured in step S1S is one frame, the speaker detection system repeatedly generates image data in step S1E to generate a plurality of frames. Further, the processing after step S2S is performed for each frame.

＜人物の検出例＞（ステップＳ２Ｓ）
発話者検出システムは、画像データに基づいて、人物を検出する。すなわち、発話者検出システムは、画像データに対して、顔認証等の処理を行うと、参加者を検出することができる。なお、人物の検出方法は、顔認証に限られず、他の認識処理等で実現してもよい。 <Example of detecting a person> (step S2S)
The speaker detection system detects a person based on image data. That is, the speaker detection system can detect a participant by performing a process such as face recognition on the image data. The method for detecting a person is not limited to face authentication, and may be realized by other recognition processing or the like.

また、発話者検出システムは、ステップＳ２Ｅにより、それぞれのフレームに対して、繰り返し人物を検出する処理を行う。以下、ステップＳ３乃至ステップＳ６は、ステップＳ２Ｓで検出される人物ごとに繰り返し行われる。 Further, the utterance detection system repeatedly detects a person for each frame in step S2E. Hereinafter, steps S3 to S6 are repeated for each person detected in step S2S.

＜人物の動きの検出例＞（ステップＳ３）
発話者検出システムは、人物の動きを検出する。例えば、発話者検出システムは、オプティカルフロー等の処理で人物の動きを検出する。なお、動きの検出方法は、他の認識処理等で実現してもよい。 <Example of detecting the movement of a person> (step S3)
The speaker detection system detects the movement of a person. For example, the speaker detection system detects the movement of a person by processing such as optical flow. The motion detection method may be realized by other recognition processing or the like.

＜視線の検出例＞（ステップＳ４）
発話者検出システムは、視線を検出する。例えば、発話者検出システムは、参加者の目を示す画像等から視線を検出する。なお、視線の検出方法は、他の認識処理等で実現してもよい。このように、発話者検出システムは、視線を検出して、参加者のうち、どの人物に最も視線が集まっているかを検出するのが望ましい。すなわち、それぞれの参加者の視線を検出することで、発話者検出システムは、視線の先となる回数が最も多い人物を特定する処理を行うのが望ましい。 <Example of line-of-sight detection> (step S4)
The speaker detection system detects the line of sight. For example, the speaker detection system detects the line of sight from an image showing the eyes of the participants. The line-of-sight detection method may be realized by other recognition processing or the like. As described above, it is desirable that the speaker detection system detects the line of sight to detect which of the participants has the most line of sight. That is, it is desirable that the speaker detection system performs a process of identifying the person who is the most frequently seen in the line of sight by detecting the line of sight of each participant.

＜***の検出例＞（ステップＳ５）
発話者検出システムは、***を検出する。例えば、発話者検出システムは、顔認証等の処理を行うと、顔の主な部位である、目、鼻、***及び耳等を検出できる。 <Example of lip detection> (step S5)
The speaker detection system detects the lips. For example, the speaker detection system can detect the eyes, nose, lips, ears, etc., which are the main parts of the face, by performing a process such as face recognition.

＜***の動作の検出例＞（ステップＳ６）
発話者検出システムは、***の動作を検出する。例えば、発話者検出システムは、ステップＳ５で検出する***をフレーム間で追跡していくと、***の動作を検出できる。 <Example of detecting lip movement> (step S6)
The speaker detection system detects the movement of the lips. For example, the speaker detection system can detect the movement of the lips by tracking the lips detected in step S5 between frames.

＜***の動作で発話者が検出できたか否かの判断例＞（ステップＳ７）
発話者検出システムは、***の動作で発話者が検出できたか否かを判断する。具体的には、ステップＳ５及びステップＳ６で***が検出できない場合等には、発話者検出システムは、***の動作で発話者が検出できないと判断する（ステップＳ７でＮＯ）。 <Example of determining whether or not the speaker could be detected by the movement of the lips> (step S7)
The speaker detection system determines whether or not the speaker can be detected by the movement of the lips. Specifically, when the lips cannot be detected in steps S5 and S6, the speaker detection system determines that the speaker cannot be detected by the movement of the lips (NO in step S7).

次に、***の動作で発話者が検出できないと判断すると（ステップＳ７でＮＯ）、発話者検出システムは、ステップＳ９に進む。一方で、***の動作で発話者が検出できると判断すると（ステップＳ７でＹＥＳ）、発話者検出システムは、ステップＳ８に進む。 Next, if it is determined that the speaker cannot be detected by the movement of the lips (NO in step S7), the speaker detection system proceeds to step S9. On the other hand, if it is determined that the speaker can be detected by the movement of the lips (YES in step S7), the speaker detection system proceeds to step S8.

＜***の動作が検出できた人物を発話者と検出する例＞（ステップＳ８）
発話者検出システムは、***の動作が検出できた人物を発話者と検出する。すなわち、発話者検出システムは、***が動いている人物を発話者と検出する。 <Example of detecting a person whose lip movement can be detected as a speaker> (step S8)
The speaker detection system detects a person who can detect the movement of the lips as a speaker. That is, the speaker detection system detects a person whose lips are moving as a speaker.

＜***が遮蔽されている外観の第１人物を検出できたか否かの判断例＞（ステップＳ９）
発話者検出システムは、***が遮蔽されている外観の人物（以下「第１人物」という場合がある。）を検出できたか否かを判断する。具体的には、以下のような人物が検出される。 <Example of determining whether or not the first person with an appearance in which the lips are shielded can be detected> (Step S9)
The speaker detection system determines whether or not a person having an appearance whose lips are shielded (hereinafter, may be referred to as a "first person") can be detected. Specifically, the following persons are detected.

図４は、第１人物の検出例を示す図である。以下、図示するような画像データＩＭＧが生成された例で説明する。すなわち、この例は、図１に示すように、３人の参加者ＭＥＭである、第１参加者ＭＡ、第２参加者ＭＢ及び第３参加者ＭＣがいる会議を撮像した例である。また、図示するように、参加者ＭＥＭのうち、第２参加者ＭＢは、マスクＭＳＫを装着しているとする。 FIG. 4 is a diagram showing a detection example of the first person. Hereinafter, an example in which an image data IMG as shown in the figure is generated will be described. That is, as shown in FIG. 1, this example is an example of imaging a conference in which there are three participant MEMs, a first participant MA, a second participant MB, and a third participant MC. Further, as shown in the figure, it is assumed that the second participant MB of the participant MEMs is wearing the mask MSK.

第２参加者ＭＢのように、マスクＭＳＫ等によって、***が遮蔽されている外観の人物は、第１人物と検出される。一方で、第１参加者ＭＡ及び第３参加者ＭＣは、***がステップＳ５で検出される人物の例である。 A person whose lips are shielded by a mask MSK or the like, such as the second participant MB, is detected as the first person. On the other hand, the first participant MA and the third participant MC are examples of persons whose lips are detected in step S5.

なお、第１人物と検出する場合は、マスクＭＳＫの装着に限られない。例えば、発話者検出システムは、マスクＭＳＫ以外の物体で***が遮蔽されている人物を第１人物と検出してもよい。具体的には、第１人物は、手で鼻等を触る動きで***が遮蔽されていてもよい。ほかにも、第１人物は、撮像装置との位置関係等により、***が検出できない方向を向いている人物又は逆光等で画像では顔が分かりにくい人物等でもよい。 The case of detecting the first person is not limited to wearing the mask MSK. For example, the speaker detection system may detect a person whose lips are shielded by an object other than the mask MSK as the first person. Specifically, the first person may have his / her lips shielded by the movement of touching the nose or the like with his / her hand. In addition, the first person may be a person whose lips are facing in a direction that cannot be detected due to the positional relationship with the imaging device, or a person whose face is difficult to see in the image due to backlight or the like.

第１参加者ＭＡ及び第３参加者ＭＣが発話者である場合は、***の動作が検出できるため、発話者検出システムは、***の動作で発話者が検出できる（ステップＳ７でＹＥＳ）。 When the first participant MA and the third participant MC are speakers, the movement of the lips can be detected, so that the speaker detection system can detect the speaker by the movement of the lips (YES in step S7).

一方で、第２参加者ＭＢが発話者である場合は、発話者検出システムは、***の動作では、発話者が検出できない（ステップＳ７でＮＯ）。 On the other hand, when the second participant MB is the speaker, the speaker detection system cannot detect the speaker by the movement of the lips (NO in step S7).

そこで、第２参加者ＭＢのような第１人物が検出でき、かつ、第１人物以外の人物が発話者でないと検出されると、発話者検出システムは、第２参加者ＭＢを発話者と推定する。なお、このような方法は、第１人物と検出する人物が１人である場合に用いられるのが望ましい。 Therefore, when a first person such as the second participant MB can be detected and a person other than the first person is detected as not a speaker, the speaker detection system sets the second participant MB as the speaker. presume. It is desirable that such a method is used when there is only one person to be detected as the first person.

次に、第１人物を検出できたと判断すると（ステップＳ９でＹＥＳ）、発話者検出システムは、ステップＳ１０に進む。一方で、第１人物を検出できないと判断すると（ステップＳ９でＮＯ）、発話者検出システムは、ステップＳ１１に進む。 Next, if it is determined that the first person can be detected (YES in step S9), the speaker detection system proceeds to step S10. On the other hand, if it is determined that the first person cannot be detected (NO in step S9), the speaker detection system proceeds to step S11.

＜第１人物を発話者と検出する例＞（ステップＳ１０）
発話者検出システムは、第１人物を発話者と検出する。すなわち、発話者検出システムは、***が遮蔽されている人物を発話者と検出する。 <Example of detecting the first person as the speaker> (step S10)
The speaker detection system detects the first person as the speaker. That is, the speaker detection system detects a person whose lips are shielded as a speaker.

＜参加者のうち、最も動きのある第２人物を検出できたか否かの判断例＞（ステップＳ１１）
発話者検出システムは、参加者のうち、最も動きのある人物（以下「第２人物」という。）を検出できたか否かの判断する。すなわち、発話者検出システムは、参加者の中で最も動きのある人物を発話者と検出する。 <Example of determining whether or not the second person with the most movement among the participants could be detected> (step S11)
The speaker detection system determines whether or not the most moving person (hereinafter referred to as "second person") among the participants can be detected. That is, the speaker detection system detects the most moving person among the participants as the speaker.

例えば、第１人物と検出される人物が以下のように複数検出された場合等に、ステップＳ９が行われるのが望ましい。 For example, it is desirable that step S9 is performed when a plurality of persons detected as the first person are detected as follows.

図５は、第２人物の検出例を示す図である。この例では、まず、第２参加者ＭＢは、図４と同様に、マスクの装着によって、***が遮蔽されているため、第１人物と判断される。 FIG. 5 is a diagram showing a detection example of the second person. In this example, first, the second participant MB is determined to be the first person because the lips are shielded by wearing the mask as in FIG. 4.

そして、この例では、第１参加者ＭＡは、電子黒板２がある方向を見て発話しているとする。すなわち、カメラ１に対して、第１参加者ＭＡは、背を向けた姿勢等である。 Then, in this example, it is assumed that the first participant MA looks in a certain direction of the electronic blackboard 2 and speaks. That is, the first participant MA is in a posture of turning his / her back to the camera 1.

そのため、画像データは、第１参加者ＭＡの***が遮蔽されているのと同様に、第１参加者ＭＡの***が写っていない状態である。したがって、第１参加者ＭＡ及び第２参加者ＭＢがどちらも第１人物となり、発話者が１人に特定できない場合である。 Therefore, the image data is in a state in which the lips of the first participant MA are not shown, just as the lips of the first participant MA are shielded. Therefore, this is a case where both the first participant MA and the second participant MB are the first person, and the speaker cannot be identified as one person.

このような場合等には、発話者検出システムは、参加者のうち、最も動きのある人物を検出する。この例では、第１参加者ＭＡが第２人物と検出される例である。 In such a case, the speaker detection system detects the most moving person among the participants. In this example, the first participant MA is detected as the second person.

発話者は、話を聞いている者より、ジェスチャが多い可能性が高い。すなわち、発話者は、身振り手振りを行いながら発話する場合が多い。したがって、発話者検出システムは、参加者のうち、最も動きのある人物を第２人物と検出するのが望ましい。 Speakers are more likely to have more gestures than those who are listening. That is, the speaker often speaks while gesturing. Therefore, it is desirable for the speaker detection system to detect the most moving person among the participants as the second person.

なお、動きの検出を行うのにおいて、対象となる部位は、手Ｈの部位であるのが望ましい。会議等では、参加者の上半身が撮像される対象となる場合が多い。また、発言中は、手Ｈの部位が最もよく動く部位となる場合が多い。そこで、手Ｈを対象とすると、発話者検出システムは、発話者を精度よく検出できる。なお、手Ｈの部位には、腕等が含まれてもよい。 In addition, in detecting the movement, it is desirable that the target portion is the portion of the hand H. At meetings and the like, the upper body of the participants is often photographed. Also, during speech, the part of the hand H is often the part that moves the most. Therefore, when the hand H is targeted, the speaker detection system can accurately detect the speaker. The portion of the hand H may include an arm or the like.

＜第２人物を発話者と検出する例＞（ステップＳ１２）
発話者検出システムは、第２人物を発話者と検出する。すなわち、発話者検出システムは、最も動きのある人物を発話者と検出する。 <Example of detecting the second person as the speaker> (step S12)
The speaker detection system detects the second person as the speaker. That is, the speaker detection system detects the most moving person as the speaker.

＜参加者の視線が最も集まる第３人物を発話者と検出する例＞（ステップＳ１３）
発話者検出システムは、参加者のうち、参加者の視線が最も集まる人物（以下「第３人物」という。）を検出して発話者と検出する。例えば、以下のように、第１人物と検出される人物が複数検出され、かつ、参加者の動きにあまり差がなく、第２人物が特定しにくい場合等に、ステップＳ１３が行われるのが望ましい。 <Example of detecting the third person who attracts the most eyes of the participants as the speaker> (step S13)
The speaker detection system detects a person who has the most line of sight of the participants (hereinafter referred to as "third person") among the participants and detects the speaker as the speaker. For example, step S13 is performed when a plurality of persons to be detected as the first person are detected and there is not much difference in the movements of the participants and it is difficult to identify the second person as described below. desirable.

図６は、第３人物の検出例を示す図である。この例では、まず、第１参加者ＭＡ、第２参加者ＭＢ及び第３参加者ＭＣのいずれもが、図４と同様に、マスクの装着によって、***が遮蔽されているため、第１人物と判断される。 FIG. 6 is a diagram showing a detection example of a third person. In this example, first, since the lips of all the first participant MA, the second participant MB, and the third participant MC are shielded by wearing the mask as in FIG. 4, the first person Is judged.

さらに、図示する例では、第１参加者ＭＡ、第２参加者ＭＢ及び第３参加者ＭＣのいずれもが、あまり動かない場合であるとする。したがって、参加者の動きにあまり差がなく、第２人物が特定されない状態である。 Further, in the illustrated example, it is assumed that none of the first participant MA, the second participant MB, and the third participant MC moves very much. Therefore, there is not much difference in the movements of the participants, and the second person is not specified.

発話者は、話を聞いている者が視線を向けることが多いため、発話者に視線が最も集まる場合が多い。そこで、発話者検出システムは、参加者のうち、参加者の視線が最も集まる人物を特定し、第３人物と検出する。そして、発話者検出システムは、第３人物を発話者と検出する。 As for the speaker, the person listening to the story often looks at the speaker, so the speaker often has the most eyes. Therefore, the speaker detection system identifies the person who has the most line of sight among the participants and detects it as the third person. Then, the speaker detection system detects the third person as the speaker.

このようにすると、図示するように、第１人物が複数検出され、かつ、第２人物が検出できない場合等であっても、発話者検出システムは、発話者を検出できる。 In this way, as shown in the figure, the speaker detection system can detect the speaker even when a plurality of first persons are detected and the second person cannot be detected.

なお、ステップＳ９、ステップＳ１１及びステップＳ１３は、図示するような順序でなくともよい。具体的には、第２人物の検出処理と第３人物の検出処理は、順序が逆でもよい。 Note that steps S9, S11 and S13 do not have to be in the order shown in the figure. Specifically, the order of the second person detection process and the third person detection process may be reversed.

＜機能構成例＞
図７は、第１実施形態における発話者検出システムの機能構成例を示す機能ブロック図である。例えば、図示するように、発話者検出システム１０は、撮像部１０Ｆ１、第１検出部１０Ｆ２及び第２検出部１０Ｆ３を備える機能構成である。 <Function configuration example>
FIG. 7 is a functional block diagram showing a functional configuration example of the speaker detection system according to the first embodiment. For example, as shown in the figure, the speaker detection system 10 has a functional configuration including an imaging unit 10F1, a first detection unit 10F2, and a second detection unit 10F3.

撮像部１０Ｆ１は、発話者を含む複数の参加者を撮像して画像データＩＭＧを生成する撮像手順を行う。例えば、撮像部１０Ｆ１は、カメラ１等によって実現する。 The image capturing unit 10F1 performs an imaging procedure of capturing a plurality of participants including the speaker to generate image data IMG. For example, the imaging unit 10F1 is realized by a camera 1 or the like.

第１検出部１０Ｆ２は、画像データＩＭＧに基づいて、***の動作を検出して発話者を検出する第１検出手順を行う。例えば、第１検出部１０Ｆ２は、ＣＰＵ２０１等で実現する。 The first detection unit 10F2 performs the first detection procedure of detecting the movement of the lips and detecting the speaker based on the image data IMG. For example, the first detection unit 10F2 is realized by the CPU 201 or the like.

第２検出部１０Ｆ３は、第１検出部１０Ｆ２で発話者が検出できない場合に、画像データＩＭＧに基づいて、参加者の動き又は外観を検出して発話者を検出する第２検出手順を行う。例えば、第２検出部１０Ｆ３は、ＣＰＵ２０１等で実現する。 The second detection unit 10F3 performs a second detection procedure of detecting the speaker by detecting the movement or appearance of the participant based on the image data IMG when the speaker cannot be detected by the first detection unit 10F2. For example, the second detection unit 10F3 is realized by the CPU 201 or the like.

***の動作を検出して発話者を検出する方法であると、例えば、人物がマスクを装着していたり、***部分を手で隠す癖等があったりすると、***部分が遮蔽され、画像データに***部分が写らない場合等がある。このような場合には、***の動作に基づいて、発話者を検出するのが難しい場合が多い。一方で、本実施形態のような構成であると、発話者検出システム１０は、***の動作を用いる第１検出では難しい場合でも、参加者の動き又は外観に基づく第２検出によって発話者を検出できる。 In the method of detecting the movement of the lips and detecting the speaker, for example, when a person wears a mask or has a habit of hiding the lips part by hand, the lips part is shielded and the image data is displayed. The lips may not be visible. In such cases, it is often difficult to detect the speaker based on the movement of the lips. On the other hand, with the configuration as in the present embodiment, the speaker detection system 10 detects the speaker by the second detection based on the movement or appearance of the participant, even if the first detection using the movement of the lips is difficult. it can.

また、このように、画像データを用いる構成は、例えば、マイクアレイ等で発話者を検出する方法等と比較すると、精度よく発話者を検出できる。具体的には、マイクアレイ等の方法では、発話者がいると推定する角度が検出されるに過ぎず、同じ角度に複数の参加者がいるような場合には、区別ができない場合が多い。一方で、本実施形態のような構成であると、発話者検出システム１０は、発話者を検出できる。 Further, as described above, the configuration using the image data can accurately detect the speaker as compared with, for example, a method of detecting the speaker with a microphone array or the like. Specifically, a method such as a microphone array only detects an angle at which a speaker is presumed to be present, and when there are a plurality of participants at the same angle, it is often impossible to distinguish between them. On the other hand, with the configuration as in this embodiment, the speaker detection system 10 can detect the speaker.

＜第２実施形態＞
発話者検出システムは、例えば、以下のように、検出結果を利用した処理を行ってもよい。 <Second Embodiment>
The utterance detection system may perform processing using the detection result as follows, for example.

図８は、複数の認識モデルを用いる音声認識処理等の例を示すブロック図である。第２実施形態は、第１実施形態と比較すると、発話者の検出結果を利用して後段で処理を行う構成である点が異なる。具体的には、発話者検出システム１０は、例えば、後段で音声認識処理等を行う。ゆえに、発話者検出システム１０は、第１実施形態と比較すると、第２実施形態では、音声入力部１０Ｆ２１、切替部１０Ｆ２２及び音声認識部１０Ｆ２３等を更に備える機能構成である。以下、第１実施形態と異なる点を中心に説明し、重複する説明を省略する。 FIG. 8 is a block diagram showing an example of voice recognition processing using a plurality of recognition models. The second embodiment is different from the first embodiment in that the processing is performed in the subsequent stage using the detection result of the speaker. Specifically, the speaker detection system 10 performs voice recognition processing or the like in a later stage, for example. Therefore, the speaker detection system 10 has a functional configuration further including a voice input unit 10F21, a switching unit 10F22, a voice recognition unit 10F23, and the like in the second embodiment as compared with the first embodiment. Hereinafter, the points different from those of the first embodiment will be mainly described, and duplicate description will be omitted.

音声入力部１０Ｆ２１は、音声を入力して音声データを生成する音声入力手順を行う。例えば、音声入力部１０Ｆ２１は、マイク２４０等で実現する。 The voice input unit 10F21 performs a voice input procedure for inputting voice and generating voice data. For example, the voice input unit 10F21 is realized by a microphone 240 or the like.

切替部１０Ｆ２２は、発話者の特性に合うように、認識モデルを切り替える切替手順を行う。例えば、切替部１０Ｆ２２は、ＣＰＵ２０１等で実現する。 The switching unit 10F22 performs a switching procedure for switching the recognition model so as to match the characteristics of the speaker. For example, the switching unit 10F22 is realized by the CPU 201 or the like.

音声認識部１０Ｆ２３は、認識モデルに基づいて、音声認識をする音声認識手順を行う。例えば、音声認識部１０Ｆ２３は、ＣＰＵ２０１等で実現する。 The voice recognition unit 10F23 performs a voice recognition procedure for voice recognition based on the recognition model. For example, the voice recognition unit 10F23 is realized by the CPU 201 or the like.

発話者検出システム１０は、音声認識処理によって、例えば、音声を変換してテキストデータＴＸを生成する。そこで、音声認識には、発話者の特性、すなわち、発話者の年齢又は性別に合わせて、カテゴリごとに生成される複数の認識モデルＭＬ１、ＭＬ２及びＭＬ３を適用させるのが望ましい。このように、発話者の特性に合わせて、認識モデルを切り替えると、発話者検出システム１０は、音声認識における認識率を向上させることができる。 The speaker detection system 10 generates text data TX by, for example, converting voice by voice recognition processing. Therefore, it is desirable to apply a plurality of recognition models ML1, ML2, and ML3 generated for each category according to the characteristics of the speaker, that is, the age or gender of the speaker, for voice recognition. In this way, by switching the recognition model according to the characteristics of the speaker, the speaker detection system 10 can improve the recognition rate in voice recognition.

認識モデルＭＬ１、ＭＬ２及びＭＬ３は、例えば、あらかじめ準備されるデータ等である。また、認識モデルＭＬ１、ＭＬ２及びＭＬ３は、例えば、音響モデル又は言語モデル等のように、様々な種類があってもよい。さらに、認識モデルＭＬ１、ＭＬ２及びＭＬ３は、例えば、個別に準備されてもよい。すなわち、過去の発話等に基づいて、参加者ごとに、認識モデルＭＬ１、ＭＬ２及びＭＬ３のように、認識モデルが準備されてもよい。 The recognition models ML1, ML2 and ML3 are, for example, data prepared in advance. Further, the recognition models ML1, ML2 and ML3 may have various types such as an acoustic model or a language model. Further, the recognition models ML1, ML2 and ML3 may be prepared individually, for example. That is, a recognition model may be prepared for each participant, such as the recognition models ML1, ML2, and ML3, based on past utterances and the like.

また、後段に行う処理は、音声認識に限られない。例えば、発話者検出システム１０は、入力した音声を参加者ごとに、タグ付けする処理等でもよい。 Further, the processing performed in the subsequent stage is not limited to voice recognition. For example, the speaker detection system 10 may be a process of tagging the input voice for each participant.

＜第３実施形態＞
全体処理は、以下のような処理でもよい。 <Third Embodiment>
The whole process may be the following process.

図９は、第３実施形態における全体処理例を示すフローチャートである。第１実施形態と比較すると、第３実施形態では、ステップＳ３１が加わる点が異なる。以下、異なる点を中心に説明する。 FIG. 9 is a flowchart showing an example of overall processing in the third embodiment. Compared with the first embodiment, the third embodiment is different in that step S31 is added. Hereinafter, the differences will be mainly described.

＜発話者が複数人検出されたか否かの判断例＞（ステップＳ３１）
発話者検出システムは、発話者が複数人検出されたか否かを判断する。例えば、***の動作で検出すると、発話していなくとも、発話者検出システムは、物を噛む等の動作を行い、***を動かす人物がいると、発話者と検出する場合がある。そこで、発話者が複数であるような検出結果である場合には、発話者と検出された複数の人物を対象に、ステップＳ１１及びステップＳ１３等の処理を行うことで、１人に特定する処理を行うのが望ましい。 <Example of determining whether or not multiple speakers have been detected> (step S31)
The speaker detection system determines whether or not a plurality of speakers have been detected. For example, when detected by the movement of the lips, the speaker detection system may perform an action such as chewing an object and detect that there is a person who moves the lips as a speaker even if the person is not speaking. Therefore, when the detection result is such that there are a plurality of speakers, a process of identifying the speaker and the plurality of detected persons by performing processes such as steps S11 and S13 to identify one person. It is desirable to do.

すなわち、複数の発話者候補のうち、発話者検出システムは、最も動きのある人物又は最も視線が集まる人物等を発話者と検出する。 That is, among the plurality of speaker candidates, the speaker detection system detects the person who moves the most, the person who attracts the most eyes, and the like as the speaker.

このような構成であると、発話者検出システムは、物を噛む等の動作を行う人物がいても、発話者を精度よく検出できる。 With such a configuration, the speaker detection system can accurately detect the speaker even if there is a person who performs an action such as chewing an object.

＜その他の実施形態＞
なお、撮像装置及び情報処理装置は、複数の装置であってもよい。すなわち、本発明に係る実施形態は、発話者検出システムは、各処理及びデータの記憶を冗長、分散、並列、仮想化又はこれらを組み合わせて実行してもよい。 <Other Embodiments>
The image pickup device and the information processing device may be a plurality of devices. That is, in the embodiment of the present invention, the speaker detection system may execute each process and data storage redundantly, distributed, parallel, virtualized, or a combination thereof.

また、撮像装置及び情報処理装置は、上記の例に限られない。例えば、撮像装置及び情報処理装置は、外部又は内部に演算装置、制御装置又は記憶装置を更に有してもよい。一方で、撮像装置及び情報処理装置は、上記の例より、少ない部品で構成するハードウェア構成でもよい。 Further, the imaging device and the information processing device are not limited to the above examples. For example, the imaging device and the information processing device may further have an arithmetic unit, a control device, or a storage device outside or inside. On the other hand, the image pickup device and the information processing device may have a hardware configuration composed of fewer parts than the above example.

なお、本発明に係る各処理の全部又は一部は、低水準言語又は高水準言語で記述され、コンピュータに発話者検出方法を実行させるためのプログラムによって実現されてもよい。すなわち、プログラムは、発話者検出システム等のコンピュータに各処理を実行させるためのコンピュータプログラムである。 All or part of each process according to the present invention may be described in a low-level language or a high-level language, and may be realized by a program for causing a computer to execute a speaker detection method. That is, the program is a computer program for causing a computer such as a speaker detection system to execute each process.

したがって、プログラムに基づいて発話者検出方法が実行されると、コンピュータが有する演算装置及び制御装置は、各処理を実行するため、プログラムに基づいて演算及び制御を行う。また、コンピュータが有する記憶装置は、各処理を実行するため、プログラムに基づいて、処理に用いられるデータを記憶する。 Therefore, when the speaker detection method is executed based on the program, the arithmetic unit and the control device of the computer perform the arithmetic and control based on the program in order to execute each process. In addition, the storage device of the computer stores the data used for the processing based on the program in order to execute each processing.

また、プログラムは、コンピュータが読み取り可能な記録媒体に記録されて頒布することができる。なお、記録媒体は、磁気テープ、フラッシュメモリ、光ディスク、光磁気ディスク又は磁気ディスク等のメディアである。さらに、プログラムは、電気通信回線を通じて頒布することができる。 In addition, the program can be recorded and distributed on a computer-readable recording medium. The recording medium is a medium such as a magnetic tape, a flash memory, an optical disk, a magneto-optical disk, or a magnetic disk. In addition, the program can be distributed over telecommunication lines.

また、各処理には、ＡＩ（ＡｒｔｉｆｉｃｉａｌＩｎｔｅｌｌｉｇｅｎｃｅ）等が用いられてもよい。つまり、発話者検出システムは、過去のデータを学習データとして機械学習等を行う。例えば、この学習結果によって、発話者検出システムは、発話者の検出又は音声の認識等を推測する等の処理を行ってもよい。 In addition, AI (Artificial Intelligence) or the like may be used for each process. That is, the speaker detection system performs machine learning or the like using past data as learning data. For example, the speaker detection system may perform processing such as detecting the speaker or estimating voice recognition based on the learning result.

以上、実施形態における一例について説明したが、本発明は、上記実施形態に限定されない。すなわち、本発明の範囲内で種々の変形及び改良が可能である。 Although an example in the embodiment has been described above, the present invention is not limited to the above embodiment. That is, various modifications and improvements are possible within the scope of the present invention.

１カメラ
２電子黒板
１０発話者検出システム
１０Ｆ１撮像部
１０Ｆ２第１検出部
１０Ｆ２１音声入力部
１０Ｆ２２切替部
１０Ｆ２３音声認識部
１０Ｆ３第２検出部
２０５ネットワークＩ／Ｆ
２１０バスライン
２１１キャプチャデバイス
２１３ディスプレイコントローラ
２１４接触センサ
２１５センサコントローラ
２１６電子ペンコントローラ
２１９近距離通信回路
２１９ａアンテナ
２２２電源スイッチ
２２３選択スイッチ類
２３０ＵＳＢメモリ
２４０マイク
２５０スピーカ
２８０ディスプレイ
２９０電子ペン
Ｈ手
ＩＭＧ画像データ
ＭＡ第１参加者
ＭＢ第２参加者
ＭＣ第３参加者
ＭＥＭ参加者
ＭＬ１認識モデル
ＭＬ２認識モデル
ＭＬ３認識モデル
ＭＳＫマスク
ＴＸテキストデータ 1 Camera 2 Electronic blackboard 10 Speaker detection system 10F1 Imaging unit 10F2 First detection unit 10F21 Voice input unit 10F22 Switching unit 10F23 Voice recognition unit 10F3 Second detection unit 205 Network I / F
210 Bus line 211 Capture device 213 Display controller 214 Contact sensor 215 Sensor controller 216 Electronic pen controller 219 Short-range communication circuit 219a Antenna 222 Power switch 223 Selection switch 230 USB memory 240 Microphone 250 Speaker 280 Display 290 Electronic pen H Hand IMG image data MA 1st participant MB 2nd participant MC 3rd participant MEM participant ML1 recognition model ML2 recognition model ML3 recognition model MSK mask TX text data

特開２００４−１１８３１４号公報Japanese Unexamined Patent Publication No. 2004-118314

Claims

発話者を検出する発話者検出システムであって、
前記発話者を含む複数の参加者を撮像して画像データを生成する撮像部と、
前記画像データに基づいて、***の動作を検出して前記発話者を検出する第１検出部と、
前記第１検出部で前記発話者が検出できない場合に、前記画像データに基づいて、前記参加者の動き又は外観を検出して前記発話者を検出する第２検出部と
を備える発話者検出システム。 It is a speaker detection system that detects the speaker.
An imaging unit that captures a plurality of participants including the speaker and generates image data,
Based on the image data, a first detection unit that detects the movement of the lips and detects the speaker, and
A speaker detection system including a second detection unit that detects the movement or appearance of the participant based on the image data and detects the speaker when the speaker cannot be detected by the first detection unit. ..

前記第２検出部は、
前記参加者のうち、前記***が遮蔽されている外観の第１人物を検出し、かつ、前記第１人物となる人物以外が前記発話者でないと検出されると、前記第１人物を前記発話者と検出する
請求項１に記載の発話者検出システム。 The second detection unit
When a first person having an appearance in which the lips are shielded is detected among the participants and a person other than the person who becomes the first person is detected as not the speaker, the first person is said to be the utterance. The speaker detection system according to claim 1, wherein the person is detected.

前記第２検出部は、
前記参加者のうち、最も動きのある第２人物を検出すると、前記第２人物を前記発話者と検出する
請求項１又は２に記載の発話者検出システム。 The second detection unit
The speaker detection system according to claim 1 or 2, wherein when the second person who has the most movement is detected among the participants, the second person is detected as the speaker.

前記第２検出部は、
前記参加者のうち、手の部位が最も動く人物を前記第２人物と検出する
請求項３に記載の発話者検出システム。 The second detection unit
The speaker detection system according to claim 3, wherein the person whose hand portion moves most is detected as the second person among the participants.

前記第２検出部は、
前記参加者の視線が最も集まる第３人物を前記発話者と検出する
請求項１乃至４のいずれか１項に記載の発話者検出システム。 The second detection unit
The speaker detection system according to any one of claims 1 to 4, wherein a third person whose eyes are most gathered by the participants is detected as the speaker.

前記発話者を検出すると、前記発話者の特性に合う認識モデルに切り替える切替部と、
前記認識モデルに基づいて音声認識を行う音声認識部と
を更に備える
請求項１乃至５のいずれか１項に記載の発話者検出システム。 When the speaker is detected, a switching unit that switches to a recognition model that matches the characteristics of the speaker, and
The speaker detection system according to any one of claims 1 to 5, further comprising a voice recognition unit that performs voice recognition based on the recognition model.

発話者を検出する発話者検出システムが行う発話者検出方法であって、
発話者検出システムが、前記発話者を含む複数の参加者を撮像して画像データを生成する撮像手順と、
発話者検出システムが、前記画像データに基づいて、***の動作を検出して前記発話者を検出する第１検出手順と、
発話者検出システムが、前記第１検出手順で前記発話者が検出できない場合に、前記画像データに基づいて、前記参加者の動き又は外観を検出して前記発話者を検出する第２検出手順と
を含む発話者検出方法。 It is a speaker detection method performed by the speaker detection system that detects the speaker.
An imaging procedure in which the speaker detection system images a plurality of participants including the speaker to generate image data, and
The first detection procedure in which the speaker detection system detects the movement of the lips and detects the speaker based on the image data, and
When the speaker detection system cannot detect the speaker in the first detection procedure, the speaker detection procedure detects the movement or appearance of the participant based on the image data to detect the speaker. Speaker detection method including.

請求項７に記載の発話者検出方法をコンピュータに実行させるためのプログラム。 A program for causing a computer to execute the speaker detection method according to claim 7.