JP6103765B2

JP6103765B2 - Action recognition device, method and program, and recognizer construction device

Info

Publication number: JP6103765B2
Application number: JP2013136461A
Authority: JP
Inventors: 山田　健太郎; 健太郎山田; 松尾　賢治; 賢治松尾; 内藤　整; 整内藤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2013-06-28
Filing date: 2013-06-28
Publication date: 2017-03-29
Anticipated expiration: 2033-06-28
Also published as: JP2015011526A

Description

本発明は、映像から人物の行動を認識することに関し、特に、人物が装着したカメラにより撮影された一人称視点映像から人物の行動を認識する行動認識装置、方法及びプログラムと、当該行動認識を可能にするための認識器を構築する認識器構築装置と、に関する。 The present invention relates to recognizing a person's action from a video, and in particular, an action recognition apparatus, method, and program for recognizing a person's action from a first-person viewpoint video shot by a camera worn by the person, and the action recognition And a recognizer construction device for constructing a recognizer for the purpose.

非特許文献１では、映像から人物の行動を認識する方法が提案されており、特に、一人称の視点で映像が撮影できる小型カメラをユーザが装着し、食事・料理・テレビ視聴等の18種類程度の日常生活の行動を推定することを目的としている。 Non-Patent Document 1 proposes a method for recognizing a person's action from a video, and in particular, a user wears a small camera that can shoot a video from a first-person viewpoint, and about 18 types such as meals, cooking, and TV viewing. The purpose is to estimate the behavior of daily life.

具体的には、次の手順(1),(2)が提案されている。すなわち、(1)画像内に含まれる個々の物体を認識し、(2)物体の物体尤度情報からユーザ行動を推定する行動推定方法が提案されている。例えば、図６に当該手順(1),(2)を概念的に示す通り、視野映像内で高い信頼度で認識できた物体が「テレビ・テーブル・リモコン」であった場合、あらかじめ学習した識別器に照らし合わせ、各物体の物体尤度をヒストグラムで表現する物体尤度情報から、ユーザ行動として「テレビ視聴」が推定される。 Specifically, the following procedures (1) and (2) have been proposed. That is, (1) an action estimation method for recognizing individual objects included in an image and (2) estimating a user action from object likelihood information of the object has been proposed. For example, as shown conceptually in steps (1) and (2) in FIG. 6, if the object that can be recognized with high reliability in the visual field image is a “television / table / remote control”, the identification learned in advance “TV viewing” is estimated as the user behavior from the object likelihood information representing the object likelihood of each object in the form of a histogram.

手順(2)にて、ユーザ行動を推定する際、判定の基準となる物体尤度情報とユーザ行動の普遍的な関係は、事前に多数の学習用サンプル（物体尤度情報とユーザ行動を表現するキーワードのセット）を用い、SVM(サポートベクトルマシン)に代表される機械学習方法等で学習する。 When estimating user behavior in step (2), the universal relationship between object likelihood information and user behavior, which is the criterion for determination, is determined in advance by using a number of learning samples (object likelihood information and user behavior are expressed. Set of keywords), and learn by a machine learning method represented by SVM (support vector machine).

なお、手順(1)における画像内に含まれる物体の認識、物体尤度情報の取得は、非特許文献2に記載されている一般物体認識方法を適用することによって計算機上で実現が可能である。 Note that recognition of an object included in an image and acquisition of object likelihood information in the procedure (1) can be realized on a computer by applying the general object recognition method described in Non-Patent Document 2. .

また、非特許文献1では、手順(1)にて物体認識を行う際に、その物体が能動的(active)物体か受動的(passive)物体かを区別して認識し、物体尤度情報からユーザ行動推定を行う際にも、当該物体として区別された情報を用いることを提案している。非特許文献1では、能動的物体と受動的物体を区別して扱うことで、ユーザ行動を推定する精度が大幅に向上することが報告されている。非特許文献1によると、能動的物体とは、人間の手により直接操作を受けている物体のことであり、例えば手によって掴まれたマグカップや、手によって開かれた冷蔵庫などが該当する。一方、受動的物体とは、能動的物体ではない物体のことである。 Also, in Non-Patent Document 1, when performing object recognition in step (1), the object is recognized by distinguishing whether the object is an active object or a passive object, and the object likelihood information is used to identify the user. It has also been proposed to use information distinguished as the object when performing behavior estimation. In Non-Patent Document 1, it is reported that the accuracy of estimating user behavior is greatly improved by distinguishing between active and passive objects. According to Non-Patent Document 1, an active object is an object that is directly operated by a human hand, such as a mug held by a hand or a refrigerator opened by a hand. On the other hand, a passive object is an object that is not an active object.

ここで、同一の物体であっても、シーンによって能動的物体と認識すべき場合と、受動的物体と認識すべき場合がある。例えばマグカップは、手に掴まれているシーンでは「能動的マグカップ」と認識され、テーブルに置かれているシーンでは「受動的マグカップ」と認識される。人間の行動を推定するうえで、人の手により操作された能動的物体は重要であるため、ユーザ行動の推定において、能動的物体と受動的物体を区別して扱うことで、行動認識精度が向上する。 Here, even if the same object is to be recognized as an active object depending on the scene, it may be recognized as a passive object. For example, a mug is recognized as an “active mug” in a scene held by a hand, and as a “passive mug” in a scene placed on a table. Active objects manipulated by human hands are important in estimating human behavior, so the accuracy of action recognition is improved by distinguishing between active and passive objects in user action estimation. To do.

H. Pirsiavash and D. Ramanan. "Detecting activities of daily living in first-person camera views," In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.H. Pirsiavash and D. Ramanan. "Detecting activities of daily living in first-person camera views," In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012. P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, "Object Detection with Discriminatively Trained Part Based Models," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, No. 9, pp. 1627-1645, Sep. 2010.P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, "Object Detection with Discriminatively Trained Part Based Models," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, No. 9, pp. 1627-1645, Sep. 2010.

上述の通り、上記従来手法（非特許文献1）では、能動的物体と受動的物体を区別して扱うことで行動認識の精度を向上させることを提案している。しかし、以下のような課題が存在する。 As described above, the conventional method (Non-Patent Document 1) proposes to improve the accuracy of action recognition by distinguishing between active objects and passive objects. However, the following problems exist.

第一に、一人称視点映像において、手の操作を受けた（能動的な状態の）物体が常に映像内に含まれるとは限らない。例えば、「テレビ視聴」という行動においては、リモコンを手で操作しているシーンもあるが、手がどの物体にも触れていないシーンや、手が一人称視点映像中に含まれないシーンも多い。したがって、能動的物体と受動的物体を区別して扱うことによる精度向上の恩恵を受けられないシーンが多いという問題がある。 First, in a first-person viewpoint video, an object that has been operated by a hand (in an active state) is not always included in the video. For example, in the action of “viewing TV”, there are scenes where the remote controller is operated by hand, but there are many scenes where the hand does not touch any object or where the hand is not included in the first person viewpoint video. Therefore, there is a problem that there are many scenes that cannot benefit from accuracy improvement by distinguishing between active objects and passive objects.

第二に、能動的物体と受動的物体を区別して扱うことができる物体は、手により直接操作される物体に限られており、物体全体の種類に対して少ないという問題がある。非特許文献1に関連して著者らにより公開されているデータセットによると、物体認識された21種類の物体うち、能動的物体と受動的物体を区別して扱うことができた物体は5種類に過ぎない。 Secondly, the number of objects that can be handled by distinguishing between active objects and passive objects is limited to objects that can be directly manipulated by the hand, and there is a problem that the number of types of objects is small. According to a data set published by the authors in connection with Non-Patent Document 1, out of 21 types of recognized objects, there are 5 types of objects that can be handled by distinguishing active and passive objects. Not too much.

例えば、テレビはリモコンを通じて操作されるものの、手によって直接操作されることはなく、常に受動的物体となってしまう。より精度を向上するためには、能動的/受動的とは異なる区別により、重要であるか否かを区別可能な物体の種類を増加させる必要がある。より明確には、全種類の物体に対し、フレーム単位で重要であるか否かを区別可能であることが望ましい。 For example, although a television is operated through a remote controller, it is not directly operated by hand, and is always a passive object. In order to further improve the accuracy, it is necessary to increase the types of objects that can be distinguished whether or not they are important by differentiating from active / passive. More specifically, it is desirable to be able to distinguish whether or not it is important on a frame basis for all types of objects.

上記従来技術の課題に鑑み、本発明は、能動的物体/受動的物体とは異なる手法によって当該手法の制約を回避し、シーンや物体の種類によらず物体が重要か否かの区別を行うことで、精度の高い行動認識を行うことのできる、行動認識装置、方法及びプログラムを提供することを目的とする。また、これに対応して、本発明は、当該行動認識を可能にする認識器を構築する認識器構築装置を提供することを目的とする。 In view of the above-mentioned problems of the prior art, the present invention avoids the limitations of the method by a method different from the active object / passive object, and distinguishes whether the object is important regardless of the scene or the type of the object. Thus, an object of the present invention is to provide an action recognition apparatus, method, and program capable of performing action recognition with high accuracy. Correspondingly, an object of the present invention is to provide a recognizer construction device that constructs a recognizer that enables the action recognition.

上記目的を達成するため、本発明は、一人称視点の映像より、当該一人称の人物の行動を認識する行動認識装置であって、前記映像の各フレームより、当該フレーム内の物体をその尤度と共に認識する物体認識部と、前記映像の各フレームより、当該フレーム内における前記一人称の人物の注意度合いの分布として注意マップを推定する注意推定部と、前記認識された各物体の占める領域において、前記推定された注意マップを参照することにより、当該各物体が注目物体であるか否かを判定する注目物体判定部と、前記認識された各物体において、その尤度に、当該物体に対して前記注目物体であると判定されたか否かの情報を注目情報として追加することで、前記認識された各物体の注目情報付の物体尤度情報を生成する物体尤度情報生成部と、前記映像の各フレームにおいて、前記認識された全ての物体における注目情報付の物体尤度情報を、所定の認識器に対して入力することで、当該各フレームにおける前記一人称の人物の行動を認識する行動認識部と、を備えることを特徴とする。 In order to achieve the above object, the present invention provides an action recognition apparatus for recognizing the action of a first person from an image of a first person viewpoint, and from each frame of the image, an object in the frame together with its likelihood. An object recognizing unit for recognizing, from each frame of the video, a caution estimating unit for estimating a caution map as a distribution of the attention level of the first person in the frame, and a region occupied by each recognized object, An attention object determination unit that determines whether each object is an object of interest by referring to the estimated attention map, and each of the recognized objects, the likelihood of the object Object likelihood information generation for generating object likelihood information with attention information of each recognized object by adding information indicating whether or not the object is determined as attention information. And the object likelihood information with attention information on all recognized objects in each frame of the video is input to a predetermined recognizer, whereby the behavior of the first person in each frame is input. An action recognition unit for recognizing.

また、本発明は、前記行動認識装置の各部に対応する各段階を備えた行動認識方法であることを特徴とする。また、本発明は、当該行動認識方法の各段階をコンピュータに実行させる行動認識プログラムであることを特徴とする。 In addition, the present invention is an action recognition method including each stage corresponding to each part of the action recognition device. Further, the present invention is an action recognition program for causing a computer to execute each step of the action recognition method.

さらに、本発明は、前記行動認識装置における行動認識部が用いる認識器を構築する認識器構築装置であって、複数の画像と、当該画像の各々に付与された行動ラベルとのセットを学習データとして蓄積している学習データ蓄積部と、前記学習データの画像の各々より、当該画像内の物体をその尤度と共に認識する物体認識部と、前記学習データの画像の各々より、当該画像内における注意度合いの分布として注意マップを推定する注意推定部と、前記認識された各物体の占める領域において、前記推定された注意マップを参照することにより、当該各物体が注目物体であるか否かを判定する注目物体判定部と、前記認識された各物体において、その尤度に、当該物体に対して前記注目物体であると判定されたか否かの情報を注目情報として追加することで、前記認識された各物体の注目情報付の物体尤度情報を生成する物体尤度情報生成部と、前記学習データにおいて、各画像につき前記認識された全ての物体における注目情報付の物体尤度情報と、当該画像に対して付与された行動ラベルと、の組み合わせを学習することにより、前記認識器を構築する認識器構築部と、を備えることを特徴とする。 Further, the present invention is a recognizer construction device for constructing a recognizer used by the behavior recognition unit in the behavior recognition device, wherein a set of a plurality of images and a behavior label assigned to each of the images is learned data. A learning data storage unit that is stored as an object, an object recognition unit that recognizes an object in the image together with the likelihood from each of the images of the learning data, and an image in the image from each of the images of the learning data. The attention estimation unit that estimates the attention map as the distribution of the attention degree, and by referring to the estimated attention map in the area occupied by each recognized object, it is determined whether or not each object is the attention object. Attention object determination unit for determining, and information on whether or not each of the recognized objects is determined to be the attention object with respect to the object as attention information In addition, an object likelihood information generation unit that generates object likelihood information with attention information of each recognized object, and with the attention information on all recognized objects for each image in the learning data. A recognizer constructing unit that constructs the recognizer by learning a combination of the object likelihood information and the action label given to the image.

本発明の行動認識装置、方法又はプログラムによれば、推定された注意マップに基づいて各物体が注目されているか否かに関する注目情報を取得し、認識された物体において当該注目情報による区別を設けたものと当該物体の尤度とが入力が認識器の入力となるので、高精度に行動認識を行うことが可能となる。 According to the action recognition apparatus, method, or program of the present invention, attention information regarding whether or not each object is focused is acquired based on the estimated attention map, and the recognized object is distinguished by the attention information. Since the input of the object and the likelihood of the object becomes the input of the recognizer, it is possible to perform action recognition with high accuracy.

また、本発明の認識器構築装置によれば、上記行動認識を行う際に必要となる認識器を構築することが可能となる。 In addition, according to the recognizer construction device of the present invention, it is possible to construct a recognizer that is necessary when performing the action recognition.

一実施形態に係る行動認識装置の機能ブロック図である。It is a functional block diagram of the action recognition device concerning one embodiment. 注目物体判定部における処理を模式的に示すための図である。It is a figure for showing typically processing in an attention object judging part. 一実施形態に係る認識器構築装置の機能ブロック図である。It is a functional block diagram of the recognizer construction apparatus concerning one embodiment. 行動認識装置による行動認識処理の一実施形態に係るフローチャートである。It is a flowchart which concerns on one Embodiment of the action recognition process by an action recognition apparatus. 認識器構築装置による認識器の構築処理の一実施形態に係るフローチャートである。It is a flowchart which concerns on one Embodiment of the construction process of the recognizer by a recognizer construction apparatus. 非特許文献１における行動認識の手順を概念的に示す図である。It is a figure which shows notionally the procedure of action recognition in nonpatent literature 1.

まず、本発明の要旨を以下に説明する。 First, the gist of the present invention will be described below.

本発明では、一人称視点映像の各フレーム画像において認識された全ての物体に対し、注目物体であるか非注目物体であるか識別し、ユーザ行動推定において同種の物体であっても注目物体と非注目物体とを区別して扱うことで行動認識の精度を向上させる。注目物体とは、一人称視点映像を撮影しているユーザが注目している（注意を払っている）物体のことであり、非注目物体とは、注目物体ではない物体のことである。 In the present invention, all objects recognized in each frame image of the first-person viewpoint video are identified as objects of interest or non-attention objects, and even if the same kind of object is estimated in the user behavior estimation, The accuracy of action recognition is improved by distinguishing and handling the object of interest. The attention object is an object that is being noticed (paid attention) by the user who is shooting the first-person viewpoint video, and the non-attention object is an object that is not the attention object.

人間の注意を推定する方法としては、例えば以下の非特許文献3や非特許文献4のように、映像から得られる情報を用いて各フレーム画像に対する人間の注意をマップの形で推定する方法と、例えば以下の特許文献1のように、頭部に装着した視線方向検出装置により、高精度に視線方向を検出し、各フレーム画像に対する人間の注意を視線位置座標の形で推定する方法がある。上記のいずれかにより推定された注意に基づいて注目物体検出を行い、注目物体として検出された物体以外の物体は非注目物体とすることで、各物体が注目物体であるか、非注目物体であるかを識別する。 As a method for estimating human attention, for example, as in Non-Patent Document 3 and Non-Patent Document 4 below, a method for estimating human attention to each frame image in the form of a map using information obtained from a video and For example, as in Patent Document 1 below, there is a method of detecting a gaze direction with high accuracy by a gaze direction detection device mounted on the head and estimating human attention to each frame image in the form of gaze position coordinates. . The target object detection is performed based on the attention estimated by any of the above, and the objects other than the object detected as the target object are set as non-target objects. Identify if there is.

[特許文献１]特許第3038375号公報
[非特許文献３]Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 20(11), 1254-1259 (1998)
[非特許文献４]K. Yamada, Y. Sugano, T. Okabe, Y. Sato, A. Sugimoto, and K. Hiraki, "Attention prediction in egocentric video using motion and visual saliency," in Proc. 5th Pacific-Rim Symposium on Image and Video Technology (PSIVT) 2011, vol.1, pp. 277-288, Nov. 2011. [Patent Document 1] Japanese Patent No. 3038375
[Non-Patent Document 3] Itti, L., Koch, C., Niebur, E .: A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 20 (11), 1254-1259 (1998)
[Non-Patent Document 4] K. Yamada, Y. Sugano, T. Okabe, Y. Sato, A. Sugimoto, and K. Hiraki, "Attention prediction in egocentric video using motion and visual saliency," in Proc. 5th Pacific- Rim Symposium on Image and Video Technology (PSIVT) 2011, vol.1, pp. 277-288, Nov. 2011.

前者のように注意マップの形で推定された注意に基づく場合には、各フレーム画像において、各物体の領域中の画素に対応する注意マップの値の平均値を各物体の注目度とし、これに基づいて注目物体を検出する。例えば、注目度が閾値を超えた場合に注目物体とすることもできるし、各フレーム画像において注目度が最も高い物体を注目物体とすることもできる。 When based on the attention estimated in the form of the attention map as in the former case, the average value of the attention map values corresponding to the pixels in the area of each object is used as the attention degree of each object in each frame image. An object of interest is detected based on For example, when the degree of attention exceeds a threshold value, the object of interest can be selected, and the object with the highest degree of attention can be selected as the object of attention in each frame image.

また、後者のように視線位置座標の形で推定された注意に基づく場合には、当該視線位置座標を認識された物体領域として含む物体を注目物体とすることができる。または、視線位置座標を中心とした二次元ガウス分布状の注意マップを生成することにより、前者（注意マップから注目物体検出）と同様の方法で注目物体を検出することもできる。 Further, when based on the attention estimated in the form of the line-of-sight position coordinate as in the latter case, an object including the line-of-sight position coordinate as a recognized object region can be set as the target object. Alternatively, the attention object can be detected in the same manner as the former (detection of attention object from the attention map) by generating a attention map having a two-dimensional Gaussian distribution centered on the line-of-sight position coordinates.

各物体を注目物体であるか非注目物体であるか識別した後は、それぞれを異なる種類の物体として区別することで注目情報付の物体尤度情報を生成し、ユーザ行動との組み合わせを学習して識別器を構築し、非特許文献1と同様にしてユーザ行動を認識する。上記の注目物体/非注目物体という区別は、能動的物体/受動的物体という区別とは異なり、シーンや物体の種類よらず物体が重要か否か区別を行うことができるため、能動的物体/受動的物体という区別を行う従来手法（非特許文献1）に比べ、より行動認識の精度を向上させることが期待できる。 After identifying each object as an object of interest or non-object of interest, it distinguishes each object as a different type of object, generates object likelihood information with attention information, and learns combinations with user behavior Thus, a discriminator is constructed and user actions are recognized in the same manner as in Non-Patent Document 1. The distinction between the above-mentioned object of interest / non-object of interest is different from the object of active object / passive object, and it is possible to distinguish whether an object is important regardless of the type of scene or object. Compared to the conventional method (Non-Patent Document 1) that distinguishes passive objects, it can be expected to improve the accuracy of action recognition.

以上、本発明の要旨を説明した。本発明を実施するための最良の実施形態について、以下では図面を用いて詳細に説明する。 The gist of the present invention has been described above. The best mode for carrying out the present invention will be described in detail below with reference to the drawings.

なお、以下の実施形態における構成要素は適宜、既存の構成要素などとの置き換えが可能であり、また、他の既存の構成要素との組み合せを含む様々なバリエーションが可能である。したがって、以下の実施形態の記載をもって、特許請求の範囲に記載された発明の内容を限定するものではない。 Note that the constituent elements in the following embodiments can be appropriately replaced with existing constituent elements, and various variations including combinations with other existing constituent elements are possible. Accordingly, the description of the following embodiments does not limit the contents of the invention described in the claims.

図１は、本発明の一実施形態に係る行動認識装置１のブロック図である。行動認識装置１は、カメラ映像に対して一般物体認識と注意推定を適用し、物体の検出・認識に加え、注目/非注目の区別を行い、あらかじめ学習により構築された行動認識器に入力することで、行動認識を行う。 FIG. 1 is a block diagram of an action recognition apparatus 1 according to an embodiment of the present invention. The action recognition device 1 applies general object recognition and attention estimation to camera images, performs object distinction between attention and non-attention in addition to object detection / recognition, and inputs them to an action recognizer constructed in advance by learning. In this way, action recognition is performed.

行動認識装置1は、カメラ部10、物体認識部20、注意推定部30、注目物体判定部40、物体尤度情報生成部50及び行動認識部60を備える。各部の詳細は以下の通りである。 The behavior recognition apparatus 1 includes a camera unit 10, an object recognition unit 20, a caution estimation unit 30, a target object determination unit 40, an object likelihood information generation unit 50, and a behavior recognition unit 60. Details of each part are as follows.

カメラ部10は、行動を認識する対象の人物に装着され、カメラ視野内を連続して撮影する。撮影された複数の画像（以下、この画像の各々を「フレーム画像」という）は、時系列順に物体認識部20及び、注意推定部30に入力される。 The camera unit 10 is attached to a person whose action is to be recognized, and continuously captures images within the camera field of view. A plurality of captured images (hereinafter, each of these images is referred to as a “frame image”) is input to the object recognition unit 20 and the attention estimation unit 30 in chronological order.

物体認識部20は、入力されたフレーム画像から、物体を検出し、物体の種類を識別する。当該物体の検出と識別は、例えば、前述の非特許文献2に記載されている公知の一般物体認識手法を適用することによって計算機上で実現が可能である。また、当該検出・識別する際に、当該一般物体認識手法によって、当該物体の尤度も算出される。 The object recognition unit 20 detects an object from the input frame image and identifies the type of the object. The detection and identification of the object can be realized on a computer by applying a known general object recognition method described in Non-Patent Document 2 described above, for example. In addition, when the detection / identification is performed, the likelihood of the object is also calculated by the general object recognition method.

注意推定部30は、入力されたフレーム画像から、カメラを装着した人物の注意分布を推定し、入力画像に対応した注意マップを生成する。注意マップは、入力画像の各画素が視覚的な注意をどれだけひくかを推定した値を示した二次元マップである。当該注意マップの推定は、例えば前述の非特許文献3（顕著性を推定）や非特許文献4（顕著性及び自己運動を推定）に記載されている、顕著性及び／又は自己運動の推定に基づく注意推定方法を適用することによって計算機上で実現が可能である。 The attention estimation unit 30 estimates the attention distribution of the person wearing the camera from the input frame image, and generates a attention map corresponding to the input image. The attention map is a two-dimensional map showing a value obtained by estimating how much each pixel of the input image draws visual attention. The attention map is estimated by, for example, the estimation of saliency and / or self-motion described in Non-Patent Document 3 (estimation of saliency) and Non-Patent Document 4 (estimation of saliency and self-motion). It can be realized on a computer by applying the attention estimation method based on it.

なお、顕著性に基づく注意マップは、非特許文献３，４その他に開示の種々の公知手法により、フレーム画像・映像のみを解析して顕著性マップを生成することができるので、当該顕著性マップを注意マップとして利用すればよい。 Note that the saliency-based attention map can generate a saliency map by analyzing only frame images / videos by various known methods disclosed in Non-Patent Documents 3, 4 and others. Can be used as a caution map.

一方、自己運動に基づく注意マップは、次のように求めればよい。すなわち、非特許文献４に開示の公知手法により、まず、フレーム映像（一連のフレーム画像）を解析することで、自己運動（カメラ姿勢の変化、すなわち、カメラ装着人物の視点の位置姿勢の変化）が推定され、さらに、当該推定された自己運動に基づいて視覚的注意マップを生成できるので、当該視覚的注意マップを注意マップとして利用すればよい。ここで、自己運動は、フレーム間（例えば１つ前のフレームと現在のフレーム）から特徴点抽出・マッチングを行い、マッチングした点の幾何的対応情報により推定する。当該推定の際、事前に与えられたカメラパラメータを利用する。あるいは、当該カメラパラメータは、事前に与えることなく、以下の文献の手法によって、映像(フレーム間の2枚の画像)の解析により自己運動の推定と同時に取得されてもよい。
「山田健人，金澤靖，金谷健一，菅谷保之， 2画像からの3次元復元の最新アルゴリズム，情報処理学会研究報告, 2009-CVIM-168-15 (2009-8), pp. 1-8.」 On the other hand, an attention map based on self-motion may be obtained as follows. That is, by a known method disclosed in Non-Patent Document 4, first, by analyzing a frame image (a series of frame images), self-motion (change in camera posture, that is, change in the position and posture of the viewpoint of the person wearing the camera) Furthermore, a visual attention map can be generated based on the estimated self-motion, so that the visual attention map may be used as the attention map. Here, the self-motion is estimated from the geometric correspondence information of the matched points by extracting and matching feature points between frames (for example, the previous frame and the current frame). In the estimation, camera parameters given in advance are used. Alternatively, the camera parameters may be acquired simultaneously with the estimation of self-motion by analyzing the video (two images between frames) by the method of the following document without giving in advance.
“Kento Yamada, Satoshi Kanazawa, Kenichi Kanaya, Yasuyuki Sugaya, The latest algorithm for 3D reconstruction from two images, IPSJ Research Report, 2009-CVIM-168-15 (2009-8), pp. 1-8 . "

あるいは、非特許文献４の手法のように、映像解析によって自己運動を推定する代わりに、次のようにしてもよい。すなわち、カメラ部10から自己運動の情報が直接得られる（例えば加速度センサ10をカメラ部が含む等）場合であれば、当該直接得た自己運動から視覚的注意マップを生成（当該生成は非特許文献４と同様）して、注意マップとして利用するようにしてもよい。 Alternatively, instead of estimating self-motion by video analysis as in the method of Non-Patent Document 4, the following may be performed. That is, if self-motion information can be obtained directly from the camera unit 10 (for example, the camera unit includes the acceleration sensor 10), a visual attention map is generated from the directly obtained self-motion (this generation is not patented). It may be used as a caution map.

さらに、上記顕著性によるマップと、上記自己運動推定によるマップと、の両者で注意マップを生成する際は、非特許文献4に開示のように、当該両者におけるマップを組み合わせたマップを、注意マップとして利用すればよい。当該組み合わせる際は、各マップの値を予め規格化しておいてから、両者のマップの値を各位置において足し合わせるなどすればよい。 Furthermore, when generating a caution map with both the map based on the saliency and the map based on the self-motion estimation, as disclosed in Non-Patent Document 4, a map obtained by combining the maps in both the maps is used. You can use as. In the combination, the values of the maps are standardized in advance, and then the values of both maps may be added at each position.

注目物体判定部40は、物体認識部20により認識された物体の情報と、注意推定部30により推定された注意マップを用いて、物体認識部20により認識された物体の各々につき、注目物体であるか否かを判定する。 The object-of-interest determination unit 40 uses the information on the object recognized by the object recognition unit 20 and the attention map estimated by the attention estimation unit 30 to identify each object recognized by the object recognition unit 20 as an object of attention. It is determined whether or not there is.

図２は、当該注目物体であるか否かを判定する処理を模式的に示すための図である。図２では、(1)に示す映像の各時刻tのフレーム画像f(x, y, t)（fは画素値、xは横位置、yは縦位置とし、概念的に示すため、具体的な画像内容は不図示）に対して、物体認識部20により(2)に示すような各物体Oi(i=1, 2, ..., n)が認識され、その領域R(Oi)が得られると共に、注意推定部30により、(3)に示すような注意マップc(x, y, t)（cは当該注意マップの各位置での値）が得られている。なお、認識される各物体Oi（及びその領域R(Oi)）も一般には各時刻tに依存して変化するが、ここでは簡略化のため、t依存性の明示は省略している。 FIG. 2 is a diagram for schematically showing processing for determining whether or not the object is the target object. In FIG. 2, a frame image f (x, y, t) at each time t of the video shown in (1) (f is a pixel value, x is a horizontal position, and y is a vertical position. The object recognition unit 20 recognizes each object Oi (i = 1, 2, ..., n) as shown in (2) and the region R (Oi) is At the same time, the attention estimation unit 30 obtains the attention map c (x, y, t) (c is a value at each position of the attention map) as shown in (3). Note that each recognized object Oi (and its region R (Oi)) generally changes depending on each time t. However, for the sake of simplification, the t dependency is not shown here.

注目物体判定部40は、当該(2)及び(3)の結果を参照して、(4)に示すように、物体認識部20により認識された各物体Oi(i=1, 2, ..., n)のフレーム画像における領域R(Oi)中の画素に対応する注意マップの値c(x, y, t)の、当該領域R(Oi)内で算出した平均値mean(Oi)を当該物体Oiの注目度C(Oi)とし、これに基づいて注目物体を検出する。 The object-of-interest determination unit 40 refers to the results of (2) and (3), and each object Oi (i = 1, 2,...) Recognized by the object recognition unit 20 as shown in (4). ., n) of the attention map value c (x, y, t) corresponding to the pixels in the region R (Oi) in the frame image of (., n), the average value mean (Oi) calculated in the region R (Oi) The attention degree C (Oi) of the object Oi is set, and the attention object is detected based on this.

あるいは、平均値mean(Oi)の代わりに、各物体Oiの領域R(Oi)内の注意マップの値c(x, y, t)の最大値や、中央値、最頻値その他の統計的な値を、注目度C(Oi)として採用してもよい。その他にも、領域R(Oi)の代表位置、例えば、重心(x_g, y_g)などにおける注意マップの値c(x_g, y_g, t)を注目度C(Oi)として採用してもよい。 Alternatively, instead of the mean value mean (Oi), the maximum value, median, mode, and other statistical values of the attention map value c (x, y, t) in the region R (Oi) of each object Oi Any value may be adopted as the degree of attention C (Oi). In addition, the attention map value c (x _g , y _g , t) at the representative position of the region R (Oi), for example, the center of gravity (x _g , y _g ), etc. is adopted as the attention degree C (Oi). Also good.

注目物体判定部40は、当該算出した各物体Oi(i=1, 2, ..., n)の注目度C(Oi)により、各物体Oiが注目物体であるか否かを、各時刻tのフレーム画像において次のように判定する。一例では、注目度C(Oi)が所定の閾値Thを超える物体Oiを注目物体として判定し、超えない物体Oiは注目物体でないと判定することができる。また、一例では、各時刻tのフレーム画像において注目度が最も高い物体Oiを注目物体として判定し、残りの全ての物体を注目物体ではないと判定することもできる。同様に、注目度が上位の所定数の物体のみを注目物体とし、残りを注目物体ではないと判定してもよい。 The object-of-interest determination unit 40 determines whether each object Oi is an object of interest at each time according to the degree of attention C (Oi) of the calculated object Oi (i = 1, 2, ..., n). The determination is made as follows in the frame image of t. In one example, an object Oi having a degree of attention C (Oi) exceeding a predetermined threshold Th can be determined as a target object, and an object Oi that does not exceed it can be determined not to be a target object. In one example, the object Oi having the highest degree of attention in the frame image at each time t can be determined as the target object, and all the remaining objects can be determined not to be the target object. Similarly, it may be determined that only a predetermined number of objects with a higher degree of attention are used as the target object, and the rest are not the target object.

物体尤度情報生成部50は、注目物体判定部40により判定された注目物体であるか否かの情報と、物体認識部20により認識された物体の情報（物体尤度）と、を組み合わせて注目情報付の物体尤度情報を生成する。物体尤度情報は、各物体（識別された物体の種類）を一つのビンとし、各物体の物体尤度を値としたヒストグラムである。 The object likelihood information generation unit 50 combines information on whether or not the object is the target object determined by the target object determination unit 40 and information on the object recognized by the object recognition unit 20 (object likelihood). Object likelihood information with attention information is generated. The object likelihood information is a histogram in which each object (identified object type) is one bin and the object likelihood of each object is a value.

ここで、注目物体判定部40により注目物体として判定された物体と、注目物体ではないとして判定された物体とは、区別して注目情報付の物体尤度情報は生成される。すなわち、物体認識部20により認識される１種類の物体に対し、注目物体/非注目物体の２つのビンが物体尤度情報に用意される。 Here, the object likelihood information with the attention information is generated by distinguishing the object determined as the attention object by the attention object determination unit 40 and the object determined as not the attention object. That is, for one type of object recognized by the object recognition unit 20, two bins of a target object / non-target object are prepared in the object likelihood information.

例えば、物体認識部20により「テレビ」として認識された物体が、注目物体であると判定された場合には当該物体の物体尤度は「注目テレビ」というビンの値となり、注目物体であるとは判定されなかった場合には、当該物体の物体尤度は「非注目テレビ」というビンの値として、注目情報付の物体尤度情報が生成される。 For example, when an object recognized as “television” by the object recognition unit 20 is determined to be an attention object, the object likelihood of the object is a bin value of “attention television”, Is not determined, object likelihood information with attention information is generated with the object likelihood of the object as a bin value of “non-attention television”.

なお、当該「テレビ」等の認識された各物体Oiにつき、「注目テレビ」又は「非注目テレビ」のいずれかのビンを上記のように区別して与えたうえで、当該区別されたビンにおける具体的な値には、一実施形態では、物体認識部20において「テレビ」等として認識される際に算出された物体尤度L(Oi)を用いればよい。 For each recognized object Oi such as “TV”, the bins of either “attention TV” or “non-attention TV” are distinguished as described above, and the specifics in the distinguished bins are specified. In one embodiment, the target likelihood may be the object likelihood L (Oi) calculated when the object recognition unit 20 recognizes it as “television” or the like.

また、別の一実施形態では、当該区別されたビンにおける具体的な値を、上記物体認識部20の算出したL(Oi)としてではなく、次のように定めてもよい。すなわち、注目物体判定部40により推定された物体Oiの注目度C(Oi)をさらに用いてL(Oi)を更新した値として、以下のように求めてもよい。 In another embodiment, specific values in the distinguished bins may be determined as follows instead of the L (Oi) calculated by the object recognition unit 20. That is, the attention level C (Oi) of the object Oi estimated by the attention object determination unit 40 may be further used as a value obtained by updating L (Oi) as follows.

この際まず、当該時刻tの画像において、最も注目度の高い物体Oi_maxの注目度C(Oi_max)を1.0、最も注目度の低い物体Oi_minの注目度C(Oi_min)が0となるように各物体Oiの注目度C(Oi)を正規化し、正規化注目度C(Oi)_[正規化]を得る。更に、１から正規化注目度C(Oi)_[正規化]を引いた値を正規化非注目度C(Oi)_{[正規化/非注目]}（=1-C(Oi)_[正規化]）とする。 At this time, first, in the image at the time t, the attention degree C (Oi _max ) of the object Oi _{max with} the highest attention degree is 1.0, and the attention degree C (Oi _min ) of the object Oi _min with the lowest attention degree is 0. In this way, the attention degree C (Oi) of each object Oi is normalized to obtain a _normalized attention degree C (Oi) _{[normalization]} . Furthermore, the value obtained by subtracting the _normalized attention degree C (Oi) _{[normalization]} from 1 is the normalized non-attention degree C (Oi) _{[normalization / non-attention]} (= 1-C (Oi) _{[normalization]} ) And

以上の２種類の正規化された値を用いることにより、注目物体又は非注目物体と判定され、注目物体としてのビン又は非注目物体としてのビンを与えられたそれぞれの物体Oiについて、物体尤度を当初の値L(Oi)から更新して、次のように定めてもよい。 By using the above two types of normalized values, object likelihood is determined for each object Oi that is determined as a target object or a non-target object and given a bin as a target object or a bin as a non-target object. May be updated from the initial value L (Oi) and determined as follows.

注目物体と判定された物体については、当該物体Oiの当初の物体尤度L(Oi)と、当該物体の正規化注目度C(Oi)_[正規化]とを掛け合わせたものL(Oi)*C(Oi)_[正規化]を、当該注目物体Oiのビンの値とする。また、非注目物体と判定された物体については、当該物体Oiの当初の物体尤度L(Oi)と、正規化非注目度C(Oi)_{[正規化/非注目]}とを掛け合わせたものL(Oi)*C(Oi)_{[正規化/非注目]}を、当該非注目物体Oiのビンの値とする。以上により、注目情報付の物体尤度情報を生成することが出来る。 For an object determined to be an object of interest, L (Oi) is obtained by multiplying the original object likelihood L (Oi) of the object Oi by the normalized attention degree C (Oi) _{[normalization] of the} object * C (Oi) _{[normalization]} is the bin value of the object of interest Oi. In addition, for an object determined to be a non-attention object, the original object likelihood L (Oi) of the object Oi is multiplied by the normalized non-attention degree C (Oi) _{[normalized / non-attention].} Let L (Oi) * C (Oi) _{[normalization / non-attention] be} the bin value of the non-attention object Oi. As described above, object likelihood information with attention information can be generated.

上記手法によれば、注目物体と判定された物体Oiは、当初の注目度（規格化する前の注目度）C(Oi)が高いほど、物体尤度が相対的に高くなり、非注目物体と判定された物体Oiは、逆に、当初の注目度C(Oi)が低いほど、物体尤度が相対的に高くなるよう、当初の物体尤度L(Oi)が更新されて、注目情報付の物体尤度情報が生成されることとなる。同様の傾向となるその他の手法で更新を行ってもよい。 According to the above method, the object Oi determined to be the target object has a higher object likelihood as the initial attention level (attention level before normalization) C (Oi) is higher. On the contrary, the object Oi determined to be the initial object likelihood L (Oi) is updated so that the object likelihood is relatively higher as the initial attention degree C (Oi) is lower. The attached object likelihood information is generated. You may update by the other method which becomes the same tendency.

行動認識部60は、物体尤度情報生成部50から入力される、注目情報付の物体尤度情報に基づき、行動認識を行う。ここで、当該時刻tのフレームにおいて認識された全物体Oi(i=1, 2, ..., n)における注目情報付の物体尤度情報を、SVM等の統計的機械学習その他によって既に学習済みの認識器に対して入力することで、当該フレームに対する行動認識の結果が得られる。すなわち、映像における当該時刻tのフレームが、何の行動に対応しているかについての認識結果が得られる。 The behavior recognition unit 60 performs behavior recognition based on the object likelihood information with attention information input from the object likelihood information generation unit 50. Here, object likelihood information with attention information on all objects Oi (i = 1, 2, ..., n) recognized in the frame at the time t has already been learned by statistical machine learning such as SVM and others. By inputting to a recognized recognizer, a result of action recognition for the frame is obtained. That is, a recognition result as to what action the frame at the time t in the video corresponds to is obtained.

図３は、行動認識部60における当該認識器を構築するための認識器構築装置100の一実施形態に係る機能ブロック図である。認識器構築装置100は、学習データ蓄積部110、物体認識部120、注意推定部130、注目物体判定部140、物体尤度情報生成部150及び学習部160を備える。 FIG. 3 is a functional block diagram according to an embodiment of the recognizer construction device 100 for constructing the recognizer in the action recognition unit 60. The recognizer construction device 100 includes a learning data accumulation unit 110, an object recognition unit 120, a caution estimation unit 130, an attention object determination unit 140, an object likelihood information generation unit 150, and a learning unit 160.

学習データ蓄積部110は、複数の画像データと、当該画像データの各々におけるユーザ行動を表現するキーワードラベル（以下、行動ラベルと言う）とのセットを、学習データとして蓄積している。 The learning data storage unit 110 stores, as learning data, a set of a plurality of image data and keyword labels (hereinafter referred to as “behavior labels”) that represent user behavior in each of the image data.

物体認識部120は物体認識部20と、注意推定部130は注意推定部30と、注目物体判定部140は注目物体判定部40と、物体尤度情報生成部150は物体尤度情報生成部50と同等の機能を持つ。 The object recognition unit 120 is the object recognition unit 20, the attention estimation unit 130 is the attention estimation unit 30, the attention object determination unit 140 is the attention object determination unit 40, and the object likelihood information generation unit 150 is the object likelihood information generation unit 50. Has the same function as

従って、学習データ蓄積部110に蓄積された学習データのうち、（行動ラベルの情報を含まない）複数の画像データの各々は、図１の行動認識装置1に入力される映像の各時刻tのフレーム画像に対応するものとして、図１の場合と同様に図３の構成においても、物体認識部120及び注意推定部130に入力され処理された後、さらに注目物体判定部140及び物体尤度情報生成部150による処理を経て、各画像で認識された全物体につき生成された注目情報付の物体尤度情報となり、学習部160へと出力される。 Accordingly, among the learning data stored in the learning data storage unit 110, each of a plurality of image data (not including information on the action label) is obtained at each time t of the video input to the action recognition device 1 in FIG. As in the case of FIG. 1, as corresponding to the frame image, in the configuration of FIG. 3, after being input and processed by the object recognition unit 120 and the attention estimation unit 130, the target object determination unit 140 and the object likelihood information Through the processing by the generation unit 150, object likelihood information with attention information generated for all objects recognized in each image is output to the learning unit 160.

学習部160は、上記のように物体尤度情報生成部150から入力された、各画像で認識された全物体につき生成された注目情報付の物体尤度情報と、学習データ蓄積部110から入力された、当該各画像に対応する行動ラベルとの組み合わせにおける対応関係を、当該学習データ全体に渡って、SVMに代表される機械学習方法等で学習し、図１の行動認識部60が利用する認識器を構築する。当該予め築された認識器を用いることで、行動認識部60による行動認識が可能となる。 The learning unit 160 is input from the object likelihood information generation unit 150 as described above, and is input from the learning data storage unit 110 and object likelihood information with attention information generated for all objects recognized in each image. 1 is learned by a machine learning method represented by SVM over the entire learning data, and used by the action recognition unit 60 of FIG. Build a recognizer. By using the pre-built recognizer, action recognition by the action recognition unit 60 becomes possible.

図４は、図１の行動認識装置1が行う行動認識処理の一実施形態に係るフローチャートであり、その各ステップ(S1〜S6)は以下の通りである。 FIG. 4 is a flowchart according to an embodiment of the action recognition process performed by the action recognition apparatus 1 of FIG. 1, and each step (S1 to S6) is as follows.

ステップS1において、カメラ部10は、カメラ視野内を連続して撮影し、フレーム画像を取得する。 In step S1, the camera unit 10 continuously captures the camera field of view and acquires frame images.

ステップS2において、物体認識部20はカメラ部10から入力された各フレーム画像から、物体を検出し、物体の種類を識別する。当該識別の際に、識別された物体の尤度も取得する。 In step S2, the object recognition unit 20 detects an object from each frame image input from the camera unit 10 and identifies the type of the object. At the time of identification, the likelihood of the identified object is also acquired.

ステップS3において、注意推定部30はカメラ部10から入力された各フレーム画像から、注意推定を行い、注意マップを生成する。 In step S3, the attention estimation unit 30 performs attention estimation from each frame image input from the camera unit 10, and generates a attention map.

ステップS4において、注目物体判定部40は、物体認識部20により認識された物体の情報と、注意推定部30により推定された注意マップを用いて、当該認識された物体の各々につき、注目物体であるか否かを判定する。 In step S4, the object-of-interest determination unit 40 uses the information of the object recognized by the object recognition unit 20 and the attention map estimated by the attention estimation unit 30 to determine the object of interest for each of the recognized objects. It is determined whether or not there is.

ステップS5において、物体尤度情報生成部50は、注目物体判定部40から入力される、注目物体か否かの判定結果と、物体認識部20から入力される物体の物体尤度を元に、注目情報付の物体尤度情報を生成する。 In step S5, the object likelihood information generation unit 50, based on the determination result of whether or not the object of interest input from the object of interest determination unit 40 and the object likelihood of the object input from the object recognition unit 20, Object likelihood information with attention information is generated.

ステップS6において、行動認識部60は、物体尤度情報生成部50から入力される注目情報付の物体尤度情報に基づき、行動認識を行う。 In step S6, the action recognition unit 60 performs action recognition based on the object likelihood information with attention information input from the object likelihood information generation unit 50.

図５は、図３の認識器構築装置100が行う認識器の構築処理の一実施形態に係るフローチャートであり、その各ステップ(S11〜S17)は以下の通りである。 FIG. 5 is a flowchart according to an embodiment of a recognizer construction process performed by the recognizer construction device 100 of FIG. 3, and each step (S11 to S17) is as follows.

ステップS11において、学習データ蓄積部110は、画像データを学習データとして物体認識部120及び注意推定部130に入力する。 In step S11, the learning data storage unit 110 inputs the image data as learning data to the object recognition unit 120 and the attention estimation unit 130.

ステップS12において、物体認識部120は、学習データ蓄積部110から入力された画像データから、物体を検出し、物体の種類を識別する。また、当該識別した物体の尤度を求める。 In step S12, the object recognition unit 120 detects an object from the image data input from the learning data storage unit 110, and identifies the type of the object. Further, the likelihood of the identified object is obtained.

ステップS13において、注意推定部130は学習データ蓄積部110から入力された画像データから、注意推定を行い、注意マップを生成する。 In step S13, the attention estimation unit 130 performs attention estimation from the image data input from the learning data storage unit 110, and generates a attention map.

ステップS14において、注目物体判定部140は、物体認識部120により認識された物体の情報と、注意推定部130により推定された注意マップを用いて、各物体が注目物体であるか否かを判定する。 In step S14, the object-of-interest determination unit 140 determines whether each object is an object of interest using information on the object recognized by the object recognition unit 120 and the attention map estimated by the attention estimation unit 130. To do.

ステップS15において、物体尤度情報生成部150は、注目物体判定部140から入力される、注目物体か否かの判定結果と、物体認識部120から入力される物体の物体尤度を元に、注目情報付の物体尤度情報を生成する。 In step S15, the object likelihood information generation unit 150 is based on the determination result of whether or not the object of interest is input from the object of interest determination unit 140 and the object likelihood of the object input from the object recognition unit 120. Object likelihood information with attention information is generated.

ステップS16において、学習データ蓄積部110は、行動ラベルを学習データとして学習部160に入力する。 In step S16, the learning data storage unit 110 inputs the action label as learning data to the learning unit 160.

ステップS17において、学習部160は、物体尤度情報生成部150から入力される、画像データから生成された注目情報付の物体尤度情報と、学習データ蓄積部110から入力される、画像データに対応した行動ラベルと、を組み合わせて、全学習データを対象として当該組み合わせにおける対応関係を学習し、認識器を構築する。 In step S17, the learning unit 160 adds object likelihood information with attention information generated from image data input from the object likelihood information generation unit 150 and image data input from the learning data storage unit 110. Combining the corresponding action labels, learning the correspondence in the combination for all learning data, and constructing a recognizer.

以上、この発明の実施形態につき、図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計なども含まれる。以下、（１）〜（４）として補足的事項を説明する。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes a design that does not depart from the gist of the present invention. Hereinafter, supplementary items will be described as (1) to (4).

（１）図４のフローにおけるステップS2とS3、及び、図５のフローにおけるステップS12とS13、はそれぞれ、逆の順序で実施されてもよいし、同時並行で実施されてもよい。 (1) Steps S2 and S3 in the flow of FIG. 4 and steps S12 and S13 in the flow of FIG. 5 may be performed in the reverse order, or may be performed in parallel.

（２）カメラ部10は、人物に装着されているものとしたが、何らかの人物の一人称映像又はこれと同等ないしこれに準ずる映像が得られるのであれば、必ずしも装着されていなくともよい。例えば、当該人物がカメラを持ち続けて撮影してもよいし、編集などにより人工的に得られた一人称映像であってもよい。 (2) The camera unit 10 is attached to a person, but may not necessarily be attached as long as a first person image of a person or an image equivalent to or equivalent to this is obtained. For example, the person may continue to take a picture while holding the camera, or may be a first person video artificially obtained by editing or the like.

（３）上記実施形態では、注意推定部30及び注意推定部130には、顕著性及び／又は自己運動の推定に基づく注意推定方法を適用することとしたが、前述の特許文献1のように、頭部に装着した視線方向検出装置により、高精度に視線方向を検出し、各フレーム画像に対する人間の注意を視線位置座標の形で推定することもできる。従って、当該推定結果を利用する場合には、各フレームには、通常の映像としての時系列上の画像情報に加えてさらに、当該人間の視点位置座標の情報が付随しているとみなしてよい。 (3) In the above-described embodiment, the attention estimation method based on the saliency and / or self-motion estimation is applied to the attention estimation unit 30 and the attention estimation unit 130. However, as in Patent Document 1 described above, The gaze direction detection device mounted on the head can detect the gaze direction with high accuracy and estimate human attention to each frame image in the form of gaze position coordinates. Therefore, when using the estimation result, each frame may be considered to be accompanied by information on the human viewpoint position coordinate in addition to the time-series image information as a normal video. .

視線位置座標の形で注意を推定する場合の実施例を、以下に３つ説明する。 Three examples in the case where attention is estimated in the form of line-of-sight position coordinates will be described below.

第一実施例では、注意推定部30及び注意推定部130は上記実施形態における注意マップに該当するものは求めない。そして、注目物体判定部40及び注目物体判定部140は、当該視線位置座標を物体領域として含む物体を、あるいは、当該視点位置座標に最も近い物体領域を有する物体を、注目物体として判定し、残りの物体を非注目物体と判定することができる。 In the first example, the caution estimation unit 30 and the caution estimation unit 130 do not obtain the caution map corresponding to the above embodiment. The object-of-interest determination unit 40 and the object-of-interest determination unit 140 determine an object including the line-of-sight position coordinate as an object area, or an object having an object area closest to the viewpoint position coordinate as an object of interest, and the remaining Can be determined as a non-attention object.

第二実施例では、注意推定部30及び注意推定部130が視線位置座標を中心（ピーク位置）とした二次元ガウス分布状の注意マップを生成した後に、注目物体判定部40及び注目物体判定部140が上記実施形態と同様の方法で注目物体を判定することもできる。当該二次元ガウス分布のピーク位置以外のパラメータは予め設定しておけばよい。 In the second embodiment, after the attention estimation unit 30 and the attention estimation unit 130 generate a two-dimensional Gaussian distribution attention map centered on the line-of-sight position coordinates (peak position), the attention object determination unit 40 and the attention object determination unit 140 can also determine the object of interest in the same manner as in the above embodiment. Parameters other than the peak position of the two-dimensional Gaussian distribution may be set in advance.

ただし、第一実施例にて、視線位置座標を物体領域として含む物体を、あるいは、視点位置座標に最も近い物体領域を有する物体を、注目物体として検出した場合には、各物体Oiの注目度C(Oi)は定義されないことに伴い、上記実施形態の場合と比べて次の制約がある。すなわち、物体尤度情報生成部50及び物体尤度情報生成部150は、注目度C(Oi)を用いて当初の尤度L(Oi)を更新した形では、物体尤度情報を生成することはできない。（注目/非注目の判定結果を用い、当初の尤度L(Oi)をそのまま採用した物体尤度情報生成に限られる。） However, in the first embodiment, when an object including the line-of-sight position coordinates as an object area or an object having an object area closest to the viewpoint position coordinates is detected as an attention object, the degree of attention of each object Oi Since C (Oi) is not defined, there are the following restrictions compared to the above embodiment. That is, the object likelihood information generation unit 50 and the object likelihood information generation unit 150 generate object likelihood information in a form in which the initial likelihood L (Oi) is updated using the attention degree C (Oi). I can't. (This is limited to the generation of object likelihood information that uses the attention / non-attention determination result and directly adopts the initial likelihood L (Oi).)

あるいは、上記制約を形式上回避するため、第一実施例にて、視点位置座標の点においてのみ所定値A(A>0)を有し、その他の点では値がゼロとなる注意マップが生成されたものとみなして、注目物体判定部40及び注目物体判定部140が上記実施形態と同様に注目物体の判定を行うようにしてもよい。 Alternatively, in order to avoid the above restrictions in form, in the first embodiment, a caution map is generated that has a predetermined value A (A> 0) only at the point of the viewpoint position coordinate and zero at other points. The target object determination unit 40 and the target object determination unit 140 may determine the target object in the same manner as in the above embodiment.

第三実施例では、より一般に、視点位置座標でピークの値を取る所定の２次元分布として、注意推定部30,130が注意マップを生成し、注目物体判定部40,140は上記実施形態と同様にするようにしてもよい。なお、第一実施例は、当該所定の２次元分布としてデルタ関数的な分布を採用し、第二実施例は、当該所定の２次元分布としてガウス分布を採用した、第三実施例の特別な場合とみなすこともできる。 In the third example, more generally, the attention estimation units 30 and 130 generate the attention map as a predetermined two-dimensional distribution that takes the peak value at the viewpoint position coordinates, and the attention object determination units 40 and 140 are configured in the same manner as in the above embodiment. It may be. The first embodiment employs a delta function-like distribution as the predetermined two-dimensional distribution, and the second embodiment employs a Gaussian distribution as the predetermined two-dimensional distribution. It can be regarded as a case.

（４）本発明は、コンピュータに読み取られ、コンピュータに図４の各ステップを実行させる、あるいは、コンピュータを図１の各部として機能させる、行動認識プログラムとして提供することもできる。同様に、本発明は、コンピュータに読み取られ、コンピュータに図５の各ステップを実行させる、あるいは、コンピュータを図３の各部として機能させる、認識器構築プログラムとして提供することもできる。 (4) The present invention can be provided as an action recognition program that is read by a computer and causes the computer to execute the steps of FIG. 4 or causes the computer to function as each unit of FIG. Similarly, the present invention can be provided as a recognizer construction program that is read by a computer and causes the computer to execute the steps of FIG. 5 or causes the computer to function as each unit of FIG.

1…行動認識装置、10…カメラ部、20…物体認識部、30…注意推定部、40…注目物体判定部、50…物体尤度情報生成部、60…行動認識部、100…認識器構築装置、110…学習データ蓄積部、120…物体認識部、130…注意推定部、140…注目物体判定部、150…物体尤度情報生成部、160…学習部 DESCRIPTION OF SYMBOLS 1 ... Action recognition apparatus, 10 ... Camera part, 20 ... Object recognition part, 30 ... Attention estimation part, 40 ... Attention object determination part, 50 ... Object likelihood information generation part, 60 ... Action recognition part, 100 ... Recognizer construction 110, learning data storage unit, 120 ... object recognition unit, 130 ... attention estimation unit, 140 ... attention object determination unit, 150 ... object likelihood information generation unit, 160 ... learning unit

Claims

一人称視点の映像より、当該一人称の人物の行動を認識する行動認識装置であって、
前記映像の各フレームより、当該フレーム内の物体をその尤度と共に認識する物体認識部と、
前記映像の各フレームより、当該フレーム内における前記一人称の人物の注意度合いの分布として注意マップを推定する注意推定部と、
前記認識された各物体の占める領域において、前記推定された注意マップを参照することにより、当該各物体が注目物体であるか否かを判定する注目物体判定部と、
前記認識された各物体において、その尤度に、当該物体に対して前記注目物体であると判定されたか否かの情報を注目情報として追加することで、前記認識された各物体の注目情報付の物体尤度情報を生成する物体尤度情報生成部と、
前記映像の各フレームにおいて、前記認識された全ての物体における注目情報付の物体尤度情報を、所定の認識器に対して入力することで、当該各フレームにおける前記一人称の人物の行動を認識する行動認識部と、を備え、
前記物体認識部では、前記注意推定部の推定する注意マップを参照せずに、前記フレーム内の物体をその物体種別に関する尤度と共に認識し、
前記注意推定部では、前記物体認識部がその尤度と共に認識する前記フレーム内の物体の情報を参照せずに、前記注意マップを推定することを特徴とする行動認識装置。 A behavior recognition device that recognizes the behavior of a first person from a first person perspective image,
An object recognition unit that recognizes an object in the frame together with its likelihood from each frame of the video;
From each frame of the video, a caution estimation unit that estimates a caution map as a distribution of the attention level of the first person in the frame,
An object-of-interest determination unit that determines whether each object is an object of interest by referring to the estimated attention map in the area occupied by each of the recognized objects;
In each recognized object, the information indicating whether or not the object is determined to be the attention object is added to the likelihood as attention information, so that the attention information of each recognized object is attached. An object likelihood information generating unit for generating object likelihood information of
In each frame of the video, object likelihood information with attention information on all recognized objects is input to a predetermined recognizer, thereby recognizing the action of the first person in each frame. An action recognition unit,
The object recognition unit recognizes the object in the frame together with the likelihood related to the object type without referring to the attention map estimated by the attention estimation unit,
The Caution estimator, behavior recognition apparatus the object recognition unit without reference to information of an object recognizing said frame with its likelihood, characterized that you estimate the attention map.

前記注意推定部が、
前記映像の各フレームより求めた顕著性マップとして、
前記映像の各フレームより前記一人称の人物の視点についての自己運動を推定し、当該推定された自己運動に基づく視覚的注意マップとして、または、
当該顕著性マップ及び視覚的注意マップを組み合わせたマップとして、前記注意マップを推定することを特徴とする請求項１に記載の行動認識装置。 The attention estimation unit is
As a saliency map obtained from each frame of the video,
Estimating self-motion about the viewpoint of the first person from each frame of the video, as a visual attention map based on the estimated self-motion, or
The behavior recognition apparatus according to claim 1, wherein the attention map is estimated as a map obtained by combining the saliency map and the visual attention map.

前記映像の各フレームにおいては、前記一人称の人物の視点位置座標の情報が付随しており、
前記注意推定部が、前記映像の各フレームより、前記視点位置座標においてピークをなす所定の分布として、前記注意マップを推定することを特徴とする請求項１に記載の行動認識装置。 Each frame of the video is accompanied by information on the viewpoint position coordinates of the first person,
2. The behavior recognition apparatus according to claim 1, wherein the attention estimation unit estimates the attention map as a predetermined distribution having a peak in the viewpoint position coordinates from each frame of the video.

前記注目物体判定部は、前記認識された各物体の占める領域において、前記推定された注意マップの値の所定の統計値を求めて当該物体の注目度となし、当該注目度に基づいて各物体が注目物体であるか否かを判定することを特徴とする請求項１ないし３のいずれかに記載の行動認識装置。 The object-of-interest determination unit obtains a predetermined statistical value of the estimated attention map value in the area occupied by each recognized object, determines the object's attention level, and determines each object based on the degree of attention 4. The action recognition device according to claim 1, wherein it is determined whether or not the object is a target object.

前記所定の統計値が平均値であることを特徴とする請求項４に記載の行動認識装置。 The behavior recognition apparatus according to claim 4, wherein the predetermined statistical value is an average value.

前記物体尤度情報生成部は、前記物体認識部にて認識された物体の尤度を、前記注目物体判定部にて求めた当該物体の注目度に基づいて更新して、前記注目情報付の物体尤度情報を生成することを特徴とする請求項４または５に記載の行動認識装置。 The object likelihood information generation unit updates the likelihood of the object recognized by the object recognition unit based on the attention degree of the object obtained by the attention object determination unit, and adds the attention information attached. 6. The behavior recognition apparatus according to claim 4, wherein the object likelihood information is generated.

前記物体尤度情報生成部は、前記物体の尤度を更新するに際して、前記注目物体と判定された物体については、その注目度が高いほど、更新された尤度が高くなるようにし、前記注目物体とは判定されなかった物体については、その注目度が低いほど、更新された尤度が高くなるようにすることを特徴とする請求項６に記載の行動認識装置。 When updating the likelihood of the object, the object likelihood information generation unit increases the likelihood of the object determined to be the attention object so that the higher the attention degree, the higher the likelihood. The behavior recognition apparatus according to claim 6, wherein an updated likelihood is increased as an attention level of an object that is not determined to be an object is lower.

一人称視点の映像より、当該一人称の人物の行動を認識する行動認識方法であって、
前記映像の各フレームより、当該フレーム内の物体をその尤度と共に認識する一方で、前記映像の各フレームより、当該フレーム内における前記一人称の人物の注意度合いの分布として注意マップを推定する、物体認識段階及び注意推定段階と、
前記認識された各物体の占める領域において、前記推定された注意マップを参照することにより、当該各物体が注目物体であるか否かを判定する注目物体判定段階と、
前記認識された各物体において、その尤度に、当該物体に対して前記注目物体であると判定されたか否かの情報を注目情報として追加することで、前記認識された各物体の注目情報付の物体尤度情報を生成する物体尤度情報生成段階と、
前記映像の各フレームにおいて、前記認識された全ての物体における注目情報付の物体尤度情報を、所定の認識器に対して入力することで、当該各フレームにおける前記一人称の人物の行動を認識する行動認識段階と、を備え、
前記物体認識段階では、前記注意推定段階の推定する注意マップを参照せずに、前記フレーム内の物体をその物体種別に関する尤度と共に認識し、
前記注意推定段階では、前記物体認識段階がその尤度と共に認識する前記フレーム内の物体の情報を参照せずに、前記注意マップを推定することを特徴とする行動認識方法。 A behavior recognition method for recognizing the behavior of a first person from a first person perspective image,
An object that recognizes an object in the frame together with its likelihood from each frame of the video, and estimates a caution map as a distribution of the attention level of the first person in the frame from each frame of the video A recognition stage and an attention estimation stage;
An object-of-interest determination step for determining whether each object is an object of interest by referring to the estimated attention map in an area occupied by each recognized object;
In each recognized object, the information indicating whether or not the object is determined to be the attention object is added to the likelihood as attention information, so that the attention information of each recognized object is attached. Object likelihood information generation stage for generating object likelihood information of
In each frame of the video, object likelihood information with attention information on all recognized objects is input to a predetermined recognizer, thereby recognizing the action of the first person in each frame. An action recognition stage,
In the object recognition step, the object in the frame is recognized together with the likelihood related to the object type without referring to the attention map estimated in the attention estimation step.
The Caution estimation step, the behavior recognition method the object recognition step without reference to information of an object recognizing said frame with its likelihood, characterized that you estimate the attention map.

コンピュータに、請求項８に記載の各段階を実行させることを特徴とする、一人称視点の映像より、当該一人称の人物の行動を認識する行動認識プログラム。 9. A behavior recognition program for recognizing a behavior of a first person from a first person video, wherein the computer executes the steps according to claim 8.

請求項１ないし７のいずれかに記載の行動認識装置における行動認識部が用いる認識器を構築する認識器構築装置であって、
複数の画像と、当該画像の各々に付与された行動ラベルとのセットを学習データとして蓄積している学習データ蓄積部と、
前記学習データの画像の各々より、当該画像内の物体をその尤度と共に認識する物体認識部と、
前記学習データの画像の各々より、当該画像内における注意度合いの分布として注意マップを推定する注意推定部と、
前記認識された各物体の占める領域において、前記推定された注意マップを参照することにより、当該各物体が注目物体であるか否かを判定する注目物体判定部と、
前記認識された各物体において、その尤度に、当該物体に対して前記注目物体であると判定されたか否かの情報を注目情報として追加することで、前記認識された各物体の注目情報付の物体尤度情報を生成する物体尤度情報生成部と、
前記学習データにおいて、各画像につき前記認識された全ての物体における注目情報付の物体尤度情報と、当該画像に対して付与された行動ラベルと、の組み合わせを学習することにより、前記認識器を構築する認識器構築部と、を備え、
前記物体認識部では、前記注意推定部の推定する注意マップを参照せずに、前記画像内の物体をその物体種別に関する尤度と共に認識し、
前記注意推定部では、前記物体認識部がその尤度と共に認識する前記画像内の物体の情報を参照せずに、前記注意マップを推定することを特徴とする認識器構築装置。 A recognizer construction device for constructing a recognizer used by the behavior recognition unit in the behavior recognition device according to claim 1,
A learning data storage unit that stores a set of a plurality of images and action labels given to each of the images as learning data;
From each of the images of the learning data, an object recognition unit that recognizes an object in the image together with its likelihood,
From each of the images of the learning data, a caution estimation unit that estimates a caution map as a distribution of the caution level in the image,
An object-of-interest determination unit that determines whether each object is an object of interest by referring to the estimated attention map in the area occupied by each of the recognized objects;
In each recognized object, the information indicating whether or not the object is determined to be the attention object is added to the likelihood as attention information, so that the attention information of each recognized object is attached. An object likelihood information generating unit for generating object likelihood information of
In the learning data, by learning a combination of object likelihood information with attention information on all recognized objects for each image and an action label given to the image, the recognizer A recognizer construction unit for construction,
The object recognition unit recognizes the object in the image together with the likelihood related to the object type without referring to the attention map estimated by the attention estimation unit,
The Caution estimation unit, without referring to the information of the object of the object recognition unit in recognizing the image together with the likelihood recognizer construction and wherein that you estimate the attention map.