JP2024067990A

JP2024067990A - GENERATION PROGRAM, GENERATION METHOD, AND INFORMATION PROCESSING APPARATUS

Info

Publication number: JP2024067990A
Application number: JP2022178464A
Authority: JP
Inventors: 源太鈴木
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2024-05-17

Abstract

【課題】現場での行動検知の精度を向上させることを課題とする。【解決手段】情報処理装置は、人物の行動の要素を示すルールを特定する。情報処理装置は、特定したルールに合致した姿勢を示す人物のモデルを生成する。情報処理装置は、カメラパラメータを用いて、画像データの中に人物のモデルが配置された合成データを生成する。防犯やリテール、製造、業務効率化など、様々なシーンに適した映像分析ソリューションを提供できる。【選択図】図２７[Problem] The problem is to improve the accuracy of on-site behavior detection. [Solution] An information processing device identifies rules that indicate elements of a person's behavior. The information processing device generates a model of a person that exhibits a posture that matches the identified rules. The information processing device uses camera parameters to generate synthetic data in which the person model is placed within image data. It is possible to provide video analysis solutions suitable for various scenes, such as crime prevention, retail, manufacturing, and business efficiency. [Selected Figure] Fig. 27

Description

本発明は、生成プログラム、生成方法および情報処理装置に関する。 The present invention relates to a generation program, a generation method, and an information processing device.

映像データを用いた行動認識として、映像データ内の人物検出、属性検出、姿勢推定などの技術が知られている。例えば、店舗内や工場内を撮像した映像データを機械学習モデルに入力し、機械学習モデルの出力結果を取得する。そして、機械学習モデルの出力結果を用いて、不審者の検出、体調不良者の検出、不審な行動の検出などが行われる。 Known techniques for behavior recognition using video data include detecting people in video data, detecting attributes, and estimating postures. For example, video data captured inside a store or factory is input into a machine learning model, and the output of the machine learning model is obtained. The output of the machine learning model is then used to detect suspicious individuals, people who are unwell, suspicious behavior, etc.

特開２０１２－１７３９０３号公報JP 2012-173903 A 特開２０１３－５０９４５号公報JP 2013-50945 A

しかしながら、上記技術では、行動認識を行う現場環境における特殊な服装、カメラ歪み、設置位置などの様々な条件により、行動認識の精度が落ちることがある。 However, with the above technology, the accuracy of behavior recognition can be reduced due to various conditions in the field environment where behavior recognition is performed, such as special clothing, camera distortion, and installation position.

例えば、訓練データに含まれていない、もしくは少量の訓練データにしか含まれていない服装、カメラに対する向き、姿勢などについては、行動認識の精度が低下する。また、現場環境では物体と物体、物体と人物、人物と人物などのように、様々な重なり（オクルージョン）が発生し、行動認識の精度が低下する。 For example, the accuracy of behavior recognition decreases for clothing, orientation relative to the camera, and posture that are not included in the training data or that are included in only a small amount of training data. Also, in the field environment, various overlaps (occlusions) occur, such as between objects, between objects, between people, and between people, and the accuracy of behavior recognition decreases.

一つの側面では、現場での行動検知の精度を向上させることができる生成プログラム、生成方法および情報処理装置を提供することを目的とする。 In one aspect, the objective is to provide a generation program, a generation method, and an information processing device that can improve the accuracy of on-site behavior detection.

第１の案では、生成プログラムは、コンピュータに、人物の行動の要素を示すルールを特定し、特定した前記ルールに合致した姿勢を示す人物のモデルを生成し、カメラパラメータを用いて、画像データの中に人物のモデルが配置された合成データを生成する、処理を実行させることを特徴とする。 In the first proposal, the generation program is characterized in that it causes a computer to execute processes that identify rules that indicate elements of a person's behavior, generate a model of a person exhibiting a posture that matches the identified rules, and generate synthetic data in which the model of the person is placed within image data using camera parameters.

一実施形態によれば、現場での行動検知の精度を向上させることができる。 According to one embodiment, the accuracy of on-site behavior detection can be improved.

図１は、実施例１にかかる情報処理装置を説明する図である。FIG. 1 is a diagram illustrating an information processing apparatus according to a first embodiment. 図２は、実施例１にかかる情報処理装置の機能構成を示す機能ブロック図である。FIG. 2 is a functional block diagram of the information processing apparatus according to the first embodiment. 図３は、３Ｄ生成モデルの訓練を説明する図である。FIG. 3 is a diagram illustrating training of a 3D generative model. 図４は、領域抽出モデルの訓練を説明する図である。FIG. 4 is a diagram illustrating training of a region extraction model. 図５は、骨格推定モデルの訓練を説明する図である。FIG. 5 is a diagram illustrating training of a skeleton estimation model. 図６は、３Ｄアバターの生成を説明する図である。FIG. 6 is a diagram illustrating the generation of a 3D avatar. 図７は、歩行の動作判定を説明するための図である。FIG. 7 is a diagram for explaining the walking motion determination. 図８は、フレームから生成される３Ｄアバターの一例を示す図である。FIG. 8 is a diagram showing an example of a 3D avatar generated from a frame. 図９は、３Ｄアバターの歩行姿勢を匿名化する処理を説明するための図である。FIG. 9 is a diagram for explaining the process of anonymizing the walking posture of a 3D avatar. 図１０は、セマンティックセグメンテーションによる注目領域の検出を説明する図である。FIG. 10 is a diagram illustrating detection of an attention region by semantic segmentation. 図１１は、骨格推定モデルを用いた動作解析を説明する図である。FIG. 11 is a diagram for explaining a motion analysis using a skeleton estimation model. 図１２は、トラッキングによる基準方向の設定を説明する図である。FIG. 12 is a diagram for explaining setting of the reference direction by tracking. 図１３は、クラスタリングを説明する図である。FIG. 13 is a diagram illustrating clustering. 図１４は、クラスタの抽出を説明する図である。FIG. 14 is a diagram for explaining the extraction of clusters. 図１５は、注目領域の抽出を説明する図である。FIG. 15 is a diagram for explaining extraction of a region of interest. 図１６は、セマンティックセグメンテーションの実行結果への基準線の設定を説明する図である。FIG. 16 is a diagram for explaining setting of a reference line for the results of execution of semantic segmentation. 図１７は、基準線に基づくクラスタリングを説明する図である。FIG. 17 is a diagram illustrating clustering based on a reference line. 図１８は、ラベル修正を説明する図である。FIG. 18 is a diagram for explaining label correction. 図１９は、商品棚エリアの設定を説明する図である。FIG. 19 is a diagram for explaining the setting of a product shelf area. 図２０は、合成データの生成を説明する図である。FIG. 20 is a diagram for explaining the generation of composite data. 図２１は、合成データの生成を説明する図である。FIG. 21 is a diagram for explaining generation of composite data. 図２２は、カメラパラメータの推定を説明する図（１）である。FIG. 22 is a diagram (1) for explaining the estimation of camera parameters. 図２３は、カメラパラメータの推定を説明する図（２）である。FIG. 23 is a diagram (2) for explaining the estimation of camera parameters. 図２４は、カメラパラメータの推定を説明する図（３）である。FIG. 24 is a diagram (3) for explaining the estimation of camera parameters. 図２５は、各種モデルの訓練への適用を説明する図である。FIG. 25 is a diagram illustrating the application of various models to training. 図２６は、実施例１にかかる合成データの生成処理の流れを示すフローチャートである。FIG. 26 is a flowchart of the synthetic data generation process according to the first embodiment. 図２７は、実施例２にかかる情報処理装置を説明する図である。FIG. 27 is a diagram illustrating an information processing apparatus according to the second embodiment. 図２８は、実施例２にかかる情報処理装置の機能構成を示す機能ブロック図である。FIG. 28 is a functional block diagram of the information processing apparatus according to the second embodiment. As shown in FIG. 図２９は、行動ルールＤＢを説明する図である。FIG. 29 is a diagram for explaining the behavior rule DB. 図３０は、行動ルールに基づく合成データの生成を説明する図である。FIG. 30 is a diagram for explaining generation of synthetic data based on behavior rules. 図３１は、実施例２にかかる合成データを用いた機械学習モデルの評価処理の流れを示すフローチャートである。FIG. 31 is a flowchart illustrating the flow of the evaluation process of the machine learning model using synthetic data according to the second embodiment. 図３２は、３Ｄアバターの配置例を説明する図である。FIG. 32 is a diagram illustrating an example of the arrangement of 3D avatars. 図３３は、ハードウェア構成例を説明する図である。FIG. 33 is a diagram illustrating an example of a hardware configuration.

以下に、本願の開示する生成プログラム、生成方法および情報処理装置の実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。また、各実施例は、矛盾のない範囲内で適宜組み合わせることができる。 Below, examples of the generation program, generation method, and information processing device disclosed in the present application will be described in detail with reference to the drawings. Note that the present invention is not limited to these examples. Furthermore, the respective examples can be appropriately combined within a range that does not cause inconsistencies.

＜情報処理装置の説明＞
図１は、実施例１にかかる情報処理装置１０を説明する図である。図１に示す情報処理装置１０は、人物を検出する人物検出モデル、性別やユニフォームの着衣有無などを推定する属性推定モデル、人物の姿勢を検出する姿勢推定モデルなどの各種機械学習モデルの訓練に使用可能なデータを生成するコンピュータの一例である。 <Description of Information Processing Device>
Fig. 1 is a diagram illustrating an information processing device 10 according to Example 1. The information processing device 10 illustrated in Fig. 1 is an example of a computer that generates data that can be used for training various machine learning models, such as a person detection model that detects a person, an attribute estimation model that estimates gender, whether or not a person is wearing a uniform, and the like, and a posture estimation model that detects the posture of a person.

近年、不審者の検出、体調不良者の検出、不審な行動の検出などの行動認識を行う機械学習モデルが利用されている。しかし、現場環境と訓練環境との違い、訓練データの充実度、現場で発生するオクルージョンなどにより、行動認識の精度が低下する。 In recent years, machine learning models have been used to recognize behaviors such as detecting suspicious individuals, people who are unwell, and suspicious behavior. However, the accuracy of behavior recognition decreases due to differences between the on-site environment and the training environment, the quality of the training data, occlusions that occur in the field, and other factors.

例えば、店内の画像データに対して店員か否かの識別処理を行う場合、あらかじめエプロン着用の店員の画像データを学習した機械学習モデルを用いた識別処理が行われる。しかし、現場となる売り場によってエプロンが異なり、訓練データに少量しか含まれない使用頻度の少ないエプロンや、人物姿勢（後ろ向き、横向き、しゃがみなど）によっては、店員と判別できず接客者検知での虚報が多い。 For example, when identifying whether or not an image of a store clerk is based on image data of a store, the identification process is carried out using a machine learning model that has previously trained on image data of store clerks wearing aprons. However, aprons differ depending on the sales floor, and infrequently used aprons that are included in only a small amount of training data, or depending on the posture of the person (facing backwards, sideways, crouching, etc.), it is not possible to identify the person as a store clerk, resulting in many false positives in the customer service detection process.

また、工場の画像データに対して作業員の検出処理を行う場合、様々な行動を学習した機械学習モデルを用いた検出処理が行われる。しかし、検出したい行動が一般的な訓練データセットに含まれていない姿勢だと人物検知や姿勢推定に失敗する。 In addition, when detecting workers from factory image data, the detection process is carried out using a machine learning model that has learned various behaviors. However, if the behavior to be detected is a pose that is not included in the general training dataset, human detection and pose estimation will fail.

一方で、全ての現場環境や全てのオクルージョンを想定した教師有りの訓練データを用意することは現実的ではない。 On the other hand, it is not realistic to prepare supervised training data that takes into account all field environments and all occlusions.

そこで、実施例１にかかる情報処理装置１０は、様々な環境で撮像された画像データを用いて、現場環境で想定される姿勢や属性の人物を含む合成データを生成することで、行動検知の精度を向上させる。 Therefore, the information processing device 10 according to the first embodiment improves the accuracy of behavior detection by using image data captured in various environments to generate synthetic data including people with postures and attributes expected in the field environment.

具体的には、図１に示すように、情報処理装置１０は、人物を撮影した第一の画像データと、所定の場所を撮影した第二の画像データとを取得する。情報処理装置１０は、取得した第一の画像データを第一の機械学習モデルに入力することで、人物の形状と形状に対して付与されるテクスチャから成る３次元の人物モデルを生成する。また、情報処理装置１０は、取得した第二の画像データに対して機械学習モデルによる推定や画像解析を行うことで、人物が行動する領域を特定する。その後、情報処理装置１０は、カメラパラメータを用いて、特定された領域に３次元の人物モデルが配置された合成データを生成する。 Specifically, as shown in FIG. 1, the information processing device 10 acquires first image data of a person and second image data of a predetermined location. The information processing device 10 inputs the acquired first image data into a first machine learning model to generate a three-dimensional human model consisting of the shape of the person and a texture assigned to the shape. The information processing device 10 also performs estimation and image analysis using the machine learning model on the acquired second image data to identify the area in which the person is active. The information processing device 10 then uses the camera parameters to generate composite data in which a three-dimensional human model is placed in the identified area.

情報処理装置１０は、このようにして生成された合成データに基づき、２次元画像データおよびラベル（正解情報）が付加された訓練データを用いて、各種機械学習モデルの訓練を行うことで、現場での行動検知の精度を向上させることができる。なお、実施例では、画像データを単に「画像」と表記することがある。 The information processing device 10 can improve the accuracy of on-site behavior detection by training various machine learning models using training data to which 2D image data and labels (correct answer information) have been added based on the composite data generated in this manner. Note that in the embodiments, image data may be referred to simply as "image."

＜情報処理装置の機能構成＞
図２は、実施例１にかかる情報処理装置１０の機能構成を示す機能ブロック図である。図２に示すように、情報処理装置１０は、通信部１１、記憶部２０、制御部３０を有する。 <Functional configuration of information processing device>
2 is a functional block diagram showing a functional configuration of the information processing device 10 according to the first embodiment. As shown in FIG. 2, the information processing device 10 includes a communication unit 11, a storage unit 20, and a control unit 30.

通信部１１は、他の装置との間の通信を制御する処理部であり、例えば通信インタフェースなどにより実現される。例えば、通信部１１は、人物が写っている画像データである人物画像データを受信し、店舗内に設置された複数のカメラから店舗における所定の場所が写っている画像データである場所画像データを受信する。 The communication unit 11 is a processing unit that controls communication with other devices, and is realized, for example, by a communication interface. For example, the communication unit 11 receives person image data, which is image data showing a person, and receives location image data, which is image data showing a specific location in the store, from multiple cameras installed in the store.

なお、人物画像データは、店舗で撮像された画像データに限ったものではなく、想定される各人物が写っていればどこで撮像されかを問わない。また、場所画像データは、実施例１では店舗内の画像データを例にして説明するが、これに限定されるものではなく、想定される現場の画像データであればよい。 Note that the person image data is not limited to image data taken in a store, and it does not matter where the image was taken as long as each anticipated person is captured. Also, while the location image data is described in Example 1 using image data inside a store as an example, it is not limited to this, and may be image data of an anticipated location.

記憶部２０は、各種データや制御部３０が実行するプログラムなどを記憶する処理部の一例であり、例えばメモリやハードディスクなどにより実現される。この記憶部１２は、訓練データＤＢ２１、人物画像データＤＢ２２、場所画像データＤＢ２３、３Ｄ生成モデル２４、領域抽出モデル２５、骨格推定モデル２６を記憶する。 The storage unit 20 is an example of a processing unit that stores various data and programs executed by the control unit 30, and is realized by, for example, a memory or a hard disk. The storage unit 12 stores a training data DB 21, a person image data DB 22, a place image data DB 23, a 3D generation model 24, an area extraction model 25, and a skeleton estimation model 26.

訓練データＤＢ２１は、３Ｄ生成モデル２４、領域抽出モデル２５、骨格推定モデル２６の訓練に使用する訓練データを記憶するデータベースである。各訓練データは、説明変数である画像データと、目的変数であるラベル（正解情報）とが対応付けられたデータである。 The training data DB 21 is a database that stores training data used to train the 3D generation model 24, the area extraction model 25, and the skeleton estimation model 26. Each training data is data in which image data, which is an explanatory variable, is associated with a label (correct answer information), which is an objective variable.

人物画像データＤＢ２２は、人物が撮像された画像データである人物画像データを記憶するデータベースである。例えば、人物画像データＤＢ２２が記憶する人物画像データには、想定される現場で撮像された人物画像データと、様々な場所で撮像された人物画像データとが含まれてよい。 The person image data DB22 is a database that stores person image data, which is image data of a person. For example, the person image data stored in the person image data DB22 may include person image data captured at an expected location and person image data captured at various locations.

場所画像データＤＢ２３は、人物検出、属性推定、骨格検知などを行う各現場を撮像した画像データである場所画像データを記憶するデータベースである。例えば、場所画像データＤＢ２３が記憶する場所画像データには、同じ現場の場所画像データと、様々な場所の場所画像データとが含まれてよい。なお、現場とは、例えば店舗の場合、人物が商品を取る商品棚を含む場所、人物が歩く通路の場所、人物が商品を購入する会計機を含む場所などが該当する。 The location image data DB23 is a database that stores location image data, which is image data captured at each location where person detection, attribute estimation, skeletal detection, etc. are performed. For example, the location image data stored in the location image data DB23 may include location image data of the same location and location image data of various locations. Note that in the case of a store, for example, a location may be a location including a product shelf where a person picks up products, an aisle where a person walks, or a location including a cash register where a person purchases products.

３Ｄ生成モデル２４は、人物の画像データを基にして、３Ｄアバターを生成する機械学習モデルである。たとえば、３Ｄ生成モデル２４は、「Mesh Graphormer」と、「Texformer」との機能を有する。Mesh Graphormerは、１つのフレームの人物の領域の画像データを基にして、人物の姿勢や、人物の各頂点を推定し、３次元の人物のメッシュモデルを生成する。Texformerは、１つのフレームの人物の領域の画像データを基にして、人物のテクスチャを生成する。３Ｄ生成モデル２４は、Mesh Graphormerによって推定された３次元の人物のメッシュモデルに、Texformerによって生成されたテクスチャを設定することで、３Ｄアバターを生成する。 The 3D generative model 24 is a machine learning model that generates a 3D avatar based on image data of a person. For example, the 3D generative model 24 has the functions of "Mesh Graphormer" and "Texformer." Mesh Graphormer estimates a person's posture and each vertex of the person based on image data of the person's area in one frame, and generates a three-dimensional mesh model of the person. Texformer generates a texture for the person based on image data of the person's area in one frame. The 3D generative model 24 generates a 3D avatar by setting the texture generated by Texformer to the three-dimensional mesh model of the person estimated by Mesh Graphormer.

領域抽出モデル２５は、セマンティックセグメンテーションを実行することで、人物が行動する領域を含む各種領域の抽出を実行する機械学習モデルである。具体的には、領域抽出モデル２５は、ＲＧＢの画像データの入力に応じて、セグメンテーション結果を出力する。セグメンテーション結果には、画像データ内の各領域に対して、識別されたラベルが設定される。例えば、領域抽出モデル２５には、convolutional encoder-decoderなどを採用することができる。 The area extraction model 25 is a machine learning model that performs semantic segmentation to extract various areas, including areas where people are active. Specifically, the area extraction model 25 outputs a segmentation result in response to input RGB image data. In the segmentation result, an identified label is set for each area in the image data. For example, a convolutional encoder-decoder or the like can be adopted for the area extraction model 25.

骨格推定モデル２６は、画像フレームの人物の領域（全身画像）を入力すると、該当する人物の骨格データを出力する機械学習モデルである。骨格推定モデル２６は、OpenPose等の機械学習モデルで実現することができる。 The skeleton estimation model 26 is a machine learning model that outputs skeleton data of a person when a region of the person in an image frame (a whole-body image) is input. The skeleton estimation model 26 can be realized by a machine learning model such as OpenPose.

また、骨格推定モデル２６には、動作解析を実行する機械学習モデルを用いることもできる。具体的には、骨格推定モデル２６には、人物の２次元画像データに対して、頭、手首、腰、足首などの２次元の関節位置（骨格座標）を推定し、基本となる動作の認識やユーザが定義したルールの認識を行う、訓練済みの深層学習器を採用することができる。この骨格推定モデル２６を用いることで、人物の基本動作を認識することができ、足首の位置、顏の向き、身体の向きを取得することができる。 In addition, a machine learning model that performs motion analysis can be used for the skeletal estimation model 26. Specifically, a trained deep learning device can be used for the skeletal estimation model 26, which estimates two-dimensional joint positions (skeletal coordinates) of the head, wrists, waist, ankles, etc. from two-dimensional image data of a person, and recognizes basic movements and rules defined by the user. By using this skeletal estimation model 26, it is possible to recognize the basic movements of a person, and obtain the position of the ankles, the direction of the face, and the direction of the body.

制御部３０は、情報処理装置１０全体を司る処理部であり、例えばプロセッサなどによる実現される。この制御部３０は、事前学習部３１、取得部３２、人物モデル生成部３３、領域特定部３４、合成データ生成部３５、機械学習部３６を有する。なお、事前学習部３１、取得部３２、人物モデル生成部３３、領域特定部３４、合成データ生成部３５、機械学習部３６は、プロセッサが有する電子回路やプロセッサが実行するプロセスなどにより実現される。 The control unit 30 is a processing unit that controls the entire information processing device 10, and is realized by, for example, a processor. This control unit 30 has a pre-learning unit 31, an acquisition unit 32, a person model generation unit 33, an area identification unit 34, a synthetic data generation unit 35, and a machine learning unit 36. Note that the pre-learning unit 31, the acquisition unit 32, the person model generation unit 33, the area identification unit 34, the synthetic data generation unit 35, and the machine learning unit 36 are realized by electronic circuits possessed by the processor, processes executed by the processor, etc.

（事前学習）
事前学習部３１は、訓練データＤＢ２１に記憶される各訓練データを用いて、３Ｄ生成モデル２４、領域抽出モデル２５、骨格推定モデル２６を生成する処理部である。なお、ここでは、情報処理装置１０が上記各機械学習モデルを生成する例で説明するが、これに限定されるものではなく、他の装置で生成された訓練済みの各機械学習モデルを用いることもできる。 (Pre-learning)
The pre-learning unit 31 is a processing unit that generates a 3D generation model 24, a region extraction model 25, and a skeleton estimation model 26 by using each training data stored in the training data DB 21. Note that, although an example in which the information processing device 10 generates each of the above machine learning models will be described here, the present invention is not limited to this, and each trained machine learning model generated by another device can also be used.

図３は、３Ｄ生成モデルの訓練を説明する図である。図３に示すように、事前学習部３１は、説明変数である人物が写っている画像データと、目的変数である３Ｄアバターを含む訓練データを３Ｄ生成モデル２４に入力する。そして、事前学習部３１は、３Ｄ生成モデル２４が「Mesh Graphormer」から出力された３次元の人物のメッシュモデルと、「Texformer」から出力された人物のテクスチャとを合成することで生成した３Ｄアバターを取得する。その後、事前学習部３１は、目的変数である３Ｄモデルと、３Ｄ生成モデル２４の出力結果である３Ｄモデルとの誤差が最小化するように、３Ｄ生成モデル２４のパラメータ更新を行うことで、３Ｄ生成モデル２４の訓練を実行する。 FIG. 3 is a diagram illustrating the training of a 3D generative model. As shown in FIG. 3, the pre-learning unit 31 inputs training data including image data of a person, which is an explanatory variable, and a 3D avatar, which is an objective variable, to the 3D generative model 24. The pre-learning unit 31 then obtains a 3D avatar generated by the 3D generative model 24 by synthesizing a three-dimensional mesh model of a person output from "Mesh Graphormer" and a texture of the person output from "Texformer". The pre-learning unit 31 then performs training of the 3D generative model 24 by updating the parameters of the 3D generative model 24 so as to minimize the error between the 3D model, which is the objective variable, and the 3D model, which is the output result of the 3D generative model 24.

図４は、領域抽出モデル２５の訓練を説明する図である。図４に示すように、事前学習部３１は、説明変数であるＲＧＢの画像データと、目的変数であるセグメンテーション結果とを含む訓練データを領域抽出モデル２５に入力し、出力結果（セグメンテーション結果）を取得する。そして、事前学習部３１は、訓練データの目的変数と出力結果との誤差が最小化するように、領域抽出モデル２５のパラメータ更新を行うことで、領域抽出モデル２５の訓練を実行する。 Figure 4 is a diagram explaining the training of the area extraction model 25. As shown in Figure 4, the pre-learning unit 31 inputs training data including RGB image data, which is an explanatory variable, and a segmentation result, which is an objective variable, to the area extraction model 25, and obtains an output result (segmentation result). Then, the pre-learning unit 31 performs training of the area extraction model 25 by updating the parameters of the area extraction model 25 so as to minimize the error between the objective variable of the training data and the output result.

図５は、骨格推定モデル２６の訓練を説明する図である。図５に示すように、事前学習部３１は、説明変数である画像データと、目的変数である骨格データとを含む訓練データを骨格推定モデル２６に入力し、出力結果（骨格認識結果）を取得する。そして、事前学習部３１は、訓練データの目的変数と出力結果との誤差が最小化するように、骨格推定モデル２６のパラメータ更新を行うことで、骨格推定モデル２６の訓練を実行する。なお、骨格データには、２次元の関節位置（骨格座標）、足首の位置、顏の向き、身体の向きや動作などを含めることができる。 Figure 5 is a diagram illustrating the training of the skeleton estimation model 26. As shown in Figure 5, the pre-learning unit 31 inputs training data including image data, which is an explanatory variable, and skeleton data, which is an objective variable, to the skeleton estimation model 26 and obtains an output result (skeleton recognition result). The pre-learning unit 31 then performs training of the skeleton estimation model 26 by updating the parameters of the skeleton estimation model 26 so as to minimize the error between the objective variable of the training data and the output result. Note that the skeleton data can include two-dimensional joint positions (skeletal coordinates), ankle positions, face orientation, body orientation and movement, etc.

（データ取得）
取得部３２は、人物を撮影した人物画像データと、所定の場所を撮影した場所画像データとを取得する処理部である。例えば、取得部３２は、人物検出、属性推定、骨格検知などを行う各現場で撮像された人物画像データや場所画像データを取得して各ＤＢに格納する。また、取得部３２は、各画像データをインターネット等から取得してもよい。 (Data Acquisition)
The acquisition unit 32 is a processing unit that acquires person image data obtained by photographing a person and location image data obtained by photographing a predetermined location. For example, the acquisition unit 32 acquires person image data and location image data captured at each site where person detection, attribute estimation, skeletal detection, etc. are performed, and stores the data in each DB. The acquisition unit 32 may also acquire each image data from the Internet, etc.

（３Ｄアバターの生成）
人物モデル生成部３３は、人物画像データを３Ｄ生成モデル２４に入力することで、人物の形状と形状に対して付与されるテクスチャから成る３次元の人物モデルを生成する処理部である。具体的には、人物モデル生成部３３は、人物画像データＤＢ２２に記憶される各人物画像データに対して機械学習モデルや画像解析を行うことで、各人物画像データに写っている人物から３次元の人物モデルの一例である３Ｄアバターを生成し、生成された３Ｄアバターを記憶部２０に格納する。 (3D avatar generation)
The person model generation unit 33 is a processing unit that generates a three-dimensional person model consisting of a person's shape and a texture assigned to the shape by inputting person image data to the 3D generation model 24. Specifically, the person model generation unit 33 performs a machine learning model or image analysis on each person image data stored in the person image data DB 22 to generate a 3D avatar, which is an example of a three-dimensional person model, from the person depicted in each person image data, and stores the generated 3D avatar in the storage unit 20.

図６は、３Ｄアバターの生成を説明する図である。人物モデル生成部３３は、映像データに含まれるフレーム５０を取得し、フレーム５０の人物の領域５０ａを特定する。人物の領域は、たとえば、Bounding Boxに対応する領域となる。人物モデル生成部３３は、人物の領域の画像を基にして、人物の「骨格情報」および「属性情報」を推定する。骨格情報は、人物の各関節の位置が設定された情報である。属性情報には、人物の年代、性別、体形、髪型、服装等が含まれる。 Figure 6 is a diagram explaining the generation of a 3D avatar. The person model generation unit 33 acquires a frame 50 included in the video data, and identifies a person's area 50a in the frame 50. The person's area is, for example, an area corresponding to a bounding box. The person model generation unit 33 estimates the person's "skeletal information" and "attribute information" based on the image of the person's area. The skeletal information is information in which the position of each of the person's joints is set. The attribute information includes the person's age, sex, body shape, hairstyle, clothing, etc.

人物モデル生成部３３は、人物の領域５０ａの画像を、３Ｄ生成モデル２４に入力することで、３Ｄアバターａｖ１を生成する。人物モデル生成部３３は、３Ｄアバターａｖ１の頭部の部位、手の部位を低解像度化する。また、人物モデル生成部３３は、骨格情報を基にして、人物の動作が歩行であると判定した場合には、３Ｄアバターａｖ１の所定の部位を移動させることで、３Ｄアバターａｖ１の歩行姿勢を匿名化する。人物モデル生成部３３が、上記の処理を実行することで、３Ｄアバターａｖ２が生成される。 The person model generation unit 33 generates a 3D avatar av1 by inputting an image of the person's area 50a into the 3D generation model 24. The person model generation unit 33 reduces the resolution of the head and hand parts of the 3D avatar av1. Furthermore, when the person model generation unit 33 determines that the person's movement is walking based on the skeletal information, it anonymizes the walking posture of the 3D avatar av1 by moving a specific part of the 3D avatar av1. The person model generation unit 33 executes the above process to generate a 3D avatar av2.

ここで、３Ｄアバターの生成処理を具体的に説明する。具体的には、人物モデル生成部３３は、歩行動作の判定処理、３Ｄアバターを生成する処理、３Ｄアバターの特徴量を変換する処理、３Ｄアバターの歩行姿勢を匿名化する処理、変換映像データを生成する処理を実行する。 Here, the process of generating a 3D avatar will be described in detail. Specifically, the person model generation unit 33 executes a process of determining a walking motion, a process of generating a 3D avatar, a process of converting the feature quantities of the 3D avatar, a process of anonymizing the walking posture of the 3D avatar, and a process of generating converted video data.

まず、歩行動作の判定処理について説明する。図７は、歩行の動作判定を説明するための図である。たとえば、人物モデル生成部３３は、図７に示す骨格情報を、人物の姿勢を判定する訓練済みの姿勢判定モデルなどに入力することで、人物の動作を判定する。人物モデル生成部３３は、骨格推定モデル２６など用いて推定された骨格情報をそのまま利用してもよいし、図７の骨格情報ＳＫ１０のように、一部の関節位置を抽出して、利用してもよい。骨格情報ＳＫ１０には、関節ｐ１，ｐ２，ｐ３，ｐ４，ｐ５，ｐ６、ｐ７，ｐ８が含まれる。 First, the process of determining the walking movement will be described. FIG. 7 is a diagram for explaining the determination of the walking movement. For example, the person model generation unit 33 determines the movement of the person by inputting the skeletal information shown in FIG. 7 into a trained posture determination model that determines the posture of the person. The person model generation unit 33 may directly use the skeletal information estimated using the skeletal estimation model 26 or the like, or may extract and use some of the joint positions as in the skeletal information SK10 in FIG. 7. The skeletal information SK10 includes joints p1, p2, p3, p4, p5, p6, p7, and p8.

関節ｐ１は、左肩の関節である。関節ｐ２は、右肩の関節である。関節ｐ３は、左腰の関節である。関節ｐ４は、右腰の関節である。関節ｐ５は、左膝の関節である。関節ｐ６は、右膝の関節である。関節ｐ７は、左足首の関節である。関節ｐ８は、右足首の関節である。たとえば、姿勢判定モデルＭ４は、骨格情報ＳＫ１０が入力されると、関節ｐ３，ｐ４，ｐ５，ｐ６の角度のパターンに応じて、姿勢を、立つ、歩く、しゃがむ、座る、寝る等の何れかに分類する。 Joint p1 is the left shoulder joint. Joint p2 is the right shoulder joint. Joint p3 is the left hip joint. Joint p4 is the right hip joint. Joint p5 is the left knee joint. Joint p6 is the right knee joint. Joint p7 is the left ankle joint. Joint p8 is the right ankle joint. For example, when skeletal information SK10 is input, posture determination model M4 classifies the posture into one of standing, walking, crouching, sitting, lying down, etc., depending on the angle pattern of joints p3, p4, p5, and p6.

次に、人物モデル生成部３３が実行する３Ｄアバターを生成する処理について説明する。人物モデル生成部３３は、映像データのフレームに含まれる人物の領域を特定し、特定した人物の領域の画像を、３Ｄ生成モデル２４に入力することで、３Ｄアバターを生成する。たとえば、人物モデル生成部３３は、人物の領域の画像を、Mesh Graphormerに入力し、３次元の人物のメッシュモデルを生成する。人物モデル生成部３３は、人物の領域の画像を、Texformerに入力して、人物のテクスチャを生成する。人物モデル生成部３３は、Mesh Graphormerによって推定された３次元の人物のメッシュモデルに、Texformerによって生成されたテクスチャを設定することで、３Ｄアバターを生成する。人物モデル生成部３３は、１つのフレームから、かかるフレームに含まれる人物の３Ｄアバターを生成することが可能である。 Next, the process of generating a 3D avatar executed by the person model generation unit 33 will be described. The person model generation unit 33 identifies a person's area included in a frame of video data, and generates a 3D avatar by inputting an image of the identified person's area into the 3D generation model 24. For example, the person model generation unit 33 inputs the image of the person's area into Mesh Graphormer to generate a three-dimensional mesh model of the person. The person model generation unit 33 inputs the image of the person's area into Texformer to generate a texture for the person. The person model generation unit 33 generates a 3D avatar by setting the texture generated by Texformer to the three-dimensional mesh model of the person estimated by Mesh Graphormer. The person model generation unit 33 is capable of generating a 3D avatar of a person included in one frame from the frame.

図８は、フレームから生成される３Ｄアバターの一例を示す図である。たとえば、人物モデル生成部３３が、フレーム６０の人物の領域の画像を、３Ｄ生成モデル２４に入力することで、３Ｄアバター６０ａが生成される。人物モデル生成部３３が、フレーム６１の人物の領域の画像を、３Ｄ生成モデル２４に入力することで、３Ｄアバター６１ａが生成される。人物モデル生成部３３が、フレーム６２の人物の領域の画像を、３Ｄ生成モデル２４に入力することで、３Ｄアバター６２ａが生成される。人物モデル生成部３３が、フレーム６３の人物の領域の画像を、３Ｄ生成モデル２４に入力することで、３Ｄアバター６３ａが生成される。 Figure 8 is a diagram showing an example of a 3D avatar generated from a frame. For example, the person model generation unit 33 inputs an image of the person's area in frame 60 to the 3D generation model 24, generating 3D avatar 60a. The person model generation unit 33 inputs an image of the person's area in frame 61 to the 3D generation model 24, generating 3D avatar 61a. The person model generation unit 33 inputs an image of the person's area in frame 62 to the 3D generation model 24, generating 3D avatar 62a. The person model generation unit 33 inputs an image of the person's area in frame 63 to the 3D generation model 24, generating 3D avatar 63a.

続いて、人物モデル生成部３３が実行する３Ｄアバターの特徴量を変換する処理について説明する。人物モデル生成部３３は、フレーム番号ｎのフレームから推定された骨格情報と、フレーム番号ｎのフレームから生成した３Ｄアバターとを基にして、３Ｄアバターを構成する複数の部位のうち、特徴量を変換する部位を特定する。たとえば、人物モデル生成部３３は、骨格情報と、３Ｄアバターとを重ねて配置し、骨格情報の頭、手首の関節位置を基準として、３Ｄアバターの頭（顔、耳介を含む）の部位と、手の部位を特定する。 Next, the process of converting the features of a 3D avatar executed by the person model generation unit 33 will be described. Based on the skeletal information estimated from the frame with frame number n and the 3D avatar generated from the frame with frame number n, the person model generation unit 33 identifies the parts of the 3D avatar whose features are to be converted, out of the multiple parts that make up the 3D avatar. For example, the person model generation unit 33 arranges the skeletal information and the 3D avatar so that they are superimposed, and identifies the head (including the face and ears) and hand parts of the 3D avatar based on the joint positions of the head and wrist in the skeletal information.

続いて、人物モデル生成部３３は、３Ｄアバターの手の部位、頭の部位を低解像度化する（ぼかす）。また、人物モデル生成部３３は、３Ｄアバターの頭部の部位の位置を、所定の方向へ所定の距離ずらす。所定の方向、所定の距離は、予め設定される。人物モデル生成部３３が、かかる処理を実行することで、３Ｄアバターの特徴量を変換する。 Next, the human model generation unit 33 reduces the resolution of the hand and head parts of the 3D avatar (blurs them). The human model generation unit 33 also shifts the position of the head part of the 3D avatar by a predetermined distance in a predetermined direction. The predetermined direction and predetermined distance are set in advance. By performing this processing, the human model generation unit 33 converts the feature amounts of the 3D avatar.

なお、人物モデル生成部３３は、アバターの頭の部位の特徴と類似する頭部のパーツを、予め記憶する情報から選択し、選択した頭部のパーツによって、アバターの頭部の部位を置き換えることで、３Ｄアバターの特徴量を変換してもよい。人物モデル生成部３３は、画像データの識別番号と対応付けて、特徴量を変換した３Ｄアバターの情報を、記憶部２０に登録する。 The person model generation unit 33 may convert the features of the 3D avatar by selecting a head part similar to the features of the avatar's head part from pre-stored information and replacing the avatar's head part with the selected head part. The person model generation unit 33 registers the information of the 3D avatar with the converted features in the storage unit 20 in association with the identification number of the image data.

次に、人物モデル生成部３３が、３Ｄアバターの歩行姿勢を匿名化する処理について説明する。人物モデル生成部３３は、歩行検出情報に設定されるフレーム番号のフレームから生成した３Ｄアバターを選択し、選択した３Ｄアバターの歩行姿勢を匿名化する。なお、歩行検出情報とは、時系列の骨格情報から、一つの骨格情報を取得し、骨格情報を姿勢判定モデルに入力することで、人物の動作が「歩行」であると判定されたフレームを特定する情報であり、人物モデル生成部３３によって生成される。 Next, the process by which the person model generation unit 33 anonymizes the walking posture of a 3D avatar will be described. The person model generation unit 33 selects a 3D avatar generated from the frame with the frame number set in the walking detection information, and anonymizes the walking posture of the selected 3D avatar. Note that the walking detection information is information that specifies a frame in which the movement of a person is determined to be "walking" by acquiring one piece of skeletal information from the time series of skeletal information and inputting the skeletal information into a posture determination model, and is generated by the person model generation unit 33.

図９は、３Ｄアバターの歩行姿勢を匿名化する処理を説明するための図である。たとえば、人物モデル生成部３３は、歩行姿勢であると判定された３Ｄアバターに対応する骨格情報を、骨格情報ＳＫ２０とする。骨格情報ＳＫ２０には、関節ｐ１～ｐ１３が含まれる。関節ｐ１～ｐ８の関節の説明は、図７と同様である、関節ｐ９は、左肘の関節である。関節ｐ１０は、右肘の関節である。関節ｐ１１は、左手首の関節である。関節ｐ１２は、右手首の関節である。関節ｐ１３は、頭部の各関節に対応する。 Figure 9 is a diagram for explaining the process of anonymizing the walking posture of a 3D avatar. For example, the person model generation unit 33 sets the skeletal information corresponding to a 3D avatar determined to be in a walking posture as skeletal information SK20. The skeletal information SK20 includes joints p1 to p13. The explanation of joints p1 to p8 is the same as in Figure 7, and joint p9 is the left elbow joint. Joint p10 is the right elbow joint. Joint p11 is the left wrist joint. Joint p12 is the right wrist joint. Joint p13 corresponds to each joint of the head.

人物モデル生成部３３は、骨格情報ＳＫ２０の関節ｐ３，ｐ５，ｐ７のｘ座標の値が同じ値となるように、関節ｐ５，ｐ７を移動させる。人物モデル生成部３３は、骨格情報ＳＫ２０の関節ｐ４，ｐ６，ｐ８のｘ座標の値が同じ値となるように、関節ｐ６，ｐ８を移動させる。人物モデル生成部３３は、骨格情報ＳＫ２０の関節ｐ９，ｐ１１のｘ座標の値が同じ値となるように、関節ｐ１１を移動させる。人物モデル生成部３３は、骨格情報ＳＫ２０の関節ｐ１０，ｐ１２のｘ座標の値が同じ値となるように、関節ｐ１２を移動させる。上記のように、どの関節の組のｘ座標を同じにするかに関する情報は、設定情報として、予め記憶部２０に登録される。人物モデル生成部３３は、設定情報を基にして、上記の処理を実行する。 The character model generation unit 33 moves the joints p5 and p7 so that the x-coordinate values of the joints p3, p5, and p7 in the skeleton information SK20 are the same. The character model generation unit 33 moves the joints p6 and p8 so that the x-coordinate values of the joints p4, p6, and p8 in the skeleton information SK20 are the same. The character model generation unit 33 moves the joint p11 so that the x-coordinate values of the joints p9 and p11 in the skeleton information SK20 are the same. The character model generation unit 33 moves the joint p12 so that the x-coordinate values of the joints p10 and p12 in the skeleton information SK20 are the same. As described above, information regarding which pair of joints should have the same x-coordinate is registered in advance in the storage unit 20 as setting information. The character model generation unit 33 executes the above processing based on the setting information.

人物モデル生成部３３が、上記処理を実行することで、骨格情報ＳＫ２０が、骨格情報ＳＫ２０ａとなる。人物モデル生成部３３は、３Ｄアバターの姿勢を、骨格情報ＳＫ２０ａに合わせて調整する。たとえば、人物モデル生成部３３は、３Ｄアバターの各部位のうち、歩行に関連する関節ｐ３～１２に対応する部位を特定し、特定した部位を、骨格情報ＳＫ２０ａの関節ｐ３～１２に位置に合わせて移動させることで、３Ｄアバターの歩行姿勢を匿名化する。 When the person model generation unit 33 executes the above process, the skeletal information SK20 becomes skeletal information SK20a. The person model generation unit 33 adjusts the posture of the 3D avatar to match the skeletal information SK20a. For example, the person model generation unit 33 identifies, among the parts of the 3D avatar, those parts that correspond to the joints p3 to 12 related to walking, and anonymizes the walking posture of the 3D avatar by moving the identified parts to match the positions of the joints p3 to 12 in the skeletal information SK20a.

なお、人物モデル生成部３３は、頭部の各関節ｐ１３をそのままとすることで、人物が向いていた方向をユーザが確認できるように、３Ｄアバターの顔の向きの情報をそのままとする。たとえば、人物モデル生成部３３は、人物が商品に手を伸ばす等の購買行動（物体を探索する動作）が検出された場合、３Ｄアバターの各部位のうち、頭部の各関節ｐ１３に対応する部位をそのままとすることで、購買行動に関する人の動きを反映させる。 The person model generation unit 33 leaves the information about the facial direction of the 3D avatar unchanged by leaving each joint p13 of the head unchanged so that the user can confirm the direction the person was facing. For example, when a purchasing behavior (movement of searching for an object) such as a person reaching for a product is detected, the person model generation unit 33 reflects the person's movement related to the purchasing behavior by leaving the parts of the 3D avatar that correspond to each joint p13 of the head unchanged.

このように、人物モデル生成部３３は、歩行検出情報に設定されたフレーム番号に対応する３Ｄアバターについて、上記処理をそれぞれ実行することで、３Ｄアバターの歩行姿勢を匿名化する。このようにして、人物モデル生成部３３は、複数の人物画像データから、様々な姿勢、種別（性別や年齢）、服装の３Ｄアバター（３Ｄモデル）を生成する。 In this way, the person model generation unit 33 performs the above process on each 3D avatar corresponding to the frame number set in the walking detection information, thereby anonymizing the walking posture of the 3D avatar. In this way, the person model generation unit 33 generates 3D avatars (3D models) with various postures, types (gender and age), and clothing from multiple person image data.

（領域特定）
図２に戻り、領域特定部３４は、場所画像データの中から、人物が行動する領域を特定する処理部である。具体的には、領域特定部３４は、場所画像データＤＢ２３に記憶される各場所画像データに対して機械学習モデルや画像解析を行うことで、各場所画像データ内で人物が行動する領域を特定する。そして、領域特定部３４は、各場所画像データと各特定結果とを対応付けて記憶部２０に格納する。 (Area specific)
2, the area identification unit 34 is a processing unit that identifies an area in which a person is active from the place image data. Specifically, the area identification unit 34 identifies an area in which a person is active within each place image data by performing a machine learning model or image analysis on each place image data stored in the place image data DB 23. Then, the area identification unit 34 stores each place image data and each identification result in the storage unit 20 in association with each other.

例えば、領域特定部３４は、領域抽出モデル２５を用いたセマンティックセグメンテーションにより、人物が行動する領域（注目領域）を特定することができる。図１０は、セマンティックセグメンテーションによる注目領域の検出を説明する図である。図１０に示すように、領域特定部３４は、画像データを領域抽出モデル２５（convolutional encoder-decoder）に入力し、画像データの各領域にラベルが設定された出力結果（セグメンテーション結果）を取得する。一例を挙げると、領域特定部３４は、ラベル「通路」が設定された領域を注目領域と特定したり、ラベル「通路」が設定された領域内でラベル「商品棚」と隣接する領域を注目領域と特定したりする。 For example, the region identification unit 34 can identify a region where a person is moving (region of interest) by semantic segmentation using the region extraction model 25. FIG. 10 is a diagram illustrating detection of a region of interest by semantic segmentation. As shown in FIG. 10, the region identification unit 34 inputs image data to the region extraction model 25 (convolutional encoder-decoder) and obtains an output result (segmentation result) in which a label is set for each region of the image data. As an example, the region identification unit 34 identifies a region with the label "aisle" as a region of interest, or identifies a region adjacent to the label "product shelf" within the region with the label "aisle" as a region of interest.

なお、領域特定部３４は、カメラの映像データから人の作業位置を抽出し、作業位置のクラスタリングによってＲＯＩ（Region Of Interest）を注目領域と特定することもできる。 In addition, the area identification unit 34 can also extract the work positions of people from the camera's video data and identify the ROI (Region of Interest) as the area of interest by clustering the work positions.

また、領域特定部３４は、領域抽出モデル２５によるセマンティックセグメンテーションの実行結果を適切な情報を用いて修正することで、人物が行動する領域をより正確に特定することもできる。具体的には、領域特定部３４は、店舗での購買行動は移動と商品を選び取る行動が主として発生し、選び取る際には通路方向に対して身体の向きにバラつきが生じることを用いて、商品の選び取りが発生する注目領域を抽出し、セグメンテーション結果を修正することで、行動分析の対象となる注目領域を正確に設定する。 The area identification unit 34 can also more accurately identify the area in which a person behaves by correcting the results of semantic segmentation performed by the area extraction model 25 using appropriate information. Specifically, the area identification unit 34 uses the fact that purchasing behavior in a store mainly involves movement and the action of selecting a product, and that when selecting a product, there is variation in the orientation of the body relative to the aisle direction, to extract the area of interest where product selection occurs, and corrects the segmentation results to accurately set the area of interest that is the subject of behavioral analysis.

図１１は、骨格推定モデル２６を用いた動作解析を説明する図である。図１１に示すように、領域特定部３４は、ＲＧＢの画像データを骨格推定モデル２６に入力し、画像データに写っている人物の２次元骨格座標を取得する。そして、領域特定部３４は、２次元骨格座標にしたがって、人物の足首の位置、顔の向き、身体の向きを特定する。 Figure 11 is a diagram explaining motion analysis using the skeleton estimation model 26. As shown in Figure 11, the area identification unit 34 inputs RGB image data into the skeleton estimation model 26 and obtains the two-dimensional skeleton coordinates of the person depicted in the image data. Then, the area identification unit 34 identifies the position of the person's ankles, the direction of the face, and the direction of the body according to the two-dimensional skeleton coordinates.

すなわち、領域特定部３４は、所定時間間隔で取得された各映像データに含まれる各画像データ（例えば１００フレーム）それぞれを骨格推定モデル２６に入力し、各画像データに写っている人物の足首の位置、顔の向き、身体の向きを測定することで、映像データ内における人物の足首の位置の遷移、顔の向きの遷移、身体の向きの遷移を特定することができる。 In other words, the region identification unit 34 inputs each image data (e.g., 100 frames) included in each video data acquired at a predetermined time interval into the skeleton estimation model 26, and measures the ankle position, face direction, and body direction of the person appearing in each image data, thereby being able to identify the transition of the person's ankle position, face direction, and body direction within the video data.

次に、領域特定部３４は、トラッキング情報から人の移動経路を抽出し、基準線となる通路方向を設定する。具体的には、領域特定部３４は、映像データ内から画像データを取得（選択）し、ある人物の移動経路を用いて、画像データ上にユーザが歩く方向である基準方向を設定する。そして、領域特定部３４は、設定された基準方向を、移動経路を示す基準線として抽出する。 Next, the area identification unit 34 extracts the person's movement path from the tracking information and sets the passage direction that serves as the reference line. Specifically, the area identification unit 34 acquires (selects) image data from within the video data, and uses the movement path of a certain person to set a reference direction, which is the direction in which the user walks, on the image data. Then, the area identification unit 34 extracts the set reference direction as a reference line that indicates the movement path.

図１２は、トラッキングによる基準方向の設定を説明する図である。図１２に示すように、領域特定部３４は、画像データ上に、トラッキング結果である移動経路Ａ１と移動経路Ａ２とを設定する。このとき、領域特定部３４は、設定された移動経路を含む領域を通路の領域と設定することができる。なお、領域特定部３４は、画像データに対してセマンティックセグメンテーションを実行した結果により、画像データ上に通路の領域を設定することができる。 Figure 12 is a diagram explaining the setting of a reference direction by tracking. As shown in Figure 12, the area identification unit 34 sets movement routes A1 and A2, which are the tracking results, on the image data. At this time, the area identification unit 34 can set an area including the set movement routes as the passage area. Note that the area identification unit 34 can set the passage area on the image data based on the results of performing semantic segmentation on the image data.

次に、領域特定部３４は、トラッキング結果により、移動経路Ａ１から移動経路Ａ２への遷移を特定し、その遷移にしたがって、通路の領域上に基準方向Ｂ１、Ｂ２、Ｂ３のそれぞれを設定する。そして、領域特定部３４は、この基準方向Ｂ１、Ｂ２、Ｂ３それぞれを、基準線に設定する。なお、移動経路や移動経路の遷移は、一方方向に限らず、多方向が特定されることもあるが、この場合であっても、方向を除外して同じ移動軌跡であれば、１つの通路方向であり、１つの基準線として抽出される。例えば、領域特定部３４は、ユーザが歩く複数の移動経路から通路方向となる近似直線を算出し、その近似曲線を基準線として設定する。 Next, the area identification unit 34 identifies the transition from movement path A1 to movement path A2 based on the tracking results, and sets each of the reference directions B1, B2, and B3 on the passage area according to the transition. The area identification unit 34 then sets each of these reference directions B1, B2, and B3 as a reference line. Note that the movement path and the transition of the movement path are not limited to one direction, and multiple directions may be identified. Even in this case, if the movement trajectory is the same excluding the direction, it is one passage direction and is extracted as one reference line. For example, the area identification unit 34 calculates an approximate straight line that is the passage direction from the multiple movement paths walked by the user, and sets the approximate curve as the reference line.

次に、領域特定部３４は、各人物の移動軌跡を抽出し、各基準線と各人物の移動軌跡との距離に基づくクラスタリングにより、複数のクラスタを生成する。つまり、領域特定部３４は、各移動軌跡がどの基準線に近いかをクラスタリングする。 Next, the area identification unit 34 extracts the movement trajectory of each person and generates multiple clusters by clustering based on the distance between each reference line and the movement trajectory of each person. In other words, the area identification unit 34 clusters each movement trajectory based on which reference line it is closest to.

図１３は、クラスタリングを説明する図である。図１３に示すように、領域特定部３４は、各画像データに写っている人物の足首の位置を取得し、基準線Ｂ１、Ｂ２、Ｂ３が設定された画像データにプロットする。そして、領域特定部３４は、各基準線と各人物の移動軌跡との距離に基づくクラスタリングにより、複数のクラスタを生成する。 Figure 13 is a diagram explaining clustering. As shown in Figure 13, the area identification unit 34 acquires the position of the ankles of the person appearing in each image data, and plots them on the image data in which reference lines B1, B2, and B3 are set. The area identification unit 34 then generates multiple clusters by clustering based on the distance between each reference line and the movement trajectory of each person.

例えば、領域特定部３４は、各移動軌跡から各基準線への垂線を引き、その垂線の長さを基にしたクラスタリングを実行することにより、各移動軌跡をいずれかの基準線にクラスタリングする。なお、ベースとなる距離は、垂線の長さに限らず、ユークリッド距離などを用いることもできる。 For example, the region identification unit 34 clusters each movement trajectory to one of the reference lines by drawing a perpendicular line from each movement trajectory to each reference line and performing clustering based on the length of the perpendicular line. Note that the base distance is not limited to the length of the perpendicular line, and Euclidean distance, etc., can also be used.

この結果、領域特定部３４は、基準線Ｂ１に最も近い移動軌跡の点群を含むクラスタＣ１と、基準線Ｂ２に最も近い移動軌跡の点群を含むクラスタＣ２と、基準線Ｂ３に最も近い移動軌跡の点群を含むクラスタＣ３と、を生成する。 As a result, the region identification unit 34 generates a cluster C1 including the point cloud of the movement trajectory closest to the reference line B1, a cluster C2 including the point cloud of the movement trajectory closest to the reference line B2, and a cluster C3 including the point cloud of the movement trajectory closest to the reference line B3.

次に、領域特定部３４は、クラスタリングされた各移動軌跡について、各基準線に対する身体の向きのなす角を算出する。具体的には、領域特定部３４は、各画像データに写っている人物の身体の向きを取得し、画像データ内の移動軌跡に、該当する身体の向きを対応付ける。そして、領域特定部３４は、クラスタリング結果を用いて、各移動軌跡が属するクラスタの基準線を特定する。その後、領域特定部３４は、各移動軌跡に対して、公知の手法を用いて、属するクラスタの基準線と身体の向きとのなす角度を算出する。なお、領域特定部３４は、身体の向きに限らず、顔の向きを用いることもできる。 Next, the region identification unit 34 calculates the angle of the body orientation with respect to each reference line for each clustered movement trajectory. Specifically, the region identification unit 34 obtains the body orientation of the person appearing in each image data, and associates the movement trajectory in the image data with the corresponding body orientation. The region identification unit 34 then uses the clustering results to identify the reference line of the cluster to which each movement trajectory belongs. After that, the region identification unit 34 calculates, for each movement trajectory, the angle between the reference line of the cluster to which the movement trajectory belongs and the body orientation using a known method. Note that the region identification unit 34 is not limited to the body orientation, but can also use the face orientation.

続いて、領域特定部３４は、複数のクラスタそれぞれについて、クラスタに属する各移動軌跡と基準線とのなす角度に基づく評価値が閾値以上であるクラスタを含む領域を注目領域に抽出する。具体的には、領域特定部３４は、各基準線に対する身体の向きのなす角のうち、大きい角度を多く含む基準線を抽出し、このような基準線が属する領域を注目領域として抽出する。 Then, for each of the multiple clusters, the region identifying unit 34 extracts a region including a cluster whose evaluation value based on the angle between each movement trajectory belonging to the cluster and the reference line is equal to or greater than a threshold value as a region of interest. Specifically, the region identifying unit 34 extracts reference lines that include many large angles among the angles formed by the body orientation with respect to each reference line, and extracts the region to which such a reference line belongs as a region of interest.

図１４は、クラスタの抽出を説明する図である。図１４に示すように、領域特定部３４は、各移動軌跡がプロットされた画像データに対して、各移動軌跡に対応する身体の向きをプロットする。また、領域特定部３４は、各移動軌跡に対して算出された角度も対応付ける。 Figure 14 is a diagram explaining the extraction of clusters. As shown in Figure 14, the region identification unit 34 plots the body orientation corresponding to each movement trajectory for the image data in which each movement trajectory is plotted. The region identification unit 34 also associates the calculated angle with each movement trajectory.

そして、領域特定部３４は、各クラスタについて、属する移動軌跡の角度を集計する。例えば、図１４に示すように、領域特定部３４は、クラスタＣ１に属する各移動軌跡の角度とその角度に該当する移動軌跡の数、クラスタＣ２に属する各移動軌跡の角度とその角度に該当する移動軌跡の数、クラスタＣ３に属する各移動軌跡の角度とその角度に該当する移動軌跡の数を集計する。 Then, the area identification unit 34 counts the angles of the movement trajectories that belong to each cluster. For example, as shown in FIG. 14, the area identification unit 34 counts the angles of each movement trajectory that belongs to cluster C1 and the number of movement trajectories that correspond to that angle, the angles of each movement trajectory that belongs to cluster C2 and the number of movement trajectories that correspond to that angle, and the angles of each movement trajectory that belongs to cluster C3 and the number of movement trajectories that correspond to that angle.

その後、領域特定部３４は、大きい角度を多く有するクラスタを抽出する。例えば、領域特定部３４は、クラスタごとに、角度の中央値、角度の平均値、６０度以上の角度の数の割合などを評価値として算出する。そして、領域特定部３４は、評価値が閾値以上であるクラスタＣ２とクラスタＣ３を抽出する。 Then, the region identification unit 34 extracts clusters that have many large angles. For example, the region identification unit 34 calculates, for each cluster, an evaluation value such as the median angle, the average angle, or the percentage of angles that are 60 degrees or greater. Then, the region identification unit 34 extracts clusters C2 and C3 whose evaluation values are equal to or greater than a threshold value.

続いて、領域特定部３４は、抽出したクラスタＣ２とクラスタＣ３について、注目領域として、クラスタに属する各移動軌跡を囲む多角形を生成する。図１５は、注目領域の抽出を説明する図である。図１５に示すように、領域特定部３４は、クラスタＣ２について、クラスタＣ２に属する各移動軌跡を含む最大の多角形Ｃ２´を生成して、注目領域として抽出する。同様に、領域特定部３４は、クラスタＣ３について、クラスタＣ３に属する各移動軌跡を含む最大の多角形Ｃ３´を生成して、注目領域として抽出する。 Next, for the extracted clusters C2 and C3, the region identification unit 34 generates polygons surrounding each of the movement trajectories belonging to the clusters as a region of interest. FIG. 15 is a diagram for explaining the extraction of a region of interest. As shown in FIG. 15, for cluster C2, the region identification unit 34 generates the largest polygon C2' that includes each of the movement trajectories belonging to cluster C2, and extracts it as a region of interest. Similarly, for cluster C3, the region identification unit 34 generates the largest polygon C3' that includes each of the movement trajectories belonging to cluster C3, and extracts it as a region of interest.

その後、領域特定部３４は、上記抽出結果を用いて、セマンティックセグメンテーションにより得られた各エリアのラベルを修正（変更）する。例えば、領域特定部３４は、修上述した処理により得られた基準線に関する情報、注目領域の抽出結果、ＲＯＩに関する情報、足首の位置や身体の向きや顔の向きなどの行動認識結果などを取得する。 Then, the region identification unit 34 uses the above extraction results to modify (change) the labels of each area obtained by semantic segmentation. For example, the region identification unit 34 acquires information about the reference line obtained by the above processing, the extraction result of the region of interest, information about the ROI, and behavior recognition results such as the position of the ankles, the direction of the body, and the direction of the face.

また、領域特定部３４は、上記注目領域の抽出に使用された画像データなど、映像データに含まれる画像データを、領域抽出モデル２５に入力し、セマンティックセグメンテーションの実行結果を取得する。なお、セグメンテーションの実行結果には、画像データに含まれる複数の領域それぞれについて、識別された結果を示すラベルが付与されている。例えば、セマンティックセグメンテーションの実行結果には、「棚」、「通路」、「壁」などのラベルが付与される。 The region identification unit 34 also inputs image data contained in the video data, such as the image data used to extract the region of interest, into the region extraction model 25 to obtain the results of semantic segmentation. The segmentation results are provided with labels indicating the identification results for each of the multiple regions contained in the image data. For example, the results of semantic segmentation are provided with labels such as "shelf," "aisle," and "wall."

次に、領域特定部３４は、セグメンテーション結果に基準線を設定する。図１６は、セマンティックセグメンテーションの実行結果への基準線の設定を説明する図である。図１６に示すように、領域特定部３４は、セグメンテーション結果と基準線に関する情報を取得し、セグメンテーション結果に対して、基準線Ｂ１、Ｂ２、Ｂ３をプロットする。 Next, the region identification unit 34 sets a reference line on the segmentation result. FIG. 16 is a diagram illustrating the setting of a reference line on the result of executing semantic segmentation. As shown in FIG. 16, the region identification unit 34 obtains information on the segmentation result and the reference line, and plots reference lines B1, B2, and B3 on the segmentation result.

次に、領域特定部３４は、基準線が設定されたセグメンテーション結果に対して、基準線に基づくクラスタリングを実行する。図１７は、基準線に基づくクラスタリングを説明する図である。図１７に示すように、領域特定部３４は、セグメンテーション結果に設定（識別）された各ラベルのうち、「通路」のラベルが設定されたエリアを特定する。そして、領域特定部３４は、特定した「通路」のエリアに属する各画素と各基準線（Ｂ１、Ｂ２、Ｂ３）との距離を算出し、最も距離が近い基準線に属するように、各画素をクラスタリングする。なお、距離には、各画素から各基準線に対する垂線の長さや、画素と基準線とのユークリッド距離などを用いることができる。そして、領域特定部３４は、基準線Ｂ１に属するクラスタＬ１、基準線Ｂ２に属するクラスタＬ２、基準線Ｂ３に属するクラスタＬ３を特定する。 Next, the area identification unit 34 performs clustering based on the reference line on the segmentation result in which the reference line has been set. FIG. 17 is a diagram for explaining clustering based on the reference line. As shown in FIG. 17, the area identification unit 34 identifies an area in which the label "passage" has been set among the labels set (identified) in the segmentation result. Then, the area identification unit 34 calculates the distance between each pixel belonging to the identified "passage" area and each reference line (B1, B2, B3), and clusters each pixel so that it belongs to the reference line with the shortest distance. Note that the distance can be the length of the perpendicular line from each pixel to each reference line, the Euclidean distance between the pixel and the reference line, or the like. Then, the area identification unit 34 identifies cluster L1 belonging to reference line B1, cluster L2 belonging to reference line B2, and cluster L3 belonging to reference line B3.

次に、領域特定部３４は、上記抽出結果に基づき、セグメンテーション結果のラベルを修正する。具体的には、領域特定部３４は、複数のクラスタのうち注目領域に対応する注目クラスタを特定し、注目クラスタの領域を、対応する注目領域を含む領域に修正し、修正された領域に対して設定されたラベルを、注目領域に該当するラベルに変更する。すなわち、領域特定部３４は、クラスタリング結果と抽出された注目領域とを含む領域が最大を取るように各クラスタの領域を修正し、その修正された領域を注目領域としてラベリングする。 Next, the region identification unit 34 modifies the label of the segmentation result based on the extraction result. Specifically, the region identification unit 34 identifies a cluster of interest that corresponds to the region of interest from among the multiple clusters, modifies the region of the cluster of interest to a region that includes the corresponding region of interest, and changes the label set for the modified region to a label that corresponds to the region of interest. In other words, the region identification unit 34 modifies the region of each cluster so that the region that includes the clustering result and the extracted region of interest is maximized, and labels the modified region as the region of interest.

図１８は、ラベル修正を説明する図である。図１８に示すように、領域特定部３４は、注目領域（Ｃ２´とＣ３´）に関する各多角形の座標を取得し、クラスタリングされたセグメンテーション結果（画像データ）にマッピングする。そして、領域特定部３４は、注目領域Ｃ２´が属するクラスタＬ２と、注目領域Ｃ３´が属するクラスタＬ３とを特定する。 Figure 18 is a diagram explaining label correction. As shown in Figure 18, the area identification unit 34 acquires the coordinates of each polygon related to the areas of interest (C2' and C3') and maps them to the clustered segmentation result (image data). Then, the area identification unit 34 identifies the cluster L2 to which the area of interest C2' belongs and the cluster L3 to which the area of interest C3' belongs.

その後、領域特定部３４は、注目領域Ｃ２´が含まれるように、クラスタＬ２の領域を拡張した領域Ｌ２´を生成する。そして、領域特定部３４は、領域Ｌ２´に設定されているラベル「通路」を、ラベル「注目領域」に修正（変更）する。 Then, the area identification unit 34 generates an area L2' by expanding the area of the cluster L2 so that the area includes the area of interest C2'. Then, the area identification unit 34 corrects (changes) the label "passage" set in the area L2' to the label "area of interest."

同様に、領域特定部３４は、注目領域Ｃ３´が含まれるように、クラスタＬ３の領域を拡張した領域Ｌ３´を生成する。そして、領域特定部３４は、領域Ｌ３´に設定されているラベル「通路」を、ラベル「注目領域」に修正する。なお、領域特定部３４は、注目領域の方がクラスタの領域よりも大きい場合、注目領域のラベル「通路」を、ラベル「注目領域」に修正（変更）する。 Similarly, the area identification unit 34 generates an area L3' by expanding the area of the cluster L3 so that the area of interest C3' is included. Then, the area identification unit 34 corrects the label "passage" set in the area L3' to the label "area of interest." Note that if the area of interest is larger than the area of the cluster, the area identification unit 34 corrects (changes) the label of the area of interest, "passage," to the label "area of interest."

次に、領域特定部３４は、顔の向きまたは身体の向きに基づき、店舗を構成する複数の領域のうちラベル「注目領域」と隣接する、人物に関連する物体が収納される領域を設定する。具体的には、領域特定部３４は、画像データに対して、ピッキング動作の対象となる商品が置いてある商品棚エリアを特定する。すなわち、領域特定部３４は、領域Ｌ２´や領域Ｌ３´と隣接するエリアについて、セグメンテーション結果により設定済みであるラベルを、ラベル「商品棚」に変更する。 Next, based on the direction of the face or body, the area identification unit 34 sets an area that contains objects related to the person and is adjacent to the label "area of interest" among the multiple areas that make up the store. Specifically, the area identification unit 34 identifies, from the image data, a product shelf area in which the product that is the target of the picking operation is placed. That is, for areas adjacent to area L2' and area L3', the area identification unit 34 changes the label that has already been set based on the segmentation result to the label "product shelf."

図１９は、商品棚エリアの設定を説明する図である。図１９に示すように、領域特定部３４は、ラベル「注目領域」が設定された領域Ｌ２´と領域Ｌ３´のそれぞれについて、各領域に属する各移動軌跡および顔の向きをプロットする。 Figure 19 is a diagram explaining the setting of the product shelf area. As shown in Figure 19, the area identification unit 34 plots each movement trajectory and face direction belonging to each area for each of areas L2' and L3', which are set with the label "area of interest."

そして、領域特定部３４は、顔の向きのベクトルの数が閾値以上である方向を特定し、その方向にある領域のうち、領域Ｌ２´と接する領域もしく領域Ｌ２´と隣接する領域として、領域Ｅ１と領域Ｅ２を特定する。この結果、領域特定部３４は、セグメンテーション結果において、領域Ｅ１と領域Ｅ２のラベルを「商品棚エリア」と設定する。 Then, the area identification unit 34 identifies a direction in which the number of face orientation vectors is equal to or greater than a threshold value, and among the areas in that direction, identifies areas E1 and E2 as areas that are in contact with area L2' or adjacent to area L2'. As a result, the area identification unit 34 sets the labels of areas E1 and E2 as "product shelf area" in the segmentation results.

同様に、領域特定部３４は、顔の向きのベクトルの数が閾値以上である方向を特定し、その方向にある領域のうち、領域Ｌ３´と接する領域もしく領域Ｌ３´と隣接する領域として、領域Ｅ３と領域Ｅ４を特定する。この結果、領域特定部３４は、セグメンテーション結果において、領域Ｅ３と領域Ｅ４のラベルを「商品棚エリア」と設定する。 Similarly, the area identification unit 34 identifies a direction in which the number of face orientation vectors is equal to or greater than a threshold value, and among the areas in that direction, identifies areas E3 and E4 as areas that are in contact with area L3' or adjacent to area L3'. As a result, the area identification unit 34 sets the labels of areas E3 and E4 as "product shelf area" in the segmentation results.

そして、領域特定部３４は、領域Ｅ１、領域Ｅ２、領域Ｅ３、領域Ｅ４の座標や、領域Ｅ１からＥ４それぞれを設定した画像データなどの情報を記憶部２０に格納する。なお、領域特定部３４は、セグメンテーション結果ではなく、セグメンテーション結果の元となった画像データに対して、「商品棚エリア」に領域を設定することもできる。 Then, the area identification unit 34 stores information such as the coordinates of area E1, area E2, area E3, and area E4, and image data in which areas E1 to E4 are set, in the memory unit 20. Note that the area identification unit 34 can also set an area in the "product shelf area" not based on the segmentation result, but on the image data on which the segmentation result is based.

（合成データの生成）
図２に戻り、合成データ生成部３５は、カメラパラメータを用いて、領域特定部３４により特定された領域に、人物モデル生成部３３によって生成された３次元の人物モデルが配置された合成データを生成する処理部である。 (Generating synthetic data)
Returning to Figure 2, the synthetic data generation unit 35 is a processing unit that uses camera parameters to generate synthetic data in which the three-dimensional human model generated by the human model generation unit 33 is placed in the area identified by the area identification unit 34.

具体的には、合成データ生成部３５は、領域特定部３４により特定された場所画像データ内の注目領域（行動を行う領域）に、特定の行動を行う３Ｄアバターを配置した合成データを生成する。例えば、合成データ生成部３５は、注目領域に、「商品棚エリア」に対して商品を取る行動を行う３Ｄアバターを配置した合成データを生成する。 Specifically, the synthetic data generation unit 35 generates synthetic data in which a 3D avatar performing a specific action is placed in a region of interest (region in which an action is performed) in the location image data identified by the region identification unit 34. For example, the synthetic data generation unit 35 generates synthetic data in which a 3D avatar performing an action of picking up a product from a "product shelf area" is placed in the region of interest.

ここで、合成データ生成部３５は、場所画像データを撮像したカメラの位置を推定して、適切な大きさの３Ｄアバターを配置する。例えば、合成データ生成部３５は、単眼デプス推定を行う訓練済みの機械学習モデル（推定モデル）を用いて、カメラから注目領域までの距離を推定し、推定された距離にしたがって３Ｄアバターを適切に配置することができる。 Here, the synthetic data generation unit 35 estimates the position of the camera that captured the location image data, and places a 3D avatar of an appropriate size. For example, the synthetic data generation unit 35 can estimate the distance from the camera to the area of interest using a trained machine learning model (estimation model) that performs monocular depth estimation, and appropriately place the 3D avatar according to the estimated distance.

図２０は、合成データの生成を説明する図である。図２０に示すように、合成データ生成部３５は、場所画像データに対する領域特定部３４の領域特定処理により、場所画像データ内の床のうち注目領域を３Ｄアバターの配置位置に決定する。 Figure 20 is a diagram explaining the generation of synthetic data. As shown in Figure 20, the synthetic data generation unit 35 determines the attention area of the floor in the location image data as the placement position of the 3D avatar by the area identification process of the area identification unit 34 for the location image data.

一方で、合成データ生成部３５は、推定モデルに場所画像データを入力し、明るいほどカメラからの距離が遠い位置を表すデプス画像を取得する。そして、合成データ生成部３５は、デプス画像および公知の単眼カメラ距離計測技術などを用いて、カメラから配置位置までの距離を算出する。続いて、合成データ生成部３５は、カメラと配置位置までの距離を用いて俯角を算出し、俯角および距離を用いて３Ｄアバターの高さを決定する。例えば、合成データ生成部３５は、俯角および距離を用いた三平方の定理により配置位置から天井や商品棚までの高さを算出し、その高さより低い３Ｄアバターを配置位置に配置した合成データを生成する。 Meanwhile, the synthetic data generation unit 35 inputs location image data into the estimation model, and obtains a depth image in which the brighter the location, the farther the distance from the camera. The synthetic data generation unit 35 then calculates the distance from the camera to the placement position using the depth image and known monocular camera distance measurement techniques. Next, the synthetic data generation unit 35 calculates the depression angle using the distance between the camera and the placement position, and determines the height of the 3D avatar using the depression angle and distance. For example, the synthetic data generation unit 35 calculates the height from the placement position to the ceiling or product shelves using Pythagoras' theorem using the depression angle and distance, and generates synthetic data in which a 3D avatar that is lower than that height is placed at the placement position.

ここで、３Ｄアバターは、条件により様々な姿勢に変更することができる。このため、合成データ生成部３５は、目的とする行動（学習対象とする行動）が指定されることで、目的とする３Ｄアバターを含む合成データを生成することができる。上記領域特定部３４による処理結果を用いた合成データの具体的な生成例を説明する。 Here, the 3D avatar can be changed to various postures depending on the conditions. Therefore, the synthetic data generation unit 35 can generate synthetic data including the desired 3D avatar by specifying the desired behavior (behavior to be learned). A specific example of generating synthetic data using the processing results by the area identification unit 34 will be described below.

図２１は、合成データの生成を説明する図である。図２１に示すように、合成データ生成部３５は、複数のうち場所画像データのうち、商品棚を含む画像や飲み物売り場などユーザが指定した条件等の指示に合致した場所画像データを選定する。そして、合成データ生成部３５は、選択された場所画像データに対して領域特定部３４が特定した場所画像データ内の領域を選択する。例えば、合成データ生成部３５は、ユーザの指示に応じて、特定された領域であるクラスタＣ２の領域（移動軌跡の領域）、クラスタＬ２の領域（基準線を含む領域）、拡張された領域Ｌ２´（移動軌跡と基準線を含む領域）、クラスタＣ３の領域、クラスタＬ３の領域、拡張された領域Ｌ３´の中から、領域Ｌ３´を選択する。 FIG. 21 is a diagram for explaining the generation of synthetic data. As shown in FIG. 21, the synthetic data generation unit 35 selects, from among the multiple pieces of location image data, location image data that matches the user's specified conditions, such as an image including a product shelf or a drink section. The synthetic data generation unit 35 then selects an area in the location image data identified by the area identification unit 34 for the selected location image data. For example, the synthetic data generation unit 35 selects area L3' from the identified areas of cluster C2 (area of the movement trajectory), cluster L2 (area including the reference line), expanded area L2' (area including the movement trajectory and reference line), cluster C3, cluster L3, and expanded area L3' in response to the user's instructions.

続いて、合成データ生成部３５は、人物モデル生成部３３により生成された複数の３Ｄアバター７０のうち、商品を選ぶ女性や商品を陳列する男性従業員などユーザが指定した条件等の指示に合致した３Ｄアバター７０を選定する。そして、合成データ生成部３５は、ユーザが指定した条件に合致するように、選定した３Ｄアバター７０を上下方向、左右方向に回転させるとともに、手を上げたり、しゃがんだりさせて、該当する姿勢に生成する。 The synthetic data generation unit 35 then selects a 3D avatar 70 that matches the user's specified conditions, such as a woman selecting a product or a male employee displaying products, from among the multiple 3D avatars 70 generated by the person model generation unit 33. The synthetic data generation unit 35 then rotates the selected 3D avatar 70 vertically and horizontally, and raises its hands or crouches to generate it in a corresponding pose so as to match the user's specified conditions.

その後、合成データ生成部３５は、場所画像データ内の領域Ｌ３´に、商品棚Ｅ４の方を向いて、商品を物色する行動を行う３Ｄアバター７０を配置した合成データを生成する。ここで、合成データ生成部３５は、３Ｄアバター７０を配置する際に、図２０を用いた手法に限らず、カメラの位置やレンズのパラメータなどを含むカメラパラメータを用いて適切に配置することができる。なお、カメラパラメータは、すべての場所画像データについて既知の情報とは限らないので、場所画像データに合わせて適切に推定することで、精度を向上させることができる。 Then, the synthetic data generation unit 35 generates synthetic data in which a 3D avatar 70 is placed in area L3' within the location image data, facing the product shelf E4 and browsing for products. Here, when placing the 3D avatar 70, the synthetic data generation unit 35 can appropriately place it using camera parameters including the camera position and lens parameters, rather than being limited to the method using FIG. 20. Note that the camera parameters are not necessarily known information for all location image data, so accuracy can be improved by appropriately estimating them in accordance with the location image data.

（カメラパラメータの推定）
ここでは、３Ｄアバターの身長を推定するカメラパラメータの推定について説明する。図２２、図２３、図２４は、カメラパラメータの推定を説明する図である。図２２について説明する。図２２では、映像データに含まれる複数の画像フレーム（場所画像データ）のうち、画像フレームＦ２１を用いて説明を行う。 (Estimation of camera parameters)
Here, the estimation of camera parameters for estimating the height of a 3D avatar will be described. Fig. 22, Fig. 23, and Fig. 24 are diagrams for explaining the estimation of camera parameters. Fig. 22 will be described. In Fig. 22, the explanation will be given using image frame F21 out of multiple image frames (location image data) included in the video data.

画像フレームＦ２１の座標系は、画像座標系（ｘ，ｙ）となる。人物が現実に存在する座標系は、世界座標系（Ｘ，Ｙ，Ｚ）となる。以下の説明では、画像フレームＦ２１に映った画像座標系の人物を人物２１－１ａと表記し、世界座標系の人物を人物２１－２ａと表記する。 The coordinate system of image frame F21 is the image coordinate system (x, y). The coordinate system in which the person actually exists is the world coordinate system (X, Y, Z). In the following explanation, the person in the image coordinate system shown in image frame F21 is referred to as person 21-1a, and the person in the world coordinate system is referred to as person 21-2a.

カメラ１００のカメラパラメータには、カメラ１００の高さｃと、カメラ１００の角度θと、カメラ１００の焦点距離ｆとが含まれる。カメラ１００のカメラパラメータを未知とし、合成データ生成部３５は、カメラパラメータに予め所定の初期値を設定しておく。なお、カメラパラメータは、これらに限らず、幾何学関係を定義可能なその他のパラメータでも良い。たとえば、その他のパラメータとして、光軸と画像の交点（光軸中心座標）やカメラの回転角などが含まれる。 The camera parameters of the camera 100 include the height c of the camera 100, the angle θ of the camera 100, and the focal length f of the camera 100. The camera parameters of the camera 100 are unknown, and the composite data generating unit 35 sets predetermined initial values to the camera parameters in advance. Note that the camera parameters are not limited to these, and may be other parameters that can define a geometric relationship. For example, other parameters include the intersection of the optical axis and the image (optical axis center coordinates) and the rotation angle of the camera.

合成データ生成部３５は、画像フレームＦ２１を解析することで、人物２１－１ａの骨格データを特定する。たとえば、合成データ生成部３５は、画像フレームＦ２１を、機械学習済みの学習モデル（例えば骨格推定モデル２６）に入力することで、人物２１－１ａの骨格データを特定する。 The composite data generation unit 35 identifies skeletal data of the person 21-1a by analyzing the image frame F21. For example, the composite data generation unit 35 identifies skeletal data of the person 21-1a by inputting the image frame F21 into a machine-learned learning model (e.g., the skeletal estimation model 26).

骨格データには、人物の複数の関節に関する情報が含まれ、各関節は、画像フレーム上の座標に対応付けられる。たとえば、画像フレームＦ２１に対応する骨格データには、人物２１－１ａの頭部の座標（ｘ_ｈ１，ｙ_ｈ１）、足部の座標（ｘ_ｆ１，ｙ_ｆ１）等が含まれる。 The skeletal data includes information about a plurality of joints of a person, and each joint is associated with a coordinate on an image frame. For example, the skeletal data corresponding to the image frame F21 includes the coordinates (x _h1 , y _h1 ) of the head of the person 21-1a, the coordinates (x _f1 , y _f1 ) of the feet, etc.

合成データ生成部３５は、属性テーブルを有しており、かかる属性テーブルには、各国の領土の範囲と、該当する国に住む人物の平均身長とが対応付けられる。合成データ生成部３５は、カメラ１００から受信する位置データと、属性テーブルとを基にして、カメラ１００が設置された国の人物の平均身長（画像フレームＦ２１に映った人物の平均身長）を特定する。 The composite data generation unit 35 has an attribute table in which the territorial extent of each country is associated with the average height of people living in that country. Based on the position data received from the camera 100 and the attribute table, the composite data generation unit 35 determines the average height of people in the country in which the camera 100 is installed (the average height of people captured in the image frame F21).

合成データ生成部３５は、カメラ１００のカメラパラメータを基にして、画像座標系の足部の座標（ｘ_ｆ１，ｙ_ｆ１）を世界座標系の座標に投影する。たとえば、合成データ生成部３５は、カメラ１００と、人物２１－１ａの足部の座標（ｘ_ｆ１，ｙ_ｆ１）とを通る線分ｌ５と、世界座標系のＸＺ平面との交点（Ｘ_ｆ１，Ｙ_ｆ１，Ｚ_ｆ１）を、世界座標系の人物２１－２ａの足部の座標として算出する。 The synthetic data generation unit 35 projects the coordinates ( _xf1 , _yf1 ) of the foot in the image coordinate system onto the coordinates of the world coordinate system based on the camera parameters of the camera 100. For example, the synthetic data generation unit 35 calculates the intersection ( _Xf1 , _Yf1 , _Zf1 ) of the line segment l5 passing through the camera 100 and the coordinates ( _xf1 , _yf1 ) of the foot of the person 21-1a with the XZ plane of the world coordinate system as the coordinates of the foot of the person 21-2a in the world coordinate system.

合成データ生成部３５は、位置データおよび属性情報を基にして特定した身長（平均身長）Ｌを、世界座標軸系の人物２１－２ａに割り当てる。合成データ生成部３５は、人物２１－２ａの足部の座標と、身長Ｌとを基にして、人物２１－２ａの頭部の座標（Ｘ_ｈ１，Ｙ_ｈ１，Ｚ_ｈ１）を算出する。 The composite data generation unit 35 assigns the height (average height) L determined based on the position data and attribute information to the person 21-2a in the world coordinate axis system. The composite data generation unit 35 calculates the coordinates (X _h1 , Y _h1 , Z _h1 ) of the head of the person 21-2a based on the coordinates of the feet of the person 21-2a and the height L.

合成データ生成部３５は、カメラ１００のカメラパラメータを基にして、世界座標系の頭部の座標（Ｘ_ｈ１，Ｙ_ｈ１，Ｚ_ｈ１）を、画像座標系の座標に逆投影する。たとえば、合成データ生成部３５は、カメラ１００と、人物２１－２ａの頭部の座標（Ｘ_ｈ１，Ｙ_ｈ１，Ｚ_ｈ１）とを通る線分ｌ６と、画像座標系の平面との交点の座標（ｘ´_ｆ１，ｙ´_ｆ１）を算出し、画像座標系の人物２１－１ａの頭部の座標とする。 The synthetic data generation unit 35 back-projects the coordinates (X _h1 , Y _h1 , Z _h1 ) of the head in the world coordinate system into the coordinates of the image coordinate system based on the camera parameters of the camera 100. For example, the synthetic data generation unit 35 calculates the coordinates (x' f1 , y' f1 ) of the intersection of the plane of the image coordinate system with a line segment 16 passing through the camera 100 and the coordinates (X _h1 , Y _h1 , Z _h1 ₎ of the head of the person _21-2a , and sets this as the coordinates of the head of the person 21-1a in the image coordinate system.

合成データ生成部３５は、画像座標系の足部の座標（ｘ_ｆ１，ｙ_ｆ１）から、座標（ｘ´_ｆ１，ｙ´_ｆ１）までの距離を、「第一の特徴量」として設定する。第一の特徴量は、割り当てた身長Ｌと、カメラ１００のカメラパラメータに基づいて推定される人物２１－１ａの身長に対応する。 The synthetic data generating unit 35 sets the distance from the coordinates ( _xf1 , _yf1 ) of the foot to the coordinates ( _x'f1 , _y'f1 ) in the image coordinate system as a "first feature amount". The first feature amount corresponds to the height of the person 21-1a estimated based on the assigned height L and the camera parameters of the camera 100.

合成データ生成部３５は、画像座標系の足部の座標（ｘ_ｆ１，ｙ_ｆ１）から、頭部の座標（ｘ_ｈ１，ｙ_ｈ１）までの距離を、「第二の特徴量」として設定する。第二の特徴量は、骨格データに基づいて推定される人物２１－１ａの身長に対応する。 The synthetic data generating unit 35 sets the distance from the foot coordinate ( _xf1 , _yf1 ) to the head coordinate ( _xh1 , _yh1 ) in the image coordinate system as a "second feature amount". The second feature amount corresponds to the height of the person 21-1a estimated based on the skeletal data.

図２３の説明に移行する。合成データ生成部３５は、画像フレームＦ２１に含まれる他の人物２２－１ａ，２３－１ａ，２４－１ａ，２５－１ａについても、人物２１－１ａと同様にして、身長Ｌを割り振り、各人物２２－１ａ～２５－１ａの第一の特徴量、第二の特徴量をそれぞれ設定する。初回に、各人物２１－１ａ～２５－１ａに割り振られる身長Ｌは、同じ身長（平均身長）となる。 Now, let us move on to the explanation of Figure 23. The composite data generation unit 35 assigns height L to the other persons 22-1a, 23-1a, 24-1a, and 25-1a included in the image frame F21 in the same manner as for person 21-1a, and sets the first characteristic amount and the second characteristic amount for each of the persons 22-1a to 25-1a. The height L assigned to each of the persons 21-1a to 25-1a at the initial time is the same height (average height).

人物２２－１ａの骨格データから得られる、画像座標系の足部の座標を（ｘ_ｆ２，ｙ_ｆ２）とし、頭部の座標を（ｘ_ｈ２，ｙ_ｈ２）とする。カメラパラメータと身長Ｌとを用いて得られる頭部の座標を（ｘ´_ｆ２，ｙ´_ｆ２）とする。人物２２－１ａの第一の特徴量は、座標（ｘ_ｆ２，ｙ_ｆ２）から、座標（ｘ´_ｆ２，ｙ´_ｆ２）までの距離となる。人物２２－１ａの第二の特徴量は、座標（ｘ_ｆ２，ｙ_ｆ２）から、頭部の座標（ｘ_ｈ２，ｙ_ｈ２）までの距離となる。 The coordinates of the feet in the image coordinate system obtained from the skeletal data of the person 22-1a are ( _xf2 , _yf2 ) and the coordinates of the head are ( _xh2 , _yh2 ). The coordinates of the head obtained using the camera parameters and height L are ( _x'f2 , _y'f2 ). The first feature amount of the person 22-1a is the distance from the coordinates ( _xf2 , _yf2 ) to the coordinates ( _x'f2 , _y'f2 ). The second feature amount of the person 22-1a is the distance from the coordinates ( _xf2 , _yf2 ) to the coordinates ( _xh2 , _yh2 ) of the head.

人物２３－１ａの骨格データから得られる、画像座標系の足部の座標を（ｘ_ｆ３，ｙ_ｆ３）とし、頭部の座標を（ｘ_ｈ３，ｙ_ｈ３）とする。カメラパラメータと身長Ｌとを用いて得られる頭部の座標を（ｘ´_ｆ３，ｙ´_ｆ３）とする。人物２３－１ａの第一の特徴量は、座標（ｘ_ｆ３，ｙ_ｆ３）から、座標（ｘ´_ｆ３，ｙ´_ｆ３）までの距離となる。人物２３－１ａの第二の特徴量は、座標（ｘ_ｆ３，ｙ_ｆ３）から、頭部の座標（ｘ_ｈ３，ｙ_ｈ３）までの距離となる。 The coordinates of the feet in the image coordinate system obtained from the skeletal data of the person 23-1a are ( _xf3 , _yf3 ) and the coordinates of the head are ( _xh3 , _yh3 ). The coordinates of the head obtained using the camera parameters and height L are ( _x'f3 , _y'f3 ). The first feature amount of the person 23-1a is the distance from the coordinates ( _xf3 , _yf3 ) to the coordinates ( _x'f3 , _y'f3 ). The second feature amount of the person 23-1a is the distance from the coordinates ( _xf3 , _yf3 ) to the coordinates ( _xh3 , _yh3 ) of the head.

人物２４－１ａの骨格データから得られる、画像座標系の足部の座標を（ｘ_ｆ４，ｙ_ｆ４）とし、頭部の座標を（ｘ_ｈ４，ｙ_ｈ４）とする。カメラパラメータと身長Ｌとを用いて得られる頭部の座標を（ｘ´_ｆ４，ｙ´_ｆ４）とする。人物２４－１ａの第一の特徴量は、座標（ｘ_ｆ４，ｙ_ｆ４）から、座標（ｘ´_ｆ４，ｙ´_ｆ４）までの距離となる。人物２４－１ａの第二の特徴量は、座標（ｘ_ｆ４，ｙ_ｆ４）から、頭部の座標（ｘ_ｈ４，ｙ_ｈ４）までの距離となる。 The coordinates of the feet in the image coordinate system obtained from the skeletal data of the person 24-1a are ( _xf4 , _yf4 ) and the coordinates of the head are ( _xh4 , _yh4 ). The coordinates of the head obtained using the camera parameters and height L are ( _x'f4 , _y'f4 ). The first feature amount of the person 24-1a is the distance from the coordinates ( _xf4 , _yf4 ) to the coordinates ( _x'f4 , _y'f4 ). The second feature amount of the person 24-1a is the distance from the coordinates ( _xf4 , _yf4 ) to the coordinates ( _xh4 , _yh4 ) of the head.

人物２５－１ａの骨格データから得られる、画像座標系の足部の座標を（ｘ_ｆ５，ｙ_ｆ５）とし、頭部の座標を（ｘ_ｈ５，ｙ_ｈ５）とする。カメラパラメータと身長Ｌとを用いて得られる頭部の座標を（ｘ´_ｆ５，ｙ´_ｆ５）とする。人物２５－１ａの第一の特徴量は、座標（ｘ_ｆ５，ｙ_ｆ５）から、座標（ｘ´_ｆ５，ｙ´_ｆ５）までの距離となる。人物２５－１ａの第二の特徴量は、座標（ｘ_ｆ５，ｙ_ｆ５）から、頭部の座標（ｘ_ｈ５，ｙ_ｈ５）までの距離となる。 The coordinates of the feet in the image coordinate system obtained from the skeletal data of the person 25-1a are ( _xf5 , _yf5 ) and the coordinates of the head are ( _xh5 , _yh5 ). The coordinates of the head obtained using the camera parameters and height L are ( _x'f5 , _y'f5 ). The first feature amount of the person 25-1a is the distance from the coordinates ( _xf5 , _yf5 ) to the coordinates ( _x'f5 , _y'f5 ). The second feature amount of the person 25-1a is the distance from the coordinates ( _xf5 , _yf5 ) to the coordinates ( _xh5 , _yh5 ) of the head.

合成データ生成部３５は、人物２１－１ａ～２５－１ａの身長Ｌを固定した状態で、それぞれの人物２１－１ａ～２５－１ａについて、第一の特徴量と、第二の特徴量との差が小さくなるように、カメラ１００のカメラパラメータを最適化する。 The composite data generation unit 35 optimizes the camera parameters of the camera 100 so that the difference between the first feature amount and the second feature amount for each of the persons 21-1a to 25-1a is small while the height L of the persons 21-1a to 25-1a is fixed.

図２４の説明に移行する。合成データ生成部３５は、上記処理によって最適化したカメラパラメータを基にして、人物２１－１ａ～２５－１ａの身長をそれぞれ算出する。合成データ生成部３５は、人物２１－１ａ～２５－１ａのうち、算出した身長が、所定の範囲に含まれない人物を特定する。たとえば、所定の範囲を、「初期値（平均身長）±４」とする。合成データ生成部３５は、２回目以降のカメラパラメータの最適化を行う場合、算出した身長が、所定の範囲に含まれない人物の情報を用いる。 We now move on to the explanation of Figure 24. The synthetic data generation unit 35 calculates the height of each of the persons 21-1a to 25-1a based on the camera parameters optimized by the above process. The synthetic data generation unit 35 identifies persons 21-1a to 25-1a whose calculated heights do not fall within a predetermined range. For example, the predetermined range is set to "initial value (average height) ±4". When optimizing the camera parameters from the second time onwards, the synthetic data generation unit 35 uses information on persons whose calculated heights do not fall within the predetermined range.

図２４に示した例では、最適化したカメラパラメータを基にして算出した人物２１－１ａ～２５－１ａの身長をそれぞれ「１７３」、「１６９」、「１６７」、「１７７」、「１７０」とする。初期値を１７２とすると、所定の範囲は「１６８～１７６」となる。そうすると、合成データ生成部３５は、身長が、所定の範囲に含まれない人物として、身長「１６７」の人物２３－１ａと身長「１７７」の人物２４－１ａとを特定する。 In the example shown in FIG. 24, the heights of persons 21-1a to 25-1a calculated based on the optimized camera parameters are "173", "169", "167", "177", and "170", respectively. If the initial value is 172, the predetermined range is "168 to 176". The composite data generation unit 35 then identifies person 23-1a with a height of "167" and person 24-1a with a height of "177" as persons whose heights do not fall within the predetermined range.

合成データ生成部３５は、特定した人物の身長が、初期値以上場合には、人物の身長に所定値を加算し、加算した身長を２回目の初期値として設定する。合成データ生成部３５は、特定した人物の身長が、初期値未満の場合には、人物の身長に所定値を減算し、減算した身長を２回目の初期値として設定する。所定値を１とする。 If the height of the identified person is equal to or greater than the initial value, the composite data generation unit 35 adds a predetermined value to the person's height and sets the added height as the second initial value. If the height of the identified person is less than the initial value, the composite data generation unit 35 subtracts a predetermined value from the person's height and sets the subtracted height as the second initial value. The predetermined value is 1.

たとえば、人物２３－１ａの身長が「１６７」であり、初期値未満である。このため、合成データ生成部３５は、人物２３－１ａの身長Ｌに、２回目の初期値として「１６６」を設定する。人物２４－１ａの身長が「１７７」であり、初期値以上である。このため、合成データ生成部３５は、人物２４－１ａの身長Ｌに、２回目の初期値として「１７８」を設定する。 For example, the height of person 23-1a is "167", which is less than the initial value. Therefore, the composite data generation unit 35 sets the height L of person 23-1a to "166" as the second initial value. The height of person 24-1a is "177", which is greater than or equal to the initial value. Therefore, the composite data generation unit 35 sets the height L of person 24-1a to "178" as the second initial value.

合成データ生成部３５は、人物２３－１ａ，２４－１ａの身長Ｌを固定した状態で、それぞれの人物２３－１ａ，２４－１ａについて、第一の特徴量と、第二の特徴量との差が小さくなるように、カメラ１００のカメラパラメータを最適化する。２回目のカメラパラメータの初期値は、１回目のカメラパラメータの推定結果とする。 The composite data generation unit 35 optimizes the camera parameters of the camera 100 so as to reduce the difference between the first feature amount and the second feature amount for each of the persons 23-1a and 24-1a while fixing the height L of the persons 23-1a and 24-1a. The initial values of the camera parameters for the second estimation are set to the results of the first estimation of the camera parameters.

上記のように合成データ生成部３５は、１回目の処理で、各人物に仮の平均身長を割り当て、カメラパラメータを推定する。合成データ生成部３５は、カメラパラメータの推定結果から特定される各人物の身長が、平均身長を基準とする所定範囲に含まれない人物を抽出する。合成データ生成部３５は、抽出した人物の身長を用いて、２回目以降のカメラパラメータを再計算することで、カメラパラメータを決定する。これによって、人物の身長を精度よく算出するためのカメラパラメータをカメラ１００に設定することができる。 As described above, in the first processing, the synthetic data generation unit 35 assigns a tentative average height to each person and estimates the camera parameters. The synthetic data generation unit 35 extracts people whose heights, identified from the camera parameter estimation results, do not fall within a predetermined range based on the average height. The synthetic data generation unit 35 determines the camera parameters by recalculating the camera parameters from the second time onwards using the heights of the extracted people. This makes it possible to set camera parameters in the camera 100 for calculating a person's height with high accuracy.

合成データ生成部３５は、平均身長を基準とする所定範囲に含まれない人物を特定して、かかる人物の身長を再設定し、カメラパラメータの再計算を行い、統計的な平均値から外れる人物が存在する場合でも、カメラパラメータを収束させることができる。 The synthetic data generation unit 35 identifies people who are not within a specified range based on the average height, resets the height of such people, and recalculates the camera parameters, so that the camera parameters can converge even if there are people who deviate from the statistical average value.

このように、合成データ生成部３５は、画像フレームに含まれる全ての人物の身長が未知であっても、それぞれの人物の身長を推定することができる。したがって、合成データ生成部３５は、推定された身長となるように、３Ｄアバターの身長を変更して該当領域に配置することで、状況を正確に表した合成データを生成することができる。 In this way, the synthetic data generation unit 35 can estimate the height of each person even if the height of all people included in the image frame is unknown. Therefore, the synthetic data generation unit 35 can generate synthetic data that accurately represents the situation by changing the height of the 3D avatar and placing it in the relevant area so that it matches the estimated height.

（機械学習へ適用）
図２に戻り、機械学習部３６は、合成データから２次元の人物モデルを含む画像データを生成し、画像データを訓練データとして、画像データの入力に応じて人物を識別する機械学習モデルを生成する処理部である。 (Applied to machine learning)
Returning to Figure 2, the machine learning unit 36 is a processing unit that generates image data including a two-dimensional person model from the synthetic data, and uses the image data as training data to generate a machine learning model that identifies a person in response to input image data.

具体的には、機械学習部３６は、合成データ内の３Ｄアバターを公知の手法で２次元化した２次元の画像データを生成する。ここで、上述したように、合成データは、ユーザの指定条件にしたがって生成されていることから、合成データで使用された場所画像データは、撮像された場所が既知であり、合成データ内の３Ｄアバターは、種別、行動、姿勢等が既知である。 Specifically, the machine learning unit 36 generates two-dimensional image data by two-dimensionalizing the 3D avatar in the composite data using a known method. Here, as described above, the composite data is generated according to the conditions specified by the user, so the location image data used in the composite data is known in terms of the location where it was captured, and the type, behavior, posture, etc. of the 3D avatar in the composite data are known.

そのため、機械学習部３６は、合成データに基づく２次元の画像データに、学習対象の機械学習モデルに応じたラベルを付加することで、学習内容に応じた訓練データを生成する。なお、機械学習部３６は、訓練として、目的変数と各機械学習モデルの出力結果との誤差が最小化するように、各機械学習モデルのパラメータ更新を実行する。 Therefore, the machine learning unit 36 generates training data according to the learning content by adding a label according to the machine learning model to be learned to the two-dimensional image data based on the synthetic data. Note that, as training, the machine learning unit 36 executes parameter updates for each machine learning model so as to minimize the error between the objective variable and the output result of each machine learning model.

図２５は、各種モデルの訓練への適用を説明する図である。図２５に示すように、機械学習部３６は、人物検出モデルの訓練を行う場合、２次元の画像データに、ラベルとして「人物の領域（バウンティングボックス）」を付加した訓練データを生成する。そして、機械学習部３６は、画像データを説明変数、ラベルを目的変数とする訓練データを用いて、画像データから画像データに写っている物の領域を検出する人物検出モデルの訓練を実行する。 Figure 25 is a diagram explaining the application of various models to training. As shown in Figure 25, when training a person detection model, the machine learning unit 36 generates training data in which a "person area (bounding box)" is added as a label to two-dimensional image data. Then, the machine learning unit 36 uses the training data in which the image data is an explanatory variable and the label is an objective variable to train a person detection model that detects the area of an object appearing in the image data from the image data.

同様に、機械学習部３６は、属性推定モデルの訓練を行う場合、２次元の画像データに、ラベルとして「服装」、「年齢」や「性別」などを付加した訓練データを生成する。そして、機械学習部３６は、画像データを説明変数、ラベルを目的変数とする訓練データを用いて、画像データから画像データに写っている人物の属性を推定する属性推定モデルの訓練を実行する。 Similarly, when training an attribute estimation model, the machine learning unit 36 generates training data in which labels such as "clothing," "age," and "gender" are added to two-dimensional image data. Then, the machine learning unit 36 uses the training data, which has the image data as an explanatory variable and the labels as a target variable, to train an attribute estimation model that estimates the attributes of a person depicted in the image data from the image data.

同様に、機械学習部３６は、骨格推定モデルの訓練を行う場合、２次元の画像データに、ラベルとして「骨格情報（例えば１８関節の情報）」などを付加した訓練データを生成する。そして、機械学習部３６は、画像データを説明変数、ラベルを目的変数とする訓練データを用いて、画像データから画像データに写っている人物の骨格を推定する骨格推定モデルの訓練を実行する。 Similarly, when training a skeletal estimation model, the machine learning unit 36 generates training data in which "skeletal information (e.g., information on 18 joints)" is added as a label to two-dimensional image data. Then, the machine learning unit 36 uses the training data, which has the image data as an explanatory variable and the label as a target variable, to train a skeletal estimation model that estimates the skeleton of a person depicted in the image data from the image data.

同様に、機械学習部３６は、行動検知モデルの訓練を行う場合、２次元の画像データに、ラベルとして「歩く」、「座る」や「物を取る」などの「行動」を付加した訓練データを生成する。そして、機械学習部３６は、画像データを説明変数、ラベルを目的変数とする訓練データを用いて、画像データから画像データに写っている人物の行動を検知する行動検知モデルの訓練を実行する。 Similarly, when training a behavior detection model, the machine learning unit 36 generates training data in which "behaviors" such as "walking," "sitting," and "picking up an object" are added as labels to two-dimensional image data. Then, the machine learning unit 36 uses the training data, which has the image data as an explanatory variable and the labels as a target variable, to train a behavior detection model that detects the behavior of a person appearing in the image data from the image data.

＜処理の流れ＞
図２６は、実施例１にかかる合成データの生成処理の流れを示すフローチャートである。図２６に示すように、情報処理装置１０は、処理開始が指示されると（Ｓ１０１：Ｙｅｓ）、合成処理の開始が指示されるまで（Ｓ１０３：Ｎｏ）、人物画像データと場所画像データを蓄積する（Ｓ１０２）。 <Processing flow>
Fig. 26 is a flowchart showing the flow of a process for generating composite data according to Example 1. As shown in Fig. 26, when an instruction to start processing is given (S101: Yes), the information processing device 10 accumulates person image data and place image data (S102) until an instruction to start a composite process is given (S103: No).

その後、情報処理装置１０は、合成処理の開始が指示されると（Ｓ１０３：Ｙｅｓ）、各人物画像データから各３Ｄアバターを生成する（Ｓ１０４）。続いて、情報処理装置１０は、各場所画像データに対して行動領域の特定を行い、各場所画像データにおいて人物が行動する領域を特定する（Ｓ１０５）。 Thereafter, when the information processing device 10 is instructed to start the synthesis process (S103: Yes), the information processing device 10 generates each 3D avatar from each person image data (S104). Next, the information processing device 10 identifies an action area for each place image data, and identifies an area in each place image data where the person is acting (S105).

そして、情報処理装置１０は、カメラパラメータの推定を実行し（Ｓ１０６）、推定されたカメラパラメータを用いて、任意の３Ｄアバターと任意の場所画像データとを用いて、行動領域に３Ｄアバターが配置された合成データを生成する（Ｓ１０７）。 Then, the information processing device 10 performs estimation of camera parameters (S106), and uses the estimated camera parameters to generate composite data in which the 3D avatar is placed in the action area using an arbitrary 3D avatar and arbitrary location image data (S107).

ここで、他の合成データを生成する場合（Ｓ１０８：Ｎｏ）、情報処理装置１０は、Ｓ１０７以降を繰り返し、他の合成データを生成しない場合（Ｓ１０８：Ｙｅｓ）、処理を終了する。 If other composite data is to be generated (S108: No), the information processing device 10 repeats S107 and subsequent steps, and if other composite data is not to be generated (S108: Yes), the processing ends.

＜効果＞
上述したように、実施例１にかかる情報処理装置１０は、すでに設置されているカメラにより撮像された人物画像データや場所画像データを用いて、合成データを生成することができる。また、情報処理装置１０は、合成データから訓練データを生成して、各種機械学習モデルを訓練することができる。＜Effects＞
As described above, the information processing device 10 according to the first embodiment can generate synthetic data using person image data and location image data captured by a camera that has already been installed. In addition, the information processing device 10 can generate training data from the synthetic data and train various machine learning models.

したがって、情報処理装置１０は、各種機械学習モデルを用いた検知を行う現場に適した訓練データを用いて、現場に適した機械学習モデルを生成することができる。この結果、情報処理装置１０は、現場での行動検知の精度を向上させることができる。 Therefore, the information processing device 10 can generate a machine learning model suitable for the site by using training data suitable for the site where detection using various machine learning models is performed. As a result, the information processing device 10 can improve the accuracy of behavior detection at the site.

また、情報処理装置１０は、各現場で教師ありの訓練データを収集する場合に比べて、高速に訓練データを生成することができるので、機械学習モデルの訓練にかかるコストを削減することができる。また、情報処理装置１０は、教師ありの訓練データを高速かつ正確に生成することができるので、現場に適した機械学習モデルの生成時間を短縮することができる。さらに、情報処理装置１０は、現場における精度高い行動検知の高速に実現することができる。 In addition, the information processing device 10 can generate training data at high speed compared to collecting supervised training data at each site, thereby reducing the cost of training the machine learning model. In addition, the information processing device 10 can generate supervised training data quickly and accurately, thereby shortening the time required to generate a machine learning model suitable for the site. Furthermore, the information processing device 10 can quickly achieve highly accurate behavior detection in the field.

ところで、人物が行う行動には様々な行動が含まれるが、商品を手にとる行動や商品を物色する行動のように、種別が類似する行動には同じような動作（骨格情報の変化）が行われる。したがって、行動をルール化しておき、ルールに基づいて人物の３次元モデルのポーズと配置を自動で決定することができる。 A person's actions include a wide variety of behaviors, but similar types of actions, such as picking up a product or browsing for products, involve similar movements (changes in skeletal information). Therefore, it is possible to define rules for actions and automatically determine the pose and position of a three-dimensional human model based on the rules.

そこで、実施例２では、情報処理装置１０が、行動を決めるルールに基づいて、人物の３次元モデルのポーズと配置を決定して、自動で合成データを生成する例を説明する。また、実施例２にかかる情報処理装置１０は、生成された合成データを各種機械学習モデルに入力し、正しく認識されなかった機械学習モデルの訓練を実行する。 In this regard, in Example 2, an example will be described in which the information processing device 10 determines the pose and position of a three-dimensional model of a person based on rules that determine behavior, and automatically generates synthetic data. In addition, the information processing device 10 in Example 2 inputs the generated synthetic data into various machine learning models, and performs training of machine learning models that were not correctly recognized.

図２７は、実施例２にかかる情報処理装置１０を説明する図である。図２７に示すように、情報処理装置１０は、予め定義したルールＡ、ルールＢ、ルールＣなどの各種ルールを記憶する。情報処理装置１０は、実施例１の手法で生成された各３Ｄアバターと行動領域が特定された各場所画像データとを用いて、各ルールに基づく姿勢で３Ｄアバターを配置した合成データＡ１、合成データＢ１、合成データＣ１を生成する。 Fig. 27 is a diagram illustrating an information processing device 10 according to Example 2. As shown in Fig. 27, the information processing device 10 stores various rules, such as predefined rules A, B, and C. Using each 3D avatar generated by the method of Example 1 and each location image data in which the action area is identified, the information processing device 10 generates composite data A1, B1, and C1 in which the 3D avatar is positioned in a posture based on each rule.

その後、情報処理装置１０は、人物検出モデル、属性推定モデル、骨格推定モデルなどの各種機械学習モデルに、各合成データを入力して、各モデルの出力結果を取得する。そして、情報処理装置１０は、推定が失敗した機械学習モデルを訓練対象と特定する。例えば、合成データＡ１が入力されたときの人物検出モデルの人物検出結果が正しい検出結果ではない場合、情報処理装置１０は、合成データＡ１を用いて人物検出モデルの訓練を実行する。また、合成データＣ１が入力されたときの属性推定モデルの属性推定結果が正しい推定結果ではない場合、情報処理装置１０は、合成データＣ１を用いて属性推定モデルの訓練を実行する。 Then, the information processing device 10 inputs each synthetic data into various machine learning models such as a person detection model, an attribute estimation model, and a skeleton estimation model, and obtains the output results of each model. Then, the information processing device 10 identifies the machine learning model in which the estimation has failed as the training target. For example, if the person detection result of the person detection model when synthetic data A1 is input is not a correct detection result, the information processing device 10 trains the person detection model using the synthetic data A1. Also, if the attribute estimation result of the attribute estimation model when synthetic data C1 is input is not a correct estimation result, the information processing device 10 trains the attribute estimation model using the synthetic data C1.

図２８は、実施例２にかかる情報処理装置１０の機能構成を示す機能ブロック図である。図２８に示す各処理部や各ＤＢのうち、実施例１と異なる点は、行動ルールＤＢ２７であるので、ここでは、行動ルールＤＢ２７について説明する。 Figure 28 is a functional block diagram showing the functional configuration of the information processing device 10 according to Example 2. Among the processing units and DBs shown in Figure 28, the difference from Example 1 is the behavior rule DB 27, so here, the behavior rule DB 27 will be described.

行動ルールＤＢ２７は、人物の行動の要素を示すルールを記憶するデータベースである。具体的には、行動ルールＤＢ２７は、行動ごとに、３Ｄアバターの配置位置や３Ｄアバターの姿勢やポーズ等を記憶する。 The behavior rule DB27 is a database that stores rules that indicate the elements of a person's behavior. Specifically, the behavior rule DB27 stores the placement position of the 3D avatar, the posture and pose of the 3D avatar, etc. for each behavior.

図２９は、行動ルールＤＢ２７を説明する図である。図２９に示すように、行動ルールＤＢ２７は、「行動名、姿勢、身体の向き、立ち位置、手の位置」を対応付けて記憶する。ここで記憶される「行動名」は、行動を一意に識別する情報である。「姿勢」は、行動を行うときに姿勢を示す情報である。「身体の向き」は、行動を行うときに身体の向きを示す情報である。「立ち位置」は、行動を行うときに立ち位置を示す情報である。「手の位置」は、行動を行うときに手の位置を示す情報である。 Figure 29 is a diagram explaining the behavior rule DB27. As shown in Figure 29, the behavior rule DB27 stores "behavior name, posture, body direction, standing position, hand position" in association with each other. The "behavior name" stored here is information that uniquely identifies the behavior. "Posture" is information that indicates the posture when performing the behavior. "Body direction" is information that indicates the body direction when performing the behavior. "Standing position" is information that indicates the standing position when performing the behavior. "Hand position" is information that indicates the hand position when performing the behavior.

図２９の例では、「倒れる行動」の場合、人物は、「床」で「寝る」姿勢を取り、「床にお尻や体が付く」状態であることが定義されている。なお、行動の要素として「姿勢」や「身体の向き」を例示したが、「手の位置」を省略し、「背中の向き」を追加するなど、任意に変更することができる。 In the example of Figure 29, in the case of "falling action," the person is defined as "lying down" on the "floor," with "buttocks and body touching the floor." Note that while "posture" and "body direction" are given as examples of elements of the action, they can be changed as desired, for example, omitting "hand position" and adding "back direction."

合成データ生成部３５は、行動ルールに基づいて、合成データを自動で生成する。例えば、合成データ生成部３５は、ユーザが指定した行動に対応するルール（行動の要素）を行動ルールＤＢ２７から特定し、特定したルールにしたがって合成データを生成する。また、合成データ生成部３５は、行動ルールＤＢ２７に記憶される行動ごとに、行動に対応付けられるルールに一致する３Ｄアバターと場所画像データとを選択して、合成データを生成することもできる。 The synthetic data generation unit 35 automatically generates synthetic data based on the behavior rules. For example, the synthetic data generation unit 35 identifies a rule (element of behavior) corresponding to a behavior specified by the user from the behavior rule DB 27, and generates synthetic data according to the identified rule. In addition, the synthetic data generation unit 35 can also generate synthetic data by selecting a 3D avatar and location image data that match the rule associated with the behavior for each behavior stored in the behavior rule DB 27.

図３０は、行動ルールに基づく合成データの生成を説明する図である。図３０に示すように、合成データ生成部３５は、ユーザにより「商品を取る」行動が指定された場合、行動ルールＤＢ２７から行動の要素を取得する。すなわち、「商品を取る行動」の場合、人物は、「領域（ROI：Region of Interest）に両足首が入った」状態で、「立つ」姿勢を取り、身体が「商品棚に向いて」、手を「商品棚に入れる」状態であることが定義されている。 Figure 30 is a diagram explaining the generation of synthetic data based on behavior rules. As shown in Figure 30, when the behavior of "picking up a product" is specified by the user, the synthetic data generation unit 35 acquires behavior elements from the behavior rule DB 27. That is, in the case of the "behavior of picking up a product," it is defined that the person is in a "standing" posture with "both ankles in the region of interest (ROI)," with the body "facing the product shelf," and with the hands "placed in the product shelf."

合成データ生成部３５は、特定した要素のうち、姿勢「立つ」を特定する。そして、合成データ生成部３５は、該当する３Ｄアバターの姿勢を立った姿勢に変更する。 The synthetic data generation unit 35 identifies the posture "standing" from among the identified elements. The synthetic data generation unit 35 then changes the posture of the corresponding 3D avatar to a standing posture.

続いて、合成データ生成部３５は、商品棚を含む場所画像データを選択し、実施例１の手法で特定された行動領域と商品棚領域のそれぞれを立ち位置ＲＯＩと手の位置ＲＯＩに設定する。そして、合成データ生成部３５は、立ち位置ＲＯＩに、立ち姿勢の３Ｄアバターを配置する。ここで、合成データ生成部３５は、「商品を取る行動」に対応付けられる要素のうち、身体の向きが「商品棚に向いている」かつ「手の位置」が「商品棚に入れる」となっていることから、立ち位置ＲＯＩから手の位置ＲＯＩに対して、手を伸ばしている３Ｄアバターを配置する。 Next, the synthetic data generation unit 35 selects location image data including the product shelves, and sets the action area and product shelf area identified by the method of Example 1 to the standing position ROI and hand position ROI, respectively. The synthetic data generation unit 35 then places a 3D avatar in a standing position ROI. Here, since the elements associated with the "action of picking up a product" indicate that the body orientation is "facing the product shelves" and the "hand position" is "putting it into the product shelves," the synthetic data generation unit 35 places a 3D avatar with its hand outstretched from the standing position ROI to the hand position ROI.

このように行動ごとにルールを対応付けておくことで、合成データ生成部３５は、行動に合致した合成データを正確に生成することができるので、合成データの生成時間を短縮することができ、人為的なミスによる不正確な合成データの生成を抑制することができる。 By associating a rule with each behavior in this way, the synthetic data generation unit 35 can accurately generate synthetic data that matches the behavior, thereby shortening the time it takes to generate synthetic data and preventing the generation of inaccurate synthetic data due to human error.

次に、合成データを用いた各種機械学習モデルの評価の具体例を説明する。ここで例示する機械学習モデルは例示であり、数、評価の順番、機械学習モデルの種別などを限定するものではない。 Next, we will explain specific examples of evaluating various machine learning models using synthetic data. The machine learning models shown here are merely examples, and are not intended to limit the number, order of evaluation, or type of machine learning models.

図３１は、実施例２にかかる合成データを用いた機械学習モデルの評価処理の流れを示すフローチャートである。図３１に示すように、機械学習部３６は、処理開始が指示されると（Ｓ２０１：Ｙｅｓ）、生成されて記憶部２０等に格納される合成データを取得する（Ｓ２０２）。 Fig. 31 is a flowchart showing the flow of the evaluation process of the machine learning model using synthetic data according to Example 2. As shown in Fig. 31, when an instruction to start the process is received (S201: Yes), the machine learning unit 36 acquires the synthetic data that is generated and stored in the storage unit 20 or the like (S202).

続いて、機械学習部３６は、実施例１による手法や合成データの生成に使用されたルールから、合成データにラベル（正解情報）を設定する（Ｓ２０３）。例えば、機械学習部３６は、人物の領域、属性、骨格情報などを各ラベルとして設定する。なお、機械学習部３６は、合成データから生成した上記２次元モデルの画像データを用いてもよい。 Next, the machine learning unit 36 sets labels (correct answer information) to the synthetic data based on the method according to the first embodiment and the rules used to generate the synthetic data (S203). For example, the machine learning unit 36 sets the area, attributes, skeletal information, etc. of the person as each label. Note that the machine learning unit 36 may use image data of the above two-dimensional model generated from the synthetic data.

その後、機械学習部３６は、人物検出モデルを用いて合成データから人物検出を実行する（Ｓ２０４）。すなわち、機械学習部３６は、合成データを人物検出モデルに入力して、人物検出モデルによる人物検出結果を取得する。 Then, the machine learning unit 36 performs person detection from the synthetic data using the person detection model (S204). That is, the machine learning unit 36 inputs the synthetic data into the person detection model and obtains a person detection result by the person detection model.

そして、機械学習部３６は、人物検出モデルによる人物検出が成功した場合（Ｓ２０５：Ｙｅｓ）、Ｓ２０６を実行せずにＳ２０７を実行し、人物検出モデルによる人物検出が失敗した場合（Ｓ２０５：Ｎｏ）、合成データを用いて人物検出モデルを訓練する（Ｓ２０６）。例えば、機械学習部３６は、人物検出モデルにより合成データから人物が検出されなかった場合、合成データを説明変数、ラベルを目的変数として人物検出モデルの訓練を実行する。 If person detection using the person detection model is successful (S205: Yes), the machine learning unit 36 executes S207 without executing S206, and if person detection using the person detection model is unsuccessful (S205: No), the machine learning unit 36 trains the person detection model using the synthetic data (S206). For example, if a person is not detected from the synthetic data using the person detection model, the machine learning unit 36 trains the person detection model using the synthetic data as an explanatory variable and the label as a target variable.

その後、機械学習部３６は、属性推定モデルを用いて合成データから人物の属性推定を実行する（Ｓ２０７）。すなわち、機械学習部３６は、合成データを属性推定モデルに入力して、属性推定モデルによる属性推定結果を取得する。 Then, the machine learning unit 36 performs person attribute estimation from the composite data using the attribute estimation model (S207). That is, the machine learning unit 36 inputs the composite data to the attribute estimation model and obtains the attribute estimation result by the attribute estimation model.

そして、機械学習部３６は、属性推定モデルによる属性推定が成功した場合（Ｓ２０８：Ｙｅｓ）、Ｓ２０９を実行せずにＳ２１０を実行し、属性推定モデルによる属性推定が失敗した場合（Ｓ２０８：Ｎｏ）、合成データを用いて属性推定モデルを訓練する（Ｓ２０９）。例えば、機械学習部３６は、属性推定モデルにより合成データから属性が推定されなかった場合や属性が間違って推定された場合、合成データを説明変数、ラベルを目的変数として属性推定モデルの訓練を実行する。 If attribute estimation using the attribute estimation model is successful (S208: Yes), the machine learning unit 36 executes S210 without executing S209, and if attribute estimation using the attribute estimation model is unsuccessful (S208: No), the machine learning unit 36 trains the attribute estimation model using the synthetic data (S209). For example, if an attribute is not estimated from the synthetic data by the attribute estimation model or an attribute is incorrectly estimated, the machine learning unit 36 trains the attribute estimation model using the synthetic data as an explanatory variable and the label as a target variable.

その後、機械学習部３６は、骨格推定モデルを用いて合成データから人物の骨格推定を実行する（Ｓ２１０）。すなわち、機械学習部３６は、合成データを骨格推定モデルに入力して、骨格推定モデルによる骨格推定結果を取得する。 Then, the machine learning unit 36 performs human skeletal estimation from the composite data using the skeletal estimation model (S210). That is, the machine learning unit 36 inputs the composite data into the skeletal estimation model and obtains a skeletal estimation result by the skeletal estimation model.

そして、機械学習部３６は、骨格推定モデルによる骨格推定が成功した場合（Ｓ２１１：Ｙｅｓ）、処理を終了する。一方、機械学習部３６は、骨格推定モデルによる骨格推定が失敗した場合（Ｓ２１１：Ｎｏ）、合成データを用いて骨格推定モデルを訓練する（Ｓ２１２）。例えば、機械学習部３６は、骨格推定モデルにより合成データから骨格が推定されなかった場合や骨格が間違って推定された場合、合成データを説明変数、ラベルを目的変数として骨格推定モデルの訓練を実行する。 Then, if skeleton estimation using the skeleton estimation model is successful (S211: Yes), the machine learning unit 36 ends the process. On the other hand, if skeleton estimation using the skeleton estimation model is unsuccessful (S211: No), the machine learning unit 36 trains the skeleton estimation model using the synthetic data (S212). For example, if the skeleton is not estimated from the synthetic data by the skeleton estimation model or if the skeleton is incorrectly estimated, the machine learning unit 36 trains the skeleton estimation model using the synthetic data as an explanatory variable and the label as a target variable.

このように、機械学習部３６は、推定精度が悪い機械学習モデルのみを特定し、その機械学習モデルに対してのみ訓練および再訓練を行うことができる。この結果、情報処理装置１０は、現場で複数の機械学習モデルを使用する場合でも、精度劣化の検出精度を向上させることができ、精度劣化の是正処理の短縮を実現することができる。 In this way, the machine learning unit 36 can identify only machine learning models with poor estimation accuracy and perform training and retraining only on those machine learning models. As a result, the information processing device 10 can improve the accuracy of detecting accuracy degradation even when multiple machine learning models are used on-site, and can shorten the process of correcting accuracy degradation.

さて、これまで本発明の実施例について説明したが、本発明は上述した実施例以外にも、種々の異なる形態にて実施されてよいものである。 So far, we have explained the embodiments of the present invention, but the present invention may be implemented in various different forms other than the above-mentioned embodiments.

（数値等）
上記実施例で用いたデータ例、行動、訓練データ、ラベル、機械学習モデルの種別や数、３Ｄアバター、３Ｄアバターの姿勢、現場等は、あくまで一例であり、任意に変更することができる。また、各フローチャートで説明した処理の流れも矛盾のない範囲内で適宜変更することができる。 (Numerical values, etc.)
The data examples, actions, training data, labels, types and numbers of machine learning models, 3D avatars, 3D avatar postures, sites, etc. used in the above embodiments are merely examples and can be changed as desired. In addition, the process flow described in each flowchart can be changed as appropriate within a range that does not cause inconsistencies.

（モデルの形態）
上記実施例では、各機械学習モデルとしては、多値判定モデル（多値分類モデル）や２値分類モデルなどを用いることもできる。 (Model form)
In the above embodiments, each machine learning model may be a multi-value judgment model (multi-value classification model) or a binary classification model.

（想定場所）
上記実施例では、店舗を例にして説明したが、これに限定されるものではない。例えば倉庫、工場、教室、電車の車内や飛行機の客室などにも流用することができる。これらの場合、人物に関連する物体が収納された領域の一例として説明した商品棚の領域に代わりに、物を置く領域や荷物をしまう領域が検出、設定対象となる。また、上記情報処理装置１０は、姿勢や属性に限らず、ユニフォームの着用有無、エプロンの着用有無など、現場に応じた格好の３Ｄアバターを生成することができる。 (Expected location)
In the above embodiment, a store is used as an example, but the present invention is not limited thereto. For example, the present invention can be used in warehouses, factories, classrooms, train cars, airplane cabins, and the like. In these cases, instead of the product shelf area described as an example of an area in which objects related to a person are stored, an area in which items are placed or luggage is stored is detected and set. Furthermore, the information processing device 10 can generate a 3D avatar suited to the scene, not only in terms of posture and attributes, but also in terms of whether or not the person is wearing a uniform or an apron.

また、上記実施例では、人物の足首の位置を用いる例を説明したが、これに限定されるものではなく、例えば足の位置、靴の位置などを用いることもできる。また、上記実施例では、顔の向きの方向にあるエリアを商品棚エリアと特定する例を説明したが、身体の向きの方向にあるエリアを商品棚エリアと特定することもできる。また、各機械学習モデルは、ニューラルネットワークなどを用いることができる。 In the above embodiment, an example was described in which the position of a person's ankle was used, but this is not limited to this, and for example, the position of the foot or the position of the shoe could also be used. In the above embodiment, an example was described in which an area in the direction of the face's direction was identified as a product shelf area, but an area in the direction of the body's direction could also be identified as a product shelf area. In addition, each machine learning model could use a neural network, etc.

（合成データ生成の別例）
例えば、情報処理装置１０は、上記カメラパラメータ等を用いて商品棚等の奥行を推定し、行動領域（注目領域）に含まれる物体の後ろに３次元アバターが配置された合成データを生成することができる。 (Another example of synthetic data generation)
For example, the information processing device 10 can estimate the depth of a product shelf or the like using the above camera parameters, etc., and generate synthetic data in which a three-dimensional avatar is placed behind an object included in the behavior area (area of interest).

図３２は、３Ｄアバターの配置例を説明する図である。図３２に示すように、情報処理装置１０は、人物画像データから３Ｄアバターを生成する。また、情報処理装置１０は、場所画像データに対して環境認識を実行する。例えば、情報処理装置１０は、場所画像データからデプス画像やセマンティックセグメンテーション結果などを生成し、カメラパラメータの推定やキャリブレーションを実行する。そして、情報処理装置１０は、場所画像データ内の注目領域と注目領域までの距離を特定するとともに、セマンティックセグメンテーション結果によりラベル「商品棚」が設定された領域およびその領域までの距離を特定する。 Figure 32 is a diagram illustrating an example of the arrangement of 3D avatars. As shown in Figure 32, the information processing device 10 generates a 3D avatar from person image data. The information processing device 10 also performs environment recognition on the place image data. For example, the information processing device 10 generates a depth image and a semantic segmentation result from the place image data, and performs camera parameter estimation and calibration. The information processing device 10 then identifies the attention area and the distance to the attention area in the place image data, and identifies the area to which the label "product shelf" is set based on the semantic segmentation result and the distance to the area.

ここで、情報処理装置１０は、商品棚と注目領域の位置関係を正確に再現した合成データを生成する。例えば、情報処理装置１０は、注目領域の中に商品棚が含まれるとともに、注目領域が商品棚よりもカメラに近い場合には、図３２の（ａ）に示すように、商品棚の前に３Ｄアバター７０を配置する。一方、情報処理装置１０は、注目領域の中に商品棚が含まれるとともに、商品棚が注目領域よりもカメラに近い場合には、図３２の（ｂ）に示すように、商品棚の後ろに３Ｄアバター７０を配置する。 Here, the information processing device 10 generates synthetic data that accurately reproduces the positional relationship between the product shelf and the attention area. For example, when the attention area includes a product shelf and the attention area is closer to the camera than the product shelf, the information processing device 10 places a 3D avatar 70 in front of the product shelf as shown in FIG. 32(a). On the other hand, when the attention area includes a product shelf and the product shelf is closer to the camera than the attention area, the information processing device 10 places a 3D avatar 70 behind the product shelf as shown in FIG. 32(b).

このように、情報処理装置１０は、状況を正確に再現した合成データを生成することができるので、合成データを用いた機械学習モデルの訓練の精度を向上させることができ、結果として、機械学習モデルの訓練時間を短縮することができる。 In this way, the information processing device 10 can generate synthetic data that accurately reproduces a situation, thereby improving the accuracy of training a machine learning model using the synthetic data, and as a result, shortening the training time for the machine learning model.

（システム）
上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 (system)
The information including the processing procedures, control procedures, specific names, various data and parameters shown in the above documents and drawings can be changed arbitrarily unless otherwise specified.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散や統合の具体的形態は図示のものに限られない。つまり、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。 In addition, each component of each device shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. In other words, the specific form of distribution and integration of each device is not limited to that shown in the figure. In other words, all or part of them can be functionally or physically distributed and integrated in any unit depending on various loads, usage conditions, etc.

さらに、各装置にて行なわれる各処理機能は、その全部または任意の一部が、ＣＰＵおよび当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。このシステムにより、防犯やリテール、製造、業務効率化など、様々なシーンに適した映像分析ソリューションを提供できる。 Furthermore, all or any part of the processing functions performed by each device can be realized by a CPU and a program that is analyzed and executed by the CPU, or can be realized as hardware using wired logic. This system can provide video analysis solutions suitable for a variety of scenarios, including crime prevention, retail, manufacturing, and business efficiency.

［ハードウェア］
図３３は、ハードウェア構成例を説明する図である。図３３に示すように、情報処理装置１０は、通信装置１０ａ、ＨＤＤ（Hard Disk Drive）１０ｂ、メモリ１０ｃ、プロセッサ１０ｄを有する。また、図３３に示した各部は、バス等で相互に接続される。 [hardware]
Fig. 33 is a diagram illustrating an example of a hardware configuration. As shown in Fig. 33, an information processing device 10 includes a communication device 10a, a hard disk drive (HDD) 10b, a memory 10c, and a processor 10d. The components shown in Fig. 33 are connected to each other via a bus or the like.

通信装置１０ａは、ネットワークインタフェースカードなどであり、他の装置との通信を行う。ＨＤＤ１０ｂは、図２に示した機能を動作させるプログラムやＤＢを記憶する。 The communication device 10a is a network interface card or the like, and communicates with other devices. The HDD 10b stores the programs and DBs that operate the functions shown in FIG. 2.

プロセッサ１０ｄは、図２に示した各処理部と同様の処理を実行するプログラムをＨＤＤ１０ｂ等から読み出してメモリ１０ｃに展開することで、図２等で説明した各機能を実行するプロセスを動作させる。例えば、このプロセスは、情報処理装置１０が有する各処理部と同様の機能を実行する。具体的には、プロセッサ１０ｄは、事前学習部３１、取得部３２、人物モデル生成部３３、領域特定部３４、合成データ生成部３５、機械学習部３６等と同様の機能を有するプログラムをＨＤＤ１０ｂ等から読み出す。そして、プロセッサ１０ｄは、事前学習部３１、取得部３２、人物モデル生成部３３、領域特定部３４、合成データ生成部３５、機械学習部３６等と同様の処理を実行するプロセスを実行する。 The processor 10d reads out a program that executes the same processes as the processing units shown in FIG. 2 from the HDD 10b, etc., and expands it into the memory 10c, thereby operating a process that executes each function described in FIG. 2, etc. For example, this process executes a function similar to that of each processing unit possessed by the information processing device 10. Specifically, the processor 10d reads out a program having functions similar to those of the pre-learning unit 31, the acquisition unit 32, the person model generation unit 33, the area identification unit 34, the synthetic data generation unit 35, the machine learning unit 36, etc., from the HDD 10b, etc. Then, the processor 10d executes a process that executes the same processes as those of the pre-learning unit 31, the acquisition unit 32, the person model generation unit 33, the area identification unit 34, the synthetic data generation unit 35, the machine learning unit 36, etc.

このように、情報処理装置１０は、プログラムを読み出して実行することで情報処理方法を実行する情報処理装置として動作する。また、情報処理装置１０は、媒体読取装置によって記録媒体から上記プログラムを読み出し、読み出された上記プログラムを実行することで上記した実施例と同様の機能を実現することもできる。なお、この他の実施例でいうプログラムは、情報処理装置１０によって実行されることに限定されるものではない。例えば、他のコンピュータまたはサーバがプログラムを実行する場合や、これらが協働してプログラムを実行するような場合にも、上記実施例が同様に適用されてもよい。 In this way, the information processing device 10 operates as an information processing device that executes an information processing method by reading and executing a program. The information processing device 10 can also realize functions similar to those of the above-mentioned embodiment by reading the program from a recording medium using a media reading device and executing the read program. Note that the program in this other embodiment is not limited to being executed by the information processing device 10. For example, the above embodiment may also be similarly applied to cases where another computer or server executes a program, or where these cooperate to execute a program.

このプログラムは、インターネットなどのネットワークを介して配布されてもよい。また、このプログラムは、ハードディスク、フレキシブルディスク（ＦＤ）、ＣＤ－ＲＯＭ、ＭＯ（Magneto－Optical disk）、ＤＶＤ（Digital Versatile Disc）などのコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行されてもよい。 This program may be distributed via a network such as the Internet. In addition, this program may be recorded on a computer-readable recording medium such as a hard disk, a flexible disk (FD), a CD-ROM, an MO (Magneto-Optical disk), or a DVD (Digital Versatile Disc), and may be executed by being read from the recording medium by a computer.

１０情報処理装置
１１通信部
２０記憶部
２１訓練データＤＢ
２２人物画像データＤＢ
２３場所画像データＤＢ
２４３Ｄ生成モデル
２５領域抽出モデル
２６骨格推定モデル
３０制御部
３１事前学習部
３２取得部
３３人物モデル生成部
３４領域特定部
３５合成データ生成部
３６機械学習部 10 Information processing device 11 Communication unit 20 Storage unit 21 Training data DB
22 Person image data DB
23 Location image data DB
24 3D generation model 25 Area extraction model 26 Skeleton estimation model 30 Control unit 31 Pre-learning unit 32 Acquisition unit 33 Person model generation unit 34 Area identification unit 35 Synthetic data generation unit 36 Machine learning unit

Claims

コンピュータに、
人物の行動の要素を示すルールを特定し、
特定した前記ルールに合致した姿勢を示す人物のモデルを生成し、
カメラパラメータを用いて、画像データの中に人物のモデルが配置された合成データを生成する、
処理を実行させることを特徴とする生成プログラム。 On the computer,
Identify rules that describe the elements of a person's behavior;
generating a model of a person exhibiting a pose consistent with the identified rules;
Using the camera parameters, generate synthetic data in which a human model is placed in the image data.
A generating program for executing a process.

生成された前記合成データを機械学習モデルに入力することで、前記ルールに合致する条件を前記機械学習モデルが示すか否かを判定する、
処理を前記コンピュータに実行させることを特徴とする請求項１に記載の生成プログラム。 inputting the generated synthetic data into a machine learning model to determine whether the machine learning model indicates a condition that matches the rule;
2. The generating program according to claim 1, which causes the computer to execute a process.

前記判定する処理は、
前記合成データを機械学習モデルに入力して得られた前記機械学習モデルの出力結果が、前記ルールに合致する条件を示さない場合に、前記合成データを用いて、前記機械学習モデルの訓練を実行する、ことを特徴とする請求項２に記載の生成プログラム。 The process of determining includes:
The generation program according to claim 2, characterized in that, when the output result of the machine learning model obtained by inputting the synthetic data into the machine learning model does not indicate a condition that matches the rule, training of the machine learning model is performed using the synthetic data.

前記判定する処理は、
画像データ内の人物を検出する人物検出モデル、画像データ内の人物の属性を推定する属性推定モデル、画像データ内の人物の骨格情報を推定する骨格推定モデルのそれぞれに、前記合成データを入力し、
前記合成データの入力に基づく出力結果に基づき、訓練対象のモデルを決定する、ことを特徴とする請求項３に記載の生成プログラム。 The process of determining includes:
inputting the composite data into a person detection model for detecting a person in the image data, an attribute estimation model for estimating attributes of the person in the image data, and a skeleton estimation model for estimating skeleton information of the person in the image data;
The generation program according to claim 3 , further comprising: determining a model to be trained based on an output result based on the input of the synthetic data.

前記特定する処理は、
行動毎に人物が行動する領域と人物が行動するときの姿勢とを規定した複数のルールから、指定された条件に応じたルールを特定し、
前記モデルを生成する処理は、
前記指定された条件に応じたルールに規定される姿勢を示す前記人物のモデルを生成し、
前記合成データを生成する処理は、
前記画像データ内の前記指定された条件に応じたルールに規定される領域に、前記カメラパラメータを用いて前記姿勢を示す前記人物のモデルを配置した前記合成データを生成する、ことを特徴とする請求項４に記載の生成プログラム。 The process of specifying
Identifying a rule corresponding to a specified condition from a plurality of rules that define an area in which a person will act and a posture when the person acts for each action;
The process of generating the model includes:
generating a model of the person exhibiting a pose defined by rules according to the specified conditions;
The process of generating the composite data includes:
The generation program according to claim 4, characterized in that the synthetic data is generated by placing a model of the person showing the pose using the camera parameters in an area in the image data that is defined by a rule corresponding to the specified condition.

前記合成データを生成する処理は、
異なる各場所を撮影した複数の画像データそれぞれから、各画像データにおいて人物が行動する領域を特定し、
前記複数の画像データから、前記指定された条件に応じたルールに該当する画像データおよび前記領域を選定し、
選定された前記画像データ内の前記指定された条件に応じたルールに規定される領域に、前記カメラパラメータを用いて前記姿勢を示す前記人物のモデルを配置した前記合成データを生成する、ことを特徴とする請求項５に記載の生成プログラム。 The process of generating the composite data includes:
Identifying an area in which a person is active in each of a plurality of image data captured in different locations;
selecting image data and the area that meet a rule according to the specified condition from the plurality of image data;
The generation program according to claim 5, characterized in that the synthetic data is generated by placing a model of the person showing the pose using the camera parameters in an area defined by a rule corresponding to the specified condition within the selected image data.

前記合成データは、前記画像データの中に設定されるＲＯＩ（Region Of Interest）に、前記ルールに基づく特定の行動を行う３Ｄアバターを配置した画像である、
ことを特徴とする請求項１に記載の生成プログラム。 The composite data is an image in which a 3D avatar performing a specific action based on the rule is placed in a region of interest (ROI) set in the image data.
2. The generating program according to claim 1 .

コンピュータが、
人物の行動の要素を示すルールを特定し、
特定した前記ルールに合致した姿勢を示す人物のモデルを生成し、
カメラパラメータを用いて、画像データの中に人物のモデルが配置された合成データを生成する、
処理を実行することを特徴とする生成方法。 The computer
Identify rules that describe the elements of a person's behavior;
generating a model of a person exhibiting a pose consistent with the identified rules;
Using the camera parameters, generate synthetic data in which a human model is placed in the image data.
A generating method comprising:

人物の行動の要素を示すルールを特定し、
特定した前記ルールに合致した姿勢を示す人物のモデルを生成し、
カメラパラメータを用いて、画像データの中に人物のモデルが配置された合成データを生成する、
制御部を有することを特徴とする情報処理装置。 Identify rules that describe the elements of a person's behavior;
generating a model of a person exhibiting a pose consistent with the identified rules;
Using the camera parameters, generate synthetic data in which a human model is placed in the image data.
An information processing device comprising a control unit.