JP7318814B2

JP7318814B2 - DATA GENERATION METHOD, DATA GENERATION PROGRAM AND INFORMATION PROCESSING DEVICE

Info

Publication number: JP7318814B2
Application number: JP2022533003A
Authority: JP
Inventors: 創輔山尾
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2023-08-01
Anticipated expiration: 2040-07-03
Also published as: JPWO2022003963A1; WO2022003963A1

Description

本発明は、データ生成方法、データ生成プログラムおよび情報処理装置に関する。 The present invention relates to a data generation method, a data generation program, and an information processing apparatus.

カラー画像などの２次元画像を用いて、各種スポーツに関連した３次元の人の動きを検出する骨格認識が行われている。例えば、複数の２次元の関節座標から、三角測量法により代表的な３次元の関節座標を算出する手法が利用されている。近年では、骨格認識の精度を向上させるために、機械学習により生成された推定モデルを用いて、複数視点の２次元の関節座標から３次元の関節座標を推定する手法なども知られている。 Skeletal recognition is performed by using two-dimensional images such as color images to detect three-dimensional human movements related to various sports. For example, a method of calculating representative three-dimensional joint coordinates by triangulation from a plurality of two-dimensional joint coordinates is used. In recent years, in order to improve the accuracy of skeletal recognition, a method of estimating three-dimensional joint coordinates from two-dimensional joint coordinates of multiple viewpoints using an estimation model generated by machine learning is also known.

K．Iskakov et al．，“Learnable Triangulation of Human Pose”，ICCV 2019K. Iskakov et al. , “Learnable Triangulation of Human Pose”, ICCV 2019 C．Ionescu et al．，“Human3．6M，Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments”，TPAMI 2014C. Ionescu et al. , "Human3.6M, Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments", TPAMI 2014 G．Varol et al．，“Learning from Synthetic Humans”，CVPR 2017G. Varol et al. , “Learning from Synthetic Humans,” CVPR 2017

ところで、上記推定モデルは、２次元画像と３次元の骨格位置とを含む学習データを用いた機械学習により生成されるが、体操競技など複雑な動きの骨格認識にも適用させるためには、非常に多くの学習データを用いて推定精度を向上させることが要求される。しかしながら、このような学習データは手動で生成されることが一般的であり、精度も悪く、効率的ではないことから、機械学習を用いた３次元の骨格認識の精度劣化やコスト増大が発生する。 By the way, the above estimation model is generated by machine learning using learning data including two-dimensional images and three-dimensional skeleton positions. Therefore, it is required to improve the estimation accuracy by using a lot of training data. However, such learning data is generally generated manually, and the accuracy is poor and it is not efficient. Therefore, the accuracy of 3D skeleton recognition using machine learning deteriorates and the cost increases. .

なお、モーションキャプチャなどを用いて取得された３次元の骨格情報と同様の姿勢をとる人体ＣＧ（Computer Graphics）モデルを用いて、テクスチャやレンダリング条件を変えながら、多様なバリエーションでＣＧ画像を合成することで、学習データセットを生成することも考えられる。しかし、例えば体操競技のように、選手がユニフォームを着て動作する場合や複数の器具を使用する場合、実際のカメラ画像に近い品質でそれらを模擬してＣＧ画像を合成することが難しい。 Using a human body CG (Computer Graphics) model that assumes the same posture as 3D skeletal information acquired using motion capture or the like, CG images are synthesized in a variety of variations while changing textures and rendering conditions. It is also conceivable to generate a learning data set by However, in gymnastics, for example, when an athlete wears a uniform and moves or uses a plurality of equipment, it is difficult to simulate them with a quality close to the actual camera image and synthesize a CG image.

一つの側面では、２次元画像と３次元の骨格位置とを含む学習データセットを自動で生成することができるデータ生成方法、データ生成プログラムおよび情報処理装置を提供することを目的とする。 An object of one aspect is to provide a data generation method, a data generation program, and an information processing apparatus capable of automatically generating a learning data set including two-dimensional images and three-dimensional skeleton positions.

第１の案では、データ生成方法は、コンピュータが、所定動作を行う被写体を撮像した動画像データに含まれる複数のフレームそれぞれから複数の関節それぞれの２次元座標を取得する処理を実行する。データ生成方法は、コンピュータが、前記所定動作を行う前記被写体の複数の関節位置に関する３次元骨格データを含む３次元系列データから、前記複数のフレームそれぞれに対応する複数の３次元骨格データそれぞれを特定する処理を実行する。データ生成方法は、コンピュータが、前記複数のフレームそれぞれの２次元座標と前記複数の３次元骨格データそれぞれとを用いて、前記動画像データと前記３次元系列データとの間の時刻同期に関する調整量と、前記動画像データに前記３次元系列データを投影するときの投影パラメータとの最適化を実行する処理を実行する。データ生成方法は、コンピュータが、最適化された前記調整量と前記投影パラメータとを用いて、前記動画像データと前記３次元系列データとを対応付けたデータを生成する処理を実行する。 In the first proposal, the data generation method causes the computer to acquire the two-dimensional coordinates of each of a plurality of joints from each of a plurality of frames included in moving image data obtained by imaging a subject performing a predetermined action. In the data generation method, a computer identifies each of a plurality of 3D skeleton data corresponding to each of the plurality of frames from 3D series data including 3D skeleton data relating to a plurality of joint positions of the subject performing the predetermined motion. process. In the data generation method, the computer uses the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data to adjust the amount of time synchronization between the moving image data and the three-dimensional series data. and a process of optimizing projection parameters for projecting the three-dimensional series data onto the moving image data. In the data generation method, the computer uses the optimized adjustment amount and the projection parameter to generate data in which the moving image data and the three-dimensional series data are associated with each other.

一実施形態によれば、２次元画像と３次元の骨格位置とを含む学習データセットを自動で生成することができる。 According to one embodiment, a training data set including two-dimensional images and three-dimensional skeleton positions can be automatically generated.

図１は、骨格認識を用いたシステムの全体構成例を説明する図である。FIG. 1 is a diagram illustrating an example of the overall configuration of a system using skeleton recognition. 図２は、手動により学習データの生成を説明する図である。FIG. 2 is a diagram for explaining manual generation of learning data. 図３は、実施例１にかかる生成装置の機能構成を示す機能ブロック図である。FIG. 3 is a functional block diagram of the functional configuration of the generating device according to the first embodiment; 図４は、動画像データの例を示す図である。FIG. 4 is a diagram showing an example of moving image data. 図５は、３次元骨格系列データの例を説明する図である。FIG. 5 is a diagram illustrating an example of 3D skeleton sequence data. 図６は、２次元の関節座標の取得を説明する図である。FIG. 6 is a diagram illustrating acquisition of two-dimensional joint coordinates. 図７は、カメラキャリブレーションを説明する図である。FIG. 7 is a diagram for explaining camera calibration. 図８は、時刻同期とリサンプリングを説明する図である。FIG. 8 is a diagram for explaining time synchronization and resampling. 図９は、初期パラメータの推定を説明する図である。FIG. 9 is a diagram illustrating estimation of initial parameters. 図１０は、パラメータの最適化を説明する図である。FIG. 10 is a diagram for explaining parameter optimization. 図１１は、最適化の結果を説明する図である。FIG. 11 is a diagram explaining the result of optimization. 図１２は、生成される学習データの一例を示す図である。FIG. 12 is a diagram showing an example of generated learning data. 図１３は、学習データの生成処理の流れを示すフローチャートである。FIG. 13 is a flowchart showing the flow of learning data generation processing. 図１４は、初期値の推定から最適化までの処理の流れを示すフローチャートである。FIG. 14 is a flow chart showing the flow of processing from initial value estimation to optimization. 図１５は、骨格認識の処理例を説明する図である。FIG. 15 is a diagram illustrating an example of skeleton recognition processing. 図１６は、骨格認識の処理例を説明する図である。FIG. 16 is a diagram illustrating an example of skeleton recognition processing. 図１７は、骨格認識の処理例を説明する図である。FIG. 17 is a diagram illustrating an example of skeleton recognition processing. 図１８は、ハードウェア構成例を説明する図である。FIG. 18 is a diagram illustrating a hardware configuration example.

以下に、本発明にかかるデータ生成方法、データ生成プログラムおよび情報処理装置の実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。また、各実施例は、矛盾のない範囲内で適宜組み合わせることができる。 Exemplary embodiments of a data generation method, a data generation program, and an information processing apparatus according to the present invention will be described in detail below with reference to the drawings. In addition, this invention is not limited by this Example. Moreover, each embodiment can be appropriately combined within a range without contradiction.

［システム構成］
図１は、骨格認識を用いたシステムの全体構成例を説明する図である。図１に示すように、このシステムは、３Ｄ（Three－Dimensional）レーザセンサ５、生成装置１０、学習装置４０、認識装置５０、採点装置９０を有し、被写体である演技者１の３次元データを撮像し、骨格等を認識して正確な技の採点を行うシステムである。なお、本実施例では、一例として、体操競技を例にして説明するが、これに限定されるものではなく、選手が一連の技を行って審判が採点する他の競技や、人の様々な行動や動作にも適用することができる。また、本実施例では、２次元を２Ｄ、３次元を３Ｄと記載する場合がある。[System configuration]
FIG. 1 is a diagram illustrating an example of the overall configuration of a system using skeleton recognition. As shown in FIG. 1, this system has a 3D (Three-Dimensional) laser sensor 5, a generating device 10, a learning device 40, a recognition device 50, and a scoring device 90. It is a system that takes an image of the body, recognizes the skeleton, etc., and scores accurate techniques. In this embodiment, a gymnastics competition will be described as an example, but the present invention is not limited to this. It can also be applied to actions and movements. Also, in this embodiment, two dimensions may be referred to as 2D, and three dimensions as 3D.

一般的に、体操競技における現在の採点方法は、複数の採点者によって目視で行われているが、技の高度化に伴い、採点者の目視では採点が困難な場合が増加している。近年では、３Ｄレーザセンサ５を使った、採点競技の自動採点システムや採点支援システムが知られている。例えば、これらのシステムにおいては、３Ｄレーザセンサ５により選手の３次元データである距離画像を取得し、距離画像から選手の各関節の向きや各関節の角度などである骨格を認識する。そして、採点支援システムにおいては、骨格認識の結果を３Ｄモデルにより表示することで、採点者が、演技者の細部の状況を確認するなどにより、より正しい採点を実施することを支援する。また、自動採点システムにおいては、骨格認識の結果から、演技した技などを認識し、採点ルールに照らして採点を行う。 In general, the current scoring method in gymnastics is visually performed by a plurality of raters. In recent years, automatic scoring systems and scoring support systems for scoring competitions using the 3D laser sensor 5 are known. For example, in these systems, the 3D laser sensor 5 acquires a distance image, which is three-dimensional data of the player, and recognizes the skeleton, which is the direction of each joint and the angle of each joint, from the distance image. In the scoring support system, by displaying the result of skeleton recognition as a 3D model, the scorer can confirm the detailed situation of the performer, thereby supporting more accurate scoring. In addition, the automatic scoring system recognizes the tricks performed from the results of skeleton recognition, and scores according to the scoring rules.

ここで、採点支援システムや自動採点システムにおいては、随時行われる演技を、タイムリーに採点支援または自動採点することが求められる。通常、距離画像やカラー画像から演技者の３次元骨格を認識する手法では、メモリ不足などによる処理時間の長時間化や骨格認識の精度低下を招く。 Here, the scoring support system and the automatic scoring system are required to timely assist or automatically score performances that are performed at any time. Normally, the method of recognizing the three-dimensional skeleton of an actor from a distance image or a color image causes a long processing time and a decrease in the accuracy of skeleton recognition due to insufficient memory or the like.

例えば、自動採点システムによる自動採点の結果を採点者へ提供し、採点者が自己の採点結果と比較する形態では、従来技術を用いた場合、採点者への情報提供が遅延する。さらに、骨格認識の精度が低下することで、続く技認識も誤ってしまう可能性があり、結果として技による決定される得点も誤ってしまう。同様に、採点支援システムにおいて、演技者の関節の角度や位置を、３Ｄモデルを使って表示する際にも、表示までの時間が遅延したり、表示される角度等が正しくないという事態を生じうる。この場合には、この採点支援システムを利用した採点者による採点は、誤った採点となってしまう場合もある。 For example, in a form in which the grader is provided with the results of automatic scoring by an automatic scoring system and the grader compares them with their own scoring results, the provision of information to the graders is delayed when using conventional technology. Furthermore, the decrease in accuracy of skeleton recognition may lead to erroneous recognition of subsequent techniques, resulting in erroneous scores determined by techniques. Similarly, in the scoring support system, when the angles and positions of the joints of actors are displayed using 3D models, there is a delay in the time taken to display them, and the displayed angles, etc. are not correct. sell. In this case, grading by a grader using this grading support system may result in erroneous grading.

以上の通り、自動採点システムや採点支援システムにおける骨格認識の精度が悪かったり、処理に時間を要すると、採点ミスの発生や、採点時間の長時間化を招いてしまう。このようなことから、機械学習により生成された機械学習モデルを用いることで、高精度な骨格認識、認識ミス、採点の長時間化の抑制などが実現されている。なお、３次元の人の動き検出（骨格認識）に関しては、複数台の３Ｄレーザセンサ５から３次元の関節座標を高精度で抽出する３Ｄセンシング技術が確立されつつあり、他スポーツや他分野への展開が期待されている。 As described above, if the skeleton recognition accuracy in the automatic scoring system or the scoring support system is poor, or if the processing takes a long time, scoring errors will occur and the scoring time will increase. For this reason, by using a machine learning model generated by machine learning, highly accurate skeleton recognition, recognition errors, suppression of prolonged scoring, and the like are realized. As for 3D human motion detection (skeletal recognition), 3D sensing technology is being established to extract 3D joint coordinates from multiple 3D laser sensors 5 with high precision, and it is expected to be applied to other sports and other fields. is expected to develop.

ここで、図１におけるシステムを構成する各装置について説明する。３Ｄレーザセンサ５は、赤外線レーザ等を用いて対象物の距離を画素ごとに測定（センシング）するセンサ装置の一例である。距離画像には、各画素までの距離が含まれる。つまり、距離画像は、３Ｄレーザセンサ（深度センサ）５から見た被写体の深度を表す深度画像である。 Here, each device constituting the system in FIG. 1 will be described. The 3D laser sensor 5 is an example of a sensor device that measures (senses) the distance of an object for each pixel using an infrared laser or the like. The distance image contains the distance to each pixel. In other words, the distance image is a depth image representing the depth of the subject viewed from the 3D laser sensor (depth sensor) 5 .

学習装置４０は、骨格認識用の機械学習モデルを学習するコンピュータ装置の一例である。具体的には、学習装置４０は、２次元の骨格位置情報や３次元の骨格位置情報などを学習データセットとして使用した、ディープラーニングなどの機械学習を実行して機械学習モデルを生成する。 The learning device 40 is an example of a computer device that learns a machine learning model for skeletal recognition. Specifically, the learning device 40 executes machine learning such as deep learning using two-dimensional skeleton position information, three-dimensional skeleton position information, etc. as a learning data set to generate a machine learning model.

認識装置５０は、３Ｄレーザセンサ５により測定された距離画像を用いて、演技者１の各関節の向きや位置等に関する骨格を認識するコンピュータ装置の一例である。具体的には、認識装置５０は、３Ｄレーザセンサ５により測定された距離画像を、学習装置４０によって学習された学習済みの機械学習モデルに入力し、機械学習モデルの出力結果に基づいて骨格を認識する。その後、認識装置５０は、認識された骨格を採点装置９０に出力する。例えば、骨格認識の結果として得られる情報は、各関節の３次元位置に関する情報である。 The recognition device 50 is an example of a computer device that uses the distance image measured by the 3D laser sensor 5 to recognize the skeleton related to the direction and position of each joint of the performer 1 . Specifically, the recognition device 50 inputs the distance image measured by the 3D laser sensor 5 to the learned machine learning model learned by the learning device 40, and calculates the skeleton based on the output result of the machine learning model. recognize. The recognition device 50 then outputs the recognized skeleton to the scoring device 90 . For example, information obtained as a result of skeletal recognition is information about the three-dimensional position of each joint.

採点装置９０は、認識装置５０により入力された認識結果情報を用いて、演技者の各関節の位置や向きから得られる動きの推移を特定し、演技者が演技した技の特定および採点を実行するコンピュータ装置の一例である。 The scoring device 90 uses the recognition result information input by the recognition device 50 to specify the movement transition obtained from the position and orientation of each joint of the performer, and specifies and scores the technique performed by the performer. 1 is an example of a computer device that

上述した機械学習モデルを用いた骨格認識技術により、体操のような複雑な姿勢における３次元の骨格認識を行うためには、複雑な姿勢に関する多くの学習データを新たに準備して学習させる必要がある。 In order to perform 3D skeletal recognition in complex postures such as gymnastics using the above-mentioned skeletal recognition technology using the machine learning model, it is necessary to newly prepare a lot of learning data related to complex postures. be.

近年では、多くの学習データを生成するために、レーザ方式や画像方式などを用いて、３次元の骨格情報などを新たに収集することが行われている。例えば、３Ｄレーザセンサを用いたレーザ方式では、レーザを１秒間に約２００万回照射し１レーザの走行時間（Time of Flight：ToF）をもとに、対象となる人も含めて、各照射点の深さ情報を求めることが行われている。このレーザ方式は、高精度な深度データを取得できるが、レーザースキャンやToF測定などの複雑な構成や処理が要求され、ハードウエアが複雑で高価になることから、汎用的に利用することは難しい。 In recent years, in order to generate a large amount of learning data, three-dimensional skeleton information and the like are newly collected using a laser method, an image method, and the like. For example, in the laser method using a 3D laser sensor, the laser is irradiated approximately 2 million times per second, and based on the time of flight (ToF) of one laser, each irradiation including the target person Determination of depth information for points has been performed. Although this laser method can acquire highly accurate depth data, it requires complicated configuration and processing such as laser scanning and ToF measurement, and the hardware is complicated and expensive, making it difficult to use for general purposes. .

また、ＣＭＯＳ（Complementary Metal Oxide Semiconductor）イメージャによって各ピクセルのＲＧＢデータを取得する画像方式は、安価なＲＧＢカメラを用いることができ、近年の深層学習技術の向上により、３次元の骨格認識の精度も向上しつつある。しかし、一般的な姿勢しか含まないTotal Captureなどの既存のデータセットを使用して機械学習を実行しても、体操のような複雑な姿勢を認識できる機械学習モデルを生成することができない。 In addition, the imaging method that acquires the RGB data of each pixel with a CMOS (Complementary Metal Oxide Semiconductor) imager can use an inexpensive RGB camera, and recent improvements in deep learning technology have improved the accuracy of 3D skeleton recognition. improving. However, even if machine learning is performed using existing datasets such as Total Capture that contain only general postures, it is not possible to generate a machine learning model that can recognize complex postures such as gymnastics.

このように、一般的に使用されている方式では、複雑な姿勢を認識できる機械学習モデルを生成できないので、複雑な姿勢を学習するための学習データを手動で生成することが行われている。 As described above, the generally used method cannot generate a machine learning model capable of recognizing a complicated posture, so learning data for learning a complicated posture is manually generated.

図２は、手動により学習データの生成を説明する図である。図２に示すように、演技者１の体操の一連の演技を撮像した動画像データと、過去に演技者１が演技したときの３次元骨格データを含む３次元骨格系列データとを目視により時刻同期を実行し、動画像データと３次元骨格データとが同期するフレームを探索する（Ｓ１）。続いて、事前知識として定義したおおよそのカメラパラメータを用いて、３次元骨格データを画像に投影し（Ｓ２）、動画像データの人物シルエットと３次元骨格データとが重なるように、カメラパラメータの値を手動で調整する（Ｓ３）。その後、３次元骨格系列データを動画像データのフレームレートでリサンプリングし、動画像データに投影することで、動画像データ全体に渡る学習データを作成する（Ｓ４）。 FIG. 2 is a diagram for explaining manual generation of learning data. As shown in FIG. 2, moving image data obtained by capturing a series of gymnastics performances of performer 1 and 3D skeleton sequence data including 3D skeleton data when performer 1 performed in the past are visually observed at the time. Synchronization is executed to search for a frame in which moving image data and 3D skeleton data are synchronized (S1). Subsequently, using approximate camera parameters defined as prior knowledge, the 3D skeleton data is projected onto an image (S2), and the camera parameter values is manually adjusted (S3). After that, the three-dimensional skeleton sequence data is resampled at the frame rate of the moving image data and projected onto the moving image data to create learning data covering the entire moving image data (S4).

このように、手動で学習データを作成する場合、動画像データと３次元骨格系列データの空間的および時間的な位置合わせを目視および手動で実施するので、学習データとして十分な精度が得られず、さらには膨大な作業時間を要する。 In this way, when manually creating training data, the spatial and temporal alignment of moving image data and 3D skeletal sequence data is performed visually and manually, so sufficient accuracy cannot be obtained as training data. , and requires a huge amount of work time.

そこで、実施例１では、同一の動作に関する、カメラで取得した動画像データと、３Ｄセンシングなどで取得した３次元骨格系列データとを用いて、空間的および時間的に最適な自動位置合わせを実施し、効率的かつ高品質な学習データを生成する技術を提供する。 Therefore, in the first embodiment, moving image data obtained by a camera and three-dimensional skeletal sequence data obtained by 3D sensing or the like regarding the same motion are used to perform automatic spatial and temporal alignment. and provide technology to generate efficient and high-quality training data.

具体的には、生成装置１０は、演技を行う演技者１を撮像した２次元の動画像データに含まれる複数のフレームそれぞれから複数の関節それぞれの２次元座標を取得する。続いて、生成装置１０は、演技を行う演技者１の複数の関節位置（３次元骨格データ）に関する３次元骨格系列データから、複数のフレームそれぞれに対応する複数の３次元骨格データそれぞれを特定する。そして、生成装置１０は、複数のフレームそれぞれの２次元座標と複数の３次元骨格データそれぞれとを用いて、動画像データと３次元骨格系列データとの間の時刻同期に関する調整量と、動画像データへ３次元骨格系列データを投影するときのカメラパラメータとの最適化を実行する。その後、生成装置１０は、最適化された調整量とカメラパラメータとを用いて、動画像データと３次元骨格系列データとを対応付けたデータを生成する。 Specifically, the generating device 10 acquires the two-dimensional coordinates of each of the joints from each of the frames included in the two-dimensional moving image data of the performer 1 performing the performance. Next, the generation device 10 identifies each of the plurality of 3D skeleton data corresponding to each of the plurality of frames from the 3D skeleton series data regarding the plurality of joint positions (3D skeleton data) of the performer 1 performing the performance. . Then, using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data, the generating device 10 generates an adjustment amount related to time synchronization between the moving image data and the three-dimensional skeleton sequence data, and the moving image Perform optimization with camera parameters when projecting 3D skeletal sequence data onto the data. After that, the generation device 10 generates data in which the moving image data and the 3D skeleton sequence data are associated with each other using the optimized adjustment amount and camera parameters.

このように、生成装置１０は、３次元骨格系列データと２次元の関節座標との間の幾何的な整合性が最大化されるように、非線型最適化に基づくカメラキャリブレーションと時刻同期を同時に実施することで、空間的および時間的に最適な自動位置合わせを実行する。この結果、生成装置１０は、精度良く３次元の骨格位置を推定する機械学習モデルの学習に利用できる、２次元画像と３次元の骨格位置とを含む学習データセットを自動で生成することができる。 In this way, the generation device 10 performs camera calibration and time synchronization based on nonlinear optimization so as to maximize the geometric consistency between the 3D skeleton sequence data and the 2D joint coordinates. Performed simultaneously, it performs the best spatial and temporal automatic registration. As a result, the generation device 10 can automatically generate a learning data set including two-dimensional images and three-dimensional skeleton positions that can be used for learning a machine learning model for accurately estimating three-dimensional skeleton positions. .

［機能構成］
図３は、実施例１にかかる生成装置１０の機能構成を示す機能ブロック図である。図３に示すように、生成装置１０は、通信部１１、記憶部１２、制御部２０を有する。[Function configuration]
FIG. 3 is a functional block diagram of the functional configuration of the generation device 10 according to the first embodiment. As shown in FIG. 3 , the generation device 10 has a communication section 11 , a storage section 12 and a control section 20 .

通信部１１は、他の装置との間の通信を制御する処理部であり、例えば通信インタフェースなどにより実現される。例えば、通信部１１は、カメラなどを用いて撮影された演技者１の動画像データを受信し、３Ｄレーザセンサ５を用いて撮影された演技者１の３次元骨格系列を受信する。 The communication unit 11 is a processing unit that controls communication with other devices, and is realized by, for example, a communication interface. For example, the communication unit 11 receives moving image data of the actor 1 photographed using a camera or the like, and receives the three-dimensional skeleton sequence of the actor 1 photographed using the 3D laser sensor 5 .

記憶部１２は、各種データや制御部２０が実行するプログラムなどを記憶する処理部であり、例えばメモリやハードディスクなどにより実現される。この記憶部１２は、動画像データ１３、３次元骨格系列データ１４、学習データセット１５を記憶する。 The storage unit 12 is a processing unit that stores various data, programs executed by the control unit 20, and the like, and is realized by, for example, a memory or a hard disk. The storage unit 12 stores moving image data 13, three-dimensional skeleton sequence data 14, and a learning data set 15. FIG.

動画像データ１３は、演技者１が演技したときにカメラなどにより撮影された一連の動画像データであり、複数のフレームから構成される。図４は、動画像データ１３の例を示す図である。図４では、一例として、あん馬を演技しているときに撮影された動画像データ１３内の１つのフレームを示している。この動画像データ１３は、座標系として、カメラの位置、姿勢や解像度を基準とし、時刻系として、カメラ固有のタイムスタンプやサンプリングレートを基準とする。 The moving image data 13 is a series of moving image data captured by a camera or the like when the performer 1 performs, and is composed of a plurality of frames. FIG. 4 is a diagram showing an example of the moving image data 13. As shown in FIG. FIG. 4 shows, as an example, one frame in the moving image data 13 captured while the pommel horse is performing. The moving image data 13 is based on the position, orientation, and resolution of the camera as a coordinate system, and based on a camera-specific time stamp and sampling rate as a time system.

３次元骨格系列データ１４は、複数の関節位置に関する３次元の関節座標を示す３次元骨格データを含む系列データである。具体的には、３次元骨格系列データ１４は、演技者１が演技したときに、３Ｄレーザセンサなどにより撮影された一連の３次元骨格データである。なお、３次元骨格データには、演技者１が演技しているときの各関節の３次元の骨格位置（骨格情報）に関する情報が含まれる。 The 3D skeleton series data 14 is series data including 3D skeleton data representing 3D joint coordinates for a plurality of joint positions. Specifically, the three-dimensional skeleton series data 14 is a series of three-dimensional skeleton data captured by a 3D laser sensor or the like when performer 1 performs. The three-dimensional skeleton data includes information on the three-dimensional skeleton position (skeletal information) of each joint when the performer 1 is performing.

図５は、３次元骨格系列データ１４の例を説明する図である。図５では、演技者１の演技から生成された３次元骨格系列データ１４中の１つの３次元骨格データを図示している。図５に示すように、３次元骨格系列データ１４は、３Ｄセンシング技術で取得されるデータであり、各関節の３次元座標が含まれるデータである。ここで、各関節は、例えば右肩、左肩、右足首など予め指定した１８関節やユーザが任意に設定した複数の関節などである。この３次元骨格系列データ１４は、座標系として、センサの位置や姿勢を基準とし、時刻系として、センサ固有のタイムスタンプやサンプリングレートを基準とする。 FIG. 5 is a diagram illustrating an example of the 3D skeleton sequence data 14. As shown in FIG. FIG. 5 shows one piece of 3D skeleton data in the 3D skeleton series data 14 generated from the performance of the performer 1 . As shown in FIG. 5, the 3D skeletal sequence data 14 is data obtained by 3D sensing technology and includes 3D coordinates of each joint. Here, each joint is, for example, 18 joints designated in advance such as right shoulder, left shoulder, right ankle, or a plurality of joints arbitrarily set by the user. The three-dimensional skeletal sequence data 14 is based on the position and orientation of the sensor as a coordinate system, and based on a sensor-specific time stamp and sampling rate as a time system.

学習データセット１５は、後述する制御部２０によって生成される、機械学習モデルの生成に利用される複数の学習データを含むデータベースである。例えば、学習データセット１５は、動画像データ１３に３次元骨格データとカメラパラメータとが対応付けられた情報である。 The learning data set 15 is a database containing a plurality of learning data used for generating a machine learning model, generated by the control unit 20, which will be described later. For example, the learning data set 15 is information in which the moving image data 13 is associated with three-dimensional skeleton data and camera parameters.

制御部２０は、生成装置１０全体を司る処理部であり、例えばプロセッサなどにより実現される。この制御部２０は、データ取得部２１、座標取得部２２、学習データ生成部２３を有する。なお、データ取得部２１、座標取得部２２、学習データ生成部２３は、プロセッサが有する電子回路やプロセッサが実行するプロセスなどにより実現される。 The control unit 20 is a processing unit that controls the generation device 10 as a whole, and is realized by, for example, a processor. The control unit 20 has a data acquisition unit 21 , a coordinate acquisition unit 22 and a learning data generation unit 23 . Note that the data acquisition unit 21, the coordinate acquisition unit 22, and the learning data generation unit 23 are realized by an electronic circuit included in the processor, a process executed by the processor, or the like.

データ取得部２１は、動画像データ１３や３次元骨格系列データ１４を取得して記憶部１２に格納する処理部である。例えば、データ取得部２１は、カメラから動画像データ１３を取得することもでき、以前に公知の手法により撮像された動画像データ１３を記憶する記憶先から読み出して記憶部１２に格納することもできる。同様に、データ取得部２１は、３Ｄレーザセンサから３次元骨格系列データ１４を取得することもでき、以前に公知の手法により撮像された３次元骨格系列データ１４を記憶する記憶先から読み出して記憶部１２に格納することもできる。 The data acquisition unit 21 is a processing unit that acquires the moving image data 13 and the three-dimensional skeleton series data 14 and stores them in the storage unit 12 . For example, the data acquisition unit 21 can acquire the moving image data 13 from a camera, or read the moving image data 13 previously captured by a known method from a storage destination and store it in the storage unit 12 . can. Similarly, the data acquisition unit 21 can also acquire the 3D skeletal sequence data 14 from the 3D laser sensor, and read and store the 3D skeletal sequence data 14 previously captured by a known technique from a storage destination. It can also be stored in section 12 .

座標取得部２２は、動画像データ１３に含まれる複数のフレームそれぞれから複数の関節それぞれの２次元の関節座標である２次元座標を取得する処理部である。具体的には、座標取得部２２は、動画像データの中から適当なフレームを数枚（例えば１０枚）程度選択し、それぞれのフレームにおける対象人物の２次元の関節位置を自動または手動で取得する。 The coordinate acquisition unit 22 is a processing unit that acquires two-dimensional coordinates, which are two-dimensional joint coordinates of each of a plurality of joints, from each of a plurality of frames included in the moving image data 13 . Specifically, the coordinate acquisition unit 22 selects several (for example, 10) appropriate frames from the moving image data, and automatically or manually acquires the two-dimensional joint positions of the target person in each frame. do.

図６は、２次元関節座標の取得を説明する図である。図６に示すように、座標取得部２２は、動画像データ１３の中からフレームを選択し、選択したフレームから予め指定した８関節をアノテーション対象に設定する。そして、座標取得部２２は、既存のモデルなどを用いて、アノテーション対象である（１）右ひじ（２）右手首（３）左ひじ（４）左手首（５）右ひざ（６）右足首（７）左ひざ（８）左足首それぞれの関節位置を示す２次元座標を取得する。 FIG. 6 is a diagram illustrating acquisition of two-dimensional joint coordinates. As shown in FIG. 6, the coordinate acquisition unit 22 selects a frame from the moving image data 13, and sets eight joints specified in advance from the selected frame as annotation targets. Then, the coordinate acquisition unit 22 uses an existing model or the like to obtain (1) right elbow (2) right wrist (3) left elbow (4) left wrist (5) right knee (6) right ankle (7) Acquire two-dimensional coordinates indicating joint positions of the left knee (8) and the left ankle.

なお、３次元骨格系列データにおける関節のサブセットをアノテーション対象とすることもできる。また、既存の２次元の骨格認識手法を用いた自動アノテーションや、目視や手動でのアノテーションを実施することで、２次元の関節座標を取得できる。 Note that a subset of joints in the 3D skeletal sequence data can also be an annotation target. In addition, two-dimensional joint coordinates can be acquired by performing automatic annotation using an existing two-dimensional skeleton recognition method, visual observation, or manual annotation.

学習データ生成部２３は、初期設定部２４、最適化部２５、出力部２６を有し、動画像データ１３に３次元骨格データを対応付けた学習データセット１５を生成する処理部である。具体的には、学習データ生成部２３は、同一の演技に関する、３次元骨格系列データ１４と動画像データ１３の両方を保有することから、３次元骨格系列データ１４を動画像データ１３に対して投影することで、新たにデータを取得することなく、複雑な姿勢に関する２次元の関節座標または３次元の関節座標を付加した画像を大量に生成することができる。 The learning data generating unit 23 is a processing unit that includes an initial setting unit 24, an optimizing unit 25, and an output unit 26, and generates a learning data set 15 in which the moving image data 13 is associated with three-dimensional skeleton data. Specifically, since the learning data generator 23 has both the 3D skeleton sequence data 14 and the moving image data 13 regarding the same performance, the 3D skeleton sequence data 14 is generated for the moving image data 13. By projecting, it is possible to generate a large number of images to which two-dimensional joint coordinates or three-dimensional joint coordinates regarding complex postures are added without obtaining new data.

ただし、図４と図５で説明したように、３次元骨格系列データ１４と動画像データ１３とは、互いに異なる座標系および時刻系を基準とする。このため、学習データ生成部２３は、３次元骨格系列データ１４を動画像データ１３に投影するために、空間的な位置合わせに該当するカメラキャリブレーションと、時間的な位置合わせに該当する時刻同期とリサンプリングを実行する。 However, as described with reference to FIGS. 4 and 5, the three-dimensional skeleton sequence data 14 and the moving image data 13 are based on mutually different coordinate systems and time systems. Therefore, in order to project the three-dimensional skeleton sequence data 14 onto the moving image data 13, the learning data generation unit 23 performs camera calibration corresponding to spatial alignment and time synchronization corresponding to temporal alignment. and resampling.

（カメラキャリブレーション）
ここで、カメラキャリブレーションについて説明する。図７は、カメラキャリブレーションを説明する図である。図７に示すように、ある３Ｄ点を画像に投影したときの投影点を求めるためには、画像を撮影したカメラに関する、焦点距離や解像度などのカメラ固有のパラメータ（カメラ内部パラメータ）と、３Ｄ点の基準となる座標系（世界座標系）におけるカメラの位置および姿勢のパラメータ（カメラ外部パラメータ）が必要である。これらのパラメータ（カメラパラメータ）を求める処理をカメラキャリブレーションと呼ぶ。(camera calibration)
Here, camera calibration will be described. FIG. 7 is a diagram for explaining camera calibration. As shown in FIG. 7, in order to obtain a projection point when a certain 3D point is projected onto an image, camera-specific parameters such as focal length and resolution (camera intrinsic parameters) related to the camera that captured the image, and 3D The position and orientation parameters (camera extrinsics) of the camera in the coordinate system (world coordinate system) that serves as a point reference are required. The process of obtaining these parameters (camera parameters) is called camera calibration.

図７の例では、画像に３Ｄ点を投影する透視投影は、ｗ［ｘ，ｙ，ｔ］^ｔ＝Ｋ［Ｒ，ｔ］［Ｘ，Ｙ，Ｚ］^ｔで表すことができる。ここで、［Ｘ，Ｙ，Ｚ］は、投影元である３Ｄ点の座標を示し、［ｘ，ｙ］は、投影先である画像上の投影点の座標を示す。Ｋは、カメラの内部パラメータであり、３×３の内部行列である。Ｒは、カメラの外部パラメータであり、３×３の回転行列である。ｔは、３×１の並進ベクトルである。これらのうち、Ｒ、ｔがカメラキャリブレーションの対象となる。In the example of FIG. 7, a perspective projection that projects a 3D point onto an image can be represented by w[x,y,t] ^t =K[R,t][X,Y,Z] ^t . Here, [X, Y, Z] indicates the coordinates of the 3D point that is the projection source, and [x, y] indicates the coordinates of the projection point on the image that is the projection destination. K is an intrinsic parameter of the camera and is a 3×3 intrinsic matrix. R is the extrinsic parameter of the camera and is a 3x3 rotation matrix. t is a 3×1 translation vector. Of these, R and t are subject to camera calibration.

（時刻同期とリサンプリング）
次に、時刻同期とリサンプリングについて説明する。図８は、時刻同期とリサンプリングを説明する図である。図８に示すように、動画像データ１３と３次元骨格系列データ１４が異なる時刻系で取得された場合、２つのデータ全体の時刻同期をとった上で、動画像データ１３の各フレーム時刻で３次元骨格系列データ１４をリサンプリングすることで、動画像データ１３と同期した３次元骨格データを取得できる。(time synchronization and resampling)
Next, time synchronization and resampling will be described. FIG. 8 is a diagram for explaining time synchronization and resampling. As shown in FIG. 8, when the moving image data 13 and the 3D skeleton sequence data 14 are acquired in different time systems, the entire two data are time-synchronized, and then each frame time of the moving image data 13 is By resampling the 3D skeleton sequence data 14, 3D skeleton data synchronized with the moving image data 13 can be obtained.

ここで、時刻同期とは、動画像データ１３と３次元骨格系列データ１４の間での時刻系の変換を規定することであり、リサンプリングとは、動画像データ１３の各フレームの時刻における３次元骨格データを補間することである。 Here, the time synchronization means to define the conversion of the time system between the moving image data 13 and the three-dimensional skeleton sequence data 14, and the resampling means that the time of each frame of the moving image data 13 is 3 It is to interpolate the dimensional skeleton data.

図８の場合、実世界の時刻系はｔであり、３次元骨格系列データ１４の時刻系はｔ_ｓであり、動画像データ１３の時刻系はｔ_ｖとする。この状態で、３次元骨格系列データ１４は、サンプリング周期Ｔ_ｓでサンプリングされ、時刻ｔ_ｓ，０、時刻ｔ_ｓ，１、時刻ｔ_ｓ，２などの３次元骨格データがサンプリングされる。動画像データ１３は、サンプリング周期Ｔ_ｖでサンプリングされ、時刻ｔ_ｖ，０、時刻ｔ_ｖ，１、時刻ｔ_ｖ，１などのフレームがサンプリングされる。In the case of FIG. 8, the time system of the real world is t, the time system of the three-dimensional skeleton sequence data 14 is _ts , and the time system of the moving image data 13 is _tv . In this state, the 3D skeleton sequence data 14 is sampled at a sampling period T _s , and 3D skeleton data such as time t _s,0 , time t _s,1 , and time t _s,2 are sampled. The moving image data 13 is sampled at a sampling period T _v , and frames at time t _v,0 , time t _v,1 , time t _v,1, etc. are sampled.

３次元骨格系列データ１４の先頭である時刻ｔ_ｓ，０と、動画像データ１３の先頭である時刻ｔ_ｖ，０との差分は、時刻シフト量Ｔ_ｖ，ｓとなる。ここで、時刻同期は、「τ（ｔ_ｖ，ｊ）＝時刻ｔ_ｓ，０＋時刻シフト量Ｔ_ｖ，ｓ＋ｊＴ_ｖ」のように、２つの時刻系間の時刻の変換を規定することで算出できる。また、リサンプリングは、時刻系の変換式を用いて「ｔ_ｖ，ｊ」付近所定時刻の範囲内の３次元骨格データを参照し、時刻ｔ_ｖ，ｊに相当する３次元骨格データをバイリニア法などで補間することで実行できる。例えば、動画像データ１３における時刻ｔ_ｖ，０のフレームに対しては、３次元骨格系列データ１４の時刻ｔ_ｓ，２から時刻ｔ_ｓ，３の間にある、τ（ｔ_ｖ，０）で時刻同期された３次元骨格データを対応付けることで、リサンプリングされた３次元骨格データを抽出することができる。The difference between the time t _s,0 that is the beginning of the three-dimensional skeleton sequence data 14 and the time t _v,0 that is the beginning of the moving image data 13 is the time shift amount T _v,s . Here, the time synchronization is defined as time conversion between two time systems, such as “τ(t _v,j )=time t _s,0 +time shift amount T _v,s +jT _v ”. can be calculated. Also, resampling is performed by referring to the 3D skeleton data within the range of a predetermined time around "t _v,j " using a time system conversion formula, and converting the 3D skeleton data corresponding to the time t _v,j to the bilinear method. It can be executed by interpolating with For example, for the frame at time t _v,0 in the moving image data 13, at τ(t _v,0 ) between time t _s,2 and time t _s,3 in the three-dimensional skeleton sequence data 14, By associating time-synchronized 3D skeleton data, resampled 3D skeleton data can be extracted.

上述したように、学習データ生成部２３は、３次元骨格系列データ１４を動画像データ１３に投影するために、「カメラキャリブレーション」と、「時刻同期とリサンプリング」とを実行する。このとき、学習データ生成部２３は、３次元骨格系列データ１４と動画像データ１３との間の幾何的な整合性が最大化されるように、非線型最適化に基づくカメラキャリブレーションと時刻同期を同時に実施するコスト関数を定義する。そして、学習データ生成部２３は、コスト関数の最適化により、最適な時刻同期とカメラキャリブレーションを算出する。 As described above, the learning data generator 23 executes “camera calibration” and “time synchronization and resampling” in order to project the 3D skeleton sequence data 14 onto the moving image data 13 . At this time, the learning data generator 23 performs camera calibration and time synchronization based on nonlinear optimization so as to maximize the geometric consistency between the three-dimensional skeleton sequence data 14 and the moving image data 13. Define a cost function that simultaneously implements Then, the learning data generation unit 23 calculates optimal time synchronization and camera calibration by optimizing the cost function.

図３に戻り、初期設定部２４は、コスト関数の初期パラメータを設定する処理部である。具体的には、初期設定部２４は、時刻同期を変更した各同期パターンにおいて、リサンプリングにより対応付けられた複数のフレームそれぞれと複数の３次元骨格データそれぞれとを用いて、動画像データ１３に３次元骨格系列データ１３を投影するときのカメラの位置姿勢を推定する推定問題を解くことによりカメラパラメータを算出する。そして、初期設定部２４は、各同期パターンについて算出された各カメラパラメータを用いて、各同期パターンの時刻同期と各カメラパラメータとの妥当性を表す尤度を算出する。その後、初期設定部２４は、尤度が最も高い同期パターンについて特定された複数のフレームそれぞれと複数の３次元骨格データそれぞれとを初期値に設定する。 Returning to FIG. 3, the initial setting unit 24 is a processing unit that sets initial parameters of the cost function. Specifically, in each synchronization pattern whose time synchronization has been changed, the initial setting unit 24 converts the moving image data 13 into Camera parameters are calculated by solving an estimation problem for estimating the position and orientation of the camera when projecting the three-dimensional skeleton sequence data 13 . Then, the initial setting unit 24 uses each camera parameter calculated for each synchronization pattern to calculate the likelihood representing the validity of the time synchronization of each synchronization pattern and each camera parameter. After that, the initial setting unit 24 sets initial values to each of the plurality of frames identified for the synchronization pattern with the highest likelihood and each of the plurality of three-dimensional skeleton data.

図９は、初期パラメータの推定を説明する図である。図９に示すように、初期設定部２４は、時刻同期を適当に変えながら、２次元の関節座標が取得されている動画像データ１３のフレームに対応する３次元骨格データを上記図８に示したリサンプリングにより取得する。そして、初期設定部２４は、２次元の関節位置と３次元骨格データとの対応におけるＰｎＰ（Perspective－n－Point）問題を解くことでカメラパラメータを推定する。 FIG. 9 is a diagram illustrating estimation of initial parameters. As shown in FIG. 9, the initial setting unit 24 displays the three-dimensional skeleton data corresponding to the frame of the moving image data 13 for which the two-dimensional joint coordinates are acquired while appropriately changing the time synchronization. obtained by resampling. The initial setting unit 24 then estimates the camera parameters by solving a PnP (Perspective-n-Point) problem in correspondence between the two-dimensional joint positions and the three-dimensional skeleton data.

このとき、初期設定部２４は、求めたカメラパラメータと時刻同期の妥当性を表す尤度として、３次元骨格系列データ１４と２次元の関節座標との間の幾何的な整合性を定量的に計算する。例えば、初期設定部２４は、リサンプリングした３次元骨格データにおける、３次元の関節座標の再投影誤差が閾値未満となる関節数の割合などを尤度として算出する。そして、初期設定部２４は、時刻同期に関する全試行の中で、尤度が最大値をとるときの時刻同期とカメラパラメータを準最適解として採用し、最適化部２５に初期値として出力する。 At this time, the initial setting unit 24 quantitatively determines the geometric consistency between the three-dimensional skeleton sequence data 14 and the two-dimensional joint coordinates as the likelihood representing the appropriateness of the obtained camera parameters and time synchronization. calculate. For example, the initial setting unit 24 calculates, as the likelihood, the ratio of the number of joints in the resampled 3D skeleton data, for which the reprojection error of the 3D joint coordinates is less than a threshold. Then, the initial setting unit 24 adopts the time synchronization and the camera parameters when the likelihood takes the maximum value among all trials related to time synchronization as semi-optimal solutions, and outputs them to the optimization unit 25 as initial values.

図９の例では、初期設定部２４は、動画像データ１３の先頭フレームを一定の間隔でシフトさせた同期パターンに該当する試行ｉ－１、試行ｉ、試行ｉ＋１を実行して、準最適解を算出する例を示している。試行ｉ－１を例にして説明すると、初期設定部２４は、データ取得部２１により２次元座標が取得された時刻ｔ_ｓ，０に該当する動画像データ１３のフレームに対して、対応する時刻ｔ_ｓ，０から時刻ｔ_ｓ，１の間の３次元骨格データを用いてリサンプリングした結果、データＡ１を生成する。同様に、初期設定部２４は、２次元座標が取得されたフレームに対して、時刻ｔ_ｓ，２から時刻ｔ_ｓ，３の間の３次元骨格データを用いてリサンプリングしてデータＡ３を生成し、時刻ｔ_ｓ，３から時刻ｔ_ｓ，４の間の３次元骨格データを用いてリサンプリングしてデータＡ４を生成する。In the example of FIG. 9, the initial setting unit 24 executes trial i−1, trial i, and trial i+1 corresponding to the synchronization pattern in which the leading frame of the moving image data 13 is shifted at regular intervals to obtain a suboptimal solution. An example of calculating is shown. Taking trial i−1 as an example, the initial setting unit 24 obtains the corresponding time t _s,0 for the frame of the moving image data 13 corresponding to the time t s,0 when the two-dimensional coordinates were obtained by the data obtaining unit 21 . Data A1 is generated as a result of resampling using the three-dimensional skeleton data from time t s, ₀ to time t _s,1 . Similarly, the initial setting unit 24 generates data A3 by resampling the frame from which the two-dimensional coordinates have been acquired using the three-dimensional skeleton data from time _ts,2 to time _ts, 3. Then, resampling is performed using the three-dimensional skeleton data from time _ts,3 to time _ts,4 to generate data A4.

続いて、初期設定部２４は、３次元骨格データと２次元座標とを含むデータＡ１、データＡ３、データＡ４を用いて、ＰｎＰ問題を解くことにより、カメラパラメータを推定する。ここで、初期設定部２４は、推定されたカメラパラメータを用いて、動画像データ１３のフレームに、３次元骨格データを投影する。そして、初期設定部２４は、当該フレームにおける各関節について、各関節の２次元座標と、３次元骨格データにおける各関節の３次元座標（２次元座標のみを使用）との距離を算出する。そして、初期設定部２４は、距離が閾値未満である関節の割合を尤度として算出する。なお、尤度は、１つまたは複数のフレームに対して３次元骨格データを投影したときの各関節の距離を用いて算出することができる。 Subsequently, the initial setting unit 24 estimates camera parameters by solving a PnP problem using data A1, data A3, and data A4 including three-dimensional skeleton data and two-dimensional coordinates. Here, the initial setting unit 24 projects the three-dimensional skeleton data onto the frames of the moving image data 13 using the estimated camera parameters. Then, for each joint in the frame, the initial setting unit 24 calculates the distance between the two-dimensional coordinates of each joint and the three-dimensional coordinates (using only two-dimensional coordinates) of each joint in the three-dimensional skeleton data. Then, the initial setting unit 24 calculates the ratio of joints whose distance is less than the threshold as the likelihood. Note that the likelihood can be calculated using the distance of each joint when the 3D skeleton data is projected onto one or more frames.

このようにして、初期設定部２４は、試行ｉ－１、試行ｉ、試行ｉ＋１のそれぞれについて、上述した処理を実行して、尤度を算出する。そして、初期設定部２４は、尤度が最も高い試行ｉの時刻同期とカメラパラメータとを準最適解（初期値）に決定する。 In this manner, the initial setting unit 24 executes the above-described processing for each of trial i−1, trial i, and trial i+1 to calculate the likelihood. Then, the initial setting unit 24 determines the time synchronization and camera parameters of the trial i with the highest likelihood as quasi-optimal solutions (initial values).

図３に戻り、最適化部２５は、初期設定部２４により算出された準最適解を初期値として、時刻同期の調整量とカメラパラメータとに関するコスト関数の最適化を実行する処理部である。具体的には、最適化部２５は、時刻同期の調整量Δｔとカメラパラメータに関するコスト関数Ｃ（式１）を定義し、準最適解を初期値とする非線型最適化を適用し、コスト関数を最小化する。このとき、最適化部２５は、離散データである３次元骨格系列データを、関節ごとに微分可能な連続関数ｆ（ｔ）で表現し、３次元骨格データのリサンプリング処理としてコスト関数Ｃに組込む。なお、ｆ（ｔ）としては、３次スプライン補間などが適用できる。 Returning to FIG. 3, the optimization unit 25 is a processing unit that uses the suboptimal solution calculated by the initial setting unit 24 as an initial value to optimize the cost function related to the adjustment amount of time synchronization and camera parameters. Specifically, the optimization unit 25 defines a cost function C (Equation 1) related to the time synchronization adjustment amount Δt and camera parameters, applies nonlinear optimization with a quasi-optimal solution as an initial value, and obtains a cost function to minimize At this time, the optimization unit 25 expresses the 3D skeletal sequence data, which is discrete data, as a continuous function f(t) that can be differentiated for each joint, and incorporates it into the cost function C as a resampling process for the 3D skeletal data. . As for f(t), cubic spline interpolation or the like can be applied.

なお、式（１）における「ｉ」は、関節を示し、「ｔ」は、準最適解における２次元座標が取得された動画像データ１３のフレームの時刻である。「ｐ_ｉ，ｔ」は、時刻ｔにおける関節ｉの２次元の関節位置を示す。「ｆ_ｉ（ｔ）」は、時刻ｔにおける関節ｉの位置であり、リンサンプリングされた３次元の関節座標である。「π（ｘ）」は、カメラパラメータを用いた３Ｄ点Ｘの透視投影であり、２次元の関節座標である。このコスト関数Ｃにおいて、時刻の調整量Δｔとπとが最適化対象である。Note that "i" in the equation (1) indicates a joint, and "t" is the frame time of the moving image data 13 from which the two-dimensional coordinates in the quasi-optimal solution were acquired. “p _i,t ” indicates the two-dimensional joint position of joint i at time t. “f _i (t)” is the position of joint i at time t, and is the phosphor-sampled three-dimensional joint coordinates. "π(x)" is the perspective projection of the 3D point X using the camera parameters and the two-dimensional joint coordinates. In this cost function C, the time adjustment amounts Δt and π are optimization targets.

図１０は、パラメータの最適化を説明する図である。図１０に示すように、最適化部２５は、準最適解をコスト関数Ｃの初期値に設定し、最適化とリサンプリングとを繰り返すことで、各パラメータの最適値を算出する。例えば、最適化部２５は、２次元座標と３次元骨格データとが対応付けられたデータＢ１を用いてコスト関数の最適化を実行し、次のデータＢ２を用いるときは上記リサンプリングした上でデータＢ２を生成した後、当該データＢ２を用いてコスト関数の最適化を実行する。このようにすることで、最適化部２５は、Δｔとカメラパラメータとを同時に最適化することができる。 FIG. 10 is a diagram for explaining parameter optimization. As shown in FIG. 10, the optimization unit 25 sets the initial value of the cost function C to a quasi-optimal solution, and repeats optimization and resampling to calculate the optimum value of each parameter. For example, the optimization unit 25 uses data B1 in which two-dimensional coordinates and three-dimensional skeleton data are associated to execute cost function optimization, and when using the next data B2, after resampling the After generating the data B2, the optimization of the cost function is performed using the data B2. By doing so, the optimization unit 25 can optimize Δt and camera parameters at the same time.

図１１は、最適化の結果を説明する図である。図１１に示すように、初期設定部２４は、（１）から（８）の関節の２次元座標が取得されたフレーム（画像）に対して、初期パラメータを設定する。このとき、幾何学的な整合性が不十分であることから、画像上の選手の身体と、リサンプリングされた３次元骨格データとがずれた状態である。その後、最適化部２５が最適化を実行することで、幾何学的な整合性が向上することから、画像上の選手の身体と、リサンプリングされた３次元骨格データとが一致する。そして、最適化部２５は、最適化されたΔｔとカメラパラメータとを出力部２６に出力する。 FIG. 11 is a diagram explaining the result of optimization. As shown in FIG. 11, the initial setting unit 24 sets initial parameters for frames (images) in which the two-dimensional coordinates of the joints (1) to (8) are acquired. At this time, the player's body on the image is out of alignment with the resampled three-dimensional skeleton data due to insufficient geometric consistency. After that, the optimization unit 25 performs optimization to improve geometric consistency, so that the player's body on the image matches the resampled three-dimensional skeleton data. The optimization unit 25 then outputs the optimized Δt and camera parameters to the output unit 26 .

出力部２６は、最適化部２５による最適化結果を用いて、学習データを生成する処理部である。具体的には、出力部２６は、最適化された時刻同期の調整量およびカメラパラメータを前提として、２次元の関節座標や３次元の関節座標を付加した画像を生成し、学習データとして学習データセット１５に格納する。 The output unit 26 is a processing unit that uses the optimization result from the optimization unit 25 to generate learning data. Specifically, the output unit 26 generates an image to which two-dimensional joint coordinates and three-dimensional joint coordinates are added based on the optimized time synchronization adjustment amount and camera parameters, and generates learning data as learning data. Store in set 15.

図１２は、生成される学習データの一例を示す図である。図１２に示すように、出力部２６は、「画像、３次元骨格データ、カメラパラメータ」として「Ｉ_１、（｛Ｘ_１，１，Ｙ_１，１，Ｚ_１，１｝・・・｛Ｘ_１，ｊ，Ｙ_１，ｊ，Ｚ_１，ｊ｝）、（Ｋ，Ｒ，ｔ）」や「Ｉ_２、（｛Ｘ_２，１，Ｙ_２，１，Ｚ_２，１｝・・・｛Ｘ_２，ｊ，Ｙ_２，ｊ，Ｚ_２，ｊ｝）、（Ｋ，Ｒ，ｔ）」などが記憶される。FIG. 12 is a diagram showing an example of generated learning data. As shown in FIG. 12, the output unit 26 outputs "I ₁ , ({X _1,1 ,Y _1,1 ,Z _1,1 }...{X _{1, j} , Y _{1, j} , Z _{1, j} }), (K, R, t)" and "I ₂ , ({X ₂ , 1 , Y ₂ , 1 , Z _{2, 1} } ... { X _2,j , Y _2,j , Z _2,j }), (K, R, t)” and the like are stored.

この例では、２次元画像Ｉ_１に対して関節の３次元座標「｛Ｘ_１，１，Ｙ_１，１，Ｚ_１，１｝・・・｛Ｘ_１，ｊ，Ｙ_１，ｊ，Ｚ_１，ｊ｝」が対応付けられ、対応付け時のカメラパラメータが「Ｋ，Ｒ，ｔ」であることを示している。なお、カメラパラメータは、一連の動画像データ１３に対して、１つ設定される。In this example, three-dimensional coordinates of joints "{ _X1,1 , _Y1,1 , _Z1,1 }...{ _X1,j , _Y1,j , _Z1" are calculated for the two-dimensional image _I1 . _{, j} }” are associated with each other, indicating that the camera parameters at the time of association are “K, R, t”. One camera parameter is set for a series of moving image data 13 .

［処理の流れ］
図１３は、学習データの生成処理の流れを示すフローチャートである。図１３に示すように、座標取得部２２は、記憶部１２から動画像データ１３と３次元骨格系列データ１４を読込み（Ｓ１０１）、動画像データ１３のフレーム数枚における２次元の関節座標を取得する（Ｓ１０２）。[Process flow]
FIG. 13 is a flowchart showing the flow of learning data generation processing. As shown in FIG. 13, the coordinate acquisition unit 22 reads the moving image data 13 and the three-dimensional skeletal sequence data 14 from the storage unit 12 (S101), and acquires two-dimensional joint coordinates in several frames of the moving image data 13. (S102).

そして、学習データ生成部２３は、カメラパラメータと時刻同期の初期値を推定し（Ｓ１０３）、カメラパラメータと時刻同期を最適化し（Ｓ１０４）、最適化結果を用いて学習データを生成する（Ｓ１０５）。 Then, the learning data generation unit 23 estimates the initial values of the camera parameters and time synchronization (S103), optimizes the camera parameters and time synchronization (S104), and generates learning data using the optimization results (S105). .

ここで、Ｓ１０３とＳ１０４で実行される処理の詳細を説明する。図１４は、初期値の推定から最適化までの処理の流れを示すフローチャートである。図１４に示すように、学習データ生成部２３は、動画像データ１３と３次元骨格系列データ１４との間の時刻同期の候補群を生成する（Ｓ２０１）。 Here, details of the processing executed in S103 and S104 will be described. FIG. 14 is a flow chart showing the flow of processing from initial value estimation to optimization. As shown in FIG. 14, the learning data generator 23 generates a candidate group for time synchronization between the moving image data 13 and the three-dimensional skeleton sequence data 14 (S201).

続いて、学習データ生成部２３は、時刻同期の各候補について、２次元の関節座標をもつフレームに対応する３次元骨格データのリサンプリングを実行する（Ｓ２０２）。そして、学習データ生成部２３は、時刻同期の各候補について、２次元の関節位置と３次元骨格データの対応に基づくＰｎＰ問題を解くことでカメラパラメータを推定する（Ｓ２０３）。 Subsequently, the learning data generation unit 23 resamples the three-dimensional skeleton data corresponding to the frame having two-dimensional joint coordinates for each candidate for time synchronization (S202). Then, the learning data generation unit 23 estimates the camera parameters for each candidate for time synchronization by solving the PnP problem based on the correspondence between the two-dimensional joint positions and the three-dimensional skeleton data (S203).

その後、学習データ生成部２３は、時刻同期の各候補について、時刻同期とカメラパラメータの妥当性を示す尤度を計算し（Ｓ２０４）、時刻同期の候補のうち、尤度が最大となるときの時刻同期とカメラパラメータを準最適解に決定する（Ｓ２０５）。 After that, the learning data generation unit 23 calculates the likelihood indicating the validity of the time synchronization and camera parameters for each time synchronization candidate (S204). Time synchronization and camera parameters are determined as suboptimal solutions (S205).

そして、学習データ生成部２３は、３次元骨格データのリサンプリング処理を組み込んだ、時刻同期とカメラパラメータに関するコスト関数を定義し（Ｓ２０６）、準最適解を初期値とする非線型最適化を実行してコスト関数を最小化し、最適な時刻同期とカメラパラメータを取得する（Ｓ２０７）。 Then, the learning data generation unit 23 defines a cost function related to time synchronization and camera parameters, incorporating the resampling process of the 3D skeleton data (S206), and executes nonlinear optimization with the suboptimal solution as the initial value. to minimize the cost function and obtain the optimum time synchronization and camera parameters (S207).

［効果］
上述したように、生成装置１０は、３次元骨格系列データ１４と２次元の関節座標との間の幾何的な整合性が最大化されるように、準最適なカメラパラメータと時刻同期を同時に推定した後に、その推定結果を初期値とする非線型最適化により空間的および時間的に最適な自動位置合わせを実行する。したがって、生成装置１０は、カメラや３Ｄセンシング技術により非同期的に取得した、動画像データ１３と３次元骨格系列データ１４を利用し、画像方式の骨格認識のための学習データを効率的に生成することができる。[effect]
As described above, the generation device 10 simultaneously estimates sub-optimal camera parameters and time synchronization so as to maximize the geometric consistency between the 3D skeletal sequence data 14 and the 2D joint coordinates. Then, spatially and temporally optimal automatic registration is performed by nonlinear optimization with the estimation result as the initial value. Therefore, the generation device 10 uses the moving image data 13 and the 3D skeleton sequence data 14 asynchronously acquired by a camera or 3D sensing technology to efficiently generate learning data for image-based skeleton recognition. be able to.

また、生成装置１０は、尤度として、動画像データのフレームに投影したときの各関節に対する再投影誤差が閾値未満である関節数の割合を算出することができるので、正確な初期値を設定することができる。この結果、生成装置１０は、ある程度絞り込んだ状態で最適化を実行することができるので、最適化処理のコストを削減することができ、最適化処理の処理時間を短縮することができる。 In addition, since the generation device 10 can calculate, as the likelihood, the ratio of the number of joints for which the reprojection error for each joint when projected onto a frame of moving image data is less than the threshold, an accurate initial value is set. can do. As a result, the generation device 10 can execute optimization in a state narrowed down to some extent, so that the cost of optimization processing can be reduced and the processing time of optimization processing can be shortened.

さて、これまで本発明の実施例について説明したが、本発明は上述した実施例以外にも、種々の異なる形態にて実施されてよいものである。 Although the embodiments of the present invention have been described so far, the present invention may be implemented in various different forms other than the embodiments described above.

［数値等］
上記実施例で用いた対象とするデータの種類、コスト関数、機械学習モデル、学習データ、各種パラメータ等は、あくまで一例であり、任意に変更することができる。上記実施例では、体操競技を例にして説明したが、これに限定されるものではなく、選手が一連の技を行って審判が採点する他の競技にも適用することができる。他の競技の一例としては、フィギュアスケート、新体操、チアリーディング、水泳の飛び込み、空手の型、モーグルのエアーなどがある。また、スポーツに限らず、トラック、タクシー、電車などの運転手の姿勢検出やパイロットの姿勢検出などにも適用することができる。[Numbers, etc.]
The types of target data, cost functions, machine learning models, learning data, various parameters, etc. used in the above embodiments are merely examples, and can be arbitrarily changed. In the above embodiment, gymnastics was described as an example, but the present invention is not limited to this, and can be applied to other sports in which athletes perform a series of tricks and are scored by referees. Examples of other sports include figure skating, rhythmic gymnastics, cheerleading, swimming diving, karate style, and mogul airs. In addition, the present invention can be applied not only to sports, but also to detecting the posture of drivers and pilots of trucks, taxis, trains, and the like.

［適用例］
上述したよりに生成された学習データは様々な骨格認識用のモデルに採用することができる。ここでは、学習データの適用例について説明する。図１５から図１７は、骨格認識の処理例を説明する図である。[Application example]
The learning data generated by the above method can be used in various models for skeletal recognition. Here, an example of application of learning data will be described. 15 to 17 are diagrams for explaining an example of skeleton recognition processing.

図１５は、２次元の骨格認識を実行した後に数式で３次元の骨格認識を実行する例である。図１５の例では、人検出モデルを用いて、多数点の画像から人を検出し、２Ｄ検出モデルを用いて、検出された人の各関節のヒートマップから複数の２次元の関節座標を特定した後、複数の２次元の関節座標から三角測量法により、代数的に３次元の関節座標を求める。 FIG. 15 is an example of executing three-dimensional skeleton recognition using a formula after executing two-dimensional skeleton recognition. In the example of FIG. 15, a human detection model is used to detect a person from a large number of images, and a 2D detection model is used to identify a plurality of two-dimensional joint coordinates from a heat map of each joint of the detected person. After that, three-dimensional joint coordinates are algebraically determined from a plurality of two-dimensional joint coordinates by triangulation.

この手法では、人検出モデルと２Ｄ検出モデルの２つの機械学習モデルが用いられている。この２つのモデルに対して、実施例１により生成された「画像と２次元座標」とが対応付けられた学習データを用いて機械学習を実行することにより、各モデルの精度を向上させることができる。 This approach uses two machine learning models: a human detection model and a 2D detection model. By executing machine learning on these two models using learning data in which "images and two-dimensional coordinates" generated in Example 1 are associated with each other, the accuracy of each model can be improved. can.

図１６は、２次元の骨格認識を実行した後に、モデルにより３次元の骨格認識を実行する例である。図１６の例では、人検出モデルを用いて、多数点の画像から人を検出し、２Ｄ検出モデルを用いて、検出された人の各関節のヒートマップから複数の２次元の関節座標を特定する。その後、複数視点の２次元の関節座標から、学習によって得られた３Ｄ推定モデルを使って３次元の関節座標を求める。または、複数視点の２次元のヒートマップを統合した３ＤＶｏｘｅｌデータから、学習によって得られた３Ｄ推定モデルを使って３次元の関節座標を求める。 FIG. 16 is an example of executing three-dimensional skeleton recognition using a model after executing two-dimensional skeleton recognition. In the example of FIG. 16, a human detection model is used to detect a person from a large number of images, and a 2D detection model is used to identify a plurality of two-dimensional joint coordinates from a heat map of each joint of the detected person. do. After that, from the two-dimensional joint coordinates of multiple viewpoints, the three-dimensional joint coordinates are obtained using the 3D estimation model obtained by learning. Alternatively, 3D joint coordinates are obtained using a 3D estimation model obtained by learning from 3D voxel data obtained by integrating two-dimensional heat maps of multiple viewpoints.

この手法では、３つの機械学習モデルが用いられている。このうち、人検出モデルと２Ｄ検出モデルに対しては、実施例１により生成された「画像と２次元座標」が対応付けられた学習データを用いて機械学習を実行する。３Ｄ推定モデルに対しては、「画像と３次元座標とカメラパラメータ」が対応付けられた学習データを用いて機械学習を実行する。この結果、各モデルの精度を向上させることができる。 This approach uses three machine learning models. Of these models, the human detection model and the 2D detection model are machine-learned using learning data in which "images and two-dimensional coordinates" generated according to the first embodiment are associated with each other. Machine learning is performed on the 3D estimation model using learning data in which "images, three-dimensional coordinates, and camera parameters" are associated with each other. As a result, the accuracy of each model can be improved.

図１７は、３次元の骨格認識を直接実行する例である。図１７の例では、人検出モデルを用いて、画像から人を検出し、学習によって得られた３Ｄ検出モデルを用いて、検出された人に対して各関節の３次元の関節座標を推定する。この手法では、２つの機械学習モデルが用いられている。これらのモデルに対しては、「画像と３次元座標とカメラパラメータ」が対応付けられた学習データを用いて機械学習を実行する。この結果、各モデルの精度を向上させることができる。 FIG. 17 shows an example of directly executing 3D skeleton recognition. In the example of FIG. 17, a human detection model is used to detect a person from an image, and a 3D detection model obtained by learning is used to estimate the three-dimensional joint coordinates of each joint of the detected person. . This approach uses two machine learning models. For these models, machine learning is performed using learning data in which "images, three-dimensional coordinates, and camera parameters" are associated. As a result, the accuracy of each model can be improved.

［システム］
上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。なお、座標取得部２２は、取得部の一例であり、学習データ生成部２３は、特定部と実行部と生成部の一例である。また、３次元骨格系列データ１４は、３次元系列データの一例である。また、カメラパラメータは、投影パラメータの一例である。なお、２次元の関節座標とは、関節の位置を２次元で表現したときの座標であり、２次元の骨格座標と同義である。同様に、２次元の関節座標とは、関節の位置を３次元で表現したときの座標であり、３次元の骨格座標と同義である。[system]
Information including processing procedures, control procedures, specific names, and various data and parameters shown in the above documents and drawings can be arbitrarily changed unless otherwise specified. Note that the coordinate acquisition unit 22 is an example of an acquisition unit, and the learning data generation unit 23 is an example of an identification unit, an execution unit, and a generation unit. Also, the three-dimensional skeleton sequence data 14 is an example of three-dimensional sequence data. Also, camera parameters are an example of projection parameters. Note that the two-dimensional joint coordinates are coordinates when the positions of joints are expressed two-dimensionally, and are synonymous with two-dimensional skeletal coordinates. Similarly, two-dimensional joint coordinates are coordinates when joint positions are represented in three dimensions, and are synonymous with three-dimensional skeleton coordinates.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散や統合の具体的形態は図示のものに限られない。つまり、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。 Also, each component of each device illustrated is functionally conceptual, and does not necessarily need to be physically configured as illustrated. That is, the specific forms of distribution and integration of each device are not limited to those shown in the drawings. That is, all or part of them can be functionally or physically distributed and integrated in arbitrary units according to various loads and usage conditions.

さらに、各装置にて行なわれる各処理機能は、その全部または任意の一部が、ＣＰＵおよび当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 Further, each processing function performed by each device may be implemented in whole or in part by a CPU and a program analyzed and executed by the CPU, or implemented as hardware based on wired logic.

［ハードウェア］
次に、ハードウェア構成例を説明する。図１８は、ハードウェア構成例を説明する図である。図１８に示すように、生成装置１０は、通信装置１０ａ、ＨＤＤ（Hard Disk Drive）１０ｂ、メモリ１０ｃ、プロセッサ１０ｄを有する。また、図１８に示した各部は、バス等で相互に接続される。[hardware]
Next, a hardware configuration example will be described. FIG. 18 is a diagram illustrating a hardware configuration example. As shown in FIG. 18, the generation device 10 has a communication device 10a, a HDD (Hard Disk Drive) 10b, a memory 10c, and a processor 10d. 18 are interconnected by a bus or the like.

通信装置１０ａは、ネットワークインタフェースカードなどであり、他のサーバとの通信を行う。ＨＤＤ１０ｂは、図３に示した機能を動作させるプログラムやＤＢを記憶する。 The communication device 10a is a network interface card or the like, and communicates with other servers. The HDD 10b stores programs and DBs for operating the functions shown in FIG.

プロセッサ１０ｄは、図３に示した各処理部と同様の処理を実行するプログラムをＨＤＤ１０ｂ等から読み出してメモリ１０ｃに展開することで、図３等で説明した各機能を実行するプロセスを動作させる。例えば、このプロセスは、生成装置１０が有する各処理部と同様の機能を実行する。具体的には、プロセッサ１０ｄは、データ取得部２１、座標取得部２２、学習データ生成部２３等と同様の機能を有するプログラムをＨＤＤ１０ｂ等から読み出す。そして、プロセッサ１０ｄは、データ取得部２１、座標取得部２２、学習データ生成部２３等と同様の処理を実行するプロセスを実行する。 The processor 10d reads from the HDD 10b or the like a program for executing processing similar to that of each processing unit shown in FIG. 3 and develops it in the memory 10c, thereby operating processes for executing each function described with reference to FIG. 3 and the like. For example, this process executes the same function as each processing unit of the generation device 10 . Specifically, the processor 10d reads from the HDD 10b or the like a program having functions similar to those of the data acquisition section 21, the coordinate acquisition section 22, the learning data generation section 23, and the like. Then, the processor 10d executes the same process as the data acquisition unit 21, the coordinate acquisition unit 22, the learning data generation unit 23, and the like.

このように、生成装置１０は、プログラムを読み出して実行することで各種情報処理方法を実行する情報処理装置として動作する。また、生成装置１０は、媒体読取装置によって記録媒体から上記プログラムを読み出し、読み出された上記プログラムを実行することで上記した実施例と同様の機能を実現することもできる。なお、この他の実施例でいうプログラムは、生成装置１０によって実行されることに限定されるものではない。例えば、他のコンピュータまたはサーバがプログラムを実行する場合や、これらが協働してプログラムを実行するような場合にも、本発明を同様に適用することができる。 Thus, the generation device 10 operates as an information processing device that executes various information processing methods by reading and executing programs. Further, the generating device 10 can read the program from the recording medium by the medium reading device and execute the read program, thereby realizing the same function as the above-described embodiment. Note that the program referred to in this other embodiment is not limited to being executed by the generation device 10 . For example, the present invention can be applied in the same way when another computer or server executes the program, or when they cooperate to execute the program.

このプログラムは、インターネットなどのネットワークを介して配布することができる。また、このプログラムは、ハードディスク、フレキシブルディスク（ＦＤ）、ＣＤ－ＲＯＭ、ＭＯ（Magneto－Optical disk）、ＤＶＤ（Digital Versatile Disc）などのコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行することができる。 This program can be distributed via a network such as the Internet. In addition, this program is recorded on a computer-readable recording medium such as a hard disk, flexible disk (FD), CD-ROM, MO (Magneto-Optical disk), DVD (Digital Versatile Disc), etc., and is read from the recording medium by a computer. It can be executed by being read.

１０生成装置
１１通信部
１２記憶部
１３動画像データ
１４３次元骨格系列データ
１５学習データセット
２０制御部
２１データ取得部
２２座標取得部
２３学習データ生成部
２４初期設定部
２５最適化部
２６出力部10 generation device 11 communication unit 12 storage unit 13 moving image data 14 three-dimensional skeleton sequence data 15 learning data set 20 control unit 21 data acquisition unit 22 coordinate acquisition unit 23 learning data generation unit 24 initial setting unit 25 optimization unit 26 output unit

Claims

コンピュータが、
所定動作を行う被写体を撮像した動画像データに含まれる複数のフレームそれぞれから複数の関節それぞれの２次元座標を取得し、
前記所定動作を行う前記被写体の複数の関節位置に関する３次元骨格データを含む３次元系列データから、前記複数のフレームそれぞれに対応する複数の３次元骨格データそれぞれを特定し、
前記複数のフレームそれぞれの２次元座標と前記複数の３次元骨格データそれぞれとを用いて、前記動画像データと前記３次元系列データとの間の時刻同期に関する調整量と、前記動画像データに前記３次元系列データを投影するときの投影パラメータとの最適化を実行し、
最適化された前記調整量と前記投影パラメータとを用いて、前記動画像データと前記３次元系列データとを対応付けたデータを生成する
処理を実行することを特徴とするデータ生成方法。the computer
Acquiring two-dimensional coordinates of each of a plurality of joints from each of a plurality of frames included in moving image data obtained by imaging a subject performing a predetermined action,
identifying each of a plurality of 3D skeleton data corresponding to each of the plurality of frames from 3D series data including 3D skeleton data relating to a plurality of joint positions of the subject performing the predetermined motion;
Using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data, an adjustment amount related to time synchronization between the moving image data and the three-dimensional series data, and the performing optimization with projection parameters when projecting three-dimensional series data;
A data generation method, comprising generating data in which the moving image data and the three-dimensional series data are associated with each other using the optimized adjustment amount and the projection parameter.

前記特定する処理は、前記複数のフレームそれぞれの時刻を含む前記３次元系列データのサンプリング周期内の前記３次元骨格データを用いたリサンプリングにより、前記複数のフレームそれぞれに対応する前記複数の３次元骨格データそれぞれを特定することを特徴とする請求項１に記載のデータ生成方法。 The identifying process is performed by resampling using the three-dimensional skeleton data within a sampling period of the three-dimensional series data including the time of each of the plurality of frames, thereby obtaining the plurality of three-dimensional data corresponding to each of the plurality of frames. 2. The data generation method according to claim 1, wherein each piece of skeleton data is specified.

前記動画像データの時刻と前記３次元系列データの時刻との時刻同期を調整した各同期パターンにおいて、前記リサンプリングにより対応付けられた前記複数のフレームそれぞれと前記複数の３次元骨格データそれぞれとを用いて、カメラの位置姿勢を推定する推定問題を解くことにより前記投影パラメータを算出し、
前記各同期パターンについて算出された各投影パラメータと前記各同期パターンの時刻同期との妥当性を表す尤度を算出する処理を、前記コンピュータが実行し、
前記実行する処理は、前記尤度が最も高い前記同期パターンに対して算出された前記投影パラメータと前記時刻同期を初期値として、前記調整量と前記投影パラメータとに関するコスト関数の最適化を実行することを特徴とする請求項２に記載のデータ生成方法。each of the plurality of frames and each of the plurality of three-dimensional skeleton data associated by the resampling in each synchronization pattern obtained by adjusting the time synchronization between the time of the moving image data and the time of the three-dimensional series data; calculating the projection parameters by solving an estimation problem for estimating the position and orientation of the camera using
The computer executes a process of calculating a likelihood representing the validity of each projection parameter calculated for each synchronization pattern and the time synchronization of each synchronization pattern,
In the process to be executed, the projection parameter calculated for the synchronization pattern with the highest likelihood and the time synchronization are used as initial values to optimize a cost function related to the adjustment amount and the projection parameter. 3. The data generating method according to claim 2, characterized by:

前記算出する処理は、前記尤度として、前記リサンプリングされた前記複数の３次元骨格データそれぞれを、対応する前記動画像データのフレームに投影したときの各関節に対する再投影誤差が閾値未満である関節数の割合を算出することを特徴とする請求項３に記載のデータ生成方法。 In the calculating process, as the likelihood, a reprojection error for each joint when each of the plurality of resampled 3D skeleton data is projected onto the corresponding frame of the moving image data is less than a threshold. 4. The data generation method according to claim 3, wherein the ratio of the number of joints is calculated.

前記生成する処理は、最適化された前記時刻同期の調整量にしたがって、前記動画像データの時刻と前記３次元系列データの時刻とを同期させ、最適化された前記投影パラメータを用いて、前記３次元系列データ内の各３次元骨格データを、前記時刻同期により時刻が同期する前記動画像データの各フレームに投影して、前記データを生成することを特徴とする請求項１に記載のデータ生成方法。 The generating process synchronizes the time of the moving image data and the time of the three-dimensional series data according to the optimized time synchronization adjustment amount, and uses the optimized projection parameter to generate the 2. The data according to claim 1, wherein the data is generated by projecting each three-dimensional skeleton data in the three-dimensional series data onto each frame of the moving image data whose time is synchronized by the time synchronization. generation method.

コンピュータに、
所定動作を行う被写体を撮像した動画像データに含まれる複数のフレームそれぞれから複数の関節それぞれの２次元座標を取得し、
前記所定動作を行う前記被写体の複数の関節位置に関する３次元骨格データを含む３次元系列データから、前記複数のフレームそれぞれに対応する複数の３次元骨格データそれぞれを特定し、
前記複数のフレームそれぞれの２次元座標と前記複数の３次元骨格データそれぞれとを用いて、前記動画像データと前記３次元系列データとの間の時刻同期に関する調整量と、前記動画像データに前記３次元系列データを投影するときの投影パラメータとの最適化を実行し、
最適化された前記調整量と前記投影パラメータとを用いて、前記動画像データと前記３次元系列データとを対応付けたデータを生成する
処理を実行させることを特徴とするデータ生成プログラム。to the computer,
Acquiring two-dimensional coordinates of each of a plurality of joints from each of a plurality of frames included in moving image data obtained by imaging a subject performing a predetermined action,
identifying each of a plurality of 3D skeleton data corresponding to each of the plurality of frames from 3D series data including 3D skeleton data relating to a plurality of joint positions of the subject performing the predetermined motion;
Using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data, an adjustment amount related to time synchronization between the moving image data and the three-dimensional series data, and the performing optimization with projection parameters when projecting three-dimensional series data;
A data generation program for generating data in which the moving image data and the three-dimensional series data are associated with each other using the optimized adjustment amount and the projection parameter.

所定動作を行う被写体を撮像した動画像データに含まれる複数のフレームそれぞれから複数の関節それぞれの２次元座標を取得する取得部と、
前記所定動作を行う前記被写体の複数の関節位置に関する３次元骨格データを含む３次元系列データから、前記複数のフレームそれぞれに対応する複数の３次元骨格データそれぞれを特定する特定部と、
前記複数のフレームそれぞれの２次元座標と前記複数の３次元骨格データそれぞれとを用いて、前記動画像データと前記３次元系列データとの間の時刻同期に関する調整量と、前記動画像データに前記３次元系列データを投影するときの投影パラメータとの最適化を実行する実行部と、
最適化された前記調整量と前記投影パラメータとを用いて、前記動画像データと前記３次元系列データとを対応付けたデータを生成する生成部と
を有することを特徴とする情報処理装置。an acquisition unit that acquires two-dimensional coordinates of each of a plurality of joints from each of a plurality of frames included in moving image data obtained by imaging a subject performing a predetermined action;
an identifying unit that identifies each of a plurality of 3D skeleton data corresponding to each of the plurality of frames from 3D sequence data including 3D skeleton data relating to a plurality of joint positions of the subject performing the predetermined motion;
Using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data, an adjustment amount related to time synchronization between the moving image data and the three-dimensional series data, and the an execution unit that executes optimization with projection parameters when projecting three-dimensional series data;
and a generating unit that generates data in which the moving image data and the three-dimensional series data are associated with each other using the optimized adjustment amount and the projection parameter.