JP7502957B2

JP7502957B2 - Haptic metadata generating device, video-haptic interlocking system, and program

Info

Publication number: JP7502957B2
Application number: JP2020170229A
Authority: JP
Inventors: 正樹高橋; 真希子東; 拓也半田; 雅規佐野; 結子山内
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2020-10-08
Filing date: 2020-10-08
Publication date: 2024-06-19
Anticipated expiration: 2040-10-08
Also published as: JP2022062313A

Description

本発明は、映像から人物オブジェクトを抽出し、動的な人物オブジェクトに対応する触覚メタデータを生成する触覚メタデータ生成装置、生成した触覚メタデータを基に触覚提示デバイスを駆動制御する映像触覚連動システム、及びプログラムに関する。 The present invention relates to a haptic metadata generating device that extracts a person object from a video and generates haptic metadata corresponding to the dynamic person object, a video-haptic interlocking system that drives and controls a haptic presentation device based on the generated haptic metadata, and a program.

放送映像等の一般的なカメラ映像の映像コンテンツは、視覚と聴覚の２つの感覚に訴える情報を提供するメディアである。しかし、視覚障害者や聴覚障害者に対しては視聴覚情報だけでは不十分であり、番組コンテンツの状況を正確に伝えることができない。そのため、テレビを持っていない、若しくは持っていても視聴しない障害者も多い。そこで、映像コンテンツに対し、視覚・聴覚以外の“触覚”で感じられる情報を提示することで、視覚又は聴覚の障害者もテレビ放送を理解できるシステムの構築が望まれる。 The video content of general camera footage, such as broadcast footage, is a medium that provides information that appeals to both the senses of sight and hearing. However, for the visually or hearing impaired, audiovisual information alone is insufficient, and it is not possible to accurately convey the status of program content. As a result, many disabled people do not own a television, or even if they do own one, they do not watch it. There is a need to build a system that allows visually or hearing impaired people to understand television broadcasts by presenting information that can be sensed by the "tactile" sense in addition to sight and hearing in relation to the video content.

また、視覚・聴覚の感覚を有する健常者にとっても、また、触覚刺激を提示することにより放送番組の視聴時の臨場感や没入感の向上が期待できる。特に、スポーツコンテンツにおける人物の動きは重要な情報であり、これを触覚刺激で提示することにより、コンテンツ視聴における臨場感が高まる。 For people with normal vision and hearing, the presentation of tactile stimuli is also expected to improve the sense of realism and immersion when watching broadcast programs. In particular, the movement of people in sports content is important information, and presenting this information as tactile stimuli can enhance the sense of realism when watching content.

例えば、野球映像を視聴する際、ボールがバットに当たるタイミングで触覚提示デバイスを介して視聴者に刺激を与えることで、バッターのヒッティングの感覚を疑似体験できる。また、視覚に障害のある方々に触覚刺激を提供することで、スポーツの試合状況を理解させることにも繋がると考えられる。このように、触覚は映像視聴における第３の感覚として期待されている。 For example, when watching a baseball game, a haptic device can be used to stimulate the viewer when the ball hits the bat, allowing them to simulate the sensation of the batter hitting the ball. It is also believed that providing haptic stimulation to visually impaired people can help them understand the situation of a sports game. In this way, touch is expected to become a third sense when watching videos.

特に、スポーツはリアルタイムでの映像視聴が重要視されるため、映像に対する触覚刺激の提示は、自動、且つリアルタイムで行われる必要がある。そこで、プレーの種類、タイミング、状況などに関する選手の動きに同期した触覚刺激の提示が、触覚を併用した映像コンテンツの映像視聴に効果的な場合が多い。そして、視覚又は聴覚に障害を持つ方々にもスポーツの状況を伝えることが可能となる。 In particular, because watching sports videos in real time is important, the presentation of tactile stimuli for the video needs to be done automatically and in real time. Therefore, the presentation of tactile stimuli synchronized with the movements of the players in relation to the type, timing, situation, etc. of the play is often effective for watching video content that also uses the sense of touch. It also makes it possible to convey the situation of sports to people with visual or hearing impairments.

このため、触覚を併用した映像コンテンツの映像視聴を実現するには、その映像コンテンツから人物オブジェクトの動きを抽出し、抽出した人物オブジェクトの動きに対応した触覚情報を触覚メタデータとして生成することが必要になる。 Therefore, to realize video viewing of video content that also utilizes haptics, it is necessary to extract the movements of human objects from the video content and generate haptic information corresponding to the extracted movements of the human objects as haptic metadata.

しかし、従来の触覚メタデータの生成法では、触覚を併用した映像視聴を実現するとしても、触覚提示デバイスにより、どのようなタイミングで、またどのような刺激をユーザに提示するかを示す触覚メタデータを、映像と同期した態様で人手により編集する必要があった。 However, with conventional methods for generating haptic metadata, even if video viewing was to be achieved in conjunction with haptics, the haptic metadata, which indicates when and what type of stimulation should be presented to the user by the haptic presentation device, had to be manually edited in a manner synchronized with the video.

収録番組の場合、人手で時間をかけて触覚メタデータを編集することが可能である。しかし、生放送映像に対して触覚提示デバイスによる刺激提示を連動させるには、事前に触覚情報を編集することができないことから、リアルタイムで映像コンテンツの映像解析を行い、触覚メタデータを生成することが要求される。 In the case of recorded programs, it is possible to edit the haptic metadata manually, taking time and effort. However, in order to link the presentation of stimuli using a haptic device to live broadcast footage, it is necessary to perform video analysis of the video content in real time and generate haptic metadata, since it is not possible to edit the haptic information in advance.

近年、スポーツ映像解析技術は、目覚ましい成長を遂げている。ウィンブルドンでも使用されているテニスのホークアイシステムは、複数の固定カメラ映像をセンサとしてテニスボールを３次元的に追跡し、ジャッジに絡むＩＮ／ＯＵＴの判定を行っている。また２０１４年のＦＩＦＡワールドカップでは、ゴールラインテクノロジーと称して、数台の固定カメラの映像を解析し、ゴールの判定を自動化している。更に、サッカースタジアムへ多数のステレオカメラを設定し、フィールド内の全選手をリアルタイムに追跡するＴＲＡＣＡＢシステム等、スポーツにおけるリアルタイム映像解析技術の高度化が進んでいる。 In recent years, sports video analysis technology has made remarkable progress. The Hawk-Eye tennis system, which is used at Wimbledon, uses images from multiple fixed cameras as sensors to track the tennis ball in three dimensions and make IN/OUT decisions that involve the judges. Furthermore, at the 2014 FIFA World Cup, a system known as goal-line technology analyzed images from several fixed cameras to automate the decision-making of goals. Furthermore, real-time video analysis technology in sports is becoming more sophisticated, as seen in the TRACAB system, which sets up multiple stereo cameras in soccer stadiums and tracks all the players on the field in real time.

一方で、動的な人物オブジェクトとして選手の姿勢を計測するには、従来、マーカー式のモーションキャプチャー方式を用いた計測が一般的である。しかし、この方式は、選手の体に多数のマーカーを装着する必要があり、実試合には適用できない。そこで、近年では、選手の体に投光されている赤外線パターンを読み取り、その赤外線パターンの歪みから深度情報を得る深度センサを用いることで、マーカーレスでの人物姿勢計測が可能になっている。また、マーカー式ではなく、光学式のモーションキャプチャー方式を応用した種々の技術が開示されている（例えば、特許文献１，２，３参照）。 On the other hand, in order to measure the posture of a player as a dynamic human object, conventionally, a marker-based motion capture method has been used. However, this method requires attaching a large number of markers to the player's body, and is not applicable to actual matches. In recent years, therefore, it has become possible to measure human posture without a marker by using a depth sensor that reads an infrared pattern projected onto the player's body and obtains depth information from the distortion of that infrared pattern. In addition, various technologies that apply an optical motion capture method instead of a marker method have been disclosed (for example, see Patent Documents 1, 2, and 3).

例えば、特許文献１では、立体視を用いた仮想現実システムにおいて他者の模範動作映像を表示することにより使用者に対して動作を教示する際に、光学式のモーションキャプチャー方式により、計測対象者の骨格の３次元位置を計測する装置が開示されている。また、特許文献２では、体操競技などの映像とモーションキャプチャデータから得られる情報を利用し、動作認識を施す技術が開示されており、隠れマルコフモデルを利用し、動作の時間的長短の制約を取り除いていることに特長を有している。また、特許文献３には、光学式のモーションキャプチャー方式を利用してプレイヤーの動作を測定し、測定したデータとモデルのフォームに関するデータとに基づいて同プレイヤーのフォームを評価するトレーニング評価装置について開示されている。しかし、これらの技術は、モーションキャプチャー方式を利用するため、実際の試合に適用できず、汎用的なカメラ映像から人物のプレー動作を計測することは難しい。 For example, Patent Document 1 discloses a device that uses an optical motion capture method to measure the three-dimensional position of the skeleton of a person being measured when teaching a user a movement by displaying a video of another person's model movement in a virtual reality system using stereoscopic vision. Patent Document 2 discloses a technology that uses information obtained from video of gymnastics and other events and motion capture data to perform movement recognition, and is characterized by using a hidden Markov model to remove the constraint on the length of time of the movement. Patent Document 3 discloses a training evaluation device that uses an optical motion capture method to measure a player's movements and evaluates the player's form based on the measured data and data related to the model's form. However, these technologies cannot be applied to actual matches because they use motion capture, and it is difficult to measure a person's playing movements from general-purpose camera images.

また、モーションキャプチャー方式によらず、一人又は二人が一組となってバドミントンの試合やバドミントン練習を撮影したカメラ映像のみから、人物の動きをシミュレートする装置が開示されている（例えば、特許文献４参照）。特許文献４の技術では、撮影したカメラ映像から、ショットなどの動作を検出するものとなっているが、専用に設定したカメラによる撮影映像から処理することを前提としており、汎用的な放送カメラ映像から人物のプレー動作を計測することは難しい。 In addition, a device has been disclosed that does not use a motion capture method, but instead simulates a person's movements solely from camera footage of a badminton match or practice session in which one or two people form a group (see, for example, Patent Document 4). The technology in Patent Document 4 detects shots and other movements from captured camera footage, but it is premised on processing from footage captured by a specially set up camera, making it difficult to measure a person's playing movements from general-purpose broadcast camera footage.

ところで、近年の深層学習技術の発達により、深度センサを用いずに、従来では困難であった深度情報を含まない通常の静止画像から人物の骨格位置を推定することが可能になっている。この深層学習技術を用いることで、通常のカメラ映像から静止画像を抽出し、その静止画像に含まれる選手の姿勢を自動計測することが可能となっている。即ち、通常のカメラ映像から選手の姿勢を計測することで、競技に影響を与えず、触覚刺激に関する情報を取得することが可能である。 However, recent developments in deep learning technology have made it possible to estimate a person's skeletal position from normal still images that do not contain depth information, which was previously difficult, without using a depth sensor. Using this deep learning technology, it is now possible to extract still images from normal camera footage and automatically measure the posture of the player contained in the still image. In other words, by measuring the posture of the player from normal camera footage, it is possible to obtain information about tactile stimuli without affecting the competition.

骨格情報の取得により、人物の姿勢を計測することは可能であるが、その姿勢の意味付けには認識処理が必要となる。例えば、柔道の映像を入力した際、当該フレームで行われている動作内容が「組み合い」なのか「投げ技」なのか「寝技」なのかは、画像特徴や骨格特徴から判別する必要がある。画像処理における認識処理で広く用いられているのがConvolutional Neural Network （ＣＮＮ）である。ＣＮＮは、何段もの深い層を持つニューラルネットワークで、特に画像認識の分野で優れた性能を発揮しているネットワークである。このネットワークは「畳み込み層」や「プーリング層」などの幾つかの特徴的な機能を持った層を積み上げることで構成され、現在幅広い分野で利用されている。 It is possible to measure a person's posture by acquiring skeletal information, but recognition processing is required to determine the meaning of that posture. For example, when a video of judo is input, it is necessary to determine from image and skeletal features whether the action taking place in that frame is a "grapple," a "throw," or a "ground technique." Convolutional Neural Networks (CNNs) are widely used in recognition processing in image processing. CNNs are neural networks with many deep layers, and are particularly well-suited for image recognition. These networks are constructed by stacking layers with several distinctive functions, such as "convolutional layers" and "pooling layers," and are currently used in a wide range of fields.

一般的なニューラルネットワークでは層状にニューロンを配置し、前後の層に含まれるニューロン同士は網羅的に結線するのが普通であるが、この畳み込みニューラルネットワークではこのニューロン同士の結合をうまく制限し、尚且つウェイト共有という手法を使うことで、画像の畳み込みに相当するような処理をニューラルネットワークの枠組みの中で表現している。この層は「畳み込み層」と呼ばれ、ＣＮＮの最大の特徴となっている。また、この畳み込みニューラルネットワークにおいて、もうひとつ大きな特徴が、「プーリング層」である。ＣＮＮにおいて、「畳み込み層」が画像からのエッジ抽出等の特徴抽出の役割を果たしているとすると、「プーリング層」はそうした抽出された特徴が、平行移動などでも影響を受けないようにロバスト性を与えている。 In a typical neural network, neurons are arranged in layers, and neurons in previous and subsequent layers are generally connected to each other, but this convolutional neural network effectively limits the connections between these neurons and uses a technique called weight sharing to represent processing equivalent to image convolution within the framework of a neural network. This layer is called the "convolutional layer," and is the greatest feature of CNN. Another major feature of this convolutional neural network is the "pooling layer." In CNN, if the "convolutional layer" plays the role of extracting features such as edges from images, the "pooling layer" provides robustness so that these extracted features are not affected by parallel translation, etc.

他方では、骨格情報を利用する以外にも、画像から動作を認識する手法として、Motion History Image（ＭＨＩ）と呼ばれる画像が従来使われてきた（例えば、非特許文献１、特許文献５参照）。ＭＨＩは、フレームごとに輝度差分が生じた領域を高い輝度で塗りつぶし、以降のフレームでは徐々にその輝度を下げて描画した画像であり、動オブジェクトの動きの向きの情報を持つ１枚の画像となっている。 On the other hand, in addition to using skeletal information, a method known as a Motion History Image (MHI) has been used to recognize motion from images (see, for example, Non-Patent Document 1 and Patent Document 5). MHI is an image in which areas where brightness differences occur in each frame are filled with a high brightness, and the brightness is gradually reduced in subsequent frames, resulting in a single image that contains information on the direction of movement of a moving object.

特許文献５では、画像認識技術を用いて野球映像から投球動作を検出する技術が開示されており、野球映像に対してMotion History Image（ＭＨＩ）を作成し、投球動作を検出するものとなっている。ただし、特許文献５に開示される技法のＭＨＩは骨格検出を行っておらず、詳細な動作の認識は困難である。 Patent Document 5 discloses a technique for detecting pitching motions from baseball video using image recognition technology, creating a Motion History Image (MHI) for the baseball video and detecting the pitching motion. However, the MHI technique disclosed in Patent Document 5 does not detect bone structure, making it difficult to recognize detailed motions.

そこで、骨格検出を行って得られる人物骨格と各骨格を結ぶ接続線を示す画像（ボーン画像）についてＭＨＩを生成し、深層学習技術によりカメラ映像から人物の姿勢を計測する、Skeleton motion history Image（ＳｋｌＭＨＩ）と称される技術も開示されている（例えば、非特許文献２参照）。 Therefore, a technology called Skeleton motion history Image (Skl MHI) has been disclosed, in which an MHI is generated for an image (bone image) showing the human skeleton obtained by skeleton detection and the connecting lines connecting each skeleton, and the posture of the person is measured from camera images using deep learning technology (see, for example, Non-Patent Document 2).

特開２００２－８０６３号公報JP 2002-8063 A 特開２００２－２５３７１８号公報JP 2002-253718 A 特開２０２０－３８４４０号公報JP 2020-38440 A 特開２０１８－１８７３８３号公報JP 2018-187383 A 特開２００８－２２１４２号公報JP 2008-22142 A

“Motion History Image”、［online］、［令和２年９月１５日検索］、インターネット〈https://web.cse.ohio-state.edu/~davis.1719/CVL/Research/MHI/mhi.html〉"Motion History Image", [online], [searched September 15, 2020], Internet: https://web.cse.ohio-state.edu/~davis.1719/CVL/Research/MHI/mhi.html C. N. Ohyo, T. T. Zin, P. Tin., “Skeleton motion history based human action recognition using deep learning”、［online］、［令和２年９月１５日検索］、インターネット〈https://ieeexplore.ieee.org/document/8229448〉C. N. Ohyo, T. T. Zin, P. Tin., “Skeleton motion history based human action recognition using deep learning”, [online], [Retrieved September 15, 2020], Internet: https://ieeexplore.ieee.org/document/8229448

上述したように、従来、一般的には、映像コンテンツに触覚情報を付与する際は、刺激の種類やタイミングを人手で編集する必要があった。そのため、生放送番組での触覚情報提示は不可能であった。リアルタイム映像解析により、触覚情報抽出を自動化できれば、生放送番組でも触覚情報を提供できる。そして、触覚を併用した映像コンテンツの映像視聴を実現するには、その映像コンテンツから人物オブジェクトの動きを抽出し、抽出した人物オブジェクトの動きに対応した触覚情報を触覚メタデータとして生成することが必要になる。 As mentioned above, conventionally, when adding haptic information to video content, it was generally necessary to manually edit the type and timing of stimuli. This made it impossible to present haptic information in live broadcast programs. If haptic information extraction could be automated using real-time video analysis, haptic information could be provided even in live broadcast programs. To realize the viewing of video content that also incorporates haptics, it is necessary to extract the movements of human objects from the video content and generate haptic information corresponding to the extracted movements of the human objects as haptic metadata.

特に、スポーツ中継はリアルタイム性が重視されるコンテンツである。そのため、競技に関する触覚情報もリアルタイムで付与され、映像と同時に提示される必要がある。選手の動きに同期した触覚刺激が効果的な場合が多く、映像から触覚メタデータを抽出する場合には、カメラ映像からリアルタイムで選手の動きを解析する必要がある。競技に影響を与えないため、マーカー装着によるモーションキャプチャーや、撮影距離に制限のある深度センサなどは用いず、通常の放送カメラ映像から触覚メタデータを抽出することが望ましい。 In particular, live sports broadcasts are content in which real-timeness is important. Therefore, tactile information about the competition must be added in real time and presented simultaneously with the video. Tactile stimulation synchronized with the movements of the athletes is often effective, so when extracting tactile metadata from video, it is necessary to analyze the movements of the athletes in real time from the camera footage. In order not to affect the competition, it is desirable to extract tactile metadata from regular broadcast camera footage, rather than using motion capture with markers attached or depth sensors with limited shooting distance.

つまり、スポーツを撮影する通常のカメラ映像のみから、自動、且つリアルタイムで人物オブジェクト（選手等）の動きに関する触覚メタデータを生成する技法が望まれる。 In other words, a technique is needed to automatically generate haptic metadata about the movements of human objects (such as athletes) in real time using only regular camera footage of sports.

また、人物オブジェクトの動きを高精度に検出するために、人物以外の動オブジェクト（例えば、バドミントン競技であればシャトル、ラケット）を参考する技法も考えられるが、参考とする人物以外の動オブジェクトが存在しない競技（例えば、柔道やレスリング等）においても、人物オブジェクトの動きを高精度に検出する技法が望まれる。 In addition, to detect the movement of a human object with high accuracy, a technique can be considered that refers to moving objects other than the human (for example, a shuttlecock or racket in badminton), but a technique that can detect the movement of a human object with high accuracy is also desired in sports where there are no moving objects other than the human to refer to (for example, judo and wrestling).

尚、近年の深層学習技術の発達により、深度センサを用いずに、従来では困難であった深度情報を含まない通常の静止画像から人物の骨格位置を推定することが可能になっているが、これに代表される骨格検出アルゴリズムは基本的に静止画単位で骨格位置を検出するものである。このため、スポーツを撮影する通常のカメラ映像のみから、自動、且つリアルタイムで人物オブジェクト（選手等）の動きに関する触覚メタデータを生成するには、更なる工夫が必要になる。 Recent advances in deep learning technology have made it possible to estimate the skeletal position of a person from normal still images that do not contain depth information, which was previously difficult, without using a depth sensor. However, typical skeleton detection algorithms basically detect the skeletal position on a still image-by-still image basis. For this reason, further ingenuity is required to automatically generate haptic metadata about the movements of human objects (such as athletes) in real time from only normal camera footage of sports.

ところで、動作認識の機械学習として、旧来の教師あり学習手法であるＳＶＭなどを用いることで高速に動作認識できるものの、近年発展が望ましい深層学習を利用することで、更なる精度向上が期待できる。映像解析に基づく動作認識にはＣＮＮが用いられることが多い。しかし、ＣＮＮは静止画像ベースの識別アルゴリズムであり、時間軸が考慮されない。映像シーンの動作内容を理解するには、人物の動きに関する特徴量を扱う必要があるが、静止画には時間軸の情報が含まれないため、ＣＮＮの動作内容を高精度な識別は期待できない。 Although machine learning for action recognition can quickly recognize actions by using SVM, a traditional supervised learning method, further improvements in accuracy can be expected by using deep learning, which has seen great advances in recent years. CNN is often used for action recognition based on video analysis. However, CNN is a classification algorithm based on still images and does not take the time axis into account. To understand the content of actions in a video scene, it is necessary to deal with features related to people's movements, but still images do not contain information on the time axis, so CNN cannot be expected to classify the content of actions with high accuracy.

このため、ＣＮＮにより画像から動作を認識する手法として、Motion History Image（ＭＨＩ）と呼ばれる画像を利用することが考えられる。このＭＨＩを解析することで、 “腕を広げる”、“しゃがむ”、“手を上げる”など人物の基本的な動きを認識判定することが可能になる。ただし、ＭＨＩは人物の関節の各部位を計測しているわけではないため、全身を使った大きな動作の認識に限られる。例えば、特許文献５に開示されるような、野球映像に対してMotion History Image（ＭＨＩ）を作成し、投球動作を検出するには、背景に含まれるノイズの影響を抑えるために投手の領域を高精度に検出する必要があり、更に、骨格検出を行うものではないため詳細な動作の認識は困難である。 For this reason, one possible method for recognizing motion from images using a CNN is to use images called Motion History Images (MHI). By analyzing this MHI, it becomes possible to recognize and determine basic human motions such as "spreading arms," "crouching," and "raising hands." However, since MHI does not measure each part of a person's joints, it is limited to recognizing large movements using the whole body. For example, as disclosed in Patent Document 5, in order to create a Motion History Image (MHI) for baseball video and detect pitching motions, it is necessary to detect the pitcher's area with high accuracy to suppress the effects of noise in the background, and furthermore, since skeletal detection is not performed, it is difficult to recognize detailed motions.

そこで、非特許文献２に開示されるように、骨格検出を行って得られる人物骨格と各骨格を結ぶ接続線を示す画像（ボーン画像）についてMotion History Image（ＭＨＩ）を生成し、深層学習技術によりカメラ映像から人物の姿勢を計測する、Skeleton motion history Image（ＳｋｌＭＨＩ）と称される技術により、動作認識の精度向上が実現されるが、より一層の動作認識の精度向上が要望される。 As disclosed in Non-Patent Document 2, a Motion History Image (MHI) is generated for an image (bone image) showing the human skeleton obtained by skeleton detection and the connecting lines connecting each skeleton, and a technology called Skeleton motion history Image (Skl MHI) is used to measure the posture of the person from the camera image using deep learning technology. This improves the accuracy of motion recognition, but there is a demand for even greater improvements in the accuracy of motion recognition.

本発明の目的は、上述の問題に鑑みて、映像から人物オブジェクトを自動抽出し、動的な人物オブジェクトに対応する触覚メタデータを同期して自動生成する触覚メタデータ生成装置、生成した触覚メタデータを基に触覚提示デバイスを駆動制御する映像触覚連動システム、及びプログラムを提供することにある。 In view of the above problems, the object of the present invention is to provide a haptic metadata generating device that automatically extracts human objects from video and automatically generates haptic metadata corresponding to dynamic human objects in synchronization with each other, a video-haptic interlocking system that drives and controls a haptic presentation device based on the generated haptic metadata, and a program.

本発明の触覚メタデータ生成装置は、映像から人物オブジェクトを抽出し、動的な人物オブジェクトに対応する触覚メタデータを生成する触覚メタデータ生成装置であって、入力された映像について、現フレーム画像と所定数の過去のフレーム画像を含む複数フレーム画像を抽出する複数フレーム抽出手段と、当該複数フレーム画像の各々について、骨格検出アルゴリズムに基づき、各人物オブジェクトの第１の骨格座標集合を生成する人物骨格抽出手段と、当該複数フレーム画像の各々について、前記第１の骨格座標集合を基に探索範囲を可変設定し、各人物オブジェクトの骨格の位置及びサイズと、その周辺画像情報を抽出することにより人物オブジェクトを識別し、人物ＩＤを付与した第２の骨格座標集合を生成する人物識別手段と、前記現フレーム画像を基準に、当該複数フレーム画像の各々における前記第２の骨格座標集合を基に、識別した人物骨格毎の動きの方向のみを示す１枚の骨格軌跡特徴画像を生成する骨格軌跡特徴画像生成手段と、前記骨格軌跡特徴画像を入力とする畳み込みニューラルネットワークにより、人物の特定動作を認識し、所定の触覚提示デバイスを作動させる衝撃提示用の情報を検出する人物動作認識手段と、前記現フレーム画像に対応して、当該衝撃提示用の情報を含む触覚メタデータを生成し、フレーム単位で外部出力するメタデータ生成手段と、を備えることを特徴とする。 The haptic metadata generating device of the present invention is a haptic metadata generating device that extracts human objects from video and generates haptic metadata corresponding to dynamic human objects, and includes a multi-frame extraction means for extracting a plurality of frame images including a current frame image and a predetermined number of past frame images from an input video, a human skeleton extraction means for generating a first set of skeleton coordinates for each of the plurality of frame images based on a skeleton detection algorithm, and a human skeleton extraction means for variably setting a search range based on the first set of skeleton coordinates for each of the plurality of frame images, and extracting the position and size of the skeleton of each human object and its surrounding image information to detect the human object. and generate a second set of skeletal coordinates to which a person ID is assigned; a skeletal trajectory characteristic image generation means that generates a skeletal trajectory characteristic image showing only the direction of movement of each identified person skeleton based on the second set of skeletal coordinates in each of the multiple frame images, with the current frame image as a reference; a person action recognition means that recognizes a specific action of the person using a convolutional neural network that uses the skeletal trajectory characteristic image as an input, and detects information for presenting an impact that activates a specified tactile presentation device; and a metadata generation means that generates tactile metadata including the information for presenting the impact corresponding to the current frame image, and outputs the metadata to the outside on a frame-by-frame basis.

また、本発明の触覚メタデータ生成装置において、前記骨格軌跡特徴画像生成手段は、前記骨格軌跡特徴画像として、当該複数フレーム画像における各人物の骨格座標ごとに連結した軌跡を描画し、且つこの描画の際に、過去に向かうほど輝度を下げるか、又は上げて描画して生成した１枚の画像とすることを特徴とする。 In addition, in the haptic metadata generating device of the present invention, the skeletal trajectory feature image generating means draws, as the skeletal trajectory feature image, a trajectory that connects the skeletal coordinates of each person in the multiple frame images, and when drawing, reduces or increases the brightness toward the past to generate a single image.

また、本発明の触覚メタデータ生成装置において、前記骨格軌跡特徴画像生成手段は、前記骨格軌跡特徴画像として、当該複数フレーム画像における各人物の骨格座標について、各人物に対し共通又は区別して、各人物の骨格座標ごとに色分けし、各人物の骨格座標ごとの動きをフレーム単位で時系列に階調するよう描画して生成した１枚の画像とすることを特徴とする。 In the haptic metadata generating device of the present invention, the skeletal trajectory feature image generating means generates the skeletal trajectory feature image as a single image by color-coding the skeletal coordinates of each person in the multiple frame images, either common or distinct for each person, and drawing the movement of each person at each skeletal coordinate in a time-series gradation on a frame-by-frame basis.

また、本発明の触覚メタデータ生成装置において、前記人物識別手段は、前記探索範囲として、最大で人物骨格の全体を囲む人物探索範囲に限定し、最小で人物骨格のうち所定領域を注目探索範囲として定めた絞り込みによる可変設定を行い、状態推定アルゴリズムで得られる人物の骨格の状態遷移推定値に基づいて、少なくとも前記注目探索範囲を含むように前記探索範囲を決定して、当該人物オブジェクトを識別する処理を行う手段を有することを特徴とする。 In addition, in the haptic metadata generating device of the present invention, the person identification means is characterized by having a means for performing variable setting by narrowing down the search range to a person search range that at most surrounds the entire human skeleton and at least a specified area of the human skeleton as the attention search range, determining the search range to include at least the attention search range based on the state transition estimate value of the human skeleton obtained by a state estimation algorithm, and performing processing to identify the person object.

また、本発明の触覚メタデータ生成装置において、当該複数フレーム画像の各々を用いて隣接フレーム間の差分画像を基に動オブジェクトを検出し、各差分画像から検出した動オブジェクトのうち前記識別した人物骨格毎の動きの方向のみを示す骨格軌跡特徴画像と対比して人物以外の動オブジェクトを選定し、前記人物以外の動オブジェクトについて、各差分画像から得られる座標位置、大きさ、移動方向を要素とし連結した動オブジェクト軌跡画像を生成する動オブジェクト検出手段を更に備え、前記人物動作認識手段は、前記識別した人物骨格毎の動きの方向のみを示す骨格軌跡特徴画像上に、前記動オブジェクト軌跡画像を追加して合成したものを入力とする畳み込みニューラルネットワークにより、人物の特定動作を認識することを特徴とする。 The haptic metadata generating device of the present invention further comprises a moving object detection means for detecting a moving object based on a difference image between adjacent frames using each of the multiple frame images, selecting a moving object other than a person from among the moving objects detected from each difference image by comparing it with a skeletal trajectory feature image showing only the direction of movement of each identified human skeleton, and generating a moving object trajectory image for the moving object other than a person by connecting the coordinate position, size, and movement direction obtained from each difference image as elements, and the human action recognition means recognizes a specific human action by a convolutional neural network that receives as input the moving object trajectory image added to the skeletal trajectory feature image showing only the direction of movement of each identified human skeleton.

また、本発明の映像触覚連動システムは、本発明の触覚メタデータ生成装置と、触覚刺激を提示する触覚提示デバイスと、前記触覚メタデータ生成装置から得られる触覚メタデータを基に、予め定めた駆動基準データを参照し、前記触覚提示デバイスを駆動するよう制御する制御ユニットと、を備えることを特徴とする。 The video-haptic interlocking system of the present invention is characterized by comprising the haptic metadata generating device of the present invention, a haptic presentation device that presents haptic stimuli, and a control unit that controls the driving of the haptic presentation device by referring to predetermined driving reference data based on the haptic metadata obtained from the haptic metadata generating device.

更に、本発明のプログラムは、コンピュータを、本発明の触覚メタデータ生成装置として機能させるためのプログラムとして構成する。 Furthermore, the program of the present invention is configured as a program for causing a computer to function as the haptic metadata generating device of the present invention.

本発明によれば、映像から人物オブジェクトを自動抽出し、動的な人物オブジェクトに対応する触覚メタデータを同期して自動生成することができる。特に、スポーツ映像のリアルタイム視聴時での触覚刺激提示が可能となる。視覚・聴覚への情報提供のみならず、触覚にも提示することで、視覚や聴覚に障害を持つ方々へもスポーツの状況を分かりやすく伝えることが可能となる。また、一般の晴眼者の方々にとっても、従来の映像視聴では伝えきれない臨場感や没入感を提供することができる。 According to the present invention, it is possible to automatically extract human objects from video and automatically generate haptic metadata corresponding to dynamic human objects in synchronization. In particular, it becomes possible to present haptic stimuli when watching sports video in real time. By presenting information not only to the visual and auditory senses but also to the tactile sense, it becomes possible to convey the situation of sports in an easy-to-understand manner even to people with visual or hearing impairments. It is also possible to provide a sense of realism and immersion that cannot be conveyed by conventional video viewing for ordinary sighted people.

本発明による一実施形態の触覚メタデータ生成装置を備える映像触覚連動システムの概略構成を示すブロック図である。1 is a block diagram showing a schematic configuration of a video-haptic linked system including a haptic metadata generation device according to an embodiment of the present invention. 本発明による一実施形態の触覚メタデータ生成装置の処理例を示すフローチャートである。1 is a flowchart illustrating an example of processing of a haptic metadata generating device according to an embodiment of the present invention. 本発明による一実施形態の触覚メタデータ生成装置における人物骨格抽出処理に関する説明図である。1 is an explanatory diagram relating to a human skeleton extraction process in a haptic metadata generation device according to an embodiment of the present invention; （ａ）は１フレーム画像を例示する図であり、（ｂ）は本発明による一実施形態の触覚メタデータ生成装置における１フレーム画像における人物骨格抽出例を示す図である。FIG. 2A is a diagram illustrating an example of one frame image, and FIG. 2B is a diagram illustrating an example of human skeleton extraction from one frame image in a haptic metadata generation device according to an embodiment of the present invention. （ａ），（ｂ）は、それぞれ本発明による一実施形態の触覚メタデータ生成装置における人物骨格抽出処理に関する人物オブジェクトの探索範囲の処理例を示す図である。5A and 5B are diagrams illustrating an example of processing of a search range for a human object in relation to human skeleton extraction processing in a haptic metadata generation device according to an embodiment of the present invention. （ａ）は、本発明に係る骨格軌跡特徴画像（ＳＴＩ：Skeleton Trajectory Image）の画像例を示す図であり、（ｂ）は、その軌跡特徴画像（ＳＴＩ）の説明図である。FIG. 2A is a diagram showing an example of a skeleton trajectory image (STI) according to the present invention, and FIG. 2B is an explanatory diagram of the trajectory feature image (STI). （ａ）は１フレーム画像例を模擬的に示した図であり、（ｂ）は従来技術のボーン画像例、（ｃ）は従来技術のＳｋｌＭＨＩ（Skeleton Motion History Image）の画像例、（ｄ）は本発明に係る骨格軌跡特徴画像（ＳＴＩ）の画像例を示す図である。FIG. 1A is a diagram showing a simulated example of a one-frame image, FIG. 1B is a diagram showing an example of a bone image of the prior art, FIG. 1C is a diagram showing an example of an image of a Skeleton Motion History Image (Skl) of the prior art, and FIG. 1D is a diagram showing an example of an image of a Skeleton Trajectory Feature Image (STI) according to the present invention. 従来技術のボーン画像、従来技術のＳｋｌＭＨＩ、及び本発明に係る骨格軌跡特徴画像（ＳＴＩ）の人物動きの検出精度の比較評価を示す図である。1 is a diagram showing a comparative evaluation of human movement detection accuracy among bone images of the prior art, Skl MHI of the prior art, and the skeleton trajectory feature image (STI) according to the present invention. 本発明による一実施形態の映像触覚連動システムにおける制御ユニットの概略構成を示すブロック図である。1 is a block diagram showing a schematic configuration of a control unit in a video-haptic linking system according to an embodiment of the present invention.

（システム構成）
以下、図面を参照して、本発明による一実施形態の触覚メタデータ生成装置１２を備える映像触覚連動システム１について詳細に説明する。図１は、本発明による一実施形態の触覚メタデータ生成装置１２を備える映像触覚連動システム１の概略構成を示すブロック図である。 (System configuration)
Hereinafter, a video-haptic interlocking system 1 including a haptic metadata generation device 12 according to an embodiment of the present invention will be described in detail with reference to the drawings. Fig. 1 is a block diagram showing a schematic configuration of a video-haptic interlocking system 1 including a haptic metadata generation device 12 according to an embodiment of the present invention.

図１に示す映像触覚連動システム１は、カメラや記録装置等の映像出力装置１０から映像を入力し、入力された映像から人物オブジェクトを自動抽出し、動的な人物オブジェクトに対応する触覚メタデータ（第１の触覚メタデータと第２の触覚メタデータの２種類）を同期して自動生成する触覚メタデータ生成装置１２と、生成した触覚メタデータを基に、本例では２台の触覚提示デバイス１４Ｌ，１４Ｒと、各触覚提示デバイス１４Ｌ，１４Ｒを個別に駆動制御する制御ユニット１３と、を備える。 The video-haptic interlocking system 1 shown in FIG. 1 includes a haptic metadata generating device 12 that receives video from a video output device 10 such as a camera or a recording device, automatically extracts a person object from the input video, and automatically generates synchronous haptic metadata (two types of haptic metadata: first haptic metadata and second haptic metadata) corresponding to the dynamic person object, and in this example, two haptic presentation devices 14L, 14R based on the generated haptic metadata, and a control unit 13 that individually drives and controls each of the haptic presentation devices 14L, 14R.

まず、映像出力装置１０が出力する映像は、一例として柔道競技をリアルタイムで撮影されたものとしてディスプレイ１１に表示され、ユーザＵによって視覚されるものとする。 First, the video output by the video output device 10 is displayed on the display 11 as, for example, a judo competition filmed in real time, and viewed by the user U.

柔道競技は、二人の選手が組み合って、「抑え込み」や「投げ」などの技を競うスポーツであり、各人物に衝撃が生じた瞬間や各人物の動きの状況変化を触覚提示デバイス１４Ｌ，１４Ｒにより触覚刺激としてユーザＵに提示することで、より臨場感を高め、また視聴覚障害者にも試合状況を伝えることが可能である。 Judo is a sport in which two athletes grapple and compete with each other using techniques such as pinning and throwing. By presenting the moment when each athlete is struck and the changes in each athlete's movements as tactile stimuli to the user U using tactile presentation devices 14L and 14R, it is possible to enhance the sense of realism and also to convey the situation of the match to people with hearing and visual impairments.

特に、柔道競技では、映像上で選手同士の重なりやオクルージョンが多数生じるため、各選手に生じる衝撃の種類に応じたタイミングと速さ以外にも、各選手の押し引きなどの組み合い、投げ等に係る動作状況を連続的に触覚提示できるようにすることで、視覚や聴覚の障害者にも試合の緊迫感を伝えることができ、また臨場感を高めることができる。 In particular, in judo competitions, there is a lot of overlap and occlusion between athletes on the screen, so by being able to continuously tactilely present each athlete's movements, such as pushing and pulling, grappling, throwing, etc., in addition to the timing and speed according to the type of impact each athlete receives, it is possible to convey the tension of the match to people with visual or hearing impairments and enhance the sense of realism.

そこで、ユーザＵは、左手ＨＬで触覚提示デバイス１４Ｌを把持し、右手ＨＲで触覚提示デバイス１４Ｒを把持して、本例では映像解析に同期した振動刺激が提示されるものとする。制御ユニット１３は、触覚メタデータ生成装置１２から得られる各人物オブジェクトＯｐ１，Ｏｐ２に生じる衝撃の種類に応じたタイミングと速さを示す衝撃提示用の情報を含む触覚メタデータを基に、各人物オブジェクトＯｐ１，Ｏｐ２に対応付けられた２台の触覚提示デバイス１４Ｌ，１４Ｒの触覚提示を個別に制御する。ただし、制御ユニット１３は、１台の触覚提示デバイスに対してのみ駆動制御する形態でもよいし、３台以上の触覚提示デバイスに対して個別に駆動制御する形態でもよい。また、限定するものではないが、本例の制御ユニット１３は、映像内の人物オブジェクトＯｐ１（選手）の動きに対応した振動刺激は触覚提示デバイス１４Ｌで、人物オブジェクトＯｐ２（選手）の動きに対応した振動刺激は触覚提示デバイス１４Ｒで提示するように分類して制御するものとする。 The user U holds the tactile presentation device 14L with the left hand HL and the tactile presentation device 14R with the right hand HR, and in this example, a vibration stimulus synchronized with the video analysis is presented. The control unit 13 individually controls the tactile presentation of the two tactile presentation devices 14L and 14R associated with each person object Op1 and Op2 based on the tactile metadata including information for impact presentation indicating the timing and speed according to the type of impact occurring to each person object Op1 and Op2 obtained from the tactile metadata generating device 12. However, the control unit 13 may be in a form that drives and controls only one tactile presentation device, or in a form that drives and controls three or more tactile presentation devices individually. In addition, although not limited thereto, the control unit 13 in this example classifies and controls the vibration stimulus corresponding to the movement of the person object Op1 (player) in the video to be presented by the tactile presentation device 14L, and the vibration stimulus corresponding to the movement of the person object Op2 (player) to be presented by the tactile presentation device 14R.

触覚提示デバイス１４Ｌ，１４Ｒは、球状のケース１４１内に、制御ユニット１３の制御によって振動刺激を提示可能な振動アクチュエーター１４２が収容されている。尚、触覚提示デバイス１４Ｌ，１４Ｒは、振動刺激の他、電磁気パルス刺激を提示するものでもよい。本例では、制御ユニット１３と各触覚提示デバイス１４Ｌ，１４Ｒとの間は有線接続され、触覚メタデータ生成装置１２と制御ユニット１３との間も有線接続されている形態を例に説明するが、それぞれ近距離無線通信で無線接続されている形態としてもよい。 The tactile presentation devices 14L, 14R are housed in a spherical case 141 and include a vibration actuator 142 capable of presenting vibration stimuli under the control of the control unit 13. The tactile presentation devices 14L, 14R may present electromagnetic pulse stimuli in addition to vibration stimuli. In this example, the control unit 13 and each tactile presentation device 14L, 14R are connected by wire, and the tactile metadata generating device 12 and the control unit 13 are also connected by wire, but they may also be connected wirelessly via short-range wireless communication.

触覚メタデータ生成装置１２は、複数フレーム抽出部１２１、人物骨格抽出部１２２、人物識別部１２３、骨格軌跡特徴画像生成部１２４、動オブジェクト検出部１２５、人物動作認識部１２６、及びメタデータ生成部１２７を備える。 The haptic metadata generating device 12 includes a multiple frame extraction unit 121, a human skeleton extraction unit 122, a human identification unit 123, a skeleton trajectory characteristic image generation unit 124, a moving object detection unit 125, a human action recognition unit 126, and a metadata generation unit 127.

複数フレーム抽出部１２１は、映像出力装置１０から入力された映像について、現フレーム画像とＴ（Ｔは１以上の整数）フレーム分の過去のフレーム画像を含む複数フレーム画像を抽出し、人物骨格抽出部１２２及び動オブジェクト検出部１２５に出力する。 The multiple frame extraction unit 121 extracts multiple frame images, including a current frame image and T (T is an integer equal to or greater than 1) frames of past frame images, from the video input from the video output device 10, and outputs these to the human skeleton extraction unit 122 and the moving object detection unit 125.

人物骨格抽出部１２２は、当該複数フレーム画像の各々について、骨格検出アルゴリズムに基づき、各人物オブジェクト（以下、単に「人物」とも称する。）Ｏｐ１，Ｏｐ２の骨格座標集合Ｐ^ｎ _ｂ（ｎ：検出人数、ｂ：骨格ＩＤ）を生成し、現フレーム画像を含む当該複数フレーム画像とともに、人物識別部１２３に出力する。 The person skeleton extraction unit 122 generates a skeleton coordinate set P ⁿ _b (n: number of detected persons, b: skeleton ID) for each person object (hereinafter simply referred to as "person") Op1, Op2 for each of the multiple frame images based on a skeleton detection algorithm, and outputs it to the person identification unit 123 together with the multiple frame images including the current frame image.

人物識別部１２３は、当該複数フレーム画像の各々について、骨格座標集合Ｐ^ｎ _ｂを基に探索範囲（詳細は後述する。）を可変設定し、各人物の骨格の位置及びサイズと、その周辺画像情報を抽出することにより人物を識別し、人物ＩＤを付与した骨格座標集合Ｐ^ｉ _ｂ（ｉ：人物ＩＤ、ｂ：骨格ＩＤ）を生成し、骨格軌跡特徴画像生成部１２４に出力する。 The person identification unit 123 variably sets a search range (described in detail later) based on the skeletal coordinate set P ⁿ _b for each of the multiple frame images, identifies the person by extracting the position and size of each person's skeleton and its surrounding image information, generates a skeletal coordinate set P ⁱ _b (i: person ID, b: skeleton ID) to which a person ID is assigned, and outputs it to the skeletal trajectory characteristic image generation unit 124.

骨格軌跡特徴画像生成部１２４は、現フレーム画像を基準に、当該複数フレーム画像における骨格座標集合Ｐ^ｉ _ｂを基に、識別した人物骨格毎の動きの方向のみを示す１枚の骨格軌跡特徴画像を生成し、人物動作認識部１２６に出力する。ここで、骨格軌跡特徴画像について、その詳細は後述するが、本願明細書中、ＳＴＩ（Skeleton Trajectory Image）と名付けている。 The skeleton trajectory characteristic image generating unit 124 generates one skeleton trajectory characteristic image indicating only the direction of movement of each identified human skeleton based on _the skeleton coordinate set ^Pib in the multiple frame images with the current frame image as a reference, and outputs the image to the human action recognizing unit 126. The skeleton trajectory characteristic image will be described in detail later, and is referred to as STI (Skeleton Trajectory Image) in this specification.

動オブジェクト検出部１２５は、本例のような柔道競技の動きの認識のためには必ずしも設ける必要はないが、当該複数フレーム画像の各々を用いて隣接フレーム間の差分画像を基に動オブジェクトを検出し、各差分画像から検出した動オブジェクトのうち骨格軌跡特徴画像生成部１２４から得られる骨格軌跡特徴画像（ＳＴＩ）と対比して人物以外の動オブジェクトを選定し、その人物以外の動オブジェクトについて、各差分画像から得られる座標位置、大きさ、移動方向を要素とし連結した動オブジェクト軌跡画像を生成し、骨格軌跡特徴画像生成部１２４に出力する。この場合、骨格軌跡特徴画像生成部１２４は、骨格軌跡特徴画像（ＳＴＩ）上に、動オブジェクト軌跡画像を追加して描画（合成）したものを人物動作認識部１２６に出力する。 The moving object detection unit 125 does not necessarily have to be provided for recognizing the movements of judo competitions as in this example, but it detects moving objects based on the difference images between adjacent frames using each of the multiple frame images, selects moving objects other than people from among the moving objects detected from each difference image by comparing them with the skeletal trajectory feature image (STI) obtained from the skeletal trajectory feature image generation unit 124, generates a moving object trajectory image for the moving object other than people by linking the coordinate position, size, and movement direction obtained from each difference image as elements, and outputs the generated image to the skeletal trajectory feature image generation unit 124. In this case, the skeletal trajectory feature image generation unit 124 adds the moving object trajectory image to the skeletal trajectory feature image (STI) and draws (composites) it, and outputs the result to the human action recognition unit 126.

人物動作認識部１２６は、骨格軌跡特徴画像（ＳＴＩ）を入力とするＣＮＮ（畳み込みニューラルネットワーク）により、人物の特定動作を認識し、触覚提示デバイス１４Ｌ，１４Ｒを作動させる所定の衝撃提示用の情報、即ち現フレーム画像内の各人物の識別、位置座標（及び、本例では柔道競技としているため対象外となるが、チーム競技であればそのチーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さを示す衝撃提示用の情報を検出し、メタデータ生成部１２７に出力する。 The human action recognition unit 126 recognizes specific human actions using a CNN (convolutional neural network) that uses a skeletal trajectory feature image (STI) as input, and detects information for presenting a specific impact that activates the tactile presentation devices 14L and 14R, i.e., the identification of each person in the current frame image, their position coordinates (and, although this is not included in this example because it is a judo competition, if it were a team competition, their team classification), and information for presenting an impact that indicates the timing and speed at which to activate the tactile presentation device, and outputs this information to the metadata generation unit 127.

メタデータ生成部１２７は、現フレーム画像に対応して、人物動作認識部１２６から得られる所定の衝撃提示用の情報、即ち現フレーム画像内の各人物の識別、位置座標（及び、本例では柔道競技としているため対象外となるが、チーム競技であればそのチーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さを示す衝撃提示用の情報を含む触覚メタデータ（衝撃提示用）を生成し、フレーム単位で制御ユニット１３に出力する。 The metadata generation unit 127 generates haptic metadata (for impact presentation) corresponding to the current frame image, including predetermined information for impact presentation obtained from the human action recognition unit 126, i.e., the identification of each person in the current frame image, position coordinates (and team classification in a team sport, which is not included in this example because it is a judo competition), and information for impact presentation indicating the timing and speed at which to activate the haptic presentation device, and outputs this to the control unit 13 on a frame-by-frame basis.

以下、より具体的に、図２を基に、図３乃至図６を参照しながら、触覚メタデータ生成装置１２における触覚メタデータ生成処理について説明する。 The haptic metadata generation process in the haptic metadata generating device 12 will be described in more detail below with reference to Figures 3 to 6, based on Figure 2.

（触覚メタデータ生成処理）
図２は、本発明による一実施形態の触覚メタデータ生成装置１２の処理例を示すフローチャートである。そして、図３は、触覚メタデータ生成装置１２における人物骨格抽出処理に関する説明図である。また、図４（ａ）は１フレーム画像を例示する図であり、図４（ｂ）は触覚メタデータ生成装置１２における１フレーム画像における人物骨格抽出例を示す図である。図５（ａ），（ｂ）は、それぞれ本発明による一実施形態の触覚メタデータ生成装置１２における人物骨格抽出処理に関する人物オブジェクトの探索範囲の処理例を示す図である。図６（ａ）は、本発明に係る骨格軌跡特徴画像（ＳＴＩ）の画像例を示す図であり、図６（ｂ）は、その軌跡特徴画像（ＳＴＩ）の説明図である。 (Haptic Metadata Generation Processing)
Fig. 2 is a flow chart showing an example of processing by the haptic metadata generation device 12 according to an embodiment of the present invention. Fig. 3 is an explanatory diagram relating to human skeleton extraction processing in the haptic metadata generation device 12. Fig. 4(a) is a diagram showing an example of one frame image, and Fig. 4(b) is a diagram showing an example of human skeleton extraction in one frame image in the haptic metadata generation device 12. Figs. 5(a) and 5(b) are diagrams showing an example of processing of a search range for a human object in human skeleton extraction processing in the haptic metadata generation device 12 according to an embodiment of the present invention. Fig. 6(a) is a diagram showing an example of an image of a skeleton trajectory feature image (STI) according to the present invention, and Fig. 6(b) is an explanatory diagram of the trajectory feature image (STI).

図２に示すように、触覚メタデータ生成装置１２は、まず、複数フレーム抽出部１２１により、映像出力装置１０から入力された映像について、現フレーム画像とＴ（Ｔは１以上の整数）フレーム分の過去のフレーム画像を含む複数フレーム画像を抽出する（ステップＳ１）。 As shown in FIG. 2, the haptic metadata generating device 12 first extracts multiple frame images including a current frame image and T (T is an integer equal to or greater than 1) frames of past frame images from the video input from the video output device 10 using the multiple frame extraction unit 121 (step S1).

続いて、触覚メタデータ生成装置１２は、人物骨格抽出部１２２により、当該複数フレーム画像の各々について、骨格検出アルゴリズムに基づき、各人物オブジェクトＯｐ１，Ｏｐ２の骨格座標集合Ｐ^ｎ _ｂ（ｎ：検出人数、ｂ：骨格ＩＤ）を生成する（ステップＳ２）。 Next, the haptic metadata generating device 12 generates a set of skeleton coordinates P ⁿ _b (n: number of detected persons, b: skeleton ID) of each human object Op1, Op2 for each of the multiple frame images based on a skeleton detection algorithm using the human skeleton extraction unit 122 (step S2).

近年の深層学習技術の発展により、通常の画像から人物の骨格位置を推定することが可能となった。OpenPoseやVisionPose（NextSystem社）に代表されるように、骨格検出アルゴリズムをオープンソースで公開しているものも存在する。そこで、本例の人物骨格抽出部１２２は、VisionPoseを用いて、図３に示すように、フレーム画像毎に人物の骨格３０点を検出し、その位置座標を示す骨格座標集合Ｐ^ｎ _ｂを生成する。 With the recent development of deep learning technology, it has become possible to estimate the skeletal position of a person from a normal image. Some skeleton detection algorithms are open source, such as OpenPose and VisionPose (NextSystem). Therefore, the human skeleton extraction unit 122 of this example uses VisionPose to detect 30 human skeleton points for each frame image, as shown in FIG. 3, and generates a skeleton coordinate set P ⁿ _b indicating the position coordinates.

VisionPoseでは、図３において、Ｐ^ｎ _１：“頭”、Ｐ^ｎ _２：“鼻”、Ｐ^ｎ _３：“左目”、Ｐ^ｎ _４：“右目”、Ｐ^ｎ _５：“左耳”、Ｐ^ｎ _６：“右耳”、Ｐ^ｎ _７：“首”、Ｐ^ｎ _８：“背骨（肩）”、Ｐ^ｎ _９：“左肩”、Ｐ^ｎ _１０：“右肩”、Ｐ^ｎ _１１：“左肘”、Ｐ^ｎ _１２：“右肘”、Ｐ^ｎ _１３：“左手首”、Ｐ^ｎ _１４：“右手首”、Ｐ^ｎ _１５：“左手”、Ｐ^ｎ _１６：“右手”、Ｐ^ｎ _１７：“左親指”、Ｐ^ｎ _１８：“右親指”、Ｐ^ｎ _１９：“左指先”、Ｐ^ｎ _２０：“右指先”、Ｐ^ｎ _２１：“背骨（中央）”、Ｐ^ｎ _２２：“背骨（基端部）”、Ｐ^ｎ _２３：“左尻部”、Ｐ^ｎ _２４：“右尻部”、Ｐ^ｎ _２５：“左膝”、Ｐ^ｎ _２６：“右膝”、Ｐ^ｎ _２７：“左足首”、Ｐ^ｎ _２８：“右足首”、Ｐ^ｎ _２９：“左足”、及び、Ｐ^ｎ _３０：“右足”、についての座標位置と、各座標位置を図示するような線で連結した描画が可能である。 In VisionPose, in FIG. 3, P ⁿ ₁ : "head", P ⁿ ₂ : "nose", P ⁿ ₃ : "left eye", P ⁿ ₄ : "right eye", P ⁿ ₅ : "left ear", P ⁿ ₆ : "right ear", P ⁿ ₇ : "neck", P ⁿ ₈ : "spine (shoulder)", P ⁿ ₉ : "left shoulder", P ⁿ ₁₀ : "right shoulder", P ⁿ ₁₁ : "left elbow", P ⁿ ₁₂ : "right elbow", P ⁿ ₁₃ : "left wrist", P ⁿ ₁₄ : "right wrist", P ⁿ ₁₅ : "left hand", P ⁿ ₁₆ : "right hand", P ⁿ ₁₇ : "left thumb", P ⁿ ₁₈ : "right thumb", P ⁿ ₁₉ : "left fingertip", P It is possible to draw the coordinate positions of ^{P n} ₂₀ : "right fingertip", P ⁿ ₂₁ : "spine (center)", P ⁿ ₂₂ : "spine (base end)", P ⁿ ₂₃ : "left buttock", P ⁿ ₂₄ : "right buttock", P ⁿ ₂₅ : "left knee", P ⁿ ₂₆ : "right knee", P ⁿ ₂₇ : "left ankle", P ⁿ ₂₈ : "right ankle", P ⁿ ₂₉ : "left foot", and P ⁿ ₃₀ : "right foot", and to connect each coordinate position with a line as shown in the figure.

このVisionPoseの骨格検出アルゴリズムに基づき、図４（ａ）に示す柔道競技の１フレーム画像Ｆに対して、人物の骨格抽出を行ったフレーム画像Ｆａを図４（ｂ）に示している。図４（ａ）に示すフレーム画像Ｆには、各人物オブジェクトＯｐ１，Ｏｐ２（選手）のみが映り込んでいる様子を示しているが、その他の人物オブジェクトである審判の動オブジェクトが映り込むことや、別のスポーツ競技であれば人物以外の動オブジェクト（バドミントン競技であればラケットやシャトル等）、或いは観客等のオブジェクト（実質的には、静オブジェクト）が写り込むことがある。しかし、VisionPoseの骨格検出アルゴリズムを適用すると、選手及び審判の人物オブジェクトの人物についてのみ人物の骨格抽出を抽出することができる。本例では、図４（ｂ）に示すように、人物オブジェクトＯｐ１，Ｏｐ２にそれぞれ対応する骨格座標集合Ｐ^１ _ｂ，Ｐ^２ _ｂを推定して生成することができる。図４（ｂ）からも理解されるように、柔道競技においても、比較的精度よく各人物の骨格を推定できる。尚、骨格検出アルゴリズムは、静止画単位での推定に留まるので、触覚メタデータ生成装置１２は、後続する処理として、人物の識別を行い、各人物の骨格位置の推移を１枚の骨格軌跡特徴画像（ＳＴＩ）に描画し、ＣＮＮにより時間軸を考慮した高精度な動作認識を行う。 Based on the VisionPose skeleton detection algorithm, a frame image Fa in which human skeletons are extracted from one frame image F of a judo competition shown in FIG. 4A is shown in FIG. 4B. The frame image F shown in FIG. 4A shows a state in which only the human objects Op1 and Op2 (players) are reflected, but other human objects such as a referee may be reflected, or in other sports competitions, moving objects other than people (such as a racket or shuttlecock in badminton competitions) or objects such as spectators (substantially still objects) may be reflected. However, by applying the VisionPose skeleton detection algorithm, human skeletons can be extracted only for the human objects of the players and the referee. In this example, as shown in FIG. 4B, skeleton coordinate sets P ¹ _b and P ² _b corresponding to the human objects Op1 and Op2, respectively, can be estimated and generated. As can be seen from FIG. 4B, the skeletons of each person can be estimated relatively accurately even in a judo competition. In addition, since the skeleton detection algorithm is limited to estimation on a still image basis, the haptic metadata generation device 12 performs subsequent processing to identify people, draws the progress of each person's skeleton position in a single skeleton trajectory feature image (STI), and performs highly accurate action recognition taking into account the time axis using CNN.

続いて、触覚メタデータ生成装置１２は、人物識別部１２３により、当該複数フレーム画像の各々について、骨格座標集合Ｐ^ｎ _ｂを基に探索範囲を可変設定し、各人物の骨格の位置及びサイズと、その周辺画像情報を抽出することにより人物を識別し、人物ＩＤを付与した骨格座標集合Ｐ^ｉ _ｂ（ｉ：人物ＩＤ、ｂ：骨格ＩＤ）を生成する（ステップＳ３）。 Next, the haptic metadata generating device 12 uses the person identification unit 123 to variably set a search range for each of the multiple frame images based on the skeletal coordinate set P ⁿ _b , identifies each person by extracting the position and size of each person's skeleton and its surrounding image information, and generates a skeletal coordinate set P ⁱ _b (i: person ID, b: skeleton ID) to which a person ID is assigned (step S3).

前述した人物骨格抽出部１２２により、当該複数フレーム画像の各々について、骨格座標集合Ｐ^ｎ _ｂとして、１以上の人物の骨格の検出が可能となる。しかし、各フレーム画像の骨格座標集合Ｐ^ｎ _ｂでは、「誰」の情報は存在しないため、各人物の骨格を識別する必要がある。この識別には、各フレーム画像における各骨格座標集合Ｐ^ｎ _ｂの座標付近の画像情報を利用する。即ち、人物識別部１２３は、骨格座標集合Ｐ^ｎ _ｂを基に、各人物の骨格の位置及びサイズと、その周辺画像情報（色情報、及び顔又は背付近のテクスチャ情報）を抽出することにより、人物を識別し、人物ＩＤを付与した骨格座標集合Ｐ^ｉ _ｂ（ｉ：人物ＩＤ、ｂ：骨格ＩＤ）を生成する。 The human skeleton extraction unit 122 described above makes it possible to detect one or more human skeletons as a skeleton coordinate set P ⁿ _b for each of the multiple frame images. However, since the skeleton coordinate set P ⁿ _b for each frame image does not contain information on "who", it is necessary to identify the skeleton of each person. For this identification, image information near the coordinates of each skeleton coordinate set P ⁿ _b in each frame image is used. That is, the human identification unit 123 extracts the position and size of each person's skeleton and its surrounding image information (color information, and texture information near the face or back) based on the skeleton coordinate set P ⁿ _b , thereby identifying the person and generating a skeleton coordinate set P ⁱ _b (i: human ID, b: skeleton ID) to which a human ID is assigned.

例えば、柔道では白と青の道着で試合が行われるが、各骨格座標集合Ｐ^ｎ _ｂの骨格の位置付近の画像情報として、フレーム画像Ｆにおける色情報を参照することで、選手の識別が可能になる。また、バドミントン競技では、コートを縦に構えた画角で撮影される場合に、各骨格座標集合Ｐ^ｎ _ｂの骨格の位置がフレーム画像Ｆにおける画面上側であれば奥の選手、画面下側であれば手前の選手、として識別することができる。 For example, in judo, players wear white and blue uniforms, and players can be identified by referring to color information in frame image F as image information around the position of the skeleton in each skeleton coordinate set P ⁿ _b . In badminton, when the court is photographed with a vertical angle of view, if the skeleton position in each skeleton coordinate set P ⁿ _b is on the upper side of the screen in frame image F, the player can be identified as the player at the back, and if it is on the lower side of the screen, the player can be identified as the player at the front.

従って、人物骨格抽出部１２２における骨格検出アルゴリズムは静止画単位での推定に留まるが、骨格座標集合Ｐ^ｎ _ｂを基に動オブジェクトとしての人物を認識することができる。 Therefore, although the skeleton detection algorithm in the human skeleton extraction unit 122 is limited to estimation on a still image basis, it is possible to recognize a human as a moving object based on the skeleton coordinate set P ⁿ _b .

尚、前述した人物骨格抽出部１２２では、選手以外にも審判や観客など、触覚刺激の提示対象としない他の人物の骨格を検出してしまうことも多い。審判は選手と別の衣服を着用することが多いため、色情報で識別できる。また、観客は選手に比べて遠くにいることが多いため、骨格のサイズで識別が可能である。このように、各競技のルールや撮影状況を考慮し、人物識別に適切な周辺画像情報（色情報、及び顔又は背付近のテクスチャ情報）を設定することにより、触覚刺激の提示対象とする選手の識別が可能となる。 The human skeleton extraction unit 122 described above often detects the skeletons of people other than the players, such as referees and spectators, who are not the targets of tactile stimuli. Referees often wear different clothing than the players, so they can be identified by color information. Spectators are often farther away than the players, so they can be identified by the size of their skeletons. In this way, by taking into account the rules of each sport and the shooting conditions and setting appropriate surrounding image information for person identification (color information, and texture information near the face or back), it becomes possible to identify the players who are the targets of tactile stimuli.

ところで、本実施形態の人物識別部１２３は、各人物の重なりやオクルージョンにも対応するため、フレーム画像単位で探索範囲（人物探索範囲Ｒ^ｉ及び注目探索範囲Ｒｂ^ｉ）を可変設定する。例えば、図５（ａ）に示す人物オブジェクトＯｐ１，Ｏｐ２（選手）と、人物オブジェクトＯｐ３（審判）について、人物骨格抽出部１２２により各骨格座標集合Ｐ^ｎ _ｂ（図示略）の抽出が行われると、人物識別部１２３は、フレーム画像単位で人物探索範囲Ｒ^ｉ及び注目探索範囲Ｒｂ^ｉを可変設定することができる。この探索範囲Ｒ^ｉは、図５（ａ）において、人物ＩＤ（ｉ）ごとに設定し、フレーム画像の画像座標上での人物の位置座標、及び人物の大きさ（幅及び高さ）を有するものとして外接矩形で表している。また、各人物の腰領域（Ｐ^ｎ _２２，Ｐ^ｎ _２３，Ｐ^ｎ _２４）を囲む領域を注目探索範囲Ｒｂ^ｉとして表している。 Incidentally, the person identification unit 123 of this embodiment variably sets the search range (person search range R ⁱ and attention search range Rb ⁱ ) for each frame image in order to deal with overlapping and occlusion of each person. For example, when the person skeleton extraction unit 122 extracts each skeleton coordinate set P ⁿ _b (not shown) for the person objects Op1 and Op2 (players) and the person object Op3 (referee) shown in FIG. 5A, the person identification unit 123 can variably set the person search range R ⁱ and attention search range Rb ⁱ for each frame image. In FIG. 5A, this search range R ⁱ is set for each person ID (i) and is represented by a circumscribing rectangle having the position coordinates of the person on the image coordinates of the frame image and the size (width and height) of the person. In addition, the area surrounding the waist region (P ⁿ ₂₂ , P ⁿ ₂₃ , P ⁿ ₂₄ ) of each person is represented as the attention search range Rb ⁱ .

より具体的には、本実施形態の人物識別部１２３は、各フレーム画像で人物の探索範囲を、最大で人物骨格の全体を囲む人物探索範囲Ｒ^ｉに限定し、最小で人物骨格のうち所定領域（本例では腰領域（Ｐ^ｎ _２２，Ｐ^ｎ _２３，Ｐ^ｎ _２４）を囲む領域）を注目探索範囲Ｒｂ^ｉとして定めた絞り込みによる可変設定を行い、状態推定アルゴリズムで得られる人物の骨格の状態遷移推定値に基づいて、少なくとも注目探索範囲Ｒｂ^ｉを含むように探索範囲を決定して、当該人物オブジェクトを識別する処理を行う。これにより、例えば図５（ｂ）に示すように各人物の動作が変化した場合やフレーム画像に対する相対的な人物の大きさが変化した場合でも、他の人物の誤認識を防ぎ、また処理速度も向上できる。特に、柔道のように識別対象の人物の重なりが激しく、背景も複雑な映像から精度よく選手を識別するには探索範囲の利用が有効である。 More specifically, the person identification unit 123 of this embodiment performs variable setting by narrowing down the person search range in each frame image to a person search range ^Ri that surrounds the entire person skeleton at maximum, _and a predetermined region of the person skeleton at minimum (in this example _, a region surrounding the waist region ( ^Pn22 , ^Pn23 , ^Pn24 )) as the attention search range Rb ⁱ , and performs processing to identify the person object by determining the search range to include _at least the attention search range Rb ⁱ based on the state transition estimation value of the person skeleton obtained by the state estimation algorithm. This makes it possible to prevent erroneous recognition of other people and improve processing speed even when the movement of each person changes or the size of the person relative to the frame image changes, for example, as shown in FIG. 5B. In particular, the use of the search range is effective for accurately identifying players from images in which there is a large overlap of people to be identified and the background is complex, such as in judo.

つまり、本実施形態の人物識別部１２３は、各選手及び審判の人物オブジェクトのＯｐ１，Ｏｐ２，Ｏｐ３における各骨格座標集合Ｐ^ｎ _ｂのうち、色識別を可能とする所定範囲（本例では腰領域（Ｐ^ｎ _２２，Ｐ^ｎ _２３，Ｐ^ｎ _２４）の色（青、白、茶色））を注目探索範囲Ｒｂ^ｉとして予め定めているので、検出した複数の人物の骨格座標集合Ｐ^ｎ _ｂが重なる場合には注目探索範囲Ｒｂ^ｉに絞って探索することで、各フレーム画像で精度よく人物を抽出・追跡できる。尚、背景に解析対象以外の骨格を検出する場合もあるため、解析対象の人物の骨格には、人物ＩＤ（ｉ）を付与して判別することで、追跡対象の人物の骨格座標Ｐ^ｉ _ｂを識別できる。 In other words, the person identification unit 123 of this embodiment predetermines a predetermined range that enables color identification (in this example, the colors (blue, white, ^brown ) of the waist area ( ^Pn22 , ^Pn23 , _Pn24 )) among each skeletal coordinate set ^Pnb in _Op1 , _Op2 , and _Op3 of the person objects of each player and referee as the attention search range Rb ⁱ , so that when the skeletal coordinate sets ^Pnb of multiple detected people overlap, the search is _narrowed down to the attention search range Rb ⁱ , making it possible to accurately extract and track people in each frame image. Note that since skeletons other than those to be analyzed may be detected in the background, the skeleton of the person to be analyzed is assigned a person ID (i) and identified, making it possible to identify the skeletal _coordinates ^Pib of the person to be tracked.

そして、探索範囲（人物探索範囲Ｒ^ｉ及び注目探索範囲Ｒｂ^ｉ）の広さや形の決定は、カルマンフィルタやパーティクルフィルタなどの状態推定アルゴリズムで得られる人物の骨格の状態遷移推定値に基づいて、少なくとも注目探索範囲Ｒｂ^ｉ（本例では、各人物の腰領域）を含むように決定する。 The size and shape of the search ranges (person search range ^Ri and attention search range ^Rbi ) are determined based on the state transition estimates of the person's skeleton obtained using state estimation algorithms such as a Kalman filter or a particle filter, so as to include at least the attention search range ^Rbi (in this example, the waist area of each person).

そして、探索範囲（人物探索範囲Ｒ^ｉ及び注目探索範囲Ｒｂ^ｉ）の安定検出時には範囲を狭め、検出が不安定な際には範囲を広げることができ、例えば、人物ＩＤ（ｉ）ごとに人物の骨格の状態遷移推定値に基づいて定めた探索範囲を設定し、その状態遷移推定値が直前フレームから所定値以内であれば安定とし、そうでなければ不安定とすることや、状態推定アルゴリズムで得られる人物の骨格の状態遷移推定値に基づいて、Ｔフレーム分の時間窓間に、検出に成功した割合を計算し、その割合が所定値以上であれば安定とし、当該所定値を下回った場合に不安定とすることで、探索範囲を可変設定することができる。 The search range (person search range ^Ri and focus search range Rb ⁱ ) can be narrowed when detection is stable and widened when detection is unstable. For example, a search range can be set for each person ID (i) based on an estimated state transition value of the person's skeleton, and the estimated state transition value can be deemed stable if it is within a predetermined value from the previous frame, and unstable if it is not. The search range can also be variably set by calculating the rate of successful detection within a time window of T frames based on the estimated state transition value of the person's skeleton obtained by a state estimation algorithm, and determining that the detection is stable if the rate is equal to or greater than a predetermined value, and unstable if it is below the predetermined value.

続いて、触覚メタデータ生成装置１２は、骨格軌跡特徴画像生成部１２４により、現フレーム画像を基準に、当該複数フレーム画像における骨格座標集合Ｐ^ｉ _ｂを基に、識別した人物骨格毎の動きの方向のみを示す１枚の骨格軌跡特徴画像（ＳＴＩ）を生成する（ステップＳ４）。 Next, the haptic metadata generation device 12 generates, by the skeleton trajectory feature image generation unit 124, a skeleton trajectory feature image (STI) showing only the direction of movement of each identified human skeleton based on the skeleton coordinate set P ⁱ _b in the multiple frame images, using the current frame image as a reference (step S4).

ここで、骨格軌跡特徴画像（ＳＴＩ）の描画生成にあたって、まず、任意のフレーム画像における骨格座標集合Ｐ^ｉ _ｂをＰ^ｉ _ｂ（ｔ）とし、現フレーム画像をｔ＝０として現フレーム画像における骨格座標集合Ｐ^ｉ _ｂをＰ^ｉ _ｂ（０）で表し、過去Ｔフレームのフレーム画像における骨格座標集合Ｐ^ｉ _ｂをＰ^ｉ _ｂ（Ｔ）で表す。つまり、骨格軌跡特徴画像生成部１２４は、現フレーム画像のフレーム番号をｔ＝０として、過去Ｔフレームまでのフレーム番号をｔ＝Ｔで表すと、現フレーム画像を基準に、ｔ＝０，１，…，Ｔの各フレーム画像Ｆを用いて、識別した人物骨格毎の動きの方向のみを示す１枚の骨格軌跡特徴画像（ＳＴＩ）を生成することができる。骨格軌跡特徴画像（ＳＴＩ）は、いわば現フレーム画像を基準に過去のオプティカルフローを連結し、１枚の画像として時間軸の情報を含んだものである。 Here, in drawing and generating the skeleton trajectory feature image (STI), first, the skeleton coordinate set P ⁱ _b in an arbitrary frame image is defined as P ⁱ _b (t), the skeleton coordinate set P ⁱ _b in the current frame image is expressed as P ⁱ _b (0) when the current frame image is t=0, and the skeleton coordinate set P ⁱ _b in the frame image of the past T frames is expressed as P ⁱ _b (T). In other words, when the frame number of the current frame image is t=0 and the frame numbers up to the past T frames are expressed as t=T, the skeleton trajectory feature image generating unit 124 can generate one skeleton trajectory feature image (STI) showing only the direction of movement of each identified human skeleton using each frame image F of t=0, 1, ..., T based on the current frame image. The skeleton trajectory feature image (STI) is, so to speak, a single image that includes information on the time axis by linking past optical flows based on the current frame image.

この骨格軌跡特徴画像（ＳＴＩ）における軌跡特徴量のデュレーションとなるＴは、任意に設定可能である。また、１枚の骨格軌跡特徴画像（ＳＴＩ）の生成に用いる骨格座標は、必ずしも図３に示す３０点全てを用いる必要はなく、予め定めた特定の骨格軌跡のみを使用して、処理速度を向上させる構成とすることもできる。 T, which is the duration of the trajectory feature in this skeletal trajectory feature image (STI), can be set arbitrarily. In addition, the skeletal coordinates used to generate one skeletal trajectory feature image (STI) do not necessarily need to use all 30 points shown in FIG. 3, and it is also possible to use only a specific skeletal trajectory that has been determined in advance, thereby improving processing speed.

骨格軌跡特徴画像（ＳＴＩ）は、現フレーム画像から過去Ｔフレーム分のフレーム画像における各人物の骨格座標を利用し、各人物の骨格座標ごとに連結した軌跡を描画するものとし、且つこの描画の際に、過去に向かうほど輝度を下げか、又は上げて描画して生成した１枚の画像とする。好適には、骨格軌跡特徴画像（ＳＴＩ）は、現フレームからＴフレーム分の過去のフレーム画像における各人物の骨格座標ごとに色分けし、各人物の骨格座標ごとの動き（遷移）をフレーム単位で時系列に階調するよう描画したものとする。 The skeleton trajectory feature image (STI) uses the skeleton coordinates of each person in frame images from the current frame image to the past T frames to draw a connected trajectory for each person's skeleton coordinate, and is generated as a single image by lowering or raising the brightness as it goes further back in time. Preferably, the skeleton trajectory feature image (STI) is color-coded for each person's skeleton coordinate in past frame images from the current frame to the past T frames, and the movement (transition) of each person's skeleton coordinate is drawn in gradation in time series on a frame-by-frame basis.

例えば、現フレームから過去Ｔフレームまで、各人物の骨格座標ごとに連結した軌跡を描画する際に、その輝度ｂを
ｂ＝２５５×（Ｔ－ｔ）／Ｔ
として定めたものとする。 For example, when drawing a trajectory that connects each person's skeleton coordinates from the current frame to the past T frames, the brightness b is expressed as b = 255 × (T - t) / T
This is what is defined as follows.

また、過去に遡るほど輝度を上げるように描画してもよく、この場合には、
ｂ＝２５５×ｔ／Ｔ
とすることができる。 Also, the brightness may be increased as the image goes back in time. In this case,
b = 255 × t / T
It can be said that:

ここで、ｔ＝０を現フレーム画像とし過去Ｔフレーム分を処理対象とするとき（ｔ＝０～Ｔ）、ｂを０～２５５とし、その値を、各人物の骨格座標ごとに色分けして表現するのが好適である。例えば、図６（ａ）は、本発明に係る骨格軌跡特徴画像（ＳＴＩ）の画像例を示す図である。図６（ａ）ではグレイスケール表示として認識処理に用いるとしているが、好適には、図６（ｂ）に示す軌跡特徴画像（ＳＴＩ）の説明図に示すように、例えば背景は輝度として最低値の“黒”（若しくは輝度として最高値の“白”でもよい。）、いずれの人物オブジェクトＯｐ１，Ｏｐ２についても、例えば“頭”（Ｐ^１ _１），（Ｐ^２ _１）の色を“青”に、“左指先” （Ｐ^１ _１９），（Ｐ^２ _１９）の色を“赤”とするなど、予め区別可能とする色で色分けして描画する。また、本実施例では、図６（ｂ）に示すように人物オブジェクトＯｐ１，Ｏｐ２を区別する色分けを施していないが、各人物オブジェクトＯｐ１，Ｏｐ２をも色分けするとしてもよく、例えば２名の人物に対し最大３０点の骨格座標を色分けするには、６０色を定義すればよい。そして、本発明に係る骨格軌跡特徴画像（ＳＴＩ）は、各人物の骨格座標ごとに色を固定したまま、輝度のみが元フレーム画像から過去へ遡るほど暗く（もしくは明るく）描画するものとする。 Here, when t=0 is the current frame image and the past T frames are to be processed (t=0 to T), it is preferable to set b to 0 to 255 and express the value by color-coding for each person's skeleton coordinates. For example, FIG. 6(a) is a diagram showing an example of an image of a skeleton trajectory feature image (STI) according to the present invention. In FIG. 6(a), it is assumed that the image is used for recognition processing as a grayscale display, but preferably, as shown in the explanatory diagram of the trajectory feature image (STI) shown in FIG. 6(b), for example, the background is "black" with the lowest brightness (or "white" with the highest brightness), and both person objects Op1 and Op2 are drawn in colors that can be distinguished in advance, such as "blue" for the "head ^{" (P 1} ¹ ₎ , (P ² ₁ ) and "red" for the "left fingertip" (P 1 ₁₉ ), (P ² ₁₉ ). In this embodiment, the person objects Op1 and Op2 are not color-coded to distinguish them from each other as shown in Fig. 6B, but each person object Op1 and Op2 may be color-coded, and for example, 60 colors may be defined to color-code a maximum of 30 points of skeletal coordinates for two persons. The skeleton trajectory feature image (STI) according to the present invention is drawn such that the luminance becomes darker (or brighter) the further back in time the original frame image is, while keeping the color for each person's skeletal coordinates fixed.

従って、骨格軌跡特徴画像生成部１２４は、骨格軌跡特徴画像（ＳＴＩ）として、当該複数フレーム画像における各人物の骨格座標について、各人物に対し共通又は区別して、各人物の骨格座標ごとに色分けし、各人物の骨格座標ごとの動き（遷移）をフレーム単位で時系列に階調するよう描画したものとする。 The skeletal trajectory feature image generating unit 124 therefore generates a skeletal trajectory feature image (STI) by color-coding the skeletal coordinates of each person in the multiple frame images, either common or distinct for each person, and drawing the movement (transition) of each person's skeletal coordinates in chronological gradation on a frame-by-frame basis.

また、骨格軌跡特徴画像生成部１２４は、動オブジェクト検出部１２５の機能により、球技の場合はボールなど、人物骨格以外の軌跡を併せて骨格軌跡特徴画像（ＳＴＩ）上に描画することができる。この場合、ボールの移動方向などが特徴量に付加されるため、動作認識の判定精度が向上する。 The skeletal trajectory feature image generation unit 124 can also use the function of the moving object detection unit 125 to draw trajectories other than the human skeleton, such as the ball in the case of a ball game, on the skeletal trajectory feature image (STI). In this case, the direction of the ball's movement is added to the feature amount, improving the accuracy of the action recognition judgment.

即ち、触覚メタデータ生成装置１２は、動オブジェクト検出部１２５により、当該複数フレーム画像の各々を用いて隣接フレーム間の差分画像を基に動オブジェクトを検出し、各差分画像から検出した動オブジェクトのうち骨格軌跡特徴画像生成部１２４から得られる骨格軌跡特徴画像（ＳＴＩ）と対比して人物以外の動オブジェクトを選定し、その人物以外の動オブジェクトについて、各差分画像から得られる座標位置、大きさ、移動方向を要素とし連結した動オブジェクト軌跡画像を生成し、骨格軌跡特徴画像生成部１２４に出力する。この場合、骨格軌跡特徴画像生成部１２４は、骨格軌跡特徴画像（ＳＴＩ）上に、動オブジェクト軌跡画像を追加して描画（合成）したものを人物動作認識部１２６に出力する（ステップＳ５）。 That is, the haptic metadata generating device 12 uses each of the multiple frame images to detect moving objects based on the difference images between adjacent frames using the moving object detection unit 125, selects moving objects other than people from among the moving objects detected from each difference image by comparing them with the skeletal trajectory feature image (STI) obtained from the skeletal trajectory feature image generating unit 124, generates a moving object trajectory image for the moving object other than people by connecting the coordinate position, size, and movement direction obtained from each difference image as elements, and outputs the generated moving object trajectory image to the skeletal trajectory feature image generating unit 124. In this case, the skeletal trajectory feature image generating unit 124 adds the moving object trajectory image to the skeletal trajectory feature image (STI) and draws (combines) it, and outputs the result to the human action recognition unit 126 (step S5).

即ち、動オブジェクト検出部１２５は、競技に関わる人物以外の動オブジェクトが存在しない、柔道競技のような場合では必要とされないが（処理として設けていても弊害が無い。）、競技に関わる人物以外の動オブジェクトが存在する場合（例えばバドミントン競技のシャトルやラケット、卓球やテニス競技のボールやラケット等）、その人物以外の動オブジェクトの動きの軌跡を検出し、動オブジェクト軌跡画像として生成し、骨格軌跡特徴画像生成部１２４に対して、骨格軌跡特徴画像（ＳＴＩ）上に、動オブジェクト軌跡画像を追加して描画（合成）させる。これにより、例えば競技に関わる人物以外の動オブジェクトが存在する場合、競技に関わる人物の動きに関わる情報が増えるため、後段の人物動作認識部１２６における人物動作の認識精度が向上する。このため、動オブジェクト検出部１２５を設けておくことで、任意の競技に対して同処理で対応できるため、汎用性のある触覚メタデータ生成装置１２を構成できる。 That is, the moving object detection unit 125 is not required in cases such as judo where there are no moving objects other than people involved in the competition (there is no problem even if it is provided as a process). However, when there are moving objects other than people involved in the competition (for example, shuttlecocks and rackets in badminton, balls and rackets in table tennis and tennis), the moving object detection unit 125 detects the trajectory of the movement of the moving object other than people, generates a moving object trajectory image, and causes the skeletal trajectory feature image generation unit 124 to add and draw (combine) the moving object trajectory image on the skeletal trajectory feature image (STI). As a result, when there are moving objects other than people involved in the competition, for example, information related to the movement of people involved in the competition increases, improving the recognition accuracy of human movements in the subsequent human movement recognition unit 126. Therefore, by providing the moving object detection unit 125, the same process can be used for any competition, making it possible to configure a versatile haptic metadata generation device 12.

続いて、触覚メタデータ生成装置１２は、人物動作認識部１２６により、骨格軌跡特徴画像（ＳＴＩ）を入力とするＣＮＮ（畳み込みニューラルネットワーク）により、人物の特定動作を認識し、触覚提示デバイス１４Ｌ，１４Ｒを作動させる所定の衝撃提示用の情報を検出する（ステップＳ６）。衝撃提示用の情報には、現フレーム画像内の各人物の識別、位置座標（及び、本例では柔道競技としているため対象外となるが、チーム競技であればそのチーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さを示す情報が含まれる。 Next, the haptic metadata generating device 12 uses the human action recognition unit 126 to recognize specific actions of the person using a convolutional neural network (CNN) that receives the skeletal trajectory feature image (STI) as input, and detects information for presenting a predetermined impact that activates the haptic presentation devices 14L and 14R (step S6). The information for presenting an impact includes the identification of each person in the current frame image, their position coordinates (and, although this is not included in this example because it is a judo competition, if it is a team competition, their team classification), and information indicating the timing and speed at which to activate the haptic presentation device.

尚、ＣＮＮによる機械学習時には、事前に学習用の骨格軌跡特徴画像（ＳＴＩ）を作成して学習させておく。このように、人物動作認識部１２６における認識処理には、深層学習の一つであるＣＮＮ（畳み込みニューラルネットワーク）を用いる。ＣＮＮは、何段もの深い層を持つニューラルネットワークであり、特に画像認識の分野で優れた性能を発揮しているネットワークである。このネットワークは「畳み込み層」や「プーリング層」などの幾つかの特徴的な機能を持った層を積み上げることで構成され、現在幅広い分野で利用されている。「畳み込み層」の処理により高い精度を、「プーリング層」の処理により撮影画角に依存しない汎用性を実現している。 When performing machine learning using CNN, a skeletal trajectory feature image (STI) for learning is created in advance and trained. In this way, the recognition process in the human action recognition unit 126 uses CNN (convolutional neural network), which is a type of deep learning. CNN is a neural network with many deep layers, and is a network that has demonstrated excellent performance particularly in the field of image recognition. This network is constructed by stacking layers with several characteristic functions, such as "convolutional layers" and "pooling layers," and is currently used in a wide range of fields. High accuracy is achieved through processing in the "convolutional layers," while versatility that is not dependent on the shooting angle is achieved through processing in the "pooling layer."

このＣＮＮを用いて骨格軌跡特徴画像（ＳＴＩ）を解析することで、「組み合い」や「投げ」、「寝技」などの動作イベントを、選手の撮影サイズや位置に依存せずに高い精度で識別することが可能となり、これらの情報を基に触覚デバイス１４Ｌ，１４Ｒを制御するための触覚メタデータを生成することで、スポーツ映像のリアルタイム視聴時でも触覚刺激を提示することが可能となる。 By using this CNN to analyze skeletal trajectory feature images (STI), it becomes possible to identify movement events such as "grappling," "throwing," and "ground fighting" with high accuracy, regardless of the size or position of the athletes in the footage. By generating haptic metadata for controlling haptic devices 14L and 14R based on this information, it becomes possible to present haptic stimuli even when watching sports footage in real time.

最終的に、触覚メタデータ生成装置１２は、メタデータ生成部１２７により、現フレーム画像に対応して、人物動作認識部１２６から得られる所定の衝撃提示用の情報、即ち現フレーム画像内の各人物の識別、位置座標（及び、本例では柔道競技としているため対象外となるが、チーム競技であればそのチーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さを示す衝撃提示用の情報を含む触覚メタデータ（衝撃提示用）を生成し、フレーム単位で制御ユニット１３に出力する（ステップＳ７）。 Finally, the haptic metadata generating device 12 generates haptic metadata (for impact presentation) by the metadata generating unit 127, which includes information for presenting a specific impact obtained from the human action recognizing unit 126 in correspondence with the current frame image, i.e., the identification and position coordinates of each person in the current frame image (and team classification in a team sport, which is not included in this example because it is a judo competition), and information for presenting an impact indicating the timing and speed at which to activate the haptic presentation device, and outputs this to the control unit 13 on a frame-by-frame basis (step S7).

（実験検証）
本発明に係る触覚メタデータ生成装置１２の有効性を示すため、評価実験を行った。
図７（ａ）は１フレーム画像例を模擬的に示した図であり、図７（ｂ）は従来技術のボーン画像例、図７（ｃ）は従来技術のＳｋｌＭＨＩ（Skeleton Motion History Image）の画像例、図７（ｄ）は本発明に係る骨格軌跡特徴画像（ＳＴＩ）の画像例を示す図である。また、図８は、従来技術のボーン画像、従来技術のＳｋｌＭＨＩ、及び本発明に係る骨格軌跡特徴画像（ＳＴＩ）の人物動きの検出精度の比較評価を示す図である。 (Experimental Verification)
In order to demonstrate the effectiveness of the haptic metadata generating device 12 according to the present invention, an evaluation experiment was carried out.
Fig. 7(a) is a diagram showing a schematic example of one frame image, Fig. 7(b) is a diagram showing an example of a bone image of the prior art, Fig. 7(c) is a diagram showing an example of an image of a Skleton Motion History Image (Skl MHI) of the prior art, and Fig. 7(d) is a diagram showing an example of an image of a skeleton trajectory feature image (STI) according to the present invention. Fig. 8 is a diagram showing a comparative evaluation of the human motion detection accuracy of the bone image of the prior art, the Skl MHI of the prior art, and the skeleton trajectory feature image (STI) according to the present invention.

まず、比較評価する前に、柔道の試合映像（図７（ａ）参照）から、従来技術のボーン画像（図７（ｂ）参照）、従来技術のＳｋｌＭＨＩ（図７（ｃ）参照）、及び本発明に係る骨格軌跡特徴画像（ＳＴＩ）（図７（ｄ）参照）について、正例、負例それぞれ約２，０００枚の画像を作成して、それぞれＣＮＮによる事前学習を行った。 First, prior to the comparative evaluation, approximately 2,000 positive and negative example images were created from a judo match video (see FIG. 7(a)) for the bone images of the conventional technology (see FIG. 7(b)), the Skl MHI of the conventional technology (see FIG. 7(c)), and the skeleton trajectory feature image (STI) of the present invention (see FIG. 7(d)), and pre-trained using a CNN for each.

そして、別の試合映像で識別した結果を図８に示している。図８では、「立ち合い」、「投げ」、「寝技」、「待て」の４つの試合状況（シーン分類）の識別結果と、「投げ」動作の検出結果の比較として、適合率、再現率、及びこれらの統合的指標であるＦ値（F-Measure）の値を示した。４つの試合状況（シーン分類）の状態の識別判定、及び「投げ」の検出精度のいずれの場合においても、本発明に係る骨格軌跡特徴画像（ＳＴＩ）を用いて学習した場合が最もよい結果が得られた。従って、従来技術のボーン画像や、従来技術のＳｋｌＭＨＩを用いた動作認識よりも、本発明に係る骨格軌跡特徴画像（ＳＴＩ）を用いる触覚メタデータ生成装置１２の有効性を確認できた。尚、ＳｋｌＭＨＩについても骨格座標ごとに色分けを行って評価したが、それでも本発明に係る骨格軌跡特徴画像（ＳＴＩ）を用いた方が動作認識の精度として向上する理由として、ＳｋｌＭＨＩ（ボーン画像も同様）では、各骨格を結ぶ接続線が動作認識に悪影響を及ぼしていると考えられる。 The results of classification in another match video are shown in FIG. 8. In FIG. 8, the accuracy rate, recall rate, and F-measure (F-Measure), which is an integrated index of these, are shown as a comparison of the classification results of the four match situations (scene classifications) of "initai", "nage", "ne-waza" and "wait" and the detection results of the "nage" motion. In both cases of classification judgment of the four match situations (scene classifications) and the detection accuracy of "nage", the best results were obtained when learning using the skeletal trajectory feature image (STI) of the present invention. Therefore, the effectiveness of the tactile metadata generation device 12 using the skeletal trajectory feature image (STI) of the present invention was confirmed compared to the motion recognition using the bone image of the conventional technology and the Skl MHI of the conventional technology. Note that the Skl MHI was also evaluated by color-coding each skeletal coordinate, but the reason why the accuracy of the motion recognition is improved when the skeletal trajectory feature image (STI) of the present invention is used is thought to be that in the Skl MHI (as well as the bone image), the connecting lines connecting each skeleton have a negative effect on the motion recognition.

（制御ユニット）
図９は、本発明による一実施形態の映像触覚連動システム１における制御ユニット１３の概略構成を示すブロック図である。制御ユニット１３は、メタデータ受信部１３１、解析部１３２、記憶部１３３、及び駆動部１３４‐１，１３４‐２を備える。 (Controller unit)
9 is a block diagram showing a schematic configuration of the control unit 13 in the video-haptic linking system 1 according to an embodiment of the present invention. The control unit 13 includes a metadata receiving unit 131, an analyzing unit 132, a storage unit 133, and driving units 134-1 and 134-2.

メタデータ受信部１３１は、触覚メタデータ生成装置１２から触覚メタデータ（衝撃提示用）を入力し、解析部１３２に出力する機能部である。触覚メタデータは、現フレーム画像内の各人物の識別、位置座標、（及びチーム競技であればそのチーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さを示す情報を含む。 The metadata receiving unit 131 is a functional unit that inputs haptic metadata (for impact presentation) from the haptic metadata generating device 12 and outputs it to the analyzing unit 132. The haptic metadata includes the identification and position coordinates of each person in the current frame image (and the team classification in the case of a team sport), as well as information indicating the timing and speed at which the haptic presentation device is activated.

解析部１３２は、触覚メタデータ生成装置１２から得られる触覚メタデータを基に、予め定めた駆動基準データを参照し、駆動部１３４‐１，１３４‐２を介して、対応する各触覚提示デバイス１４Ｌ，１４Ｒの振動アクチュエーター１４２を駆動するよう制御する機能部である。例えば、解析部１３２は、触覚メタデータにおける人物の識別、位置座標、（及びチーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さから、予め定めた駆動基準データを参照して、触覚提示デバイス１４Ｌの振動アクチュエーター１４２の作動タイミング、強さ、及び動作時間を決定して駆動制御する。 The analysis unit 132 is a functional unit that refers to predetermined drive reference data based on the haptic metadata obtained from the haptic metadata generation device 12, and controls the driving of the vibration actuators 142 of the corresponding haptic presentation devices 14L, 14R via the drive units 134-1, 134-2. For example, the analysis unit 132 refers to predetermined drive reference data based on the person's identification, position coordinates (and team classification) in the haptic metadata, and the timing and speed at which the haptic presentation device is operated, and determines the operation timing, strength, and operation time of the vibration actuator 142 of the haptic presentation device 14L, and controls the driving.

記憶部１３３は、触覚メタデータに基づいた駆動部１３４‐１，１３４‐２の駆動を制御するための予め定めた駆動基準データを記憶している。駆動基準データは、触覚メタデータに対応付けられた触覚刺激としての振動アクチュエーター１４２の作動タイミング、強さ、及び動作時間について、予め定めたテーブル又は関数で表されている。また、記憶部１３３は、制御ユニット１３の機能を実現するためのプログラムを記憶している。即ち、制御ユニット１３を構成するコンピュータにより当該プログラムを読み出して実行することで、制御ユニット１３の機能を実現する。 The memory unit 133 stores predetermined drive reference data for controlling the drive of the drive units 134-1, 134-2 based on the haptic metadata. The drive reference data is expressed as a predetermined table or function for the operation timing, strength, and operation time of the vibration actuator 142 as a haptic stimulus associated with the haptic metadata. The memory unit 133 also stores a program for realizing the functions of the control unit 13. In other words, the functions of the control unit 13 are realized by reading and executing the program by the computer constituting the control unit 13.

駆動部１３４‐１，１３４‐２は、各触覚提示デバイス１４Ｌ，１４Ｒの振動アクチュエーター１４２を駆動するドライバである。 The driving units 134-1 and 134-2 are drivers that drive the vibration actuators 142 of each tactile presentation device 14L and 14R.

このように、本実施形態の触覚メタデータ生成装置１２を備える映像触覚連動システム１によれば、映像から人物オブジェクトを自動抽出し、動的な人物オブジェクトに対応する触覚メタデータを同期して自動生成することができるので、触覚提示デバイスと映像を連動させることができるようになる。 In this way, the video-haptic linking system 1 equipped with the haptic metadata generating device 12 of this embodiment can automatically extract person objects from video and automatically generate haptic metadata corresponding to dynamic person objects in synchronization, thereby enabling the haptic presentation device to be linked with the video.

特に、本実施形態の触覚メタデータ生成装置１２は、カメラの映像を入力とし、その映像を解析して前述のような触覚メタデータを出力するため、まず映像を解析して映像中の人物骨格を検出し、検出した骨格情報を用いて解析対象とする人物を特定し、その後、対象人物の骨格位置の履歴から骨格軌跡特徴画像（ＳＴＩ）を描画生成する。この骨格軌跡特徴画像（ＳＴＩ）上で、ボールなど人物以外の動オブジェクトの軌跡画像を合成してもよい。 In particular, the haptic metadata generating device 12 of this embodiment takes camera footage as input, analyzes the footage, and outputs the haptic metadata described above. In order to do this, the device first analyzes the footage to detect human skeletons in the footage, identifies the person to be analyzed using the detected skeleton information, and then draws and generates a skeletal trajectory feature image (STI) from the history of the skeletal position of the target person. A trajectory image of a moving object other than a person, such as a ball, may be synthesized on this skeletal trajectory feature image (STI).

そして、本実施形態の触覚メタデータ生成装置１２は、この骨格軌跡特徴画像（ＳＴＩ）を入力とするＣＮＮにより、選手の動作イベント（「投げ」、「組み合い」等）、及びその動作の状況を認識し、対応する人物動作の触覚メタデータ（衝撃提示用）を生成する。そして、制御ユニット１３は、本実施形態の触覚メタデータ生成装置１２から得られる触覚メタデータ（衝撃提示用）を基に、映像内の人物オブジェクトＯｐ１（選手）の動きに対応した振動刺激は触覚提示デバイス１４Ｌで、人物オブジェクトＯｐ２（選手）の動きに対応した振動刺激は触覚提示デバイス１４Ｒで提示するように分類して制御する。 The haptic metadata generating device 12 of this embodiment uses a CNN that uses this skeletal trajectory feature image (STI) as input to recognize the player's action events (such as "throws" and "grappling") and the context of those actions, and generates haptic metadata (for impact presentation) for the corresponding human action. Based on the haptic metadata (for impact presentation) obtained from the haptic metadata generating device 12 of this embodiment, the control unit 13 classifies and controls so that vibration stimuli corresponding to the movement of the person object Op1 (player) in the video are presented by the haptic presentation device 14L, and vibration stimuli corresponding to the movement of the person object Op2 (player) are presented by the haptic presentation device 14R.

従って、本実施形態の映像触覚連動システム１は、「投げ」のような動作イベント以外にも、選手の押し引きなどの状況を連続的に伝えることが可能となり、障害者にも試合の緊迫感を伝えることができ、また臨場感を高めることができる。 Therefore, the video-haptic linkage system 1 of this embodiment can continuously convey situations such as the pushing and pulling of players in addition to motion events such as "throwing," allowing the sense of tension of the match to be conveyed to people with disabilities and enhancing the sense of realism.

ところで、従来のMotion History Image（ＭＨＩ）と呼ばれる画像を解析することで、“腕を広げる”、“しゃがむ”、“手を上げる”など人物の基本的な動きを判定することが可能になるが、人物の関節の各部位を計測しているわけではないため、全身を使った大きな動作の認識に限られる。一方、本発明に係る骨格軌跡特徴画像（ＳＴＩ）は、このＭＨＩの改善版ともいえる画像特徴量を示す画像であり、各人物の骨格の軌跡、もしくはこれに加えて追跡対象となる人物以外の動オブジェクトの軌跡情報を描画したものとすることで、背景に含まれるノイズの影響を抑えた高精度な認識が可能となる。また、各人物の骨格座標の推移を利用して画像を作成しているため、全身運動のみならず、手や足の部分的な動作の認識も、高い精度で行うことができる。 By analyzing a conventional image called a Motion History Image (MHI), it is possible to determine basic human movements such as "spreading arms," "squatting," and "raising hands." However, since each part of a person's joints is not measured, it is limited to recognizing large movements using the whole body. On the other hand, the skeletal trajectory feature image (STI) of the present invention is an image that shows image features that can be said to be an improved version of this MHI, and by drawing the trajectory of each person's skeleton, or in addition, the trajectory information of a moving object other than the person to be tracked, it is possible to perform highly accurate recognition with reduced influence of noise contained in the background. In addition, since the image is created using the transition of each person's skeletal coordinates, it is possible to recognize not only whole-body movements but also partial movements of the hands and feet with high accuracy.

特に、骨格検出アルゴリズムは静止画単位での姿勢推定に留まるが、本発明に係る骨格軌跡特徴画像（ＳＴＩ）は、各骨格位置の推移を軌跡特徴として扱い、この軌跡特徴量を１枚の画像で表現することにより、ＣＮＮによる動作の識別を可能としている。つまり、ＣＮＮでは困難であった時間軸方向の特徴を、本発明に係る骨格軌跡特徴画像（ＳＴＩ）を入力として用いることで高精度な人物動きの動作認識を可能としている。 In particular, while skeleton detection algorithms are limited to pose estimation on a still image basis, the skeleton trajectory feature image (STI) of the present invention treats the transition of each skeleton position as a trajectory feature, and by expressing this trajectory feature amount in a single image, it is possible to identify actions using a CNN. In other words, by using the skeleton trajectory feature image (STI) of the present invention as input to the time axis direction features, which were difficult to detect using a CNN, highly accurate recognition of human movements is possible.

尚、上述した一実施形態の触覚メタデータ生成装置１２をコンピュータとして機能させることができ、当該コンピュータに、本発明に係る各構成要素を実現させるためのプログラムは、当該コンピュータの内部又は外部に備えられるメモリに記憶される。コンピュータに備えられる中央演算処理装置（ＣＰＵ）などの制御で、各構成要素の機能を実現するための処理内容が記述されたプログラムを、適宜、メモリから読み込んで、本実施形態の触覚メタデータ生成装置１２の各構成要素の機能をコンピュータに実現させることができる。ここで、各構成要素の機能をハードウェアの一部で実現してもよい。 The haptic metadata generation device 12 of the above-described embodiment can be made to function as a computer, and a program for causing the computer to realize each of the components of the present invention is stored in a memory provided inside or outside the computer. Under the control of a central processing unit (CPU) or the like provided in the computer, a program describing the processing contents for realizing the functions of each component can be read from the memory as appropriate, and the computer can be made to realize the functions of each component of the haptic metadata generation device 12 of this embodiment. Here, the functions of each component may be realized by part of the hardware.

以上、特定の実施形態の例を挙げて本発明を説明したが、本発明は前述の実施形態の例に限定されるものではなく、その技術思想を逸脱しない範囲で種々変形可能である。例えば、上述した実施形態の例では、主として柔道競技の映像解析を例に説明したが、バドミントンや卓球、その他の様々なスポーツ種目、及びスポーツ以外の映像にも広く応用可能である。例えば、触覚情報を用いたパブリックビューイング、エンターテインメント、将来の触覚放送などのサービス性の向上に繋がる。また、スポーツ以外の例として、工場での触覚アラームへの応用や、監視カメラ映像解析に基づいたセキュリティシステムなど、様々な用途に応用することも可能である。従って、本発明は、前述の実施形態の例に限定されるものではなく、特許請求の範囲の記載によってのみ制限される。 Although the present invention has been described above by giving examples of specific embodiments, the present invention is not limited to the above-mentioned embodiments and can be modified in various ways without departing from the technical concept thereof. For example, the above-mentioned embodiments have been described mainly by taking the video analysis of judo as an example, but the present invention can be widely applied to videos of badminton, table tennis, and various other sports, as well as non-sports. For example, the present invention can lead to improved services such as public viewing using tactile information, entertainment, and future tactile broadcasting. In addition, the present invention can be applied to various uses such as tactile alarms in factories and security systems based on surveillance camera video analysis. Therefore, the present invention is not limited to the above-mentioned embodiments and is limited only by the claims.

本発明によれば、映像から人物オブジェクトを自動抽出し、動的な人物オブジェクトに対応する触覚メタデータを同期して自動生成することができるので、触覚提示デバイスと映像を連動させる用途に有用である。 The present invention can automatically extract human objects from video and automatically generate haptic metadata corresponding to dynamic human objects in synchronization, making it useful for applications that link haptic presentation devices with video.

１映像触覚連動システム
１０映像出力装置
１１ディスプレイ
１２触覚メタデータ生成装置
１３制御ユニット
１４Ｌ，１４Ｒ触覚提示デバイス
１２１複数フレーム抽出部
１２２人物骨格抽出部
１２３人物識別部
１２４骨格軌跡特徴画像生成部
１２５動オブジェクト検出部
１２６人物動作認識部
１２７メタデータ生成部
１３１メタデータ受信部
１３２解析部
１３３記憶部
１３４‐１，１３４‐２駆動部
１４１ケース
１４２振動アクチュエーター REFERENCE SIGNS LIST 1 Video-haptic interlocking system 10 Video output device 11 Display 12 Haptic metadata generating device 13 Control unit 14L, 14R Haptic presentation device 121 Multiple frame extraction unit 122 Human skeleton extraction unit 123 Human identification unit 124 Skeleton trajectory characteristic image generation unit 125 Moving object detection unit 126 Human action recognition unit 127 Metadata generation unit 131 Metadata reception unit 132 Analysis unit 133 Memory unit 134-1, 134-2 Drive unit 141 Case 142 Vibration actuator

Claims

映像から人物オブジェクトを抽出し、動的な人物オブジェクトに対応する触覚メタデータを生成する触覚メタデータ生成装置であって、
入力された映像について、現フレーム画像と所定数の過去のフレーム画像を含む複数フレーム画像を抽出する複数フレーム抽出手段と、
当該複数フレーム画像の各々について、骨格検出アルゴリズムに基づき、各人物オブジェクトの第１の骨格座標集合を生成する人物骨格抽出手段と、
当該複数フレーム画像の各々について、前記第１の骨格座標集合を基に探索範囲を可変設定し、各人物オブジェクトの骨格の位置及びサイズと、その周辺画像情報を抽出することにより人物オブジェクトを識別し、人物ＩＤを付与した第２の骨格座標集合を生成する人物識別手段と、
前記現フレーム画像を基準に、当該複数フレーム画像の各々における前記第２の骨格座標集合を基に、識別した人物骨格毎の動きの方向のみを示す１枚の骨格軌跡特徴画像を生成する骨格軌跡特徴画像生成手段と、
前記骨格軌跡特徴画像を入力とする畳み込みニューラルネットワークにより、人物の特定動作を認識し、所定の触覚提示デバイスを作動させる衝撃提示用の情報を検出する人物動作認識手段と、
前記現フレーム画像に対応して、当該衝撃提示用の情報を含む触覚メタデータを生成し、フレーム単位で外部出力するメタデータ生成手段と、
を備えることを特徴とする触覚メタデータ生成装置。 A haptic metadata generation device that extracts a person object from a video and generates haptic metadata corresponding to a dynamic person object, comprising:
a multiple frame extraction means for extracting multiple frame images including a current frame image and a predetermined number of past frame images from an input video;
a human skeleton extracting means for generating a first set of skeleton coordinates of each human object for each of the plurality of frame images based on a skeleton detection algorithm;
a person identification means for variably setting a search range based on the first set of skeleton coordinates for each of the plurality of frame images, identifying a person object by extracting a position and a size of the skeleton of each person object and its surrounding image information, and generating a second set of skeleton coordinates to which a person ID is assigned;
a skeleton trajectory feature image generating means for generating a skeleton trajectory feature image indicating only a direction of movement of each identified human skeleton based on the second skeleton coordinate set in each of the plurality of frame images, with the current frame image as a reference;
a human action recognition means for recognizing a specific action of a person by a convolutional neural network using the skeletal trajectory feature image as an input and detecting information for impact presentation that activates a predetermined tactile presentation device;
a metadata generating means for generating haptic metadata including information for presenting the impact corresponding to the current frame image, and outputting the generated haptic metadata to an external device on a frame-by-frame basis;
A haptic metadata generating device comprising:

前記骨格軌跡特徴画像生成手段は、前記骨格軌跡特徴画像として、当該複数フレーム画像における各人物の骨格座標ごとに連結した軌跡を描画し、且つこの描画の際に、過去に向かうほど輝度を下げるか、又は上げて描画して生成した１枚の画像とすることを特徴とする、請求項１に記載の触覚メタデータ生成装置。 The haptic metadata generating device according to claim 1, characterized in that the skeletal trajectory feature image generating means draws, as the skeletal trajectory feature image, a trajectory that connects the skeletal coordinates of each person in the multiple frame images, and when drawing, lowers or raises the brightness toward the past to generate a single image.

前記骨格軌跡特徴画像生成手段は、前記骨格軌跡特徴画像として、当該複数フレーム画像における各人物の骨格座標について、各人物に対し共通又は区別して、各人物の骨格座標ごとに色分けし、各人物の骨格座標ごとの動きをフレーム単位で時系列に階調するよう描画して生成した１枚の画像とすることを特徴とする、請求項１又は２に記載の触覚メタデータ生成装置。 The haptic metadata generating device according to claim 1 or 2, characterized in that the skeletal trajectory feature image generating means generates the skeletal trajectory feature image by color-coding the skeletal coordinates of each person in the multiple frame images, either common or distinct for each person, and drawing the movement of each person at each skeletal coordinate in a time-series gradation on a frame-by-frame basis, thereby generating a single image.

前記人物識別手段は、前記探索範囲として、最大で人物骨格の全体を囲む人物探索範囲に限定し、最小で人物骨格のうち所定領域を注目探索範囲として定めた絞り込みによる可変設定を行い、状態推定アルゴリズムで得られる人物の骨格の状態遷移推定値に基づいて、少なくとも前記注目探索範囲を含むように前記探索範囲を決定して、当該人物オブジェクトを識別する処理を行う手段を有することを特徴とする、請求項１から３のいずれか一項に記載の触覚メタデータ生成装置。 The haptic metadata generating device according to any one of claims 1 to 3, characterized in that the person identification means performs variable setting by narrowing down the search range to a person search range that at most surrounds the entire human skeleton and at least a specified area of the human skeleton as a focus search range, and determines the search range to include at least the focus search range based on a state transition estimate of the human skeleton obtained by a state estimation algorithm, thereby performing processing to identify the person object.

当該複数フレーム画像の各々を用いて隣接フレーム間の差分画像を基に動オブジェクトを検出し、各差分画像から検出した動オブジェクトのうち前記識別した人物骨格毎の動きの方向のみを示す骨格軌跡特徴画像と対比して人物以外の動オブジェクトを選定し、前記人物以外の動オブジェクトについて、各差分画像から得られる座標位置、大きさ、移動方向を要素とし連結した動オブジェクト軌跡画像を生成する動オブジェクト検出手段を更に備え、
前記人物動作認識手段は、前記識別した人物骨格毎の動きの方向のみを示す骨格軌跡特徴画像上に、前記動オブジェクト軌跡画像を追加して合成したものを入力とする畳み込みニューラルネットワークにより、人物の特定動作を認識することを特徴とする、請求項１から４のいずれか一項に記載の触覚メタデータ生成装置。 a moving object detection means for detecting a moving object based on a difference image between adjacent frames using each of the plurality of frame images, selecting a moving object other than a person from among the moving objects detected from each difference image by comparing with a skeleton trajectory feature image showing only the direction of movement of each identified human skeleton, and generating a moving object trajectory image by connecting the coordinate positions, sizes and moving directions obtained from each difference image as elements for the moving objects other than the person,
The haptic metadata generation device according to any one of claims 1 to 4, characterized in that the human action recognition means recognizes a specific action of a person by a convolutional neural network that receives as input a skeleton trajectory feature image that indicates only the direction of movement of each identified human skeleton and a composite image obtained by adding the dynamic object trajectory image to the skeleton trajectory feature image.

請求項１から５のいずれか一項に記載の触覚メタデータ生成装置と、
触覚刺激を提示する触覚提示デバイスと、
前記触覚メタデータ生成装置から得られる触覚メタデータを基に、予め定めた駆動基準データを参照し、前記触覚提示デバイスを駆動するよう制御する制御ユニットと、
を備えることを特徴とする映像触覚連動システム。 A haptic metadata generation device according to any one of claims 1 to 5,
A tactile presentation device that presents a tactile stimulus;
a control unit that controls the haptic presentation device to drive the haptic presentation device by referring to predetermined driving reference data based on the haptic metadata obtained from the haptic metadata generating device;
A video and haptic linkage system comprising:

コンピュータを、請求項１から５のいずれか一項に記載の触覚メタデータ生成装置として機能させるためのプログラム。 A program for causing a computer to function as a haptic metadata generating device according to any one of claims 1 to 5.