WO2024106328A1

WO2024106328A1 - Computer program, information processing terminal, and method for controlling same

Info

Publication number: WO2024106328A1
Application number: PCT/JP2023/040545
Authority: WO
Inventors: 良太片野; 剛山本; 光国高堀; 虎太郎尾嶋; 規悦青木; 裕司永野
Original assignee: 株式会社バンダイ; 株式会社バンダイナムコピクチャーズ
Priority date: 2022-11-14
Filing date: 2023-11-10
Publication date: 2024-05-23
Also published as: JP2024071199A; JP7441289B1; JP2024071379A

Abstract

[Problem] The present invention provides, for example, a mechanism for suitably synthesizing an object with a captured image of a real space and outputting the result of synthesis. [Solution] This information processing terminal: captures an image of a surrounding environment including a prescribed model; and acquires, for each pixel of the captured image that has been captured, information pertaining to the distance from a camera. The information processing terminal also: recognizes information relating to the orientation of at least one part of the prescribed model included in the captured image and the distance from the camera; and determines, with reference to the at least one part for which the orientation was recognized, position information pertaining to an object being generated. The information processing terminal furthermore: draws, as an object, pixels indicated by the position information to be closer to the camera than the distance information for the corresponding pixel of the captured image, from among the pixels of the object being generated; generates an object image; and outputs, to a display unit, a composite image in which the object image is superimposed on the captured image.

Description

コンピュータプログラム、情報処理端末、及びその制御方法Computer program, information processing terminal, and control method thereof

本発明は、コンピュータプログラム、情報処理端末、及びその制御方法に関する。 The present invention relates to a computer program, an information processing terminal, and a control method thereof.

従来より、現実空間に文字やＣＧなどの情報を重畳させてユーザに提示することで、拡張された現実空間を該ユーザに提示するＡＲ（Augmented Reality）などのＸＲ（Cross Reality）技術が様々な分野で利用されている。例えば、特許文献１には、フィギュアの台座に設けられたマークを携帯端末のカメラで認識して、マークに対応付けて用意された演出用の映像等を生成し、携帯端末の画面にフィギュアと映像とを重ね合わせて表示する拡張現実感システムが開示されている。 Conventionally, XR (Cross Reality) technologies such as AR (Augmented Reality) have been used in various fields to present an augmented real space to a user by superimposing information such as text and CG on the real space. For example, Patent Document 1 discloses an augmented reality system that recognizes marks on the base of a figure with a camera on a mobile device, generates images for presentation that are prepared in association with the marks, and displays the figure and the images superimposed on the screen of the mobile device.

特許第５５５１２０５号公報Patent No. 5551205

上記従来技術では、フィギュア画像と演出用の映像とを重ね合わせて表示しており、フィギュア画像と演出用の映像とが重複する位置ではフィギュア画像が隠れるようになっている。つまり、上記従来技術では、演出用の映像全てをフィギュア画像の前面に表示している。しかし、立体感や臨場感を出すには、演出用の映像など撮像画像に重ね合わせるオブジェクトは、当該フィギュアとの位置関係に応じてフィギュアの前面に露出したり、背面に隠れたりすることが望ましい。また、撮像した映像にリアルタイムでオブジェクトを付加するために、それらの処理負荷をできる限り低減させる必要がある。　 In the above-mentioned conventional technology, the figure image and the video for the production are displayed superimposed on each other, and the figure image is hidden where the figure image and the video for the production overlap. In other words, in the above-mentioned conventional technology, all the video for the production is displayed in front of the figure image. However, to create a sense of three-dimensionality and realism, it is desirable for objects superimposed on the captured image, such as the video for the production, to be exposed in front of or hidden behind the figure depending on their positional relationship with the figure. Also, in order to add objects to the captured image in real time, it is necessary to reduce the processing load as much as possible.

本発明は例えば、現実空間の撮像画像に好適にオブジェクトを合成して出力する仕組みを提供する。 The present invention provides, for example, a mechanism for suitably synthesizing objects with captured images of real space and outputting the resulting images.

本発明は、例えば、コンピュータプログラムであって、情報処理端末のコンピュータを、所定の模型を含む周辺環境を撮像する撮像手段と、前記撮像手段によって撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得手段と、前記撮像画像に含まれる前記所定の模型の少なくとも１つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識手段と、前記認識手段によって認識した前記少なくとも１つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定手段と、前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成手段と、前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力手段と、として機能させることを特徴とする。　 The present invention is, for example, a computer program that causes a computer of an information processing terminal to function as an imaging means for imaging a surrounding environment including a specified model, an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means, a recognition means for recognizing information regarding the attitude of at least one part of the specified model contained in the captured image and its distance from the imaging means, a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means, an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, and generating an object image, and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.

また、本発明は、例えば、情報処理端末であって、所定の模型を含む周辺環境を撮像する撮像手段と、前記撮像手段によって撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得手段と、前記撮像画像に含まれる前記所定の模型の少なくとも１つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識手段と、前記認識手段によって認識した前記少なくとも１つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定手段と、前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成手段と、前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力手段とを備えることを特徴とする。　 The present invention is also characterized in that, for example, an information processing terminal includes an imaging means for imaging a surrounding environment including a predetermined model, an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means, a recognition means for recognizing information regarding the attitude of at least one part of the predetermined model included in the captured image and the distance from the imaging means, a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means, an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, and generating an object image, and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.

また、本発明は、例えば、情報処理端末の制御方法であって、所定の模型を含む周辺環境を撮像手段によって撮像する撮像工程と、前記撮像工程で撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得工程と、前記撮像画像に含まれる前記所定の模型の少なくとも１つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識工程と、前記認識工程で認識した前記少なくとも１つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定工程と、前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成工程と、前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力工程とを含むことを特徴とする。 The present invention is also characterized in that it is a control method for an information processing terminal, comprising, for example, an imaging step of imaging a surrounding environment including a predetermined model by an imaging means, an acquisition step of acquiring distance information from the imaging means for each pixel of the image captured in the imaging step, a recognition step of recognizing information regarding the attitude of at least one part of the predetermined model included in the captured image and the distance from the imaging means, a position determination step of determining position information of an object to be generated based on the at least one part recognized in the recognition step, an object generation step of drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, and generating an object image, and an output step of outputting a composite image in which the object image is superimposed on the captured image to a display unit.

本発明によれば例えば、現実空間の撮像画像に好適にオブジェクトを合成して出力することができる。 According to the present invention, for example, it is possible to suitably synthesize an object into a captured image of real space and output it.

一実施形態に係るシステムの構成例を示す図。FIG. 1 is a diagram showing an example of the configuration of a system according to an embodiment. 一実施形態に係る情報処理端末の構成例を示す図。FIG. 2 is a diagram showing an example of the configuration of an information processing terminal according to an embodiment. 一実施形態に係るシステムが提供するＸＲギミックの一例を示す図。FIG. 11 is a diagram showing an example of an XR gimmick provided by the system according to an embodiment. 一実施形態に係るＸＲギミックの画面遷移を示す図。11A to 11C are diagrams showing screen transitions of an XR gimmick according to an embodiment. 一実施形態に係るエフェクト合成に関する機能構成を示す図。FIG. 4 is a diagram showing a functional configuration relating to effect synthesis according to an embodiment. 一実施形態に係るエフェクト合成の処理手順に応じた一連の画像例を示す図。1A to 1C are diagrams showing a series of example images according to a processing procedure of effect synthesis according to an embodiment. 一実施形態に係る基本制御の処理手順を示すフローチャート。11 is a flowchart showing a processing procedure of basic control according to an embodiment. 一実施形態に係るエフェクト合成出力の処理手順を示すフローチャート。11 is a flowchart showing a processing procedure for effect synthesis output according to an embodiment. 一実施形態に係る物体認識の処理手順を示すフローチャート。1 is a flowchart showing a processing procedure for object recognition according to an embodiment. 一実施形態に係るオブジェクト生成の処理手順を示すフローチャート。11 is a flowchart showing a processing procedure for generating an object according to an embodiment. 一実施形態に係るボーン構造の生成方法を示す図。1A to 1C are diagrams showing a method for generating a bone structure according to an embodiment; 一実施形態に係る物体認識の処理手順を示すフローチャート。1 is a flowchart showing a processing procedure for object recognition according to an embodiment. 一実施形態に係るフィギュアの変形例を示す図。FIG. 13 is a diagram showing a modified example of the figure according to the embodiment.

以下、添付図面を参照して実施形態を詳しく説明する。尚、以下の実施形態は特許請求の範囲に係る発明を限定するものではなく、また実施形態で説明されている特徴の組み合わせの全てが発明に必須のものとは限らない。実施形態で説明されている複数の特徴うち二つ以上の特徴が任意に組み合わされてもよい。また、同一若しくは同様の構成には同一の参照番号を付し、重複した説明は省略する。　 The embodiments are described in detail below with reference to the attached drawings. Note that the following embodiments do not limit the invention as claimed, and not all combinations of features described in the embodiments are necessarily essential to the invention. Two or more of the features described in the embodiments may be combined in any desired manner. Furthermore, the same reference numbers are used for the same or similar configurations, and duplicate descriptions will be omitted.

＜第１の実施形態＞　＜システム構成＞　以下では本発明の第１の実施形態について説明する。まず図１を参照して、本実施形態に係るシステム構成について説明する。なお、ここでは必要最低限の簡易的な構成について説明するが、本発明を限定するものではない。例えば、各装置については複数の装置が含まれてもよいし、複数のサーバが一体化して設けられてもよい。　 <First embodiment> <System configuration> Below, a first embodiment of the present invention will be described. First, with reference to FIG. 1, the system configuration according to this embodiment will be described. Note that a simple configuration with the minimum necessary components will be described here, but this does not limit the present invention. For example, each device may include multiple devices, or multiple servers may be integrated.

本システムは、情報処理端末１０１、アプリケーションサーバ１０２、機械学習サーバ１０３、及びデータベース１０４を含んで構成される。情報処理端末１０１及びアプリケーションサーバ１０２はネットワークを介して相互に通信可能に接続される。アプリケーションサーバ１０２は、ローカルエリアネットワーク（ＬＡＮ）を介して機械学習サーバ１０３に相互に通信可能に接続される。また、機械学習サーバ１０３はＬＡＮを介してデータベース１０４に接続される。　 This system includes an information processing terminal 101, an application server 102, a machine learning server 103, and a database 104. The information processing terminal 101 and the application server 102 are connected to each other via a network so that they can communicate with each other. The application server 102 is connected to the machine learning server 103 via a local area network (LAN) so that they can communicate with each other. In addition, the machine learning server 103 is connected to the database 104 via the LAN.

情報処理端末１０１は、例えば、スマートフォン、携帯電話機、タブレットＰＣ等の携帯型の情報処理端末である。カメラ等の撮像部と、撮像した画像を表示する表示部とを少なくとも有する情報処理端末であれば任意の装置であってよい。情報処理端末１０１は、ネットワーク１０５を介してアプリケーションサーバ１０２から、本発明を実施するためのアプリケーションをダウンロードしてインストールする。当該アプリケーションが情報処理端末１０１で実行されることによって、以下で説明する所定の模型を含む撮像画像に、エフェクトオブジェクトを合成した拡張現実を提供することができる。なお、撮像画像は、静止画像及び動画像（映像）の何れであってもよい。また、動画像にエフェクトを付加する場合には、エフェクトオブジェクトをアニメーションとして合成してもよい。　 The information processing terminal 101 is, for example, a portable information processing terminal such as a smartphone, a mobile phone, or a tablet PC. Any device may be used as long as it has at least an imaging unit such as a camera and a display unit for displaying the captured image. The information processing terminal 101 downloads and installs an application for implementing the present invention from the application server 102 via the network 105. By executing the application on the information processing terminal 101, it is possible to provide an augmented reality in which an effect object is synthesized into a captured image including a specific model, as described below. The captured image may be either a still image or a moving image (video). Furthermore, when an effect is added to a moving image, the effect object may be synthesized as an animation.

アプリケーションサーバ１０２は、機械学習サーバ１０３において機械学習された学習済みモデルを取得し、当該学習済みモデルを組み込んだアプリケーションを情報処理端末１０１等の外部端末に提供する。機械学習サーバ１０３は、例えば、深層学習の畳み込みニューラルネットワーク（ＣＮＮ）によって、画像情報に含まれるフィギュア（模型）の各パーツを認識する学習済みモデルを生成する。学習データとしては、例えば所定の模型のパーツごとに様々な姿勢や角度から撮影された撮像画像に教師データを付与したデータを用いる。このように所定の模型をパーツ毎に学習させることにより、例えば推定フェーズにおいて所定の模型の全体を認識するよりも、高速に且つ処理負荷を抑えた認識処理を実行することができる。学習データについては、機械学習サーバ１０３で生成してもよいし、外部で生成されたデータを受信してもよい。また、機械学習サーバ１０３は、模型ごとに生成した学習済みモデルをデータベース１０４に格納し、必要に応じてアプリケーションサーバ１０２へ提供する。また、機械学習サーバ１０３は、追加の学習データに基づいて再学習を行うために、データベース１０４から対応する学習済みモデルを読み出して再学習させ、再学習後のモデルをデータベース１０４へ再度格納する。　 The application server 102 acquires a trained model that has been machine-learned in the machine learning server 103, and provides an application incorporating the trained model to an external terminal such as the information processing terminal 101. The machine learning server 103 generates a trained model that recognizes each part of a figure (model) included in image information, for example, by a deep learning convolutional neural network (CNN). For example, data obtained by attaching teacher data to captured images taken from various postures and angles for each part of a specific model is used as the training data. By training the specific model for each part in this way, it is possible to execute a recognition process faster and with a reduced processing load than, for example, recognizing the entire specific model in the estimation phase. The training data may be generated by the machine learning server 103, or data generated externally may be received. The machine learning server 103 also stores the trained model generated for each model in the database 104, and provides it to the application server 102 as necessary. In addition, in order to perform re-learning based on additional training data, the machine learning server 103 reads out the corresponding trained model from the database 104, re-learns it, and stores the re-learned model in the database 104 again.

＜情報処理端末の構成＞　次に、図２を参照して、本実施形態に係る情報処理端末１０１の構成例について説明する。ここでは、本実施形態に係る情報処理端末１０１において本発明を説明する上で重要なデバイスについてのみ説明する。したがって、情報処理端末１０１は代替的に又は追加的に他のデバイスを含んで構成されてもよい。　 <Configuration of information processing terminal> Next, an example of the configuration of the information processing terminal 101 according to this embodiment will be described with reference to FIG. 2. Here, only the devices that are important for explaining the present invention in the information processing terminal 101 according to this embodiment will be described. Therefore, the information processing terminal 101 may be configured to include other devices instead or in addition.

情報処理端末１０１は、ＣＰＵ２０１、記憶部２０２、通信制御部２０３、表示部２０４、操作部２０５、カメラ２０６、及びスピーカ２０７を備える。各コンポーネントは、システムバス２１０を介して相互にデータを送受することができる。　 The information processing terminal 101 includes a CPU 201, a memory unit 202, a communication control unit 203, a display unit 204, an operation unit 205, a camera 206, and a speaker 207. Each component can send and receive data to and from each other via a system bus 210.

ＣＰＵ２０１は、システムバス２１０を介して接続された各コンポーネントを全体的に制御する中央処理プロセッサである。ＣＰＵ２０１は、記憶部２０２に記憶されたコンピュータプログラムを実行することにより、後述する各処理を実行する。記憶部２０２は、ＣＰＵ２０１のワーク領域や一時領域として使用されるとともに、ＣＰＵ２０１によって実行される制御プログラムや各種データを記憶している。　 The CPU 201 is a central processor that provides overall control of each component connected via the system bus 210. The CPU 201 executes each process described below by executing computer programs stored in the memory unit 202. The memory unit 202 is used as a work area and temporary area for the CPU 201, and also stores the control programs executed by the CPU 201 and various data.

通信制御部２０３は広帯域無線通信によりネットワーク１０５を介してアプリケーションサーバ１０２と双方向通信を行うことができる。なお、通信制御部２０３は、広帯域無線通信に加えて又は代えて、無線ＬＡＮ（ＷｉＦｉ）、Bluetooth（登録商標）通信、及び赤外線通信などの近距離無線通信の機能を有してもよい。通信制御部２０３は、例えば広帯域無線通信機能を有しておらず、WiFi通信機能を有している場合、近くのアクセスポイントを介してネットワーク１０５へ接続する。　 The communication control unit 203 can perform bidirectional communication with the application server 102 via the network 105 using broadband wireless communication. Note that the communication control unit 203 may have short-range wireless communication functions such as wireless LAN (WiFi), Bluetooth (registered trademark) communication, and infrared communication in addition to or instead of broadband wireless communication. For example, if the communication control unit 203 does not have a broadband wireless communication function but has a WiFi communication function, it connects to the network 105 via a nearby access point.

表示部２０４はタッチパネル式の液晶ディスプレイであり、各種画面を表示するとともに、カメラ２０６によって撮像された静止画像や動画像を表示する。操作部２０５は、表示部２０４と一体化して設けられ、ユーザ操作を受け付ける操作入力部である。また、操作部２０５は、物理的に構成された押下式やスライド式のボタン等を含んでもよい。　 The display unit 204 is a touch panel type liquid crystal display that displays various screens as well as still images and moving images captured by the camera 206. The operation unit 205 is an operation input unit that is integrated with the display unit 204 and accepts user operations. The operation unit 205 may also include physically configured push-type or slide-type buttons, etc.

カメラ２０６は、情報処理端末１０１の周辺環境を撮像する撮像部であり、例えば情報処理端末１０１において表示部２０４が設けられた裏側に位置することが望ましい。これにより、ユーザはカメラ２０６で撮影しながら、当該撮像画像を表示部２０４で確認することができる。なお、カメラ２０６は単眼カメラであっても、複眼カメラであってもよい。スピーカ２０７は例えば出力するエフェクトオブジェクトに合わせて音声を出力する。音声データについては、エフェクトごとに予め用意されている。　 The camera 206 is an imaging unit that captures images of the surrounding environment of the information processing terminal 101, and is preferably located, for example, behind the display unit 204 on the information processing terminal 101. This allows the user to check the captured image on the display unit 204 while taking an image with the camera 206. The camera 206 may be a monocular camera or a compound eye camera. The speaker 207 outputs sound, for example, in accordance with the effect object to be output. Sound data is prepared in advance for each effect.

＜ＸＲギミック＞　次に、図３を参照して、本実施形態に係るシステムが提供するＸＲギミックの一
例について説明する。ここでは、アプリケーションサーバ１０２から、機械学習サーバ１０３によって生成された学習済みモデルを組み込んだアプリケーションが情報処理端末１０１にダウンロードされ、インストールされていることを前提とする。当該アプリケーションは、情報処理端末１０１で実行されることによりＸＲギミックを提供する。　 <XR Gimmick> Next, an example of an XR gimmick provided by the system according to the present embodiment will be described with reference to Fig. 3. Here, it is assumed that an application incorporating a trained model generated by the machine learning server 103 is downloaded from the application server 102 to the information processing terminal 101 and installed. The application provides an XR gimmick by being executed on the information processing terminal 101.

３０１は所定の模型の一例であり、人型のフィギュアである。本発明を限定する意図はなく、任意の物体等を模した模型であれば本発明に適用することができる。３０２はフィギュア３０１を載置した机を示す。ユーザは情報処理端末１０１上で上記アプリケーションを起動し、アプリケーション画面に表示された複数の項目から当該フィギュア３０１に対応する項目を選択する。ユーザが当該フィギュア３０１に対応する項目を選択すると、カメラ２０６が起動され、ユーザは机３０２に載置されたフィギュア３０１を撮影する。ユーザは撮影中において自由に情報処理端末１０１を動かして、矢印に示すように、フィギュア３０１を撮影する角度を変える変更することができる。撮像された映像は情報処理端末１０１の表示部２０４に表示される。ここで、後述する操作ボタン等を選択することにより、当該映像にエフェクトオブジェクトを合成して表示させることができる。このように、本システムは、フィギュア３０１を含む情報処理端末１０１の周辺環境を撮像した現実空間に、アニメーション等のエフェクトオブジェクトを重畳して出力することにより拡張した現実空間を提供する。　 301 is an example of a predetermined model, a human figure. There is no intention to limit the present invention, and any model that imitates any object or the like can be applied to the present invention. 302 indicates a desk on which the figure 301 is placed. The user starts the above application on the information processing terminal 101, and selects an item corresponding to the figure 301 from multiple items displayed on the application screen. When the user selects the item corresponding to the figure 301, the camera 206 is started, and the user takes a picture of the figure 301 placed on the desk 302. The user can freely move the information processing terminal 101 during shooting, and change the angle at which the figure 301 is shot, as shown by the arrow. The captured image is displayed on the display unit 204 of the information processing terminal 101. Here, by selecting an operation button or the like described later, an effect object can be synthesized and displayed on the image. In this way, this system provides an expanded real space by superimposing an effect object such as an animation on a real space in which an image of the surrounding environment of the information processing terminal 101, including the figure 301, and outputting the superimposed effect object.

＜ＸＲギミックにおける画面遷移＞　次に、図４を参照して、本実施形態に係るＸＲギミックに係る画面遷移について説明する。図４（ａ）～図４（ｄ）は、ユーザがＸＲギミックを提供するアプリケーションを実行して、図３に示すように情報処理端末１０１を動かした場合の画面遷移について説明する。　 <Screen transitions in XR gimmick> Next, with reference to FIG. 4, we will explain the screen transitions related to the XR gimmick of this embodiment. FIG. 4(a) to FIG. 4(d) explain the screen transitions when the user executes an application that provides the XR gimmick and moves the information processing terminal 101 as shown in FIG. 3.

図４（ａ）に示す画面４００は、本実施形態に係るＸＲギミックを提供するアプリケーションを起動すると表示部２０４に表示される画面である。ここでは、当該アプリケーションに登録されているフィギュアを選択するための選択ボタン４０１～４０５が表示される。ここでは、５つのフィギュアが登録されている例を示すが、さらに多くのフィギュアが登録される場合には、画面を下方向にスクロールすることで未表示の項目が表示され、選択可能となる。各項目には、それぞれ異なるフィギュアが登録されており、ユーザが撮影対象となるフィギュアを選択すると、図４（ｂ）に示す画面４１０に遷移する。　 Screen 400 shown in FIG. 4(a) is a screen that is displayed on the display unit 204 when an application that provides an XR gimmick according to this embodiment is launched. Here, selection buttons 401 to 405 are displayed for selecting figures registered in the application. Here, an example is shown in which five figures are registered, but if more figures are registered, undisplayed items can be displayed and selected by scrolling the screen downwards. Different figures are registered in each item, and when the user selects a figure to be photographed, the screen transitions to screen 410 shown in FIG. 4(b).

画面４１０では、カメラ２０６が起動され、カメラ２０６によって情報処理端末１０１の周辺環境が撮像され、当該映像が表示部２０４に表示されている様子を示す。当該周辺環境の撮像画像には、図３に示したように、机３０２に載置されたフィギュア３０１が含まれる。また、撮像された映像に加えて、各種ボタン４１１～４１３が選択可能に表示される。ボタン４１１は、撮影中の静止画像を撮像するためのボタンである。ボタン４１１が操作されると、操作されたタイミングで静止画像を取得し、記憶部２０２に保存される。ボタン４１２は、エフェクトを付与付加するためのボタンである。ボタン４１２が操作されると、当該フィギュアにおいて登録されている少なくとも１つのエフェクトが選択可能に表示され、さらにユーザは所望のエフェクトを選択することができる。ボタン４１３は、各種メニューを表示するボタンである。開始したＸＲギミックを終了させて画面４００へ遷移させたり、他の設定等を行ったりすることができる。なお、ここでは３つのボタンを含む例について説明したが、さらに多くの操作ボタンが含まれてもよい。　 The screen 410 shows the state in which the camera 206 is started, the camera 206 captures an image of the surrounding environment of the information processing terminal 101, and the image is displayed on the display unit 204. The captured image of the surrounding environment includes the figure 301 placed on the desk 302, as shown in FIG. 3. In addition to the captured image, various buttons 411 to 413 are displayed in a selectable manner. The button 411 is a button for capturing a still image during shooting. When the button 411 is operated, a still image is acquired at the timing of the operation and stored in the storage unit 202. The button 412 is a button for adding an effect. When the button 412 is operated, at least one effect registered for the figure is displayed in a selectable manner, and the user can select a desired effect. The button 413 is a button for displaying various menus. The started XR gimmick can be ended, and the screen can be transitioned to the screen 400, or other settings can be performed. Note that although an example including three buttons has been described here, more operation buttons may be included.

ボタン４１２を介して所定のエフェクトが選択されると、図４（ｃ）の画面４２０に示すように、エフェクトの合成出力が開始される。４２１は映像に合成されたエフェクトオブジェクトであり、フィギュア３０１を囲むように３つの輪が表示されている。これらの輪は、例えばフィギュア３０１の頭上から発生し、足下方向に向けてアニメーション表示されてもよい。なお、表示されているエフェクトオブジェクト４２１は、フィギュア３０１の前面に表示されている部分と、フィギュア３０１の背面に隠れて表示されていない部分があることが分かる。これらの表示制御の詳細については後述する。　 When a specific effect is selected via button 412, the composite output of the effect begins, as shown on screen 420 in FIG. 4(c). 421 is an effect object composited into the video, and three rings are displayed surrounding the figure 301. These rings may be animated, for example, to appear from above the head of the figure 301 and point toward the feet. It can be seen that the displayed effect object 421 has a portion that is displayed in front of the figure 301 and a portion that is hidden behind the figure 301 and not displayed. Details of the display control of these will be described later.

図４（ｄ）に示すように、画面４３０は、エフェクトの出力中において、ユーザが図４（ｃ）の状態からフィギュア３０１の側面から撮影する状態まで情報処理端末１０１を動かした際の画面を示す。ここでは、フィギュア３０１の側面に回った場合においても、エフェクトオブジェクト４３１に示すように、情報処理端末１０１から見てフィギュア３０１の後側に回り込む部分は表示されていないことが分かる。このように、本実施形態に係るエフェクトの合成出力では、カメラ２０６によって撮影された映像に追従して、エフェクトオブジェクトもフィギュア３０１の位置関係に応じて表示が変化するものである。詳細な表示制御については後述する。　 As shown in FIG. 4(d), screen 430 shows the screen when the user moves the information processing terminal 101 from the state in FIG. 4(c) to a state where the user is photographing the side of the figure 301 while the effect is being output. Here, it can be seen that even when moving around to the side of the figure 301, the part that wraps around to the rear of the figure 301 as seen from the information processing terminal 101 is not displayed, as shown in effect object 431. In this way, in the composite output of effects according to this embodiment, the display of the effect object also changes according to the positional relationship of the figure 301, following the image captured by the camera 206. Detailed display control will be described later.

＜エフェクト合成の機能構成＞　次に、図５を参照して、本実施形態に係るエフェクト合成出力に係る機能構成について説明する。以下で説明する機能構成は、例えばＣＰＵ２０１が記憶部２０２に予め記憶された制御プログラムを実行することにより実現されるものである。本情報処理端末１０１は、エフェクト合成出力に係る機能構成として、画像取得部５０１、深度情報取得部５０２、物体認識部５０３、エフェクト位置決定部５０４、学習済みモデル５０５、エフェクト描画部５０６、合成部５０７、及び出力部５０８を含む。　 <Functional configuration of effect synthesis> Next, referring to FIG. 5, the functional configuration related to the effect synthesis output according to this embodiment will be described. The functional configuration described below is realized, for example, by the CPU 201 executing a control program pre-stored in the storage unit 202. The information processing terminal 101 includes, as functional configuration related to effect synthesis output, an image acquisition unit 501, a depth information acquisition unit 502, an object recognition unit 503, an effect position determination unit 504, a trained model 505, an effect drawing unit 506, a synthesis unit 507, and an output unit 508.

画像取得部５０１は、カメラ２０６によって撮像された撮像画像（ＲＧＢ画像）を取得する。画像取得部５０１によって取得されたＲＧＢ画像は、深度情報取得部５０２、物体認識部５０３、及び合成部５０７へそれぞれ出力される。深度情報取得部５０２は、画像取得部５０１から受け取った撮像画像における各画素について、撮像時におけるカメラ２０６からの距離情報（深度情報）を取得する。深度情報取得部５０２は、取得した深度情報を示すグレースケール画像（深度マップ）を生成する。深度情報の取得方法としては、任意の既知の手法を用いてもよく、例えば、ステレオ視や時間差による運動視差を利用して取得する手法や、畳み込みニューラルネットワークを利用して二次元画像から対象物までの距離を推定するように学習させた機械学習済みのモデルによって取得する手法であってもよい。なお、本実施形態に係るＸＲギミックはリアルタイム性を要するものであるため、処理負荷が低い手法が望ましい。深度情報取得部５０２は、取得した深度情報（深度マップ）をエフェクト描画部５０６へ出力する。　 The image acquisition unit 501 acquires an image (RGB image) captured by the camera 206. The RGB image acquired by the image acquisition unit 501 is output to the depth information acquisition unit 502, the object recognition unit 503, and the synthesis unit 507. The depth information acquisition unit 502 acquires distance information (depth information) from the camera 206 at the time of capturing for each pixel in the captured image received from the image acquisition unit 501. The depth information acquisition unit 502 generates a grayscale image (depth map) indicating the acquired depth information. Any known method may be used as a method for acquiring depth information, for example, a method of acquiring the information using stereo vision or motion parallax due to a time difference, or a method of acquiring the information using a machine-learned model that has been trained to estimate the distance from a two-dimensional image to an object using a convolutional neural network. Note that since the XR gimmick according to this embodiment requires real-time performance, a method with a low processing load is desirable. The depth information acquisition unit 502 outputs the acquired depth information (depth map) to the effect drawing unit 506.

物体認識部５０３は、学習済みモデル５０５を用いて、撮像画像に含まれるフィギュア３０１の少なくとも１つのパーツの姿勢と、カメラ２０６から当該パーツまでの距離とを認識する。姿勢情報には、当該パーツの形状及び角度の情報が含まれる。より詳細には、物体認識部５０３は、学習済みモデル５０５を用いて、認識対象となるパーツの前、後、上、下、左、右、ピッチ、ヨー、及びロールの各方向における角度を検出することができる。ここで、少なくとも１つのパーツとは、フィギュア３０１等の所定の模型における頭部、胸部、腹部、腰部、腕部、及び脚部の少なくとも１つであり、選択されたエフェクトに関連するパーツである。所定の模型をパーツごとに分割する粒度については任意である。例えば、可動フィギュアの場合には、関節を有するパーツごとに分割することが望ましい。これにより、可動するパーツごとに形状や姿勢等を認識することができ、可動が行われた場合であっても認識誤りを低減させることができる。　 The object recognition unit 503 uses the trained model 505 to recognize the posture of at least one part of the figure 301 included in the captured image and the distance from the camera 206 to the part. The posture information includes information on the shape and angle of the part. More specifically, the object recognition unit 503 can use the trained model 505 to detect the angles of the part to be recognized in each of the directions of front, back, up, down, left, right, pitch, yaw, and roll. Here, at least one part is at least one of the head, chest, abdomen, waist, arms, and legs in a specified model such as the figure 301, and is a part related to the selected effect. The granularity of dividing the specified model into parts is arbitrary. For example, in the case of a movable figure, it is desirable to divide it into parts having joints. This makes it possible to recognize the shape, posture, etc. of each movable part, and reduces recognition errors even when movement is performed.

また、エフェクトに関連するパーツとは、生成するエフェクトオブジェクトの近傍に位置するパーツを示す。これは、撮像画像に合成するエフェクトオブジェクトがフィギュア３０１との位置関係を考慮して配置されるものであり、生成するエフェクトオブジェクトの位置を決定するためである。例えば、所定の模型において胸部の一部から光線を出力するエフェクトオブジェクトを生成する場合には、当該模型の胸部の姿勢及び撮像画像におけるカメラ２０６から当該模型の胸部までの距離を認識することで、エフェクトオブジェクトの発生位置や発生方向を決定することができる。　 Furthermore, parts related to an effect refer to parts located near the effect object to be generated. This is because the effect object to be composited into the captured image is positioned taking into consideration its positional relationship with the figure 301, and the position of the effect object to be generated is determined. For example, when generating an effect object that outputs a ray from part of the chest of a specific model, the position and direction in which the effect object will be generated can be determined by recognizing the orientation of the chest of the model and the distance from the camera 206 to the chest of the model in the captured image.

このように、本実施形態によれば、所定の模型全体の姿勢及び撮像画像におけるカメラ２０６から当該模型までの距離を認識するものではなく、生成するエフェクトオブジェクトに関連する少なくとも１つのパーツのみを認識する。これにより、所定の模型全体を認識する場合と比較して、高速に処理することができ、ＸＲギミックのリアルタイム性を保証することができる。なお、物体認識部５０３は、選択された模型の三次元形状モデルの情報を予め保持しているため、一部のパーツの姿勢及び距離を認識することにより、他のパーツの姿勢及び距離をある程度推定することも可能である。物体認識部５０３は、生成するエフェクトオブジェクトに関連する少なくとも１つのパーツの姿勢及び距離に関する情報を認識すると、当該情報をエフェクト位置決定部５０４へ出力する。　 Thus, according to this embodiment, rather than recognizing the posture of the entire specified model and the distance from the camera 206 to the model in the captured image, only at least one part related to the effect object to be generated is recognized. This allows for faster processing compared to recognizing the entire specified model, and ensures the real-time performance of the XR gimmick. Note that since the object recognition unit 503 holds in advance information on the three-dimensional shape model of the selected model, it is also possible to estimate to some extent the posture and distance of other parts by recognizing the posture and distance of some parts. When the object recognition unit 503 recognizes information related to the posture and distance of at least one part related to the effect object to be generated, it outputs the information to the effect position determination unit 504.

エフェクト位置決定部５０４は、取得した少なくとも１つのパーツの姿勢及びカメラ２０６からの距離に関する情報に基づいて、生成するエフェクトオブジェクトの位置情報を決定する。当該位置情報には、エフェクトオブジェクトについての少なくとも姿勢（角度）及びカメラ２０６からの距離に関する情報が含まれる。決定した位置情報はエフェクト描画部５０６に出力される。エフェクト描画部５０６では生成するエフェクトオブジェクトのモデル情報を予め保持しているため、ここでは当該エフェクトオブジェクトの基準位置をフィギュア３０１の所定位置と関連付けて定義した情報が出力されうる。つまり、生成するエフェクトオブジェクトの位置情報は、エフェクト描画部５０６がエフェクトオブジェクトを描画するために必要となる情報を含んでいればよく、例えば当該エフェクトオブジェクトの姿勢（角度）及びカメラ２０６からの距離に関する情報を示すものであればよい。　 The effect position determination unit 504 determines position information of the effect object to be generated based on the acquired information on the attitude of at least one part and the distance from the camera 206. The position information includes information on at least the attitude (angle) and distance from the camera 206 for the effect object. The determined position information is output to the effect drawing unit 506. Since the effect drawing unit 506 holds model information of the effect object to be generated in advance, information that defines the reference position of the effect object in association with a specified position of the figure 301 can be output here. In other words, the position information of the effect object to be generated only needs to include information required for the effect drawing unit 506 to draw the effect object, and may, for example, indicate information on the attitude (angle) of the effect object and the distance from the camera 206.

エフェクト描画部５０６は、深度情報取得部５０２から取得した深度情報と、エフェクト位置決定部５０４から取得したエフェクトオブジェクトの姿勢及び距離に関する情報とに基づいて、エフェクトオブジェクトを描画する。エフェクト描画部５０６は、上述したように、生成するエフェクトについての予め保持しているモデル情報に従って描画を行う。より詳細には、エフェクト描画部５０６は、撮像画像の各画素の深度情報（距離情報）に応じて、対応するエフェクトオブジェクトの描画画素のうち、撮像画像の対応する画素よりもカメラ２０６に近いことを示す画素について描画する。一方、エフェクト描画部５０６は、対応するエフェクトオブジェクトの描画画素のうち、撮像画像の対応する画素よりもカメラ２０６に近いことを示さない画素については描画しない。これにより、例えばフィギュア３０１の背面に隠れるエフェクトオブジェクトは描画されず、フィギュア３０１の前面に露出するエフェクトオブジェクトのみが描画されることになる。エフェクト描画部５０６は、描画したエフェクトオブジェクト画像を合成部５０７へ出力する。　 The effect drawing unit 506 draws an effect object based on the depth information acquired from the depth information acquisition unit 502 and the information on the posture and distance of the effect object acquired from the effect position determination unit 504. As described above, the effect drawing unit 506 performs drawing according to the model information previously stored for the effect to be generated. More specifically, the effect drawing unit 506 draws pixels that indicate that the corresponding effect object is closer to the camera 206 than the corresponding pixel of the captured image, among the drawing pixels of the corresponding effect object, according to the depth information (distance information) of each pixel of the captured image. On the other hand, the effect drawing unit 506 does not draw pixels that do not indicate that the corresponding effect object is closer to the camera 206 than the corresponding pixel of the captured image, among the drawing pixels of the corresponding effect object. As a result, for example, an effect object hidden behind the figure 301 is not drawn, and only an effect object exposed in front of the figure 301 is drawn. The effect drawing unit 506 outputs the drawn effect object image to the synthesis unit 507.

合成部５０７は、画像取得部５０１から取得した撮像画像に対して、当該撮像画像に基づいて生成され、エフェクト描画部５０６から取得したエフェクトオブジェクト画像を重畳して現実空間にエフェクト画像を付加した合成画像を生成する。また、合成部５０７は、環境の輝度調整などを行うことにより、合成した画像の最終調整や品質調整を行ってもよい。例えば、選択されたエフェクトに合わせて、よりエフェクトを強調して表示する場合には、現実空間の画像を暗くするなどの調整を行うことができる。合成画像は出力部５０８に渡され、出力部５０８は、合成画像を表示部２０４に表示する。　 The synthesis unit 507 generates a synthetic image by superimposing the effect object image acquired from the effect drawing unit 506, which is generated based on the captured image acquired from the image acquisition unit 501, and adding the effect image to the real space. The synthesis unit 507 may also perform final adjustments and quality adjustments to the synthesized image by adjusting the brightness of the environment, etc. For example, when displaying an effect with more emphasis in accordance with a selected effect, adjustments such as darkening the image in the real space can be made. The synthetic image is passed to the output unit 508, which displays the synthetic image on the display unit 204.

各部はカメラ２
０６によって継続的に取得される撮像画像に対して上述した一連の処理を周期的（例えば、30msec、60msec、90msecなどの周期）に実行してもよい。この場合、出力部５０８によって表示される画像は動画像となる。なお、付加されるエフェクトオブジェクトについても動的に変化するアニメーションとして表示されてもよい。この場合、生成するエフェクトごとに、周期的な処理に合わせてアニメーションを構成する連続的な複数の画像が予め保持されている。さらに、出力部５０８は、表示した合成画像（エフェクトのアニメーション）に合わせて、スピーカ２０７によって所定の音声を出力することも可能である。　 Each part is a camera 2
The above-mentioned series of processes may be executed periodically (e.g., at periods of 30 msec, 60 msec, 90 msec, etc.) on the captured images continuously acquired by the image capture unit 506. In this case, the image displayed by the output unit 508 becomes a moving image. The added effect object may also be displayed as a dynamically changing animation. In this case, for each effect to be generated, a plurality of consecutive images that compose an animation in accordance with the periodic processing are stored in advance. Furthermore, the output unit 508 can also output a predetermined sound from the speaker 207 in accordance with the displayed composite image (effect animation).

＜エフェクト合成における処理画像＞　次に、図６を参照して、本実施形態に係るエフェクト合成の処理手順に応じた一連の処理画像について説明する。ここでは、図３に示すフィギュア３０１及び机３０２を含む情報処理端末１０１の周辺環境を撮像した撮像画像に対してエフェクトを合成する例について説明する。　 <Processed images in effect compositing> Next, with reference to FIG. 6, a series of processed images according to the processing procedure for effect compositing according to this embodiment will be described. Here, an example of compositing an effect on a captured image of the surrounding environment of the information processing terminal 101, including the figure 301 and desk 302 shown in FIG. 3, will be described.

６００はカメラ２０６によって撮像された撮像画像を示す。撮像画像６００には机３０２に載置されたフィギュア３０１が含まれている。６１０は深度情報取得部５０２によって撮像画像６００から得られた深度情報である、グレースケールの深度マップを示す。深度マップ６１０では、各画素において、白に近いほどカメラ２０６からの距離が近いことを示す。　 Reference numeral 600 denotes an image captured by the camera 206. The captured image 600 includes a figure 301 placed on a desk 302. Reference numeral 610 denotes a grayscale depth map, which is depth information obtained from the captured image 600 by the depth information acquisition unit 502. In the depth map 610, the closer each pixel is to white, the closer the distance from the camera 206 is.

６２０は、物体認識部５０３によって生成されるエフェクトオブジェクトに関連する少なくとも１つのパーツを学習済みモデル５０５を用いて認識している様子を示す。ここでは、例えばフィギュア３０１の頭部６２１及び胸部６２２の認識が行われている。このように、本実施形態によれば、撮像画像６００に含まれるフィギュア３０１の全体を認識するのではなく、生成するエフェクトオブジェクトに関連する一部のパーツのみが認識される。　 620 shows how at least one part related to the effect object generated by the object recognition unit 503 is recognized using the trained model 505. Here, for example, the head 621 and chest 622 of the figure 301 are recognized. In this way, according to this embodiment, only some of the parts related to the effect object to be generated are recognized, rather than recognizing the entire figure 301 contained in the captured image 600.

６３０は、６２０で認識されたパーツに基づいて、位置情報が決定されたエフェクトオブジェクト６３１のモデルを示す。ここでは、生成するエフェクトオブジェクト６３１の全体の位置情報が決定される。なお、エフェクトオブジェクト６３１のモデル情報を予め保持しているため、当該オブジェクトを描画するための姿勢やカメラ２０６からの距離に関する情報が含まれればよい。　 630 shows a model of an effect object 631 whose position information has been determined based on the parts recognized in 620. Here, the overall position information of the effect object 631 to be generated is determined. Note that since the model information of the effect object 631 is stored in advance, it is sufficient that the information included is related to the posture for rendering the object and the distance from the camera 206.

６４０は合成されるエフェクトオブジェクト画像を示す。エフェクトオブジェクト画像６４０では、エフェクト描画部５０６が深度マップ６１０とエフェクトオブジェクトのモデル６３０とを用いて、撮像画像に合成するエフェクトオブジェクト６４１を描画する。エフェクトオブジェクト６４１は、エフェクトオブジェクトの全体を示すエフェクトオブジェクト６３１と比較すると、描画されていない部分が含まれる。これは、深度マップ６１０から得られる距離情報と、モデル６３０から得られるエフェクトオブジェクト６３１の距離情報とを比較し、撮像画像の対応画素よりも前面に位置する（つまり、カメラ２０６に近い）エフェクトオブジェクトのみを描画したためである。　 640 indicates the effect object image to be composited. In effect object image 640, effect rendering unit 506 uses depth map 610 and effect object model 630 to render effect object 641 to be composited into the captured image. Compared to effect object 631, which shows the entire effect object, effect object 641 includes parts that are not rendered. This is because distance information obtained from depth map 610 is compared with distance information of effect object 631 obtained from model 630, and only effect objects located in front of the corresponding pixel in the captured image (i.e., closer to camera 206) are rendered.

６５０は、撮像画像６００に対してエフェクトオブジェクト画像６４０を重畳した合成画像を示す。合成画像６５０は、エフェクトオブジェクト画像６４０が単に撮像画像６００に対して重ね合わせて生成された画像である。しかしながら、合成画像６５０では、エフェクトオブジェクト６３１のうち、フィギュア３０１の画像部分に重複し、かつフィギュア３０１の背面に隠れる部分は描画されていないことが分かる。このように、本実施形態によれば、より立体感や臨場感を有するＸＲギミックを提供することができる。なお、本実施形態では、対象物に隠れるエフェクトの部分を描画することなく、前面に露出する部分のみを描画する。従って、一度描画したエフェクトオブジェクトをのうち、フィギュアとの位置関係に応じて隠れる部分を消去する制御と比較して、より処理負荷を低減した処理を実現でき、高速に処理を行うことができる。　 650 shows a composite image in which the effect object image 640 is superimposed on the captured image 600. The composite image 650 is an image generated by simply superimposing the effect object image 640 on the captured image 600. However, it can be seen that in the composite image 650, the part of the effect object 631 that overlaps with the image part of the figure 301 and is hidden behind the figure 301 is not drawn. In this way, according to this embodiment, it is possible to provide an XR gimmick with a more three-dimensional feel and a more realistic feel. Note that in this embodiment, only the part exposed to the front is drawn, without drawing the part of the effect hidden by the target object. Therefore, compared to control in which the part of the effect object that has been drawn once is erased according to its positional relationship with the figure, it is possible to realize processing with a lower processing load and to perform processing at a higher speed.

＜基本制御＞　次に、図７を参照して、本実施形態に係るＸＲギミックを提供するアプリケーションにおける基本制御の処理手順を説明する。以下で説明する処理は、例えばＣＰＵ２０１が記憶部２０２に予め記憶されている制御プログラム等を読み出して実行することにより実現される。　 <Basic control> Next, with reference to FIG. 7, the processing procedure of basic control in the application that provides the XR gimmick according to this embodiment will be described. The processing described below is realized, for example, by the CPU 201 reading and executing a control program that has been pre-stored in the memory unit 202.

まずＳ１０１でＣＰＵ２０１は、本実施形態に係るＸＲギミックを提供するアプリケーションが起動されると、メニューを選択可能に表示する画面４００を表示部２０４に表示する。続いて、Ｓ１０２でＣＰＵ２０１は、画面４００や当該画面４００から遷移する設定画面（不図示）等を介して、ユーザ操作に応じて選択された情報を取得する。ここでの選択情報には、例えばカメラ２０６で撮像して表示する所定の模型に関する情報が含まれる。ＣＰＵ２０１は、選択された情報に応じてカメラ２０６による撮像を開始させる。撮像画像は、画面４１０に示すように、表示部２０４に表示される。　 First, in S101, when an application that provides the XR gimmick according to this embodiment is launched, the CPU 201 displays a screen 400 that displays a selectable menu on the display unit 204. Next, in S102, the CPU 201 acquires information selected in response to a user operation via the screen 400 or a setting screen (not shown) to which the screen 400 transitions. The selected information here includes, for example, information about a specific model that is captured by the camera 206 and displayed. The CPU 201 starts capturing images with the camera 206 in response to the selected information. The captured image is displayed on the display unit 204, as shown on screen 410.

次に、Ｓ１０３でＣＰＵ２０１は、ボタン４１２を介してエフェクト出力が選択されたかどうかを判断する。選択された場合は処理をＳ１０６へ進め、そうでない場合は処理をＳ１０４へ進める。Ｓ１０４でＣＰＵ２０１は、カメラ２０６によって撮像された撮像画像を取得してＳ１０４で表示部２０４へ表示し、処理をＳ１０７へ進める。一方、Ｓ１０６でＣＰＵ２０１は、撮像画像にエフェクトを合成して出力し、処理をＳ１０７へ進める。Ｓ１０６の詳細な処理については図８を用いて後述する。Ｓ１０７でＣＰＵ２０１は、映像の出力を終了するか否かを判断し、終了しない場合は処理をＳ１０３に戻し、終了する場合は本フローチャートの処理を終了する。例えば、ボタン４１３を介して画面４００に戻る指示が行われた場合や、当該アプリケーションが終了された場合に、ＣＰＵ２０１は映像の出力を終了すると判断して、カメラ２０６の起動を停止する。　 Next, in S103, the CPU 201 determines whether or not effect output has been selected via the button 412. If it has been selected, the process proceeds to S106; if not, the process proceeds to S104. In S104, the CPU 201 acquires an image captured by the camera 206, displays it on the display unit 204 in S104, and proceeds to S107. On the other hand, in S106, the CPU 201 combines an effect with the captured image and outputs it, and proceeds to S107. The detailed process of S106 will be described later with reference to FIG. 8. In S107, the CPU 201 determines whether or not to end the video output, and if not, returns to S103, and if to end, ends the process of this flowchart. For example, if an instruction to return to the screen 400 is given via the button 413 or if the application is ended, the CPU 201 determines that the video output is to be ended, and stops the startup of the camera 206.

＜エフェクト合成出力制御＞　次に、図８を参照して、本実施形態に係るエフェクト合成出力（Ｓ１０６）の処理手順について説明する。以下で説明する処理は、例えばＣＰＵ２０１が記憶部２０２に予め記憶されている制御プログラム等を読み出して実行することにより実現される。　 <Effects composite output control> Next, the processing procedure for effects composite output (S106) according to this embodiment will be described with reference to FIG. 8. The processing described below is realized, for example, by the CPU 201 reading and executing a control program or the like that is pre-stored in the storage unit 202.

まずＳ２０１でＣＰＵ２０１は、ボタン４１２を介して選択されたエフェクト情報を取得する。エフェクト情報には、生成するエフェクトを識別するための識別情報や、当該エフェクトが関連する少なくとも１つのパーツの情報等を含む。これらの情報はアプリケーションサーバ１０２から受信して記憶部２０２に予め記憶されている情報である。続いて、Ｓ２０２でＣＰＵ２０１は、カメラ２０６によって撮像された処理対象の撮像画像を取得する。　 First, in S201, the CPU 201 acquires effect information selected via the button 412. The effect information includes identification information for identifying the effect to be generated, information on at least one part to which the effect is related, and the like. This information is received from the application server 102 and is pre-stored in the storage unit 202. Next, in S202, the CPU 201 acquires a captured image of the processing target captured by the camera 206.

次に、Ｓ２０３でＣＰＵ２０１は、深度情報取得部５０２によって、Ｓ２０１で取得した撮像画像の深度マップを取得する。また、Ｓ２０４でＣＰＵ２０１は、Ｓ２０１で取得した撮像画像を学習済みモデル５０５に入力し、撮像画像に含まれる模型（ここでは、フィギュア３０１）についてＳ２０１で取得したエフェクトに関連する少なくとも１つのパーツの物体認識を行う。物体認識の詳細な制御については図９を用いて後述する。物体認識が行われると、Ｓ２０５でＣＰＵ２０１は、Ｓ２０４で認識された少なくとも１つのパーツの姿勢及び距離に関する情報に基づいて、生成するエフェクトの位置情報を決定する。位置情報には、上述したように、エフェクト画像を生成するための情報として、生成するエフェクトの姿勢（角度）及びカメラ２０６からの距離に関する情報が含まれる。なお、Ｓ２０３と、Ｓ２０４及びＳ２０５との処理順序は説明を容易にするために順序付けて説明したが、深度マップを取得する処理と、エフェクトオブジェクトの位置決定を行う処理とは逆の順序で行われてもよく、並行して行われるものであってよい。　 Next, in S203, the CPU 201 acquires a depth map of the captured image acquired in S201 by the depth information acquisition unit 502. In addition, in S204, the CPU 201 inputs the captured image acquired in S201 to the trained model 505, and performs object recognition of at least one part related to the effect acquired in S201 for the model (here, the figure 301) included in the captured image. Detailed control of the object recognition will be described later with reference to FIG. 9. After the object recognition is performed, in S205, the CPU 201 determines position information of the effect to be generated based on information on the attitude and distance of at least one part recognized in S204. As described above, the position information includes information on the attitude (angle) of the effect to be generated and the distance from the camera 206 as information for generating the effect image. Note that the processing order of S203, S204, and S205 has been described in order for ease of explanation, but the processing of acquiring the depth map and the processing of determining the position of the effect object may be performed in the reverse order, or may be performed in parallel.

次に、Ｓ２０６でＣＰＵ２０１は、Ｓ２０３で取得された深度マップと、Ｓ２０５で決定されたエフェクトオブジェクトの位置情報とに基づいてエフェクトオブジェクトの画像を生成する。エフェクトオブジェクトの画像の生成制御については図１０を用いて後述する。続いて、Ｓ２０７でＣＰＵ２０１は、Ｓ２０２で取得した撮像画像に対して、Ｓ２０６で生成したエフェクトオブジェクト画像を重畳して合成する。その後、Ｓ２０８でＣＰＵ２０１は、合成画像を表示部２０４に表示するとともに、必要に応じてスピーカ２０７から音声を出力し、本フローチャートの処理を終了する。　 Next, in S206, the CPU 201 generates an image of the effect object based on the depth map acquired in S203 and the position information of the effect object determined in S205. The generation control of the image of the effect object will be described later with reference to FIG. 10. Next, in S207, the CPU 201 superimposes the effect object image generated in S206 onto the captured image acquired in S202 to synthesize them. After that, in S208, the CPU 201 displays the synthesized image on the display unit 204 and outputs sound from the speaker 207 as necessary, and ends the processing of this flowchart.

＜物体認識制御＞　次に、図９を参照して、本実施形態に係る物体認識（Ｓ２０４）の処理手順について説明する。以下で説明する処理は、例えばＣＰＵ２０１が記憶部２０２に予め記憶されている制御プログラム等を読み出して実行することにより実現される。　 <Object Recognition Control> Next, the processing procedure for object recognition (S204) according to this embodiment will be described with reference to FIG. 9. The processing described below is realized, for example, by the CPU 201 reading and executing a control program or the like that is pre-stored in the memory unit 202.

まずＳ３０１でＣＰＵ２０１は、Ｓ２０１で取得したエフェクト情報に基づいて、生成するエフェクトに関連するパーツを特定する。例えば図４や図６で説明したＸＲギミックの例では、フィギュア３０１の頭部及び胸部をエフェクトに関連するパーツとして特定する。続いて、Ｓ３０２でＣＰＵ２０１は、Ｓ３０１で特定した少なくとも１つのパーツについて、学習済みモデル５０５を用いて撮像画像に含まれる当該パーツを認識する。　 First, in S301, the CPU 201 identifies parts related to the effect to be generated based on the effect information acquired in S201. For example, in the example of the XR gimmick described in FIG. 4 and FIG. 6, the head and chest of the figure 301 are identified as parts related to the effect. Next, in S302, the CPU 201 uses the trained model 505 to recognize at least one part identified in S301 that is included in the captured image.

次に、Ｓ３０３でＣＰＵ２０１は、学習済みモデル５０５の出力結果から、認識したパーツの形状、角度、及び距離に関する情報を取得する。その後、Ｓ３０４でＣＰＵ２０１は、Ｓ３０１で特定されたパーツのうち、未解析のパーツがあるかどうかを判断する。未解析のパーツがあれば、処理をＳ３０２に戻し、未解析のパーツが無ければ本フローチャートの処理を終了する。　 Next, in S303, the CPU 201 obtains information regarding the shape, angle, and distance of the recognized part from the output result of the trained model 505. After that, in S304, the CPU 201 determines whether or not there are any unanalyzed parts among the parts identified in S301. If there are any unanalyzed parts, the process returns to S302, and if there are no unanalyzed parts, the process of this flowchart ends.

＜エフェクトオブジェクトの生成制御＞　次に、図１０を参照して、本実施形態に係るエフェクトオブジェクト生成（Ｓ２０６）の処理手順について説明する。以下で説明する処理は、例えばＣＰＵ２０１が記憶部２０２に予め記憶されている制御プログラム等を読み出して実行することにより実現される。　 <Effect object generation control> Next, the processing procedure for generating an effect object (S206) according to this embodiment will be described with reference to FIG. 10. The processing described below is realized, for example, by the CPU 201 reading and executing a control program or the like that is pre-stored in the storage unit 202.

まずＳ４０１でＣＰＵ２０１は、Ｓ２０５で決定されたエフェクトオブジェクトの位置情報と、予め保持している生成するエフェクトオブジェクトのモデル情報とに基づいて、生成するエフェクトオブジェクトの画素位置ｘを初期化する。ここではエフェクトオブジェクトの全画素について後述する処理を実施するため、例えば初期値としてエフェクトオブジェクトの左上の画素位置を画素位置ｘとして設定する。　 First, in S401, the CPU 201 initializes the pixel position x of the effect object to be generated based on the position information of the effect object determined in S205 and the model information of the effect object to be generated that has been stored in advance. Here, in order to perform the processing described below on all pixels of the effect object, for example, the upper left pixel position of the effect object is set as the initial value for pixel position x.

次にＳ４０２でＣＰＵ２０１は、エフェクトオブジェクトの処理対象の画素位置ｘと、対応する撮像画像の画素位置ｙとのそれぞれの距離情報を比較する。続いて、Ｓ４０３でＣＰＵ２０１は、比較の結果、エフェクトオブジェクトの方が前方に位置するかどうか（カメラ２０６に近いかどうか）を判断する。エフェクトオブジェクトが前方であればＳ４０４に進み、そうでなければＳ４０５へ進む。Ｓ４０４でＣＰＵ２０１は、対応画素のエフェクトオブジェクトを描画し、Ｓ４０５に進む。Ｓ４０５でＣＰＵ２０１は、エフェクトオブジェクトの全ての画素について、対応する撮像画像の画素と比較したかどうかを判断する。全ての画素について処理が終了すると、本フローチャートの処理を終了し、そうでない場合は処理を４０２へ戻す。　 Next, in S402, the CPU 201 compares the distance information between the pixel position x of the effect object to be processed and the corresponding pixel position y of the captured image. Next, in S403, the CPU 201 determines whether the comparison shows that the effect object is located in front (closer to the camera 206). If the effect object is in front, the process proceeds to S404, otherwise, the process proceeds to S405. In S404, the CPU 201 draws the effect object of the corresponding pixel, and proceeds to S405. In S405, the CPU 201 determines whether all pixels of the effect object have been compared with the corresponding pixels of the captured image. When processing has been completed for all pixels, the process of this flowchart ends, and if not, the process returns to 402.

以上説明したように、本実施形態に係る情報処理端末は、所定の模型を含む周辺環境を撮像し、撮像された撮像画像の各画素について、カメラからの距離情報を取得する。また、本情報処理端末は、撮像画像に含まれる所定の模型の少なくとも１つのパーツの姿勢及びカメラからの距離に関する情報を認識し、認識した少なくとも１つのパーツを基準に、生成するオブジェクトの位置情報を決定する。さらに、本情報処理端末は、生成するオブジェクトの各画素のうち、それぞれの位置情報が撮像画像の対応する画素の距離情報よりもカメラに近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成し、撮像画像にオブジェクト画像を重畳した合成画像を表示部に出力す
る。一方、本情報処理端末は、生成するオブジェクトの各画素のうち、それぞれの位置情報が撮像画像の対応する画素の距離情報よりもカメラに近いことを示さない画素については描画しない。つまり、本実施形態によれば、合成するエフェクトオブジェクトのうち、所定物よりも前方に位置する部分については描画し、所定物よりも後方に位置する部分については所定物に隠れるため描画しない。このように、本発明は現実空間の撮像画像に好適にオブジェクトを合成して出力することができる。　 As described above, the information processing terminal according to this embodiment captures an image of the surrounding environment including a predetermined model, and acquires distance information from the camera for each pixel of the captured image. In addition, the information processing terminal recognizes information on the posture of at least one part of the predetermined model included in the captured image and the distance from the camera, and determines position information of an object to be generated based on the recognized at least one part. Furthermore, the information processing terminal draws, as an object, pixels of each pixel of the object to be generated whose position information indicates that the pixels are closer to the camera than the distance information of the corresponding pixel in the captured image, generates an object image, and outputs a composite image in which the object image is superimposed on the captured image to the display unit. On the other hand, the information processing terminal does not draw pixels of each pixel of the object to be generated whose position information does not indicate that the pixels are closer to the camera than the distance information of the corresponding pixel in the captured image. In other words, according to this embodiment, of the effect object to be synthesized, a portion located in front of a predetermined object is drawn, and a portion located behind the predetermined object is not drawn because it is hidden by the predetermined object. In this way, the present invention can suitably synthesize an object into a captured image in real space and output it.

＜第２の実施形態＞　以下では本発明の第２の実施形態について説明する。上記第１の実施形態では物体認識（Ｓ２０４）において、エフェクトオブジェクトを生成するための基準位置を認識すべく、少なくとも１つのパーツを認識する制御について説明した。また、上記実施形態では、エフェクトオブジェクトの描画については撮像画像から生成したデプスマップを用いてエフェクトオブジェクトの位置と、撮像画像の各画素の位置とを比較して描画の有無を制御する例について説明した。　 <Second embodiment> The second embodiment of the present invention will be described below. In the above first embodiment, a control for recognizing at least one part in order to recognize a reference position for generating an effect object in object recognition (S204) was described. Also, in the above embodiment, an example was described in which, regarding the rendering of an effect object, a depth map generated from a captured image is used to compare the position of the effect object with the position of each pixel in the captured image to control whether or not to render it.

しかしながら、デプスマップの精度は、カメラの性能や光量等の撮像時の環境条件に応じて変動するものである。そこで、本実施形態では、物体認識（Ｓ２０４）において、上記少なくとも１つのパーツの認識に加えて、対象模型のボーン構造を構築して、デプスマップを補完する制御について説明する。また、構築したボーン構造を利用することにより、対象模型の姿勢を判定することができ、判定した姿勢に応じてエフェクトオブジェクトを動的に変化させることができる。詳細については後述する。　 However, the accuracy of the depth map varies depending on the camera's performance and the environmental conditions at the time of image capture, such as the amount of light. Therefore, in this embodiment, in addition to recognizing at least one part as described above, in object recognition (S204), a bone structure of the target model is constructed to complement the depth map, and control is described. Furthermore, by using the constructed bone structure, the posture of the target model can be determined, and the effect object can be dynamically changed according to the determined posture. Details will be described later.

＜ボーン構造＞　まず図１１を参照して、本実施形態に係る模型のボーン構造について説明する。１１００は、撮像画像に含まれる所定の模型であるフィギュアのボーン構造を示す。１１００では撮像画像に含まれるフィギュア３０１及び机３０２は点線で示す。　 <Bone structure> First, the bone structure of the model according to this embodiment will be described with reference to FIG. 11. Reference numeral 1100 indicates the bone structure of a figure, which is a specified model, contained in the captured image. In 1100, the figure 301 and desk 302 contained in the captured image are indicated by dotted lines.

図１１に示す１１０１などの黒丸は、フィギュア３０１の特徴点を示す。これらの特徴点はフィギュアの大まかなアウトラインを生成するための点であり、その数や位置を限定する意図はない。１１０２は各特徴点を連結したボーン構造を示す。図１１に示すボーン構造１１０２は、フィギュア３０１の基準ボーン構造を示す。基準ボーン構造とは、所定の模型の基準姿勢から得られるボーン構造であり、模型ごとに予め用意されているデータである。基準ボーン構造は、例えば対象模型の３次元データから、ポリゴン数を軽減した大まかなアウトラインからなる三次元データから得ることができる。　 Black circles such as 1101 in Figure 11 indicate feature points of figure 301. These feature points are used to generate a rough outline of the figure, and there is no intention to limit their number or position. 1102 indicates the bone structure connecting each feature point. Bone structure 1102 in Figure 11 indicates the reference bone structure of figure 301. The reference bone structure is a bone structure obtained from the reference posture of a specified model, and is data prepared in advance for each model. The reference bone structure can be obtained, for example, from three-dimensional data consisting of a rough outline with a reduced number of polygons from the three-dimensional data of the target model.

上記第１の実施形態では出力するエフェクトに関連のある少なくとも１つのパーツを認識したが、本実施形態では対象模型に含まれる各パーツの認識を行う。例えば、認識するパーツには顔、胸、腹、腰、両腕、及び両脚が含まれてもよい。また、本実施形態では、撮像画像から認識されるパーツの角度に従って、基準ボーン構造を更新する。従って、更新したボーン構造は、撮像画像に含まれる対象模型の姿勢を示すこととなる。さらに、更新したボーン構造を撮像画像にマッピングことにより、撮像画像中における対応するボーン構造の位置付近においては、対象模型が撮像されていることを示す領域として判定することができる。　 In the first embodiment above, at least one part related to the effect to be output is recognized, but in this embodiment, each part included in the target model is recognized. For example, the parts to be recognized may include the face, chest, abdomen, waist, both arms, and both legs. Furthermore, in this embodiment, the reference bone structure is updated according to the angle of the part recognized from the captured image. Therefore, the updated bone structure indicates the posture of the target model included in the captured image. Furthermore, by mapping the updated bone structure to the captured image, the area near the position of the corresponding bone structure in the captured image can be determined as an area indicating that the target model is being captured.

＜物体認識（ボーン構築を含む）＞　次に、図１２を参照して、本実施形態に係る物体認識（Ｓ２０４）の処理手順について説明する。以下で説明する処理は、例えばＣＰＵ２０１が記憶部２０２に予め記憶されている制御プログラム等を読み出して実行することにより実現される。なお、ここでは、上記第１の実施形態で説明した図９のフローチャートと異なる処理について説明し、同様の処理手については同一のステップ番号を付し、説明を省略する。　 <Object recognition (including bone construction)> Next, the processing procedure for object recognition (S204) according to this embodiment will be described with reference to FIG. 12. The processing described below is realized, for example, by the CPU 201 reading and executing a control program or the like pre-stored in the storage unit 202. Note that here, processing that differs from the flowchart of FIG. 9 described in the first embodiment above will be described, and similar processing steps will be given the same step numbers and will not be described.

まずＳ５０１でＣＰＵ２０１は、対象模型（ここではフィギュア３０１）の基準ボーン構造を含む三次元データを取得する。当該データは、アプリケーションをインストールした際に記憶部２０２に予め記憶される情報である。ここでは、例えばフィギュア３０１の基準ボーン構造１１０２を含む三次元データが記憶部２０２から読み出される。続いて、Ｓ５０２でＣＰＵ２０１は、対象模型であるフィギュア３０１の情報に基づいて、認識するパーツを特定する。ここでは、上記第１の実施形態におけるＳ３０１とは異なり、選択されたエフェクトに関連するパーツを特定するのではなく、ボーン構造を更新するために必要なパーツを特定する。なお、基本的には、フィギュア３０１に含まれる各パーツを特定する。　 First, in S501, the CPU 201 acquires three-dimensional data including the reference bone structure of the target model (here, the figure 301). This data is information that is pre-stored in the storage unit 202 when the application is installed. Here, for example, three-dimensional data including the reference bone structure 1102 of the figure 301 is read from the storage unit 202. Next, in S502, the CPU 201 identifies the parts to be recognized based on the information of the figure 301, which is the target model. Here, unlike S301 in the first embodiment above, the parts related to the selected effect are not identified, but the parts necessary to update the bone structure are identified. Note that, basically, each part included in the figure 301 is identified.

その後、Ｓ３０３及びＳ３０４で各パーツを認識すると、Ｓ５０３でＣＰＵ２０１は、認識したパーツ（姿勢及びカメラ２０６からの距離に関する情報）に従って、Ｓ５０１で取得した基準ボーン構造の対応する部分を更新する。具体的には、ＣＰＵ２０１は、認識したパーツと、基準ボーン構造を含む三次元データ上の対応する部分とを照合し、当該認識したパーツの角度に合わせるように特徴点の位置を調整して基準ボーン構造を更新する。その後、Ｓ３０４でＣＰＵ２０１は、Ｓ５０２で特定されたパーツのうち、未解析のパーツがあるかどうかを判断する。未解析のパーツがあれば、処理をＳ３０２に戻し、未解析のパーツが無ければＳ５０４に進む。　 After that, when each part is recognized in S303 and S304, in S503 the CPU 201 updates the corresponding part of the reference bone structure acquired in S501 according to the recognized part (information on posture and distance from the camera 206). Specifically, the CPU 201 compares the recognized part with the corresponding part on the three-dimensional data including the reference bone structure, and updates the reference bone structure by adjusting the position of the feature points to match the angle of the recognized part. Then, in S304 the CPU 201 determines whether any of the parts identified in S502 have not been analyzed. If there are any unanalyzed parts, the process returns to S302, and if there are no unanalyzed parts, the process proceeds to S504.

Ｓ５０４でＣＰＵ２０１は、更新された基準ボーン構造から対象模型の姿勢を判定し、本フローチャートの処理を終了する。対象模型の姿勢の判定は、例えば人体の姿勢を機械学習させた学習済みモデルを用いて、更新した基準ボーン構造の姿勢を推定することにより行ってもよい。　 In S504, the CPU 201 determines the posture of the target model from the updated reference bone structure, and ends the processing of this flowchart. The posture of the target model may be determined by estimating the posture of the updated reference bone structure, for example, using a trained model that has been machine-learned to train the posture of the human body.

本実施形態によれば、推定した対象模型の姿勢に応じて出力するエフェクトを変化させることができる。例えば、対象模型の姿勢が所定の姿勢を示す場合にのみ特定のエフェクトを出力するようにしてもよい。一例として、推定した姿勢が当該フィギュアの変身ポーズを示す場合には、例えばベルトの部分を発光させ、回転させるようなエフェクトを出力してもよい。また、フィギュア全体の姿勢を認識しているため、腰から胴体、肩を通って上腕、下腕、拳という順序で各パーツに対して連続して点滅等を示すエフェクトを付与してもよい。　 According to this embodiment, the effect to be output can be changed according to the estimated posture of the target model. For example, a specific effect can be output only when the posture of the target model indicates a specific posture. As one example, when the estimated posture indicates the transformation pose of the figure, an effect such as making the belt part light up and rotate can be output. In addition, since the posture of the entire figure is recognized, an effect such as blinking can be applied to each part in the following order: waist, torso, shoulders, upper arms, lower arms, and fists.

また、本実施形態によれば、更新したボーン構造を含む三次元データをデプスマップを補完するために利用してもよい。カメラの性能や撮像時の環境条件に応じて、デプスマップを用いたエフェクトオブジェクトの生成制御に加えて、上記三次元データを用いてデプスマップ（即ち、撮像画像）上の対象模型の位置を特定してもよい。デプスマップ上での対象模型の位置が特定できれば、対応する画素と重複するエフェクトオブジェクトの画素を描画するかどうか判定するのみでよく、デプスマップの精度が良くない場合に補完的に利用することができるとともに、対象模型と重複する画素のみについて上記Ｓ４０２の比較処理を行うだけでよく、処理負荷を低減することもできる。　 Furthermore, according to this embodiment, three-dimensional data including the updated bone structure may be used to complement the depth map. In addition to controlling the generation of the effect object using the depth map, depending on the camera performance and the environmental conditions at the time of image capture, the above three-dimensional data may be used to identify the position of the target model on the depth map (i.e., the captured image). If the position of the target model on the depth map can be identified, it is only necessary to determine whether or not to draw pixels of the effect object that overlap with corresponding pixels, which can be used to complement when the accuracy of the depth map is poor, and it is also possible to reduce the processing load by only performing the comparison process of S402 above for pixels that overlap with the target model.

以上説明したように、本実施形態に係る情報処理端末は、さらに、所定の模型に含まれる各パーツの姿勢及びカメラからの距離に関する情報を認識し、所定の模型の基準姿勢を示す基準ボーン構造を含む三次元データを、認識した各パーツに従って更新し、更新した三次元データに基づいて所定の模型の姿勢を認識する。このように、本実施形態によれば、撮像画像から各パーツを認識して予め用意した基準ボーン構造を更新して、撮像画像に含まれるフィギュアの姿勢を判定することができる。これにより、生成するオブジェクトを、認識した所定の模型の姿勢に合わせて変化させることができる。また、更新された三次元データから特定される、撮像画像における所定の模型の位置に基づいて、オブジェクト画像を生成することができる。よって、デプスマットの精度が低い場合において、エフェクトオブジェクトの生成を好適に補完することができる。　 As described above, the information processing terminal according to this embodiment further recognizes information regarding the posture of each part included in the specified model and the distance from the camera, updates the three-dimensional data including the reference bone structure indicating the reference posture of the specified model according to each recognized part, and recognizes the posture of the specified model based on the updated three-dimensional data. In this way, according to this embodiment, it is possible to determine the posture of the figure included in the captured image by recognizing each part from the captured image and updating the reference bone structure prepared in advance. This makes it possible to change the object to be generated according to the posture of the recognized specified model. In addition, it is possible to generate an object image based on the position of the specified model in the captured image identified from the updated three-dimensional data. Therefore, when the accuracy of the depth matte is low, it is possible to suitably complement the generation of the effect object.

＜変形例＞　本発明は上記実施形態に制限されるものではなく、発明の要旨の範囲内で、種々の変形・変更が可能である。上記実施形態では、所定の模型であるフィギュア３０１に対して、当該フィギュアの体を囲むような３つの輪をエフェクトオブジェクトとして合成して出力する例について説明した。このように、フィギュア３０１には実際には含まれないオブジェクトを合成して出力する例について説明したが、本発明はこれに限定されない。例えば、フィギュアが物理的に有する部分に対してエフェクトを合成して出力するようにしてもよい。　 <Modifications> The present invention is not limited to the above embodiment, and various modifications and alterations are possible within the scope of the gist of the invention. In the above embodiment, an example was described in which three rings that surround the body of a specific model figure 301 are synthesized and output as effect objects. In this way, an example was described in which an object that is not actually included in figure 301 is synthesized and output, but the present invention is not limited to this. For example, an effect may be synthesized and output for a part that the figure physically has.

図１３は、変形例となるフィギュア１１０１が机３０２に載置され、情報処理端末１０１で撮影している様子を示す。フィギュア１１０１は、フィギュア３０１と同様の本体部分に加えて、例えば炎を示す物理的なオブジェクト１１０２を備える。本発明によれば、このような物理的に存在する部分に対してエフェクトを合成して出力するようにしてもよい。例えば、図１３の例では、炎の揺らめきや火の粉などを付加して出力するようにしてもよい。また、本発明によれば、フィギュアの撮像した顔に対して、異なる表情、例えば笑顔や泣き顔などの表情を付加したり、視線をカメラ方向に向けるよう変更した画像を合成して出力してもよい。なお、本発明によれば、撮像画像からフィギュアの少なくとも１つパーツを認識して、当該認識したパーツを基準にして生成可能なエフェクトであれば任意のエフェクトを付加することができる。　 FIG. 13 shows a modified figure 1101 placed on a desk 302 and photographed by the information processing terminal 101. In addition to the main body of the figure 301, the figure 1101 has a physical object 1102, for example, a flame. According to the present invention, effects may be synthesized and output for such physically existing parts. For example, in the example of FIG. 13, flickering flames or sparks may be added and output. According to the present invention, a different expression may be added to the captured image of the figure's face, for example, a smiling or crying expression, or an image in which the line of sight is changed to face the camera may be synthesized and output. According to the present invention, at least one part of the figure can be recognized from the captured image, and any effect can be added as long as it is an effect that can be generated based on the recognized part.

また、本実施形態では、対象の模型として人型の模型を例に説明したが、本発明を限定する意図はない、例えば、人、動物、ロボット、昆虫、恐竜等、様々な形状の模型に適用することができる。いずれの場合においても、上記実施形態で説明したように、複数のパーツに分割して認識することにより、リアルタイム性を保証しつつ、拡張現実を提供することができる。　 In addition, in this embodiment, a humanoid model has been used as an example of the target model, but this is not intended to limit the present invention, and the present invention can be applied to models of various shapes, such as humans, animals, robots, insects, and dinosaurs. In any case, as described in the above embodiment, by dividing the model into multiple parts and recognizing them, it is possible to provide augmented reality while ensuring real-time performance.

＜実施形態のまとめ＞　上記実施形態は以下のコンピュータプログラム、情報処理端末及びその制御方法を少なくとも開示する。　 <Summary of the embodiment> The above embodiment discloses at least the following computer program, information processing terminal, and control method thereof.

（１）情報処理端末のコンピュータを、　所定の模型を含む周辺環境を撮像する撮像手段と、　前記撮像手段によって撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得手段と、　前記撮像画像に含まれる前記所定の模型の少なくとも１つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識手段と、　前記認識手段によって認識した前記少なくとも１つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定手段と、　前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成手段と、　前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力手段と、として機能させることを特徴とするコンピュータプログラム。　 (1) A computer program that causes a computer of an information processing terminal to function as: an imaging means for imaging the surrounding environment including a specified model; an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means; a recognition means for recognizing information related to the attitude of at least one part of the specified model contained in the captured image and its distance from the imaging means; a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means; an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, thereby generating an object image; and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.

（２）前記オブジェクト生成手段は、前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示さない画素については描画しないことを特徴とする（１）に記載のコンピュータプログラム。　 (2) The computer program described in (1), wherein the object generating means does not draw any pixel of the object to be generated whose position information does not indicate that the pixel is closer to the imaging means than the distance information of the corresponding pixel in the captured image.

（３）前記情報処理端末のコンピュータを、さらに、前記所定の模型を含む撮像画像に対して合成するエフェクトを選択する選択手段として機能させ、　前記認識手段は、前記選択されたエフェクトに関連するパーツを認識することを特徴とする（１）又は（２）に記載のコンピュータプログラム。　 (3) The computer program described in (1) or (2) is characterized in that the computer of the information processing terminal is further made to function as a selection means for selecting an effect to be composited with a captured image including the specified model, and the recognition means recognizes parts related to the selected effect.

（４）前記認識手段は、撮像画像を入力とし、前記所定の模型のパーツごとに形状、角度、及び距離に関する情報を出力するように学習させた学習済みモデルを用いて、前記選択されたエフェクトに関連するパーツを認識することを特徴とする（３）に記載のコンピュータプログラム。　 (4) The computer program described in (3) is characterized in that the recognition means recognizes parts related to the selected effect using a trained model that is trained to input a captured image and output information about the shape, angle, and distance for each part of the specified model.

（５）前記オブジェクト生成手段は、前記選択手段によって選択されたエフェクトに対応する、予め記憶されているオブジェクトのモデル情報に基づいて、前記オブジェクト画像を生成するこ
とを特徴とする（２）又は（３）に記載のコンピュータプログラム。　 (5) The computer program according to (2) or (3), characterized in that the object generation means generates the object image based on pre-stored object model information corresponding to the effect selected by the selection means.

（６）前記位置決定手段は、前記少なくとも１つのパーツの姿勢及び前記撮像手段からの距離に関する情報に基づいて、前記生成するオブジェクトの角度及び距離に関する情報を含む前記位置情報を決定することを特徴とする（１）乃至（５）の何れか１つに記載のコンピュータプログラム。　 (6) The computer program described in any one of (1) to (5), wherein the position determination means determines the position information including information regarding the angle and distance of the object to be generated based on information regarding the posture of the at least one part and the distance from the imaging means.

（７）前記所定の模型の前記少なくとも１つのパーツとは、頭部、胸部、腹部、腰部、腕部、及び脚部の少なくとも１つであることを特徴とする（３）乃至（５）の何れか１つに記載のコンピュータプログラム。　 (7) The computer program described in any one of (3) to (5), characterized in that the at least one part of the specified model is at least one of the head, chest, abdomen, waist, arms, and legs.

（８）前記選択されたエフェクトに関連するパーツとは、前記生成するオブジェクトの近傍に位置するパーツであることを特徴とする（７）に記載のコンピュータプログラム。　 (8) The computer program described in (7) is characterized in that the parts related to the selected effect are parts located in the vicinity of the object to be generated.

（９）前記取得手段は、前記撮像画像の各画素についての前記距離情報として、深度情報を示すグレースケールの深度マップを取得することを特徴とする（１）乃至（８）の何れか１つに記載のコンピュータプログラム。　 (9) The computer program described in any one of (1) to (8), wherein the acquisition means acquires a grayscale depth map indicating depth information as the distance information for each pixel of the captured image.

（１０）前記認識手段は、さらに、前記所定の模型に含まれる各パーツの姿勢及び前記撮像手段からの距離に関する情報を認識し、前記所定の模型の基準姿勢を示す基準ボーン構造を含む三次元データを、認識した各パーツに従って更新し、更新した前記三次元データに基づいて前記所定の模型の姿勢を認識することを特徴とする（１）乃至（９）の何れか１つに記載のコンピュータプログラム。　 (10) The computer program according to any one of (1) to (9), wherein the recognition means further recognizes information regarding the posture of each part included in the specified model and the distance from the imaging means, updates three-dimensional data including a reference bone structure indicating a reference posture of the specified model according to each recognized part, and recognizes the posture of the specified model based on the updated three-dimensional data.

（１１）前記オブジェクト生成手段によって生成されるオブジェクトは、認識された前記所定の模型の姿勢に合わせて変化することを特徴とする（１０）に記載のコンピュータプログラム。　 (11) The computer program described in (10), characterized in that the object generated by the object generating means changes in accordance with the recognized posture of the predetermined model.

（１２）前記オブジェクト生成手段は、前記更新された三次元データから特定される、前記撮像画像における前記所定の模型の位置に基づいて、前記オブジェクト画像を生成することを特徴とする（１０）又は（１１）に記載のコンピュータプログラム。　 (12) The computer program described in (10) or (11), wherein the object generating means generates the object image based on the position of the specified model in the captured image, which is identified from the updated three-dimensional data.

（１３）前記撮像手段は前記所定の模型を含む周辺環境を継続的に撮像し、　前記取得手段、前記認識手段、前記オブジェクト生成手段、及び前記出力手段は、前記撮像手段によって撮像された撮像画像に基づいて周期的に処理を実行し、　前記出力手段は、前記撮像手段によって継続的に撮像された映像に、動的に変化するアニメーションとして前記オブジェクトを合成して出力することを特徴とする（１）乃至（１２）の何れか１つに記載のコンピュータプログラム。　 (13) The computer program according to any one of (1) to (12), characterized in that the imaging means continuously captures images of the surrounding environment including the specified model, the acquisition means, the recognition means, the object generation means, and the output means periodically execute processing based on the images captured by the imaging means, and the output means synthesizes the object as a dynamically changing animation into the video continuously captured by the imaging means and outputs the video.

（１４）情報処理端末であって、　所定の模型を含む周辺環境を撮像する撮像手段と、　前記撮像手段によって撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得手段と、　前記撮像画像に含まれる前記所定の模型の少なくとも１つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識手段と、　前記認識手段によって認識した前記少なくとも１つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定手段と、　前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成手段と、　前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力手段とを備えることを特徴とする情報処理端末。　 (14) An information processing terminal comprising: an imaging means for imaging a surrounding environment including a specified model; an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means; a recognition means for recognizing information related to the attitude of at least one part of the specified model included in the captured image and its distance from the imaging means; a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means; an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, thereby generating an object image; and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.

（１５）情報処理端末の制御方法であって、　所定の模型を含む周辺環境を撮像手段によって撮像する撮像工程と、　前記撮像工程で撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得工程と、　前記撮像画像に含まれる前記所定の模型の少なくとも１つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識工程と、　前記認識工程で認識した前記少なくとも１つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定工程と、　前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成工程と、　前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力工程とを含むことを特徴とする情報処理端末の制御方法。 (15) A control method for an information processing terminal, comprising: an imaging step of imaging a surrounding environment including a predetermined model by an imaging means; an acquisition step of acquiring distance information from the imaging means for each pixel of the image captured in the imaging step; a recognition step of recognizing information regarding the attitude of at least one part of the predetermined model included in the captured image and the distance from the imaging means; a position determination step of determining position information of an object to be generated based on the at least one part recognized in the recognition step; an object generation step of drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, thereby generating an object image; and an output step of outputting a composite image in which the object image is superimposed on the captured image to a display unit.

１０１：情報処理端末、１０２：アプリケーションサーバ、１０３：機械学習サーバ、１０４：データベース、１０５：ネットワーク、２０１：ＣＰＵ、２０２：記憶部、２０３：通信制御部、２０４：表示部、２０５：操作部、２０６：カメラ、２０７：スピーカ、２１０：システムバス：３０１：フィギュア、３０２：机、５０１：画像取得部、５０２：深度情報取得部、５０３：物体認識部、５０４：エフェクト位置決定部、５０５：学習済みモデル、５０６：エフェクト描画部、５０７：合成部：５０８：出力部 101: Information processing terminal, 102: Application server, 103: Machine learning server, 104: Database, 105: Network, 201: CPU, 202: Storage unit, 203: Communication control unit, 204: Display unit, 205: Operation unit, 206: Camera, 207: Speaker, 210: System bus: 301: Figure, 302: Desk, 501: Image acquisition unit, 502: Depth information acquisition unit, 503: Object recognition unit, 504: Effect position determination unit, 505: Learned model, 506: Effect drawing unit, 507: Synthesis unit: 508: Output unit

Claims

情報処理端末のコンピュータを、　所定の模型を含む周辺環境を撮像する撮像手段と、　前記撮像手段によって撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得手段と、　前記撮像画像に含まれる前記所定の模型の少なくとも１つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識手段と、　前記認識手段によって認識した前記少なくとも１つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定手段と、　前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成手段と、　前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力手段と、として機能させることを特徴とするコンピュータプログラム。 A computer program that causes a computer of an information processing terminal to function as: an imaging means for imaging the surrounding environment including a specified model; an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means; a recognition means for recognizing information regarding the attitude of at least one part of the specified model included in the captured image and the distance from the imaging means; a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means; an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, thereby generating an object image; and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.
前記オブジェクト生成手段は、前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示さない画素については描画しないことを特徴とする請求項１に記載のコンピュータプログラム。 The computer program of claim 1, characterized in that the object generating means does not draw any pixel of the object to be generated whose position information does not indicate that it is closer to the imaging means than the distance information of the corresponding pixel in the captured image.
前記情報処理端末のコンピュータを、さらに、前記所定の模型を含む撮像画像に対して合成するエフェクトを選択する選択手段として機能させ、　前記認識手段は、前記選択されたエフェクトに関連するパーツを認識することを特徴とする請求項２に記載のコンピュータプログラム。 The computer program according to claim 2, further comprising causing the computer of the information processing terminal to function as a selection means for selecting an effect to be composited with a captured image including the specified model, and the recognition means for recognizing parts related to the selected effect.
前記認識手段は、撮像画像を入力とし、前記所定の模型のパーツごとに形状、角度、及び距離に関する情報を出力するように学習させた学習済みモデルを用いて、前記選択されたエフェクトに関連するパーツを認識することを特徴とする請求項３に記載のコンピュータプログラム。 The computer program according to claim 3, characterized in that the recognition means recognizes parts related to the selected effect using a trained model that is trained to input a captured image and output information regarding the shape, angle, and distance for each part of the specified model.
前記オブジェクト生成手段は、前記選択手段によって選択されたエフェクトに対応する、予め記憶されているオブジェクトのモデル情報に基づいて、前記オブジェクト画像を生成することを特徴とする請求項３に記載のコンピュータプログラム。 The computer program according to claim 3, characterized in that the object generating means generates the object image based on pre-stored object model information corresponding to the effect selected by the selection means.
前記位置決定手段は、前記少なくとも１つのパーツの姿勢及び前記撮像手段からの距離に関する情報に基づいて、前記生成するオブジェクトの角度及び距離に関する情報を含む前記位置情報を決定することを特徴とする請求項１に記載のコンピュータプログラム。 The computer program according to claim 1, characterized in that the position determination means determines the position information including information regarding the angle and distance of the object to be generated based on information regarding the orientation of the at least one part and the distance from the imaging means.
前記所定の模型の前記少なくとも１つのパーツとは、頭部、胸部、腹部、腰部、腕部、及び脚部の少なくとも１つであることを特徴とする請求項３に記載のコンピュータプログラム。 The computer program according to claim 3, characterized in that the at least one part of the specified model is at least one of the head, chest, abdomen, waist, arms, and legs.
前記選択されたエフェクトに関連するパーツとは、前記生成するオブジェクトの近傍に位置するパーツであることを特徴とする請求項７に記載のコンピュータプログラム。 The computer program according to claim 7, characterized in that the parts related to the selected effect are parts located in the vicinity of the object to be generated.
前記取得手段は、前記撮像画像の各画素についての前記距離情報として、深度情報を示すグレースケールの深度マップを取得することを特徴とする請求項１に記載のコンピュータプログラム。 The computer program according to claim 1, characterized in that the acquisition means acquires a grayscale depth map indicating depth information as the distance information for each pixel of the captured image.
前記認識手段は、さらに、前記所定の模型に含まれる各パーツの姿勢及び前記撮像手段からの距離に関する情報を認識し、前記所定の模型の基準姿勢を示す基準ボーン構造を含む三次元データを、認識した各パーツに従って更新し、更新した前記三次元データに基づいて前記所定の模型の姿勢を認識することを特徴とする請求項１に記載のコンピュータプログラム。 The computer program according to claim 1, characterized in that the recognition means further recognizes information regarding the posture of each part included in the specified model and the distance from the imaging means, updates three-dimensional data including a reference bone structure indicating the reference posture of the specified model according to each recognized part, and recognizes the posture of the specified model based on the updated three-dimensional data.
前記オブジェクト生成手段によって生成されるオブジェクトは、認識された前記所定の模型の姿勢に合わせて変化することを特徴とする請求項１０に記載のコンピュータプログラム。 The computer program according to claim 10, characterized in that the object generated by the object generating means changes according to the recognized posture of the specified model.
前記オブジェクト生成手段は、前記更新された三次元データから特定される、前記撮像画像における前記所定の模型の位置に基づいて、前記オブジェクト画像を生成することを特徴とする請求項１０に記載のコンピュータプログラム。 The computer program according to claim 10, characterized in that the object generation means generates the object image based on the position of the specified model in the captured image, which is identified from the updated three-dimensional data.
前記撮像手段は前記所定の模型を含む周辺環境を継続的に撮像し、　前記取得手段、前記認識手段、前記オブジェクト生成手段、及び前記出力手段は、前記撮像手段によって撮像された撮像画像に基づいて周期的に処理を実行し、　前記出力手段は、前記撮像手段によって継続的に撮像された映像に、動的に変化するアニメーションとして前記オブジェクトを合成して出力することを特徴とする請求項１乃至１２の何れか１項に記載のコンピュータプログラム。 The computer program according to any one of claims 1 to 12, characterized in that the imaging means continuously captures images of the surrounding environment including the specified model, the acquisition means, the recognition means, the object generation means, and the output means periodically execute processing based on the images captured by the imaging means, and the output means synthesizes the object as a dynamically changing animation into the video continuously captured by the imaging means and outputs the video.
情報処理端末であって、　所定の模型を含む周辺環境を撮像する撮像手段と、　前記撮像手段によって撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得手段と、　前記撮像画像に含まれる前記所定の模型の少なくとも１つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識手段と、　前記認識手段によって認識した前記少なくとも１つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定手段と、　前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成手段と、　前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力手段とを備えることを特徴とする情報処理端末。 An information processing terminal comprising: an imaging means for imaging a surrounding environment including a specified model; an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means; a recognition means for recognizing information related to the attitude of at least one part of the specified model included in the captured image and its distance from the imaging means; a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means; an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, thereby generating an object image; and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.
情報処理端末の制御方法であって、　所定の模型を含む周辺環境を撮像手段によって撮像する撮像工程と、　前記撮像工程で撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得工程と、　前記撮像画像に含まれる前記所定の模型の少なくとも１つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識工程と、　前記認識工程で認識した前記少なくとも１つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定工程と、　前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成工程と、　前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力工程とを含むことを特徴とする情報処理端末の制御方法。 A control method for an information processing terminal, comprising: an imaging step of imaging the surrounding environment including a specified model by an imaging means; an acquisition step of acquiring distance information from the imaging means for each pixel of the image captured in the imaging step; a recognition step of recognizing information regarding the attitude of at least one part of the specified model included in the captured image and the distance from the imaging means; a position determination step of determining position information of an object to be generated based on the at least one part recognized in the recognition step; an object generation step of drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, thereby generating an object image; and an output step of outputting a composite image in which the object image is superimposed on the captured image to a display unit.