WO2024106328A1 - Computer program, information processing terminal, and method for controlling same - Google Patents

Computer program, information processing terminal, and method for controlling same Download PDF

Info

Publication number
WO2024106328A1
WO2024106328A1 PCT/JP2023/040545 JP2023040545W WO2024106328A1 WO 2024106328 A1 WO2024106328 A1 WO 2024106328A1 JP 2023040545 W JP2023040545 W JP 2023040545W WO 2024106328 A1 WO2024106328 A1 WO 2024106328A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
image
captured image
distance
imaging
Prior art date
Application number
PCT/JP2023/040545
Other languages
French (fr)
Japanese (ja)
Inventor
良太 片野
剛 山本
光国 高堀
虎太郎 尾嶋
規悦 青木
裕司 永野
Original Assignee
株式会社バンダイ
株式会社バンダイナムコピクチャーズ
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社バンダイ, 株式会社バンダイナムコピクチャーズ filed Critical 株式会社バンダイ
Publication of WO2024106328A1 publication Critical patent/WO2024106328A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0481Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
    • G06F3/04815Interaction with a metaphor-based environment or interaction object displayed as three-dimensional, e.g. changing the user viewpoint with respect to the environment or object
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/14Digital output to display device ; Cooperation and interconnection of the display device with other functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics

Definitions

  • the present invention relates to a computer program, an information processing terminal, and a control method thereof.
  • Patent Document 1 discloses an augmented reality system that recognizes marks on the base of a figure with a camera on a mobile device, generates images for presentation that are prepared in association with the marks, and displays the figure and the images superimposed on the screen of the mobile device.
  • the figure image and the video for the production are displayed superimposed on each other, and the figure image is hidden where the figure image and the video for the production overlap.
  • all the video for the production is displayed in front of the figure image.
  • objects superimposed on the captured image such as the video for the production, to be exposed in front of or hidden behind the figure depending on their positional relationship with the figure.
  • the present invention provides, for example, a mechanism for suitably synthesizing objects with captured images of real space and outputting the resulting images.
  • the present invention is, for example, a computer program that causes a computer of an information processing terminal to function as an imaging means for imaging a surrounding environment including a specified model, an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means, a recognition means for recognizing information regarding the attitude of at least one part of the specified model contained in the captured image and its distance from the imaging means, a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means, an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, and generating an object image, and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.
  • an information processing terminal includes an imaging means for imaging a surrounding environment including a predetermined model, an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means, a recognition means for recognizing information regarding the attitude of at least one part of the predetermined model included in the captured image and the distance from the imaging means, a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means, an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, and generating an object image, and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.
  • the present invention is also characterized in that it is a control method for an information processing terminal, comprising, for example, an imaging step of imaging a surrounding environment including a predetermined model by an imaging means, an acquisition step of acquiring distance information from the imaging means for each pixel of the image captured in the imaging step, a recognition step of recognizing information regarding the attitude of at least one part of the predetermined model included in the captured image and the distance from the imaging means, a position determination step of determining position information of an object to be generated based on the at least one part recognized in the recognition step, an object generation step of drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, and generating an object image, and an output step of outputting a composite image in which the object image is superimposed on the captured image to a display unit.
  • FIG. 1 is a diagram showing an example of the configuration of a system according to an embodiment.
  • FIG. 2 is a diagram showing an example of the configuration of an information processing terminal according to an embodiment.
  • FIG. 11 is a diagram showing an example of an XR gimmick provided by the system according to an embodiment.
  • 11A to 11C are diagrams showing screen transitions of an XR gimmick according to an embodiment.
  • FIG. 4 is a diagram showing a functional configuration relating to effect synthesis according to an embodiment.
  • 1A to 1C are diagrams showing a series of example images according to a processing procedure of effect synthesis according to an embodiment.
  • 11 is a flowchart showing a processing procedure of basic control according to an embodiment.
  • 11 is a flowchart showing a processing procedure for effect synthesis output according to an embodiment.
  • 1 is a flowchart showing a processing procedure for object recognition according to an embodiment.
  • 11 is a flowchart showing a processing procedure for generating an object according to an embodiment.
  • FIG. 13 is a diagram showing a modified example of the figure according to the embodiment.
  • This system includes an information processing terminal 101, an application server 102, a machine learning server 103, and a database 104.
  • the information processing terminal 101 and the application server 102 are connected to each other via a network so that they can communicate with each other.
  • the application server 102 is connected to the machine learning server 103 via a local area network (LAN) so that they can communicate with each other.
  • the machine learning server 103 is connected to the database 104 via the LAN.
  • the information processing terminal 101 is, for example, a portable information processing terminal such as a smartphone, a mobile phone, or a tablet PC. Any device may be used as long as it has at least an imaging unit such as a camera and a display unit for displaying the captured image.
  • the information processing terminal 101 downloads and installs an application for implementing the present invention from the application server 102 via the network 105.
  • an application for implementing the present invention from the application server 102 via the network 105.
  • the captured image may be either a still image or a moving image (video).
  • the effect object may be synthesized as an animation.
  • the training data may be generated by the machine learning server 103, or data generated externally may be received.
  • the machine learning server 103 also stores the trained model generated for each model in the database 104, and provides it to the application server 102 as necessary.
  • the machine learning server 103 reads out the corresponding trained model from the database 104, re-learns it, and stores the re-learned model in the database 104 again.
  • the information processing terminal 101 includes a CPU 201, a memory unit 202, a communication control unit 203, a display unit 204, an operation unit 205, a camera 206, and a speaker 207. Each component can send and receive data to and from each other via a system bus 210.
  • the CPU 201 is a central processor that provides overall control of each component connected via the system bus 210.
  • the CPU 201 executes each process described below by executing computer programs stored in the memory unit 202.
  • the memory unit 202 is used as a work area and temporary area for the CPU 201, and also stores the control programs executed by the CPU 201 and various data.
  • the communication control unit 203 can perform bidirectional communication with the application server 102 via the network 105 using broadband wireless communication.
  • the communication control unit 203 may have short-range wireless communication functions such as wireless LAN (WiFi), Bluetooth (registered trademark) communication, and infrared communication in addition to or instead of broadband wireless communication.
  • WiFi wireless LAN
  • Bluetooth registered trademark
  • infrared communication in addition to or instead of broadband wireless communication.
  • the communication control unit 203 does not have a broadband wireless communication function but has a WiFi communication function, it connects to the network 105 via a nearby access point.
  • the display unit 204 is a touch panel type liquid crystal display that displays various screens as well as still images and moving images captured by the camera 206.
  • the operation unit 205 is an operation input unit that is integrated with the display unit 204 and accepts user operations.
  • the operation unit 205 may also include physically configured push-type or slide-type buttons, etc.
  • the camera 206 is an imaging unit that captures images of the surrounding environment of the information processing terminal 101, and is preferably located, for example, behind the display unit 204 on the information processing terminal 101. This allows the user to check the captured image on the display unit 204 while taking an image with the camera 206.
  • the camera 206 may be a monocular camera or a compound eye camera.
  • the speaker 207 outputs sound, for example, in accordance with the effect object to be output. Sound data is prepared in advance for each effect.
  • 301 is an example of a predetermined model, a human figure. There is no intention to limit the present invention, and any model that imitates any object or the like can be applied to the present invention.
  • 302 indicates a desk on which the figure 301 is placed. The user starts the above application on the information processing terminal 101, and selects an item corresponding to the figure 301 from multiple items displayed on the application screen. When the user selects the item corresponding to the figure 301, the camera 206 is started, and the user takes a picture of the figure 301 placed on the desk 302. The user can freely move the information processing terminal 101 during shooting, and change the angle at which the figure 301 is shot, as shown by the arrow. The captured image is displayed on the display unit 204 of the information processing terminal 101.
  • this system provides an expanded real space by superimposing an effect object such as an animation on a real space in which an image of the surrounding environment of the information processing terminal 101, including the figure 301, and outputting the superimposed effect object.
  • FIG. 4(a) to FIG. 4(d) explain the screen transitions when the user executes an application that provides the XR gimmick and moves the information processing terminal 101 as shown in FIG. 3.
  • Screen 400 shown in FIG. 4(a) is a screen that is displayed on the display unit 204 when an application that provides an XR gimmick according to this embodiment is launched.
  • selection buttons 401 to 405 are displayed for selecting figures registered in the application.
  • an example is shown in which five figures are registered, but if more figures are registered, undisplayed items can be displayed and selected by scrolling the screen downwards. Different figures are registered in each item, and when the user selects a figure to be photographed, the screen transitions to screen 410 shown in FIG. 4(b).
  • the screen 410 shows the state in which the camera 206 is started, the camera 206 captures an image of the surrounding environment of the information processing terminal 101, and the image is displayed on the display unit 204.
  • the captured image of the surrounding environment includes the figure 301 placed on the desk 302, as shown in FIG. 3.
  • various buttons 411 to 413 are displayed in a selectable manner.
  • the button 411 is a button for capturing a still image during shooting. When the button 411 is operated, a still image is acquired at the timing of the operation and stored in the storage unit 202.
  • the button 412 is a button for adding an effect. When the button 412 is operated, at least one effect registered for the figure is displayed in a selectable manner, and the user can select a desired effect.
  • the button 413 is a button for displaying various menus.
  • the started XR gimmick can be ended, and the screen can be transitioned to the screen 400, or other settings can be performed. Note that although an example including three buttons has been described here, more operation buttons may be included.
  • the composite output of the effect begins, as shown on screen 420 in FIG. 4(c).
  • 421 is an effect object composited into the video, and three rings are displayed surrounding the figure 301. These rings may be animated, for example, to appear from above the head of the figure 301 and point toward the feet. It can be seen that the displayed effect object 421 has a portion that is displayed in front of the figure 301 and a portion that is hidden behind the figure 301 and not displayed. Details of the display control of these will be described later.
  • screen 430 shows the screen when the user moves the information processing terminal 101 from the state in FIG. 4(c) to a state where the user is photographing the side of the figure 301 while the effect is being output.
  • effect object 431 the part that wraps around to the rear of the figure 301 as seen from the information processing terminal 101 is not displayed, as shown in effect object 431.
  • the display of the effect object also changes according to the positional relationship of the figure 301, following the image captured by the camera 206. Detailed display control will be described later.
  • the information processing terminal 101 includes, as functional configuration related to effect synthesis output, an image acquisition unit 501, a depth information acquisition unit 502, an object recognition unit 503, an effect position determination unit 504, a trained model 505, an effect drawing unit 506, a synthesis unit 507, and an output unit 508.
  • the image acquisition unit 501 acquires an image (RGB image) captured by the camera 206.
  • the RGB image acquired by the image acquisition unit 501 is output to the depth information acquisition unit 502, the object recognition unit 503, and the synthesis unit 507.
  • the depth information acquisition unit 502 acquires distance information (depth information) from the camera 206 at the time of capturing for each pixel in the captured image received from the image acquisition unit 501.
  • the depth information acquisition unit 502 generates a grayscale image (depth map) indicating the acquired depth information.
  • Any known method may be used as a method for acquiring depth information, for example, a method of acquiring the information using stereo vision or motion parallax due to a time difference, or a method of acquiring the information using a machine-learned model that has been trained to estimate the distance from a two-dimensional image to an object using a convolutional neural network. Note that since the XR gimmick according to this embodiment requires real-time performance, a method with a low processing load is desirable.
  • the depth information acquisition unit 502 outputs the acquired depth information (depth map) to the effect drawing unit 506.
  • the object recognition unit 503 uses the trained model 505 to recognize the posture of at least one part of the figure 301 included in the captured image and the distance from the camera 206 to the part.
  • the posture information includes information on the shape and angle of the part. More specifically, the object recognition unit 503 can use the trained model 505 to detect the angles of the part to be recognized in each of the directions of front, back, up, down, left, right, pitch, yaw, and roll.
  • at least one part is at least one of the head, chest, abdomen, waist, arms, and legs in a specified model such as the figure 301, and is a part related to the selected effect.
  • the granularity of dividing the specified model into parts is arbitrary. For example, in the case of a movable figure, it is desirable to divide it into parts having joints. This makes it possible to recognize the shape, posture, etc. of each movable part, and reduces recognition errors even when movement is performed.
  • parts related to an effect refer to parts located near the effect object to be generated. This is because the effect object to be composited into the captured image is positioned taking into consideration its positional relationship with the figure 301, and the position of the effect object to be generated is determined. For example, when generating an effect object that outputs a ray from part of the chest of a specific model, the position and direction in which the effect object will be generated can be determined by recognizing the orientation of the chest of the model and the distance from the camera 206 to the chest of the model in the captured image.
  • the object recognition unit 503 rather than recognizing the posture of the entire specified model and the distance from the camera 206 to the model in the captured image, only at least one part related to the effect object to be generated is recognized. This allows for faster processing compared to recognizing the entire specified model, and ensures the real-time performance of the XR gimmick. Note that since the object recognition unit 503 holds in advance information on the three-dimensional shape model of the selected model, it is also possible to estimate to some extent the posture and distance of other parts by recognizing the posture and distance of some parts. When the object recognition unit 503 recognizes information related to the posture and distance of at least one part related to the effect object to be generated, it outputs the information to the effect position determination unit 504.
  • the effect position determination unit 504 determines position information of the effect object to be generated based on the acquired information on the attitude of at least one part and the distance from the camera 206.
  • the position information includes information on at least the attitude (angle) and distance from the camera 206 for the effect object.
  • the determined position information is output to the effect drawing unit 506. Since the effect drawing unit 506 holds model information of the effect object to be generated in advance, information that defines the reference position of the effect object in association with a specified position of the figure 301 can be output here. In other words, the position information of the effect object to be generated only needs to include information required for the effect drawing unit 506 to draw the effect object, and may, for example, indicate information on the attitude (angle) of the effect object and the distance from the camera 206.
  • the effect drawing unit 506 draws an effect object based on the depth information acquired from the depth information acquisition unit 502 and the information on the posture and distance of the effect object acquired from the effect position determination unit 504. As described above, the effect drawing unit 506 performs drawing according to the model information previously stored for the effect to be generated. More specifically, the effect drawing unit 506 draws pixels that indicate that the corresponding effect object is closer to the camera 206 than the corresponding pixel of the captured image, among the drawing pixels of the corresponding effect object, according to the depth information (distance information) of each pixel of the captured image.
  • the effect drawing unit 506 does not draw pixels that do not indicate that the corresponding effect object is closer to the camera 206 than the corresponding pixel of the captured image, among the drawing pixels of the corresponding effect object. As a result, for example, an effect object hidden behind the figure 301 is not drawn, and only an effect object exposed in front of the figure 301 is drawn.
  • the effect drawing unit 506 outputs the drawn effect object image to the synthesis unit 507.
  • the synthesis unit 507 generates a synthetic image by superimposing the effect object image acquired from the effect drawing unit 506, which is generated based on the captured image acquired from the image acquisition unit 501, and adding the effect image to the real space.
  • the synthesis unit 507 may also perform final adjustments and quality adjustments to the synthesized image by adjusting the brightness of the environment, etc. For example, when displaying an effect with more emphasis in accordance with a selected effect, adjustments such as darkening the image in the real space can be made.
  • the synthetic image is passed to the output unit 508, which displays the synthetic image on the display unit 204.
  • Each part is a camera 2
  • the above-mentioned series of processes may be executed periodically (e.g., at periods of 30 msec, 60 msec, 90 msec, etc.) on the captured images continuously acquired by the image capture unit 506.
  • the image displayed by the output unit 508 becomes a moving image.
  • the added effect object may also be displayed as a dynamically changing animation.
  • the output unit 508 can also output a predetermined sound from the speaker 207 in accordance with the displayed composite image (effect animation).
  • Reference numeral 600 denotes an image captured by the camera 206.
  • the captured image 600 includes a figure 301 placed on a desk 302.
  • Reference numeral 610 denotes a grayscale depth map, which is depth information obtained from the captured image 600 by the depth information acquisition unit 502. In the depth map 610, the closer each pixel is to white, the closer the distance from the camera 206 is.
  • 620 shows how at least one part related to the effect object generated by the object recognition unit 503 is recognized using the trained model 505.
  • the head 621 and chest 622 of the figure 301 are recognized.
  • only some of the parts related to the effect object to be generated are recognized, rather than recognizing the entire figure 301 contained in the captured image 600.
  • FIG. 630 shows a model of an effect object 631 whose position information has been determined based on the parts recognized in 620.
  • the overall position information of the effect object 631 to be generated is determined. Note that since the model information of the effect object 631 is stored in advance, it is sufficient that the information included is related to the posture for rendering the object and the distance from the camera 206.
  • effect rendering unit 506 uses depth map 610 and effect object model 630 to render effect object 641 to be composited into the captured image.
  • effect object 641 includes parts that are not rendered. This is because distance information obtained from depth map 610 is compared with distance information of effect object 631 obtained from model 630, and only effect objects located in front of the corresponding pixel in the captured image (i.e., closer to camera 206) are rendered.
  • the composite image 650 shows a composite image in which the effect object image 640 is superimposed on the captured image 600.
  • the composite image 650 is an image generated by simply superimposing the effect object image 640 on the captured image 600.
  • the part of the effect object 631 that overlaps with the image part of the figure 301 and is hidden behind the figure 301 is not drawn.
  • Note that in this embodiment only the part exposed to the front is drawn, without drawing the part of the effect hidden by the target object. Therefore, compared to control in which the part of the effect object that has been drawn once is erased according to its positional relationship with the figure, it is possible to realize processing with a lower processing load and to perform processing at a higher speed.
  • the CPU 201 displays a screen 400 that displays a selectable menu on the display unit 204.
  • the CPU 201 acquires information selected in response to a user operation via the screen 400 or a setting screen (not shown) to which the screen 400 transitions.
  • the selected information here includes, for example, information about a specific model that is captured by the camera 206 and displayed.
  • the CPU 201 starts capturing images with the camera 206 in response to the selected information.
  • the captured image is displayed on the display unit 204, as shown on screen 410.
  • the CPU 201 determines whether or not effect output has been selected via the button 412. If it has been selected, the process proceeds to S106; if not, the process proceeds to S104.
  • the CPU 201 acquires an image captured by the camera 206, displays it on the display unit 204 in S104, and proceeds to S107.
  • the CPU 201 combines an effect with the captured image and outputs it, and proceeds to S107. The detailed process of S106 will be described later with reference to FIG. 8.
  • the CPU 201 determines whether or not to end the video output, and if not, returns to S103, and if to end, ends the process of this flowchart. For example, if an instruction to return to the screen 400 is given via the button 413 or if the application is ended, the CPU 201 determines that the video output is to be ended, and stops the startup of the camera 206.
  • the CPU 201 acquires effect information selected via the button 412.
  • the effect information includes identification information for identifying the effect to be generated, information on at least one part to which the effect is related, and the like. This information is received from the application server 102 and is pre-stored in the storage unit 202.
  • the CPU 201 acquires a captured image of the processing target captured by the camera 206.
  • the CPU 201 acquires a depth map of the captured image acquired in S201 by the depth information acquisition unit 502.
  • the CPU 201 inputs the captured image acquired in S201 to the trained model 505, and performs object recognition of at least one part related to the effect acquired in S201 for the model (here, the figure 301) included in the captured image. Detailed control of the object recognition will be described later with reference to FIG. 9.
  • the CPU 201 determines position information of the effect to be generated based on information on the attitude and distance of at least one part recognized in S204.
  • the position information includes information on the attitude (angle) of the effect to be generated and the distance from the camera 206 as information for generating the effect image.
  • the processing order of S203, S204, and S205 has been described in order for ease of explanation, but the processing of acquiring the depth map and the processing of determining the position of the effect object may be performed in the reverse order, or may be performed in parallel.
  • the CPU 201 generates an image of the effect object based on the depth map acquired in S203 and the position information of the effect object determined in S205.
  • the generation control of the image of the effect object will be described later with reference to FIG. 10.
  • the CPU 201 superimposes the effect object image generated in S206 onto the captured image acquired in S202 to synthesize them.
  • the CPU 201 displays the synthesized image on the display unit 204 and outputs sound from the speaker 207 as necessary, and ends the processing of this flowchart.
  • the CPU 201 identifies parts related to the effect to be generated based on the effect information acquired in S201. For example, in the example of the XR gimmick described in FIG. 4 and FIG. 6, the head and chest of the figure 301 are identified as parts related to the effect.
  • the CPU 201 uses the trained model 505 to recognize at least one part identified in S301 that is included in the captured image.
  • the CPU 201 obtains information regarding the shape, angle, and distance of the recognized part from the output result of the trained model 505. After that, in S304, the CPU 201 determines whether or not there are any unanalyzed parts among the parts identified in S301. If there are any unanalyzed parts, the process returns to S302, and if there are no unanalyzed parts, the process of this flowchart ends.
  • the CPU 201 initializes the pixel position x of the effect object to be generated based on the position information of the effect object determined in S205 and the model information of the effect object to be generated that has been stored in advance.
  • the upper left pixel position of the effect object is set as the initial value for pixel position x.
  • the CPU 201 compares the distance information between the pixel position x of the effect object to be processed and the corresponding pixel position y of the captured image.
  • the CPU 201 determines whether the comparison shows that the effect object is located in front (closer to the camera 206). If the effect object is in front, the process proceeds to S404, otherwise, the process proceeds to S405.
  • the CPU 201 draws the effect object of the corresponding pixel, and proceeds to S405.
  • the CPU 201 determines whether all pixels of the effect object have been compared with the corresponding pixels of the captured image. When processing has been completed for all pixels, the process of this flowchart ends, and if not, the process returns to 402.
  • the information processing terminal captures an image of the surrounding environment including a predetermined model, and acquires distance information from the camera for each pixel of the captured image.
  • the information processing terminal recognizes information on the posture of at least one part of the predetermined model included in the captured image and the distance from the camera, and determines position information of an object to be generated based on the recognized at least one part.
  • the information processing terminal draws, as an object, pixels of each pixel of the object to be generated whose position information indicates that the pixels are closer to the camera than the distance information of the corresponding pixel in the captured image, generates an object image, and outputs a composite image in which the object image is superimposed on the captured image to the display unit.
  • the information processing terminal does not draw pixels of each pixel of the object to be generated whose position information does not indicate that the pixels are closer to the camera than the distance information of the corresponding pixel in the captured image.
  • a portion located in front of a predetermined object is drawn, and a portion located behind the predetermined object is not drawn because it is hidden by the predetermined object.
  • the present invention can suitably synthesize an object into a captured image in real space and output it.
  • a bone structure of the target model is constructed to complement the depth map, and control is described. Furthermore, by using the constructed bone structure, the posture of the target model can be determined, and the effect object can be dynamically changed according to the determined posture. Details will be described later.
  • Reference numeral 1100 indicates the bone structure of a figure, which is a specified model, contained in the captured image.
  • the figure 301 and desk 302 contained in the captured image are indicated by dotted lines.
  • Black circles such as 1101 in Figure 11 indicate feature points of figure 301. These feature points are used to generate a rough outline of the figure, and there is no intention to limit their number or position.
  • 1102 indicates the bone structure connecting each feature point.
  • Bone structure 1102 in Figure 11 indicates the reference bone structure of figure 301.
  • the reference bone structure is a bone structure obtained from the reference posture of a specified model, and is data prepared in advance for each model.
  • the reference bone structure can be obtained, for example, from three-dimensional data consisting of a rough outline with a reduced number of polygons from the three-dimensional data of the target model.
  • each part included in the target model is recognized.
  • the parts to be recognized may include the face, chest, abdomen, waist, both arms, and both legs.
  • the reference bone structure is updated according to the angle of the part recognized from the captured image. Therefore, the updated bone structure indicates the posture of the target model included in the captured image. Furthermore, by mapping the updated bone structure to the captured image, the area near the position of the corresponding bone structure in the captured image can be determined as an area indicating that the target model is being captured.
  • the CPU 201 acquires three-dimensional data including the reference bone structure of the target model (here, the figure 301).
  • This data is information that is pre-stored in the storage unit 202 when the application is installed.
  • three-dimensional data including the reference bone structure 1102 of the figure 301 is read from the storage unit 202.
  • the CPU 201 identifies the parts to be recognized based on the information of the figure 301, which is the target model.
  • the parts related to the selected effect are not identified, but the parts necessary to update the bone structure are identified. Note that, basically, each part included in the figure 301 is identified.
  • the CPU 201 updates the corresponding part of the reference bone structure acquired in S501 according to the recognized part (information on posture and distance from the camera 206). Specifically, the CPU 201 compares the recognized part with the corresponding part on the three-dimensional data including the reference bone structure, and updates the reference bone structure by adjusting the position of the feature points to match the angle of the recognized part. Then, in S304 the CPU 201 determines whether any of the parts identified in S502 have not been analyzed. If there are any unanalyzed parts, the process returns to S302, and if there are no unanalyzed parts, the process proceeds to S504.
  • the CPU 201 determines the posture of the target model from the updated reference bone structure, and ends the processing of this flowchart.
  • the posture of the target model may be determined by estimating the posture of the updated reference bone structure, for example, using a trained model that has been machine-learned to train the posture of the human body.
  • the effect to be output can be changed according to the estimated posture of the target model.
  • a specific effect can be output only when the posture of the target model indicates a specific posture.
  • an effect such as making the belt part light up and rotate can be output.
  • an effect such as blinking can be applied to each part in the following order: waist, torso, shoulders, upper arms, lower arms, and fists.
  • three-dimensional data including the updated bone structure may be used to complement the depth map.
  • the above three-dimensional data may be used to identify the position of the target model on the depth map (i.e., the captured image). If the position of the target model on the depth map can be identified, it is only necessary to determine whether or not to draw pixels of the effect object that overlap with corresponding pixels, which can be used to complement when the accuracy of the depth map is poor, and it is also possible to reduce the processing load by only performing the comparison process of S402 above for pixels that overlap with the target model.
  • the information processing terminal further recognizes information regarding the posture of each part included in the specified model and the distance from the camera, updates the three-dimensional data including the reference bone structure indicating the reference posture of the specified model according to each recognized part, and recognizes the posture of the specified model based on the updated three-dimensional data.
  • the present invention is not limited to the above embodiment, and various modifications and alterations are possible within the scope of the gist of the invention.
  • an example was described in which three rings that surround the body of a specific model figure 301 are synthesized and output as effect objects.
  • an example was described in which an object that is not actually included in figure 301 is synthesized and output, but the present invention is not limited to this.
  • an effect may be synthesized and output for a part that the figure physically has.
  • FIG. 13 shows a modified figure 1101 placed on a desk 302 and photographed by the information processing terminal 101.
  • the figure 1101 has a physical object 1102, for example, a flame.
  • effects may be synthesized and output for such physically existing parts.
  • flickering flames or sparks may be added and output.
  • a different expression may be added to the captured image of the figure's face, for example, a smiling or crying expression, or an image in which the line of sight is changed to face the camera may be synthesized and output.
  • at least one part of the figure can be recognized from the captured image, and any effect can be added as long as it is an effect that can be generated based on the recognized part.
  • a humanoid model has been used as an example of the target model, but this is not intended to limit the present invention, and the present invention can be applied to models of various shapes, such as humans, animals, robots, insects, and dinosaurs.
  • the present invention can be applied to models of various shapes, such as humans, animals, robots, insects, and dinosaurs.
  • by dividing the model into multiple parts and recognizing them it is possible to provide augmented reality while ensuring real-time performance.
  • the above embodiment discloses at least the following computer program, information processing terminal, and control method thereof.
  • a computer program that causes a computer of an information processing terminal to function as: an imaging means for imaging the surrounding environment including a specified model; an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means; a recognition means for recognizing information related to the attitude of at least one part of the specified model contained in the captured image and its distance from the imaging means; a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means; an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, thereby generating an object image; and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.
  • the computer program described in (1) or (2) is characterized in that the computer of the information processing terminal is further made to function as a selection means for selecting an effect to be composited with a captured image including the specified model, and the recognition means recognizes parts related to the selected effect.
  • the computer program described in (3) is characterized in that the recognition means recognizes parts related to the selected effect using a trained model that is trained to input a captured image and output information about the shape, angle, and distance for each part of the specified model.
  • the computer program described in (7) is characterized in that the parts related to the selected effect are parts located in the vicinity of the object to be generated.
  • An information processing terminal comprising: an imaging means for imaging a surrounding environment including a specified model; an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means; a recognition means for recognizing information related to the attitude of at least one part of the specified model included in the captured image and its distance from the imaging means; a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means; an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, thereby generating an object image; and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.
  • a control method for an information processing terminal comprising: an imaging step of imaging a surrounding environment including a predetermined model by an imaging means; an acquisition step of acquiring distance information from the imaging means for each pixel of the image captured in the imaging step; a recognition step of recognizing information regarding the attitude of at least one part of the predetermined model included in the captured image and the distance from the imaging means; a position determination step of determining position information of an object to be generated based on the at least one part recognized in the recognition step; an object generation step of drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, thereby generating an object image; and an output step of outputting a composite image in which the object image is superimposed on the captured image to a display unit.
  • 101 Information processing terminal
  • 102 Application server
  • 103 Machine learning server
  • 104 Database
  • 105 Network
  • 201 CPU
  • 202 Storage unit
  • 203 Communication control unit
  • 204 Display unit
  • 205 Operation unit
  • 206 Camera
  • 207 Speaker
  • 210 System bus: 301: Figure
  • 502 Depth information acquisition unit
  • 503 Object recognition unit
  • 504 Effect position determination unit
  • 505 Learned model
  • 506 Effect drawing unit
  • 507 Synthesis unit: 508: Output unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Graphics (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Processing Or Creating Images (AREA)
  • User Interface Of Digital Computer (AREA)
  • Digital Computer Display Output (AREA)

Abstract

[Problem] The present invention provides, for example, a mechanism for suitably synthesizing an object with a captured image of a real space and outputting the result of synthesis. [Solution] This information processing terminal: captures an image of a surrounding environment including a prescribed model; and acquires, for each pixel of the captured image that has been captured, information pertaining to the distance from a camera. The information processing terminal also: recognizes information relating to the orientation of at least one part of the prescribed model included in the captured image and the distance from the camera; and determines, with reference to the at least one part for which the orientation was recognized, position information pertaining to an object being generated. The information processing terminal furthermore: draws, as an object, pixels indicated by the position information to be closer to the camera than the distance information for the corresponding pixel of the captured image, from among the pixels of the object being generated; generates an object image; and outputs, to a display unit, a composite image in which the object image is superimposed on the captured image.

Description

コンピュータプログラム、情報処理端末、及びその制御方法Computer program, information processing terminal, and control method thereof
本発明は、コンピュータプログラム、情報処理端末、及びその制御方法に関する。 The present invention relates to a computer program, an information processing terminal, and a control method thereof.
従来より、現実空間に文字やCGなどの情報を重畳させてユーザに提示することで、拡張された現実空間を該ユーザに提示するAR(Augmented Reality)などのXR(Cross Reality)技術が様々な分野で利用されている。例えば、特許文献1には、フィギュアの台座に設けられたマークを携帯端末のカメラで認識して、マークに対応付けて用意された演出用の映像等を生成し、携帯端末の画面にフィギュアと映像とを重ね合わせて表示する拡張現実感システムが開示されている。 Conventionally, XR (Cross Reality) technologies such as AR (Augmented Reality) have been used in various fields to present an augmented real space to a user by superimposing information such as text and CG on the real space. For example, Patent Document 1 discloses an augmented reality system that recognizes marks on the base of a figure with a camera on a mobile device, generates images for presentation that are prepared in association with the marks, and displays the figure and the images superimposed on the screen of the mobile device.
特許第5551205号公報Patent No. 5551205
上記従来技術では、フィギュア画像と演出用の映像とを重ね合わせて表示しており、フィギュア画像と演出用の映像とが重複する位置ではフィギュア画像が隠れるようになっている。つまり、上記従来技術では、演出用の映像全てをフィギュア画像の前面に表示している。しかし、立体感や臨場感を出すには、演出用の映像など撮像画像に重ね合わせるオブジェクトは、当該フィギュアとの位置関係に応じてフィギュアの前面に露出したり、背面に隠れたりすることが望ましい。また、撮像した映像にリアルタイムでオブジェクトを付加するために、それらの処理負荷をできる限り低減させる必要がある。  In the above-mentioned conventional technology, the figure image and the video for the production are displayed superimposed on each other, and the figure image is hidden where the figure image and the video for the production overlap. In other words, in the above-mentioned conventional technology, all the video for the production is displayed in front of the figure image. However, to create a sense of three-dimensionality and realism, it is desirable for objects superimposed on the captured image, such as the video for the production, to be exposed in front of or hidden behind the figure depending on their positional relationship with the figure. Also, in order to add objects to the captured image in real time, it is necessary to reduce the processing load as much as possible.
本発明は例えば、現実空間の撮像画像に好適にオブジェクトを合成して出力する仕組みを提供する。 The present invention provides, for example, a mechanism for suitably synthesizing objects with captured images of real space and outputting the resulting images.
本発明は、例えば、コンピュータプログラムであって、情報処理端末のコンピュータを、所定の模型を含む周辺環境を撮像する撮像手段と、前記撮像手段によって撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得手段と、前記撮像画像に含まれる前記所定の模型の少なくとも1つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識手段と、前記認識手段によって認識した前記少なくとも1つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定手段と、前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成手段と、前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力手段と、として機能させることを特徴とする。  The present invention is, for example, a computer program that causes a computer of an information processing terminal to function as an imaging means for imaging a surrounding environment including a specified model, an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means, a recognition means for recognizing information regarding the attitude of at least one part of the specified model contained in the captured image and its distance from the imaging means, a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means, an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, and generating an object image, and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.
また、本発明は、例えば、情報処理端末であって、所定の模型を含む周辺環境を撮像する撮像手段と、前記撮像手段によって撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得手段と、前記撮像画像に含まれる前記所定の模型の少なくとも1つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識手段と、前記認識手段によって認識した前記少なくとも1つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定手段と、前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成手段と、前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力手段とを備えることを特徴とする。  The present invention is also characterized in that, for example, an information processing terminal includes an imaging means for imaging a surrounding environment including a predetermined model, an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means, a recognition means for recognizing information regarding the attitude of at least one part of the predetermined model included in the captured image and the distance from the imaging means, a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means, an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, and generating an object image, and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.
また、本発明は、例えば、情報処理端末の制御方法であって、所定の模型を含む周辺環境を撮像手段によって撮像する撮像工程と、前記撮像工程で撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得工程と、前記撮像画像に含まれる前記所定の模型の少なくとも1つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識工程と、前記認識工程で認識した前記少なくとも1つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定工程と、前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成工程と、前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力工程とを含むことを特徴とする。 The present invention is also characterized in that it is a control method for an information processing terminal, comprising, for example, an imaging step of imaging a surrounding environment including a predetermined model by an imaging means, an acquisition step of acquiring distance information from the imaging means for each pixel of the image captured in the imaging step, a recognition step of recognizing information regarding the attitude of at least one part of the predetermined model included in the captured image and the distance from the imaging means, a position determination step of determining position information of an object to be generated based on the at least one part recognized in the recognition step, an object generation step of drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, and generating an object image, and an output step of outputting a composite image in which the object image is superimposed on the captured image to a display unit.
本発明によれば例えば、現実空間の撮像画像に好適にオブジェクトを合成して出力することができる。 According to the present invention, for example, it is possible to suitably synthesize an object into a captured image of real space and output it.
一実施形態に係るシステムの構成例を示す図。FIG. 1 is a diagram showing an example of the configuration of a system according to an embodiment. 一実施形態に係る情報処理端末の構成例を示す図。FIG. 2 is a diagram showing an example of the configuration of an information processing terminal according to an embodiment. 一実施形態に係るシステムが提供するXRギミックの一例を示す図。FIG. 11 is a diagram showing an example of an XR gimmick provided by the system according to an embodiment. 一実施形態に係るXRギミックの画面遷移を示す図。11A to 11C are diagrams showing screen transitions of an XR gimmick according to an embodiment. 一実施形態に係るエフェクト合成に関する機能構成を示す図。FIG. 4 is a diagram showing a functional configuration relating to effect synthesis according to an embodiment. 一実施形態に係るエフェクト合成の処理手順に応じた一連の画像例を示す図。1A to 1C are diagrams showing a series of example images according to a processing procedure of effect synthesis according to an embodiment. 一実施形態に係る基本制御の処理手順を示すフローチャート。11 is a flowchart showing a processing procedure of basic control according to an embodiment. 一実施形態に係るエフェクト合成出力の処理手順を示すフローチャート。11 is a flowchart showing a processing procedure for effect synthesis output according to an embodiment. 一実施形態に係る物体認識の処理手順を示すフローチャート。1 is a flowchart showing a processing procedure for object recognition according to an embodiment. 一実施形態に係るオブジェクト生成の処理手順を示すフローチャート。11 is a flowchart showing a processing procedure for generating an object according to an embodiment. 一実施形態に係るボーン構造の生成方法を示す図。1A to 1C are diagrams showing a method for generating a bone structure according to an embodiment; 一実施形態に係る物体認識の処理手順を示すフローチャート。1 is a flowchart showing a processing procedure for object recognition according to an embodiment. 一実施形態に係るフィギュアの変形例を示す図。FIG. 13 is a diagram showing a modified example of the figure according to the embodiment.
以下、添付図面を参照して実施形態を詳しく説明する。尚、以下の実施形態は特許請求の範囲に係る発明を限定するものではなく、また実施形態で説明されている特徴の組み合わせの全てが発明に必須のものとは限らない。実施形態で説明されている複数の特徴うち二つ以上の特徴が任意に組み合わされてもよい。また、同一若しくは同様の構成には同一の参照番号を付し、重複した説明は省略する。  The embodiments are described in detail below with reference to the attached drawings. Note that the following embodiments do not limit the invention as claimed, and not all combinations of features described in the embodiments are necessarily essential to the invention. Two or more of the features described in the embodiments may be combined in any desired manner. Furthermore, the same reference numbers are used for the same or similar configurations, and duplicate descriptions will be omitted.
<第1の実施形態> <システム構成> 以下では本発明の第1の実施形態について説明する。まず図1を参照して、本実施形態に係るシステム構成について説明する。なお、ここでは必要最低限の簡易的な構成について説明するが、本発明を限定するものではない。例えば、各装置については複数の装置が含まれてもよいし、複数のサーバが一体化して設けられてもよい。  <First embodiment> <System configuration> Below, a first embodiment of the present invention will be described. First, with reference to FIG. 1, the system configuration according to this embodiment will be described. Note that a simple configuration with the minimum necessary components will be described here, but this does not limit the present invention. For example, each device may include multiple devices, or multiple servers may be integrated.
本システムは、情報処理端末101、アプリケーションサーバ102、機械学習サーバ103、及びデータベース104を含んで構成される。情報処理端末101及びアプリケーションサーバ102はネットワークを介して相互に通信可能に接続される。アプリケーションサーバ102は、ローカルエリアネットワーク(LAN)を介して機械学習サーバ103に相互に通信可能に接続される。また、機械学習サーバ103はLANを介してデータベース104に接続される。  This system includes an information processing terminal 101, an application server 102, a machine learning server 103, and a database 104. The information processing terminal 101 and the application server 102 are connected to each other via a network so that they can communicate with each other. The application server 102 is connected to the machine learning server 103 via a local area network (LAN) so that they can communicate with each other. In addition, the machine learning server 103 is connected to the database 104 via the LAN.
情報処理端末101は、例えば、スマートフォン、携帯電話機、タブレットPC等の携帯型の情報処理端末である。カメラ等の撮像部と、撮像した画像を表示する表示部とを少なくとも有する情報処理端末であれば任意の装置であってよい。情報処理端末101は、ネットワーク105を介してアプリケーションサーバ102から、本発明を実施するためのアプリケーションをダウンロードしてインストールする。当該アプリケーションが情報処理端末101で実行されることによって、以下で説明する所定の模型を含む撮像画像に、エフェクトオブジェクトを合成した拡張現実を提供することができる。なお、撮像画像は、静止画像及び動画像(映像)の何れであってもよい。また、動画像にエフェクトを付加する場合には、エフェクトオブジェクトをアニメーションとして合成してもよい。  The information processing terminal 101 is, for example, a portable information processing terminal such as a smartphone, a mobile phone, or a tablet PC. Any device may be used as long as it has at least an imaging unit such as a camera and a display unit for displaying the captured image. The information processing terminal 101 downloads and installs an application for implementing the present invention from the application server 102 via the network 105. By executing the application on the information processing terminal 101, it is possible to provide an augmented reality in which an effect object is synthesized into a captured image including a specific model, as described below. The captured image may be either a still image or a moving image (video). Furthermore, when an effect is added to a moving image, the effect object may be synthesized as an animation.
アプリケーションサーバ102は、機械学習サーバ103において機械学習された学習済みモデルを取得し、当該学習済みモデルを組み込んだアプリケーションを情報処理端末101等の外部端末に提供する。機械学習サーバ103は、例えば、深層学習の畳み込みニューラルネットワーク(CNN)によって、画像情報に含まれるフィギュア(模型)の各パーツを認識する学習済みモデルを生成する。学習データとしては、例えば所定の模型のパーツごとに様々な姿勢や角度から撮影された撮像画像に教師データを付与したデータを用いる。このように所定の模型をパーツ毎に学習させることにより、例えば推定フェーズにおいて所定の模型の全体を認識するよりも、高速に且つ処理負荷を抑えた認識処理を実行することができる。学習データについては、機械学習サーバ103で生成してもよいし、外部で生成されたデータを受信してもよい。また、機械学習サーバ103は、模型ごとに生成した学習済みモデルをデータベース104に格納し、必要に応じてアプリケーションサーバ102へ提供する。また、機械学習サーバ103は、追加の学習データに基づいて再学習を行うために、データベース104から対応する学習済みモデルを読み出して再学習させ、再学習後のモデルをデータベース104へ再度格納する。  The application server 102 acquires a trained model that has been machine-learned in the machine learning server 103, and provides an application incorporating the trained model to an external terminal such as the information processing terminal 101. The machine learning server 103 generates a trained model that recognizes each part of a figure (model) included in image information, for example, by a deep learning convolutional neural network (CNN). For example, data obtained by attaching teacher data to captured images taken from various postures and angles for each part of a specific model is used as the training data. By training the specific model for each part in this way, it is possible to execute a recognition process faster and with a reduced processing load than, for example, recognizing the entire specific model in the estimation phase. The training data may be generated by the machine learning server 103, or data generated externally may be received. The machine learning server 103 also stores the trained model generated for each model in the database 104, and provides it to the application server 102 as necessary. In addition, in order to perform re-learning based on additional training data, the machine learning server 103 reads out the corresponding trained model from the database 104, re-learns it, and stores the re-learned model in the database 104 again.
<情報処理端末の構成> 次に、図2を参照して、本実施形態に係る情報処理端末101の構成例について説明する。ここでは、本実施形態に係る情報処理端末101において本発明を説明する上で重要なデバイスについてのみ説明する。したがって、情報処理端末101は代替的に又は追加的に他のデバイスを含んで構成されてもよい。  <Configuration of information processing terminal> Next, an example of the configuration of the information processing terminal 101 according to this embodiment will be described with reference to FIG. 2. Here, only the devices that are important for explaining the present invention in the information processing terminal 101 according to this embodiment will be described. Therefore, the information processing terminal 101 may be configured to include other devices instead or in addition.
情報処理端末101は、CPU201、記憶部202、通信制御部203、表示部204、操作部205、カメラ206、及びスピーカ207を備える。各コンポーネントは、システムバス210を介して相互にデータを送受することができる。  The information processing terminal 101 includes a CPU 201, a memory unit 202, a communication control unit 203, a display unit 204, an operation unit 205, a camera 206, and a speaker 207. Each component can send and receive data to and from each other via a system bus 210.
CPU201は、システムバス210を介して接続された各コンポーネントを全体的に制御する中央処理プロセッサである。CPU201は、記憶部202に記憶されたコンピュータプログラムを実行することにより、後述する各処理を実行する。記憶部202は、CPU201のワーク領域や一時領域として使用されるとともに、CPU201によって実行される制御プログラムや各種データを記憶している。  The CPU 201 is a central processor that provides overall control of each component connected via the system bus 210. The CPU 201 executes each process described below by executing computer programs stored in the memory unit 202. The memory unit 202 is used as a work area and temporary area for the CPU 201, and also stores the control programs executed by the CPU 201 and various data.
通信制御部203は広帯域無線通信によりネットワーク105を介してアプリケーションサーバ102と双方向通信を行うことができる。なお、通信制御部203は、広帯域無線通信に加えて又は代えて、無線LAN(WiFi)、Bluetooth(登録商標)通信、及び赤外線通信などの近距離無線通信の機能を有してもよい。通信制御部203は、例えば広帯域無線通信機能を有しておらず、WiFi通信機能を有している場合、近くのアクセスポイントを介してネットワーク105へ接続する。  The communication control unit 203 can perform bidirectional communication with the application server 102 via the network 105 using broadband wireless communication. Note that the communication control unit 203 may have short-range wireless communication functions such as wireless LAN (WiFi), Bluetooth (registered trademark) communication, and infrared communication in addition to or instead of broadband wireless communication. For example, if the communication control unit 203 does not have a broadband wireless communication function but has a WiFi communication function, it connects to the network 105 via a nearby access point.
表示部204はタッチパネル式の液晶ディスプレイであり、各種画面を表示するとともに、カメラ206によって撮像された静止画像や動画像を表示する。操作部205は、表示部204と一体化して設けられ、ユーザ操作を受け付ける操作入力部である。また、操作部205は、物理的に構成された押下式やスライド式のボタン等を含んでもよい。  The display unit 204 is a touch panel type liquid crystal display that displays various screens as well as still images and moving images captured by the camera 206. The operation unit 205 is an operation input unit that is integrated with the display unit 204 and accepts user operations. The operation unit 205 may also include physically configured push-type or slide-type buttons, etc.
カメラ206は、情報処理端末101の周辺環境を撮像する撮像部であり、例えば情報処理端末101において表示部204が設けられた裏側に位置することが望ましい。これにより、ユーザはカメラ206で撮影しながら、当該撮像画像を表示部204で確認することができる。なお、カメラ206は単眼カメラであっても、複眼カメラであってもよい。スピーカ207は例えば出力するエフェクトオブジェクトに合わせて音声を出力する。音声データについては、エフェクトごとに予め用意されている。  The camera 206 is an imaging unit that captures images of the surrounding environment of the information processing terminal 101, and is preferably located, for example, behind the display unit 204 on the information processing terminal 101. This allows the user to check the captured image on the display unit 204 while taking an image with the camera 206. The camera 206 may be a monocular camera or a compound eye camera. The speaker 207 outputs sound, for example, in accordance with the effect object to be output. Sound data is prepared in advance for each effect.
<XRギミック> 次に、図3を参照して、本実施形態に係るシステムが提供するXRギミックの一
例について説明する。ここでは、アプリケーションサーバ102から、機械学習サーバ103によって生成された学習済みモデルを組み込んだアプリケーションが情報処理端末101にダウンロードされ、インストールされていることを前提とする。当該アプリケーションは、情報処理端末101で実行されることによりXRギミックを提供する。 
<XR Gimmick> Next, an example of an XR gimmick provided by the system according to the present embodiment will be described with reference to Fig. 3. Here, it is assumed that an application incorporating a trained model generated by the machine learning server 103 is downloaded from the application server 102 to the information processing terminal 101 and installed. The application provides an XR gimmick by being executed on the information processing terminal 101.
301は所定の模型の一例であり、人型のフィギュアである。本発明を限定する意図はなく、任意の物体等を模した模型であれば本発明に適用することができる。302はフィギュア301を載置した机を示す。ユーザは情報処理端末101上で上記アプリケーションを起動し、アプリケーション画面に表示された複数の項目から当該フィギュア301に対応する項目を選択する。ユーザが当該フィギュア301に対応する項目を選択すると、カメラ206が起動され、ユーザは机302に載置されたフィギュア301を撮影する。ユーザは撮影中において自由に情報処理端末101を動かして、矢印に示すように、フィギュア301を撮影する角度を変える変更することができる。撮像された映像は情報処理端末101の表示部204に表示される。ここで、後述する操作ボタン等を選択することにより、当該映像にエフェクトオブジェクトを合成して表示させることができる。このように、本システムは、フィギュア301を含む情報処理端末101の周辺環境を撮像した現実空間に、アニメーション等のエフェクトオブジェクトを重畳して出力することにより拡張した現実空間を提供する。  301 is an example of a predetermined model, a human figure. There is no intention to limit the present invention, and any model that imitates any object or the like can be applied to the present invention. 302 indicates a desk on which the figure 301 is placed. The user starts the above application on the information processing terminal 101, and selects an item corresponding to the figure 301 from multiple items displayed on the application screen. When the user selects the item corresponding to the figure 301, the camera 206 is started, and the user takes a picture of the figure 301 placed on the desk 302. The user can freely move the information processing terminal 101 during shooting, and change the angle at which the figure 301 is shot, as shown by the arrow. The captured image is displayed on the display unit 204 of the information processing terminal 101. Here, by selecting an operation button or the like described later, an effect object can be synthesized and displayed on the image. In this way, this system provides an expanded real space by superimposing an effect object such as an animation on a real space in which an image of the surrounding environment of the information processing terminal 101, including the figure 301, and outputting the superimposed effect object.
<XRギミックにおける画面遷移> 次に、図4を参照して、本実施形態に係るXRギミックに係る画面遷移について説明する。図4(a)~図4(d)は、ユーザがXRギミックを提供するアプリケーションを実行して、図3に示すように情報処理端末101を動かした場合の画面遷移について説明する。  <Screen transitions in XR gimmick> Next, with reference to FIG. 4, we will explain the screen transitions related to the XR gimmick of this embodiment. FIG. 4(a) to FIG. 4(d) explain the screen transitions when the user executes an application that provides the XR gimmick and moves the information processing terminal 101 as shown in FIG. 3.
図4(a)に示す画面400は、本実施形態に係るXRギミックを提供するアプリケーションを起動すると表示部204に表示される画面である。ここでは、当該アプリケーションに登録されているフィギュアを選択するための選択ボタン401~405が表示される。ここでは、5つのフィギュアが登録されている例を示すが、さらに多くのフィギュアが登録される場合には、画面を下方向にスクロールすることで未表示の項目が表示され、選択可能となる。各項目には、それぞれ異なるフィギュアが登録されており、ユーザが撮影対象となるフィギュアを選択すると、図4(b)に示す画面410に遷移する。  Screen 400 shown in FIG. 4(a) is a screen that is displayed on the display unit 204 when an application that provides an XR gimmick according to this embodiment is launched. Here, selection buttons 401 to 405 are displayed for selecting figures registered in the application. Here, an example is shown in which five figures are registered, but if more figures are registered, undisplayed items can be displayed and selected by scrolling the screen downwards. Different figures are registered in each item, and when the user selects a figure to be photographed, the screen transitions to screen 410 shown in FIG. 4(b).
画面410では、カメラ206が起動され、カメラ206によって情報処理端末101の周辺環境が撮像され、当該映像が表示部204に表示されている様子を示す。当該周辺環境の撮像画像には、図3に示したように、机302に載置されたフィギュア301が含まれる。また、撮像された映像に加えて、各種ボタン411~413が選択可能に表示される。ボタン411は、撮影中の静止画像を撮像するためのボタンである。ボタン411が操作されると、操作されたタイミングで静止画像を取得し、記憶部202に保存される。ボタン412は、エフェクトを付与付加するためのボタンである。ボタン412が操作されると、当該フィギュアにおいて登録されている少なくとも1つのエフェクトが選択可能に表示され、さらにユーザは所望のエフェクトを選択することができる。ボタン413は、各種メニューを表示するボタンである。開始したXRギミックを終了させて画面400へ遷移させたり、他の設定等を行ったりすることができる。なお、ここでは3つのボタンを含む例について説明したが、さらに多くの操作ボタンが含まれてもよい。  The screen 410 shows the state in which the camera 206 is started, the camera 206 captures an image of the surrounding environment of the information processing terminal 101, and the image is displayed on the display unit 204. The captured image of the surrounding environment includes the figure 301 placed on the desk 302, as shown in FIG. 3. In addition to the captured image, various buttons 411 to 413 are displayed in a selectable manner. The button 411 is a button for capturing a still image during shooting. When the button 411 is operated, a still image is acquired at the timing of the operation and stored in the storage unit 202. The button 412 is a button for adding an effect. When the button 412 is operated, at least one effect registered for the figure is displayed in a selectable manner, and the user can select a desired effect. The button 413 is a button for displaying various menus. The started XR gimmick can be ended, and the screen can be transitioned to the screen 400, or other settings can be performed. Note that although an example including three buttons has been described here, more operation buttons may be included.
ボタン412を介して所定のエフェクトが選択されると、図4(c)の画面420に示すように、エフェクトの合成出力が開始される。421は映像に合成されたエフェクトオブジェクトであり、フィギュア301を囲むように3つの輪が表示されている。これらの輪は、例えばフィギュア301の頭上から発生し、足下方向に向けてアニメーション表示されてもよい。なお、表示されているエフェクトオブジェクト421は、フィギュア301の前面に表示されている部分と、フィギュア301の背面に隠れて表示されていない部分があることが分かる。これらの表示制御の詳細については後述する。  When a specific effect is selected via button 412, the composite output of the effect begins, as shown on screen 420 in FIG. 4(c). 421 is an effect object composited into the video, and three rings are displayed surrounding the figure 301. These rings may be animated, for example, to appear from above the head of the figure 301 and point toward the feet. It can be seen that the displayed effect object 421 has a portion that is displayed in front of the figure 301 and a portion that is hidden behind the figure 301 and not displayed. Details of the display control of these will be described later.
図4(d)に示すように、画面430は、エフェクトの出力中において、ユーザが図4(c)の状態からフィギュア301の側面から撮影する状態まで情報処理端末101を動かした際の画面を示す。ここでは、フィギュア301の側面に回った場合においても、エフェクトオブジェクト431に示すように、情報処理端末101から見てフィギュア301の後側に回り込む部分は表示されていないことが分かる。このように、本実施形態に係るエフェクトの合成出力では、カメラ206によって撮影された映像に追従して、エフェクトオブジェクトもフィギュア301の位置関係に応じて表示が変化するものである。詳細な表示制御については後述する。  As shown in FIG. 4(d), screen 430 shows the screen when the user moves the information processing terminal 101 from the state in FIG. 4(c) to a state where the user is photographing the side of the figure 301 while the effect is being output. Here, it can be seen that even when moving around to the side of the figure 301, the part that wraps around to the rear of the figure 301 as seen from the information processing terminal 101 is not displayed, as shown in effect object 431. In this way, in the composite output of effects according to this embodiment, the display of the effect object also changes according to the positional relationship of the figure 301, following the image captured by the camera 206. Detailed display control will be described later.
<エフェクト合成の機能構成> 次に、図5を参照して、本実施形態に係るエフェクト合成出力に係る機能構成について説明する。以下で説明する機能構成は、例えばCPU201が記憶部202に予め記憶された制御プログラムを実行することにより実現されるものである。本情報処理端末101は、エフェクト合成出力に係る機能構成として、画像取得部501、深度情報取得部502、物体認識部503、エフェクト位置決定部504、学習済みモデル505、エフェクト描画部506、合成部507、及び出力部508を含む。  <Functional configuration of effect synthesis> Next, referring to FIG. 5, the functional configuration related to the effect synthesis output according to this embodiment will be described. The functional configuration described below is realized, for example, by the CPU 201 executing a control program pre-stored in the storage unit 202. The information processing terminal 101 includes, as functional configuration related to effect synthesis output, an image acquisition unit 501, a depth information acquisition unit 502, an object recognition unit 503, an effect position determination unit 504, a trained model 505, an effect drawing unit 506, a synthesis unit 507, and an output unit 508.
画像取得部501は、カメラ206によって撮像された撮像画像(RGB画像)を取得する。画像取得部501によって取得されたRGB画像は、深度情報取得部502、物体認識部503、及び合成部507へそれぞれ出力される。深度情報取得部502は、画像取得部501から受け取った撮像画像における各画素について、撮像時におけるカメラ206からの距離情報(深度情報)を取得する。深度情報取得部502は、取得した深度情報を示すグレースケール画像(深度マップ)を生成する。深度情報の取得方法としては、任意の既知の手法を用いてもよく、例えば、ステレオ視や時間差による運動視差を利用して取得する手法や、畳み込みニューラルネットワークを利用して二次元画像から対象物までの距離を推定するように学習させた機械学習済みのモデルによって取得する手法であってもよい。なお、本実施形態に係るXRギミックはリアルタイム性を要するものであるため、処理負荷が低い手法が望ましい。深度情報取得部502は、取得した深度情報(深度マップ)をエフェクト描画部506へ出力する。  The image acquisition unit 501 acquires an image (RGB image) captured by the camera 206. The RGB image acquired by the image acquisition unit 501 is output to the depth information acquisition unit 502, the object recognition unit 503, and the synthesis unit 507. The depth information acquisition unit 502 acquires distance information (depth information) from the camera 206 at the time of capturing for each pixel in the captured image received from the image acquisition unit 501. The depth information acquisition unit 502 generates a grayscale image (depth map) indicating the acquired depth information. Any known method may be used as a method for acquiring depth information, for example, a method of acquiring the information using stereo vision or motion parallax due to a time difference, or a method of acquiring the information using a machine-learned model that has been trained to estimate the distance from a two-dimensional image to an object using a convolutional neural network. Note that since the XR gimmick according to this embodiment requires real-time performance, a method with a low processing load is desirable. The depth information acquisition unit 502 outputs the acquired depth information (depth map) to the effect drawing unit 506.
物体認識部503は、学習済みモデル505を用いて、撮像画像に含まれるフィギュア301の少なくとも1つのパーツの姿勢と、カメラ206から当該パーツまでの距離とを認識する。姿勢情報には、当該パーツの形状及び角度の情報が含まれる。より詳細には、物体認識部503は、学習済みモデル505を用いて、認識対象となるパーツの前、後、上、下、左、右、ピッチ、ヨー、及びロールの各方向における角度を検出することができる。ここで、少なくとも1つのパーツとは、フィギュア301等の所定の模型における頭部、胸部、腹部、腰部、腕部、及び脚部の少なくとも1つであり、選択されたエフェクトに関連するパーツである。所定の模型をパーツごとに分割する粒度については任意である。例えば、可動フィギュアの場合には、関節を有するパーツごとに分割することが望ましい。これにより、可動するパーツごとに形状や姿勢等を認識することができ、可動が行われた場合であっても認識誤りを低減させることができる。  The object recognition unit 503 uses the trained model 505 to recognize the posture of at least one part of the figure 301 included in the captured image and the distance from the camera 206 to the part. The posture information includes information on the shape and angle of the part. More specifically, the object recognition unit 503 can use the trained model 505 to detect the angles of the part to be recognized in each of the directions of front, back, up, down, left, right, pitch, yaw, and roll. Here, at least one part is at least one of the head, chest, abdomen, waist, arms, and legs in a specified model such as the figure 301, and is a part related to the selected effect. The granularity of dividing the specified model into parts is arbitrary. For example, in the case of a movable figure, it is desirable to divide it into parts having joints. This makes it possible to recognize the shape, posture, etc. of each movable part, and reduces recognition errors even when movement is performed.
また、エフェクトに関連するパーツとは、生成するエフェクトオブジェクトの近傍に位置するパーツを示す。これは、撮像画像に合成するエフェクトオブジェクトがフィギュア301との位置関係を考慮して配置されるものであり、生成するエフェクトオブジェクトの位置を決定するためである。例えば、所定の模型において胸部の一部から光線を出力するエフェクトオブジェクトを生成する場合には、当該模型の胸部の姿勢及び撮像画像におけるカメラ206から当該模型の胸部までの距離を認識することで、エフェクトオブジェクトの発生位置や発生方向を決定することができる。  Furthermore, parts related to an effect refer to parts located near the effect object to be generated. This is because the effect object to be composited into the captured image is positioned taking into consideration its positional relationship with the figure 301, and the position of the effect object to be generated is determined. For example, when generating an effect object that outputs a ray from part of the chest of a specific model, the position and direction in which the effect object will be generated can be determined by recognizing the orientation of the chest of the model and the distance from the camera 206 to the chest of the model in the captured image.
このように、本実施形態によれば、所定の模型全体の姿勢及び撮像画像におけるカメラ206から当該模型までの距離を認識するものではなく、生成するエフェクトオブジェクトに関連する少なくとも1つのパーツのみを認識する。これにより、所定の模型全体を認識する場合と比較して、高速に処理することができ、XRギミックのリアルタイム性を保証することができる。なお、物体認識部503は、選択された模型の三次元形状モデルの情報を予め保持しているため、一部のパーツの姿勢及び距離を認識することにより、他のパーツの姿勢及び距離をある程度推定することも可能である。物体認識部503は、生成するエフェクトオブジェクトに関連する少なくとも1つのパーツの姿勢及び距離に関する情報を認識すると、当該情報をエフェクト位置決定部504へ出力する。  Thus, according to this embodiment, rather than recognizing the posture of the entire specified model and the distance from the camera 206 to the model in the captured image, only at least one part related to the effect object to be generated is recognized. This allows for faster processing compared to recognizing the entire specified model, and ensures the real-time performance of the XR gimmick. Note that since the object recognition unit 503 holds in advance information on the three-dimensional shape model of the selected model, it is also possible to estimate to some extent the posture and distance of other parts by recognizing the posture and distance of some parts. When the object recognition unit 503 recognizes information related to the posture and distance of at least one part related to the effect object to be generated, it outputs the information to the effect position determination unit 504.
エフェクト位置決定部504は、取得した少なくとも1つのパーツの姿勢及びカメラ206からの距離に関する情報に基づいて、生成するエフェクトオブジェクトの位置情報を決定する。当該位置情報には、エフェクトオブジェクトについての少なくとも姿勢(角度)及びカメラ206からの距離に関する情報が含まれる。決定した位置情報はエフェクト描画部506に出力される。エフェクト描画部506では生成するエフェクトオブジェクトのモデル情報を予め保持しているため、ここでは当該エフェクトオブジェクトの基準位置をフィギュア301の所定位置と関連付けて定義した情報が出力されうる。つまり、生成するエフェクトオブジェクトの位置情報は、エフェクト描画部506がエフェクトオブジェクトを描画するために必要となる情報を含んでいればよく、例えば当該エフェクトオブジェクトの姿勢(角度)及びカメラ206からの距離に関する情報を示すものであればよい。  The effect position determination unit 504 determines position information of the effect object to be generated based on the acquired information on the attitude of at least one part and the distance from the camera 206. The position information includes information on at least the attitude (angle) and distance from the camera 206 for the effect object. The determined position information is output to the effect drawing unit 506. Since the effect drawing unit 506 holds model information of the effect object to be generated in advance, information that defines the reference position of the effect object in association with a specified position of the figure 301 can be output here. In other words, the position information of the effect object to be generated only needs to include information required for the effect drawing unit 506 to draw the effect object, and may, for example, indicate information on the attitude (angle) of the effect object and the distance from the camera 206.
エフェクト描画部506は、深度情報取得部502から取得した深度情報と、エフェクト位置決定部504から取得したエフェクトオブジェクトの姿勢及び距離に関する情報とに基づいて、エフェクトオブジェクトを描画する。エフェクト描画部506は、上述したように、生成するエフェクトについての予め保持しているモデル情報に従って描画を行う。より詳細には、エフェクト描画部506は、撮像画像の各画素の深度情報(距離情報)に応じて、対応するエフェクトオブジェクトの描画画素のうち、撮像画像の対応する画素よりもカメラ206に近いことを示す画素について描画する。一方、エフェクト描画部506は、対応するエフェクトオブジェクトの描画画素のうち、撮像画像の対応する画素よりもカメラ206に近いことを示さない画素については描画しない。これにより、例えばフィギュア301の背面に隠れるエフェクトオブジェクトは描画されず、フィギュア301の前面に露出するエフェクトオブジェクトのみが描画されることになる。エフェクト描画部506は、描画したエフェクトオブジェクト画像を合成部507へ出力する。  The effect drawing unit 506 draws an effect object based on the depth information acquired from the depth information acquisition unit 502 and the information on the posture and distance of the effect object acquired from the effect position determination unit 504. As described above, the effect drawing unit 506 performs drawing according to the model information previously stored for the effect to be generated. More specifically, the effect drawing unit 506 draws pixels that indicate that the corresponding effect object is closer to the camera 206 than the corresponding pixel of the captured image, among the drawing pixels of the corresponding effect object, according to the depth information (distance information) of each pixel of the captured image. On the other hand, the effect drawing unit 506 does not draw pixels that do not indicate that the corresponding effect object is closer to the camera 206 than the corresponding pixel of the captured image, among the drawing pixels of the corresponding effect object. As a result, for example, an effect object hidden behind the figure 301 is not drawn, and only an effect object exposed in front of the figure 301 is drawn. The effect drawing unit 506 outputs the drawn effect object image to the synthesis unit 507.
合成部507は、画像取得部501から取得した撮像画像に対して、当該撮像画像に基づいて生成され、エフェクト描画部506から取得したエフェクトオブジェクト画像を重畳して現実空間にエフェクト画像を付加した合成画像を生成する。また、合成部507は、環境の輝度調整などを行うことにより、合成した画像の最終調整や品質調整を行ってもよい。例えば、選択されたエフェクトに合わせて、よりエフェクトを強調して表示する場合には、現実空間の画像を暗くするなどの調整を行うことができる。合成画像は出力部508に渡され、出力部508は、合成画像を表示部204に表示する。  The synthesis unit 507 generates a synthetic image by superimposing the effect object image acquired from the effect drawing unit 506, which is generated based on the captured image acquired from the image acquisition unit 501, and adding the effect image to the real space. The synthesis unit 507 may also perform final adjustments and quality adjustments to the synthesized image by adjusting the brightness of the environment, etc. For example, when displaying an effect with more emphasis in accordance with a selected effect, adjustments such as darkening the image in the real space can be made. The synthetic image is passed to the output unit 508, which displays the synthetic image on the display unit 204.
各部はカメラ2
06によって継続的に取得される撮像画像に対して上述した一連の処理を周期的(例えば、30msec、60msec、90msecなどの周期)に実行してもよい。この場合、出力部508によって表示される画像は動画像となる。なお、付加されるエフェクトオブジェクトについても動的に変化するアニメーションとして表示されてもよい。この場合、生成するエフェクトごとに、周期的な処理に合わせてアニメーションを構成する連続的な複数の画像が予め保持されている。さらに、出力部508は、表示した合成画像(エフェクトのアニメーション)に合わせて、スピーカ207によって所定の音声を出力することも可能である。 
Each part is a camera 2
The above-mentioned series of processes may be executed periodically (e.g., at periods of 30 msec, 60 msec, 90 msec, etc.) on the captured images continuously acquired by the image capture unit 506. In this case, the image displayed by the output unit 508 becomes a moving image. The added effect object may also be displayed as a dynamically changing animation. In this case, for each effect to be generated, a plurality of consecutive images that compose an animation in accordance with the periodic processing are stored in advance. Furthermore, the output unit 508 can also output a predetermined sound from the speaker 207 in accordance with the displayed composite image (effect animation).
<エフェクト合成における処理画像> 次に、図6を参照して、本実施形態に係るエフェクト合成の処理手順に応じた一連の処理画像について説明する。ここでは、図3に示すフィギュア301及び机302を含む情報処理端末101の周辺環境を撮像した撮像画像に対してエフェクトを合成する例について説明する。  <Processed images in effect compositing> Next, with reference to FIG. 6, a series of processed images according to the processing procedure for effect compositing according to this embodiment will be described. Here, an example of compositing an effect on a captured image of the surrounding environment of the information processing terminal 101, including the figure 301 and desk 302 shown in FIG. 3, will be described.
600はカメラ206によって撮像された撮像画像を示す。撮像画像600には机302に載置されたフィギュア301が含まれている。610は深度情報取得部502によって撮像画像600から得られた深度情報である、グレースケールの深度マップを示す。深度マップ610では、各画素において、白に近いほどカメラ206からの距離が近いことを示す。  Reference numeral 600 denotes an image captured by the camera 206. The captured image 600 includes a figure 301 placed on a desk 302. Reference numeral 610 denotes a grayscale depth map, which is depth information obtained from the captured image 600 by the depth information acquisition unit 502. In the depth map 610, the closer each pixel is to white, the closer the distance from the camera 206 is.
620は、物体認識部503によって生成されるエフェクトオブジェクトに関連する少なくとも1つのパーツを学習済みモデル505を用いて認識している様子を示す。ここでは、例えばフィギュア301の頭部621及び胸部622の認識が行われている。このように、本実施形態によれば、撮像画像600に含まれるフィギュア301の全体を認識するのではなく、生成するエフェクトオブジェクトに関連する一部のパーツのみが認識される。  620 shows how at least one part related to the effect object generated by the object recognition unit 503 is recognized using the trained model 505. Here, for example, the head 621 and chest 622 of the figure 301 are recognized. In this way, according to this embodiment, only some of the parts related to the effect object to be generated are recognized, rather than recognizing the entire figure 301 contained in the captured image 600.
630は、620で認識されたパーツに基づいて、位置情報が決定されたエフェクトオブジェクト631のモデルを示す。ここでは、生成するエフェクトオブジェクト631の全体の位置情報が決定される。なお、エフェクトオブジェクト631のモデル情報を予め保持しているため、当該オブジェクトを描画するための姿勢やカメラ206からの距離に関する情報が含まれればよい。  630 shows a model of an effect object 631 whose position information has been determined based on the parts recognized in 620. Here, the overall position information of the effect object 631 to be generated is determined. Note that since the model information of the effect object 631 is stored in advance, it is sufficient that the information included is related to the posture for rendering the object and the distance from the camera 206.
640は合成されるエフェクトオブジェクト画像を示す。エフェクトオブジェクト画像640では、エフェクト描画部506が深度マップ610とエフェクトオブジェクトのモデル630とを用いて、撮像画像に合成するエフェクトオブジェクト641を描画する。エフェクトオブジェクト641は、エフェクトオブジェクトの全体を示すエフェクトオブジェクト631と比較すると、描画されていない部分が含まれる。これは、深度マップ610から得られる距離情報と、モデル630から得られるエフェクトオブジェクト631の距離情報とを比較し、撮像画像の対応画素よりも前面に位置する(つまり、カメラ206に近い)エフェクトオブジェクトのみを描画したためである。  640 indicates the effect object image to be composited. In effect object image 640, effect rendering unit 506 uses depth map 610 and effect object model 630 to render effect object 641 to be composited into the captured image. Compared to effect object 631, which shows the entire effect object, effect object 641 includes parts that are not rendered. This is because distance information obtained from depth map 610 is compared with distance information of effect object 631 obtained from model 630, and only effect objects located in front of the corresponding pixel in the captured image (i.e., closer to camera 206) are rendered.
650は、撮像画像600に対してエフェクトオブジェクト画像640を重畳した合成画像を示す。合成画像650は、エフェクトオブジェクト画像640が単に撮像画像600に対して重ね合わせて生成された画像である。しかしながら、合成画像650では、エフェクトオブジェクト631のうち、フィギュア301の画像部分に重複し、かつフィギュア301の背面に隠れる部分は描画されていないことが分かる。このように、本実施形態によれば、より立体感や臨場感を有するXRギミックを提供することができる。なお、本実施形態では、対象物に隠れるエフェクトの部分を描画することなく、前面に露出する部分のみを描画する。従って、一度描画したエフェクトオブジェクトをのうち、フィギュアとの位置関係に応じて隠れる部分を消去する制御と比較して、より処理負荷を低減した処理を実現でき、高速に処理を行うことができる。  650 shows a composite image in which the effect object image 640 is superimposed on the captured image 600. The composite image 650 is an image generated by simply superimposing the effect object image 640 on the captured image 600. However, it can be seen that in the composite image 650, the part of the effect object 631 that overlaps with the image part of the figure 301 and is hidden behind the figure 301 is not drawn. In this way, according to this embodiment, it is possible to provide an XR gimmick with a more three-dimensional feel and a more realistic feel. Note that in this embodiment, only the part exposed to the front is drawn, without drawing the part of the effect hidden by the target object. Therefore, compared to control in which the part of the effect object that has been drawn once is erased according to its positional relationship with the figure, it is possible to realize processing with a lower processing load and to perform processing at a higher speed.
<基本制御> 次に、図7を参照して、本実施形態に係るXRギミックを提供するアプリケーションにおける基本制御の処理手順を説明する。以下で説明する処理は、例えばCPU201が記憶部202に予め記憶されている制御プログラム等を読み出して実行することにより実現される。  <Basic control> Next, with reference to FIG. 7, the processing procedure of basic control in the application that provides the XR gimmick according to this embodiment will be described. The processing described below is realized, for example, by the CPU 201 reading and executing a control program that has been pre-stored in the memory unit 202.
まずS101でCPU201は、本実施形態に係るXRギミックを提供するアプリケーションが起動されると、メニューを選択可能に表示する画面400を表示部204に表示する。続いて、S102でCPU201は、画面400や当該画面400から遷移する設定画面(不図示)等を介して、ユーザ操作に応じて選択された情報を取得する。ここでの選択情報には、例えばカメラ206で撮像して表示する所定の模型に関する情報が含まれる。CPU201は、選択された情報に応じてカメラ206による撮像を開始させる。撮像画像は、画面410に示すように、表示部204に表示される。  First, in S101, when an application that provides the XR gimmick according to this embodiment is launched, the CPU 201 displays a screen 400 that displays a selectable menu on the display unit 204. Next, in S102, the CPU 201 acquires information selected in response to a user operation via the screen 400 or a setting screen (not shown) to which the screen 400 transitions. The selected information here includes, for example, information about a specific model that is captured by the camera 206 and displayed. The CPU 201 starts capturing images with the camera 206 in response to the selected information. The captured image is displayed on the display unit 204, as shown on screen 410.
次に、S103でCPU201は、ボタン412を介してエフェクト出力が選択されたかどうかを判断する。選択された場合は処理をS106へ進め、そうでない場合は処理をS104へ進める。S104でCPU201は、カメラ206によって撮像された撮像画像を取得してS104で表示部204へ表示し、処理をS107へ進める。一方、S106でCPU201は、撮像画像にエフェクトを合成して出力し、処理をS107へ進める。S106の詳細な処理については図8を用いて後述する。S107でCPU201は、映像の出力を終了するか否かを判断し、終了しない場合は処理をS103に戻し、終了する場合は本フローチャートの処理を終了する。例えば、ボタン413を介して画面400に戻る指示が行われた場合や、当該アプリケーションが終了された場合に、CPU201は映像の出力を終了すると判断して、カメラ206の起動を停止する。  Next, in S103, the CPU 201 determines whether or not effect output has been selected via the button 412. If it has been selected, the process proceeds to S106; if not, the process proceeds to S104. In S104, the CPU 201 acquires an image captured by the camera 206, displays it on the display unit 204 in S104, and proceeds to S107. On the other hand, in S106, the CPU 201 combines an effect with the captured image and outputs it, and proceeds to S107. The detailed process of S106 will be described later with reference to FIG. 8. In S107, the CPU 201 determines whether or not to end the video output, and if not, returns to S103, and if to end, ends the process of this flowchart. For example, if an instruction to return to the screen 400 is given via the button 413 or if the application is ended, the CPU 201 determines that the video output is to be ended, and stops the startup of the camera 206.
<エフェクト合成出力制御> 次に、図8を参照して、本実施形態に係るエフェクト合成出力(S106)の処理手順について説明する。以下で説明する処理は、例えばCPU201が記憶部202に予め記憶されている制御プログラム等を読み出して実行することにより実現される。  <Effects composite output control> Next, the processing procedure for effects composite output (S106) according to this embodiment will be described with reference to FIG. 8. The processing described below is realized, for example, by the CPU 201 reading and executing a control program or the like that is pre-stored in the storage unit 202.
まずS201でCPU201は、ボタン412を介して選択されたエフェクト情報を取得する。エフェクト情報には、生成するエフェクトを識別するための識別情報や、当該エフェクトが関連する少なくとも1つのパーツの情報等を含む。これらの情報はアプリケーションサーバ102から受信して記憶部202に予め記憶されている情報である。続いて、S202でCPU201は、カメラ206によって撮像された処理対象の撮像画像を取得する。  First, in S201, the CPU 201 acquires effect information selected via the button 412. The effect information includes identification information for identifying the effect to be generated, information on at least one part to which the effect is related, and the like. This information is received from the application server 102 and is pre-stored in the storage unit 202. Next, in S202, the CPU 201 acquires a captured image of the processing target captured by the camera 206.
次に、S203でCPU201は、深度情報取得部502によって、S201で取得した撮像画像の深度マップを取得する。また、S204でCPU201は、S201で取得した撮像画像を学習済みモデル505に入力し、撮像画像に含まれる模型(ここでは、フィギュア301)についてS201で取得したエフェクトに関連する少なくとも1つのパーツの物体認識を行う。物体認識の詳細な制御については図9を用いて後述する。物体認識が行われると、S205でCPU201は、S204で認識された少なくとも1つのパーツの姿勢及び距離に関する情報に基づいて、生成するエフェクトの位置情報を決定する。位置情報には、上述したように、エフェクト画像を生成するための情報として、生成するエフェクトの姿勢(角度)及びカメラ206からの距離に関する情報が含まれる。なお、S203と、S204及びS205との処理順序は説明を容易にするために順序付けて説明したが、深度マップを取得する処理と、エフェクトオブジェクトの位置決定を行う処理とは逆の順序で行われてもよく、並行して行われるものであってよい。  Next, in S203, the CPU 201 acquires a depth map of the captured image acquired in S201 by the depth information acquisition unit 502. In addition, in S204, the CPU 201 inputs the captured image acquired in S201 to the trained model 505, and performs object recognition of at least one part related to the effect acquired in S201 for the model (here, the figure 301) included in the captured image. Detailed control of the object recognition will be described later with reference to FIG. 9. After the object recognition is performed, in S205, the CPU 201 determines position information of the effect to be generated based on information on the attitude and distance of at least one part recognized in S204. As described above, the position information includes information on the attitude (angle) of the effect to be generated and the distance from the camera 206 as information for generating the effect image. Note that the processing order of S203, S204, and S205 has been described in order for ease of explanation, but the processing of acquiring the depth map and the processing of determining the position of the effect object may be performed in the reverse order, or may be performed in parallel.
次に、S206でCPU201は、S203で取得された深度マップと、S205で決定されたエフェクトオブジェクトの位置情報とに基づいてエフェクトオブジェクトの画像を生成する。エフェクトオブジェクトの画像の生成制御については図10を用いて後述する。続いて、S207でCPU201は、S202で取得した撮像画像に対して、S206で生成したエフェクトオブジェクト画像を重畳して合成する。その後、S208でCPU201は、合成画像を表示部204に表示するとともに、必要に応じてスピーカ207から音声を出力し、本フローチャートの処理を終了する。  Next, in S206, the CPU 201 generates an image of the effect object based on the depth map acquired in S203 and the position information of the effect object determined in S205. The generation control of the image of the effect object will be described later with reference to FIG. 10. Next, in S207, the CPU 201 superimposes the effect object image generated in S206 onto the captured image acquired in S202 to synthesize them. After that, in S208, the CPU 201 displays the synthesized image on the display unit 204 and outputs sound from the speaker 207 as necessary, and ends the processing of this flowchart.
<物体認識制御> 次に、図9を参照して、本実施形態に係る物体認識(S204)の処理手順について説明する。以下で説明する処理は、例えばCPU201が記憶部202に予め記憶されている制御プログラム等を読み出して実行することにより実現される。  <Object Recognition Control> Next, the processing procedure for object recognition (S204) according to this embodiment will be described with reference to FIG. 9. The processing described below is realized, for example, by the CPU 201 reading and executing a control program or the like that is pre-stored in the memory unit 202.
まずS301でCPU201は、S201で取得したエフェクト情報に基づいて、生成するエフェクトに関連するパーツを特定する。例えば図4や図6で説明したXRギミックの例では、フィギュア301の頭部及び胸部をエフェクトに関連するパーツとして特定する。続いて、S302でCPU201は、S301で特定した少なくとも1つのパーツについて、学習済みモデル505を用いて撮像画像に含まれる当該パーツを認識する。  First, in S301, the CPU 201 identifies parts related to the effect to be generated based on the effect information acquired in S201. For example, in the example of the XR gimmick described in FIG. 4 and FIG. 6, the head and chest of the figure 301 are identified as parts related to the effect. Next, in S302, the CPU 201 uses the trained model 505 to recognize at least one part identified in S301 that is included in the captured image.
次に、S303でCPU201は、学習済みモデル505の出力結果から、認識したパーツの形状、角度、及び距離に関する情報を取得する。その後、S304でCPU201は、S301で特定されたパーツのうち、未解析のパーツがあるかどうかを判断する。未解析のパーツがあれば、処理をS302に戻し、未解析のパーツが無ければ本フローチャートの処理を終了する。  Next, in S303, the CPU 201 obtains information regarding the shape, angle, and distance of the recognized part from the output result of the trained model 505. After that, in S304, the CPU 201 determines whether or not there are any unanalyzed parts among the parts identified in S301. If there are any unanalyzed parts, the process returns to S302, and if there are no unanalyzed parts, the process of this flowchart ends.
<エフェクトオブジェクトの生成制御> 次に、図10を参照して、本実施形態に係るエフェクトオブジェクト生成(S206)の処理手順について説明する。以下で説明する処理は、例えばCPU201が記憶部202に予め記憶されている制御プログラム等を読み出して実行することにより実現される。  <Effect object generation control> Next, the processing procedure for generating an effect object (S206) according to this embodiment will be described with reference to FIG. 10. The processing described below is realized, for example, by the CPU 201 reading and executing a control program or the like that is pre-stored in the storage unit 202.
まずS401でCPU201は、S205で決定されたエフェクトオブジェクトの位置情報と、予め保持している生成するエフェクトオブジェクトのモデル情報とに基づいて、生成するエフェクトオブジェクトの画素位置xを初期化する。ここではエフェクトオブジェクトの全画素について後述する処理を実施するため、例えば初期値としてエフェクトオブジェクトの左上の画素位置を画素位置xとして設定する。  First, in S401, the CPU 201 initializes the pixel position x of the effect object to be generated based on the position information of the effect object determined in S205 and the model information of the effect object to be generated that has been stored in advance. Here, in order to perform the processing described below on all pixels of the effect object, for example, the upper left pixel position of the effect object is set as the initial value for pixel position x.
次にS402でCPU201は、エフェクトオブジェクトの処理対象の画素位置xと、対応する撮像画像の画素位置yとのそれぞれの距離情報を比較する。続いて、S403でCPU201は、比較の結果、エフェクトオブジェクトの方が前方に位置するかどうか(カメラ206に近いかどうか)を判断する。エフェクトオブジェクトが前方であればS404に進み、そうでなければS405へ進む。S404でCPU201は、対応画素のエフェクトオブジェクトを描画し、S405に進む。S405でCPU201は、エフェクトオブジェクトの全ての画素について、対応する撮像画像の画素と比較したかどうかを判断する。全ての画素について処理が終了すると、本フローチャートの処理を終了し、そうでない場合は処理を402へ戻す。  Next, in S402, the CPU 201 compares the distance information between the pixel position x of the effect object to be processed and the corresponding pixel position y of the captured image. Next, in S403, the CPU 201 determines whether the comparison shows that the effect object is located in front (closer to the camera 206). If the effect object is in front, the process proceeds to S404, otherwise, the process proceeds to S405. In S404, the CPU 201 draws the effect object of the corresponding pixel, and proceeds to S405. In S405, the CPU 201 determines whether all pixels of the effect object have been compared with the corresponding pixels of the captured image. When processing has been completed for all pixels, the process of this flowchart ends, and if not, the process returns to 402.
以上説明したように、本実施形態に係る情報処理端末は、所定の模型を含む周辺環境を撮像し、撮像された撮像画像の各画素について、カメラからの距離情報を取得する。また、本情報処理端末は、撮像画像に含まれる所定の模型の少なくとも1つのパーツの姿勢及びカメラからの距離に関する情報を認識し、認識した少なくとも1つのパーツを基準に、生成するオブジェクトの位置情報を決定する。さらに、本情報処理端末は、生成するオブジェクトの各画素のうち、それぞれの位置情報が撮像画像の対応する画素の距離情報よりもカメラに近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成し、撮像画像にオブジェクト画像を重畳した合成画像を表示部に出力す
る。一方、本情報処理端末は、生成するオブジェクトの各画素のうち、それぞれの位置情報が撮像画像の対応する画素の距離情報よりもカメラに近いことを示さない画素については描画しない。つまり、本実施形態によれば、合成するエフェクトオブジェクトのうち、所定物よりも前方に位置する部分については描画し、所定物よりも後方に位置する部分については所定物に隠れるため描画しない。このように、本発明は現実空間の撮像画像に好適にオブジェクトを合成して出力することができる。 
As described above, the information processing terminal according to this embodiment captures an image of the surrounding environment including a predetermined model, and acquires distance information from the camera for each pixel of the captured image. In addition, the information processing terminal recognizes information on the posture of at least one part of the predetermined model included in the captured image and the distance from the camera, and determines position information of an object to be generated based on the recognized at least one part. Furthermore, the information processing terminal draws, as an object, pixels of each pixel of the object to be generated whose position information indicates that the pixels are closer to the camera than the distance information of the corresponding pixel in the captured image, generates an object image, and outputs a composite image in which the object image is superimposed on the captured image to the display unit. On the other hand, the information processing terminal does not draw pixels of each pixel of the object to be generated whose position information does not indicate that the pixels are closer to the camera than the distance information of the corresponding pixel in the captured image. In other words, according to this embodiment, of the effect object to be synthesized, a portion located in front of a predetermined object is drawn, and a portion located behind the predetermined object is not drawn because it is hidden by the predetermined object. In this way, the present invention can suitably synthesize an object into a captured image in real space and output it.
<第2の実施形態> 以下では本発明の第2の実施形態について説明する。上記第1の実施形態では物体認識(S204)において、エフェクトオブジェクトを生成するための基準位置を認識すべく、少なくとも1つのパーツを認識する制御について説明した。また、上記実施形態では、エフェクトオブジェクトの描画については撮像画像から生成したデプスマップを用いてエフェクトオブジェクトの位置と、撮像画像の各画素の位置とを比較して描画の有無を制御する例について説明した。  <Second embodiment> The second embodiment of the present invention will be described below. In the above first embodiment, a control for recognizing at least one part in order to recognize a reference position for generating an effect object in object recognition (S204) was described. Also, in the above embodiment, an example was described in which, regarding the rendering of an effect object, a depth map generated from a captured image is used to compare the position of the effect object with the position of each pixel in the captured image to control whether or not to render it.
しかしながら、デプスマップの精度は、カメラの性能や光量等の撮像時の環境条件に応じて変動するものである。そこで、本実施形態では、物体認識(S204)において、上記少なくとも1つのパーツの認識に加えて、対象模型のボーン構造を構築して、デプスマップを補完する制御について説明する。また、構築したボーン構造を利用することにより、対象模型の姿勢を判定することができ、判定した姿勢に応じてエフェクトオブジェクトを動的に変化させることができる。詳細については後述する。  However, the accuracy of the depth map varies depending on the camera's performance and the environmental conditions at the time of image capture, such as the amount of light. Therefore, in this embodiment, in addition to recognizing at least one part as described above, in object recognition (S204), a bone structure of the target model is constructed to complement the depth map, and control is described. Furthermore, by using the constructed bone structure, the posture of the target model can be determined, and the effect object can be dynamically changed according to the determined posture. Details will be described later.
<ボーン構造> まず図11を参照して、本実施形態に係る模型のボーン構造について説明する。1100は、撮像画像に含まれる所定の模型であるフィギュアのボーン構造を示す。1100では撮像画像に含まれるフィギュア301及び机302は点線で示す。  <Bone structure> First, the bone structure of the model according to this embodiment will be described with reference to FIG. 11. Reference numeral 1100 indicates the bone structure of a figure, which is a specified model, contained in the captured image. In 1100, the figure 301 and desk 302 contained in the captured image are indicated by dotted lines.
図11に示す1101などの黒丸は、フィギュア301の特徴点を示す。これらの特徴点はフィギュアの大まかなアウトラインを生成するための点であり、その数や位置を限定する意図はない。1102は各特徴点を連結したボーン構造を示す。図11に示すボーン構造1102は、フィギュア301の基準ボーン構造を示す。基準ボーン構造とは、所定の模型の基準姿勢から得られるボーン構造であり、模型ごとに予め用意されているデータである。基準ボーン構造は、例えば対象模型の3次元データから、ポリゴン数を軽減した大まかなアウトラインからなる三次元データから得ることができる。  Black circles such as 1101 in Figure 11 indicate feature points of figure 301. These feature points are used to generate a rough outline of the figure, and there is no intention to limit their number or position. 1102 indicates the bone structure connecting each feature point. Bone structure 1102 in Figure 11 indicates the reference bone structure of figure 301. The reference bone structure is a bone structure obtained from the reference posture of a specified model, and is data prepared in advance for each model. The reference bone structure can be obtained, for example, from three-dimensional data consisting of a rough outline with a reduced number of polygons from the three-dimensional data of the target model.
上記第1の実施形態では出力するエフェクトに関連のある少なくとも1つのパーツを認識したが、本実施形態では対象模型に含まれる各パーツの認識を行う。例えば、認識するパーツには顔、胸、腹、腰、両腕、及び両脚が含まれてもよい。また、本実施形態では、撮像画像から認識されるパーツの角度に従って、基準ボーン構造を更新する。従って、更新したボーン構造は、撮像画像に含まれる対象模型の姿勢を示すこととなる。さらに、更新したボーン構造を撮像画像にマッピングことにより、撮像画像中における対応するボーン構造の位置付近においては、対象模型が撮像されていることを示す領域として判定することができる。  In the first embodiment above, at least one part related to the effect to be output is recognized, but in this embodiment, each part included in the target model is recognized. For example, the parts to be recognized may include the face, chest, abdomen, waist, both arms, and both legs. Furthermore, in this embodiment, the reference bone structure is updated according to the angle of the part recognized from the captured image. Therefore, the updated bone structure indicates the posture of the target model included in the captured image. Furthermore, by mapping the updated bone structure to the captured image, the area near the position of the corresponding bone structure in the captured image can be determined as an area indicating that the target model is being captured.
<物体認識(ボーン構築を含む)> 次に、図12を参照して、本実施形態に係る物体認識(S204)の処理手順について説明する。以下で説明する処理は、例えばCPU201が記憶部202に予め記憶されている制御プログラム等を読み出して実行することにより実現される。なお、ここでは、上記第1の実施形態で説明した図9のフローチャートと異なる処理について説明し、同様の処理手については同一のステップ番号を付し、説明を省略する。  <Object recognition (including bone construction)> Next, the processing procedure for object recognition (S204) according to this embodiment will be described with reference to FIG. 12. The processing described below is realized, for example, by the CPU 201 reading and executing a control program or the like pre-stored in the storage unit 202. Note that here, processing that differs from the flowchart of FIG. 9 described in the first embodiment above will be described, and similar processing steps will be given the same step numbers and will not be described.
まずS501でCPU201は、対象模型(ここではフィギュア301)の基準ボーン構造を含む三次元データを取得する。当該データは、アプリケーションをインストールした際に記憶部202に予め記憶される情報である。ここでは、例えばフィギュア301の基準ボーン構造1102を含む三次元データが記憶部202から読み出される。続いて、S502でCPU201は、対象模型であるフィギュア301の情報に基づいて、認識するパーツを特定する。ここでは、上記第1の実施形態におけるS301とは異なり、選択されたエフェクトに関連するパーツを特定するのではなく、ボーン構造を更新するために必要なパーツを特定する。なお、基本的には、フィギュア301に含まれる各パーツを特定する。  First, in S501, the CPU 201 acquires three-dimensional data including the reference bone structure of the target model (here, the figure 301). This data is information that is pre-stored in the storage unit 202 when the application is installed. Here, for example, three-dimensional data including the reference bone structure 1102 of the figure 301 is read from the storage unit 202. Next, in S502, the CPU 201 identifies the parts to be recognized based on the information of the figure 301, which is the target model. Here, unlike S301 in the first embodiment above, the parts related to the selected effect are not identified, but the parts necessary to update the bone structure are identified. Note that, basically, each part included in the figure 301 is identified.
その後、S303及びS304で各パーツを認識すると、S503でCPU201は、認識したパーツ(姿勢及びカメラ206からの距離に関する情報)に従って、S501で取得した基準ボーン構造の対応する部分を更新する。具体的には、CPU201は、認識したパーツと、基準ボーン構造を含む三次元データ上の対応する部分とを照合し、当該認識したパーツの角度に合わせるように特徴点の位置を調整して基準ボーン構造を更新する。その後、S304でCPU201は、S502で特定されたパーツのうち、未解析のパーツがあるかどうかを判断する。未解析のパーツがあれば、処理をS302に戻し、未解析のパーツが無ければS504に進む。  After that, when each part is recognized in S303 and S304, in S503 the CPU 201 updates the corresponding part of the reference bone structure acquired in S501 according to the recognized part (information on posture and distance from the camera 206). Specifically, the CPU 201 compares the recognized part with the corresponding part on the three-dimensional data including the reference bone structure, and updates the reference bone structure by adjusting the position of the feature points to match the angle of the recognized part. Then, in S304 the CPU 201 determines whether any of the parts identified in S502 have not been analyzed. If there are any unanalyzed parts, the process returns to S302, and if there are no unanalyzed parts, the process proceeds to S504.
S504でCPU201は、更新された基準ボーン構造から対象模型の姿勢を判定し、本フローチャートの処理を終了する。対象模型の姿勢の判定は、例えば人体の姿勢を機械学習させた学習済みモデルを用いて、更新した基準ボーン構造の姿勢を推定することにより行ってもよい。  In S504, the CPU 201 determines the posture of the target model from the updated reference bone structure, and ends the processing of this flowchart. The posture of the target model may be determined by estimating the posture of the updated reference bone structure, for example, using a trained model that has been machine-learned to train the posture of the human body.
本実施形態によれば、推定した対象模型の姿勢に応じて出力するエフェクトを変化させることができる。例えば、対象模型の姿勢が所定の姿勢を示す場合にのみ特定のエフェクトを出力するようにしてもよい。一例として、推定した姿勢が当該フィギュアの変身ポーズを示す場合には、例えばベルトの部分を発光させ、回転させるようなエフェクトを出力してもよい。また、フィギュア全体の姿勢を認識しているため、腰から胴体、肩を通って上腕、下腕、拳という順序で各パーツに対して連続して点滅等を示すエフェクトを付与してもよい。  According to this embodiment, the effect to be output can be changed according to the estimated posture of the target model. For example, a specific effect can be output only when the posture of the target model indicates a specific posture. As one example, when the estimated posture indicates the transformation pose of the figure, an effect such as making the belt part light up and rotate can be output. In addition, since the posture of the entire figure is recognized, an effect such as blinking can be applied to each part in the following order: waist, torso, shoulders, upper arms, lower arms, and fists.
また、本実施形態によれば、更新したボーン構造を含む三次元データをデプスマップを補完するために利用してもよい。カメラの性能や撮像時の環境条件に応じて、デプスマップを用いたエフェクトオブジェクトの生成制御に加えて、上記三次元データを用いてデプスマップ(即ち、撮像画像)上の対象模型の位置を特定してもよい。デプスマップ上での対象模型の位置が特定できれば、対応する画素と重複するエフェクトオブジェクトの画素を描画するかどうか判定するのみでよく、デプスマップの精度が良くない場合に補完的に利用することができるとともに、対象模型と重複する画素のみについて上記S402の比較処理を行うだけでよく、処理負荷を低減することもできる。  Furthermore, according to this embodiment, three-dimensional data including the updated bone structure may be used to complement the depth map. In addition to controlling the generation of the effect object using the depth map, depending on the camera performance and the environmental conditions at the time of image capture, the above three-dimensional data may be used to identify the position of the target model on the depth map (i.e., the captured image). If the position of the target model on the depth map can be identified, it is only necessary to determine whether or not to draw pixels of the effect object that overlap with corresponding pixels, which can be used to complement when the accuracy of the depth map is poor, and it is also possible to reduce the processing load by only performing the comparison process of S402 above for pixels that overlap with the target model.
以上説明したように、本実施形態に係る情報処理端末は、さらに、所定の模型に含まれる各パーツの姿勢及びカメラからの距離に関する情報を認識し、所定の模型の基準姿勢を示す基準ボーン構造を含む三次元データを、認識した各パーツに従って更新し、更新した三次元データに基づいて所定の模型の姿勢を認識する。このように、本実施形態によれば、撮像画像から各パーツを認識して予め用意した基準ボーン構造を更新して、撮像画像に含まれるフィギュアの姿勢を判定することができる。これにより、生成するオブジェクトを、認識した所定の模型の姿勢に合わせて変化させることができる。また、更新された三次元データから特定される、撮像画像における所定の模型の位置に基づいて、オブジェクト画像を生成することができる。よって、デプスマットの精度が低い場合において、エフェクトオブジェクトの生成を好適に補完することができる。  As described above, the information processing terminal according to this embodiment further recognizes information regarding the posture of each part included in the specified model and the distance from the camera, updates the three-dimensional data including the reference bone structure indicating the reference posture of the specified model according to each recognized part, and recognizes the posture of the specified model based on the updated three-dimensional data. In this way, according to this embodiment, it is possible to determine the posture of the figure included in the captured image by recognizing each part from the captured image and updating the reference bone structure prepared in advance. This makes it possible to change the object to be generated according to the posture of the recognized specified model. In addition, it is possible to generate an object image based on the position of the specified model in the captured image identified from the updated three-dimensional data. Therefore, when the accuracy of the depth matte is low, it is possible to suitably complement the generation of the effect object.
<変形例> 本発明は上記実施形態に制限されるものではなく、発明の要旨の範囲内で、種々の変形・変更が可能である。上記実施形態では、所定の模型であるフィギュア301に対して、当該フィギュアの体を囲むような3つの輪をエフェクトオブジェクトとして合成して出力する例について説明した。このように、フィギュア301には実際には含まれないオブジェクトを合成して出力する例について説明したが、本発明はこれに限定されない。例えば、フィギュアが物理的に有する部分に対してエフェクトを合成して出力するようにしてもよい。  <Modifications> The present invention is not limited to the above embodiment, and various modifications and alterations are possible within the scope of the gist of the invention. In the above embodiment, an example was described in which three rings that surround the body of a specific model figure 301 are synthesized and output as effect objects. In this way, an example was described in which an object that is not actually included in figure 301 is synthesized and output, but the present invention is not limited to this. For example, an effect may be synthesized and output for a part that the figure physically has.
図13は、変形例となるフィギュア1101が机302に載置され、情報処理端末101で撮影している様子を示す。フィギュア1101は、フィギュア301と同様の本体部分に加えて、例えば炎を示す物理的なオブジェクト1102を備える。本発明によれば、このような物理的に存在する部分に対してエフェクトを合成して出力するようにしてもよい。例えば、図13の例では、炎の揺らめきや火の粉などを付加して出力するようにしてもよい。また、本発明によれば、フィギュアの撮像した顔に対して、異なる表情、例えば笑顔や泣き顔などの表情を付加したり、視線をカメラ方向に向けるよう変更した画像を合成して出力してもよい。なお、本発明によれば、撮像画像からフィギュアの少なくとも1つパーツを認識して、当該認識したパーツを基準にして生成可能なエフェクトであれば任意のエフェクトを付加することができる。  FIG. 13 shows a modified figure 1101 placed on a desk 302 and photographed by the information processing terminal 101. In addition to the main body of the figure 301, the figure 1101 has a physical object 1102, for example, a flame. According to the present invention, effects may be synthesized and output for such physically existing parts. For example, in the example of FIG. 13, flickering flames or sparks may be added and output. According to the present invention, a different expression may be added to the captured image of the figure's face, for example, a smiling or crying expression, or an image in which the line of sight is changed to face the camera may be synthesized and output. According to the present invention, at least one part of the figure can be recognized from the captured image, and any effect can be added as long as it is an effect that can be generated based on the recognized part.
また、本実施形態では、対象の模型として人型の模型を例に説明したが、本発明を限定する意図はない、例えば、人、動物、ロボット、昆虫、恐竜等、様々な形状の模型に適用することができる。いずれの場合においても、上記実施形態で説明したように、複数のパーツに分割して認識することにより、リアルタイム性を保証しつつ、拡張現実を提供することができる。  In addition, in this embodiment, a humanoid model has been used as an example of the target model, but this is not intended to limit the present invention, and the present invention can be applied to models of various shapes, such as humans, animals, robots, insects, and dinosaurs. In any case, as described in the above embodiment, by dividing the model into multiple parts and recognizing them, it is possible to provide augmented reality while ensuring real-time performance.
<実施形態のまとめ> 上記実施形態は以下のコンピュータプログラム、情報処理端末及びその制御方法を少なくとも開示する。  <Summary of the embodiment> The above embodiment discloses at least the following computer program, information processing terminal, and control method thereof.
(1)情報処理端末のコンピュータを、 所定の模型を含む周辺環境を撮像する撮像手段と、 前記撮像手段によって撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得手段と、 前記撮像画像に含まれる前記所定の模型の少なくとも1つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識手段と、 前記認識手段によって認識した前記少なくとも1つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定手段と、 前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成手段と、 前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力手段と、として機能させることを特徴とするコンピュータプログラム。  (1) A computer program that causes a computer of an information processing terminal to function as: an imaging means for imaging the surrounding environment including a specified model; an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means; a recognition means for recognizing information related to the attitude of at least one part of the specified model contained in the captured image and its distance from the imaging means; a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means; an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, thereby generating an object image; and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.
(2)前記オブジェクト生成手段は、前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示さない画素については描画しないことを特徴とする(1)に記載のコンピュータプログラム。  (2) The computer program described in (1), wherein the object generating means does not draw any pixel of the object to be generated whose position information does not indicate that the pixel is closer to the imaging means than the distance information of the corresponding pixel in the captured image.
(3)前記情報処理端末のコンピュータを、さらに、前記所定の模型を含む撮像画像に対して合成するエフェクトを選択する選択手段として機能させ、 前記認識手段は、前記選択されたエフェクトに関連するパーツを認識することを特徴とする(1)又は(2)に記載のコンピュータプログラム。  (3) The computer program described in (1) or (2) is characterized in that the computer of the information processing terminal is further made to function as a selection means for selecting an effect to be composited with a captured image including the specified model, and the recognition means recognizes parts related to the selected effect.
(4)前記認識手段は、撮像画像を入力とし、前記所定の模型のパーツごとに形状、角度、及び距離に関する情報を出力するように学習させた学習済みモデルを用いて、前記選択されたエフェクトに関連するパーツを認識することを特徴とする(3)に記載のコンピュータプログラム。  (4) The computer program described in (3) is characterized in that the recognition means recognizes parts related to the selected effect using a trained model that is trained to input a captured image and output information about the shape, angle, and distance for each part of the specified model.
(5)前記オブジェクト生成手段は、前記選択手段によって選択されたエフェクトに対応する、予め記憶されているオブジェクトのモデル情報に基づいて、前記オブジェクト画像を生成するこ
とを特徴とする(2)又は(3)に記載のコンピュータプログラム。 
(5) The computer program according to (2) or (3), characterized in that the object generation means generates the object image based on pre-stored object model information corresponding to the effect selected by the selection means.
(6)前記位置決定手段は、前記少なくとも1つのパーツの姿勢及び前記撮像手段からの距離に関する情報に基づいて、前記生成するオブジェクトの角度及び距離に関する情報を含む前記位置情報を決定することを特徴とする(1)乃至(5)の何れか1つに記載のコンピュータプログラム。  (6) The computer program described in any one of (1) to (5), wherein the position determination means determines the position information including information regarding the angle and distance of the object to be generated based on information regarding the posture of the at least one part and the distance from the imaging means.
(7)前記所定の模型の前記少なくとも1つのパーツとは、頭部、胸部、腹部、腰部、腕部、及び脚部の少なくとも1つであることを特徴とする(3)乃至(5)の何れか1つに記載のコンピュータプログラム。  (7) The computer program described in any one of (3) to (5), characterized in that the at least one part of the specified model is at least one of the head, chest, abdomen, waist, arms, and legs.
(8)前記選択されたエフェクトに関連するパーツとは、前記生成するオブジェクトの近傍に位置するパーツであることを特徴とする(7)に記載のコンピュータプログラム。  (8) The computer program described in (7) is characterized in that the parts related to the selected effect are parts located in the vicinity of the object to be generated.
(9)前記取得手段は、前記撮像画像の各画素についての前記距離情報として、深度情報を示すグレースケールの深度マップを取得することを特徴とする(1)乃至(8)の何れか1つに記載のコンピュータプログラム。  (9) The computer program described in any one of (1) to (8), wherein the acquisition means acquires a grayscale depth map indicating depth information as the distance information for each pixel of the captured image.
(10)前記認識手段は、さらに、前記所定の模型に含まれる各パーツの姿勢及び前記撮像手段からの距離に関する情報を認識し、前記所定の模型の基準姿勢を示す基準ボーン構造を含む三次元データを、認識した各パーツに従って更新し、更新した前記三次元データに基づいて前記所定の模型の姿勢を認識することを特徴とする(1)乃至(9)の何れか1つに記載のコンピュータプログラム。  (10) The computer program according to any one of (1) to (9), wherein the recognition means further recognizes information regarding the posture of each part included in the specified model and the distance from the imaging means, updates three-dimensional data including a reference bone structure indicating a reference posture of the specified model according to each recognized part, and recognizes the posture of the specified model based on the updated three-dimensional data.
(11)前記オブジェクト生成手段によって生成されるオブジェクトは、認識された前記所定の模型の姿勢に合わせて変化することを特徴とする(10)に記載のコンピュータプログラム。  (11) The computer program described in (10), characterized in that the object generated by the object generating means changes in accordance with the recognized posture of the predetermined model.
(12)前記オブジェクト生成手段は、前記更新された三次元データから特定される、前記撮像画像における前記所定の模型の位置に基づいて、前記オブジェクト画像を生成することを特徴とする(10)又は(11)に記載のコンピュータプログラム。  (12) The computer program described in (10) or (11), wherein the object generating means generates the object image based on the position of the specified model in the captured image, which is identified from the updated three-dimensional data.
(13)前記撮像手段は前記所定の模型を含む周辺環境を継続的に撮像し、 前記取得手段、前記認識手段、前記オブジェクト生成手段、及び前記出力手段は、前記撮像手段によって撮像された撮像画像に基づいて周期的に処理を実行し、 前記出力手段は、前記撮像手段によって継続的に撮像された映像に、動的に変化するアニメーションとして前記オブジェクトを合成して出力することを特徴とする(1)乃至(12)の何れか1つに記載のコンピュータプログラム。  (13) The computer program according to any one of (1) to (12), characterized in that the imaging means continuously captures images of the surrounding environment including the specified model, the acquisition means, the recognition means, the object generation means, and the output means periodically execute processing based on the images captured by the imaging means, and the output means synthesizes the object as a dynamically changing animation into the video continuously captured by the imaging means and outputs the video.
(14)情報処理端末であって、 所定の模型を含む周辺環境を撮像する撮像手段と、 前記撮像手段によって撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得手段と、 前記撮像画像に含まれる前記所定の模型の少なくとも1つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識手段と、 前記認識手段によって認識した前記少なくとも1つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定手段と、 前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成手段と、 前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力手段とを備えることを特徴とする情報処理端末。  (14) An information processing terminal comprising: an imaging means for imaging a surrounding environment including a specified model; an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means; a recognition means for recognizing information related to the attitude of at least one part of the specified model included in the captured image and its distance from the imaging means; a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means; an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, thereby generating an object image; and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.
(15)情報処理端末の制御方法であって、 所定の模型を含む周辺環境を撮像手段によって撮像する撮像工程と、 前記撮像工程で撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得工程と、 前記撮像画像に含まれる前記所定の模型の少なくとも1つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識工程と、 前記認識工程で認識した前記少なくとも1つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定工程と、 前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成工程と、 前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力工程とを含むことを特徴とする情報処理端末の制御方法。 (15) A control method for an information processing terminal, comprising: an imaging step of imaging a surrounding environment including a predetermined model by an imaging means; an acquisition step of acquiring distance information from the imaging means for each pixel of the image captured in the imaging step; a recognition step of recognizing information regarding the attitude of at least one part of the predetermined model included in the captured image and the distance from the imaging means; a position determination step of determining position information of an object to be generated based on the at least one part recognized in the recognition step; an object generation step of drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, thereby generating an object image; and an output step of outputting a composite image in which the object image is superimposed on the captured image to a display unit.
101:情報処理端末、102:アプリケーションサーバ、103:機械学習サーバ、104:データベース、105:ネットワーク、201:CPU、202:記憶部、203:通信制御部、204:表示部、205:操作部、206:カメラ、207:スピーカ、210:システムバス:301:フィギュア、302:机、501:画像取得部、502:深度情報取得部、503:物体認識部、504:エフェクト位置決定部、505:学習済みモデル、506:エフェクト描画部、507:合成部:508:出力部 101: Information processing terminal, 102: Application server, 103: Machine learning server, 104: Database, 105: Network, 201: CPU, 202: Storage unit, 203: Communication control unit, 204: Display unit, 205: Operation unit, 206: Camera, 207: Speaker, 210: System bus: 301: Figure, 302: Desk, 501: Image acquisition unit, 502: Depth information acquisition unit, 503: Object recognition unit, 504: Effect position determination unit, 505: Learned model, 506: Effect drawing unit, 507: Synthesis unit: 508: Output unit

Claims (15)

  1. 情報処理端末のコンピュータを、 所定の模型を含む周辺環境を撮像する撮像手段と、 前記撮像手段によって撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得手段と、 前記撮像画像に含まれる前記所定の模型の少なくとも1つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識手段と、 前記認識手段によって認識した前記少なくとも1つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定手段と、 前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成手段と、 前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力手段と、として機能させることを特徴とするコンピュータプログラム。 A computer program that causes a computer of an information processing terminal to function as: an imaging means for imaging the surrounding environment including a specified model; an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means; a recognition means for recognizing information regarding the attitude of at least one part of the specified model included in the captured image and the distance from the imaging means; a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means; an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, thereby generating an object image; and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.
  2. 前記オブジェクト生成手段は、前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示さない画素については描画しないことを特徴とする請求項1に記載のコンピュータプログラム。 The computer program of claim 1, characterized in that the object generating means does not draw any pixel of the object to be generated whose position information does not indicate that it is closer to the imaging means than the distance information of the corresponding pixel in the captured image.
  3. 前記情報処理端末のコンピュータを、さらに、前記所定の模型を含む撮像画像に対して合成するエフェクトを選択する選択手段として機能させ、 前記認識手段は、前記選択されたエフェクトに関連するパーツを認識することを特徴とする請求項2に記載のコンピュータプログラム。 The computer program according to claim 2, further comprising causing the computer of the information processing terminal to function as a selection means for selecting an effect to be composited with a captured image including the specified model, and the recognition means for recognizing parts related to the selected effect.
  4. 前記認識手段は、撮像画像を入力とし、前記所定の模型のパーツごとに形状、角度、及び距離に関する情報を出力するように学習させた学習済みモデルを用いて、前記選択されたエフェクトに関連するパーツを認識することを特徴とする請求項3に記載のコンピュータプログラム。 The computer program according to claim 3, characterized in that the recognition means recognizes parts related to the selected effect using a trained model that is trained to input a captured image and output information regarding the shape, angle, and distance for each part of the specified model.
  5. 前記オブジェクト生成手段は、前記選択手段によって選択されたエフェクトに対応する、予め記憶されているオブジェクトのモデル情報に基づいて、前記オブジェクト画像を生成することを特徴とする請求項3に記載のコンピュータプログラム。 The computer program according to claim 3, characterized in that the object generating means generates the object image based on pre-stored object model information corresponding to the effect selected by the selection means.
  6. 前記位置決定手段は、前記少なくとも1つのパーツの姿勢及び前記撮像手段からの距離に関する情報に基づいて、前記生成するオブジェクトの角度及び距離に関する情報を含む前記位置情報を決定することを特徴とする請求項1に記載のコンピュータプログラム。 The computer program according to claim 1, characterized in that the position determination means determines the position information including information regarding the angle and distance of the object to be generated based on information regarding the orientation of the at least one part and the distance from the imaging means.
  7. 前記所定の模型の前記少なくとも1つのパーツとは、頭部、胸部、腹部、腰部、腕部、及び脚部の少なくとも1つであることを特徴とする請求項3に記載のコンピュータプログラム。 The computer program according to claim 3, characterized in that the at least one part of the specified model is at least one of the head, chest, abdomen, waist, arms, and legs.
  8. 前記選択されたエフェクトに関連するパーツとは、前記生成するオブジェクトの近傍に位置するパーツであることを特徴とする請求項7に記載のコンピュータプログラム。 The computer program according to claim 7, characterized in that the parts related to the selected effect are parts located in the vicinity of the object to be generated.
  9. 前記取得手段は、前記撮像画像の各画素についての前記距離情報として、深度情報を示すグレースケールの深度マップを取得することを特徴とする請求項1に記載のコンピュータプログラム。 The computer program according to claim 1, characterized in that the acquisition means acquires a grayscale depth map indicating depth information as the distance information for each pixel of the captured image.
  10. 前記認識手段は、さらに、前記所定の模型に含まれる各パーツの姿勢及び前記撮像手段からの距離に関する情報を認識し、前記所定の模型の基準姿勢を示す基準ボーン構造を含む三次元データを、認識した各パーツに従って更新し、更新した前記三次元データに基づいて前記所定の模型の姿勢を認識することを特徴とする請求項1に記載のコンピュータプログラム。 The computer program according to claim 1, characterized in that the recognition means further recognizes information regarding the posture of each part included in the specified model and the distance from the imaging means, updates three-dimensional data including a reference bone structure indicating the reference posture of the specified model according to each recognized part, and recognizes the posture of the specified model based on the updated three-dimensional data.
  11. 前記オブジェクト生成手段によって生成されるオブジェクトは、認識された前記所定の模型の姿勢に合わせて変化することを特徴とする請求項10に記載のコンピュータプログラム。 The computer program according to claim 10, characterized in that the object generated by the object generating means changes according to the recognized posture of the specified model.
  12. 前記オブジェクト生成手段は、前記更新された三次元データから特定される、前記撮像画像における前記所定の模型の位置に基づいて、前記オブジェクト画像を生成することを特徴とする請求項10に記載のコンピュータプログラム。 The computer program according to claim 10, characterized in that the object generation means generates the object image based on the position of the specified model in the captured image, which is identified from the updated three-dimensional data.
  13. 前記撮像手段は前記所定の模型を含む周辺環境を継続的に撮像し、 前記取得手段、前記認識手段、前記オブジェクト生成手段、及び前記出力手段は、前記撮像手段によって撮像された撮像画像に基づいて周期的に処理を実行し、 前記出力手段は、前記撮像手段によって継続的に撮像された映像に、動的に変化するアニメーションとして前記オブジェクトを合成して出力することを特徴とする請求項1乃至12の何れか1項に記載のコンピュータプログラム。 The computer program according to any one of claims 1 to 12, characterized in that the imaging means continuously captures images of the surrounding environment including the specified model, the acquisition means, the recognition means, the object generation means, and the output means periodically execute processing based on the images captured by the imaging means, and the output means synthesizes the object as a dynamically changing animation into the video continuously captured by the imaging means and outputs the video.
  14. 情報処理端末であって、 所定の模型を含む周辺環境を撮像する撮像手段と、 前記撮像手段によって撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得手段と、 前記撮像画像に含まれる前記所定の模型の少なくとも1つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識手段と、 前記認識手段によって認識した前記少なくとも1つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定手段と、 前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成手段と、 前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力手段とを備えることを特徴とする情報処理端末。 An information processing terminal comprising: an imaging means for imaging a surrounding environment including a specified model; an acquisition means for acquiring distance information from the imaging means for each pixel of the captured image captured by the imaging means; a recognition means for recognizing information related to the attitude of at least one part of the specified model included in the captured image and its distance from the imaging means; a position determination means for determining position information of an object to be generated based on the at least one part recognized by the recognition means; an object generation means for drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, thereby generating an object image; and an output means for outputting a composite image in which the object image is superimposed on the captured image to a display unit.
  15. 情報処理端末の制御方法であって、 所定の模型を含む周辺環境を撮像手段によって撮像する撮像工程と、 前記撮像工程で撮像された撮像画像の各画素について、前記撮像手段からの距離情報を取得する取得工程と、 前記撮像画像に含まれる前記所定の模型の少なくとも1つのパーツの姿勢及び前記撮像手段からの距離に関する情報を認識する認識工程と、 前記認識工程で認識した前記少なくとも1つのパーツを基準に、生成するオブジェクトの位置情報を決定する位置決定工程と、 前記生成するオブジェクトの各画素のうち、それぞれの位置情報が前記撮像画像の対応する画素の距離情報よりも前記撮像手段に近いことを示す画素をオブジェクトとして描画し、オブジェクト画像を生成するオブジェクト生成工程と、 前記撮像画像に前記オブジェクト画像を重畳した合成画像を表示部に出力する出力工程とを含むことを特徴とする情報処理端末の制御方法。 A control method for an information processing terminal, comprising: an imaging step of imaging the surrounding environment including a specified model by an imaging means; an acquisition step of acquiring distance information from the imaging means for each pixel of the image captured in the imaging step; a recognition step of recognizing information regarding the attitude of at least one part of the specified model included in the captured image and the distance from the imaging means; a position determination step of determining position information of an object to be generated based on the at least one part recognized in the recognition step; an object generation step of drawing, as an object, pixels among the pixels of the object to be generated whose respective position information indicates that they are closer to the imaging means than the distance information of the corresponding pixel in the captured image, thereby generating an object image; and an output step of outputting a composite image in which the object image is superimposed on the captured image to a display unit.
PCT/JP2023/040545 2022-11-14 2023-11-10 Computer program, information processing terminal, and method for controlling same WO2024106328A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022-182029 2022-11-14
JP2022182029A JP7441289B1 (en) 2022-11-14 2022-11-14 Computer program, information processing terminal, and its control method

Publications (1)

Publication Number Publication Date
WO2024106328A1 true WO2024106328A1 (en) 2024-05-23

Family

ID=90011331

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/040545 WO2024106328A1 (en) 2022-11-14 2023-11-10 Computer program, information processing terminal, and method for controlling same

Country Status (2)

Country Link
JP (2) JP7441289B1 (en)
WO (1) WO2024106328A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5551205B2 (en) * 2012-04-26 2014-07-16 株式会社バンダイ Portable terminal device, terminal program, augmented reality system, and toy
JP2019179481A (en) * 2018-03-30 2019-10-17 株式会社バンダイ Computer program and portable terminal device
JP2021033963A (en) * 2019-08-29 2021-03-01 株式会社Sally127 Information processing device, display control method and display control program
JP2021128542A (en) * 2020-02-13 2021-09-02 エヌエイチエヌ コーポレーション Information processing program and information processing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5551205B2 (en) * 2012-04-26 2014-07-16 株式会社バンダイ Portable terminal device, terminal program, augmented reality system, and toy
JP2019179481A (en) * 2018-03-30 2019-10-17 株式会社バンダイ Computer program and portable terminal device
JP2021033963A (en) * 2019-08-29 2021-03-01 株式会社Sally127 Information processing device, display control method and display control program
JP2021128542A (en) * 2020-02-13 2021-09-02 エヌエイチエヌ コーポレーション Information processing program and information processing system

Also Published As

Publication number Publication date
JP2024071199A (en) 2024-05-24
JP7441289B1 (en) 2024-02-29
JP2024071379A (en) 2024-05-24

Similar Documents

Publication Publication Date Title
KR101692335B1 (en) System for augmented reality image display and method for augmented reality image display
CN105589199A (en) Display device, method of controlling the same, and program
KR101763636B1 (en) Method for collaboration using head mounted display
JP2014238731A (en) Image processor, image processing system, and image processing method
JP6517255B2 (en) Character image generation apparatus, character image generation method, program, recording medium, and character image generation system
CN108563327B (en) Augmented reality method, device, storage medium and electronic equipment
CN112581571B (en) Control method and device for virtual image model, electronic equipment and storage medium
US11778283B2 (en) Video distribution system for live distributing video containing animation of character object generated based on motion of actors
CN111798548A (en) Control method and device of dance picture and computer storage medium
CN111383313B (en) Virtual model rendering method, device, equipment and readable storage medium
WO2024106328A1 (en) Computer program, information processing terminal, and method for controlling same
JP6554139B2 (en) Information processing method, apparatus, and program for causing computer to execute information processing method
US20220405996A1 (en) Program, information processing apparatus, and information processing method
JP6982203B2 (en) Character image generator, character image generation method and program
CN109716395B (en) Maintaining object stability in virtual reality
JP6843178B2 (en) Character image generator, character image generation method, program and recording medium
KR20180058199A (en) Electronic apparatus for a video conference and operation method therefor
JP7510723B1 (en) Character display control system, character display control method and program
JP2018190397A (en) Information processing method, information processing device, program causing computer to execute information processing method
EP4379668A1 (en) Generation of realistic vr video from captured target object
US20230191259A1 (en) System and Method for Using Room-Scale Virtual Sets to Design Video Games
JP6983639B2 (en) A method for communicating via virtual space, a program for causing a computer to execute the method, and an information processing device for executing the program.
KR20060027180A (en) Portable device and method for reflecting into display information movements of such a portable device in 3-dimensional space
JP2024047008A (en) Information Processing System
JP2024047006A (en) Information processing system and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23891477

Country of ref document: EP

Kind code of ref document: A1