JP7290546B2

JP7290546B2 - 3D model generation apparatus and method

Info

Publication number: JP7290546B2
Application number: JP2019195844A
Authority: JP
Inventors: 良亮渡邊
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2023-06-13
Anticipated expiration: 2039-10-29
Also published as: JP2021071749A

Description

本発明は、複数台のカメラの映像から被写体の3Dモデルを高速かつ高品質に生成する装置及び方法に関する。 The present invention relates to an apparatus and method for generating a 3D model of an object from images captured by multiple cameras at high speed and with high quality.

複数のカメラ映像から被写体の3Dモデルを生成するアプローチとして、非特許文献1に開示された視体積交差法が広く知られている。視体積交差法は、各カメラ映像から被写体の部分だけを抽出した2値のシルエット画像を3D空間に投影し、その積集合となる部分のみを残すことによって3Dモデルを生成する手法である。 As an approach for generating a 3D model of an object from multiple camera images, the visual volume intersection method disclosed in Non-Patent Document 1 is widely known. The visual volume intersection method is a method of generating a 3D model by projecting a binary silhouette image, which is obtained by extracting only the part of the subject from each camera image, onto a 3D space and leaving only the part that is the product set.

視体積交差法に基づいて生成される3Dモデルを構成する最小単位はボクセルと呼ばれる。ボクセルは、一定の値を持つ小さな体積の立方体であり、立体データを離散的に表現する際の正規格子単位である。以下の説明では、M×M×M（Mは定数）の大きさのボクセルを「単位ボクセルサイズがMのボクセル」と表現することとしている。 A minimum unit that constitutes a 3D model generated based on the visual volume intersection method is called a voxel. A voxel is a small-volume cube with a constant value, and is a regular grid unit for discrete representation of volumetric data. In the following description, a voxel having a size of M×M×M (where M is a constant) is expressed as "a voxel with a unit voxel size of M".

一般に、単位ボクセルを大きく設定するほど3D空間は離散的に扱われるため、視体積交差法の処理時間は短くなるが、モデルが離散化されるため実際の形状よりも粗い3Dモデルが生成される。一方、この単位ボクセルサイズが小さくなるほど実際の形状に近い形を復元することが可能となるが、計算単位の増加により処理時間が爆発的に増加する。 In general, the larger the unit voxel, the more discrete the 3D space will be treated, so the processing time of the visual volume intersection method will be shorter, but since the model is discretized, a 3D model that is rougher than the actual shape will be generated. . On the other hand, the smaller the unit voxel size, the closer to the actual shape can be restored, but the increase in the number of calculation units increases the processing time explosively.

非特許文献２には、視体積交差法を自由視点映像技術等の中で用いる技術が開示されている。自由視点映像技術は複数台のカメラ映像から3D空間を再構成し、カメラがないアングルからでも視聴することを可能とする技術であるが、スポーツ映像などを対象とする場合にはリアルタイム性が重要である。しかしながら、スタジアムなどの広大な領域の中で、通常のボクセルベースの視体積交差法で3Dモデルの生成を行う場合には、計算時間が膨大となるという欠点があった。 Non-Patent Document 2 discloses a technique that uses the visual volume intersection method in a free-viewpoint imaging technique or the like. Free-viewpoint video technology is a technology that reconstructs 3D space from images from multiple cameras and enables viewing from angles where there are no cameras. is. However, in a vast area such as a stadium, when generating a 3D model using the normal voxel-based visual volume intersection method, there is a drawback that the calculation time is enormous.

このような技術課題を解決するために、非特許文献３には視体積交差法を高速化する技術が開示されている。非特許文献３では、視体積交差法で3Dボクセルモデルを生成する際に、初めに単位ボクセルサイズMaでモデルの生成を行い、ボクセルの塊を一つのオブジェクトとして3Dのバウンディングボックスを得る。その後、各3Dバウンディングボックス内を、細かい単位ボクセルサイズMb（＜Ma）で視体積交差法を用いてモデル化することで処理時間を大幅に削減することに成功している。 In order to solve such a technical problem, Non-Patent Document 3 discloses a technique for speeding up the visual volume intersection method. In Non-Patent Document 3, when a 3D voxel model is generated by the visual volume intersection method, the model is first generated with a unit voxel size Ma, and a 3D bounding box is obtained with a cluster of voxels as one object. After that, we succeeded in greatly reducing the processing time by modeling the inside of each 3D bounding box using the visual volume intersection method with a fine unit voxel size Mb (<Ma).

非特許文献４には、コーンビームCTを用いた3次元再構成を目的に、対象を粗いボクセルと細かいボクセルとで表現することで、PWLSを用いた逐次近似法を用いて反復的に再構成の質を高めていく際の収束の速度を速める技術が開示されている。 In Non-Patent Document 4, for the purpose of three-dimensional reconstruction using cone-beam CT, iterative reconstruction is performed using the iterative approximation method using PWLS by expressing the object with coarse voxels and fine voxels. Techniques are disclosed for increasing the speed of convergence when increasing the quality of .

非特許文献４では、粗いボクセルと細かいボクセルから得られるそれぞれのROI(Region of interest)の境界付近で、粗いボクセルからの補間結果を細かいボクセルに、細かいボクセルからの補間結果を粗いボクセルに反映させながら、細かいボクセルから得られるROI領域と、粗いグリッドから得られるROI領域のそれぞれのペナルティ強度を制御することで、効率的に誤差関数を収束させながら、対象の3次元再構成を行うことが可能であることが示されている。 In Non-Patent Document 4, near the boundary of each ROI (Region of Interest) obtained from coarse voxels and fine voxels, the interpolation results from the coarse voxels are reflected on the fine voxels, and the interpolation results from the fine voxels are reflected on the coarse voxels. However, by controlling the penalty intensity of the ROI region obtained from fine voxels and the ROI region obtained from coarse grids, it is possible to efficiently converge the error function and perform 3D reconstruction of the target. It has been shown that

非特許文献５には、3Dモデルをボクセルで表現する際に、3Dモデルの輪郭付近の部分などの判定が曖昧になる領域だけを八分木に沿って細かく分割することを繰り返すことで、Coarse-to-Fineにボクセルを分割していき、高精度かつ効率的にモデル形状を表現する技術が開示されている。 In Non-Patent Document 5, when a 3D model is represented by voxels, coarse division is performed by repeating finely dividing only regions where determination is ambiguous, such as portions near the contour of the 3D model, along an octree. A technology is disclosed that divides voxels into -to-fine and expresses a model shape with high precision and efficiency.

特許文献１には、CADのアセンブリモデルをボクセルに分割する際に、事前に記録されたアセンブリモデルの体積誤差を基にボクセルサイズを変更することで、ボクセルの分割数を動的に変更し、マシン資源消費量を節約する技術が開示されている。 In Patent Document 1, when dividing a CAD assembly model into voxels, by changing the voxel size based on the volume error of the assembly model recorded in advance, the number of voxel divisions is dynamically changed, Techniques for saving machine resource consumption are disclosed.

特許第4597347号Patent No. 4597347

Laurentini, A. "The visual hull concept for silhouette based image understanding.", IEEE Transactions on Pattern Analysis and Machine Intelligence, 16, 150-162 (1994).Laurentini, A. "The visual hull concept for silhouette based image understanding.", IEEE Transactions on Pattern Analysis and Machine Intelligence, 16, 150-162 (1994). J. Kilner, J. Starck, A. Hilton and O. Grau, "Dual-Mode Deformable Models for Free-Viewpoint Video of Sports Events," Sixth International Conference on 3-D Digital Imaging and Modeling (3DIM 2007), Montreal, QC, 2007, pp. 177-184.J. Kilner, J. Starck, A. Hilton and O. Grau, "Dual-Mode Deformable Models for Free-Viewpoint Video of Sports Events," Sixth International Conference on 3-D Digital Imaging and Modeling (3DIM 2007), Montreal, QC, 2007, pp. 177-184. J. Chen, R. Watanabe, K. Nonaka, T. Konno, H. Sankoh, S. Naito, "A Fast Free-viewpoint Video Synthesis Algorithm for Sports Scenes", 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019), WeAT17.2.J. Chen, R. Watanabe, K. Nonaka, T. Konno, H. Sankoh, S. Naito, "A Fast Free-viewpoint Video Synthesis Algorithm for Sports Scenes", 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems ( IROS 2019), WeAT17.2. Cao Q, Zbijewski W, Sisniega A, Yorkston J, Siewerdsen JH, Stayman JW. "Multiresolution iterative reconstruction in high-resolution extremity cone-beam CT." Phys Med Biol. 2016; 61(20):7263‐7281.Cao Q, Zbijewski W, Sisniega A, Yorkston J, Siewerdsen JH, Stayman JW. "Multiresolution iterative reconstruction in high-resolution extremity cone-beam CT." Phys Med Biol. 2016; 61(20):7263‐7281. Richard Szeliski. "Rapid octree construction from image sequences." CVGIP: Image Underst. 58, 1, pp.23-32, 1993.Richard Szeliski. "Rapid octree construction from image sequences." CVGIP: Image Underst. 58, 1, pp.23-32, 1993. C. Stauffer and W. E. L. Grimson, "Adaptive background mixture models for real-time tracking," 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 246-252 Vol. 2 (1999).C. Stauffer and W. E. L. Grimson, "Adaptive background mixture models for real-time tracking," 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 246-252 Vol. 2 (1999). Chen, J., Nonaka, K., Sankoh, H., Watanabe, R., Sabirin, H., & Naito, S. Efficient Parallel Connected Component Labeling with a Coarse-to-Fine Strategy. IEEE Access, 2008, 6, 55731-55740.Chen, J., Nonaka, K., Sankoh, H., Watanabe, R., Sabirin, H., & Naito, S. Efficient Parallel Connected Component Labeling with a Coarse-to-Fine Strategy. IEEE Access, 2008, 6 , 55731-55740. Zhirong Wu et al., "3D ShapeNets: A deep representation for volumetric shapes," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 1912-1920.Zhirong Wu et al., "3D ShapeNets: A deep representation for volumetric shapes," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 1912-1920. J. Redmon and A. Farhadi,"YOLO9000: Better, Faster, Stronger," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6517-6525 (2017).J. Redmon and A. Farhadi,"YOLO9000: Better, Faster, Stronger," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6517-6525 (2017). S. Gerke, K. Muller and R. Schafer, "Soccer Jersey Number Recognition Using Convolutional Neural Networks," 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), Santiago, 2015, pp. 734-741.doi: 10.1109/ICCVW.2015.100S. Gerke, K. Muller and R. Schafer, "Soccer Jersey Number Recognition Using Convolutional Neural Networks," 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), Santiago, 2015, pp. 734-741.doi: 10.1109/ICCVW .2015.100

非特許文献３のように、ボクセルサイズMaの粗いボクセルを作った後に、限定された領域を細かいボクセルサイズMbでモデル化する方式は、非特許文献１の方式と比較して処理時間を大幅に減らすことができる。しかしながら、非特許文献１と同様に最終的な処理時間はボクセルサイズやボクセル数に依存して変化する。 As in Non-Patent Document 3, after creating coarse voxels of voxel size Ma, the method of modeling a limited area with fine voxel size Mb significantly increases the processing time compared to the method of Non-Patent Document 1. can be reduced. However, as in Non-Patent Document 1, the final processing time varies depending on the voxel size and the number of voxels.

一方、スポーツシーンで自由視点映像を制作する場合などに適用することを鑑みると、視聴者の注目が集まりやすい競技用のボールなどは正しくモデル生成されることが重要である。スポーツによってはボールが非常に小さいケースもあるが、このようなケースでも正しくモデル生成がされないと視聴時に違和感を生むことから、単位ボクセルサイズは1cmなどのかなり小さいサイズを設定せざるを得ないケースが多かった。 On the other hand, in consideration of the application of this method to the production of free-viewpoint video in sports scenes, it is important to correctly generate models for sports balls that tend to attract the attention of viewers. Depending on the sport, there are cases where the ball is very small, but even in such cases, if the model is not generated correctly, it will cause discomfort when viewing, so there is no choice but to set the unit voxel size to a fairly small size such as 1 cm. There were many

結果的に、非特許文献３の技術を利用しても、品質を保つためにはボクセルサイズを小さく設定せざるを得ず、広域空間を対象にした3Dモデル生成などではリアルタイム制作が達成できないケースが存在していた。 As a result, even if the technology of Non-Patent Document 3 is used, it is necessary to set the voxel size small in order to maintain quality, and there are cases where real-time production cannot be achieved with 3D model generation targeting a wide space. existed.

非特許文献４に示されるような反復的に再構築を行う手法は、CTのような高い精度が求められる場面では有効であるものの、依然として多くの生成時間を要求されるため、リアルタイム性の求められるアプリケーションに適用することは困難であった。 Although the iterative reconstruction method shown in Non-Patent Document 4 is effective in situations such as CT where high accuracy is required, it still requires a long generation time, so real-time performance is required. It was difficult to apply it to applications where

収束が早まったとされる非特許文献４の手法の中でも50回程度の繰り返しに基づく誤差関数の最適化が必要であり、1回の反復に2分程度の時間が掛かることが示されている。また、細かいボクセルのサイズを動的に変更するような機構は開示されておらず、一様なサイズでの生成が成される。 Among the methods of Non-Patent Document 4, which is said to have accelerated convergence, it is necessary to optimize the error function based on about 50 iterations, and it is shown that one iteration takes about 2 minutes. Also, a mechanism for dynamically changing the size of fine voxels is not disclosed, and generation with a uniform size is achieved.

非特許文献５に示されるような八分木を用いる手法は、段階的な細分化を繰り返してボクセルを細かくしていくため、繰り返しの回数が多い場合には処理時間が増大する懸念がある。加えて、非特許文献５の中では、全ての3Dオブジェクトの表面部分が細分化され、細かい単位ボクセルサイズでの生成が成される。よって、大きいオブジェクトサイズを持つものに関しては表面部分も広くなるため、細かい単位ボクセルサイズで生成される箇所が多くなり、処理時間の増大に繋がるという懸念が存在していた。 The method using an octree as shown in Non-Patent Document 5 repeats stepwise subdivision to make voxels finer, so there is a concern that the processing time will increase if the number of repetitions is large. In addition, in Non-Patent Document 5, the surface portion of all 3D objects is subdivided to generate a fine unit voxel size. Therefore, there is a concern that an object with a large object size will have a large surface area, which will increase the number of locations generated with a small unit voxel size, leading to an increase in processing time.

特許文献１のように単位ボクセルサイズを動的に変更する機構は、事前に記録されるCADのアセンブリモデルとの体積誤差を基にボクセルサイズが決定されるため、自由視点映像制作のように、事前に誤差を比較するための正解3Dモデルを用意できないようなケースには適応できない。 The mechanism that dynamically changes the unit voxel size as in Patent Document 1 determines the voxel size based on the volume error with the pre-recorded CAD assembly model. It cannot be applied to cases where a correct 3D model for error comparison cannot be prepared in advance.

本発明の目的は、上記の技術課題を解決し、被写体のボクセルモデルを初めに低解像で生成して被写体の位置を推定した後、被写体の推定位置のみを対象にボクセルモデルを高解像で生成して3Dモデル化する際に、3Dモデルを高速かつ高品質に生成できる装置及び方法を提供することにある。 An object of the present invention is to solve the above technical problems by first generating a voxel model of a subject at a low resolution and estimating the position of the subject, and then generating a voxel model with only the estimated position of the subject at a high resolution. To provide a device and method capable of generating a 3D model at high speed and with high quality when the 3D model is generated by .

上記の目的を達成するために、本発明は、多視点映像から被写体の3DCGモデルを生成する3Dモデル生成装置において、以下の構成を具備した点に特徴がある。 In order to achieve the above objects, the present invention is characterized by a 3D model generation device for generating a 3DCG model of a subject from multi-viewpoint images, having the following configuration.

(1) 多視点映像から視点ごとにシルエット画像を取得する手段と、シルエット画像から視体積交差法によりボクセルサイズが第１サイズの低解像ボクセルモデルを被写体ごとに生成する低解像モデル生成手段と、低解像ボクセルモデルごとに、その特徴に基づいて第１サイズよりも小さい第２サイズを決定するボクセルサイズ決定手段と、低解像ボクセルモデルごとにボクセルサイズが第２サイズの高解像ボクセルモデルを生成する高解像モデル生成手段と、高解像ボクセルモデルに基づいて被写体の3DCGモデルを出力する手段とを具備した。 (1) Means for acquiring a silhouette image for each viewpoint from a multi-view video, and low-resolution model generating means for generating a low-resolution voxel model with a first voxel size for each subject from the silhouette image by the visual volume intersection method. voxel size determining means for determining, for each low-resolution voxel model, a second size that is smaller than the first size based on its characteristics; A high-resolution model generating means for generating a voxel model and a means for outputting a 3DCG model of a subject based on the high-resolution voxel model are provided.

(2) ボクセルサイズ決定手段は、各低解像ボクセルモデルをその特徴に基づいて分類し、この分類の結果に基づいて第２セルサイズを決定するようにした。 (2) The voxel size determination means classifies each low-resolution voxel model based on its features and determines a second cell size based on the results of this classification.

(3) 各低解像ボクセルモデルがそのサイズおよび／または位置に基づいて分類されるようにした。 (3) Each low-resolution voxel model was categorized based on its size and/or location.

(4) 各低解像ボクセルモデルがその形状に基づいて分類されるようにした。 (4) Each low-resolution voxel model was classified based on its shape.

(5) 各低解像ボクセルモデルがその逆投影マスクと重なる2D画像上の領域に対する被写体の認識結果に基づいて分類されるようにした。 (5) Each low-resolution voxel model is classified based on the object recognition results for the region on the 2D image that overlaps with its backprojection mask.

(6) 低解像ボクセルモデルごとにその逆投影マスクと重なる2D画像上の領域が人物領域であるか否を識別し、人物領域であると、その所定部位の画像特徴に基づいて各低解像ボクセルモデルが分類されるようにした。 (6) For each low-resolution voxel model, identify whether or not the region on the 2D image that overlaps the backprojection mask is a human region, and if it is a human region, determine each low-resolution voxel model based on the image features of the predetermined part. The image voxel model was made to be classified.

(7) 低解像ボクセルモデルごとにその3Dバウンディングボックスを生成し、3Dバウンディングボックス内を第２サイズで視体積交差法によりモデル化することで高解像ボクセルモデルを生成するようにした。 (7) A 3D bounding box is generated for each low-resolution voxel model, and a high-resolution voxel model is generated by modeling the inside of the 3D bounding box with the second size using the visual volume intersection method.

(8) 低解像ボクセルモデルごとにその高解像ボクセルモデルのボクセル数を推定し、全高解像ボクセルモデルのボクセル総数および許容される処理時間に基づいて第２サイズが決定されるようにした。 (8) For each low-resolution voxel model, the number of voxels in its high-resolution voxel model was estimated so that the second size was determined based on the total number of voxels in all high-resolution voxel models and the processing time allowed. .

(9) 各低解像ボクセルモデルの特徴に基づいて、その高解像ボクセルモデルを生成しない低解像ボクセルモデルを判別し、当該判別された低解像ボクセルモデルの高解像ボクセルモデルを生成しないようにした。 (9) Based on the features of each low-resolution voxel model, discriminate a low-resolution voxel model that does not generate a high-resolution voxel model, and generate a high-resolution voxel model for the discriminated low-resolution voxel model. I tried not to.

(10) 低解像ボクセルモデルごとに優先度を設定し、許容される処理時間に基づいて、優先度の高い順に第２サイズで高解像ボクセルモデルを生成するようにした。 (10) A priority is set for each low-resolution voxel model, and a high-resolution voxel model is generated in the second size in descending order of priority based on the allowable processing time.

(11) 3Dバウンディングボックス内で第２サイズを異ならせるようにした。 (11) Made the second size different in the 3D bounding box.

(1) ボクセルサイズが第１サイズの低解像ボクセルモデルを生成して被写体の位置を推定した後、ボクセルサイズが第１サイズよりも小さい第２サイズの高解像ボクセルモデルを生成して3DCGモデルを出力する際に、第２サイズを低解像ボクセルモデルの特徴に基づいて可変としたので、高解像処理の削減による処理時間の短縮によりリアルタイム性の要求に応えられるようになる。 (1) After generating a low-resolution voxel model with a first voxel size and estimating the position of the subject, generating a high-resolution voxel model with a second voxel size smaller than the first size, and performing 3DCG When the model is output, the second size is made variable based on the features of the low-resolution voxel model, so the processing time is shortened by reducing the high-resolution processing, and real-time requirements can be met.

(2) 各低解像ボクセルモデルをその特徴に基づいて分類し、この分類の結果に基づいて第２セルサイズを決定するので、低解像ボクセルモデルを一貫した指標で分類することができ、第２サイズを低解像ボクセルモデルごとに適正に決定できるようになる。 (2) classifying each low-resolution voxel model based on its features and determining a second cell size based on the results of this classification, so that the low-resolution voxel models can be classified with a consistent index; The second size can now be properly determined for each low-resolution voxel model.

(3) 各低解像ボクセルモデルをそのサイズおよび／または位置に基づいて分類するので、低い処理負荷での分類が可能になる。 (3) classify each low-resolution voxel model based on its size and/or position, allowing classification with low processing load;

(4) 各低解像ボクセルモデルをその形状に基づいて分類するので、3DCGモデルに要求される解像度が被写体の形状に依存する場合には第２サイズを適正に決定できるようになる。 (4) Each low-resolution voxel model is classified based on its shape so that the second size can be properly determined when the resolution required for the 3DCG model depends on the shape of the subject.

(5) 各低解像ボクセルモデルをその逆投影マスクと重なる2D画像上の領域に対する被写体の認識結果に基づいて分類するので、被写体の識別結果に基づいて第２サイズを決定できるようになる。 (5) Each low-resolution voxel model is classified based on object recognition results for regions on the 2D image that overlap with its backprojection mask, so that a second size can be determined based on the object identification results.

(6) 低解像ボクセルモデルごとにその逆投影マスクと重なる2D画像上の領域が人物領域であるか否を識別し、人物領域であると、その所定部位の画像特徴に基づいて各低解像ボクセルモデルを分類するので、高解像化範囲の更なる絞り込みが可能となり、高解像処理の削減による処理時間の短縮によりリアルタイム性の要求に応えられるようになる。 (6) For each low-resolution voxel model, identify whether or not the region on the 2D image that overlaps the backprojection mask is a human region, and if it is a human region, determine each low-resolution voxel model based on the image features of the predetermined part. Since the image voxel model is classified, the high-resolution range can be further narrowed down, and the processing time can be shortened by reducing the high-resolution processing, so that real-time requirements can be met.

(7) 低解像ボクセルモデルごとにその3Dバウンディングボックスを生成し、3Dバウンディングボックス内のボクセル領域を対象に高解像ボクセルモデルを生成するので、高解像化する領域を限定することができ、高解像処理の削減による処理時間の短縮によりリアルタイム性の要求に応えられるようになる。 (7) A 3D bounding box is generated for each low-resolution voxel model, and a high-resolution voxel model is generated for the voxel area within the 3D bounding box, so the area to be high-resolution can be limited. , the reduction of processing time due to the reduction of high-resolution processing makes it possible to meet the demand for real-time performance.

(8) 低解像ボクセルモデルごとに高解像化した際のボクセル数を推定し、全ての高解像ボクセルモデルのボクセル総数および許容される処理時間に基づいて第２サイズを決定するので、処理時間内でより多くの領域を高解像化できるようになる。 (8) Estimate the number of voxels when the resolution is increased for each low-resolution voxel model, and determine the second size based on the total number of voxels of all high-resolution voxel models and the allowable processing time. It becomes possible to increase the resolution of a larger area within the processing time.

(9) 各低解像ボクセルモデルの特徴に基づいて、その高解像ボクセルモデルを生成しない低解像ボクセルモデルを判別し、当該判別された低解像ボクセルモデルの高解像ボクセルモデルは生成しないので、無駄な高解像化処理を削減できるようになる。 (9) Based on the characteristics of each low-resolution voxel model, determine a low-resolution voxel model that does not generate a high-resolution voxel model, and generate a high-resolution voxel model for the determined low-resolution voxel model. Therefore, wasteful high-resolution processing can be reduced.

(10) 低解像ボクセルモデルごとに優先度を設定し、許容される処理時間に基づいて優先度の高い順に第２サイズで高解像ボクセルモデルを生成するので、処理時間内でより多くの領域を効率的に高解像化できるようになる。 (10) Priority is set for each low-resolution voxel model, and high-resolution voxel models are generated in the second size in descending order of priority based on the allowable processing time. It becomes possible to efficiently increase the resolution of an area.

(11) 3Dバウンディングボックス内で第２サイズを異ならせるようにしたので、高解像化範囲の更なる絞り込みが可能となり、高解像処理の削減による処理時間の短縮によりリアルタイム性の要求に応えられるようになる。 (11) Since the second size is made different within the 3D bounding box, it is possible to further narrow down the high-resolution range, and the reduction in high-resolution processing shortens the processing time, meeting the demand for real-time performance. will be available.

本発明の一実施形態に係る3Dモデル生成装置の機能ブロック図である。1 is a functional block diagram of a 3D model generation device according to one embodiment of the present invention; FIG. シルエット画像の例を示した図である。FIG. 4 is a diagram showing an example of a silhouette image; 3Dバウンディングボックスの例を示した図である。FIG. 4 is a diagram showing an example of a 3D bounding box; 第４指標による分類方法を模式的に示した図である。It is the figure which showed typically the classification method by the 4th index. 分類結果の一例を示した図である。It is the figure which showed an example of a classification result.

以下、図面を参照して本発明の実施の形態について詳細に説明する。図１は、本発明の一実施形態に係る3Dモデル生成装置１の主要部の構成を示したブロック図であり、ここでは、野球中継における被写体の3Dモデルの生成を例にして説明する。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the main parts of a 3D model generation device 1 according to an embodiment of the present invention. Here, generation of a 3D model of a subject in a baseball broadcast will be described as an example.

このような3Dモデル生成装置１は、汎用のコンピュータやサーバに各機能を実現するアプリケーション（プログラム）を実装することで構成できる。あるいは、アプリケーションの一部をハードウェア化またはソフトウェア化した専用機や単能機としても構成できる。 Such a 3D model generation device 1 can be configured by installing an application (program) that implements each function in a general-purpose computer or server. Alternatively, a part of the application can be configured as a dedicated machine or a single-function machine that is made into hardware or software.

シルエット画像取得部１０１は、複数の被写体を異なる視点で撮影した複数のカメラ映像（多視点映像）から、視体積交差法に用いるシルエット画像をフレーム単位で取得する。視体積交差法で3Dモデルを形成するためには、３台以上のカメラ２からシルエット画像を取得することが望ましい。 The silhouette image acquisition unit 101 acquires a silhouette image for use in the visual volume intersection method on a frame-by-frame basis from a plurality of camera videos (multi-viewpoint videos) obtained by photographing a plurality of subjects from different viewpoints. In order to form a 3D model by the visual volume intersection method, it is desirable to acquire silhouette images from three or more cameras 2 .

シルエット画像は、図２に一例を示すように、3Dモデルを生成する被写体を白、それ以外の部分を黒で表した２値のマスク画像形式で取得される。なお、このようなシルエット画像は、非特許文献６に開示された背景差分法を利用して取得できる。 The silhouette image is acquired in a binary mask image format in which the subject for which the 3D model is to be generated is represented in white and the other portions are represented in black, as shown in an example in FIG. Such a silhouette image can be acquired using the background subtraction method disclosed in Non-Patent Document 6.

低解像モデル生成部１０２は、多視点映像から取得したシルエット画像に基づいて、単位ボクセルサイズ（本実施形態では、単位ボクセルの一辺の長さ）が第１サイズM₁の3次元空間内に視体積交差法を用いて視体積を形成する。低解像モデル生成部１０２は更に、この視体積に対して各ボクセルの隣接関係を基に連結成分を計算し、連結している領域を一つの各被写体のモデルとみなすことで、単位ボクセルサイズが第１サイズM₁の粗い低解像ボクセルモデルMD_Loを生成する。 Based on the silhouette image acquired from the multi-view video, the low-resolution model generation unit 102 creates a three-dimensional space having a unit voxel size (in this embodiment, the length of one side of the unit voxel) of a first size _M1 . A visual volume is formed using the visual volume intersection method. The low-resolution model generation unit 102 further calculates a connected component based on the adjacency relationship of each voxel for this visual volume, and regards the connected area as one model of each subject, so that the unit voxel size generates a coarse low-resolution voxel model MD _Lo of first size M ₁ .

本実施形態では、第１サイズM₁が5cmに設定され、3Dモデル生成の対象範囲（本実施形態では、野球グランド全体）に単位ボクセルサイズが5cmのボクセルグリッドを配置し、ボクセルグリッドごとに3Dモデルを形成するか否かを視体積交差法に基づき判定する。視体積交差法は、N枚のシルエット画像を3次元ワールド座標に投影した際の視錐体の共通部分を視体積（Visual Hull）VH(I)として獲得するものであり、以下の式で示される。 In this embodiment, the first size _M1 is set to 5 cm, a voxel grid with a unit voxel size of 5 cm is arranged in the target range of 3D model generation (in this embodiment, the entire baseball field), and each voxel grid is 3D Whether or not to form a model is determined based on the visual volume intersection method. The visual volume intersection method acquires the common part of the visual frustum when N silhouette images are projected onto the 3D world coordinates as the visual volume (Visual Hull) VH(I), which is expressed by the following formula. be

上式(1)において、集合Iはシルエット画像の集合であり、Viはi番目のカメラから得られるシルエット画像から計算される視錐体である。また、通常はN枚全てのシルエット画像の共通部分がモデル化されるが、N-1枚が共通する場合にモデル化するなど、モデル化に用いるシルエット画像の数は変更してもよい。なお、モデル化に用いるシルエット画像数を減じると、一部のシルエット画像で被写体が欠けた場合にも3Dモデルの復元が可能になる一方、ノイズが多くなるなどの副作用が現れる可能性がある。 In the above equation (1), set I is a set of silhouette images, and Vi is a viewing frustum calculated from silhouette images obtained from the i-th camera. In addition, although the common part of all N silhouette images is usually modeled, the number of silhouette images used for modeling may be changed, such as modeling when N-1 images are common. Reducing the number of silhouette images used for modeling makes it possible to restore the 3D model even if the subject is missing in some of the silhouette images, but it may cause side effects such as increased noise.

3Dバウンディングボックス生成部１０３は、図３に示したように、各低解像ボクセルモデルMD_Loに外接する3DバウンディングボックスBBをそれぞれ生成する。単位ボクセルサイズ決定部１０４はバウンディングボックス分類部１０４ａを含む。バウンディングボックス分類部１０４ａは、後に詳述するように、複数の分類指標を用いて各低解像ボクセルモデルMD_Loを分類する。 The 3D bounding box generator 103 generates 3D bounding boxes BB that circumscribe each low-resolution voxel model MD _Lo as shown in FIG. The unit voxel size determination unit 104 includes a bounding box classification unit 104a. The bounding box classification unit 104a classifies each low-resolution voxel model MD _Lo using a plurality of classification indices, as will be described in detail later.

前記単位ボクセルサイズ決定部１０４は、後段の高解像モデル生成部１０５が3DバウンディングボックスBBごとに高解像ボクセルモデルMD_Hiを生成する際の単位ボクセルの第２サイズM₂を、各3DバウンディングボックスBBが収容する低解像ボクセルモデルMD_Loの分類結果に基づいて決定する。 The unit voxel size determination unit 104 determines the second size M ₂ of the unit voxels when the high resolution model generation unit 105 in the subsequent stage generates the high resolution voxel model MD _Hi for each 3D bounding box BB. The decision is made based on the classification result of the low-resolution voxel model MD _Lo contained in the box BB.

前記分類部１０４ａは、各低解像ボクセルモデルMD_Loを、その特徴に基づいて分類する。本実施形態では、分類指標として以下の５つの指標のいずれか、または複数を組み合わせて各低解像ボクセルモデルMD_Loを分類する。 The classification unit 104a classifies each low-resolution voxel model MD _Lo based on its features. In the present embodiment, each low-resolution voxel model MD _Lo is classified using any one of the following five indices as a classification index, or a plurality of them in combination.

(1) 第１指標：低解像ボクセルモデルMD_Loのサイズ
各低解像ボクセルモデルMD_Loが、そのサイズ（全体の大きさ、、縦、横、高さ）に基づいて分類される。本実施形態では、低解像ボクセルモデルMD_Loのサイズを、その3DバウンディングボックスBBのサイズで代表する場合を例にして説明する。 (1) First Index: Size of Low-Resolution Voxel Model MD _Lo Each low-resolution voxel model MD _Lo is classified based on its size (overall size, length, width, height). In this embodiment, a case where the size of the low-resolution voxel model MD _Lo is represented by the size of its 3D bounding box BB will be described as an example.

被写体としてボール、人物（選手または審判）およびボール以外の野球用具が想定される場合、ボールのサイズがボール以外のサイズと較べて十分に小さい。更に、ボールのサイズは厳密に規定されていることから、第１の指標により3DバウンディングボックスBBをボールとボール以外とに分類できる。 When a ball, a person (a player or a referee), and baseball equipment other than the ball are assumed as subjects, the size of the ball is sufficiently smaller than the size of the objects other than the ball. Furthermore, since the size of the ball is strictly defined, the 3D bounding box BB can be classified into ball and non-ball by the first index.

前記単位ボクセルサイズ決定部１０４は、ボールに分類された3DバウンディングボックスBBに適用する第２サイズM₂を1cm、ボール以外に分類された3DバウンディングボックスBBに適用する第２サイズM₂を2cmというように、分類結果に応じて第２サイズM₂を設定する。なお、第２サイズM₂は上記のような固定値に限定されず、3DバウンディングボックスBBのサイズ（例えば、体積）に応じて動的に設定しても良い。 The unit voxel size determination unit 104 sets the second size M2 applied to the 3D bounding box BB classified as a ball to 1 cm, and sets the second size _M2 applied to the 3D bounding box BB classified as a non-ball to ₂ cm. , the second size _M2 is set according to the classification result. Note that the second size _M2 is not limited to the above fixed value, and may be dynamically set according to the size (for example, volume) of the 3D bounding box BB.

(2) 第２指標：3Dバウンディングボックスの位置
各低解像ボクセルモデルMD_Loが、その位置に基づいて分類される。本実施形態では、低解像ボクセルモデル MD_Loの位置を、その3DバウンディングボックスBBの位置で代表する場合を例にして説明する。 (2) Second index: 3D bounding box location Each low-resolution voxel model MD _Lo is classified based on its location. In this embodiment, an example will be described in which the position of the low-resolution voxel model MD _Lo is represented by the position of its 3D bounding box BB.

被写体の位置は被写体毎に特徴的であり、野球競技であれば、例えば高さが10mの位置に形成される3DバウンディングボックスBBはボールである可能性が高く、人物や用具である可能性は限りなく低い。 The position of the subject is characteristic for each subject, and in the case of a baseball game, for example, the 3D bounding box BB formed at a height of 10m is highly likely to be a ball, not a person or equipment. infinitely low.

そこで、このような先見情報を第２指標として被写体を分類し、高い位置の3DバウンディングボックスBBはボールとみなして第２サイズM₂を1cmとし、それ以外はボール以外とみなして第２サイズM₂を2cmとすることができる。 Therefore, the subject is classified using such foresight information as the second index, and the 3D bounding box BB at the high position is regarded as a ball and the second size M ₂ is set to 1 cm, and the others are regarded as other than the ball and the second size M 2 is set. ₂ can be 2 cm.

上記の第１および第２指標は、3DバウンディングボックスBBを推定できれば簡単に得られので、分類に要する処理時間が極小であり、リアルタイム性が強く求められるシステムに適している。 The above first and second indexes can be easily obtained by estimating the 3D bounding box BB, so the processing time required for classification is extremely short, and they are suitable for systems that strongly require real-time performance.

(3) 第３指標：低解像ボクセルモデルMD_Loの形状
低解像ボクセルモデルMD_Loが、その形状に基づいて分類される。低解像ボクセルモデルMD_Loの形状は被写体ごとに特徴的であることを利用して、予め低解像ボクセルモデルMD_Loの形状と被写体との関係を深層学習等により学習して予測モデルを構築し、各低解像ボクセルモデルMD_Loを前記予測モデルに適用することで、各低解像ボクセルモデルMD_Loがボール、人物または野球用具に分類される。 (3) Third Index: Shape of Low-Resolution Voxel Model MD _Lo The low-resolution voxel model MD _Lo is classified based on its shape. Utilizing the fact that the shape of the low-resolution voxel model MD _Lo is characteristic for each subject, the relationship between the shape of the low-resolution voxel model MD _Lo and the subject is learned in advance by deep learning, etc., and a prediction model is constructed. Then, by applying each low-resolution voxel model MD _Lo to the prediction model, each low-resolution voxel model MD _Lo is classified as a ball, a person, or a baseball equipment.

(4) 第４指標：低解像ボクセルモデルMD_Loの2D画像
シルエット画像の基となるカメラ画像（2D画像）に対して、非特許文献９に開示されるような、画像中からの物体識別を行うアルゴリズムを適用し、その識別結果に基づいて各低解像ボクセルモデルMD_Loが分類される。 (4) Fourth index: 2D image of low-resolution voxel model MD _Lo For the camera image (2D image) that is the basis of the silhouette image, object identification from the image as disclosed in Non-Patent Document 9 and each low-resolution voxel model MD _Lo is classified based on the identification result.

図４は、2D画像に基づく分類方法を模式的に示した図であり、低解像ボクセルモデルMD_Loを各カメラのスクリーン位置に逆投影し、このときに得られる逆投影マスクと各カメラの2D画像を対象とした画素単位の認識結果とを重ね合わせ、逆投影マスクと重なった2D画像領域の認識結果に基づいて当該低解像ボクセルモデルが識別される。 FIG. 4 is a diagram schematically showing a classification method based on 2D images. A low-resolution voxel model MD _Lo is back-projected onto the screen position of each camera, and the back-projection mask obtained at this time and each camera's The low-resolution voxel model is identified based on the recognition result of the 2D image area overlapped with the back projection mask by superimposing the pixel-by-pixel recognition result for the 2D image.

例えば、逆投影マスクと重なった2D画像領域の各画素に対する認識結果を参照し、「人物」と認識された画素の割合が十分に多ければ、当該低解像ボクセルモデルMD_Loは「人物」に分類される。 For example, if the recognition result for each pixel in the 2D image area that overlaps with the backprojection mask is referred to, and the proportion of pixels recognized as a “person” is sufficiently high, the low-resolution voxel model MD _Lo becomes a “person”. being classified.

なお、第４指標を採用した分類では、シルエット画像のみならず2Dの原画が必要となることから、前記シルエット画像取得部１０１は、シルエット画像に加えて各カメラの原画を取得する機能を有するものとする。 In the classification using the fourth index, not only the silhouette image but also the 2D original image is required. Therefore, the silhouette image acquisition unit 101 has a function of acquiring the original image of each camera in addition to the silhouette image. and

逆投影は全てのカメラに対して実施する必要はなく、処理時間の観点から一部のカメラのみに限定しても良い。また、2D画像上の同一の画素に２つの3Dバウンディングボックスが重複して現れる場合には、単位ボクセルサイズが小さい方の物体の結果が優先的に逆投影マスクに反映されるようにしてもよい。 Backprojection need not be performed for all cameras, and may be limited to some cameras from the viewpoint of processing time. Also, when two 3D bounding boxes overlap in the same pixel on the 2D image, the result of the object with the smaller unit voxel size may be preferentially reflected in the backprojection mask. .

上記第３または第４指標による分類では、事前に学習が必要になることに加え、処理時間が比較的大きくなりがちという欠点はある。しかしながら、事前に学習した情報に基づいてボール等のバウンディングボックスを分類するため、高精度の分類が可能である。 Classification by the third or fourth index has the disadvantage that it requires prior learning and that the processing time tends to be relatively long. However, since bounding boxes such as balls are classified based on information learned in advance, highly accurate classification is possible.

例えば、特定のシーンのリプレイ動画を自由視点映像に基づいて制作し、スタジアムの大型ビジョンで放映するような用途では、10秒のリプレイの制作に数十秒程度の制作時間が許されるケースもある。このように、リアルタイムまでは要求されないものの高速な制作が求められる場面にて高い品質を得るためには、上記第３または第４指標を採用した分類により、品質と制作速度のトレードオフに優れた制作が可能である。 For example, in applications such as creating a replay video of a specific scene based on free-viewpoint video and broadcasting it on a large screen in a stadium, there are cases where the production time of tens of seconds is allowed for the production of a 10-second replay. . In this way, in order to obtain high quality in situations where high-speed production is required, although real-time is not required, classification using the above 3rd or 4th index provides an excellent trade-off between quality and production speed. production is possible.

(5) 第５指標：被写体に固有の情報
低解像ボクセルモデルMD_Loを各カメラのスクリーン位置へ逆投影して得られる逆投影マスクと各カメラの2D画像との重なった2D画像領域に対する固有情報の認識結果を指標として各低解像ボクセルモデルが分類される。 (5) Fifth index: Information unique to the subject Unique to the 2D image region where the back projection mask obtained by back-projecting the low-resolution voxel model MD _Lo to the screen position of each camera and the 2D image of each camera overlap Each low-resolution voxel model is classified using the information recognition result as an index.

例えば、低解像ボクセルモデルMD_Loに対応する2D画像領域が人物に分類されると、更に顔認識や背番号認識を実行し、高解像対象として予め登録された選手であるか否かを判定する。登録された選手以外であれば、第２サイズM₂として第１サイズM₁より小さい第１の第２サイズM₂₁を設定する一方、登録された選手であれば、更に小さい第２の第２サイズM₂₂（＜M₂₁）を設定する。 For example, when the 2D image area corresponding to the low-resolution voxel model MD _Lo is classified as a person, face recognition and uniform number recognition are further performed to determine whether or not the player is a player pre-registered as a high-resolution target. judge. If the player is not a registered player, a first second size _M21 that is smaller than the first size _M1 _is set as the second size M2. Set size M ₂₂ (< M ₂₁ ).

前記第３指標や第４指標に基づく分類では、一般的に人物、ボール、バットなどには分類できても、人物の名前や背番号といった各被写体に固有の情報までは識別できない。一方、非特許文献１０などでは選手の背番号に基づいて被写体をさらに細かく分類できる。第５指標により各3Dバウンディングボックスを分類すれば、注目選手やユーザのお気に入りの選手のみを高解像で表示させることが可能になる。 In the classification based on the third index and the fourth index, although it is generally possible to classify objects into persons, balls, bats, etc., it is not possible to identify information unique to each subject, such as the person's name and jersey number. On the other hand, according to Non-Patent Document 10, etc., the subjects can be further classified based on the player's jersey number. By classifying each 3D bounding box by the fifth index, it becomes possible to display only the player of interest or the user's favorite player with high resolution.

なお、複数の指標を組み合わせて分類するのであれば、各分類結果の論理和や論理積に基づいて最終的な分類結果を決定するようにしても良い。あるいは、第１または第２指標を採用してボールの3Dバウンディングボックスを分類したのち、残りの3Dバウンディングボックスのみを対象に認識ベースの第３ないし第５指標を採用するようにしても良い。 Note that if classification is performed by combining a plurality of indices, the final classification result may be determined based on the logical sum or logical product of each classification result. Alternatively, the first or second metric may be employed to classify the 3D bounding box of the ball, and then only the remaining 3D bounding boxes may be subjected to the recognition-based third to fifth metric.

このようにすれば、分類に要する処理時間の長い認識ベースの第３ないし第５指標を採用する3Dバウンディングボックス数を減じることができるので処理時間を短縮できるようになる。 In this way, the processing time can be shortened because the number of 3D bounding boxes that adopt the recognition-based third to fifth indicators, which require a long processing time for classification, can be reduced.

図５は、分類結果の一例を示した図であり、各3DバウンディングボックスBBにはIDが付され、3DバウンディングボックスBBごとに分類結果および第２サイズM₂が登録されている。なお、非特許文献７には、効率的に各ボクセルの連結成分を計算してIDを付するラベリング手法が開示されている。 FIG. 5 is a diagram showing an example of classification results. Each 3D bounding box BB is assigned an ID, and the classification result and the second size _M2 are registered for each 3D bounding box BB. Note that Non-Patent Document 7 discloses a labeling method that efficiently calculates connected components of each voxel and assigns an ID to each voxel.

高解像ボクセル生成部１０５は、前記3Dバウンディングボックス生成部１０３が生成した3DバウンディングボックスBBの内部の狭い領域のみに対して、前記単位ボクセルサイズ決定部１０４が決定した第２サイズM₂に基づいてボクセルグリッドを配置して視体積交差法を適用し、高解像ボクセルモデルMD_Hiを生成する。これにより、品質面と速度面のトレードオフに優れた3Dモデル生成を行うことができる。 The high-resolution voxel generation unit 105 selects only the narrow region inside the 3D bounding box BB generated by the 3D bounding box generation unit 103 based on the second size _M2 determined by the unit voxel size determination unit 104. Then, the voxel grid is placed using the visual volume intersection method, and a high-resolution voxel model MD _Hi is generated. This makes it possible to generate 3D models with an excellent trade-off between quality and speed.

3Dモデル出力部１０６は、高解像モデル生成部１０５で得られた3Dモデルを出力する機能を有する。高解像ボクセルモデルMD_Hiは多数のボクセルで形成されるボリュームデータであるが、一般的に3Dモデルデータはポリゴンモデルとして扱う方が都合の良いケースも多い。このとき、例えばマーチンキューブ法などのボクセルモデルをポリゴンモデルに変換する手法を用いてボクセルモデルをポリゴンモデルに変換する機能を具備し、ポリゴンモデルとして3Dモデルを出力する機能を有していてもよい。 The 3D model output unit 106 has a function of outputting the 3D model obtained by the high resolution model generation unit 105 . A high-resolution voxel model MD _Hi is volume data formed by a large number of voxels, but in general, there are many cases where it is more convenient to handle 3D model data as a polygon model. At this time, it may have a function of converting a voxel model into a polygon model using a method for converting a voxel model into a polygon model, such as the Martin Cube method, and have a function of outputting a 3D model as a polygon model. .

なお、上記の実施形態では3DバウンディングボックスBB（または低解像ボクセルモデルMD_Lo）の分類結果のみに基づいて、高解像ボクセルモデルMD_Hiを生成する際の単位ボクセルの第２サイズM₂が決定されるものとして説明したが、本発明はこれのみに限定されるものではなく、リアルタイム性の観点から、高解像ボクセルモデルMD_Hiの生成に要する処理時間をも考慮して第２サイズM₂が決定されるようにしても良い。 In the above embodiment, the second size M ₂ of the unit voxels when generating the high-resolution voxel model MD _Hi is based only on the classification result of the 3D bounding box BB (or the low-resolution voxel model MD _Lo ). However, the present invention is not limited to this, and from _the viewpoint of real-time performance, the second size M ₂ may be determined.

例えば、本実施形態では3DバウンディングボックスBBのサイズおよび個数が3Dバウンディングボックス生成部１０３にとって既知であり、そのボクセル領域の合計が計算範囲となる。一般的に、視体積交差法の処理時間はボクセル数に比例するところ、ボクセル領域内のボクセル数は単位ボクセルサイズに依存するので、単位ボクセルサイズ（第２サイズ）ごとに全体の処理時間を高い精度で見積もることができる。 For example, in this embodiment, the size and number of 3D bounding boxes BB are known to the 3D bounding box generator 103, and the total voxel area is the calculation range. In general, the processing time of the visual volume intersection method is proportional to the number of voxels, but the number of voxels in the voxel region depends on the unit voxel size. can be estimated with accuracy.

したがって、ボールに適用する第２サイズM₂は1cmに固定する一方、ボール以外に適用する第２サイズM₂は、残りの処理時間を残りの総ボクセル数で除した値に基づいて動的に決定するようにしても良い。 Therefore, the second size _M2 applied to balls is fixed at 1 cm, while the second size _M2 applied to non-balls is dynamically based on the remaining processing time divided by the total number of remaining voxels. You may make it decide.

あるいは、ボールに適用する第２サイズM₂は1cmに固定する一方、ボール以外の分類結果には予め優先度を付しておき、優先度のより高い分類結果により小さな第２サイズM₂が割り当てられるように、残りの処理時間および優先度に基づいて、ボール以外に適用する第２サイズM₂を動的に決定するようにしても良い。 Alternatively, while the second size _M2 applied to the ball is fixed at 1 cm, the classification results other than the ball are given priority in advance, and the smaller second size _M2 is assigned to the classification result with the higher priority. As described above, the second size _M2 to be applied to non-balls may be dynamically determined based on the remaining processing time and priority.

さらに、上記の実施形態では全ての3DバウンディングボックスBBがいずれかの被写体に分類されるものとして説明したが、本発明はこれのみに限定されるものではなく、例えば第１指標を採用する際に、サイズが所定の基準サイズよりも小さい3DバウンディングボックスBBはノイズとみなして排除しても良い。 Furthermore, in the above embodiment, all 3D bounding boxes BB are classified into one subject, but the present invention is not limited to this. , a 3D bounding box BB whose size is smaller than a predetermined reference size may be regarded as noise and eliminated.

また、第２指標を採用するのであれば、被写体が存在し得ない位置の3DバウンディングボックスBBはノイズとみなして排除しても良い。さらに、第３指標ないし第５指標のように認識ベースの指標を採用するのであれば、認識尤度が所定の閾値を下回る3DバウンディングボックスBBはノイズとみなして排除しても良い。 Also, if the second index is adopted, the 3D bounding box BB at a position where the subject cannot exist may be regarded as noise and eliminated. Furthermore, if a recognition-based index such as the third index to the fifth index is adopted, the 3D bounding box BB whose recognition likelihood is below a predetermined threshold may be regarded as noise and eliminated.

さらに、上記の実施形態では、3Dバウンディングボックス毎にその内側は同一の第２サイズM₂が適用されるものとして説明したが、本発明はこれのみに限定されるものではなく、被写体の部位ごとに第２サイズM₂を異ならせても良い。 Furthermore, in the above embodiment, the same second size _M2 is applied to the inside of each 3D bounding box. , the second size _M2 may be different.

例えば、前記第５指標を採用することで3Dバウンディングボックスが人物に分類されており、かつその顔領域や背番号領域を識別できていれば、当該顔領域や背番号領域の第２サイズM_2aを他の領域の第２サイズM_2bよりもさらに小さく（M_2a＜M_2b）しても良い。 For example, if the 3D bounding box is classified as a person by adopting the fifth index, and if the face area and the uniform number area can be identified, the second size M _2a of the face area and the uniform number area may be smaller than the second size M _2b of the other regions (M _2a <M _2b ).

さらに、上記の実施形態では3Dバウンディングボックス内の全てのボクセル領域に視体積交差法を適用して単位ボクセルが第２サイズM₂の高解像ボクセルモデルMD_Hiを生成するものとして説明したが、本発明はこれのみに限定されるものではなく、低解像ボクセルモデルMD_Loのボクセル領域のみを対象にしても良い。 Furthermore, in the above embodiment, the visual volume intersection method is applied to all voxel regions within the 3D bounding box to generate a high-resolution voxel model MD _Hi in which the unit voxel is the second size _M2 . The present invention is not limited to this, and may target only the voxel area of the low-resolution voxel model MD _Lo .

１０１...シルエット画像取得部，１０２...低解像モデル生成部，１０３...3Dバウンディングボックス生成部，１０４...単位ボクセルサイズ決定部，１０４ａ...分類部，１０５...高解像モデル生成部，１０６...3Dモデル出力部 101... silhouette image acquisition unit, 102... low resolution model generation unit, 103... 3D bounding box generation unit, 104... unit voxel size determination unit, 104a... classification unit, 105.. .High-resolution model generation unit, 106...3D model output unit

Claims

多視点映像から被写体の3DCGモデルを生成する3Dモデル生成装置において、
多視点映像から視点ごとにシルエット画像を取得する手段と、
シルエット画像から視体積交差法によりボクセルサイズが第１サイズの低解像ボクセルモデルを被写体ごとに生成する低解像モデル生成手段と、
低解像ボクセルモデルごとに、その特徴に基づいて前記第１サイズよりも小さい第２サイズを決定するボクセルサイズ決定手段と、
低解像ボクセルモデルごとにボクセルサイズが前記決定した第２サイズの高解像ボクセルモデルを生成する高解像モデル生成手段と、
前記高解像ボクセルモデルに基づいて被写体の3DCGモデルを出力する手段とを具備したことを特徴とする3Dモデル生成装置。 In a 3D model generation device that generates a 3DCG model of a subject from multi-view images,
a means for acquiring a silhouette image for each viewpoint from multi-view video;
low-resolution model generating means for generating a low-resolution voxel model having a first voxel size for each subject from the silhouette image by the visual volume intersection method;
voxel size determination means for determining, for each low resolution voxel model, a second size smaller than the first size based on its characteristics;
a high-resolution model generation means for generating a high-resolution voxel model having the determined second voxel size for each low-resolution voxel model;
and means for outputting a 3DCG model of a subject based on the high-resolution voxel model.

前記ボクセルサイズ決定手段は、各低解像ボクセルモデルをその特徴に基づいて分類する手段を具備し、
前記分類の結果に基づいて第２セルサイズを決定することを特徴とする請求項１に記載の3Dモデル生成装置。 said voxel size determination means comprising means for classifying each low resolution voxel model based on its characteristics;
2. The 3D model generation device according to claim 1, wherein the second cell size is determined based on the classification result.

前記分類する手段は、各低解像ボクセルモデルをそのサイズに基づいて分類することを特徴とする請求項２に記載の3Dモデル生成装置。 3. The 3D model generator of claim 2, wherein said classifying means classifies each low-resolution voxel model based on its size.

前記分類する手段は、各低解像ボクセルモデルをその形状に基づいて分類することを特徴とする請求項２または３に記載の3Dモデル生成装置。 4. The 3D model generating apparatus according to claim 2, wherein said classifying means classifies each low-resolution voxel model based on its shape.

前記分類する手段は、各低解像ボクセルモデルをその逆投影マスクが重なる2D画像上の領域に対する被写体の認識結果に基づいて分類することを特徴とする請求項２ないし４のいずれかに記載の3Dモデル生成装置。 5. The method according to any one of claims 2 to 4, wherein said classifying means classifies each low-resolution voxel model based on the recognition result of the object for the region on the 2D image on which the backprojection mask overlaps. 3D model generator.

前記分類する手段は、低解像ボクセルモデルごとにその逆投影マスクと重なる2D画像上の領域が人物領域であるか否を識別し、人物領域であると、その所定部位の画像特徴に基づいて各低解像ボクセルモデルを分類することを特徴とする請求項２ないし５のいずれかに記載の3Dモデル生成装置。 The means for classifying identifies whether or not a region on the 2D image that overlaps the backprojection mask for each low-resolution voxel model is a human region, and determines that it is a human region based on the image features of the predetermined part. 6. The 3D model generation device according to any one of claims 2 to 5, wherein each low resolution voxel model is classified.

低解像ボクセルモデルごとにその3Dバウンディングボックスを生成する手段を更に具備し、
前記高解像モデル生成手段は、3Dバウンディングボックス内を第２サイズで視体積交差法によりモデル化することで高解像ボクセルモデルを生成することを特徴とする請求項１ないし６のいずれかに記載の3Dモデル生成装置。 further comprising means for generating a 3D bounding box for each low resolution voxel model;
7. Any one of claims 1 to 6 , wherein the high-resolution model generation means generates a high-resolution voxel model by modeling the inside of the 3D bounding box with the second size using the visual volume intersection method. A 3D model generator as described.

前記ボクセルサイズ決定手段は、低解像ボクセルモデルごとその高解像ボクセルモデルのボクセル数を推定し、全高解像ボクセルモデルのボクセル総数および許容される処理時間に基づいて第２サイズを決定することを特徴とする請求項１ないし６のいずれかに記載の3Dモデル生成装置。 The voxel size determining means estimates the number of voxels of the high resolution voxel model for each low resolution voxel model, and determines the second size based on the total number of voxels of all high resolution voxel models and the allowable processing time. 7. The 3D model generation device according to any one of claims 1 to 6 , characterized by:

前記ボクセルサイズ決定手段は、各低解像ボクセルモデルの特徴に基づいて、その高解像ボクセルモデルを生成しない低解像ボクセルモデルを判別し、当該判別された低解像ボクセルモデルの高解像ボクセルモデルを生成しないことを特徴とする請求項１ないし６のいずれかに記載の3Dモデル生成装置。 The voxel size determining means discriminates a low-resolution voxel model that does not generate a high-resolution voxel model based on the features of each low-resolution voxel model, and high-resolution of the discriminated low-resolution voxel model 7. The 3D model generation device according to any one of claims 1 to 6, wherein the 3D model generation device does not generate voxel models.

前記ボクセルサイズ決定手段は、低解像ボクセルモデルごとに優先度を設定し、許容される処理時間に基づいて、優先度の高い順に前記第２サイズで高解像ボクセルモデルを生成することを特徴とする請求項１ないし６のいずれかに記載の3Dモデル生成装置。 The voxel size determining means sets a priority for each low-resolution voxel model, and generates the high-resolution voxel model in the second size in descending order of priority based on an allowable processing time. 7. The 3D model generation device according to any one of claims 1 to 6.

前記ボクセルサイズ決定手段は、3Dバウンディングボックス内で第２サイズを異ならせることを特徴とする請求項７に記載の3Dモデル生成装置。 8. The 3D model generation device according to claim 7 , wherein the voxel size determining means varies the second size within the 3D bounding box.

コンピュータが多視点映像から被写体の3DCGモデルを生成する3Dモデル生成方法において、
多視点映像から視点ごとにシルエット画像を取得する手順と、
シルエット画像から視体積交差法によりボクセルサイズが第１サイズの低解像ボクセルモデルを被写体ごとに生成する手順と、
低解像ボクセルモデルごとに、その特徴に基づいて前記第１サイズよりも小さい第２サイズを決定する手順と、
低解像ボクセルモデルの3Dバウンディングボックスごとにボクセルサイズが前記第２サイズの高解像ボクセルモデルを生成する手順と、
前記高解像ボクセルモデルに基づいて被写体の3DCGモデルを出力する手順とを含むことを特徴とする3Dモデル生成方法。 In a 3D model generation method in which a computer generates a 3DCG model of a subject from multi-view images,
A procedure for acquiring a silhouette image for each viewpoint from a multi-view video;
A procedure for generating a low-resolution voxel model having a first voxel size for each subject from a silhouette image by the visual volume intersection method;
determining, for each low-resolution voxel model, a second size smaller than the first size based on its features;
generating a high resolution voxel model with a voxel size of said second size for each 3D bounding box of the low resolution voxel model;
and outputting a 3DCG model of the subject based on the high-resolution voxel model.