CN111242207A

CN111242207A - Three-dimensional model classification and retrieval method based on visual saliency information sharing

Info

Publication number: CN111242207A
Application number: CN202010017062.9A
Authority: CN
Inventors: 聂为之; 王亚; 屈露
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-01-08
Filing date: 2020-01-08
Publication date: 2020-06-05

Abstract

The invention discloses a three-dimensional model classification and retrieval method based on visual saliency information sharing, which comprises the following steps: extracting a view every 30 degrees around the Z-axis direction of the three-dimensional model, and extracting a feature descriptor of each virtual image through a deep convolutional neural network; taking the feature descriptor as the input of the visual saliency branch, generating the weight of a view through a first LSTM module and a soft attention mechanism, and generating the feature descriptor of the visual saliency branch through a second LSTM module; the feature descriptors are used as input of MVCNN branches, visual information fusion in the MVCNN module is guided by using view weights, and the feature descriptors of the MVCNN branches are obtained through a CNN; and connecting the descriptors of the two branches in series, making decision through a full connection layer and a softmax layer, classifying, and executing similarity measurement for retrieval. The invention is based on two branches of view convolutional neural network and visual saliency, fusing feature descriptors to generate feature descriptors for 3D shape classification and retrieval.

Description

Three-dimensional model classification and retrieval method based on visual saliency information sharing

Technical Field

The invention relates to the fields of three-dimensional model feature extraction, three-dimensional model classification and retrieval and the like, in particular to a three-dimensional model classification and retrieval method based on visual saliency information sharing.

Background

In recent years, as the application of 3D technology in the film and television industry has become widespread, people can see 3D models almost anywhere, and therefore it is natural and reasonable to explore more efficient methods to learn the representation form of three-dimensional models. Furthermore, with the development of computer vision and 3D reconstruction techniques, 3D shape recognition has become a fundamental task of shape analysis, which is the most critical technique for processing and analyzing 3D data. Thanks to the powerful deep learning neural network and the use of large-scale labeled 3D shape sets, various deep networks for 3D shape recognition have been studied. In general, 3D shape recognition methods can be roughly classified into two types: model-based methods and view-based methods.

Model-based methods can learn the shape characteristics of a model directly from a 3D data format, such as: voxel grid^[1]Polygonal meshes or surfaces^[2]And point cloud^[3]. For example, the literature^[4]A novel deep learning model, namely a grid convolution limited Boltzmann machine (MCRBM), is proposed for unsupervised feature learning of 3D grids. Literature reference^[5]To learn global features, MeshNet (mesh neural network) using face unit and feature segmentation is proposed, which can solve complexity and irregularity of meshes and express a three-dimensional shape. In the literature^[6]In (e), it is proposed that KD (K-dimensional search) networks can process unstructured point clouds and use learning functions to perform retrieval tasks. But the limited shape representation (e.g., smooth manifold) or higher computational complexity puts constraints on the model-based approach. This limitation is even more pronounced, especially for voxel-based methods.

In view-based approaches, the input data are views taken from different angles of the 3D object, which can be easily captured compared to other approaches (e.g., point cloud structures and polygon meshes). MVCNN-based^[7](multiview convolutional neural network) architecture, a compact shape descriptor can be extracted from multiple rendered views of an object using CNN (convolutional neural network) with pooling layers. Deep-Pano^[8]Learning PANORAMA using CNN (Panoramic) view characteristics. Literature reference^[9]A method of capturing panoramic image features is proposed which aims at achieving continuity of a three-dimensional model by constructing an enhanced representation of the three-dimensional model. Literature reference^[10]A real-time 3D shape search engine GIFT (graphics processor dual-acceleration reverse file) is proposed to accelerate the GPU (graphics processor) and the two reverse files. Most view-based methods treat all views equally, which results in ignoring the dependency and distinguishing information of multiple views, limiting the performance of existing methods.

The main challenges currently faced by three-dimensional model classification and retrieval are:

1) due to the large amount of information of the three-dimensional model, the three-dimensional model classification and retrieval tasks have higher time and space complexity;

2) the designed feature descriptors are guaranteed to have high discrimination while the computation time and space complexity are considered.

Disclosure of Invention

The invention provides a three-dimensional model classification and retrieval method based on visual saliency information sharing, which is based on two branches of a view convolutional neural network (MVCNN) and visual saliency, and then fuses feature descriptors of the two branches to generate feature descriptors for 3D shape classification and retrieval, as described in detail below:

a three-dimensional model classification and retrieval method based on visual saliency information sharing, the method comprising:

extracting a view every 30 degrees around the Z-axis direction of the three-dimensional model, and extracting a feature descriptor of each virtual image through a deep convolutional neural network;

taking the feature descriptor as the input of the visual saliency branch, generating the weight of a view through a first LSTM module and a soft attention mechanism, and generating the feature descriptor of the visual saliency branch through a second LSTM module;

the feature descriptors are used as input of MVCNN branches, visual information fusion in the MVCNN module is guided by using view weights, and the feature descriptors of the MVCNN branches are obtained through a CNN;

and connecting the descriptors of the two branches in series, making decision through a full connection layer and a softmax layer, classifying, and executing similarity measurement for retrieval.

The feature descriptor is used as an input of the visual saliency branch, a weight of a view is generated through a first LSTM module and a soft attention mechanism, and the feature descriptor for generating the visual saliency branch through a second LSTM module is specifically:

sequentially inputting 12 feature descriptors into a visual saliency branch according to an extraction sequence, generating weights of all views through a first LSTM module and a soft attention mechanism, and hiding the views in a hidden state h_tAnd an internal storage state c_tThe relation between the two is calculated to obtain h_t-1Further obtaining the weight of each view;

the last hidden state is linearly weighted and then used as the input of the second LSTM module to obtain the feature descriptors of the visual saliency branches.

Further, the guiding visual information fusion in the MVCNN model by using the view weight, and obtaining the feature descriptors of the MVCNN branches through one CNN specifically include:

performing feature fusion on the two-dimensional view by applying view saliency pooling;

and obtaining the feature descriptors of the MVCNN branches through a layer of deep neural convolution network.

The technical scheme provided by the invention has the beneficial effects that:

1. according to the method, the visual information and the related information of the views are saved by updating the weights of different views in the visual saliency model, so that the flexibility and the stability of the feature descriptors are improved;

2. the visual information fusion in the MVCNN model is guided by using the view weight defined by the visual saliency model, and the visual information and related information in the view are reserved, so that the three-dimensional model is more comprehensively described;

3. the method continuously updates parameters by using a deep learning-based method, ensures the obtained weight to be the optimal solution when obtaining the three-dimensional model feature descriptor, and increases the scientificity and accuracy of the feature descriptor;

4. through comparison experiments, the algorithm is proved to be superior to each branch algorithm and a classical 3D classification retrieval method.

Drawings

FIG. 1 is a flow chart of a three-dimensional model classification and retrieval method based on visual saliency information sharing;

FIG. 2 is an exemplary diagram of the contents of a three-dimensional model database.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

At this stage, most multi-view based methods treat all views equally, which results in ignoring the dependency and differentiation information of multiple views, limiting the performance of existing methods.

The embodiment of the invention provides a three-dimensional model classification and retrieval method based on visual saliency information sharing. A three-dimensional model is taken by rotation, one view is extracted at intervals of 30 degrees, a characteristic descriptor is extracted from each view, and the characteristic descriptors corresponding to 12 views are input in a visual saliency portion. Due to the superiority of the LSTM (long short term memory) network architecture, a soft attention mechanism and two LSTM modules are used in the visual saliency branch. The soft attention mechanism and the first LSTM module are used to generate view weights for the convolution features and the second LSTM module generates features for the visual saliency branches. And inputting feature descriptors corresponding to 12 views in the MVNN branch part, guiding visual information fusion in the MVNN model by using the view weights, and then obtaining the feature descriptors of the MVNN part through one CNN. The network is used for both classification tasks and retrieval tasks. The final result is obtained by fusion decision. The results of comparing the method with several other methods are provided in the WeChat, and the evaluations in the data sets of ModelNet40 and ShapeNetCore55 show the accuracy of classification and retrieval of three-dimensional models.

Example 1

A three-dimensional model classification and retrieval method based on visual saliency information sharing is disclosed, and is shown in FIG. 1, and mainly comprises three parts: firstly, calculating the view weight based on attention; secondly, view attention pooling; and generating a final shape descriptor, wherein the specific implementation steps are as follows:

101: giving a three-dimensional model, extracting a view every 30 degrees around the Z-axis direction of the three-dimensional model, and extracting a feature descriptor of each virtual image through a deep convolutional neural network;

102: taking the feature descriptor as the input of the visual saliency branch, generating the weight of a view through a first LSTM module and a soft attention mechanism, and generating the feature descriptor of the visual saliency branch through a second LSTM module;

103: similarly, the feature descriptors are used as input of MVCNN branches, visual information fusion in the MVCNN model is guided by using view weights, and then the feature descriptors of the MVCNN part are obtained through a CNN;

104: the feature descriptors of the two branches are obtained through the above steps 101-103, the descriptors of the two branches are connected in series, and then a decision is made through a full connection layer and a softmax layer, classification is performed, and similarity measurement is performed for retrieval.

The operation of extracting the feature descriptor of the virtual view through the convolutional neural network in step 101 is specifically:

1) extracting 12 views;

2) feature descriptors are extracted for each view.

The visually significant branch in step 102 takes the output of step 101 as input, and finally obtains the weight of each view and the characteristics of the visually significant branch, and the specific steps are as follows:

1) inputting 12 feature descriptors into the visual saliency branch in sequence according to the extraction sequence of step 101, generating the weight of each view through a first LSTM module and a soft attention mechanism, wherein the view weight is equal to the previous hidden state h_t-1Is concerned, so by the hidden state h_tAnd an internal storage state c_tThe relation betweenCalculated to obtain h_t-1Thereby further obtaining the weight of each view.

e_i＝w^Ttanh(U_h[h_t-1,v_i,t]+b_v)

Wherein e is_iA relevance score for the ith view; u shape_hIs h_t-1A weight matrix of (a); v. of_i,tA feature descriptor for the ith view in time t; b_vIs h_t-1The weight deviation of (2); e.g. of the type_jFor the correlation score of the jth view, T is the matrix transpose. w, U_h、b_vAre parameters that need to be optimized.

2) The last hidden state is linearly weighted and then used as the input of the second LSTM module to obtain the feature descriptors of the visual saliency branches.

The MVCNN branch in step 103 also takes the output of step 101 as input, and finally obtains the characteristics of the MVCNN branch, specifically including the following steps:

1) performing feature fusion on the two-dimensional view by applying view saliency pooling;

2) and obtaining the feature descriptors of the MVCNN branches through a layer of deep neural convolution network.

In the above step 104, the classification and search task is finally completed by fusing the two branches, and the specific steps are as follows:

1) connecting the feature descriptors of the two branches in series to obtain a feature descriptor of the final three-dimensional model;

2) the feature descriptors pass through a full connection layer and a softmax layer to obtain a classification result;

3) the similarity measure is executed to obtain a retrieval result.

In summary, in the embodiment of the present invention, the feature descriptors of the two branches are extracted through the above steps 101 to 103, and then the features are fused and used for classification and retrieval through the step 104, so that the description of the three-dimensional model is more comprehensive, and the quantization of the similarity is more accurate and scientific.

Example 2

The scheme in embodiment 1 is further described below with reference to the network structure, fig. 1, and fig. 2, and is described in detail below:

for extracting the first feature descriptor, the invention takes the Z axis as the rotation center, carries out visual angle sampling on the three-dimensional model at intervals of 30 degrees, and extracts the feature descriptor of the view through a mature deep convolutional neural network, which is concretely as follows:

1. each three-dimensional model was normalized by NPCA (three-dimensional principal component analysis) method. The visualization tool developed by OpenGL then extracts one view every 30 degrees around each three-dimensional model Z-axis direction as a human observer. 12 views are extracted to represent the visual and structural information of the three-dimensional model. Thus, these views can be seen as a series of images v₁,v₂,...v₁₂This is very important for the network structure of the present invention.

2. Extracting the characteristic descriptor of each view by adopting the network structure of the CNN to obtain f₁,f₂,...,f₁₂The CNN network parameters are shared.

For the visual saliency branch, the three-dimensional model is characterized by obtaining the weight of each view and generating a feature descriptor through a soft attention mechanism and two LSTM modules, which is as follows:

1) input feature descriptor f₁,f₂,...,f₁₂By the hidden state h_tAnd an internal storage state c_tThe relation h between_t＝o_t⊙c_tCalculating to obtain the last hidden state h_t-1Wherein o is_tIs an output gate.

2) The calculation is based on the previous hidden state h_t-1Each view weight a of_iWherein v is_i,tIs a feature descriptor of the virtual view at time t, w^TU_hAnd b_vAlong with overall network parametersAnd (5) new.

For the MVCNN branch, a three-dimensional model is represented by view significance pooling and multi-view feature descriptors fused according to weight, and the details are as follows:

(1) the 12 feature descriptors f obtained in the first step of the invention₁,f₂,...,f₁₂Inputting the MVCNN branch, and obtaining an average value of the dynamic weighted sum of the multi-view feature descriptors by adopting visual saliency pooling;

(2) and inputting the aggregated feature descriptors into a final CNN network for training to obtain the feature descriptors of the MVCNN branches.

After the feature descriptors of the two branches are obtained, fusion is performed, specifically as follows:

(1) setting the dimension of the feature descriptors obtained by the two branches to be (1,4096), and obtaining one feature descriptor (2,4096) in a serial mode;

(2) obtaining the scores of all classifications through a full connection layer, and classifying through the scores by a softmax layer;

(3) and completing the retrieval task through the similarity measurement.

In conclusion, the embodiment of the invention enhances the expressiveness of the three-dimensional model through the steps, eliminates the influence of each view with the same view weight on the classification and retrieval results, and improves the accuracy of the classification and retrieval of the three-dimensional model.

Example 3

The following examples are presented to demonstrate the feasibility of the embodiments of examples 1 and 2, and are described in detail below:

the database in the embodiment of the invention is based on ModelNet40 and ShapeNetCore 55. ModelNet40 is a subset of ModelNet, and contains 12,311 CAD models, divided into 40 classes. The model was cleaned up manually, but without pose normalization, the ModelNet40 model used in the present example was in the format of off. ShapeleNet Core55 is a subset of Shapelet, and contains 55 classes, approximately 51,300 three-dimensional models, each of which is subdivided into several subcategories, including a 70% training set, a 10% validation set, and a 20% test set. The ShapeNetCore55 model used in the examples of the present invention is in the form of a.

The table below shows the accuracy of classification experiments performed by different parts of the network in the ModelNet40 dataset, and the results show that the attention weights can focus the model on more representative views, resulting in better performance in 3D shape recognition, and taking the captured views as a sequence of views and extracting their structural information, the network architecture being effective for obtaining a better 3D object representation.

Table 1 shows the classification results of different components of the framework in the ModelNet40 data set

Embodiments of the invention have performed classification and search experiments on ModelNet40 and compared with various models, including 3D Shapelets^[10]，SPH^[11]，LFD^[12]，MVCNN^[7]，PointNet^[3]，PointNet++^[13]，KD-Network^[6]And the like. The following table shows the classification and search results for each method. In the retrieval task, low-rank Mahalanobis metric learning is further applied to the MVCNN to improve the retrieval performance. The method directly uses the final feature descriptors compressed by the serialized features and the convolutional features to obtain up to date performance of 90.7%.

The result shows that the method provided by the invention can achieve the best performance, the classification precision is 92.69%, and the retrieval mAP is 90.7%. Compared with the best result of MVCNN, the double-flow network of the method improves the classification and retrieval tasks by 1.7 percent and 7.7 percent respectively.

Table 2 shows the classification accuracy of each model in the ModelNet40 data set

The following table shows the results of the search experiments performed in the ShapeNetCore55 dataset and compared with the three-dimensional model search methods including rotationNet, Improved GIFT, ReVGG, DLAN, SHREC16Bai GIFT, SHREC16Su MVCNN: in Micro-averaged, the method has better performance and always very close to the best result of the data set, but in Macro-averaged, the method is lower than the F-score of RotationNet, but superior to other three-dimensional model retrieval methods.

Table 3 shows the accuracy of the search in the ShapeNetCore55 dataset for each model

In order to research the influence of the number of views on classification performance and search performance, virtual views are extracted by sequentially taking 180 degrees, 90 degrees, 60 degrees, 45 degrees, 36 degrees, 30 degrees and 18 degrees around a Z axis at an angle theta, and each three-dimensional model respectively generates 4 views, 6 views, 8 views, 10 views, 12 views and 20 views.

The following table is the results of classification and search experiments with different number of views as input to the algorithm. The results show that performance can be improved by increasing the number of views, but too many view images lead to redundancy of information and thus performance is degraded. When the view number is set to 12, the NN, FT, ST, F _ measure, DCG, ANMRR and ACC are respectively improved by 15.8% -46.7%, 11.8% -118.8%, 17.0% -71.5%, 18.0% -52.4%, 12.0% -95.6% and 43.6% -77.9%. Therefore, the optimum number of viewing times is set to 12.

TABLE 4 Classification and retrieval accuracy for varying view numbers in ModelNet40 data sets

In order to study the influence of view order on the classification and search results of the three-dimensional model, the present embodiment sets up 50 out-of-order view experiments, and the following table provides the classification and search results. The results show that the results of entering a cluttered view are even better than the results of an ordered view. Obviously, the network can adaptively calculate the importance of each view without being limited by the setting of a camera, thereby realizing the learning of powerful three-dimensional model visual information and structural information.

TABLE 5 Classification and retrieval accuracy for view misordering and forward ordering in ModelNet40 dataset

Reference to the literature

[1]Z.Wu,S.Song,A.Khosla,F.Yu,L.Zhang,X.Tang,and J.Xiao.3D shapenets:Adeep representation for volumetric shapes.In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 19121920,2015.

[2]D.Boscaini,J.Masci,E.Rodol‘a,and M.Bronstein.Learning shapecorrespondence with anisotropic convolutional neural networks.In NIPS,pages31893197,2016.

[3]C.R.Qi,H.Su,K.Mo,and L.J.Guibas.Pointnet:Deep learning on pointsets for 3d classification and segmentation.In CVPR,2017.

[4]Z.Han,Z.Liu,J.Han,C.M.Vong,S.Bu,and C.L.Chen,Mesh convolutionalrestricted boltzmann machines for unsupervised learning of features withstructure preservation on 3-d meshes,IEEE Transactions on Neural NetworksLearning Systems,28(10):22682281,2017.

[5]Y.Feng,Y.Feng,H.You,X.Zhao,and Y.Gao,Meshnet:Mesh neural networkfor 3d shape representation,arXiv:1811.11424,2018.

[6]R.Klokov and V.Lempitsky.Escape from cells:Deep kd-networks forthe recognition of 3d point cloud models.arXiv:1704.01222,2017.

[7]H.Su,S.Maji,E.Kalogerakis,and E.Learned-Miller,Multiviewconvolutional neural networks for 3D shape recognition.In Proceedings of theIEEE International Conference on Computer Vision,pages 945953, 2015.

[8]K.Sfikas,T.Theoharis,and I.Pratikakis,Exploiting the PANORAMARepresentation for Convolutional Neural Network Classification and Retrieval,in Eurographics Workshop on 3D Object Retrieval,I.Pratikakis,F.Dupont,andM.Ovsjanikov,Eds.The Eurographics Association,2017.

[9]K.Sfikas,I.Pratikakis,and T.Theoharis,Ensemble of panoramabasedconvolutional neural networks for 3d model classification and retrieval,Computers Graphics,vol.71,pages 208218. [Online].Available:http://www.sciencedirect.com/science/article/pii/S0097849317301978,2018.

[10]S.Bai,X.Bai,Z.Zhou,Z.Zhang,and L.Jan Latecki,GIFT:A realtime andscalable 3D shape search engine,in Proc.IEEE Conf.Comput.Vis.PatternRecognit,pages 50235032,2016.

[11]M.Kazhdan,T.Funkhouser,and S.Rusinkiewicz,Rotation invariantspherical harmonic representation of 3Dshape descriptors,inProc.Symp.Geometry Process.vol.6,pp.156164, 2003.

[12]D.Chen,X.Tian,Y.Shen,and M.Ouhyoung,On visual similarity based 3Dmodel retrieval, Comput.Graph.Forum,vol.22,no.3,pp.223232,2003.

[13]C.R.Qi,L.Yi,H.Su,and L.J.Guibas.Pointnet++:Deep hierarchicalfeature learning on point sets in a metric space.In NIPS,2017.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A three-dimensional model classification and retrieval method based on visual saliency information sharing, characterized in that the method comprises:

2. The method for classifying and retrieving three-dimensional models based on visual saliency information sharing according to claim 1, wherein the feature descriptors used as input of the visual saliency branch are generated by a first LSTM module and a soft attention mechanism as weights of views, and the feature descriptors generated by a second LSTM module as input of the visual saliency branch are specifically:

3. The method as claimed in claim 1, wherein the applying the view weight to guide the visual information fusion in the MVCNN model, and obtaining the feature descriptors of the MVCNN branches through a CNN specifically comprises: