CN117216591A

CN117216591A - Training method and device for three-dimensional model matching and multi-modal feature mapping model

Info

Publication number: CN117216591A
Application number: CN202311228748.2A
Authority: CN
Inventors: 陈明翔
Original assignee: Ant Blockchain Technology Shanghai Co Ltd
Current assignee: Ant Blockchain Technology Shanghai Co Ltd
Priority date: 2023-09-21
Filing date: 2023-09-21
Publication date: 2023-12-12

Abstract

The embodiment of the specification provides a training method and device for three-dimensional model matching and multi-modal feature mapping models, wherein the method comprises the following steps: determining at least one first feature corresponding to a first object by using a target multi-modal feature mapping model according to the first object, wherein the target multi-modal feature mapping model is obtained based on sample image-text pair training, the sample image-text pair comprises a corresponding text and a depth map, and the first object is a first text or a first three-dimensional model; acquiring at least one second feature corresponding to the second three-dimensional model, wherein each second feature is generated by using a target multi-modal feature mapping model according to a depth map of the second three-dimensional model; and determining a matching result of the first object and the second three-dimensional model based on the first feature corresponding to the first object and a plurality of second features corresponding to the second three-dimensional model.

Description

Training method and device for three-dimensional model matching and multi-modal feature mapping model

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a training method and apparatus for three-dimensional model matching and multi-modal feature mapping models.

Background

In some scenarios, a user has a need to find his or her desired three-dimensional model. In order to meet the demands of such users, providing a three-dimensional model matching method is a problem to be solved.

Disclosure of Invention

One or more embodiments of the present disclosure provide a three-dimensional model matching method and apparatus, so as to implement matching of a three-dimensional model.

According to a first aspect, there is provided a three-dimensional model matching method, comprising:

determining at least one first feature corresponding to a first object by using a target multi-modal feature mapping model according to the first object, wherein the target multi-modal feature mapping model is obtained based on sample image-text pair training, the sample image-text pair comprises a corresponding text and a depth map, and the first object is a first text or a first three-dimensional model;

acquiring at least one second feature corresponding to a second three-dimensional model, wherein each second feature is generated by using the target multi-modal feature mapping model according to a depth map of the second three-dimensional model;

and determining a matching result of the first object and the second three-dimensional model based on the first feature corresponding to the first object and a plurality of second features corresponding to the second three-dimensional model.

In an alternative embodiment, the first object is a first three-dimensional model;

the determining at least one first feature corresponding to the first object includes:

determining a first depth map of the first three-dimensional model under at least one preset first visual angle;

and determining the first features corresponding to the first depth maps by using the image encoder of the target multi-modal feature mapping model according to the first depth maps.

In an alternative embodiment, the depth map of the second three-dimensional model includes: a depth map of each second three-dimensional model at least one of the first viewing angles;

the determining a matching result of the first object and the second three-dimensional model includes:

calculating the average value of the first features corresponding to all the first depth maps to obtain the average feature of the first three-dimensional model;

calculating the average value of the second features corresponding to each second three-dimensional model aiming at each second three-dimensional model to obtain the average features of each second three-dimensional model;

and determining a matching result of the first three-dimensional model and each second three-dimensional model based on the average characteristic of the first three-dimensional model and the average characteristic of each second three-dimensional model.

In an alternative embodiment, the determining the matching result of the first three-dimensional model and each second three-dimensional model includes:

calculating the average characteristics of the first three-dimensional model, and respectively obtaining first similarity values between the average characteristics of the first three-dimensional model and the average characteristics of each second three-dimensional model;

and determining the second three-dimensional model with the largest corresponding first similarity value as a three-dimensional model matched with the first three-dimensional model.

In an alternative embodiment, the first object is a first text;

and determining a first feature corresponding to the first text by using a text encoder of the target multi-modal feature mapping model according to the first text.

In an alternative embodiment, the determining the matching result of the first object and the second three-dimensional model includes:

calculating a first feature corresponding to the first text, and respectively calculating second similarity values between the first feature and each second feature;

and determining a matching result of the first text and the second three-dimensional model based on each second similarity value.

In an alternative embodiment, the second three-dimensional model is a plurality of;

The determining, based on each second similarity value, a matching result of the first text and the second three-dimensional model includes:

based on each second similarity value, determining N third features with the maximum corresponding second similarity values from the plurality of second features, wherein N is a positive integer;

determining a second three-dimensional model corresponding to each third feature;

and determining the second three-dimensional model with the largest corresponding third characteristic number as the three-dimensional model matched with the first text.

According to a second aspect, there is provided a training method of a multimodal feature mapping model, comprising:

acquiring a sample graph-text pair in a training data set, wherein the sample graph-text pair comprises a sample depth map and a corresponding sample text;

determining graph features corresponding to the sample depth map and text features corresponding to the sample text by using a multi-mode feature mapping model to be trained according to the sample depth map and the sample text;

and adjusting parameters of the multi-modal feature mapping model to be trained with the aim of maximizing the similarity between the graph features and the text features.

In an alternative embodiment, before the acquiring the sample image-text pair in the training dataset, the method further includes:

Acquiring a color image and a corresponding second text;

obtaining a second depth map corresponding to the color image by using a depth estimation model according to the color image, wherein the second depth map comprises depth values of pixels, and the depth estimation model is obtained based on sample color images and depth map training corresponding to the sample color images;

and forming a group of sample image-text pairs based on the second text and the second depth map, and classifying the sample image-text pairs into the training data set.

obtaining a sample three-dimensional model;

rendering the sample three-dimensional model from at least one second view angle to obtain a three-dimensional model rendering map and a third depth map corresponding to each second view angle, wherein the third depth map comprises depth values corresponding to the sample three-dimensional model under the corresponding second view angle;

determining a third text corresponding to each three-dimensional model rendering graph by using an image description generating model according to each three-dimensional model rendering graph, wherein the image description generating model is used for generating a corresponding text description based on an input image;

And forming sample image-text pairs corresponding to each second view angle based on a third text corresponding to each three-dimensional model rendering image and a third depth image corresponding to each sample three-dimensional model under each second view angle, and classifying the sample image-text pairs into the training data set.

According to a third aspect, there is provided a three-dimensional model matching apparatus comprising:

the first determining module is configured to determine at least one first feature corresponding to a first object by using a target multi-modal feature mapping model according to the first object, wherein the target multi-modal feature mapping model is obtained based on sample image-text pairs, the sample image-text pairs comprise corresponding texts and depth maps, and the first object is a first text or a first three-dimensional model;

the first acquisition module is configured to acquire at least one second feature corresponding to a second three-dimensional model, and each second feature is generated by using the target multi-modal feature mapping model according to a depth map of the second three-dimensional model;

and the second determining module is configured to determine a matching result of the first object and the second three-dimensional model based on the first feature corresponding to the first object and a plurality of second features corresponding to the second three-dimensional model.

According to a fourth aspect, there is provided a training device for a multimodal feature mapping model, comprising:

the second acquisition module is configured to acquire a sample graph-text pair in the training data set, wherein the sample graph-text pair comprises a sample depth map and a corresponding sample text;

a third determining module, configured to determine, according to the sample depth map and the sample text, a map feature corresponding to the sample depth map and a text feature corresponding to the sample text using a multi-modal feature mapping model to be trained;

and the adjustment module is configured to adjust parameters of the multi-mode feature mapping model to be trained aiming at maximizing the similarity between the graph features and the text features.

According to a fifth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first or second aspect.

According to a sixth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has executable code stored therein, the processor implementing the method of the first or second aspect when executing the executable code.

According to the training method and the training device for the three-dimensional model matching and the multi-modal feature mapping model, in the three-dimensional model matching process, according to a first object, at least one first feature corresponding to the first object is determined by using a target multi-modal feature mapping model, wherein the target multi-modal feature mapping model is obtained based on sample image-text pairs, the sample image-text pairs comprise corresponding texts and depth maps, and the first object is a first text or a first three-dimensional model; acquiring at least one second feature corresponding to the second three-dimensional model, wherein each second feature is generated by using a target multi-modal feature mapping model according to a depth map of the second three-dimensional model; and determining a matching result of the first object and the second three-dimensional model based on the first feature corresponding to the first object and a plurality of second features corresponding to the second three-dimensional model.

In the above process, at least one first feature corresponding to the first object can be determined through training the obtained target multi-modal feature mapping model based on the corresponding text and the depth map, then at least one second feature corresponding to the second three-dimensional model is obtained, and each second feature is generated through using the target multi-modal feature mapping model according to the depth map of the second three-dimensional model, so that alignment between the at least one first feature of the first object and the at least one second feature corresponding to the second three-dimensional model can be ensured, then the first object and the second three-dimensional model are matched based on the aligned first feature corresponding to the first object and a plurality of second features corresponding to the second three-dimensional model, and a matching result of the first object and the second three-dimensional model is determined, so that matching between the text and the three-dimensional model or matching between the three-dimensional model and the three-dimensional model can be realized.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a schematic diagram of an implementation framework of one embodiment of the disclosure;

FIG. 2 is a schematic flow chart of a training method of the multi-modal feature mapping model according to the embodiment;

FIG. 3 is a schematic flow chart of a three-dimensional model matching method according to an embodiment;

FIG. 4 is a schematic flow chart of another three-dimensional model matching method according to an embodiment;

fig. 5 is another flow chart of the three-dimensional model matching method according to the embodiment:

FIG. 6 is a schematic block diagram of a three-dimensional model matching apparatus provided by an embodiment;

FIG. 7 is a schematic block diagram of a training apparatus for a multimodal feature mapping model provided by an embodiment.

Detailed Description

The technical solutions of the embodiments of the present specification will be described in detail below with reference to the accompanying drawings.

As described above, in order to meet the requirement of users for finding the three-dimensional model required by the users, providing a three-dimensional model matching method is a problem to be solved.

In view of this, the embodiments of the present disclosure provide a method and apparatus for training a three-dimensional model matching and a multi-modal feature mapping model, so as to implement matching of the three-dimensional model. Fig. 1 shows a schematic diagram of an implementation scenario according to one embodiment disclosed in the present specification. In the implementation scene, a first object is acquired, wherein the first object is a first text or a first three-dimensional model; according to a first object, determining at least one first feature corresponding to the first object by using a target multi-modal feature mapping model, wherein the target multi-modal feature mapping model is obtained based on sample image-text pair training, and the sample image-text pair comprises a corresponding text and a corresponding depth map; at least one second feature corresponding to the second three-dimensional model is acquired, wherein each second feature is generated by using the target multi-modal feature mapping model according to the depth map of the second three-dimensional model; and matching the first object with the second three-dimensional model based on the first feature corresponding to the first object and a plurality of second features corresponding to the second three-dimensional model, and determining a matching result of the first correspondence and the second three-dimensional model.

The training method and device for the three-dimensional model matching and the multi-mode feature mapping model provided by the specification are described in detail below with reference to specific embodiments.

In some scenarios, there is a need for some users to find their desired three-dimensional model through text, and/or there is a need for some users to find other three-dimensional models that match their known three-dimensional model through some known three-dimensional models. In order to meet the needs of such users, how to provide a method that can achieve matching of three-dimensional models is important. In order to meet the requirement of matching the three-dimensional model, the embodiment of the specification also provides a training method of the multi-mode feature mapping model, and the training of the obtained multi-mode feature mapping model can support the implementation of a matching scheme of the three-dimensional model.

The training process of the multi-modal feature mapping model is first described below.

Fig. 2 is a flow chart illustrating a method of training a multimodal feature mapping model in one embodiment of the present description, where the method is performed by an electronic device, which may be implemented by any apparatus, device, platform, cluster of devices, etc. having computing, processing capabilities. In the training process of the multi-modal feature mapping model, the method includes steps S210-S230:

in step S210, a sample image-text pair in the training data set is obtained, where the sample image-text pair includes a sample depth map and a corresponding sample text.

In one implementation, to train the multimodal feature mapping model, a training dataset for training the multimodal feature mapping model may be first obtained, where the training dataset may include a plurality of sample image-text pairs, each sample image-text pair including a depth map and text that have a correspondence, and for clarity of description, the depth map in the sample image-text pair may be referred to as a sample depth map, and the text in the sample image-text Wen Duizhong may be referred to as a sample text. The electronic device may obtain sample image-text pairs from the training dataset and then perform subsequent training procedures.

In one case, the present specification embodiment also provides a process of constructing the training data set, wherein the process of constructing the training data set may include, but is not limited to: constructing a sample image-text pair containing a depth map and a text with corresponding relations by using a two-dimensional color image and a text corresponding to the two-dimensional color image; and/or constructing a sample image-text pair containing the depth map and the text with corresponding relation by using the rendering map and the depth map obtained by rendering the three-dimensional model.

Specifically, in one embodiment, before step S210, the method may further include the following steps 11-13:

in step 11, a color image and its corresponding second text are acquired. In this step, the electronic device may obtain a two-dimensional image-text pair from an arbitrarily specified public data set capable of providing a large number of two-dimensional image-text pairs, where the two-dimensional image-text pair includes a two-dimensional color image and a text corresponding thereto, and for clarity of description, the text corresponding to the two-dimensional color image is called a second text, which may also be called an image description corresponding to the two-dimensional color image, and the second text may be used to generate a two-dimensional color image corresponding thereto.

Next, in step 12, a second depth map corresponding to the color image is obtained by using a depth estimation model according to the color image, wherein the second depth map includes depth values of each pixel, and the depth estimation model is obtained based on the sample color image and the depth map corresponding thereto. The depth map corresponding to the sample color image may include a depth value annotated for each pixel of the sample color image. In this step, the electronic device may acquire a depth estimation model from a preset storage space, where the depth estimation model is obtained in advance based on a sample color image and a depth map corresponding to the sample color image, and the depth estimation model may output a corresponding depth map for an input color image, where the depth map includes depth values of each pixel. In one case, the depth value may range from 0, 255.

After the electronic device obtains the color image and the second text corresponding to the color image, the color image can be input into a depth estimation model, and the color image is processed through the depth estimation model to obtain a depth map corresponding to the color image, which is called a second depth map. In one implementation, the depth estimation model may be a monocular depth estimation model, which may be implemented based on the MiDaS model.

Then, in step 13, a set of sample pairs of graphics are formed based on the second text and the second depth map and are included in the training data set. In this step, a second text corresponding to the color image and a second depth map corresponding to the color image may be formed into a set of sample image-text pairs, and the sample image-text pairs may be classified into training data.

By the method, the second depth map corresponding to the color image can be obtained by the depth estimation model for the color images in the two-dimensional image-text pairs, so that the text in the two-dimensional image-text pairs and the corresponding second depth map are utilized to obtain a plurality of sample image-text pairs. The sample image-text pair comprises original text, namely image description, with relatively high accuracy, and a depth map obtained by a depth estimation model according to a two-dimensional color image corresponding to the text. In one case, the sample image-text pair may be referred to as a first sample image-text pair.

In yet another embodiment, prior to step S210, the method may further comprise the steps of 21-24:

in step 21, a sample three-dimensional model is acquired. In this step, the electronic device may obtain the sample three-dimensional model from any data source that may provide the three-dimensional model. The sample three-dimensional model may be any type of three-dimensional model, for example, an animal model, a plant model, a character model, a building model, various prop models in a game, and the like. In one implementation, the sample three-dimensional model may exist in the form of an obj file that may include geometry information for the corresponding sample three-dimensional model, including, for example: spatial position information of each vertex constituting the sample three-dimensional model, triangular patch information and texture coordinate information of each vertex, and the like.

Next, in step 22, rendering the sample three-dimensional model from at least one second view angle to obtain a three-dimensional model rendering map and a third depth map corresponding to each second view angle, where the third depth map includes depth values corresponding to the sample three-dimensional model under the corresponding second view angle. In this step, the electronic device may render the sample three-dimensional model from at least one second view angle, to obtain a three-dimensional model rendering map and a third depth map corresponding to each second view angle, where the at least one second view angle may be determined randomly or may be preset.

The spatial position information of each vertex of the sample three-dimensional model may be spatial position information under a specified three-dimensional spatial coordinate system. In one implementation, the center point of the sample three-dimensional model coincides with the origin O of a specified three-dimensional space coordinate system, the size of the sample three-dimensional model is adjusted to a specified size, e.g., the coordinates of each axis in the spatial position information of each vertex of the sample three-dimensional model are normalized to [ -1,1].

In order to ensure that the dimensions of the rendering graphs of the three-dimensional models obtained by rendering the sample three-dimensional model from each second view angle are uniform, and the dimensions of the third depth graphs are uniform, the distances between the corresponding view point when the sample three-dimensional model is rendered from each second view angle and the center point of the sample three-dimensional model, namely the origin of the appointed three-dimensional space coordinate system, are equal, for example: the sample three-dimensional model may be rendered at least one location, i.e., a viewpoint, that is a specified distance from a center point of the sample three-dimensional model, wherein, in the case where the at least one location, i.e., the viewpoint, is a plurality of locations, the at least one location corresponds to a different second perspective.

In one case, the second view angle may be represented by a set of horizontal angles and vertical angles, where a horizontal angle may refer to an angle between a point of projection of a viewpoint on an XOZ plane (considered as a horizontal plane) of the specified three-dimensional space coordinate system and an origin of the specified three-dimensional space coordinate system, that is, a line connecting a center point of the three-dimensional model of the sample, and the X-axis; the vertical angle may refer to an angle between a line connecting the viewpoint and an origin of a specified three-dimensional space coordinate system and a line connecting a projection point of the viewpoint on the XOZ plane and the origin. The X axis is the horizontal axis of the specified three-dimensional space coordinate system, the Z axis is the vertical axis of the specified three-dimensional space coordinate system, and the Y axis is the vertical axis of the specified three-dimensional space coordinate system.

In one implementation manner, if the sample three-dimensional model corresponds to the texture map, correspondingly, when the sample three-dimensional model is rendered from at least one second view angle, the sample three-dimensional model can be rendered according to the texture in the texture map based on the corresponding relation between each vertex in the sample three-dimensional model and each pixel point in the texture map, so as to obtain a three-dimensional model rendering map corresponding to the at least one second view angle.

In still another implementation manner, if the sample three-dimensional model does not correspond to the texture map, correspondingly, when the sample three-dimensional model is rendered from at least one second view angle, the sample three-dimensional model is rendered according to the appointed gray map based on the corresponding relation between each vertex in the sample three-dimensional model and each pixel point in the appointed gray map, so as to obtain a three-dimensional model rendering map corresponding to the at least one second view angle. In one case, the designated gray scale map may be a solid color map, i.e., each pixel in the designated gray scale map corresponds to the same color value.

The third depth map corresponding to the at least one second view angle comprises depth values corresponding to the sample three-dimensional model under the corresponding second view angle. The depth value corresponding to the sample three-dimensional model under the corresponding second view angle can represent the relative distance between each vertex included in the sample three-dimensional model and the plane of the view point corresponding to the second view angle, wherein the plane of the view point corresponding to the second view angle is perpendicular to the XOZ plane of the specified three-dimensional space coordinate system. The range of values for each depth value in the third depth map is unified to [0, 255].

In one implementation, the determining of each depth value in the third depth map may include: based on the spatial position information of each vertex of the sample three-dimensional model under the appointed three-dimensional space coordinate system and the position information of the plane of the viewpoint corresponding to the second view angle j under the appointed three-dimensional space coordinate system, determining the distance between each vertex of the sample three-dimensional model and the plane of the viewpoint corresponding to the second view angle j, and then converting the distance between each vertex of the sample three-dimensional model and the plane of the viewpoint corresponding to the second view angle j into an appointed value range, namely [0, 255], so as to obtain each depth value of the third depth map of the sample three-dimensional model under the second view angle j. Wherein j is a positive integer.

The process of converting the distance between each vertex of the sample three-dimensional model and the plane where the viewpoint corresponding to the second viewing angle j is located into the specified value range may include: determining the distance with the largest numerical value from the distances between each vertex of the sample three-dimensional model and the plane of the viewpoint corresponding to the second view angle j; dividing the distance between each vertex of the sample three-dimensional model and the plane of the viewpoint corresponding to the second view angle j by the distance with the maximum value to obtain the intermediate distance between each vertex of the sample three-dimensional model, multiplying the intermediate distance between each vertex of the sample three-dimensional model by 255 to convert the distance between each vertex of the sample three-dimensional model and the plane of the viewpoint corresponding to the second view angle j into a specified value range.

After the three-dimensional model rendering graphs and the third depth graphs corresponding to the second view angles are obtained, in step 23, a third text corresponding to each three-dimensional model rendering graph is determined by using an image description generating model according to each three-dimensional model rendering graph, where the image description generating model is used for generating a corresponding text description based on the input image.

In this step, the electronic device may acquire an image description generation model from a preset storage space, where the image description generation model is obtained in advance based on two-dimensional image-text pair training, and is used to generate a corresponding text based on the input two-dimensional image. The electronic equipment can respectively input the three-dimensional model rendering graphs into the image description generating model, and process the three-dimensional model rendering graphs through the image description generating model to obtain a third text corresponding to the three-dimensional model rendering graphs.

In one implementation, the Image description generation model may be implemented by a BLIP (Bootstrapping Language-Image Pre-training, guided language Image Pre-training) model.

Next, in step 24, a sample image-text pair corresponding to each second view angle is formed based on the third text corresponding to each three-dimensional model rendering image and the third depth image corresponding to the sample three-dimensional model at each second view angle, and is classified into the training data set.

In this step, the electronic device may combine the third text corresponding to each three-dimensional model rendering map and the third depth map corresponding to each three-dimensional model rendering map at the second viewing angle, to form a sample image-text pair corresponding to each second viewing angle, and classify the sample image-text pair into the training data.

By the method, at least one group of sample image-text pairs corresponding to the three-dimensional models can be obtained for the three-dimensional models. The sample image-text pair comprises a depth map with relatively high accuracy and a text obtained by generating a model through image description according to a three-dimensional model rendering map. In one case, the sample graphic text pair may be referred to as a second sample graphic text pair.

According to the implementation mode, under the condition that the second visual angles are multiple, a plurality of groups of sample image-text pairs can be obtained by using one sample three-dimensional model, and the number of the second type of sample image-text pairs can be increased to a certain extent.

By the method, the training data set containing the first type sample image-text pair and the second type sample image-text pair can be obtained, and then the sample image-text pair can be obtained from the training data set, wherein the sample image-text pair can belong to the first type sample image-text pair or the second type sample image-text pair. In one implementation, in order to better improve the accuracy of the feature mapping result of the multi-mode feature mapping model, in the process of acquiring the sample image-text pairs from the training data set, the sample image-text pairs can be acquired from the first type sample image-text pairs and the second type sample image-text pairs on average, so as to ensure the accuracy of the mapping result of the multi-mode feature mapping model and the training efficiency.

After the sample image-text pair is obtained, in step S220, according to the sample depth map and the sample text, using the multi-mode feature mapping model to be trained, determining the map feature corresponding to the sample depth map and the text feature corresponding to the sample text.

In one implementation, the multi-modal feature mapping model to be trained includes an image encoder and a text encoder, the electronic device may input a sample depth map to the image encoder of the multi-modal feature mapping model to be trained, encode the sample depth map by the image encoder to obtain a map feature corresponding to the sample depth map, input a sample text to the text encoder of the multi-modal feature mapping model to be trained, encode the sample text by the text encoder to obtain a text feature corresponding to the sample text, and obtain the map feature and the text feature mapped to the same feature space.

Then, in step S230, parameters of the multimodal feature mapping model to be trained are adjusted with the objective of maximizing the similarity between the graph features and the text features.

In this step, after obtaining the graph feature corresponding to the sample depth map and the text feature corresponding to the sample text, the electronic device calculates the similarity between the graph feature and the text feature, for example, a cosine similarity value between the graph feature and the text feature may be calculated, where the greater the cosine similarity value, the more similar the two features are. For another example, a distance between the graph feature and the text feature, such as a Euclidean distance, may also be calculated, where a smaller distance between the graph feature and the text feature characterizes a greater similarity between the graph feature and the text feature.

Parameters of the multimodal feature mapping model to be trained are then adjusted with the goal of maximizing similarity between the graph features and the text features. In one implementation, an electronic device may determine a model penalty based on a similarity between the graph feature and the text feature, wherein the model penalty is inversely related to the similarity between the graph feature and the text feature, and adjust parameters of a multimodal feature mapping model to be trained with the objective of minimizing the model penalty, i.e., maximizing the similarity between the graph feature and the text feature.

Specifically, the target program can determine the model parameter gradient of the multi-mode feature mapping model to be trained by using a back propagation algorithm based on the model loss; determining updated model parameters of the multi-modal feature mapping model to be trained (i.e., updated values of the model parameters of the multi-modal feature mapping model to be trained) by using the determined model parameter gradients and the current model parameters of the multi-modal feature mapping model to be trained (i.e., the current values of the model parameters of the multi-modal feature mapping model to be trained); and further, based on the updated model parameters, adjusting the model parameters of the multi-mode feature mapping model to be trained. The model parameter gradient of the multi-mode feature mapping model to be trained is determined by taking the model loss minimization as a target. The model loss is minimized, the monitoring training of the mapping result of the multi-mode feature mapping model to be trained can be realized, the accuracy of the multi-mode feature mapping model to be trained is maintained, and the accuracy of the mapping result of the multi-mode feature mapping model to be trained is improved.

The steps S210-S230 are a model iterative training process for the multi-modal feature mapping model to be trained. In order to train to obtain a better multi-modal feature mapping model to be trained, the above process can be performed in multiple iterations. That is, after step S230, the process returns to step S210 based on the model parameters updated by the multimodal feature mapping model to be trained.

The stopping conditions of the model iterative training process may include: the number of iterative training reaches a preset number of times threshold, or the iterative training time reaches a preset time, or the model loss is smaller than a set loss threshold, and the like.

And circularly training the multi-modal feature mapping model to be trained through the model iterative training process, and obtaining the trained multi-modal feature mapping model when the model iterative training process reaches a stop condition. The trained multi-modal feature mapping model can obtain the map features corresponding to the depth map and the text features corresponding to the text mapped to the same feature space aiming at the depth map and the text, so that the map features corresponding to the depth map and the text features corresponding to the text are aligned. And then, executing a three-dimensional model matching process by utilizing the trained multi-modal feature mapping model.

Next, fig. 3 shows a schematic flow chart of a three-dimensional model matching method in an embodiment of the present disclosure. The method is executed by the electronic equipment, the electronic equipment can be realized by any device, equipment, platform, equipment cluster and the like with calculation and processing capabilities, and the electronic equipment for executing the training method of the multi-mode feature mapping model and the electronic equipment for executing the three-dimensional model matching method can be the same physical equipment or different physical equipment. In the three-dimensional model matching process, as shown in fig. 3, the method includes the following steps S310 to S330:

In step S310, at least one first feature corresponding to the first object is determined by using a target multi-modal feature mapping model according to the first object, wherein the target multi-modal feature mapping model is obtained based on a sample graph-text pair, the sample graph-text pair includes a corresponding text and a depth map, and the first object is a first text or a first three-dimensional model.

In some scenarios, the user may have a need to find his desired three-dimensional model, e.g. the user may have a need to find his desired three-dimensional model by text, i.e. the user may wish to be able to enter information by which he can find his desired three-dimensional model, e.g. the user needs to find a three-dimensional model of a puppy, he enters a text of "xx puppy", for which the electronic device may recommend a three-dimensional model of "xx puppy", where "xx" may represent a class of puppy; alternatively, the user may have a need to find other three-dimensional models that are similar by their known three-dimensional models, i.e., the user may wish to be able to input a three-dimensional model by which the user can find other three-dimensional models that are similar to the input three-dimensional model, e.g., the user needs to find a three-dimensional model of a puppy that inputs one three-dimensional model of a "xx puppy" for which the electronic device may recommend other three-dimensional models of "xx puppy" that are similar to the three-dimensional model of the "xx puppy" that it inputs.

Accordingly, the electronic device may acquire the content input by the user as the first object, where the first object may be a text, referred to as a first text, or may be a three-dimensional model, referred to as a first three-dimensional model. Then, the electronic device uses the target multi-modal feature mapping model according to the first object, namely, the first object is input into the target multi-modal feature mapping model, the first object is processed through the target multi-modal feature mapping model, and at least one first feature corresponding to the first object is determined.

In an embodiment, the target multi-modal feature mapping model may be a multi-modal feature mapping model obtained by training using the training method of the multi-modal feature mapping model provided in the foregoing embodiment, that is, the multi-modal feature mapping model after the foregoing training. The target multi-modal feature mapping model is used for respectively performing feature mapping on the text and the depth map to obtain text features of the text and map features of the depth map in the same feature space.

In step S320, at least one second feature corresponding to the second three-dimensional model is acquired, and each second feature is generated by using the target multi-modal feature mapping model according to the depth map of the second three-dimensional model.

In one implementation, a preset storage space corresponding to the electronic device is pre-stored with a three-dimensional model set, where the three-dimensional model set may include a plurality of three-dimensional models and features of at least one depth map corresponding to each three-dimensional model. The features of at least one depth map corresponding to each three-dimensional model are generated by using the target multi-modal feature mapping model according to the depth map of the corresponding three-dimensional model. For example, the three-dimensional model X corresponds to M depth maps, where each depth map corresponds to a feature in Y dimension, for example 512 dimensions, and the corresponding feature of the M depth maps corresponding to the three-dimensional model X may be a matrix of m×y (512).

Each three-dimensional model in the three-dimensional model set may be any type of three-dimensional model, for example, an animal model, a plant model, a character model, a building model, various prop models in a game, and the like.

In this step, the electronic device may acquire at least one feature corresponding to the second three-dimensional model from the foregoing three-dimensional model set, and for clarity of description, each feature corresponding to the second three-dimensional model is referred to as a second feature. In one case, the set of three-dimensional models may be referred to as a three-dimensional model library. After determining at least one first feature corresponding to the first object, the electronic device may obtain at least one second feature corresponding to the second three-dimensional model from the three-dimensional model library.

After obtaining at least one first feature corresponding to the first object and at least one second feature corresponding to the second three-dimensional model, in step S330, a matching result of the first object and the second three-dimensional model is determined based on the first feature corresponding to the first object and the plurality of second features corresponding to the second three-dimensional model.

It may be appreciated that the at least one first feature corresponding to the first object and the at least one second feature corresponding to the second three-dimensional model are both generated by using the target multi-modal feature mapping model, where the first feature and the second feature are in the same feature space.

It may be appreciated that the first object may be a first text or a first three-dimensional model, and in one embodiment, the first object is a first three-dimensional model; as shown in fig. 4, the method may include the following steps S410 to S460:

in step S410, a first depth map of the first three-dimensional model at least one preset first viewing angle is determined. The first three-dimensional model may be any type of three-dimensional model, such as an animal model, a plant model, a character model, a building model, various prop models in a game, and the like. In one implementation, the first three-dimensional model may exist in the form of an obj file that may include, for example, geometry information for the first three-dimensional model including: spatial position information of each vertex constituting the first three-dimensional model, triangular patch information and texture coordinate information of each vertex, and the like.

In one case, after the electronic device obtains the first three-dimensional model, the electronic device may convert the spatial position information of each vertex of the first three-dimensional model into the specified three-dimensional space coordinate system based on a conversion relationship between the coordinate system where the first three-dimensional model is located and the specified three-dimensional space coordinate system; then, the size of the first three-dimensional model is adjusted to the specified size, for example, the coordinates of each axis in the spatial position information of each vertex of the first three-dimensional model are normalized to [ -1,1], so as to adjust the first three-dimensional model to the specified size; and then adjusting the central point of the first three-dimensional model to coincide with the origin O of the specified three-dimensional space coordinate system.

And then, on the basis of the adjusted first three-dimensional model, determining a first depth map of the first three-dimensional model under at least one preset first visual angle. In one implementation, a first depth map of the first three-dimensional model at a preset first view angle may be determined, where a distance between a viewpoint corresponding to the first three-dimensional model when determining the first depth map at the preset first view angle and an origin of a specified three-dimensional space coordinate system may be the specified distance or other preset distances.

In still another implementation manner, a first depth map of the second three-dimensional model under at least two preset first view angles may be determined, where the distances between the view points corresponding to the first view angles and the origin of the specified three-dimensional space coordinate system may be the specified distances or other preset distances.

The distance between the viewpoint corresponding to each first view angle and the origin of the specified three-dimensional space coordinate system can be determined according to the distance between the viewpoint corresponding to the depth map of the second three-dimensional model and the origin of the specified three-dimensional space coordinate system. The center point of the second three-dimensional model coincides with the origin of the specified three-dimensional space coordinate system. In one case, in order to ensure accuracy of the matching result, a distance between the viewpoint corresponding to each first view angle and the origin of the specified three-dimensional space coordinate system is equal to a distance between the viewpoint corresponding to the depth map of the second three-dimensional model and the origin of the specified three-dimensional space coordinate system, for example, the specified distance.

In one case, each first view angle may be represented by a set of horizontal angles and vertical angles, where a horizontal angle may refer to an angle between a line between a projection point of a viewpoint on an XOZ plane (considered as a horizontal plane) of a specified three-dimensional space coordinate system and an origin of the specified three-dimensional space coordinate system, that is, a center point of the first three-dimensional model, and an X axis; the vertical angle may refer to an angle between a line connecting the viewpoint and an origin of a specified three-dimensional space coordinate system and a line connecting a projection point of the viewpoint on the XOZ plane and the origin. The X axis is the horizontal axis of the specified three-dimensional space coordinate system, the Z axis is the vertical axis of the specified three-dimensional space coordinate system, and the Y axis is the vertical axis of the specified three-dimensional space coordinate system.

In one case, the preset horizontal angle corresponding to the at least one first viewing angle may include, but is not limited to: 45 degrees, 135 degrees, 225 degrees, and 315 degrees; the preset vertical angle corresponding to the at least one first viewing angle may include, but is not limited to: 15 degrees and 30 degrees. The horizontal angle and the vertical angle may be arbitrarily combined to obtain at least one first viewing angle.

In one implementation manner, each first depth map includes a depth value corresponding to the first three-dimensional model under the corresponding first view angle, where the depth value corresponding to the first three-dimensional model under the corresponding first view angle may represent a relative distance between each vertex of the first three-dimensional model and a plane where a viewpoint corresponding to the corresponding first view angle is located, where the plane where the viewpoint corresponding to the first view angle is located is perpendicular to the XOZ plane. The range of depth values corresponding to each pixel in each first depth map is [0, 255].

Taking at least one first view angle i as an example, the determining process of the first depth map is described, specifically, the electronic device may determine the distance between each vertex of the adjusted first three-dimensional model and the plane in which the viewpoint corresponding to the first view angle i is located based on the spatial position information of each vertex of the adjusted first three-dimensional model in the specified three-dimensional space coordinate system and the position information of the plane in which the viewpoint corresponding to the first view angle i is located in the specified three-dimensional space coordinate system, and then convert the distance between each vertex of the adjusted first three-dimensional model and the plane in which the viewpoint corresponding to the first view angle i is located into the specified value range, that is, the foregoing [0, 255], to obtain the first depth map of the first three-dimensional model in the first view angle i. Wherein i is a positive integer.

The step of converting the distance between each vertex of the adjusted first three-dimensional model and the plane of the viewpoint corresponding to the first viewing angle i into the specified value range may be that the distance with the largest value is determined from the distance between each vertex of the adjusted first three-dimensional model and the plane of the viewpoint corresponding to the first viewing angle i; dividing the distance between each vertex of the adjusted first three-dimensional model and the plane of the viewpoint corresponding to the first viewing angle i by the distance with the maximum value to obtain the intermediate distance between each vertex of the adjusted first three-dimensional model, and multiplying 255 by the intermediate distance between each vertex of the adjusted first three-dimensional model to convert the distance between each vertex of the adjusted first three-dimensional model and the plane of the viewpoint corresponding to the first viewing angle i to the appointed value range.

Thereafter, in step S420, according to each first depth map, a first feature corresponding to each first depth map is determined using the image encoder of the target multi-modal feature mapping model. In this step, the electronic device may input each first depth map into the target multi-mode feature mapping model, and encode the input first depth maps by using an image encoder of the target multi-mode feature mapping model to obtain first features corresponding to each first depth map, so as to obtain at least one first feature corresponding to the first object.

Next, in step S430, at least one second feature corresponding to the second three-dimensional model is acquired, and each second feature is generated by using the target multi-modal feature mapping model according to the depth map of the second three-dimensional model; the depth map of the second three-dimensional model may include: and the depth map of each second three-dimensional model under at least one first view angle corresponds to one second feature. The second three-dimensional model may be a three-dimensional model obtained from the three-dimensional model library.

Then, in step S440, an average value of the first features corresponding to all the first depth maps is calculated, so as to obtain an average feature of the first three-dimensional model. And in step S450, for each second three-dimensional model, calculating an average value of the second features corresponding to the second three-dimensional model, to obtain average features of each second three-dimensional model.

In order to improve accuracy of the matching result to a certain extent, the electronic device may calculate an average value of the first features corresponding to all the first depth maps to obtain average features of the first three-dimensional model, and calculate, for each second three-dimensional model, an average value of the second features corresponding to the second three-dimensional model to obtain average features of each second three-dimensional model.

In one case, in a case where the first depth map is one, that is, at least one first view is one, a first feature corresponding to the one first depth map may be taken as an average feature of the first three-dimensional model. Correspondingly, the second feature corresponding to each second three-dimensional model is one, and is the feature corresponding to the depth map of the second three-dimensional model under the first view angle, and the second feature corresponding to each second three-dimensional model can be determined as the average feature of each second three-dimensional model.

Next, in step S460, a matching result of the first three-dimensional model and each of the second three-dimensional models is determined based on the average feature of the first three-dimensional model and the average feature of each of the second three-dimensional models.

In one implementation, in the case that the second three-dimensional model is one, a similarity value between the first three-dimensional model and the second three-dimensional model may be determined based on the average feature of the first three-dimensional model and the average feature of the second three-dimensional model, and if the similarity value exceeds a preset threshold, it may be determined that the first three-dimensional model and the second three-dimensional model match, or vice versa.

In the case where the second three-dimensional model is plural, specifically, in one embodiment, in step S460, the following steps 31 to 32 may be included:

In step 31, a first similarity value between the average feature of the first three-dimensional model and the average feature of each of the second three-dimensional models is calculated. In this step, the electronic device may calculate, based on a preset similarity algorithm, a first similarity value between the average feature of the first three-dimensional model and the average feature of each of the second three-dimensional models, as a first similarity value between the first three-dimensional model and each of the second three-dimensional models.

In one case, the predetermined similarity algorithm is, for example, a cosine similarity (cosine similarity) algorithm, and the first similarity value is positively correlated with the cosine similarity value, for example, the first similarity value between the first three-dimensional model and the second three-dimensional model is equal to the cosine similarity value between the two, and the larger the first similarity value between the first three-dimensional model and the second three-dimensional model, the larger the cosine similarity value, the more similar the first three-dimensional model and the second three-dimensional model are. In another case, the preset similarity algorithm is, for example, a euclidean distance algorithm, the first similarity value is inversely related to the euclidean distance, and the smaller the euclidean distance between the first three-dimensional model and the second three-dimensional model is, the larger the first similarity value between the first three-dimensional model and the second three-dimensional model is, which characterizes the similarity between the first three-dimensional model and the second three-dimensional model.

Accordingly, in step 32, the second three-dimensional model with the largest corresponding first similarity value is determined as the three-dimensional model matching the first three-dimensional model.

In still another embodiment, in order to better ensure the use experience of the user, a three-dimensional model with higher matching degree with the user searching requirement is recommended to the user, and after determining the first similarity value between the first three-dimensional model and each second three-dimensional model, the second three-dimensional model with the largest corresponding first similarity value and the corresponding first similarity value larger than the specified similarity threshold value can be determined as the three-dimensional model matched with the first three-dimensional model. And ensuring that a second three-dimensional model with high similarity of the first three-dimensional model can be recommended to the user through the specified similarity threshold.

In still another embodiment, in the case where the first object is the first three-dimensional model, the first depth map of the first three-dimensional model at the at least one first view angle is plural, and the second depth map of the second three-dimensional model at the at least one first view angle is plural, the first view angles corresponding to the plural first depth maps may be the same or different from the first view angles corresponding to the plural depth maps of the second three-dimensional model, for example: the first view angles corresponding to the plurality of first depth maps comprise a first view angle 1, a first view angle 2 and a first view angle 3, and the first view angles corresponding to the plurality of depth maps of the second three-dimensional model comprise a first view angle 1, a first view angle 2 and a first view angle 4. In one case, in order to ensure that the accuracy of the matching result is higher, the first view angles corresponding to the plurality of first depth maps are the same as the first view angles corresponding to the plurality of depth maps of the second three-dimensional model, that is, the first view angles corresponding to the plurality of first depth maps include a first view angle 1, a first view angle 2 and a first view angle 3, and correspondingly, the first view angles corresponding to the plurality of depth maps of the second three-dimensional model include a first view angle 1, a first view angle 2 and a first view angle 3.

In yet another embodiment, in the case that the first object is the first text; as shown in fig. 5, the method may include the following steps S510-S540:

in step S510, according to the first text, a text encoder of the target multimodal feature mapping model is used to determine a first feature corresponding to the first text. The first text may be content input by a user according to a three-dimensional model that the user needs to find. In this step, after the electronic device obtains the first text, the first text may be input into the target multi-modal feature mapping model, and the first text is encoded by a text encoder of the target multi-modal feature mapping model, so as to determine a first feature corresponding to the first text, that is, a first feature corresponding to the first object.

Next, in step S520, at least one second feature corresponding to the second three-dimensional model is acquired, and each second feature is generated by using the target multi-modal feature mapping model according to the depth map of the second three-dimensional model. The second three-dimensional model may be a three-dimensional model obtained from the three-dimensional model library.

Next, in step S530, a second similarity value between the first feature corresponding to the first text and each of the second features is calculated. In this step, the electronic device may calculate, based on a preset similarity algorithm, a similarity value between the first feature corresponding to the first text and each of the second features, which is referred to as a second similarity value, as a second similarity value between the first feature and each of the second features.

In this case, the predetermined similarity algorithm is, for example, a cosine similarity (cosine similarity) algorithm, the second similarity value is positively correlated with the cosine similarity value, the second similarity value between the first feature and the second feature is equal to the cosine similarity value between the two features, and the larger the second similarity value between the first feature and the second feature, i.e. the cosine similarity value, the more similar the first feature and the second feature are represented. In another case, the preset similarity algorithm is, for example, a euclidean distance algorithm, the second similarity value is inversely related to the euclidean distance, and the smaller the euclidean distance between the first feature and the second feature is, the larger the second similarity value between the first feature and the second feature is, and the more similar the first feature and the second feature are correspondingly represented.

Thereafter, in step S540, a matching result of the first text and the second three-dimensional model is determined based on the respective second similarity values.

In one embodiment, in the case that the second three-dimensional model is one, each second similarity value includes a similarity value between the first feature of the first text and at least one second feature corresponding to the one second three-dimensional model, the magnitude between each second similarity value and the specified threshold may be determined, and when it is determined that more than the specified number of second similarity values are greater than the specified threshold, it may be determined that the first text matches the second three-dimensional model; conversely, upon determining that the second similarity value that does not exceed the specified number is greater than the specified threshold, it may be determined that the first text does not match the second three-dimensional model.

In yet another embodiment, the aforementioned second three-dimensional model is a plurality of; accordingly, in step S540, the following steps 41-43 may be included:

in step 41, based on each second similarity value, N third features with the largest corresponding second similarity value are determined from the plurality of second features, where N is a positive integer. The specific value of N can be set according to the requirement. In this step, the electronic device may compare the magnitudes of the second similarity values, and determine, from the plurality of second features, N features with the largest corresponding second similarity values, as the third feature, based on the magnitudes of the second similarity values.

Thereafter, at step 42, a second three-dimensional model corresponding to each third feature is determined. And determining the second three-dimensional model corresponding to each third feature based on the corresponding relation between the N second features with the largest corresponding second similarity values and the second three-dimensional model.

Next, in step 43, the second three-dimensional model with the largest number of corresponding third features is determined as the three-dimensional model matching the first text. For example, N is set to 3, and the third features include a third feature 1, a third feature 2, and a third feature 3, where the third feature 1 and the third feature 2 correspond to the second three-dimensional model a, and the third feature 3 corresponds to the second three-dimensional model B, and at this time, the second three-dimensional model a is the second three-dimensional model with the largest number of corresponding third features, and the second three-dimensional model a is determined as the three-dimensional model matching the first text. For another example, N is set to 3, and the third features include a third feature 1, a third feature 2, and a third feature 3, where the third feature 1 corresponds to the second three-dimensional model a, the third feature 3 corresponds to the second three-dimensional model B, the third feature 2 corresponds to the second three-dimensional model C, the second similarity value 1 corresponding to the third feature 1> the second similarity value 2 corresponding to the third feature 2> the second similarity value 2 corresponding to the third feature 31, and at this time, the second three-dimensional model a with the largest second similarity value corresponding to the third feature 1 may be determined as the three-dimensional model matching the first text.

In yet another embodiment, after determining a second three-dimensional model that matches the first object, such as the first text or the first three-dimensional model, the second three-dimensional model that matches the first object may be presented to a user in need of finding the three-dimensional model for viewing by the user.

In this embodiment, at least one first feature corresponding to the first object may be determined by training the obtained target multi-modal feature mapping model based on the corresponding text and the depth map, then at least one second feature corresponding to the second three-dimensional model is obtained, and each second feature is generated by using the target multi-modal feature mapping model according to the depth map of the second three-dimensional model, so that alignment between the at least one first feature of the first object and the at least one second feature corresponding to the second three-dimensional model may be ensured, then the first object and the second three-dimensional model may be matched based on the aligned first feature corresponding to the first object and the plurality of second features corresponding to the second three-dimensional model, and a matching result between the first object and the second three-dimensional model may be determined, so that matching between the text and the three-dimensional model, or matching between the three-dimensional model and the three-dimensional model may be achieved.

The foregoing describes certain embodiments of the present disclosure, other embodiments being within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. Furthermore, the processes depicted in the accompanying figures are not necessarily required to achieve the desired result in the particular order shown, or in a sequential order. In some embodiments, multitasking and parallel processing are also possible, or may be advantageous.

Corresponding to the above method embodiments, in the present embodiment, a three-dimensional model matching apparatus 600 is provided, whose schematic block diagram is shown in fig. 6, including:

a first determining module 610, configured to determine, according to a first object, at least one first feature corresponding to the first object using a target multi-modal feature mapping model, where the target multi-modal feature mapping model is obtained based on a sample image-text pair training, the sample image-text pair including a corresponding text and a depth map, and the first object is a first text or a first three-dimensional model;

a first obtaining module 620 configured to obtain at least one second feature corresponding to a second three-dimensional model, each of the second features being generated by using the target multi-modal feature mapping model according to a depth map of the second three-dimensional model;

A second determining module 630 is configured to determine a matching result of the first object and the second three-dimensional model based on the first feature corresponding to the first object and the plurality of second features corresponding to the second three-dimensional model.

the first determining module 610 includes:

a first determining unit (not shown in the figure) configured to determine a first depth map of the first three-dimensional model at least one preset first viewing angle;

a second determining unit (not shown in the figure) configured to determine, according to each first depth map, a first feature corresponding to each first depth map using the image encoder of the target multi-modal feature mapping model.

the second determining module 630 includes:

a first computing unit (not shown in the figure) configured to compute an average value of first features corresponding to all the first depth maps, so as to obtain an average feature of the first three-dimensional model;

a second calculation unit (not shown in the figure) configured to calculate, for each second three-dimensional model, an average value of second features corresponding to the second three-dimensional model, to obtain an average feature of each second three-dimensional model;

A third determining unit (not shown in the figure) configured to determine a matching result of the first three-dimensional model and each second three-dimensional model based on the average feature of the first three-dimensional model and the average feature of each second three-dimensional model.

In an alternative embodiment, the third determining unit is specifically configured to calculate a first similarity value between the average feature of the first three-dimensional model and the average feature of each second three-dimensional model;

In an alternative embodiment, the first object is a first text;

the first determining module 610 is specifically configured to determine, according to the first text, a first feature corresponding to the first text using a text encoder of the target multi-modal feature mapping model.

In an alternative embodiment, the second determining module 630 includes:

a third calculation unit (not shown in the figure) configured to calculate second similarity values between the first features corresponding to the first text and the respective second features;

a fourth determining unit (not shown in the figure) configured to determine a matching result of the first text and the second three-dimensional model based on each second similarity value.

the fourth determining unit is specifically configured to determine, based on each second similarity value, N third features with the largest corresponding second similarity values from the plurality of second features, where N is a positive integer;

Corresponding to the above method embodiments, in the present embodiment, a training device 700 for a multimodal feature mapping model is provided, and a schematic block diagram of the training device is shown in fig. 7, including:

a second obtaining module 710 configured to obtain a sample graph-text pair in the training data set, the sample graph-text pair including a sample depth map and a corresponding sample text thereof;

a third determining module 720, configured to determine, according to the sample depth map and the sample text, a map feature corresponding to the sample depth map and a text feature corresponding to the sample text using a multi-modal feature mapping model to be trained;

an adjustment module 730 configured to adjust parameters of the multimodal feature mapping model to be trained with a view to maximizing a similarity between the graph feature and the text feature.

In an alternative embodiment, the apparatus further comprises:

a third obtaining module (not shown in the figure) configured to obtain a color image and a corresponding second text thereof before the obtaining of the sample image-text pairs in the training dataset;

a obtaining module (not shown in the figure) configured to obtain a second depth map corresponding to the color image according to the color image by using a depth estimation model, wherein the second depth map comprises depth values of pixels, and the depth estimation model is obtained based on sample color image and corresponding depth map training;

a first composition module (not shown) configured to compose a set of sample pairs of graphics based on the second text and the second depth map, and to attribute the pairs to the training dataset.

In an alternative embodiment, the apparatus further comprises:

a fourth acquisition module (not shown) configured to acquire a sample three-dimensional model prior to the acquisition of the sample image-text pairs in the training dataset;

a rendering module (not shown in the figure) configured to render the sample three-dimensional model from at least one second view angle, to obtain a three-dimensional model rendering map and a third depth map corresponding to each second view angle, where the third depth map includes depth values corresponding to the sample three-dimensional model under the corresponding second view angle;

A fourth determining module (not shown in the figure) configured to determine, from each three-dimensional model rendering map, a third text corresponding to each three-dimensional model rendering map using an image description generation model for generating a corresponding text based on the input image;

and a second composing module (not shown in the figure) configured to compose a sample image-text pair corresponding to each second view angle based on a third text corresponding to each three-dimensional model rendering image and a third depth image corresponding to each sample three-dimensional model at each second view angle, and to classify the sample image-text pair into the training data set.

The foregoing apparatus embodiments correspond to the method embodiments, and specific descriptions may be referred to descriptions of method embodiment portions, which are not repeated herein. The device embodiments are obtained based on corresponding method embodiments, and have the same technical effects as the corresponding method embodiments, and specific description can be found in the corresponding method embodiments.

The embodiments of the present specification also provide a computer-readable storage medium having a computer program stored thereon, which when executed in a computer, causes the computer to perform the three-dimensional model matching method or the multi-modal feature mapping model training method provided in the present specification.

The embodiment of the specification also provides a computing device, which comprises a memory and a processor, wherein executable codes are stored in the memory, and when the processor executes the executable codes, the three-dimensional model matching method or the multi-mode feature mapping model training method provided by the specification is realized.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for storage media and computing device embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part.

Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The foregoing detailed description of the embodiments of the present invention further details the objects, technical solutions and advantageous effects of the embodiments of the present invention. It should be understood that the foregoing description is only specific to the embodiments of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements, etc. made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A three-dimensional model matching method, comprising:

2. The method of claim 1, the first object being a first three-dimensional model;

3. The method of claim 2, wherein the depth map of the second three-dimensional model comprises: a depth map of each second three-dimensional model at least one of the first viewing angles;

4. The method of claim 3, wherein the determining a match of the first three-dimensional model to each of the second three-dimensional models comprises:

5. The method of claim 1, the first object being a first text;

6. The method of claim 5, wherein the determining a match result of the first object and the second three-dimensional model comprises:

7. The method of claim 6, the second three-dimensional model being a plurality;

8. A method of training a multimodal feature mapping model, comprising:

9. The method of claim 8, further comprising, prior to said acquiring the sample pairs of images in the training dataset:

Acquiring a color image and a corresponding second text;

10. The method of claim 8, further comprising, prior to said acquiring the sample pairs of images in the training dataset:

obtaining a sample three-dimensional model;

determining a third text corresponding to each three-dimensional model rendering graph by using an image description generating model according to each three-dimensional model rendering graph, wherein the image description generating model is used for generating a corresponding text based on an input image;

11. A three-dimensional model matching apparatus comprising:

12. A training apparatus for a multimodal feature mapping model, comprising:

13. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-10.