CN114782719B

CN114782719B - Training method of feature extraction model, object retrieval method and device

Info

Publication number: CN114782719B
Application number: CN202210450243.XA
Authority: CN
Inventors: 孙准; 冯原; 郑弘晖
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-04-26
Filing date: 2022-04-26
Publication date: 2023-02-03
Anticipated expiration: 2042-04-26
Also published as: CN114782719A

Abstract

The invention provides a training method of a feature extraction model, an object retrieval method and a device, relates to the technical field of artificial intelligence, particularly relates to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as image retrieval. The specific implementation scheme is as follows: determining a feature extraction model to be trained; inputting the sample pairs representing the same semantics into a feature extraction model to obtain first sample features output by each feature extraction layer of a first extraction network and second sample features output by each feature extraction layer of a second extraction network; constructing a positive sample feature pair for representing the same semantics and a negative sample feature pair for representing different semantics based on the first sample feature and the second sample feature; in response to determining that the feature extraction model is not converged based on the positive sample feature pair and the negative sample feature pair, adjusting the model parameters. Therefore, by the scheme, the feature extraction model with higher precision can be obtained by training under the condition of fewer samples for resources.

Description

Training method of feature extraction model, object retrieval method and device

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, can be applied to scenes such as image retrieval and the like, and particularly relates to a training method of a feature extraction model, an object retrieval method and a device.

Background

Multimodal retrieval refers to a retrieval that fuses different modalities, such as: search for a graph with a text, search for a text with a graph, etc.

In the related art, a search system based on a comparative learning idea is generally used for searching. The retrieval system can obtain features of objects of different modalities by using the feature extraction model, and then compare the similarity of the features to determine a final retrieval result. In the training process of the feature extraction model, the multi-modal objects with the same semantics are used as a positive sample pair, the multi-modal objects with different semantics are used as a negative sample pair, and the difference between the positive sample pair and the negative sample pair is compared, so that the positive sample pair can obtain a higher score than the negative sample pair.

Disclosure of Invention

The disclosure provides a training method of a feature extraction model, an object retrieval method and an object retrieval device.

According to an aspect of the present disclosure, there is provided a training method of a feature extraction model, including:

determining a feature extraction model to be trained; wherein the feature extraction model comprises a first extraction network for first modality objects and a second extraction network for second modality objects, the first extraction network and the second extraction network each comprising a plurality of feature extraction layers;

inputting sample pairs representing the same semantics into the feature extraction model to obtain first sample features output by each feature extraction layer of the first extraction network and second sample features output by each feature extraction layer of the second extraction network;

constructing a positive sample feature pair for characterizing the same semantics and a negative sample feature pair for characterizing different semantics based on the first sample feature and the second sample feature;

in response to determining that the feature extraction model is not converged based on the pair of positive sample features and the pair of negative sample features, adjusting model parameters of the feature extraction model.

According to another aspect of the present disclosure, there is provided an object retrieval method including:

acquiring a target object as a retrieval basis; wherein the target object belongs to a first modality;

inputting the target object into a first extraction network in a feature extraction model, so that a plurality of feature extraction layers of the first extraction network extract features of the target object to obtain first features of the target object; the feature extraction model is a model obtained by training according to a training method of the feature extraction model; in the plurality of feature extraction layers of the first extraction network, the output of a previous feature extraction layer is the input of a next feature extraction layer, and the input of the first feature extraction layer is the target object;

determining an object matched with the target object from a database containing objects belonging to a second modality based on a pre-constructed feature library and the first feature, wherein the object is used as a retrieval result corresponding to the target object;

the feature library comprises second features of the objects in the database, and the second features of the objects in the database are obtained by performing feature extraction on the objects in the database by utilizing a plurality of feature extraction layers of a second extraction network in the feature extraction model; in the plurality of feature extraction layers in the second extraction network, the output of the previous feature extraction layer is the input of the next feature extraction layer, and the input of the first feature extraction layer is the object in the database.

According to another aspect of the present disclosure, there is provided a training apparatus for a feature extraction model, including:

the determining module is used for determining a feature extraction model to be trained; wherein the feature extraction model comprises a first extraction network for first modality objects and a second extraction network for second modality objects, the first extraction network and the second extraction network each comprising a plurality of feature extraction layers;

the sample feature extraction module is used for inputting sample pairs representing the same semantics into the feature extraction model to obtain first sample features output by each feature extraction layer of the first extraction network and second sample features output by each feature extraction layer of the second extraction network;

a construction module, configured to construct, based on the first sample feature and the second sample feature, a positive sample feature pair for characterizing the same semantic and a negative sample feature pair for characterizing different semantics;

an adjustment module to adjust model parameters of the feature extraction model in response to determining that the feature extraction model is not converged based on the positive sample feature pair and the negative sample feature pair.

According to another aspect of the present disclosure, there is provided an object retrieval apparatus including:

the acquisition module is used for acquiring a target object serving as a retrieval basis; wherein the target object belongs to a first modality;

the feature extraction module is used for inputting the target object into a first extraction network in a feature extraction model, so that a plurality of feature extraction layers of the first extraction network extract features of the target object to obtain a first feature of the target object; the feature extraction model is a model obtained by training according to a training method of the feature extraction model; in the plurality of feature extraction layers of the first extraction network, the output of the former-stage feature extraction layer is the input of the latter-stage feature extraction layer, and the input of the first-stage feature extraction layer is the target object

The retrieval module is used for determining an object matched with the target object from a database containing objects belonging to a second modality based on a pre-constructed feature library and the first feature, and taking the object matched with the target object as a retrieval result corresponding to the target object;

the feature library comprises second features of the objects in the database, and the second features of the objects in the database are obtained by performing feature extraction on the objects in the database by using a plurality of feature extraction layers of a second extraction network in the feature extraction model; in the plurality of feature extraction layers of the second extraction network, the output of the previous feature extraction layer is the input of the next feature extraction layer, and the input of the first feature extraction layer is the object in the database.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described training method of the feature extraction model or the object retrieval method.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the above-described training method of the feature extraction model, or the object retrieval method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above-described training method of a feature extraction model, or object retrieval method.

By the scheme, the feature extraction model with higher precision can be obtained by training under the condition of fewer samples for resources.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a method of training a feature extraction model according to the present disclosure;

FIG. 2 is a flow chart of an object retrieval method according to the present disclosure;

FIG. 3 is a schematic diagram illustrating the training of a neural network of a double tower architecture in the related art;

FIG. 4 is a schematic diagram of one particular example of an object retrieval method according to the present disclosure;

FIG. 5 is a schematic diagram of one particular example of a training method of a feature extraction model according to the present disclosure;

FIG. 6 is a schematic diagram of loss value calculation in a training method for a feature extraction model according to the present disclosure;

FIG. 7 is a schematic diagram of a training apparatus for a feature extraction model according to the present disclosure;

FIG. 8 is a schematic diagram of an object retrieval device according to the present disclosure;

fig. 9 is a block diagram of an electronic device for implementing a method provided by an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

With the continuous popularity of the internet and the explosive growth of multimedia data presentation, how to effectively organize, manage and retrieve such large-scale multimedia information has become a current hot topic. In recent years, the field has made great progress, but multimodal retrieval remains a difficult task. On one hand, the network multimedia information has large data volume, multiple semantic categories and complex and various contents; on the other hand, multimedia data are in heterogeneous feature spaces due to information expression modes such as texts and images, and the association relationship between the multimedia data and the information is complex and diverse.

In view of the above problems, many multi-modal data analysis methods based on associative learning ideas have been proposed in recent years, and it is desirable to functionally map data of different modalities into a new comparable feature subspace and then subsequently analyze the data of different modalities in the feature subspace. For example, for a multi-modal search task, in the related art, a search system based on a comparison learning idea is generally adopted, and the search system may utilize a feature extraction model to obtain features of objects in different modalities, and then compare the similarities of the features to determine a final search result. In addition, in the training process of the feature extraction model, the multi-modal objects with the same semantics are used as a positive sample pair, the multi-modal objects with different semantics are used as a negative sample pair, and the difference between the positive sample pair and the negative sample pair is compared, so that the positive sample pair can obtain a higher score than the negative sample pair.

However, limited by the comparison idea of the retrieval system itself, in the training process of such feature extraction model, a large number of negative sample pairs are required to be compared with the positive sample pairs, and increasing the number of negative sample pairs would bring about greater consumption of computing resources.

Based on the above, in order to train and obtain a feature extraction model with higher precision under the condition of fewer sample pairs of resources, the embodiment of the present disclosure provides a training method for a feature extraction model.

First, a training method of a feature extraction model provided in the embodiments of the present disclosure is described below.

It should be noted that, in a specific application, the training method of the feature extraction model provided by the embodiment of the present disclosure may be applied to various electronic devices, for example, a personal computer, a server, and other devices with data processing capability. In addition, it can be understood that the training method of the feature extraction model provided by the embodiments of the present disclosure may be implemented by software, hardware, or a combination of software and hardware.

The training method of the feature extraction model provided by the embodiment of the disclosure may include the following steps:

In the scheme provided by the disclosure, when a feature extraction model is trained, a sample pair representing the same semantics is subjected to feature extraction through a first extraction network and a second extraction network which comprise a plurality of feature extraction layers, and each feature extraction layer in each extraction network outputs a sample feature; and then, constructing a positive sample feature pair and a negative sample feature pair by using the sample features output by each feature extraction layer in the first extraction network and the second extraction network, and training the feature extraction model by using the positive sample feature pair and the negative sample feature pair. As the sample features output by each feature extraction layer are utilized to construct the positive sample feature pairs and the negative sample feature pairs, compared with the feature pairs containing different semantic information as the negative sample feature pairs in the related technology, the features output by different feature extraction layers are added as the negative sample feature pairs, so that the number of the obtained negative sample feature pairs is greatly increased. Therefore, by the scheme, the feature extraction model with higher precision can be obtained by training under the condition of fewer samples to resources, the consumption of computing resources can be reduced, and the training cost is further reduced.

The following describes a training method of a feature extraction model provided in the embodiments of the present disclosure with reference to the accompanying drawings.

As shown in fig. 1, the training method for the feature extraction model provided in the embodiment of the present disclosure may include the following steps:

s101, determining a feature extraction model to be trained; wherein the feature extraction model comprises a first extraction network for first modality objects and a second extraction network for second modality objects, the first extraction network and the second extraction network each comprising a plurality of feature extraction layers;

in this embodiment, the first modality object and the second modality object may be a text object, a video object, a picture object, and the like. Also, the first modality object and the second modality object are objects of different modalities, for example: the first modal object is a picture object, and the second modal object is a text object; or the first modality object is a video object, and the second modality object is a picture object. It is understood that, when the first modality object and the second modality object are objects of different modalities, features of the objects of different modalities can be extracted through the feature extraction model, so that the features of the objects of different modalities can be compared.

In order to extract features of the first modality object and the second modality object, features of objects for different modalities may be extracted using a feature extraction model comprising a first extraction network for extracting features of the first modality object and a second extraction network for extracting features of the second modality object. In order to make the extracted features accurate, the feature extraction model needs to be trained.

It should be noted that, in the training process of the feature extraction model, a large number of negative sample pairs are required to be compared with positive sample pairs, and increasing the number of negative sample pairs brings greater consumption of computing resources, in order to solve the problem, the feature extraction model in the related art is modified, after modification, both the first extraction network and the second extraction network in the feature extraction model in the present disclosure include a plurality of feature extraction layers, the features of the first modal object and the second modal object in different feature spaces are extracted through the plurality of feature extraction layers, and subsequently, the features of the sample pairs belonging to the same semantic meaning in different feature spaces can be used as the negative sample feature pairs, so that a large number of negative sample feature pairs can be constructed to train the model based on limited sample pair resources.

In addition, the output of each feature extraction layer is the same dimension of features, so that the features extracted by two feature extraction layers which are combined randomly all belong to the same dimension of features, and the features are comparable.

S102, inputting sample pairs representing the same semantics into the feature extraction model to obtain first sample features output by each feature extraction layer of the first extraction network and second sample features output by each feature extraction layer of the second extraction network;

in this embodiment, in order to extract the characteristic of the semantic meaning represented by the object in different modalities, a sample pair representing the same semantic meaning may be used as input data to train the characteristic extraction model. It can be understood that, in the training, for any sample pair representing the same semantic meaning, an object in the sample pair and a different modal object in other sample pairs may form a negative sample pair, for example, a sample pair formed by a picture object and a video object representing "cat" is a sample pair representing the same semantic meaning, and a sample pair formed by a picture object and a video object representing "dog" is a sample pair representing the same semantic meaning, and then one picture object in the sample pair representing "cat" and one video object in the sample pair representing "dog" may form a negative sample pair, so that a negative sample pair may be constructed based on a positive sample pair representing the same semantic meaning, and further, the feature extraction model may be trained by using a feature corresponding to the positive sample pair and a feature corresponding to the negative sample pair.

For example, the sample pairs representing the same semantics may be obtained from a database containing objects in different modalities, and the objects in different modalities may be manually labeled in advance, that is, each object has a tag of the semantics represented by the object, so that the objects belonging to different modalities and having the same tag may be selected as the sample pairs representing the same semantics.

In this embodiment, the sample pairs representing the same semantics are input into the feature extraction model, a first extraction network in the feature extraction model performs feature extraction on a first modal object in the sample pair, and a second extraction network in the feature extraction model performs feature extraction on a second modal object in the sample pair, so as to obtain first sample features output by each feature extraction layer of the first extraction network and second sample features output by each feature extraction layer of the second extraction network.

It can be understood that, since the first extraction network and the second extraction network each include a plurality of feature extraction layers, and the sample features output by each extraction layer are features of the sample pair in different feature spaces, the features of the sample pair in different feature spaces belonging to the same semantic can be subsequently used as negative sample pairs, so as to construct more negative sample pairs to train the feature extraction model. For example, the first extraction network and the second extraction network each include 2 feature extraction layers, which are the first extraction layer and the second extraction layer, respectively, and the sample pairs are text objects and picture objects representing "cat", so that features output by the text objects representing "cat" through the first extraction layer and features output by the picture objects representing "cat" through the second extraction layer can be used as negative sample feature pairs, and thus more negative sample feature pairs can be constructed under the condition of less sample pair resources.

For example, in a specific implementation, the feature extraction layers in the first extraction network and the second projection network may be neural network layers capable of implementing nonlinear operations, such as: convolutional layers in neural networks, pooling layers, and the like. In practical application, the neural network layer may map features in the raw data into a high-dimensional feature space by using a linear projection function and an activation function.

S103, constructing a positive sample feature pair for representing the same semantic and a negative sample feature pair for representing different semantics based on the first sample feature and the second sample feature;

in this embodiment, of the sample features output by each feature extraction layer, the first sample feature and the second sample feature that are output by the feature extraction layer at the corresponding position and belong to the same sample pair may be constructed as a positive sample feature pair, and the first sample feature and the second sample feature that are output by the feature extraction layers at different positions or belong to different sample pairs may be constructed as a negative sample feature pair.

It should be noted that, for clarity of layout, the specific positive sample feature pairs and negative sample feature pairs will be described below, and details are not repeated here.

And S104, responding to the fact that the feature extraction model is determined not to be converged based on the positive sample feature pair and the negative sample feature pair, and adjusting the model parameters of the feature extraction model.

In this embodiment, if the model is not converged, the model parameters of the feature extraction model, that is, the parameters of the first extraction network and/or the second extraction network are adjusted, and the step of obtaining the sample pairs representing the same semantics may be returned, and the feature extraction model is continuously trained until the model is converged, and the training is ended.

In the scheme provided by the disclosure, when a feature extraction model is trained, a sample pair is subjected to feature extraction through a first extraction network and a second extraction network which comprise a plurality of feature extraction layers, each feature extraction layer in each extraction network outputs a sample feature, a positive sample feature pair and a negative sample feature pair are constructed by using the sample features output by each feature extraction layer in the first extraction network and the second extraction network, and the feature extraction model is trained by using the positive sample feature pair and the negative sample feature pair. As the sample features output by each feature extraction layer are utilized to construct the positive sample feature pairs and the negative sample feature pairs, compared with the feature pairs containing different semantic information as the negative sample feature pairs in the related technology, the features output by different feature extraction layers are added as the negative sample feature pairs, so that the number of the obtained negative sample feature pairs is greatly increased. Therefore, by the scheme, the feature extraction model with higher precision can be obtained by training under the condition of fewer samples to resources, the consumption of computing resources can be reduced, and the training cost is further reduced.

Optionally, in another embodiment of the present disclosure, the plurality of feature extraction layers in the first extraction network include a feature extraction layer constructed based on a first encoder, and a feature extraction layer belonging to a neural network layer, the first encoder being an encoder for the first modal object; the plurality of feature extraction layers in the second extraction network include a feature extraction layer constructed based on a second encoder, which is an encoder for a second modal object, and a feature extraction layer belonging to a neural network layer.

In this embodiment, the first encoder encodes the first modality object, and the second encoder encodes the second modality object, so that the features of the first modality object and the second modality object at respective positions can be fully utilized and mapped into comparable feature spaces to obtain feature vectors with the same dimension, so that the features of the first modality object and the second modality object can be compared. And a plurality of feature extraction layers belonging to the neural network layer are connected behind the first encoder and the second encoder, and features output by the first encoder or the second encoder are subjected to feature extraction through the plurality of feature extraction layers, so that features corresponding to the plurality of feature extraction layers can be obtained. It can be understood that by combining such an encoder with a neural network layer, more comprehensive feature extraction can be achieved.

For example, the feature extraction layer belonging to the neural network in the first extraction network and the feature extraction layer belonging to the neural network in the second extraction network may be convolutional layers, pooling layers, and other network layers capable of performing nonlinear operations in the neural network. It should be noted that, in the feature extraction layer belonging to the neural network in the first extraction network and the feature extraction layer belonging to the neural network in the second extraction network, the feature extraction layers at corresponding positions may have the same network structure, so that features extracted by the respective feature extraction layers may be compared.

Therefore, according to the scheme, the first modal object and the second modal object are coded through the first coder in the first extraction network and the second coder in the second extraction network, and then the features are extracted through the plurality of feature extraction layers belonging to the neural network layer, so that the feature extraction is more comprehensive.

Optionally, in another embodiment of the present disclosure, in the step S103, constructing a positive sample feature pair for characterizing the same semantic and a negative sample feature pair for characterizing different semantics based on the first sample feature and the second sample feature, which may include steps A1-A2:

a1, constructing a positive sample feature pair by utilizing sample features extracted by a first type extraction layer aiming at the sample pairs representing the same semantics; the first extraction layer comprises two feature extraction layers which belong to the same position in the first extraction network and the second extraction network;

in this embodiment, because the sample pair representing the same semantics is a positive sample pair, and the first class extraction layer includes two feature extraction layers belonging to the same position in the first extraction network and the second extraction network, the feature extraction layer at the same position may map features of input data into a comparable feature space, so that features output by the two feature extraction layers at the same position are features belonging to the positive sample pair and belonging to the comparable feature space, and therefore, a positive sample feature pair representing the same semantics may be constructed for sample features extracted by the sample pair using the first class extraction layer.

A2, constructing a negative sample feature pair by utilizing a second type of extraction layer aiming at the sample features extracted by the sample pair representing the same semantics; wherein the second extraction layer comprises two feature extraction layers belonging to different positions in the first extraction network and the second extraction network.

It can be understood that, because the sample features output by the two feature extraction layers belonging to different positions are obtained by performing nonlinear processing on input data to different degrees, the second type of extraction layer can be used to construct a negative sample feature pair for characterizing different semantics for the sample features extracted by the sample pair for characterizing the same semantics. For example, if the first extraction network and the second extraction network each include 2 feature extraction layers, which are the first extraction layer and the second extraction layer, respectively, the sample features output by the second extraction layer are obtained by further performing nonlinear processing on the sample features output by the first extraction layer, and therefore, the extracted sample features of the two feature extraction layers belonging to different positions in the first extraction network and the second extraction network can construct a negative sample feature pair representing different semantics.

Therefore, according to the scheme, the negative sample feature pairs are constructed by using the sample features output by the second type of extraction layer, the number of the obtained negative sample feature pairs is greatly increased, and more negative sample feature pairs used for training the feature extraction model can be obtained under the condition of less sample pair resources.

Optionally, in another embodiment of the present disclosure, after obtaining the first sample feature and the second sample feature, and before constructing the positive sample feature pair and the negative sample feature pair, the training method of the feature extraction model may further include:

constructing a feature matrix based on the first sample feature and the second sample feature; in the feature matrix, the number of dimensions of a row is the number of sample features extracted by the first extraction network for the sample pairs representing the same semantics, the number of dimensions of a column is the number of sample features extracted by the second extraction network for the sample pairs representing the same semantics, the matrix elements are the feature pairs related to the first sample feature and the second sample feature, and the arrangement mode of each dimension of the row is the same as that of each dimension of the column;

in this embodiment, a positive sample feature pair for representing the same semantic and a negative sample feature pair for representing different semantics are determined by constructing a feature matrix. It can be understood that, after the sample pairs representing the same semantics are input into the feature extraction model, each feature extraction layer in the first extraction network outputs the sample features of the sample pair, and the number of the output sample features is the product of the number of the feature extraction layers in the first extraction network and the number of the sample pairs, and when a feature matrix is constructed, the total number of the output features of the first extraction network can be used as the dimension of a row in the feature matrix; each feature extraction layer in the second extraction network outputs sample features of the sample pairs, and the number of the output sample features is the product of the number of the feature extraction layers in the second extraction network and the number of the sample pairs. In addition, the feature pairs output by the two feature extraction layers belonging to the same position in the feature extraction model and belonging to the same sample pair can be used as matrix elements of the feature matrix.

It should be noted that, when constructing the feature matrix, the arrangement manner of each dimension of the row is the same as that of each dimension of the column. For example, if the row arrangement is to arrange the sample features output by each feature extraction layer according to the sequence of the sample pairs, and then arrange the sample features output by each feature extraction layer according to the sequence of the feature extraction layers, the column arrangement is also to arrange the sample features output by each feature extraction layer according to the sequence of the sample pairs, and then arrange the sample features output by each feature extraction layer according to the sequence of the feature extraction layers.

Correspondingly, in this embodiment, in the step A1, constructing a positive sample feature pair by using the sample features extracted by the first extraction layer for the sample pairs representing the same semantics, includes:

selecting matrix elements on the main diagonal from the characteristic matrix to obtain a positive sample characteristic pair for representing the same semantics;

correspondingly, in this embodiment, in the step A2, constructing a negative sample feature pair by using the second type extraction layer for the sample features extracted from the sample pair representing the same semantics, including:

and selecting matrix elements except the matrix elements on the main diagonal from the characteristic matrix to obtain the negative sample characteristic pairs for representing different semantics.

It can be understood that, when the feature matrix is constructed, the arrangement manner of each dimension of the row is the same as that of each dimension of the column, and therefore, the matrix elements on the main diagonal are sample pairs belonging to the same representation and the same semantic meaning and sample feature pairs output by the feature extraction layers belonging to the same position, that is, the sample feature pairs used for representing the same semantic meaning can be used as positive sample feature pairs. In the feature matrix, matrix elements except for matrix elements on the main diagonal are sample feature pairs output by feature extraction layers belonging to different sample pairs or belonging to different positions, and can be used as negative sample feature pairs for representing different semantics.

In addition, it should be noted that, by constructing the feature matrix based on the obtained sample features, when the model loss value is calculated subsequently, because the number of the obtained sample features is greatly increased, the number of the loss values belonging to each element position in the constructed feature matrix is also increased exponentially. In this case, a large amount of computing resources are required to calculate the loss value of each matrix element position. In order to reduce the amount of calculation, the loss values of matrix elements at other positions in the feature matrix besides the feature position output by the intermediate feature extraction layer may be used for gradient feedback.

Therefore, according to the scheme, the feature matrix is constructed on the basis of the first sample feature and the second sample feature, and each positive sample feature pair and each negative sample feature pair can be quickly determined subsequently according to the feature matrix, so that the model loss value can be calculated by using each positive sample feature pair and each negative sample feature pair.

Optionally, in another embodiment of the present disclosure, the manner of determining whether the feature extraction model converges based on the positive sample feature pair and the negative sample feature pair may include steps B1-B2:

b1, determining the model loss of the feature extraction model based on the positive sample feature pair and the negative sample feature pair;

and B2, judging whether the characteristic extraction model converges or not by utilizing the model loss.

It is understood that, in order to perform multi-modal search, when the feature extraction model is trained, the training target may be such that the similarity between the pair of positive sample features is close to 1, and the similarity between the pair of negative sample features is close to 0. Then, based on the pair of positive sample features and the pair of negative sample features, a loss value between the similarity between the pair of positive sample features and 1 and a loss value between the similarity between the pair of negative sample features and 0 are calculated as a model loss. Then, whether the feature extraction model converges is judged by using the model loss and a loss value threshold value set for the model in advance.

Optionally, in an implementation manner, in the step B1, determining a model loss of the feature extraction model based on the positive sample feature pair and the negative sample feature pair may include steps B11 to B12:

b11, calculating the feature similarity of the positive sample feature pair and the feature similarity of the negative sample feature pair;

illustratively, the feature similarity of a sample feature pair may be determined by calculating a cosine distance between the sample feature pair and then subtracting the cosine distance by 1. Of course, the feature similarity of the sample feature pair may also be calculated in other manners, for example, by using the euclidean distance, and it should be noted that the manner of calculating the feature similarity of the sample feature pair is not limited in the embodiments of the present disclosure.

B12, determining the model loss of the feature extraction model based on the difference between the feature similarity of the positive sample feature pair and a first true value and the difference between the feature similarity of the negative sample feature pair and a second true value; the first true value is a similarity true value representing the same semantic meaning, and the second true value is a similarity true value representing the different semantic meaning.

It is understood that, since the first true value is a true value of similarity characterizing semantically identical, i.e. 1 or 100%, and the second true value is a true value of similarity characterizing semantically different, the model loss of the feature extraction model can be determined by using the difference between the feature similarity of the positive sample feature pair and the first true value, and the difference between the feature similarity of the negative sample feature pair and the second true value, so that the feature extraction model can be trained by minimizing the model loss in the training process of the feature extraction model. For example, the model loss of the feature extraction model may be determined as: the feature similarity of each positive sample feature pair is respectively different from the first true value, and the feature similarity of each negative sample feature pair is respectively the sum of the differences from the second true value. It is understood that when calculating the model loss, the calculation may be performed by using a cross entropy loss function, but is not limited thereto; in addition, when the model parameters of the feature extraction model are adjusted, a gradient descent method or the like may be used, but the method is not limited to this.

It can be seen that, according to the scheme, the similarity between the positive sample feature pairs and the loss value between 1, and the similarity between the negative sample feature pairs and the loss value between 0 are calculated to be used as model losses. Then, whether the feature extraction model converges is judged by using the model loss and a loss value threshold value set for the model in advance. The feature extraction model can thus be trained by minimizing the model loss such that the similarity between pairs of positive sample features is close to 1 and the similarity between pairs of negative sample features is close to 0. And in the subsequent retrieval process, the object matched with the target object can be accurately retrieved.

After the feature extraction model is trained by the scheme provided by the above embodiment, as shown in fig. 2, an embodiment of the present disclosure further provides an object retrieval method, including the following steps:

s201, acquiring a target object serving as a retrieval basis; wherein the target object belongs to a first modality;

in this embodiment, the target object is query data for retrieval, and the target object belongs to a first modality, where the first modality may be a picture, a text, a video, or the like. In an actual multi-modal retrieval process, a second modal object having the same semantic meaning as an input target object can be retrieved according to the target object.

S202, inputting the target object into a first extraction network in a feature extraction model, so that a plurality of feature extraction layers in the first extraction network extract the features of the target object to obtain a first feature of the target object; the feature extraction model is a model obtained by training based on the training method of the feature extraction model; in the multiple feature extraction layers of the first extraction network, the output of the previous feature extraction layer is the input of the next feature extraction layer, and the input of the first feature extraction layer is the target object

It can be understood that, due to the feature extraction model trained in advance, in order to be obtained through the training of the feature extraction model by the training method, the first extraction network and the second extraction network in the feature extraction model can fully extract the features of the positive sample pairs, so that the positive sample feature pairs have higher similarity and the negative sample feature pairs have lower similarity. Furthermore, in the retrieval process, a target object belonging to a first modality may be input into a first extraction network to extract a first feature of the target object, and then a feature belonging to an object of a second modality matching the first feature may be retrieved using the first feature. Specifically, feature pairs may be constructed by respectively associating the first features with features belonging to the second modality object, and each feature pair may be scored according to a similarity between each feature pair, so that the feature of the second modality object included in the feature pair with the highest score may be used as the feature of the second modality object matching the target object, and then the matching second modality object may be determined as the search result of the target object.

S203, based on the pre-constructed feature library and the first feature, determining an object matched with the target object from a database containing objects belonging to a second modality, and taking the object as a retrieval result corresponding to the target object; the feature library comprises second features of the objects in the database, and the second features of the objects in the database are obtained by performing feature extraction on the objects in the database by utilizing a plurality of feature extraction layers of a second extraction network in the feature extraction model; in the plurality of feature extraction layers in the second extraction network, the output of the previous feature extraction layer is the input of the next feature extraction layer, and the input of the first feature extraction layer is the object in the database.

It can be understood that, in the actual retrieval process, in order to increase the retrieval speed, the retrieval system can be divided into an offline part and an online part. And the off-line part extracts second characteristics of each object in the database containing a large number of second modal objects by using a second extraction network in the characteristic extraction model to construct a characteristic library. And the online part is used for performing similarity calculation on the features of the target object and each second feature in the feature library during retrieval, returning an index corresponding to the second feature with the highest feature similarity of the target object, and determining a corresponding object from the database as a retrieval result corresponding to the target object according to the index.

Therefore, according to the scheme, the feature extraction model is a model trained by a large number of negative sample feature pairs, and the features of the target object and the features of the second modal objects are extracted by using the feature extraction model, so that the features belonging to the same semantics and the features belonging to different semantics in the sample pairs can be fully extracted, the feature pairs belonging to the same semantics have higher similarity, and the retrieval precision can be greatly improved.

For a better understanding of the contents of the embodiments of the present disclosure, reference is made to the following description in connection with a specific example.

Fig. 3 shows a schematic diagram of training a neural network with a double tower architecture in the related art. As shown in fig. 3, the first modality object is a text object, the second modality object is a picture object, the input is N sample pairs, N is a positive integer, and each sample pair contains one text object and one picture object representing the same semantic meaning. A text encoder (corresponding to the first encoder above) in the neural network encodes each first modality object in the N sample pairs to obtain N features, which are respectively T in the graph ₁ …T _N . A picture encoder (corresponding to the second encoder above) in the neural network encodes each second modality object in the N sample pairs to obtain N features, which are I in the picture respectively ₁ …I _N 。

Based on the characteristics output by the text encoder and the picture encoder, an N x N characteristic matrix is constructed, in the characteristic matrix, the matrix elements on the main diagonal are sample characteristic pairs belonging to the same sample pair, the similarity of the matrix elements on the main diagonal can be used as the similarity of the positive sample pair, in the characteristic matrix, the matrix elements except the main diagonal are sample characteristic pairs belonging to different sample pairs, namely the sample characteristic pairs belonging to the negative sample pair, and the similarity of the matrix elements except the main diagonal can be used as the similarity of the negative sample pair. For example, I in the figure ₁ And T ₁ The similarity between the two is the similarity of the positive sample pair, I in the figure ₁ And T ₂ The similarity between the negative samples is the similarity of the negative sample pairs.

It can be understood that, for performing multi-modal search subsequently using the neural network with the double-tower architecture, in the training process of the neural network, the parameters of the neural network can be adjusted by calculating the loss value between the similarity of the main diagonal and 1 and the loss value between the similarity of the non-main diagonal and 0 in the feature matrix, and returning back with the gradient of the loss value, so as to optimize the neural network. Specifically, parameters of the neural network are adjusted by minimizing a loss value between the similarity of the main diagonal line and 1 and a loss value between the similarity of the non-main diagonal line and 0, so that the similarity between the features of the positive sample pair extracted by the optimized neural network is higher, and the similarity between the features of the negative sample pair is lower, and therefore a retrieval result matched with a target object can be retrieved more accurately in the subsequent retrieval process.

Fig. 4 is a schematic diagram illustrating a retrieval system for performing multi-modal retrieval by using the object retrieval method provided in this embodiment, where the retrieval system performs feature extraction by using the feature extraction model trained by the training method of this embodiment, and the feature extraction model may be a model obtained by modifying the above-mentioned neural network with a double tower architecture. As shown in fig. 4, the retrieval system includes two parts, an offline part and an online part. The online part is used for extracting text features (corresponding to the first features in the text) of an input query text (corresponding to the target object in the text) by using a first extraction network in a feature extraction model, then, based on the text features and a pre-constructed feature library, performing similarity calculation on the features under each index in the feature library and the text features, scoring the features under each index according to each calculated similarity, and then returning pictures corresponding to the index with the highest score as a retrieval result. The features in the pre-constructed feature library are extracted from a database (corresponding to the database of the object belonging to the second modality) containing a large number of pictures by using the second extraction network in the feature extraction model.

As shown in fig. 5, the feature extraction model obtained by training in the training method of this embodiment includes a text encoder, a picture encoder, and a projection network (corresponding to the above feature extraction layer belonging to the neural network layer), where the text encoder and the projection network form a first extraction network, and the picture encoder and the projection network form a second extraction network. The part of the dashed box in fig. 5 corresponds to the meaning indicated in fig. 3, and the projection network includes two feature extraction layers, where N is 4, i.e. the number of input sample pairs is 4 pairs.

In practical applications, the number of layers of the feature extraction layer of the projection network can be set by itself, but as the number of layers is larger, the more features are output, the more computing resources are consumed. The dimensions of the input features and the output features of the projection network are equal in width, so that the output features of a sample pair can be compared in feature extraction layers belonging to the same positions. In addition, each feature extraction layer of the projection network can be any neural network layer capable of performing nonlinear operation.

In the training process of the feature extraction model, the features output by the text encoder and the picture encoder are spliced with the feature pairs output by each feature extraction layer in the corresponding projection network respectively, and then a feature matrix is constructed. And adjusting parameters of the feature extraction model by calculating a loss value between the similarity of the main diagonal and 1 and a loss value between the similarity of the non-main diagonal and 0 in the feature matrix and returning the loss values in a gradient manner, thereby optimizing the feature extraction model.

It can be understood that, because the projection network formed by a plurality of feature extraction layers is added in the feature extraction model, after the sample pairs are input into the feature extraction model, the obtained sample feature pairs extracted by the feature extraction layers belonging to different positions can be used as negative sample feature pairs, so that a large number of negative sample feature pairs can be constructed to train the model under the condition that the sample pair resources are limited, and the feature extraction accuracy of the model can be improved.

In addition, when parameters of the feature extraction model are adjusted by using the loss value gradient feedback, because the scale of the constructed feature matrix is greatly increased, the number of loss values belonging to each element position in the matrix is also exponentially increased, and at this time, a large amount of calculation resources are consumed for calculating the loss value of each matrix element position. In order to reduce the calculation amount, gradient return can be performed by using the loss values of matrix elements at other positions in the feature matrix except for the feature position output by the intermediate feature extraction layer. The positions of the matrix elements that specifically participate in the gradient feedback can be referred to by the dashed line in fig. 6.

Therefore, by the scheme, a large number of negative sample feature pairs can be constructed under the condition that the sample pair resources are limited, so that the feature extraction model is optimized based on the idea of comparison learning. In the retrieval process, the target object and the characteristics of the second modal object are extracted by using the optimized characteristic extraction model, so that the extracted second modal object with the same semantics as the target object can obtain a higher score, and the retrieval accuracy is improved.

Based on the above embodiment of the training method of the feature extraction model, the embodiment of the present disclosure further provides a training apparatus of the feature extraction model, as shown in fig. 7, the apparatus includes:

a determining module 710, configured to determine a feature extraction model to be trained; wherein the feature extraction model comprises a first extraction network for first modality objects and a second extraction network for second modality objects, the first extraction network and the second extraction network each comprising a plurality of feature extraction layers;

a sample feature extraction module 720, configured to input sample pairs representing the same semantics into the feature extraction model, so as to obtain first sample features output by each feature extraction layer of the first extraction network and second sample features output by each feature extraction layer of the second extraction network;

a construction module 730, configured to construct, based on the first sample feature and the second sample feature, a positive sample feature pair for characterizing the same semantics and a negative sample feature pair for characterizing different semantics;

an adjusting module 740 for adjusting model parameters of the feature extraction model in response to determining that the feature extraction model is not converged based on the pair of positive sample features and the pair of negative sample features.

Optionally, the building module includes:

the first construction sub-module is used for constructing the positive sample feature pair aiming at the sample features extracted by the sample pairs representing the same semantics by utilizing the first class extraction layer; wherein the first extraction layer comprises two feature extraction layers belonging to the same position in the first extraction network and the second extraction network;

the second construction submodule is used for constructing the negative sample feature pair aiming at the sample features extracted by the sample pairs representing the same semantics by utilizing a second type extraction layer; wherein the second type extraction layer comprises two feature extraction layers belonging to different positions in the first extraction network and the second extraction network.

Optionally, the apparatus further comprises:

a matrix construction module for constructing a feature matrix based on the first sample feature and the second sample feature; in the feature matrix, the number of dimensions of a row is the number of sample features extracted by the first extraction network on the sample pairs representing the same semantics, the number of dimensions of a column is the number of sample features extracted by the second extraction network on the sample pairs representing the same semantics, and a matrix element is a feature pair related to the first sample feature and the second sample feature, and the arrangement mode of each dimension of the row is the same as the arrangement mode of each dimension of the column;

the first building submodule is specifically configured to:

selecting matrix elements on a main diagonal line from the characteristic matrix to obtain the positive sample characteristic pair;

the second building submodule is specifically configured to:

and selecting matrix elements except the matrix elements on the main diagonal from the characteristic matrix to obtain the negative sample characteristic pair.

Optionally, the determining whether the feature extraction model converges based on the positive sample feature pair and the negative sample feature pair includes:

determining a model loss for the feature extraction model based on the pair of positive sample features and the pair of negative sample features;

and judging whether the feature extraction model is converged or not by utilizing the model loss.

Optionally, the determining a model loss of the feature extraction model based on the positive sample feature pair and the negative sample feature pair includes:

calculating the feature similarity of the positive sample feature pair and the feature similarity of the negative sample feature pair;

determining a model loss of the feature extraction model based on a difference between the feature similarity of the positive sample feature pair and a first true value and a difference between the feature similarity of the negative sample feature pair and a second true value;

the first truth value is a similarity truth value with the same representation semantics, and the second truth value is a similarity truth value with different representation semantics.

Optionally, the plurality of feature extraction layers in the first extraction network include a feature extraction layer constructed based on a first encoder, and a feature extraction layer belonging to a neural network layer, the first encoder being an encoder for a first modal object;

the plurality of feature extraction layers in the second extraction network include a feature extraction layer constructed based on a second encoder, and a feature extraction layer belonging to a neural network layer, the second encoder being an encoder for a second modal object.

Based on the above embodiment of the object retrieval method, an embodiment of the present disclosure further provides an object retrieval device, as shown in fig. 8, the device includes:

an obtaining module 810, configured to obtain a target object as a search basis; wherein the target object belongs to a first modality;

a feature extraction module 820, configured to input the target object into a first extraction network in a feature extraction model, so that a plurality of feature extraction layers of the first extraction network extract features of the target object to obtain a first feature of the target object; the feature extraction model is a model obtained by training according to a training method of the feature extraction model; in the plurality of feature extraction layers of the first extraction network, the output of the previous feature extraction layer is the input of the next feature extraction layer, and the input of the first feature extraction layer is the target object

A retrieval module 830, configured to determine, based on a pre-constructed feature library and the first feature, an object matching the target object from a database including objects belonging to a second modality, as a retrieval result corresponding to the target object;

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

An electronic device provided by the present disclosure may include:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the similar song retrieval method described above.

A computer-readable storage medium is provided in the present disclosure, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any of the above-mentioned training methods for feature extraction models, or the steps of any of the above-mentioned object retrieval methods.

In yet another embodiment provided by the present disclosure, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of any of the above-described methods for training a feature extraction model, or any of the above-described methods for object retrieval.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as a training method of a feature extraction model, or an object retrieval method. For example, in some embodiments, the training method of the feature extraction model, or the object retrieval method, may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the training method of the feature extraction model described above, or the object retrieval method, may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable way (e.g. by means of firmware) to perform a training method of the feature extraction model, or an object retrieval method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A training method of a feature extraction model comprises the following steps:

constructing a positive sample feature pair for characterizing the same semantic and a negative sample feature pair for characterizing different semantics based on the first sample feature and the second sample feature;

in response to determining that the feature extraction model is not converged based on the pair of positive sample features and the pair of negative sample features, adjusting model parameters of the feature extraction model;

wherein the constructing of the positive sample feature pairs for characterizing the same semantic and the negative sample feature pairs for characterizing different semantic based on the first sample feature and the second sample feature comprises:

utilizing a first type extraction layer to construct the positive sample feature pair aiming at the sample features extracted by the sample pairs representing the same semantics; the first extraction network comprises a first extraction network and a second extraction network, wherein the first extraction network comprises a first extraction network and a second extraction network which are connected with each other, and the first extraction network comprises a first extraction network and a second extraction network which are connected with each other;

utilizing a second type of extraction layer to construct the negative sample feature pair aiming at the sample features extracted by the sample pairs representing the same semantics; wherein the second type of extraction layer comprises two feature extraction layers belonging to different positions in the first extraction network and the second extraction network.

2. The method of claim 1, wherein the method further comprises:

constructing a feature matrix based on the first sample feature and the second sample feature; in the feature matrix, the number of dimensions of a row is the number of sample features extracted by the first extraction network for the sample pairs representing the same semantics, the number of dimensions of a column is the number of sample features extracted by the second extraction network for the sample pairs representing the same semantics, and a matrix element is a feature pair related to the first sample feature and the second sample feature, and the arrangement mode of each dimension of the row is the same as the arrangement mode of each dimension of the column;

the constructing, by using the first extraction layer, the positive sample feature pair for the sample features extracted from the sample pairs representing the same semantics includes:

the constructing the negative sample feature pair for the sample features extracted by the sample pair by using the second type of extraction layer comprises:

3. The method of claim 1, wherein determining whether the feature extraction model converges based on the pair of positive and negative sample features comprises:

and judging whether the feature extraction model converges or not by using the model loss.

4. The method of claim 3, wherein the determining a model loss for the feature extraction model based on the pair of positive sample features and the pair of negative sample features comprises:

5. The method according to any one of claims 1-4, wherein the plurality of feature extraction layers in the first extraction network include a feature extraction layer constructed based on a first encoder, the first encoder being an encoder for a first modal object, and a feature extraction layer belonging to a neural network layer;

6. An object retrieval method, comprising:

inputting the target object into a first extraction network in a feature extraction model, so that a plurality of feature extraction layers of the first extraction network extract features of the target object to obtain first features of the target object; wherein the feature extraction model is a model trained according to the method of any one of claims 1 to 5; in the plurality of feature extraction layers of the first extraction network, the output of a previous feature extraction layer is the input of a next feature extraction layer, and the input of the first feature extraction layer is the target object;

the feature library comprises second features of the objects in the database, and the second features of the objects in the database are obtained by performing feature extraction on the objects in the database by utilizing a plurality of feature extraction layers of a second extraction network in the feature extraction model; in the plurality of feature extraction layers of the second extraction network, the output of the previous feature extraction layer is the input of the next feature extraction layer, and the input of the first feature extraction layer is the object in the database.

7. A training apparatus for a feature extraction model, comprising:

an adjustment module to adjust model parameters of the feature extraction model in response to determining that the feature extraction model is not converged based on the pair of positive sample features and the pair of negative sample features;

wherein the building block comprises:

the first construction sub-module is used for constructing the positive sample feature pair by utilizing the sample features extracted by the first extraction layer aiming at the sample pairs representing the same semantics; wherein the first extraction layer comprises two feature extraction layers belonging to the same position in the first extraction network and the second extraction network;

the second construction submodule is used for constructing the negative sample feature pair aiming at the sample features extracted by the sample pairs representing the same semantics by utilizing a second type extraction layer; wherein the second type of extraction layer comprises two feature extraction layers belonging to different positions in the first extraction network and the second extraction network.

8. The apparatus of claim 7, wherein the apparatus further comprises:

a matrix construction module for constructing a feature matrix based on the first sample feature and the second sample feature; in the feature matrix, the number of dimensions of a row is the number of sample features extracted by the first extraction network for the sample pairs representing the same semantics, the number of dimensions of a column is the number of sample features extracted by the second extraction network for the sample pairs representing the same semantics, and a matrix element is a feature pair related to the first sample feature and the second sample feature, and the arrangement mode of each dimension of the row is the same as the arrangement mode of each dimension of the column;

the first building submodule is specifically configured to:

the second building submodule is specifically configured to:

9. The apparatus of claim 7, wherein the manner of determining whether the feature extraction model converges based on the pair of positive sample features and the pair of negative sample features comprises:

10. The apparatus of claim 9, wherein the determining a model loss for the feature extraction model based on the pair of positive sample features and the pair of negative sample features comprises:

11. The apparatus according to any one of claims 7-10, wherein the plurality of feature extraction layers in the first extraction network include a feature extraction layer constructed based on a first encoder, which is an encoder for a first modal object, and a feature extraction layer belonging to a neural network layer;

12. An object retrieval apparatus comprising:

the feature extraction module is used for inputting the target object into a first extraction network in a feature extraction model, so that a plurality of feature extraction layers of the first extraction network extract features of the target object to obtain a first feature of the target object; wherein, the feature extraction model is a model obtained by training according to any one of the methods of claims 1-5; in a plurality of feature extraction layers of the first extraction network, the output of a previous feature extraction layer is the input of a next feature extraction layer, and the input of the first feature extraction layer is the target object;

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of training a feature extraction model according to any one of claims 1 to 5, or a method of object retrieval according to claim 6.

14. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a training method of a feature extraction model according to any one of claims 1 to 5, or an object retrieval method of claim 6.