CN114817655A

CN114817655A - Cross-modal retrieval method, network training method, device, equipment and medium

Info

Publication number: CN114817655A
Application number: CN202210265872.5A
Authority: CN
Inventors: 何永明; 李涛; 梅丰
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-03-17
Filing date: 2022-03-17
Publication date: 2022-07-29

Abstract

The disclosure relates to a cross-modal retrieval method, a network training method, a device, equipment and a medium, wherein the cross-modal retrieval method comprises the following steps: acquiring data to be retrieved and candidate data; the data to be retrieved and the candidate data correspond to different modalities; extracting a first feature of the data to be retrieved and a second feature of the candidate data based on a cross-modal retrieval network; and retrieving data matched with the data to be retrieved from the candidate data according to the matching degree of the first characteristic and the second characteristic. By adopting the cross-modal retrieval network disclosed by the invention, the local information of the input data to be retrieved and the candidate data can be accurately captured, and more effective characteristics are output, and the more effective characteristics have more distinguishing performance between the same modal and more identifiability between different modals, so that the cross-modal retrieval performance of fine granularity is improved.

Description

Cross-modal retrieval method, network training method, device, equipment and medium

Technical Field

The present disclosure relates to the field of computers, and in particular, to a cross-modal search method, a network training method, an apparatus, a device, and a medium.

Background

The cross-modal retrieval refers to a retrieval mode in which the modality of the retrieval result is different from that of the query data. Such as using images to retrieve text, video, audio, etc.

In the related art, cross-modal retrieval usually performs similarity calculation on features of different modalities output by a retrieval network to obtain a relevance score, and retrieval is performed according to the relevance score. When a retrieval network is trained, the data of two modes are generally mapped into a high-dimensional representation space with the same dimension, and the characteristics of the two modes are obtained and then are directly trained by using a contrast loss function. The features output by the retrieval network obtained by the training mode are not fine enough, and only coarse-grained judgment is performed to judge whether the features are related or not, so that the retrieval effect of fine-grained cross-modal retrieval is poor.

Disclosure of Invention

The disclosure provides a cross-modal retrieval method, a network training method, a device, equipment and a medium, which are used for at least solving the problems that the features output by a retrieval network in the related technology are not fine enough, and only coarse-grained judgment is performed to determine whether the features are related or not, and the retrieval effect of fine-grained cross-modal retrieval is poor. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a cross-modal retrieval method, including:

acquiring data to be retrieved and candidate data; the data to be retrieved and the candidate data correspond to different modalities;

extracting a first feature of the data to be retrieved and a second feature of the candidate data based on a cross-modal retrieval network;

retrieving data matched with the data to be retrieved from the candidate data according to the matching degree of the first characteristic and the second characteristic;

the cross-modal retrieval network is obtained by performing countermeasure training on a first sample generation network corresponding to the data to be retrieved of the sample, a second sample generation network corresponding to the associated sample data and a third sample generation network corresponding to the non-associated sample data in an countermeasure network based on the data to be retrieved of the sample, the associated sample data matched with the data to be retrieved of the sample and the non-associated sample data not matched with the data to be retrieved of the sample; the associated sample data and the data to be retrieved of the sample correspond to different modals, and the associated sample data and the non-associated sample data correspond to the same modality.

In an exemplary embodiment, the cross-modal search network includes a first generation network corresponding to the first sample generation network, a second generation network corresponding to the second sample generation network, and a third generation network corresponding to the third sample generation network, and the extracting, based on the cross-modal search network, the first feature of the data to be searched and the second feature of the candidate data includes:

inputting the data to be retrieved and the candidate data into the cross-modal retrieval network;

extracting the first feature based on the first generated network;

extracting the second feature from the second and third generation networks.

In an exemplary embodiment, the retrieving, from the candidate data according to the matching degree of the first feature and the second feature, data that matches the data to be retrieved includes:

determining the degree of match between the first feature and the second feature;

taking the candidate data corresponding to the target second characteristic as the data matched with the data to be retrieved; and the target second feature represents a second feature of which the matching degree with the first feature meets a preset condition.

According to a second aspect of the embodiments of the present disclosure, there is provided a training method for cross-modal search networks, including:

acquiring sample data to be retrieved, associated sample data matched with the sample data to be retrieved and non-associated sample data not matched with the sample data to be retrieved; the associated sample data and the sample to-be-retrieved data correspond to different modals, and the associated sample data and the non-associated sample data correspond to the same modality;

inputting the sample data to be retrieved, the associated sample data and the non-associated sample data into a first sample generation network, a second sample generation network and a third sample generation network in a countermeasure network to obtain sample characteristics of the sample data to be retrieved, associated sample characteristics of the associated sample data and non-associated sample characteristics of the non-associated sample data;

and carrying out countermeasure training on the countermeasure network based on the sample features, the associated sample features and the non-associated sample features to obtain a cross-modal retrieval network.

In an exemplary embodiment, the inputting the sample data to be retrieved, the associated sample data, and the non-associated sample data into a first sample generation network, a second sample generation network, and a third sample generation network in a countermeasure network to obtain a sample feature of the sample data to be retrieved, an associated sample feature of the associated sample data, and a non-associated sample feature of the non-associated sample data includes:

inputting the data to be retrieved of the sample into the first sample generation network, and extracting the sample features based on the first sample generation network;

inputting the associated sample data into the second sample generation network, and extracting the associated sample characteristics according to the second sample generation network;

inputting the non-associated sample data into the third sample generation network, and extracting the non-associated sample features based on the third sample generation network.

In an exemplary embodiment, the performing countermeasure training on the countermeasure network based on the sample feature, the associated sample feature, and the non-associated sample feature to obtain a cross-modal search network includes:

obtaining first loss information based on the sample characteristics, the associated sample characteristics and the non-associated sample characteristics;

inputting the sample characteristics, the associated sample characteristics and the non-associated sample characteristics into a discrimination network in the countermeasure network to obtain second loss information;

training the countermeasure network based on the first loss information and the second loss information to obtain a first generation network, a second generation network and a third generation network; the first generation network is used for extracting the features of data to be retrieved, and the second generation network and the third generation network are used for extracting the features of candidate data;

and generating a cross-modal retrieval network according to the first generation network, the second generation network and the third generation network.

In an exemplary embodiment, the inputting the sample feature, the associated sample feature, and the non-associated sample feature into a discriminant network in the countermeasure network to obtain second loss information includes:

inputting the sample features, the correlated sample features and the uncorrelated sample features into the discriminative network;

judging the matching degree between the sample characteristics and the associated sample characteristics based on the judging network to obtain a first judging result, and judging the matching degree between the sample characteristics and the non-associated sample characteristics to obtain a second judging result;

and obtaining the second loss information according to the first judgment result and the second judgment result.

In an exemplary embodiment, the obtaining the second loss information according to the first and second determination results includes:

calculating a first logarithm corresponding to the first judgment result and a second logarithm corresponding to the second judgment result;

and obtaining the second loss information according to the first logarithm and the second logarithm.

In an exemplary embodiment, the number of the data to be retrieved of the sample is multiple, and the method further includes:

determining target sample data to be retrieved from a plurality of sample data to be retrieved;

determining target associated sample data matched with the data to be retrieved of the target sample from the associated sample data, and determining target non-associated sample data not matched with the data to be retrieved of the target sample from the non-associated sample data;

the inputting the sample feature, the associated sample feature and the non-associated sample feature into a discrimination network in the countermeasure network to obtain second loss information includes:

and inputting the sample characteristics of the target sample to-be-retrieved data, the associated sample characteristics of the target associated sample data and the non-associated sample characteristics of the target non-associated sample data into the discrimination network to obtain the second loss information.

According to a third aspect of the embodiments of the present disclosure, there is provided a cross-modal retrieval apparatus, including:

the data acquisition module is configured to acquire data to be retrieved and candidate data; the data to be retrieved and the candidate data correspond to different modalities;

a feature extraction module configured to perform extraction of a first feature of the data to be retrieved and a second feature of the candidate data based on a cross-modal retrieval network;

the data retrieval module is configured to retrieve data matched with the data to be retrieved from the candidate data according to the matching degree of the first characteristic and the second characteristic;

In an exemplary embodiment, the cross-modality retrieval network includes a first generation network corresponding to the first sample generation network, a second generation network corresponding to the second sample generation network, and a third generation network corresponding to the third sample generation network, and the feature extraction module includes:

an input unit configured to perform input of the data to be retrieved and the candidate data into the cross-modal retrieval network;

a first feature extraction unit configured to perform extraction of the first feature based on the first generated network;

a second feature extraction unit configured to perform extraction of the second feature from the second generation network and the third generation network.

In an exemplary embodiment, the data retrieval module includes:

a matching degree determination unit configured to perform determining the matching degree between the first feature and the second feature;

a matching data determination unit configured to execute candidate data corresponding to a target second feature as the data matching the data to be retrieved; and the target second feature represents a second feature of which the matching degree with the first feature meets a preset condition.

In an exemplary embodiment, the data retrieval module includes:

According to a fourth aspect of the embodiments of the present disclosure, there is provided a training apparatus for cross-modal search network, including:

the sample data acquisition module is configured to execute acquisition of sample data to be retrieved, associated sample data matched with the sample data to be retrieved and non-associated sample data not matched with the sample data to be retrieved; the associated sample data and the sample to-be-retrieved data correspond to different modals, and the associated sample data and the non-associated sample data correspond to the same modality;

the sample characteristic determination module is configured to input the sample to-be-retrieved data, the associated sample data and the non-associated sample data into a first sample generation network, a second sample generation network and a third sample generation network in a countermeasure network to obtain sample characteristics of the sample to-be-retrieved data, associated sample characteristics of the associated sample data and non-associated sample characteristics of the non-associated sample data;

and the cross-modal retrieval network determining module is configured to execute countermeasure training on the countermeasure network based on the sample features, the associated sample features and the non-associated sample features to obtain a cross-modal retrieval network.

In an exemplary embodiment, the sample characteristic determination module includes:

a sample feature extraction unit configured to perform input of the sample data to be retrieved into the first sample generation network, and extract the sample feature based on the first sample generation network;

an associated sample feature extraction unit configured to perform input of the associated sample data into the second sample generation network, and extract the associated sample feature according to the second sample generation network;

and the non-associated sample feature extraction unit is configured to input the non-associated sample data into the third sample generation network and extract the non-associated sample features based on the third sample generation network.

In an exemplary embodiment, the cross-modal search network determination module includes:

a first loss information determination unit configured to perform deriving first loss information based on the sample feature, the associated sample feature, and the non-associated sample feature;

a second loss information determination unit configured to perform a discriminant network inputting the sample feature, the associated sample feature, and the non-associated sample feature into the countermeasure network, resulting in second loss information;

a training unit configured to perform training of the countermeasure network based on the first loss information and the second loss information, resulting in a first generation network, a second generation network, and a third generation network; the first generation network is used for extracting the features of data to be retrieved, and the second generation network and the third generation network are used for extracting the features of candidate data;

a cross-modal search network generation unit configured to perform generating a cross-modal search network from the first generated network, the second generated network, and the third generated network.

In an exemplary embodiment, the second loss information determining unit includes:

a sample feature input subunit configured to perform input of the sample feature, the associated sample feature, and the non-associated sample feature into the discriminant network;

a discrimination result determination subunit configured to perform discrimination of a matching degree between the sample feature and the associated sample feature based on the discrimination network to obtain a first discrimination result, and discrimination of a matching degree between the sample feature and the non-associated sample feature to obtain a second discrimination result;

a second loss information determination subunit configured to perform obtaining the second loss information according to the first and second discrimination results.

In an exemplary embodiment, the second loss information determining subunit includes:

a logarithm determination submodule configured to perform calculation of a first logarithm corresponding to the first discrimination result and a second logarithm corresponding to the second discrimination result;

a second loss information determination sub-module configured to perform obtaining the second loss information according to the first logarithm and the second logarithm.

In an exemplary embodiment, the number of the data to be retrieved of the sample is multiple, and the apparatus further includes:

the device comprises a first determining module, a second determining module and a searching module, wherein the first determining module is configured to determine target sample data to be searched from a plurality of sample data to be searched;

the second determination module is configured to determine target associated sample data matched with the data to be retrieved of the target sample from the associated sample data, and determine target non-associated sample data not matched with the data to be retrieved of the target sample from the non-associated sample data;

the second loss information determining unit is configured to input the sample characteristics of the target sample to-be-retrieved data, the associated sample characteristics of the target associated sample data, and the non-associated sample characteristics of the target non-associated sample data into the discrimination network, so as to obtain the second loss information.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement a cross-modal retrieval method or a training method of a cross-modal retrieval network as described in any of the above embodiments.

According to a sixth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions of the computer-readable storage medium, when executed by a processor of an electronic device, cause the electronic device to perform the cross-modal retrieval method or the training method of the cross-modal retrieval network according to any of the above embodiments.

According to a seventh aspect of the embodiments of the present disclosure, there is provided a computer program product, including a computer program, which when executed by a processor implements the cross-modal search method or the training method of the cross-modal search network described in any of the above embodiments.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the cross-modal retrieval network is obtained by performing countermeasure training on a first sample generation network corresponding to the data to be retrieved, a second sample generation network corresponding to the associated sample data and a third sample generation network corresponding to the non-associated sample data based on the data to be retrieved, the associated sample data matched with the data to be retrieved and the non-associated sample data not matched with the data to be retrieved. The cross-modal retrieval network obtained by training the framework can accurately capture local information of input data to be retrieved and candidate data, and output more effective characteristics, wherein the more effective characteristics are more distinguishable between the same modal and more distinguishable between different modal, so that the cross-modal retrieval performance of fine granularity is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a schematic diagram illustrating an implementation environment of a cross-modal search method or a training method of a cross-modal search network according to an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a cross-modal retrieval method according to an exemplary embodiment.

Fig. 3 is a flow diagram illustrating extraction of a first feature and a second feature according to an example embodiment.

Fig. 4 is a flowchart illustrating a method for retrieving data matching data to be retrieved from candidate data according to a matching degree of a first feature and a second feature according to an exemplary embodiment.

FIG. 5 is a flow diagram illustrating a method of training across modal search networks, according to an example embodiment.

Fig. 6 is a flowchart illustrating a method for obtaining sample characteristics of sample data to be retrieved, associated sample characteristics of associated sample data, and non-associated sample characteristics of non-associated sample data according to an exemplary embodiment.

FIG. 7 is a flow diagram illustrating a method of training across modal search networks in accordance with an exemplary embodiment.

FIG. 8 is a flow diagram illustrating a method for obtaining a cross-modal search network in accordance with an exemplary embodiment.

Fig. 9 is a flow chart illustrating inputting sample features, associated sample features, and non-associated sample features into a discriminative network in a countermeasure network to obtain second loss information according to an example embodiment.

FIG. 10 is a flowchart illustrating a method for determining target sample data to be retrieved, target associated sample data, and target non-associated sample data, according to an example embodiment.

FIG. 11 is a block diagram illustrating a cross-modal retrieval device, according to an example embodiment.

FIG. 12 is a block diagram of a training apparatus across a modal search network, according to an example embodiment.

FIG. 13 is a block diagram illustrating an electronic device for cross-modality retrieval or training of a cross-modality retrieval network, according to an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a schematic diagram illustrating an implementation environment of a cross-modal search method or a training method of a cross-modal search network according to an exemplary embodiment. As shown in fig. 1, the implementation environment may include at least a terminal 01 and a server 02, and the terminal 01 and the server 2 may be directly or indirectly connected through wired or wireless communication, and the disclosure is not limited herein.

In particular, the terminal may be used to collect data to be retrieved. Alternatively, the terminal 01 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart television, a smart watch, and the like, but is not limited thereto.

Specifically, the server 02 may be configured to train a cross-modal search network, and to search out data matching the data to be searched from the candidate data based on the cross-modal search network. Optionally, the server 02 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like.

It should be noted that fig. 1 is only an example. In other scenarios, other implementation environments may also be included, for example, the implementation environment may include a terminal, a cross-modal search network is obtained through training of the terminal, and the terminal is configured to search out data matching the data to be searched from the candidate data based on the cross-modal search network.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are information and data authorized by the user or sufficiently authorized by each party.

In order to better understand the cross-modal detection method provided by the embodiment of the present disclosure, a technical scenario to which the cross-modal detection method can be applied is first introduced. Specifically, the cross-modal detection method of the embodiment of the present disclosure may be applied to the following scene a, scene B, scene C, and scene D, but the cross-modal detection method of the embodiment of the present disclosure is not limited to these three application scenes. The following briefly introduces scene a, scene B, scene C, and scene D, respectively.

Scene A:

and searching and matching the text data and the video data. By the cross-modal retrieval method, the text to be retrieved can be used as query input to automatically generate a video matched with the text to be retrieved, and the video is recommended to the terminal account.

Scene B:

and searching and matching the image data and the text data. By the cross-modal retrieval method, the image to be retrieved can be used as query input to automatically generate the text description matched with the image to be retrieved, and the text description is recommended to the terminal account.

Scene C:

and searching and matching the image data and the voice data. Such as a mobile phone smart voice search. By the cross-modal retrieval method, the terminal account voice can be used as query input, retrieval and matching are carried out through image content in a mobile phone album, and an image matched with the terminal account voice is automatically generated.

Scene D:

and searching and matching aiming at video data and music data. Such as for entertaining short videos/advertising videos. By the cross-modal retrieval method, the video to be retrieved can be used as a query to automatically search the music which is most matched with the video to be retrieved by utilizing a cross-modal retrieval technology, and personalized recommendation is performed, for example, the action and the shot conversion of the video are mainly considered to be matched with the music beat and the like.

FIG. 2 is a flow diagram illustrating a cross-modal retrieval method according to an exemplary embodiment. As shown in fig. 2, the method is used in the system comprising the terminal and the server in fig. 1, and comprises the following steps.

In step S11, data to be retrieved and candidate data are acquired; the data to be retrieved and the candidate data correspond to different modalities.

Optionally, the data to be retrieved may be used as a search term, and data matching the data to be retrieved is retrieved from the candidate data. Illustratively, the modality of the data to be retrieved includes, but is not limited to: text, video, images, voice, etc. The modalities of the candidate data include, but are not limited to: text, video, images, voice, etc., but in a different modality than the data to be retrieved. For example, if the modality of the data to be retrieved is text, the modality of the candidate data may be video, image, voice, etc. If the modality of the data to be retrieved is an image, the modality of the candidate data can be text, video, voice and the like.

In an embodiment, in step S11, the terminal account may send a data retrieval request to the server, where the data retrieval request selects a target modality that may carry the data to be retrieved and the data that needs to be retrieved, and the server may, in response to the data retrieval request, obtain, from the database, candidate data corresponding to the target modality according to the target modality. The server responds to a data retrieval request sent by the terminal account to acquire candidate data, and personalized recommendation of the terminal account can be achieved.

In another embodiment, in step S11, the server may further automatically obtain data browsed in history of the terminal account, use the data browsed in history as data to be retrieved, and automatically obtain candidate data with a modality different from that of the data to be retrieved from the database. Because the data browsed historically may be data interested by the terminal account, the server can automatically recommend the data of interest to the terminal account in a mode of automatically acquiring candidate data according to the data browsed historically, and experience of the terminal account is improved.

In step S13, the first feature of the data to be retrieved and the second feature of the candidate data are extracted based on the cross-modal search network. The cross-modal retrieval network is obtained by performing countermeasure training on a first sample generation network corresponding to the data to be retrieved of the sample, a second sample generation network corresponding to the associated sample data and a third sample generation network corresponding to the non-associated sample data in the countermeasure network based on the data to be retrieved of the sample, the associated sample data matched with the data to be retrieved of the sample and the non-associated sample data not matched with the data to be retrieved of the sample; the associated sample data and the data to be retrieved of the sample correspond to different modals, and the associated sample data and the non-associated sample data correspond to the same modality.

In the embodiment of the disclosure, the data to be retrieved of the sample, the associated sample data matched with the data to be retrieved of the sample, and the non-associated sample data not matched with the data to be retrieved of the sample may be obtained in advance, wherein a modality of the associated sample data and a modality of the non-associated sample data are both different from a modality of the data to be retrieved of the sample, and the modality of the associated sample data is the same as the modality of the non-associated sample data. Performing countermeasure training on a first sample generation network, a second sample generation network and a third sample generation network based on the data to be retrieved, the associated sample data and the non-associated sample data to obtain the cross-modal retrieval network, wherein the first sample generation network is a generation network taking the data to be retrieved as input, the second sample generation network is a generation network taking the associated sample data as input, and the third sample generation network is a generation network taking the non-associated sample data as input.

Optionally, the cross-modal search network obtained by the training method includes a first generation network corresponding to the first sample generation network, a second generation network corresponding to the second sample generation network, and a third generation network corresponding to the third sample generation network. The first generation network may be used to extract features of data to be retrieved, and the second and third generation networks may be used to extract features of candidate data.

Fig. 3 is a flow diagram illustrating extraction of a first feature and a second feature according to an example embodiment. As shown in fig. 3, in an alternative embodiment, in step S13, the extracting a first feature of the data to be retrieved and a second feature of the candidate data based on the cross-modal search network includes:

in step S131, the data to be retrieved and the candidate data are input into the cross-modal search network.

In step S133, the first feature is extracted based on the first generated network.

In step S135, the second feature is extracted from the second generation network and the third generation network.

Optionally, after the data to be retrieved is input into the cross-modal retrieval network, a first feature of the data to be retrieved may be extracted through the first generation network, and a second feature of the candidate data may be extracted through the second generation network and the third generation network.

The second generation network corresponds to the second sample generation network, and the second sample generation network is a generation network that takes the associated sample data as input, so that the second generation network can be used for extracting features of data that are matched with the data to be retrieved and have a modality different from that of the data to be retrieved.

If a certain candidate data is matched with the data to be retrieved, the second feature of the candidate data or the second feature with higher identifiability of the candidate data can be extracted through the second generation network, and the feature of the candidate data cannot be extracted through the third generation network or the second feature with lower identifiability of the candidate data can only be extracted.

Similarly, if a certain candidate data is not matched with the data to be retrieved, the second feature of the candidate data or the second feature with higher identifiability of the candidate data may be extracted through the third generation network, and the second feature of the candidate data may not be extracted through the second generation network or only the second feature with lower identifiability of the candidate data may be extracted.

Assuming that the candidate data includes candidate data 1 and candidate data 2, where candidate data 1 matches the data to be retrieved, and candidate data 2 does not match the data to be retrieved, candidate data 1 and candidate data 2 are respectively input to a second generation network and a third generation network, where the second generation network can extract a second feature with a higher degree of recognizability of candidate data 1 but does not extract a second feature of candidate data 2, or the second generation network extracts a second feature with a lower degree of recognizability of candidate data 2, and the third generation network can extract a second feature with a higher degree of recognizability of candidate data 2 but does not extract a second feature of candidate data 1, or the second generation network extracts a second feature with a lower degree of recognizability of candidate data 1.

Illustratively, the first feature or the second feature may be a one-dimensional feature vector, which may also be referred to as a vector characterization.

It can be seen that, the cross-modal search network in the embodiment of the present disclosure is a three-tower structure based on the first generation network, the second generation network, and the third generation network, for the data to be searched, the first feature with a finer granularity may be extracted, for the candidate data, no matter whether a certain candidate data is matched with the data to be searched, the second feature with a better effect may be extracted through the second generation network and the third generation network, the more effective feature may be more distinctive between the same modalities, and may be more recognizable between different modalities, thereby improving the cross-modal search performance and accuracy of the fine granularity.

In step S15, data matching the data to be retrieved is retrieved from the candidate data based on the matching degree between the first feature and the second feature.

In the embodiment of the present disclosure, after the first feature and the second feature are obtained, a matching degree between the first feature and the second feature may be calculated, and data matching the data to be retrieved may be retrieved from the candidate data according to a result of the calculation of the matching degree.

Fig. 4 is a flowchart illustrating a method for retrieving data matching data to be retrieved from candidate data according to a matching degree of a first feature and a second feature according to an exemplary embodiment. As shown in fig. 4, in an alternative embodiment, in the step S15, the retrieving, from the candidate data, data that matches the data to be retrieved according to the matching degree of the first feature and the second feature may include:

in step S151, the degree of matching between the first feature and the second feature is determined.

In step S153, using the candidate data corresponding to the target second feature as the data matched with the data to be retrieved; the target second feature represents a second feature whose matching degree with the first feature satisfies a preset condition.

Alternatively, in the step S151, the present disclosure may calculate the matching degree between the first feature and the second feature in various ways, which is not limited in detail herein.

In one embodiment, a cosine similarity between the first feature and the second feature may be calculated to obtain the degree of matching.

In another embodiment, the matching degree may also be obtained by calculating the euclidean distance between the first feature and the second feature.

In the third embodiment, a pearson correlation coefficient between the first feature and the second feature may be further calculated, and the matching degree is characterized by the pearson correlation coefficient, and the larger the pearson correlation coefficient is, the higher the matching degree is, and conversely, the smaller the matching degree is.

Optionally, in the step S153, the present disclosure may search the candidate data for data matching the data to be searched in various ways, which is not limited herein.

In one embodiment, the second feature with the highest matching degree with the first feature may be used as the target second feature, and the candidate data corresponding to the target second feature may be used as the data matched with the data to be retrieved.

In another embodiment, the second features may also be ranked according to the matching degree with the first features to obtain a second feature sequence, a preset number of second features in the second feature sequence are used as the target second features, and the candidate data corresponding to the target second features are used as the data matched with the data to be retrieved.

In the third embodiment, a threshold value of matching degree may be further set, a second feature having a matching degree with the first feature larger than the threshold value of matching degree is used as the target second feature, and candidate data corresponding to the target second feature is used as data matched with the data to be retrieved.

The data matched with the data to be retrieved means that the retrieved data can be matched with the data to be retrieved, for example, the data to be retrieved is "using skill of badminton racket", if the retrieved result is a video of "using skill of xxx-brand badminton racket", the two are considered to be matched, and if the retrieved result is a video of "xxx-brand badminton racket, but not using skill, but a purchasing guide", the two are considered to be not matched.

In the embodiment of the disclosure, the first feature is extracted through a first generation network in a three-tower structure, the second feature is extracted through a second generation network and a third generation network in the three-tower structure, if the candidate data is data matched with the data to be retrieved, the second feature more related to the first feature can be extracted through the second generation network, so that the matching degree between the second feature and the first feature is higher, and if the candidate data is data unmatched with the data to be retrieved, the second feature more unrelated to the difference between the first feature can be extracted through the third generation network, so that the matching degree between the second feature and the first feature is lower. When the matching degree is calculated, the matching degree between the candidate data and the data to be retrieved is higher or the matching degree between the candidate data and the data to be retrieved is lower, so that the data matched with the data to be retrieved can be accurately determined, and the performance and the accuracy of fine-grained cross-modal retrieval are improved.

Taking a scene of a text retrieval video as an example, the beneficial effects obtained by the embodiment of the present disclosure are explained:

suppose the data to be retrieved is "xxx brand badminton racket use skill".

The candidate data are:

video 1: the introduction is of badminton racket, skill in use, but not xxx brand.

Video 2: xxx-brand badminton racquets are described, but are not skills in use, but are purchasing guidelines.

And 3, video 3: the use skill of xxx-brand badminton racket is introduced.

Because the cross-modal retrieval network disclosed by the invention can output the feature vectors with better quality, the same modality is more distinguishable, and different modalities are more distinguishable, the result retrieved by using the cross-modal retrieval method disclosed by the embodiment of the invention is the video 3, the correlation between the retrieval result and the data to be retrieved is higher, and the retrieval effect is better.

Because the representation obtained by the related technology lacks fine-grained information, the correlation of the three videos cannot be well distinguished, the retrieved video is video 1 or video 2, the correlation between the retrieval result and the data to be retrieved is low, and the retrieval effect is poor.

FIG. 5 is a flow diagram illustrating a method of training across modal search networks, according to an example embodiment. As shown in fig. 5, may include:

in step S21, sample data to be retrieved, associated sample data that matches the sample data to be retrieved, and non-associated sample data that does not match the sample data to be retrieved are obtained.

Optionally, the associated sample data matched with the sample data to be retrieved refers to data which has a different modality from that of the sample data to be retrieved and can be matched with the sample data to be retrieved. The non-associated sample data which is not matched with the sample data to be retrieved refers to data which has a different modality from that of the sample data to be retrieved and cannot be matched with the sample data to be retrieved. For example, the sample data to be retrieved is a "pig peclet" text, the associated sample data may be a "pig peclet first season" video, a "pig peclet second season" video, a "pig peclet third season" video, and the like, and the non-associated sample data may be a "snow white princess" video, an "ottman" video, and the like.

For example, a data set composed of sample data to be retrieved and associated sample data (i.e., a relevant pair), a data set composed of sample data to be retrieved and non-associated sample data (i.e., an irrelevant pair) may be prepared, and the relevant pair and the irrelevant pair may be referred to as a data set a.

Taking the data to be retrieved of the sample as a text and the associated sample data and the non-associated sample data as videos as examples, the data set a is described as follows:

assuming that the sample data to be retrieved includes "pig peclet" text and "ottman" text, the relevant pair of pair may be:

the "piglet pecky" text, "piglet pecky first season" video, "piglet pecky" text, "piglet pecky second season" video, "piglet pecky" text, "piglet pecky third season" video, "orbemann" text, "obu otman" video, "orbeman" text, "belia otman" video, and the like.

Unrelated pair of pair:

the video comprises a text of 'piggy pecker' — 'ultraman' video, 'piggy pecker' text '-white snow princess' video, 'ultraman' text '-piggy pecker' video, 'ultraman' text '-light head strength' video and the like.

The above-described correlated pair and uncorrelated pair are referred to as dataset a. It can be seen that data set a may include a plurality of different texts, and associated videos that match each text, and non-associated videos that do not match each text.

In step S23, the data to be retrieved, the associated sample data, and the non-associated sample data are input into the first sample generation network, the second sample generation network, and the third sample generation network in the countermeasure network, so as to obtain the sample characteristics of the data to be retrieved, the associated sample characteristics of the associated sample data, and the non-associated sample characteristics of the non-associated sample data.

Optionally, the countermeasure network includes a first sample generation network corresponding to the sample data to be retrieved, a second sample generation network corresponding to the associated sample data, and non-associated sample data corresponding to the third sample generation network. The three sample data can be input into the confrontation network, and the feature vector extraction is performed on the corresponding sample data through the corresponding generation network to obtain the sample features corresponding to the three sample data respectively.

Taking the data to be retrieved of the sample as a text, and the associated sample data and the non-associated sample data as videos, the first sample generation network may be a model from Bidirectional deformable features coders (BERTs) or other models that can extract text features and back propagation training. The second sample generation network and the third sample generation network may be a video recognition network (C3D), a three-dimensional convolutional neural network (3D convolutional neural networks, 3D CNN), and the like, which extract video features and models of back propagation.

Alternatively, the countermeasure network can be a Generative countermeasure network (GAN), wherein the GAN is a deep learning model. The model comprises at least two modules: one module is a generating network (also referred to as a generating network in the embodiments of the present disclosure), and the other module is a discriminating network (also referred to as a discriminating network in the embodiments of the present disclosure), and the two modules perform game learning with each other, so as to generate better output. The generative model and the discriminant model may be both neural networks, specifically, deep neural networks, or convolutional neural networks. The basic principle of GAN is as follows: taking GAN for generating pictures as an example, assume that there are two networks, G (generator) and d (discriminator), where G is a network for generating pictures, which receives a random noise z, and generates pictures by this noise, denoted as G (z); d is a discrimination network for discriminating whether a picture is "real". The input parameter is a picture, the probability that the picture is a real picture is output, if the probability is 1, the picture represents that 100% of the picture is the real picture, and if the probability is 0, the picture cannot be the real picture. In the process of training the generating countermeasure network, the aim of generating the network G is to generate a real picture as much as possible to deceive the discrimination network D, and the aim of discriminating the network D is to distinguish the picture generated by G from the real picture as much as possible. Thus, G and D constitute a dynamic "gaming" process, i.e., "play" in a "generative play network". As a result of the final game, in an ideal state, G can generate enough pictures G (z) to be "fake" and D cannot easily determine whether the generated pictures are true or not, i.e., D (G (z)) is 0.5. This results in an excellent generative model G which can be used to generate pictures.

Illustratively, the discriminant network may be a model that can be propagated backward, such as a Multilayer Perceptron (MLP).

The back propagation can correct the size of the parameters in the initial sample generation network in the training process, so that the reconstruction error loss of the initial sample generation network is smaller and smaller. Specifically, an error loss occurs when an input signal is transmitted in a forward direction until an output, and parameters in an initial sample generator are updated by back-propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the target sample generation network, such as a weight matrix.

Fig. 6 is a flowchart illustrating a method for obtaining sample characteristics of sample data to be retrieved, associated sample characteristics of associated sample data, and non-associated sample characteristics of non-associated sample data according to an exemplary embodiment. As shown in fig. 6, in an optional embodiment, in the step S23, the inputting the sample data to be retrieved, the associated sample data, and the non-associated sample data into the first sample generation network, the second sample generation network, and the third sample generation network in the countermeasure network to obtain the sample characteristics of the sample data to be retrieved, the associated sample characteristics of the associated sample data, and the non-associated sample characteristics of the non-associated sample data may include:

in step S231, the data to be retrieved of the sample is input into the first sample generation network, and the sample features are extracted based on the first sample generation network.

In step S233, the associated sample data is input to the second sample generation network, and associated sample features are extracted from the second sample generation network.

In step S235, the non-associated sample data is input into the third sample generation network, and the non-associated sample features are extracted based on the third sample generation network.

FIG. 7 is a flow diagram illustrating a method of training across modal search networks in accordance with an exemplary embodiment. As shown in fig. 7, the data to be retrieved can be input into a first sample generation network corresponding to the data to be retrieved, and sample features of the data to be retrieved can be extracted based on the first sample generation network. And inputting the associated sample data into a second sample generation network corresponding to the associated sample data, and extracting the associated sample characteristics of the associated sample data according to the second sample generation network. And inputting the non-associated sample data into a third sample generation network corresponding to the non-associated sample data, and extracting the non-associated sample characteristics of the non-associated sample data based on the third sample generation network.

For example, the sample feature, the associated sample feature, and the non-associated sample feature may be one-dimensional feature vectors, which may also be referred to as vector characterizations.

The embodiment of the disclosure provides a three-tower structure comprising a first sample generation network, a second sample generation network and a third sample generation network, for sample to-be-retrieved data, a sample feature with a finer granularity can be extracted, for associated sample data, an associated sample feature with a finer granularity can be extracted through the associated second sample generation network, and for non-associated sample data, a non-associated sample feature with a finer granularity can be extracted through the non-associated third sample generation network. The method can always extract the sample features with identifiability aiming at the sample features, the associated sample features and the non-associated sample features, and can reduce the training difficulty of difficult negative samples, so that the different modes have more identifiability, the same mode has more distinctiveness, and the training precision of cross-mode retrieval is improved.

In step S25, the countermeasure network is confronted and trained based on the sample features, the associated sample features, and the non-associated sample features, so as to obtain a cross-modal search network.

In the embodiment of the present disclosure, after the sample feature, the associated sample feature, and the non-associated sample feature are obtained, the countermeasure network may be confronted and trained based on the three sample features until the countermeasure network converges, so as to obtain the cross-modal search network.

Optionally, the embodiment of the present disclosure may train the countermeasure network in various ways, which are not specifically limited herein.

FIG. 8 is a flow diagram illustrating a method for obtaining a cross-modal search network in accordance with an exemplary embodiment. As shown in fig. 8, in an optional implementation, in the step S25, the performing countermeasure training on the countermeasure network based on the sample feature, the associated sample feature, and the non-associated sample feature to obtain the cross-modal search network may include:

in step S251, first loss information is obtained based on the sample feature, the correlated sample feature, and the non-correlated sample feature.

Alternatively, the first loss information may be a contrast loss function. The contrast loss is a dimension reduction learning method, which can learn a mapping relationship, and the mapping relationship can make the distance of points in the same category but far away from each other in the high-dimensional space closer after the points are mapped to the low-dimensional space through a function, and make the points in different categories but near to each other farther in the low-dimensional space after the points are mapped. As a result, in the low-dimensional space, the same kind of points will have a clustering effect, and different kinds of points will be separated.

In a possible implementation manner, in the above step S251, the first loss information L may be calculated by the following formula ₁ ：

Wherein L is ₁ Refers to the first loss information, q refers to the sample characteristics, v ⁺ Refers to the correlation of sample features, v ^- Refers to the non-correlated sample features and τ refers to the hyperparameter.

In another possible embodiment, q and v can be also used ⁺ 、v ^- Corresponding weights a, b, c are set and substituted into the above formula, i.e. q is replaced by q · a (i.e. the product of q and a), and v is used ⁺ B (i.e., v) ⁺ Product with b) instead of v ⁺ Using v ^- C (i.e., v) ^- Product with c) instead of v ^- 。

In a third possible implementation, other functions representing the similarity between vectors may also be designed to obtain the first loss information. For example, the cosine similarity is designed to obtain the first loss information.

In step S253, the sample feature, the associated sample feature, and the non-associated sample feature are input to a discrimination network in the countermeasure network, and second loss information is obtained.

Optionally, the second loss information may be obtained in various ways in the embodiments of the present disclosure, and is not specifically limited herein.

Fig. 9 is a flow chart illustrating inputting sample features, associated sample features, and non-associated sample features into a discriminative network in a countermeasure network to obtain second loss information according to an example embodiment. As shown in fig. 9, in an optional implementation manner, in the step S253, the inputting the sample feature, the associated sample feature, and the non-associated sample feature into a discrimination network in the countermeasure network to obtain the second loss information may include:

in step S2531, the sample feature, the correlated sample feature, and the non-correlated sample feature are input to the discrimination network.

In one approach, continuing with fig. 7, the sample features and associated sample features may be combined pairwise, and the sample features and non-associated sample features may be combined pairwise into the discrimination network.

In another mode, the sample characteristics, the associated sample characteristics and the non-associated sample characteristics can be directly input into the discriminant network,

in step S2533, a first determination result is obtained by determining a degree of matching between the sample feature and the related sample feature based on the determination network, and a second determination result is obtained by determining a degree of matching between the sample feature and the non-related sample feature.

In step S2535, the second loss information is obtained based on the first determination result and the second determination result.

In an exemplary embodiment, in step S2535, the obtaining second loss information according to the first and second determination results may include:

and calculating a first logarithm corresponding to the first judgment result and a second logarithm corresponding to the second judgment result.

In one aspect, in the above step S2531 to the above step S2535, the second loss information L may be calculated by the following formula ₂ ：

L ₂ ＝L(q，v ⁺ ，v ^- )＝-logD(q，v ⁺ )-log(1-D(q，v ^- ))；

Wherein L is ₂ Refers to the second loss information, q refers to the sample characteristics, v ⁺ Refers to the correlation of sample features, v ^- Refers to the non-correlated sample features, D (q, v) ⁺ ) Refers to the first discrimination, D (q, v) ^- ) Indicates the second judgment result, logD (q, v) ⁺ ) Refers to the first logarithm, log (1-D (q, v) ⁺ ) ) refers to the second logarithm.

In another mode, D (q, v) can be ⁺ ) And D (q, v) ^- ) Setting weights c, D, respectively, using D (q, v) ⁺ ) C (i.e., D (q, v) ⁺ ) Product with c) replaces D (q, v) in the above equation ⁺ ) Using D (q, v) ^- ) D (i.e., D (q, v) ^- ) Product with D) replaces D (q, v) in the above equation ^- )。

According to the embodiment of the disclosure, the second loss information is calculated according to the first judgment result between the sample characteristic and the associated sample characteristic and the second judgment result between the sample characteristic and the non-associated sample characteristic, so that the matching degree between the sample characteristic and the associated sample characteristic and the matching degree between the sample characteristic and the non-associated sample characteristic can be fully considered, the data in different modes can be more recognizable, the determination accuracy of the second loss information is improved, and the training accuracy of the cross-mode retrieval is improved. In addition, since the logarithm does not change the property and the correlation of the data, but also compresses the scale of the variable, reduces the absolute value of the data, and facilitates calculation, the second loss function is calculated by the first logarithm corresponding to the first determination result and the second logarithm corresponding to the second determination result, so that not only the accuracy of the second loss function can be ensured, but also the calculation efficiency of the second loss function can be improved.

In step S255, training the countermeasure network based on the first loss information and the second loss information to obtain a first generation network, a second generation network, and a third generation network; the first generation network is used for extracting the characteristics of data to be retrieved, and the second generation network and the third generation network are used for extracting the characteristics of candidate data.

In step S257, a cross-modal search network is generated based on the first generation network, the second generation network, and the third generation network.

In one embodiment, in the steps S255 to S257, a sum of the first loss information and the second loss information may be used as total loss information, a countermeasure network may be trained based on the total loss information, the model may be optimized by a gradient descent algorithm until the countermeasure network converges to obtain a first generation network, a second generation network, and a third generation network, and the first generation network, the second generation network, and the third generation network may be used as the cross-modal search network.

In another embodiment, in the above step S255 to step S257, the first loss information and the second loss information may be weighted to obtain the total loss information, and the weighting formula may be as follows: l ═ A × L ₁ +B*L ₂ (ii) a Wherein L is total loss information, and A is L ₁ B is L ₂ The weight of (c). According toThe total loss information trains an antagonistic network, and optimizes the model by using a gradient descent algorithm until the antagonistic network converges to obtain a first generation network, the second generation network and the third generation network, and the first generation network, the second generation network and the third generation network are used as the cross-modal retrieval network.

In the embodiment of the disclosure, the countermeasure network is trained by using first loss information obtained by a sample feature, a correlated sample feature and a non-correlated sample feature and second loss information obtained by inputting the sample feature, the correlated sample feature and the non-correlated sample feature into a discrimination network in the countermeasure network, because the first loss information can be a contrast loss function, and the contrast loss is a dimension reduction learning method, it can make points of the same category but far distance in a high-dimensional space, after being mapped to a low-dimensional space by a function, the distance becomes close, the points of different categories but near distance become farther in the low-dimensional space after being mapped, and the second loss information can fully consider the matching degree between the sample feature and the correlated sample feature and the matching degree between the sample feature and the non-correlated sample feature, so that different modes have more identifiability, therefore, the training precision of the cross-modal retrieval network can be improved by training the countermeasure network through the first loss information and the second loss information, so that the cross-modal retrieval network obtained through training can output more effective vector representation.

FIG. 10 is a flowchart illustrating a method for determining target sample data to be retrieved, target associated sample data, and target non-associated sample data, according to an example embodiment. As shown in fig. 10, in an optional embodiment, the number of the sample data to be retrieved is multiple, and the method may further include:

in step S31, target sample data to be retrieved is determined from the plurality of sample data to be retrieved.

In step S33, target associated sample data that matches the data to be retrieved of the target sample is determined from the associated sample data, and target non-associated sample data that does not match the data to be retrieved of the target sample is determined from the non-associated sample data.

Accordingly, in the step S253, the inputting the sample feature, the related sample feature and the non-related sample feature into a discrimination network in the countermeasure network to obtain second loss information may include:

Optionally, the data set a may include a plurality of different sample data to be retrieved, and each sample data to be retrieved may correspond to a plurality of associated sample data and a plurality of non-associated sample data. In the steps S31 to S33, a target sample data to be retrieved may be determined from the multiple different sample data to be retrieved, target associated sample data of the target sample data to be retrieved may be determined from the multiple associated sample data, and target non-associated sample data of the target sample data to be retrieved may be determined from the multiple non-associated sample data, so as to obtain a data set B, where the data set B is a subset of the data set a.

It should be noted that any sample data to be retrieved in the data set a may be used as the target sample data to be retrieved. For each sample data to be retrieved, a data set B including the sample data to be retrieved, corresponding associated sample data, and corresponding non-associated sample data may be generated.

The following describes the process of generating data set B from data set a:

a related pair composed of the same text (i.e. target sample data to be retrieved) and a related video (i.e. target related sample data):

the video of the piglet peclet is a text, a video of the piglet peclet in the first season, a text, a video of the piglet peclet in the second season, a text, a video of the piglet peclet in the third season and the like.

An irrelevant pair composed of the same text (i.e. target sample data to be retrieved) and an irrelevant video (i.e. target irrelevant sample data):

the video comprises a text of 'piggy pecker', 'ultraman' video, a text of 'piggy pecker', 'white snow princess' video, a text of 'piggy pecker', 'strong light head' video and the like.

The above-described correlated pair and uncorrelated pair are symmetric as a data set B. As can be seen, the data set B contains the same text, associated videos matching the same text, and non-associated videos not matching the same text.

Correspondingly, in the step S253, the sample characteristics of the target sample to be retrieved in the data set B and the associated characteristics of the target associated sample data may be combined in pairs, and the sample characteristics of the target sample to be retrieved and the non-associated characteristics of the non-target associated sample data may be combined in pairs and input to the discrimination network, so as to obtain the second loss information. The extraction process of the sample characteristics of the target sample to be retrieved data, the associated characteristics of the target associated sample data and the non-associated characteristics of the non-target associated sample data in the data set B is similar to the sample to be retrieved data, and is not repeated here.

According to the method and the device, the target sample data to be retrieved is determined from the multiple sample data to be retrieved, and the target associated sample data and the target non-associated sample data under the target sample data to be retrieved are obtained, so that the data reuse rate can be improved, the cost is saved, the server burden is reduced, and the training speed of the cross-modal retrieval network is improved; in addition, target associated sample data and target non-associated sample data of the same target sample under the data to be retrieved are input into the discrimination network to obtain second loss information, the training difficulty of the discrimination network can be reduced, the determination precision of the second loss information is improved, and therefore the training precision of the cross-modal retrieval network is improved.

Of course, in another mode, in steps S31-S33, the data set B may be collected from a database different from the data set a, so as to improve richness of data, and further improve training accuracy of the cross-modal search network.

The cross-modal retrieval network is obtained by performing countermeasure training on a first sample generation network corresponding to the data to be retrieved, a second sample generation network corresponding to the associated sample data and a third sample generation network corresponding to the non-associated sample data based on the data to be retrieved, the associated sample data matched with the data to be retrieved and the non-associated sample data not matched with the data to be retrieved. The utility model provides a multimode representation training frame of three-tower structure based on antagonism training promptly, this training frame can extract local feature at the training in-process, make the data between with the mode have distinguishability more, the data between different modes have distinguishability more, thereby improve the training precision of cross-modal retrieval network, and then make the cross-modal retrieval network that the training obtained can accurately catch the local information of the data of treating retrieval data and candidate data of input, output more effectual characteristic, make the data between with the mode have distinguishability more, the data between different modes have distinguishability more, thereby improve fine grit retrieval performance. In addition, a confrontation network is used in the training process, and the cross-modal retrieval network can be obtained through training by inputting sample data into the confrontation network, namely the training process is an end-to-end training process, so that the training complexity is reduced, and the consumption of the training process on system resources is reduced.

FIG. 11 is a block diagram illustrating a cross-modal retrieval device, according to an example embodiment. Referring to fig. 11, the apparatus includes a data acquisition module 41, a feature extraction module 43, and a data retrieval module 45.

A data acquisition module 41 configured to perform acquisition of data to be retrieved and candidate data; the data to be retrieved and the candidate data correspond to different modes;

and a feature extraction module 43 configured to perform extraction of the first feature of the data to be retrieved and the second feature of the candidate data based on the cross-modal retrieval network.

And a data retrieval module 45 configured to retrieve data matching the data to be retrieved from the candidate data according to the matching degree of the first feature and the second feature.

The cross-modal retrieval network is obtained by performing countermeasure training on a first sample generation network corresponding to the data to be retrieved of the sample, a second sample generation network corresponding to the associated sample data and a third sample generation network corresponding to the non-associated sample data in an countermeasure network based on the data to be retrieved of the sample, the associated sample data matched with the data to be retrieved of the sample and the non-associated sample data not matched with the data to be retrieved of the sample; the associated sample data and the sample data to be retrieved correspond to different modals, and the associated sample data and the non-associated sample data correspond to the same modality.

In an optional embodiment, the cross-modality search network includes a first generation network corresponding to the first sample generation network, a second generation network corresponding to the second sample generation network, and a third generation network corresponding to the third sample generation network, and the feature extraction module includes:

and the input unit is configured to input the data to be retrieved and the candidate data into the cross-modal retrieval network.

A first feature extraction unit configured to perform extraction of the first feature based on the first generated network.

In an optional embodiment, the data retrieving module includes:

a matching degree determination unit configured to perform determination of the matching degree between the first feature and the second feature.

A matching data determining unit configured to execute the candidate data corresponding to the target second feature as the data matching with the data to be retrieved; the target second feature represents a second feature whose matching degree with the first feature satisfies a preset condition.

FIG. 12 is a block diagram of a training apparatus across a modal search network, according to an example embodiment. Referring to fig. 12, includes: the system comprises a sample data acquisition module 51, a sample characteristic determination module 53 and a cross-modal search network determination module 55.

A sample data obtaining module 51 configured to perform obtaining of data to be retrieved of a sample, associated sample data matched with the data to be retrieved of the sample, and non-associated sample data not matched with the data to be retrieved of the sample; the associated sample data and the sample data to be retrieved correspond to different modals, and the associated sample data and the non-associated sample data correspond to the same modality.

The sample feature determining module 53 is configured to perform a first sample generation network, a second sample generation network, and a third sample generation network that input the to-be-retrieved sample data, the associated sample data, and the non-associated sample data into the countermeasure network, so as to obtain a sample feature of the to-be-retrieved sample data, an associated sample feature of the associated sample data, and a non-associated sample feature of the non-associated sample data.

And a cross-modal search network determining module 55 configured to perform countermeasure training on the countermeasure network based on the sample features, the associated sample features, and the non-associated sample features, so as to obtain a cross-modal search network.

In an optional embodiment, the sample characteristic determining module includes:

and the sample feature extraction unit is configured to input the sample data to be retrieved into the first sample generation network, and extract the sample features based on the first sample generation network.

And the associated sample feature extraction unit is configured to input the associated sample data into the second sample generation network and extract the associated sample feature according to the second sample generation network.

And a non-related sample feature extraction unit configured to input the non-related sample data into the third sample generation network, and extract the non-related sample feature based on the third sample generation network.

In an optional embodiment, the cross-modality retrieval network determining module includes:

a first loss information determination unit configured to perform obtaining first loss information based on the sample feature, the associated sample feature, and the non-associated sample feature.

And a second loss information determining unit configured to perform a discrimination network that inputs the sample feature, the associated sample feature, and the non-associated sample feature into the countermeasure network, to obtain second loss information.

A training unit configured to perform training of the countermeasure network based on the first loss information and the second loss information, resulting in a first generation network, a second generation network, and a third generation network; the first generation network is used for extracting the characteristics of data to be retrieved, and the second generation network and the third generation network are used for extracting the characteristics of candidate data.

A cross-modal search network generation unit configured to generate a cross-modal search network based on the first generation network, the second generation network, and the third generation network.

In an optional embodiment, the second loss information determining unit includes:

a sample feature input subunit configured to perform input of the sample feature, the associated sample feature, and the non-associated sample feature into the discrimination network.

And a discrimination result determination subunit configured to perform discrimination of a degree of matching between the sample feature and the associated sample feature based on the discrimination network to obtain a first discrimination result, and discrimination of a degree of matching between the sample feature and the non-associated sample feature to obtain a second discrimination result.

A second loss information determination subunit configured to perform obtaining the second loss information according to the first determination result and the second determination result.

In an optional embodiment, the second loss information determining subunit includes:

and the logarithm determination submodule is configured to calculate a first logarithm corresponding to the first judgment result and a second logarithm corresponding to the second judgment result.

A second loss information determining sub-module configured to perform obtaining the second loss information according to the first logarithm and the second logarithm.

In an optional embodiment, the number of the data to be retrieved in the sample is multiple, and the apparatus further includes:

the first determining module is configured to determine target sample data to be retrieved from a plurality of sample data to be retrieved.

And the second determining module is configured to determine target associated sample data matched with the data to be retrieved of the target sample from the associated sample data, and determine target non-associated sample data not matched with the data to be retrieved of the target sample from the non-associated sample data.

The second loss information determining unit is configured to input a sample characteristic of the target sample data to be retrieved, a related sample characteristic of the target related sample data, and a non-related sample characteristic of the target non-related sample data into the discrimination network, so as to obtain the second loss information.

With respect to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here

In an exemplary embodiment, there is also provided an electronic device, comprising a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the steps of any of the above embodiments of the cross-modal search method or the training method of the cross-modal search network when executing the instructions stored on the memory.

The electronic device may be a terminal, a server, or a similar computing device, taking the electronic device as a server as an example, fig. 13 is a block diagram of an electronic device for training across modal search or across a modal search network according to an exemplary embodiment, where the electronic device 60 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 61 (the CPU 61 may include but is not limited to a Processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 63 for storing data, and one or more storage media 62 (e.g., one or more mass storage devices) for storing an application program 623 or data 622. Memory 63 and storage medium 62 may be, among other things, transient or persistent storage. The program stored on the storage medium 62 may include one or more modules, each of which may include a sequence of instructions operating on the electronic device. Still further, the central processor 61 may be configured to communicate with the storage medium 62, and execute a series of instruction operations in the storage medium 62 on the electronic device 60. The electronic device 60 may also include one or more power supplies 66, one or more wired or wireless network interfaces 65, one or more input-output interfaces 64, and/or one or more operating systems 621, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The input/output interface 64 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the electronic device 60. In one example, the input/output Interface 64 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In an exemplary embodiment, the input/output interface 64 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

It will be understood by those skilled in the art that the structure shown in fig. 13 is only an illustration, and is not intended to limit the structure of the electronic device. For example, electronic device 60 may also include more or fewer components than shown in FIG. 13, or have a different configuration than shown in FIG. 13.

In an exemplary embodiment, there is also provided a computer-readable storage medium, wherein the instructions of the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the steps of any one of the above-described cross-modal retrieval methods or training methods of a cross-modal retrieval network.

In an exemplary embodiment, a computer program product is further provided, which includes a computer program, and the computer program is executed by a processor to implement the cross-modal search method or the training method of the cross-modal search network provided in any one of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided by the present disclosure may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A cross-modal retrieval method, comprising:

2. The cross-modal retrieval method according to claim 1, wherein the cross-modal retrieval network comprises a first generation network corresponding to the first sample generation network, a second generation network corresponding to the second sample generation network, and a third generation network corresponding to the third sample generation network, and the extracting the first feature of the data to be retrieved and the second feature of the candidate data based on the cross-modal retrieval network comprises:

extracting the first feature based on the first generated network;

extracting the second feature from the second and third generation networks.

3. The cross-modal retrieval method according to claim 1 or 2, wherein the retrieving data matching the data to be retrieved from the candidate data according to the matching degree of the first feature and the second feature comprises:

4. A training method for cross-modal search networks is characterized by comprising the following steps:

5. The method for training a cross-modal search network according to claim 4, wherein the inputting the sample data to be searched, the associated sample data, and the non-associated sample data into a first sample generation network, a second sample generation network, and a third sample generation network in a countermeasure network to obtain the sample features of the sample data to be searched, the associated sample features of the associated sample data, and the non-associated sample features of the non-associated sample data comprises:

6. A cross-modality retrieval apparatus, comprising:

the characteristic extraction module is configured to extract a first characteristic of the data to be retrieved and a second characteristic of the candidate data based on a cross-modal retrieval network;

7. A training apparatus for cross-modal search networks, comprising:

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement a cross-modal search method as claimed in any of claims 1 to 3 or a training method for a cross-modal search network as claimed in any of claims 4 to 5.

9. A computer-readable storage medium, whose instructions, when executed by a processor of an electronic device, cause the electronic device to perform the cross-modal retrieval method of any of claims 1 to 3 or the training method of the cross-modal retrieval network of any of claims 4 to 5.

10. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the cross-modal search method of any of claims 1 to 3 or the training method of the cross-modal search network of any of claims 4 to 5.