CN116049468A

CN116049468A - Feature extraction model training method, picture searching method, device and equipment

Info

Publication number: CN116049468A
Application number: CN202111262536.7A
Authority: CN
Inventors: 沈辉
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2023-05-02
Also published as: WO2023071577A1

Abstract

The application discloses a feature extraction model training method, a picture searching method, a device and equipment, wherein the method comprises the following steps: acquiring training picture data comprising real pictures and template pictures, wherein each real picture in the same batch of training picture data has a template picture corresponding to the real picture; extracting visual features of each real picture and visual features of each template picture in the same batch of training picture data based on the feature extraction model, wherein the visual features have visual semantic correlation among areas in the pictures; calculating the similarity between the visual characteristics of each real picture and the visual characteristics of each template picture; and training the feature extraction model according to the similarity, so that the visual features of each real picture output by the trained feature extraction model are similar to the visual features of the template picture corresponding to each real picture, and the feature recognition degree of the visual features of the pictures and the searching effect of the picture searching with the template picture can be improved.

Description

Feature extraction model training method, picture searching method, device and equipment

Technical Field

The application relates to the technical field of computers, in particular to a feature extraction model training method, a picture searching device and equipment.

Background

The current image search technology is mainly an image search technology based on global features (global features) and an image search technology based on collaborative filtering schemes.

For the searching problem of the template picture, the current scheme based on the global feature has larger error of the searching result because of lower feature recognition degree, and the feature extraction model based on the convolutional neural network aims at the general picture searching problem and can not effectively utilize the template picture, so the searching effect is not optimal.

Therefore, for an image search technique in which a template picture is a search target, further improvements are required.

Disclosure of Invention

The embodiment of the application provides a feature extraction model training method, a picture searching device and equipment, which can improve the feature identification degree of visual features of pictures and the searching effect of picture searching of template pictures.

In a first aspect, a feature extraction model training method is provided, the method comprising:

acquiring training picture data comprising real pictures and template pictures, wherein each real picture in the same batch of training picture data has a template picture corresponding to the real picture;

Extracting visual features of each real picture and visual features of each template picture in the same batch of training picture data based on a feature extraction model, wherein the visual features have visual semantic correlation among areas in the pictures;

calculating the similarity between the visual characteristics of each real picture and the visual characteristics of each template picture;

and training the feature extraction model according to the similarity, so that the visual features of each real picture output by the trained feature extraction model are similar to the visual features of the template picture corresponding to each real picture.

In a second aspect, a method for searching a picture is provided, the method comprising:

acquiring a picture to be searched;

extracting visual features of the picture to be searched based on a trained feature extraction model, wherein the visual features have visual semantic correlation among regions in the picture, and the trained feature extraction model is obtained by training according to the feature extraction model training method according to any embodiment;

and searching template pictures similar to the visual characteristics of the pictures to be searched from a template picture database as search results.

In a third aspect, there is provided a feature extraction model training apparatus, the apparatus comprising:

the first acquisition unit is used for acquiring training picture data comprising real pictures and template pictures, wherein each real picture in the same batch of training picture data has a template picture corresponding to the real picture;

the first extraction unit is used for extracting visual features of each real picture and visual features of each template picture in the same batch of training picture data based on the feature extraction model, wherein the visual features have visual semantic correlation among areas in the pictures;

the computing unit is used for computing the similarity between the visual characteristics of each real picture and the visual characteristics of each template picture;

the training unit is used for training the feature extraction model according to the similarity, so that the visual features of each real picture output by the trained feature extraction model are similar to the visual features of the template picture corresponding to each real picture.

In a fourth aspect, there is provided a picture searching apparatus, including:

the second acquisition unit is used for acquiring pictures to be searched;

The second extraction unit is used for extracting the visual features of the pictures to be searched based on a trained feature extraction model, wherein the visual features have visual semantic correlation among regions in the pictures, and the trained feature extraction model is obtained by training the feature extraction model training method according to any embodiment;

and the searching unit is used for searching template pictures similar to the visual characteristics of the pictures to be searched from the template picture database as search results.

In a fifth aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program adapted to be loaded by a processor for performing the steps of the feature extraction model training method as described in any of the embodiments above or the steps of the picture searching method as described in any of the embodiments above.

In a sixth aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory having stored therein a computer program, the processor being configured to perform the steps in the feature extraction model training method as described in any of the embodiments above or the steps in the picture searching method as described in any of the embodiments above by invoking the computer program stored in the memory.

In a seventh aspect, a computer program product is provided, comprising computer instructions which, when executed by a processor, implement steps in a feature extraction model training method as described in any of the embodiments above or steps in a picture searching method as described in any of the embodiments above.

According to the embodiment of the application, training picture data comprising real pictures and template pictures are obtained, wherein each real picture in the same batch of training picture data has the template picture corresponding to the real picture; extracting visual features of each real picture and visual features of each template picture in the same batch of training picture data based on the feature extraction model, wherein the visual features have visual semantic relativity between areas in the pictures; calculating the similarity between the visual characteristics of each real picture and the visual characteristics of each template picture; and training the feature extraction model according to the similarity, so that the visual features of each real picture output by the trained feature extraction model are similar to the visual features of the template picture corresponding to each real picture. According to the method and the device for searching the template pictures, the visual features with visual semantic correlation among the regions in the pictures can be effectively extracted through the feature learning model, and measurement learning is carried out based on the similarity between the visual features of the real pictures in the training pictures and the visual features of the template pictures, so that the feature extraction model is optimized, the feature identification degree of the visual features of the pictures can be improved, and the searching effect of the picture searching with the template pictures is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a feature extraction model training method according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a framework of model training according to an embodiment of the present application.

Fig. 3 is a flowchart of a picture searching method according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a feature extraction model training device according to an embodiment of the present application.

Fig. 5 is a schematic structural diagram of a picture searching apparatus according to an embodiment of the present application.

Fig. 6 is another schematic structural diagram of a device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The embodiment of the application provides a feature extraction model training method, a device, computer equipment and a storage medium. Specifically, the feature extraction model training method in the embodiment of the application may be executed by a computer device, where the computer device may be a device such as a terminal or a server.

First, partial terms or expressions appearing in the course of the description of the present application are explained as follows:

the scale invariant feature transform (Scale Invariant Feature Transform) algorithm, SIFT, may represent local features that may be used to detect salient, stable feature points in an image, and then generate a multi-dimensional feature based on the local neighborhood in which the detected pixel points are located, the multi-dimensional feature representing a description of the current feature point.

A Histogram of direction gradient (OrientedGradient, HOG), which may represent a global feature, HOG features are a feature descriptor used for object detection in computer vision and image processing, which is constructed by computing and counting the gradient direction histograms of local areas of an image.

A local binary pattern (Local Binary Pattern, LBP) is a feature operator that describes the local texture of an image.

The Speeded-Up Robust Features (SURF) algorithm mainly simplifies some of the operations in SIFT. The SURF simplifies the template of the Gaussian second-order differential in the SIFT, so that convolution smoothing operation only needs to be converted into addition and subtraction operation, and therefore the SURF algorithm is good in robustness and low in time complexity. The dimension of the SURF finally generated feature point feature vector is 64 dimensions.

The Brisk algorithm, a feature extraction algorithm, is also a binary feature description operator.

K-means, generally referred to as K-means clustering algorithm. The mean clustering algorithm (K-means clustering algorithm) is an iterative solution clustering analysis algorithm, and comprises the steps of dividing data into K groups, randomly selecting K objects as initial clustering centers, calculating the distance between each object and each seed clustering center, and distributing each object to the closest clustering center. The cluster centers and the objects assigned to them represent a cluster. For each sample assigned, the cluster center of the cluster is recalculated based on the existing objects in the cluster. This process will repeat until a certain termination condition is met. The termination condition may be that no (or a minimum number of) objects are reassigned to different clusters, no (or a minimum number of) cluster centers are changed again, and the sum of squares of errors is locally minimum.

BoW (Bag of words) model is used for representing images as feature vectors in computer vision, for example, a K-Means clustering method can be adopted, features with strong similarity are classified into a cluster category, the center of each cluster is defined as a word of the image, the number of the cluster categories is the size of the whole visual dictionary, and thus each image can be represented by a series of representative visual words.

The local aggregate descriptor vector (Vector of Locally Aggregated Descriptors, VLAD), a coding method, is widely used in many subjects of computer vision, such as image retrieval and scene recognition. The VLAD codes firstly cluster the extracted features on the image set to obtain a codebook, then calculate the difference between the original features and each word in the codebook, accumulate the differences, and finally splice the accumulated differences of all words to form a new vector to represent the image.

Fisher vector essentially represents an image with gradient vectors of likelihood functions. The Fisher Vector coding is a coding mode based on the Fisher kernel principle, a Gaussian Mixture Model (GMM) is trained by training samples through a maximum likelihood estimation method, then an original feature (such as Dense-Traj) extracted from the samples is modeled by the Gaussian mixture model, and the generated model parameters are used for coding the original feature of the samples into Fisher vectors which are convenient to learn and measure.

Inverted index (inverted index) is derived from the fact that in practice it is necessary to find records based on the value of the attribute. Each entry in such an index table includes an attribute value and the address of each record having the attribute value. Since the attribute value is not determined by a record but the position of the record is determined by the attribute value, it is called inverted index (inverted index). The file with inverted index is called inverted index file (inverted file) for short.

TF-IDF (term frequency-inverse document frequency) is a common weighting technique for information retrieval and data mining. TF is the Term Frequency (Term Frequency) and IDF is the inverse text Frequency index (Inverse Document Frequency). TF-IDF can be used to evaluate how important a word is to one of a set of documents or a corpus of documents. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with the frequency with which it appears in the corpus. The main ideas of TF-IDF are: if a word or phrase appears frequently in one article TF is high and rarely in other articles, the word or phrase is considered to have good category discrimination and is suitable for classification.

The image searching technical scheme based on global features has two specific technical schemes: based on the manually designed visual features (hand-crafted vision feature), such as SIFT, HOG, LBP, and then using feature embedding encoding techniques (feature embedding), such as BoW, VLAD, fisherVector, a feature vector of a fixed dimension is obtained. Alternatively, a convolutional neural network (convolution neural network) is used, and a specific model training task, such as metric learning (Distance Metric Learning) or image classification (image classification), is used to learn a feature extraction model, and after model training is completed, each picture can obtain a fixed dimension vector as a visual feature. After the features are obtained, the feature dimension is reduced, and the calculated amount is compressed by using technical schemes such as feature dimension reduction or feature quantization. Or using point multiplication quantization (product quantization), the features are segmented to increase query speed. In the image Search phase, a nearest neighbor Search algorithm (KNN Search) is used to find similar pictures.

Based on the local feature technical scheme, firstly, detecting points of interest (points) or corner points (markers) in a picture, then extracting manually designed Visual features such as SIFT, LBP, brisk, SURF and the like at corresponding positions, and then obtaining a feature dictionary (codebook) by using K-Means, wherein the Visual features of each point can be matched with Visual words (Visual Word) in a nearest dictionary. Therefore, the problem is converted from image searching to a problem similar to text searching, and similar pictures are obtained by matching based on the technical schemes of an inverted index table (Inverted Index Table), TF-IDF and the like.

The technical scheme recall based on the local features is higher than the technical scheme of the global features. However, since the technical scheme based on the inverted index table generally involves a relatively large amount of calculation, when the data is relatively large, the search speed is relatively slow, and thus the technique cannot be used for large-scale picture search. Currently, commercial image searching is basically based on global features for image searching or rough ranking is based on global features, and then local features are used for accurate matching. The resolution of global features directly affects the search results, so obtaining recognizable global features is a core problem for image search.

Aiming at the special search problem, the method designs a feature learning model suitable for image matching based on a transducer network. The technical points to which the present application relates primarily include metric learning and transducer vision models.

For the searching problem of the template picture, the global feature scheme based on the VLAD at present has larger error of the searching result because of lower feature recognition degree, and the feature extraction model based on the convolutional neural network aims at the general picture searching problem and can not effectively utilize the template picture, so the searching effect is not optimal. Therefore, for the problem that a template picture is used as a search target, the method improves a metric learning scheme, uses visual features of the template picture as target features, then uses metric learning, and optimizes a feature learning model.

The scheme for acquiring global features at present comprises a visual feature embedding coding mode based on local features and a feature extraction model based on a convolutional neural network, and is replaced by a feature learning model based on a transducer. The transducer can effectively learn the visual semantic relativity of different areas in the picture, compares the technical schemes such as VLAD and the like, can obtain global features with more identification degree and visual semantics by Deep Learning (Deep Learning), and effectively avoids the defects of manual visual features such as weak expression capability, rotation invariance, illumination invariance and inaccurate corner detection. Compared with a characteristic extraction model of the convolutional neural network, the transducer can more effectively extract visual characteristics with higher identification degree under the scene. Because the content of the template picture is unchanged all the time, the template picture can be understood as a rigid object, and the convolutional neural network cannot capture the visual semantic correlation between the regions in the picture, so that the learned visual features are not optimal visual features with high recognition degree, and the Transformer can effectively learn the visual semantic correlation between the regions in the picture from the rigid structure, so as to extract the visual features with higher recognition degree.

The embodiments of the present application provide a feature extraction model training method, which may be executed by a terminal or a server, or may be executed by the terminal and the server together; the embodiment of the application is described by taking a feature extraction model training method as an example to be executed by a server.

Referring to fig. 1 to fig. 2, fig. 1 is a flow chart of a feature extraction model training method provided in an embodiment of the present application, and fig. 2 is a frame diagram of model training provided in an embodiment of the present application. The method comprises the following steps:

step 110, obtaining training picture data comprising real pictures and template pictures, wherein each real picture in the same batch of training picture data has a template picture corresponding to the real picture.

The framework of the overall model training of the application is shown in fig. 2, and is different from the previous metric learning model, and the application can divide training pictures into two types, namely a real picture and a template picture, for the special feature learning problem of the template picture. The training pictures are manually marked pictures, and can be obtained through technical schemes such as network crowdsourcing collection, picture searching, crawler and the like. The template picture refers to a picture with the target object accounting for more than 50% of the picture, and can be obtained by manual special acquisition.

In the subsequent training process of the feature extraction model, a plurality of batches of training picture data need to be sequentially input into the feature extraction model. The obtained training picture data of the same batch can contain a plurality of real pictures and template pictures corresponding to each real picture, namely, for each real picture, the training picture data of the same batch contains the template pictures corresponding to the real picture. For example, each batch of training picture data contains n template pictures, each of which must appear in pairs with k real pictures during the training process. Therefore, in a batch of training picture data, n template pictures and k×n real pictures are all included.

Optionally, when the training picture data including the real picture and the template picture is acquired, the method further includes:

and carrying out data enhancement processing on the real picture.

Optionally, the performing data enhancement processing on the real picture includes:

and carrying out data enhancement processing on the real picture according to at least one processing mode of rotation, shielding, cutting, color dithering, perspective transformation and ghost superposition.

In the model training process, data enhancement processing can be performed on the real picture, wherein the data enhancement processing can comprise at least one of the following processing modes: rotation, occlusion, cropping, color dithering, perspective transformation, and ghost superposition. Wherein, the template picture can not be subjected to any data enhancement processing.

Each template picture and the corresponding real picture form a positive sample pair, and form a negative sample pair with other pictures, so in a batch of training picture data, if n template pictures exist and each template picture and k real pictures are paired, kn positive sample pairs and kn (n-1) negative sample pairs are shared.

Optionally, to solve the problem of imbalance of positive and negative samples, it is necessary to ensure that k is greater than n and k is of a different order of magnitude from n in a batch of training picture data, i.e., k > n. For example, according to the experimental result, if there are 1024 training pictures in the same batch of training picture data, n=16, k=63 is generally set.

And 120, extracting visual features of each real picture and visual features of each template picture in the same batch of training picture data based on a feature extraction model, wherein the visual features have visual semantic correlation among regions in the pictures.

Optionally, the extracting, based on the feature extraction model, the visual feature of each real picture and the visual feature of each template picture in the same batch of training picture data includes:

inputting the same batch of training picture data into the feature extraction model at the same time for processing, and taking each obtained real picture and a vector of a preset dimension corresponding to each template picture as picture features;

And calculating the 2 norms of each picture feature, and mapping each picture feature to an European space to obtain the visual feature corresponding to each real picture and the visual feature corresponding to each template picture.

As shown in fig. 2, in the embodiment of the present application, a feature extraction model is constructed based on a transform network, unlike features based on traditional manual design and features learned by a convolutional neural network, in the embodiment of the present application, the transform is used as the feature extraction model, and a plurality of cascaded encoders (encoders) are used, so that a visual structure inside a target page can be effectively learned, and features obtained by the model have a higher degree of identification.

The Encoder connection shown in fig. 2 is a cascade of encoding modules, i.e. the output of the last Encoder serves as the input of the next Encoder. Wherein the Encoder module is a coding module in a general transformer network, and the input is

Feature map of (a) and the output is +.>

Feature map (feature map).

For example, the same batch of training picture data is simultaneously input into the feature extraction model, and finally, the pooling layer processing is performed, so that d-dimensional (preset-dimension) vectors are obtained for each picture as picture features. For example, the current lot includes (k+1) n pictures, where n template pictures and k×n real pictures, the (k+1) n pictures are input into a transformer network, and after being processed by a plurality of cascaded encoders (encoders), d-dimensional vectors of the last layer are obtained as picture features, which corresponds to the output of the average Pooling layer (AVG Pooling) in fig. 2.

Then, calculating 2 norms of each picture feature at a pooling layer, and mapping each picture feature to Euclidean space to obtain a visual feature corresponding to each real picture and a visual feature corresponding to each template picture, wherein the visual feature can be expressed as the following formula (2):

wherein, the liquid crystal display device comprises a liquid crystal display device, |x| ₂ A 2-norm representing the characteristics of a picture,

and (3) representing the visual features obtained after the image features are mapped to the Euclidean space, namely, the visual features corresponding to the real image and the visual features corresponding to the template image.

And step 130, calculating the similarity between the visual characteristics of each real picture and the visual characteristics of each template picture.

For example, the similarity between the real picture and the template picture may be measured in terms of a cosine distance between the visual features of the real picture and the visual features of the template picture.

Optionally, the calculating the similarity between the visual feature of each real picture and the visual feature of each template picture includes:

and calculating the cosine distance between the visual characteristics of each real picture and the visual characteristics of each template picture.

Optionally, the calculating the cosine distance of the visual feature of each real picture and each template picture includes:

Combining visual features of all real pictures in the same batch of training picture data to form a real picture feature matrix;

combining visual features of all template pictures in the same batch of training picture data to form a template picture feature matrix;

and carrying out point multiplication on the real picture feature matrix and the template picture feature matrix to obtain a cosine distance matrix, wherein the cosine distance matrix comprises cosine distances of all visual features of the real picture and the template picture.

For example, combining the visual features of all real pictures in the same batch of training picture data, combining the visual features of all template pictures in the same batch of training picture data, and forming two feature matrices which respectively represent the real picture feature matrix and the template picture feature matrix; and then the two feature matrices are subjected to dot multiplication to obtain cosine distances between the real picture features and the template picture features, which can be expressed as the following formula (1):

S＝M _r ×M _t ^T (1)；

wherein S is E R ^kn×n Representing cosine distance matrix corresponding to all training pictures, M _r ∈R ^kn×d Representing the feature matrix of the real picture, M _t ∈R ^n×d Representing a template picture feature matrix; m is M _t ^T Represents M _t Is a transpose operation of (a).

And 140, training the feature extraction model according to the similarity, so that the visual features of each real picture output by the trained feature extraction model are similar to the visual features of the template picture corresponding to each real picture.

For example, when the similarity between the real picture and the template picture is measured according to the cosine distance between the visual feature of the real picture and the visual feature of the template picture, the feature extraction model may be trained according to the cosine distance, so that the cosine distance between the visual feature of each real picture and the visual feature of the template picture corresponding to each real picture output by the trained feature extraction model is nearest.

Optionally, the training the feature extraction model according to the similarity includes:

according to the number of the template pictures, the number of the real pictures corresponding to each template picture, a true value matrix formed by the cosine distance matrix, the positive sample pair and the negative sample pair, and a loss function of the feature extraction model is calculated to obtain a trained feature extraction model;

each template picture and the corresponding real picture form a positive sample pair, and each template picture and other pictures except the corresponding real picture form a negative sample pair.

Optionally, the training picture data of the same batch comprises n template pictures and k real pictures corresponding to each template picture, wherein k is greater than n, and k and n are different orders of magnitude;

the loss function of the feature extraction model is determined using the following formula:

where Loss represents the Loss function; n represents the number of template pictures; k represents the number of real pictures corresponding to each template picture; s represents the cosine distance matrix; y represents the truth matrix formed by the positive sample pair and the negative sample pair.

For example, the similarity between the real picture and the template picture may be measured in terms of a cosine distance between the visual features of the real picture and the visual features of the template picture. Where the cosine distance is a measure of sample similarity, the more the two samples resemble, the closer the cosine distance is to 1. For example, the computed cosine distance is a prediction result for kn positive sample pairs and k (n-1) negative sample pairs. If the true value (group trunk) of the positive sample pair is 1 and the true value (group trunk) of the negative sample pair is 0, the training is performed by continuous iteration, so that the cosine distance of the corresponding positive sample pair in the prediction result is 1, and the cosine distance of the negative sample pair is 0.

Wherein in mathematics, a metric (or distance function) is a function defining the distance between elements in a set, and the basic principle of metric learning is to autonomously learn the metric distance function for a specific task according to different tasks. Wherein distance metric learning (or simply metric learning) is intended to automatically construct task-dependent distance metrics from supervised data in a machine learning manner. The learned distance metrics may be used for tasks of different tasks, e.g., K-NN classification, clustering, information retrieval.

For example, a contrast loss function (contrast loss) is used to calculate the loss between the network predicted result and the real result, thereby updating the feature extraction model. For example, the loss function of the feature extraction model may be expressed as the following formula (2), and the matrix of true values (ground trunk) formed by the positive and negative sample pairs may be expressed as the following formula (3):

wherein Loss represents a contrast Loss function; n represents the number of template pictures; k represents the number of real pictures corresponding to each template picture; s represents all cosine distance matrixes; y represents a truth matrix formed by positive sample pairs and negative sample pairs, and Y is R as well as S ^kn×n Is a matrix of (a); y represents the actual label of the picture sample, y _i Representing template pictures, y _j Representing a real picture; i _F The F-norm of the matrix is represented, which corresponds to the L2-loss of these samples and the predicted result.

The template picture can be regarded as a fixed parameter in the feature extraction model training, so that the template picture does not participate in model gradient calculation and does not influence model parameter updating. As shown in fig. 2, the upward arrow indicates the gradient return, but only the model in part (a) of fig. 2 participates in the gradient return, and the template picture does not participate in the model gradient calculation and model parameter update. Here, since the template picture corresponds to the target (target) of the real picture, this part does not participate in the gradient calculation. The specific scheme is that when the model is trained, forward is carried out twice, namely a template picture and a real picture, but gradient return is not carried out on the template picture forward this time. For example, the gray arrows in part (b) of fig. 2 are invalid because the template picture does not participate in the model gradient calculation.

According to the embodiment of the application, a new feature extraction model training framework is provided, a transfomer network is used as a feature extraction model for image searching, the cosine distance of the visual features of the real picture and the template picture is trained in a measurement learning mode, the template picture is effectively utilized, the image searching effect is improved, and the extracted visual features are more suitable for the image searching problem of the template picture.

All the above technical solutions may be combined to form an optional embodiment of the present application, which is not described here in detail.

Referring to fig. 3, fig. 3 is a flowchart illustrating a picture searching method according to an embodiment of the present application. Fig. 3 shows a flow diagram of a picture searching method based on the feature extraction model training method shown in fig. 1 to 2 according to an embodiment of the present application. The method comprises the following steps:

step 310, obtaining a picture to be searched;

step 320, extracting visual features of the picture to be searched based on a trained feature extraction model, wherein the visual features have visual semantic correlation between regions in the picture, and the trained feature extraction model is obtained by training based on a feature extraction model training method shown in fig. 1-2;

and step 330, searching template pictures similar to the visual characteristics of the pictures to be searched from a template picture database as search results.

Optionally, before searching the template picture similar to the visual feature of the picture to be searched from the template picture database as a search result, the method further comprises:

and extracting visual features of each template picture in the template picture database based on the trained feature extraction model and storing the visual features.

Optionally, the searching, from a template picture database, for a template picture similar to the visual feature of the picture to be searched as a search result includes:

Calculating cosine distances between the visual features of the pictures to be searched and the visual features of each template picture in the template picture database;

and taking the template picture with the nearest cosine distance from the visual characteristic of the picture to be searched in the template picture database as a search result.

According to the embodiment of the application, the picture to be searched is obtained; extracting visual features of the pictures to be searched based on a trained feature extraction model, wherein the visual features have visual semantic correlation among regions in the pictures, and the trained feature extraction model is obtained by training based on a feature extraction model training method shown in fig. 1-2; and searching template pictures similar to the visual characteristics of the pictures to be searched from a template picture database as search results. According to the method and the device for searching the template pictures, the visual features with visual semantic correlation among the areas in the pictures can be effectively extracted through the feature learning model, the feature identification degree of the visual features of the pictures can be improved, and the searching effect of the picture searching of the template pictures is improved.

In order to facilitate better implementation of the feature extraction model training method in the embodiment of the application, the embodiment of the application also provides a feature extraction model training device. Referring to fig. 4, fig. 4 is a schematic structural diagram of a feature extraction model training device according to an embodiment of the present application. Wherein, the feature extraction model training apparatus 400 may include:

a first obtaining unit 401, configured to obtain training picture data including real pictures and template pictures, where each real picture in the same batch of training picture data has a template picture corresponding to the real picture;

a first extracting unit 402, configured to extract, based on a feature extraction model, a visual feature of each real picture and a visual feature of each template picture in the same batch of training picture data, where the feature extraction model is a model constructed based on a transform network, and the visual features have visual semantic correlation between regions in the pictures;

a calculating unit 403, configured to calculate a similarity between a visual feature of each of the real pictures and a visual feature of each of the template pictures;

and the training unit 404 is configured to train the feature extraction model according to the similarity, so that the visual features of each real picture output by the trained feature extraction model are similar to the visual features of the template picture corresponding to each real picture.

Optionally, when extracting the visual feature of each real picture and the visual feature of each template picture in the same batch of training picture data based on the feature extraction model, the first extraction unit 402 may be configured to:

Alternatively, the calculating unit 403 may be configured to, when calculating the similarity between the visual feature of each of the real pictures and the visual feature of each of the template pictures: and calculating the cosine distance between the visual characteristics of each real picture and the visual characteristics of each template picture.

Alternatively, the calculating unit 403 may be configured to, when calculating the cosine distance between the visual feature of each of the real pictures and the visual feature of each of the template pictures:

Optionally, when the training unit 404 trains the feature extraction model according to the similarity, the training unit may be configured to:

Optionally, when acquiring training picture data including a real picture and a template picture, the first acquiring unit 401 may be further configured to:

and carrying out data enhancement processing on the real picture.

Optionally, when the data enhancement processing is performed on the real picture, the first obtaining unit 401 may be configured to:

It should be noted that, the functions of each module in the feature extraction model training apparatus 400 in the embodiments of the present application may be correspondingly referred to the specific implementation manner of any embodiment in the above method embodiments, which is not described herein again.

The embodiment of the application also provides a picture searching device. Referring to fig. 5, fig. 5 is a schematic structural diagram of a picture searching apparatus according to an embodiment of the present application. The picture searching apparatus 500 may include:

A second obtaining unit 501, configured to obtain a picture to be searched;

the second extracting unit 502 is configured to extract a visual feature of the picture to be searched by using a trained feature extraction model, where the visual feature has a visual semantic correlation between regions in the picture, and the trained feature extraction model is obtained by training based on a feature extraction model training method shown in fig. 1 to 2;

and a searching unit 503, configured to search a template picture similar to the visual feature of the picture to be searched from the template picture database as a search result.

Optionally, before the searching, from the template picture database, for a template picture similar to the visual feature of the picture to be searched as a search result, the second extracting unit 502 is further configured to:

Optionally, when the searching unit 503 searches for a template picture similar to the visual feature of the picture to be searched from the template picture database as a search result, the searching unit may be further configured to:

It should be noted that, the functions of each module in the picture searching apparatus 500 in the embodiments of the present application may be correspondingly referred to the specific implementation manner of any embodiment in the embodiments of the video processing method, which is not described herein.

The respective units in the above-described feature extraction model training apparatus 400 or the picture searching apparatus 500 may be implemented in whole or in part by software, hardware, and combinations thereof. The above units may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor invokes and executes operations corresponding to the above units.

The feature extraction model training apparatus 400 or the picture searching apparatus 500 may be integrated in a terminal or a server having a memory and a processor mounted therein and having an arithmetic capability, or the feature extraction model training apparatus 400 or the picture searching apparatus 500 may be the terminal or the server. The terminal can be a smart phone, a tablet personal computer, a notebook computer, a smart television, a smart sound box, wearable smart equipment, a personal computer (Personal Computer, PC) and other equipment, and the terminal can also comprise a client, wherein the client can be a video client, a browser client or an instant messaging client and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligent platforms, and the like.

Fig. 6 is another schematic structural diagram of an apparatus provided in an embodiment of the present application, as shown in fig. 6, the apparatus 600 may include: a communication interface 601, a memory 602, a processor 603 and a communication bus 604. Communication interface 601, memory 602, and processor 603 enable communication with each other via communication bus 604. The communication interface 601 is used for data communication of the apparatus 600 with external devices. The memory 602 may be used to store software programs and modules, and the processor 603 may execute the software programs and modules stored in the memory 602, such as software programs for corresponding operations in the foregoing method embodiments.

Alternatively, the processor 603 may call a software program and module stored in the memory 602 to perform the following operations: acquiring training picture data comprising real pictures and template pictures, wherein each real picture in the same batch of training picture data has a template picture corresponding to the real picture; extracting visual features of each real picture and visual features of each template picture in the same batch of training picture data based on a feature extraction model, wherein the visual features have visual semantic correlation among areas in the pictures; calculating the similarity between the visual characteristics of each real picture and the visual characteristics of each template picture; and training the feature extraction model according to the similarity, so that the visual features of each real picture output by the trained feature extraction model are similar to the visual features of the template picture corresponding to each real picture.

Optionally, the processor 603 may call a software program stored in the memory 602 and the module may further perform the following operations: acquiring a picture to be searched; extracting visual features of the picture to be searched based on a trained feature extraction model, wherein the visual features have visual semantic correlation among regions in the picture, and the trained feature extraction model is obtained by training according to the feature extraction model training method according to any embodiment; and searching template pictures similar to the visual characteristics of the pictures to be searched from a template picture database as search results.

Alternatively, the apparatus 600 may be integrated in a terminal or a server having a storage and a processor installed to have an operation capability, or the apparatus 600 may be the terminal or the server, for example. The terminal can be a smart phone, a tablet personal computer, a notebook computer, a smart television, a smart sound box, wearable smart equipment, a personal computer and other equipment. The server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like.

Optionally, the application further provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.

Embodiments of the present application also provide a computer-readable storage medium for storing a computer program. The computer readable storage medium may be applied to a computer device, and the computer program causes the computer device to execute the corresponding processes in the above method embodiments, which are not described herein for brevity.

Embodiments of the present application also provide a computer program product comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the corresponding processes in the above method embodiments, which are not described herein for brevity.

Embodiments of the present application also provide a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the corresponding processes in the above method embodiments, which are not described herein for brevity.

It should be appreciated that the processor of an embodiment of the present application may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The processor may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

It will be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

It should be understood that the above memory is exemplary but not limiting, and for example, the memory in the embodiments of the present application may be Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), direct RAM (DR RAM), and the like. That is, the memory in embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of training a feature extraction model, the method comprising:

and training the feature extraction model according to the similarity, so that the video features of each real picture output by the trained feature extraction model are similar to the visual features of the template picture corresponding to each real picture.

2. The method for training the feature extraction model according to claim 1, wherein the feature extraction model based on the feature extraction model extracts the visual feature of each real picture and the visual feature of each template picture in the same batch of training picture data, comprising:

and calculating the 2 norms of each picture feature, and mapping each picture feature to an European space to obtain the visual feature of each real picture and the visual feature of each template picture.

3. The feature extraction model training method of claim 2, wherein said calculating the similarity between the visual features of each of the real pictures and the visual features of each of the template pictures comprises:

4. A method of training a feature extraction model as claimed in claim 3, wherein said calculating a cosine distance between the visual features of each of said real pictures and the visual features of each of said template pictures comprises:

5. The feature extraction model training method of claim 4, wherein said training the feature extraction model based on the similarity comprises:

6. The feature extraction model training method of claim 5, wherein the same batch of training picture data comprises n template pictures and k real pictures corresponding to each template picture, k is greater than n, and k and n are different orders of magnitude;

/>

7. The feature extraction model training method according to claim 1, further comprising, when the training picture data including a real picture and a template picture is acquired:

and carrying out data enhancement processing on the real picture.

8. The method for training a feature extraction model according to claim 7, wherein the performing data enhancement processing on the real picture includes:

9. A picture searching method, the method comprising:

acquiring a picture to be searched;

extracting visual features of the pictures to be searched based on a trained feature extraction model, wherein the visual features have visual semantic correlation among regions in the pictures, and the trained feature extraction model is obtained by training according to the feature extraction model training method of any one of claims 1-8;

10. The picture searching method according to claim 9, wherein before searching for a template picture similar to the visual feature of the picture to be searched for as a search result from the template picture database, further comprising:

11. The picture searching method according to claim 10, wherein the searching for template pictures similar to the visual features of the picture to be searched from the template picture database as search results comprises:

12. A feature extraction model training apparatus, the apparatus comprising:

13. A picture searching apparatus, the apparatus comprising:

the second acquisition unit is used for acquiring pictures to be searched;

a second extraction unit, configured to extract a visual feature of the picture to be searched based on a trained feature extraction model, where the visual feature has a visual semantic correlation between regions in the picture, and the trained feature extraction model is obtained by training according to the feature extraction model training method according to any one of claims 1 to 8;

14. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program adapted to be loaded by a processor for performing the steps of the feature extraction model training method of any of claims 1-8 or the steps of the picture searching method of any of claims 9-11.

15. A computer device, characterized in that it comprises a processor and a memory, in which a computer program is stored, the processor being adapted to perform the steps of the feature extraction model training method of any of claims 1-8 or the steps of the picture searching method of any of claims 9-11 by invoking the computer program stored in the memory.

16. A computer program product comprising computer instructions which, when executed by a processor, implement the steps of the feature extraction model training method of any one of claims 1 to 8 or the steps of the picture searching method of any one of claims 9 to 11.