CN114926658A

CN114926658A - Picture feature extraction method and device, computer equipment and readable storage medium

Info

Publication number: CN114926658A
Application number: CN202210675949.6A
Authority: CN
Inventors: 谯轶轩
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2022-08-19

Abstract

The application discloses a picture feature extraction method, a picture feature extraction device, computer equipment and a readable storage medium, relates to the technical field of computers, weakens the requirement on negative samples in feature learning, reduces the pressure brought by learning and memory, improves the efficiency, and improves the accuracy and performance of feature extraction. The method comprises the following steps: acquiring a target picture to be subjected to picture feature extraction, and performing enhancement processing on the target picture to obtain a first view angle picture and a second view angle picture; determining a first feature vector of the first visual angle picture and a second feature vector of the second visual angle picture, performing feature projection on the first feature vector and the second feature vector, and generating a loss function value by using the first feature vector after the feature projection and the second feature vector after the feature projection; performing representation learning of picture features based on the loss function values to obtain a feature extraction model; and inputting the target picture into the feature extraction model, and acquiring a vector output by the feature extraction model as a picture feature vector of the target picture.

Description

Picture feature extraction method and device, computer equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for extracting picture features, a computer device, and a readable storage medium.

Background

With the continuous development of computer technology, it is easier for people to obtain various multimedia information such as pictures, videos and the like, wherein a picture is a kind of a large number, and how to classify pictures so as to effectively and quickly retrieve required pictures from a large-scale picture database becomes a problem of increasing attention of people. The extraction of the picture features is an important link in the process of picture classification and retrieval.

In the related technology, the picture characteristics of a picture A are required to be extracted at present, and the picture A needs to be correspondingly changed to obtain a plurality of pictures similar to the picture A as positive samples; then a number of pictures that differ from a are collected as negative examples. Through the comparison learning of the A, the positive sample and the negative sample, the characteristic distance between the A and the positive sample is pulled in, the characteristic distance between the A and the negative sample is pulled away, and then the characteristic vector of the A is extracted.

In carrying out the present application, the applicant has found that the related art has at least the following problems:

in order to obtain stronger image feature extraction capability, a large number of negative samples with higher quality need to be screened, if the images which are obvious and different from the image A are randomly selected as the negative samples, the training task of comparison is too simple, valuable feature representation cannot be really learned, and only the distinguishing factors among the different images are learned; if the picture similar to A is selected as the negative sample by mistake, the normal learning task is seriously interfered, so that great pressure is brought to learning and memory, the learning efficiency is poor, the accuracy of picture feature extraction is not high, and the performance is reduced.

Disclosure of Invention

In view of this, the present application provides a method and an apparatus for extracting picture features, a computer device, and a readable storage medium, and mainly aims to solve the problems that great pressure is brought to learning and memory, learning efficiency is poor, accuracy of picture feature extraction is not high, and performance is reduced.

According to a first aspect of the present application, there is provided a method for extracting picture features, the method including:

acquiring a target picture to be subjected to picture feature extraction, and performing enhancement processing on the target picture to obtain a first view picture and a second view picture;

determining a first feature vector of the first visual angle picture and a second feature vector of the second visual angle picture, performing feature projection on the first feature vector and the second feature vector, and generating a loss function value by adopting the first feature vector after feature projection and the second feature vector after feature projection;

performing representation learning of picture features based on the loss function values to obtain a feature extraction model;

and inputting the target picture into the feature extraction model, and acquiring a vector output by the feature extraction model as a picture feature vector of the target picture.

Optionally, the obtaining a target picture to be subjected to picture feature extraction, and performing enhancement processing on the target picture to obtain a first view picture and a second view picture includes:

acquiring the input target picture to be subjected to picture feature extraction, and determining a reference edge in the target picture, wherein the reference edge is a picture boundary of the target picture, and the length of the reference edge is less than or equal to the length of other picture boundaries except the reference edge in the target picture;

and cutting two region pictures with the side length equal to the length of the reference side in the target picture, taking any one of the two region pictures as the first view angle picture, and taking the other one of the two region pictures except the first view angle picture as the second view angle picture.

Optionally, the determining a first feature vector of the first view picture and a second feature vector of the second view picture, performing feature projection on the first feature vector and the second feature vector, and generating a loss function value by using the first feature vector after feature projection and the second feature vector after feature projection, includes:

acquiring a main view encoder, determining a preset encoding dimension, and encoding the first view picture based on the main view encoder to obtain the first eigenvector with the dimension consistent with the preset encoding dimension;

determining an auxiliary view encoder, and encoding the second view picture based on the auxiliary view encoder to obtain the second feature vector with dimension consistent with the preset encoding dimension;

acquiring a main visual angle projection matrix and an auxiliary visual angle projection matrix with preset projection dimensions, performing feature projection on the first feature vector by adopting the main visual angle projection matrix, and performing feature projection on the second feature vector by adopting the auxiliary visual angle projection matrix to obtain the first feature vector after feature projection and the second feature vector after feature projection;

carrying out nonlinear change on the first feature vector after feature projection to obtain a nonlinear change vector;

in the second feature vector after feature projection, determining an element at the same position for each element in a plurality of elements included in the nonlinear variation vector, obtaining a corresponding bit element of each element, and performing the following processing on each element: calculating the difference value between the element and the corresponding bit element, and performing square calculation on the difference value to obtain a square value;

obtaining a square value of each element, obtaining a plurality of square values of the plurality of elements, calculating a sum of the plurality of square values, and taking the sum as the loss function value.

Optionally, the performing feature projection on the first feature vector by using the primary view projection matrix, performing feature projection on the second feature vector by using the secondary view projection matrix, and obtaining the first feature vector after feature projection and the second feature vector after feature projection, includes:

calculating the main view projection matrix and the first eigenvector by adopting matrix multiplication to obtain a vector with dimension equal to the preset projection dimension as the first eigenvector after feature projection;

and calculating the auxiliary view projection matrix and the second eigenvector by adopting matrix multiplication to obtain a vector with the dimension equal to the preset projection dimension as the second eigenvector after the characteristic projection.

Optionally, the performing nonlinear change on the first feature vector after feature projection to obtain a nonlinear change vector includes:

acquiring a first preset transformation matrix with a dimension being a first preset transformation dimension, and calculating the first preset transformation matrix and the first feature vector after feature projection by adopting matrix multiplication to obtain a vector with the dimension being equal to the first preset transformation dimension as a first intermediate vector;

normalizing the first intermediate vector, and converting the first intermediate vector into a standard normal distribution with a mean value of 0 and a variance of 1 as a second intermediate vector;

acquiring a preset nonlinear activation function, bringing the second intermediate vector into the nonlinear activation function for calculation, and taking the calculated vector as a third intermediate vector;

and acquiring a second preset transformation matrix with the dimensionality being a second preset transformation dimensionality, calculating the second preset transformation matrix and the third intermediate vector by adopting matrix multiplication, and taking the obtained vector as the nonlinear transformation vector, wherein the dimensionality of the nonlinear transformation vector is consistent with the dimensionality of the first feature vector after feature projection.

Optionally, the performing, based on the loss function value, representation learning of a picture feature to obtain a feature extraction model includes:

reading current encoder parameters of a main view encoder as first historical encoder parameters, and reading current encoder parameters of an auxiliary view encoder as second historical encoder parameters;

returning the loss function value, and updating the main visual angle encoder and the main visual angle projection matrix by adopting the loss function value;

obtaining a delay updating factor, calculating the first historical encoder parameter and the second historical encoder parameter by adopting the delay updating factor to obtain an encoder updating value, and calculating the main visual angle projection matrix and the auxiliary visual angle projection matrix by adopting the delay updating factor to obtain a projection matrix to be updated;

setting the encoder updating value in the auxiliary view encoder, and updating the auxiliary view projection matrix into the projection matrix to be updated;

performing enhancement processing on the target picture again, performing feature extraction on the new first view picture and the new second view picture obtained after enhancement processing again, performing feature projection on the new first view picture based on the updated main view encoder, performing feature extraction on the new first view picture based on the updated main view projection matrix, performing feature projection on the new second view picture based on the updated auxiliary view encoder, performing feature projection on the new second view picture based on the updated auxiliary view projection matrix, regenerating a new loss function value, and updating the main view encoder, the main view projection matrix, the auxiliary view encoder and the auxiliary view projection matrix according to the new loss function until the generated loss function value reaches a threshold value, and obtaining the feature extraction model.

Optionally, the inputting the target picture into the feature extraction model, and acquiring a vector output by the feature extraction model as a picture feature vector of the target picture includes:

inputting the target picture into the feature extraction model, and coding the target picture based on a main view encoder in the feature extraction model to obtain an initial picture feature vector;

and performing feature projection processing on the initial picture feature vector based on a main view projection matrix in the feature extraction model to obtain the initial picture feature vector after feature projection, and taking the initial picture feature vector after feature projection as the picture feature vector of the target picture.

According to a second aspect of the present application, there is provided an apparatus for extracting picture features, the apparatus comprising:

the enhancement processing module is used for acquiring a target picture to be subjected to picture feature extraction, and enhancing the target picture to obtain a first view angle picture and a second view angle picture;

the feature extraction module is used for determining a first feature vector of the first view picture and a second feature vector of the second view picture;

the feature projection module is used for performing feature projection on the first feature vector and the second feature vector;

the visual angle prediction module is used for generating a loss function value by adopting the first feature vector after feature projection and the second feature vector after feature projection;

the characteristic learning module is used for performing representation learning of picture characteristics based on the loss function values to obtain a characteristic extraction model;

the feature projection module is further configured to input the target picture into the feature extraction model, and obtain a vector output by the feature extraction model as a picture feature vector of the target picture.

Optionally, the enhancement processing module is configured to acquire the input target picture to be subjected to picture feature extraction, and determine a reference edge in the target picture, where the reference edge is a picture boundary of the target picture, and a length of the reference edge is less than or equal to lengths of other picture boundaries except the reference edge in the target picture; and cutting two region pictures with the side length equal to the length of the reference side in the target picture, taking any one of the two region pictures as the first view angle picture, and taking the other one of the two region pictures except the first view angle picture as the second view angle picture.

Optionally, the feature extraction module is configured to acquire a main view encoder, determine a preset encoding dimension, and perform encoding processing on the first view picture based on the main view encoder to obtain the first feature vector with a dimension consistent with the preset encoding dimension; determining an auxiliary view encoder, and encoding the second view picture based on the auxiliary view encoder to obtain the second feature vector with dimension consistent with the preset encoding dimension;

the feature projection module is configured to acquire a main view projection matrix and an auxiliary view projection matrix, where the dimensions of the main view projection matrix and the auxiliary view projection matrix are preset projection dimensions, perform feature projection on the first feature vector by using the main view projection matrix, perform feature projection on the second feature vector by using the auxiliary view projection matrix, and acquire the first feature vector after feature projection and the second feature vector after feature projection;

the view angle prediction module is used for carrying out nonlinear change on the first feature vector after feature projection to obtain a nonlinear change vector; in the second feature vector after feature projection, determining an element at the same position for each element in a plurality of elements included in the nonlinear variation vector, respectively, obtaining a corresponding bit element of each element, and performing the following processing on each element: calculating the difference value between the element and the corresponding bit element, and performing square calculation on the difference value to obtain a square value; obtaining a square value of each element, obtaining a plurality of square values of the plurality of elements, calculating a sum of the plurality of square values, and taking the sum as the loss function value.

Optionally, the feature projection module is configured to calculate the main view projection matrix and the first feature vector by using matrix multiplication, and obtain a vector with a dimension equal to the preset projection dimension as the first feature vector after feature projection; and calculating the auxiliary view projection matrix and the second eigenvector by adopting matrix multiplication to obtain a vector with the dimension equal to the preset projection dimension as the second eigenvector after the characteristic projection.

Optionally, the view prediction module is configured to obtain a first preset transformation matrix with a first preset transformation dimension, and calculate the first preset transformation matrix and the first feature vector after feature projection by using matrix multiplication to obtain a vector with a dimension equal to the first preset transformation dimension as a first intermediate vector; normalizing the first intermediate vector, and converting the first intermediate vector into a standard normal distribution with a mean value of 0 and a variance of 1 as a second intermediate vector; acquiring a preset nonlinear activation function, bringing the second intermediate vector into the nonlinear activation function for calculation, and taking the calculated vector as a third intermediate vector; and acquiring a second preset transformation matrix with the dimensionality being a second preset transformation dimensionality, calculating the second preset transformation matrix and the third intermediate vector by adopting matrix multiplication, and taking the obtained vector as the nonlinear transformation vector, wherein the dimensionality of the nonlinear transformation vector is consistent with the dimensionality of the first feature vector after feature projection.

Optionally, the feature learning module is configured to read a current encoder parameter of the main view encoder as a first historical encoder parameter, and read a current encoder parameter of the auxiliary view encoder as a second historical encoder parameter; returning the loss function value, and updating the main visual angle encoder and the main visual angle projection matrix by adopting the loss function value; obtaining a delay updating factor, calculating the first historical encoder parameter and the second historical encoder parameter by adopting the delay updating factor to obtain an encoder updating value, and calculating the main view projection matrix and the auxiliary view projection matrix by adopting the delay updating factor to obtain a projection matrix to be updated; setting the encoder updating value in the auxiliary view encoder, and updating the auxiliary view projection matrix into the projection matrix to be updated;

the enhancement processing module is further configured to perform enhancement processing on the target picture again, perform feature extraction on the new first view picture and the new second view picture obtained after enhancement processing again, perform feature projection on the new first view picture based on the updated main view projection matrix, perform feature extraction on the new second view picture based on the updated auxiliary view encoder, perform feature projection on the new second view picture based on the updated auxiliary view projection matrix, regenerate a new loss function value, and update the main view encoder, the main view projection matrix, the auxiliary view encoder, and the auxiliary view projection matrix according to the new loss function, and obtaining the feature extraction model until the generated loss function value reaches a threshold value.

Optionally, the feature projection module is further configured to input the target picture into the feature extraction model, and perform coding processing on the target picture based on a main view coder in the feature extraction model to obtain an initial picture feature vector; and performing feature projection processing on the initial picture feature vector based on a main view projection matrix in the feature extraction model to obtain the initial picture feature vector after feature projection, and taking the initial picture feature vector after feature projection as the picture feature vector of the target picture.

According to a third aspect of the present application, there is provided a computer device comprising a memory storing a computer program and a processor implementing the steps of the method of any of the first aspects when the computer program is executed.

According to a fourth aspect of the present application, there is provided a readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the above-mentioned first aspects.

By means of the technical scheme, the method, the device, the computer equipment and the readable storage medium for extracting the picture features are characterized by obtaining a target picture to be subjected to picture feature extraction, performing enhancement processing on the target picture to obtain a first view picture and a second view picture, determining a first feature vector of the first view picture and a second feature vector of the second view picture, performing feature projection on the first feature vector and the second feature vector, generating a loss function value by using the first feature vector after the feature projection and the second feature vector after the feature projection, performing representation learning on picture features based on the loss function value to obtain a feature extraction model, inputting the target picture to the feature extraction model, and obtaining a vector output by the feature extraction model as a picture feature vector of the target picture. According to the image feature extraction method and device, the demand for a large number of high-quality negative samples in image feature learning is weakened through the images with different visual angles, the problems of high display memory resource demand and slow training caused by the fact that a large number of negative samples need to be contrasted in contrast learning are solved, the problem of model performance reduction caused by the fact that preprocessing work in early stages such as screening and filtering is not rigorous is solved, the flow of data collection and processing is simplified, the pressure brought by learning and memory is reduced, the learning efficiency is improved, and the accuracy and performance of image feature extraction can be improved.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 shows a schematic flow chart of an extraction method of picture features provided in an embodiment of the present application;

fig. 2A is a schematic flowchart illustrating another method for extracting picture features according to an embodiment of the present application;

fig. 2B is a schematic diagram of a view prediction module according to an embodiment of the present disclosure;

fig. 2C shows a schematic flowchart of a method for extracting image features provided in the embodiment of the present application;

fig. 3 is a schematic structural diagram illustrating an apparatus for extracting picture features provided in an embodiment of the present application;

fig. 4 shows a schematic device structure diagram of a computer apparatus according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the application provides a method for extracting picture features, as shown in fig. 1, the method includes:

101. and acquiring a target picture to be subjected to picture feature extraction, and performing enhancement processing on the target picture to obtain a first view picture and a second view picture.

In order to learn the target picture at different visual angles, after the target picture to be subjected to picture feature extraction is acquired, enhancement processing is performed on the target picture to obtain a first visual angle picture and a second visual angle picture at different visual angles, and the first visual angle picture and the second visual angle picture are used as samples to perform model training.

In the embodiment of the present application, the enhancement processing is performed on the target picture by using a cropping mode to obtain the first view picture and the second view picture, and during the actual application process, the enhancement processing may also be performed by using a rotation and flipping mode of different angles on the target picture, or a scaling mode of different proportions on the target picture, and the like.

102. And determining a first feature vector of the first view picture and a second feature vector of the second view picture, performing feature projection on the first feature vector and the second feature vector, and generating a loss function value by using the feature projected first feature vector and the feature projected second feature vector.

In this embodiment of the application, after a first view picture and a second view picture are obtained, a first feature vector of the first view picture and a second feature vector of the second view picture need to be determined, feature projection is performed on the first feature vector and the second feature vector, and a loss function value is generated by using the first feature vector after the feature projection and the second feature vector after the feature projection. Wherein, models such as ResNet50 (residual network-50), ResNet101 (residual network-101), ViT (Vision Transformer) and the like can be selected to extract the first characteristic vector and the second characteristic vector; and performing feature projection on the first feature vector and the second feature vector by adopting linear transformation to obtain a vector with a representative dimension, so as to calculate the two vectors with the dimension to obtain a loss function value of the two vectors. The loss function value indicates the error between the first characteristic vector and the second characteristic vector, and the processes of coding, characteristic projection and the like can be adjusted by utilizing the loss function value, so that the model is trained towards the correct direction, and the accuracy of subsequently extracting the picture characteristics is improved.

103. And performing representation learning of picture features based on the loss function values to obtain a feature extraction model.

In the embodiment of the present application, after the loss function value is determined, parameters to be learned of models such as codes and feature projections are adjusted based on the loss function value, so that the representation learning of image features based on the loss function value is realized, and a feature extraction model is obtained.

104. And inputting the target picture into the feature extraction model, and acquiring a vector output by the feature extraction model as a picture feature vector of the target picture.

In the embodiment of the application, after the feature extraction model is generated, the target picture is input into the feature extraction model, and the vector output by the feature extraction model is obtained as the picture feature vector of the target picture. After the target picture is input into the feature extraction model, the feature extraction model firstly carries out coding processing on the target picture to obtain a vector, then carries out feature projection processing on the obtained vector, and takes the processed result as the final picture feature vector of the target picture.

The method provided by the embodiment of the application comprises the steps of obtaining a target picture to be subjected to picture feature extraction, performing enhancement processing on the target picture to obtain a first view angle picture and a second view angle picture, determining a first feature vector of the first view angle picture and a second feature vector of the second view angle picture, performing feature projection on the first feature vector and the second feature vector, generating a loss function value by using the first feature vector after the feature projection and the second feature vector after the feature projection, performing representation learning of picture features based on the loss function value to obtain a feature extraction model, inputting the target picture into the feature extraction model, and obtaining a vector output by the feature extraction model as a picture feature vector of the target picture. The embodiment of the application weakens the demand on a large number of high-quality negative samples in the picture characteristic learning by using the pictures with different visual angles, relieves the high display and memory resource demand caused by the need of comparing a large number of negative samples in the comparison learning, and solves the problem of slow training, thereby relieving the problem of model performance reduction caused by the fact that the preprocessing work in the early stages such as screening, filtering and the like is not strict, simplifies the flow of data collection and processing, reduces the pressure brought by learning and memory, improves the learning efficiency, and can also improve the accuracy and performance of picture characteristic extraction.

Further, as a refinement and an extension of the specific implementation of the foregoing embodiment, in order to fully describe the specific implementation process of the embodiment, an embodiment of the present application provides another method for extracting picture features, as shown in fig. 2A, the method includes:

201. and acquiring a target picture to be subjected to picture feature extraction, and performing enhancement processing on the target picture to obtain a first view picture and a second view picture.

In the embodiment of the present application, there are various ways to perform enhancement processing on a target picture, such as flipping, rotating, cropping, zooming, translating, dithering, and the like. In the embodiment of the present application, an example that a first view picture and a second view picture are obtained by performing enhancement processing on a target picture in a cropping manner is described, in the process of practical application, a manner of rotating and flipping the target picture by different angles, or a manner of scaling the target picture by different proportions, etc. may also be used to perform enhancement processing, and the specific manner used for performing enhancement processing is not limited in the present application.

In an optional embodiment, the enhancement processing is performed on the target picture as follows: firstly, an input target picture to be subjected to picture feature extraction is obtained, and a reference edge is determined in the target picture, wherein the reference edge is a picture boundary of the target picture, the length of the reference edge is less than or equal to the length of other picture boundaries except the reference edge in the target picture, namely, the short edge of the target picture is used as the reference edge. Then, two region pictures with the side length equal to the length of the reference side are cut out from the target picture, any one of the two region pictures is used as a first view angle picture, and the other one of the two region pictures except the first view angle picture is used as a second view angle picture. In summary, the enhancement processing procedure also cuts out two square areas with short sides as lengths in the target picture as the first view picture and the second view picture according to the shorter side of the target picture as a standard. For convenience of the following explanation, the first view picture is V1, and the second view picture is V2.

202. And determining a first feature vector of the first view picture and a second feature vector of the second view picture.

In the embodiment of the application, after the first view picture and the second view picture are obtained, a first feature vector of the first view picture and a second feature vector of the second view picture need to be determined to be used in a subsequent feature learning process. The learning and training process of the whole model is actually divided into two routes, namely a main visual angle route and an auxiliary visual angle route, the first visual angle picture is used for the learning process of the main visual angle route, and the second visual angle picture is used for the learning process of the auxiliary visual angle route. Therefore, in an optional implementation, when extracting the feature vector, a main view encoder on a main view route needs to be acquired, a preset encoding dimension is determined, and a first view picture is encoded based on the main view encoder to obtain a first feature vector with a dimension consistent with the preset encoding dimension; and acquiring an auxiliary view encoder on the auxiliary view route, and encoding the second view picture based on the auxiliary view encoder to obtain a second feature vector with dimension consistent with the preset encoding dimension.

The models such as ResNet50, ResNet101 and ViT can be selected as a main view encoder and an auxiliary view encoder, and the preset encoding dimension can be 2048 dimensions, so that 2048-dimensional first eigenvector and 2048-dimensional second eigenvector can be obtained through the main view encoder and the auxiliary view encoder. For convenience of explanation, let the first feature vector of the first view picture V1 be Y1, and the second feature vector of the second view picture V2 be Y2.

203. And performing feature projection on the first feature vector and the second feature vector to obtain the first feature vector after feature projection and the second feature vector after feature projection.

In the embodiment of the present application, feature projection is performed on the first feature vector and the second feature vector through a full connection layer providing linear transformation, so as to obtain a feature-projected first feature vector and a feature-projected second feature vector. In an alternative embodiment, the process of feature projection is as follows: and acquiring a main visual angle projection matrix and an auxiliary visual angle projection matrix with preset projection dimensions, performing feature projection on the first feature vector by adopting the main visual angle projection matrix, and performing feature projection on the second feature vector by adopting the auxiliary visual angle projection matrix to obtain the first feature vector after feature projection and the second feature vector after feature projection. Specifically, matrix multiplication is needed to be adopted, the main visual angle projection matrix and the first eigenvector are calculated, and a vector with the dimensionality equal to the preset projection dimensionality is obtained and is used as the first eigenvector after feature projection; and simultaneously, calculating the auxiliary view projection matrix and the second eigenvector by adopting matrix multiplication to obtain a vector with the dimensionality equal to the preset projection dimensionality as the second eigenvector after the characteristic projection.

In practical application, the preset projection dimension may be [ 128, 2048 ], and Y1 and Y2 are calculated through matrix multiplication, so that 128-dimensional vectors can be obtained as projected Y1 and Y2. From a theoretical perspective, the characteristic projection process actually realizes further information compression and rectification on original 2048-dimensional Y1 and Y2, and obtains 128 representative dimensions for the subsequent learning process; from the practical perspective, if the vector dimension can be reduced in a lossless manner, the requirement of the model training process on resources can be greatly relieved, and the training speed of the model is improved. For the convenience of the following explanation, the vector after Y1 feature projection is referred to as Z1, and the vector after Y2 feature projection is referred to as Z2.

204. And carrying out nonlinear change on the first feature vector after feature projection to obtain a nonlinear change vector.

In the embodiment of the present application, after the first feature vector after the feature projection is obtained, the first feature vector after the feature projection needs to be subjected to nonlinear change, so as to obtain a nonlinear change vector. In an alternative embodiment, the process of performing the non-linear change is as follows: firstly, a first preset transformation matrix with a dimension being a first preset transformation dimension is obtained, the first preset transformation matrix and a first feature vector after feature projection are calculated by adopting matrix multiplication, and a vector with the dimension being equal to the first preset transformation dimension is obtained and serves as a first intermediate vector. Subsequently, the first intermediate vector is subjected to normalization processing, and the first intermediate vector is converted into a standard normal distribution having a mean value of 0 and a variance of 1 as a second intermediate vector. And then, acquiring a preset nonlinear activation function, bringing the second intermediate vector into the nonlinear activation function for calculation, and taking the calculated vector as a third intermediate vector. And finally, acquiring a second preset transformation matrix with the dimensionality being a second preset transformation dimensionality, calculating the second preset transformation matrix and a third intermediate vector by adopting matrix multiplication, and taking the obtained vector as a nonlinear transformation vector, wherein the dimensionality of the nonlinear transformation vector is consistent with the dimensionality of the first feature vector after feature projection.

Summarizing the above process of non-linear variation, it can be seen that the process of non-linear variation is actually divided into 4 steps, and therefore, in the practical application process, the view prediction module shown in fig. 2B may be set, and the non-linear variation of the first eigenvector is realized based on the view prediction module. As shown in fig. 2B, the view prediction module may include 4 sub-modules, which are a first Linear transformation layer, a BN (Batch Normalization) layer, a reli (Rectified Linear Unit) layer, and a second Linear transformation layer, where the first Linear transformation layer is used to generate a first intermediate vector, similar to the process described in step 203, the first Linear transformation layer uses a first predetermined transformation matrix to increase the dimensionality of the vector, and in the actual application process, the first predetermined transformation dimensionality may be [ 128 × 4, 128 ], so as to increase the dimensionality by 4 times, the BN layer is used to generate a second intermediate vector, the BN layer normalizes the first intermediate vector, converts the distribution of the first intermediate vector into a standard distribution with a mean value of 0 and a variance of 1, so as to avoid the problem of gradient disappearance, increase the convergence speed, and greatly increase the training speed, and the reli layer is used to generate a third intermediate vector, the ReLu layer can increase the complexity of the model and is used for enhancing the feature learning capability of the model; the second linear transformation layer is used for generating a nonlinear transformation vector, the second linear transformation layer is similar to the process described in the step 203, and the second preset transformation matrix is used for reducing the dimensionality of the vector to the dimensionality same as the dimensionality of the first feature vector after feature projection, in the process of practical application, the second preset transformation dimensionality can be [ 128, 128 x 4 ], so that the dimensionality is reduced by 4 times, and the dimensionality of the obtained nonlinear transformation vector can be reduced to the dimensionality of the first feature vector after feature projection. For convenience of explanation, a nonlinear change vector obtained by nonlinear changing Z1 is referred to as q 1.

205. A loss function value is generated.

In the embodiment of the present application, after the nonlinear variation vector is generated, the vector Z2 after the second eigenvector feature projection needs to be predicted by using the nonlinear variation vector, so that the present application calculates a mean square error loss function between the nonlinear variation vectors q1 and Z2, generates a loss function value, and then performs subsequent model optimization by using the loss function value.

In an alternative embodiment, when generating the loss function value, it is required to determine, in the second feature vector after feature projection, an element at the same position for each element in the plurality of elements included in the nonlinear change vector, obtain a corresponding bit element of each element, and perform the following processing on each element: and calculating the difference value between the element and the corresponding bit element, and performing square calculation on the difference value to obtain a square value. Thus, a square value of each element is obtained, a plurality of square values of the plurality of elements are obtained, a sum of the plurality of square values is calculated, and the sum is used as a loss function value. The second preset transformation dimension may be [ 128, 128 × 4 ] as described in the above step 204 by way of example, so that when the loss function value is generated by taking [ 128, 128 × 4 ] as an example, that is, the difference is calculated first for each of 128 elements in the two vectors, namely the second feature vector and the nonlinear change vector after feature projection, and then squared, and finally summed to obtain the loss function value.

206. And performing representation learning of picture features based on the loss function values to obtain a feature extraction model.

In the embodiment of the present application, after the loss function value is generated, training of the model is started. In the model training process, for a main view angle route, a normal parameter updating strategy is adopted, namely a loss function value is calculated in a forward process, and parameters of all models on the main view angle route are updated by adopting the loss function value in a reverse process; for the auxiliary view angle route, the parameters of all models on the auxiliary view angle route cannot be updated by the error of the back propagation of the loss function value, so that the delay updating strategy is designed for the auxiliary view angle route in the embodiment of the application, and the specific training process of the feature extraction model is as follows:

first, the current encoder parameters of the primary view encoder are read as first historical encoder parameters, and the current encoder parameters of the secondary view encoder are read as second historical encoder parameters. And then, returning a loss function value for the main view route, and updating the main view encoder and the main view projection matrix by adopting the loss function value. And for the auxiliary view angle route, acquiring a delay updating factor, calculating a first history encoder parameter and a second history encoder parameter by adopting the delay updating factor to obtain an encoder updating value, calculating a main view angle projection matrix and an auxiliary view angle projection matrix by adopting the delay updating factor to obtain a projection matrix to be updated, setting the encoder updating value in an auxiliary view angle encoder, and updating the auxiliary view angle projection matrix into the projection matrix to be updated. Thus, parameter adjustment of the main view route and the auxiliary view route is completed, the target picture needs to be enhanced again, a new first view picture and a new second view picture obtained after enhancement are enhanced again, feature extraction is performed on the new first view picture based on the updated main view encoder, feature projection is performed on the new first view picture based on the updated main view projection matrix, feature extraction is performed on the new second view picture based on the updated auxiliary view encoder, feature projection is performed on the new second view picture based on the updated auxiliary view projection matrix, a new loss function value is generated again, the main view encoder, the main view projection matrix, the auxiliary view encoder and the auxiliary view projection matrix are updated according to the new loss function until the generated loss function value reaches a threshold value, and obtaining a feature extraction model.

For convenience of explanation, if the first historical encoder parameter is F1, the second historical encoder parameter is F2, the encoder update value is F2, the main view projection matrix is G1, the auxiliary view projection matrix is G2, the projection matrix to be updated is G2, and the delay update factor is m, the process of calculating the encoder update value and the projection matrix to be updated can be summarized as the following formula 1:

equation 1: f2 ═ m × F2+ (1-m) × F1

G2＝m×g2+(1-m)×g1

Where m is actually used to control the degree of variation of the parameter in the auxiliary view route. In practical applications, m may be set to a larger value, for example, 0.995, so that the updating of the parameter greatly retains the existing parameter on the route of the secondary viewing angle, and the parameter on the route of the primary viewing angle is referred to for a lesser extent. Taking F2 as an example, the update of the parameters on the auxiliary view route largely retains the previous parameters, i.e., (0.995 × F2), and to a lesser extent, uses the parameters on the main view route, i.e., (0.005 × F1). Therefore, through the strategy of delaying updating described in the embodiment of the application, the parameters to be learned by the modules in the auxiliary view angle route are changed correspondingly according to the parameters to be learned by the modules in the main view angle route, instead of directly performing return updating of the loss function value, and through the strategy of delaying updating, the representation learning of the picture characteristics is performed to obtain the characteristic extraction model.

207. And inputting the target picture into the feature extraction model, and acquiring a vector output by the feature extraction model as a picture feature vector of the target picture.

In the embodiment of the application, after the training of the feature extraction model is completed, feature extraction is performed on the target picture, the target picture is input into the feature extraction model, and a vector output by the feature extraction model is obtained and used as a picture feature vector of the target picture. In an optional embodiment, when extracting a picture feature vector, after a target picture is input to a feature extraction model, a main view encoder in the feature extraction model is required to encode the target picture to obtain an initial picture feature vector; and then, performing feature projection processing on the initial picture feature vector based on the main view projection matrix in the feature extraction model to obtain the initial picture feature vector after feature projection, and taking the initial picture feature vector after feature projection as the picture feature vector of the target picture. That is, when the image feature vector is extracted, all contents on the auxiliary view angle route are discarded, non-linear change is not needed, the target image is directly processed by adopting the main view angle route, and the vector Z obtained after feature projection on the main view angle route is used as the final feature vector of the target image to be represented.

Therefore, by the technical scheme, the demand on a large number of high-quality negative samples in picture feature learning is weakened, the problems of high video memory resource demand and slow training caused by the fact that a large number of negative samples need to be compared in contrast learning are solved, the problem of model performance reduction caused by the fact that preprocessing work in early stages such as screening and filtering is not rigorous is solved, the data collection and processing flow is simplified, and the model iteration cycle is accelerated. In addition, the strategy of delaying updating greatly reduces the parameters of the model which need to be learned through back propagation of the loss function values, reduces the variable intermediate values which need to be stored due to back propagation, reduces the video memory of the model, and greatly improves the training efficiency of the model.

In the practical application process, the technical scheme of the application can be realized based on a feature extraction system, and the feature extraction system can comprise an enhancement processing module, a feature extraction module, a feature projection module and a view prediction module. The enhancement processing module is used for enhancing the target picture to obtain pictures with two different visual angles; the feature extraction module is configured to perform encoding processing on pictures from two different viewing angles to obtain two vectors, that is, Y1 and Y2 mentioned in step 202 can be extracted by the feature extraction module; the characteristic projection module is used for carrying out characteristic projection on Y1 and Y2 so as to obtain Z1 and Z2 mentioned in the step 203; and the visual angle prediction module is used for generating a loss function value, and generating the loss function value for parameter regulation and control by virtue of nonlinear change vectors q1 and Z2 obtained by carrying out nonlinear change on Z1. It should be noted that the feature extraction system can implement a complete picture feature extraction process in the present application through the computing capability of a server mounted on the feature extraction system, where the server may be an independent server, or a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform.

Specifically referring to fig. 2C, two routes may be set in the feature extraction system, which are a main view route and an auxiliary view route, respectively, and assuming that the input target picture is a, two pictures V1 and V2 at different views are obtained after passing through the enhancement processing module; subsequently, V1 and V2 are input to the feature extraction module for encoding processing, and Y1 and Y2 are obtained. Then, Y1 and Y2 are input to the feature projection module to obtain Z1 and Z2, a nonlinear change vector q1 obtained by nonlinear change of Z1 is processed by the view angle prediction module on q2 and Z2 to obtain a loss function value loss. And regulating and controlling parameters by back propagation loss on the main visual angle route, not performing back propagation on the auxiliary visual angle route, and using a delay updating strategy to reference the regulation and control performed by the parameters on the main visual angle route. Therefore, by adopting a modularized design idea through the feature extraction system, the main enhancement processing module and the feature extraction module can be updated along with continuous iteration and updating of the technology, the expansibility is strong, and the multi-view self-supervision picture feature learning framework without negative samples in the application can be suitable for a large number of different methods.

According to the method provided by the embodiment of the application, the pictures with different visual angles are utilized to weaken the requirements on a large number of high-quality negative samples in picture feature learning, the problems of high display and memory resource requirements and slow training caused by the fact that a large number of negative samples need to be compared in contrast learning are solved, the problem of model performance reduction caused by the fact that preprocessing work in the early stages such as screening and filtering is not rigorous is solved, the flow of data collection and processing is simplified, the pressure brought by learning and memory is reduced, the learning efficiency is improved, and the accuracy and performance of picture feature extraction can be improved.

Further, as a specific implementation of the method shown in fig. 1, an embodiment of the present application provides an apparatus for extracting a picture feature, and as shown in fig. 3, the apparatus includes: an enhancement processing module 301, a feature extraction module 302, a feature projection module 303, a view prediction module 304 and a feature learning module 305.

The enhancement processing module 301 is configured to acquire a target picture to be subjected to picture feature extraction, and perform enhancement processing on the target picture to obtain a first view picture and a second view picture;

the feature extraction module 302 is configured to determine a first feature vector of the first view picture and a second feature vector of the second view picture;

the feature projection module 303 is configured to perform feature projection on the first feature vector and the second feature vector;

the view prediction module 304 is configured to generate a loss function value by using the first feature vector after feature projection and the second feature vector after feature projection;

the feature learning module 305 is configured to perform representation learning of picture features based on the loss function values to obtain a feature extraction model;

the feature projection module 303 is further configured to input the target picture into the feature extraction model, and obtain a vector output by the feature extraction model as a picture feature vector of the target picture.

In a specific application scenario, the enhancement processing module 301 is configured to acquire the input target picture to be subjected to picture feature extraction, and determine a reference edge in the target picture, where the reference edge is a picture boundary of the target picture, and the length of the reference edge is less than or equal to the length of other picture boundaries except the reference edge in the target picture; and cutting two region pictures with the side length equal to the length of the reference side in the target picture, taking any one of the two region pictures as the first view angle picture, and taking the other one of the two region pictures except the first view angle picture as the second view angle picture.

In a specific application scenario, the feature extraction module 302 is configured to acquire a main view encoder, determine a preset encoding dimension, and perform encoding processing on the first view picture based on the main view encoder to obtain the first feature vector with a dimension consistent with the preset encoding dimension; determining an auxiliary view encoder, and encoding the second view picture based on the auxiliary view encoder to obtain the second feature vector with the dimension consistent with the preset encoding dimension;

the feature projection module 303 is configured to obtain a main view projection matrix and an auxiliary view projection matrix, where the dimensions of the main view projection matrix and the auxiliary view projection matrix are preset projection dimensions, perform feature projection on the first feature vector by using the main view projection matrix, perform feature projection on the second feature vector by using the auxiliary view projection matrix, and obtain the first feature vector after feature projection and the second feature vector after feature projection;

the view angle prediction module 304 is configured to perform nonlinear change on the first feature vector after feature projection to obtain a nonlinear change vector; in the second feature vector after feature projection, determining an element at the same position for each element in a plurality of elements included in the nonlinear variation vector, respectively, obtaining a corresponding bit element of each element, and performing the following processing on each element: calculating the difference value between the element and the corresponding bit element, and performing square calculation on the difference value to obtain a square value; obtaining a square value of each element, obtaining a plurality of square values of the plurality of elements, calculating a sum of the plurality of square values, and taking the sum as the loss function value.

In a specific application scenario, the feature projection module 303 is configured to calculate the main view projection matrix and the first feature vector by using matrix multiplication, and obtain a vector with a dimension equal to the preset projection dimension as the first feature vector after feature projection; and calculating the auxiliary view projection matrix and the second eigenvector by adopting matrix multiplication to obtain a vector with the dimension equal to the preset projection dimension as the second eigenvector after the characteristic projection.

In a specific application scenario, the view prediction module 304 is configured to obtain a first preset transformation matrix with a dimension being a first preset transformation dimension, and calculate the first preset transformation matrix and the first feature vector after feature projection by using matrix multiplication to obtain a vector with a dimension equal to the first preset transformation dimension as a first intermediate vector; normalizing the first intermediate vector, and converting the first intermediate vector into a standard normal distribution with a mean value of 0 and a variance of 1 as a second intermediate vector; acquiring a preset nonlinear activation function, bringing the second intermediate vector into the nonlinear activation function for calculation, and taking the calculated vector as a third intermediate vector; and acquiring a second preset transformation matrix with the dimensionality being a second preset transformation dimensionality, calculating the second preset transformation matrix and the third intermediate vector by adopting matrix multiplication, and taking the obtained vector as the nonlinear transformation vector, wherein the dimensionality of the nonlinear transformation vector is consistent with the dimensionality of the first feature vector after feature projection.

In a specific application scenario, the feature learning module 305 is configured to read a current encoder parameter of the main view encoder as a first historical encoder parameter, and read a current encoder parameter of the auxiliary view encoder as a second historical encoder parameter; returning the loss function value, and updating the main view encoder and the main view projection matrix by adopting the loss function value; obtaining a delay updating factor, calculating the first historical encoder parameter and the second historical encoder parameter by adopting the delay updating factor to obtain an encoder updating value, and calculating the main view projection matrix and the auxiliary view projection matrix by adopting the delay updating factor to obtain a projection matrix to be updated; setting the encoder updating value in the auxiliary view encoder, and updating the auxiliary view projection matrix into the projection matrix to be updated;

the enhancement processing module 301 is further configured to perform enhancement processing on the target picture again, perform feature extraction on the new first view picture and the new second view picture obtained after enhancement processing again, perform feature projection on the new first view picture based on the updated main view encoder, perform feature extraction on the new first view picture based on the updated main view projection matrix, perform feature projection on the new second view picture based on the updated auxiliary view encoder, perform feature projection on the new second view picture based on the updated auxiliary view projection matrix, regenerate a new loss function value, and update the main view encoder, the main view projection matrix, the auxiliary view encoder, and the auxiliary view projection matrix according to the new loss function, and obtaining the feature extraction model until the generated loss function value reaches a threshold value.

In a specific application scenario, the feature projection module 303 is further configured to input the target picture to the feature extraction model, and perform encoding processing on the target picture based on a main view encoder in the feature extraction model to obtain an initial picture feature vector; and performing feature projection processing on the initial picture feature vector based on a main view projection matrix in the feature extraction model to obtain the initial picture feature vector after feature projection, and taking the initial picture feature vector after feature projection as the picture feature vector of the target picture.

The device that this application embodiment provided, the picture that utilizes different visual angles has weakened the demand to a large amount of high quality negative samples in the study of picture characteristic, the high apparent memory resource demand that needs to compare a large amount of negative samples and lead to in the contrast study has been alleviated, the problem of training is slow, thereby the model performance decline problem that the preliminary treatment work is not rigorous to lead to because screening, filtration etc. has been alleviated, the flow of data collection and processing has been simplified, not only the pressure that study and memory brought has been reduced, improve learning efficiency, and can also promote the rate of accuracy and the performance that the picture characteristic was drawed.

It should be noted that other corresponding descriptions of the functional units involved in the apparatus for extracting picture features provided in the embodiment of the present application may refer to the corresponding descriptions in fig. 1 and fig. 2A to fig. 2C, and are not repeated herein.

In an exemplary embodiment, referring to fig. 4, a computer device is further provided, the computer device includes a bus, a processor, a memory and a communication interface, and may further include an input/output interface and a display device, wherein the functional units may communicate with each other through the bus. The memory stores computer programs, and the processor is used for executing the programs stored in the memory and executing the method for extracting the picture features in the embodiment.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for extracting picture features.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by hardware, and also by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the implementation scenarios of the present application.

Those skilled in the art will appreciate that the drawings are merely schematic representations of preferred embodiments and that the blocks or flowchart illustrations are not necessary to practice the present application.

Those skilled in the art can understand that the modules in the device in the implementation scenario may be distributed in the device in the implementation scenario according to the implementation scenario description, and may also be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above application serial number is merely for description and does not represent the superiority and inferiority of the implementation scenario.

The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims

1. A method for extracting picture features is characterized by comprising the following steps:

2. The method according to claim 1, wherein the obtaining a target picture to be subjected to picture feature extraction, and performing enhancement processing on the target picture to obtain a first view picture and a second view picture comprises:

3. The method of claim 1, wherein the determining a first eigenvector of the first view picture and a second eigenvector of the second view picture, performing feature projection on the first eigenvector and the second eigenvector, and generating a loss function value using the feature-projected first eigenvector and the feature-projected second eigenvector comprises:

in the second feature vector after feature projection, determining an element at the same position for each element in a plurality of elements included in the nonlinear variation vector, respectively, obtaining a corresponding bit element of each element, and performing the following processing on each element: calculating the difference value between the element and the corresponding bit element, and performing square calculation on the difference value to obtain a square value;

4. The method of claim 3, wherein the performing feature projection on the first eigenvector by using the primary perspective projection matrix and performing feature projection on the second eigenvector by using the secondary perspective projection matrix to obtain the first eigenvector after feature projection and the second eigenvector after feature projection comprises:

and calculating the auxiliary visual angle projection matrix and the second eigenvector by adopting matrix multiplication to obtain a vector with the dimensionality equal to the preset projection dimensionality as the second eigenvector after characteristic projection.

5. The method of claim 3, wherein the performing the non-linear transformation on the first feature vector after feature projection to obtain a non-linear transformation vector comprises:

6. The method of claim 1, wherein the learning of the representation of the picture features based on the loss function values to obtain a feature extraction model comprises:

returning the loss function value, and updating the main view encoder and the main view projection matrix by adopting the loss function value;

performing enhancement processing on the target picture again, performing feature extraction on the new first view picture and the new second view picture obtained after enhancement processing again, performing feature projection on the new first view picture based on the updated main view encoder, performing feature extraction on the new second view picture based on the updated auxiliary view encoder, performing feature projection on the new second view picture based on the updated auxiliary view projection matrix, regenerating a new loss function value, and updating the main view encoder, the main view projection matrix, the auxiliary view encoder and the auxiliary view projection matrix according to the new loss function until the generated loss function value reaches a threshold value, and obtaining the feature extraction model.

7. The method according to claim 1, wherein the inputting the target picture into the feature extraction model, and obtaining a vector output by the feature extraction model as a picture feature vector of the target picture, comprises:

8. An apparatus for extracting picture features, comprising:

the feature projection module is further configured to input the target picture to the feature extraction model, and obtain a vector output by the feature extraction model as a picture feature vector of the target picture.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.