CN110532409A

CN110532409A - Image search method based on isomery bilinearity attention network

Info

Publication number: CN110532409A
Application number: CN201910692241.XA
Authority: CN
Inventors: 王鹏; 苏海波
Original assignee: Northwest University of Technology
Current assignee: Northwest University of Technology
Priority date: 2019-07-30
Filing date: 2019-07-30
Publication date: 2019-12-03
Anticipated expiration: 2039-07-30
Also published as: CN110532409B

Abstract

The present invention relates to a kind of image search method based on isomery bilinearity attention network, there are two special branch, an offer critical zone locations information for the network, and another then provides the description of property level.The output result of Liang Ge branch passes through the module integrated characterization at an image level of a bilinearity based on attention mechanism.Come the two branches of pre-training using two complementary tasks to ensure that the two branches have the ability to realize key area positioning and attribute description.First branch realizes the task of key area detection using hour-glass network；Second branch realizes attribute forecast using Inception-Resnet-v2 network.The attention mechanism for the channel-wise that the present invention is driven jointly using two branches is weighted the channel of the output characterization of two branches, then the characterization after weighting is integrated into the characterization of final image level using the method in compact bilinearity pond.Then the Euclidean distance between the characterization by calculating different images and sequence, to obtain final search result.

Description

Image search method based on isomery bilinearity attention network

Technical field

The invention belongs to content-based image retrieval field, specially the channel attention mechanism of optimization heterogeneous characteristic and To the image search method and system in the compact bilinearity pond of the two isomery branch feature interactions modeling after optimization.

Background technique

Content-based image retrieval can effectively help user to browse and find from a large amount of image data base The image that oneself is needed.The research interest that it has very big commercial value therefore to cause many people in recent years.However it is right In every image of database, they are often under different illumination conditions, different shooting angle and mixed and disorderly background It is collected.In addition to this, the otherness of many images also tends to embody in detail, such as the neckline pattern of dress Just there are many kinds of formats: crew neck, V-arrangement neck, bateau neck etc..These phenomenons just bring great challenge to the work of image retrieval Property.These challenges can be summarized as two problems: " where see " and " how describing "." where see " mainly solves how to find out one The key position of a object.One image usually contains multiple key components of retrieval object, and people can be by comparing these The visual appearance of key component distinguishes two images." how describing " seeks to the vision content of description image, makes to retrieve System gets rid of the influence of the factors such as illumination, background, posture and visual angle, to focus more on the attribute aspect of searched targets.

Summary of the invention

Technical problems to be solved

In order to avoid the shortcomings of the prior art, the present invention proposes one kind based on isomery bilinearity attention network frame Image search method.

Technical solution

A kind of image search method based on isomery bilinearity attention network, it is characterised in that steps are as follows:

Step 1: a picture is obtained to a characteristic pattern after hour-glass networkSimultaneously Another characteristic pattern is obtained by Inception-Resnet-v2 network

Step 2: ensemble average pond being made to the two characteristic patterns respectively, obtains two vectorsWith

v^a=GlobalAveragePooling (V^a), (1)

v^l=GlobalAveragePooling (V^l). (2)

Step 3: by v^aAnd v^lIt is spliced into a vector, then by two parallel multilayer perceptrons, to calculate every The attention weight of the channel-wise of the characteristic pattern of a branch；Specific formula for calculation is as follows:

HereIt is all linear turn Change matrix；k^aAnd k^lIt is projection dimension,For concatenation so C=C^a+C^l, α^aTo distribute to attributive classification branch Channel-wise attention weight, α^lTo distribute to the channel-wise attention weight that key area positions branch； Sigmoid and Relu is common activation primitive；

Step 4: obtaining the characteristic pattern of two weightingsThen they are adopted again Sample is to identical space size W × H；

Step 5: if the feature vector on position characteristic pattern (i, j) that given step 4 obtainsIt uses Count sketch function Ψ is x_ijProject object vectorHere a symbolic vector has also been usedWith One map vectorEach of s value is all randomly selected with identical probability from {+1, -1 }；It is every in P A value is then therefrom to be chosen from { 1 ..., d } with equally distributed probability；count

Sketch function Ψ is defined as follows:

y_ij=Ψ (x_ij, s, p) and=[v₁,…,v_d], (5)

Here v_t=∑_ls[l]·x_ij[l], so that p [l]=t；If two vectors (With) convolution make For the input of count sketch function, then this count sketch function can be write as independent vector is as input The convolution of two count sketch functions:

Here ⊙ indicates apposition operation, and * indicates convolution operation；Two-wire is obtained eventually by the conversion of time domain and frequency domain Property feature:

It wherein, ° is the multiplication of element set；

Step 6: in training, finally obtained bilinearity feature being made ID classification based training；In test, query graph is calculated Picture and the bilinearity feature after the obtained regularization of image in database, then calculate the Euclidean distance between them, i.e., The result of final top-k can be obtained.

Beneficial effect

A kind of image search method based on isomery bilinearity attention network proposed by the present invention, obtains one of image The bilinearity feature of robust, for content-based image retrieval task, which not only addresses only " where see " The problem of, that is, the key area in image is found out, and solve the problems, such as " how to describe " to key area, given each The attributive character of key area.

Specific embodiment

The technical scheme is that such: there are two special branch, an offer critical zone locations for the network Information, and another then provides the description of property level.The output result of Liang Ge branch is double based on attention mechanism by one Linear block is integrated into the characterization of an image level.Come the two branches of pre-training using two complementary tasks to ensure this Liang Ge branch has the ability to realize key area positioning and attribute description.First branch realizes pass using hour-glass network The task of key range detection；Second branch realizes attribute forecast using Inception-Resnet-v2 network.The present invention makes The attention mechanism of the channel-wise driven jointly with two branches is weighted the channel of the output characterization of two branches, so Characterization after weighting is integrated into the characterization of final image level using the method in compact bilinearity pond afterwards.Then pass through meter The Euclidean distance between the characterization of different images and sequence are calculated, to obtain final search result.

Detailed process is as follows:

1, attributive classification branch pre-training

Dimension of picture, is adjusted to 299 × 299 sizes using the method for bilinear interpolation by a given picture.Adjustment Picture afterwards is input to Inception-Resnet-v2 network, removes Inception-Resnet-v2 network i.e. last two layers Average pond layer and full articulamentum obtain the characteristic pattern of network output, and characteristic pattern size is 1536 × 8 × 8, the weight behind network Average pond layer and full articulamentum are newly increased, the dimension needs of the new full articulamentum output unlike former network are classified Attribute number.In order to solve the problems, such as data nonbalance, the present invention is selected the comparable attribute of quantity to predict, is handed over using binary system Entropy loss function is pitched to assess the performance of multi-tag attribute forecast task, using stochastic gradient descent method progress parameter optimization and more Newly.

2, key area positions branch pre-training

Dimension of picture, is adjusted to 256 × 256 sizes using the method for bilinear interpolation by a given picture.Adjustment Picture afterwards is input to hour-glass network, and landmark number is set as 8, that is, exports the coordinate of 8 key points, allows rear basis The coordinate of this 8 key points come generate 64 × 64 thermodynamic chart (heatmap), then heating power corresponding with ground truth Scheme (64 × 64) and calculates normalization mean error.Carry out undated parameter using Adam optimizer when training.

3, data enhance

Same picture at random left and right overturning simultaneously, it is random rotate by a certain angle θ ∈ [- 30 ° ,+30 °] simultaneously, use The method of bilinear interpolation is adjusted separately to 299 × 299 and 256 × 256 sizes, and last normalized obtains two tensors (299 × 299 × 3 and 256 × 256 × 3).Because of input of the tensor 256 × 256 × 3 as key area positioning branch, The coordinate for corresponding to key point for original image will also change.Image is when making left and right overturning, the seat of image left side point Mark is changed to the coordinate of corresponding the right point, and the coordinate of the right point is changed to the coordinate of corresponding left side point.Random-Rotation and figure The coordinate of key point will also adjust accordingly when as size adjustment.

4, the branch feature of image is obtained

The tensor (299 × 299 × 3) obtained after data prediction is input to Inception-Resnet-v2 network to obtain To characteristic patternAnother tensor (256 × 256 × 3) of identical image is input to hour-glass net Network obtains another characteristic pattern

5, the characteristic optimization based on channel-wise attention mechanism

The characteristic pattern of two branches output passes through ensemble average Chi Huahou, respectively obtains two Global Vectors WithThen the two vectors are stitched togetherThen the vector of this splicing is passed through The hidden layer that one 512 dimension is obtained after one full articulamentum and Relu layers, obtains using a full articulamentum and Sigmoid layers The weight vectors in the channel of characteristic patternWithThe channel distribution weight of two characteristic patterns will be weighed The weighted value of each of weight vector is multiplied with corresponding channel, the feature after being optimized With

6, the compact bilinearity pond of bicharacteristic

First by the feature after the optimization of two branchesWithIt is adjusted to identical space size (8 × 8) by average Chi Huahou, Take the vector on position (i, j) of two branch characteristic patternsWithThe two vectors are passed through respectively Count sketch function projects on the vector of d dimension, it is demonstrated experimentally that effect is preferable when d takes 16k.Detailed process is exactly: ForEstablish two vectorsWithVectorEach of value be all from {+1, -1 } Initialization is selected at random with identical probability, andEach of value be then from { 1 ..., 16k } with identical probability with Machine selects initialization.WithInput as count sketch function.In count sketch function, First initialize a vector y₁=[0 ..., 0]^16k, then for y₁In i-th dimension numerical value can be obtained by formula below:So thatFinally obtain a projection result y₁.Similarly for input WithCount sketch function exports a projection result y₂.Finally to y₁And y₂Carry out time domain and frequency domain conversion, obtain to Measure F_ij=FFT^-1(FFT(y₁)°FFT(y₂)).By 8 × 8 F_ijIt is integrated into a tensorThen F is asked Each channel in F is summed to obtain a vector with pondization processingThen this vector f is melted as symbol Square root and L₂Norm processing, finally obtains the bilinearity feature of picture.Then dimensionality reduction is carried out to this bilinearity feature, double Linear character is respectively by obtaining final compact bilinearity feature after full articulamentum, batch normalization layer.

7, model training

Using compact bilinearity feature as input, a full articulamentum realizes ID classification task as frame.One ID is comprising all positive sample pictures (comprising same object i.e. in picture), and for an ID, other ID are negative samples This.Each example is namely regarded as an individual class.The dimension of full articulamentum output is equal to all ID numbers.Mainly Loss function uses cross entropy loss function:

Here x is predicted vector, and gt is the corresponding index of true tag.There are two complementary loss functions also It is binary system cross entropy loss function and the key area positioning of the multi-tag attribute forecast training of attributive classification branch road The normalization mean error function of branch road critical point detection.Three losses distribute different weights, obtain total loss.Optimization Device selects Adam optimizer uniformly to calculate gradient and carry out backpropagation.It needs that learning rate is arranged when undated parameter, it is initial to learn Habit rate is set as 0.0001, and then every 5 epoch, learning rate just decay to original half.The picture number of an iteration is arranged For 20 pictures.It loses and tends to be steady after 35 epoch.In order to avoid training over-fitting, one added about on loss item Beam item L₂Standardization.

8, model application

Image data processing does not need data enhancing herein, it is only necessary to Image Adjusting to 299 × 299 and 256 × 256 sizes, and normalize the input that can be used as attributive classification branch and key area positioning branch.Entire network model Parameter it is all fixed, as long as input image data and forward reasoning.The compact bilinearity that model is finally obtained These vectors are passed through L by feature of the feature as image, the in this way feature vector of our available all images₂At norm These feature vectors may map on a spherical surface after reason, and feature vector can serve as the foundation of measurement.Provide one Query image obtains the feature vector F of query image after model reasoning_q, database images obtain after model reasoning All feature vector { F of database images₁,…,F_m, calculate the feature vector F of query image_qWith all feature vectors {F₁,…,F_mEuclidean distance: d_i=| | F_q-F_i||₂, i=1 ..., m obtain D=[d₁,…,d_m], all values in D are carried out It resequences from high in the end, top-k is exactly to take the k of foremost a as a result, corresponding database images are considered as retrieving Correct result.If this k number of model prediction is according to there is and retrieve the true corresponding database images of image in the image of library When, that is, think that this is retrieved successfully.For example the result of top-5 is [d₁₀,d₃₅,d₆₀,d₆₁,d₂₆], if query image will be retrieved Database images be database in No. 61 image, then this is retrieved successfully.

Claims

1. a kind of image search method based on isomery bilinearity attention network, it is characterised in that steps are as follows:

Step 1: a picture is obtained to a characteristic pattern after hour-glass networkPass through simultaneously Inception-Resnet-v2 network obtains another characteristic pattern

v^a=GlobalAveragePooling (V^a), (1)

v^l=GlobalAveragePooling (V^l). (2)

Step 3: by v^aAnd v^lIt is spliced into a vector, then by two parallel multilayer perceptrons, to calculate each The attention weight of the channel-wise of the characteristic pattern on road；Specific formula for calculation is as follows:

HereIt is all linear transformation square Battle array；k^aAnd k^lIt is projection dimension,For concatenation so C=C^a+C^l, α^aFor the channel- for distributing to attributive classification branch Wise attention weight, α^lTo distribute to the channel-wise attention weight that key area positions branch；Sigmoid and Relu is common activation primitive；

Step 4: obtaining the characteristic pattern of two weightingsThen they are re-sampled to Identical space size W × H；

Step 5: if the feature vector on position characteristic pattern (i, j) that given step 4 obtainsUse count Sketch function Ψ is x_ijProject object vectorHere a symbolic vector has also been usedIt is reflected with one Directive amountEach of s value is all randomly selected with identical probability from {+1, -1 }；Each value in P is then It is therefrom to be chosen from { 1 ..., d } with equally distributed probability；Count sketch function Ψ is defined as follows:

y_ij=Ψ (x_ij, s, p) and=[v₁..., v_d], (5)

Here v_t=∑_ls[l]·x_ij[l], so that p [l]=t；If two vectorsWithConvolution as count The input of sketch function, then this count sketch function can be write as independent vector two as input The convolution of count sketch function:

Here ⊙ indicates apposition operation, and * indicates convolution operation；Bilinearity spy is obtained eventually by the conversion of time domain and frequency domain Sign:

Wherein,For the multiplication of element set；

Step 6: in training, finally obtained bilinearity feature being made ID classification based training；Test when, calculate query image with The bilinearity feature after the obtained regularization of image in database, then calculates the Euclidean distance between them, can obtain To the result of final top-k.