CN110704652A - Vehicle image fine-grained retrieval method and device based on multiple attention mechanism - Google Patents

Vehicle image fine-grained retrieval method and device based on multiple attention mechanism Download PDF

Info

Publication number
CN110704652A
CN110704652A CN201910776963.3A CN201910776963A CN110704652A CN 110704652 A CN110704652 A CN 110704652A CN 201910776963 A CN201910776963 A CN 201910776963A CN 110704652 A CN110704652 A CN 110704652A
Authority
CN
China
Prior art keywords
vehicle
target object
image
reference image
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910776963.3A
Other languages
Chinese (zh)
Inventor
张斯尧
王思远
谢喜林
张�诚
文戎
田磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha Qianshitong Intelligent Technology Co Ltd
Original Assignee
Changsha Qianshitong Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changsha Qianshitong Intelligent Technology Co Ltd filed Critical Changsha Qianshitong Intelligent Technology Co Ltd
Priority to CN201910776963.3A priority Critical patent/CN110704652A/en
Publication of CN110704652A publication Critical patent/CN110704652A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention provides a vehicle image fine-grained retrieval method and a vehicle image fine-grained retrieval device based on a multiple attention mechanism, wherein the method comprises the following steps: inputting the vehicle reference image into the trained multi-attention convolutional neural network model, automatically positioning a target object in the vehicle reference image, and extracting a feature vector of the target object of the vehicle reference image; automatically positioning a target object in the vehicle image to be recognized through a multiple attention convolution neural network model, and extracting a feature vector of the target object in the vehicle image to be recognized; calculating the similarity between the characteristic vector of the target object in the vehicle reference image and the characteristic vector of the target object in the vehicle image to be recognized; and obtaining the vehicle image to be recognized containing the target object of the same category as the target object in the vehicle reference image in the vehicle image to be recognized according to the similarity, and using the vehicle image to be recognized as a retrieval image of the vehicle reference image. The method and the device can improve the accuracy and efficiency of vehicle image retrieval.

Description

Vehicle image fine-grained retrieval method and device based on multiple attention mechanism
Technical Field
The invention belongs to the technical field of computer vision and intelligent traffic, and particularly relates to a vehicle image fine-grained retrieval method and device based on a multiple attention mechanism, terminal equipment and a computer readable medium.
Background
With the rapid development of modern transportation, security and protection industries and the like, the target recognition technology is more and more applied to various fields, and is one of the important research subjects of the computer vision and pattern recognition technology in the intelligent transportation field in recent years.
Vehicle fine-grained identification is an important research direction in the field of computer vision. Vehicle identification of the same vehicle type is more difficult than the traditional method because the difference between similar vehicles is very small, and the difference may be only the annual inspection mark on the vehicle or some small decorations in the vehicle. With the rise of deep learning in recent years, many researchers have attempted to apply deep learning to the field of object detection and recognition. The fine-grained image analysis is a popular research topic in the field of computer vision for solving the problems, aims to research a plurality of visual analysis tasks such as positioning, identifying and retrieving object subclasses in the fine-grained image, and has wide application value in a real scene.
Meanwhile, image retrieval is a technology for retrieving similar images through input images, and mainly involves two major parts, namely image feature extraction and image feature similarity analysis. Fine-grained image recognition consists in finding local regional features in images that have subtle differences, allowing the recognition of different subclasses within a large class. The fine-grained image recognition technology is used for image retrieval, and fine-grained features of the images can be extracted and similarity of the fine-grained features of the images can be analyzed. However, the fine-grained image identification and retrieval method in the prior art has the problems of low efficiency and low accuracy.
Disclosure of Invention
In view of this, embodiments of the present invention provide a vehicle image fine-grained retrieval method and apparatus based on a multiple attention mechanism, a terminal device, and a computer readable medium, which can improve accuracy and efficiency of vehicle image retrieval.
The first aspect of the embodiment of the invention provides a vehicle image fine-grained retrieval method based on a multiple attention mechanism, which comprises the following steps:
inputting a vehicle reference image into a trained multi-attention convolutional neural network model, automatically positioning a target object in the vehicle reference image, and extracting a feature vector of the target object in the vehicle reference image;
inputting a vehicle image to be recognized into the trained multiple attention convolutional neural network model, automatically positioning a target object in the vehicle image to be recognized, and extracting a feature vector of the target object in the vehicle image to be recognized;
calculating the similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be recognized;
and obtaining the vehicle image to be recognized, which contains the target object of the same category as the target object in the vehicle reference image, in the vehicle image to be recognized according to the similarity, and using the vehicle image to be recognized as a retrieval image of the vehicle reference image.
A second aspect of an embodiment of the present invention provides a vehicle image fine-grained retrieval device based on a multiple attention mechanism, including:
the first extraction module is used for inputting a vehicle reference image into a trained multi-attention convolutional neural network model, automatically positioning a target object in the vehicle reference image and extracting a feature vector of the target object in the vehicle reference image;
the second extraction module is used for inputting the vehicle image to be recognized into the trained multi-attention convolutional neural network model, automatically positioning the target object in the vehicle image to be recognized and extracting the characteristic vector of the target object in the vehicle image to be recognized;
the calculation module is used for calculating the similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be identified;
and the retrieval output module is used for obtaining the vehicle image to be identified, which contains the target object of the same category as the target object in the vehicle reference image, in the vehicle image to be identified according to the similarity, and using the vehicle image to be identified as the retrieval image of the vehicle reference image.
A third aspect of the embodiments of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the above-mentioned fine-grained retrieval method for vehicle images based on a multiple attention mechanism when executing the computer program.
A sixth aspect of the embodiments of the present invention provides a computer-readable medium, which stores a computer program that, when being processed and executed, implements the steps of the above-mentioned fine-grained retrieval method for vehicle images based on a multiple attention mechanism.
In the fine-grained retrieval method for vehicle images based on a multiple attention mechanism provided by the embodiment of the invention, a vehicle reference image can be input into a trained multiple attention convolution neural network model, a target object in the vehicle reference image is automatically positioned, a feature vector of the target object in the vehicle reference image is extracted, a vehicle image to be recognized is input into the trained multiple attention convolution neural network model, the target object in the vehicle image to be recognized is automatically positioned, the feature vector of the target object in the vehicle image to be recognized is extracted, the similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be recognized is calculated, and the vehicle image to be recognized containing the target object of the same category as the target object in the vehicle reference image is obtained according to the similarity, the vehicle image is used as the retrieval image of the vehicle reference image, so that the accuracy and efficiency of vehicle image retrieval can be improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a flowchart of a fine-grained retrieval method for a vehicle image based on a multiple attention mechanism according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a feature map provided by an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a vehicle image fine-grained retrieval device based on a multiple attention mechanism according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a refinement of the first extraction module in FIG. 3;
FIG. 5 is a schematic structural diagram of another vehicle image fine-grained retrieval device based on a multiple attention mechanism according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
Referring to fig. 1, fig. 1 is a block diagram illustrating a fine-grained search method for a vehicle image based on a multiple attention mechanism according to an embodiment of the present invention. As shown in fig. 1, the vehicle image fine-grained retrieval method based on the multiple attention mechanism of the embodiment includes the following steps:
s101: and inputting the vehicle reference image into the trained multi-attention convolutional neural network model, automatically positioning the target object in the vehicle reference image, and extracting the characteristic vector of the target object in the vehicle reference image.
In the embodiment of the invention, a multi-attention convolutional neural network model can be established first. In particular, to prevent the training from getting into a locally optimal solution, the channel clustering layer and the classification network layer need to be pre-trained. In the pre-training, the channel clustering layer is used for locally positioning an input vehicle image, and the classification network layer is used for identifying local features of each positioned part of the input vehicle image and generating a weight vector of each positioned part of the input vehicle image. Further, the acquired given vehicle image can be input into a convolution layer in the classification network layer, and the original depth feature W X of the vehicle image is extracted; wherein, W is a basic network layer of a multi-attention convolution neural network model to be established, X is the vehicle image, and X is a convolution operation operator; then, the vehicle image can be divided into N parts, and all characteristic channels contained in the N parts are clustered into N parts; wherein each of said local regions corresponds to a set of channel cluster layers, each of said channel cluster layers comprising two feature channel layers, each of said feature channel layers comprising a plurality of feature channels, and specifically, each feature channel having a peak response point since it responds to a particular type of visual pattern, such that each feature channel can be represented by a position vector whose elements are all trainingsThe peak response coordinates of the image on the channel are trained. These position vectors are used as features for clustering, and the different channels are divided into N clusters, i.e. N parts. Whether each channel belongs to the cluster is represented by an indication vector with length c (number of channels). If so, the channel position is 1, otherwise it is 0. N indicating vectors are mutually exclusive relations; in order to ensure that the above process is optimized in training, the N parts can be identified by the classification network layer, and the N full-link layers corresponding to the N parts one to one receive the original depth features to generate corresponding weight vectors di(X):
di(X)=fi(W*X) (1)
Wherein d isi(X) is the weight vector of the ith part of the N parts, di(X)=[d1,...,dc]C is the number of said characteristic channels, fiFor the clustering function of the ith fully-connected layer of the N fully-connected layers, it is usually also necessary to pre-train the clustering function f in order to obtain an accurate weight vectoriEach of the parameters in (1); based on the learned weight vectors, the attention heatmaps of the N portions are available:
Figure RE-GDA0002296611950000041
wherein M isi(X) is a probability heatmap of the i-th part of the N parts, sigmoid is a sigmoid function, [ W X]jIs the original depth feature of the jth feature channel. Through the above pre-training of the clustering layer and the classification network layer, the weight vector and each parameter related to each function can be set, so that the hierarchical structure of the multi-attention convolutional neural network model can be constructed, and the hierarchical structure of the multi-attention convolutional neural network model also comprises a basic network layer and other layers in the existing neural network. A loss function l (x) may then be determined based on the multiple attention convolutional neural network model as:
Figure RE-GDA0002296611950000042
wherein L isclsIs a partial classification loss (partial classification loss) function, LcngIs a channel clustering loss function, LcngExpressed as:
Lcng(Mi)=Dis(Mi)+λDiv(Mi) (4)
dis and Div represent a difference function and a distance function, respectively, where Dis is used to make coordinates in the same one of the N parts tend to gather, Div is used to make different ones of the N parts tend to be distant, λ is a weight parameter, and a determined value thereof may be obtained through training in subsequent S103; local classification loss function LclsY in (1)(i)Representation using local refinement-based features Pi(X) a label vector predicted starting from the ith part of the N parts, a local classification loss function LclsY in (1)*Is the ground truth label vector. Local refinement feature Pi(X) is represented by:
Figure RE-GDA0002296611950000051
and alternately and iteratively training a classification network layer and a channel clustering layer through the local classification loss function and the channel clustering loss function until the local classification loss function and the channel clustering loss function are not changed any more. In particular, with regard to the alternating iterative training of the classification network layer and the channel class layer by the local classification loss function and the channel clustering loss function, the convolutional layer may be fixed first, and the loss function L may be clustered by the channelcngTraining the channel clustering layer; fixing the channel clustering layer again by the local classification loss function LclsTraining a convolutional layer and a softmax layer in the classification network layer; the convolutional layer and softmax layer in the classification network layer may then be fixed, passing the local classification loss function LclsTraining the channel clustering layer; fixing the channel clustering layer, and clustering the loss function L through the channelcngTraining volumes in the classification network layerA build-up layer and a softmax layer; and analogizing in turn, alternately and iteratively training the channel clustering layer and the convolution layer and the softmax layer in the classification network layer until the local classification loss function and the channel clustering loss function are not changed any more, and clustering the channel loss function L in the above-mentioned manner by alternating iterationcngSum local part class loss function LclsIn the process of training and learning the joint loss, the weight parameter matrixes and the offset values in the channel clustering layer and the channel clustering network layer can be continuously adjusted, and when the local classification loss function and the channel clustering loss function are not changed (namely learning is finished), the adjustment values of the weight parameter matrixes and the offset values in the channel clustering layer and the channel clustering network layer are obtained; the weight parameter matrix before adjustment comprises the weight vectors, and in particular, each weight vector plus a corresponding coefficient can constitute the weight parameter matrix before adjustment. After the adjustment values of the weight parameter matrix and the offset value of the channel clustering layer and the classification network layer are obtained, the multiple attention convolutional neural network model can be trained by using a vehicle data set containing fine-grained image classification of different vehicle attributes (the training mode can be supervised learning of labeled data), the determination values of the weight parameter matrix and the offset value of the channel clustering layer and the classification network layer can be obtained as a training result, and the determination values of the weight parameter matrix and the offset value can be used for subsequent vehicle feature extraction, vehicle multi-attribute identification and retrieval and the like. After the multi-attention convolutional neural network model is trained, the vehicle reference image can be input into the trained multi-attention convolutional neural network model, the target object in the vehicle reference image is automatically positioned, and the feature vector of the target object in the vehicle reference image is extracted. Specifically, the feature map may be generated first, for example, the feature map may be generated based on ResNet-50(Residual Neural Network-50, remaining Neural Network-50) (as shown in FIG. 2): using convolution operations in wholeGenerating k x k position-sensitive score maps for each type of target object on the vehicle reference image (see fig. 2), the vehicle reference image (left) includes 5 target objects a, b, c, d and e, each of which may be mapped to a 3 x 3 position-sensitive score map, e.g., target object a may be mapped to a position-sensitive score map of a1 to a 9. The k x k position sensitivity score map is used for describing a space grid of corresponding positions of each type of target object; each position sensitivity score map has c characteristic channel outputs, which are used for representing c-1 type objects and adding an image background. The target object corresponds to a large class, and a small class corresponding to an image background is added to a c-1 class object represented by the output of the c characteristic channels; feature vectors for each type of target object, for example, feature vectors for respective object subclasses of each type of target object, may then be extracted based on the k × k position sensitivity score maps for each type of target object. Regarding the generation of k × k position sensitivity scores, for a w × size candidate target frame (obtained by the RPN network), the candidate target frame is divided into k × k sub-regions, and each sub-region is w × k2Size, for any one sub-region bin (i, j),0 ≦ i, j ≦ k-1, defining a position-sensitive pooling operation:
Figure RE-GDA0002296611950000061
wherein r isc(i, j | Θ) is the pooling response of the sub-region bin (i, j) against the c-1 object plus an image background, zi,j,CIs the position sensitivity score map corresponding to the sub-region bin (i, j), (x)0,y0) Representing the coordinates of the upper left corner of the target candidate box, n being the number of pixels in the sub-region bin (i, j), Θ representing all the parameters resulting from the training of the multi-attention convolutional neural network model; and obtaining a k multiplied by k position sensitivity fraction map of each type of target object according to the pooling response of the sub-region bin (i, j) to the c-1 type object and an image background.
S102: and inputting the vehicle image to be recognized into the trained multiple attention convolution neural network model, automatically positioning the target object in the vehicle image to be recognized, and extracting the characteristic vector of the target object in the vehicle image to be recognized.
In the embodiment of the present invention, the method for extracting the feature vector of the target object from the to-be-recognized vehicle image is the same as the method for extracting the feature vector of the target object from the vehicle reference image in S101, and reference may be specifically made to the related description of S101, so that details are not repeated herein.
S103: and calculating the similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be identified.
Specifically, the similarity calculation is usually performed by using cosine distance in the prior art, and in the embodiment of the present invention, the cosine similarity is used to measure the similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be recognized. Cosine similarity measures the difference between two individuals by using the cosine value of the included angle between two characteristic vectors in a vector space. Compared with distance measurement, cosine similarity pays more attention to the difference of two vectors in direction rather than distance or length, and similarity of characteristic vectors of target objects in the vehicle reference image and characteristic vectors of target objects in the vehicle image to be identified
Figure RE-GDA0002296611950000071
The calculation formula of (2) is as follows:
Figure RE-GDA0002296611950000072
wherein the content of the first and second substances,
Figure RE-GDA0002296611950000073
is the feature vector of the target object in the vehicle reference image,
Figure RE-GDA0002296611950000074
for the feature vector of the target object in the vehicle reference image, | | x | |
Figure RE-GDA0002296611950000075
Is the norm, | | y | | |
Figure RE-GDA0002296611950000076
Norm of theta is
Figure RE-GDA0002296611950000077
And
Figure RE-GDA0002296611950000078
the included angle therebetween.
S104: and obtaining the vehicle image to be recognized, which contains the target object of the same category as the target object in the vehicle reference image, in the vehicle image to be recognized according to the similarity, and using the vehicle image to be recognized as a retrieval image of the vehicle reference image.
In the embodiment of the present invention, for convenience of understanding, assuming that there are one or more images to be recognized, if a target object of the same category as the target object in the vehicle reference image is included in a certain image to be recognized in the one or more images to be recognized, the image to be recognized may be used as a search image of the vehicle reference image. And regarding the determination of the target objects of the same category as the target objects in the vehicle reference image, for a certain target object in the vehicle reference image, the similarity between the feature vectors of all the target objects in the vehicle image to be recognized and the feature vector of the certain target object can be calculated, the similarity is ranked according to the similarity, and the target object in the vehicle image to be recognized corresponding to the high ranked similarity is taken as the target object of the same category of the certain target object in the vehicle reference image.
In the vehicle image fine-grained retrieval method based on the multiple attention mechanism provided in fig. 1, the feature vectors of the target object can be respectively extracted from the vehicle reference image and the vehicle image to be recognized based on the trained multiple attention convolutional neural network model, and the retrieval image of the vehicle reference image is finally obtained by calculating the similarity between the feature vectors of the target object of the vehicle reference image and the feature vectors of the target object of the vehicle image to be recognized, so that the accuracy and efficiency of vehicle retrieval can be improved.
Referring to fig. 3, fig. 3 is a block diagram of a fine-grained retrieval device for vehicle images based on a multiple attention mechanism according to an embodiment of the present invention. As shown in fig. 3, the vehicle image fine-grained retrieval device 30 based on the multiple attention mechanism of the present embodiment includes a first extraction module 301, a second extraction module 302, a calculation module 303, and a retrieval output module 304. The first extraction module 301, the second extraction module 302, the calculation module 303 and the retrieval output module 304 are respectively used for executing the specific methods in S101, S102, S103 and S104 in fig. 1, and details can be referred to the related introduction of fig. 1, which is only briefly described here:
the first extraction module 301 is configured to input a vehicle reference image into a trained multi-attention convolutional neural network model, automatically locate a target object in the vehicle reference image, and extract a feature vector of the target object in the vehicle reference image.
The second extraction module 302 is configured to input the vehicle image to be recognized into the trained multi-attention convolutional neural network model, automatically locate the target object in the vehicle image to be recognized, and extract the feature vector of the target object in the vehicle image to be recognized.
A calculating module 303, configured to calculate a similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be recognized.
And a retrieval output module 304, configured to obtain, according to the similarity, a to-be-identified vehicle image that includes a target object of the same category as the target object in the vehicle reference image in the to-be-identified vehicle image, as a retrieval image of the vehicle reference image.
Further, as can be seen in fig. 4, the first extraction module 301 may specifically include a mapping unit 3011 and an extraction unit 3012:
and a mapping unit 3011, configured to generate a feature map. The feature map may be generated based on ResNet-50, which may specifically include: using convolution operations over the whole frameGenerating a k × k position sensitivity score map for each type of target object on the vehicle reference image, wherein the k × k position sensitivity score map is used for describing a spatial grid of corresponding positions of each type of target object; each position sensitivity score map has c channel outputs representing a c-1 class object plus an image background. More specifically, for a candidate target frame of w × size, the candidate target frame may be divided into k × k sub-regions, and each sub-region is w ×/k2Size, for any one sub-region bin (i, j),0 ≦ i, j ≦ k-1, defining a position-sensitive pooling operation:
Figure RE-GDA0002296611950000081
wherein r isc(i, j | Θ) is the pooling response of the sub-region bin (i, j) against the c-1 object plus an image background, zi,j,CIs the position sensitivity score map corresponding to the sub-region bin (i, j), (x)0,y0) Representing the coordinates of the upper left corner of the target candidate frame, n being the number of pixels in the sub-region bin (i, j), and Θ representing all learned parameters of the network; and obtaining a k multiplied by k position sensitivity fraction map of each type of target object according to the pooling response of the sub-region bin (i, j) to the c-1 type object and an image background.
An extracting unit 3012, configured to extract a feature vector of each type of target object based on the k × k position sensitivity score maps of each type of target object.
Referring to fig. 5, fig. 5 is a block diagram illustrating another fine-grained retrieval apparatus for vehicle images based on a multiple attention mechanism according to an embodiment of the present invention. As shown in fig. 5, the vehicle image fine-grained retrieval device 50 based on the multiple attention mechanism is optimized based on the vehicle image fine-grained retrieval device 30 based on the multiple attention mechanism shown in fig. 3, and the vehicle image fine-grained retrieval device 50 based on the multiple attention mechanism further includes a model establishing module 501 in addition to the first extraction module 301, the second extraction module 302, the calculation module 303 and the retrieval output module 304 of the vehicle image fine-grained retrieval device 30 based on the multiple attention mechanism:
the model establishing module 501 is configured to establish a multiple attention convolutional neural network model, and train the established multiple attention convolutional neural network model. The method and steps specifically executed by the model building module 501 may refer to the related description of S101, and therefore are not described herein again.
In the vehicle image fine-grained retrieval device based on the multiple attention mechanism provided in fig. 3 or fig. 5, the feature vectors of the target object can be respectively extracted from the vehicle reference image and the vehicle image to be recognized based on the trained multiple attention convolutional neural network model, and the retrieval image of the vehicle reference image is finally obtained by calculating the similarity between the feature vectors of the target object of the vehicle reference image and the feature vectors of the target object of the vehicle image to be recognized, so that the accuracy and efficiency of vehicle retrieval can be improved.
Fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 6, the terminal device 6 of this embodiment includes: a processor 60, a memory 61 and a computer program 62 stored in said memory 61 and operable on said processor 60, for example a program for performing a fine-grained retrieval of images of a vehicle based on a multiple attention mechanism. The processor 60, when executing the computer program 62, implements the steps in the above-described method embodiments, e.g., S101 to S104 shown in fig. 1. Alternatively, the processor 60, when executing the computer program 62, implements the functions of the modules/units in the above-mentioned device embodiments, such as the functions of the modules 301 to 304 shown in fig. 3.
Illustratively, the computer program 62 may be partitioned into one or more modules/units that are stored in the memory 61 and executed by the processor 60 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 62 in the terminal device 6. For example, the computer program 62 may be divided into a first extraction module 301, a second extraction module 302, a calculation module 303 and a retrieval output module 304. (modules in the virtual device), the specific functions of each module are as follows:
the first extraction module 301 is configured to input a vehicle reference image into a trained multi-attention convolutional neural network model, automatically locate a target object in the vehicle reference image, and extract a feature vector of the target object in the vehicle reference image.
The second extraction module 302 is configured to input the vehicle image to be recognized into the trained multi-attention convolutional neural network model, automatically locate the target object in the vehicle image to be recognized, and extract the feature vector of the target object in the vehicle image to be recognized.
A calculating module 303, configured to calculate a similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be recognized.
And a retrieval output module 304, configured to obtain, according to the similarity, a to-be-identified vehicle image that includes a target object of the same category as the target object in the vehicle reference image in the to-be-identified vehicle image, as a retrieval image of the vehicle reference image.
The terminal device 6 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. Terminal device 6 may include, but is not limited to, a processor 60, a memory 61. Those skilled in the art will appreciate that fig. 6 is merely an example of a terminal device 6 and does not constitute a limitation of terminal device 6 and may include more or less components than those shown, or some components in combination, or different components, for example, the terminal device may also include input output devices, network access devices, buses, etc.
The Processor 60 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) and the like provided on the terminal device 6. Further, the memory 61 may also include both an internal storage unit of the terminal device 6 and an external storage device. The memory 61 is used for storing the computer programs and other programs and data required by the terminal device 6. The memory 61 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A fine-grained retrieval method for a vehicle image based on a multiple attention mechanism is characterized by comprising the following steps:
inputting a vehicle reference image into a trained multi-attention convolutional neural network model, automatically positioning a target object in the vehicle reference image, and extracting a feature vector of the target object in the vehicle reference image;
inputting a vehicle image to be recognized into the trained multiple attention convolutional neural network model, automatically positioning a target object in the vehicle image to be recognized, and extracting a feature vector of the target object in the vehicle image to be recognized;
calculating the similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be recognized;
and obtaining the vehicle image to be recognized, which contains the target object of the same category as the target object in the vehicle reference image, in the vehicle image to be recognized according to the similarity, and using the vehicle image to be recognized as a retrieval image of the vehicle reference image.
2. The method for fine-grained retrieval of vehicle images based on multiple attention mechanisms according to claim 1, wherein before inputting a vehicle reference image into a trained multiple attention convolutional neural network model, automatically locating a target object in the vehicle reference image, and extracting a feature vector of the target object in the vehicle reference image, the method further comprises:
and constructing a multi-attention convolutional neural network model, and training the constructed multi-attention convolutional neural network model.
3. The method for fine-grained retrieval of vehicle images based on multiple attention mechanisms according to claim 1, wherein the step of inputting vehicle reference images into the trained multiple attention convolutional neural network model, automatically locating target objects in the vehicle reference images, and extracting feature vectors of the target objects in the vehicle reference images comprises:
generating a feature map; the generating the feature map comprises: generating a k × k position sensitivity score map for each type of target object on the whole vehicle reference image by using convolution operation, wherein the k × k position sensitivity score map is used for describing a spatial grid of a corresponding position of each type of target object; each position sensitivity score image has c channel outputs and is used for representing a c-1 type object and adding an image background;
extracting feature vectors of each type of target object based on the k × k position sensitivity score maps of each type of target object.
4. The method for fine-grained retrieval of vehicle images based on multiple attention mechanisms according to claim 3, wherein the generating k x k position sensitivity score maps for each type of target object over the entire vehicle reference image by convolution operation comprises:
for a candidate target frame with the size of w × h, dividing the candidate target frame into k × k sub-regions, wherein each sub-region is w × h/k2Size, for any one sub-region bin (i, j),0 ≦ i, j ≦ k-1, defining a position-sensitive pooling operation:
Figure FDA0002175381860000021
wherein r isc(i, j | Θ) is the pooling response of the sub-region bin (i, j) against the c-1 object plus an image background, zi,j,cIs the position sensitivity score map corresponding to the sub-region bin (i, j), (x)0,y0) Representing the coordinates of the upper left corner of the target candidate frame, n being the number of pixels in the sub-region bin (i, j), and Θ representing all learned parameters of the network;
and obtaining a k multiplied by k position sensitivity fraction map of each type of target object according to the pooling response of the sub-region bin (i, j) to the c-1 type object and an image background.
5. The method for fine-grained retrieval of vehicle images based on multiple attention mechanisms according to claim 1, wherein the calculating the similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be recognized comprises:
the vehicle parameterSimilarity between feature vector of target object in photographic image and feature vector of target object in vehicle image to be recognized
Figure FDA0002175381860000022
The calculation formula of (2) is as follows:
Figure FDA0002175381860000023
wherein the content of the first and second substances,
Figure FDA0002175381860000024
is the feature vector of the target object in the vehicle reference image,is a feature vector of a target object in the vehicle reference image, | x |Is | y |Norm of theta is
Figure FDA0002175381860000028
And
Figure FDA0002175381860000029
the included angle therebetween.
6. A vehicle image fine-grained retrieval device based on a multiple attention mechanism is characterized by comprising:
the first extraction module is used for inputting a vehicle reference image into a trained multi-attention convolutional neural network model, automatically positioning a target object in the vehicle reference image and extracting a feature vector of the target object in the vehicle reference image;
the second extraction module is used for inputting the vehicle image to be recognized into the trained multi-attention convolutional neural network model, automatically positioning the target object in the vehicle image to be recognized and extracting the characteristic vector of the target object in the vehicle image to be recognized;
the calculation module is used for calculating the similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be identified;
and the retrieval output module is used for obtaining the vehicle image to be identified, which contains the target object of the same category as the target object in the vehicle reference image, in the vehicle image to be identified according to the similarity, and using the vehicle image to be identified as the retrieval image of the vehicle reference image.
7. The vehicle image fine-grained retrieval device based on multiple attention mechanisms according to claim 6, wherein the first extraction module comprises:
a mapping unit for generating a feature map; the generating the feature map comprises: generating a k × k position sensitivity score map for each type of target object on the whole vehicle reference image by using convolution operation, wherein the k × k position sensitivity score map is used for describing a space grid of a corresponding position; each position sensitivity score image has c channel outputs and is used for representing a c-1 type object and adding an image background;
and the extraction unit is used for extracting the characteristic vector of each type of target object based on the k multiplied by k position sensitivity fraction map of each type of target object.
8. The vehicle image fine-grained retrieval device based on multiple attention mechanisms according to claim 6, wherein the calculation module calculates similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be recognized
Figure FDA0002175381860000031
Meter (2)The calculation formula is as follows:
Figure FDA0002175381860000032
wherein the content of the first and second substances,is the feature vector of the target object in the vehicle reference image,
Figure FDA0002175381860000034
is a feature vector of a target object in the vehicle reference image, | x |
Figure FDA0002175381860000035
Is | y |
Figure FDA0002175381860000036
Norm of theta isAnd
Figure FDA0002175381860000038
the included angle therebetween.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1-5 when executing the computer program.
10. A computer-readable medium, in which a computer program is stored which, when being processed and executed, carries out the steps of the method according to any one of claims 1 to 5.
CN201910776963.3A 2019-08-22 2019-08-22 Vehicle image fine-grained retrieval method and device based on multiple attention mechanism Pending CN110704652A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910776963.3A CN110704652A (en) 2019-08-22 2019-08-22 Vehicle image fine-grained retrieval method and device based on multiple attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910776963.3A CN110704652A (en) 2019-08-22 2019-08-22 Vehicle image fine-grained retrieval method and device based on multiple attention mechanism

Publications (1)

Publication Number Publication Date
CN110704652A true CN110704652A (en) 2020-01-17

Family

ID=69193631

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910776963.3A Pending CN110704652A (en) 2019-08-22 2019-08-22 Vehicle image fine-grained retrieval method and device based on multiple attention mechanism

Country Status (1)

Country Link
CN (1) CN110704652A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291812A (en) * 2020-02-11 2020-06-16 浙江大华技术股份有限公司 Attribute class acquisition method and device, storage medium and electronic device
CN112052350A (en) * 2020-08-25 2020-12-08 腾讯科技(深圳)有限公司 Picture retrieval method, device, equipment and computer readable storage medium
CN112541096A (en) * 2020-07-27 2021-03-23 广元量知汇科技有限公司 Video monitoring method for smart city
CN112818162A (en) * 2021-03-04 2021-05-18 泰康保险集团股份有限公司 Image retrieval method, image retrieval device, storage medium and electronic equipment
CN112991311A (en) * 2021-03-29 2021-06-18 深圳大学 Vehicle overweight detection method, device and system and terminal equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197538A (en) * 2017-12-21 2018-06-22 浙江银江研究院有限公司 A kind of bayonet vehicle searching system and method based on local feature and deep learning
CN109284670A (en) * 2018-08-01 2019-01-29 清华大学 A kind of pedestrian detection method and device based on multiple dimensioned attention mechanism
CN110019896A (en) * 2017-07-28 2019-07-16 杭州海康威视数字技术股份有限公司 A kind of image search method, device and electronic equipment
CN110096982A (en) * 2019-04-22 2019-08-06 长沙千视通智能科技有限公司 A kind of video frequency vehicle big data searching method based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019896A (en) * 2017-07-28 2019-07-16 杭州海康威视数字技术股份有限公司 A kind of image search method, device and electronic equipment
CN108197538A (en) * 2017-12-21 2018-06-22 浙江银江研究院有限公司 A kind of bayonet vehicle searching system and method based on local feature and deep learning
CN109284670A (en) * 2018-08-01 2019-01-29 清华大学 A kind of pedestrian detection method and device based on multiple dimensioned attention mechanism
CN110096982A (en) * 2019-04-22 2019-08-06 长沙千视通智能科技有限公司 A kind of video frequency vehicle big data searching method based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HELIANG ZHANG 等: "Learning Multi-Attention Convolutional Neural Network for Fine-Grained", 《2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291812A (en) * 2020-02-11 2020-06-16 浙江大华技术股份有限公司 Attribute class acquisition method and device, storage medium and electronic device
CN111291812B (en) * 2020-02-11 2023-10-17 浙江大华技术股份有限公司 Method and device for acquiring attribute category, storage medium and electronic device
CN112541096A (en) * 2020-07-27 2021-03-23 广元量知汇科技有限公司 Video monitoring method for smart city
CN112052350A (en) * 2020-08-25 2020-12-08 腾讯科技(深圳)有限公司 Picture retrieval method, device, equipment and computer readable storage medium
CN112052350B (en) * 2020-08-25 2024-03-01 腾讯科技(深圳)有限公司 Picture retrieval method, device, equipment and computer readable storage medium
CN112818162A (en) * 2021-03-04 2021-05-18 泰康保险集团股份有限公司 Image retrieval method, image retrieval device, storage medium and electronic equipment
CN112818162B (en) * 2021-03-04 2023-10-17 泰康保险集团股份有限公司 Image retrieval method, device, storage medium and electronic equipment
CN112991311A (en) * 2021-03-29 2021-06-18 深圳大学 Vehicle overweight detection method, device and system and terminal equipment
CN112991311B (en) * 2021-03-29 2021-12-10 深圳大学 Vehicle overweight detection method, device and system and terminal equipment

Similar Documents

Publication Publication Date Title
CN109522942B (en) Image classification method and device, terminal equipment and storage medium
CN108898086B (en) Video image processing method and device, computer readable medium and electronic equipment
Rong et al. Radial lens distortion correction using convolutional neural networks trained with synthesized images
Zhou et al. BOMSC-Net: Boundary optimization and multi-scale context awareness based building extraction from high-resolution remote sensing imagery
CN110909611B (en) Method and device for detecting attention area, readable storage medium and terminal equipment
CN110704652A (en) Vehicle image fine-grained retrieval method and device based on multiple attention mechanism
Jiang et al. Robust feature matching for remote sensing image registration via linear adaptive filtering
CN110689043A (en) Vehicle fine granularity identification method and device based on multiple attention mechanism
Kim et al. Color–texture segmentation using unsupervised graph cuts
CN108399386A (en) Information extracting method in pie chart and device
Zhao et al. Recognition of building group patterns using graph convolutional network
CN112990010B (en) Point cloud data processing method and device, computer equipment and storage medium
Zhang et al. Road recognition from remote sensing imagery using incremental learning
CN108428248B (en) Vehicle window positioning method, system, equipment and storage medium
CN111126459A (en) Method and device for identifying fine granularity of vehicle
Ding et al. Efficient vanishing point detection method in complex urban road environments
CN112634369A (en) Space and or graph model generation method and device, electronic equipment and storage medium
CN115311502A (en) Remote sensing image small sample scene classification method based on multi-scale double-flow architecture
CN113537180A (en) Tree obstacle identification method and device, computer equipment and storage medium
Ding et al. Efficient vanishing point detection method in unstructured road environments based on dark channel prior
Wang et al. Bdr-net: Bhattacharyya distance-based distribution metric modeling for rotating object detection in remote sensing
Kelenyi et al. SAM-Net: self-attention based feature matching with spatial transformers and knowledge distillation
Zha et al. ENGD-BiFPN: A remote sensing object detection model based on grouped deformable convolution for power transmission towers
CN117765039A (en) Point cloud coarse registration method, device and equipment
Pang et al. ROSE: real one-stage effort to detect the fingerprint singular point based on multi-scale spatial attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200117