CN110704652A

CN110704652A - Vehicle image fine-grained retrieval method and device based on multiple attention mechanism

Info

Publication number: CN110704652A
Application number: CN201910776963.3A
Authority: CN
Inventors: 张斯尧; 王思远; 谢喜林; 张�诚; 文戎; 田磊
Original assignee: Changsha Qianshitong Intelligent Technology Co Ltd
Current assignee: Changsha Qianshitong Intelligent Technology Co Ltd
Priority date: 2019-08-22
Filing date: 2019-08-22
Publication date: 2020-01-17

Abstract

The embodiment of the invention provides a vehicle image fine-grained retrieval method and a vehicle image fine-grained retrieval device based on a multiple attention mechanism, wherein the method comprises the following steps: inputting the vehicle reference image into the trained multi-attention convolutional neural network model, automatically positioning a target object in the vehicle reference image, and extracting a feature vector of the target object of the vehicle reference image; automatically positioning a target object in the vehicle image to be recognized through a multiple attention convolution neural network model, and extracting a feature vector of the target object in the vehicle image to be recognized; calculating the similarity between the characteristic vector of the target object in the vehicle reference image and the characteristic vector of the target object in the vehicle image to be recognized; and obtaining the vehicle image to be recognized containing the target object of the same category as the target object in the vehicle reference image in the vehicle image to be recognized according to the similarity, and using the vehicle image to be recognized as a retrieval image of the vehicle reference image. The method and the device can improve the accuracy and efficiency of vehicle image retrieval.

Description

Vehicle image fine-grained retrieval method and device based on multiple attention mechanism

Technical Field

The invention belongs to the technical field of computer vision and intelligent traffic, and particularly relates to a vehicle image fine-grained retrieval method and device based on a multiple attention mechanism, terminal equipment and a computer readable medium.

Background

With the rapid development of modern transportation, security and protection industries and the like, the target recognition technology is more and more applied to various fields, and is one of the important research subjects of the computer vision and pattern recognition technology in the intelligent transportation field in recent years.

Vehicle fine-grained identification is an important research direction in the field of computer vision. Vehicle identification of the same vehicle type is more difficult than the traditional method because the difference between similar vehicles is very small, and the difference may be only the annual inspection mark on the vehicle or some small decorations in the vehicle. With the rise of deep learning in recent years, many researchers have attempted to apply deep learning to the field of object detection and recognition. The fine-grained image analysis is a popular research topic in the field of computer vision for solving the problems, aims to research a plurality of visual analysis tasks such as positioning, identifying and retrieving object subclasses in the fine-grained image, and has wide application value in a real scene.

Meanwhile, image retrieval is a technology for retrieving similar images through input images, and mainly involves two major parts, namely image feature extraction and image feature similarity analysis. Fine-grained image recognition consists in finding local regional features in images that have subtle differences, allowing the recognition of different subclasses within a large class. The fine-grained image recognition technology is used for image retrieval, and fine-grained features of the images can be extracted and similarity of the fine-grained features of the images can be analyzed. However, the fine-grained image identification and retrieval method in the prior art has the problems of low efficiency and low accuracy.

Disclosure of Invention

In view of this, embodiments of the present invention provide a vehicle image fine-grained retrieval method and apparatus based on a multiple attention mechanism, a terminal device, and a computer readable medium, which can improve accuracy and efficiency of vehicle image retrieval.

The first aspect of the embodiment of the invention provides a vehicle image fine-grained retrieval method based on a multiple attention mechanism, which comprises the following steps:

inputting a vehicle reference image into a trained multi-attention convolutional neural network model, automatically positioning a target object in the vehicle reference image, and extracting a feature vector of the target object in the vehicle reference image;

inputting a vehicle image to be recognized into the trained multiple attention convolutional neural network model, automatically positioning a target object in the vehicle image to be recognized, and extracting a feature vector of the target object in the vehicle image to be recognized;

calculating the similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be recognized;

and obtaining the vehicle image to be recognized, which contains the target object of the same category as the target object in the vehicle reference image, in the vehicle image to be recognized according to the similarity, and using the vehicle image to be recognized as a retrieval image of the vehicle reference image.

A second aspect of an embodiment of the present invention provides a vehicle image fine-grained retrieval device based on a multiple attention mechanism, including:

the first extraction module is used for inputting a vehicle reference image into a trained multi-attention convolutional neural network model, automatically positioning a target object in the vehicle reference image and extracting a feature vector of the target object in the vehicle reference image;

the second extraction module is used for inputting the vehicle image to be recognized into the trained multi-attention convolutional neural network model, automatically positioning the target object in the vehicle image to be recognized and extracting the characteristic vector of the target object in the vehicle image to be recognized;

the calculation module is used for calculating the similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be identified;

and the retrieval output module is used for obtaining the vehicle image to be identified, which contains the target object of the same category as the target object in the vehicle reference image, in the vehicle image to be identified according to the similarity, and using the vehicle image to be identified as the retrieval image of the vehicle reference image.

A third aspect of the embodiments of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the above-mentioned fine-grained retrieval method for vehicle images based on a multiple attention mechanism when executing the computer program.

A sixth aspect of the embodiments of the present invention provides a computer-readable medium, which stores a computer program that, when being processed and executed, implements the steps of the above-mentioned fine-grained retrieval method for vehicle images based on a multiple attention mechanism.

In the fine-grained retrieval method for vehicle images based on a multiple attention mechanism provided by the embodiment of the invention, a vehicle reference image can be input into a trained multiple attention convolution neural network model, a target object in the vehicle reference image is automatically positioned, a feature vector of the target object in the vehicle reference image is extracted, a vehicle image to be recognized is input into the trained multiple attention convolution neural network model, the target object in the vehicle image to be recognized is automatically positioned, the feature vector of the target object in the vehicle image to be recognized is extracted, the similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be recognized is calculated, and the vehicle image to be recognized containing the target object of the same category as the target object in the vehicle reference image is obtained according to the similarity, the vehicle image is used as the retrieval image of the vehicle reference image, so that the accuracy and efficiency of vehicle image retrieval can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flowchart of a fine-grained retrieval method for a vehicle image based on a multiple attention mechanism according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a feature map provided by an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a vehicle image fine-grained retrieval device based on a multiple attention mechanism according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a refinement of the first extraction module in FIG. 3;

FIG. 5 is a schematic structural diagram of another vehicle image fine-grained retrieval device based on a multiple attention mechanism according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

Referring to fig. 1, fig. 1 is a block diagram illustrating a fine-grained search method for a vehicle image based on a multiple attention mechanism according to an embodiment of the present invention. As shown in fig. 1, the vehicle image fine-grained retrieval method based on the multiple attention mechanism of the embodiment includes the following steps:

s101: and inputting the vehicle reference image into the trained multi-attention convolutional neural network model, automatically positioning the target object in the vehicle reference image, and extracting the characteristic vector of the target object in the vehicle reference image.

In the embodiment of the invention, a multi-attention convolutional neural network model can be established first. In particular, to prevent the training from getting into a locally optimal solution, the channel clustering layer and the classification network layer need to be pre-trained. In the pre-training, the channel clustering layer is used for locally positioning an input vehicle image, and the classification network layer is used for identifying local features of each positioned part of the input vehicle image and generating a weight vector of each positioned part of the input vehicle image. Further, the acquired given vehicle image can be input into a convolution layer in the classification network layer, and the original depth feature W X of the vehicle image is extracted; wherein, W is a basic network layer of a multi-attention convolution neural network model to be established, X is the vehicle image, and X is a convolution operation operator; then, the vehicle image can be divided into N parts, and all characteristic channels contained in the N parts are clustered into N parts; wherein each of said local regions corresponds to a set of channel cluster layers, each of said channel cluster layers comprising two feature channel layers, each of said feature channel layers comprising a plurality of feature channels, and specifically, each feature channel having a peak response point since it responds to a particular type of visual pattern, such that each feature channel can be represented by a position vector whose elements are all trainingsThe peak response coordinates of the image on the channel are trained. These position vectors are used as features for clustering, and the different channels are divided into N clusters, i.e. N parts. Whether each channel belongs to the cluster is represented by an indication vector with length c (number of channels). If so, the channel position is 1, otherwise it is 0. N indicating vectors are mutually exclusive relations; in order to ensure that the above process is optimized in training, the N parts can be identified by the classification network layer, and the N full-link layers corresponding to the N parts one to one receive the original depth features to generate corresponding weight vectors d_i(X)：

d_i(X)＝f_i(W*X) (1)

Wherein d is_i(X) is the weight vector of the ith part of the N parts, d_i(X)＝[d₁，...，d_c]C is the number of said characteristic channels, f_iFor the clustering function of the ith fully-connected layer of the N fully-connected layers, it is usually also necessary to pre-train the clustering function f in order to obtain an accurate weight vector_iEach of the parameters in (1); based on the learned weight vectors, the attention heatmaps of the N portions are available:

wherein M is_i(X) is a probability heatmap of the i-th part of the N parts, sigmoid is a sigmoid function, [ W X]_jIs the original depth feature of the jth feature channel. Through the above pre-training of the clustering layer and the classification network layer, the weight vector and each parameter related to each function can be set, so that the hierarchical structure of the multi-attention convolutional neural network model can be constructed, and the hierarchical structure of the multi-attention convolutional neural network model also comprises a basic network layer and other layers in the existing neural network. A loss function l (x) may then be determined based on the multiple attention convolutional neural network model as:

wherein L is_clsIs a partial classification loss (partial classification loss) function, L_cngIs a channel clustering loss function, L_cngExpressed as:

L_cng(M_i)＝Dis(M_i)+λDiv(M_i) (4)

dis and Div represent a difference function and a distance function, respectively, where Dis is used to make coordinates in the same one of the N parts tend to gather, Div is used to make different ones of the N parts tend to be distant, λ is a weight parameter, and a determined value thereof may be obtained through training in subsequent S103; local classification loss function L_clsY in (1)⁽ⁱ⁾Representation using local refinement-based features P_i(X) a label vector predicted starting from the ith part of the N parts, a local classification loss function L_clsY in (1)^*Is the ground truth label vector. Local refinement feature P_i(X) is represented by:

and alternately and iteratively training a classification network layer and a channel clustering layer through the local classification loss function and the channel clustering loss function until the local classification loss function and the channel clustering loss function are not changed any more. In particular, with regard to the alternating iterative training of the classification network layer and the channel class layer by the local classification loss function and the channel clustering loss function, the convolutional layer may be fixed first, and the loss function L may be clustered by the channel_cngTraining the channel clustering layer; fixing the channel clustering layer again by the local classification loss function L_clsTraining a convolutional layer and a softmax layer in the classification network layer; the convolutional layer and softmax layer in the classification network layer may then be fixed, passing the local classification loss function L_clsTraining the channel clustering layer; fixing the channel clustering layer, and clustering the loss function L through the channel_cngTraining volumes in the classification network layerA build-up layer and a softmax layer; and analogizing in turn, alternately and iteratively training the channel clustering layer and the convolution layer and the softmax layer in the classification network layer until the local classification loss function and the channel clustering loss function are not changed any more, and clustering the channel loss function L in the above-mentioned manner by alternating iteration_cngSum local part class loss function L_clsIn the process of training and learning the joint loss, the weight parameter matrixes and the offset values in the channel clustering layer and the channel clustering network layer can be continuously adjusted, and when the local classification loss function and the channel clustering loss function are not changed (namely learning is finished), the adjustment values of the weight parameter matrixes and the offset values in the channel clustering layer and the channel clustering network layer are obtained; the weight parameter matrix before adjustment comprises the weight vectors, and in particular, each weight vector plus a corresponding coefficient can constitute the weight parameter matrix before adjustment. After the adjustment values of the weight parameter matrix and the offset value of the channel clustering layer and the classification network layer are obtained, the multiple attention convolutional neural network model can be trained by using a vehicle data set containing fine-grained image classification of different vehicle attributes (the training mode can be supervised learning of labeled data), the determination values of the weight parameter matrix and the offset value of the channel clustering layer and the classification network layer can be obtained as a training result, and the determination values of the weight parameter matrix and the offset value can be used for subsequent vehicle feature extraction, vehicle multi-attribute identification and retrieval and the like. After the multi-attention convolutional neural network model is trained, the vehicle reference image can be input into the trained multi-attention convolutional neural network model, the target object in the vehicle reference image is automatically positioned, and the feature vector of the target object in the vehicle reference image is extracted. Specifically, the feature map may be generated first, for example, the feature map may be generated based on ResNet-50(Residual Neural Network-50, remaining Neural Network-50) (as shown in FIG. 2): using convolution operations in wholeGenerating k x k position-sensitive score maps for each type of target object on the vehicle reference image (see fig. 2), the vehicle reference image (left) includes 5 target objects a, b, c, d and e, each of which may be mapped to a 3 x 3 position-sensitive score map, e.g., target object a may be mapped to a position-sensitive score map of a1 to a 9. The k x k position sensitivity score map is used for describing a space grid of corresponding positions of each type of target object; each position sensitivity score map has c characteristic channel outputs, which are used for representing c-1 type objects and adding an image background. The target object corresponds to a large class, and a small class corresponding to an image background is added to a c-1 class object represented by the output of the c characteristic channels; feature vectors for each type of target object, for example, feature vectors for respective object subclasses of each type of target object, may then be extracted based on the k × k position sensitivity score maps for each type of target object. Regarding the generation of k × k position sensitivity scores, for a w × size candidate target frame (obtained by the RPN network), the candidate target frame is divided into k × k sub-regions, and each sub-region is w × k²Size, for any one sub-region bin (i, j),0 ≦ i, j ≦ k-1, defining a position-sensitive pooling operation:

wherein r is_c(i, j | Θ) is the pooling response of the sub-region bin (i, j) against the c-1 object plus an image background, z_i，j，CIs the position sensitivity score map corresponding to the sub-region bin (i, j), (x)₀，y₀) Representing the coordinates of the upper left corner of the target candidate box, n being the number of pixels in the sub-region bin (i, j), Θ representing all the parameters resulting from the training of the multi-attention convolutional neural network model; and obtaining a k multiplied by k position sensitivity fraction map of each type of target object according to the pooling response of the sub-region bin (i, j) to the c-1 type object and an image background.

S102: and inputting the vehicle image to be recognized into the trained multiple attention convolution neural network model, automatically positioning the target object in the vehicle image to be recognized, and extracting the characteristic vector of the target object in the vehicle image to be recognized.

In the embodiment of the present invention, the method for extracting the feature vector of the target object from the to-be-recognized vehicle image is the same as the method for extracting the feature vector of the target object from the vehicle reference image in S101, and reference may be specifically made to the related description of S101, so that details are not repeated herein.

S103: and calculating the similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be identified.

Specifically, the similarity calculation is usually performed by using cosine distance in the prior art, and in the embodiment of the present invention, the cosine similarity is used to measure the similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be recognized. Cosine similarity measures the difference between two individuals by using the cosine value of the included angle between two characteristic vectors in a vector space. Compared with distance measurement, cosine similarity pays more attention to the difference of two vectors in direction rather than distance or length, and similarity of characteristic vectors of target objects in the vehicle reference image and characteristic vectors of target objects in the vehicle image to be identified

The calculation formula of (2) is as follows:

wherein the content of the first and second substances,

is the feature vector of the target object in the vehicle reference image,

for the feature vector of the target object in the vehicle reference image, | | x | |

Is the norm, | | y | | |

Norm of theta is

And

the included angle therebetween.

S104: and obtaining the vehicle image to be recognized, which contains the target object of the same category as the target object in the vehicle reference image, in the vehicle image to be recognized according to the similarity, and using the vehicle image to be recognized as a retrieval image of the vehicle reference image.

In the embodiment of the present invention, for convenience of understanding, assuming that there are one or more images to be recognized, if a target object of the same category as the target object in the vehicle reference image is included in a certain image to be recognized in the one or more images to be recognized, the image to be recognized may be used as a search image of the vehicle reference image. And regarding the determination of the target objects of the same category as the target objects in the vehicle reference image, for a certain target object in the vehicle reference image, the similarity between the feature vectors of all the target objects in the vehicle image to be recognized and the feature vector of the certain target object can be calculated, the similarity is ranked according to the similarity, and the target object in the vehicle image to be recognized corresponding to the high ranked similarity is taken as the target object of the same category of the certain target object in the vehicle reference image.

In the vehicle image fine-grained retrieval method based on the multiple attention mechanism provided in fig. 1, the feature vectors of the target object can be respectively extracted from the vehicle reference image and the vehicle image to be recognized based on the trained multiple attention convolutional neural network model, and the retrieval image of the vehicle reference image is finally obtained by calculating the similarity between the feature vectors of the target object of the vehicle reference image and the feature vectors of the target object of the vehicle image to be recognized, so that the accuracy and efficiency of vehicle retrieval can be improved.

Referring to fig. 3, fig. 3 is a block diagram of a fine-grained retrieval device for vehicle images based on a multiple attention mechanism according to an embodiment of the present invention. As shown in fig. 3, the vehicle image fine-grained retrieval device 30 based on the multiple attention mechanism of the present embodiment includes a first extraction module 301, a second extraction module 302, a calculation module 303, and a retrieval output module 304. The first extraction module 301, the second extraction module 302, the calculation module 303 and the retrieval output module 304 are respectively used for executing the specific methods in S101, S102, S103 and S104 in fig. 1, and details can be referred to the related introduction of fig. 1, which is only briefly described here:

the first extraction module 301 is configured to input a vehicle reference image into a trained multi-attention convolutional neural network model, automatically locate a target object in the vehicle reference image, and extract a feature vector of the target object in the vehicle reference image.

The second extraction module 302 is configured to input the vehicle image to be recognized into the trained multi-attention convolutional neural network model, automatically locate the target object in the vehicle image to be recognized, and extract the feature vector of the target object in the vehicle image to be recognized.

A calculating module 303, configured to calculate a similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be recognized.

And a retrieval output module 304, configured to obtain, according to the similarity, a to-be-identified vehicle image that includes a target object of the same category as the target object in the vehicle reference image in the to-be-identified vehicle image, as a retrieval image of the vehicle reference image.

Further, as can be seen in fig. 4, the first extraction module 301 may specifically include a mapping unit 3011 and an extraction unit 3012:

and a mapping unit 3011, configured to generate a feature map. The feature map may be generated based on ResNet-50, which may specifically include: using convolution operations over the whole frameGenerating a k × k position sensitivity score map for each type of target object on the vehicle reference image, wherein the k × k position sensitivity score map is used for describing a spatial grid of corresponding positions of each type of target object; each position sensitivity score map has c channel outputs representing a c-1 class object plus an image background. More specifically, for a candidate target frame of w × size, the candidate target frame may be divided into k × k sub-regions, and each sub-region is w ×/k²Size, for any one sub-region bin (i, j),0 ≦ i, j ≦ k-1, defining a position-sensitive pooling operation:

wherein r is_c(i, j | Θ) is the pooling response of the sub-region bin (i, j) against the c-1 object plus an image background, z_i，j，CIs the position sensitivity score map corresponding to the sub-region bin (i, j), (x)₀，y₀) Representing the coordinates of the upper left corner of the target candidate frame, n being the number of pixels in the sub-region bin (i, j), and Θ representing all learned parameters of the network; and obtaining a k multiplied by k position sensitivity fraction map of each type of target object according to the pooling response of the sub-region bin (i, j) to the c-1 type object and an image background.

An extracting unit 3012, configured to extract a feature vector of each type of target object based on the k × k position sensitivity score maps of each type of target object.

Referring to fig. 5, fig. 5 is a block diagram illustrating another fine-grained retrieval apparatus for vehicle images based on a multiple attention mechanism according to an embodiment of the present invention. As shown in fig. 5, the vehicle image fine-grained retrieval device 50 based on the multiple attention mechanism is optimized based on the vehicle image fine-grained retrieval device 30 based on the multiple attention mechanism shown in fig. 3, and the vehicle image fine-grained retrieval device 50 based on the multiple attention mechanism further includes a model establishing module 501 in addition to the first extraction module 301, the second extraction module 302, the calculation module 303 and the retrieval output module 304 of the vehicle image fine-grained retrieval device 30 based on the multiple attention mechanism:

the model establishing module 501 is configured to establish a multiple attention convolutional neural network model, and train the established multiple attention convolutional neural network model. The method and steps specifically executed by the model building module 501 may refer to the related description of S101, and therefore are not described herein again.

In the vehicle image fine-grained retrieval device based on the multiple attention mechanism provided in fig. 3 or fig. 5, the feature vectors of the target object can be respectively extracted from the vehicle reference image and the vehicle image to be recognized based on the trained multiple attention convolutional neural network model, and the retrieval image of the vehicle reference image is finally obtained by calculating the similarity between the feature vectors of the target object of the vehicle reference image and the feature vectors of the target object of the vehicle image to be recognized, so that the accuracy and efficiency of vehicle retrieval can be improved.

Fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 6, the terminal device 6 of this embodiment includes: a processor 60, a memory 61 and a computer program 62 stored in said memory 61 and operable on said processor 60, for example a program for performing a fine-grained retrieval of images of a vehicle based on a multiple attention mechanism. The processor 60, when executing the computer program 62, implements the steps in the above-described method embodiments, e.g., S101 to S104 shown in fig. 1. Alternatively, the processor 60, when executing the computer program 62, implements the functions of the modules/units in the above-mentioned device embodiments, such as the functions of the modules 301 to 304 shown in fig. 3.

Illustratively, the computer program 62 may be partitioned into one or more modules/units that are stored in the memory 61 and executed by the processor 60 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 62 in the terminal device 6. For example, the computer program 62 may be divided into a first extraction module 301, a second extraction module 302, a calculation module 303 and a retrieval output module 304. (modules in the virtual device), the specific functions of each module are as follows:

The terminal device 6 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. Terminal device 6 may include, but is not limited to, a processor 60, a memory 61. Those skilled in the art will appreciate that fig. 6 is merely an example of a terminal device 6 and does not constitute a limitation of terminal device 6 and may include more or less components than those shown, or some components in combination, or different components, for example, the terminal device may also include input output devices, network access devices, buses, etc.

The Processor 60 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) and the like provided on the terminal device 6. Further, the memory 61 may also include both an internal storage unit of the terminal device 6 and an external storage device. The memory 61 is used for storing the computer programs and other programs and data required by the terminal device 6. The memory 61 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A fine-grained retrieval method for a vehicle image based on a multiple attention mechanism is characterized by comprising the following steps:

2. The method for fine-grained retrieval of vehicle images based on multiple attention mechanisms according to claim 1, wherein before inputting a vehicle reference image into a trained multiple attention convolutional neural network model, automatically locating a target object in the vehicle reference image, and extracting a feature vector of the target object in the vehicle reference image, the method further comprises:

and constructing a multi-attention convolutional neural network model, and training the constructed multi-attention convolutional neural network model.

3. The method for fine-grained retrieval of vehicle images based on multiple attention mechanisms according to claim 1, wherein the step of inputting vehicle reference images into the trained multiple attention convolutional neural network model, automatically locating target objects in the vehicle reference images, and extracting feature vectors of the target objects in the vehicle reference images comprises:

generating a feature map; the generating the feature map comprises: generating a k × k position sensitivity score map for each type of target object on the whole vehicle reference image by using convolution operation, wherein the k × k position sensitivity score map is used for describing a spatial grid of a corresponding position of each type of target object; each position sensitivity score image has c channel outputs and is used for representing a c-1 type object and adding an image background;

extracting feature vectors of each type of target object based on the k × k position sensitivity score maps of each type of target object.

4. The method for fine-grained retrieval of vehicle images based on multiple attention mechanisms according to claim 3, wherein the generating k x k position sensitivity score maps for each type of target object over the entire vehicle reference image by convolution operation comprises:

for a candidate target frame with the size of w × h, dividing the candidate target frame into k × k sub-regions, wherein each sub-region is w × h/k²Size, for any one sub-region bin (i, j),0 ≦ i, j ≦ k-1, defining a position-sensitive pooling operation:

wherein r is_c(i, j | Θ) is the pooling response of the sub-region bin (i, j) against the c-1 object plus an image background, z_i,j,cIs the position sensitivity score map corresponding to the sub-region bin (i, j), (x)₀,y₀) Representing the coordinates of the upper left corner of the target candidate frame, n being the number of pixels in the sub-region bin (i, j), and Θ representing all learned parameters of the network;

and obtaining a k multiplied by k position sensitivity fraction map of each type of target object according to the pooling response of the sub-region bin (i, j) to the c-1 type object and an image background.

5. The method for fine-grained retrieval of vehicle images based on multiple attention mechanisms according to claim 1, wherein the calculating the similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be recognized comprises:

the vehicle parameterSimilarity between feature vector of target object in photographic image and feature vector of target object in vehicle image to be recognized

The calculation formula of (2) is as follows:

wherein the content of the first and second substances,

is the feature vector of the target object in the vehicle reference image,is a feature vector of a target object in the vehicle reference image, | x |Is | y |Norm of theta is

And

the included angle therebetween.

6. A vehicle image fine-grained retrieval device based on a multiple attention mechanism is characterized by comprising:

7. The vehicle image fine-grained retrieval device based on multiple attention mechanisms according to claim 6, wherein the first extraction module comprises:

a mapping unit for generating a feature map; the generating the feature map comprises: generating a k × k position sensitivity score map for each type of target object on the whole vehicle reference image by using convolution operation, wherein the k × k position sensitivity score map is used for describing a space grid of a corresponding position; each position sensitivity score image has c channel outputs and is used for representing a c-1 type object and adding an image background;

and the extraction unit is used for extracting the characteristic vector of each type of target object based on the k multiplied by k position sensitivity fraction map of each type of target object.

8. The vehicle image fine-grained retrieval device based on multiple attention mechanisms according to claim 6, wherein the calculation module calculates similarity between the feature vector of the target object in the vehicle reference image and the feature vector of the target object in the vehicle image to be recognized

Meter (2)The calculation formula is as follows:

wherein the content of the first and second substances,is the feature vector of the target object in the vehicle reference image,

is a feature vector of a target object in the vehicle reference image, | x |

Is | y |

Norm of theta isAnd

the included angle therebetween.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1-5 when executing the computer program.

10. A computer-readable medium, in which a computer program is stored which, when being processed and executed, carries out the steps of the method according to any one of claims 1 to 5.