CN115830402A

CN115830402A - Fine-grained image recognition classification model training method, device and equipment

Info

Publication number: CN115830402A
Application number: CN202310140142.7A
Authority: CN
Inventors: 余鹰; 王景辉
Original assignee: East China Jiaotong University
Current assignee: East China Jiaotong University
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2023-03-21
Anticipated expiration: 2043-02-21
Also published as: CN115830402B

Abstract

The invention provides a method, a device and equipment for training a fine-grained image recognition classification model, wherein the method comprises the following steps: inputting the fine-grained image into a preset network model for training, wherein the preset network model comprises a plurality of self-attention layers; acquiring classification vectors obtained by learning fine-grained images by a preset number of target self-attention layers; inputting the classification vector of each target self-attention layer into a preset classifier, outputting a classification label of each target self-attention layer, and performing loss calculation on the classification label of each target self-attention layer and a preset real label; and updating network parameters through a back propagation mechanism respectively according to the loss value of each target self-attention layer. By introducing a progressive training mechanism, the method is beneficial to mining complementary information in classification vectors of different levels and using the complementary information for classification; and a multi-scale module is also provided, so that the complementary communication of the global information and the local information is realized, and the fine-grained image classification effect is improved.

Description

Fine-grained image recognition classification model training method, device and equipment

Technical Field

The invention relates to the technical field of model training, in particular to a method, a device and equipment for training a fine-grained image recognition classification model.

Background

Fine-grained image classification aims at identifying sub-categories within the same parent category. For example, benz and Audi belonging to the same vehicle category, blue crow and parrot belonging to the same bird category, labrador retriever and golden hair belonging to the same dog category, etc. The fine-grained image classification technology is of great interest because of its many practical meanings in the aspects of face recognition, traffic vehicle recognition, intelligent retail goods, agricultural disease recognition research, endangered animal protection and the like.

However, unlike the conventional image classification problem, the training data set picture for fine-grained image classification often has a great discriminative significance only in a local fine region. The existing fine-grained image classification models are roughly divided into two types: a strong supervision model and a weak supervision model. The strong supervision model depends on fine image labeling (such as manual labeling boxes, key point information and the like), the accurate and fine labeling information is mostly obtained through expert labeling in different aspects, and in addition, due to the large sample data set, the labeling work needs to consume a large amount of time and energy. Furthermore, annotation information can be subjectively affected and subject to error. Recently, a work based on weak supervision is attracting the attention of researchers, and the method does not need additional image labeling, namely, an image-level label is used as a supervision signal. For example, a Vision Transformer (Vision self-attention model, viT for short) recently proposed by Google is very colorful in the field of computer images, and a good effect can be achieved in fine-grained image classification only by using a simple ViT, but the realization of fine-grained image classification still has a defect.

Thus, many researchers are also proposing a variety of ViT-based variants with some success. However, most of the existing ViT-based work is to migrate the existing ideas of the convolutional neural network, and the thinking of a unique multi-point attention mechanism in the ViT structure is lacked. Most of recent ViT works on picture vectors (patch tokens) and a multi-attention mechanism, but neglects the importance of classification vectors (class tokens) in classification. The existing ViT and some ViT variants only consider the beneficial information learned by the last attention layer for classification, but ignore the complementary information learned by other layers, which will cause a certain loss of information, resulting in a lack of precision effect of fine-grained image classification of the model.

Disclosure of Invention

Based on this, the present invention provides a method, an apparatus, and a device for training a fine-grained image recognition classification model to solve at least one technical problem in the prior art.

The invention provides a fine-grained image recognition classification model training method, which comprises the following steps:

obtaining a fine-grained image for model training, and inputting the fine-grained image into a preset network model for training, wherein the preset network model comprises a plurality of self-attention layers, and the fine-grained image sequentially passes through each self-attention layer so as to perform classification vector learning on the fine-grained image through the self-attention layers;

obtaining classification vectors obtained by learning the fine-grained images by a preset number of target self-attention layers, wherein the target self-attention layers are positioned at the rear ends of the multiple self-attention layers;

inputting the classification vector of each target self-attention layer into a preset classifier, outputting a classification label of each target self-attention layer, and performing loss calculation on the classification label of each target self-attention layer and a preset real label to obtain a loss value of each target self-attention layer;

and updating network parameters through a back propagation mechanism according to the loss value of each target self-attention layer so as to train the fine-grained image recognition classification model.

In addition, the fine-grained image recognition classification model training method according to the above embodiment of the present invention may further have the following additional technical features:

further, still include:

calculating a final attention weight matrix of the self-attention layer after classification vector learning is carried out on the fine-grained image according to a preset calculation rule;

determining the position of a classification target according to the final self-attention weight matrix, and intercepting a classification target area image from the fine-grained image according to the position of the classification target;

and scaling the classified target area image to be the same as the fine-grained image in size, and inputting the image into the preset network model for training so as to intensively train the fine-grained image recognition classification model.

Further, the preset network model further includes a linear projection layer and a position coding layer, and the step of inputting the fine-grained image into the preset network model for training includes:

dividing the fine-grained image into preset sub-images according to a preset division rule, and mapping each sub-image to a high-dimensional feature space through the linear projection layer to obtain a picture vector of each sub-image;

coding the picture vectors of each sub-picture through the position coding layer to add position coding information to each picture vector, and adding an empty classification vector in front of the first picture vector to obtain a vector sequence;

and inputting the vector sequence into the multi-layer self-attention layer for classification vector learning, wherein the classification features learned by each layer of self-attention layer are updated in the classification vectors of the vector sequence to obtain the classification vectors of each layer of self-attention layer.

Further, the self-attention layer comprises a plurality of attention heads, and the step of calculating a final attention weight matrix of the self-attention layer after the classification vector learning of the fine-grained image according to a preset calculation rule comprises:

after the classification vector learning is carried out on the fine-grained image, in each attention head, the attention weight of the classification vector and each picture vector in the current layer is respectively calculated, and an attention weight matrix corresponding to each attention head is obtained;

and performing dot product calculation on the attention weight matrixes of all the attention heads to obtain the final attention weight matrix.

Further, the formula for calculating the attention weight is:

in the formula ,

is as followslThe first of the attention headsiAttention weights for the picture vector and the classification vector,Qin order to query the vector, the query vector,Kis a vector of the keys, and is,Vis a vector of values that is a function of,d _k to take care of the mapping space dimensions of the head of attention,Ttransposing the matrix; wherein the attention weight matrixAExpressed as:

wherein ,l∈1,2,…,L，i∈1,2,…,K，Lrepresenting the number of heads of attention,Krepresenting the number of picture vectors.

Further, the step of determining the position of the classification target according to the final self-attention weight matrix includes:

calculating an average value of all attention weights in the final attention weight matrix;

comparing each attention weight in the final attention weight matrix to the average value, with attention weights greater than the average value being flagged as a first threshold and otherwise as a second threshold;

and determining the position of the classification target according to the position coding information of the target picture vector with the attention weight of the classification vector as a first threshold value.

Further, the step of respectively performing loss calculation on the classification label of each target self-attention layer and a preset real label to obtain a loss value of each target self-attention layer includes:

respectively carrying out cross entropy loss calculation on the classification label of each target self-attention layer and a preset real label to obtain a loss value of each target self-attention layer;

wherein the formula of the cross entropy loss calculation is as follows:

in the formula ,y ^r is as followsrThe individual objects are from the category label of the attention layer,yin order to preset the real tag,LOSS _CE (y ^r ,y) Is as followsrThe cross entropy loss value of the classification label of each target self-attention layer and a preset real label, the preset number being 3,r∈1,2,3。

the invention provides a fine-grained image recognition classification model training system, which comprises:

the image acquisition module is used for acquiring a fine-grained image for model training and inputting the fine-grained image into a preset network model for training, wherein the preset network model comprises a plurality of self-attention layers, and the fine-grained image sequentially passes through each self-attention layer so as to perform classification vector learning on the fine-grained image through the self-attention layers;

a vector acquisition module, configured to acquire classification vectors obtained by learning the fine-grained images through a preset number of target self-attention layers, where the target self-attention layers are located at the rear end of the multiple self-attention layers;

the loss calculation module is used for inputting the classification vector of each target self-attention layer into a preset classifier, outputting the classification label of each target self-attention layer, and performing loss calculation on the classification label of each target self-attention layer and a preset real label to obtain a loss value of each target self-attention layer;

and the progressive training module is used for updating network parameters through a back propagation mechanism respectively according to the loss value of each target self-attention layer so as to train the fine-grained image recognition classification model.

The present invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the fine-grained image recognition classification model training method described above.

The invention also provides fine-grained image recognition classification model training equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the fine-grained image recognition classification model training method when executing the program.

The invention has the beneficial effects that: by improving the traditional ViT structure and introducing a progressive training mechanism, classification vectors of different levels in the ViT structure are selected, the beneficial information learned by the last attention layer is not simply noticed, the importance of the classification vectors in classification is also noticed, the learned information can be well transmitted upwards, the complementary information in the classification vectors of different levels can be favorably mined and used for classification, and the precision effect of fine-grained image classification is improved.

Drawings

FIG. 1 is a photograph of a California gull as provided in an embodiment of the present invention;

FIG. 2 is a photograph of a gull of the north pole provided in an embodiment of the present invention;

FIG. 3 is a flowchart of a fine-grained image recognition classification model training method according to a first embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an improved ViT model provided in an embodiment of the present invention;

fig. 5 is a block diagram of a fine-grained image recognition classification model training system according to a third embodiment of the present invention.

The following detailed description will further illustrate the invention in conjunction with the above-described figures.

Detailed Description

To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Several embodiments of the invention are presented in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

It will be understood that when an element is referred to as being "secured to" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. The terms "vertical," "horizontal," "left," "right," and the like as used herein are for illustrative purposes only.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Fine-grained image classification aims at identifying sub-categories within the same parent category. For example, benz and Audi belonging to the same vehicle category, blue crow and parrot belonging to the same bird category, labrador retriever and golden hair belonging to the same dog category, etc. The fine-grained image classification technology is of great interest because of its many practical meanings in the aspects of face recognition, traffic vehicle recognition, intelligent retail goods, agricultural disease recognition research, endangered animal protection and the like. However, unlike the conventional image classification problem, the training data set picture for fine-grained image classification often has a great discriminative significance only in a local fine region. As shown in fig. 1 to 2, fig. 1 is a california seagull, and fig. 2 is a arctic gull. Although the two types of seagulls are different, the two types of seagulls are very similar to each other and are difficult to distinguish by naked eyes of ordinary people. Moreover, the seagulls belonging to the same kind are difficult to judge whether the seagulls belong to the same kind or not due to the problems of shooting angle, illumination, flying posture and the like. Because the specific intra-class difference is large and the inter-class difference is small, the fine-grained image identification is more difficult and more challenging than the traditional image classification.

The Vision Transformer (ViT, visual self-attention model) proposed by Google recently has great diversity in the field of computer images, and a good effect can be achieved in fine-grained image classification only by using a simple ViT, but the fine-grained image classification still has a defect to be realized. Thus, many researchers are also proposing a variety of ViT-based variants with some success. However, most of the existing ViT-based work migrates the existing ideas of the convolutional neural network, and the thinking of a unique multi-attention mechanism in the ViT structure is lacked. Most of recent ViT works on picture vectors (patch tokens) and multi-attention mechanisms, but neglects the importance of classification vectors (classttokens) in classification. And the existing ViT and some ViT variants only consider the beneficial information learned by the last attention layer for classification, but ignore the complementary information learned by other layers, which will cause a certain loss of information, resulting in a lack of precision effect of fine-grained image classification of the model.

Based on the above, the invention aims to improve the traditional ViT structure and provide a set of brand-new training method for the fine-grained image classification model, so that the fine-grained image classification model obtained by training has better classification precision, and the classification effect of the model is improved. The embodiments will be described in detail with reference to the following specific examples.

Example one

Referring to fig. 3, a fine-grained image recognition classification model training method according to a first embodiment of the present invention is shown, where the fine-grained image recognition classification model training method can be implemented by software and/or hardware, and the method includes steps S01 to S04.

Step S01, obtaining a fine-grained image for model training, inputting the fine-grained image into a preset network model for training, wherein the preset network model comprises a plurality of self-attention layers, and the fine-grained image passes through each self-attention layer in sequence so as to perform classification vector learning on the fine-grained image through the self-attention layers.

In this embodiment, the preset network model is specifically an improved ViT model, please refer to fig. 4, where the improved ViT model includes multiple self-attention layers (transformer layers), where the last three self-attention layers are respectively connected to MLPHead sorting heads, and the three MLPHead sorting heads in the drawing are respectively labeled as MLP1, MLP2, and MLP3, so that the sorting vectors (classtocken) learned by the self-attention layers can output corresponding sorting results through the corresponding MLPHead sorting heads.

In specific implementation, a large number of different types of fine-grained pictures can be collected, the same type of pictures can be classified into one type, and the same real label is preset for each type of picture, for example, a large number of gull pictures and a large number of california gull pictures are collected, the gull pictures are classified into one type and are endowed with real labels capable of representing gull characteristics, and the california gull pictures are classified into one type and are endowed with real labels capable of representing gull characteristics. And then, respectively using different types of fine-grained pictures as training sets to train the improved ViT model, wherein the fine-grained pictures are sequentially input into each self-attention layer of the ViT model during training so as to learn the classification vectors of the fine-grained pictures through the self-attention layers. Preferably, in practical implementation, the real label may be a label of each category, for example, the value of the gull real label is 1, the value of the gull real label is 2, or the real label may be a name, a fine-grained characteristic, or other identifying information of each category.

And S02, acquiring classification vectors obtained by learning the fine-grained images by a preset number of target self-attention layers, wherein the target self-attention layers are positioned at the rear ends of the multiple self-attention layers.

Step S03, inputting the classification vector of each target self-attention layer into a preset classifier, outputting the classification label of each target self-attention layer, and performing loss calculation on the classification label of each target self-attention layer and a preset real label to obtain a loss value of each target self-attention layer.

In this embodiment, the predetermined classifier is an MLPHead classification header.

And S04, updating network parameters through a back propagation mechanism respectively according to the loss value of each target self-attention layer so as to train the fine-grained image recognition classification model.

In specific implementation, the last three self-attention layers are specifically selected as target self-attention layers, that is, the preset number is three, then classification vectors learned by the last three self-attention layers are classified and output through corresponding MLPHead classification heads, so as to obtain classification labels of each target self-attention layer, then the classification labels of each target self-attention layer and loss values of real labels are respectively calculated, so as to obtain loss values of the last three self-attention layers, then network parameters of the previous self-attention layers are iteratively modified by using the loss values of each self-attention layer and a back propagation mechanism, so as to finally train a fine-grained image recognition classification model capable of accurately performing fine-grained image recognition classification. Of course, in other embodiments, other numbers and/or other locations of self-attention layers may be used for classification, such as selecting the last four self-attention layers as the target self-attention layer.

Specifically, as a preferred embodiment, the step of performing loss calculation on the classification label and a preset real label of each target self-attention layer respectively to obtain a loss value of each target self-attention layer includes:

wherein the formula of the cross entropy loss calculation is as follows:

that is, the last three self-attention layers are selected on the basis of the structure of the conventional ViT and are respectively connected to the MLPHead classification head, so that the beneficial information learned by the last three self-attention layers is selected for classification, and accordingly, a progressive step-by-step training mode is proposed, that is, the loss values of the last three self-attention layers are respectively used for updating the network parameters through a back propagation mechanism, so that the model is guided to learn the multi-layer complementary information. It should be noted that the training method in this embodiment is not simple to superimpose and back-propagate the losses of different layers. Instead, the parameters are propagated back and updated separately for each loss, which helps the different layers of the model to work better in concert. In addition, the information learned by the bottom layer is transmitted to the upper layer in a progressive mode, and model learning and convergence are facilitated.

In summary, in the fine-grained image recognition classification model training method in the above embodiment of the present invention, a traditional ViT structure is improved, a progressive training mechanism is introduced, classification vectors of different levels in the ViT structure are selected, instead of simply paying attention to only the beneficial information learned by the last attention layer, and paying attention to the importance of the classification vectors in classification, the learned information can be well transmitted upwards, which is beneficial to mining complementary information in the classification vectors of different levels and using the complementary information for classification, so as to improve the precision effect of fine-grained image classification.

Example two

A second embodiment of the present invention also provides a fine-grained image recognition classification model training method, which may be implemented by software and/or hardware, and please refer to fig. 4, where the improved ViT model further includes a linear projection layer (linear projection of Flattened Patches), a position encoding layer (PositionEmbedding), and a multi-scale module, where the fine-grained image is subjected to high-dimensional feature space mapping by the linear projection layer to obtain a corresponding picture vector, and then the picture vector is subjected to position encoding by the position encoding layer and then input into a following multi-layer self-attention layer (transformer layer), and meanwhile, on the basis of the conventional ViT structure, the embodiment further adds a multi-scale module, as shown in fig. 4, the multi-scale module is located on the right side of the multi-layer self-attention layer, and each self-attention layer is connected to the multi-scale module. It can be understood that the multi-head attention mechanism of ViT makes the ViT inherently pay more attention to global information, and some discriminant regions tend to be tiny local regions in fine-grained image recognition, so in order to better prompt the model to learn some salient region information, this embodiment proposes a multi-scale module, which has a main function of multiplying the attention weight of each layer by the multi-head attention of the attention layer and mapping the result to the original image, thereby finding the local discriminant region corresponding to each attention layer, then intercepting the corresponding local discriminant region from the original image, and then re-inputting the local discriminant region image intercepted by each layer into the model for training, helping the model find the discriminant local region on the basis of learning the global information, thereby implementing complementary communication between the global information and the local information, and improving the classification effect of the fine-grained image. Specifically, in this embodiment, the fine-grained image recognition classification model training method specifically includes steps S11 to S16.

And S11, segmenting the fine-grained image into preset sub-images according to a preset segmentation rule, and mapping each sub-image to a high-dimensional feature space through the linear projection layer to obtain a picture vector of each sub-image.

In a specific implementation, each fine-grained image may be equally divided into a preset number of sub-images according to a preset division size, for example, the fine-grained image may be equally divided into 9 number of sub-images according to a division size of 3 × 3, and each sub-image corresponds to one picture vector.

And S12, coding the picture vectors of each sub-image through the position coding layer to add position coding information to each picture vector, and adding an empty classification vector in front of the first picture vector to obtain a vector sequence.

In some alternative embodiments, the position coding information may be position coordinate information of the sub-image in the whole fine-grained image, and since the picture segmentation rule is known, the position coordinate information of each sub-image in the whole fine-grained image is also known. Or in other alternative embodiments, the position coding information may be the number of the sub-images, and specifically, each sub-image may be numbered according to the sequence of the segmentation.

Step S13, inputting the vector sequence into the multiple layers of self-attention layers for classification vector learning, wherein the classification features learned by each layer of self-attention layers are updated in the classification vectors of the vector sequence, so as to obtain the classification vector of each layer of self-attention layers.

And S14, selecting the last three self-attention layers as target self-attention layers, and acquiring a classification vector obtained by each target self-attention layer through learning the fine-grained image.

Step S15, inputting the classification vector of each target self-attention layer into a preset classifier, outputting the classification label of each target self-attention layer, and performing cross entropy loss calculation on the classification label of each target self-attention layer and a preset real label to obtain a loss value of each target self-attention layer.

And S16, updating network parameters through a back propagation mechanism respectively according to the loss value of each target self-attention layer so as to train the fine-grained image recognition classification model.

In addition, while performing the above model training, the fine-grained image recognition classification model training method of this embodiment further includes:

It should be understood that the above-mentioned intensive training is part of the model training corresponding to the multi-scale module, which can be performed simultaneously during the above-mentioned progressive training process.

The step of calculating a final attention weight matrix of the self-attention layer after the classification vector learning of the fine-grained image according to a preset calculation rule specifically includes:

and performing dot multiplication (namely matrix multiplication) on the attention weight matrixes of all the attention heads to obtain the final attention weight matrix. Wherein, the calculation formula of the attention weight is as follows:

in the formula ,

is a firstlThe first of the attention headsiAttention weights for the picture vectors and the classification vectors,Qin order to query the vector, the query vector,Kis a vector of the keys, and is,Vin the form of a vector of values,d _k to take care of the mapping space dimensions of the head of attention,Ttransposing the matrix; in specific implementation, the classification vector and the picture vector are divided into three parts, one part is a query vectorQOne is a key vectorKOne is a value vectorVThen, according to the classification vector and the corresponding query vector of the picture vectorQKey vectorKVector of sum valuesVAnd calculating the degree of relation between the classification vector and the picture vector to obtain the attention weight.

Wherein the attention weight matrixAExpressed as:

Based on this, the step of determining the position of the classification target according to the final self-attention weight matrix specifically includes:

comparing each attention weight in the final attention weight matrix with the average value, wherein the attention weight larger than the average value is marked as a first threshold, otherwise, the attention weight is marked as a second threshold, and in particular implementation, the first threshold can be set to be 1, and the second threshold can be set to be 0;

and determining the position of the classification target according to the position coding information of the target picture vector with the attention weight of the classification vector as a first threshold.

It should be noted that, because the classification vector is obtained by performing classification learning on the whole fine-grained image from the attention layer, and pays more attention to the location area of the classification target (e.g., gull arctica) in the whole fine-grained image, the target picture vector whose attention weight is the first threshold with the classification vector is necessarily a picture that is closer to or belongs to the location of the classification target, so that the location of the classification target can be determined according to the location encoding information of the target picture vector whose attention weight is the first threshold with the classification vector, and the classification target area image where the classification target is located can be mapped to the original image to intercept the classification target area image where the classification target is located, and then the classification target area image is used for strengthening training to find the discriminant local area on the basis of learning the global information, thereby implementing complementary communication between the global information and the local information, and further improving the classification effect of the fine-grained image.

Compared with the traditional ViT structure and other ViT variants, the model provided by the embodiment has at least the following advantages, and the performance and the accuracy of a fine-grained image classification task can be effectively improved. The method comprises the following specific steps:

1) The embodiment provides a training method of a fine-grained image recognition classification model, which can perform end-to-end training and can perform training only by using picture-level labels;

2) The conventional ViT structure is improved, progressive training is introduced, and classification vectors of different levels in the ViT structure are selected, so that learned information can be well transmitted upwards, and the complementary information in the classification vectors of different levels can be mined and used for classification;

3) The embodiment provides the multi-scale module, which helps the model to learn the global information and find the discriminant local area, so that the complementary communication of the global information and the local information is realized, and the fine-grained image classification effect is improved.

EXAMPLE III

Another aspect of the present invention further provides a fine-grained image recognition classification model training system, referring to fig. 5, which is a fine-grained image recognition classification model training system according to a third embodiment of the present invention, and the fine-grained image recognition classification model training system includes:

the image acquisition module 11 is configured to acquire a fine-grained image for model training, and input the fine-grained image into a preset network model for training, where the preset network model includes multiple self-attention layers, and the fine-grained image sequentially passes through each self-attention layer to perform classification vector learning on the fine-grained image through the self-attention layers;

a vector obtaining module 12, configured to obtain classification vectors obtained by learning the fine-grained images through a preset number of target self-attention layers, where the target self-attention layers are located at the rear end of the multiple self-attention layers;

the loss calculation module 13 is configured to input the classification vector of each target self-attention layer into a preset classifier, output a classification label of each target self-attention layer, and perform loss calculation on the classification label of each target self-attention layer and a preset real label to obtain a loss value of each target self-attention layer;

and the progressive training module 14 is configured to update network parameters through a back propagation mechanism according to the loss value of each target self-attention layer, so as to train the fine-grained image recognition classification model.

Further, in some optional embodiments of the invention, the system further comprises:

the multi-scale training module is used for calculating a final attention weight matrix of the self-attention layer after classification vector learning is carried out on the fine-grained image according to a preset calculation rule; determining the position of a classification target according to the final self-attention weight matrix, and intercepting a classification target area image from the fine-grained image according to the position of the classification target; and scaling the classified target area image to be the same as the fine-grained image in size, and inputting the image into the preset network model for training so as to intensively train the fine-grained image recognition classification model.

Further, in some optional embodiments of the present invention, the preset network model further includes a linear projection layer and a position coding layer, and the image obtaining module 11 is further configured to segment the fine-grained image into preset sub-images according to a preset segmentation rule, and map each sub-image to a high-dimensional feature space through the linear projection layer, so as to obtain a picture vector of each sub-image; coding the picture vectors of each sub-picture through the position coding layer to add position coding information to each picture vector, and adding an empty classification vector in front of the first picture vector to obtain a vector sequence; and inputting the vector sequence into the multi-layer self-attention layer for classification vector learning, wherein the classification features learned by each layer of self-attention layer are updated in the classification vectors of the vector sequence to obtain the classification vectors of each layer of self-attention layer.

Further, in some optional embodiments of the present invention, the self-attention layer includes a plurality of attention heads, and the multi-scale training module is further configured to, after performing classification vector learning on the fine-grained image, respectively calculate attention weights of a classification vector and each picture vector in the self-attention layer in each of the attention heads, and obtain an attention weight matrix corresponding to each of the attention heads; and performing dot product calculation on the attention weight matrixes of all the attention heads to obtain the final attention weight matrix.

Wherein, the calculation formula of the attention weight is as follows:

in the formula ,

is as followslThe first of the attention headsiAttention weights for the picture vector and the classification vector,Qin order to query the vector, the query vector,Kis a vector of the keys, and is,Vin the form of a vector of values,d _k t is the mapping space dimension of the attention head and is the matrix transposition; wherein the attention weight matrixAExpressed as:

Further, in some optional embodiments of the present invention, the multi-scale training module is further configured to calculate an average value of all attention weights in the final attention weight matrix; comparing each attention weight in the final attention weight matrix with the average value, wherein the attention weight larger than the average value is marked as a first threshold value, and otherwise, the attention weight is marked as a second threshold value; and determining the position of the classification target according to the position coding information of the target picture vector with the attention weight of the classification vector as a first threshold.

Further, in some optional embodiments of the present invention, the loss calculating module 13 is further configured to perform cross entropy loss calculation on the classification label of each target self-attention layer and a preset real label, respectively, to obtain a loss value of each target self-attention layer;

wherein the formula of the cross entropy loss calculation is as follows:

in the formula ,y ^r is as followsrThe individual objects are from the category label of the attention layer,yin order to preset the real tag,LOSS _CE (y ^r ,y) Is as followsrThe cross entropy loss value of the classification label of each target self-attention layer and a preset real label, wherein the preset number is 3,r∈1,2,3。

the present invention also provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the fine-grained image recognition classification model training method as described above.

The invention further provides a fine-grained image recognition classification model training device, which comprises a processor, a memory and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to implement the fine-grained image recognition classification model training method.

The fine-grained image recognition classification model training equipment can be a computer, a server, a camera device and the like. The processor may be, in some embodiments, a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data processing chip for executing program code stored in memory or processing data, such as executing access restriction programs.

Wherein the memory includes at least one type of readable storage medium including flash memory, hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory may be an internal storage unit of the fine-grained image recognition classification model training apparatus in some embodiments, for example, a hard disk of the fine-grained image recognition classification model training apparatus. The memory may also be an external storage device of the fine-grained image recognition and classification model training device in other embodiments, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) Card, a flash Card (FlashCard), and the like, which are equipped on the fine-grained image recognition and classification model training device. Further, the memory may also include both an internal storage unit and an external storage device of the fine-grained image recognition classification model training apparatus. The memory can be used for storing application software installed in the fine-grained image recognition classification model training equipment and various types of data, and can also be used for temporarily storing data which is output or is to be output.

Those of skill in the art will understand that the logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be viewed as implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description of the specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for training a fine-grained image recognition classification model is characterized by comprising the following steps:

acquiring classification vectors obtained by learning the fine-grained images by a preset number of target self-attention layers, wherein the target self-attention layers are positioned at the rear ends of the multiple self-attention layers;

2. The fine-grained image recognition classification model training method according to claim 1, further comprising:

3. The fine-grained image recognition classification model training method according to claim 2, wherein the preset network model further comprises a linear projection layer and a position coding layer, and the step of inputting the fine-grained image into the preset network model for training comprises:

4. The fine-grained image recognition classification model training method according to claim 3, wherein the self-attention layer comprises a plurality of attention heads, and the step of calculating a final attention weight matrix of the self-attention layer after classification vector learning on the fine-grained image according to a preset calculation rule comprises the steps of:

5. The fine-grained image recognition classification model training method according to claim 4, wherein the calculation formula of the attention weight is as follows:

in the formula ,

is as followslThe first of the attention headsiAttention weights for the picture vector and the classification vector,Qin order to query the vector, the query vector,Kis a vector of the keys, and is,Vin the form of a vector of values,d _k to take care of the mapping space dimensions of the head of attention,Ttransposing the matrix; wherein the attention weight matrixAExpressed as:

wherein ,l∈1,2,…,L， i∈1,2,…,K，Lrepresenting the number of heads of attention,Krepresenting the number of picture vectors.

6. The fine-grained image recognition classification model training method according to claim 3, wherein the step of determining the position of the classification target according to the final self-attention weight matrix comprises the steps of:

7. The fine-grained image recognition classification model training method according to claim 1, wherein the step of performing loss calculation on the classification label of each target self-attention layer and a preset real label to obtain a loss value of each target self-attention layer comprises the steps of:

wherein the formula of the cross entropy loss calculation is as follows:

8. a system for training a fine-grained image recognition classification model, the system comprising:

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a fine-grained image recognition classification model training method according to any one of claims 1 to 7.

10. A fine-grained image recognition classification model training apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the fine-grained image recognition classification model training method according to any one of claims 1 to 7 when executing the program.