CN116797847A

CN116797847A - Enhanced complementary fine-grained image classification network system

Info

Publication number: CN116797847A
Application number: CN202310842868.5A
Authority: CN
Inventors: 胡静; 王芳; 王梦瑶
Original assignee: Taiyuan University of Science and Technology
Current assignee: Taiyuan University of Science and Technology
Priority date: 2023-07-10
Filing date: 2023-07-10
Publication date: 2023-09-22

Abstract

A fine-grained image classification network system that enforces complementation, the classification network system comprising: the reinforced complementary learning network structure is used for extracting the characteristics of the main network, driving the other two sub-networks to respectively perform reinforced learning and complementary learning, and extracting the target characteristics together to realize detailed and comprehensive identification of the identification target object; the DM driving module is used for cutting and amplifying the area with the greatest influence on the result, deleting the area in the original image, and sending the area into the reinforced complementary learning network structure, so that the model can be helped to carry out end-to-end training; the DM loss function is used for enabling the DM driving module to locate a key area, continuously optimizing the position information of the reinforced complementary learning network structure and simultaneously providing a precise mask position; and the verification data set is used for verifying the performance of the reinforced complementary learning network structure.

Description

Enhanced complementary fine-grained image classification network system

Technical field:

the application relates to a fine-grained image classification network system with enhanced complementation.

The background technology is as follows:

the image analysis is to analyze the bottom layer characteristics and the upper layer structure by utilizing a mathematical model and combining an image processing technology, and extract information data with certain intelligence; image analysis focuses on the description method of constructing images, more on symbolizing various images rather than computing the images themselves and reasoning with various relevant knowledge; image analysis is also germane to research on human vision, and research on certain identifiable modules in human visual mechanisms may facilitate improvements in computer vision capabilities.

The fine-grained images are important components in image analysis, and as the identified objects are various subclasses under different types, the fine-grained identification task has very great challenges due to the fact that the differences among the subclasses are very fine and are concentrated in a plurality of local areas, some fine-grained networks are often concentrated in a certain area when judging the target class, other auxiliary judging area features are lacking, and therefore the target cannot be finely and comprehensively identified.

The application comprises the following steps:

the embodiment of the application provides a reinforced complementary fine-grained image classification network system, which has reasonable structural design, is additionally provided with a reinforced complementary fine-grained image classification network, utilizes a main network to perform feature extraction and drive two sub-networks to perform reinforced learning and complementary learning respectively, acquires the finer fine-grained image features by adopting a learning method of a reinforced model, and simultaneously acquires the complementary discrimination area of a target by adopting a mode of attention erasure, thereby increasing the overall perception capability of the network to the target, and evaluating the performance of a system model by combining verification experiments developed on a plurality of public data sets, realizing the careful and comprehensive identification of the target and improving the effect of fine-grained image identification.

The technical scheme adopted by the application for solving the technical problems is as follows:

a fine-grained image classification network system that enforces complementation, the classification network system comprising:

the reinforced complementary learning network structure is used for extracting the characteristics of the main network, driving the other two sub-networks to respectively perform reinforced learning and complementary learning, and extracting the target characteristics together to realize careful and comprehensive identification of the identification target object;

the DM driving module is used for cutting and amplifying the area with the greatest effect on the result, deleting the area in the original image and sending the area into the reinforced complementary learning network structure, so that the model can be helped to carry out end-to-end training;

the DM loss function is used for enabling the DM driving module to locate a key area, continuously optimizing the position information of the reinforced complementary learning network structure and simultaneously providing accurate mask positions;

a validation data set for validating performance of the reinforcement complementary learning network structure, including CUB-200-2011,Stanford Cars and FGVC-Ai rcraft.

The backbone network of the reinforced complementary learning network structure is I-point ion-V3; the reinforced complementary learning network structure comprises a basic network, a reinforced network and a complementary network, so that three classification networks are constructed to aggregate the whole and partial characteristics of a target object, the whole semantic information of the object can be obtained, the partial semantic information of the object can be obtained, then the characteristics output by each network are subjected to global average pooling, the pooled characteristics are spliced to form a 6144 vector, a 200-dimensional classification layer is added to the vector for end-to-end training, and finally a classification result is obtained through Softmax.

The DM driving module performs cutting and amplifying on the area with the greatest influence on the result, and sends the area into the strengthening network; and the DM driving module deletes the area with the greatest influence on the result in the original image and sends the area to the complementary network.

The DM driving module receives the feature map obtained after the training of the basic network, then generates a square area taking (x, y) as the center and half of l as the side length, cuts and amplifies the area, sends the area into the strengthening network, generates an image mask according to the area, and inputs the image mask into the complementary network for complementary learning.

The DM driving module consists of two full-connection layers, the input end is a feature map, the output end is the most important local area of the neural network, and the most important local automatic positioning can be realized through the full-connection layers;

given an image X, inputting the image X into a trained convolution layer for feature extraction, wherein Tn represents an overall parameter, the whole process can be described as convolution, pooling and activation of the X, and finally a probability distribution p is generated, and a probability distribution formula is as follows:

p(X)＝f(Tn*X)

where f (·) represents the fully connected layer, which converts the features extracted by the convolutional neural network into feature vectors and uses softmax to convert the vectors into probability values.

The initialization parameter calculation formula of the DM driving module is as follows:

wherein F represents the feature map output by the last layer in the convolutional neural network, n represents the number of the feature maps, d represents the total number of the feature maps, and F is the total feature map obtained by adding each feature map;

the average value comparison formula of the DM driving module is as follows:

wherein h, w represent the width and height of the feature map respectively,representing the mean value of the feature map;

by passing through and F^i,j And generating initialization coordinates for obtaining the center of the bounding box. After the initialization coordinates are obtained, the model can automatically optimize the coordinates according to the training process, then the region is needed to be cut and amplified, and a finer local region is obtained and then sent into a strengthening network for learning; the upper left and lower right corner coordinates of the local area are obtained from the center coordinates and the side length, the upper left corner sitting being marked (t) _lx ,t _rx ) The lower right corner is marked as (t) _ly ,t _ry ) The calculation formula is as follows:

t _lx ＝x-l，t _rx ＝y-l

t _ly ＝x+l，t _ry ＝y+l。

the cropping operation can be seen as a multiplication between the original image and the module, expressed as:

wherein ,X^crop In order to cut out the region,representing a clipping operation between the artwork and the template, M (·) is an attention mask whose expression is:

M(·)＝[μ(i-t _lx )-μ(i-t _ly )]×[μ(j-t _rx )-μ(j-t _ry )]

wherein i, j are respectively at any point in the feature map, if i, j are positioned in the feature map, the value of M (·) is 1, otherwise, the value is 0; while μ (·) is a continuous derivative function, its expression is:

the size of the extracted local area is enlarged by adopting a bilinear interpolation algorithm, and the defended local area is obtained according to the ratio of the original image to the local area, wherein the algorithm formula is as follows:

wherein , and X_a Representing the area of the local area and the whole area, respectively, < >>Is the area ratio, X _local Is an enlarged local area.

The DM loss function is:

wherein m represents the size of a batch, W represents the output result of the full connection layer, yi represents the category of the ith picture, xi represents the feature vector of the ith picture before the full connection layer, b represents the network bias, s is the number of target categories to continuously optimize the position information of the strengthening network, and simultaneously provide more accurate mask positions to enable the complementary network to learn the secondary features.

The verification method for verifying the data set comprises the following steps:

s1, training a feature extraction network by utilizing training weights of a backbone network acceptance-V3 on an Image Net, reserving parameters of a pooling layer, an input layer and a convolution layer, removing an existing full-connection layer and a softmax layer to finely tune the network, and training data used in the process;

s2, calculating the key region through the reinforced complementary learning network structure, finding the coordinate information of the most key region, and cutting and amplifying to generate a finer training result.

According to the application, the characteristic extraction is carried out on the main network through the reinforced complementary learning network structure, and the other two sub-networks are driven to respectively carry out reinforced learning and complementary learning, so that the target characteristics are extracted together, and the detailed and comprehensive identification of the identification target object is realized; the region with the greatest effect on the result is cut and amplified through the DM driving module, the region is deleted in the original image and is sent to the reinforced complementary learning network structure, and the model can be helped to carry out end-to-end training; positioning a key area by a DM driving module through a DM loss function, continuously optimizing the position information of the reinforced complementary learning network structure, and providing a precise mask position; the performance of the reinforced complementary learning network structure is verified through the verification data set, and the method has the advantages of accuracy, practicability and excellent performance.

Description of the drawings:

fig. 1 is a schematic structural view of the present application.

Fig. 2 is a diagram showing the network configuration of the present application.

The specific embodiment is as follows:

in order to clearly illustrate the technical features of the present solution, the present application will be described in detail below with reference to the following detailed description and the accompanying drawings.

As shown in fig. 1-2, a complementary fine-grained image classification network system is enhanced, the classification network system comprising:

p(X)＝f(Tn*X)

the average value comparison formula of the DM driving module is as follows:

wherein h, w represent the width and height of the feature map respectively, and I represents the mean value of the feature map;

through I and F ^i,j And generating initialization coordinates for obtaining the center of the bounding box. After the initialization coordinates are obtained, the model can automatically optimize the coordinates according to the training process, then the region is needed to be cut and amplified, and a finer local region is obtained and then sent into a strengthening network for learning; the upper left and lower right corner coordinates of the local area are obtained from the center coordinates and the side length, the upper left corner sitting being marked (t) _lx ,t _rx ) The lower right corner is marked as (t) _ly ,t _ry ) The calculation formula is as follows:

t _lx ＝x-l，t _rx ＝y-l

t _ly ＝x+l，t _ry ＝y+l。

M(·)＝[μ(i-t _lx )-μ(i-t _ly )]×[μ(j-t _rx )-μ(j-t _ry )]

The DM loss function is:

s1, training a feature extraction network by utilizing training weights of a backbone network I-V3 on an Image Net, reserving parameters of a pooling layer, an input layer and a convolution layer, removing an existing full-connection layer and a softmax layer to finely tune the network, and training data used in the process;

The working principle of the reinforced complementary fine-grained image classification network system in the embodiment of the application is as follows: the method has the advantages that the reinforced complementary fine-grained image classification network is additionally arranged, the main network is utilized to perform feature extraction, the two sub-networks are driven to perform reinforced learning and complementary learning respectively, the reinforced model learning method is adopted to acquire finer fine-grained image features, meanwhile, the complementary network is adopted to acquire a complementary discrimination area of a target in a attention erasing mode, so that the overall perception capability of the network on the target is improved, the performance of a system model is evaluated in combination with verification experiments developed on a plurality of public data sets, the effect of carrying out fine and comprehensive recognition on the target is achieved, and the fine-grained image recognition effect is improved, and can be widely applied to classification tasks.

For an identification network, the features of interest tend to be concentrated on a region of the target that becomes the most prominent feature for identifying the target; the model designed by the application can identify the target in a larger range, and the model is not dependent on a certain significant feature any more, and can also realize detailed and comprehensive identification of the identification target by means of secondary features.

The reinforced complementary learning network structure comprises a basic network, a reinforced network and a complementary network, so that three classification networks are constructed to aggregate the whole and partial characteristics of a target object, the whole semantic information of the object can be obtained, the partial semantic information of the object can be obtained, then the characteristics output by each network are subjected to global average pooling, the pooled characteristics are spliced to form a 6144 vector, a 200-dimensional classification layer is added to the vector for end-to-end training, and finally a classification result is obtained through Softmax.

Because the traditional neural network does not utilize the advantages of the deep neural network to perform positioning and recognition learning, the application provides a DM driving module to help the backbone network to find a rectangular area with the greatest influence on the result in the training process; meanwhile, the DM driver module is very computationally inexpensive, and it can help the model perform end-to-end training.

The DM driving module receives the feature map obtained after the training of the basic network, then generates a square area taking (x, y) as the center and taking half of l as the side length, cuts and amplifies the area, sends the area into the strengthening network, and can generate an image mask according to the area and input the image mask into the complementary network for complementary learning.

In the process, the high response area of the feature map is the key for obtaining coordinates (x, y), the DM driving module is composed of two full-connection layers, the input is the feature map, the output is the most important local area of the neural network, and the most important local automatic positioning can be realized through the full-connection layers, so that the size of the bounding box is limited, the size of the bounding box cannot exceed 2/3 of the longest side of the whole image at most, and the minimum size cannot be smaller than 1/3 of the smallest side of the image.

Specifically, given an image X, inputting the image X into a trained convolution layer for feature extraction, where Tn represents an overall parameter, the whole process can be described as convolving, pooling and activating the X, and finally generating a probability distribution p, where the probability distribution formula is as follows:

p(X)＝f(Tn*X)

The next step is to generate the position and length parameter information of square bounding box as

[x,y,l]＝g(Tn*X)

Wherein X, y and l are respectively the abscissa of the bounding box in X and half of the side length, g (-) represents a DM driving module, and the structure of the DM driving module is composed of two full-connection layers; because the weight parameters initialized by the network have great influence on the model, the feature images output by the last layer of the basic network are added, so that the more abundant the semantic information of the feature images is, the more accurate the generated bounding box is.

The calculation formula of the initialization parameters of the DM drive module is as follows:

wherein F represents the feature map output by the last layer in the convolutional neural network, n represents the number of feature maps, d represents the total number of feature maps, and F is the total feature map obtained by adding each feature map.

Further, the average comparison formula of the DM driving module is:

t _lx ＝x-l，t _rx ＝y-l

t _ly ＝x+l，t _ry ＝y+l。

when the corresponding coordinate information is obtained, the clipping operation can be regarded as multiplication between the original image and the module, expressed as:

M(·)＝[μ(i-t _lx )-μ(i-t _ly )]×[μ(j-t _rx )-μ(j-t _ry )]

in order to cut and amplify the picture, the size of the extracted local area is enlarged by using a bilinear interpolation method, and the enlarged local area can be obtained according to the ratio of the original image to the local area, wherein the algorithm formula is as follows:

Similarly, in order to train the complementary network, the generated local area is changed into a mask picture, the pixels of the mask are unified as the average value of the pixels of the original image, and the rest is replaced by white pixels, and the specific formula is as follows:

then, the mask is erased in the original image according to the position information obtained before, and the obtained mask image is sent to complementary model training, and the specific process is as follows:

in the above formula, the values of each position in the pixel matrix formed by the original image represent different pixels, 1 in the mask image represents a black pixel, the pixel values in the RGB channels are (0, 0), the position calculation is performed by the original image and the mask image, the black pixel part is directly filled with the original image pixel, and the mask image pixel replaces the original image pixel, so that the image after the key area is erased can be obtained.

For DM loss functions, since a suitable loss function has a positive effect on model training, a loss function commonly used for fine-grained image recognition is a softmax loss function, and the specific formula is:

To help the reinforcement of the complementary learning network structure to find more accurate features, the probability value p of the sample can be output by the trunk model ^k And probability value p generated by the reinforcement model ^k+1 In contrast, when p, the difference between the two models is referred to ^k >p ^k ⁺¹ No loss is generated when p ^k <p ^k+1 When a loss occurs; therefore, the loss function can help the strengthening network to find more accurate features, and can help the backbone network to more accurately locate after the accurate features are extracted, and the strengthening network mutually strengthen each other.

Meanwhile, in the complementary model, as the characteristics extracted by the trunk characteristics are to be deleted, the characteristics extracted by the trunk model are not in any connection with the complementary model, but the trunk model provides accurate local areas which are helpful for the complementary model to learn secondary characteristics, so that only the trunk model and the strengthening model can be positioned to key local areas; from the above examples, the total loss of the model is:

wherein For the modulation factor, two loss functions are balanced.

For the verification data set in the application mainly comprises CUB-200-2011,Stanford Cars and FGVC-air, in order to improve the attention capability of the model to secondary features, a small amount of data may need to be erased for multiple key areas to obtain all features, so that erasure experiments need to be performed on the used data set to find out proper erasure times.

The verification method for verifying a data set comprises the steps of: training the feature extraction network by utilizing the training weight of the backbone network acceptance-V3 on the Image Net, reserving parameters of a pooling layer, an input layer and a convolution layer, removing an existing full-connection layer and a softmax layer to finely tune the network, and training data used in the process; and (3) calculating the key region through the reinforced complementary learning network structure, finding the coordinate information of the most key region, and cutting and amplifying to generate a finer training result.

Specifically, the experimental environment was performed under the pyrach version 1.71, GPU was NvidiaGenforce3060Ti, and CPU was i7-10700K. The optimizer selects SGD, initial learning rate set to 0.0001, momentum over parameter 0.9, batch_size 32, epoch set to 200.

In summary, the reinforced and complementary fine-grained image classification network system in the embodiment of the application adds the reinforced and complementary fine-grained image classification network, performs feature extraction by using the main network, drives the two sub-networks to perform reinforced learning and complementary learning respectively, acquires the finer fine-grained image features by adopting the learning method of the reinforced model, and simultaneously acquires the complementary discrimination area of the target by adopting the attention erasing mode, thereby increasing the overall perception capability of the network to the target, evaluating the performance of the system model by combining verification experiments developed on a plurality of public data sets, realizing the careful and comprehensive identification of the target and improving the effect of fine-grained image identification, and can be widely applied to classification tasks.

The above embodiments are not to be taken as limiting the scope of the application, and any alternatives or modifications to the embodiments of the application will be apparent to those skilled in the art and fall within the scope of the application.

The present application is not described in detail in the present application, and is well known to those skilled in the art.

Claims

1. A fine-grained image classification network system that enhances complementation, the classification network system comprising:

a validation data set for validating performance of the reinforcement complementary learning network structure, including CUB-200-2011,Stanford Cars and FGVC-air.

2. The enhanced complementary fine-grained image classification network system of claim 1, wherein: the backbone network of the reinforced complementary learning network structure is an acceptance-V3; the reinforced complementary learning network structure comprises a basic network, a reinforced network and a complementary network, so that three classification networks are constructed to aggregate the whole and partial characteristics of a target object, the whole semantic information of the object can be obtained, the partial semantic information of the object can be obtained, then the characteristics output by each network are subjected to global average pooling, the pooled characteristics are spliced to form a 6144 vector, a 200-dimensional classification layer is added to the vector for end-to-end training, and finally a classification result is obtained through Softmax.

3. The enhanced complementary fine-grained image classification network system of claim 2, wherein: the DM driving module performs cutting and amplifying on the area with the greatest influence on the result, and sends the area into the strengthening network; and the DM driving module deletes the area with the greatest influence on the result in the original image and sends the area to the complementary network.

4. The enhanced complementary fine-grained image classification network system according to claim 3, wherein: the DM driving module receives the feature map obtained after the training of the basic network, then generates a square area taking (x, y) as the center and half of l as the side length, cuts and amplifies the area, sends the area into the strengthening network, generates an image mask according to the area, and inputs the image mask into the complementary network for complementary learning.

5. The enhanced complementary fine-grained image classification network system according to claim 4, wherein: the DM driving module consists of two full-connection layers, the input end is a feature map, the output end is the most important local area of the neural network, and the most important local automatic positioning can be realized through the full-connection layers;

p(X)＝f(Tn*X)

6. The enhanced complementary fine-grained image classification network system according to claim 5, wherein the initialization parameter calculation formula of the DM driver module is:

the average value comparison formula of the DM driving module is as follows:

t _lx ＝x-l，t _rx ＝y-l

t _ly ＝x+l，t _ry ＝y+l。

7. the enhanced complementary fine-grained image classification network system according to claim 6, wherein the cropping operation can be considered as a multiplication between the original image and the module, expressed as:

M(·)＝[μ(i-t _lx )-μ(i-t _ly )]×[μ(j-t _rx )-μ(j-t _ry )]

8. the enhanced complementary fine-grained image classification network system according to claim 7, wherein the size of the extracted local area is enlarged by using a bilinear interpolation algorithm, and a defended local area is obtained according to the ratio of the original image to the local area, and the algorithm formula is as follows:

9. The enhanced complementary fine-grained image classification network system according to claim 1, wherein the DM loss function is:

10. The enhanced complementary fine-grained image classification network system of claim 1, wherein the verification method of verifying the dataset comprises the steps of: