CN110826609A - Double-flow feature fusion image identification method based on reinforcement learning - Google Patents

Double-flow feature fusion image identification method based on reinforcement learning Download PDF

Info

Publication number
CN110826609A
CN110826609A CN201911038698.5A CN201911038698A CN110826609A CN 110826609 A CN110826609 A CN 110826609A CN 201911038698 A CN201911038698 A CN 201911038698A CN 110826609 A CN110826609 A CN 110826609A
Authority
CN
China
Prior art keywords
image
model
feature
reinforcement learning
texture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911038698.5A
Other languages
Chinese (zh)
Other versions
CN110826609B (en
Inventor
冯镔
唐哲
王豪
李亚婷
朱多旺
刘文予
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201911038698.5A priority Critical patent/CN110826609B/en
Publication of CN110826609A publication Critical patent/CN110826609A/en
Application granted granted Critical
Publication of CN110826609B publication Critical patent/CN110826609B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a double-current feature fusion image identification method based on reinforcement learning. The two models are respectively a texture model and a shape model: the texture model is classified according to texture information of an object in the image, and the shape model is classified according to shape information of the object. Both models enable the network to find the most discriminative region in the whole image in a reinforcement learning mode, and then carry out classification according to the region. The method is simple and easy to implement, has strong popularization capability, finds the area which is easy to distinguish the image, has proper and effective distinguishing area, fully uses the texture and shape information in the image, and can effectively overcome the influence of insufficient utilization of the image information and small difference between the images.

Description

Double-flow feature fusion image identification method based on reinforcement learning
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a double-flow feature fusion image identification method based on reinforcement learning.
Background
The image recognition has many applications in people's daily life, such as intelligent security, biomedicine, e-commerce shopping, automatic driving, smart home and the like. The image recognition studies how to identify a category corresponding to a sample from a plurality of categories. There are many problems such as less difference between images, greater background effect, and so on.
The current image recognition method generally inputs the image directly to a convolutional neural network for feature extraction and then carries out classification. Although there are various operations after feature extraction, most of the extracted features are texture information of the image. Such operations all have a drawback in that the shape information cannot be fully utilized and information advantageous for recognizing the image cannot be completely extracted. In addition, in order to reduce the background interference, the current approach is to generate candidate frames, but the method generates a large number of candidate frames, has a long calculation time, and has an ambiguous target, and cannot find a region really helpful for image classification.
Therefore, there is a need to design an image recognition method for dual-stream feature fusion, which can fuse texture information and shape information of an image and optimize computational efficiency.
Disclosure of Invention
The invention aims to provide a double-flow feature fusion image identification method based on reinforcement learning, which can effectively find the most effective areas respectively containing texture information and shape information. The influence of background and useless information is reduced, and the identification precision is effectively improved. The method comprises the following steps:
(1) generating a shape data set:
inputting each image into an image conversion model, and outputting n pictures which are similar in shape and different in texture. n is a preset value, the larger n, the better learning effect is, but the longer the time spent on training the model is, and generally, the value of n can be tried to be 5-10. The labels of the n converted images are all the same as the labels of the input images. The shape data set is a data set formed by pairing the original data set with a texture information-reduced data set and a shape information-increased data set. The purpose of generating this corresponding data set is to expect that the subsequent model to be trained can learn the shape information of the image.
(2) Training a texture basic model and a shape basic model:
(2.1) performing data enhancement on each image of the original dataset and the shape dataset separately: for one image, m rectangular frames are generated at random positions in the image, and m is a preset value. m may range from 4 to 7, with too much m resulting in too much data and too much time spent. The frame side length is generally greater than 1/2 and less than or equal to the image side length. The label of the cropped image is consistent with the label of the original image.
(2.2) training a basic model by using the cut image, wherein the texture basic model is trained by using an original data set, and the shape basic model is trained by using a shape data set; the texture base model and the shape base model have the same structure. An adaptive average pooling layer AdaAvgPool is newly added after the last block in the ResNet50 network, and the adaptive average pooling layer pools the images to reduce the size of the feature map. Compressing the Feature map output by AdaAvgPool to one dimension to obtain Feature vector Feature, sending the features before AdaAvgPool to a classifier to obtain a classification prediction probability pred, wherein the classifier can select to use a full connection layer. The basic model outputs Feature and pred, and the function is to extract the features of the input image and predict the classification probability.
(3) Training a texture reinforcement learning model:
(3.1) reading imageglobalAnd a corresponding category label c. A rectangular box is initialized to the same size as the size of the image to be read.
(3.2) the position and size of the rectangular frame are changed several times in the whole process, if the size of the rectangular frame is equal to the size of the image, namely, the imagelocal=imageglobalThen jump to (3.3). If the rectangular frame is smaller than the image size, the image is cut according to the frame size, and then the image is up-sampled to the size same as the original image size to obtain the processed imagelocalWhere upsampling may select bilinear interpolation.
(3.3) imagelocalAnd inputting the Feature and the classification prediction probability pred into a texture base model.
(3.4) inputting the Feature into a texture enhancement learning model, wherein the texture enhancement learning model consists of a plurality of fully connected layers, a ReLU function is used as an activation function, and the function is defined as
Figure BDA0002252260410000031
The last layer is the number of feature dimensions converted into the number of actions in the action space, and the output means the Q value of each action in the action space. The action space is a set formed by combining a series of actions, the purpose of the actions is to change the position or size of the rectangular frame, and the actions can be selected to translate, enlarge, reduce and the like in all directions; the Q value of an action means that a rectangular box changes to another position after the action is taken at one position, and this process gives a quantitative assessment of the impact that our goal (i.e., classification) has. If the Q value is larger, the box after the position is changed can lead the classification effect to be better, and conversely, the Q value is lower, the box after the position is changed can lead the classification effect to be worse.
(3.5) in order to get the determined action, this step is divided into two strategies, search and development, and one of them is selected, wherein an explicit _ rate ∈ (0,1) is preset, the explicit _ rate represents the probability of selecting search, and the corresponding 1-explicit _ rate represents the probability of selecting development, the search randomly selects an action in all actions, the development selects the action corresponding to the maximum value of the Q value obtained in (3.4) as the selected action, the action is determined after selecting one of search and development, and the size or position of the box is changed according to the selected action and the change coefficient α, so as to get a new rectangular box, wherein the change coefficient α ∈ (0,1), and 0.1 can be selected, meaning that the ratio of each change is that the action is selected to be larger to the right, meaning that the box 'is larger to the right by 1.1 times of the original box to the left, and the action is selected to be smaller meaning that the box' is smaller by 0.9 times of the original box to the left.
And (3.6) obtaining another Feature and prediction probability Feature 'pred' according to the process of extracting the Feature in (3.2) and (3.3) by using the new rectangular box obtained in (3.5).
(3.8) based on pred, (pred' in 3.6), and c in (3.1), the following judgment can be made: if the prediction score of the pred on the category c is higher than that of the pred 'on the category c, the reward is-1, and correspondingly, if the prediction score of the pred on the category c is lower than that of the pred' on the category c, the reward is 1.
(3.9) updating the Q value Q of the selected action in (3.5) according to the reward obtained in (3.8)target. The update mode is QtargetAnd (c) means forward + γ max (Q (s, a)), where Q (s, a) represents the Q value after the action is taken in the s state, i.e., the Feature. Where γ is a learning rate of each update of the Q value, γ is a preset value, and γ may be 0.9.
(3.10) characterization Feature and Q obtained in (3.9)targetAnd storing the experience into an experience pool. The experience pool is a corresponding measure for reducing the correlation between samples, and paired features and Q are firstly adoptedtargetThere is a pool of experiences. And after the experience pool is stored to a certain amount, randomly selecting data from the experience pool to train the model.
(3.11) setting the new rectangular frame as the current frame box ', setting the new Feature as the current Feature, and setting the new classification prediction probability as the current classification prediction probability pred as pred'.
And (3.12) repeating the processes from (3.4) to (3.11) to a certain number of times. This is a process of adjusting the size and position of the frame all the time, and can be dynamically adjusted according to the change rate set in (3.5), and the change rate can be changed several times when set to be large, and can be changed several times when set to be small.
(3.13) when the experience pool is filled with a certain number of samples,randomly selecting pairs of Feature and Q from an experience pooltargetThe corresponding data of (D) is recorded as FeaturesAnd Target, inputting the characteristics to the training texture reinforcement learning model, outputting the Q value of the obtained action, and recording the Q value as QevalTarget and QevalThe difference between them is taken as loss and propagated backwards, updating the parameters. loss may be a Mean Square Error (MSE) function, expressed as loss (Target-Q)eval)2
(4) Training a shape reinforcement learning model:
and (4.1) training the shape reinforcement learning model by using the shape data set according to the step in (3), wherein the training process of the shape reinforcement learning model is the same as that of the texture reinforcement learning process, and the structure of the shape reinforcement learning model is the same as that of the texture reinforcement learning model.
(5) The double-flow prediction and fusion of the test image to be detected by utilizing the two trained models comprises the following substeps:
(5.1) reading image to be detectedglobal. A rectangular box is initialized to the same size as the size of the image to be read.
And (5.2) carrying out the Feature extraction in the steps (3.2) and (3.3) on the image to obtain the Feature and the classification prediction probability pred of the corresponding position of the frame.
And (5.3) inputting the characteristics obtained in the step (5.2) into a texture reinforcement learning model, outputting the score Q values of all actions, selecting the action with the maximum Q value according to a developed strategy, and changing the size and the position of the frame according to the selected action.
(5.4) repeating the process of (5.2) and (5.3) to a number of times, the number of times of repetition being related to the rate of change of the rectangular frame similarly to the repeated process of (3.12). The last change gives the feature Ftexture
(5.5) testing the texture reinforcement learning model in a similar process from (5.1) to (5.4) to obtain Fshape
(5.6) two different characteristics FtextureAnd FshapeInput into the fusion model and output as the final prediction probability pmixWherein the fusion model is a trainable model intended to beFtextureAnd FshapeThe fusion is followed by classification. For example, the fusion model may choose to stitch two features together and then output the classification probabilities of all classes using the full-connectivity layer.
(5.7)pmixThe class with the highest probability of correspondence is the predicted class.
Through the technical scheme, compared with the prior art, the invention has the following technical effects:
(1) simple structure is effective: compared with the prior art that the texture information is extracted by using a convolutional neural network, the method respectively extracts the texture information and the shape information by designing a double-flow structure. The discriminative regions of textures and shapes are respectively searched by using a reinforcement learning mode, so that the structure is clear, simple and effective;
(2) the accuracy is high: the method is different from a method for generating the propofol, an optimal region in the image is searched in a reinforcement learning mode, a region with better performance is not selected in the propofol, model learning cost is reduced, the process of searching a distinguishing region is better met, the method is different from the method of only utilizing texture information in the prior art, information contained in the image can be fully mined by utilizing texture and shape information, and the accuracy is higher;
(3) the robustness is strong: the texture reinforcement learning model disclosed by the invention focuses more on texture information, the shape reinforcement learning model focuses more on shape information, and by respectively focusing on the two kinds of information, a network can adapt to different images, and the performance is more robust.
Drawings
FIG. 1 is a flow chart of a double-flow feature fusion image recognition method based on reinforcement learning according to the present invention;
FIG. 2 is a schematic diagram of a reinforcement learning model implementation framework of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The technical terms of the present invention are explained and explained first:
infood dataset: the database is a data set used in a game held by Kaggle and contains 251 fine-grained (pre-made) food categories, totaling 120216 images collected from the web as a training set, 12170 images as a verification set and 28399 images as a test set, and provides manual verification tags, wherein each image contains a single category of food.
ResNet-50: a neural network for classification mainly comprises 50 convolutional layers, a pooling layer and a short connecting layer. The convolution layer is used for extracting picture characteristics; the pooling layer has the functions of reducing the dimensionality of the feature vector output by the convolutional layer and reducing overfitting; the shortcut connection layer is used for transferring gradient and solving the problems of extinction and explosion gradient. The network parameters can be updated through a reverse conduction algorithm;
an image conversion model: the style of the image can be transformed but the content is not changed using the structure of the generic adaptive Network, including the generator and the discriminator.
As shown in fig. 1, the present invention provides a double-flow feature fusion image recognition method based on reinforcement learning, which includes the following steps:
(1) generating a shape data set:
inputting each image into an image conversion model, and outputting n pictures which are similar in shape and different in texture. n is a preset value, the larger n, the better learning effect is, but the longer the time spent on training the model is, and generally, the value of n can be tried to be 5-10. The labels of the n converted images are all the same as the labels of the input images. The shape data set is a data set formed by pairing the original data set with a texture information-reduced data set and a shape information-increased data set. The purpose of generating this corresponding data set is to expect that the subsequent model to be trained can learn the shape information of the image.
(2) Training a texture basic model and a shape basic model:
(2.1) respectively carrying out data enhancement on each image of the original data set and the shape data set, wherein the specific process is as follows: for one image, m rectangular frames are generated at random positions in the image, and m is a preset value. m may range from 4 to 7, with too much m resulting in too much data and too much time spent. The frame side length is generally greater than 1/2 and less than or equal to the image side length. The label of the cropped image is consistent with the label of the original image.
(2.2) training a basic model by using the cut image, wherein the texture basic model is trained by using an original data set, and the shape basic model is trained by using a shape data set; the texture base model and the shape base model have the same structure. An adaptive average pooling layer AdaAvgPool is newly added after the last block in the ResNet50 network, and the adaptive average pooling layer pools the images to reduce the size of the feature map. Compressing the Feature map output by AdaAvgPool to one dimension to obtain Feature vector Feature, sending the features before AdaAvgPool to a classifier to obtain a classification prediction probability pred, wherein the classifier can select to use a full connection layer. The basic model outputs Feature and pred, and the function is to extract the features of the input image and predict the classification probability.
(3) Training a texture reinforcement learning model:
(3.1) reading imageglobalAnd a corresponding category label c. A rectangular box is initialized to the same size as the size of the image to be read.
(3.2) the position and size of the rectangular frame are changed several times in the whole process, if the size of the rectangular frame is equal to the size of the image, namely, the imagelocal=imageglobalThen jump to (3.3). If the rectangular frame is smaller than the image size, the image is cut according to the frame size, and then the image is up-sampled to the size same as the original image size to obtain the processed imagelocalWherein the upsampling may select a bilinear interpolation. The purpose of this step is: if the rectangular frame is the initialized rectangular frame, no operation is performed, and the step (3.3) is entered, if the rectangular frame is smaller than the initialized frame, the cropping operation is performed, and the image is up-sampled to the size same as the original input image, so that the images input to the neural network are all the same size.
(3.3) image is shown in FIG. 2localInput into the texture base model yields two outputs: feature and class prediction probability pred.
(3.4) directly inputting the Feature into the reinforcement learning model, and outputting the Q value of each corresponding action in the action space through various operations such as convolution, pooling and the like of forward propagation. Specifically, the Feature is input into a selection action submodule of a texture reinforcement learning model, an agent network is arranged in the selection action submodule, the model is composed of a plurality of full connection layers, a ReLU function is used as an activation function, and the function definition formula is
Figure BDA0002252260410000081
The last layer is the number of feature dimensions converted into the number of actions in the action space, and the output means the Q value of each action in the action space. The action space is a set formed by combining a series of actions, the purpose of the actions is to change the position or size of the rectangular frame, and the actions can be selected to translate, enlarge, reduce and the like in all directions; the Q value of an action means that a rectangular box changes to another position after the action is taken at one position, and this process gives a quantitative assessment of the impact that our goal (i.e., classification) has. If the Q value is larger, the box after the position is changed can lead the classification effect to be better, and conversely, the Q value is lower, the box after the position is changed can lead the classification effect to be worse.
(3.5) in the select action submodule, in order to obtain the determined action, the step is divided into two strategies of searching and developing, wherein one strategy is selected, the search randomly selects one action from all actions, the development selects the action corresponding to the maximum value of the Q value obtained in (3.4) as the selected action, the action is determined after one strategy is selected and developed, and the size or the position of the frame is changed according to the selected action and the change coefficient α, so that a new rectangular frame box ' is obtained, wherein the change coefficient α (0,1) can be selected, 0.1 can be selected, the meaning of each change is that the ratio of each change is determined, for example, the action is selected to be larger to the right, the box ' is obtained to be larger to the right by 1.1 times, and the action is selected to be smaller, the box ' is selected to be smaller to the left by 0.9 times.
And (3.6) obtaining another Feature and prediction probability Feature 'pred' according to the process of extracting the Feature in (3.2) and (3.3) by using the new rectangular box obtained in (3.5). The purpose of this step is that after the position of the rectangular frame is changed, the same operation of extracting features can be adopted, the region corresponding to the frame is cut out, and then the up-sampling operation is performed. And inputting the feature and the prediction probability into a basic model.
(3.8) based on pred, (pred' in 3.6), and c in (3.1), the following judgment can be made: if the prediction score of the pred on the category c is higher than that of the pred 'on the category c, the reward is-1, and correspondingly, if the prediction score of the pred on the category c is lower than that of the pred' on the category c, the reward is 1. Specifically, the prediction probability pred is a prediction score that is common to all categories, and this score may indicate the probability of predicting as one category. So a higher probability for c indicates that the model performs better on the sample prediction pairs. It can be determined whether to award or punish to the model based on both the prediction probability and the label. Signal (pred' c-pred c).
(3.9) in the update Q value submodule, updating the Q value Q of the action selected in (3.5) according to the reward obtained in (3.8)target. The update mode is QtargetAnd (c) means forward + γ max (Q (s, a)), where Q (s, a) represents the Q value after the action is taken in the s state, i.e., the Feature. Wherein gamma is the learning rate of each Q value update, gamma is a preset value,γ may be selected to be 0.9.
(3.10) characterization Feature and Q obtained in (3.9)targetAnd storing the experience into an experience pool. The experience pool is a corresponding measure for reducing the correlation between samples, and paired features and Q are firstly adoptedtargetThere is a pool of experiences. And after the experience pool is stored to a certain amount, randomly selecting data from the experience pool to train the model.
(3.11) setting the new rectangular frame as the current frame box ', setting the new Feature as the current Feature, and setting the new classification prediction probability as the current classification prediction probability pred as pred'.
And (3.12) repeating the processes from (3.4) to (3.11) to a certain number of times. This is a process of adjusting the size and position of the frame all the time, and can be dynamically adjusted according to the change rate set in (3.5), and the change rate can be changed several times when set to be large, and can be changed several times when set to be small.
(3.13) in the Q value evaluation submodule, after a certain number of samples are filled in the experience pool, the Feature and Q in pairs are randomly selected from the experience pooltargetThe corresponding data of (D) is recorded as FeaturesAnd Target, inputting the characteristics to the training texture reinforcement learning model, outputting the Q value of the obtained action, and recording the Q value as QevalTarget and QevalThe difference between them is taken as loss and propagated backwards, updating the parameters. loss may be a Mean Square Error (MSE) function, expressed as loss (Target-Q)eval)2
(4) Training a shape reinforcement learning model:
and (4.1) training the shape reinforcement learning model by using the shape data set according to the step in (3), wherein the training process of the shape reinforcement learning model is the same as that of the texture reinforcement learning process, and the structure of the shape reinforcement learning model is the same as that of the texture reinforcement learning model. And (3) and (4) dividing the information in the data set into textures and shapes, and respectively learning two different kinds of information in the data set by using two streams of ideas. The model structure of both streams is the same, and the data sets used for training are pairs of data sets that have been pre-processed. The training process is the same.
(5) The double-flow prediction and fusion of the test image to be detected by utilizing the two trained models comprises the following substeps:
(5.1) reading image to be detectedglobal. A rectangular box is initialized to the same size as the size of the image to be read.
And (5.2) carrying out the Feature extraction in the steps (3.2) and (3.3) on the image to obtain the Feature and the classification prediction probability pred of the corresponding position of the frame.
And (5.3) inputting the characteristics obtained in the step (5.2) into a texture reinforcement learning model, outputting the score Q values of all actions, selecting the action with the maximum Q value according to a developed strategy, and changing the size and the position of the frame according to the selected action.
(5.4) repeating the process of (5.2) and (5.3) to a number of times, the number of times of repetition being related to the rate of change of the rectangular frame similarly to the repeated process of (3.12). The last change gives the feature Ftexture
(5.5) testing the texture reinforcement learning model in a similar process from (5.1) to (5.4) to obtain Fshape
(5.6) two different characteristics FtextureAnd FshapeInput into the fusion model and output as the final prediction probability pmixWherein the fusion model is a trainable model aimed at transforming FtextureAnd FshapeThe fusion is followed by classification. For example, the fusion model may choose to stitch two features together and then output the classification probabilities of all classes using the full-connectivity layer. The fusion of features can be tried in various ways, and can even be performed directly on the prediction scores, but the fusion performed on the scores is not as good as the fusion performed on the features. The fusion model can be obtained by inputting and using two characteristics according to the label corresponding to the image by training the fusion model.
(5.7)pmixThe class with the highest probability of correspondence is the predicted class.
The effectiveness of the invention is proved by the following experimental examples, and the experimental result proves that the invention can improve the identification accuracy of image identification.
The invention is compared with the basic network we use on the iFood dataset, and Table 1 shows the accuracy of the method of the invention on the dataset, wherein Backbone represents the basic model Resnet50 we use, and DQN represents the reinforcement learning model we use. The larger the numerical value of the result is, the higher the accuracy of image recognition is, and the improvement of the method is very obvious as can be seen from the table.
TABLE 1 precision on iFood dataset
Figure BDA0002252260410000121
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A double-flow feature fusion image identification method based on reinforcement learning is characterized by comprising the following steps:
(1) generating a shape data set:
for each image, the image is input into an image conversion model, and corresponding n images with similar shapes but different textures are output. The labels of the n converted images are the same as the labels of the input images, and n is a preset value;
(2) training a texture basic model and a shape basic model:
(2.1) performing data enhancement on each image of the original dataset and the shape dataset separately: for an image, generating m rectangular frames at random positions in the image, wherein the side length of each frame is greater than 1/2 of the side length of the image and less than or equal to the side length of the image, the label of the cut image is consistent with that of the original image, and m is a preset value;
(2.2) training a base model using the cropped images, wherein the texture base model is trained using the original data set and the shape base model is trained using the shape data set; the texture basic model and the shape basic model have the same structure, and are both a self-adaptive average pooling layer AdaAvgPool is newly added after the last block in the ResNet50 network, the self-adaptive average pooling layer is used for pooling images to reduce the size of a feature map, the basic model is trained by using the images and labels, and the basic model is used for extracting features and predicting the images;
(3) training a texture reinforcement learning model:
(3.1) reading imageglobalInitializing a rectangular box with the same size as the size of the read image, with the corresponding category label c;
(3.2) if the size of the rectangular frame is equal to the size of the image, jumping to (3.3); if the rectangular frame is smaller than the image size, the image is cut according to the size of the frame, then the image is up-sampled to the size same as the size of the original image, and the processed image is obtainedlocal
(3.3) imagelocalInputting the Feature and the classification prediction probability pred into a texture basic model;
(3.4) inputting the Feature into a texture reinforcement learning model, wherein the output of the texture reinforcement learning model is the Q value of each action in the action space;
(3.5) obtaining determined actions through searching and developing two strategies, wherein one of the searching and the developing is to select one action randomly from all the actions, the developing is to select the action corresponding to the maximum value of the Q value obtained in the step (3.4) as a selected action, determine an action after selecting one of the searching and the developing, and change the size or the position of the frame according to the selected action and the change coefficient α to obtain a new rectangular frame box';
(3.6) obtaining another Feature and prediction probability Feature 'pred' according to the process of extracting the Feature in (3.2) and (3.3) by using the new rectangular box obtained in (3.5);
(3.8) based on pred, (pred' in 3.6), and c in (3.1), the following judgment is made: reward-1 if the prediction score of pred on category c is higher than the prediction score of pred 'on category c, and reward-1 if the prediction score of pred on category c is lower than the prediction score of pred' on category c;
(3.9) updating the Q value Q of the selected action in (3.5) according to the reward obtained in (3.8)targetThe update mode is QtargetThe learning rate is updated every time the Q value is updated, and gamma is a preset value;
(3.10) characterization Feature and Q obtained in (3.9)targetStoring the experience into an experience pool;
(3.11) taking the new rectangular frame as the current frame box ', taking the new Feature as the current Feature, taking the new classification prediction probability as the current classification prediction probability pred ═ pred', and repeating the processes from (3.4) to (3.11) to a preset number of times. Randomly selecting a pair of Feature and Q from the experience pool after a preset number of samples are loaded in the experience pooltargetThe corresponding data of (D) is recorded as FeaturesAnd Target, inputting the characteristics into the texture reinforcement learning model, outputting the Q value of the obtained action, and recording the Q value as QevalTarget and QevalThe difference between the two is used as loss, and the loss is propagated reversely, and the parameters are updated;
(4) training a shape reinforcement learning model:
(4.1) training a shape reinforcement learning model by using the shape data set according to the step in (3), wherein the training process of the shape reinforcement learning model is the same as that of the texture reinforcement learning process, and the structures of the shape reinforcement learning model and the texture reinforcement learning model are the same;
(5) the double-flow prediction and fusion of the test image to be detected by utilizing the two trained reinforced models comprises the following substeps:
(5.1) reading an image to be detected, and initializing a rectangular box with the size same as that of the image to be detected;
(5.2) carrying out the Feature extraction of the steps (3.2) and (3.3) on the image to be detected to obtain the Feature and the classification prediction probability pred of the corresponding position of the frame;
(5.3) inputting the characteristics obtained in the step (5.2) into a texture reinforcement learning model, outputting score Q values of all actions, selecting the action with the maximum Q value according to a developed strategy, and changing the size and the position of a frame according to the selected action;
(5.4) repeating the process of (5.2) and (5.3) to a number of times, the number of times of repetition being related to the rate of change of the rectangular frame similarly to the repeated process of (3.12); obtaining the characteristic F after the last changetexture
(5.5) testing the texture reinforcement learning model in a similar process from (5.1) to (5.4) to obtain Fshape
(5.6) two different characteristics FtextureAnd FshapeInput into the fusion model and output as the final prediction probability pmixWherein the fusion model is a trainable model aimed at transforming FtextureAnd FshapeAfter fusion, classification is carried out;
(5.7)pmixthe class with the highest corresponding probability is the class predicted by the image to be detected.
2. The method for recognizing the double-flow feature fusion image based on the reinforcement learning according to claim 1, wherein the training process of the basic model in the step (2.2) is specifically as follows: compressing the Feature diagram output by AdaAvgPool to one dimension to obtain Feature vector Feature, and sending the features before AdaAvgPool to a classifier to obtain a classification prediction probability pred. The basic model outputs Feature and pred, and the function is to extract the features of the input image and predict the classification probability.
3. The robust learning based dual-stream feature fusion image recognition method according to claim 1 or 2, wherein the classifier in step (2.2) selects to use fully connected layers.
4. The method for identifying the double-flow feature fusion image based on the reinforcement learning according to the claim 1 or 2, wherein the step (3.4) is specifically as follows: inputting the Feature into a selection action submodule of the texture reinforcement learning model, wherein the selection action submodule is provided with an agent network, the model consists of a plurality of full connection layers, and ReLU function is used as the modelFor activating a function, the function is defined byThe last layer is to convert the characteristic dimension quantity into the action quantity of the action space, and the output meaning is the Q value of each action in the action space; the action space is a set formed by combining a series of actions, the actions aim at changing the position or size of a rectangular frame, the Q value of the action means that the rectangular frame is changed to another position after the action is taken at a certain position, the process is quantitative evaluation of the influence generated by a target, if the Q value is larger, the frame after the position is changed can enable the classification effect to be better, and otherwise, the lower the Q value is, the frame after the position is changed can enable the classification effect to be worse.
5. The method for dual-stream feature fusion image recognition based on reinforcement learning according to claim 1 or 2, wherein the action in step (3.4) comprises translation, zooming in or zooming out in various directions.
6. The robust learning based dual-stream feature fusion image recognition method according to claim 1 or 2, wherein the variation coefficient α e (0,1) in the step (3.5).
7. The method for identifying the double-flow feature fusion image based on the reinforcement learning according to the claim 1 or 2, wherein the step (3.8) is specifically as follows: the prediction probability pred is a prediction score which is common to all the categories, the score represents the probability of predicting the category, the higher the probability corresponding to c is, the better the model performs the sample prediction pair is, so that whether the model is awarded or punished can be judged according to the prediction probabilities and the labels, and the rewarded is sign (pred' c-pred c).
8. The method for identifying the double-flow feature fusion image based on the reinforcement learning according to the claim 1 or 2, characterized in that the stepsThe loss in step (3.13) selects the MSE function, expressed as loss ═ Q (Target-Q)eval)2
9. The double-flow feature fusion image identification method based on reinforcement learning according to claim 1 or 2, characterized in that after the fusion model selection in the step (5.6) splices two features together, the classification probabilities of all classes are output by using a full connection layer.
10. The dual-flow feature fusion image recognition method based on reinforcement learning according to claim 1 or 2, wherein the value range of n is 5-10, and the value range of m is 4-7.
CN201911038698.5A 2019-10-29 2019-10-29 Double-current feature fusion image identification method based on reinforcement learning Active CN110826609B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911038698.5A CN110826609B (en) 2019-10-29 2019-10-29 Double-current feature fusion image identification method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911038698.5A CN110826609B (en) 2019-10-29 2019-10-29 Double-current feature fusion image identification method based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN110826609A true CN110826609A (en) 2020-02-21
CN110826609B CN110826609B (en) 2023-03-24

Family

ID=69550977

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911038698.5A Active CN110826609B (en) 2019-10-29 2019-10-29 Double-current feature fusion image identification method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN110826609B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597865A (en) * 2020-12-16 2021-04-02 燕山大学 Intelligent identification method for edge defects of hot-rolled strip steel
CN113128522A (en) * 2021-05-11 2021-07-16 四川云从天府人工智能科技有限公司 Target identification method and device, computer equipment and storage medium
CN113240573A (en) * 2020-10-26 2021-08-10 杭州火烧云科技有限公司 Local and global parallel learning-based style transformation method and system for ten-million-level pixel digital image
CN114742800A (en) * 2022-04-18 2022-07-12 合肥工业大学 Reinforced learning fused magnesia furnace working condition identification method based on improved Transformer
TWI801038B (en) * 2021-12-16 2023-05-01 新加坡商鴻運科股份有限公司 Defect detection method, system, electronic device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104766080A (en) * 2015-05-06 2015-07-08 苏州搜客信息技术有限公司 Image multi-class feature recognizing and pushing method based on electronic commerce
CN108805798A (en) * 2017-05-05 2018-11-13 英特尔公司 Fine granularity for deep learning frame calculates communication and executes
CN109814565A (en) * 2019-01-30 2019-05-28 上海海事大学 The unmanned boat intelligence navigation control method of space-time double fluid data-driven depth Q study
CN110135502A (en) * 2019-05-17 2019-08-16 东南大学 A kind of image fine granularity recognition methods based on intensified learning strategy
CN110348355A (en) * 2019-07-02 2019-10-18 南京信息工程大学 Model recognizing method based on intensified learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104766080A (en) * 2015-05-06 2015-07-08 苏州搜客信息技术有限公司 Image multi-class feature recognizing and pushing method based on electronic commerce
CN108805798A (en) * 2017-05-05 2018-11-13 英特尔公司 Fine granularity for deep learning frame calculates communication and executes
CN109814565A (en) * 2019-01-30 2019-05-28 上海海事大学 The unmanned boat intelligence navigation control method of space-time double fluid data-driven depth Q study
CN110135502A (en) * 2019-05-17 2019-08-16 东南大学 A kind of image fine granularity recognition methods based on intensified learning strategy
CN110348355A (en) * 2019-07-02 2019-10-18 南京信息工程大学 Model recognizing method based on intensified learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIANGTENG HE等: "Fine-grained Image Classification via Combining Vision and Language", 《PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113240573A (en) * 2020-10-26 2021-08-10 杭州火烧云科技有限公司 Local and global parallel learning-based style transformation method and system for ten-million-level pixel digital image
CN112597865A (en) * 2020-12-16 2021-04-02 燕山大学 Intelligent identification method for edge defects of hot-rolled strip steel
CN113128522A (en) * 2021-05-11 2021-07-16 四川云从天府人工智能科技有限公司 Target identification method and device, computer equipment and storage medium
CN113128522B (en) * 2021-05-11 2024-04-05 四川云从天府人工智能科技有限公司 Target identification method, device, computer equipment and storage medium
TWI801038B (en) * 2021-12-16 2023-05-01 新加坡商鴻運科股份有限公司 Defect detection method, system, electronic device and storage medium
CN114742800A (en) * 2022-04-18 2022-07-12 合肥工业大学 Reinforced learning fused magnesia furnace working condition identification method based on improved Transformer
CN114742800B (en) * 2022-04-18 2024-02-20 合肥工业大学 Reinforced learning electric smelting magnesium furnace working condition identification method based on improved converter

Also Published As

Publication number Publication date
CN110826609B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN110826609B (en) Double-current feature fusion image identification method based on reinforcement learning
CN110322446B (en) Domain self-adaptive semantic segmentation method based on similarity space alignment
CN110837836B (en) Semi-supervised semantic segmentation method based on maximized confidence
CN109840531B (en) Method and device for training multi-label classification model
CN110222770B (en) Visual question-answering method based on combined relationship attention network
CN110046550B (en) Pedestrian attribute identification system and method based on multilayer feature learning
US20210081695A1 (en) Image processing method, apparatus, electronic device and computer readable storage medium
CN115858847B (en) Combined query image retrieval method based on cross-modal attention reservation
CN112257758A (en) Fine-grained image recognition method, convolutional neural network and training method thereof
CN112070040A (en) Text line detection method for video subtitles
CN111274981A (en) Target detection network construction method and device and target detection method
CN115222998B (en) Image classification method
CN115131797A (en) Scene text detection method based on feature enhancement pyramid network
CN112527993A (en) Cross-media hierarchical deep video question-answer reasoning framework
CN116912708A (en) Remote sensing image building extraction method based on deep learning
CN114510594A (en) Traditional pattern subgraph retrieval method based on self-attention mechanism
CN115908806A (en) Small sample image segmentation method based on lightweight multi-scale feature enhancement network
Fan et al. A novel sonar target detection and classification algorithm
CN113435461B (en) Point cloud local feature extraction method, device, equipment and storage medium
CN112148994B (en) Information push effect evaluation method and device, electronic equipment and storage medium
CN114972959B (en) Remote sensing image retrieval method for sample generation and in-class sequencing loss in deep learning
CN116680578A (en) Cross-modal model-based deep semantic understanding method
CN114842488A (en) Image title text determination method and device, electronic equipment and storage medium
CN114202765A (en) Image text recognition method and storage medium
CN113743497A (en) Fine granularity identification method and system based on attention mechanism and multi-scale features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant