CN110826609A - Double-flow feature fusion image identification method based on reinforcement learning - Google Patents
Double-flow feature fusion image identification method based on reinforcement learning Download PDFInfo
- Publication number
- CN110826609A CN110826609A CN201911038698.5A CN201911038698A CN110826609A CN 110826609 A CN110826609 A CN 110826609A CN 201911038698 A CN201911038698 A CN 201911038698A CN 110826609 A CN110826609 A CN 110826609A
- Authority
- CN
- China
- Prior art keywords
- image
- model
- feature
- reinforcement learning
- texture
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a double-current feature fusion image identification method based on reinforcement learning. The two models are respectively a texture model and a shape model: the texture model is classified according to texture information of an object in the image, and the shape model is classified according to shape information of the object. Both models enable the network to find the most discriminative region in the whole image in a reinforcement learning mode, and then carry out classification according to the region. The method is simple and easy to implement, has strong popularization capability, finds the area which is easy to distinguish the image, has proper and effective distinguishing area, fully uses the texture and shape information in the image, and can effectively overcome the influence of insufficient utilization of the image information and small difference between the images.
Description
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a double-flow feature fusion image identification method based on reinforcement learning.
Background
The image recognition has many applications in people's daily life, such as intelligent security, biomedicine, e-commerce shopping, automatic driving, smart home and the like. The image recognition studies how to identify a category corresponding to a sample from a plurality of categories. There are many problems such as less difference between images, greater background effect, and so on.
The current image recognition method generally inputs the image directly to a convolutional neural network for feature extraction and then carries out classification. Although there are various operations after feature extraction, most of the extracted features are texture information of the image. Such operations all have a drawback in that the shape information cannot be fully utilized and information advantageous for recognizing the image cannot be completely extracted. In addition, in order to reduce the background interference, the current approach is to generate candidate frames, but the method generates a large number of candidate frames, has a long calculation time, and has an ambiguous target, and cannot find a region really helpful for image classification.
Therefore, there is a need to design an image recognition method for dual-stream feature fusion, which can fuse texture information and shape information of an image and optimize computational efficiency.
Disclosure of Invention
The invention aims to provide a double-flow feature fusion image identification method based on reinforcement learning, which can effectively find the most effective areas respectively containing texture information and shape information. The influence of background and useless information is reduced, and the identification precision is effectively improved. The method comprises the following steps:
(1) generating a shape data set:
inputting each image into an image conversion model, and outputting n pictures which are similar in shape and different in texture. n is a preset value, the larger n, the better learning effect is, but the longer the time spent on training the model is, and generally, the value of n can be tried to be 5-10. The labels of the n converted images are all the same as the labels of the input images. The shape data set is a data set formed by pairing the original data set with a texture information-reduced data set and a shape information-increased data set. The purpose of generating this corresponding data set is to expect that the subsequent model to be trained can learn the shape information of the image.
(2) Training a texture basic model and a shape basic model:
(2.1) performing data enhancement on each image of the original dataset and the shape dataset separately: for one image, m rectangular frames are generated at random positions in the image, and m is a preset value. m may range from 4 to 7, with too much m resulting in too much data and too much time spent. The frame side length is generally greater than 1/2 and less than or equal to the image side length. The label of the cropped image is consistent with the label of the original image.
(2.2) training a basic model by using the cut image, wherein the texture basic model is trained by using an original data set, and the shape basic model is trained by using a shape data set; the texture base model and the shape base model have the same structure. An adaptive average pooling layer AdaAvgPool is newly added after the last block in the ResNet50 network, and the adaptive average pooling layer pools the images to reduce the size of the feature map. Compressing the Feature map output by AdaAvgPool to one dimension to obtain Feature vector Feature, sending the features before AdaAvgPool to a classifier to obtain a classification prediction probability pred, wherein the classifier can select to use a full connection layer. The basic model outputs Feature and pred, and the function is to extract the features of the input image and predict the classification probability.
(3) Training a texture reinforcement learning model:
(3.1) reading imageglobalAnd a corresponding category label c. A rectangular box is initialized to the same size as the size of the image to be read.
(3.2) the position and size of the rectangular frame are changed several times in the whole process, if the size of the rectangular frame is equal to the size of the image, namely, the imagelocal=imageglobalThen jump to (3.3). If the rectangular frame is smaller than the image size, the image is cut according to the frame size, and then the image is up-sampled to the size same as the original image size to obtain the processed imagelocalWhere upsampling may select bilinear interpolation.
(3.3) imagelocalAnd inputting the Feature and the classification prediction probability pred into a texture base model.
(3.4) inputting the Feature into a texture enhancement learning model, wherein the texture enhancement learning model consists of a plurality of fully connected layers, a ReLU function is used as an activation function, and the function is defined asThe last layer is the number of feature dimensions converted into the number of actions in the action space, and the output means the Q value of each action in the action space. The action space is a set formed by combining a series of actions, the purpose of the actions is to change the position or size of the rectangular frame, and the actions can be selected to translate, enlarge, reduce and the like in all directions; the Q value of an action means that a rectangular box changes to another position after the action is taken at one position, and this process gives a quantitative assessment of the impact that our goal (i.e., classification) has. If the Q value is larger, the box after the position is changed can lead the classification effect to be better, and conversely, the Q value is lower, the box after the position is changed can lead the classification effect to be worse.
(3.5) in order to get the determined action, this step is divided into two strategies, search and development, and one of them is selected, wherein an explicit _ rate ∈ (0,1) is preset, the explicit _ rate represents the probability of selecting search, and the corresponding 1-explicit _ rate represents the probability of selecting development, the search randomly selects an action in all actions, the development selects the action corresponding to the maximum value of the Q value obtained in (3.4) as the selected action, the action is determined after selecting one of search and development, and the size or position of the box is changed according to the selected action and the change coefficient α, so as to get a new rectangular box, wherein the change coefficient α ∈ (0,1), and 0.1 can be selected, meaning that the ratio of each change is that the action is selected to be larger to the right, meaning that the box 'is larger to the right by 1.1 times of the original box to the left, and the action is selected to be smaller meaning that the box' is smaller by 0.9 times of the original box to the left.
And (3.6) obtaining another Feature and prediction probability Feature 'pred' according to the process of extracting the Feature in (3.2) and (3.3) by using the new rectangular box obtained in (3.5).
(3.8) based on pred, (pred' in 3.6), and c in (3.1), the following judgment can be made: if the prediction score of the pred on the category c is higher than that of the pred 'on the category c, the reward is-1, and correspondingly, if the prediction score of the pred on the category c is lower than that of the pred' on the category c, the reward is 1.
(3.9) updating the Q value Q of the selected action in (3.5) according to the reward obtained in (3.8)target. The update mode is QtargetAnd (c) means forward + γ max (Q (s, a)), where Q (s, a) represents the Q value after the action is taken in the s state, i.e., the Feature. Where γ is a learning rate of each update of the Q value, γ is a preset value, and γ may be 0.9.
(3.10) characterization Feature and Q obtained in (3.9)targetAnd storing the experience into an experience pool. The experience pool is a corresponding measure for reducing the correlation between samples, and paired features and Q are firstly adoptedtargetThere is a pool of experiences. And after the experience pool is stored to a certain amount, randomly selecting data from the experience pool to train the model.
(3.11) setting the new rectangular frame as the current frame box ', setting the new Feature as the current Feature, and setting the new classification prediction probability as the current classification prediction probability pred as pred'.
And (3.12) repeating the processes from (3.4) to (3.11) to a certain number of times. This is a process of adjusting the size and position of the frame all the time, and can be dynamically adjusted according to the change rate set in (3.5), and the change rate can be changed several times when set to be large, and can be changed several times when set to be small.
(3.13) when the experience pool is filled with a certain number of samples,randomly selecting pairs of Feature and Q from an experience pooltargetThe corresponding data of (D) is recorded as FeaturesAnd Target, inputting the characteristics to the training texture reinforcement learning model, outputting the Q value of the obtained action, and recording the Q value as QevalTarget and QevalThe difference between them is taken as loss and propagated backwards, updating the parameters. loss may be a Mean Square Error (MSE) function, expressed as loss (Target-Q)eval)2。
(4) Training a shape reinforcement learning model:
and (4.1) training the shape reinforcement learning model by using the shape data set according to the step in (3), wherein the training process of the shape reinforcement learning model is the same as that of the texture reinforcement learning process, and the structure of the shape reinforcement learning model is the same as that of the texture reinforcement learning model.
(5) The double-flow prediction and fusion of the test image to be detected by utilizing the two trained models comprises the following substeps:
(5.1) reading image to be detectedglobal. A rectangular box is initialized to the same size as the size of the image to be read.
And (5.2) carrying out the Feature extraction in the steps (3.2) and (3.3) on the image to obtain the Feature and the classification prediction probability pred of the corresponding position of the frame.
And (5.3) inputting the characteristics obtained in the step (5.2) into a texture reinforcement learning model, outputting the score Q values of all actions, selecting the action with the maximum Q value according to a developed strategy, and changing the size and the position of the frame according to the selected action.
(5.4) repeating the process of (5.2) and (5.3) to a number of times, the number of times of repetition being related to the rate of change of the rectangular frame similarly to the repeated process of (3.12). The last change gives the feature Ftexture。
(5.5) testing the texture reinforcement learning model in a similar process from (5.1) to (5.4) to obtain Fshape
(5.6) two different characteristics FtextureAnd FshapeInput into the fusion model and output as the final prediction probability pmixWherein the fusion model is a trainable model intended to beFtextureAnd FshapeThe fusion is followed by classification. For example, the fusion model may choose to stitch two features together and then output the classification probabilities of all classes using the full-connectivity layer.
(5.7)pmixThe class with the highest probability of correspondence is the predicted class.
Through the technical scheme, compared with the prior art, the invention has the following technical effects:
(1) simple structure is effective: compared with the prior art that the texture information is extracted by using a convolutional neural network, the method respectively extracts the texture information and the shape information by designing a double-flow structure. The discriminative regions of textures and shapes are respectively searched by using a reinforcement learning mode, so that the structure is clear, simple and effective;
(2) the accuracy is high: the method is different from a method for generating the propofol, an optimal region in the image is searched in a reinforcement learning mode, a region with better performance is not selected in the propofol, model learning cost is reduced, the process of searching a distinguishing region is better met, the method is different from the method of only utilizing texture information in the prior art, information contained in the image can be fully mined by utilizing texture and shape information, and the accuracy is higher;
(3) the robustness is strong: the texture reinforcement learning model disclosed by the invention focuses more on texture information, the shape reinforcement learning model focuses more on shape information, and by respectively focusing on the two kinds of information, a network can adapt to different images, and the performance is more robust.
Drawings
FIG. 1 is a flow chart of a double-flow feature fusion image recognition method based on reinforcement learning according to the present invention;
FIG. 2 is a schematic diagram of a reinforcement learning model implementation framework of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The technical terms of the present invention are explained and explained first:
infood dataset: the database is a data set used in a game held by Kaggle and contains 251 fine-grained (pre-made) food categories, totaling 120216 images collected from the web as a training set, 12170 images as a verification set and 28399 images as a test set, and provides manual verification tags, wherein each image contains a single category of food.
ResNet-50: a neural network for classification mainly comprises 50 convolutional layers, a pooling layer and a short connecting layer. The convolution layer is used for extracting picture characteristics; the pooling layer has the functions of reducing the dimensionality of the feature vector output by the convolutional layer and reducing overfitting; the shortcut connection layer is used for transferring gradient and solving the problems of extinction and explosion gradient. The network parameters can be updated through a reverse conduction algorithm;
an image conversion model: the style of the image can be transformed but the content is not changed using the structure of the generic adaptive Network, including the generator and the discriminator.
As shown in fig. 1, the present invention provides a double-flow feature fusion image recognition method based on reinforcement learning, which includes the following steps:
(1) generating a shape data set:
inputting each image into an image conversion model, and outputting n pictures which are similar in shape and different in texture. n is a preset value, the larger n, the better learning effect is, but the longer the time spent on training the model is, and generally, the value of n can be tried to be 5-10. The labels of the n converted images are all the same as the labels of the input images. The shape data set is a data set formed by pairing the original data set with a texture information-reduced data set and a shape information-increased data set. The purpose of generating this corresponding data set is to expect that the subsequent model to be trained can learn the shape information of the image.
(2) Training a texture basic model and a shape basic model:
(2.1) respectively carrying out data enhancement on each image of the original data set and the shape data set, wherein the specific process is as follows: for one image, m rectangular frames are generated at random positions in the image, and m is a preset value. m may range from 4 to 7, with too much m resulting in too much data and too much time spent. The frame side length is generally greater than 1/2 and less than or equal to the image side length. The label of the cropped image is consistent with the label of the original image.
(2.2) training a basic model by using the cut image, wherein the texture basic model is trained by using an original data set, and the shape basic model is trained by using a shape data set; the texture base model and the shape base model have the same structure. An adaptive average pooling layer AdaAvgPool is newly added after the last block in the ResNet50 network, and the adaptive average pooling layer pools the images to reduce the size of the feature map. Compressing the Feature map output by AdaAvgPool to one dimension to obtain Feature vector Feature, sending the features before AdaAvgPool to a classifier to obtain a classification prediction probability pred, wherein the classifier can select to use a full connection layer. The basic model outputs Feature and pred, and the function is to extract the features of the input image and predict the classification probability.
(3) Training a texture reinforcement learning model:
(3.1) reading imageglobalAnd a corresponding category label c. A rectangular box is initialized to the same size as the size of the image to be read.
(3.2) the position and size of the rectangular frame are changed several times in the whole process, if the size of the rectangular frame is equal to the size of the image, namely, the imagelocal=imageglobalThen jump to (3.3). If the rectangular frame is smaller than the image size, the image is cut according to the frame size, and then the image is up-sampled to the size same as the original image size to obtain the processed imagelocalWherein the upsampling may select a bilinear interpolation. The purpose of this step is: if the rectangular frame is the initialized rectangular frame, no operation is performed, and the step (3.3) is entered, if the rectangular frame is smaller than the initialized frame, the cropping operation is performed, and the image is up-sampled to the size same as the original input image, so that the images input to the neural network are all the same size.
(3.3) image is shown in FIG. 2localInput into the texture base model yields two outputs: feature and class prediction probability pred.
(3.4) directly inputting the Feature into the reinforcement learning model, and outputting the Q value of each corresponding action in the action space through various operations such as convolution, pooling and the like of forward propagation. Specifically, the Feature is input into a selection action submodule of a texture reinforcement learning model, an agent network is arranged in the selection action submodule, the model is composed of a plurality of full connection layers, a ReLU function is used as an activation function, and the function definition formula isThe last layer is the number of feature dimensions converted into the number of actions in the action space, and the output means the Q value of each action in the action space. The action space is a set formed by combining a series of actions, the purpose of the actions is to change the position or size of the rectangular frame, and the actions can be selected to translate, enlarge, reduce and the like in all directions; the Q value of an action means that a rectangular box changes to another position after the action is taken at one position, and this process gives a quantitative assessment of the impact that our goal (i.e., classification) has. If the Q value is larger, the box after the position is changed can lead the classification effect to be better, and conversely, the Q value is lower, the box after the position is changed can lead the classification effect to be worse.
(3.5) in the select action submodule, in order to obtain the determined action, the step is divided into two strategies of searching and developing, wherein one strategy is selected, the search randomly selects one action from all actions, the development selects the action corresponding to the maximum value of the Q value obtained in (3.4) as the selected action, the action is determined after one strategy is selected and developed, and the size or the position of the frame is changed according to the selected action and the change coefficient α, so that a new rectangular frame box ' is obtained, wherein the change coefficient α (0,1) can be selected, 0.1 can be selected, the meaning of each change is that the ratio of each change is determined, for example, the action is selected to be larger to the right, the box ' is obtained to be larger to the right by 1.1 times, and the action is selected to be smaller, the box ' is selected to be smaller to the left by 0.9 times.
And (3.6) obtaining another Feature and prediction probability Feature 'pred' according to the process of extracting the Feature in (3.2) and (3.3) by using the new rectangular box obtained in (3.5). The purpose of this step is that after the position of the rectangular frame is changed, the same operation of extracting features can be adopted, the region corresponding to the frame is cut out, and then the up-sampling operation is performed. And inputting the feature and the prediction probability into a basic model.
(3.8) based on pred, (pred' in 3.6), and c in (3.1), the following judgment can be made: if the prediction score of the pred on the category c is higher than that of the pred 'on the category c, the reward is-1, and correspondingly, if the prediction score of the pred on the category c is lower than that of the pred' on the category c, the reward is 1. Specifically, the prediction probability pred is a prediction score that is common to all categories, and this score may indicate the probability of predicting as one category. So a higher probability for c indicates that the model performs better on the sample prediction pairs. It can be determined whether to award or punish to the model based on both the prediction probability and the label. Signal (pred' c-pred c).
(3.9) in the update Q value submodule, updating the Q value Q of the action selected in (3.5) according to the reward obtained in (3.8)target. The update mode is QtargetAnd (c) means forward + γ max (Q (s, a)), where Q (s, a) represents the Q value after the action is taken in the s state, i.e., the Feature. Wherein gamma is the learning rate of each Q value update, gamma is a preset value,γ may be selected to be 0.9.
(3.10) characterization Feature and Q obtained in (3.9)targetAnd storing the experience into an experience pool. The experience pool is a corresponding measure for reducing the correlation between samples, and paired features and Q are firstly adoptedtargetThere is a pool of experiences. And after the experience pool is stored to a certain amount, randomly selecting data from the experience pool to train the model.
(3.11) setting the new rectangular frame as the current frame box ', setting the new Feature as the current Feature, and setting the new classification prediction probability as the current classification prediction probability pred as pred'.
And (3.12) repeating the processes from (3.4) to (3.11) to a certain number of times. This is a process of adjusting the size and position of the frame all the time, and can be dynamically adjusted according to the change rate set in (3.5), and the change rate can be changed several times when set to be large, and can be changed several times when set to be small.
(3.13) in the Q value evaluation submodule, after a certain number of samples are filled in the experience pool, the Feature and Q in pairs are randomly selected from the experience pooltargetThe corresponding data of (D) is recorded as FeaturesAnd Target, inputting the characteristics to the training texture reinforcement learning model, outputting the Q value of the obtained action, and recording the Q value as QevalTarget and QevalThe difference between them is taken as loss and propagated backwards, updating the parameters. loss may be a Mean Square Error (MSE) function, expressed as loss (Target-Q)eval)2。
(4) Training a shape reinforcement learning model:
and (4.1) training the shape reinforcement learning model by using the shape data set according to the step in (3), wherein the training process of the shape reinforcement learning model is the same as that of the texture reinforcement learning process, and the structure of the shape reinforcement learning model is the same as that of the texture reinforcement learning model. And (3) and (4) dividing the information in the data set into textures and shapes, and respectively learning two different kinds of information in the data set by using two streams of ideas. The model structure of both streams is the same, and the data sets used for training are pairs of data sets that have been pre-processed. The training process is the same.
(5) The double-flow prediction and fusion of the test image to be detected by utilizing the two trained models comprises the following substeps:
(5.1) reading image to be detectedglobal. A rectangular box is initialized to the same size as the size of the image to be read.
And (5.2) carrying out the Feature extraction in the steps (3.2) and (3.3) on the image to obtain the Feature and the classification prediction probability pred of the corresponding position of the frame.
And (5.3) inputting the characteristics obtained in the step (5.2) into a texture reinforcement learning model, outputting the score Q values of all actions, selecting the action with the maximum Q value according to a developed strategy, and changing the size and the position of the frame according to the selected action.
(5.4) repeating the process of (5.2) and (5.3) to a number of times, the number of times of repetition being related to the rate of change of the rectangular frame similarly to the repeated process of (3.12). The last change gives the feature Ftexture。
(5.5) testing the texture reinforcement learning model in a similar process from (5.1) to (5.4) to obtain Fshape
(5.6) two different characteristics FtextureAnd FshapeInput into the fusion model and output as the final prediction probability pmixWherein the fusion model is a trainable model aimed at transforming FtextureAnd FshapeThe fusion is followed by classification. For example, the fusion model may choose to stitch two features together and then output the classification probabilities of all classes using the full-connectivity layer. The fusion of features can be tried in various ways, and can even be performed directly on the prediction scores, but the fusion performed on the scores is not as good as the fusion performed on the features. The fusion model can be obtained by inputting and using two characteristics according to the label corresponding to the image by training the fusion model.
(5.7)pmixThe class with the highest probability of correspondence is the predicted class.
The effectiveness of the invention is proved by the following experimental examples, and the experimental result proves that the invention can improve the identification accuracy of image identification.
The invention is compared with the basic network we use on the iFood dataset, and Table 1 shows the accuracy of the method of the invention on the dataset, wherein Backbone represents the basic model Resnet50 we use, and DQN represents the reinforcement learning model we use. The larger the numerical value of the result is, the higher the accuracy of image recognition is, and the improvement of the method is very obvious as can be seen from the table.
TABLE 1 precision on iFood dataset
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (10)
1. A double-flow feature fusion image identification method based on reinforcement learning is characterized by comprising the following steps:
(1) generating a shape data set:
for each image, the image is input into an image conversion model, and corresponding n images with similar shapes but different textures are output. The labels of the n converted images are the same as the labels of the input images, and n is a preset value;
(2) training a texture basic model and a shape basic model:
(2.1) performing data enhancement on each image of the original dataset and the shape dataset separately: for an image, generating m rectangular frames at random positions in the image, wherein the side length of each frame is greater than 1/2 of the side length of the image and less than or equal to the side length of the image, the label of the cut image is consistent with that of the original image, and m is a preset value;
(2.2) training a base model using the cropped images, wherein the texture base model is trained using the original data set and the shape base model is trained using the shape data set; the texture basic model and the shape basic model have the same structure, and are both a self-adaptive average pooling layer AdaAvgPool is newly added after the last block in the ResNet50 network, the self-adaptive average pooling layer is used for pooling images to reduce the size of a feature map, the basic model is trained by using the images and labels, and the basic model is used for extracting features and predicting the images;
(3) training a texture reinforcement learning model:
(3.1) reading imageglobalInitializing a rectangular box with the same size as the size of the read image, with the corresponding category label c;
(3.2) if the size of the rectangular frame is equal to the size of the image, jumping to (3.3); if the rectangular frame is smaller than the image size, the image is cut according to the size of the frame, then the image is up-sampled to the size same as the size of the original image, and the processed image is obtainedlocal;
(3.3) imagelocalInputting the Feature and the classification prediction probability pred into a texture basic model;
(3.4) inputting the Feature into a texture reinforcement learning model, wherein the output of the texture reinforcement learning model is the Q value of each action in the action space;
(3.5) obtaining determined actions through searching and developing two strategies, wherein one of the searching and the developing is to select one action randomly from all the actions, the developing is to select the action corresponding to the maximum value of the Q value obtained in the step (3.4) as a selected action, determine an action after selecting one of the searching and the developing, and change the size or the position of the frame according to the selected action and the change coefficient α to obtain a new rectangular frame box';
(3.6) obtaining another Feature and prediction probability Feature 'pred' according to the process of extracting the Feature in (3.2) and (3.3) by using the new rectangular box obtained in (3.5);
(3.8) based on pred, (pred' in 3.6), and c in (3.1), the following judgment is made: reward-1 if the prediction score of pred on category c is higher than the prediction score of pred 'on category c, and reward-1 if the prediction score of pred on category c is lower than the prediction score of pred' on category c;
(3.9) updating the Q value Q of the selected action in (3.5) according to the reward obtained in (3.8)targetThe update mode is QtargetThe learning rate is updated every time the Q value is updated, and gamma is a preset value;
(3.10) characterization Feature and Q obtained in (3.9)targetStoring the experience into an experience pool;
(3.11) taking the new rectangular frame as the current frame box ', taking the new Feature as the current Feature, taking the new classification prediction probability as the current classification prediction probability pred ═ pred', and repeating the processes from (3.4) to (3.11) to a preset number of times. Randomly selecting a pair of Feature and Q from the experience pool after a preset number of samples are loaded in the experience pooltargetThe corresponding data of (D) is recorded as FeaturesAnd Target, inputting the characteristics into the texture reinforcement learning model, outputting the Q value of the obtained action, and recording the Q value as QevalTarget and QevalThe difference between the two is used as loss, and the loss is propagated reversely, and the parameters are updated;
(4) training a shape reinforcement learning model:
(4.1) training a shape reinforcement learning model by using the shape data set according to the step in (3), wherein the training process of the shape reinforcement learning model is the same as that of the texture reinforcement learning process, and the structures of the shape reinforcement learning model and the texture reinforcement learning model are the same;
(5) the double-flow prediction and fusion of the test image to be detected by utilizing the two trained reinforced models comprises the following substeps:
(5.1) reading an image to be detected, and initializing a rectangular box with the size same as that of the image to be detected;
(5.2) carrying out the Feature extraction of the steps (3.2) and (3.3) on the image to be detected to obtain the Feature and the classification prediction probability pred of the corresponding position of the frame;
(5.3) inputting the characteristics obtained in the step (5.2) into a texture reinforcement learning model, outputting score Q values of all actions, selecting the action with the maximum Q value according to a developed strategy, and changing the size and the position of a frame according to the selected action;
(5.4) repeating the process of (5.2) and (5.3) to a number of times, the number of times of repetition being related to the rate of change of the rectangular frame similarly to the repeated process of (3.12); obtaining the characteristic F after the last changetexture;
(5.5) testing the texture reinforcement learning model in a similar process from (5.1) to (5.4) to obtain Fshape;
(5.6) two different characteristics FtextureAnd FshapeInput into the fusion model and output as the final prediction probability pmixWherein the fusion model is a trainable model aimed at transforming FtextureAnd FshapeAfter fusion, classification is carried out;
(5.7)pmixthe class with the highest corresponding probability is the class predicted by the image to be detected.
2. The method for recognizing the double-flow feature fusion image based on the reinforcement learning according to claim 1, wherein the training process of the basic model in the step (2.2) is specifically as follows: compressing the Feature diagram output by AdaAvgPool to one dimension to obtain Feature vector Feature, and sending the features before AdaAvgPool to a classifier to obtain a classification prediction probability pred. The basic model outputs Feature and pred, and the function is to extract the features of the input image and predict the classification probability.
3. The robust learning based dual-stream feature fusion image recognition method according to claim 1 or 2, wherein the classifier in step (2.2) selects to use fully connected layers.
4. The method for identifying the double-flow feature fusion image based on the reinforcement learning according to the claim 1 or 2, wherein the step (3.4) is specifically as follows: inputting the Feature into a selection action submodule of the texture reinforcement learning model, wherein the selection action submodule is provided with an agent network, the model consists of a plurality of full connection layers, and ReLU function is used as the modelFor activating a function, the function is defined byThe last layer is to convert the characteristic dimension quantity into the action quantity of the action space, and the output meaning is the Q value of each action in the action space; the action space is a set formed by combining a series of actions, the actions aim at changing the position or size of a rectangular frame, the Q value of the action means that the rectangular frame is changed to another position after the action is taken at a certain position, the process is quantitative evaluation of the influence generated by a target, if the Q value is larger, the frame after the position is changed can enable the classification effect to be better, and otherwise, the lower the Q value is, the frame after the position is changed can enable the classification effect to be worse.
5. The method for dual-stream feature fusion image recognition based on reinforcement learning according to claim 1 or 2, wherein the action in step (3.4) comprises translation, zooming in or zooming out in various directions.
6. The robust learning based dual-stream feature fusion image recognition method according to claim 1 or 2, wherein the variation coefficient α e (0,1) in the step (3.5).
7. The method for identifying the double-flow feature fusion image based on the reinforcement learning according to the claim 1 or 2, wherein the step (3.8) is specifically as follows: the prediction probability pred is a prediction score which is common to all the categories, the score represents the probability of predicting the category, the higher the probability corresponding to c is, the better the model performs the sample prediction pair is, so that whether the model is awarded or punished can be judged according to the prediction probabilities and the labels, and the rewarded is sign (pred' c-pred c).
8. The method for identifying the double-flow feature fusion image based on the reinforcement learning according to the claim 1 or 2, characterized in that the stepsThe loss in step (3.13) selects the MSE function, expressed as loss ═ Q (Target-Q)eval)2。
9. The double-flow feature fusion image identification method based on reinforcement learning according to claim 1 or 2, characterized in that after the fusion model selection in the step (5.6) splices two features together, the classification probabilities of all classes are output by using a full connection layer.
10. The dual-flow feature fusion image recognition method based on reinforcement learning according to claim 1 or 2, wherein the value range of n is 5-10, and the value range of m is 4-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911038698.5A CN110826609B (en) | 2019-10-29 | 2019-10-29 | Double-current feature fusion image identification method based on reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911038698.5A CN110826609B (en) | 2019-10-29 | 2019-10-29 | Double-current feature fusion image identification method based on reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110826609A true CN110826609A (en) | 2020-02-21 |
CN110826609B CN110826609B (en) | 2023-03-24 |
Family
ID=69550977
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911038698.5A Active CN110826609B (en) | 2019-10-29 | 2019-10-29 | Double-current feature fusion image identification method based on reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110826609B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112597865A (en) * | 2020-12-16 | 2021-04-02 | 燕山大学 | Intelligent identification method for edge defects of hot-rolled strip steel |
CN113128522A (en) * | 2021-05-11 | 2021-07-16 | 四川云从天府人工智能科技有限公司 | Target identification method and device, computer equipment and storage medium |
CN113240573A (en) * | 2020-10-26 | 2021-08-10 | 杭州火烧云科技有限公司 | Local and global parallel learning-based style transformation method and system for ten-million-level pixel digital image |
CN114742800A (en) * | 2022-04-18 | 2022-07-12 | 合肥工业大学 | Reinforced learning fused magnesia furnace working condition identification method based on improved Transformer |
TWI801038B (en) * | 2021-12-16 | 2023-05-01 | 新加坡商鴻運科股份有限公司 | Defect detection method, system, electronic device and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104766080A (en) * | 2015-05-06 | 2015-07-08 | 苏州搜客信息技术有限公司 | Image multi-class feature recognizing and pushing method based on electronic commerce |
CN108805798A (en) * | 2017-05-05 | 2018-11-13 | 英特尔公司 | Fine granularity for deep learning frame calculates communication and executes |
CN109814565A (en) * | 2019-01-30 | 2019-05-28 | 上海海事大学 | The unmanned boat intelligence navigation control method of space-time double fluid data-driven depth Q study |
CN110135502A (en) * | 2019-05-17 | 2019-08-16 | 东南大学 | A kind of image fine granularity recognition methods based on intensified learning strategy |
CN110348355A (en) * | 2019-07-02 | 2019-10-18 | 南京信息工程大学 | Model recognizing method based on intensified learning |
-
2019
- 2019-10-29 CN CN201911038698.5A patent/CN110826609B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104766080A (en) * | 2015-05-06 | 2015-07-08 | 苏州搜客信息技术有限公司 | Image multi-class feature recognizing and pushing method based on electronic commerce |
CN108805798A (en) * | 2017-05-05 | 2018-11-13 | 英特尔公司 | Fine granularity for deep learning frame calculates communication and executes |
CN109814565A (en) * | 2019-01-30 | 2019-05-28 | 上海海事大学 | The unmanned boat intelligence navigation control method of space-time double fluid data-driven depth Q study |
CN110135502A (en) * | 2019-05-17 | 2019-08-16 | 东南大学 | A kind of image fine granularity recognition methods based on intensified learning strategy |
CN110348355A (en) * | 2019-07-02 | 2019-10-18 | 南京信息工程大学 | Model recognizing method based on intensified learning |
Non-Patent Citations (1)
Title |
---|
XIANGTENG HE等: "Fine-grained Image Classification via Combining Vision and Language", 《PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113240573A (en) * | 2020-10-26 | 2021-08-10 | 杭州火烧云科技有限公司 | Local and global parallel learning-based style transformation method and system for ten-million-level pixel digital image |
CN112597865A (en) * | 2020-12-16 | 2021-04-02 | 燕山大学 | Intelligent identification method for edge defects of hot-rolled strip steel |
CN113128522A (en) * | 2021-05-11 | 2021-07-16 | 四川云从天府人工智能科技有限公司 | Target identification method and device, computer equipment and storage medium |
CN113128522B (en) * | 2021-05-11 | 2024-04-05 | 四川云从天府人工智能科技有限公司 | Target identification method, device, computer equipment and storage medium |
TWI801038B (en) * | 2021-12-16 | 2023-05-01 | 新加坡商鴻運科股份有限公司 | Defect detection method, system, electronic device and storage medium |
CN114742800A (en) * | 2022-04-18 | 2022-07-12 | 合肥工业大学 | Reinforced learning fused magnesia furnace working condition identification method based on improved Transformer |
CN114742800B (en) * | 2022-04-18 | 2024-02-20 | 合肥工业大学 | Reinforced learning electric smelting magnesium furnace working condition identification method based on improved converter |
Also Published As
Publication number | Publication date |
---|---|
CN110826609B (en) | 2023-03-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110826609B (en) | Double-current feature fusion image identification method based on reinforcement learning | |
CN110322446B (en) | Domain self-adaptive semantic segmentation method based on similarity space alignment | |
CN110837836B (en) | Semi-supervised semantic segmentation method based on maximized confidence | |
CN109840531B (en) | Method and device for training multi-label classification model | |
CN110222770B (en) | Visual question-answering method based on combined relationship attention network | |
CN110046550B (en) | Pedestrian attribute identification system and method based on multilayer feature learning | |
US20210081695A1 (en) | Image processing method, apparatus, electronic device and computer readable storage medium | |
CN115858847B (en) | Combined query image retrieval method based on cross-modal attention reservation | |
CN112257758A (en) | Fine-grained image recognition method, convolutional neural network and training method thereof | |
CN112070040A (en) | Text line detection method for video subtitles | |
CN111274981A (en) | Target detection network construction method and device and target detection method | |
CN115222998B (en) | Image classification method | |
CN115131797A (en) | Scene text detection method based on feature enhancement pyramid network | |
CN112527993A (en) | Cross-media hierarchical deep video question-answer reasoning framework | |
CN116912708A (en) | Remote sensing image building extraction method based on deep learning | |
CN114510594A (en) | Traditional pattern subgraph retrieval method based on self-attention mechanism | |
CN115908806A (en) | Small sample image segmentation method based on lightweight multi-scale feature enhancement network | |
Fan et al. | A novel sonar target detection and classification algorithm | |
CN113435461B (en) | Point cloud local feature extraction method, device, equipment and storage medium | |
CN112148994B (en) | Information push effect evaluation method and device, electronic equipment and storage medium | |
CN114972959B (en) | Remote sensing image retrieval method for sample generation and in-class sequencing loss in deep learning | |
CN116680578A (en) | Cross-modal model-based deep semantic understanding method | |
CN114842488A (en) | Image title text determination method and device, electronic equipment and storage medium | |
CN114202765A (en) | Image text recognition method and storage medium | |
CN113743497A (en) | Fine granularity identification method and system based on attention mechanism and multi-scale features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |