CN110826609B - Double-current feature fusion image identification method based on reinforcement learning - Google Patents

Double-current feature fusion image identification method based on reinforcement learning Download PDF

Info

Publication number
CN110826609B
CN110826609B CN201911038698.5A CN201911038698A CN110826609B CN 110826609 B CN110826609 B CN 110826609B CN 201911038698 A CN201911038698 A CN 201911038698A CN 110826609 B CN110826609 B CN 110826609B
Authority
CN
China
Prior art keywords
image
feature
model
reinforcement learning
texture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911038698.5A
Other languages
Chinese (zh)
Other versions
CN110826609A (en
Inventor
冯镔
唐哲
王豪
李亚婷
朱多旺
刘文予
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201911038698.5A priority Critical patent/CN110826609B/en
Publication of CN110826609A publication Critical patent/CN110826609A/en
Application granted granted Critical
Publication of CN110826609B publication Critical patent/CN110826609B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a double-current feature fusion image identification method based on reinforcement learning. The two models are respectively a texture model and a shape model: the texture model is classified according to texture information of an object in the image, and the shape model is classified according to shape information of the object. Both models enable the network to find the most discriminative region in the whole image in a reinforcement learning mode, and then carry out classification according to the region. The method is simple and easy to implement, has strong popularization capability, finds the area which is easy to distinguish the image, has proper and effective distinguishing area, fully uses the texture and shape information in the image, and can effectively overcome the influence of insufficient utilization of the image information and small difference between the images.

Description

Double-flow feature fusion image identification method based on reinforcement learning
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a double-flow feature fusion image identification method based on reinforcement learning.
Background
The image recognition has many applications in people's daily life, such as intelligent security, biomedicine, e-commerce shopping, automatic driving, smart home and the like. The image recognition studies how to identify a category corresponding to a sample from a plurality of categories. There are many problems such as less difference between images, greater background effect, and so on.
The current image recognition method generally inputs the image directly to a convolutional neural network for feature extraction and then carries out classification. Although there are various operations after feature extraction, most of the extracted features are actually texture information of the image. Such operations all have a drawback in that the shape information cannot be fully utilized and information advantageous for recognizing the image cannot be completely extracted. In addition, in order to reduce the background interference, the current approach is to generate candidate frames, but the method generates a large number of candidate frames, has a long calculation time, has an unclear target, and cannot find a region really helpful for image classification.
Therefore, there is a need to design an image recognition method for dual-stream feature fusion, which can fuse texture information and shape information of an image and optimize computational efficiency.
Disclosure of Invention
The invention aims to provide a double-flow feature fusion image identification method based on reinforcement learning, which can effectively find the most effective areas respectively containing texture information and shape information. The influence of background and useless information is reduced, and the identification precision is effectively improved. The method comprises the following steps:
(1) Generating a shape data set:
inputting each image into an image conversion model, and outputting n pictures which are similar in shape and different in texture. n is a preset value, the learning effect is better when n is larger, but the time spent on training the model is longer, and generally, the value of n can be tried to be 5-10. The labels of the n converted images are all the same as the labels of the input images. The shape data set is a data set formed by pairing the original data set with a texture information-reduced data set and a shape information-increased data set. The purpose of generating this corresponding data set is to expect that the subsequent model to be trained can learn the shape information of the image.
(2) Training a texture basic model and a shape basic model:
(2.1) performing data enhancement on each image of the original dataset and the shape dataset separately: for one image, m rectangular frames are generated at random positions in the image, and m is a preset value. m may range from 4 to 7, too large of a m may result in too much data and too much time spent. The side length of the frame is generally larger than 1/2 of the side length of the image and is smaller than or equal to the side length of the image. The label of the cropped image is consistent with the label of the original image.
(2.2) training a basic model by using the cut image, wherein the texture basic model is trained by using an original data set, and the shape basic model is trained by using a shape data set; the texture base model and the shape base model have the same structure. An adaptive average pooling layer AdaAvgPool is newly added after the last block in the ResNet50 network, and the adaptive average pooling layer pools images to reduce the size of the feature map. Compressing the Feature map output by AdaAvgPool to one dimension to obtain Feature vector Feature, sending the features before AdaAvgPool to a classifier to obtain a classification prediction probability pred, wherein the classifier can select to use a full connection layer. The basic model outputs Feature and pred, and the function is to extract the features of the input image and predict the classification probability.
(3) Training a texture reinforcement learning model:
(3.1) reading image global And a corresponding category label c. A rectangular box is initialized to the same size as the size of the image to be read.
(3.2) the position and size of the rectangular frame are changed several times in the whole process, if the size of the rectangular frame is equal to the size of the image, namely, the image local =image global Then jump to (3.3). If the rectangular frame is smaller than the image size, the image is cut according to the frame size, and then the image is up-sampled to the size same as the original image size to obtain the processed image local Where upsampling may select bilinear interpolation.
(3.3) image local And inputting the Feature and the classification prediction probability pred into a texture base model.
(3.4) inputting the Feature into a texture enhancement learning model, wherein the texture enhancement learning model consists of a plurality of fully connected layers, a ReLU function is used as an activation function, and the function is defined as
Figure BDA0002252260410000031
The last layer is the number of feature dimensions converted into the number of actions in the action space, and the output means the Q value of each action in the action space. The action space is a set formed by combining a series of actions, the purpose of the actions is to change the position or size of the rectangular frame, and the actions can be selected to translate, enlarge, reduce and the like in all directions; the Q value of an action means that a rectangular box changes to another position after the action is taken at one position, and this process gives a quantitative assessment of the impact that our goal (i.e., classification) has. If the Q value is larger, the box after the position is changed can lead the classification effect to be better, and conversely, the Q value is lower, the box after the position is changed can lead the classification effect to be worse.
(3.5) in order to obtain the determined action, the step is divided into two strategies of exploration and development, and one strategy is selected by the two strategies. An explore _ rate ∈ (0, 1) is preset, the explore _ rate represents the probability of selecting exploration, and the corresponding 1-explore _ rate represents the probability of selecting development. Exploration is to randomly select one action from all actions; in the development, the operation corresponding to the maximum Q value obtained in (3.4) is selected as the selected operation. And determining action after selecting one of exploration and development, and changing the size or the position of the frame according to the selected action and the change coefficient alpha to obtain a new rectangular frame box'. Wherein the coefficient of variation α ∈ (0, 1), 0.1 may be selected. The meaning of expression is the ratio of each change. For example, action is selected to be enlarged to the right, which means that box' is obtained by enlarging box to the right by 1.1 times; action is selected to be left-hand smaller, indicating that box' is obtained by left-hand reduction of box by 0.9 times the original.
And (3.6) obtaining another Feature and prediction probability Feature 'pred' according to the process of extracting the Feature in (3.2) and (3.3) by using the new rectangular box obtained in (3.5).
(3.8) based on pred, (pred' in 3.6), and c in (3.1), the following judgment can be made: reward = -1 if the prediction score of pred on category c is higher than the prediction score of pred 'on category c, corresponding reward =1 if the prediction score of pred on category c is lower than the prediction score of pred' on category c.
(3.9) updating the Q value Q of the selected action in (3.5) according to the reward obtained in (3.8) target . The update mode is Q target = reward + γ max (Q (s, a)), where Q (s, a) represents the Q value after an action is taken in the s state, i.e. Feature. Where γ is a learning rate of each Q value update, γ is a preset value, and γ =0.9 may be selected.
(3.10) characterization Feature and Q obtained in (3.9) target And storing the experience into an experience pool. The experience pool is a corresponding measure for reducing the correlation between samples, and paired features and Q are firstly adopted target There is a pool of experiences. And after the experience pool is stored to a certain amount, randomly selecting data from the experience pool to train the model.
(3.11) regarding the new rectangular frame as the current frame box = box ', regarding the new Feature as the current Feature = Feature ', and regarding the new classification prediction probability as the current classification prediction probability pred = pred '.
And (3.12) repeating the processes from (3.4) to (3.11) to a certain number of times. This is a process of adjusting the size and position of the frame all the time, and can be dynamically adjusted according to the change rate set in (3.5), and the change rate can be changed several times when set to be large, and can be changed several times when set to be small.
(3.13) when a certain number of samples are loaded in the experience pool, randomly selecting a pair of Feature and Q from the experience pool target The corresponding data of (D) is recorded as Feature s And Target, inputting the characteristics to the training texture reinforcement learning model, outputting the Q value of the obtained action, and recording the Q value as Q eval Target and Q eval The difference between them is taken as loss and propagated backwards, updating the parameters. loss can be selected as a mean square error MSE function, and the expression is loss = (Target-Q) eval ) 2
(4) Training a shape reinforcement learning model:
and (4.1) training the shape reinforcement learning model by using the shape data set according to the step in (3), wherein the training process of the shape reinforcement learning model is the same as that of the texture reinforcement learning process, and the structure of the shape reinforcement learning model is the same as that of the texture reinforcement learning model.
(5) The double-flow prediction and fusion of the test image to be detected by utilizing the two trained models comprises the following substeps:
(5.1) reading image to be detected global . A rectangular box is initialized to the same size as the size of the image to be read.
And (5.2) carrying out the Feature extraction in the steps (3.2) and (3.3) on the image to obtain the Feature and the classification prediction probability pred of the corresponding position of the frame.
And (5.3) inputting the characteristics obtained in the step (5.2) into a texture reinforcement learning model, outputting the score Q values of all actions, selecting the action with the maximum Q value according to a developed strategy, and changing the size and the position of the frame according to the selected action.
(5.4) repeating the process of (5.2) and (5.3) to a number of times, the number of times of repetition being related to the rate of change of the rectangular frame similarly to the repeated process of (3.12). The last change gives the feature F texture
(5.5) testing the texture reinforcement learning model in a similar process from (5.1) to (5.4) to obtain F shape
(5.6) two different characteristics F texture And F shape Input into the fusion model and output as the final prediction probability p mix Wherein the fusion model is a trainable model aimed at transforming F texture And F shape The fusion is followed by classification. For example, the fusion model may choose to stitch two features together and then output the classification probabilities of all classes using the full-connectivity layer.
(5.7)p mix The class with the highest probability of correspondence is the predicted class.
Through the technical scheme, compared with the prior art, the invention has the following technical effects:
(1) Simple structure is effective: compared with the prior art that the texture information is extracted by using a convolutional neural network, the method respectively extracts the texture information and the shape information by designing a double-flow structure. The distinguishing areas of the texture and the shape are respectively searched by using a reinforcement learning mode, so that the structure is clear, simple and effective;
(2) The accuracy is high: the method is different from a method for generating the propofol, an optimal region in the image is searched in a reinforcement learning mode, a region with better performance is not selected in the propofol, model learning cost is reduced, the process of searching a distinguishing region is better met, the method is different from the method of only utilizing texture information in the prior art, information contained in the image can be fully mined by utilizing texture and shape information, and the accuracy is higher;
(3) The robustness is strong: the texture reinforcement learning model disclosed by the invention focuses more on texture information, the shape reinforcement learning model focuses more on shape information, and by respectively focusing on the two kinds of information, a network can adapt to different images, and the performance is more robust.
Drawings
FIG. 1 is a flow chart of a double-flow feature fusion image recognition method based on reinforcement learning according to the present invention;
FIG. 2 is a schematic diagram of a reinforcement learning model implementation framework of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The technical terms of the present invention are explained and explained first:
infood dataset: the database is a data set used in a game held in Kaggle and contains 251 fine-grained (pre-made) food categories, totaling 120216 images collected from the web as a training set, 12170 images as a verification set, and 28399 images as a test set, and providing manual verification tags, where each image contains a single category of food.
ResNet-50: a neural network for classification mainly comprises 50 convolutional layers, a pooling layer and a short connecting layer. The convolution layer is used for extracting picture characteristics; the pooling layer has the functions of reducing the dimensionality of the feature vector output by the convolutional layer and reducing overfitting; the shortcut connection layer is used for transferring gradient and solving the problems of extinction and explosion gradient. The network parameters can be updated through a reverse conduction algorithm;
an image conversion model: the style of the image can be transformed but the content is not changed using the structure of the generic adaptive Network, including the generator and the discriminator.
As shown in fig. 1, the present invention provides a double-flow feature fusion image recognition method based on reinforcement learning, which includes the following steps:
(1) Generating a shape data set:
inputting each image into an image conversion model, and outputting n pictures which are similar in shape and different in texture. n is a preset value, the learning effect is better when n is larger, but the time spent on training the model is longer, and generally, the value of n can be tried to be 5-10. The labels of the n converted images are all the same as the labels of the input images. The shape data set is a data set formed by pairing the original data set with a texture information-reduced data set and a shape information-increased data set. The purpose of generating this corresponding data set is to expect that the subsequent model to be trained can learn the shape information of the image.
(2) Training a texture basic model and a shape basic model:
(2.1) respectively carrying out data enhancement on each image of the original data set and the shape data set, wherein the specific process is as follows: for one image, m rectangular frames are generated at random positions in the image, and m is a preset value. m may range from 4 to 7, too large of a m may result in too much data and too much time spent. The side length of the frame is generally larger than 1/2 of the side length of the image and is smaller than or equal to the side length of the image. The label of the cropped image is consistent with the label of the original image.
(2.2) training a basic model by using the cut image, training a texture basic model by using an original data set, and training a shape basic model by using a shape data set; the texture base model and the shape base model have the same structure. An adaptive average pooling layer AdaAvgPool is newly added after the last block in the ResNet50 network, and the adaptive average pooling layer pools images to reduce the size of the feature map. Compressing the Feature map output by AdaAvgPool to one dimension to obtain Feature vector Feature, sending the features before AdaAvgPool to a classifier to obtain a classification prediction probability pred, wherein the classifier can select to use a full connection layer. The basic model outputs Feature and pred, and the function is to extract the features of the input image and predict the classification probability.
(3) Training a texture reinforcement learning model:
(3.1) reading image global And a corresponding category label c. A rectangular box is initialized to the same size as the size of the image to be read.
(3.2) the position and size of the rectangular frame are changed several times in the whole process, if the size of the rectangular frame is equal to the size of the image, namely, the image local =image global Then jump to (3.3). If the rectangular frame is smaller than the image size, the image is cut according to the frame size, and then the image is up-sampled to the size same as the original image size to obtain the processed image local Where upsampling may select bilinear interpolation. The purpose of this step is: if the rectangular frame is the initialized rectangular frame, no operation is performed, and the step (3.3) is entered, if the rectangular frame is smaller than the initialized frame, the cropping operation is performed, and the image is up-sampled to the size same as the original input image, so that the images input to the neural network are all the same size.
(3.3) image is shown in FIG. 2 local Input into the texture base model yields two outputs: feature and class prediction probability pred.
(3.4) directly inputting the Feature into the reinforcement learning model, and outputting actions through various operations of convolution, pooling and the like of forward propagationQ values for each corresponding action in space. Specifically, the Feature is input into a selection action submodule of a texture reinforcement learning model, an agent network is arranged in the selection action submodule, the model is composed of a plurality of full connection layers, a ReLU function is used as an activation function, and the function definition formula is
Figure BDA0002252260410000081
The last layer is the number of feature dimensions converted into the number of actions in the action space, and the output means the Q value of each action in the action space. The action space is a set formed by combining a series of actions, the purpose of the actions is to change the position or size of the rectangular frame, and the actions can be selected to translate, enlarge, reduce and the like in all directions; the Q value of an action means that a rectangular box changes to another position after the action is taken at one position, and this process gives a quantitative assessment of the impact that our goal (i.e., classification) has. If the Q value is larger, the box after the position is changed can lead the classification effect to be better, and conversely, the Q value is lower, the box after the position is changed can lead the classification effect to be worse.
And (3.5) in the action selection submodule, in order to obtain a determined action, the step is divided into two strategies of exploration and development, and one strategy is selected by the two strategies. An explore _ rate ∈ (0, 1) is preset, the explore _ rate represents the probability of selecting exploration, and the corresponding 1-explore _ rate represents the probability of selecting development. Exploration is to randomly select one action from all actions; in the development, the operation corresponding to the maximum Q value obtained in (3.4) is selected as the selected operation. And determining action after selecting one of exploration and development, and changing the size or the position of the frame according to the selected action and the change coefficient alpha to obtain a new rectangular frame box'. Wherein the coefficient of variation α ∈ (0, 1), 0.1 may be selected. The meaning of expression is the ratio of each change. For example, action is selected to be enlarged to the right, which means that box' is obtained by enlarging box to the right by 1.1 times; action is selected to be left-hand smaller, indicating that box' is obtained by left-hand reduction of box by 0.9 times the original.
And (3.6) obtaining another Feature and prediction probability Feature 'pred' according to the process of extracting the Feature in (3.2) and (3.3) by using the new rectangular box obtained in (3.5). The purpose of this step is that after the position of the rectangular frame is changed, the same operation of extracting features can be adopted, the region corresponding to the frame is cut out, and then the up-sampling operation is performed. And inputting the feature and the prediction probability into a basic model.
(3.8) based on pred, (pred' in 3.6), and c in (3.1), the following judgment can be made: reward = -1 if the prediction score of pred on category c is higher than the prediction score of pred 'on category c, corresponding reward =1 if the prediction score of pred on category c is lower than the prediction score of pred' on category c. Specifically, the prediction probability pred is a prediction score that is common to all categories, and this score may indicate the probability of predicting as one category. So a higher probability for c indicates that the model performs better on the sample prediction pairs. It can be determined whether to award or punish to the model based on both the prediction probability and the label. reward = sign (pred' [ c ] -pred [ c ]).
(3.9) in the update Q value submodule, updating the Q value Q of the action selected in (3.5) according to the reward obtained in (3.8) target . The update mode is Q target = reward + γ max (Q (s, a)), where Q (s, a) represents the Q value after an action is taken in the s state, i.e. Feature. Where γ is a learning rate of each Q value update, γ is a preset value, and γ =0.9 may be selected.
(3.10) characterization Feature and Q obtained in (3.9) target And storing the experience into an experience pool. The experience pool is a corresponding measure for reducing the correlation between samples, and paired features and Q are firstly adopted target There is a pool of experiences. And after the experience pool is stored to a certain amount, randomly selecting data from the experience pool to train the model.
(3.11) taking the new rectangular frame as the current frame box = box ', taking the new Feature as the current Feature = Feature ', and taking the new classification prediction probability as the current classification prediction probability pred = pred '.
And (3.12) repeating the processes from (3.4) to (3.11) to a certain number of times. This is a process of adjusting the size and position of the frame all the time, and can be dynamically adjusted according to the change rate set in (3.5), and the change rate can be changed several times when set to be large, and can be changed several times when set to be small.
(3.13) in the Q value evaluation submodule, after a certain number of samples are filled in the experience pool, the Feature and Q in pairs are randomly selected from the experience pool target The corresponding data of (D) is recorded as Feature s And Target, inputting the characteristics to the training texture reinforcement learning model, outputting the Q value of the obtained action, and recording the Q value as Q eval Target and Q eval The difference between them is taken as loss and propagated backwards, updating the parameters. loss can be selected as a mean square error MSE function, and the expression is loss = (Target-Q) eval ) 2
(4) Training a shape reinforcement learning model:
and (4.1) training the shape reinforcement learning model by using the shape data set according to the step in (3), wherein the training process of the shape reinforcement learning model is the same as that of the texture reinforcement learning process, and the structure of the shape reinforcement learning model is the same as that of the texture reinforcement learning model. And (3) and (4) dividing the information in the data set into texture and shape, and respectively learning two different information of the data set by using two streams of ideas. The model structure of both streams is the same, and the data sets used for training are pairs of data sets that have been pre-processed. The training process is the same.
(5) The double-flow prediction and fusion of the test image to be detected by utilizing the two trained models comprises the following substeps:
(5.1) reading image to be detected global . A rectangular box is initialized to the same size as the size of the image to be read.
And (5.2) carrying out the Feature extraction in the steps (3.2) and (3.3) on the image to obtain the Feature and the classification prediction probability pred of the corresponding position of the frame.
And (5.3) inputting the characteristics obtained in the step (5.2) into a texture reinforcement learning model, outputting the score Q values of all actions, selecting the action with the maximum Q value according to a developed strategy, and changing the size and the position of the frame according to the selected action.
(5.4) repeating the process of (5.2) and (5.3) to a number of times, the number of times of repetition being related to the rate of change of the rectangular frame similarly to the repeated process of (3.12). The last change gives the feature F texture
(5.5) testing the texture reinforcement learning model in a similar process from (5.1) to (5.4) to obtain F shape
(5.6) two different characteristics F texture And F shape Input into the fusion model and output as the final prediction probability p mix Wherein the fusion model is a trainable model aimed at transforming F texture And F shape The fusion is followed by classification. For example, the fusion model may choose to stitch two features together and then output the classification probabilities of all classes using the full-connectivity layer. Feature fusion can be attempted in various ways, and even fusion can be performed directly on the prediction scores, but fusion performed on the scores does not perform as well as fusion on the features. The fusion model can be obtained by inputting and using two characteristics according to the label corresponding to the image by training the fusion model.
(5.7)p mix The class with the highest probability of correspondence is the predicted class.
The effectiveness of the invention is proved by the following experimental examples, and the experimental result proves that the invention can improve the identification accuracy of image identification.
The invention compares the iFood dataset with the basic network we use, and table 1 shows the accuracy of the method of the invention on the dataset, where backhaul represents the basic model Resnet50 we use, and DQN represents the reinforcement learning model we use. The larger the numerical value of the result is, the higher the accuracy of image recognition is, and the improvement of the method is very obvious as can be seen from the table.
TABLE 1 precision on iFood dataset
Figure BDA0002252260410000121
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A double-flow feature fusion image identification method based on reinforcement learning is characterized by comprising the following steps:
(1) Generating a shape data set:
inputting each image into an image conversion model, outputting n images which are similar in shape and different in texture, wherein the labels of the n converted images are the same as those of the input image, and n is a preset value;
(2) Training a texture basic model and a shape basic model:
(2.1) performing data enhancement on each image of the original dataset and the shape dataset separately: for an image, generating m rectangular frames at random positions in the image, wherein the side length of each frame is greater than 1/2 of the side length of the image and less than or equal to the side length of the image, the label of the cut image is consistent with that of the original image, and m is a preset value;
(2.2) training a base model using the cropped images, wherein the texture base model is trained using the original data set and the shape base model is trained using the shape data set; the texture basic model and the shape basic model have the same structure, and are both a self-adaptive average pooling layer AdaAvgPool is newly added after the last block in the ResNet50 network, the self-adaptive average pooling layer is used for pooling images to reduce the size of a feature map, the basic model is trained by using the images and labels, and the basic model is used for extracting features and predicting the images subsequently;
(3) Training a texture reinforcement learning model:
(3.1) reading image global Initializing a rectangular box with the same size as the size of the read image, with the corresponding category label c;
(3.2) if the size of the rectangular frame is equal to the size of the image, jump to: (3.3 ); if the rectangular frame is smaller than the image size, the image is cut according to the size of the frame, then the image is up-sampled to the size same as the size of the original image, and the processed image is obtained local
(3.3) image local Inputting the Feature and the classification prediction probability pred into a texture basic model;
(3.4) inputting the Feature into a texture reinforcement learning model, wherein the output of the texture reinforcement learning model is the Q value of each action in the action space;
(3.5) obtaining the determined action by exploring and developing two strategies, wherein the exploring and developing selects one of the two strategies, and the exploring randomly selects one action from all the actions; the development is to select the action corresponding to the maximum value of the Q value obtained in the step (3.4) as a selected action, determine action after selecting one of exploration and development, and change the size or the position of the frame according to the selected action and the change coefficient alpha to obtain a new rectangular frame box';
(3.6) obtaining another Feature and prediction probability Feature 'pred' according to the process of extracting the Feature in (3.2) and (3.3) by using the new rectangular box obtained in (3.5);
(3.8) according to pred, (pred' in 3.6) and c in (3.1), the following judgment is made: reward = -1 if the prediction score of pred on category c is higher than the prediction score of pred 'on category c, reward =1 if the prediction score of pred on category c is lower than the prediction score of pred' on category c;
(3.9) updating the Q value Q of the selected action in (3.5) according to the reward obtained in (3.8) target The update mode is Q target = reward + γ max (Q (s, a)), where Q (s, a) represents a Q value in an s state, i.e., after the Feature takes an action, γ is a learning rate per update of the Q value, and γ is a preset value;
(3.10) Feature and Q obtained in (3.9) target Storing the experience into an experience pool;
(3.11) taking the new rectangular frame as the current frame box = box ', taking the new Feature as the current Feature = Feature', and classifying the new FeatureThe prediction probability is used as the current classification prediction probability pred = pred', the process from (3.4) to (3.11) is repeated for a preset number of times, and after a preset number of samples are loaded in the experience pool, pairs of Feature and Q are randomly selected from the experience pool target The corresponding data of (D) is recorded as Feature s And Target, inputting the characteristics into the texture reinforcement learning model, outputting the Q value of the obtained action, and recording the Q value as Q eval Target and Q eval The difference between the two is used as loss, and the propagation is carried out reversely, and the parameters are updated;
(4) Training a shape reinforcement learning model:
(4.1) training a shape reinforcement learning model by using the shape data set according to the step in (3), wherein the training process of the shape reinforcement learning model is the same as that of the texture reinforcement learning process, and the structures of the shape reinforcement learning model and the texture reinforcement learning model are the same;
(5) The double-flow prediction and fusion of the test image to be detected by utilizing the two trained reinforced models comprises the following substeps:
(5.1) reading an image to be detected, and initializing a rectangular box with the size same as that of the image to be detected;
(5.2) carrying out the Feature extraction of the steps (3.2) and (3.3) on the image to be detected to obtain the Feature and the classification prediction probability pred of the corresponding position of the frame;
(5.3) inputting the characteristics obtained in the step (5.2) into a texture reinforcement learning model, outputting score Q values of all actions, selecting the action with the maximum Q value according to a developed strategy, and changing the size and the position of a frame according to the selected action;
(5.4) repeating the process of (5.2) and (5.3) to a number of times, the number of times of repetition being related to the rate of change of the rectangular frame similarly to the repeated process of (3.12); obtaining the characteristic F after the last change texture
(5.5) testing the texture reinforcement learning model in a similar process from (5.1) to (5.4) to obtain F shape
(5.6) two different characteristics F texture And F shape The final prediction probability p is output after being input into the fusion model mix Wherein the fusion model is oneThe model can be trained with the aim of combining F texture And F shape After fusion, classification is carried out;
(5.7)p mix the class with the highest corresponding probability is the class predicted by the image to be detected.
2. The method for recognizing the double-flow feature fusion image based on the reinforcement learning according to claim 1, wherein the training process of the basic model in the step (2.2) is specifically as follows: compressing the Feature map output by AdaAvgPool to one dimension to obtain Feature vector Feature, sending the features before AdaAvgPool to a classifier to obtain a classification prediction probability pred, and outputting the Feature and the pred by a basic model to extract the features of an input image and predict the classification probability.
3. The robust learning based dual-stream feature fusion image recognition method according to claim 2, wherein the classifier in step (2.2) selects to use fully connected layers.
4. The method for identifying the double-flow feature fusion image based on the reinforcement learning according to the claim 1 or 2, wherein the step (3.4) is specifically as follows: inputting the Feature into a selection action submodule of the texture reinforcement learning model, wherein the selection action submodule is provided with an agent network, the model consists of a plurality of full connection layers, a ReLU function is used as an activation function, and the function definition formula is
Figure FDA0003844305060000041
The last layer is to convert the characteristic dimension quantity into the action quantity of the action space, and the output meaning is the Q value of each action in the action space; wherein the action space is a set formed by combining a series of actions, the actions are used for changing the position or size of a rectangular frame, the Q value of the action represents that the rectangular frame is changed to another position after the action is taken at a certain position, the process is quantitative evaluation of the influence generated by a target, and if the Q value is larger, the frame after the position is changed can be used for allowing the frame after the position is changed to be quantitatively evaluatedThe classification effect becomes better, whereas a lower Q value indicates that the box after the position change can make the classification effect worse.
5. The method for dual-stream feature fusion image recognition based on reinforcement learning according to claim 1 or 2, wherein the action in step (3.4) comprises translation, zooming in or zooming out in various directions.
6. The robust learning based dual-stream feature fusion image recognition method according to claim 1 or 2, characterized in that the coefficient of variation α ∈ (0, 1) in step (3.5).
7. The method for identifying the double-flow feature fusion image based on the reinforcement learning according to the claim 1 or 2, wherein the step (3.8) is specifically as follows: the prediction probability pred is a prediction score which is available for all the categories, the score represents the probability of predicting the category, the higher the probability corresponding to c is, the better the model performs the sample prediction pair is, so that whether the model is awarded or punished can be judged according to the prediction probabilities and the labels, and the reward = sign (pred' [ c ] -pred [ c ]).
8. The dual-stream feature fusion image recognition method based on reinforcement learning of claim 1 or 2, wherein the loss in the step (3.11) selects an MSE function, and the expression is loss = (Target-Q) eval ) 2
9. The double-flow feature fusion image identification method based on reinforcement learning according to claim 1 or 2, characterized in that after the fusion model selection in the step (5.6) splices two features together, the classification probabilities of all classes are output by using a full connection layer.
10. The dual-flow feature fusion image recognition method based on reinforcement learning according to claim 1 or 2, wherein the value range of n is 5-10, and the value range of m is 4-7.
CN201911038698.5A 2019-10-29 2019-10-29 Double-current feature fusion image identification method based on reinforcement learning Active CN110826609B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911038698.5A CN110826609B (en) 2019-10-29 2019-10-29 Double-current feature fusion image identification method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911038698.5A CN110826609B (en) 2019-10-29 2019-10-29 Double-current feature fusion image identification method based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN110826609A CN110826609A (en) 2020-02-21
CN110826609B true CN110826609B (en) 2023-03-24

Family

ID=69550977

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911038698.5A Active CN110826609B (en) 2019-10-29 2019-10-29 Double-current feature fusion image identification method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN110826609B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113240573B (en) * 2020-10-26 2022-05-13 杭州火烧云科技有限公司 High-resolution image style transformation method and system for local and global parallel learning
CN112597865A (en) * 2020-12-16 2021-04-02 燕山大学 Intelligent identification method for edge defects of hot-rolled strip steel
CN113128522B (en) * 2021-05-11 2024-04-05 四川云从天府人工智能科技有限公司 Target identification method, device, computer equipment and storage medium
TWI801038B (en) * 2021-12-16 2023-05-01 新加坡商鴻運科股份有限公司 Defect detection method, system, electronic device and storage medium
CN114742800B (en) * 2022-04-18 2024-02-20 合肥工业大学 Reinforced learning electric smelting magnesium furnace working condition identification method based on improved converter

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104766080A (en) * 2015-05-06 2015-07-08 苏州搜客信息技术有限公司 Image multi-class feature recognizing and pushing method based on electronic commerce
CN108805798A (en) * 2017-05-05 2018-11-13 英特尔公司 Fine granularity for deep learning frame calculates communication and executes
CN109814565A (en) * 2019-01-30 2019-05-28 上海海事大学 The unmanned boat intelligence navigation control method of space-time double fluid data-driven depth Q study
CN110135502A (en) * 2019-05-17 2019-08-16 东南大学 A kind of image fine granularity recognition methods based on intensified learning strategy
CN110348355A (en) * 2019-07-02 2019-10-18 南京信息工程大学 Model recognizing method based on intensified learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104766080A (en) * 2015-05-06 2015-07-08 苏州搜客信息技术有限公司 Image multi-class feature recognizing and pushing method based on electronic commerce
CN108805798A (en) * 2017-05-05 2018-11-13 英特尔公司 Fine granularity for deep learning frame calculates communication and executes
CN109814565A (en) * 2019-01-30 2019-05-28 上海海事大学 The unmanned boat intelligence navigation control method of space-time double fluid data-driven depth Q study
CN110135502A (en) * 2019-05-17 2019-08-16 东南大学 A kind of image fine granularity recognition methods based on intensified learning strategy
CN110348355A (en) * 2019-07-02 2019-10-18 南京信息工程大学 Model recognizing method based on intensified learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Fine-grained Image Classification via Combining Vision and Language;Xiangteng He等;《Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition》;20170731;第5994-6002页 *

Also Published As

Publication number Publication date
CN110826609A (en) 2020-02-21

Similar Documents

Publication Publication Date Title
CN110826609B (en) Double-current feature fusion image identification method based on reinforcement learning
CN110837836B (en) Semi-supervised semantic segmentation method based on maximized confidence
EP3757905A1 (en) Deep neural network training method and apparatus
CN110555433B (en) Image processing method, device, electronic equipment and computer readable storage medium
CN110046550B (en) Pedestrian attribute identification system and method based on multilayer feature learning
CN115858847B (en) Combined query image retrieval method based on cross-modal attention reservation
CN112070040A (en) Text line detection method for video subtitles
CN115222998B (en) Image classification method
CN111274981A (en) Target detection network construction method and device and target detection method
CN115131797A (en) Scene text detection method based on feature enhancement pyramid network
CN114998756A (en) Yolov 5-based remote sensing image detection method and device and storage medium
Fan et al. A novel sonar target detection and classification algorithm
CN110287981B (en) Significance detection method and system based on biological heuristic characterization learning
CN113128564B (en) Typical target detection method and system based on deep learning under complex background
CN117217807B (en) Bad asset estimation method based on multi-mode high-dimensional characteristics
US20220301106A1 (en) Training method and apparatus for image processing model, and image processing method and apparatus
CN116229104A (en) Saliency target detection method based on edge feature guidance
CN115512207A (en) Single-stage target detection method based on multipath feature fusion and high-order loss sensing sampling
CN115082840A (en) Action video classification method and device based on data combination and channel correlation
CN111582057B (en) Face verification method based on local receptive field
CN114842488A (en) Image title text determination method and device, electronic equipment and storage medium
CN113743497A (en) Fine granularity identification method and system based on attention mechanism and multi-scale features
CN114170460A (en) Multi-mode fusion-based artwork classification method and system
CN113989671A (en) Remote sensing scene classification method and system based on semantic perception and dynamic graph convolution
CN114202765A (en) Image text recognition method and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant