Disclosure of Invention
The present invention is made based on the above-mentioned needs of the prior art, and the technical problem to be solved by the present invention is to provide an image classification method and system based on a target detection algorithm and a convolutional neural network.
In order to solve the problems, the invention is realized by adopting the following technical scheme:
an image classification method based on a target detection algorithm and a convolutional neural network, comprising the following steps:
carrying out target detection on the original image to obtain a prediction frame of a target image contained in the image;
randomly filling the target image into a preset grid, wherein the size of grid cells in the grid is consistent with that of the target image, and obtaining a new image with the same size as the original image according to the image after the grid is filled;
performing feature extraction network operation according to the new image to obtain a feature map based on the new image;
performing convolution calculation on each feature map by using a convolution kernel with the same size as the feature map to respectively obtain corresponding convolution values, and combining all the convolution values to obtain a one-dimensional vector;
forming a training data set by the one-dimensional vectors and the corresponding image labels, and performing supervised training on an image classification model by using the training data set to obtain a trained image classification model;
and classifying the target in the image by using the trained image classification model.
Optionally, the performing of the target detection on the original image comprises:
dividing an original image into a plurality of regional images with the same size, and performing residual error network calculation on the regional images to obtain a feature map of the regional images;
carrying out target detection calculation on the feature map of each region image, and predicting to obtain target coordinate values and target scores in each region; drawing to obtain a detection frame based on the target coordinate value;
performing sorting and deleting operation on the detection frames, wherein the sorting and deleting operation is to sort the detection frames according to the target scores, select the detection frame with the maximum target score, calculate the intersection ratio of other detection frames and the detection frame with the maximum target score and obtain an overlapping rate value; deleting the corresponding detection frame with the overlap rate value exceeding the confidence coefficient threshold;
and repeatedly executing the sequencing deletion operation on the rest detection frames until all the detection frames are processed to obtain the prediction frame of the target image contained in the image.
Optionally, randomly filling the target image into a predetermined grid, where the size of a grid cell in the grid is consistent with the size of the target image, and obtaining a new image with the same size as the original image according to the image after filling the grid, includes:
and randomly filling all target images into a preset grid for recombination, selecting the existing target image to fill the vacant part when the number of the detected targets is not enough to form a new image, and performing bilinear interpolation calculation according to the image filled with the grid to obtain the new image with the same size as the original image, wherein the new image consists of the recognition result of a target detection algorithm.
Optionally, in the target detection calculation, a target detection model is used for performing the target detection calculation, and the target detection model is obtained by training a target detection algorithm using a multi-object image data set.
Optionally, drawing a detection frame based on the target coordinate value, including:
drawing detection frames with three sizes for the target in each area based on the target coordinate value to obtain a target detection frame; wherein the three sizes are selected from nine different sizes of prior boxes obtained by clustering the sizes of all labels in the multi-object image dataset using a K-means algorithm.
An image classification system based on a target detection algorithm and a convolutional neural network, comprising:
the prediction frame detection module is used for carrying out target detection on the original image to obtain a prediction frame of a target image contained in the image;
the image recombination module is used for randomly filling the target image into a preset grid, the size of a grid unit in the grid is consistent with that of the target image, and a new image with the same size as the original image is obtained according to the image after the grid is filled;
the feature extraction module is used for performing feature extraction network operation according to the new image to obtain a feature map based on the new image;
the convolution operation module is used for carrying out convolution calculation on each characteristic graph by utilizing a convolution kernel with the same size as the characteristic graph to respectively obtain corresponding convolution values, and combining all the convolution values to obtain a one-dimensional vector;
the model training module is used for forming a training data set by the one-dimensional vectors and the corresponding image labels, and performing supervised training on the image classification model by using the training data set to obtain a trained image classification model;
and the classification module is used for classifying the targets in the images by using the trained image classification model.
Optionally, the prediction block detection module is configured to:
dividing an original image into a plurality of regional images with the same size, and performing residual error network calculation on the regional images to obtain a feature map of the regional images;
carrying out target detection calculation on the feature map of each region image, and predicting to obtain target coordinate values and target scores in each region; drawing to obtain a detection frame based on the target coordinate value;
performing sorting and deleting operation on the detection frames, wherein the sorting and deleting operation is to sort the detection frames according to the target scores, select the detection frame with the maximum target score, calculate the intersection ratio of other detection frames and the detection frame with the maximum target score and obtain an overlapping rate value; deleting the corresponding detection frame with the overlap rate value exceeding the confidence coefficient threshold;
and repeatedly executing the sequencing deletion operation on the rest detection frames until all the detection frames are processed to obtain the prediction frame of the target image contained in the image.
Optionally, the image reorganization module is configured to:
and randomly filling all target images into a preset grid for recombination, selecting the existing target image to fill the vacant part when the number of the detected targets is not enough to form a new image, and performing bilinear interpolation calculation according to the image filled with the grid to obtain the new image with the same size as the original image, wherein the new image consists of the recognition result of a target detection algorithm.
Optionally, in the target detection calculation, a target detection model is used for performing the target detection calculation, and the target detection model is obtained by training a target detection algorithm using a multi-object image data set.
Optionally, the prediction block detection module is configured to:
drawing detection frames with three sizes for the target in each area based on the target coordinate value to obtain a target detection frame; wherein the three sizes are selected from nine different sizes of prior boxes obtained by clustering the sizes of all labels in the multi-object image dataset using a K-means algorithm.
Compared with the prior art, the method has the advantages that the target detection algorithm is used for carrying out target detection on the original image, the target area can be detected, the interference of useless information can be filtered by recombining the target area, and the help is provided for the next image feature extraction. And a feature extraction network structure is further used for feature extraction, and the accuracy of the model classification result is improved by increasing the number of small convolution kernels and the depth of the network.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For the convenience of understanding of the embodiments of the present invention, the following description will be further explained with reference to specific embodiments, which are not to be construed as limiting the embodiments of the present invention.
Example 1
The present embodiment provides an image classification method based on a target detection algorithm and a convolutional neural network, and the flow of the image classification method is shown in fig. 1, and specifically includes:
s1: and carrying out target detection on the original image to obtain a prediction frame of the target image contained in the image.
And performing target detection on the original image, firstly vectorizing the original image, inputting the vectorized original image into a residual error feature extraction network for feature extraction, then inputting the feature vector into a target detector for coordinate information generation, and finally screening a detection frame by using a non-maximum suppression method to obtain a prediction frame of the target. The original image features are that the scene distribution is complex and no foreground and background are obviously distinguished.
For an image to be recognized with a plurality of attention targets and complex scene distribution, a target detection algorithm is used for specific target detection, and the one-stage target detection method is mainly characterized in that the model operation speed is high, but the recognition accuracy is lower than that of a two-stage target detection method, and in order to balance the operation speed and the recognition accuracy, the target detection algorithm is selected as a preprocessing method of the image classification method based on the feature extraction network.
In this step, in the target detection calculation, a target detection model is used to perform the target detection calculation, and the target detection model is obtained by using a multi-object image dataset to train a target detection algorithm.
In the step S1, the specific execution steps include:
s10: dividing an original image into a plurality of area images with the same size, and carrying out residual error network calculation on the area images to obtain a characteristic diagram of the area images.
The residual network is Darknet53, and the Darknet53 serving as a full convolution feature extraction network can adapt to images of various sizes. In the embodiment of the present invention, an original image is divided into a plurality of area images of 416 × 416 pixels.
The Darknet53 network structure consists of 53 convolutional layers in total, and 52 convolutional layers in total except the last fully-connected layer, wherein the fully-connected layer is a convolutional layer consisting of 1 × 1 convolutional cores. The specific process of inputting the area image into a residual error network for processing comprises the following steps:
firstly, performing feature extraction on an original image by using a convolution kernel with 32 filters, and then sequentially performing feature extraction by using 5 repeated residual error units, wherein in the 5 residual error units, each unit consists of 1 independent convolution layer and a group of repeatedly executed convolution layers, and the repeatedly executed convolution layers are respectively repeated for 1 time, 2 times, 8 times and 4 times; in each convolutional layer repeatedly executed, the convolution operation with convolution kernel of 1x1 is executed first, then the convolution operation with convolution kernel of 3x3 is executed, the number of filters is reduced by half first, and then recovery is executed, and it is noted that residual calculation does not belong to convolutional layer calculation.
A residual error structure is introduced into the Darknet53, the residual error structure adds and fuses the output of the previous layer and the output of the current layer, and the problems of gradient disappearance or gradient explosion caused by the fact that the number of network layers is too deep can be avoided.
S11: carrying out target detection calculation on the feature map of each region image, and predicting to obtain target coordinate values and target scores in each region; and drawing to obtain a detection frame based on the target coordinate value.
And performing target detection calculation on each region feature map, predicting the probability that the region is of a certain category, drawing a prediction frame of the target according to the target coordinate values, wherein the prediction result of each region comprises the target coordinate values and the target scores.
In target detection, the sizes of all labels in a multi-object image data set are clustered in advance by using a K-means clustering algorithm to obtain 9 labeling frames with different sizes as prior frames, three detection frames with different sizes are specified for each target to adapt to the targets with different sizes, and the most accurate target position information is obtained by using a non-maximum suppression method.
In order to adapt to the sizes of different targets, three prediction frames with different length-width ratios are generated in the prediction process of each target, and the model calculates the labeling frame in the multi-object image dataset, so that the labeling frames with 9 length-width ratios are obtained to serve as prior frames in the model training process, and the detection precision of the model for different objects is improved.
In this step, based on the target coordinate value, the specific execution operations of the drawn prediction box are as follows:
drawing detection frames with three sizes for the target in each area based on the target coordinate value to obtain a target detection frame; wherein the three sizes are selected from nine different sizes of prior boxes obtained by clustering the sizes of all labels in the multi-object image dataset using a K-means algorithm.
S12: performing sorting and deleting operation on the detection frames, wherein the sorting and deleting operation is to sort the detection frames according to the target scores, select the detection frame with the maximum target score, calculate the intersection ratio of other detection frames and the detection frame with the maximum target score and obtain an overlapping rate value; and deleting the corresponding detection box with the overlap rate value exceeding the confidence coefficient threshold.
S13: and repeatedly executing the sequencing deletion operation on the rest detection frames until all the detection frames are processed to obtain the prediction frame of the target image contained in the image.
Since the target detection algorithm generates a plurality of detection frames for the same target, that is, draws all detection frames exceeding a preset threshold, and generates a large number of useless detection frames, it is necessary to filter the useless detection frames and select the detection frame with the highest confidence score as the prediction frame of the target.
The specific implementation process of S12-S13 is as follows: sorting all the detected detection frames according to the target scores of the detection frames, selecting the detection frame A with the largest target score, setting a threshold b, calculating an Intersection over Union (IoU) between the detection frames and the largest detection frame A in the rest detection frames, and if IoU is larger than the threshold b, the overlapping rate between the detection frames is high. Deleting the detection boxes; there may be no overlap with the current frame or their overlap area is very small (IoU is less than threshold b), then the unprocessed frames are reordered, one frame with the largest score is selected after the reordering is completed, IoU values of other frames and the largest frame are calculated, and the frame with IoU larger than threshold b is deleted again, the process is iterated until all frames are processed, and the final detection result is output.
Through the processing of the steps, redundant detection frames can be eliminated, and the optimal target detection position can be found.
When the execution of S1 is completed, a prediction frame of the target image included in the image is obtained.
S2: and randomly filling the target image into a preset grid, wherein the size of grid cells in the grid is consistent with that of the target image, and obtaining a new image with the same size as the original image according to the image after the grid is filled.
And randomly filling all target images into a preset grid for recombination, selecting the existing target image to fill the vacant part when the number of the detected targets is not enough to form a new image, and performing bilinear interpolation calculation according to the image filled with the grid to obtain the new image with the same size as the original image, wherein the new image consists of the recognition result of a target detection algorithm.
In order to improve the robustness of the model, the recombination mode of the target image is random filling arrangement, so that the influence of the arrangement mode on the recognition result is eliminated, and the image recombination element is the recognition result (significant image area) of the target detection algorithm, so that the recombined image can filter the interference of the image background, and further provides help for the next step of feature extraction.
S3: and carrying out feature extraction network operation according to the new image to obtain a feature map based on the new image.
In the embodiment of the invention, the VGG16 feature extraction network is used for feature extraction, a great amount of 3x3 convolution kernels are used in the VGG16 network structure, and the accuracy of the model classification result is improved by increasing the number of small convolution kernels and the depth of the network.
The VGG16 has 16 layers in total, and consists of 13 convolutional layers and 3 fully-connected layers, the size of an input image of a network structure is 224 multiplied by 224 pixels, the number of channels is 3, the network structure comprises 5 convolutional layers, each convolutional layer contains 2 or 3 convolutional layers, the end of each convolutional layer is connected with a maximum pooling layer, the number of convolutional kernels of each convolutional layer in each segment is consistent, and the specific execution steps of inputting a new image into the VGG16 network for feature extraction are as follows:
aiming at the convolution layer of the first section, processing a new image in the convolution layer of the first section by using a convolution core of 3 multiplied by 3 to obtain a characteristic map of 224 multiplied by 224; in other convolution layers in the first section, processing the feature map corresponding to the previous layer by using a 3x3 convolution kernel to obtain a new feature map; in the convolutional layer, the number of characteristic channels is increased from 3 to 64, and after the convolutional layer processing is finished, the maximum pooling processing is carried out on the 224 × 224 characteristic diagram to obtain a 112 × 112 characteristic diagram.
For the convolution layers of other segments, in the first convolution layer of other segments, processing a feature map obtained by the immediately-above maximum pooling layer by using a 3x3 convolution kernel; in other convolution layers of other sections, processing the feature map corresponding to the previous layer by using a 3x3 convolution kernel to obtain a new feature map; the channel number of the characteristic diagram obtained after the processing of each convolution layer is finished is 2 times of that of the characteristic diagram of the previous convolution layer; and after the processing of each convolution layer is finished, performing maximum pooling processing on the obtained feature map of the section to obtain a pooled feature map, and reducing the size of the pooled feature map to the half of the size before processing.
In this step, the new image is subjected to VGG16 network processing, 5 feature dimensionalities reduction steps are performed, and the dimensionalities are reduced from 224 × 224 to 112 × 112, 56 × 56, 28 × 28, 14 × 14 and 7 × 7 in sequence, wherein the dimensionality reduction operation is realized through the maximum pooling layer, and the number of feature channels is increased from 3 to 64, 128, 256 and 512 in sequence.
The VGG16 network structure is characterized in that the effect of the same receptive field as a large convolution kernel is realized through a small convolution superposition mode; and all the convolution layers do not carry out dimension reduction operation, and the dimension reduction of the feature diagram is realized by using the maximum pooling layer.
The network structure of the VGG16 is simple, and only involves the convolutional layer, the max-pooling layer and the full-link layer. The convolution layers have the effect that the perception fields of all the convolution layers are different by setting convolution kernels with different sizes and different step lengths, so that the image features in different ranges are extracted, and the maximum pooling layer has the effect of extracting the most characteristic features in the image and is equivalent to sharpening.
S4: and performing convolution calculation on each characteristic diagram by using a convolution kernel with the same size as the characteristic diagram to respectively obtain corresponding convolution values, and combining all the convolution values to obtain a one-dimensional vector.
In the embodiment of the invention, a convolution core with the same size as the feature map is used for convolving the two-dimensional feature matrix of the feature map, so that the two-dimensional feature matrix is compressed into a number, and a plurality of feature maps are converted into one-dimensional vectors. The method has the advantages that the method not only facilitates the calculation of the class probability corresponding to the image, but also can reduce the parameter quantity.
S5: and forming a training data set by the one-dimensional vector and the corresponding image label, and performing supervised training on the image classification model by using the training data set to obtain the trained image classification model.
S6: and classifying the target in the image by using the trained image classification model.
And predicting the image target class by using the image classification model, wherein the feature vector output by the VGG16 is used as the input of the full connection layer, and the output of the full connection layer is subjected to image classification by adopting a Softmax classifier.
In summary, the image classification method based on the target detection algorithm and the convolutional neural network provided by the invention includes obtaining an original image, performing target detection on the original image by using a target detector, and detecting a target area; recombining the target area to obtain a new image with the same size as the original image, and resizing the new image; extracting the characteristics of the adjusted new image according to a characteristic extraction network to obtain a characteristic diagram; converting the feature map into a one-dimensional vector, forming a training data set with the corresponding image label, performing supervised training on the training data set to obtain an image classification model, and predicting the image target class by using the image classification model, which is specifically shown in a model frame schematic diagram of fig. 2. According to the method, the target detection model is used for detecting the target object in the image in advance, so that the problem of difficulty in feature extraction caused by the fact that the image scene is complex in distribution and a plurality of remarkable objects exist is solved. The recognition accuracy of image classification is improved by further acquiring the position of the target image and performing feature emphasis extraction.
Example 2
Fig. 3 is a schematic structural diagram of an image classification system based on a target detection algorithm and a convolutional neural network according to an embodiment of the present invention. As shown in fig. 3, the system includes: the device comprises a prediction frame detection module 30, an image recombination module 31, a feature extraction module 32, a convolution operation module 33, a model training module 34 and a classification module 35.
A prediction frame detection module 30, configured to perform target detection on an original image to obtain a prediction frame of a target image included in the image;
an image reorganizing module 31, configured to randomly fill the target image into a predetermined grid, where the size of a grid cell in the grid is consistent with the size of the target image, and obtain a new image with the same size as the original image according to the image after the grid is filled;
the feature extraction module 32 is configured to perform feature extraction network operation according to the new image to obtain a feature map based on the new image;
the convolution operation module 33 is configured to perform convolution calculation on each feature map by using a convolution kernel with the same size as the feature map, obtain corresponding convolution values respectively, and combine all the convolution values to obtain a one-dimensional vector;
the model training module 34 is configured to form a training data set by using the one-dimensional vectors and the corresponding image labels, and perform supervised training on an image classification model by using the training data set to obtain a trained image classification model;
and the classification module 35 is configured to classify the target in the image by using the trained image classification model.
Optionally, the prediction block detection module 30 is configured to:
dividing an original image into a plurality of regional images with the same size, and performing residual error network calculation on the regional images to obtain a feature map of the regional images;
carrying out target detection calculation on the feature map of each region image, and predicting to obtain target coordinate values and target scores in each region; drawing to obtain a detection frame based on the target coordinate value;
performing sorting and deleting operation on the detection frames, wherein the sorting and deleting operation is to sort the detection frames according to the target scores, select the detection frame with the maximum target score, calculate the intersection ratio of other detection frames and the detection frame with the maximum target score and obtain an overlapping rate value; deleting the corresponding detection frame with the overlap rate value exceeding the confidence coefficient threshold;
and repeatedly executing the sequencing deletion operation on the rest detection frames until all the detection frames are processed to obtain the prediction frame of the target image contained in the image.
Optionally, the image reorganization module 31 is configured to:
and randomly filling all target images into a preset grid for recombination, selecting the existing target image to fill the vacant part when the number of the detected targets is not enough to form a new image, and performing bilinear interpolation calculation according to the image filled with the grid to obtain the new image with the same size as the original image, wherein the new image consists of the recognition result of a target detection algorithm.
Optionally, in the target detection calculation, a target detection model is used for performing the target detection calculation, and the target detection model is obtained by training a target detection algorithm using a multi-object image data set.
Optionally, the prediction block detection module 30 is configured to:
drawing detection frames with three sizes for the target in each area based on the target coordinate value to obtain a target detection frame; wherein the three sizes are selected from nine different sizes of prior boxes obtained by clustering the sizes of all labels in the multi-object image dataset using a K-means algorithm.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to each embodiment or some parts of the embodiments.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.