CN109685145B

CN109685145B - Small object detection method based on deep learning and image processing

Info

Publication number: CN109685145B
Application number: CN201811605116.2A
Authority: CN
Inventors: 李卫军; 吴超
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2022-09-06
Anticipated expiration: 2038-12-26
Also published as: CN109685145A

Abstract

The invention relates to the field of image processing, in particular to a small object detection method based on deep learning and image processing, which reserves more details by replacing the convolution of 5x5 of an original increment module with two convolution kernels of 3x3, meanwhile, in order to accelerate the training speed and the output consistency, BN is added at the end of each branch, namely Batch Normalization, and introduces a residual error network structure to increase the accuracy, in addition, the invention adopts deconvolution to enhance the context information of the upper layer and the bottom layer of the two adjacent layers, the result of deconvolution of the upper layer is aligned with the pixels of the bottom layer and added one by one, and the obtained new characteristic diagram is used as the detected characteristic diagram, the invention can improve the identification of the small object, and improve the accuracy rate of the traditional SSD for detecting the small object on the premise of not influencing the high FPS of the traditional SSD.

Description

Small object detection method based on deep learning and image processing

Technical Field

The invention relates to the field of image processing, in particular to a small object detection method based on deep learning and image processing.

Background

Currently, a commonly used algorithm for detecting an object is SSD, i.e., Single Shot multi box Detection. The SSD is an end-to-end detection framework based on deep learning, and the framework of the SSD is mainly divided into two parts: the first part is a convolutional neural network (VGG16) positioned at the front end and used for extracting the features of a target, and the rear end is a multi-scale feature detection network and used for extracting the features of a feature layer generated by the front-stage network under different scales; then, Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2 and Conv11_2 layers are convolved to obtain coordinate positions and confidence scores, and finally, a result is obtained through non-maximum suppression (NMS).

However, since the SSD adopts a multi-scale detection method, the method can reduce the amount of computation and has a high FPS, and since the detection is performed on feature maps of different scales, the convolution fields on the feature maps of different scales are different, and particularly on a high-level convolution layer, the convolution field is large, and the extracted features are abstract, so that the SSD is not sensitive to the detection of small objects and details.

Disclosure of Invention

In order to solve the defect that an SSD detection algorithm is insensitive to small object detection in the prior art, the invention provides a small object detection method based on deep learning and image processing.

In order to realize the purpose, the technical scheme is as follows:

a small object detection method based on deep learning and image processing comprises the following steps:

step S1: acquiring a data set, wherein the data set comprises labeled object class information and the upper left (x) of a target frame _min ,y _min ) And lower right (x) _max ,y _max ) Selecting an original picture with label information from a training set of a data set, and adjusting the picture to be 300x300 to be used as input;

step S2: dividing the picture into 4 parts of P1, P2, P3 and P4 with the size of 150x150 along the horizontal (0, 150) (300, 150) and vertical (150, 0) (150, 300) directions; taking an image with (75, 75) (225, 75) (75, 225) (225 ) as four vertex coordinates as the 5 th part P5;

step S3: according to the two coordinate information (x) of the upper left and lower right of the target frame of each input picture strip _min ,y _min ),(x _max ,y _max ) Judging whether the object in the picture is divided or not, and modifying the coordinate according to the divided condition of the object;

step S4: interpolating the pictures by a cubic interpolation method, so that the divided 5 parts of the pictures P1, P2, P3, P4 and P5 with the size of 150x150 are the same as the original picture 300x300 and are named as F1, F2, F3, F4 and F5, and simultaneously multiplying the modified coordinates obtained in the step S3 by 2 and updating the coordinates;

step S5: extracting features of each of five pictures of F1, F2, F3, F4 and F5 through a VGG16 network, performing convolution by using a convolution kernel with the size of 3x3x1024 to obtain a Conv6 feature map with the size of 19x19x1024, and performing convolution by using a convolution kernel with the size of 1x1x1024 to obtain a Conv7 feature map with the size of 19x19x 1024;

step S6: stacking convolution kernels of 1x1,3x3 and 3x3 together to form three branches, adding BN (Batch Normalization) into the last branch of each branch to perform Batch Normalization, connecting and fusing the branches and introducing a residual error network structure, and naming the structure as an IRBNet convolution structure;

step S7: extracting features from the Conv7 feature map with the size of 19x19x1024 obtained in the step S5 through an IRBNet convolution structure to obtain a feature map Conv8 with the size of 10x10x 512; conv8 is subjected to IRBNet convolution to obtain a feature map Conv9 with the size of 5x5x 256; conv9 was convolved with IRBNet to obtain a feature map Conv10 with a size of 3x3x 256; conv10 was convolved with IRBNet to obtain a feature map Conv11 with a size of 1x1x 256;

step S8: deconvoluting the high-level feature map by adopting a convolution mode with a convolution kernel of 3x3 and a step length of 4, expanding the high-level feature map by two times to make the size of the high-level feature map be the same as that of the previous bottom layer, adding pixels at corresponding positions one by one to obtain a new feature map with the size consistent with that of the bottom layer feature map, and naming the structure as HDPANet;

step S9: obtaining another feature map with the size of 19x19x1024 from the feature map Conv8 through step S8, adding the feature map to Conv7 to obtain a feature map Conv7D, obtaining another feature map with the size of 10x10x512 from the feature map Conv9 through step S8, adding the feature map to Conv8 to obtain a feature map Conv8D, obtaining another feature map with the size of 5x5x256 from the feature map Conv10 through step S8, adding the feature map to Conv9 to obtain a feature map Conv9D, obtaining another feature map with the size of 3x3x256 from the feature map Conv11 through step S8, and adding the feature map to Conv10 to obtain a feature map Conv 10D;

step S10: convolving feature layers of Conv4_3, Conv10D and Conv11 by using a convolution kernel of 3x3 to obtain a feature map with the channel number of 4x (class +4), and convolving feature layers of Conv7D, Conv8D and Conv9D by using a convolution kernel of 3x3 to obtain a feature map with the channel number of 6x (class + 4);

step S11, F1, F2, F3, F4 and F5 obtain corresponding loss functions loss through steps S1 to S10; optimizing the total of five loss functions loss through a stochastic gradient descent algorithm during back propagation, and simultaneously setting training iteration times epoch, wherein the obtained network parameters are the optimal solution when the total loss is stable;

step S12: selecting pictures without label information in the data set, executing step S1 and step S2 to divide the pictures, putting the divided pictures into the network trained in step S1 to step S10, and filtering the pictures through non-maximum value inhibition to finally obtain the predicted classes label and predicted coordinates (x) of the five pictures of F1, F2, F3, F4 and F5 with prediction classes label and the prediction coordinates (x) _{pred_min} ，y _{pred_min} )，(x _{pred_max} ，y _{pred_max} )；

And S13, fusing the pictures according to the prediction types label and the prediction coordinates of the five pictures F1, F2, F3, F4 and F5, wherein the final result is the final detection result.

Preferably, the specific step of modifying the coordinates in step S3 is as follows:

1) if x _min ＜150，x _max > 150, and y _min ，y _max < 150 or x _min ＜150，x _max ＞150，y _min ，y _max If the coordinate is more than 150, the object in the image is divided into a left part and a right part along the vertical direction, and the new coordinate is (x) _min ，y _min )，(150，y _max ) And (150, y) _min )，(x _max ，y _max ) The category information does not change;

2) if x _min ，x _max ＜150，y _min ＜150，y _max > 150 or x _min ，x _max ＞150，y _min ＜150，y _max If more than 150, the object in the image is divided into an upper part and a lower part in the horizontal direction, and the new coordinate is (x) _min ，y _min )，(x _max 150), and (x) _min ，150)，(x _max ，y _max ) The category information does not change;

3) if x _min ＜150,y _min ＜150,x _max ＞150,y _max 150 indicates that the object in the image is cut into four parts together in the horizontal and vertical directions, with the new coordinate being (x) _min ,y _min ) (150 ) and (150, y) _min ),(x _max 150) and (x) _min ,150),(150,y _max ) And (150 ), (x) _max ，y _max ) The category information is not changed.

Preferably, the specific steps of the step S11 for obtaining the loss functions loss and total _ loss are as follows:

loss is divided into two parts of confidence Loss and location Loss,

wherein L (x, c, L, g) represents Loss, L _conf Denotes the confidence loss, which is the softmax loss algorithm, L _loc Representing location loss, wherein N is the number of priorbox from match to group Truth in the confidence loss; while the alpha parameter is used to adjust the ratio between confidence loss and location loss,

a GT box representing that the ith prediction box is matched to the jth real box which is in a p category; c represents confidence, l represents a prediction box, and g represents a true box;

indicating the probability value generated by the softmax method, Pos indicating positive sample, Neg indicating negative sample, N being the number of primitive boxes in the confidence loss matched to the group Truth

When the time is up, the system can be started,

indicates the probability that the ith prediction box belongs to the class p, p indicates the pth class in the classRespectively;

wherein cx represents the coordinate of the center point x of the frame, cy represents the coordinate of the center point y, w represents the width, h represents the height, i represents the ith prediction frame, j represents the jth real frame, d _i An offset is indicated and the amount of the offset,

indicating whether the ith prediction box and the jth real box match, match is 1, mismatch is 0,

a prediction block is represented that is to be used,

an offset box representing a real box; m represents a value belonging to (cx, cy, w, h),

the coordinates of the center point x of the offset box representing the jth real box,

the coordinate of the center point y of the offset box representing the jth real box,

the width of the offset box representing the jth real box,

the height of the offset box representing the jth real box,

represents the offset of the x coordinate of the center point of the ith prediction box,

represents the offset of the y coordinate of the center point of the ith prediction box,

indicates the ith prediction box width offset,

indicating the height offset of the ith prediction box,

representing the jth real box center point x coordinate,

represents the y coordinate of the center point of the jth real box,

indicates the width of the jth real box,

represents the height of the jth real box;

the five loss functions obtained by processing F1, F2, F3, F4 and F5 are respectively marked as L ₁ (x,c,l,g)，L ₂ (x,c,l,g)，L ₃ (x,c,l,g)，L ₄ (x,c,l,g),L ₅ (x, c, l, g), the total loss function is reported as:

Total_loss＝L ₁ (x,c,l,g)+L ₂ (x,c,l,g)+L ₃ (x,c,l,g)+L ₄ (x,c,l,g)+L ₅ (x,c,l,g)。

preferably, the step S13 of fusing the pictures specifically includes the following steps:

(1) if the predicted coordinate x of each picture of F1, F2, F3 and F4 is x _{pred_max} ，y _{pred_max} < 300 and x _{pred_min} ，y _{pred_min} If the current position is more than 0, combining F1, F2, F3 and F4 into a picture according to the original position, reducing the size of the fused picture by 4 times to obtain the size of 300x300 of the original image, reducing the predicted coordinate by 4 times, and obtaining the final result of detection;

(2) the types of objects, label1 and label2, on the boundary between the left and right parts are detected, and when label1 is equal to label2, they are represented as the same type, and the coordinate information of both objects is compared in magnitude, and the object is extended in a small direction with a large frame as the reference (x is 2) _max -x _min ) The length of the picture is then completely compensated, F1, F2, F3 and F4 are combined into a picture according to the original position, the size of the integrated picture is reduced by 4 times to obtain the size of the original picture 300x300, meanwhile, the modified coordinate is reduced by 4 times, and the final result is the final result of detection;

(3) the types of objects on the boundaries of the upper and lower parts, label1 and label2, are detected, and when label1 is equal to label2, they are represented as the same type, and the coordinate information of the two objects is compared in magnitude, and y is extended in a small direction with respect to a large frame _max Subtracting y _min The length of the steel is then filled; reducing the size of the fused whole picture by 4 times to obtain the size of the original image of 300x300, and reducing the modified coordinates by 4 times, wherein the final result is the final result of detection;

(4) if the predicted coordinate (x) of each of F1, F2, F3 and F4 is _{pred_min} ，y _{pred_min} ) (300 ) or (x) _{pred_max} ，y _{pred_max} ) The term (300 ) indicates that the object is divided into four parts of upper left, lower left, upper right and lower right simultaneously; the detection result of the middle picture F5 is used as the detection result of the intermediate object, the size of the fused whole picture is reduced by 4 times to obtain the size of 300 × 300 of the original image, the obtained coordinate information is reduced by 4 times at the same time, and the final result is the final detection result.

Preferably, α is 1.

Compared with the prior art, the invention has the beneficial effects that:

according to the invention, the convolution of 5x5 of an original Inclusion module is replaced by two convolution kernels of 3x3, more details are retained, in order to accelerate the training speed and output consistency, BN (BatchNormalization) is added at the end of each branch, batch normalization processing is carried out, a residual error network structure is introduced, the accuracy is increased, in addition, the context information of the upper layer and the lower layer of two adjacent layers is enhanced by deconvolution, the deconvolution results of the upper layer and the convolution pixels of the lower layer are aligned and added one by one, and the obtained new characteristic diagram is used as the detected characteristic diagram, so that the small object identification can be improved.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a schematic diagram of a segmentation point of a segmented image according to the present invention.

FIG. 3 is a flow chart of the image segmentation mesh of the present invention.

Fig. 4 is a block diagram of the residual error network of the present invention.

FIG. 5 is a diagram of the structure of IRBNet.

FIG. 6 is a flow chart of high level deconvolution pixel addition.

FIG. 7 is a flowchart for solving the prediction class label and the predicted coordinates.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the invention is further illustrated below with reference to the figures and examples.

Example 1

As shown in fig. 1 to 7, a small object detection method based on deep learning and image processing includes the following steps:

step S1: acquiring a data set, wherein the data set comprises labeled object class information and the upper left (x) of a target frame _min ,y _min ) And lower right (x) _max ,y _max ) Selecting an image with label information from the training set of the data set,resizing the picture to a size of 300x300 as input;

step S2: as shown in fig. 2, the picture is divided along the horizontal (0, 150) (300, 150) and vertical (150, 0) (150, 300) directions into 4 parts P1, P2, P3, P4 of size 150x 150; taking an image with (75, 75) (225, 75) (75, 225) (225 ) as four vertex coordinates as the 5 th part P5;

step S3: according to the two coordinate information (x) of the upper left and lower right of the target frame of each input picture strip _min ,y _min ),(x _max ,y _max ) Judging whether the object in the picture is divided or not, and modifying the coordinates according to the divided condition of the object;

step S4: as shown in fig. 3, the picture is interpolated by cubic interpolation, so that the divided 5-part pictures P1, P2, P3, P4 and P5 with the size of 150 × 150 are the same as the original picture 300 × 300 and are named as F1, F2, F3, F4 and F5, and the modified coordinates obtained in step S3 are multiplied by 2 and updated;

step S6: as shown in fig. 4 and 5, the convolution kernels 1x1,3x3 and 3x3 are stacked together to form three branches, BN, namely Batch Normalization, is added to the end of each branch for Batch Normalization, and the branches are connected and fused while introducing a residual network structure, which is named IRBNet convolution structure;

step S7: extracting features from the Conv7 feature map with the size of 19x19x1024 obtained in the step S5 through an IRBNet convolution structure to obtain a feature map Conv8 with the size of 10x10x 512; conv8 was convolved with IRBNet to obtain a feature map Conv9 with a size of 5x5x 256; conv9 was convolved with IRBNet to obtain a feature map Conv10 with a size of 3x3x 256; conv10 was convolved with IRBNet to obtain a feature map Conv11 with a size of 1x1x 256;

step S8: as shown in fig. 6, a convolution mode with a convolution kernel of 3 × 3 and a step length of 4 is adopted to perform deconvolution on the high-level feature map, the high-level feature map is expanded by two times to be as large as the previous bottom layer, then pixels at corresponding positions are added one by one to obtain a new feature map, the size of the new feature map is consistent with that of the bottom-level feature map, and the structure is named as HDPANet;

step S9: adding Conv7 to the feature map Conv8 to obtain another feature map with a size of 19x19x1024 through step S8 to obtain a feature map Conv7D, adding Conv8 to the feature map Conv9 to obtain another feature map with a size of 10x10x512 through step S8 to obtain a feature map Conv8D, adding Conv9 to the feature map Conv10 to obtain another feature map with a size of 5x5x256 through step S8 to obtain a feature map Conv9D, and adding Conv10 to the feature map Conv11 to obtain another feature map with a size of 3x3x256 through step S8 to obtain a feature map Conv 10D;

step S12: as shown in fig. 7, a picture without label information is selected from the data set, step S1 and step S2 are performed to divide the picture, the divided picture is placed in the network trained in steps S1 to S10, and filtering is performed after non-maximum suppression, so as to obtain the predicted category label and the predicted coordinates (x) of the five pictures, i.e., F1, F2, F3, F4, and F5 (x 5) _{pred_min} ，y _{pred_min} )，(x _{pred_max} ，y _{pred_max} )；

As a preferred embodiment, the specific step of modifying the coordinates in step S3 is as follows:

1) if x _min ＜150，x _max > 150, and y _min ，y _max < 150 or x _min ＜150，x _max ＞150，y _min ，y _max If more than 150, the object in the image is vertically divided into a left part and a right part, and the new coordinate is (x) _min ，y _min )，(150，y _max ) And (150, y) _min )，(x _max ，y _max ) The category information does not change;

As a preferred embodiment, the specific steps of the step S11 for finding the loss function loss and total _ loss are as follows:

loss is divided into two parts of confidence Loss and location Loss,

wherein L (x, c, L, g) represents Loss, L _conf Denotes the confidence loss, which is the softmax loss algorithm, L _loc Representing location loss, wherein N is the number of priorbox from match to GroundTruth in the confidence loss; while the alpha parameter is used to adjust the ratio between confidence loss and location loss,

GT box representing that the ith prediction box is matched with the jth real box as the class p; c represents confidence, l represents a prediction box, and g represents a true box;

represents the probability value generated by the softmax method, Pos represents a positive sample, Neg represents a negative sample, N is the number of primer boxes matched to the group Truth in the confidence loss

When the time is up, the system can be started,

representing the probability that the ith prediction box belongs to the class p, wherein p represents the pth class in the classes;

wherein cx represents the coordinate of the center point x of the frame, cy represents the coordinate of the center point y, w represents the width, h represents the height, i represents the ith prediction frame, j represents the jthA real frame, d _i An offset is indicated and the amount of the offset,

a prediction block is represented that is to be used,

the width of the offset box representing the jth real box,

the height of the offset box representing the jth real box,

indicates the ith prediction box width offset,

to representThe height offset of the ith prediction box,

representing the jth real box center point x coordinate,

represents the y coordinate of the center point of the jth real box,

indicates the width of the jth real box,

represents the height of the jth real box;

as a preferred embodiment, the specific step of fusing the pictures in step S13 is as follows:

(2) the classes of objects, label1 and label2, detected on the boundary between the left and right parts are the same class if label1 is equal to label2, and the coordinate information of the two objects is compared in magnitude, and the objects are extended in the smaller direction with the larger frame as the reference (x is the same as the coordinate information of the object) _max -x _min ) Then, the four edges of the picture are filled up, F1, F2, F3 and F4 are combined into a picture according to the original position, and the fused picture is integratedThe size of the picture is reduced by 4 times to obtain the size of the original picture of 300x300, meanwhile, the modified coordinate is reduced by 4 times, and the final result is the final result of detection;

(3) the classes of objects on the boundaries of the upper and lower parts, label1 and label2, are detected, and when label1 is equal to label2, they are represented as the same class, and the magnitude of coordinate information of the two objects is compared, and y is extended in a small direction with respect to a large frame _max Subtracting y _min Then the length of the steel is filled; the size of the fused whole picture is reduced by 4 times to obtain the size of the original picture of 300x300, meanwhile, the modified coordinate is reduced by 4 times, and the final result is the final result of detection;

(4) if the predicted coordinate (x) of each of F1, F2, F3 and F4 is _{pred_min} ，y _{pred_min} ) (300 ) or (x) _{pred_max} ，y _{pred_max} ) The term (300 ) indicates that the object is divided into four parts of upper left, lower left, upper right and lower right simultaneously; the detection result of the middle part picture F5 is used as the detection result of the intermediate object, the size of the fused whole picture is reduced by 4 times to obtain the size of the original image 300x300, the obtained coordinate information is reduced by 4 times at the same time, and the final result is the final detection result.

As a preferred embodiment, α ═ 1 is described.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should it be exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A small object detection method based on deep learning and image processing is characterized by comprising the following steps:

step S1: obtaining a data set comprising tagged object class information andupper left (x) of the target frame _min ,y _min ) And lower right (x) _max ,y _max ) Selecting an original picture with label information from a training set of a data set, and adjusting the picture to be 300x300 to be used as input;

step S2: dividing the picture along horizontal (0, 150) (300, 150) and vertical (150, 0) (150, 300) directions into 4 parts of P1, P2, P3, P4 of 150x 150; taking an image with (75, 75) (225, 75) (75, 225) (225 ) as four vertex coordinates as the 5 th part P5;

step S6: stacking convolution kernels of 1x1,3x3 and 3x3 together to form three branches, adding BN (Batch Normalization) to the last of each branch to perform Batch Normalization, connecting and fusing the branches and introducing a residual error network structure, and naming the structure as an IRBNet convolution structure;

step S7: extracting features from the Conv7 feature map with the size of 19x19x1024 obtained in the step S6 through an IRBNet convolution structure to obtain a feature map Conv8 with the size of 10x10x 512; conv8 was convolved with IRBNet to obtain a feature map Conv9 with a size of 5x5x 256; conv9 was convolved with IRBNet to obtain a feature map Conv10 with a size of 3x3x 256; conv10 was convolved with IRBNet to obtain a feature map Conv11 with a size of 1x1x 256;

2. The method for detecting small objects based on deep learning and image processing as claimed in claim 1, wherein the step S3 is to modify the coordinates as follows:

3. The method for detecting small objects based on deep learning and image processing as claimed in claim 2, wherein the step S11 comprises the following steps:

loss is divided into two parts of confidence Loss and location Loss,

When the time is up, the system can be started,

a prediction block is represented that is to be used,

the width of the offset box representing the jth real box,

the height of the offset box representing the jth real box,

indicates the ith prediction box width offset,

indicating the height offset of the ith prediction box,

representing the jth real box center point x coordinate,

represents the y coordinate of the center point of the jth real box,

indicates the width of the jth real box,

represents the height of the jth real box;

4. the method for detecting small objects based on deep learning and image processing as claimed in claim 3, wherein the step S13 is implemented by the following steps:

(2) the types of objects, label1 and label2, on the boundary between the left and right parts are detected, and when label1 is equal to label2, they are represented as the same type, and the coordinate information of both objects is compared in magnitude, and the object is extended in a small direction with a large frame as the reference (x is 2) _max -x _min ) The length of the picture is determined, then four edges of the picture are filled, F1, F2, F3 and F4 are combined into one picture according to the original position, the size of the integrated picture is reduced by 4 times to obtain the size of 300x300 of the original picture, meanwhile, the modified coordinate is reduced by 4 times, and the final result is the final detection result;

(3) the classes of objects on the boundaries of the upper and lower parts, label1 and label2, are detected, and when label1 is equal to label2, they are represented as the same class, and the magnitude of coordinate information of the two objects is compared, and y is extended in a small direction with respect to a large frame _max Minus y _min The length of the steel is then filled; reducing the size of the fused whole picture by 4 times to obtain the size of the original image of 300x300, and reducing the modified coordinates by 4 times, wherein the final result is the final result of detection;

(4) if the predicted coordinates (x) of each of F1, F2, F3 and F4 _{pred_min} ，y _{pred_min} ) Either (300 ) or (x) _{pred_max} ，y _{pred_max} ) The term (300 ) indicates that the object is divided into four parts of upper left, lower left, upper right and lower right simultaneously; the detection result of the middle part picture F5 is used as the detection result of the intermediate object, the size of the fused whole picture is reduced by 4 times to obtain the size of the original image 300x300, the obtained coordinate information is reduced by 4 times at the same time, and the final result is the final detection result.

5. The method for detecting small objects based on deep learning and image processing as claimed in claim 4, wherein α -1.