CN109685145B - Small object detection method based on deep learning and image processing - Google Patents

Small object detection method based on deep learning and image processing Download PDF

Info

Publication number
CN109685145B
CN109685145B CN201811605116.2A CN201811605116A CN109685145B CN 109685145 B CN109685145 B CN 109685145B CN 201811605116 A CN201811605116 A CN 201811605116A CN 109685145 B CN109685145 B CN 109685145B
Authority
CN
China
Prior art keywords
max
feature map
size
box
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811605116.2A
Other languages
Chinese (zh)
Other versions
CN109685145A (en
Inventor
李卫军
吴超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201811605116.2A priority Critical patent/CN109685145B/en
Publication of CN109685145A publication Critical patent/CN109685145A/en
Application granted granted Critical
Publication of CN109685145B publication Critical patent/CN109685145B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of image processing, in particular to a small object detection method based on deep learning and image processing, which reserves more details by replacing the convolution of 5x5 of an original increment module with two convolution kernels of 3x3, meanwhile, in order to accelerate the training speed and the output consistency, BN is added at the end of each branch, namely Batch Normalization, and introduces a residual error network structure to increase the accuracy, in addition, the invention adopts deconvolution to enhance the context information of the upper layer and the bottom layer of the two adjacent layers, the result of deconvolution of the upper layer is aligned with the pixels of the bottom layer and added one by one, and the obtained new characteristic diagram is used as the detected characteristic diagram, the invention can improve the identification of the small object, and improve the accuracy rate of the traditional SSD for detecting the small object on the premise of not influencing the high FPS of the traditional SSD.

Description

Small object detection method based on deep learning and image processing
Technical Field
The invention relates to the field of image processing, in particular to a small object detection method based on deep learning and image processing.
Background
Currently, a commonly used algorithm for detecting an object is SSD, i.e., Single Shot multi box Detection. The SSD is an end-to-end detection framework based on deep learning, and the framework of the SSD is mainly divided into two parts: the first part is a convolutional neural network (VGG16) positioned at the front end and used for extracting the features of a target, and the rear end is a multi-scale feature detection network and used for extracting the features of a feature layer generated by the front-stage network under different scales; then, Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2 and Conv11_2 layers are convolved to obtain coordinate positions and confidence scores, and finally, a result is obtained through non-maximum suppression (NMS).
However, since the SSD adopts a multi-scale detection method, the method can reduce the amount of computation and has a high FPS, and since the detection is performed on feature maps of different scales, the convolution fields on the feature maps of different scales are different, and particularly on a high-level convolution layer, the convolution field is large, and the extracted features are abstract, so that the SSD is not sensitive to the detection of small objects and details.
Disclosure of Invention
In order to solve the defect that an SSD detection algorithm is insensitive to small object detection in the prior art, the invention provides a small object detection method based on deep learning and image processing.
In order to realize the purpose, the technical scheme is as follows:
a small object detection method based on deep learning and image processing comprises the following steps:
step S1: acquiring a data set, wherein the data set comprises labeled object class information and the upper left (x) of a target frame min ,y min ) And lower right (x) max ,y max ) Selecting an original picture with label information from a training set of a data set, and adjusting the picture to be 300x300 to be used as input;
step S2: dividing the picture into 4 parts of P1, P2, P3 and P4 with the size of 150x150 along the horizontal (0, 150) (300, 150) and vertical (150, 0) (150, 300) directions; taking an image with (75, 75) (225, 75) (75, 225) (225 ) as four vertex coordinates as the 5 th part P5;
step S3: according to the two coordinate information (x) of the upper left and lower right of the target frame of each input picture strip min ,y min ),(x max ,y max ) Judging whether the object in the picture is divided or not, and modifying the coordinate according to the divided condition of the object;
step S4: interpolating the pictures by a cubic interpolation method, so that the divided 5 parts of the pictures P1, P2, P3, P4 and P5 with the size of 150x150 are the same as the original picture 300x300 and are named as F1, F2, F3, F4 and F5, and simultaneously multiplying the modified coordinates obtained in the step S3 by 2 and updating the coordinates;
step S5: extracting features of each of five pictures of F1, F2, F3, F4 and F5 through a VGG16 network, performing convolution by using a convolution kernel with the size of 3x3x1024 to obtain a Conv6 feature map with the size of 19x19x1024, and performing convolution by using a convolution kernel with the size of 1x1x1024 to obtain a Conv7 feature map with the size of 19x19x 1024;
step S6: stacking convolution kernels of 1x1,3x3 and 3x3 together to form three branches, adding BN (Batch Normalization) into the last branch of each branch to perform Batch Normalization, connecting and fusing the branches and introducing a residual error network structure, and naming the structure as an IRBNet convolution structure;
step S7: extracting features from the Conv7 feature map with the size of 19x19x1024 obtained in the step S5 through an IRBNet convolution structure to obtain a feature map Conv8 with the size of 10x10x 512; conv8 is subjected to IRBNet convolution to obtain a feature map Conv9 with the size of 5x5x 256; conv9 was convolved with IRBNet to obtain a feature map Conv10 with a size of 3x3x 256; conv10 was convolved with IRBNet to obtain a feature map Conv11 with a size of 1x1x 256;
step S8: deconvoluting the high-level feature map by adopting a convolution mode with a convolution kernel of 3x3 and a step length of 4, expanding the high-level feature map by two times to make the size of the high-level feature map be the same as that of the previous bottom layer, adding pixels at corresponding positions one by one to obtain a new feature map with the size consistent with that of the bottom layer feature map, and naming the structure as HDPANet;
step S9: obtaining another feature map with the size of 19x19x1024 from the feature map Conv8 through step S8, adding the feature map to Conv7 to obtain a feature map Conv7D, obtaining another feature map with the size of 10x10x512 from the feature map Conv9 through step S8, adding the feature map to Conv8 to obtain a feature map Conv8D, obtaining another feature map with the size of 5x5x256 from the feature map Conv10 through step S8, adding the feature map to Conv9 to obtain a feature map Conv9D, obtaining another feature map with the size of 3x3x256 from the feature map Conv11 through step S8, and adding the feature map to Conv10 to obtain a feature map Conv 10D;
step S10: convolving feature layers of Conv4_3, Conv10D and Conv11 by using a convolution kernel of 3x3 to obtain a feature map with the channel number of 4x (class +4), and convolving feature layers of Conv7D, Conv8D and Conv9D by using a convolution kernel of 3x3 to obtain a feature map with the channel number of 6x (class + 4);
step S11, F1, F2, F3, F4 and F5 obtain corresponding loss functions loss through steps S1 to S10; optimizing the total of five loss functions loss through a stochastic gradient descent algorithm during back propagation, and simultaneously setting training iteration times epoch, wherein the obtained network parameters are the optimal solution when the total loss is stable;
step S12: selecting pictures without label information in the data set, executing step S1 and step S2 to divide the pictures, putting the divided pictures into the network trained in step S1 to step S10, and filtering the pictures through non-maximum value inhibition to finally obtain the predicted classes label and predicted coordinates (x) of the five pictures of F1, F2, F3, F4 and F5 with prediction classes label and the prediction coordinates (x) pred_min ,y pred_min ),(x pred_max ,y pred_max );
And S13, fusing the pictures according to the prediction types label and the prediction coordinates of the five pictures F1, F2, F3, F4 and F5, wherein the final result is the final detection result.
Preferably, the specific step of modifying the coordinates in step S3 is as follows:
1) if x min <150,x max > 150, and y min ,y max < 150 or x min <150,x max >150,y min ,y max If the coordinate is more than 150, the object in the image is divided into a left part and a right part along the vertical direction, and the new coordinate is (x) min ,y min ),(150,y max ) And (150, y) min ),(x max ,y max ) The category information does not change;
2) if x min ,x max <150,y min <150,y max > 150 or x min ,x max >150,y min <150,y max If more than 150, the object in the image is divided into an upper part and a lower part in the horizontal direction, and the new coordinate is (x) min ,y min ),(x max 150), and (x) min ,150),(x max ,y max ) The category information does not change;
3) if x min <150,y min <150,x max >150,y max 150 indicates that the object in the image is cut into four parts together in the horizontal and vertical directions, with the new coordinate being (x) min ,y min ) (150 ) and (150, y) min ),(x max 150) and (x) min ,150),(150,y max ) And (150 ), (x) max ,y max ) The category information is not changed.
Preferably, the specific steps of the step S11 for obtaining the loss functions loss and total _ loss are as follows:
loss is divided into two parts of confidence Loss and location Loss,
Figure BDA0001923419980000031
wherein L (x, c, L, g) represents Loss, L conf Denotes the confidence loss, which is the softmax loss algorithm, L loc Representing location loss, wherein N is the number of priorbox from match to group Truth in the confidence loss; while the alpha parameter is used to adjust the ratio between confidence loss and location loss,
Figure BDA0001923419980000041
a GT box representing that the ith prediction box is matched to the jth real box which is in a p category; c represents confidence, l represents a prediction box, and g represents a true box;
Figure BDA0001923419980000042
Figure BDA0001923419980000043
indicating the probability value generated by the softmax method, Pos indicating positive sample, Neg indicating negative sample, N being the number of primitive boxes in the confidence loss matched to the group Truth
Figure BDA0001923419980000044
When the time is up, the system can be started,
Figure BDA0001923419980000045
indicates the probability that the ith prediction box belongs to the class p, p indicates the pth class in the classRespectively;
Figure BDA0001923419980000046
Figure BDA0001923419980000047
Figure BDA0001923419980000048
wherein cx represents the coordinate of the center point x of the frame, cy represents the coordinate of the center point y, w represents the width, h represents the height, i represents the ith prediction frame, j represents the jth real frame, d i An offset is indicated and the amount of the offset,
Figure BDA0001923419980000049
indicating whether the ith prediction box and the jth real box match, match is 1, mismatch is 0,
Figure BDA00019234199800000410
a prediction block is represented that is to be used,
Figure BDA00019234199800000411
an offset box representing a real box; m represents a value belonging to (cx, cy, w, h),
Figure BDA00019234199800000412
the coordinates of the center point x of the offset box representing the jth real box,
Figure BDA00019234199800000413
the coordinate of the center point y of the offset box representing the jth real box,
Figure BDA00019234199800000414
the width of the offset box representing the jth real box,
Figure BDA00019234199800000415
the height of the offset box representing the jth real box,
Figure BDA00019234199800000416
represents the offset of the x coordinate of the center point of the ith prediction box,
Figure BDA00019234199800000417
represents the offset of the y coordinate of the center point of the ith prediction box,
Figure BDA00019234199800000418
indicates the ith prediction box width offset,
Figure BDA00019234199800000419
indicating the height offset of the ith prediction box,
Figure BDA0001923419980000051
representing the jth real box center point x coordinate,
Figure BDA0001923419980000052
represents the y coordinate of the center point of the jth real box,
Figure BDA0001923419980000053
indicates the width of the jth real box,
Figure BDA0001923419980000054
represents the height of the jth real box;
the five loss functions obtained by processing F1, F2, F3, F4 and F5 are respectively marked as L 1 (x,c,l,g),L 2 (x,c,l,g),L 3 (x,c,l,g),L 4 (x,c,l,g),L 5 (x, c, l, g), the total loss function is reported as:
Total_loss=L 1 (x,c,l,g)+L 2 (x,c,l,g)+L 3 (x,c,l,g)+L 4 (x,c,l,g)+L 5 (x,c,l,g)。
preferably, the step S13 of fusing the pictures specifically includes the following steps:
(1) if the predicted coordinate x of each picture of F1, F2, F3 and F4 is x pred_max ,y pred_max < 300 and x pred_min ,y pred_min If the current position is more than 0, combining F1, F2, F3 and F4 into a picture according to the original position, reducing the size of the fused picture by 4 times to obtain the size of 300x300 of the original image, reducing the predicted coordinate by 4 times, and obtaining the final result of detection;
(2) the types of objects, label1 and label2, on the boundary between the left and right parts are detected, and when label1 is equal to label2, they are represented as the same type, and the coordinate information of both objects is compared in magnitude, and the object is extended in a small direction with a large frame as the reference (x is 2) max -x min ) The length of the picture is then completely compensated, F1, F2, F3 and F4 are combined into a picture according to the original position, the size of the integrated picture is reduced by 4 times to obtain the size of the original picture 300x300, meanwhile, the modified coordinate is reduced by 4 times, and the final result is the final result of detection;
(3) the types of objects on the boundaries of the upper and lower parts, label1 and label2, are detected, and when label1 is equal to label2, they are represented as the same type, and the coordinate information of the two objects is compared in magnitude, and y is extended in a small direction with respect to a large frame max Subtracting y min The length of the steel is then filled; reducing the size of the fused whole picture by 4 times to obtain the size of the original image of 300x300, and reducing the modified coordinates by 4 times, wherein the final result is the final result of detection;
(4) if the predicted coordinate (x) of each of F1, F2, F3 and F4 is pred_min ,y pred_min ) (300 ) or (x) pred_max ,y pred_max ) The term (300 ) indicates that the object is divided into four parts of upper left, lower left, upper right and lower right simultaneously; the detection result of the middle picture F5 is used as the detection result of the intermediate object, the size of the fused whole picture is reduced by 4 times to obtain the size of 300 × 300 of the original image, the obtained coordinate information is reduced by 4 times at the same time, and the final result is the final detection result.
Preferably, α is 1.
Compared with the prior art, the invention has the beneficial effects that:
according to the invention, the convolution of 5x5 of an original Inclusion module is replaced by two convolution kernels of 3x3, more details are retained, in order to accelerate the training speed and output consistency, BN (BatchNormalization) is added at the end of each branch, batch normalization processing is carried out, a residual error network structure is introduced, the accuracy is increased, in addition, the context information of the upper layer and the lower layer of two adjacent layers is enhanced by deconvolution, the deconvolution results of the upper layer and the convolution pixels of the lower layer are aligned and added one by one, and the obtained new characteristic diagram is used as the detected characteristic diagram, so that the small object identification can be improved.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a schematic diagram of a segmentation point of a segmented image according to the present invention.
FIG. 3 is a flow chart of the image segmentation mesh of the present invention.
Fig. 4 is a block diagram of the residual error network of the present invention.
FIG. 5 is a diagram of the structure of IRBNet.
FIG. 6 is a flow chart of high level deconvolution pixel addition.
FIG. 7 is a flowchart for solving the prediction class label and the predicted coordinates.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
the invention is further illustrated below with reference to the figures and examples.
Example 1
As shown in fig. 1 to 7, a small object detection method based on deep learning and image processing includes the following steps:
step S1: acquiring a data set, wherein the data set comprises labeled object class information and the upper left (x) of a target frame min ,y min ) And lower right (x) max ,y max ) Selecting an image with label information from the training set of the data set,resizing the picture to a size of 300x300 as input;
step S2: as shown in fig. 2, the picture is divided along the horizontal (0, 150) (300, 150) and vertical (150, 0) (150, 300) directions into 4 parts P1, P2, P3, P4 of size 150x 150; taking an image with (75, 75) (225, 75) (75, 225) (225 ) as four vertex coordinates as the 5 th part P5;
step S3: according to the two coordinate information (x) of the upper left and lower right of the target frame of each input picture strip min ,y min ),(x max ,y max ) Judging whether the object in the picture is divided or not, and modifying the coordinates according to the divided condition of the object;
step S4: as shown in fig. 3, the picture is interpolated by cubic interpolation, so that the divided 5-part pictures P1, P2, P3, P4 and P5 with the size of 150 × 150 are the same as the original picture 300 × 300 and are named as F1, F2, F3, F4 and F5, and the modified coordinates obtained in step S3 are multiplied by 2 and updated;
step S5: extracting features of each of five pictures of F1, F2, F3, F4 and F5 through a VGG16 network, performing convolution by using a convolution kernel with the size of 3x3x1024 to obtain a Conv6 feature map with the size of 19x19x1024, and performing convolution by using a convolution kernel with the size of 1x1x1024 to obtain a Conv7 feature map with the size of 19x19x 1024;
step S6: as shown in fig. 4 and 5, the convolution kernels 1x1,3x3 and 3x3 are stacked together to form three branches, BN, namely Batch Normalization, is added to the end of each branch for Batch Normalization, and the branches are connected and fused while introducing a residual network structure, which is named IRBNet convolution structure;
step S7: extracting features from the Conv7 feature map with the size of 19x19x1024 obtained in the step S5 through an IRBNet convolution structure to obtain a feature map Conv8 with the size of 10x10x 512; conv8 was convolved with IRBNet to obtain a feature map Conv9 with a size of 5x5x 256; conv9 was convolved with IRBNet to obtain a feature map Conv10 with a size of 3x3x 256; conv10 was convolved with IRBNet to obtain a feature map Conv11 with a size of 1x1x 256;
step S8: as shown in fig. 6, a convolution mode with a convolution kernel of 3 × 3 and a step length of 4 is adopted to perform deconvolution on the high-level feature map, the high-level feature map is expanded by two times to be as large as the previous bottom layer, then pixels at corresponding positions are added one by one to obtain a new feature map, the size of the new feature map is consistent with that of the bottom-level feature map, and the structure is named as HDPANet;
step S9: adding Conv7 to the feature map Conv8 to obtain another feature map with a size of 19x19x1024 through step S8 to obtain a feature map Conv7D, adding Conv8 to the feature map Conv9 to obtain another feature map with a size of 10x10x512 through step S8 to obtain a feature map Conv8D, adding Conv9 to the feature map Conv10 to obtain another feature map with a size of 5x5x256 through step S8 to obtain a feature map Conv9D, and adding Conv10 to the feature map Conv11 to obtain another feature map with a size of 3x3x256 through step S8 to obtain a feature map Conv 10D;
step S10: convolving feature layers of Conv4_3, Conv10D and Conv11 by using a convolution kernel of 3x3 to obtain a feature map with the channel number of 4x (class +4), and convolving feature layers of Conv7D, Conv8D and Conv9D by using a convolution kernel of 3x3 to obtain a feature map with the channel number of 6x (class + 4);
step S11, F1, F2, F3, F4 and F5 obtain corresponding loss functions loss through steps S1 to S10; optimizing the total of five loss functions loss through a stochastic gradient descent algorithm during back propagation, and simultaneously setting training iteration times epoch, wherein the obtained network parameters are the optimal solution when the total loss is stable;
step S12: as shown in fig. 7, a picture without label information is selected from the data set, step S1 and step S2 are performed to divide the picture, the divided picture is placed in the network trained in steps S1 to S10, and filtering is performed after non-maximum suppression, so as to obtain the predicted category label and the predicted coordinates (x) of the five pictures, i.e., F1, F2, F3, F4, and F5 (x 5) pred_min ,y pred_min ),(x pred_max ,y pred_max );
And S13, fusing the pictures according to the prediction types label and the prediction coordinates of the five pictures F1, F2, F3, F4 and F5, wherein the final result is the final detection result.
As a preferred embodiment, the specific step of modifying the coordinates in step S3 is as follows:
1) if x min <150,x max > 150, and y min ,y max < 150 or x min <150,x max >150,y min ,y max If more than 150, the object in the image is vertically divided into a left part and a right part, and the new coordinate is (x) min ,y min ),(150,y max ) And (150, y) min ),(x max ,y max ) The category information does not change;
2) if x min ,x max <150,y min <150,y max > 150 or x min ,x max >150,y min <150,y max If more than 150, the object in the image is divided into an upper part and a lower part in the horizontal direction, and the new coordinate is (x) min ,y min ),(x max 150), and (x) min ,150),(x max ,y max ) The category information does not change;
3) if x min <150,y min <150,x max >150,y max 150 indicates that the object in the image is cut into four parts together in the horizontal and vertical directions, with the new coordinate being (x) min ,y min ) (150 ) and (150, y) min ),(x max 150) and (x) min ,150),(150,y max ) And (150 ), (x) max ,y max ) The category information is not changed.
As a preferred embodiment, the specific steps of the step S11 for finding the loss function loss and total _ loss are as follows:
loss is divided into two parts of confidence Loss and location Loss,
Figure BDA0001923419980000091
wherein L (x, c, L, g) represents Loss, L conf Denotes the confidence loss, which is the softmax loss algorithm, L loc Representing location loss, wherein N is the number of priorbox from match to GroundTruth in the confidence loss; while the alpha parameter is used to adjust the ratio between confidence loss and location loss,
Figure BDA0001923419980000092
GT box representing that the ith prediction box is matched with the jth real box as the class p; c represents confidence, l represents a prediction box, and g represents a true box;
Figure BDA0001923419980000093
Figure BDA0001923419980000094
represents the probability value generated by the softmax method, Pos represents a positive sample, Neg represents a negative sample, N is the number of primer boxes matched to the group Truth in the confidence loss
Figure BDA0001923419980000095
When the time is up, the system can be started,
Figure BDA0001923419980000096
representing the probability that the ith prediction box belongs to the class p, wherein p represents the pth class in the classes;
Figure BDA0001923419980000097
Figure BDA0001923419980000098
Figure BDA0001923419980000099
wherein cx represents the coordinate of the center point x of the frame, cy represents the coordinate of the center point y, w represents the width, h represents the height, i represents the ith prediction frame, j represents the jthA real frame, d i An offset is indicated and the amount of the offset,
Figure BDA00019234199800000910
indicating whether the ith prediction box and the jth real box match, match is 1, mismatch is 0,
Figure BDA00019234199800000911
a prediction block is represented that is to be used,
Figure BDA00019234199800000912
an offset box representing a real box; m represents a value belonging to (cx, cy, w, h),
Figure BDA00019234199800000913
the coordinates of the center point x of the offset box representing the jth real box,
Figure BDA00019234199800000914
the coordinate of the center point y of the offset box representing the jth real box,
Figure BDA0001923419980000101
the width of the offset box representing the jth real box,
Figure BDA0001923419980000102
the height of the offset box representing the jth real box,
Figure BDA0001923419980000103
represents the offset of the x coordinate of the center point of the ith prediction box,
Figure BDA0001923419980000104
represents the offset of the y coordinate of the center point of the ith prediction box,
Figure BDA0001923419980000105
indicates the ith prediction box width offset,
Figure BDA0001923419980000106
to representThe height offset of the ith prediction box,
Figure BDA0001923419980000107
representing the jth real box center point x coordinate,
Figure BDA0001923419980000108
represents the y coordinate of the center point of the jth real box,
Figure BDA0001923419980000109
indicates the width of the jth real box,
Figure BDA00019234199800001010
represents the height of the jth real box;
the five loss functions obtained by processing F1, F2, F3, F4 and F5 are respectively marked as L 1 (x,c,l,g),L 2 (x,c,l,g),L 3 (x,c,l,g),L 4 (x,c,l,g),L 5 (x, c, l, g), the total loss function is reported as:
Total_loss=L 1 (x,c,l,g)+L 2 (x,c,l,g)+L 3 (x,c,l,g)+L 4 (x,c,l,g)+L 5 (x,c,l,g)。
as a preferred embodiment, the specific step of fusing the pictures in step S13 is as follows:
(1) if the predicted coordinate x of each picture of F1, F2, F3 and F4 is x pred_max ,y pred_max < 300 and x pred_min ,y pred_min If the current position is more than 0, combining F1, F2, F3 and F4 into a picture according to the original position, reducing the size of the fused picture by 4 times to obtain the size of 300x300 of the original image, reducing the predicted coordinate by 4 times, and obtaining the final result of detection;
(2) the classes of objects, label1 and label2, detected on the boundary between the left and right parts are the same class if label1 is equal to label2, and the coordinate information of the two objects is compared in magnitude, and the objects are extended in the smaller direction with the larger frame as the reference (x is the same as the coordinate information of the object) max -x min ) Then, the four edges of the picture are filled up, F1, F2, F3 and F4 are combined into a picture according to the original position, and the fused picture is integratedThe size of the picture is reduced by 4 times to obtain the size of the original picture of 300x300, meanwhile, the modified coordinate is reduced by 4 times, and the final result is the final result of detection;
(3) the classes of objects on the boundaries of the upper and lower parts, label1 and label2, are detected, and when label1 is equal to label2, they are represented as the same class, and the magnitude of coordinate information of the two objects is compared, and y is extended in a small direction with respect to a large frame max Subtracting y min Then the length of the steel is filled; the size of the fused whole picture is reduced by 4 times to obtain the size of the original picture of 300x300, meanwhile, the modified coordinate is reduced by 4 times, and the final result is the final result of detection;
(4) if the predicted coordinate (x) of each of F1, F2, F3 and F4 is pred_min ,y pred_min ) (300 ) or (x) pred_max ,y pred_max ) The term (300 ) indicates that the object is divided into four parts of upper left, lower left, upper right and lower right simultaneously; the detection result of the middle part picture F5 is used as the detection result of the intermediate object, the size of the fused whole picture is reduced by 4 times to obtain the size of the original image 300x300, the obtained coordinate information is reduced by 4 times at the same time, and the final result is the final detection result.
As a preferred embodiment, α ═ 1 is described.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should it be exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (5)

1. A small object detection method based on deep learning and image processing is characterized by comprising the following steps:
step S1: obtaining a data set comprising tagged object class information andupper left (x) of the target frame min ,y min ) And lower right (x) max ,y max ) Selecting an original picture with label information from a training set of a data set, and adjusting the picture to be 300x300 to be used as input;
step S2: dividing the picture along horizontal (0, 150) (300, 150) and vertical (150, 0) (150, 300) directions into 4 parts of P1, P2, P3, P4 of 150x 150; taking an image with (75, 75) (225, 75) (75, 225) (225 ) as four vertex coordinates as the 5 th part P5;
step S3: according to the two coordinate information (x) of the upper left and lower right of the target frame of each input picture strip min ,y min ),(x max ,y max ) Judging whether the object in the picture is divided or not, and modifying the coordinates according to the divided condition of the object;
step S4: interpolating the pictures by a cubic interpolation method, so that the divided 5 parts of the pictures P1, P2, P3, P4 and P5 with the size of 150x150 are the same as the original picture 300x300 and are named as F1, F2, F3, F4 and F5, and simultaneously multiplying the modified coordinates obtained in the step S3 by 2 and updating the coordinates;
step S5: extracting features of each of five pictures of F1, F2, F3, F4 and F5 through a VGG16 network, performing convolution by using a convolution kernel with the size of 3x3x1024 to obtain a Conv6 feature map with the size of 19x19x1024, and performing convolution by using a convolution kernel with the size of 1x1x1024 to obtain a Conv7 feature map with the size of 19x19x 1024;
step S6: stacking convolution kernels of 1x1,3x3 and 3x3 together to form three branches, adding BN (Batch Normalization) to the last of each branch to perform Batch Normalization, connecting and fusing the branches and introducing a residual error network structure, and naming the structure as an IRBNet convolution structure;
step S7: extracting features from the Conv7 feature map with the size of 19x19x1024 obtained in the step S6 through an IRBNet convolution structure to obtain a feature map Conv8 with the size of 10x10x 512; conv8 was convolved with IRBNet to obtain a feature map Conv9 with a size of 5x5x 256; conv9 was convolved with IRBNet to obtain a feature map Conv10 with a size of 3x3x 256; conv10 was convolved with IRBNet to obtain a feature map Conv11 with a size of 1x1x 256;
step S8: deconvoluting the high-level feature map by adopting a convolution mode with a convolution kernel of 3x3 and a step length of 4, expanding the high-level feature map by two times to make the size of the high-level feature map be the same as that of the previous bottom layer, adding pixels at corresponding positions one by one to obtain a new feature map with the size consistent with that of the bottom layer feature map, and naming the structure as HDPANet;
step S9: adding Conv7 to the feature map Conv8 to obtain another feature map with a size of 19x19x1024 through step S8 to obtain a feature map Conv7D, adding Conv8 to the feature map Conv9 to obtain another feature map with a size of 10x10x512 through step S8 to obtain a feature map Conv8D, adding Conv9 to the feature map Conv10 to obtain another feature map with a size of 5x5x256 through step S8 to obtain a feature map Conv9D, and adding Conv10 to the feature map Conv11 to obtain another feature map with a size of 3x3x256 through step S8 to obtain a feature map Conv 10D;
step S10: convolving feature layers of Conv4_3, Conv10D and Conv11 by using a convolution kernel of 3x3 to obtain a feature map with the channel number of 4x (class +4), and convolving feature layers of Conv7D, Conv8D and Conv9D by using a convolution kernel of 3x3 to obtain a feature map with the channel number of 6x (class + 4);
step S11, F1, F2, F3, F4 and F5 obtain corresponding loss functions loss through steps S1 to S10; optimizing the total of five loss functions loss through a stochastic gradient descent algorithm during back propagation, and simultaneously setting training iteration times epoch, wherein the obtained network parameters are the optimal solution when the total loss is stable;
step S12: selecting pictures without label information in the data set, executing step S1 and step S2 to divide the pictures, putting the divided pictures into the network trained in step S1 to step S10, and filtering the pictures through non-maximum value inhibition to finally obtain the predicted classes label and predicted coordinates (x) of the five pictures of F1, F2, F3, F4 and F5 with prediction classes label and the prediction coordinates (x) pred_min ,y pred_min ),(x pred_max ,y pred_max );
And S13, fusing the pictures according to the prediction types label and the prediction coordinates of the five pictures F1, F2, F3, F4 and F5, wherein the final result is the final detection result.
2. The method for detecting small objects based on deep learning and image processing as claimed in claim 1, wherein the step S3 is to modify the coordinates as follows:
1) if x min <150,x max > 150, and y min ,y max < 150 or x min <150,x max >150,y min ,y max If the coordinate is more than 150, the object in the image is divided into a left part and a right part along the vertical direction, and the new coordinate is (x) min ,y min ),(150,y max ) And (150, y) min ),(x max ,y max ) The category information does not change;
2) if x min ,x max <150,y min <150,y max > 150 or x min ,x max >150,y min <150,y max If more than 150, the object in the image is divided into an upper part and a lower part in the horizontal direction, and the new coordinate is (x) min ,y min ),(x max 150), and (x) min ,150),(x max ,y max ) The category information does not change;
3) if x min <150,y min <150,x max >150,y max 150 indicates that the object in the image is cut into four parts together in the horizontal and vertical directions, with the new coordinate being (x) min ,y min ) (150 ) and (150, y) min ),(x max 150) and (x) min ,150),(150,y max ) And (150 ), (x) max ,y max ) The category information is not changed.
3. The method for detecting small objects based on deep learning and image processing as claimed in claim 2, wherein the step S11 comprises the following steps:
loss is divided into two parts of confidence Loss and location Loss,
Figure FDA0001923419970000031
wherein L (x, c, L, g) represents Loss, L conf Denotes the confidence loss, which is the softmax loss algorithm, L loc Representing location loss, wherein N is the number of priorbox from match to GroundTruth in the confidence loss; while the alpha parameter is used to adjust the ratio between confidence loss and location loss,
Figure FDA0001923419970000032
GT box representing that the ith prediction box is matched with the jth real box as the class p; c represents confidence, l represents a prediction box, and g represents a true box;
Figure FDA0001923419970000033
Figure FDA0001923419970000034
indicating the probability value generated by the softmax method, Pos indicating positive sample, Neg indicating negative sample, N being the number of primitive boxes in the confidence loss matched to the group Truth
Figure FDA0001923419970000035
When the time is up, the system can be started,
Figure FDA0001923419970000036
representing the probability that the ith prediction box belongs to the class p, wherein p represents the pth class in the classes;
Figure FDA0001923419970000037
Figure FDA0001923419970000038
Figure FDA0001923419970000041
wherein cx represents the coordinate of the center point x of the frame, cy represents the coordinate of the center point y, w represents the width, h represents the height, i represents the ith prediction frame, j represents the jth real frame, d i An offset is indicated and the amount of the offset,
Figure FDA0001923419970000042
indicating whether the ith prediction box and the jth real box match, match is 1, mismatch is 0,
Figure FDA0001923419970000043
a prediction block is represented that is to be used,
Figure FDA0001923419970000044
an offset box representing a real box; m represents a value belonging to (cx, cy, w, h),
Figure FDA0001923419970000045
the coordinates of the center point x of the offset box representing the jth real box,
Figure FDA0001923419970000046
the coordinate of the center point y of the offset box representing the jth real box,
Figure FDA0001923419970000047
the width of the offset box representing the jth real box,
Figure FDA0001923419970000048
the height of the offset box representing the jth real box,
Figure FDA0001923419970000049
represents the offset of the x coordinate of the center point of the ith prediction box,
Figure FDA00019234199700000410
represents the offset of the y coordinate of the center point of the ith prediction box,
Figure FDA00019234199700000411
indicates the ith prediction box width offset,
Figure FDA00019234199700000412
indicating the height offset of the ith prediction box,
Figure FDA00019234199700000413
representing the jth real box center point x coordinate,
Figure FDA00019234199700000414
represents the y coordinate of the center point of the jth real box,
Figure FDA00019234199700000415
indicates the width of the jth real box,
Figure FDA00019234199700000416
represents the height of the jth real box;
the five loss functions obtained by processing F1, F2, F3, F4 and F5 are respectively marked as L 1 (x,c,l,g),L 2 (x,c,l,g),L 3 (x,c,l,g),L 4 (x,c,l,g),L 5 (x, c, l, g), the total loss function is reported as:
Total_loss=L 1 (x,c,l,g)+L 2 (x,c,l,g)+L 3 (x,c,l,g)+L 4 (x,c,l,g)+L 5 (x,c,l,g)。
4. the method for detecting small objects based on deep learning and image processing as claimed in claim 3, wherein the step S13 is implemented by the following steps:
(1) if the predicted coordinate x of each picture of F1, F2, F3 and F4 is x pred_max ,y pred_max < 300 and x pred_min ,y pred_min If the current position is more than 0, combining F1, F2, F3 and F4 into a picture according to the original position, reducing the size of the fused picture by 4 times to obtain the size of 300x300 of the original image, reducing the predicted coordinate by 4 times, and obtaining the final result of detection;
(2) the types of objects, label1 and label2, on the boundary between the left and right parts are detected, and when label1 is equal to label2, they are represented as the same type, and the coordinate information of both objects is compared in magnitude, and the object is extended in a small direction with a large frame as the reference (x is 2) max -x min ) The length of the picture is determined, then four edges of the picture are filled, F1, F2, F3 and F4 are combined into one picture according to the original position, the size of the integrated picture is reduced by 4 times to obtain the size of 300x300 of the original picture, meanwhile, the modified coordinate is reduced by 4 times, and the final result is the final detection result;
(3) the classes of objects on the boundaries of the upper and lower parts, label1 and label2, are detected, and when label1 is equal to label2, they are represented as the same class, and the magnitude of coordinate information of the two objects is compared, and y is extended in a small direction with respect to a large frame max Minus y min The length of the steel is then filled; reducing the size of the fused whole picture by 4 times to obtain the size of the original image of 300x300, and reducing the modified coordinates by 4 times, wherein the final result is the final result of detection;
(4) if the predicted coordinates (x) of each of F1, F2, F3 and F4 pred_min ,y pred_min ) Either (300 ) or (x) pred_max ,y pred_max ) The term (300 ) indicates that the object is divided into four parts of upper left, lower left, upper right and lower right simultaneously; the detection result of the middle part picture F5 is used as the detection result of the intermediate object, the size of the fused whole picture is reduced by 4 times to obtain the size of the original image 300x300, the obtained coordinate information is reduced by 4 times at the same time, and the final result is the final detection result.
5. The method for detecting small objects based on deep learning and image processing as claimed in claim 4, wherein α -1.
CN201811605116.2A 2018-12-26 2018-12-26 Small object detection method based on deep learning and image processing Active CN109685145B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811605116.2A CN109685145B (en) 2018-12-26 2018-12-26 Small object detection method based on deep learning and image processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811605116.2A CN109685145B (en) 2018-12-26 2018-12-26 Small object detection method based on deep learning and image processing

Publications (2)

Publication Number Publication Date
CN109685145A CN109685145A (en) 2019-04-26
CN109685145B true CN109685145B (en) 2022-09-06

Family

ID=66189765

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811605116.2A Active CN109685145B (en) 2018-12-26 2018-12-26 Small object detection method based on deep learning and image processing

Country Status (1)

Country Link
CN (1) CN109685145B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110068818A (en) * 2019-05-05 2019-07-30 中国汽车工程研究院股份有限公司 The working method of traffic intersection vehicle and pedestrian detection is carried out by radar and image capture device
CN110276445A (en) * 2019-06-19 2019-09-24 长安大学 Domestic communication label category method based on Inception convolution module
CN110660074B (en) * 2019-10-10 2021-04-16 北京同创信通科技有限公司 Method for establishing steel scrap grade division neural network model
CN113393411A (en) * 2020-02-26 2021-09-14 顺丰科技有限公司 Package counting method and device, server and computer readable storage medium
CN111488938B (en) * 2020-04-15 2022-05-13 闽江学院 Image matching method based on two-step switchable normalized depth neural network
CN111597340A (en) * 2020-05-22 2020-08-28 迪爱斯信息技术股份有限公司 Text classification method and device and readable storage medium
CN111860623A (en) * 2020-07-03 2020-10-30 北京林业大学 Method and system for counting tree number based on improved SSD neural network
CN113762166A (en) * 2021-09-09 2021-12-07 中国矿业大学 Small target detection improvement method and system based on wearable equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341517A (en) * 2017-07-07 2017-11-10 哈尔滨工业大学 The multiple dimensioned wisp detection method of Fusion Features between a kind of level based on deep learning
CN108564065A (en) * 2018-04-28 2018-09-21 广东电网有限责任公司 A kind of cable tunnel open fire recognition methods based on SSD

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106157307B (en) * 2016-06-27 2018-09-11 浙江工商大学 A kind of monocular image depth estimation method based on multiple dimensioned CNN and continuous CRF

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341517A (en) * 2017-07-07 2017-11-10 哈尔滨工业大学 The multiple dimensioned wisp detection method of Fusion Features between a kind of level based on deep learning
CN108564065A (en) * 2018-04-28 2018-09-21 广东电网有限责任公司 A kind of cable tunnel open fire recognition methods based on SSD

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Robust person head detection based on multi-scale representation fusion of deep convolution neural network;Yingying Wang et al.;《Proceedings of the 2017 IEEE International Conference on Robotics and Biomimetics》;20171208;全文 *

Also Published As

Publication number Publication date
CN109685145A (en) 2019-04-26

Similar Documents

Publication Publication Date Title
CN109685145B (en) Small object detection method based on deep learning and image processing
CN109859190B (en) Target area detection method based on deep learning
JP7236545B2 (en) Video target tracking method and apparatus, computer apparatus, program
CN109886066B (en) Rapid target detection method based on multi-scale and multi-layer feature fusion
CN111126359B (en) High-definition image small target detection method based on self-encoder and YOLO algorithm
CN113076871B (en) Fish shoal automatic detection method based on target shielding compensation
CN111640125B (en) Aerial photography graph building detection and segmentation method and device based on Mask R-CNN
CN111524150B (en) Image processing method and device
CN108241854B (en) Depth video saliency detection method based on motion and memory information
CN104820990A (en) Interactive-type image-cutting system
CN111401293B (en) Gesture recognition method based on Head lightweight Mask scanning R-CNN
CN107506792B (en) Semi-supervised salient object detection method
CN116645592B (en) Crack detection method based on image processing and storage medium
CN108022244B (en) Hypergraph optimization method for significant target detection based on foreground and background seeds
CN111768415A (en) Image instance segmentation method without quantization pooling
CN113111916A (en) Medical image semantic segmentation method and system based on weak supervision
CN112906794A (en) Target detection method, device, storage medium and terminal
CN111767962A (en) One-stage target detection method, system and device based on generation countermeasure network
CN112016512A (en) Remote sensing image small target detection method based on feedback type multi-scale training
CN113420643A (en) Lightweight underwater target detection method based on depth separable cavity convolution
CN111242066B (en) Large-size image target detection method, device and computer readable storage medium
CN113536896B (en) Insulator defect detection method and device based on improved Faster RCNN and storage medium
CN111931572B (en) Target detection method for remote sensing image
CN114037720A (en) Pathological image segmentation and classification method and device based on semi-supervised learning
CN113744280A (en) Image processing method, apparatus, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant