CN109784476A

CN109784476A - A method of improving DSOD network

Info

Publication number: CN109784476A
Application number: CN201910029814.0A
Authority: CN
Inventors: 程树英; 吴建耀; 郑茜颖; 林培杰; 陈志聪; 吴丽君
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-01-12
Filing date: 2019-01-12
Publication date: 2019-05-21
Anticipated expiration: 2039-01-12
Also published as: CN109784476B

Abstract

The present invention relates to a kind of methods for improving DSOD network, input picture is pre-processed first, pretreated image is input in DSOD feature extraction sub-network, RFB_a network module is added after second interposer of feature extraction sub-network, the feature with different feeling open country is extracted by the Atrous convolution of sampling step lengths different in RFB_a network, the Atrous convolutional layer that sampling step length is 6 is added after feature extraction sub-network, the feature that Atrous convolutional layer generates is input in multi-scale prediction layer, multi-scale prediction layer is input in loss function, IOG penalty term is added in loss function, prevent occurring similar prediction block overlapping when predicting intensive same type target.Meanwhile learning rate is arranged using warm up strategy in the training stage, by the way that suitably batch sample size is arranged, reduce the hardware device requirement of trained network.The present invention has higher detection accuracy relative to former DSOD algorithm, improves the detectability to Small object, while reducing the hardware device requirement of trained network.

Description

A method of improving DSOD network

Technical field

The present invention relates to computer vision field, especially a kind of method for improving DSOD network.

Background technique

Target detection is one of most important research topic in computer vision, and main task is positioned in given image Interested target, and accurately judge the specific location of each target.Algorithm of target detection based on convolutional neural networks can To be divided into two kinds: algorithm of target detection based on extracted region and based on the algorithm of target detection of recurrence.Based on extracted region mesh Detection algorithm is marked, although detection accuracy with higher, needs to extract candidate region, detection speed is difficult to reach real When.Algorithm of target detection based on recurrence, as SSD, DSOD have reached reality so that detecting by the step of removing extracted region When.But DSOD algorithm requires the problems such as high to hardware device there is poor to small target deteection ability, and when training network.

Summary of the invention

In view of this, can be improved the purpose of the present invention is to propose to a kind of method for improving DSOD network to Small object Detectability and the detection accuracy for improving target.

The present invention is realized using following scheme: a method of improving DSOD network, comprising the following steps:

Step S1: the image obtained in data set is input to input layer as input picture, and by input picture；To input Image cut, mirror image and equalization is gone to pre-process, and obtains pretreated image, while will locate in advance using method for normalizing The absolute coordinate in image after reason is converted into relative coordinate；

Step S2: RFB_a network module is added after second interposer of the feature extraction sub-network in DSOD network； Image pretreated in step S1 is input to the feature extraction sub-network in DSOD network and carries out feature extraction；By DSOD net The characteristic pattern of second interposer in feature extraction sub-network in network is input in RFB_a network module, by RFB_a net The Atrous of different sampling step lengths expands convolution in network module, extracts the feature with different feeling open country；The difference of the extraction Receptive field feature is input in 3 × 3 convolutional layers, forms first scale prediction layer of DSOD network；

Step S3: the Atrous convolutional layer for having default sampling step length is added to of the feature extraction in the DSOD network After network, and the characteristic pattern of feature extraction sub-network described in step S2 is input to, and there is default (sampling step length 6) to sample In the Atrous convolutional layer of step-length, to increase the receptive field of characteristic pattern；Meanwhile the feature for generating Atrous convolutional layer inputs Into the multi-scale prediction layer in DSOD network, 5 scale prediction layers are formed；

Step S4: 5 rulers described in the first scale prediction layer and step S3 by DSOD network described in step S2 The feature of degree prediction interval is input in the multitask loss function L that IOG penalty term is added；

Step S5: learning rate is arranged by warm up strategy, optimizes the institute in the DSOD network using gradient descent algorithm There is the weight of network layer；Suitable sample size (present invention is set as 16) is set, to reduce the hardware of trained DSOD network Equipment requirement.

Further, image preprocessing in step S1 specifically:

Step S11: input picture is cut: first randomly chooses length and height that input picture cuts image, Then one number of random selection is carried out in 0.1,0.3,0.7,0.9 as Jaccard coefficient threshold, by Jaccard coefficient Calculate the similarity of true frame and cutting image all in original image；

Whether step S12: judging true frame or cuts the Jaccard coefficient of image is greater than and randomly selects in step S11 Jaccard threshold value；If at least one true frame and the Jaccard coefficient for cutting image are greater than the Jaccard threshold of the selection Value, and the centre coordinate of this true frame falls in and cuts in image, then cuts image and meet the requirements, otherwise return step S11；Its Middle Jaccard coefficient calculates as follows:

Wherein, N indicates the number of true frame in image, box_iIndicate the area of i-th of true frame, box_cutIt indicates to cut The area of image, operator ∩ indicate to calculate overlapping area.

Step S13: to the image after cutting, carrying out left and right mirror image processing according to preset probability T (T=0.5 of the present invention), Image resolution ratio after mirror image processing is adjusted to 300 × 300, the image after obtaining mirror image processing.

Step S14: it uses and goes equalization method that the image after mirror image processing is carried out equalization, after obtaining equalization Image.

Further, the particular content of the step S2 are as follows: volume 1 × 1 is used first in each RFB_a network branches Lamination is used to reduce the port number of feature；It uses step-length for 1 in first branch of RFB_a network and convolution kernel is 3 × 3 Convolutional layer, obtain 3 × 3 receptive field feature；Used in second branch 1 × 3 convolutional layer and sampling step length for 3 Atrous Convolutional layer obtains 1 × 7 receptive field feature；Used in third branch 3 × 1 convolutional layers and sampling step length for 3 Atrous Convolutional layer obtains 7 × 1 receptive field feature；Used in the 4th branch 3 × 3 convolutional layers and sampling step length for 5 Atrous Convolutional layer obtains 11 × 11 receptive field feature；It is carried out by the feature that each branch is extracted in channel splicing and 1 × 1 convolutional layer Fusion；Finally feature that the feature of fusion and second interposer of DSOD network generate is merged by residual error to be formed it is finally defeated Feature out.

Further, it is Atrous convolutional layer that addition described in step S3, which has default sampling step length, method particularly includes: Firstly, increasing the output channel number C of feature extraction sub-network in the DSOD network, to extract feature letter more abundant Then the Atrous convolutional layer with certain sampling step length r is added in breath, it is special that the output channel of Atrous convolutional layer is equal to original DSOD Sign extracts sub-network output channel number, so that Atrous convolution is embedded into the network of DSOD；And then 1 × 1 convolution is added Layer carries out Fusion Features.

Further, the multitask loss function L of IOG penalty term is added described in step S4 specifically:

Step S41: the prediction block p and all true frame G areas of default sample for calculating the output of DSOD network are handed over and are compared maximum Prediction block g_{iou_max}, formula is as follows:

Wherein, g indicates that true frame, G indicate the set of all true frames, and p indicates prediction block, and P indicates all prediction block collection It closes, box_gIndicate the area of true frame, box_pIndicate the area of prediction block；

Step S42: it will hand over and be removed than the true frame of maximum area in step S41, then to calculate prediction block true with residue The maximum IOG penalty term of frame, using maximum IOG penalty term as L_iogLoss function, calculation formula are as follows:

Step S43: by L_iogLoss function and positioning loss function L_locAnd Classification Loss function L_confFunction is added Power fusion, forms final multitask loss function L, formula is as follows:

Wherein, N indicates the quantity of the positive sample of detection, and α indicates the L of positioning loss_locWeight；Position loss function L_loc Using smooth_L1Loss；Classification Loss function L_confIt is calculated using information cross entropy；Position loss function L_locMeter It calculates as follows:

Wherein,Indicator function indicates that i-th of default frame matches the true frame that j-th of classification is k, and l indicates prediction block Position coordinates, pos indicate the default frame of positive sample, the quantity for the positive sample that N is indicated；smooth_L1It calculates as follows:

Classification Loss L_confIt calculates as follows:

Wherein, c indicates the confidence level of each classification, and Neg indicates that negative sample, p indicate classification, and 0 indicates that classification is background.It indicates that i-th of prediction block is the probability of classification p, calculates as follows:

Further, using preheating (warmup) strategy setting learning rate described in step S5 specifically: will initially learn Rate is set as 10^-5, make learning rate linear increase to 10 in preceding 5 epoch^-2, in the 75th epoch, the 125th epoch and Learning rate respectively divided by 10, is completed training in the 200th epoch by 175 epoch；Criticize normalized weight initial value setting It is 0.5, the value of biasing is set as 0；All convolution are initialized using the method for xavier；It will by improving Training strategy Trained batch sample size falls below 16 from 128, to reduce requirement of the trained network to hardware device.

Compared with prior art, the invention has the following beneficial effects:

The present invention is added efficient network structure in lower layer network, extracts more global characteristic information, improves pair The detectability of Small object；Penalty term is added in loss function prevents occurring similar prediction block overlapping in heavy dense targets, and Missing inspection is generated when non-maximum value inhibits and post-processes, improves the detection accuracy of target.In addition, by improving Training strategy, drop The hardware device requirement of low trained network.

Detailed description of the invention

Fig. 1 is the structure chart of the embodiment of the present invention.

Fig. 2 is that the convolutional layer of the embodiment of the present invention and Atrous convolutional layer one-dimensional characteristic extract figure.

Fig. 3 is the RFB_a network structure of the embodiment of the present invention.

Fig. 4 is the intensive sampling figure of the embodiment of the present invention.

Fig. 5 is 1 testing result of specific embodiment of the embodiment of the present invention and the comparison diagram of original DSOD testing result.

Specific embodiment

The present invention will be further described with reference to the accompanying drawings and embodiments.

As shown in Figure 1, present embodiments providing a kind of method for improving DSOD network, comprising the following steps:

In the present embodiment, image preprocessing in step S1 specifically:

In the present embodiment, the particular content of the step S2 are as follows: 1 is used first in each RFB_a network branches × 1 convolutional layer is used to reduce the port number of feature；It uses step-length for 1 in first branch of RFB_a network and convolution kernel is 3 × 3 convolutional layer obtains 3 × 3 receptive field feature；Use 1 × 3 convolutional layer and sampling step length for 3 in second branch Atrous convolutional layer obtains 1 × 7 receptive field feature；Use 3 × 1 convolutional layers and sampling step length for 3 in third branch Atrous convolutional layer obtains 7 × 1 receptive field feature；Use 3 × 3 convolutional layers and sampling step length for 5 in the 4th branch Atrous convolutional layer obtains 11 × 11 receptive field feature；The spy for being extracted each branch by channel splicing and 1 × 1 convolutional layer Sign is merged；Finally the feature that the feature of fusion and second interposer of DSOD network generate is merged to be formed by residual error The feature of final output.

In the present embodiment, it is the specific method of Atrous convolutional layer that addition described in step S3, which has default sampling step length, Are as follows: firstly, increasing the output channel number C of feature extraction sub-network in the DSOD network, to extract feature more abundant Information, is then added the Atrous convolutional layer with certain sampling step length r, and the output channel of Atrous convolutional layer is equal to original DSOD Feature extraction sub-network output channel number, so that Atrous convolution is embedded into the network of DSOD；And then volume 1 × 1 is added Lamination carries out Fusion Features.

In the present embodiment, the multitask loss function L of IOG penalty term is added described in step S4 specifically:

Wherein,Indicator function indicates that i-th of default frame matches the true frame that j-th of classification is k, and l indicates prediction block Position coordinates, pos indicate positive sample default frame, N indicate positive sample quantity；smooth_L1It calculates as follows:

Classification Loss L_confIt calculates as follows:

In the present embodiment, using preheating (warmup) strategy setting learning rate described in step S5 specifically: will be initial Learning rate is set as 10^-5, make learning rate linear increase to 10 in preceding 5 epoch^-2, in the 75th epoch, the 125th epoch Learning rate is completed into training in the 200th epoch respectively divided by 10 with the 175th epoch；Criticize normalized weight initial value It is set as 0.5, the value of biasing is set as 0；All convolution are initialized using the method for xavier；By improving training plan Trained batch sample size is slightly fallen below 16 from 128, to reduce requirement of the trained network to hardware device.

Preferably, the present embodiment is cut the image of input, mirror image, equalization is gone to pre-process；It will be pretreated Image is input in DSOD feature extraction sub-network, and RFB_a network is added after second interposer of feature extraction sub-network In module, the feature with different feeling open country is extracted by the Atrous convolution of sampling step lengths different in RFB_a network, for detection Small object step provides the feature for having more global information；The Atrous that sampling step length is 6 is added after feature extraction sub-network Convolutional layer increases the receptive field of characteristic pattern, provides semantic information more abundant for subsequent multi-scale prediction layer；By Atrous The feature that convolutional layer generates is input in multi-scale prediction layer, and multi-scale prediction layer is input in loss function, in loss letter IOG penalty term is added in number, prevents occurring similar prediction block overlapping when predicting intensive same type target, to avoid non- There is missing inspection after maximum value inhibition processing；Meanwhile conjunction is passed through using preheating (warmup) strategy setting learning rate in the training stage Suitable batch sample size, reduces the hardware device requirement of trained network.The experimental results showed that the present invention is calculated relative to former DSOD Method has higher detection accuracy, improves the detectability to Small object, while the hardware device for reducing trained network is wanted It asks.

Fig. 2 is that Standard convolution layer and Atrous convolutional layer one-dimensional characteristic extract figure.When the step-length r of sampling is 1, Atrous Convolution is exactly the convolution of a standard.When sampling step length r is 2, and fill factor pading is 2, and interleaving in input signal Enter r-1 0, after Atrous convolution algorithm, 3 input signals produce 5 signal excitations.It can be seen from the figure that Atrous convolutional layer has the receptive field for increasing convolution kernel.Also there is core using two-dimensional Atrous convolution for the present embodiment The identical effect of one-dimensional Atrous convolution.

Fig. 3 indicates the network structure of the RFB_a of a standard.RFB_a network module is the structure of multiple-limb convolutional network. In different branches, the different size receptive field feature extracted by the Atrous convolution of different sampling step lengths is spelled by channel Row Fusion Features are tapped into, form the effect of intensive sampling on former characteristic pattern, as shown in Figure 4.RFB_a of the present embodiment in standard ReLU activation primitive is added in the last one Atrous convolution of each branch in network structure, to extract higher level spy Sign.Meanwhile the consistency in order to guarantee DSOD network structure, the present embodiment normalize (Batch in RFB_a network batches Normalization, BN) and before ReLU activation primitive is adjusted to convolutional layer.

Embodiment 1, as shown in figure 5, analyzing DSOD and improved method at minimum (XS) using target detection analysis tool Detectability in target.From fig. 5, it can be seen that other than desk classification, improved DSOD object detection method aircraft, from Detection accuracy in the classifications such as driving, bird has different degrees of raising.In desk classification image, the cup that is placed on desk And other items, to desk cause it is certain block, to improved DSOD detection cause larger impact, so precision be lower than original DSOD Algorithm.In general, the improved method of the present embodiment has better detection accuracy to Small object.

Embodiment 2, in PASCAL V0C2007 test set, by improved DSOD and some other typical based on returning Algorithm of target detection detection accuracy and detection speed on compare, the index of primary concern is mAP (mean Average ) and FPS (Frames Per Second) Precision.Wherein, * indicates the result tested in the present embodiment experimental situation. It can be seen that improved DSOD model with higher precision from the data in table, detection accuracy is increased to from 77.4% 79.0%.It is compared with DSSD, improved DSOD is superior to DSSD in detection accuracy and detection speed.Since RFBNet300 is used Multiple RFB network blocks, can extract more global feature, roughly the same with the improved method of the present embodiment in precision.But The computational complexity of RFBNet300 is higher than the improved method of the present embodiment, and improved method has better real-time.

The foregoing is merely presently preferred embodiments of the present invention, all equivalent changes done according to scope of the present invention patent with Modification, is all covered by the present invention.

Claims

1. a kind of method for improving DSOD network, it is characterised in that: the following steps are included:

Step S1: the image obtained in data set is input to input layer as input picture, and by input picture；To input picture It cut, mirror image and equalization gone to pre-process, obtain pretreated image, while after being pre-processed using method for normalizing Image in absolute coordinate be converted into relative coordinate；

Step S2: RFB_a network module is added after second interposer of the feature extraction sub-network in DSOD network；It will step Pretreated image is input to the feature extraction sub-network in DSOD network and carries out feature extraction in rapid S1；It will be in DSOD network Feature extraction sub-network in the characteristic pattern of second interposer be input in RFB_a network module, by RFB_a network mould The Atrous of different sampling step lengths expands convolution in block, extracts the feature with different feeling open country；The different feeling of the extraction Wild feature is input in 3 × 3 convolutional layers, forms first scale prediction layer of DSOD network；

Step S3: the Atrous convolutional layer for having default sampling step length is added to the feature extraction sub-network in the DSOD network Afterwards, and by the characteristic pattern of feature extraction sub-network described in step S2 it is input to the Atrous convolutional layer with default sampling step length In, to increase the receptive field of characteristic pattern；Meanwhile the feature that Atrous convolutional layer generates being input to more rulers in DSOD network It spends in prediction interval, forms 5 scale prediction layers；

Step S4: 5 scales described in the first scale prediction layer and step S3 by DSOD network described in step S2 are pre- The feature for surveying layer is input in the multitask loss function L that IOG penalty term is added；

Step S5: learning rate is arranged by warm up strategy, optimizes all nets in the DSOD network using gradient descent algorithm The weight of network layers；Suitable sample size is set, to reduce the hardware device requirement of trained DSOD network.

2. a kind of method for improving DSOD network according to claim 1, it is characterised in that: image preprocessing in step S1 Specifically:

Step S11: input picture is cut: first randomly chooses length and height that input picture cuts image, then One number of random selection is carried out in 0.1,0.3,0.7,0.9 as Jaccard coefficient threshold, is calculated by Jaccard coefficient The similarity of all true frames and cutting image in original image；

Wherein, N indicates the number of true frame in image, box_iIndicate the area of i-th of true frame, box_cutIt indicates to cut image Area, operator ∩ indicate to calculate overlapping area.

Step S13: to the image after cutting, left and right mirror image processing is carried out according to preset probability T, by the image after mirror image processing Resolution adjustment is 300 × 300, the image after obtaining mirror image processing.

Step S14: using and go equalization method that the image after mirror image processing is carried out equalization, the figure after obtaining equalization Picture.

3. a kind of method for improving DSOD network according to claim 1, it is characterised in that: the step S2's is specific interior Hold are as follows: use 1 × 1 convolutional layer first in each RFB_a network branches, be used to reduce the port number of feature；In RFB_a It uses step-length for 1 in first branch of network and convolutional layer that convolution kernel is 3 × 3, obtains 3 × 3 receptive field feature；Second It uses 1 × 3 convolutional layer and sampling step length for 3 Atrous convolutional layer in branch, obtains 1 × 7 receptive field feature；In third It uses 3 × 1 convolutional layers and sampling step length for 3 Atrous convolutional layer in branch, obtains 7 × 1 receptive field feature；At the 4th It uses 3 × 3 convolutional layers and sampling step length for 5 Atrous convolutional layer in branch, obtains 11 × 11 receptive field feature；By logical Road splicing and 1 × 1 convolutional layer merge the feature that each branch extracts；Finally by the second of the feature of fusion and DSOD network The feature that a interposer generates merges the feature to form final output by residual error.

4. a kind of method for improving DSOD network according to claim 1, which is characterized in that tool is added described in step S3 Having default sampling step length is Atrous convolutional layer method particularly includes: firstly, increasing feature extraction subnet in the DSOD network Then the Atrous with certain sampling step length r is added to extract characteristic information more abundant in the output channel number C of network The output channel of convolutional layer, Atrous convolutional layer is equal to original DSOD feature extraction sub-network output channel number, so that Atrous volumes Product is embedded into the network of DSOD；And then 1 × 1 convolutional layer is added and carries out Fusion Features.

5. a kind of method for improving DSOD network according to claim 1, it is characterised in that: IOG is added described in step S4 The multitask loss function L of penalty term specifically:

Step S41: prediction block p and all true frame G areas friendships of default sample and more pre- than maximum of the output of DSOD network are calculated Survey frame g_{iou_max}, formula is as follows:

Wherein, g indicates that true frame, G indicate the set of all true frames, and p indicates prediction block, and P indicates all prediction block set, box_gIndicate the area of true frame, box_pIndicate the area of prediction block；

Step S42: will hand over and remove than the true frame of maximum area in step S41, then calculate prediction block and remaining true frame Maximum IOG penalty term, using maximum IOG penalty term as L_iogLoss function, calculation formula are as follows:

Step S43: by L_iogLoss function and positioning loss function L_locAnd Classification Loss function L_confFunction, which is weighted, to be melted It closes, forms final multitask loss function L, formula is as follows:

Wherein, N indicates the quantity of the positive sample of detection, and α indicates the L of positioning loss_locWeight；Position loss function L_locUsing Be smooth_L1Loss；Classification Loss function L_confIt is calculated using information cross entropy；Position loss function L_locIt calculates such as Under:

Wherein,Indicator function indicates that i-th of default frame matches the true frame that j-th of classification is k, and l indicates the position of prediction block Coordinate, pos indicate the default frame of positive sample, the quantity for the positive sample that N is indicated；smooth_L1It calculates as follows:

Classification Loss L_confIt calculates as follows:

Wherein, c indicates the confidence level of each classification, and Neg indicates that negative sample, p indicate classification, and 0 indicates that classification is background.Table Show that i-th of prediction block is the probability of classification p, calculate as follows:

6. a kind of method for improving DSOD network according to claim 1, which is characterized in that using pre- described in step S5 Hot strategy setting learning rate specifically: set 10 for initial learning rate^-5, learning rate linear increase is arrived in preceding 5 epoch 10^-2, in the 75th epoch, the 125th epoch and the 175th epoch by learning rate respectively divided by 10, in the 200th epoch Complete training；It criticizes normalized weight initial value and is set as 0.5, the value of biasing is set as 0；All convolution use xavier's Method is initialized；Trained batch sample size is fallen below 16 from 128 by improving Training strategy, to reduce training net Requirement of the network to hardware device.