CN108460403A

CN108460403A - The object detection method and system of multi-scale feature fusion in a kind of image

Info

Publication number: CN108460403A
Application number: CN201810065807.1A
Authority: CN
Inventors: 张重阳; 程浩; 刘泽祥
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2018-01-23
Filing date: 2018-01-23
Publication date: 2018-08-28

Abstract

The invention discloses the object detection method and system of multi-scale feature fusion in a kind of image, the method includes：The first step carries out the scaling of different scale using picture to be detected, constructs image pyramid；Second step obtains the multiple scale detecting template of one group of most of sample size of covering using statistics clustering method；Third walks, and is based on multiple scale detecting template, carries out the target context structure of dimension self-adaption；4th step, multiple dimensioned depth characteristic fusion；5th step, the non-maxima suppression based on soft-decision.The present invention is by constructing the serial of methods such as the sparse pyramid of image multiresolution, multiple scale detecting masterplate, masterplate dimension self-adaption context, the fusion of multiple dimensioned depth characteristic, the abundant excavation and fusion for realizing depth characteristic utilize, and can promote target detection performance.

Description

The object detection method and system of multi-scale feature fusion in a kind of image

Technical field

The present invention relates to a kind of method of object detection field in image, multiple features fusion in specifically a kind of image Object detection method and system.

Background technology

Target detection identification in image has extensive functional need in application scenarios such as intelligent video monitorings, It is also the more popular research direction of computer vision field.Existing image object detection method, because remaining following difficulty And challenge, testing result are also to be hoisted：(1) between similar target, there are larger more for the appearance features such as color and vein shape Sample, otherness.(2) similar target causes the structure feature of sample in class to there is large change there are the diversity of posture.Such as There is target upright, postures, the similar target of different postures such as drop to the ground, tilt will present different profiles, shape in reality Etc. structure features；(3) similar object height, width equidimension size and ratio constant interval are big.One side target physical height There will be larger distributed areas, on the other hand due to the different target of shooting distance also will present in the picture it is different big The dimensional variations such as small, ratio.(4) blocking for target can influence testing result.Its partial information is missing from after target is blocked, Increase detection difficulty.(5) target local environment background and the diversity of illumination cause flase drop to increase.Target appears in as outdoor When, such as urban road, entrance, background is often complex, and some complicated background such as trees, street lamp meeting and targets Generation is obscured, leads to flase drop.

Currently, more mature object detection method can be divided into two classes substantially：(1) it is based on background modeling.This method master It is used to detect moving target in video：The still image that will be inputted carries out scene cut, utilizes mixed Gauss model (GMM) Or the methods of motion detection, it is partitioned into its foreground and background, then extract special exercise target in the foreground.Such methods need to connect Continuous image sequence models to realize, the target detection being not suitable in single image.(2) it is based on statistical learning.Will own The image collection of known genera Mr. Yu's one kind target gets up to form training set, method (such as HOG, Harr based on an engineer Deng) to training set image zooming-out feature.The feature of extraction is generally the letters such as gray scale, texture, histogram of gradients, the edge of target Breath.Then pedestrian detection grader is built according to the feature database of a large amount of training sample.Grader is generally available SVM, The models such as Adaboost and neural network.

The object detection method performance based on statistical learning is more excellent in recent years in terms of comprehensive, the target inspection based on statistical learning Survey method can be divided into traditional artificial characteristic target detection method and depth characteristic machine learning object detection method.

Traditional artificial characteristic target detection method is primarily referred to as its feature for utilizing engineer, to carry out target target Modeling.The characterization method for showing outstanding engineer in recent years includes mainly：Pedro F.Felzenszwalb in 2010 etc. DPM (Deformable Part Model) method (the Object detection with discriminatively of proposition trained part-based models).ICF (the Integral Channel proposed in 2009 such as Piotr Doll á r Features), the ACF methods (Fast Feature Pyramids for Object Detection) proposed in 2014. Informed Harr methods (the Informed Haar-like Features of the propositions such as Shanshan Zhang in 2014 Improve Pedestrian Detection), the extraction more Harr features with characterization information are dedicated to be trained. Although the feature of these engineers achieves certain effect, but because manual features characterize scarce capacity, there are still detections The not high problem of precision.Due to feature learning and ability to express more powerful possessed by depth convolutional neural networks model, scheming As target classification context of detection obtains more and more extensive and successful application.The target detection operator on basis is R-CNN (Region-Convolutional Neural Network) model.Girshick in 2014 et al. proposes RCNN for general The detection of target is again later to propose fast-rcnn and faster-rcnn, improves based on deep learning target detection side The accuracy and speed of method.The methods of Yolo proposed and SSD in 2016, then realize the fast of single stage by thoughts such as Anchor Fast target detection.These target detections based on depth learning technology are all based on greatly single scale, fixed size context Depth characteristic, there are still depth characteristics, and insufficient problem, detection performance to be utilized to need to be further increased.

Invention content

For insufficient existing for the object detection method based on depth model, multiple dimensioned spy in a kind of image of present invention proposition The object detection method and system for levying fusion, by constructing the sparse pyramid of image multiresolution, multiple scale detecting masterplate, masterplate A series of creative methods such as dimension self-adaption context, the fusion of multiple dimensioned depth characteristic, realize the abundant excavation of depth characteristic It is utilized with fusion, promotes target detection performance.

According to the first aspect of the invention, a kind of object detection method of multi-scale feature fusion in image is provided, including：

S1：The scaling of different scale is carried out using picture to be detected, constructs an image pyramid；

S2：Based on the training image that described image pyramid obtains, it is big to obtain one group of covering using statistics clustering method The multiple scale detecting template of most sample sizes；

S3：On the basis of above-mentioned multiple scale detecting template, the target context structure of dimension self-adaption is carried out；

S4：According to target context structure as a result, carry out multiple dimensioned depth characteristic fusion, Analysis On Multi-scale Features figure is obtained；

S5：According to above-mentioned Analysis On Multi-scale Features figure, the non-maxima suppression based on soft-decision is carried out, is realized multiple dimensioned in image The target detection of Fusion Features.

Preferably, in the S1：In order to enable detection network utilizes the detection block of one or several limited sizes, it can be right In image different size of target completely can compactly carry out frame choosing sampling, need to original image carry out it is multiple dimensioned scaling, So that original object increases its by repeatedly scaling is detected probability of the complete compact frame choosing of frame, by picture to be trained by by than Example is scaled to the picture of L different resolution size, to the image pyramid of one resolution ratio of construction from high to low.Specifically , when training, for each original training image, the scaling of multiple scales is carried out, obtains L under different scale Image is for training.When test, for each image to be detected, the scaling of multiple scales is equally carried out, is obtained not With L image under scale for detecting, and to the testing result of this L image, amalgamation judging is carried out, final detection is obtained As a result.

Preferably, the multiple scale detecting mould that one group of most of sample size of covering is obtained using statistics clustering method Plate refers to：Based on K-medoids clustering methods, and using Jie Kade distances (Jaccard distance) as Cluster Assessment Index concentrates target to be clustered by different wide high level and the ratio of width to height training data, forms the width of one group of K cluster centre High ratio, the target template as the most wide high proportions of covering.

Preferably, the target context for carrying out dimension self-adaption is built, and refers to：By the output of CNN network convolutional layers The receptive field each put on characteristic pattern, as candidate target frame；The part that receptive field has more relative to pattern plate bolster is used as target The context of frame, the detection for being used for auxiliary mark identify.

It is highly preferred that it is described carry out dimension self-adaption target context structure, finally obtain a contextual information with Target scale changes and the detection model of variation, i.e.,：Small scaled target will obtain the contextual information of bigger, and large scale target Contextual information it is less, to meet different demands of the target to contextual information of different scale.

Preferably, described to carry out multiple dimensioned depth characteristic fusion, refer to：The characteristic pattern that CNN difference convolutional layers are exported, choosing Go out M layers to be merged, for constructing Analysis On Multi-scale Features pyramid.

It is highly preferred that described carry out multiple dimensioned depth characteristic fusion, specially：It is in the convolutional layer selected for M Last layer of CNN networks, the characteristic pattern output it up-sample it using deconvolution, it is made to be expanded to and last layer feature It after scheming same resolution sizes, does with last layer characteristic pattern and is added pixel-by-pixel, obtain the Analysis On Multi-scale Features figure of fusion adjacent two layers； Again and so on, deconvolution expands, is merged with more last layer characteristic pattern, until the fusion for all M layers of characteristic pattern for completing to select.

Preferably, the non-maxima suppression based on soft-decision refers to：

The maximum detection block of fiducial probability is first elected, other detection blocks and the maximum detection block meter of fiducial probability are passed through IOU (intersection over union) is calculated, then reduces its fiducial probability more than a certain threshold value；

After the maximum detection block of this fiducial probability is removed, then select the maximum detection of fiducial probability in remaining detection block Frame, and remaining detection block and the maximum detection block of fiducial probability are calculated into IOU, then its fiducial probability is dropped more than a certain threshold value It is low；

By the above continuous iteration, the detection block after screening to the end is obtained.

According to the second aspect of the invention, a kind of object detection system of multi-scale feature fusion in image is provided, including：

Image pyramid builds module：The scaling of different scale is carried out using picture to be detected, constructs an image gold word Tower；

Multiple scale detecting template builds module：The instruction that the described image pyramid of module obtains is built based on image pyramid Practice image, the multiple scale detecting template of one group of most of sample size of covering is obtained using statistics clustering method；

Target context builds module：In the multiple scale detecting template that above-mentioned multiple scale detecting template structure module obtains On the basis of, carry out the target context structure of dimension self-adaption；

Multiple dimensioned depth characteristic melts module：According to target context structure module as a result, carrying out multiple dimensioned depth characteristic Fusion, obtains Analysis On Multi-scale Features figure；

Module of target detection：The Analysis On Multi-scale Features figure for melting module according to above-mentioned multiple dimensioned depth characteristic sentenced based on soft Non-maxima suppression certainly realizes the target detection of multi-scale feature fusion in image.

Preferably, described image pyramid construction module, when training, for each original training image, into The scaling of the multiple scales of row obtains L image under different scale for training.When test, for each to be detected Image, equally carry out the scaling of multiple scales, obtain L image under different scale for detecting, and to this L image Testing result carries out amalgamation judging, obtains final testing result.

Preferably, the multiple scale detecting template builds module, which is based on K-medoids clustering methods, and utilizes Jie Kade distances (Jaccard distance) are used as Cluster Assessment index, concentrate target to press different wide high level training data And the ratio of width to height is clustered, and the ratio of width to height of one group of K cluster centre is formed, the target mould as the most wide high proportions of covering Plate.

Preferably, the target context builds module, on the characteristic pattern which export CNN network convolutional layers each The receptive field of point, as candidate target frame；The part that receptive field has more relative to pattern plate bolster, the i.e. context as target frame, It is identified for the detection of auxiliary mark.

Preferably, the multiple dimensioned depth characteristic melts module, the characteristic pattern which exports CNN difference convolutional layers, choosing Go out M layers to be merged, for constructing Analysis On Multi-scale Features pyramid.

It is highly preferred that the multiple dimensioned depth characteristic melts module, CNN is in the convolutional layer which selects M Last layer of network, the characteristic pattern output it up-sample it using deconvolution, it is made to be expanded to and last layer characteristic pattern It after same resolution sizes, does with last layer characteristic pattern and is added pixel-by-pixel, obtain the Analysis On Multi-scale Features figure of fusion adjacent two layers；Again And so on, deconvolution expands, is merged with more last layer characteristic pattern, until the fusion for all M layers of characteristic pattern for completing to select.

Preferably, the module of target detection first elects the maximum detection block of fiducial probability, is detected by others Frame calculates IOU (intersection over union) with the maximum detection block of fiducial probability, is then set more than a certain threshold value Believe that probability reduces；

Compared with prior art, the invention has the advantages that：

According to the target identification of different scale to the difference of contextual information demand in the present invention, the structure in conjunction with CNN is special The concept of point and receptive field is modeled using the contextual information of dimension self-adaption, is identified for auxiliary mark.

In the present invention in such a way that K-medoids clustering methods optimize the selection of template, meets different templates and target is examined Survey model inspection effect.

Image pyramid solves the problems, such as that training picture needs fixed size, design to be obtained using the method for crop in the present invention Picture to fixed size is used to train.

Adequately utilize the feature of CNN different layers can be simultaneously by further feature and shallow-layer Fusion Features in the present invention Using the detailed information of the characterization ability and shallow-layer feature of further feature, the accuracy of detection of small scaled target is promoted.

Analysis On Multi-scale Features testing mechanism is established in the present invention, will again be merged after feature combination, is then examined again It surveys, finally merges the result of detection, realize Analysis On Multi-scale Features detection.

The non-maxima suppression based on soft-decision is utilized in the present invention, improves syncretizing effect.

To sum up, the present invention is by comprehensively utilizing multi-resolution pyramid, multiple dimensioned template cluster, dimension self-adaption context The technologies such as information, the fusion of multiple dimensioned depth characteristic, soft-decision non-maxima suppression, enhance feature learning and the expression of image object Ability, effectively promotes the precision of the target detections such as pedestrian, at the same preferably solve in the prior art target in small scale, long distance From when, the problems such as intensive target detection.

Description of the drawings

Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon：

Fig. 1 a, Fig. 1 b are that image pyramid builds flow chart in one embodiment of the invention；

Fig. 2 is that multiple dimensioned template obtains flow chart in one embodiment of the invention；

Fig. 3 is multi-scale feature fusion implementation flow chart in one embodiment of the invention；

Fig. 4 is the non-maxima suppression implementation flow chart of soft-decision in one embodiment of the invention.

Specific implementation mode

With reference to specific embodiment, the present invention is described in detail.Following embodiment will be helpful to the technology of this field Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill of this field For personnel, without departing from the inventive concept of the premise, various modifications and improvements can be made.These belong to the present invention Protection domain.

Existing object detection method can identify certain larger-size targets well, but size is larger Target small part is only accounted in actual life, for distance target farther out, testing result is not fine.Target Detection has following characteristics, by taking pedestrian target as an example：

Feature one, the diversity of scale.Have old man, a middle-aged person, child in one side pedestrian, physics height there will be compared with Big distributed area.On the other hand due to the difference of shooting distance, for the same pedestrian, higher, the remoter bat of distance It takes the photograph and a little takes that pedestrian its pixel is lower, and height is lower；And pedestrian's pixel that lower, closer shooting point takes is higher, it is high Degree is closer to its true height.The pedestrian detection effect that existing method is more than height 100 pixels is preferable, but for remote Distance, the low pedestrian detection effect of height are bad.All it is often desirable when pedestrian detection is applied in vehicle DAS (Driver Assistant System) The pedestrian that system detectio distant location occurs reminds driver, therefore solves the problems, such as that the remote pedestrian of detection is also one urgent The demand cut.

Feature two is blocked.The pedestrian taken in reality can have circumstance of occlusion, on the one hand can there is the feelings that crowd clusters round Condition.When multiple people walk together, look over always that someone will be blocked a part for body from all angles；Another aspect pedestrian May a part for body be sheltered from by the object in environment, such as number, vehicle, house.Its body after pedestrian is blocked at this time Body portion information is missing from, and the result of missing inspection can be then caused for the detector based on human body integrity profile feature.

The present invention proposes that object detection method based on multi-feature fusion is preferably solved from the above problem in image Target of having determined small scale, it is remote when, the problems such as intensive target detection.It is tired that the present invention is based on target detections present in reality It is difficult, it is proposed that it is merged for multiscale target self-adapting detecting, and using multilayer receptive field difference based on pyramid, and Using soft-decision non-maxima suppression for intensive target detection and then the effect of promotion target detection.Wherein：

The first step, the scaling of different scale is carried out using picture to be detected, and construction includes the figure from the multiple resolution ratio of size As pyramid；

Second step clusters sample according to the wide high size of training sample, passes through cluster using statistics clustering method Form K cluster centre, each cluster centre forms one using high mean value not lend oneself to worry and anxiety in this as the detection template of scale；K cluster The template group for forming one group of K different scale, as detector target template group；

Third walks, the target context structure of dimension self-adaption.

To the detection template frame of each scale, all vertically left and right directions uniform expansions expands to same convolutional neural networks The receptive field size that the characteristic pattern that (CNN, Convolutional Neural Network) is finally exported each is put is identical, extension Part forms the contextual information for including target, and the context size for including is adaptive with template size.

4th step, multi-scale feature fusion.

By the way that the more a convolutional layer characteristic patterns of CNN are added fusion pixel-by-pixel in same resolution ratio, target inspection is formed Survey the multiple dimensioned depth characteristic needed for identification.

5th step, the non-maxima suppression based on soft-decision.

Confidence level by reducing detection block replaces directly deleting detection block, then is sieved to detection block by continuous iteration Choosing.

Meanwhile the present invention constructs an image object detecting system, system synthesis profit by integrated above method step With multi-resolution pyramid, multiple dimensioned template cluster, the fusion of dimension self-adaption contextual information, multiple dimensioned depth characteristic, soft sentence Certainly the methods of non-maxima suppression enhances the feature learning and expression ability of image object, effectively promotes the target detections such as pedestrian Precision.

Specifically, object detection system based on multi-feature fusion in a kind of image, including：

The above method of the present invention and system are described in detail by taking pedestrian detection as an example below, especially above-mentioned five Implementation involved in a step/module.

One, the scaling that different scale is carried out using picture to be detected constructs image pyramid.

In order to enable detection network utilizes the detection block of one or several limited sizes, it can be to different size of in image Target completely can compactly carry out frame choosing sampling, need to carry out multiple dimensioned scaling, so that original object is passed through to original image Repeatedly scaling increases its detected frame probability that completely compact frame selects.Will picture be trained by be scaled out into L difference The picture of resolution sizes, to the image pyramid of one resolution ratio of construction from high to low.

When the training of pedestrian's detection model and test, initial data is zoomed in and out using the principle of image pyramid Processing, as shown in attached drawing 1a, 1b.

It is exactly specifically, for each image to be trained, to carry out the contracting of 0.5X, 1X and 2X when training It puts, obtains the image under different scale for training, as shown in fig. la.When test, for each figure to be detected Picture carries out the scaling of 0.5X, 1X and 2X, obtains the image under different scale for detecting, and to the detection of three scale images As a result, carrying out amalgamation judging, final testing result is obtained, as shown in figure ib.

In the present embodiment, the gradient of backpropagation is calculated for convenience, so the picture of input needs fixed scale, Such as the size of all pictures is all 640*480 pixels in Caltech.After random scaling, training picture scale has occurred The picture of variation, 640*480 sizes becomes 320*240 after 0.5X scalings, and scale then becomes after passing through 2X scalings For 1280*960.In order to obtain the training picture of unified scale, when training, ruler is cut out from the picture by scaling Degree is the picture of 640*480, and concrete operations flow is as shown in Figure 1a.I.e. for the picture that size is 320*240, image completion is used To 640*480, and for the picture that size is 1280*960, then therefrom random cropping goes out the small figure of 640*480, then by this three Kind picture is simultaneously for training, as shown in fig. la.This way can be effectively increased training samples number, promote deep learning etc. The performance of data-driven method.

Two, best detector template group is obtained using statistics clustering method.

The present invention obtains best detector template group using statistics clustering method, has been demarcated in extraction training image Frame progress K-medoids clustering methods are obtained not same group of (K) multiple dimensioned template by target rectangle frame.Here scale is The size (long and wide) and ratio (the ratio of width to height) of feeling the pulse with the finger-tip mark frame.K multiple dimensioned templates are obtained using statistical method, had made it both The differentiation that most training sample scales take into account sample size in class again can be covered.To consider when choosing template The differentiation of sample size, avoids only being caused that matching template is very few, is difficult to meet multiple dimensioned mesh with single or a small number of scale templates Mark and the accurate match cognization problem of template, but statistical data concentrates wide high and ratio the distribution situation of target sample, by poly- Class forms multiple cluster centres, each cluster centre forms a scale (average the ratio of width to height but not limited to this) template, to root One group of multiple dimensioned template is formed according to sample size distribution；Meanwhile the quantity by limiting cluster centre, to avoid center excessive Lead to that the template training sample of a certain scale is very little, detector cannot train up problem.K- can be used in clustering method Medoids etc. is realized and is carried out cluster and the selection of multiple dimensioned template to wide high and the ratio of width to height of sample that data are concentrated.

When specific operation, can first definition template scale, cover the pedestrian target of different scale.Generally vertical pedestrian's The ratio of width to height is generally 1:3, according to this experience and the general height of pedestrian target in Caltech data sets is combined to be distributed, it can be with It manually goes to select typical template scale, such as 30*90,50*150, but this mode chosen by hand does not have not only Versatility, and very maximum probability cannot choose most suitable template.This is because：

First, in different data sets, due to the difference of photo resolution and monitoring visual angle, the distribution of pedestrian level is Different, this method for choosing template by hand does not have versatility；

Secondly, in true monitoring scene, because of situations such as the appearance of pedestrian's posture, camera angle, block information, Cause the ratio of width to height of pedestrian to vary widely, the template chosen is caused not have typicalness.Meanwhile when choosing template Consider the distribution situation of object height in data set, if the corresponding training sample of the template of a certain scale is very little, can cause The corresponding detector of this template cannot be trained adequately.

Therefore, propose that the method clustered by K-medoids is high to the width of the sample in pedestrian's data set in the present embodiment Clustered, using statistical method obtain the multiple dimensioned templates of K (the present embodiment K=32, certainly, in other embodiments Can also be other numbers).

As shown in Fig. 2, being based on K-medoids clustering methods, pedestrian level and width is concentrated to cluster training data. Using Jie Kade distances (Jaccard distance) as the Cluster Assessment index of K-medoids, i.e.,：

d(s_i,s_j)=1-J (s_i,s_j)

Wherein, s_i=(h_i,w_i)；s_j=(h_j,w_j) indicating two different pedestrian's frames in data set, h, w indicate capable respectively The height and the width of people's frame, J indicate the Jie Kade similarities (Jaccard similarity coefficient) of standard,

Three, the target context structure of dimension self-adaption.

The present invention using dimension self-adaption target context build, on the one hand, small scaled target because useful information compared with Less, more contextual informations are generally required to assist in identifying；And large scale target does not often need a large amount of context then.Separately On the one hand, since each characteristic point of the full articulamentums of CNN corresponds to a fixed receptive field of scale in artwork, receptive field can be based on To construct the context of different templates.

Specifically, the pattern plate bolster of the K different scale returned out to K-medoids clustering procedures, vertically controls and is expanded It opens up and to expand to the receptive field each put with the full articulamentums of CNN in the same size.Because receptive field size is fixed, then small scale Pattern plate bolster need to do the extension of large scale, thus the context of large scale can be obtained；And large scale template is then on the contrary, obtain The context obtained is smaller.Based on this method, the contextual information with pattern plate bolster dimension self-adaption is can get, is known for auxiliary mark Not.

Four, multiple dimensioned depth characteristic fusion.

In view of the characteristic pattern of CNN each convolutional layer output, all containing the useful feature on different scale, shallow-layer it is defeated Go out to usually contain more local detail features, and high-rise feature usually contains more of overall importance and Semantic information.It is right The feature for the different scale that these different layers obtain carries out fusion utilization and can be obtained more rich feature representation.Based on the think of Think that the present invention devises a kind of Multiscale Fusion method：

As shown in figure 3, M main convolutional layer (often rearmost M layer in CNN networks, this implementation for CNN Example is the characteristic pattern of 3) output, shallow from being deep to, and first carries out deconvolution up-sampling to the characteristic pattern of bottommost layer each with deconvolution, So that it is transformed in the same last layer output same resolution-scale of characteristic pattern, then does the characteristic pattern of this two layers same scale It is added pixel-by-pixel, the Analysis On Multi-scale Features figure after being merged；Analogize accordingly again, realize all M layers of fusions, obtains M layers of fusion The Analysis On Multi-scale Features figure of feature.Such as input picture passes through after Resnet networks, by res5 layers of output result deconvolution one It is secondary to obtain resolution ratio identical with res4, and with the results added of res4, then its deconvolution once obtained identical with res3 Resolution ratio simultaneously obtains final result and is tested with the results added of res3 progress Fusion Features.

Five, the non-maxima suppression of soft-decision.

In pervious method, final testing result is obtained with non-maxima suppression method fusion detection frame.Traditional Non-maxima suppression method be it is a kind of volume method is sentenced based on the hard of greed, correct detection is may result in during fusion Frame is suppressed, especially when IOU (Intersection of Units, the overlapping region between two rectangle frames and combined region Ratio) threshold value selection it is inappropriate when.

As shown in figure 4, the non-maxima suppression Integral Thought of soft-decision is exactly in the present invention：By reducing setting for detection block Reliability replaces directly deleting detection block.

In the present embodiment, concrete operations are：The maximum detection block of fiducial probability is first elected, is then detected by others The frame of frame and maximum fiducial probability calculates IOU, then reduces its fiducial probability more than a certain threshold value.Then this maximum frame is removed The maximum fiducial probability frame in selecting remaining detection block afterwards, and remaining detection block and the frame of maximum fiducial probability are calculated into IOU, Then its fiducial probability is reduced more than a certain threshold value, by continuous iteration, obtains the detection block after screening to the end.

The mode that linear weighted function may be used in the present embodiment reduces the score of detection block according to the value of iou：

In the formula, M indicates the maximum detection block of fiducial probability, b_iIndicate i-th of detection block, N_iIndicate that IOU threshold values (can For empirical value or preset value), iou (M, b_i) indicate M and b_iIOU values.

Therefore, soft-decision of the present invention is more preferable for dense population detection result.

Some details of the invention implemented and preferred feature have been carried out in detail above by for pedestrian detection Thin description.The present invention can also be applied to target detection in other images, however it is not limited to pedestrian detection, the behaviour of other target detections Make similar to the above embodiments, no longer separately illustrates herein.

To sum up, the present invention is by constructing the sparse pyramid of image multiresolution, the multiple scale detecting mould based on cluster optimization A series of creative methods such as version, the fusion of masterplate dimension self-adaption context, multiple dimensioned depth characteristic, realize filling for depth characteristic Divide to excavate and merge and utilize, obtains the promotion of target detection performance.

Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited in above-mentioned Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow Ring the substantive content of the present invention.

Claims

1. the object detection method of multi-scale feature fusion in a kind of image, which is characterized in that including：

S2：Based on the training image that described image pyramid obtains, most of one group of covering is obtained using statistics clustering method The multiple scale detecting template of sample size；

S5：According to above-mentioned Analysis On Multi-scale Features figure, the non-maxima suppression based on soft-decision is carried out, realizes Analysis On Multi-scale Features in image The target detection of fusion.

2. the object detection method of multi-scale feature fusion in image according to claim 1, which is characterized in that the S1 In：In order to enable detection network utilizes the detection block of one or several limited sizes, it can be to different size of target in image Frame choosing sampling completely can be compactly carried out, needs to carry out multiple dimensioned scaling to original image, make original object by multiple Scaling increases it and is detected the frame probability that completely compact frame selects, by original image by being scaled out into L different resolution The picture of size, to the image pyramid of one resolution ratio of construction from high to low.

3. the object detection method of multi-scale feature fusion in image according to claim 1, which is characterized in that the profit With the multiple scale detecting template of statistics clustering method one group of most of sample size of covering of acquisition, refer to：Based on K-medoids Clustering method, and Cluster Assessment index is used as using Jie Kade distances, to training data concentrate target press different wide high level and The ratio of width to height is clustered, and the ratio of width to height of one group of K cluster centre is formed, the target mould as the most wide high proportions of covering Plate.

4. the object detection method of multi-scale feature fusion in image according to claim 1, which is characterized in that it is described into The target context of row dimension self-adaption is built, and refers to：The impression that will each be put on the characteristic pattern of CNN network convolutional layers output Open country, as candidate target frame；The part that receptive field has more relative to pattern plate bolster, the i.e. context as target frame, for assisting The detection of target identifies.

5. the object detection method of multi-scale feature fusion in image according to claim 4, which is characterized in that it is described into The target context of row dimension self-adaption is built, and finally obtains the detection that a contextual information changes with target scale variation Model, i.e.,：Small scaled target will obtain the contextual information of bigger, and the contextual information of large scale target is less, to full Different demands of the target of sufficient different scale to contextual information.

6. the object detection method of multi-scale feature fusion in image according to claim 1, which is characterized in that it is described into The multiple dimensioned depth characteristic of row merges, and refers to：The characteristic pattern that CNN difference convolutional layers are exported, selects M layers and is merged, and is used for structure Make Analysis On Multi-scale Features pyramid.

7. the object detection method of multi-scale feature fusion in image according to claim 6, which is characterized in that it is described into The multiple dimensioned depth characteristic fusion of row, specially：In last layer of CNN networks in the convolutional layer selected for M, output it Characteristic pattern it is up-sampled using deconvolution, it is and upper after so that it is expanded to resolution sizes same as last layer characteristic pattern One layer of characteristic pattern does and is added pixel-by-pixel, obtains the Analysis On Multi-scale Features figure of fusion adjacent two layers；Again and so on, deconvolution expansion, It is merged with more last layer characteristic pattern, until the fusion for all M layers of characteristic pattern for completing to select.

8. according to the object detection method of multi-scale feature fusion in claim 1-7 any one of them images, feature exists In the non-maxima suppression based on soft-decision refers to：

The maximum detection block of fiducial probability is first elected, is calculated by other detection blocks and the maximum detection block of fiducial probability IOU then reduces its fiducial probability more than a certain threshold value；

After the maximum detection block of this fiducial probability is removed, then the maximum detection block of fiducial probability in remaining detection block is selected, and Remaining detection block and the maximum detection block of fiducial probability are calculated into IOU, then reduce its fiducial probability more than a certain threshold value；

9. the object detection system of multi-scale feature fusion in a kind of image, which is characterized in that including：

Image pyramid builds module：The scaling of different scale is carried out using picture to be detected, one image pyramid of construction is more Size measurement template builds module：The training image that the described image pyramid of module obtains is built based on image pyramid, profit The multiple scale detecting template of one group of most of sample size of covering is obtained with statistics clustering method；

Target context builds module：On the basis for the multiple scale detecting template that above-mentioned multiple scale detecting template structure module obtains On, carry out the target context structure of dimension self-adaption；

Multiple dimensioned depth characteristic melts module：According to target context build module as a result, carry out multiple dimensioned depth characteristic fusion, Obtain Analysis On Multi-scale Features figure；

Module of target detection：The Analysis On Multi-scale Features figure for melting module according to above-mentioned multiple dimensioned depth characteristic, carries out based on soft-decision Non-maxima suppression realizes the target detection of multi-scale feature fusion in image.

10. the object detection system of multi-scale feature fusion in a kind of image according to claim 9, which is characterized in that The multiple scale detecting template builds module, which is based on K-medoids clustering methods, and using Jie Kade distances as poly- Class evaluation index concentrates target to be clustered by different wide high level and the ratio of width to height training data, is formed in one group of K cluster The ratio of width to height of the heart, the target template as the most wide high proportions of covering.

11. the object detection system of multi-scale feature fusion in a kind of image according to claim 9, which is characterized in that The target context builds module, the receptive field which will each put on characteristic pattern that CNN network convolutional layers export, as Candidate target frame；The part that receptive field has more relative to pattern plate bolster, the i.e. context as target frame are used for the inspection of auxiliary mark Survey identification.

12. the object detection system of multi-scale feature fusion in a kind of image according to claim 9, which is characterized in that The multiple dimensioned depth characteristic melts module, and the characteristic pattern which exports CNN difference convolutional layers is selected M layers and merged, and uses In construction Analysis On Multi-scale Features pyramid.

13. the object detection method of multi-scale feature fusion in image according to claim 12, which is characterized in that described Multiple dimensioned depth characteristic melts module, in last layer of CNN networks in the convolutional layer which selects M, outputs it Characteristic pattern it is up-sampled using deconvolution, it is and upper after so that it is expanded to resolution sizes same as last layer characteristic pattern One layer of characteristic pattern does and is added pixel-by-pixel, obtains the Analysis On Multi-scale Features figure of fusion adjacent two layers；Again and so on, deconvolution expansion, It is merged with more last layer characteristic pattern, until the fusion for all M layers of characteristic pattern for completing to select.

14. according to the object detection system of multi-scale feature fusion in a kind of image of claim 9-13 any one of them, Be characterized in that, the module of target detection, first elect the maximum detection block of fiducial probability, by other detection blocks with set Believe that the detection block of maximum probability calculates IOU, then reduces its fiducial probability more than a certain threshold value；