CN110009679A

CN110009679A - A kind of object localization method based on Analysis On Multi-scale Features convolutional neural networks

Info

Publication number: CN110009679A
Application number: CN201910148554.9A
Authority: CN
Inventors: 孙俊; 周以鹏; 吴豪; 吴小俊; 方伟; 陈祺东; 李超; 游琪; 冒钟杰
Original assignee: Jiangnan University
Current assignee: Uni Entropy Intelligent Technology Wuxi Co Ltd
Priority date: 2019-02-28
Filing date: 2019-02-28
Publication date: 2019-07-12
Anticipated expiration: 2039-02-28
Also published as: CN110009679B

Abstract

The present invention provides a kind of object localization method based on Analysis On Multi-scale Features convolutional neural networks, belongs to computer vision field.The problems such as this method is lacked, is marked without positioning for data set label segments many in practical application, propose the Weakly supervised localization method based on Analysis On Multi-scale Features convolutional neural networks, its core concept utilizes the characteristic of neural network layering, it is mapped on multilayer convolutional layer using gradient weighting Class Activation, generate grad pyramid model, and feature centroid position is calculated by mean filter, subtract the pixel fragment that module generates connection using confidence intensity mapping and threshold value ladder, carries out Weakly supervised positioning around maximum boundary mark.It is on standard testing collection the experimental results showed that, algorithm can be completed there are a large amount of classifications, multi-scale image target positioning, accuracy with higher.

Description

A kind of object localization method based on Analysis On Multi-scale Features convolutional neural networks

Technical field

The invention belongs to computer vision fields, and in particular to a kind of target based on Analysis On Multi-scale Features convolutional neural networks Localization method.

Background technique

Target positioning is important one of the research direction of computer vision field.The purpose of target positioning is to determine a mesh The position of mark in the picture.Localization method common at present is to be believed using supervised learning algorithm according to the classification of target and position Target positioning is completed in breath training in test set.In many practical applications, such as small target deteection, traffic target, multi-modal mesh In the tasks such as mark detection, medical target detection, data shortage, mark missing is numerous, is unable to satisfy neural network detection detection and appoints The demand of business.And in such applications, otherness is larger between categories of datasets, and the mark missing of partial target is serious dirty Background characteristics space has been contaminated, classifier is difficult to differentiate between out the otherness of known class and current goal, and easy mistake is divided into background classes, To obscure the judgement of monitor model, this results in the accuracy of model.

The improvement of algorithm is from two aspects.In character representation level, with the development of deep learning, Feature Engineering is got over Come it is huger, in face of complex environment and target information it is weaker in the case where, effective target signature, merge various dimensions, more rulers It is highly important that degree feature, which improves object module,.From model method level, Weakly supervised learning method is independent of target mark Label lack when data set marks, when detection classification but data set scale are not enough to complete to train known to data set, are easy to be extended to In new object class.Researcher has carried out many work in the character representation level of image object.Early stage is in image procossing Characteristic point abundant and corresponding Feature Descriptor are stablized for that can extract on textured example goal, texture object in field Body can be accurately identified and be detected based on these characteristic points and Feature Descriptor.As SIFT algorithm, other identification features are retouched State sub- PCA-SIFT algorithm, SURF algorithm.Subsequent Dalal et al. proposes to make using image local gradient orientation histogram (HOG) It is characterized, carries out pedestrian detection as classifier using support vector machines (SVM), manual feature request designer possesses more Professional domain knowledge.With the development of neural network and deep learning, Ross Girshick et al. proposes R-CNN, Fast- RCNN and Faster-RCNN series of algorithms constructs feature hierarchy structure abundant for accurate mesh using convolutional neural networks Mark detection and semantic segmentation have only used the output of the last layer characteristic pattern, but have not made full use of target Analysis On Multi-scale Features.He Et al. propose SPP-NET after the last layer convolution, be added space pond layer so that the characteristic pattern of arbitrary size can be converted At the feature vector of fixed size.Liu et al. proposes that SSD Web vector graphic single order detection structure, Analysis On Multi-scale Features figure are predicted, Detection accuracy is improved, but does not utilize bottom feature while having lacked the mutual building between different characteristic figure layer.Lin etc. People proposes feature pyramid model, on the basis of combining Analysis On Multi-scale Features figure, joined on low-level image feature figure and characteristic pattern Sampling fusion, more perfect object module.

Using Weakly supervised localization method equally there are many research in convolutional neural networks, entire figure is only used only in this method As class label positions the object in image.In recent years, Vinyals et al. proposes Class Activation mapping method (Class Activation Map, CAM), this method has modified the convolutional neural networks framework of image classification, flat with convolutional layer and the overall situation Equal pondization replaces full articulamentum, the disadvantage is that the network architecture requirement Feature Mapping needs before layer of classifying, causes in addition to dividing General network configuration is likely lower than outside generic task.Lu et al. using global maximum pondization and logarithm summarize pond have studied it is similar Method.Selvaraju et al. introduces gradient signal assemblage characteristic mapping (Gradient on the basis of Class Activation maps Class Activation Map, Grad-CAM) method, do not need to modify to the primary network architecture, using more The gradient and Fusion Features of scale feature.Other methods carry out target positioning using the disturbance of classification input picture.Zeiler and Fergus et al. is by blocking patch and being classified shielded image come disturbance input, when these objects are blocked, usually The classification score that will lead to related object reduces.Quab et al. classifies to many patches comprising a pixel, then right The classification score of these patches is averaged, and to provide the classification score of pixel, operation includes that multiple forward and backward calculates, effect Rate is lower.Zhang et al. introduces the marginal winning probability (c-MWP) of comparison, for simulating the nerve point that can protrude distinguishable region The top-down attention of class model, is only applicable to image classification task, and target positioning is poor.

Summary of the invention

The present invention is intended to provide a kind of object localization method based on Analysis On Multi-scale Features convolutional neural networks, this method are bases In the end-to-end Weakly supervised location algorithm of Analysis On Multi-scale Features, make full use of deep neural network Analysis On Multi-scale Features, by gradient plus Class Activation mapping is weighed, grad pyramid model is generated, generates grad pyramid for each prediction classification, and pass through mean filter Feature centroid position is calculated, subtracts the pixel fragment that module generates connection using confidence intensity mapping and threshold value ladder, surrounds maximum boundary Carry out Weakly supervised positioning.It is fixed to show that algorithm can complete precision target in the case where providing less label by multiple experiments Position, performance are better than other methods.

Technical solution of the present invention:

A kind of object localization method based on Analysis On Multi-scale Features convolutional neural networks, steps are as follows:

The single scale image of arbitrary size is inputted convolutional neural networks ConNet by step 1, utilizes feature pyramid model And gradient class Mean mapping Grad-CAM algorithm, calculate the cross entropy error L of classification_cross-entrop, calculate corresponding guidance Back-propagation gradientEach layer of convolutional neural networks ConNet output is { C₂,C₃,...,C_l, it is rolled up by trunk Product network query function predicts classification c, for the score y of classification^c, the size w*h of input picture I；Its multilayer feature figure and output phase Correspond to { F₂,F₃,...,F_l}。

Step 2, the weights of importance for calculating each layerPixel-level spatial-intensity and benefit are calculated on multilayer feature figure With ReLU activation primitive

Step 3 is directed to every layer of grad pyramid, carries out up-sampling and lateral connection operation, finds out superimposed intensity, i.e.,

Step 4, for superimposedAfter calculating thermodynamic chart, global peak γ is calculated, is contracted with zoom factor σ It puts, as local maxima threshold value.It is corresponding to calculate Largest Mean filter to each thermodynamic chart application maximal filter and minimum filters After waveIt is filtered with minimum meanAnd difference thermodynamic chart is calculated, the constant pixel of difference is set 0, with Obtain the Probability Area with local maxima mass center.

Step 5, by repeatedly expanding, generate multiple candidate points, find out best mass center, then utilize the global peaks after scaling Value carries out ladder and subtracts.

Step 6, the maximum boundary after terraced subtract, select the coordinate [xmin, ymin, xmax, ymax] of maximum rectangle frame. Export the target prediction classification D of all images_classWith coordinate intersection D_loc。

Beneficial effects of the present invention: it is rare in data set missing, mark in order to improve, widely apply scene to lack target fixed Position Information Problems, propose a kind of Weakly supervised target location algorithm based on grad pyramid.The present invention using gradient passback and Neural network structure constructs grad pyramid, is subtracted on the basis of only classification information by threshold value ladder and completes target positioning, calculated There are two advantages for method: 1) making full use of multiple dimensioned depth characteristic information, realize to intentional shallow structure and Deep Semantics information Fusion Features；2) by finding suitable characteristics mass center, strategy is subtracted with threshold value ladder and accurately completes target location tasks.By The comparison of algorithm on data set shows that the algorithm can effectively utilize Analysis On Multi-scale Features information, improves and appoints in Weakly supervised positioning Performance in business has preferable generalization.The goal in research of next step is to design adaptive threshold strategy and height based on classification The Weakly supervised non-maxima suppression of robust, to solve the Weakly supervised target positioning of no location tags.

Detailed description of the invention

Fig. 1 is the Weakly supervised positioning network frame of target based on grad pyramid.

Fig. 2 is grad pyramid.

Fig. 3 is Weakly supervised positioning flow figure.

Fig. 4 is experiment effect figure.Wherein, (a-1) -- (a-4) is after original image pre-processes, and (b-1) -- (b-4) is prediction Classification guiding passback Error Graph, (c-1) -- (c-4) are that grad pyramid generates thermodynamic chart, (d-1) -- the gradient of (d-4) guiding Pyramid, (e-1) -- the Weakly supervised estimation range (e-4) returns frame.

Fig. 5 is experiment prediction block and true tag effect picture.Wherein, (a)-(h) is respectively the experiment prediction block of 8 kinds of targets With true tag.

Fig. 6 is PASCAL VOC2012 contrast and experiment figure.

Fig. 7 is the target locating effect figure of fine grit classification.Wherein, (a-1) -- (a-3) is (b- after original image pretreatment 1) -- (b-3) is grad pyramid thermodynamic chart, and (c-1) -- (c-3) is that Weakly supervised estimation range returns frame.

Specific embodiment

Technical solution of the present invention is further detailed below in conjunction with specific embodiments and the drawings.

1. data set and evaluation index

(1)ImageNet-ILSVRC2012

ImageNe data set is to promote the development of computer image recognition technology and set up one large-scale picture number According to collection, every year creates extensive visual identity challenge match-ILSVRC.Image can be applied to image classification, and target positions, Target detection, video object detection, a variety of Computer Vision Tasks such as scene classification, image include that with clearly defined objective classification marks With the mark of objects in images position.The LSVRC2012 that we use is the announcement data sets in 2012 of ImageNet, includes 1000 classifications, each classification chooses about 1000 pictures, wherein have 1,200,000 trained pictures, 50,000 verifying pictures and 150,000 Open test picture.We carry out the verifying of Weakly supervised target location tasks using verifying collection, and experimental evaluation index is divided into Top1 mistake Difference, Top5 error, specially the prediction first kind and the classification of first five classification target and location error.Wherein y_iFor correct sample, m For total number of samples, D is sample set.

Wherein, location error is to hand over and than 0.5 for the positive negative sample of threshold determination.

Wherein R_predFor estimation range region.R_gtFor actual range region.The lower its numerical value of error the better.

(2) PASCAL VOC2012 data set

PASCAL VOC (Visual Object Classes) racing data collection is mainly used for target identification, provides Data set includes 20 type objects.Picture pixels size is different.Training and verifying collection data have 11,530 images, include 27,450 mark objects and 6,929 semantic segmentations.

2. parameter setting

Experiment is based on pytouch depth library, and hardware configuration is Centos operating system, and processor is Intel Xeon E5, Video card is Nvidia-tesla-K80, inside saves as 64G.Picture pretreatment is 224*224, port number 3, and on three channels With average value [0.485,0.456,0.406], standard deviation [0.229,0.224,0.225] is normalized.In VGG-19 network Upper to use the 8th, 17,26,35 layer of [conv2, conv3, conv4, the conv5] as convolutional network, output is characterized size Respectively [56*56,28*28,14*14,7*7].ImageNet data set threshold value ladder subtracting coefficient is set as 0.85, VOC data set and sets It is 0.75.

3. grad pyramid

Feature pyramid model is the module for detecting the target of different convolutional layers in depth network.Utilize convolution mind Pyramid feature level through network is had semantic structure from low to high between level, is constructed in the whole process with this The feature pyramid for having high-level semantics.Method is using the single scale image of arbitrary size as input, and in a manner of full convolution Export the Feature Mapping of the appropriate size of multilayer.For process independently of main convolutional coding structure, pyramid structure mainly includes two aspects, the On the one hand the path from bottom to top on feedforward calculates calculates the Analysis On Multi-scale Features for being 2 by scale step-length and maps the spy formed Level is levied, meanwhile, select the output of the last layer in each stage as Feature Mapping reference set.For ResNet network, make The feature activation exported with the residual error in each stage.For different convolutional layer CONV2, CONV3, CONV4, CONV5, residual block Output is { C₂,C₃,C₄,C₅, step-length is respectively { S₂,S₃,S₄,S₅A pixel.Second aspect is top-down on characteristic pattern The lateral connection in path and feature interlayer.Still semantic information is stronger for high-level characteristic relative coarseness, passes through top-down path It is mapped with lateral connection Enhanced feature, carries out more accurate positioning.Up-sampling is 2 times in spatial resolution, then passes through member Element, which is added, merges up-sampling information with current layer information.Iteration completes this process, until pyramid construction.Feature Mapping collection For { P₂,P₃,P₄,P₅, correspond to { C₂,C₃,C₄,C₅, it is respectively provided with identical size.

4.Grad-CAM algorithm

In gradient class Mean mapping algorithm (Grad-CAM), because convolutional neural networks can capture deeper view Feel structure, the gradient information for inputting network the last one convolutional layer is understood into each neuron for target determines and is important Property.Positioning figure is differentiated in order to obtain the classification of the width u and height ν of any classification cFirst Calculate the gradient score of each classification c, i.e. y^cFor the characteristic pattern A of convolutional layer^kLocal derviation, k is characterized every height in figure Block, i.e.,These gradients are handled by global average pond, obtain neuron weights of importance

The weightIllustrate the neural network structure after being linearized, obtain characteristic pattern k for target category c for Importance.Later, algorithm is carried out before the weighting of characteristic pattern using ReLU activation primitive to activation:

By ReLU function, algorithm, which only focuses on, has actively target category The feature of influence, that is, the intensity for increasing pixel are equal to the judgement confidence level for increasing class label, and negative pixel may belong to figure Other classifications as in.Algorithm provides the method for visualizing of Pixel-level spatial gradient, has the discriminating power of fine granularity feature. On the other hand, algorithm up-samples input picture by bilinearity difference, recycles point-by-point multiplication, will be oriented to backpropagation It is fused together with Grad-CAM visualization.Method has the discriminating power of target local feature and classification.

5. grad pyramid model

The basic framework of algorithm is as shown in Figure 1.In order to fit in complex environment sum number that may be present in target positioning According to condition, for example, data information amount it is few, without in the visual tasks such as mark.We are based on the characteristics of convolutional network structure, With inherent multi-Scale Pyramid shape, feature level is successively calculated.Its method is not concerned only with deep layer language information, and can Take into account texture, the marginal information of intentional shallow, feature-rich space.We select to make full use of the gold of convolutional network feature level Word tower structure creates on all scales all with the feature of powerful semanteme, gradient is carried out on each hierarchy characteristic figure Passback, is combined by top-down path and lateral connection, and building gradient class maps pyramid model, and model increases not With the importance intensity of feature under scale.Herein, our structure understands different dimensional using fused gradient information Spend feature.Calculating every level output after present image feedforward calculates first is { C₂,C₃,...,C_l, wherein l corresponds to as not Same convolutional layer, by the output of every level-one directly as the characteristic pattern { F of return₂,F₃,...,F_l, because first layer is too close to input Image, network discriminant information is insufficient, therefore does not use first layer.Later, network query function prediction output classification c, finds out each Gradient score of the score of classification c relative to all characteristic layers, i.e. output y^cFor the characteristic pattern of l convolutional layerLocal derviationLocal derviation information is carried out global average pond operation processing to obtainWherein, each characteristic pattern Corresponding to the corresponding pond range of sub-block k is i, j, it is known that:

To the different characteristic figure under each level, correspondence is { m, n, k }, i.e. the length and width and port number of single feature.Through ReLU layers of activation primitive are crossed,

Obtain the feature score of current every layer of grad pyramid

In each Gradient Features figureWe carry out two step operations.Firstly, will be sampled as thereon twice, make its with it is next Layer gradient map shape is identical.Later with next layer of gradient intensity figureCarry out lateral connection enhancing shallow-layer characteristic strength and deep layer Characteristic strength is merged.Operation between every layer are as follows:

Gradient Features figure output for bottommost, we are available:

Top characteristic pattern possesses bigger weight relative to low-level image feature figure, because the semantic information of high-level characteristic figure is more Add concentration, more visual structures can be captured.Lateral connection between figure layer, enhances gradient intensity step by step.It can be seen that base It is richer in the characteristic information of grad pyramid, more judgment basis are provided for Computer Vision Task.

6. Weakly supervised positioning

Algorithm using in Weakly supervised location tasks, by neural network forecast target category, reversely passes grad pyramid structure Generation grad pyramid is broadcast, after mean filter, utilized confidence intensity mapping and threshold value to pick and subtracts module, and determined target The validity feature region of classification, to carry out the Weakly supervised positioning of target.

Fig. 3 is the flow chart of Weakly supervised positioning.Firstly, we calculate prediction classification c by trunk convolutional network.With classification Score generates superimposed characteristic strength according to grad pyramidAfter calculating thermodynamic chart, we select global peak γ, It is zoomed in and out by certain maximum intensity factor, it is alternatively that the threshold value of local maximum point, for local location, its intensity is sufficiently high. The setting of the maximum intensity factor depends on the priori knowledge of data set, and a part depends on target sizes in data set and accounts for full images The average proportions of element, a part depend on the fine granularity degree of image classification.We are worked as with average proportions for initial value, Zhi Houzuo It is adjusted for hyper parameter.In order to choose the remarkable characteristic in thermodynamic chart, we are to each thermodynamic chart application maximal filter And minimum filters, and difference thermodynamic chart is calculated, to obtain the Probability Area with local maxima mass center.

All local maximums for the threshold value being greater than in image, we accumulate them using dilation procedure multiple Candidate point, and select center of the mass center of accumulation component as predicted boundary frame.On the basis of mass center determines, we set threshold Value ladder subtracts, with the peak value γ after scaling_localObtain center of mass point percent area.For multiple local mass center points mutually apart from each other, We choose target posting using non-maxima suppression.

Weakly supervised network structure based on grad pyramid does not need the training for primary network again, only depends on original The classification judgement of raw network, speed is faster.Meanwhile model has the interpretable of height on the basis of feature visualization Property, it is different from other network structures, for every picture, we are clear which Partial Feature determines for target classification Plan generates active influence, and it is more credible that the intensity based on feature carries out target position decision.On the basis of gradient thermodynamic chart, We provide the spatial visualization method for fine granularity classification importance simultaneously, on the basis of navigating to target area, The backpropagation of derivative and grad pyramid are fused together by we by being multiplied point by point, construct the grad pyramid of guiding E^c,(GGP,GuidedGrad-Pyramid)。

E^c=S^c⊙I (9)

Wherein S^cFor the gradient intensity figure of classification c, I is that the derivative of error relative image reversely returns.This visualization side Method both has high-resolution, while having classification discriminating power, fine granularity feature (such as the item for identifying target of image clearly Line, ear, eyes etc.), be conducive to us and the discriminant classification ability of model is assessed, to instruct adjustment model into one Walk the Weakly supervised positioning work of precision target.

7. comparative experiments in group

In order to verify in the different pyramidal validity of convolutional network structure gradient, for different network VGG-19, ResNet50, ResNet101 carry out network structure comparative experiments.Experiment is directly directly to be pushed away on the raw network of network source Disconnected, the raw network in source is sorter network, and label has only used classification information.All pictures do not utilize target position information into Row training, entire data set are considered as no position labeled data collection.

1 core network comparative experiments of table

Table 1 lists effect of this paper algorithm in three kinds of trunk convolutional networks, 50,000 in experimental verification LSVRC2012 The location error and error in classification of picture.Experimental result can be seen that under different core network structures, algorithm can be complete It is positioned at preferable target.The error of classification depends on the pre-training process of network.Meanwhile the experimental results showed that in deeper depth It spends under network structure, the syncretizing effect of grad pyramid is better.

For the improvement network structure of grad pyramid, we are transported on IMAGENET data set using VGG core network It calculates, grad pyramid generates and the fusion of grad pyramid multilayer has carried out corresponding computational complexity experiment.

2 grad pyramid network structure operation time of table

Wherein, the size for saving 4 characteristic patterns before and after core network into calculating process compares preservation with other algorithms Intermediate operations processes, do not increase operation time additionally.Meanwhile in sample and stack operation, every layer of characteristic pattern shape is because be Fixed, gradient map shape is bigger, and operation time accordingly increases.When single layer sample and stack operation time is much smaller than gradient map operation Between.The complexity of its operation may be summarized to be constant core network time τ and grad pyramid generates time O (n), and wherein n is Stacking fold, but because former layer network gradient informations are unobvious, it is general only after 4 layers as characteristic pattern.It can be seen by table 2 Out, the average time of each layer of operation.After pretreatment operation is added, data set average calculating operation time is 10FPS.

8. comparative experiments is analyzed

In order to verify the Weakly supervised performance of grad pyramid, we are had chosen into Backprop, c-MWP, Grad-CAM, 3 The algorithm that kind occurs in recent years compares.Backprop algorithm is directly visualized using back-propagation gradient, not plus Pondization operation and activation；C-MWP algorithm is using having entered to compare marginal winning probability, for simulating the nerve that can protrude distinguishable region Disaggregated model.Grad-CAM algorithm is returned merely with the last layer character gradient.Table 3 is various algorithms in ImageNet- Weakly supervised locating effect on ILSVRC2012.Error is divided into premium class positioning and error in classification, and the positioning of first five class is missed with classification Difference.The lower numerical value the better.

3 algorithm comparative experiments of table

In order to assess the effect with other algorithms, we use VGG-19 network instead of ResNet101 network.From table 2 as can be seen that our algorithm ranks first in standard index value.It is higher than second place Grad- in optimal classification error 4.1 percentage points of CAM algorithm, it is higher than 18 percentage points of c-MWP algorithm, it is outstanding embodies our algorithm in main target Target locating effect, meanwhile, in first five class location error, our algorithm is higher than 0.8 percentage point of second place algorithm, is higher than 18 percentage points of c-MWP algorithm, when predicting multiple fine granularity classifications, particle position positioning is more accurate.In error in classification, Because all employing identical core network, extra training is not carried out, so error in classification is unchanged.

We are finely tuned on VOC2012 data set using training simultaneously, and trim process is not added just for classification task Targeting information.Entire data set is considered as no position labeled data collection.

It is corresponding as 4 class targets in 20 type objects, calculate frame number and all target frame numbers of the prediction IOU higher than 0.5 Ratio, experiment effect such as Fig. 6:

It can be found out with algorithm and be better than vehicles class in animal, indoor article, mankind's prediction effect, meanwhile, for 4 classes Not, this paper algorithm is better than other algorithm expression effects.

9. analysis of experimental results

Fig. 4 shows the Weakly supervised locating effect that algorithm is put in 4 classifications.It can be seen that in different types of target, Our algorithm successfully identifies the profile and edge details of target being oriented to when passback, raw in grad pyramid thermodynamic chart At having strongly connected depth characteristic, reliable foundation is provided for goal task decision.Fig. 5 show our algorithm with The contrast effect of physical tags, it can be seen that algorithm has navigated to the edge and contour structure of target in depth characteristic, more Target optimal location is had found on scaled target.Classify for fine granularity class, different classifications simultaneously, passes through appropriate threshold value ladder Subtract, Weakly supervised target location tasks are accurately completed under several scenes environment, plurality of classes.Fig. 7 shows algorithm thin Target locating effect under granularity classification for three kinds is all the not lower subclass of dog class in figure, it can be seen that position and be absorbed in target Face has the characteristic area of high confidence level, and the signature contributions degree in four limbs region is less, and background contribution degree is ignored substantially, high confidence Feature is spent to be conducive to we determined that nucleus carries out Weakly supervised positioning.

Claims

1. a kind of object localization method based on Analysis On Multi-scale Features convolutional neural networks, which is characterized in that steps are as follows:

The single scale image I of arbitrary size is inputted convolutional neural networks ConNet by step 1, using feature pyramid model with And gradient class Mean mapping Grad-CAM algorithm, calculate the cross entropy error L of classification_cross-entrop, it is anti-to calculate corresponding guidance To disease gradientEach layer of convolutional neural networks ConNet output is { C₂,C₃,...,C_l, pass through trunk convolution Network query function predicts classification c, for the score y of classification^c, the size w*h of input picture I；Its multilayer feature figure and output phase pair It should be { F₂,F₃,...,F_l}；

Step 2, the weights of importance for calculating each layerPixel-level spatial-intensity is calculated on multilayer feature figure and is utilized ReLU activation primitive

Step 4, for superimposedAfter calculating thermodynamic chart, global peak γ is calculated, is zoomed in and out with zoom factor σ, made For local maxima threshold value；To each thermodynamic chart application maximal filter and minimum filters, after corresponding calculating Largest Mean filtering 'sIt is filtered with minimum meanAnd difference thermodynamic chart is calculated, the constant pixel of difference is set 0, to obtain Probability Area with local maxima mass center；

Step 5, by repeatedly expanding, generate multiple candidate points, find out best mass center, then using the global peak after scaling into Row ladder subtracts；

Step 6, the maximum boundary after terraced subtract, select the coordinate [xmin, ymin, xmax, ymax] of maximum rectangle frame；Output The target prediction classification D of all images_classWith coordinate intersection D_loc。

2. object localization method according to claim 1, which is characterized in that in the step 1, in feature pyramid model Pyramid structure mainly includes two aspects: path from bottom to top of the first aspect on feedforward calculates, calculating are walked by scale The feature level of a length of 2 Analysis On Multi-scale Features mapping composition, meanwhile, select the output of the last layer in each stage as feature Map reference set；For ResNet network, the feature activation exported using the residual error in each stage；It is residual for different convolutional layers The output of poor block is { C₂,C₃,...,C_l, step-length is respectively { S₂,S₃,...,S_lA pixel；Second aspect is on characteristic pattern The lateral connection in top-down path and feature interlayer；Still semantic information is stronger for high-level characteristic relative coarseness, passes through from top Downward path and the mapping of lateral connection Enhanced feature carry out more accurate positioning；Up-sampling is 2 times in spatial resolution, Then information will be up-sampled by being added by element merges with current layer information；Iteration completes this process, until pyramid construction； Feature Mapping collection is { P₂,P₃,...,P_l, correspond to { C₂,C₃,...,C_l, it is respectively provided with identical size.

3. object localization method according to claim 1 or 2, which is characterized in that the grad pyramid uses fusion Gradient information afterwards understands different dimensions feature, the specific steps are as follows:

(1.1) using the single scale image of arbitrary size as input, calculating every level output after present image feedforward calculates is {C₂,C₃,...,C_l, wherein l is different convolutional layers, by the output of every level-one directly as the characteristic pattern { F of return₂,F₃,..., F_l}；

(1.2) network query function prediction output classification c, the score for finding out each classification c are obtained relative to the gradient of all characteristic layers Point, i.e. output y^cFor the characteristic pattern F of convolutional layer l_l ^kLocal derviationLocal derviation information is subjected to global average pond operation processing It obtainsWherein, the corresponding pond range of correspondence sub-block k of each characteristic pattern is i, j, it is known that:

To the different characteristic figure under each level, correspondence is { m, n, k }, i.e. the length and width and port number of single feature；Through too drastic It is function ReLU layers living,

Obtain the feature score of current every layer of grad pyramid

In each Gradient Features figureIt carries out two step operations: firstly, will be sampled as thereon twice, making itself and next layer of gradient map Shape is identical；Later with next layer of gradient intensity figureCarry out lateral connection enhancing shallow-layer characteristic strength and further feature intensity It is merged；Operation between every layer are as follows:

Wherein,Indicate up-sampling function, image interpolation value method；

Gradient Features figure output for bottommost, obtains:

Wherein, L indicates the network number of plies；

Top characteristic pattern possesses bigger weight relative to low-level image feature figure, because the semantic information of high-level characteristic figure more collects In, more visual structures can be captured；Lateral connection between figure layer, gradient intensity enhance step by step.