CN104573744A

CN104573744A - Fine granularity classification recognition method and object part location and feature extraction method thereof

Info

Publication number: CN104573744A
Application number: CN201510026025.3A
Authority: CN
Inventors: 熊红凯; 张晓鹏
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2015-01-19
Filing date: 2015-01-19
Publication date: 2015-04-29
Anticipated expiration: 2035-01-19
Also published as: CN104573744B

Abstract

The invention provides a fine granularity classification recognition method and an object part location and feature extraction method thereof. The fine granularity classification recognition method and the object part location and feature extraction method thereof well achieve object part location and feature expression in fine granularity classification recognition. For object part location, a series of part detectors trained by supervised learning are utilized, the methods just detect the part with small deformation in consideration of the posture change and deformation influence of targets to be located, different detectors are trained for the same object part by adopting the posture clustering method, and therefore the posture change of objects is taken into account. For feature expression of the objects or parts, features are extracted at multiple dimensions and multiple positions according to the methods and then fused to be used for final object expression, and therefore the features have certain dimension and translation invariance. According to the methods, object part location and feature expression have certain complementarity at the same time, and therefore the accuracy of fine granularity classification recognition can be effectively improved.

Description

The part location of the identification of fine granulation classification and object and feature extracting method

Technical field

The present invention relates to a kind of method of technical field of image processing, specifically, what relate to is the recognition methods of a kind of fine granulation classification, and the part of the object related in this identification problem is located and feature extracting method.

Background technology

The target of fine granulation classification problem is the hundreds of multiple subclass under the same large class of differentiation, such as distinguishes different classes of flower, bird, dog etc.For layman, identify that these subclasses are very difficult, the proposition of fine granulation classification problem solves the problem that layman identifies these similar subclasses.User only needs given target object, by meticulous classification recognition methods, just can return the classification of target object, and then can obtain this subclass series of characteristics.Be different from general category identification problem (such as distinguishing car and people), because the comparison in difference between subclass is little and high localized, it is very difficult for distinguishing these subclasses.The spatial pyramid model being widely used in general category identification problem, owing to can not catch this high localized sub-class inherited, thus can not reach satisfied recognition result.

Through finding the literature search of prior art, the difficult point of fine granulation classification problem is mainly in two, and namely part is located and iamge description.That part location extensively adopts is the ``A discriminativelytrained that P.Felzenszwalb in 2010, " IEEETransactions on Pattern Analysis and Machine Intelligence " delivers, multiscale, deformable part model ", i.e. deformable segment model and its mutation.This model finds target object or partial target object by training template detector, and take into account the geometric relativity between department pattern.But only smaller to the deformation part of this model has good Detection results, the part larger to deformation ratio, the wing of such as bird, the poor performance of part detection model.For iamge description, most employing D.G.Lowe was published in the ``Distinctive imagefeatures from scale-invariant keypoints on " International Journal of Computer Vision " in 2004 ", i.e. scale invariant feature.But this feature is only the combination of some gradient informations, independent of concrete data set, do not possess good separating capacity.Other feature such as Krizhevsky was published in the ``Imagenet classificationwith deep convolutional neural networks on " Neural Information Processing Systems " in 2010 "; i.e. volume and neural network characteristics; although this feature is for the abundant feature of the semanteme of design data, lack enough yardsticks and translation invariance.When the fractional object detected and actual position have relatively large deviation, this feature well can not overcome this translation change.

Summary of the invention

For defect of the prior art, the object of this invention is to provide part location and the feature extracting method of the identification of a kind of fine granulation classification and object thereof, improve scale invariability and the translation invariance of precision and the feature representation of partly locating, thus improve the accuracy of identification of meticulous category classification problem.

The present invention is achieved by the following technical solutions:

According to a first aspect of the invention, a kind of part localization method of object is provided, that one divides sector of breakdown localization method for fine granulation, the method utilizes object detector and partial detector to detect target object and its deformation smaller portions, described detecting device utilizes the measure of supervision that has of attitude cluster to learn to obtain, and take into account the attitudes vibration of object or part; Object detector and partial detector independently carry out, and return surveyed area that in each detecting device, score is high alternatively, and final testing result is by correcting object and partial detection obtains.

Preferably, described detecting device utilizes the measure of supervision that has of attitude cluster to learn to obtain, and is specially: for object and each part, assembles positive example sample to some mixture models according to attitude;

Suppose each part p _iall use a bounding box definition, whole object is with bounding box p ₀express, wherein (l, t, r, b) represents the left side of bounding box, top, right side and bottom coordinate position; By following vector, these parts of demarcating are used for the attitude θ of parametrization sample I _i:

θ _I＝(p′ ₁，p′ ₂，...，p′ _n)

p_{i}^{'} = (\frac{p_{i}^{l} + p_{i}^{r}}{2 w}, \frac{p_{i}^{t} + p_{i}^{b}}{2 h}), i = 1,2, . . ., n

Wherein, w and h represents object p ₀width and height, n represents the quantity .p ' of object parts _ip _inormalization express, this normalized expression makes it possible to only to consider the relative position of part, and ignores the different scale between different objects part; All positive samples, according to attitude characteristic, utilize k-means clustering method to be clustered into C composition.

Further, likely inconsistent with the position of object for solving object parts in the testing result that returns, described object detector and partial detector return surveyed area that in each detecting device, score is high alternatively, are specially:

Make X={x ₀, x ₁..., x _nrepresent the testing result that the score of object and corresponding n part thereof is high, φ (X)={ φ (x ₀), φ (x ₁) ..., φ (x _n) represent corresponding convolution feature, a series of detecting device { w that given training obtains ₀, w ₁..., w _n, upgrade testing result by optimizing following expression:

\arg \max_{X} Ψ (w_{0}^{T} φ (x_{0})) + Σ_{i = 1}^{n} {[λ_{i}]}_{ϵ} Ψ (w_{i}^{T} φ (x_{i}))

Wherein

Ψ (z) = \frac{e^{z} - e^{- z}}{e^{z} + e - z}

{[λ_{i}]}_{ϵ} = \{\begin{matrix} λ_{i}, & if λ_{i} &GreaterEqual; ϵ \\ 0, & if λ_{i} < ϵ \end{matrix}

Ψ [] is a nonlinear function, and detection score is mapped to scope [-1,1], [] _εit is a loss function; Parameter lambda _ithe degree of overlapping of tolerance part and object, scope is [0,1]; Weighted term [λ _i] _εbe used for punishing the part situation inconsistent with object detected.

Part location of the present invention is only for the part that object deformation is less, and the training of detecting device take into account the change of aspect, is upgraded the relation of object and the part thereof detected, obtain the positioning precision of reliable object and part thereof by geometry.

According to a second aspect of the invention, a kind of feature extracting method is provided, the method extracts constant convolution feature on each object parts positioning result, namely convolution feature is extracted at multiple yardstick and multiple visual angle, these convolution features are carried out merging and are obtained final feature representation, and this expression is used for final classification.

Further, described feature extracting method, comprises the steps:

Step one: to given scalogram picture, extracts the 5th convolutional layer characteristic pattern f _{w × h × C}, wherein w × h represents convolved image size, the port number of C representative feature figure; Input picture is 16 relative to the down-sampling ratio of the 5th convolutional layer, means that the 5th convolutional layer characteristic pattern is 16 relative to the step-length of input picture;

Step 2: carry out zero padding operation to the border of each channel of characteristic pattern, every side increases by two pixels, obtains the characteristic pattern f ' that zero padding is later thus _{w ' × h ' × C}; The characteristic pattern f ' later to zero padding _{w ' × h ' × C}, each passage uses slip window sampling select any subgraph f with step-length 1 _{w × h × C}, therefore always having 5 × 5 relative to the upper left corner biased (Δ x, Δ y) is { subgraph of 0,1,2,3,4}; Then carry out pondization operation to each subgraph, obtaining target, to export size be the later subgraph of the pondization of n × n;

Step 3: use the pond beggar figure calculated for subsequent full UNICOM layer characteristic pattern obtained in step 2.

Preferably, aforesaid operations carries out on 5 yardsticks of input picture and flip horizontal image thereof, finally altogether obtain 25 × 5 × 2 proper vectors, these proper vectors obtain the single features expression on each yardstick carry out pondization operation respectively on each yardstick after, the feature on these multiple yardsticks of last cascade is used for the final expression to image.This feature makes it have certain yardstick and translation invariance.

According to a third aspect of the invention we, provide a kind of image fine granulation that improves to know method for distinguishing, comprise the steps:

The first step: for test pattern, utilizes the part that object detector and partial detector detection target object and deformation thereof are little, and this detecting device utilizes the measure of supervision that has of attitude cluster to learn to obtain, and take into account the attitudes vibration of object or part.Because object detector and partial detector independently carry out, do not consider the geometric relationship between them.As improvement, the method returns some high surveyed areas of score in each detecting device alternatively, and final testing result is obtained by correction object and partial detection.

Second step, extracts convolution feature to object of detecting each in the first step or part at multiple yardstick and multiple visual angle, and these convolution features are carried out merging and obtained final feature representation, and this expression is used for final classification.The present invention can improve the identification of image fine granulation.

To sum up, the inventive method solves part orientation problem and the feature representation problem of object in fine granulation classification identification problem preferably, the method increases part detection perform, and makes this feature have certain yardstick and translation invariance.Between object parts location of the present invention and feature representation, there is certain complementarity simultaneously, thus effectively can improve the precision of meticulous classification identification problem.

Compared with prior art, the present invention has following beneficial effect:

Technique scheme of the present invention solves part orientation problem and the feature representation problem of object in fine granulation classification identification problem preferably.Part location of the present invention and feature representation all make use of the current convolutional neural networks with better ability to express.Present invention employs the strong supervised learning method training objective detecting device based on attitude cluster, and geometry renewal is carried out to final testing result, part positioning precision comparatively accurately can be obtained.Meanwhile, constant feature representation technology can overcome the inaccuracy of location to a certain extent, makes it have certain yardstick and translation invariance.The combination of two kinds of methods makes the present invention can obtain good recognition performance in fine granulation classification problem.

Accompanying drawing explanation

By reading the detailed description done non-limiting example with reference to the following drawings, other features, objects and advantages of the present invention will become more obvious:

Fig. 1 is the principle framework figure of one embodiment of the invention;

Fig. 2 is the invariant feature extraction process flow diagram of one embodiment of the invention.

Embodiment

Below in conjunction with specific embodiment, the present invention is described in detail.Following examples will contribute to those skilled in the art and understand the present invention further, but not limit the present invention in any form.It should be pointed out that to those skilled in the art, without departing from the inventive concept of the premise, some distortion and improvement can also be made.These all belong to protection scope of the present invention.

As shown in Figure 1, be the principle framework of one embodiment of the invention, this framework comprises two parts, i.e. the part localization part of object and yardstick and translation invariant feature representation part.A given width test pattern, first utilizes object detector and partial detector to detect target object and its deformation smaller portions, and this detecting device utilizes the measure of supervision that has of attitude cluster to learn to obtain, and take into account the attitudes vibration of object or part.Because object detector and partial detector independently carry out, do not consider the geometric relationship between them.As improvement, return surveyed area that in each detecting device, score is higher alternatively, final testing result is by correcting object and partial detection obtains.Then, characteristic extracting module extracts convolution feature to the object of each detection or part at multiple yardstick and multiple visual angle, and these convolution features are carried out merging and obtained final feature representation, and this expression is used for final classification.

As a preferred implementation, described object and part detect specific implementation process and comprise the steps:

Step one:

Attitude cluster: the method uses the learning method training detecting device of strong supervision, and for training sample, the bounding box of whole object and some object parts is all known.For object and each part, the method assembles positive example sample to some mixture models according to attitude.Suppose each part p _iall use a bounding box (whole object is with bounding box p in definition ₀express), wherein (l, t, r, b) represents the left side of bounding box, top, right side and bottom coordinate position.By following vector, these parts of demarcating are used for the attitude θ of parametrization sample I _i:

θ _I＝(p′ ₁，p′ ₂，...，p′ _n)

p_{i}^{'} = (\frac{p_{i}^{l} + p_{i}^{r}}{2 w}, \frac{p_{i}^{t} + p_{i}^{b}}{2 h}), i = 1,2, . . ., n

Wherein, w and h represents object p ₀width and height, n represents the quantity .p ' of object parts _ip _inormalization express, this normalized expression makes it possible to only to consider the relative position of part, and ignores the different scale between different objects part.All positive samples, according to attitude characteristic, utilize k-means clustering method to be clustered into C composition.This cluster take into account the attitudes vibration of object, and this training for detecting device is very important.

Step 2:

Convolutional network training and the study of detection: the feature for detecting device training is extracted from convolutional network.In order to make convolutional network adapt to concrete meticulous categorical data collection, first convolutional neural networks should be finely tuned.Because training sample is limited, first use and select searching method to produce a series of subregion image, wherein the overlapping subsample being greater than 0.5 of all and original positive example sample is all regarded as positive example, and every other subsample is regarded as negative example, obtains thus finely tuning later convolutional neural networks.In detecting device training process, only only have original sample feature to be regarded as positive example, the subsample that those and original sample degree of overlapping are less than 0.3 is regarded as negative example.To each part of object and object, a series of detecting device { w can be obtained by stand-alone training ₀, w ₁..., w _n.

When test, a width test pattern is used equally and selects the method for search to produce a series of candidate's subregion.The feature of each candidate's subregion x represents with φ (x), accordingly for detecting device w _iscore be then expressed as the region (such as 100, this region quantity can set as required) that wherein score is higher is selected as couple candidate detection result.

Step 3:

Object and part detect renewal: owing to being independently carry out the detection of object and object parts, in the testing result returned, object parts is likely inconsistent with the position of object.A kind of geometry update method is used to address this problem.Make X={x ₀, x ₁..., x _nrepresent the testing result in object and higher 100 regions (this region quantity can set as required) of corresponding n score partly, φ (X)={ φ (x ₀), φ (x ₁) ..., φ (x _n) represent corresponding convolution feature.A series of detecting device { w that given training obtains ₀, w ₁..., w _n, upgrade testing result by optimizing following expression:

\arg \max_{X} Ψ (w_{0}^{T} φ (x_{0})) + Σ_{i = 1}^{n} {[λ_{i}]}_{ϵ} Ψ (w_{i}^{T} φ (x_{i}))

Wherein

Ψ (z) = \frac{e^{z} - e^{- z}}{e^{z} + e - z}

{[λ_{i}]}_{ϵ} = \{\begin{matrix} λ_{i}, & if λ_{i} &GreaterEqual; ϵ \\ 0, & if λ_{i} < ϵ \end{matrix}

Ψ [z] is a nonlinear function, and detection score is mapped to scope [-1,1], [λ _i] _εit is a loss function.Parameter lambda _ithe degree of overlapping of tolerance part and object, scope is [0,1].Weighted term [λ _i] _ε(ε=0.6) is used for punishing the part situation inconsistent with object detected.

As shown in Figure 2, test pattern to different yardsticks by resize ratio, for each yardstick, is extracted feature and mainly comprises the steps: characteristic extraction part

Step one: to given scalogram picture, extracts the 5th convolutional layer characteristic pattern f _{w × h × C}, wherein w × h represents convolved image size, the port number of C representative feature figure.Input picture is 16 relative to the down-sampling ratio of the 5th convolutional layer, means that the 5th convolutional layer characteristic pattern is 16 relative to the step-length of input picture.

Step 2: carry out zero padding operation to the border of each channel of characteristic pattern, every side increases by two pixels, obtains the characteristic pattern f ' that zero padding is later thus _{w ' × h ' × C}.The characteristic pattern f ' later to zero padding _{w ' × h ' × C}, each passage uses slip window sampling select any subgraph f with step-length 1 _{w × h × C}, therefore always having 5 × 5 relative to the upper left corner biased (Δ x, Δ y) is { subgraph of 0,1,2,3,4}.Then carry out pondization operation to each subgraph, obtaining target, to export size be the later subgraph of the pondization of n × n.

Aforesaid operations carries out on 5 yardsticks of input picture and flip horizontal image thereof, therefore finally altogether obtains 25 × 5 × 2 proper vectors.These proper vectors obtain the single features expression on each yardstick carry out pondization operation respectively on each yardstick after, the feature on these multiple yardsticks of last cascade is used for the final expression to image.

Implementation result:

Test and carry out on the fine granulation data set CUB-200-2011 extensively adopted.This data set comprises 200 different types of birds, altogether 11788 width images.Identify that these subclasses are all very difficult concerning people.Choosing in deformation smaller portions, only chooses head and health as part detected object.Choosing in attitude cluster, each detects target and is clustered into 3 mixture models, and during feature extraction, 5 yardsticks are chosen for { 227,280,340,400,454}.Final experimental standard is weighed with nicety of grading.

Object/part positioning precision result:

Positioning precision is weighed with the ratio of correctly locating, and the principle of correct location is that the target that detects and realistic objective degree of overlapping are greater than 0.5.To object, head and body part, the method can obtain the position precision of 96.36%, 75.22%70.14% respectively.

Classification results:

Have benefited from higher object/part positioning precision, the classifying identification method based on part finally can obtain the discrimination of 77.51%, under same experiment condition, far above existing accuracy of identification.The validity of the method comes from part positioning precision and constant feature representation form accurately, these two parts have complementary relationship again simultaneously, namely the unchangeability of feature expresses the inaccuracy that compensate for location to a certain extent, further increases final image recognition precision.

Above specific embodiments of the invention are described.It is to be appreciated that the present invention is not limited to above-mentioned particular implementation, those skilled in the art can make various distortion or amendment within the scope of the claims, and this does not affect flesh and blood of the present invention.

Claims

1. the part localization method of object in fine granulation classification identification, it is characterized in that, the method utilizes object detector and partial detector to detect target object and its deformation smaller portions, described detecting device utilizes the measure of supervision that has of attitude cluster to learn to obtain, and take into account the attitudes vibration of object or part; Object detector and partial detector independently carry out, and return surveyed area that in each detecting device, score is high alternatively, and final testing result is by correcting object and partial detection obtains.

2. the part localization method of object according to claim 1, it is characterized in that, described detecting device utilizes the measure of supervision that has of attitude cluster to learn to obtain, and is specially: for object and each part, assembles positive example sample to some mixture models according to attitude;

θ _I＝(p′ ₁，p′ ₂，...，p′ _n)

p_{i}^{'} = (\frac{p_{i}^{l} + p_{i}^{r}}{2 w}, \frac{p_{i}^{t} + p_{i}^{b}}{2 h}), i = 1,2, . . ., n

3. the part localization method of object according to claim 2, it is characterized in that, for object parts in the testing result that solution returns is likely inconsistent with the position of object, described object detector and partial detector return surveyed area that in each detecting device, score is high alternatively, are specially:

\begin{matrix} \arg \max_{X} & Ψ (w_{0}^{T} φ (x_{0})) + Σ_{i = 1}^{n} {[λ_{i}]}_{&Element;} Ψ (w_{i}^{T} φ (x_{i})) \end{matrix}

Wherein

\begin{matrix} Ψ (z) = \frac{e^{z} - e^{- z}}{e^{z} + e - z} & {[λ_{i}]}_{&Element;} = \{\begin{matrix} λ_{i}, & if & λ_{i} &GreaterEqual; &Element; \\ 0, & if & λ_{i} < &Element; \end{matrix} \end{matrix}

Ψ [] is a nonlinear function, and detection score is mapped to scope [-1,1], [] _∈it is a loss function; Parameter lambda _ithe degree of overlapping of tolerance part and object, scope is [0,1]; Weighted term [λ _i] _∈be used for punishing the part situation inconsistent with object detected.

4. feature extracting method in fine granulation classification identification, it is characterized in that, each object parts positioning result extracts constant convolution feature, namely convolution feature is extracted at multiple yardstick and multiple visual angle, these convolution features are carried out merging and are obtained final feature representation, and this expression is used for final classification.

5. feature extracting method according to claim 4, is characterized in that comprising the steps:

6. feature extracting method according to claim 5, it is characterized in that aforesaid operations carries out on 5 yardsticks of input picture and flip horizontal image thereof, finally altogether obtain 25 × 5 × 2 proper vectors, these proper vectors obtain the single features expression on each yardstick carry out pondization operation respectively on each yardstick after, the feature on these multiple yardsticks of last cascade is used for the final expression to image.

7. adopt a fine granulation classification recognition methods for method described in above-mentioned any one claim, it is characterized in that comprising two steps:

The first step: for test pattern, utilizes object detector and partial detector to detect target object and its deformation smaller portions, and described detecting device utilizes the measure of supervision that has of attitude cluster to learn to obtain, and take into account the attitudes vibration of object or part; Object detector and partial detector independently carry out, and return surveyed area that in each detecting device, score is high alternatively, and final testing result is by correcting object and partial detection obtains;

Second step, extracts convolution feature to object of detecting each in the first step or part at multiple yardstick and multiple visual angle, and these convolution features are carried out merging and obtained final feature representation, and this expression is used for final classification.