CN104331717A

CN104331717A - Feature dictionary structure and visual feature coding integrating image classifying method

Info

Publication number: CN104331717A
Application number: CN201410693888.1A
Authority: CN
Inventors: 杨育彬; 朱启海
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2014-11-26
Filing date: 2014-11-26
Publication date: 2015-02-04
Anticipated expiration: 2034-11-26
Also published as: CN104331717B

Abstract

The invention discloses a feature dictionary structure and visual feature coding integrating image classifying method. The feature dictionary structure and visual feature coding integrating image classifying method includes steps of visual feature extracting, feature dictionary learning, visual feature coding, space gathering of feature codes, training and classifying. The feature dictionary structure and visual feature coding integrating image classifying method is capable of obtaining more precise image feature expression and improving the image classifying precision. According to the feature dictionary structure and visual feature coding integrating image classifying method, the structure information in the feature dictionary is integrated to the visual feature coding process, the image feature expression has high discriminative performance, and therefore, the image classifying is more effective. The feature dictionary structure and visual feature coding integrating image classifying method realizes effective and precise image classifying, and therefore, the use value is high.

Description

The image classification method that a kind of integration characteristics dictionary structure and visual signature are encoded

Technical field

The present invention relates to Images Classification field, particularly based on the image classification method that a kind of integration characteristics dictionary structure and the visual signature of Codebook Model (Bag-of-Words, BoW) are encoded

Background technology

Along with the continuous develop rapidly of infotech, every field is all producing various types of data every day with surprising rapidity, comprises word, image, video, music etc.In colourful data message, image because of its show dramatic, abundant in content, contain much information, and store with transmission facilitate, enjoy favor, and become one of 21st century most important information carrier.Particularly along with camera, mobile phone, flat board etc. have the day by day universal of the mobile device of camera function, and the rise of social networks, the mode that people obtain image gets more and more, also impel view data sharp increase further, search required image rapidly and accurately and manage efficiently and but therefore become more and more difficult.People urgently wish that calculating function helps the mankind, analyzes the semanteme that large nuber of images in internet contains, and fully understands the content expressed by image, thus more effectively image is managed, classification annotation, or retrieve interested image.

Images Classification is as one of topmost basic technology of computer understanding image, receive the extensive research of academia and each research institution of industry member, and as important theme in each authoritative journal and Important Academic meeting at home and abroad, be computer vision field epochmaking research topic.Images Classification refer to according to certain sorting criterion by image intelligent assign to one group have definition classification in process, comprise object identification, Scene Semantics classification, Activity recognition etc.Images Classification has become the important technical that research image, semantic is understood.Science researcher has gradually recognized the importance of above problem and has constantly analysed in depth.In recent years, Codebook Model was that image high-level semantic represents and brings new inspiration, was that the Images Classification of gordian technique has achieved certain achievement, but still had many research points not yet to relate to, still have huge breakthrough space with Codebook Model.Based on the research of the image classification method of Codebook Model, becoming the focus of frontier nature in many crossing domains such as current manual's intelligence, computer vision, machine learning and data mining, having played an important role to actively pushing forward social informatization.While creating the social value that can not be substituted, this field still has many key technical problems not yet to solve, many functional realiey are still had to need perfect further, therefore, how to utilize Codebook Model, more effectively understand and Description Image high-level semantic, to realize the research of Images Classification more neatly, there is profound significance.

Summary of the invention

Goal of the invention: technical matters to be solved by this invention is for the deficiencies in the prior art, the image classification method that a kind of integration characteristics dictionary structure and visual signature are encoded is provided, utilize the distributed intelligence accessorial visual feature coding of vision word in characteristics dictionary, to make coding result have more identification, thus improve the accuracy rate of Images Classification.

In order to solve the problems of the technologies described above, the invention discloses the image classification method that a kind of integration characteristics dictionary structure and visual signature are encoded, comprising following steps:

Step 1, extract the visual signature of image: local sampling is carried out to every width image, obtain one group of region unit, extract the visual signature in every block region, obtain the visual signature set that every width image is corresponding, claim the entirety of the visual signature set of all images to be the visual signature collection of all images, be designated as set X;

Step 2, characteristics dictionary learns: to gather X for input, use characteristics dictionary learning method, obtains the characteristics dictionary be made up of one group of representative vision word;

Step 3, visual signature is encoded: the linear combination each visual signature of every width image being expressed as vision word, the corresponding coefficient of each vision word, claims this group coefficient to be the coding of visual signature;

Step 4, the space of visual signature coding is converged: be encoded to input, Using statistics method with all visual signatures of every width image, every width image is expressed as a vector, and this vector is exactly the image feature representation of correspondence image;

Step 5, the coding of every width image step 4 obtained, as input, uses disaggregated model to carry out training and classifying, obtains classification results.

Step 1 specifically comprises the steps:

Local sampling is carried out to every width image I, the mode of unique step is adopted to do intensive sampling, obtain the region unit that some sizes are identical, a visual signature is extracted to each region unit, Visual Feature Retrieval Process method is used to obtain representing this localized mass visual signature, Visual Feature Retrieval Process method comprises: histograms of oriented gradients (Histogram of Oriented Gradient, HOG), Scale invariant features transform (Scale-invariant feature transform, SIFT) etc.Obtain the visual signature set LFS of image I _i, finally obtain the overall X=[x of the visual signature set of all images ₁, x ₂..., x _n] ∈ R ^{d × N}, wherein, d represents the dimension of visual signature, and its size is determined by Visual Feature Retrieval Process technology, and N represents the sum of the visual signature of all images, x _irepresent i-th visual signature, i value 1 ~ N.

Step 2 specifically comprises the steps:

To gather X for input, use characteristics dictionary learning method, obtain the characteristics dictionary of one group of representative vision word composition, this characteristics dictionary is designated as: B=[b ₁, b ₂..., b _m] ∈ R ^{d × M}, wherein M is the number of vision word; b _jbe the column vector of a dimension d, represent a jth vision word, j value 1 ~ M.Conventional characteristics dictionary learning method comprises: k-means, K-SVD etc.

Step 3 specifically comprises the steps:

This step is encoded, for visual signature x to each visual signature in set X one by one _i, its cataloged procedure is as follows:

First, from characteristics dictionary B, x is selected _ithe vision word of p arest neighbors, namely with visual signature x _iminimum p the vision word of distance, remember that the characteristics dictionary of this p vision word composition is B _i, p value 1 ~ M, i value 1 ~ N.

Secondly, characteristics dictionary B is obtained _iin the matrix D represented by distance between each vision word _iwith computation vision feature x _ito characteristics dictionary B _ithe column vector d that represents of the distance of each vision word _i, i value 1 ~ N.Matrix D _ithe element of m capable s row be B _idistance between middle corresponding vision word, m, s=1,2 ..., p; d _ithe n-th component d _inrepresent visual signature x _iwith B _iin distance between the n-th vision word, n=1,2 ..., p.Distance computing formula is: σ is a smoothing parameter, the decline rate of control weight, σ >0.Dist (x _i, B _i)=[dist (x _i, b _i1), dist (x _i, b _i2) ..., dist (x _i, b _ip)] ^t, b _ilrepresent l the vision word of Bi, l=1,2 ..., p; Each component represent visual signature x _iwith vision word b _ilbetween distance; Max (dist (x _i, B _i)) represent vectorial dist (x _i, B _i) largest component, thus make d _ithe codomain of middle component be (0,1].During distance between calculating vision word and other vision word, also use same strategy.For accelerating D _isolving speed, the matrix D that the distance in the disposable B of obtaining between each vision word represents.Then D _ibe exactly the submatrix of D, different D can be obtained by direct index D _i, i=1,2 ..., N.

3rd, with x _i, d _i, D _i, B _ibe input with two parameter lambda and β, λ, β>=0, minimizes following formula, obtains x _iat B _ion coding

{| | x_{i} - B_{i} z_{i}^{p} | |}_{2}^{2} + λ {| | d_{i} &CircleTimes; z_{i}^{p} | |}_{2}^{2} + β z_{i}^{p^{T}} D_{i} z_{i}^{p};

Constraint condition:

1^{T} z_{i}^{p} = 1,

Wherein represent dot product, namely the component of two vector correspondences is multiplied and obtains a new vector; Solve and obtain x _iat the coding result of this p vision word

Finally, to coding in component sequence, obtain the code coefficient that k is maximum and the characteristics dictionary that the k of correspondence vision word is formed k=1,2 ..., p, then visual signature x _icoding z _ithe vector of M dimension, in vector with corresponding component is all the other components are all set to 0.

Step 5 specifically comprises the steps:

Consider the spatial statistical information of each visual signature in every width image, with the spatial pyramid Matching Model of three layers (Spatial Pyramid Matching, SPM), using the coding of all visual signatures of piece image I as input, converge technology in conjunction with maximum, then this spatial pyramid exports a dimension is (2 ⁰+ 2 ²+ 2 ⁴) vector of * M, this vector is the image feature representation of I.

Step 6 specifically comprises the steps:

After the image feature representation obtaining each image, just they can be used for training and classification.The set that the image feature representation of all images is formed is divided into training set and test set two parts, training set is used for train classification models, classifies to test set with the model trained.Usually select support vector machine (Support Vector Machine, SVM) as sorter model.

The present invention is directed to the Image Visual Feature coding method in Images Classification field, the present invention has following feature: 1) the present invention is when encoding to visual signature, not only consider the relation between visual signature and vision word, also contemplate the impact that between vision word, relation is encoded on visual signature; 2) the visual signature coding that the present invention tries to achieve is analytic solution, and do not need iteration optimization function, therefore visual signature coding method of the present invention is fast.

Beneficial effect: the present invention has taken into full account this structural information of distribution of vision word in characteristics dictionary, this information is used for the coding of visual signature, makes the coding of visual signature more can reflect the distribution of the vision word in characteristics dictionary.Therefore, the image feature representation of image has very strong identification, thus promotes the accuracy rate of Images Classification.

Accompanying drawing explanation

To do the present invention below in conjunction with the drawings and specific embodiments and further illustrate, above-mentioned and/or otherwise advantage of the present invention will become apparent.

Fig. 1 is process flow diagram of the present invention.

Fig. 2 is Visual Feature Retrieval Process schematic diagram

Fig. 3 is the process flow diagram to a visual signature coding

Fig. 4 is three sheaf space pyramid structure schematic diagram.

Embodiment

As shown in Figure 1, the invention discloses the image classification method that a kind of integration characteristics dictionary structure and visual signature are encoded, comprise following steps:

Step 1, extracts the visual signature of image: carry out local sampling to every width image, obtain one group of region unit, extract the visual signature in every block region, obtain the visual signature set that every width image is corresponding, the entirety of the visual signature set of all images is designated as set X;

Step 3, visual signature is encoded: the linear combination every width Image Visual Feature being expressed as vision word, and the corresponding coefficient of each vision word, obtains visual signature code set;

Step 4, the space of visual signature coding is converged: be encoded to input, Using statistics method with all vision spies of every width image, every width image is expressed as a vector, and this vector is exactly the image feature representation of correspondence image;

1, step 1 comprises the steps:

As shown in Figure 2, for piece image I, usually adopt the mode of unique step intensive sampling to extract some equal-sized region units from I, and extract a visual signature to each region unit, visual signature is here a d dimensional vector.Conventional Visual Feature Retrieval Process method comprises: histograms of oriented gradients (Histogram of Oriented Gradient, HOG), Scale invariant features transform (Scale-invariant feature transform, SIFT) etc.Finally obtain the overall X=[x of the visual signature set of all images ₁, x ₂..., x _n] ∈ R ^{d × N}, wherein, d represents the dimension of visual signature, and N represents the sum of the visual signature of all images, x _irepresent i-th visual signature, i value 1 ~ N.X is used to step 2 as input, to learn to obtain characteristics dictionary.

2, step 2 comprises the steps:

In this step to gather X for input, the characteristics dictionary B=[b that the vision word using characteristics dictionary learning method to obtain M d dimension is formed ₁, b ₂..., b _m] ∈ R ^{d × M}, wherein M is the number of vision word; b _jbe the column vector of a dimension d, represent a jth vision word, j value 1 ~ M.For k-means method, use k-means to be gathered by set X for M class, each class center is exactly a vision word.

3, step 3 comprises the steps:

This step is one by one to each visual signature coding in set X.

Process flow diagram as shown in Figure 3 describes the cataloged procedure of a visual signature, for visual signature x _i, choose visual signature x _ithe characteristics dictionary B obtained by step 2 in the vision word of p arest neighbors, namely with visual signature x _iminimum p the vision word of distance, p value 1 ~ M, remembers that the characteristics dictionary of this p vision word composition is B _i, i value 1 ~ N, obtains characteristics dictionary B _iin the matrix D represented by distance between each vision word _i, matrix D _ithe element of m capable s row be B _idistance between middle corresponding vision word, m, s=1,2 ..., p, then computation vision feature x _ito characteristics dictionary B _ithe column vector d that represents of the distance of each vision word _i, d _ithe n-th component d _inrepresent visual signature x _iwith B _iin distance between the n-th vision word, n=1,2 ..., p; With x _i, d _i, D _i, B _iwith two parameter lambda and β for input, λ, β>=0, minimizes following formula, obtains x _iat B _ion coding

{| | x_{i} - B_{i} z_{i}^{p} | |}_{2}^{2} + λ {| | d_{i} &CircleTimes; z_{i}^{p} | |}_{2}^{2} + β z_{i}^{p^{T}} D_{i} z_{i}^{p};

Constraint condition:

1^{T} z_{i}^{p} = 1,

Wherein represent dot product, namely the component of two vector correspondences is multiplied and obtains a new vector; Solve and obtain x _iat the coding result of this p vision word finally to coding in component sequence, obtain the code coefficient that k is maximum and the k of a correspondence vision word k=1,2 ..., p, then x _icoding z _ithe vector of M dimension, in vector with corresponding component is all the other components are all set to 0.

Visual signature x _ispecific coding method on B is as follows:

Input: Image Visual Feature x _i, characteristics dictionary B=[b ₁, b ₂..., b _m] ∈ R ^{d × M}, M is vision word number in B and x _ithe dimension of the coding on B.X _iarest neighbors word number p, parameter k, λ and β.

Cataloged procedure:

1) computation vision feature x _ithe vectorial d ' tieed up with the M represented by the distance of all vision word _i;

2) to d ' _'middle component by ascending sort, and selects p the set B formed apart from minimum vision word _i, and the distance d of correspondence _i;

3) B is obtained _iin the matrix D represented by distance between each vision word _i;

4) coding is obtained according to following formula

Ψ＝(x _i1 ^T-B _i) ^T(x _i1 ^T-B _i)

Θ＝Ψ+λ*diag ²(d _i)+βD _i

α＝-(1 ^TΘ ^-11)

{\tilde{z}}_{i}^{p} = α {(ψ + λ * {diag}^{2} (d_{i}) + β D_{i})}^{- 1}

z_{i}^{p} = {(1^{T} {\tilde{z}}_{i}^{p})}^{- 1} {\tilde{z}}_{i}^{p}

Wherein diag (d _i) represent that diagoned vector is the diagonal matrix of di.1 represents that component is the column vector of 1 entirely herein;

5) right in component by descending sort, obtain k maximum code coefficient and the characteristics dictionary that the k of correspondence vision word is formed then x _icoding z _ithe vector of M dimension, in vector with corresponding component is all the other components are all set to 0.Use formula z _i=(1 ^tz _i) ^-1z _inormalization z _i;

Export: visual signature x _icoding z _i.

4, step 4 comprises the steps:

Be illustrated in figure 4 the spatial pyramid Matching Model of three layers, after obtaining all visual signature codings of piece image, adopt spatial pyramid Matching Model (Spatial Pyramid Matching, SPM), technology is converged in conjunction with maximum (Max Pooling) this space of converging, be encoded to input with all visual signatures of piece image, obtain a vector, this vector is exactly the image feature representation of this width image.Concrete operations are: take picture centre as initial point, use different scale, are recursively divided into some subregions, and such as, use the spatial pyramid Matching Model of three layers in Fig. 4, one has 2 ⁰+ 2 ²+ 2 ⁴=21 sub regions.For a region, a value 1 ~ 21, uses and maximumly converges the coding that technology obtains this region this formula represents that this image region one has t visual signature; a _trepresent the coding of h the visual signature in this region, h value 1 ~ t; Z ' _aa dimension and z _ahidentical column vector, namely its dimension is M, and its q component is matrix the maximal value of corresponding row, namely q value 1 ~ M.Further by z ' _qnormalization, such as, use 2 norm normalization to obtain z ' _q=z ' _q/ || z ' _q|| ₂.Finally the coding of all subregion is spliced successively, obtain the image feature representation of this image.

5, step 5 comprises the steps:

After the image feature representation obtaining all images, be used as the image feature representation training svm classifier model of the image of training set, re-use the image feature representation of SVM model to the image being used as test set trained and classify.

Embodiment 1

The present embodiment comprises with lower part:

1, first by image down to the size being no more than 300 × 300, and be converted into gray-scale map, then adopt intensive sampling strategy, from image, extract the image block of 16 × 16 pixels, every 6 pixel decimations once, a SIFT feature is extracted to each image block.Therefore piece image may comprise hundreds and thousands of features, depends on tile size when extracting feature and gap size.

2, first use k-average (k-means) all Image Visual Feature to be gathered for M bunch, each bunch of center just represents a vision word.Setting arest neighbors vision word number p, intensive neighbour's vision word number k, distance smoothing parameter σ, regularization parameter λ and β.Each visual signature is encoded.

3, usage space gold Matching Model and maximumly converge technology, being converged by all visual signatures codings of every width image is the image feature representation of a vector as this image.And use supporting vector machine model train image and classify.

Embodiment 2

Be the visual signature of 128 to image zooming-out dimension, the size of characteristics dictionary and vision word quantity be set to 1024.Respectively p and k is set to 10 and 5.Other optimum configurations also comprises: λ=10 ^-4, β=10 ^-4.Use the space gold tower of 3 layers coupling and maximumly converge technology.Obtain the image feature representation of 21504 dimensions of each image.Be used as the image feature representation training svm classifier model of the image of training set, and with the model trained, the image feature representation of the image as test set classified, obtain final classification results.

The invention provides the image classification method that a kind of integration characteristics dictionary structure and visual signature are encoded; the method and access of this technical scheme of specific implementation is a lot; the above is only the preferred embodiment of the present invention; should be understood that; for those skilled in the art; under the premise without departing from the principles of the invention, can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.The all available prior art of each ingredient not clear and definite in the present embodiment is realized.

Claims

1. an integration characteristics dictionary structure and visual signature image classification method of encoding, is characterized in that, comprise the steps:

2. method according to claim 1, is characterized in that, step 1 comprises the steps:

Carry out local sampling for image I, each sampling obtains a region unit, and each region unit extracts a visual signature, obtains the visual signature set LFS of image I _i, finally obtain the visual signature set X=[x of all images ₁, x ₂..., x _n] ∈ R ^{d × N}, wherein, d represents the dimension of visual signature, and N represents the sum of the visual signature of all images, x _irepresent i-th visual signature, i value 1 ~ N.

3. method according to claim 2, is characterized in that, step 2 comprises the steps:

To gather X for input, use characteristics dictionary learning method, obtain the characteristics dictionary be made up of one group of representative vision word, this characteristics dictionary is designated as: B=[b ₁, b ₂..., b _m] ∈ R ^{d × M}, wherein M is the number of vision word; b _jbe the column vector of a dimension d, represent a jth vision word, j value 1 ~ M.

4. method according to claim 3, is characterized in that, step 3 comprises the steps:

For visual signature x _i, choose visual signature x _ithe characteristics dictionary B obtained by step 2 in the vision word of p arest neighbors, namely with visual signature x _iminimum p the vision word of distance, p value 1 ~ M, remembers that the characteristics dictionary of this p vision word composition is B _i, i value 1 ~ N, obtains characteristics dictionary B _iin the matrix D represented by distance between each vision word _i, matrix D _ithe element of m capable s row be characteristics dictionary B _idistance between middle corresponding vision word, m, s=1,2 ..., p; Computation vision feature x again _ito characteristics dictionary B _ithe column vector d that represents of the distance of each vision word _i, d _ithe n-th component d _inrepresent visual signature x _iwith B _iin distance between the n-th vision word, n=1,2 ..., p, with x _i, d _i, D _i, B _iwith two parameter lambda and β for input, λ, β>=0, minimizes following formula, obtains x _iat B _ion coding

{| | x_{i} - B_{i} z_{i}^{p} | |}_{2}^{2} + λ {| | d_{i} &CircleTimes; z_{i}^{p} | |}_{2}^{2} + {βz}_{i}^{p^{T}} D_{i} z_{i}^{p},

Constraint condition:

1^{T} z_{i}^{p} = 1

Wherein represent dot product, namely the component of two vector correspondences is multiplied and obtains a new vector; Solve and obtain x _iat the coding result of this p vision word finally to coding in component sequence, obtain the code coefficient that k is maximum and the characteristics dictionary that the k of correspondence vision word is formed k=1,2 ..., p, then visual signature x _icoding z _ithe vector of M dimension, in vector with corresponding component is all the other components are all set to 0.

5. method according to claim 4, is characterized in that, step 5 comprises the steps: to adopt spatial pyramid Matching Model, the coding of all visual signatures of every width image is merged into the image feature representation of a vector as this image.

6. method according to claim 5, it is characterized in that, after step 6 comprises the steps: the set that the image feature representation obtaining all images is formed, this set is divided into training set and test set two parts, training set is used for train classification models, classifies to test set with the model trained.