CN104933445A

CN104933445A - Mass image classification method based on distributed K-means

Info

Publication number: CN104933445A
Application number: CN201510363396.0A
Authority: CN
Inventors: 董乐; 张宁
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2015-06-26
Filing date: 2015-06-26
Publication date: 2015-09-23
Anticipated expiration: 2035-06-26
Also published as: CN104933445B

Abstract

The invention provides a mass image classification method based on distributed K-means, and belongs to the technical field of machine learning and image processing. The mass image classification method based on distributed K-means can be applied to large-scale image classification, adopts the distributed K-means algorithm to extract image characteristics on a big data processing platform Hadoop, and finally achieves the purpose of classifying large-scale images. According to the invention, through the design of performing dictionary learning on large-scale image data, and constructing a characteristic mapping function and a classification algorithm, a characteristic extracting algorithm based on the distributed K-means is provided on the basis of the big data processing platform Hadoop. The method avoids the tedious work of manually designing large-scale image characteristics, and reduces training time under the premise of ensuring classification accuracy; and the achievement of the invention has significant meanings in aspects of large-scale database management, military and medical treatment.

Description

A kind of large nuber of images sorting technique based on distributed K-means

Technical field

The invention belongs to machine learning with figurepicture processing technology field, relates to the magnanimity on distributed platform figurepicture process, particularly relates to a kind of large nuber of images sorting technique based on distributed K-means.

Background technology

In recent years, clustering algorithm is widely used in daily life.Commercially, clustering algorithm contributes to analyst and extract specific consumption information from various consumer database, and summarizes the consumption mode embodied in consumption information.Clustering algorithm is a pith in Data Mining, usually the feature representation of the profound level in database can be found as a good instrument, simultaneously, can summarize the feature of each particular category, the most important thing is, clustering algorithm can as the pre-treatment step of each algorithm in Data Mining.Along with figurethe continuous increase in picture storehouse, complexity constantly increases, and the feature that the extraction of unit artificially designs can not satisfy the demands far away, and use parallel processing is undoubtedly a good solution.Large data processing platform (DPP) Hadoop, as the realization of increasing income of Map-Reduce framework, is mainly used in the parallel computation of large-scale dataset, because framework is simple, can effectively support data-intensive applications.The present invention, just on the basis of large data processing platform (DPP) Hadoop, by unit K-means Algorithm parallelization, to the parallel data processing of input, has designed and Implemented based on distributed K-means's figurepicture feature extraction algorithm.

Summary of the invention

The present invention will solve on a large scale figurethe feature extraction problem of picture, thus reach figurethe object of picture classification, for figurethe accuracy of picture classification, proposes a kind of large nuber of images sorting technique based on distributed K-means, research realizes on the basis of large data processing platform (DPP) Hadoop, proposition parallelization figurepicture feature extraction algorithm, figuremany classification problems of picture, adopt DAG-SVM sorter to complete final figurepicture classification.

The present invention is by the following technical solutions to achieve these goals:

a kind of large nuber of images sorting technique based on distributed K-means, its flow process as figureshown in 1, specifically comprise the following steps:

Step 1. is trained figurepicture pre-service;

Input training figurepicture data set, and often will open training figurepicture is divided into multiple figurepicture block, to each figurecarry out regularization and whitening operation successively to remove interfere information, to retain key message as block, give next step process as input information;

Step 2. on large data processing platform (DPP) Hadoop, by K-means Algorithm parallelization, pretreated step 1 gained figurepicture block, as input, carries out the extraction of dictionary;

After step 3. extracts dictionary, construction feature mapping function, by pretreated training figurenew feature representation is mapped as block;

The training that step 3 obtains by step 4. figurenew feature representation as block is input in SVM classifier, carries out figurepicture classification based training;

Step 5. is for the target needing to carry out classifying figurepicture, carries out successively by it figureafter the division of picture block, regularization and whitening operation, complete described in utilization figuresVM classifier as classification based training is classified.

Further, the regularization operation described in step 1 is specific as follows:

{\tilde{x}}^{(i)} = \frac{x^{(i)} - m e a n (x^{(i)})}{\sqrt{ν a r (x^{(i)}) + σ}} - - - (1)

Wherein x ⁽ⁱ⁾it is i-th of input figurepicture block, var ⁽ⁱ⁾and mean ⁽ⁱ⁾x respectively ⁽ⁱ⁾the variance of middle all elements and mean value; σ is a predetermined constant, the operation before division is being carried out in its effect, decrease noise and prevent variance level off to zero time, prevent divisor from being zero, the span for pixel value is [0,255], the general value of σ is 10 can reach good effect, its obtaining value method is generally judged by concrete effect by testing, and detailed process arranges one by experience to be worth relatively preferably, then adjust by experiment.

Further, each to regularization of PCA whitening approach is adopted figurepicture block carry out the process reducing correlativity between pixel:

x_{r o t}^{(i)} = {(U^{(i)})}^{T} {\tilde{x}}^{(i)} - - - (2)

x_{P C A w h i t e}^{(i)} = \frac{x_{r o t}^{(i)}}{\sqrt{λ^{(i)} + ϵ}} - - - (3)

Wherein, λ ⁽ⁱ⁾and U ⁽ⁱ⁾be respectively figurepicture block eigenwert and proper vector, the effect of formula (2) reduces input figurecorrelativity between the pixel of sheet, is obtained after albefaction by formula (3) figurepicture blocks of data, ε is preset constant, and its effect is can be level and smooth figurepicture data, reach and put forward high performance object, the value of ε is generally smaller value, the same σ of its obtaining value method.

Further, the dictionary leaching process described in step 2 is specific as follows:

Pretreated through step 1 figurepicture block is as the input of Map node, and first initialization cluster centre, the reading of multiple Map nodal parallel is pretreated figurepicture data, and dispensed is to the element of each cluster centre, afterwards on Reduce node, add up all elements of each classification, recalculate new cluster centre, whether the change contrasting new cluster centre and cluster centre is before less than the threshold value of setting, if be less than, then iteration terminates, and exports cluster centre, otherwise renewal cluster centre, restarts new one and takes turns iterative process;

Further, the detailed process described in step 3 is as follows:

The dictionary parallelization that step 2 is obtained distribute to multiple Map node, input new for label simultaneously figurepicture data set gives each Map node, on Map node figurecarry out feature learning as data set, will input figurecarry out as data the feature that Feature Mapping obtains, formula is as follows:

\begin{matrix} f^{(i)} (x) = [f_{1}^{(i)} (x), ..., f_{k}^{(i)} (x), ..., f_{N}^{(i)} (x)], k = 1... N \\ f_{k}^{(i)} (x) = \max {0, μ^{(i)} (z) - z_{k}^{(i)}} \\ z_{k}^{(i)} = | | x_{P C A w h i t e}^{(i)} - c^{(k)} | |_{2} \\ μ^{(i)} (z) = (z_{1}^{(i)} + ... + z_{k}^{(i)} + ... + z_{N}^{(i)}) / N \end{matrix} - - - (5)

Wherein, f ⁽ⁱ⁾x () is figurepicture block new feature representation, N be step 2 extract dictionary cluster centre sum, c ^(k)a kth cluster centre; This formula shows as feature f _kto cluster centre c ^(k)distance when exceeding average, this Feature Mapping function will export 0.

Further, in technique scheme, obtaining figureafter picture feature, due to right figureit is one that picture carries out classification figuremany classification problems of picture, therefore step 4 and step 5 adopt DAG-SVM sorter to carry out last training and assorting process.

The invention has the beneficial effects as follows:

The present invention exists figureon the basis of picture feature extraction algorithm, unsupervised learning method K-means is adopted to carry out the study of feature, because the training parameter of K-means decreases a lot relative to traditional unsupervised learning method, therefore, this algorithm, under the prerequisite ensureing classify accuracy, greatly reduces complicated classification degree; Meanwhile, the present invention, on the basis of large data processing platform (DPP) hadoop, to every layer of process parallelization of degree of depth level feature learning, reduces time cost and resource overhead.

Accompanying drawing explanation

figure1 based on distributed K-means's figurepicture sorting technique flow process framework figure.

figure2 based on distributed K-means's figureas the flow process extracting dictionary in sorting technique figure.

figure3 figureillustrate as many classification problems assorting process figure.

figure4 whitening operation are on the impact of dictionary.

figure5Hadoop network topology figure.

Embodiment

In order to make object of the present invention, technical scheme and beneficial effect clearly understand, below in conjunction with example, and with reference to attached figure, the present invention is described in more detail

The present invention can be used on a large scale figurepicture classification, the method adopts distributed K-means algorithm to extract on large data processing platform (DPP) Hadoop figurepicture feature, final realizes on a large scale figurepicture carries out the object of classifying; The present invention is by analyzing figurethe newest research results of the picture association area such as treatment technology and machine learning, on a large scale figurecarry out the study of dictionary as data, the design of construction feature mapping function and sorting algorithm, propose on large data processing platform (DPP) Hadoop basis, based on the feature extraction algorithm of distributed K-means.This method avoid artificial design extensive figurethe tedious work of picture feature, under the prerequisite ensureing classify accuracy, decrease the training time, achievement of the present invention has great significance in large-scale data library management, military affairs, medical treatment etc.

Example

The test experiments hardware environment of the present embodiment is as follows, and experiment topology figureas figureshown in 5:

Hardware environment:

Computer type: desktop computer;

CPU：Pentium(R)Dual-Core CPU E56002.93GHz

Internal memory: 4.00GB (3.49GB can use)

System type: 32-bit operating system

Display card: integrated graphics card

Software environment:

IDE：Eclipse

figurepicture treatment S DK:JavaCV

Development language: Java;

As figure1 the present invention be directed on a large scale figurewhat picture was classified is feature extraction algorithm, comprises the steps:

Step 1. is trained figurepicture pre-service;

Step 2. on large data processing platform (DPP) Hadoop, by K-means Algorithm parallelization, pretreated step 1 gained figurepicture block message, as input, carries out the extraction of dictionary;

As figurethe process that distributed K-means algorithm extracts dictionary shown in 2: first initialization cluster centre, the reading of multiple Map nodal parallel is pretreated figurepicture data, and dispensed is to the element of each cluster centre, afterwards on Reduce node, add up all elements of each classification, recalculate new cluster centre, whether the change contrasting new cluster centre and cluster centre is before less than the threshold value of setting, if be less than, then iteration terminates, and exports cluster centre, otherwise renewal cluster centre, restarts new one and takes turns iterative process;

After step 3. extracts dictionary, construction feature mapping function, by pretreated training figurea new feature representation is mapped as block; In computer vision, a lot of Feature Mapping function is had to calculate the required time and storage resources is very huge, all need the optimization problem that solution one is very complicated, document is had to prove to adopt the method for sparse coding can reach good effect to carry out Feature Mapping, but when tape label figurewhen picture data are considerably less, sparse coding then shows certain limitation, on a large scale figurepicture carries out in the process of feature extraction, the people such as A.Coates [A.Coates, A.Y.Ng, H.Lee.An analysis of single-layer networks in unsupervised feature learning [C] .International Conference on Artificial Intelligence and Statistics, 2011:215 – 223.] demonstrate above-mentioned formula (5) good effect can have been reached, therefore, the present embodiment, after extracting dictionary, adopts the process having carried out feature extraction in this way;

It is theoretical that Cortes and Vapnik first proposed support vector machine (Support Vector Machine) in nineteen ninety-five, this theory can be good at the problem solving small sample and non-linear and high dimensional pattern identification etc., and can be generalized in the machine learning fields such as Function Fitting widely; Support vector machine method is that the Corpus--based Method theories of learning and Structural risk minization principle realize, under the prerequisite of limited sample information, be seek best compromise between its complicacy and learning ability at model, thus obtain best Generalization Ability or generalization ability; In algorithm of the present invention, obtaining tape label figureafter the feature of picture data, can will obtain figurepicture characteristic sum is corresponding figureimage scale label are input to the process of carrying out training classifier in SVM classifier, on a large scale figurepicture classification problem, due to figurethe classification of picture is many, and therefore can be regarded as many classification problems here, this example DAG-SVM sorting technique solves figuremany classification problems of picture, concrete assorting process as figureshown in 3; In this example, data set ImageNet has shared 50 classes, and data set CIFAR-100 has 10 classes, data set STL-10 has 10 classes, here for data set ImageNet, n=50, need 49 sorters, for data set CIFAR-100, n=10, need 9 sorters, for data set STL-10, n=10, need 49 sorters, this not only accelerates classification speed, and avoids the phenomenon of classification overlap and unclassified;

Step 5. is for the target needing to carry out classifying figurepicture, carries out successively by it figureafter the division of picture block, regularization, whitening operation and feature extraction, complete described in utilization figuresVM classifier as classification based training is classified.

In order to verify effect of the present invention, the present embodiment has done experiment respectively on large-scale dataset ImageNet, CIFAR-100 and STL-10, have selected 50 classes from ImageNet data centralization, totally 60,000 figurepicture, wherein, 40,000 is used as training dataset, and remaining is used as test data set; The whole data set of CIFAR-10 is tested, comprises 50,000 figuresheet is trained, 10,000 figuresheet is tested; The whole data set of STL-10 is tested, has 10 different types of 96 × 96 pixels figureimage set, each class has 500 training figurepicture and 800 tests figurepicture.On these three data sets, we reach extraordinary classifying quality.

figure4 illustrate whitening operation to the impact learning the dictionary obtained, as figureshown in 4 (a), be never carry out whitening operation figureas in data by dictionary (cluster centre) that K-means Algorithm Learning obtains, can find out, because the correlativity between pixel is very large, the dictionary (cluster centre) obtained by K-means Algorithm Learning is with regard to height correlation, therefore, the dictionary (cluster centre) of this height correlation exists figureeffect in picture classification task can non-constant, from figurecan see in 4 (c), from through whitening operation figureas in data by the dictionary (cluster centre) that K-means Algorithm Learning obtains, eliminate the correlativity between pixel by whitening operation, obtain the orthogonality of dictionary (cluster centre) just relatively better, therefore exist figureapply as in classification problem, its effect will be fine.

figure4 (b) depicts the impact of whitening operation on K-means algorithm, its left side does not carry out whitening operation, dictionary (cluster centre) can because related data causes certain deviation, carried out whitening operation on the right of it, the dictionary (cluster centre) obtained by K-means algorithm just has more good orthogonality.

table 1illustrate whitening operation pair figurethe impact of picture classify accuracy, on data set ImageNet and data set CIFAR-100, describes based on distributed K-means's figureas feature extraction algorithm in characteristic extraction procedure, whitening operation pair figurethe impact of picture classify accuracy, compared for respectively from original figureextracting directly in picture data figurepicture characteristic sum is from after whitening operation figurepicture extracting data figurethe accuracy of picture feature, as can be seen from data, on data set ImageNet, from after whitening operation figurepicture extracting data figureobtain as feature figure70.19% as classify accuracy, than directly from original figurepicture extracting data figureobtain as feature figurepicture classify accuracy is high 7.71%, on data set CIFAR-100, from the extracting data after whitening operation figurepicture feature, its figurecan 55.38% be reached as classify accuracy, and from original figurepicture extracting data figurepicture feature, figureonly reach 48.07% as classify accuracy, can be obtained by above data analysis, whitening operation pair figurevital effect is had as classify accuracy.

The present embodiment on data set STL-10 to based on distributed K-means's figureperformance as feature extraction algorithm is tested, by more final figurethe validity of this algorithm is verified in the accuracy of picture classification. as table 2shown in, on data set STL-10 figurethe comparison of picture classify accuracy, the inventive method figure56.17% is reached as classify accuracy, than SC Features, K-means encoding [2] is high 0.17%, higher by 1.27% than VQ (1layer) [1], higher by 2.67% than Sparse Filtering [3], higher by 3.27% than Reconstruction ICA [2], as can be seen from these data, the inventive method exists figureas classify accuracy having obvious advantage.On data set STL-10, Sparse coding's [1] figure59.0% is reached as classify accuracy, compare method height herein 2.83%, the method that the present invention proposes realizes on the basis of large data processing platform (DPP) Hadoop, its inter-process mechanism relates to multiple Map node and Reduce node operates the distribution operation of data and convergence, can in accuracy, have certain losing, due to based on large data platform Hadoop's figurepicture feature extraction algorithm, needs extensive figurethe training of picture data, therefore, on relatively little data set STL-10, method of the present invention shows slightly not enough in accuracy.

table 1whitening operation pair figurethe impact of picture classify accuracy

table 2on data set STL-10 figurethe comparison of picture classify accuracy

Algorithm	Accuracy
		The method of the present embodiment	56.17％
Sparse Coding	59.0％
		SC Feature,K-means encoding	56.0％
VQ(1layer)	54.9％
		Sparse Filtering	53.5％
Reconstruction ICA	52.9％

The present embodiment relevant references is as follows:

[1]A.Coates,A.Y.Ng.The importance of encoding versus training with sparse coding and vector quantization[C].International Conference on Machine Learning,2011:921–928.

[2]Q.V.Le,A.Karpenko,J.Ngiam,et al.Ica with reconstruction cost for efficient overcompletefeature learning[C].Advances in Neural Information Processing System,2011:1017–1025.

[3]J.Ngiam,Z.Chen,B.S.A.,et al.Sparse filtering[C].Advances in Neural Information Processing System,2011:1125–1133。

Claims

1., based on a large nuber of images sorting technique of distributed K-means, specifically comprise the steps:

The pre-service of step 1. training image;

Input training image data set, and be divided into multiple image block by often opening training image, carries out regularization and whitening operation successively to remove interfere information, to retain key message to each image block, gives next step process as input information;

Step 2., on large data processing platform (DPP) Hadoop, by K-means Algorithm parallelization, using the pretreated image block information of step 1 gained as input, carries out the extraction of dictionary;

After step 3. extracts dictionary, construction feature mapping function, is mapped as new feature representation by pretreated training image blocks;

The new feature representation of the training image blocks that step 3 obtains by step 4. is input in SVM classifier, carries out Images Classification training;

Step 5. is for the target image needing to carry out classifying, and after it is carried out image block division, regularization, whitening operation and feature extraction successively, the SVM classifier completing Images Classification training described in utilization is classified.

2. the large nuber of images sorting technique based on distributed K-means according to claim 1, is characterized in that, the regularization operation described in step 1 is specific as follows:

Wherein x ⁽ⁱ⁾i-th image block of input, var ⁽ⁱ⁾and mean ⁽ⁱ⁾x respectively ⁽ⁱ⁾the variance of middle all elements and mean value; σ is a predetermined constant, and the operation before division is being carried out in its effect, decrease noise and prevent variance level off to zero time, prevent divisor from being zero.

3. the large nuber of images sorting technique based on distributed K-means according to claim 1, is characterized in that, adopts PCA whitening approach to each image block of regularization carry out the process reducing correlativity between pixel:

Wherein, λ ⁽ⁱ⁾and U ⁽ⁱ⁾image block respectively eigenwert and proper vector, the effect of formula (2) be reduce input picture pixel between correlativity, obtain the image block data after albefaction by formula (3), ε is preset constant.

4. the large nuber of images sorting technique based on distributed K-means according to claim 1, is characterized in that, the dictionary leaching process described in step 2 is specific as follows:

Through the input of the pretreated image block of step 1 as Map node, first initialization cluster centre, the pretreated view data of reading of multiple Map nodal parallel, and dispensed is to the element of each cluster centre, afterwards on Reduce node, add up all elements of each classification, recalculate new cluster centre, whether the change contrasting new cluster centre and cluster centre is before less than the threshold value of setting, if be less than, then iteration terminates, and exports cluster centre, otherwise renewal cluster centre, restarts new one and takes turns iterative process.

5. the large nuber of images sorting technique based on distributed K-means according to claim 1, it is characterized in that, the detailed process described in step 3 is as follows:

The dictionary parallelization that step 2 is obtained distribute to multiple Map node, input the new image data set without label to each Map node simultaneously, carry out feature learning to the image data set on Map node, input image data is carried out the feature that Feature Mapping obtains, formula is as follows:

Wherein, f ⁽ⁱ⁾x () is image block new feature representation, N be step 2 extract dictionary cluster centre sum, c ^(k)it is a kth cluster centre.

6. the large nuber of images sorting technique based on distributed K-means according to claim 1, is characterized in that, step 4 and step 5 adopt DAG-SVM sorter to carry out last training and assorting process.