CN103699523B - Product classification method and apparatus - Google Patents

Product classification method and apparatus Download PDF

Info

Publication number
CN103699523B
CN103699523B CN201310692950.0A CN201310692950A CN103699523B CN 103699523 B CN103699523 B CN 103699523B CN 201310692950 A CN201310692950 A CN 201310692950A CN 103699523 B CN103699523 B CN 103699523B
Authority
CN
China
Prior art keywords
product
sample
word
feature
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310692950.0A
Other languages
Chinese (zh)
Other versions
CN103699523A (en
Inventor
樊春玲
邓亮
冯良炳
张冠军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201310692950.0A priority Critical patent/CN103699523B/en
Publication of CN103699523A publication Critical patent/CN103699523A/en
Application granted granted Critical
Publication of CN103699523B publication Critical patent/CN103699523B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of product classification method, described method includes: according to being used for describing the product Text Feature Extraction product text feature of product to be sorted;Product image zooming-out product characteristics of image according to described product to be sorted;The product feature of product to be sorted is generated according to described product text feature and described product characteristics of image;The product feature of described product to be sorted is inputted the product classification model that training in advance obtains, it is thus achieved that classification results.Product classification method provided by the invention, by extracting product text feature and the product characteristics of image of product to be sorted, generates product feature further according to product text feature and product characteristics of image, thus utilizing this product feature to carry out classifying to obtain classification results.The text feature of product to be sorted and characteristics of image due to comprehensive consideration, and individually according to compared with the text message of product carries out classification, improve classification accuracy.Present invention also offers a kind of product classification device.

Description

Product classification method and apparatus
Technical field
The present invention relates to area of pattern recognition, particularly relate to a kind of product classification method and apparatus.
Background technology
Along with the fast development of ecommerce, shopping online has been increasingly becoming the daily behavior of netizen.2012 China's online-shopping market analysis reports according to CNNIC in March, 2013 issue show, 2012, China's online-shopping market dealing money reached 1,259,400,000,000 yuan.Networking products broad categories, substantial amounts, electricity business website needs the energy that costs a lot of money in the management of product, just can provide the user good purchase experiences.
Product classification problem is the matter of utmost importance of the management of product, but product classification is mainly by manually demarcating product category at present, although there is also the method that the text message according to product carries out classifying, but owing to the text message of product not can be fully described all the elements of product, if Word message describes deviation occurs, may result in product and classified by mistake, it is necessary to spend a lot of human cost to revise product category, therefore existing product classification classification accuracy is poor
Summary of the invention
Based on this, it is necessary to for the problem that the text message according to product carries out product classification poor accuracy, it is provided that a kind of product classification method and apparatus.
A kind of product classification method, described method includes:
According to being used for describing the product Text Feature Extraction product text feature of product to be sorted;
Product image zooming-out product characteristics of image according to described product to be sorted;
The product feature of product to be sorted is generated according to described product text feature and described product characteristics of image;
The product feature of described product to be sorted is inputted the product classification model that training in advance obtains, it is thus achieved that classification results.
A kind of product classification device, described device includes:
Product Text character extraction module, for according to the product Text Feature Extraction product text feature for describing product to be sorted;
Product image characteristics extraction module, for the product image zooming-out product characteristics of image according to described product to be sorted;
Product feature generation module, for generating the product feature of product to be sorted according to described product text feature and described product characteristics of image;
Sort module, for inputting, by the product feature of described product to be sorted, the product classification model that training in advance obtains, it is thus achieved that classification results.
The said goods sorting technique and device, by extracting product text feature and the product characteristics of image of product to be sorted, generate product feature further according to product text feature and product characteristics of image, thus utilizing this product feature to carry out classifying to obtain classification results.The text feature of product to be sorted and characteristics of image due to comprehensive consideration, and individually according to compared with the text message of product carries out classification, improve classification accuracy.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of product classification method in an embodiment;
Fig. 2 is the schematic flow sheet of the step that training obtains product classification model in an embodiment;
Fig. 3 is according to for describing the schematic flow sheet of the step of the product Text Feature Extraction product text feature of product to be sorted in an embodiment;
Fig. 4 is the schematic flow sheet of the step of the product image zooming-out product characteristics of image in an embodiment according to product to be sorted;
Fig. 5 is partitioned into image fritter from product image in one embodiment, or is partitioned into the schematic diagram of little image block from sample image;
Image fritter is divided into multiple elementary area by Fig. 6 in one embodiment, or little image block is divided into the schematic diagram of multiple subelement;
Fig. 7 is the schematic flow sheet of the step of the sample text extraction sample text feature concentrating catalog in an embodiment according to training sample;
Fig. 8 is the schematic flow sheet of the step of the sample image extraction sample image feature concentrating catalog in an embodiment according to training sample;
Fig. 9 is the schematic diagram of the process generating product feature in a concrete application scenarios;
Figure 10 uses the product classification model that training obtains to treat sort product to classify in one concrete application scenarios, it is thus achieved that the schematic diagram of the process of classification results;
Figure 11 is the structured flowchart of product classification device in an embodiment;
Figure 12 is the structured flowchart of product classification device in another embodiment;
Figure 13 is the structured flowchart of product Text character extraction module in an embodiment;
Figure 14 is the structured flowchart of product Feature Words screening module in an embodiment;
Figure 15 is the structured flowchart of product image characteristics extraction module in an embodiment;
Figure 16 is the structured flowchart of sample text characteristic extracting module in an embodiment;
Figure 17 is the structured flowchart of sample characteristics word screening module in an embodiment;
Figure 18 is the structured flowchart of sample image characteristic extracting module in an embodiment;
Figure 19 is the structured flowchart of product classification device in further embodiment.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein is only in order to explain the present invention, is not intended to limit the present invention.
The description of specific distinct unless the context otherwise, element in the present invention and assembly, the form that quantity both can be single exists, it is also possible to multiple forms exists, and this is not defined by the present invention.Although the step in the present invention has arranged with label, but being not used to limit the precedence of step, the order of step or the execution of certain step need based on other steps unless expressly stated, and otherwise the relative rank of step is adjustable in.It is appreciated that term "and/or" used herein relates to and contain the one or more of any and all possible combination in the Listed Items being associated.
As it is shown in figure 1, in one embodiment, it is provided that a kind of product classification method, including:
Step 102, according to being used for describing the product Text Feature Extraction product text feature of product to be sorted.
Product text refers to the text for describing product to be sorted, including word, symbol, numeral etc..Product text can correspondence be stored in product text document, the corresponding product text document of such a product text.
Specifically, extract the process of product text feature, it is that the Feature Words extracted from product text is carried out quantization to represent product text, thus structureless original product text being converted into structurized computer may identify which the information of process, for the process that classification processes.Existing text feature can be used to extract product text feature from product text, such as PCA (PrincipalComponentAnalysis, PCA), simulated annealing (SimulatingAnneal, SA) etc..
Step 104, the product image zooming-out product characteristics of image according to product to be sorted.
Product image refers to the image of the image of product to be sorted.The color characteristic (such as color histogram) of product image, textural characteristics or shape facility etc. can be extracted as product characteristics of image.
Step 106, generates the product feature of product to be sorted according to product text feature and product characteristics of image.
After obtaining product text feature and product characteristics of image, product text feature and product characteristics of image can be stitched together and obtain the product feature of product to be sorted.Specifically, can would indicate that the vector sum of product text feature represents that the vector of product characteristics of image couples together and constitutes the vector representing product feature, thus realizing the splicing of feature.
Step 108, inputs, by the product feature of product to be sorted, the product classification model that training in advance obtains, it is thus achieved that classification results.
Treat before sort product classifies, obtain product classification model previously according to training sample set training.During classification, the product feature of product to be sorted is inputted the product classification model that training obtains, just can obtain classification results.
In one embodiment, training sample set includes multiple catalogs of corresponding pre-set categories, and catalog is to being applied to describe sample text and the sample image of catalog.So, training sample concentrates the catalog including pre-set categories and each pre-set categories corresponding multiple catalogs respectively.Each catalog is a corresponding sample text and at least one sample image respectively.
As in figure 2 it is shown, this product classification method also includes training obtains the step of product classification model, including step 202~step 208:
Step 202, concentrates the sample text of catalog to extract sample text feature according to training sample.
Specifically, the sample text that the means identical with the product Text Feature Extraction product text feature according to product to be sorted concentrate each catalog corresponding according to training sample can be used to extract sample text feature.Existing text feature can be used to extract sample text feature from sample text, such as PCA (PrincipalComponentAnalysis, PCA), simulated annealing (SimulatingAnneal, SA) etc..
Step 204, concentrates the sample image of catalog to extract sample image feature according to training sample.
Specifically, the sample image that the means identical with the product image zooming-out product characteristics of image according to product to be sorted concentrate each catalog corresponding according to training sample is used to extract sample image feature.The color characteristic (such as color histogram) of each sample image, textural characteristics or shape facility etc. can be extracted as sample image feature.
Step 206, generates sample characteristics according to sample text feature and sample image feature.
After obtaining sample text feature and sample image feature, can get up obtain the sample characteristics of each catalog by sample text feature and sample image merging features.Specifically, can would indicate that the vector sum of sample text feature represents that the vector of sample image feature couples together and constitutes the vector representing sample characteristics, thus realizing the splicing of feature.
Step 208, obtains the product classification model based on support vector machine according to sample characteristics training.
The present embodiment adopts the training of support vector machine (Supportvectormachine, SVM) method to obtain product classification model.The basic thought of support vector machine method is the hyperplane setting up or a series of higher dimensional space so that hyperplane is maximum to the distance between the most adjacent training sample.Use existing training method of support vector machine can also obtain the product classification model based on support vector machine.In SVM method, important work is exactly the selection of kernel function.When sample characteristics also has Heterogeneous Information, sample size is very big, and the irregular or data of multidimensional data, in the unevenness of high-order feature space distribution, adopt the mode that monokaryon carries out mapping that all samples are processed and unreasonable, namely need to be combined multiple kernel functions, i.e. Multiple Kernel Learning method.
The method of synthetic kernel has a lot, the present embodiment adopts Multiple Kernel Learning method (the Ultra-fastoptimizationalgorithmforsparsemultikernellearn ing based on sparse coding, UFO-MKL), openness raising can reduce redundancy in some cases, improves operation efficiency.
Specifically, if the sample characteristics obtained is x ∈ X, pre-set categories is y ∈ Y={1,2 ..., F}, F is the sum of pre-set categories.DefinitionJ=1 .., F, wherein,For the function corresponding with jth pre-set categories.
Definition φ ‾ ( x , y ) = [ φ 1 ( x , y ) , · · · , φ F ( x , y ) ] , w ‾ = [ w 1 , · · · , w F ] , Wherein wjForCorresponding hyperplane coefficient;Definition modulus value is as shown in Equation (1), wherein | | | |pThe p norm of vector.
| | w ‾ | | 2 , p = | | | | w 1 | | 2 , | | w 2 | | 2 , | | w 3 | | 2 , · · · , | | w 2 | | 2 | | p Formula (1)
The training of the product classification model of multinuclear can be defined as the optimization problem of formula (2),
Formula (2)
Wherein,For coefficient specification item,Losing cost item for classification error, N is the catalog number of training set.Wherein :=representing assignment, λ, α is factor coefficient, and p=2logF (2logF-1) is the norm factor.
For conventional simple cost function,Partial derivative for cost function, then based on the coefficient derivation algorithm such as 11 of UFO-MKL)~18):
11), initialization factor coefficient lambda, α and iterative cycles number of times T;
12), coefficient is initializedVariableVariable q=2logF;
13), fort=1,2 ..., Tdo;
14), training sample (x is obtained at randomt,yt);
15), more new variables
16), calculate v j = max ( 0 , | | θ j | | 1 - at ) , ∀ j = 1,2 , · · · , F ;
17), coefficient is updated w j = v j θ j tλ | | θ j | | 1 ( v j | | v | | q ) q - 2 , ∀ j = 1,2 , · · · , F ;
18), endfor.
Wherein, step 13)~step 18) represent t value 1,2 respectively ..., T, repeated execution of steps 13)~step 17), until during t=T, having updated coefficient wjPosterior circle stops, and algorithm terminates.ObtainCorresponding hyperplane coefficient wjAfter, it is possible to set up the hyperplane of a series of higher dimensional space, thus obtaining product classification model.
In the present embodiment, by step 202~step 208, it is provided that training obtains the training method of the product classification model based on support vector machine, uses this kind of training method can improve operation efficiency.
The said goods sorting technique, by step 102~step 108, first extracting product text feature and the product characteristics of image of product to be sorted, generating product feature further according to product text feature and product characteristics of image, thus utilizing this product feature to carry out classifying to obtain classification results.The text feature of product to be sorted and characteristics of image due to comprehensive consideration, and individually according to compared with the text message of product carries out classification, improve classification accuracy;And the offer of classification accuracy also makes product automatically classify is possibly realized, the human cost in product classification process can be saved.
In one embodiment, sample text correspondence is stored in sample files.Sample text and sample files one_to_one corresponding.As it is shown on figure 3, step 102 specifically includes step 302~step 308.
Step 302, carries out participle by product text, it is thus achieved that candidate word.
Participle is the process that word sequence is divided into word independent one by one or word.Specifically, in one embodiment, Inst. of Computing Techn. Academia Sinica can be adopted based on the Chinese lexical analysis system ICTCLAS (InstituteofComputingTechnology of multilamellar HMM, ChineseLexicalAnalysisSystem) product text is carried out Chinese word segmentation, thus obtaining candidate word.The precision of word segmentation reaches 98.45%.
Step 304, filters out product Feature Words according to default valuation functions from candidate word.
Specifically, by constructing valuation functions, each feature in characteristic set being estimated, and each feature is given a mark, so each candidate word obtains an assessed value, is also called weights.Then all features are pressed the sequence of weights size, extracts the optimal characteristics of predetermined number as the character subset extracting result.
In one embodiment, step 304 specifically includes step 21)~step 25) at least one step, it is preferred to include step 21)~step 25) Overall Steps:
21), calculate the number of times that candidate word occurs in sample files, will appear from the number of times candidate word more than or equal to frequency threshold value as product Feature Words.
Step 21) in, default valuation functions is word frequency rate (TermFrequency, TF) function.Specifically, travel through all candidate word, obtain the number of times that each candidate word occurs in sample files, set point number threshold value (such as 10), delete occurrence number less than frequency threshold value to the classification only small candidate word of contribution, choose the candidate word more than or equal to frequency threshold value as product Feature Words.
22), calculate the sample files comprising candidate word and account for the proportion of sample files sum, using corresponding proportion candidate word in preset range as product Feature Words.
Specifically, the document frequencies P of each candidate word Γ is first calculated according to formula (3)Γ, the sample files namely comprising this candidate word accounts for the proportion of sample files sum.Formula (3) is default valuation functions, is called document frequency (DocumentFrequency, DF) function.
P Γ = n Γ n Formula (3)
Wherein, nΓFor comprising the sample files of candidate word Γ, n is sample files sum.
Set preset range, such as (0.005,0.08), filter out the candidate word Γ in preset range as product Feature Words.
23), calculate the information gain weights of candidate word, using corresponding information gain weights more than the candidate word of information gain weight threshold as product Feature Words.
Specifically, first each candidate word Γ is calculated according to formula (4)kInformation gain weights IG (Γk).Formula (4), for presetting valuation functions, is called information gain (InformationGain, IG) function.
IG ( Γ k ) = - Σ i = 1 m P ( y i ) log P ( y i ) + P ( Γ k ) Σ i = 1 m P ( y i | Γ k ) log P ( y i | Γ k ) + P ( Γ ‾ k ) Σ i = 1 m P ( y i | Γ ‾ k ) log P ( y i | Γ ‾ k ) Formula (4)
Wherein, ΓkRepresent kth candidate word, yiRepresenting pre-set categories, F represents pre-set categories number, P (yi) represent yiThe sample files probability of appearance, the P (Γ in sample files set (set that all sample files are constituted) that classification is correspondingk) represent the probability that the sample files comprising candidate word Γ occurs in sample files set, P (yik) represent that sample files comprises candidate word ΓkTime belong to yiThe conditional probability of class,Represent that training sample is concentrated and do not comprise candidate word ΓkTime belong to yiThe conditional probability of class.
Set information gain weight threshold value, such as 0.006.After obtaining the information gain weights of each candidate word, choose the information gain weights candidate word more than this information gain weight threshold as product Feature Words.
24), calculate the association relationship of candidate word, using corresponding association relationship more than the candidate word of association relationship threshold value as product Feature Words.
Specifically, first each candidate word Γ is calculated according to formula (5)kWith each classification yiAssociation relationship MI (Γk,yi)。
MI ( Γ k , y i ) = log P ( Γ k , y i ) P ( Γ k ) P ( y i ) Formula (5)
Formula (5) is also referred to as formula (6)
MI(Γk, yi)=log(Γk|yi)-logP(Γk) formula (6)
Wherein, P (Γk,yi) for there is candidate word Γ in sample files setkAnd belong to pre-set categories yiSample files occur probability, P (Γk) for candidate word ΓkThe probability occurred, P (y is concentrated at whole training samplei) it is corresponding yiThe probability that the sample files of classification occurs in whole sample files set, P (Γk|yi) for Feature Words ΓkAt yiThe conditional probability occurred in the sample files of classification.Formula (5) or (6), for presetting valuation functions, are called mutual information (MutualInformation, MI) function.
Set association relationship threshold value, such as 1.54, choose the candidate word more than mutual information threshold value as product Feature Words.
25), concentrate whether occur whether candidate word and candidate word belong to the probability of pre-set categories according to training sample, calculate the degree of association of candidate word and pre-set categories, using corresponding degree of association more than the candidate word of relevance threshold as product Feature Words.
Specifically, first each candidate word Γ is calculated according to formula (7)kWith pre-set categories yiBetween degree of association CHI (Γk,yi).Formula (7), for presetting valuation functions, is called evolution Fitness Test (Chi-square, CHI) function.
CHI ( Γ k , y i ) = n [ P ( Γ k , y i ) × ( Γ ‾ k , y ‾ i ) - ( Γ ‾ k , y i ) × P ( Γ k , y ‾ i ) ] 2 P ( Γ k ) × P ( y i ) × P ( Γ ‾ k ) × P ( y ‾ i ) Formula (7)
Wherein, n is the sample files sum that training sample is concentrated, P (Γk,yi) for there is candidate word Γ in sample files setkAnd belong to pre-set categories yiSample files occur probability,For sample files set occurs without candidate word ΓkAnd it is not belonging to pre-set categories yiSample files occur probability,For there is candidate word Γ in sample files setkAnd it is not belonging to pre-set categories yiSample files occur probability,For sample files set occurs without candidate word ΓkAnd belong to pre-set categories yiSample files occur probability.
Set relevance threshold, such as 10, filter out the candidate word more than this threshold value as product Feature Words.
By above-mentioned steps 21)~step 25) Overall Steps, it is possible to generate five set product Feature Words, corresponding generate five kinds of product text features, be remarkably improved product text feature and describe the ability of product to be sorted, thus improving the accuracy of classification.
In one embodiment, also include before step 304: filter out to be included in and preset the candidate word disabling in vocabulary.Candidate word may exist some word that can cause classification interference or word, such as modal particle, auxiliary words etc..Therefore pre-set and disable vocabulary, these can cause the word of classification interference or word addition disable in vocabulary, presetting, thus filtering out to be included in, the candidate word disabling in vocabulary, it is possible to avoid unnecessary calculating, saving product classification required time.
The number counting yield Feature Words weights of step 306, the frequency occurred in sample files according to product Feature Words, sample files sum and the sample files that comprises product Feature Words.
Specifically, by above-mentioned steps 21)~step 25) filter out product Feature Words after, calculate every set product Feature Words (step 21)~step 25 respectively according to formula (8)) each step respectively correspondence generates a set product Feature Words) product Feature Words weights Wi:
Wi=TFi(γ, d) × n/DF (γ) formula (8)
Wherein, WiFor the product Feature Words weights of i-th product Feature Words, TFi(Υ, d) for the product Feature Words Υ frequency occurred in sample files d, n represents sample files sum, and DF (Υ) is the number of files comprising product Feature Words Υ.
Step 308, generates the product text feature of product to be sorted according to product Feature Words weights.
Specifically, step 21 is calculated respectively according to formula (8))~step 25) after the product Feature Words weights of each product Feature Words obtained in each step, product text can be converted to the vector that is dimension with product Feature Words, and the property value of each dimension is the weights of product Feature Words.By step 21)~step 25) each step can draw a vector, i.e. a product text feature.Then for a product text, by step 21)~step 25) Overall Steps can obtain five vectors, i.e. five kinds of product text features, thus obtains the product text feature of product to be sorted.Adopt five kinds of product text features, the accuracy rate of product classification can be improved.
In the present embodiment, by above-mentioned steps 302~step 308, from the product text of product to be sorted, extract the product text feature that can accurately represent product text, be conducive to treating sort product and correctly classify.
As shown in Figure 4, in one embodiment, step 104 includes step 402~step 408:
Step 402, is partitioned into the image fritter of multiple formed objects from the product image of product to be sorted, and there is lap between adjacent image fritter.
Specifically, product to be sorted at least one product image corresponding, cutting image fritter thick and fast in each image, it is sized to wide 16 and long 16.As it is shown in figure 5, during cutting, move cutting starting point according to the step-length of 8,8 respectively, have lap between the image fritter that namely position is adjacent on the transverse and longitudinal direction of the product image of product to be sorted, so every width picture will be cut into many image fritters.
Step 404, extracts the histogram of gradients feature of image fritter.
Specifically, step 404 includes step 31)~step 32):
Step 31), each image fritter is divided into formed objects and nonoverlapping multiple elementary area.
Specifically, as shown in Figure 6, by image fritter difference 4 deciles on horizontal stroke, longitudinal direction, it is thus achieved that 16 elementary area Ci, i=1,2 ..., 16.
Step 32), each elementary area is added up the histogram of gradients feature in 8 directions, gets up to obtain the histogram of gradients feature of each image fritter by the histogram of gradients merging features of the elementary area corresponding to each image fritter.
Specifically, first each elementary area C is calculated according to formula (9) and formula (10)iIn each pixel Grad M (a, b) and direction β (a, b):
M ( a , b ) = ( C i ( a + 1 , b ) - C i ( a - 1 , b ) ) 2 + ( C i ( a , b + 1 ) - C i ( a , b - 1 ) ) 2 Formula (9)
β ( a , b ) = arctan ( C i ( a , b + 1 ) - C i ( a , b - 1 ) C i ( a + 1 , b ) - C i ( a - 1 , b ) ) Formula (10)
Wherein, (a, b) for each elementary area C for MiIn the Grad of each pixel, (a, b) for each elementary area C for βiIn the direction of each pixel, a, b be each elementary area C respectivelyiIn the abscissa of each pixel and vertical coordinate.
Then according to each elementary area CiIn the direction β of each pixel (a, b), by each elementary area CiIn each pixel Grad M (a, b) be added to vector hi, i=1,2 ..., in position corresponding in 16, thus obtaining the histogram of gradients feature h of elementary areai.Again by the elementary area C corresponding to each image fritteriHistogram of gradients feature hiIt is stitched together, obtains the gradient orientation histogram feature feat=(h of image fritter1,h2,…,h16).Wherein, feat is the characteristic vector of one 128 dimension.
In the present embodiment, by step 31)~step 32), it is extracted the histogram of gradients feature of image fritter, it is simple to generate product characteristics of image according to the histogram of gradients feature of image fritter.
Step 406, calculating the histogram of gradients feature of each image fritter and the Euclidean distance of each cluster centre in the cluster centre set learning acquisition in advance, cluster centre nearest with the Euclidean distance of the histogram of gradients feature of each image fritter in Statistical Clustering Analysis centralization also counts.
Step 408, the cluster centre added up according to the histogram of gradients feature of corresponding each image fritter and count results generate product characteristics of image.
Firstly, it is necessary to beforehand through step 41)~step 44), study obtains cluster centre set:
Step 41), choose the default of corresponding each pre-set categories respectively from training sample concentration and choose several catalogs.
Specifically, concentrating the catalog of corresponding pre-set categories from training sample, corresponding each pre-set categories is chosen respectively to preset and is chosen several catalogs.Such as training sample concentrates the catalog of total F classification, then each classification chooses M catalog respectively, then obtain M*F catalog altogether.
Step 42), catalog image corresponding for the catalog chosen is divided into the image subblock of multiple formed objects, and adjacent image subblock exists lap.
Step 43), extract the histogram of gradients feature of image subblock.
Step 42)~step 43) from catalog image corresponding to the catalog chosen, it is partitioned into image subblock, thus extracting the histogram of gradients feature of image subblock, with above-mentioned steps 402)~step 404) in from the product image of product to be sorted, be partitioned into image fritter, thus the step extracting the histogram of gradients feature of image fritter is essentially identical, it is distinctive in that the object of process is different, does not repeat them here.
Step 44), it is the cluster centre presetting cluster centre number by the histogram of gradients feature clustering of image subblock, it is thus achieved that cluster centre set.
Specifically, by step 41)~step 43), obtain the histogram of gradients feature set FEAT={feat about image subblock1,feat2,…,featm, m is total number of image subblock.Presetting cluster centre number is 1024, uses k-means(k-mean cluster) feature set FEAT clusters by clustering algorithm, and cluster obtains 1024 cluster centre points, is designated as Dict={d1,d2,…,d1024, Dict is the cluster centre set that study obtains.
By step 404, obtain the set Feat={feat of the histogram of gradients feature of multiple image fritters of product to be sorted1,feat2,…,feats, s is total number of image fritter.
In step 406~step 408, specifically, first initializing product characteristics of image is the full null vector R=[r that length is identical with the element number in cluster centre set1,r2,…,r1024], for product to be sorted multiple image fritters histogram of gradients feature set Feat in each histogram of gradients feature feati, calculate itself and the Euclidean distance of each cluster centre point, the histogram of gradients feature feat of each image fritter of statistical distance in cluster centre set Dict according to formula (11)iThe minimum cluster centre point of Euclidean distance.
mi n d x = ar g j C min | | fea t i - d j | | 2 Formula (11)
Wherein mindxRepresent apart from each histogram of gradients feature featiThe position of the minimum cluster centre point of Euclidean distance, dj∈Dict。
Then, according to formula (12), the cluster centre point of statistics is counted.
r [ mi n d x ] = r [ mi n d x ] + 1 Formula (12)
Operation in above step 406~step 408, is equivalent to the histogram of gradients feature feat to each image fritteriCluster centre set Dict votes.The final vectorial R=[r obtained1,r2,…,r1024] it is the product characteristics of image of generation.
As it is shown in fig. 7, in one embodiment, step 202 includes step 702~step 708:
Step 702, carries out participle by sample text, it is thus achieved that word to be selected.
Step 704, filters out sample characteristics word according to default valuation functions from word to be selected.
In one embodiment, step 704 specifically includes step 51)~step 55) at least one step, it is preferred to include step 51)~step 55) Overall Steps:
Step 51), calculate the number of times that word to be selected occurs in sample files, will appear from the number of times word to be selected more than frequency threshold value as sample characteristics word.
Specifically, travel through all words to be selected, obtain the number of times that each word to be selected occurs in sample files, set point number threshold value (such as 10), delete occurrence number less than frequency threshold value to the classification only small word to be selected of contribution, choose the word to be selected more than or equal to frequency threshold value as sample characteristics word.
Step 52), calculate the sample files comprising word to be selected and account for the proportion of sample files sum, using corresponding proportion word to be selected in preset range as sample characteristics word.
Specifically, first calculating the document frequencies of each word to be selected according to formula (3), the sample files namely comprising word to be selected accounts for the proportion of sample files sum.Set preset range, such as (0.005,0.08), filter out the word to be selected in preset range as sample characteristics word
Step 53), calculate the information gain weights of word to be selected, using corresponding information gain weights more than the word to be selected of information gain weight threshold as sample characteristics word.
Specifically, set information gain weight threshold value, such as 0.006.Calculate the information gain weights of each word to be selected according to formula (4), using corresponding information gain weights more than the word to be selected of this information gain weight threshold as sample characteristics word
Step 54), calculate the association relationship of word to be selected, using corresponding association relationship more than the word to be selected of association relationship threshold value as sample characteristics word.
Specifically, association relationship threshold value is set, such as 1.54.Calculate the association relationship of each word to be selected according to formula (5) or (6), using corresponding association relationship more than the word to be selected of association relationship threshold value as sample characteristics word
Step 55), according to training sample concentrates whether occur whether word to be selected and word to be selected belong to the probability of pre-set categories, calculate the degree of association of word to be selected and pre-set categories, using the degree of association of correspondence more than the word to be selected of relevance threshold as sample characteristics word.
Specifically, relevance threshold is set, such as 10.According to according to training sample concentrates whether occur whether word to be selected and word to be selected belong to the probability of pre-set categories, being calculated the degree of association of each word to be selected by formula 7, using corresponding degree of association more than the word to be selected of relevance threshold as sample characteristics word.
By above-mentioned steps 51)~step 55) Overall Steps, it is possible to generating five groups of sample characteristics words, corresponding generate five kinds of sample text features, being remarkably improved the ability of sample text feature description catalog, thus improving the accuracy of classification.
In one embodiment, before step 704, also include filtering out the step being included in the default word to be selected disabled in vocabulary.Word to be selected may exist some word that can cause classification interference or word, such as modal particle, auxiliary words etc..Therefore pre-set and disable vocabulary, these can cause the word of classification interference or word addition disable in vocabulary, presetting, thus filtering out to be included in, the word to be selected disabling in vocabulary, it is possible to avoid unnecessary calculating, saving product classification required time.
The number of step 706, the frequency occurred in sample files according to sample characteristics word, sample files sum and the sample files that comprises sample characteristics word calculates sample characteristics word weights.
Specifically, by above-mentioned steps 51)~step 55) filter out sample characteristics word after, calculate often group sample characteristics word (step 51)~step 55 respectively according to above-mentioned formula (8)) each step correspondence respectively generates one group of sample characteristics word) sample characteristics word weights.
Step 708, generates the sample text feature of catalog according to sample characteristics word weights.
Specifically, after calculating sample characteristics word weights, each sample text can being converted to the vector being dimension with sample characteristics word, the property value of each dimension is the weights of sample characteristics word.By above-mentioned steps 51)~step 55) Overall Steps can obtain five kinds of sample text features.Adopt five kinds of sample text features, the classification accuracy of the disaggregated model of training acquisition can be made to be improved.
As shown in Figure 8, in one embodiment, step 204 includes:
Step 802, concentrates the little image block being partitioned into multiple formed objects the sample image of catalog, and there is lap between adjacent little image block from training sample.
Specifically, catalog at least one sample image corresponding, each image is cut little image block thick and fast, it is sized to wide 16 and long 16.As it is shown in figure 5, during cutting, move cutting starting point according to the step-length of 8,8 respectively, namely have lap between adjacent little image block on the transverse and longitudinal direction of the sample image of catalog, so every width picture will be cut into many little image blocks.It is essentially identical with the process of above-mentioned segmentation acquisition image fritter, image subblock that segmentation obtains the process of little image block, is distinctive in that the source images of segmentation is different.
Step 804, extracts the histogram of gradients feature of little image block.
Specifically he, step 804 includes step 61)~step 62):
Step 61), each little image block is divided into formed objects and nonoverlapping multiple subelement.
Specifically, as shown in Figure 6, by little image block difference 4 deciles on horizontal stroke, longitudinal direction, it is thus achieved that 16 subelements.The process dividing subelement is essentially identical with the process of above-mentioned division elementary area, is distinctive in that the difference processing object, then this repeats no more.
Step 62), each subelement is added up the histogram of gradients feature in 8 directions, gets up to obtain the histogram of gradients feature of each little image block by the histogram of gradients merging features of the subelement corresponding to each little image block.
Specifically, Grad and the direction of each pixel in each subelement is first calculated according to above-mentioned formula (9) and formula (10).Then the direction according to pixel each in each subelement, is added in position corresponding in a vector, thus obtaining the histogram of gradients feature of subelement by the Grad of pixel each in each subelement.Again the histogram of gradients merging features of the subelement of each little image block is got up, obtain the histogram of gradients feature of little image block.
Step 806, calculating the histogram of gradients feature of each little image block and the Euclidean distance of each cluster centre in the cluster centre set learning acquisition in advance, cluster centre nearest with the Euclidean distance of the histogram of gradients feature of each little image block in Statistical Clustering Analysis centralization also counts.
Step 808, cluster centre and the count results added up according to the histogram of gradients feature of corresponding each little image block generate sample image feature.
In step 806~step 808, specifically, first initialize the full null vector that each sample image is characterized as that length is identical with the element number in cluster centre set, for catalog multiple little image block histogram of gradients feature set in each histogram of gradients feature, calculate itself and the Euclidean distance of each cluster centre point in cluster centre set, cluster centre point that the Euclidean distance of the histogram of gradients feature of each little image block of statistical distance is minimum and in initialized full null vector correspondence position counting, the final vector obtained is the sample image feature of generation.
The principle of the said goods sorting technique is described with a concrete application scenarios below.Assuming that training sample is concentrated and include five class electricity business's catalogs, the sweater in men's clothing, T-shirt, overcoat, trousers, shirt, wherein every class has 300 products.Corresponding sample files and at least one sample image being used for describing this catalog of each catalog, training sample concentrates all sample files to constitute sample files set.
As shown in Figure 9, word to be selected is obtained after each sample files is carried out participle, filter out and be included in the word to be selected disabling in vocabulary, from word to be selected, filter out sample characteristics word according to word frequency rate, document frequency, information gain, mutual information, five kinds of valuation functions of evolution Fitness Test.Then calculate and often organize the sample characteristics word weights of each sample characteristics word in sample characteristics word, thus according to sample characteristics word weights, it is thus achieved that five kinds of one-dimensional vector, i.e. five kinds of sample text features.
The sample image of catalog is divided into little image block, and the adjacent little image block in position exists lap.Little image block is divided into 16 subelements again, each subelement is added up the histogram of gradients feature in 8 directions, the histogram of gradients merging features of subelement corresponding for each little image block gets up to obtain the histogram of gradients feature of each little image block.Then the histogram of gradients feature of each little image block and the Euclidean distance of each cluster centre in the cluster centre set learning acquisition in advance are calculated, cluster centre nearest with the Euclidean distance of the histogram of gradients feature of each little image block in Statistical Clustering Analysis centralization also counts, thus cluster centre and count results according to statistics generate sample image feature, i.e. an one-dimensional vector.
The sample text feature one-dimensional vector of each catalog and sample image feature one-dimensional vector are stitched together and obtain the sample characteristics of this catalog.Sample characteristics according to catalog obtains the product classification model based on support vector machine.
The corresponding product documentation of product to be sorted and at least one product image, product documentation is carried out participle and obtains candidate word, filter out and be included in the candidate word disabling in vocabulary, from candidate word, filter out product Feature Words according to word frequency rate, document frequency, information gain, mutual information, five kinds of valuation functions of evolution Fitness Test.Then the product Feature Words weights of each product Feature Words in every set product Feature Words are calculated, thus according to product Feature Words weights, it is thus achieved that five kinds of one-dimensional vector, i.e. five kinds of product text features.
The product image of product to be sorted is divided into image fritter, and the adjacent image fritter in position exists lap.Image fritter is divided into 16 elementary areas again, each elementary area is added up the histogram of gradients feature in 8 directions, the histogram of gradients merging features of elementary area corresponding for each image fritter gets up to obtain the histogram of gradients feature of each image fritter.Then the histogram of gradients feature of each image fritter and the Euclidean distance of each cluster centre in the cluster centre set learning acquisition in advance are calculated, cluster centre nearest with the Euclidean distance of the histogram of gradients feature of each image fritter in Statistical Clustering Analysis centralization also counts, thus the cluster centre added up according to the histogram of gradients feature of corresponding each image fritter and count results generate one-dimensional product characteristics of image.
Be stitched together acquisition product feature by product text feature one-dimensional vector and product characteristics of image one-dimensional vector.As shown in Figure 10, the product classification model that the input training of product feature is obtained, product classification model output class distinguishing label, it is thus achieved that classification results.
As shown in figure 11, in one embodiment, it is provided that a kind of product classification device, this product classification device includes product Text character extraction module 1120, product image characteristics extraction module 1140, product feature generation module 1160 and sort module 1180.
Product Text character extraction module 1120 is for according to the product Text Feature Extraction product text feature for describing product to be sorted.
Product image characteristics extraction module 1140 is for the product image zooming-out product characteristics of image according to product to be sorted.
Product feature generation module 1160 for generating the product feature of product to be sorted according to product text feature and product characteristics of image.
Sort module 1180 is for inputting, by the product feature of product to be sorted, the product classification model that training in advance obtains, it is thus achieved that classification results.
As shown in figure 12, in one embodiment, training sample set includes multiple catalogs of corresponding pre-set categories, and catalog is to being applied to describe sample text and the sample image of catalog;Product classification device also includes training module 1110, and training module 1110 includes sample text characteristic extracting module 1112, sample image characteristic extracting module 1114, sample characteristics generation module 1116 and training and performs module 1118.
Sample text characteristic extracting module 1112 extracts sample text feature for concentrating the sample text of catalog according to training sample.
Sample image characteristic extracting module 1114 extracts sample image feature for concentrating the sample image of catalog according to training sample.
Sample characteristics generation module 1116 is for generating sample characteristics according to sample text feature and sample image feature.
Training performs module 1118 for obtaining the product classification model based on support vector machine according to sample characteristics training.
In one embodiment, sample text correspondence is stored in sample files;As shown in figure 13, product Text character extraction module 1120 includes first participle module 1122, product Feature Words screening module 1124, product Feature Words weight computing module 1126 and product text feature generation module 1128.
First participle module 1122 is for carrying out participle by product text, it is thus achieved that candidate word.
Product Feature Words screening module 1124 for filtering out product Feature Words according to default valuation functions from candidate word.
Product Feature Words weight computing module 1126 is total for the frequency occurred in sample files according to product Feature Words, sample files and the number counting yield Feature Words weights of the sample files that comprises product Feature Words.
Product text feature generation module 1128 for generating the product text feature of product to be sorted according to product Feature Words weights.
In one embodiment, product Text character extraction module 1120 also includes candidate word filtering module 1123, presets, for filtering out to be included in, the candidate word disabling in vocabulary.
As shown in figure 14, in one embodiment, product Feature Words screening module 1124 includes at least one module in the first screening module 1124a, the second screening module 1124b, the 3rd screening module 1124c, the 4th screening module 1124d and the five screening module 1124e.
First screening module 1124a, for calculating the number of times that candidate word occurs in sample files, will appear from the number of times candidate word more than or equal to frequency threshold value as product Feature Words.
Second screening module 1124b accounts for the proportion of sample files sum for calculating the sample files comprising candidate word, using corresponding proportion candidate word in preset range as product Feature Words.
3rd screening module 1124c for calculating the information gain weights of candidate word, using corresponding information gain weights more than the candidate word of information gain weight threshold as product Feature Words.
4th screening module 1124d, for calculating the association relationship of candidate word, using corresponding association relationship more than the candidate word of association relationship threshold value as product Feature Words.
5th screening module 1124e is for according to training sample concentrates whether occur whether candidate word and candidate word belong to the probability of pre-set categories, calculate the degree of association of candidate word and pre-set categories, using corresponding degree of association more than the candidate word of relevance threshold as product Feature Words.
As shown in figure 15, in one embodiment, product image characteristics extraction module 1140 includes image fritter segmentation module 1142, image fritter characteristic extracting module the 1144, first statistics and counting module 1146 and product characteristics of image generation module 1148.
Image fritter segmentation module 1142 for being partitioned into the image fritter of multiple formed objects from the product image of product to be sorted, and there is lap between adjacent image fritter.
Image fritter characteristic extracting module 1144 is for extracting the histogram of gradients feature of image fritter.
The Euclidean distance of the first statistics and the counting module 1146 histogram of gradients feature with each cluster centre in the cluster centre set learning acquisition in advance for calculating each image fritter, cluster centre nearest with the Euclidean distance of the histogram of gradients feature of each image fritter in Statistical Clustering Analysis centralization also counts.
Product characteristics of image generation module 1148 generates product characteristics of image for the cluster centre added up according to the histogram of gradients feature of corresponding each image fritter and count results.
In one embodiment, image fritter characteristic extracting module 1144 includes elementary area division module 1144a and fisrt feature concatenation module 1144b.
Elementary area divides module 1144a for each image fritter is divided into formed objects and nonoverlapping multiple elementary area.
The histogram of gradients merging features of the elementary area corresponding to each image fritter, for adding up the histogram of gradients feature in 8 directions on each elementary area, is got up to obtain the histogram of gradients feature of each image fritter by fisrt feature concatenation module 1144b.
As shown in figure 16, in one embodiment, sample text characteristic extracting module 1112 includes the second word-dividing mode 1112a, sample characteristics word screening module 1112c, sample characteristics word weight computing module 1112d and sample text feature generation module 1112e.
Second word-dividing mode 1112a is for carrying out participle by sample text, it is thus achieved that word to be selected.
Sample characteristics word screening module 1112c for filtering out sample characteristics word according to default valuation functions from word to be selected.
Sample characteristics word weight computing module 1112d is for the number calculating sample characteristics word weights of the frequency occurred in sample files according to sample characteristics word, sample files sum and the sample files that comprises sample characteristics word.
Sample text feature generation module 1112e for generating the sample text feature of catalog according to sample characteristics word weights.
In one embodiment, sample text characteristic extracting module 1112 also includes word filtering module 1112b to be selected, presets, for filtering out to be included in, the word to be selected disabling in vocabulary.
As shown in figure 17, in one embodiment, sample characteristics word screening module 1112c includes according to number of times screening module 1112c1, according to document proportion screening module 1112c2, according to information gain weights screening module 1112c3, according to association relationship screening module 1112c4 with according at least one module in degree of association screening module 1112c5.
According to number of times screening module 1112c1 for calculating the number of times that word to be selected occurs in sample files, will appear from the number of times word to be selected more than frequency threshold value as sample characteristics word.
The proportion of sample files sum is accounted for for calculating the sample files comprising word to be selected, using corresponding proportion word to be selected in preset range as sample characteristics word according to document proportion screening module 1112c2.
According to information gain weights screening module 1112c3 for calculating the information gain weights of word to be selected, using corresponding information gain weights more than the word to be selected of information gain weight threshold as sample characteristics word.
According to association relationship screening module 1112c4 for calculating the association relationship of word to be selected, using corresponding association relationship more than the word to be selected of association relationship threshold value as sample characteristics word.
According to degree of association screening module 1112c5 for according to training sample concentrates whether occur whether word to be selected and word to be selected belong to the probability of pre-set categories, calculate the degree of association of word to be selected and pre-set categories, using corresponding degree of association more than the word to be selected of relevance threshold as sample characteristics word.
As shown in figure 18, in one embodiment, sample image characteristic extracting module 1114 includes little image block segmentation module 1114a, little image block characteristics extraction module 1114b, the second statistics and counting module 1114c and sample image feature generation module 1114d.
Little image block segmentation module 1114a for being partitioned into the little image block of multiple formed objects from the sample image of training sample concentration catalog, and there is lap between adjacent little image block.
Little image block characteristics extraction module 1114b is for extracting the histogram of gradients feature of little image block.
The Euclidean distance of the second statistics and counting module 1114c histogram of gradients feature with each cluster centre in the cluster centre set learning acquisition in advance for calculating each little image block, cluster centre nearest with the Euclidean distance of the histogram of gradients feature of each little image block in Statistical Clustering Analysis centralization also counts.
Sample image feature generation module 1114d generates sample image feature for cluster centre and the count results added up according to the histogram of gradients feature of corresponding each little image block.
In one embodiment, little image block characteristics extraction module 1114b includes subelement division module 1114b1 and second feature concatenation module 1114b2.
Subelement divides module 1114b1 for each little image block is divided into formed objects and nonoverlapping multiple subelement.
The histogram of gradients merging features of the subelement corresponding to each little image block, for adding up the histogram of gradients feature in 8 directions on each subelement, is got up to obtain the histogram of gradients feature of each little image block by second feature concatenation module 1114b2.
As shown in figure 19, in one embodiment, product classification device also includes cluster centre set acquisition module 1130, and cluster centre set acquisition module 1130 includes catalog and chooses module 1132, image subblock segmentation module 1134, sub-image feature extraction module 1136 and cluster module 1138.
Catalog is chosen module 1132 and is chosen several catalogs for choosing the default of corresponding each pre-set categories respectively from training sample concentration.
Image subblock segmentation module 1134 for being divided into the image subblock of multiple formed objects by catalog image corresponding for the catalog chosen, and adjacent image subblock exists lap.
Sub-image feature extraction module 1136 is for extracting the histogram of gradients feature of image subblock.
It is the cluster centre presetting cluster centre number that cluster module 1138 is used for the histogram of gradients feature clustering of image subblock, it is thus achieved that cluster centre set.
Embodiment described above only have expressed the several embodiments of the present invention, and it describes comparatively concrete and detailed, but therefore can not be interpreted as the restriction to the scope of the claims of the present invention.It should be pointed out that, for the person of ordinary skill of the art, without departing from the inventive concept of the premise, it is also possible to making some deformation and improvement, these broadly fall into protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims (24)

1. a product classification method, described method includes:
According to being used for describing the product Text Feature Extraction product text feature of product to be sorted;
Product image zooming-out product characteristics of image according to described product to be sorted;
The product feature of product to be sorted is generated according to described product text feature and described product characteristics of image;
The product feature of described product to be sorted is inputted the product classification model that training in advance obtains, it is thus achieved that classification results;
The described product image zooming-out product characteristics of image according to described product to be sorted, including:
From the product image of described product to be sorted, it is partitioned into the image fritter of multiple formed objects, and there is lap between adjacent image fritter;
Extract the histogram of gradients feature of described image fritter;
Calculate histogram of gradients feature and the Euclidean distance of each cluster centre in the cluster centre set learning acquisition in advance of each described image fritter, add up cluster centre nearest with the Euclidean distance of the histogram of gradients feature of described each image fritter in described cluster centre set and count;
Cluster centre and count results that histogram of gradients feature according to corresponding described each image fritter is added up generate product characteristics of image.
2. method according to claim 1, it is characterised in that training sample set includes multiple catalogs of corresponding pre-set categories, described catalog is to being applied to describe sample text and the sample image of catalog;Described method also includes training and obtains the step of product classification model, including:
According to described training sample set, the sample text of catalog extracts sample text feature;
According to described training sample set, the sample image of catalog extracts sample image feature;
Sample characteristics is generated according to described sample text feature and described sample image feature;
The product classification model based on support vector machine is obtained according to the training of described sample characteristics.
3. method according to claim 2, it is characterised in that described sample text correspondence is stored in sample files;Described basis is used for describing the product Text Feature Extraction product text feature of product to be sorted, including:
Described product text is carried out participle, it is thus achieved that candidate word;
From described candidate word, product Feature Words is filtered out according to default valuation functions;
The number counting yield Feature Words weights of the frequency that occurs in described sample files according to described product Feature Words, sample files sum and the sample files that comprises described product Feature Words;
The product text feature of product to be sorted is generated according to described product Feature Words weights.
4. method according to claim 3, it is characterised in that described basis is preset valuation functions and filtered out before product Feature Words from described candidate word, also includes:
Filter out to be included in and preset the described candidate word disabling in vocabulary.
5. method according to claim 3, it is characterised in that described basis is preset valuation functions and filtered out product Feature Words from described candidate word, including:
Calculate the number of times that described candidate word occurs in described sample files, will appear from the number of times candidate word more than or equal to frequency threshold value as product Feature Words;And/or,
Calculate the sample files comprising described candidate word and account for the proportion of sample files sum, using corresponding proportion candidate word in preset range as product Feature Words;And/or,
Calculate the information gain weights of described candidate word, using corresponding information gain weights more than the candidate word of information gain weight threshold as product Feature Words;And/or,
Calculate the association relationship of described candidate word, using corresponding association relationship more than the candidate word of association relationship threshold value as product Feature Words;And/or,
According to described training sample concentrates whether occur whether described candidate word and described candidate word belong to the probability of described pre-set categories, calculate the degree of association of described candidate word and described pre-set categories, using corresponding degree of association more than the candidate word of relevance threshold as product Feature Words.
6. method according to claim 1, it is characterised in that the histogram of gradients feature of the described image fritter of described extraction, including:
Each described image fritter is divided into formed objects and nonoverlapping multiple elementary area;
Described each elementary area is added up the histogram of gradients feature in 8 directions, gets up to obtain the histogram of gradients feature of described each image fritter by the histogram of gradients merging features of the elementary area corresponding to described each image fritter.
7. method according to claim 2, it is characterised in that described sample text correspondence is stored in sample files;The described sample text of catalog according to described training sample set extracts sample text feature, including:
Described sample text is carried out participle, it is thus achieved that word to be selected;
From described word to be selected, sample characteristics word is filtered out according to default valuation functions;
The number of the frequency that occurs in described sample files according to described sample characteristics word, sample files sum and the sample files that comprises described sample characteristics word calculates sample characteristics word weights;
The sample text feature of described catalog is generated according to described sample characteristics word weights.
8. method according to claim 7, it is characterised in that the default valuation functions of described basis also includes before filtering out sample characteristics word from described word to be selected:
Filter out to be included in and preset the word described to be selected disabling in vocabulary.
9. method according to claim 7, it is characterised in that described basis is preset valuation functions and filtered out sample characteristics word from described word to be selected, including:
Calculate the number of times that described word to be selected occurs in described sample files, will appear from the number of times word to be selected more than frequency threshold value as sample characteristics word;
Calculate the sample files comprising described word to be selected and account for the proportion of sample files sum, using corresponding proportion word to be selected in preset range as sample characteristics word;
Calculate the information gain weights of described word to be selected, using corresponding information gain weights more than the word to be selected of information gain weight threshold as sample characteristics word;
Calculate the association relationship of described word to be selected, using corresponding association relationship more than the word to be selected of association relationship threshold value as sample characteristics word;And/or
According to described training sample concentrates whether occur whether described word to be selected and described word to be selected belong to the probability of described pre-set categories, calculate the degree of association of described word to be selected and described pre-set categories, using corresponding degree of association more than the word to be selected of relevance threshold as sample characteristics word.
10. method according to claim 2, it is characterised in that the described sample image of catalog according to described training sample set extracts sample image feature, including:
From the sample image of catalog described in described training sample set, it is partitioned into the little image block of multiple formed objects, and there is lap between adjacent little image block;
Extract the histogram of gradients feature of described little image block;
Calculate histogram of gradients feature and the Euclidean distance of each cluster centre in the cluster centre set learning acquisition in advance of each described little image block, add up cluster centre nearest with the Euclidean distance of the histogram of gradients feature of described each little image block in described cluster centre set and count;
Cluster centre and count results that histogram of gradients feature according to corresponding described each little image block is added up generate sample image feature.
11. method according to claim 10, it is characterised in that the histogram of gradients feature of the described little image block of described extraction, including:
Each described little image block is divided into formed objects and nonoverlapping multiple subelement;
Described each subelement is added up the histogram of gradients feature in 8 directions, gets up to obtain the histogram of gradients feature of described each little image block by the histogram of gradients merging features of the subelement corresponding to described each little image block.
12. the method according to claim 1 or 10, it is characterised in that described method also includes study and obtains the step of cluster centre set, including:
Choose the default of corresponding each pre-set categories respectively from training sample concentration and choose several catalogs;
Catalog image corresponding for the described catalog chosen is divided into the image subblock of multiple formed objects, and adjacent image subblock exists lap;
Extract the histogram of gradients feature of described image subblock;
It it is the cluster centre presetting cluster centre number by the histogram of gradients feature clustering of described image subblock, it is thus achieved that cluster centre set.
13. a product classification device, it is characterised in that described device includes:
Product Text character extraction module, for according to the product Text Feature Extraction product text feature for describing product to be sorted;
Product image characteristics extraction module, for the product image zooming-out product characteristics of image according to described product to be sorted;
Product feature generation module, for generating the product feature of product to be sorted according to described product text feature and described product characteristics of image;
Sort module, for inputting, by the product feature of described product to be sorted, the product classification model that training in advance obtains, it is thus achieved that classification results;
Described product image characteristics extraction module includes:
, for being partitioned into the image fritter of multiple formed objects from the product image of described product to be sorted, and there is lap between adjacent image fritter in image fritter segmentation module;
Image fritter characteristic extracting module, for extracting the histogram of gradients feature of described image fritter;
First statistics and counting module, for calculating histogram of gradients feature and the Euclidean distance of each cluster centre in the cluster centre set learning acquisition in advance of each described image fritter, add up cluster centre nearest with the Euclidean distance of the histogram of gradients feature of described each image fritter in described cluster centre set and count;
Product characteristics of image generation module, cluster centre and count results for adding up according to the histogram of gradients feature of corresponding described each image fritter generate product characteristics of image.
14. device according to claim 13, it is characterised in that training sample set includes multiple catalogs of corresponding pre-set categories, described catalog is to being applied to describe sample text and the sample image of catalog;Described device also includes training module, including:
Sample text characteristic extracting module, the sample text for catalog according to described training sample set extracts sample text feature;
Sample image characteristic extracting module, the sample image for catalog according to described training sample set extracts sample image feature;
Sample characteristics generation module, for generating sample characteristics according to described sample text feature and described sample image feature;
Training performs module, for obtaining the product classification model based on support vector machine according to the training of described sample characteristics.
15. device according to claim 14, it is characterised in that described sample text correspondence is stored in sample files;Described product Text character extraction module includes:
First participle module, for carrying out participle by described product text, it is thus achieved that candidate word;
Product Feature Words screening module, for filtering out product Feature Words according to default valuation functions from described candidate word;
Product Feature Words weight computing module, the number counting yield Feature Words weights of and the sample files that comprise described product Feature Words total for the frequency occurred in described sample files according to described product Feature Words, sample files;
Product text feature generation module, for generating the product text feature of product to be sorted according to described product Feature Words weights.
16. device according to claim 15, it is characterised in that described product Text character extraction module also includes candidate word filtering module, preset, for filtering out to be included in, the described candidate word disabling in vocabulary.
17. device according to claim 15, it is characterised in that described product Feature Words screening module includes at least one module in the first screening module, the second screening module, the 3rd screening module, the 4th screening module and the 5th screening module:
First screening module, for calculating the number of times that described candidate word occurs in described sample files, will appear from the number of times candidate word more than or equal to frequency threshold value as product Feature Words;
Second screening module accounts for the proportion of sample files sum for calculating the sample files comprising described candidate word, using corresponding proportion candidate word in preset range as product Feature Words;
3rd screening module for calculating the information gain weights of described candidate word, using corresponding information gain weights more than the candidate word of information gain weight threshold as product Feature Words;
4th screening module for calculating the association relationship of described candidate word, using corresponding association relationship more than the candidate word of association relationship threshold value as product Feature Words;
5th screening module is for according to described training sample concentrates whether occur whether described candidate word and described candidate word belong to the probability of described pre-set categories, calculate the degree of association of described candidate word and described pre-set categories, using corresponding degree of association more than the candidate word of relevance threshold as product Feature Words.
18. device according to claim 13, it is characterised in that described image fritter characteristic extracting module includes:
Elementary area divides module, for each described image fritter is divided into formed objects and nonoverlapping multiple elementary area;
Fisrt feature concatenation module, for adding up the histogram of gradients feature in 8 directions on described each elementary area, get up to obtain the histogram of gradients feature of described each image fritter by the histogram of gradients merging features of the elementary area corresponding to described each image fritter.
19. device according to claim 14, it is characterised in that described sample text correspondence is stored in sample files;Described sample text characteristic extracting module includes:
Second word-dividing mode, for carrying out participle by described sample text, it is thus achieved that word to be selected;
Sample characteristics word screening module, for filtering out sample characteristics word according to default valuation functions from described word to be selected;
Sample characteristics word weight computing module, for the number calculating sample characteristics word weights of the frequency occurred in described sample files according to described sample characteristics word, sample files sum and the sample files that comprises described sample characteristics word;
Sample text feature generation module, for generating the sample text feature of described catalog according to described sample characteristics word weights.
20. device according to claim 19, it is characterised in that described sample text characteristic extracting module also includes word filtering module to be selected, preset, for filtering out to be included in, the word described to be selected disabling in vocabulary.
21. device according to claim 19, it is characterized in that, described sample characteristics word screening module includes according to number of times screening module, according to document proportion screening module, according to information gain weights screening module, according to association relationship screening module with according at least one module in degree of association screening module:
According to number of times screening module for calculating the number of times that described word to be selected occurs in described sample files, will appear from the number of times word to be selected more than frequency threshold value as sample characteristics word;
The proportion of sample files sum is accounted for for calculating the sample files comprising described word to be selected, using corresponding proportion word to be selected in preset range as sample characteristics word according to document proportion screening module;
According to information gain weights screening module for calculating the information gain weights of described word to be selected, using corresponding information gain weights more than the word to be selected of information gain weight threshold as sample characteristics word;
According to association relationship screening module for calculating the association relationship of described word to be selected, using corresponding association relationship more than the word to be selected of association relationship threshold value as sample characteristics word;
According to degree of association screening module for according to described training sample concentrates whether occur whether described word to be selected and described word to be selected belong to the probability of described pre-set categories, calculate the degree of association of described word to be selected and described pre-set categories, using corresponding degree of association more than the word to be selected of relevance threshold as sample characteristics word.
22. device according to claim 14, it is characterised in that described sample image characteristic extracting module includes:
, for being partitioned into the little image block of multiple formed objects from the sample image of catalog described in described training sample set, and there is lap between adjacent little image block in little image block segmentation module;
Little image block characteristics extraction module, for extracting the histogram of gradients feature of described little image block;
Second statistics and counting module, for calculating histogram of gradients feature and the Euclidean distance of each cluster centre in the cluster centre set learning acquisition in advance of each described little image block, add up cluster centre nearest with the Euclidean distance of the histogram of gradients feature of described each little image block in described cluster centre set and count;
Sample image feature generation module, cluster centre and count results for adding up according to the histogram of gradients feature of corresponding described each little image block generate sample image feature.
23. device according to claim 22, it is characterised in that described little image block characteristics extraction module includes:
Subelement divides module, for each described little image block is divided into formed objects and nonoverlapping multiple subelement;
Second feature concatenation module, for adding up the histogram of gradients feature in 8 directions on described each subelement, gets up to obtain the histogram of gradients feature of described each little image block by the histogram of gradients merging features of the subelement corresponding to described each little image block.
24. the device according to claim 13 or 22, it is characterised in that described device also includes cluster centre set acquisition module, including:
Module chosen by catalog, chooses several catalogs for choosing the default of corresponding each pre-set categories respectively from training sample concentration;
Image subblock segmentation module, for catalog image corresponding for the described catalog chosen is divided into the image subblock of multiple formed objects, and there is lap in adjacent image subblock;
Sub-image feature extraction module, for extracting the histogram of gradients feature of described image subblock;
Cluster module, being used for the histogram of gradients feature clustering of described image subblock is the cluster centre presetting cluster centre number, it is thus achieved that cluster centre set.
CN201310692950.0A 2013-12-16 2013-12-16 Product classification method and apparatus Active CN103699523B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310692950.0A CN103699523B (en) 2013-12-16 2013-12-16 Product classification method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310692950.0A CN103699523B (en) 2013-12-16 2013-12-16 Product classification method and apparatus

Publications (2)

Publication Number Publication Date
CN103699523A CN103699523A (en) 2014-04-02
CN103699523B true CN103699523B (en) 2016-06-29

Family

ID=50361054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310692950.0A Active CN103699523B (en) 2013-12-16 2013-12-16 Product classification method and apparatus

Country Status (1)

Country Link
CN (1) CN103699523B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095396A (en) * 2015-07-03 2015-11-25 北京京东尚科信息技术有限公司 Model establishment method, quality assessment method and device
WO2017113232A1 (en) * 2015-12-30 2017-07-06 中国科学院深圳先进技术研究院 Product classification method and apparatus based on deep learning
CN105824889A (en) * 2016-03-11 2016-08-03 杨晟志 Classification method based on virtual map
CN105824512B (en) * 2016-03-11 2019-10-01 杨晟志 A kind of catalogue interactive system based on virtual map
CN107346433B (en) * 2016-05-06 2020-09-18 华为技术有限公司 Text data classification method and server
CN106021350A (en) * 2016-05-10 2016-10-12 湖北工程学院 An artwork collection and management method and an artwork collection and management system
CN106250398B (en) * 2016-07-19 2020-03-27 北京京东尚科信息技术有限公司 Method and device for classifying and judging complaint content of complaint event
CN107784372B (en) * 2016-08-24 2022-02-22 阿里巴巴集团控股有限公司 Target object attribute prediction method, device and system
CN106919954A (en) * 2017-03-02 2017-07-04 深圳明创自控技术有限公司 A kind of cloud computing system for commodity classification
CN107133208B (en) * 2017-03-24 2021-08-24 南京柯基数据科技有限公司 Entity extraction method and device
CN107220875B (en) * 2017-05-25 2020-09-22 黄华 Electronic commerce platform with good service
CN107194739B (en) * 2017-05-25 2018-10-26 广州百奕信息科技有限公司 A kind of intelligent recommendation system based on big data
CN109241379A (en) * 2017-07-11 2019-01-18 北京交通大学 A method of across Modal detection network navy
CN108256549B (en) * 2017-12-13 2019-03-15 北京达佳互联信息技术有限公司 Image classification method, device and terminal
CN107977794B (en) * 2017-12-14 2021-09-17 方物语(深圳)科技文化有限公司 Data processing method and device for industrial product, computer equipment and storage medium
CN109035630A (en) * 2018-08-21 2018-12-18 深圳码隆科技有限公司 Commodity information identification method and system
CN110852329B (en) * 2019-10-21 2021-06-15 南京航空航天大学 Method for defining product appearance attribute
CN114375465A (en) * 2019-11-05 2022-04-19 深圳市欢太科技有限公司 Picture classification method and device, storage medium and electronic equipment
CN111368926B (en) * 2020-03-06 2021-07-06 腾讯科技(深圳)有限公司 Image screening method, device and computer readable storage medium
TWI754972B (en) * 2020-06-23 2022-02-11 財團法人亞洲大學 Image verification method and real-time product verification system
CN112101018B (en) * 2020-08-05 2024-03-12 北京工联科技有限公司 Method and system for calculating new words in text based on word frequency matrix feature vector
CN113570427A (en) * 2021-07-22 2021-10-29 上海普洛斯普新数字科技有限公司 System for extracting and identifying on-line or system commodity characteristic information
CN113962773A (en) * 2021-10-22 2022-01-21 广州华多网络科技有限公司 Same-style commodity polymerization method and device, equipment, medium and product thereof

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315663A (en) * 2008-06-25 2008-12-03 中国人民解放军国防科学技术大学 Nature scene image classification method based on area dormant semantic characteristic

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8768050B2 (en) * 2011-06-13 2014-07-01 Microsoft Corporation Accurate text classification through selective use of image data

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315663A (en) * 2008-06-25 2008-12-03 中国人民解放军国防科学技术大学 Nature scene image classification method based on area dormant semantic characteristic

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Segmentation as Selective Search for Object Recognition;Koen E.A. van de Sande等;《2011 IEEE International Conference on Computer Vision》;20111113;第1879-1886页 *
动作识别中局部时空特征的运动表示方法研究;雷庆等;《计算机工程与应用》;20101231;第46卷(第34期);第7-10页 *
文本分类中特征选择方法的研究;宋丽平;《中国优秀硕士学位论文全文数据库信息科技辑》;20111215(第S2期);第5-20、41-53页 *
文本分类特征选取技术研究;郑伟;《中国优秀硕士学位论文全文数据库信息科技辑》;20090215(第02期);第4-34页 *

Also Published As

Publication number Publication date
CN103699523A (en) 2014-04-02

Similar Documents

Publication Publication Date Title
CN103699523B (en) Product classification method and apparatus
US11715313B2 (en) Apparatus and methods for extracting data from lineless table using delaunay triangulation and excess edge removal
Chen et al. A survey of document image classification: problem statement, classifier architecture and performance evaluation
CN105824802A (en) Method and device for acquiring knowledge graph vectoring expression
CN106598920B (en) A kind of nearly word form classification method of stroke coding combination Chinese character dot matrix
CN102156871B (en) Image classification method based on category correlated codebook and classifier voting strategy
CN109063649B (en) Pedestrian re-identification method based on twin pedestrian alignment residual error network
CN107291723A (en) The method and apparatus of web page text classification, the method and apparatus of web page text identification
CN107871144A (en) Invoice trade name sorting technique, system, equipment and computer-readable recording medium
CN103258037A (en) Trademark identification searching method for multiple combined contents
CN103473545B (en) A kind of text image method for measuring similarity based on multiple features
CN102193936A (en) Data classification method and device
CN105139041A (en) Method and device for recognizing languages based on image
CN102855492A (en) Classification method based on mineral flotation foam image
CN104008375A (en) Integrated human face recognition mehtod based on feature fusion
CN106844481B (en) Font similarity and font replacement method
CN105787488A (en) Image feature extraction method and device realizing transmission from whole to local
CN103593674A (en) Cervical lymph node ultrasonoscopy feature selection method
CN104156690A (en) Gesture recognition method based on image space pyramid bag of features
CN105447492A (en) Image description method based on 2D local binary pattern
CN107895117A (en) Malicious code mask method and device
CN102768732A (en) Face recognition method integrating sparse preserving mapping and multi-class property Bagging
CN106022359A (en) Fuzzy entropy space clustering analysis method based on orderly information entropy
CN103258186A (en) Integrated face recognition method based on image segmentation
CN102306179B (en) Image content retrieval method based on hierarchical color distribution descriptor

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant