CN103294798B - Commodity automatic classification method based on binary word segmentation and support vector machine - Google Patents

Commodity automatic classification method based on binary word segmentation and support vector machine Download PDF

Info

Publication number
CN103294798B
CN103294798B CN201310201322.8A CN201310201322A CN103294798B CN 103294798 B CN103294798 B CN 103294798B CN 201310201322 A CN201310201322 A CN 201310201322A CN 103294798 B CN103294798 B CN 103294798B
Authority
CN
China
Prior art keywords
commodity
classification
word segmentation
binary word
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310201322.8A
Other languages
Chinese (zh)
Other versions
CN103294798A (en
Inventor
许大伦
毛颖
张立群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lele Kaihang (Beijing) Education Technology Co., Ltd.
Original Assignee
BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd
Priority to CN201310201322.8A priority Critical patent/CN103294798B/en
Publication of CN103294798A publication Critical patent/CN103294798A/en
Application granted granted Critical
Publication of CN103294798B publication Critical patent/CN103294798B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of commodity automatic classification method based on binary word segmentation and support vector machine, the method specifically includes that carrying out binary word segmentation for all commodity titles in training set processes structural feature dictionary;Structure commodity classification set, is shown as specific vector according to described feature dictionary by commodity header sheet simultaneously, classification belonging to this specific vector and commodity generate training data, uses sequential Dual Method to carry out parameter optimization for this training data and obtains optimal classification vector;Calculate the inner product of the vectorial specific vector represented by title with commodity to be sorted of described optimal classification, select classification corresponding to maximum inner product result as the classification belonging to these commodity.The present invention is to solve to be difficult in prior art to build Product Feature Information storehouse and to cause commodity automatic classification method training time length and the undesirable problem of effect due to feature space structure.

Description

Commodity automatic classification method based on binary word segmentation and support vector machine
Technical field
The present invention relates to Data Mining, specifically, relate to a kind of based on binary word segmentation and support to The commodity of amount machine (Support Vector Machine, SVM, a kind of automatic learning-oriented sorting algorithm) are automatic Sorting technique.
Background technology
Data mining (Data mining), generally refers to automatic search from substantial amounts of data and is hidden in therein There is the process of the information of special relationship.Classification, then be an important step in data mining.
Along with developing rapidly of electronic information technology, data mining has been deep into every field, particularly with E-commerce field, efficient commodity automatic classification method is to the merchandise news of magnanimity in management ecommerce extremely Close important.At present, have multiple commodity automatic classification method, such as: logic-based rule traditional decision-tree, Naive Bayesian based on statistical correlation or Bayesian network method, neural net method based on perceptron, The k near neighbor method of instance-based learning and support vector machine method based on vector space etc., according to document report Road, the classification accuracy of above-mentioned common method is about 80%.
It is excellent that conventionally, as support vector machine method has, classification speed is fast, result precision is high Put and be widely used.
But, the method effect in actual applications depends primarily on the structure of feature space, if feature Space is the least to be necessary for using Non-linear Kernel function so that data linearly inseparable, and this can cause the training time Long, the problems such as effect is undesirable.
Meanwhile, the Chinese title of commodity contain many characteristic informations (as producer's brand, trade name, Specifications and models and price), they vary in size with the dependency of commodity classification, are handled differently meeting in theory Be conducive to improving the accuracy rate of commodity classification.But owing to quantity of information is huge, to build and safeguard such a product The cost in product characteristic information storehouse is the highest, and amount of calculation is huge, and actual operation is poor.
Therefore, how to solve prior art to be difficult to build Product Feature Information storehouse and due to feature space structure Make and cause commodity automatic classification method training time length and effect undesirable, become as technology urgently to be resolved hurrily Problem.
Summary of the invention
The technical problem to be solved is to provide a kind of based on binary word segmentation with the commodity of support vector machine Automatic classification method, to solve to be difficult in prior art build Product Feature Information storehouse and due to feature space Structure and cause commodity automatic classification method training time length and the undesirable problem of effect.
For solving above-mentioned technical problem, the invention provides a kind of based on binary word segmentation with the business of support vector machine Product automatic classification method, it is characterised in that including:
Binary word segmentation is carried out for all commodity titles in training set and processes structural feature dictionary;
Structure commodity classification set, is shown as specific vector according to described feature dictionary by commodity header sheet simultaneously, Generated training data by classification belonging to this specific vector and commodity, use sequential to prescription with ingredients even in number for this training data Method carries out parameter optimization and obtains optimal classification vector;
Calculate the inner product of the vectorial specific vector represented by title with commodity to be sorted of described optimal classification, choose Select corresponding the classifying as the classification belonging to these commodity of maximum inner product result.
Preferably, wherein, the described binary word segmentation that carries out commodity title processes structural feature dictionary, further For: add up word frequency after all commodity titles in training set are carried out binary word segmentation, select frequency higher Word structural feature dictionary.
Preferably, wherein, described training set, comprise all of business in a certain e-commerce website further Product title;Described feature dictionary, the reflection commodity letter obtained by comprising further after binary word segmentation processes The Feature Words of breath.
Preferably, wherein, described according to described feature dictionary, commodity header sheet is shown as specific vector, enters one Step is: the number of times combination table of Feature Words obtained after commodity title arbitrary in training set carries out binary word segmentation It is shown as n-dimensional vector.
Preferably, wherein, represented by the vectorial title with commodity to be sorted of the described optimal classification of described calculating The inner product of specific vector, be further: feature obtained after commodity title to be sorted is carried out binary word segmentation The number of times combination table of word is shown as n-dimensional vector, calculates the inner product of this n-dimensional vector and described optimal classification vector.
Compared with prior art, a kind of commodity based on binary word segmentation and support vector machine of the present invention are certainly Dynamic sorting technique, has reached following effect:
1) present invention carries out binary word segmentation process to commodity title, and significant increase characteristic information storehouse builds just Yi Xing.
2) present invention uses the specific vector that commodity header sheet is shown as in feature space by Feature Words, significant increase The ga s safety degree of commodity, thus efficiently solve and cause the commodity side of classification automatically due to feature space structure Method training time length and the undesirable problem of effect.
Accompanying drawing explanation
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, The schematic description and description of the present invention is used for explaining the present invention, is not intended that the improper limit to the present invention Fixed.In the accompanying drawings:
Fig. 1 is commodity based on binary word segmentation and the support vector machine side of classification automatically described in the embodiment of the present invention The schematic process flow diagram of method.
Detailed description of the invention
As employed some vocabulary in the middle of description and claim to censure specific components.Art technology Personnel are it is to be appreciated that hardware manufacturer may call same assembly with different nouns.This specification and In the way of claim not difference by title is used as distinguishing assembly, but with assembly difference functionally The different criterion being used as distinguishing." comprising " as mentioned by the middle of description and claim in the whole text is out Put formula term, therefore " comprise but be not limited to " should be construed to." substantially " refer in acceptable range of error, Those skilled in the art can solve described technical problem in the range of certain error, basically reaches described technology Effect.Additionally, " coupling " word comprises any directly and indirectly electric property coupling means at this.Therefore, if Described in literary composition, a first device is coupled to one second device, then representing described first device can direct electric property coupling In described second device, or indirectly it is electrically coupled to described second device by other devices or the means that couple. Description subsequent descriptions is to implement the better embodiment of the present invention, and right described description is to illustrate the present invention's For the purpose of rule, it is not limited to the scope of the present invention.Protection scope of the present invention is when regarding appended power Profit requires that defined person is as the criterion.
Below in conjunction with accompanying drawing, the present invention is described in further detail, but not as a limitation of the invention.
As it is shown in figure 1, be that a kind of described in the embodiment of the present invention is based on binary word segmentation with the commodity of support vector machine Automatic classification method flow process.
Step 101, carries out binary word segmentation for all commodity titles in training set and processes structural feature word Storehouse;
Wherein, described training set may also be referred to as commodity head stack, comprises a certain ecommerce in set All of commodity title in website;Described feature dictionary may also be referred to as characteristic information storehouse, it contains through The Feature Words of the reflection merchandise news obtained by crossing after binary word segmentation processes.
Further, commodity title is carried out binary word segmentation and processes structural feature information bank, particularly as follows: to instruction Practice after all commodity titles in set carry out binary word segmentation and add up word frequency, select the word structure spy that frequency is higher Levy dictionary.
Further, step 101 particularly as follows:
First, it is assumed herein that the entitled L of commodity, concrete form is: by C1C2C3…Ck-1CkConstitute, its Middle CiBeing a Chinese character or English word, k is heading character length;
Afterwards, title L is carried out binary word segmentation, obtain set of words { C1C2,C2C3,...,Ck-1Ck, In this set of words, by CiCi+1It is considered as a word, and represents with W;
Afterwards, all of commodity title in traversal training set, add up the number of times Count (W) that each word W occurs
Then, threshold value C is setTIf, Count (W) >=CT(that is, the number of times that word W occurs is more than The threshold values C setT), then W is characterized word;
Thus, all Feature Words W constitutive characteristic dictionary { W obtained1,W2,…,Wn}。
Step 102, constructs commodity classification set, is shown as by commodity header sheet according to described feature dictionary simultaneously Specific vector, is generated training data by classification belonging to this specific vector and commodity, uses for this training data Sequential Dual Method carries out parameter optimization and obtains optimal classification vector.
Further, according to described feature dictionary, commodity header sheet is shown as specific vector, particularly as follows: will instruction Practice and concentrate arbitrary commodity title LiThe number of times combination table of the Feature Words W obtained by after carrying out binary word segmentation is shown as n Dimensional vector.
Further, step 102 particularly as follows:
To all commodity classifications numbering (the concrete classification of commodity may is that clothes, trousers, footwear, food or Articles for daily use etc.), if m is number of always classifying, then classification set can be expressed as: { Y1,Y2,...,Ym};
By commodity title L arbitrary in training setiIt is expressed as n-dimensional vector Xi=(xi,1,xi,2,....,xi,n), wherein xi,jFor Number of times to Feature Words Wj obtained after Li binary word segmentation;
Inquire about classification Y belonging to these commodityi,Yi∈ 1,2 ..., and m}, obtain training data { Xi,Yi};
To described training data { Xi,YiCarry out sequential Dual Method optimization and obtain optimal classification vector Vk, its In, VkIt is represented by (Vk,1,Vk,2,...,Vk,n), k=1,2 ..., n.
Step 103, calculates the vectorial specific vector represented by title with commodity to be sorted of described optimal classification Inner product, select maximum inner product result corresponding classification as the classification belonging to these commodity.
Further, Feature Words W's obtained after commodity title L to be sorted carries out binary word segmentation is secondary Array is closed and is expressed as n-dimensional vector X, calculates the inner product of this n-dimensional vector X and described optimal classification vector, and Using classification maximum for inner product as the classification belonging to these commodity.
Further, described step 103 particularly as follows:
The title L of commodity to be sorted is expressed as n-dimensional vector X=(x1,x2,....,xn), wherein xiFor to L binary Feature Words W is obtained after cutting wordmNumber of times;
The inner product that calculating X is vectorial with all optimal classifications:
S k = Σ i = 1 n V k , i X i
Take inner product the maximum to classify for prediction, if i.e.
S k * = Max { S 1 , S 2 , . . . , S m }
Then these commodity belong to classification Yk
Above-mentioned sorting technique carries out binary word segmentation to commodity title, rejects the frequency of occurrences and is less than the rare of certain threshold value Seeing word, structural feature dictionary, the quantity of its Feature Words is about 70,000, and each commodity title comprises according to it The sparse vector that the quantity of Feature Words is represented as in high-dimensional feature space;This product features extract with Method for expressing is the most easy and simple to handle, and makes inhomogeneous commodity have good ga s safety degree.Use line Property kernel function, is trained support vector machine, achieves good classification results: with all commodity in Jingdone district, Half is trained, and half is tested, and accuracy rate is 94%.
Compared with prior art, a kind of commodity based on binary word segmentation and support vector machine of the present invention are certainly Dynamic sorting technique, has reached following effect:
1) present invention carries out binary word segmentation process to commodity title, and significant increase characteristic information storehouse builds just Yi Xing.
2) present invention uses the specific vector that commodity header sheet is shown as in feature space by Feature Words, significant increase The ga s safety degree of commodity, thus efficiently solve and cause the commodity side of classification automatically due to feature space structure Method training time length and the undesirable problem of effect.
Described above illustrate and describes some preferred embodiments of the present invention, but as previously mentioned, it should be understood that The present invention is not limited to form disclosed herein, is not to be taken as the eliminating to other embodiments, and can For other combinations various, amendment and environment, and can be in invention contemplated scope described herein, by upper State teaching or the technology of association area or knowledge is modified.And the change that those skilled in the art are carried out and change Without departing from the spirit and scope of the present invention, the most all should be in the protection domain of claims of the present invention.

Claims (3)

1. a commodity automatic classification method based on binary word segmentation and support vector machine, it is characterised in that bag Include:
Binary word segmentation is carried out for all commodity titles in training set and processes structural feature dictionary;
Structure commodity classification set, is shown as specific vector according to described feature dictionary by commodity header sheet simultaneously, Generated training data by classification belonging to this specific vector and commodity, use sequential to prescription with ingredients even in number for this training data Method carry out parameter optimization obtain optimal classification vector, wherein, described according to described feature dictionary by commodity title It is expressed as specific vector, is further: obtained by commodity title arbitrary in training set being carried out after binary word segmentation The number of times combination table of Feature Words be shown as n-dimensional vector;
Calculate the inner product of the vectorial specific vector represented by title with commodity to be sorted of described optimal classification, choose Selecting corresponding the classifying as the classification belonging to these commodity of maximum inner product result, wherein, described calculating is described most preferably Class vector and the inner product of the specific vector represented by title of commodity to be sorted, be further: by be sorted The number of times combination table of the Feature Words that commodity title is obtained after carrying out binary word segmentation is shown as n-dimensional vector, calculates this n The inner product that dimensional vector is vectorial with described optimal classification.
2. commodity automatic classification method based on binary word segmentation and support vector machine as claimed in claim 1, It is characterized in that, the described binary word segmentation that carries out commodity title processes structural feature dictionary, is further: right All commodity titles in training set add up word frequency after carrying out binary word segmentation, select the word structure that frequency is higher Feature dictionary.
3. commodity automatic classification method based on binary word segmentation and support vector machine as claimed in claim 2, It is characterized in that described training set comprises all of commodity title in a certain e-commerce website further; Described feature dictionary, the feature of the reflection merchandise news obtained by comprising further after binary word segmentation processes Word.
CN201310201322.8A 2013-05-27 2013-05-27 Commodity automatic classification method based on binary word segmentation and support vector machine Active CN103294798B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310201322.8A CN103294798B (en) 2013-05-27 2013-05-27 Commodity automatic classification method based on binary word segmentation and support vector machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310201322.8A CN103294798B (en) 2013-05-27 2013-05-27 Commodity automatic classification method based on binary word segmentation and support vector machine

Publications (2)

Publication Number Publication Date
CN103294798A CN103294798A (en) 2013-09-11
CN103294798B true CN103294798B (en) 2016-08-31

Family

ID=49095660

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310201322.8A Active CN103294798B (en) 2013-05-27 2013-05-27 Commodity automatic classification method based on binary word segmentation and support vector machine

Country Status (1)

Country Link
CN (1) CN103294798B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605815B (en) * 2013-12-11 2016-08-31 焦点科技股份有限公司 A kind of merchandise news being applicable to B2B E-commerce platform is classified recommendation method automatically
CN103778205B (en) * 2014-01-13 2018-07-06 北京奇虎科技有限公司 A kind of commodity classification method and system based on mutual information
CN104063428A (en) * 2014-06-09 2014-09-24 国家计算机网络与信息安全管理中心 Method for detecting unexpected hot topics in Chinese microblogs
CN104268134B (en) * 2014-09-28 2017-04-19 苏州大学 Subjective and objective classifier building method and system
CN108563782B (en) * 2018-04-25 2023-04-18 平安科技(深圳)有限公司 Commodity information format processing method and device, computer equipment and storage medium
CN110245800A (en) * 2019-06-19 2019-09-17 南京大学金陵学院 A method of based on superior vector spatial model goods made to order information class indication
CN110334306A (en) * 2019-06-21 2019-10-15 无线生活(北京)信息技术有限公司 Label processing method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
CN102193936A (en) * 2010-03-09 2011-09-21 阿里巴巴集团控股有限公司 Data classification method and device
CN102289522A (en) * 2011-09-19 2011-12-21 北京金和软件股份有限公司 Method of intelligently classifying texts

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102193936A (en) * 2010-03-09 2011-09-21 阿里巴巴集团控股有限公司 Data classification method and device
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
CN102289522A (en) * 2011-09-19 2011-12-21 北京金和软件股份有限公司 Method of intelligently classifying texts

Also Published As

Publication number Publication date
CN103294798A (en) 2013-09-11

Similar Documents

Publication Publication Date Title
CN103294798B (en) Commodity automatic classification method based on binary word segmentation and support vector machine
Jain et al. Application of machine learning techniques to sentiment analysis
Gokulakrishnan et al. Opinion mining and sentiment analysis on a twitter data stream
Bhardwaj et al. Sentiment analysis for Indian stock market prediction using Sensex and nifty
Lu et al. Rated aspect summarization of short comments
US20180293294A1 (en) Similar Term Aggregation Method and Apparatus
Adeborna et al. An approach to sentiment analysis–the case of airline quality rating
Weichselbraun et al. A context-dependent supervised learning approach to sentiment detection in large textual databases
Rani et al. Study and comparision of vectorization techniques used in text classification
Khanvilkar et al. Smart recommendation system based on product reviews using Random Forest
Wassan et al. [Retracted] Customer Experience towards the Product during a Coronavirus Outbreak
Priyanka et al. Identifying the best feature combination for sentiment analysis of customer reviews
Zirpe et al. Negation handling using stacking ensemble method
Li et al. A hybrid model for role-related user classification on twitter
Bach et al. Leveraging user ratings for resource-poor sentiment classification
Polpinij Multilingual sentiment classification on large textual data
Sonawane et al. Extracting sentiments from reviews: A lexicon-based approach
Khanvilkar et al. Product recommendation using sentiment analysis of reviews: a random forest approach
Suman Sentiment analysis: A survey
Clarizia et al. Sentiment analysis in social networks: A methodology based on the latent dirichlet allocation approach
Kunneman et al. Aspect-based summarization of pros and cons in unstructured product reviews.
Lasne et al. Food reviews classification using multi-label convolutional neural network text classifier
Im et al. Confirmatory aspect-based opinion mining processes
Vaitheeswaran et al. Hybrid based approach to enhance the accuracy of sentiment analysis on tweets
bin Harunasir et al. Sentiment analysis of amazon product reviews by supervised machine learning models

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20190925

Address after: 100088 Beijing Haidian District Garden Road No. 13 Courtyard 7 Floor 12, 1203-1

Patentee after: Lele Kaihang (Beijing) Education Technology Co., Ltd.

Address before: 100085, room 2, building 5, building 1, No. 516, ten Street, Haidian District, Beijing

Patentee before: Beijing Shangyou Tongda Information Technology Co., Ltd.

TR01 Transfer of patent right