CN103294798B - Commodity automatic classification method based on binary word segmentation and support vector machine - Google Patents
Commodity automatic classification method based on binary word segmentation and support vector machine Download PDFInfo
- Publication number
- CN103294798B CN103294798B CN201310201322.8A CN201310201322A CN103294798B CN 103294798 B CN103294798 B CN 103294798B CN 201310201322 A CN201310201322 A CN 201310201322A CN 103294798 B CN103294798 B CN 103294798B
- Authority
- CN
- China
- Prior art keywords
- commodity
- classification
- word segmentation
- binary word
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of commodity automatic classification method based on binary word segmentation and support vector machine, the method specifically includes that carrying out binary word segmentation for all commodity titles in training set processes structural feature dictionary;Structure commodity classification set, is shown as specific vector according to described feature dictionary by commodity header sheet simultaneously, classification belonging to this specific vector and commodity generate training data, uses sequential Dual Method to carry out parameter optimization for this training data and obtains optimal classification vector;Calculate the inner product of the vectorial specific vector represented by title with commodity to be sorted of described optimal classification, select classification corresponding to maximum inner product result as the classification belonging to these commodity.The present invention is to solve to be difficult in prior art to build Product Feature Information storehouse and to cause commodity automatic classification method training time length and the undesirable problem of effect due to feature space structure.
Description
Technical field
The present invention relates to Data Mining, specifically, relate to a kind of based on binary word segmentation and support to
The commodity of amount machine (Support Vector Machine, SVM, a kind of automatic learning-oriented sorting algorithm) are automatic
Sorting technique.
Background technology
Data mining (Data mining), generally refers to automatic search from substantial amounts of data and is hidden in therein
There is the process of the information of special relationship.Classification, then be an important step in data mining.
Along with developing rapidly of electronic information technology, data mining has been deep into every field, particularly with
E-commerce field, efficient commodity automatic classification method is to the merchandise news of magnanimity in management ecommerce extremely
Close important.At present, have multiple commodity automatic classification method, such as: logic-based rule traditional decision-tree,
Naive Bayesian based on statistical correlation or Bayesian network method, neural net method based on perceptron,
The k near neighbor method of instance-based learning and support vector machine method based on vector space etc., according to document report
Road, the classification accuracy of above-mentioned common method is about 80%.
It is excellent that conventionally, as support vector machine method has, classification speed is fast, result precision is high
Put and be widely used.
But, the method effect in actual applications depends primarily on the structure of feature space, if feature
Space is the least to be necessary for using Non-linear Kernel function so that data linearly inseparable, and this can cause the training time
Long, the problems such as effect is undesirable.
Meanwhile, the Chinese title of commodity contain many characteristic informations (as producer's brand, trade name,
Specifications and models and price), they vary in size with the dependency of commodity classification, are handled differently meeting in theory
Be conducive to improving the accuracy rate of commodity classification.But owing to quantity of information is huge, to build and safeguard such a product
The cost in product characteristic information storehouse is the highest, and amount of calculation is huge, and actual operation is poor.
Therefore, how to solve prior art to be difficult to build Product Feature Information storehouse and due to feature space structure
Make and cause commodity automatic classification method training time length and effect undesirable, become as technology urgently to be resolved hurrily
Problem.
Summary of the invention
The technical problem to be solved is to provide a kind of based on binary word segmentation with the commodity of support vector machine
Automatic classification method, to solve to be difficult in prior art build Product Feature Information storehouse and due to feature space
Structure and cause commodity automatic classification method training time length and the undesirable problem of effect.
For solving above-mentioned technical problem, the invention provides a kind of based on binary word segmentation with the business of support vector machine
Product automatic classification method, it is characterised in that including:
Binary word segmentation is carried out for all commodity titles in training set and processes structural feature dictionary;
Structure commodity classification set, is shown as specific vector according to described feature dictionary by commodity header sheet simultaneously,
Generated training data by classification belonging to this specific vector and commodity, use sequential to prescription with ingredients even in number for this training data
Method carries out parameter optimization and obtains optimal classification vector;
Calculate the inner product of the vectorial specific vector represented by title with commodity to be sorted of described optimal classification, choose
Select corresponding the classifying as the classification belonging to these commodity of maximum inner product result.
Preferably, wherein, the described binary word segmentation that carries out commodity title processes structural feature dictionary, further
For: add up word frequency after all commodity titles in training set are carried out binary word segmentation, select frequency higher
Word structural feature dictionary.
Preferably, wherein, described training set, comprise all of business in a certain e-commerce website further
Product title;Described feature dictionary, the reflection commodity letter obtained by comprising further after binary word segmentation processes
The Feature Words of breath.
Preferably, wherein, described according to described feature dictionary, commodity header sheet is shown as specific vector, enters one
Step is: the number of times combination table of Feature Words obtained after commodity title arbitrary in training set carries out binary word segmentation
It is shown as n-dimensional vector.
Preferably, wherein, represented by the vectorial title with commodity to be sorted of the described optimal classification of described calculating
The inner product of specific vector, be further: feature obtained after commodity title to be sorted is carried out binary word segmentation
The number of times combination table of word is shown as n-dimensional vector, calculates the inner product of this n-dimensional vector and described optimal classification vector.
Compared with prior art, a kind of commodity based on binary word segmentation and support vector machine of the present invention are certainly
Dynamic sorting technique, has reached following effect:
1) present invention carries out binary word segmentation process to commodity title, and significant increase characteristic information storehouse builds just
Yi Xing.
2) present invention uses the specific vector that commodity header sheet is shown as in feature space by Feature Words, significant increase
The ga s safety degree of commodity, thus efficiently solve and cause the commodity side of classification automatically due to feature space structure
Method training time length and the undesirable problem of effect.
Accompanying drawing explanation
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application,
The schematic description and description of the present invention is used for explaining the present invention, is not intended that the improper limit to the present invention
Fixed.In the accompanying drawings:
Fig. 1 is commodity based on binary word segmentation and the support vector machine side of classification automatically described in the embodiment of the present invention
The schematic process flow diagram of method.
Detailed description of the invention
As employed some vocabulary in the middle of description and claim to censure specific components.Art technology
Personnel are it is to be appreciated that hardware manufacturer may call same assembly with different nouns.This specification and
In the way of claim not difference by title is used as distinguishing assembly, but with assembly difference functionally
The different criterion being used as distinguishing." comprising " as mentioned by the middle of description and claim in the whole text is out
Put formula term, therefore " comprise but be not limited to " should be construed to." substantially " refer in acceptable range of error,
Those skilled in the art can solve described technical problem in the range of certain error, basically reaches described technology
Effect.Additionally, " coupling " word comprises any directly and indirectly electric property coupling means at this.Therefore, if
Described in literary composition, a first device is coupled to one second device, then representing described first device can direct electric property coupling
In described second device, or indirectly it is electrically coupled to described second device by other devices or the means that couple.
Description subsequent descriptions is to implement the better embodiment of the present invention, and right described description is to illustrate the present invention's
For the purpose of rule, it is not limited to the scope of the present invention.Protection scope of the present invention is when regarding appended power
Profit requires that defined person is as the criterion.
Below in conjunction with accompanying drawing, the present invention is described in further detail, but not as a limitation of the invention.
As it is shown in figure 1, be that a kind of described in the embodiment of the present invention is based on binary word segmentation with the commodity of support vector machine
Automatic classification method flow process.
Step 101, carries out binary word segmentation for all commodity titles in training set and processes structural feature word
Storehouse;
Wherein, described training set may also be referred to as commodity head stack, comprises a certain ecommerce in set
All of commodity title in website;Described feature dictionary may also be referred to as characteristic information storehouse, it contains through
The Feature Words of the reflection merchandise news obtained by crossing after binary word segmentation processes.
Further, commodity title is carried out binary word segmentation and processes structural feature information bank, particularly as follows: to instruction
Practice after all commodity titles in set carry out binary word segmentation and add up word frequency, select the word structure spy that frequency is higher
Levy dictionary.
Further, step 101 particularly as follows:
First, it is assumed herein that the entitled L of commodity, concrete form is: by C1C2C3…Ck-1CkConstitute, its
Middle CiBeing a Chinese character or English word, k is heading character length;
Afterwards, title L is carried out binary word segmentation, obtain set of words { C1C2,C2C3,...,Ck-1Ck,
In this set of words, by CiCi+1It is considered as a word, and represents with W;
Afterwards, all of commodity title in traversal training set, add up the number of times Count (W) that each word W occurs
Then, threshold value C is setTIf, Count (W) >=CT(that is, the number of times that word W occurs is more than
The threshold values C setT), then W is characterized word;
Thus, all Feature Words W constitutive characteristic dictionary { W obtained1,W2,…,Wn}。
Step 102, constructs commodity classification set, is shown as by commodity header sheet according to described feature dictionary simultaneously
Specific vector, is generated training data by classification belonging to this specific vector and commodity, uses for this training data
Sequential Dual Method carries out parameter optimization and obtains optimal classification vector.
Further, according to described feature dictionary, commodity header sheet is shown as specific vector, particularly as follows: will instruction
Practice and concentrate arbitrary commodity title LiThe number of times combination table of the Feature Words W obtained by after carrying out binary word segmentation is shown as n
Dimensional vector.
Further, step 102 particularly as follows:
To all commodity classifications numbering (the concrete classification of commodity may is that clothes, trousers, footwear, food or
Articles for daily use etc.), if m is number of always classifying, then classification set can be expressed as: { Y1,Y2,...,Ym};
By commodity title L arbitrary in training setiIt is expressed as n-dimensional vector Xi=(xi,1,xi,2,....,xi,n), wherein xi,jFor
Number of times to Feature Words Wj obtained after Li binary word segmentation;
Inquire about classification Y belonging to these commodityi,Yi∈ 1,2 ..., and m}, obtain training data { Xi,Yi};
To described training data { Xi,YiCarry out sequential Dual Method optimization and obtain optimal classification vector Vk, its
In, VkIt is represented by (Vk,1,Vk,2,...,Vk,n), k=1,2 ..., n.
Step 103, calculates the vectorial specific vector represented by title with commodity to be sorted of described optimal classification
Inner product, select maximum inner product result corresponding classification as the classification belonging to these commodity.
Further, Feature Words W's obtained after commodity title L to be sorted carries out binary word segmentation is secondary
Array is closed and is expressed as n-dimensional vector X, calculates the inner product of this n-dimensional vector X and described optimal classification vector, and
Using classification maximum for inner product as the classification belonging to these commodity.
Further, described step 103 particularly as follows:
The title L of commodity to be sorted is expressed as n-dimensional vector X=(x1,x2,....,xn), wherein xiFor to L binary
Feature Words W is obtained after cutting wordmNumber of times;
The inner product that calculating X is vectorial with all optimal classifications:
Take inner product the maximum to classify for prediction, if i.e.
Then these commodity belong to classification Yk。
Above-mentioned sorting technique carries out binary word segmentation to commodity title, rejects the frequency of occurrences and is less than the rare of certain threshold value
Seeing word, structural feature dictionary, the quantity of its Feature Words is about 70,000, and each commodity title comprises according to it
The sparse vector that the quantity of Feature Words is represented as in high-dimensional feature space;This product features extract with
Method for expressing is the most easy and simple to handle, and makes inhomogeneous commodity have good ga s safety degree.Use line
Property kernel function, is trained support vector machine, achieves good classification results: with all commodity in Jingdone district,
Half is trained, and half is tested, and accuracy rate is 94%.
Compared with prior art, a kind of commodity based on binary word segmentation and support vector machine of the present invention are certainly
Dynamic sorting technique, has reached following effect:
1) present invention carries out binary word segmentation process to commodity title, and significant increase characteristic information storehouse builds just
Yi Xing.
2) present invention uses the specific vector that commodity header sheet is shown as in feature space by Feature Words, significant increase
The ga s safety degree of commodity, thus efficiently solve and cause the commodity side of classification automatically due to feature space structure
Method training time length and the undesirable problem of effect.
Described above illustrate and describes some preferred embodiments of the present invention, but as previously mentioned, it should be understood that
The present invention is not limited to form disclosed herein, is not to be taken as the eliminating to other embodiments, and can
For other combinations various, amendment and environment, and can be in invention contemplated scope described herein, by upper
State teaching or the technology of association area or knowledge is modified.And the change that those skilled in the art are carried out and change
Without departing from the spirit and scope of the present invention, the most all should be in the protection domain of claims of the present invention.
Claims (3)
1. a commodity automatic classification method based on binary word segmentation and support vector machine, it is characterised in that bag
Include:
Binary word segmentation is carried out for all commodity titles in training set and processes structural feature dictionary;
Structure commodity classification set, is shown as specific vector according to described feature dictionary by commodity header sheet simultaneously,
Generated training data by classification belonging to this specific vector and commodity, use sequential to prescription with ingredients even in number for this training data
Method carry out parameter optimization obtain optimal classification vector, wherein, described according to described feature dictionary by commodity title
It is expressed as specific vector, is further: obtained by commodity title arbitrary in training set being carried out after binary word segmentation
The number of times combination table of Feature Words be shown as n-dimensional vector;
Calculate the inner product of the vectorial specific vector represented by title with commodity to be sorted of described optimal classification, choose
Selecting corresponding the classifying as the classification belonging to these commodity of maximum inner product result, wherein, described calculating is described most preferably
Class vector and the inner product of the specific vector represented by title of commodity to be sorted, be further: by be sorted
The number of times combination table of the Feature Words that commodity title is obtained after carrying out binary word segmentation is shown as n-dimensional vector, calculates this n
The inner product that dimensional vector is vectorial with described optimal classification.
2. commodity automatic classification method based on binary word segmentation and support vector machine as claimed in claim 1,
It is characterized in that, the described binary word segmentation that carries out commodity title processes structural feature dictionary, is further: right
All commodity titles in training set add up word frequency after carrying out binary word segmentation, select the word structure that frequency is higher
Feature dictionary.
3. commodity automatic classification method based on binary word segmentation and support vector machine as claimed in claim 2,
It is characterized in that described training set comprises all of commodity title in a certain e-commerce website further;
Described feature dictionary, the feature of the reflection merchandise news obtained by comprising further after binary word segmentation processes
Word.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310201322.8A CN103294798B (en) | 2013-05-27 | 2013-05-27 | Commodity automatic classification method based on binary word segmentation and support vector machine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310201322.8A CN103294798B (en) | 2013-05-27 | 2013-05-27 | Commodity automatic classification method based on binary word segmentation and support vector machine |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103294798A CN103294798A (en) | 2013-09-11 |
CN103294798B true CN103294798B (en) | 2016-08-31 |
Family
ID=49095660
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310201322.8A Active CN103294798B (en) | 2013-05-27 | 2013-05-27 | Commodity automatic classification method based on binary word segmentation and support vector machine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103294798B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605815B (en) * | 2013-12-11 | 2016-08-31 | 焦点科技股份有限公司 | A kind of merchandise news being applicable to B2B E-commerce platform is classified recommendation method automatically |
CN103778205B (en) * | 2014-01-13 | 2018-07-06 | 北京奇虎科技有限公司 | A kind of commodity classification method and system based on mutual information |
CN104063428A (en) * | 2014-06-09 | 2014-09-24 | 国家计算机网络与信息安全管理中心 | Method for detecting unexpected hot topics in Chinese microblogs |
CN104268134B (en) * | 2014-09-28 | 2017-04-19 | 苏州大学 | Subjective and objective classifier building method and system |
CN108563782B (en) * | 2018-04-25 | 2023-04-18 | 平安科技(深圳)有限公司 | Commodity information format processing method and device, computer equipment and storage medium |
CN110245800A (en) * | 2019-06-19 | 2019-09-17 | 南京大学金陵学院 | A method of based on superior vector spatial model goods made to order information class indication |
CN110334306A (en) * | 2019-06-21 | 2019-10-15 | 无线生活(北京)信息技术有限公司 | Label processing method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184262A (en) * | 2011-06-15 | 2011-09-14 | 悠易互通(北京)广告有限公司 | Web-based text classification mining system and web-based text classification mining method |
CN102193936A (en) * | 2010-03-09 | 2011-09-21 | 阿里巴巴集团控股有限公司 | Data classification method and device |
CN102289522A (en) * | 2011-09-19 | 2011-12-21 | 北京金和软件股份有限公司 | Method of intelligently classifying texts |
-
2013
- 2013-05-27 CN CN201310201322.8A patent/CN103294798B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102193936A (en) * | 2010-03-09 | 2011-09-21 | 阿里巴巴集团控股有限公司 | Data classification method and device |
CN102184262A (en) * | 2011-06-15 | 2011-09-14 | 悠易互通(北京)广告有限公司 | Web-based text classification mining system and web-based text classification mining method |
CN102289522A (en) * | 2011-09-19 | 2011-12-21 | 北京金和软件股份有限公司 | Method of intelligently classifying texts |
Also Published As
Publication number | Publication date |
---|---|
CN103294798A (en) | 2013-09-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103294798B (en) | Commodity automatic classification method based on binary word segmentation and support vector machine | |
Jain et al. | Application of machine learning techniques to sentiment analysis | |
Gokulakrishnan et al. | Opinion mining and sentiment analysis on a twitter data stream | |
Bhardwaj et al. | Sentiment analysis for Indian stock market prediction using Sensex and nifty | |
Lu et al. | Rated aspect summarization of short comments | |
US20180293294A1 (en) | Similar Term Aggregation Method and Apparatus | |
Adeborna et al. | An approach to sentiment analysis–the case of airline quality rating | |
Weichselbraun et al. | A context-dependent supervised learning approach to sentiment detection in large textual databases | |
Rani et al. | Study and comparision of vectorization techniques used in text classification | |
Khanvilkar et al. | Smart recommendation system based on product reviews using Random Forest | |
Wassan et al. | [Retracted] Customer Experience towards the Product during a Coronavirus Outbreak | |
Priyanka et al. | Identifying the best feature combination for sentiment analysis of customer reviews | |
Zirpe et al. | Negation handling using stacking ensemble method | |
Li et al. | A hybrid model for role-related user classification on twitter | |
Bach et al. | Leveraging user ratings for resource-poor sentiment classification | |
Polpinij | Multilingual sentiment classification on large textual data | |
Sonawane et al. | Extracting sentiments from reviews: A lexicon-based approach | |
Khanvilkar et al. | Product recommendation using sentiment analysis of reviews: a random forest approach | |
Suman | Sentiment analysis: A survey | |
Clarizia et al. | Sentiment analysis in social networks: A methodology based on the latent dirichlet allocation approach | |
Kunneman et al. | Aspect-based summarization of pros and cons in unstructured product reviews. | |
Lasne et al. | Food reviews classification using multi-label convolutional neural network text classifier | |
Im et al. | Confirmatory aspect-based opinion mining processes | |
Vaitheeswaran et al. | Hybrid based approach to enhance the accuracy of sentiment analysis on tweets | |
bin Harunasir et al. | Sentiment analysis of amazon product reviews by supervised machine learning models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20190925 Address after: 100088 Beijing Haidian District Garden Road No. 13 Courtyard 7 Floor 12, 1203-1 Patentee after: Lele Kaihang (Beijing) Education Technology Co., Ltd. Address before: 100085, room 2, building 5, building 1, No. 516, ten Street, Haidian District, Beijing Patentee before: Beijing Shangyou Tongda Information Technology Co., Ltd. |
|
TR01 | Transfer of patent right |