CN109657057A - A kind of short text sensibility classification method of combination SVM and document vector - Google Patents
A kind of short text sensibility classification method of combination SVM and document vector Download PDFInfo
- Publication number
- CN109657057A CN109657057A CN201811401134.9A CN201811401134A CN109657057A CN 109657057 A CN109657057 A CN 109657057A CN 201811401134 A CN201811401134 A CN 201811401134A CN 109657057 A CN109657057 A CN 109657057A
- Authority
- CN
- China
- Prior art keywords
- short text
- vector
- comment
- svm
- emotion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses the short text sensibility classification method of a kind of combination SVM and document vector, which comprises the steps of: firstly, pre-processing to short text;Secondly, short text is trained to multi-C vector using Doc2Vec technology;Then using the short text data training SVM classifier marked;Finally, carrying out emotional semantic classification to the short text not marked using trained SVM classifier.The present invention carries out emotional semantic classification using the SVM in machine learning, and not only classifying quality is significant, but also the data for needing to mark are few.While improving precision, trained cost is reduced.
Description
Technical field
The invention belongs to Computer Natural Language Processing fields, and in particular to a kind of short essay of combination SVM and document vector
This sensibility classification method.It is a kind of technology that can news media's data be carried out with emotional semantic classification.
Background technique
The application of sentiment analysis is very extensive.For example, enterprise can use the emotion of sentiment analysis stroke analysis user
Tendency, and then improve product and formulate sales tactics;Video display enterprise can obtain viewing person for the feedback of film, and then adjust
Play of broadcasting etc..Under the driving of various current demands, the technology of sentiment analysis achieves significant progress.
In terms of the division of Sentiment orientation, there are mainly two types of division mode, i.e. the emotion of coarseness is divided and fine-grained
Emotion divides.The emotion of coarseness is divided, emotion is mainly divided into positivity, neutral, negative affect.But having
In research, in order to simplify subsequent sentiment analysis process, Sentiment orientation is only divided into positivity and negativity.In fine-grained feelings
Feel division direction, the emotional categories such as emotion is mainly divided into pleasure, anger, sorrow, happiness, is feared.
In actual operation, some articles think the emotion of one section of text with regard to only one, then by entire text
Emotion is attributed to one kind.But think that the emotion of a text shows different emotion colors in terms of different in some articles
Coloured silk, such as " not being the style that I likes although this part clothes is seen very well ".For " clothes ", the emotion of this text
It is positive, but for " I ", the Sentiment orientation of this text is exactly negative sense.From these angles, institute
To have drawn the concept of Aspect, i.e., based on the sentiment analysis of Aspect.Many times, real demand is only simple
The Sentiment orientation of text is understood, so the emotion that research recent years more concentrates on coarseness divides aspect.It is described herein
Sentiment analysis, be also concentrated mainly on the sentiment analysis of coarseness, hereinafter referred to as sentiment analysis.
In the technical aspect of sentiment analysis, the method for mainstream mainly has based on sentiment dictionary and based on two kinds of sides of machine learning
Method.Machine learning is studied mainly around the data and outstanding algorithm model for obtaining high quality, utilizes the number for having mark
According to training algorithm model, it is then based on trained model and emotion judgement is carried out to new data.And sentiment dictionary is utilized to differentiate
The emotion of text is carried out mainly around one outstanding sentiment dictionary of building, and sentiment dictionary quality is for sentiment analysis
It influences very big.It since machine is the emerging technology of door, can not only be analyzed, and can generated continuous based on a large amount of data
Vector, so making emotional semantic classification at present using machine learning has more researchs.
Machine learning algorithm will be used for the sentiment analysis of text by Pang et al for the first time.But what Pang et al was utilized
It is one-hot term vector, for the vector when carrying out the analysis of short text, there are sparse phenomenons.Mikolov, T are then
Continuous term vector is obtained using neural network model, the classification of emotion is completed using the method and KNN of superposition term vector.Herein
Based on Doc2Vec technology is based on, short text is directly trained to term vector, and emotional semantic classification is completed based on SVM, not only classified
Effect is good, and the data for needing to mark are few.
Summary of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide the short texts of a kind of combination SVM and document vector
Sensibility classification method.It is of the invention to be widely used, for example, business enterprice sector can use the technology automation analysis user comment,
User is obtained for the degree of recognition of the product with this, and then is improved product and increased economic efficiency.
The present invention is the technical issues of solving background technique, the technical solution adopted is that: a kind of combination SVM and document vector
Short text sensibility classification method, include the following steps:
1) short text is pre-processed;
2) short text is trained to multi-C vector using Doc2Vec technology;
3) using the short text data training SVM classifier marked;
4) emotional semantic classification is carried out to the short text not marked using trained SVM classifier.
The step 1) pre-processes short text comprising the steps of:
(1) comment data of targeted website is crawled, the short text corpus of experiment is formed;
(2) symbol unrelated in corpus is removed, punctuation mark includes.?!,;: " " ' ' ()-... " ";
(3) word segmentation processing is carried out using comment data of the participle tool to acquisition;
(4) stop words unrelated in the corpus after removal participle.
Short text is trained to multi-C vector using Doc2Vec technology by the step 2), the specific steps are as follows:
(1) vector matrix of random one An*m of initialization, wherein m can represent all news with arbitrary initial, n and comment
In in the quantity of different terms and corpus number of reviews summation;
(2) for a news comment, the word C=(t1, t2 ... ... tn-1, tn) and this news that are included by it
Comment on the corresponding multi-C vector being converted into An*m, i.e. W=(w1, w2, w3 ... ..., wn, wn+1);
(3) each wi (i=1,2,3 ... n+1) vector in W is summed up, obtains T:
(4) T is brought into tanh activation primitive Y, wherein U, P is the parameter that hyperbolic tangent function needs dynamic to update:
Y=tanh (UT+P);
(5) then, bring the y of acquisition into SoftMax function, obtain final each word Probability p (wi | w1, w2 ... wi-
1, wi+1 ... wn+1):
(6) objective function f is obtained, and is averaged to objective function:
(7) according to neural network BackPropagation algorithm, above-mentioned parameter is updated, and moment of a vector may finally be obtained
Battle array An*m.The step 3) is needed using the short text data training SVM classifier marked in training SVM classifier
Meet following constraint, the hyperplane found in this way is optimal:
s.t yi(ωTxi+ b) >=1, i=1,2 ..., m
Following formula can be obtained by solving above-mentioned equation, here αiIt is Lagrange multiplier:
Wherein, ω, b are the parameters of SVM, and xi, yi are sample datas, and i is the number of sample data, and m is sample data
Total number;
Training step is as follows:
(1) p item is respectively chosen, p is generally higher than equal to 300, and positive emotion is commented on and the comment of negative sense emotion;
(2) above-mentioned comment is changed into corresponding vector in Am*n, vector X=(x can be obtained1, x2......x2p), every comment
The vector of corresponding affective tag composition is Y=(y1, y2... y2p), y hereiIt is 0 or 1, wherein 0 represents positive emotion, 1 is represented
Positive emotion;
(3) X, Y are brought into the above-mentioned formula for seeking ω, the corresponding value of ω can be obtained, trained SVM finally can be obtained;
(4) comment to be sorted is converted into corresponding vector X ' in Am*n, T input SVM can must be finally changed into comment
Emotional category Y '.
Beneficial effect
1, the present invention is widely used in life.For example, enterprise can use the emotion of sentiment analysis stroke analysis user
Tendency, and then improve product and formulate sales tactics;Video display enterprise can obtain viewing person for the feedback of film, and then adjust
The play of broadcasting, so that final maximum revenue.
2, the present invention carries out emotional semantic classification using the SVM in machine learning, and not only classifying quality is significant, but also needs to mark
Data it is few.While improving precision, trained cost is reduced.
3, the present invention is based on Doc2Vec technologies, and short text is directly trained to term vector.In Doc2Vec training process, no
Only the vector comprising each word indicates, the vector for further comprising each paragraph indicates.In the vector expression of paragraph, contain
The information of context, this can be further improved the precision of emotional semantic classification.Machine learning algorithm will be used for by Pang et al for the first time
The sentiment analysis of text.But what Pang et al utilized is one-hot term vector, which is carrying out the analysis of short text
When, there are sparse phenomenons.Mikolov, T are then that continuous term vector is obtained using neural network model, utilize superposition
Term vector and KNN complete the classification of vector, but classify into using term vector, have ignored contextual information.
Detailed description of the invention
Fig. 1 is the flow chart of this method.
Fig. 2 is svm classifier schematic diagram.
Fig. 3 Doc2Vec term vector sample.
Specific embodiment
The present invention is described in detail below in conjunction with the drawings and specific embodiments.The present invention is in conjunction with SVM classifier
The method for carrying out emotional semantic classification with document vectors for documents.Below in conjunction with specific implementation use-case illustrate implementation step of the invention
It is rapid:
Embodiment 1, which is realized, divides the emotion of Netease's news comment
Fig. 1 is the flow chart for carrying out emotional semantic classification to text in conjunction with SVM and document vector, and modules are in the implementation
Specific step is as follows in example:
Step 1: using the address of Netease's news home news module as crawling address
Step 2: Pycharm, Scrapy crawler frame writes crawlers using python3.0, realize new to Netease
Hear web page news title, time, news content, news address, url, comment people id, comment people area, the keys such as comment time
Field crawls, and the main code for crawling program is as follows:
Step 3: using the above-mentioned critical field crawled of mysql database purchase.Because the data crawled are related to newly altogether
It hears and news comment and user these three entities, is schemed according to ER, need to design three Mysql database tables, structure design altogether
It is as follows:
1 news data table of table
2 comment data table of table
3 user data table of table
A) comment data is read from database using python language, and mark 300 positive comments and 300 negative senses
Comment.Wherein comment on it is positive and negative be labeled according to following emoticon, be 1 positive Emotion tagging, negative sense Emotion tagging
It is -1, remaining is 0.
There is the text definition of following symbol at positive emotion, such as table 4 commenting in this:
Table 4
The text definition for occurring following symbol in comment at negative sense emotion, such as table 5:
Table 5
B) using stammerer participle tool news comment data are segmented, then using the comment data after participle as
The input of Doc2Vec model, multi-C vector of the training about comment, the form of term vector is as shown in Figure 3 after training.Doc2Vec
The main code of model is as follows:
C) using the comment data training SVM classifier for having mark, the principle that SVM classifies to data is as shown in Figure 2.
Training SVM classifier and the main code classified to the data not marked are as follows:
The emotion evolutionary process of some event of the research of embodiment 2
Fig. 1 is the flow chart for carrying out emotional semantic classification to text in conjunction with SVM and document vector, and modules are in the implementation
Specific step is as follows in example:
Step 1: Java, Webmagic, the tools such as Xpath crawl in Sina's platform 1 year about the event using Idea
Whole news and news comment.The field for needing to obtain has a news web page headline, the time, news content, news
Location, url comment on people id, and the time is commented in comment people area.
Step 2: using the above-mentioned critical field crawled of mysql database purchase.Because the data crawled are related to newly altogether
It hears and news comment and user these three entities, so scheme according to ER, needs to design News altogether, Comment, User tri-
Mysql tables of data.The java code wherein stored into User tables of data is as follows:
Step 3: taking out comment data from database using java program, and mark 300 positive comments and 300
Negative sense comment.It is 1 positive Emotion tagging, is -1 negative sense Emotion tagging, remaining is 0, wherein positive emotion and negative sense emotion
The table 4 being defined as above in an embodiment, table 5.
News comment is segmented using java language, the code of participle is as follows;
A) multi-C vector about comment then is trained using the comment data after participle as the input of doc2vec model,
The form of term vector is as shown in Figure 3 after training;
B) using the comment data training SVM classifier for having mark, the principle that SVM classifies to data is as shown in Figure 2;
C) news comment is divided according to season, and the comment data in each season using trained SVM into
Market sense divides.The quantity for counting the positive and negative comment of each season is mi, ni (i=1,2,3,4);
D) calculate separately out the ratio of positive number of reviews and negative number of reviews of each seasonAnd remember W=(w1, w2, w3, w4);
E) Html is used, Echars, Javascript make the evolution curve about W.
It should be understood that embodiment discussed herein simply to illustrate that, it will be understood by those skilled in the art that can
To be improved or converted, and all these modifications and variations should all belong to the protection domain of appended claims of the present invention.
Claims (4)
1. the short text sensibility classification method of a kind of combination SVM and document vector, which comprises the steps of:
1) short text is pre-processed;
2) short text is trained to multi-C vector using Doc2Vec technology;
3) using the short text data training SVM classifier marked;
4) emotional semantic classification is carried out to the short text not marked using trained SVM classifier.
2. the short text sensibility classification method of a kind of combination SVM and document vector according to claim 1, feature exist
In the step 1) pre-processes short text comprising the steps of:
(1) comment data of targeted website is crawled, the short text corpus of experiment is formed;
(2) symbol unrelated in corpus is removed, punctuation mark includes.?!,;: " " ' ' ()-... " ";
(3) word segmentation processing is carried out using comment data of the participle tool to acquisition;
(4) stop words unrelated in the corpus after removal participle.
3. the short text sensibility classification method of a kind of combination SVM and document vector according to claim 1, feature exist
In short text is trained to multi-C vector using Doc2Vec technology by the step 2), the specific steps are as follows:
(1) vector matrix of random one An*m of initialization, wherein m can be represented in all news comments with arbitrary initial, n
The summation of number of reviews in the quantity and corpus of different terms;
(2) the word C=(t1, t2 ... ... tn-1, tn) for being included by it for a news comment and this news comment
The corresponding multi-C vector being converted into An*m, i.e. W=(w1, w2, w3 ... ..., wn, wn+1);
(3) each wi (i=1,2,3 ... n+1) vector in W is summed up, obtains T:
(4) T is brought into tanh activation primitive Y, wherein U, P is the parameter that hyperbolic tangent function needs dynamic to update:
Y=tanh (UT+P);
(5) then, bring the y of acquisition into SoftMax function, obtain final each word Probability p (wi | w1, w2 ... wi-1, wi+
1 ... wn+1):
(6) objective function f is obtained, and is averaged to objective function:
(7) according to neural network BackPropagation algorithm, above-mentioned parameter is updated, and vector matrix An* may finally be obtained
m。
4. the short text sensibility classification method of a kind of combination SVM and document vector according to claim 1, feature exist
In the step 3) needs to meet using the short text data training SVM classifier marked in training SVM classifier
Following constraint:
s.t yi(ωTxi+ b) >=1, i=1,2 ..., m
Following formula can be obtained by solving above-mentioned equation, here αiIt is Lagrange multiplier:
Wherein, ω, b are the parameters of SVM, and xi, yi are sample datas, and i is the number of sample data, and m is total item of sample data
Number;
Training step is as follows:
(1) p item is respectively chosen, p is generally higher than equal to 300, and positive emotion is commented on and the comment of negative sense emotion;
(2) above-mentioned comment is changed into corresponding vector in Am*n, vector X=(x can be obtained1, x2......x2p), every comment institute is right
The vector for answering affective tag to form is Y=(y1, y2... y2p), y hereiIt is 0 or 1, wherein 0 represents positive emotion, 1 is represented just
To emotion;
(3) X, Y are brought into the above-mentioned formula for seeking ω, the corresponding value of ω can be obtained, trained SVM finally can be obtained;
(4) comment to be sorted is converted into corresponding vector X ' in Am*n, T input SVM can must be finally changed to the emotion of comment
Classification Y '.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811401134.9A CN109657057A (en) | 2018-11-22 | 2018-11-22 | A kind of short text sensibility classification method of combination SVM and document vector |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811401134.9A CN109657057A (en) | 2018-11-22 | 2018-11-22 | A kind of short text sensibility classification method of combination SVM and document vector |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109657057A true CN109657057A (en) | 2019-04-19 |
Family
ID=66112174
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811401134.9A Pending CN109657057A (en) | 2018-11-22 | 2018-11-22 | A kind of short text sensibility classification method of combination SVM and document vector |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109657057A (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103116637A (en) * | 2013-02-08 | 2013-05-22 | 无锡南理工科技发展有限公司 | Text sentiment classification method facing Chinese Web comments |
CN105824922A (en) * | 2016-03-16 | 2016-08-03 | 重庆邮电大学 | Emotion classifying method fusing intrinsic feature and shallow feature |
CN106407449A (en) * | 2016-09-30 | 2017-02-15 | 四川长虹电器股份有限公司 | Emotion classification method based on support vector machine |
CN107193801A (en) * | 2017-05-21 | 2017-09-22 | 北京工业大学 | A kind of short text characteristic optimization and sentiment analysis method based on depth belief network |
CN107315797A (en) * | 2017-06-19 | 2017-11-03 | 江西洪都航空工业集团有限责任公司 | A kind of Internet news is obtained and text emotion forecasting system |
CN108108468A (en) * | 2017-12-29 | 2018-06-01 | 华中科技大学鄂州工业技术研究院 | A kind of short text sentiment analysis method and apparatus based on concept and text emotion |
CN108509629A (en) * | 2018-04-09 | 2018-09-07 | 南京大学 | Text emotion analysis method based on emotion dictionary and support vector machine |
CN108733653A (en) * | 2018-05-18 | 2018-11-02 | 华中科技大学 | A kind of sentiment analysis method of the Skip-gram models based on fusion part of speech and semantic information |
-
2018
- 2018-11-22 CN CN201811401134.9A patent/CN109657057A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103116637A (en) * | 2013-02-08 | 2013-05-22 | 无锡南理工科技发展有限公司 | Text sentiment classification method facing Chinese Web comments |
CN105824922A (en) * | 2016-03-16 | 2016-08-03 | 重庆邮电大学 | Emotion classifying method fusing intrinsic feature and shallow feature |
CN106407449A (en) * | 2016-09-30 | 2017-02-15 | 四川长虹电器股份有限公司 | Emotion classification method based on support vector machine |
CN107193801A (en) * | 2017-05-21 | 2017-09-22 | 北京工业大学 | A kind of short text characteristic optimization and sentiment analysis method based on depth belief network |
CN107315797A (en) * | 2017-06-19 | 2017-11-03 | 江西洪都航空工业集团有限责任公司 | A kind of Internet news is obtained and text emotion forecasting system |
CN108108468A (en) * | 2017-12-29 | 2018-06-01 | 华中科技大学鄂州工业技术研究院 | A kind of short text sentiment analysis method and apparatus based on concept and text emotion |
CN108509629A (en) * | 2018-04-09 | 2018-09-07 | 南京大学 | Text emotion analysis method based on emotion dictionary and support vector machine |
CN108733653A (en) * | 2018-05-18 | 2018-11-02 | 华中科技大学 | A kind of sentiment analysis method of the Skip-gram models based on fusion part of speech and semantic information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kumar et al. | Sentiment analysis of multimodal twitter data | |
CN108446271B (en) | Text emotion analysis method of convolutional neural network based on Chinese character component characteristics | |
CN109933664B (en) | Fine-grained emotion analysis improvement method based on emotion word embedding | |
CN106599022B (en) | User portrait forming method based on user access data | |
Hassan et al. | Sentiment analysis on bangla and romanized bangla text using deep recurrent models | |
Wu et al. | Towards building a high-quality microblog-specific Chinese sentiment lexicon | |
CN108427670A (en) | A kind of sentiment analysis method based on context word vector sum deep learning | |
CN107391483A (en) | A kind of comment on commodity data sensibility classification method based on convolutional neural networks | |
CN107862343A (en) | The rule-based and comment on commodity property level sensibility classification method of neutral net | |
CN108363725B (en) | Method for extracting user comment opinions and generating opinion labels | |
CN108154395A (en) | A kind of customer network behavior portrait method based on big data | |
CN107301199A (en) | A kind of data label generation method and device | |
Al-Nabki et al. | Improving named entity recognition in noisy user-generated text with local distance neighbor feature | |
CN108388554A (en) | Text emotion identifying system based on collaborative filtering attention mechanism | |
Yeole et al. | Opinion mining for emotions determination | |
CN109325120A (en) | A kind of text sentiment classification method separating user and product attention mechanism | |
CN110196945A (en) | A kind of microblog users age prediction technique merged based on LSTM with LeNet | |
CN112084333B (en) | Social user generation method based on emotional tendency analysis | |
Du et al. | A heuristic approach for website classification with mixed feature extractors | |
Thomas et al. | Deep learning architectures for named entity recognition: A survey | |
CN110472115A (en) | A kind of social networks text emotion fine grit classification method based on deep learning | |
CN107908749B (en) | Character retrieval system and method based on search engine | |
Dabade | Sentiment analysis of Twitter data by using deep learning And machine learning | |
Song et al. | Extracting product features from online reviews for sentimental analysis | |
Nazir et al. | Sentiment analysis of user reviews about hotel in Roman Urdu |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190419 |