CN109657057A - A kind of short text sensibility classification method of combination SVM and document vector - Google Patents

A kind of short text sensibility classification method of combination SVM and document vector Download PDF

Info

Publication number
CN109657057A
CN109657057A CN201811401134.9A CN201811401134A CN109657057A CN 109657057 A CN109657057 A CN 109657057A CN 201811401134 A CN201811401134 A CN 201811401134A CN 109657057 A CN109657057 A CN 109657057A
Authority
CN
China
Prior art keywords
short text
vector
comment
svm
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811401134.9A
Other languages
Chinese (zh)
Inventor
沈幸博
王文俊
孙越恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201811401134.9A priority Critical patent/CN109657057A/en
Publication of CN109657057A publication Critical patent/CN109657057A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses the short text sensibility classification method of a kind of combination SVM and document vector, which comprises the steps of: firstly, pre-processing to short text;Secondly, short text is trained to multi-C vector using Doc2Vec technology;Then using the short text data training SVM classifier marked;Finally, carrying out emotional semantic classification to the short text not marked using trained SVM classifier.The present invention carries out emotional semantic classification using the SVM in machine learning, and not only classifying quality is significant, but also the data for needing to mark are few.While improving precision, trained cost is reduced.

Description

A kind of short text sensibility classification method of combination SVM and document vector
Technical field
The invention belongs to Computer Natural Language Processing fields, and in particular to a kind of short essay of combination SVM and document vector This sensibility classification method.It is a kind of technology that can news media's data be carried out with emotional semantic classification.
Background technique
The application of sentiment analysis is very extensive.For example, enterprise can use the emotion of sentiment analysis stroke analysis user Tendency, and then improve product and formulate sales tactics;Video display enterprise can obtain viewing person for the feedback of film, and then adjust Play of broadcasting etc..Under the driving of various current demands, the technology of sentiment analysis achieves significant progress.
In terms of the division of Sentiment orientation, there are mainly two types of division mode, i.e. the emotion of coarseness is divided and fine-grained Emotion divides.The emotion of coarseness is divided, emotion is mainly divided into positivity, neutral, negative affect.But having In research, in order to simplify subsequent sentiment analysis process, Sentiment orientation is only divided into positivity and negativity.In fine-grained feelings Feel division direction, the emotional categories such as emotion is mainly divided into pleasure, anger, sorrow, happiness, is feared.
In actual operation, some articles think the emotion of one section of text with regard to only one, then by entire text Emotion is attributed to one kind.But think that the emotion of a text shows different emotion colors in terms of different in some articles Coloured silk, such as " not being the style that I likes although this part clothes is seen very well ".For " clothes ", the emotion of this text It is positive, but for " I ", the Sentiment orientation of this text is exactly negative sense.From these angles, institute To have drawn the concept of Aspect, i.e., based on the sentiment analysis of Aspect.Many times, real demand is only simple The Sentiment orientation of text is understood, so the emotion that research recent years more concentrates on coarseness divides aspect.It is described herein Sentiment analysis, be also concentrated mainly on the sentiment analysis of coarseness, hereinafter referred to as sentiment analysis.
In the technical aspect of sentiment analysis, the method for mainstream mainly has based on sentiment dictionary and based on two kinds of sides of machine learning Method.Machine learning is studied mainly around the data and outstanding algorithm model for obtaining high quality, utilizes the number for having mark According to training algorithm model, it is then based on trained model and emotion judgement is carried out to new data.And sentiment dictionary is utilized to differentiate The emotion of text is carried out mainly around one outstanding sentiment dictionary of building, and sentiment dictionary quality is for sentiment analysis It influences very big.It since machine is the emerging technology of door, can not only be analyzed, and can generated continuous based on a large amount of data Vector, so making emotional semantic classification at present using machine learning has more researchs.
Machine learning algorithm will be used for the sentiment analysis of text by Pang et al for the first time.But what Pang et al was utilized It is one-hot term vector, for the vector when carrying out the analysis of short text, there are sparse phenomenons.Mikolov, T are then Continuous term vector is obtained using neural network model, the classification of emotion is completed using the method and KNN of superposition term vector.Herein Based on Doc2Vec technology is based on, short text is directly trained to term vector, and emotional semantic classification is completed based on SVM, not only classified Effect is good, and the data for needing to mark are few.
Summary of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide the short texts of a kind of combination SVM and document vector Sensibility classification method.It is of the invention to be widely used, for example, business enterprice sector can use the technology automation analysis user comment, User is obtained for the degree of recognition of the product with this, and then is improved product and increased economic efficiency.
The present invention is the technical issues of solving background technique, the technical solution adopted is that: a kind of combination SVM and document vector Short text sensibility classification method, include the following steps:
1) short text is pre-processed;
2) short text is trained to multi-C vector using Doc2Vec technology;
3) using the short text data training SVM classifier marked;
4) emotional semantic classification is carried out to the short text not marked using trained SVM classifier.
The step 1) pre-processes short text comprising the steps of:
(1) comment data of targeted website is crawled, the short text corpus of experiment is formed;
(2) symbol unrelated in corpus is removed, punctuation mark includes.?!,;: " " ' ' ()-... " ";
(3) word segmentation processing is carried out using comment data of the participle tool to acquisition;
(4) stop words unrelated in the corpus after removal participle.
Short text is trained to multi-C vector using Doc2Vec technology by the step 2), the specific steps are as follows:
(1) vector matrix of random one An*m of initialization, wherein m can represent all news with arbitrary initial, n and comment In in the quantity of different terms and corpus number of reviews summation;
(2) for a news comment, the word C=(t1, t2 ... ... tn-1, tn) and this news that are included by it Comment on the corresponding multi-C vector being converted into An*m, i.e. W=(w1, w2, w3 ... ..., wn, wn+1);
(3) each wi (i=1,2,3 ... n+1) vector in W is summed up, obtains T:
(4) T is brought into tanh activation primitive Y, wherein U, P is the parameter that hyperbolic tangent function needs dynamic to update:
Y=tanh (UT+P);
(5) then, bring the y of acquisition into SoftMax function, obtain final each word Probability p (wi | w1, w2 ... wi- 1, wi+1 ... wn+1):
(6) objective function f is obtained, and is averaged to objective function:
(7) according to neural network BackPropagation algorithm, above-mentioned parameter is updated, and moment of a vector may finally be obtained Battle array An*m.The step 3) is needed using the short text data training SVM classifier marked in training SVM classifier Meet following constraint, the hyperplane found in this way is optimal:
s.t yiTxi+ b) >=1, i=1,2 ..., m
Following formula can be obtained by solving above-mentioned equation, here αiIt is Lagrange multiplier:
Wherein, ω, b are the parameters of SVM, and xi, yi are sample datas, and i is the number of sample data, and m is sample data Total number;
Training step is as follows:
(1) p item is respectively chosen, p is generally higher than equal to 300, and positive emotion is commented on and the comment of negative sense emotion;
(2) above-mentioned comment is changed into corresponding vector in Am*n, vector X=(x can be obtained1, x2......x2p), every comment The vector of corresponding affective tag composition is Y=(y1, y2... y2p), y hereiIt is 0 or 1, wherein 0 represents positive emotion, 1 is represented Positive emotion;
(3) X, Y are brought into the above-mentioned formula for seeking ω, the corresponding value of ω can be obtained, trained SVM finally can be obtained;
(4) comment to be sorted is converted into corresponding vector X ' in Am*n, T input SVM can must be finally changed into comment Emotional category Y '.
Beneficial effect
1, the present invention is widely used in life.For example, enterprise can use the emotion of sentiment analysis stroke analysis user Tendency, and then improve product and formulate sales tactics;Video display enterprise can obtain viewing person for the feedback of film, and then adjust The play of broadcasting, so that final maximum revenue.
2, the present invention carries out emotional semantic classification using the SVM in machine learning, and not only classifying quality is significant, but also needs to mark Data it is few.While improving precision, trained cost is reduced.
3, the present invention is based on Doc2Vec technologies, and short text is directly trained to term vector.In Doc2Vec training process, no Only the vector comprising each word indicates, the vector for further comprising each paragraph indicates.In the vector expression of paragraph, contain The information of context, this can be further improved the precision of emotional semantic classification.Machine learning algorithm will be used for by Pang et al for the first time The sentiment analysis of text.But what Pang et al utilized is one-hot term vector, which is carrying out the analysis of short text When, there are sparse phenomenons.Mikolov, T are then that continuous term vector is obtained using neural network model, utilize superposition Term vector and KNN complete the classification of vector, but classify into using term vector, have ignored contextual information.
Detailed description of the invention
Fig. 1 is the flow chart of this method.
Fig. 2 is svm classifier schematic diagram.
Fig. 3 Doc2Vec term vector sample.
Specific embodiment
The present invention is described in detail below in conjunction with the drawings and specific embodiments.The present invention is in conjunction with SVM classifier The method for carrying out emotional semantic classification with document vectors for documents.Below in conjunction with specific implementation use-case illustrate implementation step of the invention It is rapid:
Embodiment 1, which is realized, divides the emotion of Netease's news comment
Fig. 1 is the flow chart for carrying out emotional semantic classification to text in conjunction with SVM and document vector, and modules are in the implementation Specific step is as follows in example:
Step 1: using the address of Netease's news home news module as crawling address
Step 2: Pycharm, Scrapy crawler frame writes crawlers using python3.0, realize new to Netease Hear web page news title, time, news content, news address, url, comment people id, comment people area, the keys such as comment time Field crawls, and the main code for crawling program is as follows:
Step 3: using the above-mentioned critical field crawled of mysql database purchase.Because the data crawled are related to newly altogether It hears and news comment and user these three entities, is schemed according to ER, need to design three Mysql database tables, structure design altogether It is as follows:
1 news data table of table
2 comment data table of table
3 user data table of table
A) comment data is read from database using python language, and mark 300 positive comments and 300 negative senses Comment.Wherein comment on it is positive and negative be labeled according to following emoticon, be 1 positive Emotion tagging, negative sense Emotion tagging It is -1, remaining is 0.
There is the text definition of following symbol at positive emotion, such as table 4 commenting in this:
Table 4
The text definition for occurring following symbol in comment at negative sense emotion, such as table 5:
Table 5
B) using stammerer participle tool news comment data are segmented, then using the comment data after participle as The input of Doc2Vec model, multi-C vector of the training about comment, the form of term vector is as shown in Figure 3 after training.Doc2Vec The main code of model is as follows:
C) using the comment data training SVM classifier for having mark, the principle that SVM classifies to data is as shown in Figure 2. Training SVM classifier and the main code classified to the data not marked are as follows:
The emotion evolutionary process of some event of the research of embodiment 2
Fig. 1 is the flow chart for carrying out emotional semantic classification to text in conjunction with SVM and document vector, and modules are in the implementation Specific step is as follows in example:
Step 1: Java, Webmagic, the tools such as Xpath crawl in Sina's platform 1 year about the event using Idea Whole news and news comment.The field for needing to obtain has a news web page headline, the time, news content, news Location, url comment on people id, and the time is commented in comment people area.
Step 2: using the above-mentioned critical field crawled of mysql database purchase.Because the data crawled are related to newly altogether It hears and news comment and user these three entities, so scheme according to ER, needs to design News altogether, Comment, User tri- Mysql tables of data.The java code wherein stored into User tables of data is as follows:
Step 3: taking out comment data from database using java program, and mark 300 positive comments and 300 Negative sense comment.It is 1 positive Emotion tagging, is -1 negative sense Emotion tagging, remaining is 0, wherein positive emotion and negative sense emotion The table 4 being defined as above in an embodiment, table 5.
News comment is segmented using java language, the code of participle is as follows;
A) multi-C vector about comment then is trained using the comment data after participle as the input of doc2vec model, The form of term vector is as shown in Figure 3 after training;
B) using the comment data training SVM classifier for having mark, the principle that SVM classifies to data is as shown in Figure 2;
C) news comment is divided according to season, and the comment data in each season using trained SVM into Market sense divides.The quantity for counting the positive and negative comment of each season is mi, ni (i=1,2,3,4);
D) calculate separately out the ratio of positive number of reviews and negative number of reviews of each seasonAnd remember W=(w1, w2, w3, w4);
E) Html is used, Echars, Javascript make the evolution curve about W.
It should be understood that embodiment discussed herein simply to illustrate that, it will be understood by those skilled in the art that can To be improved or converted, and all these modifications and variations should all belong to the protection domain of appended claims of the present invention.

Claims (4)

1. the short text sensibility classification method of a kind of combination SVM and document vector, which comprises the steps of:
1) short text is pre-processed;
2) short text is trained to multi-C vector using Doc2Vec technology;
3) using the short text data training SVM classifier marked;
4) emotional semantic classification is carried out to the short text not marked using trained SVM classifier.
2. the short text sensibility classification method of a kind of combination SVM and document vector according to claim 1, feature exist In the step 1) pre-processes short text comprising the steps of:
(1) comment data of targeted website is crawled, the short text corpus of experiment is formed;
(2) symbol unrelated in corpus is removed, punctuation mark includes.?!,;: " " ' ' ()-... " ";
(3) word segmentation processing is carried out using comment data of the participle tool to acquisition;
(4) stop words unrelated in the corpus after removal participle.
3. the short text sensibility classification method of a kind of combination SVM and document vector according to claim 1, feature exist In short text is trained to multi-C vector using Doc2Vec technology by the step 2), the specific steps are as follows:
(1) vector matrix of random one An*m of initialization, wherein m can be represented in all news comments with arbitrary initial, n The summation of number of reviews in the quantity and corpus of different terms;
(2) the word C=(t1, t2 ... ... tn-1, tn) for being included by it for a news comment and this news comment The corresponding multi-C vector being converted into An*m, i.e. W=(w1, w2, w3 ... ..., wn, wn+1);
(3) each wi (i=1,2,3 ... n+1) vector in W is summed up, obtains T:
(4) T is brought into tanh activation primitive Y, wherein U, P is the parameter that hyperbolic tangent function needs dynamic to update:
Y=tanh (UT+P);
(5) then, bring the y of acquisition into SoftMax function, obtain final each word Probability p (wi | w1, w2 ... wi-1, wi+ 1 ... wn+1):
(6) objective function f is obtained, and is averaged to objective function:
(7) according to neural network BackPropagation algorithm, above-mentioned parameter is updated, and vector matrix An* may finally be obtained m。
4. the short text sensibility classification method of a kind of combination SVM and document vector according to claim 1, feature exist In the step 3) needs to meet using the short text data training SVM classifier marked in training SVM classifier Following constraint:
s.t yiTxi+ b) >=1, i=1,2 ..., m
Following formula can be obtained by solving above-mentioned equation, here αiIt is Lagrange multiplier:
Wherein, ω, b are the parameters of SVM, and xi, yi are sample datas, and i is the number of sample data, and m is total item of sample data Number;
Training step is as follows:
(1) p item is respectively chosen, p is generally higher than equal to 300, and positive emotion is commented on and the comment of negative sense emotion;
(2) above-mentioned comment is changed into corresponding vector in Am*n, vector X=(x can be obtained1, x2......x2p), every comment institute is right The vector for answering affective tag to form is Y=(y1, y2... y2p), y hereiIt is 0 or 1, wherein 0 represents positive emotion, 1 is represented just To emotion;
(3) X, Y are brought into the above-mentioned formula for seeking ω, the corresponding value of ω can be obtained, trained SVM finally can be obtained;
(4) comment to be sorted is converted into corresponding vector X ' in Am*n, T input SVM can must be finally changed to the emotion of comment Classification Y '.
CN201811401134.9A 2018-11-22 2018-11-22 A kind of short text sensibility classification method of combination SVM and document vector Pending CN109657057A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811401134.9A CN109657057A (en) 2018-11-22 2018-11-22 A kind of short text sensibility classification method of combination SVM and document vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811401134.9A CN109657057A (en) 2018-11-22 2018-11-22 A kind of short text sensibility classification method of combination SVM and document vector

Publications (1)

Publication Number Publication Date
CN109657057A true CN109657057A (en) 2019-04-19

Family

ID=66112174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811401134.9A Pending CN109657057A (en) 2018-11-22 2018-11-22 A kind of short text sensibility classification method of combination SVM and document vector

Country Status (1)

Country Link
CN (1) CN109657057A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116637A (en) * 2013-02-08 2013-05-22 无锡南理工科技发展有限公司 Text sentiment classification method facing Chinese Web comments
CN105824922A (en) * 2016-03-16 2016-08-03 重庆邮电大学 Emotion classifying method fusing intrinsic feature and shallow feature
CN106407449A (en) * 2016-09-30 2017-02-15 四川长虹电器股份有限公司 Emotion classification method based on support vector machine
CN107193801A (en) * 2017-05-21 2017-09-22 北京工业大学 A kind of short text characteristic optimization and sentiment analysis method based on depth belief network
CN107315797A (en) * 2017-06-19 2017-11-03 江西洪都航空工业集团有限责任公司 A kind of Internet news is obtained and text emotion forecasting system
CN108108468A (en) * 2017-12-29 2018-06-01 华中科技大学鄂州工业技术研究院 A kind of short text sentiment analysis method and apparatus based on concept and text emotion
CN108509629A (en) * 2018-04-09 2018-09-07 南京大学 Text emotion analysis method based on emotion dictionary and support vector machine
CN108733653A (en) * 2018-05-18 2018-11-02 华中科技大学 A kind of sentiment analysis method of the Skip-gram models based on fusion part of speech and semantic information

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116637A (en) * 2013-02-08 2013-05-22 无锡南理工科技发展有限公司 Text sentiment classification method facing Chinese Web comments
CN105824922A (en) * 2016-03-16 2016-08-03 重庆邮电大学 Emotion classifying method fusing intrinsic feature and shallow feature
CN106407449A (en) * 2016-09-30 2017-02-15 四川长虹电器股份有限公司 Emotion classification method based on support vector machine
CN107193801A (en) * 2017-05-21 2017-09-22 北京工业大学 A kind of short text characteristic optimization and sentiment analysis method based on depth belief network
CN107315797A (en) * 2017-06-19 2017-11-03 江西洪都航空工业集团有限责任公司 A kind of Internet news is obtained and text emotion forecasting system
CN108108468A (en) * 2017-12-29 2018-06-01 华中科技大学鄂州工业技术研究院 A kind of short text sentiment analysis method and apparatus based on concept and text emotion
CN108509629A (en) * 2018-04-09 2018-09-07 南京大学 Text emotion analysis method based on emotion dictionary and support vector machine
CN108733653A (en) * 2018-05-18 2018-11-02 华中科技大学 A kind of sentiment analysis method of the Skip-gram models based on fusion part of speech and semantic information

Similar Documents

Publication Publication Date Title
Kumar et al. Sentiment analysis of multimodal twitter data
CN108446271B (en) Text emotion analysis method of convolutional neural network based on Chinese character component characteristics
CN109933664B (en) Fine-grained emotion analysis improvement method based on emotion word embedding
CN106599022B (en) User portrait forming method based on user access data
Hassan et al. Sentiment analysis on bangla and romanized bangla text using deep recurrent models
Wu et al. Towards building a high-quality microblog-specific Chinese sentiment lexicon
CN108427670A (en) A kind of sentiment analysis method based on context word vector sum deep learning
CN107391483A (en) A kind of comment on commodity data sensibility classification method based on convolutional neural networks
CN107862343A (en) The rule-based and comment on commodity property level sensibility classification method of neutral net
CN108363725B (en) Method for extracting user comment opinions and generating opinion labels
CN108154395A (en) A kind of customer network behavior portrait method based on big data
CN107301199A (en) A kind of data label generation method and device
Al-Nabki et al. Improving named entity recognition in noisy user-generated text with local distance neighbor feature
CN108388554A (en) Text emotion identifying system based on collaborative filtering attention mechanism
Yeole et al. Opinion mining for emotions determination
CN109325120A (en) A kind of text sentiment classification method separating user and product attention mechanism
CN110196945A (en) A kind of microblog users age prediction technique merged based on LSTM with LeNet
CN112084333B (en) Social user generation method based on emotional tendency analysis
Du et al. A heuristic approach for website classification with mixed feature extractors
Thomas et al. Deep learning architectures for named entity recognition: A survey
CN110472115A (en) A kind of social networks text emotion fine grit classification method based on deep learning
CN107908749B (en) Character retrieval system and method based on search engine
Dabade Sentiment analysis of Twitter data by using deep learning And machine learning
Song et al. Extracting product features from online reviews for sentimental analysis
Nazir et al. Sentiment analysis of user reviews about hotel in Roman Urdu

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190419