CN108228655A

CN108228655A - A kind of preposition processing method of text emotion analysis signature verification

Info

Publication number: CN108228655A
Application number: CN201611195601.8A
Authority: CN
Inventors: 不公告发明人
Original assignee: Qingdao Xiangzhi Electronic Technology Co Ltd
Current assignee: Qingdao Xiangzhi Electronic Technology Co Ltd
Priority date: 2016-12-21
Filing date: 2016-12-21
Publication date: 2018-06-29

Abstract

The invention discloses a kind of preposition processing method of text emotion analysis signature verification, by obtaining pretreatment information to the pretreatment of original training set：It determines the summary of original training set, determines the summary of original set of eigenvectors, initial data is expanded, so as to construct the pretreatment information after integrating；Then, signature verification and feature selecting are carried out to pretreatment information.The positive effect of the present invention is：The present invention sets about generation analysis information in terms of training set and feature vector two, has not only ensured the abundant information degree of assessment result, but also help to improve the accuracy of entire sentiment analysis flow.The present invention is also equipped with good Universal and scalability, and good result can be played to the various different sentiment analysis algorithms for modeling and realizing.

Description

A kind of preposition processing method of text emotion analysis signature verification

Technical field

The invention belongs to text emotion analysis fields, and in particular to a kind of preposition processing of text emotion analysis signature verification Method.

Background technology

Existing text classification feature selecting and proof scheme have had good effect on content domain classification, but should There is the problem of following during used in sentiment analysis field：

1st, for general applicable scene, to sentiment analysis, this business scope is not deep enough, especially on the internet There is apparent non-equilibrium language material distribution in emotion information, during manual construction sentiment analysis corpus easily mistake classification the problems such as, The considerations of existing algorithm is to both of these problems are generally insufficient；

2nd, for the generic validation comparison basis of different characteristic extraction algorithm, the best practices lacked in practical application are set Meter.For example TFIDF pays close attention to word frequency in common verification method, but the not high keyword of occurrence frequency is easily ignored；It is and opposite Information gain whether consider feature and occur, but due to not considering word frequency, the effect of low-frequency word is easily exaggerated again.

Meanwhile existing sentiment analysis technical solution has the following problems：

1st, the accuracy rate of most of existing Chinese sentiment analysis algorithm is relatively low, and lacking can instruct the feature of algorithm improvement to test Card or feature selecting scheme, such as according to the 5th Chinese sentiment classification evaluation and test seminar COAE2013's as a result, correct Rate is generally 60% or so；

2nd, text message forms expression by feature vector, but due to lacking generally acknowledged field best practices, sentiment analysis Modeling and algorithm are also multifarious, therefore the identification proof scheme of sentiment analysis feature vector is also required to consider these algorithms The characteristics of with model respectively, such as common bag of words, a variety of models such as n-gram, word2vec etc..

To sum up, present invention seek to address that set of eigenvectors input in the signature verification problem of text emotion analysis field Step is managed to get going out whether feature is suitable for the preliminary judgement of this conclusion of sentiment analysis, and generates useful information for subsequently Processing.

Invention content

In order to overcome the disadvantages mentioned above of the prior art, the present invention provides a kind of the preposition of text emotion analysis signature verification Processing method.

The technical solution adopted by the present invention to solve the technical problems is：A kind of text emotion analyzes the preposition of signature verification Processing method includes the following steps：

Step 1: pretreatment information is obtained to the pretreatment of original training set：

S1, the summary for determining original training set, and result sample_struct is exported, including：Sample total number parameter Sample_size, the distributed constant sample_dist of emotional semantic classification and text message distribution of content parameter sample_text_ info；

S2, input set of eigenvectors to be verified, determine the summary of original set of eigenvectors, and export result vector_ Struct, including：The dimension ginseng of multiaspect parameter vector_multi, hardness flexibility parameter vector_prop and feature vector Number vector_dimen；

S3, initial data is expanded, and exports result addtion_sets：Including：

(1) if the dimensional parameter vector_dimen of feature vector is low-dimensional and does not include word frequency information, base is built In the feature vector of bag of words, the feature vector of low dimensional is supplemented, obtains word frequency information supplement result tf_ addition_set；

(2) if the distributed constant sample_dist of emotional semantic classification is unevenly distributed weighing apparatus, training set is carried out at equalization Reason, being equalized handling result even_addition_set；

S4, the pretreatment information after integrating is constructed, including：Original feature vector collection origin_set, additional feature to Quantity set addtion_sets, feature vector set attribute vector_struct, training set attribute sample_struct；

Step 2: signature verification and feature selecting are carried out to pretreatment information：

S1, for original feature vector collection：According to sample_size values and vector_dimen values, base is carried out at the same time The feature selecting of standard is made a decision in cross validation and based on bootstrap come after verifying the processing of two kinds of classification accuracy, it is right Handling result is multiplied by different weights；

S2, for reference vector collection：Set of eigenvectors selects representative feature using general InfoGain one by one ；Vector set is built using class bagging algorithms, theory is determined in a manner that majority is voted to each vector in vector set The classification value of reckoning, then pass through the gap acquisition training set letter of the actual classification value for calculating classification value and training set of vector set Breath.

Compared with prior art, the positive effect of the present invention is：The present invention is in terms of the training set and feature vector two It is lacking practice and skill into analysis information, not only ensured the abundant information degree of assessment result, but also help to improve entire sentiment analysis stream The accuracy of journey.The present invention is also equipped with good Universal and scalability, to the various different sentiment analysis modeled and realize Algorithm can play good result, and specific manifestation is as follows：

1st, the data attribute of training set and original feature vector collection is individually extracted；

2nd, many algorithms generation is integrated as the set of eigenvectors with reference to comparison, and retains extended capability；

3rd, it is appended in primitive character after integrating emotion word frequency information and TFIDF word frequency informations；

4th, the emotional semantic classification distribution of lack of balance training set is corrected；

5th, depth analysis is carried out to the selection of original affective characteristics vector set binding characteristic and hypothesis testing algorithm；

6th, original sentiment analysis algorithm delete by feature and then progress Chi-square Test compares verification；

7th, the basis that more reference feature vector collection extraction models are analyzed as range.

Specific embodiment

A kind of preposition processing method of text emotion analysis signature verification, includes the following steps：

1. the pretreatment of pair original training set obtains pretreatment information：

This step includes following particular content：

1.1. profiling is carried out to original training set, output result is denoted as sample_struct：

(1) whether judgement sample total number is enough big：It will determine that result is represented with parameter sample_size, for emotion Sample is analyzed, sample is enough to represent greatly each not repeated sample number effectively classified more than 1000.

(2) judge whether the distribution of emotional semantic classification is balanced：It will determine that result is represented with parameter sample_dist, wherein wrapping The sample size of different classifications is included, it is balanced if the number of samples of different classifications is not much different；Otherwise it is unbalanced.

(3) text message distribution of content is obtained, is represented with parameter sample_text_info, including：Number of words, sentence number and section Number；

The result of profiling can be embodied in the output of whole flow process, play the work of input parameter in subsequent processing With.

1.2. set of eigenvectors to be verified and judgement are inputted, output result is denoted as vector_struct：

(1) it is multiaspect or single label, i.e., other than affective characteristics, if comprising other classification informations, be denoted as parameter vector_multi；

(2) analysis result is hardness or flexibility, i.e., whether analysis result is made of the probability of multinomial emotion value, note For parameter vector_prop；

(3) (vectorial number is more than 500 or more than number of samples the whether enough height of the dimension (vectorial number) of feature vector 20% dimension height), if comprising word frequency information, it is denoted as parameter vector_dimen.

1.3. data extending, output result are denoted as addtion_sets

This step expands initial data from many aspects, each other reference, so as to provide the comprehensive of algorithm adjustment direction Foundation is considered in conjunction.

1.3.1. word frequency information supplements, and handling result is denoted as tf_addition_set.According to the dimension of feature vector Vector_dimen if set of eigenvectors is low-dimensional and does not include word frequency information, needs to build the spy based on bag of words Sign vector, so as to which the feature vector to low dimensional be allowed to supplement, mainly including following training set processing mode：

1) bag of words feature vector is generated based on sentiment dictionary, each characteristic value is the product of word frequency and emotion value：

2) bag of words feature vector is generated based on the dictionary that TFIDF is extracted：

The new feature of addition needs to carry out dimension-reduction treatment using weighted average KL divergences.

This step is by carrying out vector set supplement, it is ensured that word information is taken into account in sentiment analysis, and can be incorporated experience into Property sentiment dictionary and general word frequency dictionary of both Information Superiority, while keep two kinds of information independence.

1.3.2. training set equalizes, and handling result is denoted as even_addition_set.According to the distribution of emotional semantic classification Sample_dist in the case of emotional semantic classification is unevenly distributed in training set, is needed by repeating to choose type on the low side or subtract The mode of few type on the high side carries out equalization processing, generates reference vector collection according to given sentiment analysis algorithm again later.

This step provides the foundation subsequently adjusted by comparing the balanced handling result with lack of balance test set.

1.3.3. retain extending space in model, allow voluntarily to add the scheme of generation additional vector collection, output

For custom_addition_set.

1.4. the pretreatment information after integrating is constructed, is mainly included：

Original feature vector collection origin_set

Additional set of eigenvectors addtion_sets；

Feature vector set attribute vector_struct；

Training set attribute sample_struct；

After constructing pretreatment information, the flow of signature verification and selection is proceeded by, is tested into feature selecting and feature Card；

2. pair pretreatment information carries out signature verification and feature selecting：

This step includes following particular content：

2.1. it after input pretreatment information, needs to be handled from original vector collection and reference vector collection both direction, The basic ideas of depth-first and breadth First are respectively adopted.

2.1.1. for original feature vector collection information, the reliability of each characteristic reaction classification accuracy is paid close attention to

When carrying out depth analysis to original feature vector collection, need to ensure the comprehensive consideration to large sample and small sample, together When more deep analysis is carried out to the feature of original vector collection.Therefore, it is necessary to according to sample_size values and vector_dimen Value, after having carried out following two processing at the same time, is multiplied by result different weights：

2.1.1.1. the feature selecting of standard is made a decision based on cross validation, there is advantage, feature choosing to extensive sample Selecting algorithm can voluntarily specify, it is proposed that the mode being combined using InfoGain and DF algorithms, while pay close attention to word frequency information and word and be No existing influence these two aspects.

2.1.1.2. classification accuracy is verified based on bootstrap：It is former by changing for the reference vector collection of comparison The progress of beginning algorithm is deleted by feature, judges to whether there is significant difference between the classification results of generation using Chi-square Test, from And judge the availability of single characteristic item.The hypothesis of Chi-square Test is：After feature change has been carried out, it is judged as that different emotions are classified Sample size whether significant changes have occurred.This method emphasis is used for the feature of low dimensional, has higher for small sample Availability.

Emotion information can be evaded using non-parametric Chi-square Test and be distributed irregular situation, while to the thin of each feature It causes to judge can fully meet the needs of depth analysis.

2.1.2. for the processing of reference vector collection, the mainly comparison between different sets, using the place of breadth First Reason mode has good versatility：

2.1.2.1. set of eigenvectors selects representative characteristic item using general InfoGain one by one；

2.1.2.2. class bagging algorithms refer to：If vector set for subscript from 1 to n, then vector set be D1 to Dn, then often Element in a vector set both corresponds to one in original training set, if original training set is m common, i-th corresponding classification For Ci, in the Cij that is classified as of vector set Dj, then following vector set can be built

(1,C₁₁,C₁₂…C_1j…C_1n,C₁),

…

(i,C_i1…C_ij…C_in,C_i),

…

(m,C_m1…C_mj…C_mn,C_m)

Later, to each vector in the vector set, the classification of theoretical reckoning can be determined by way of most vote Value, then the gap of the actual classification value for the reckoning classification value and training set for passing through vector set can obtain much information, including training Concentrate whether corresponding entry accidentally grades.

This step establishes the synthesis structure mode to training set and multiple vector set information, general and intuitive, convenient for fortune It calculates.Finally, the information derived simply is summarized, it subsequently can be with more targetedly strategy is pocessed.

Claims

1. a kind of preposition processing method of text emotion analysis signature verification, it is characterised in that：Include the following steps：

S2, input set of eigenvectors to be verified, determine the summary of original set of eigenvectors, and export result vector_struct, Including：The dimensional parameter of multiaspect parameter vector_multi, hardness flexibility parameter vector_prop and feature vector vector_dimen；

S3, initial data is expanded, and exports result addtion_sets：Including：

(1) it if the dimensional parameter vector_dimen of feature vector is low-dimensional and does not include word frequency information, builds word-based The feature vector of bag model supplements the feature vector of low dimensional, obtains word frequency information supplement result tf_addition_ set；

(2) if the distributed constant sample_dist of emotional semantic classification is unevenly distributed weighing apparatus, equalization processing is carried out to training set, is obtained To equalization processing result even_addition_set；

S4, the pretreatment information after integrating is constructed, including：Original feature vector collection origin_set, additional set of eigenvectors Addtion_sets, feature vector set attribute vector_struct, training set attribute sample_struct；

S1, for original feature vector collection：According to sample_size values and vector_dimen values, carried out at the same time based on friendship Fork verification makes a decision the feature selecting of standard and based on bootstrap come after verifying the processing of two kinds of classification accuracy, to processing As a result it is multiplied by different weights；

S2, for reference vector collection：Set of eigenvectors selects representative characteristic item using general InfoGain one by one；It adopts Vector set is built with class bagging algorithms, theoretical reckoning is determined in a manner that majority is voted to each vector in vector set Classification value, then pass through the gap of the actual classification value for calculating classification value and training set of vector set and obtain training set information.

2. a kind of preposition processing method of text emotion analysis signature verification according to claim 1, it is characterised in that：It is right In sentiment analysis sample, when each not repeated sample number effectively classified is more than 1000, then it is enough big to be considered as sample total number.

3. a kind of preposition processing method of text emotion analysis signature verification according to claim 1, it is characterised in that：Text This information distribution of content parameter sample_text_info includes：Number of words, sentence number and hop count.

4. a kind of preposition processing method of text emotion analysis signature verification according to claim 1, it is characterised in that：Structure The method for building the feature vector based on bag of words is：Bag of words feature vector is generated based on sentiment dictionary, each characteristic value is word Frequency and the product of emotion value；Bag of words feature vector is generated based on the dictionary that TFIDF is extracted；It is flat using weighting to the new feature of addition Equal KL divergences carry out dimension-reduction treatment.

5. a kind of preposition processing method of text emotion analysis signature verification according to claim 1, it is characterised in that：It is right Training set carry out equalization processing method be：It is carried out by way of repeating to choose type on the low side or reduction type on the high side Weighing apparatusization processing, generates reference vector collection according to given sentiment analysis algorithm again later.

6. a kind of preposition processing method of text emotion analysis signature verification according to claim 1, it is characterised in that： When expanding initial data, retain extending space in a model, allow voluntarily to add generation additional vector collection custom_ addition_set。

7. a kind of preposition processing method of text emotion analysis signature verification according to claim 1, it is characterised in that： When carrying out making a decision the feature selecting processing of standard based on cross validation, to extensive sample, InfoGain and DF algorithms are used The mode being combined carries out.

8. a kind of preposition processing method of text emotion analysis signature verification according to claim 1, it is characterised in that： When verifying classification accuracy processing, to small-scale sample, passed through based on bootstrap for the reference vector collection of comparison Modification primal algorithm progress is deleted by feature, using poor with the presence or absence of conspicuousness between the classification results of Chi-square Test judgement generation It is different, so as to judge the availability of single characteristic item.

9. a kind of preposition processing method of text emotion analysis signature verification according to claim 1, it is characterised in that：Institute Class bagging algorithms are stated to refer to：If vector set is subscript from 1 to n, then vector set is D1 to Dn, then the member in each vector set Element both corresponds to one in original training set, if original training set is m common, i-th corresponding to be classified as Ci, in vector set Dj Be classified as Cij, then it is as follows to build vector set：

(1,C₁₁,C₁₂…C_1j…C_1n,C₁),

…

(i,C_i1…C_ij…C_in,C_i),

…

(m,C_m1…C_mj…C_mn,C_m)。