CN103970864A

CN103970864A - Emotion classification and emotion component analyzing method and system based on microblog texts

Info

Publication number: CN103970864A
Application number: CN201410193638.1A
Authority: CN
Inventors: 徐华; 杨炜炜; 王玮
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2014-05-08
Filing date: 2014-05-08
Publication date: 2014-08-06
Anticipated expiration: 2034-05-08
Also published as: CN103970864B

Abstract

The invention discloses an emotion classification and emotion component analyzing method based on microblog texts. The method comprises the following steps: acquiring the multiple microblog texts released by a user from the internet; word classification is conducted on the microblog texts to obtain a plurality of words according to the part of speech of each word; extracting a plurality of characteristic words from the multiple words; training a classifier of each node in an emotion classifying system according to the multiple characteristic words to construct the emotion classifying system, and achieving emotion classification through the motion classifying system; analyzing microblog text emotion components according to a classification result. By means of the emotion classification and emotion component analyzing method based on microblog texts, the emotion classifying system is constructed by extracting the multiple characteristic words, emotion classification is achieved, and the emotion components of the microblog texts are analyzed according to the classification result. Time is saved, the classifying speed and classifying effect are improved, and the emotion components further can be fast analyzed, so that the use requirement of a user can be well met. The invention further discloses an emotion classification and emotion component analyzing system based on the microblog texts.

Description

Mood classification and mood component analyzing method and system based on microblogging text

Technical field

The present invention relates to computer utility and Internet technical field, particularly a kind of mood classification and mood component analyzing method and system based on microblogging text.

Background technology

Along with the development of network and Web2.0, microblogging has become acquired information indispensable in people's daily life and the important channel releasing news.On microblogging, user can record the life of oneself, also can to focus instantly, deliver the view of oneself, express the suggestion of oneself, and this class microblogging often contains publisher's mood.Therefore, by the microblogging text to user issue, analyze, thereby infer user's mood, to realize from microblogging, excavate unique user and all users for the emotional status of some focus incident, think that decision-making from now on provides Data support.Yet, take Sina's microblogging as example, Sina's microblogging has approximately 500,000,000 registered users, has every day 200,000,000 new microbloggings of surpassing to be published, if these microbloggings, entirely by artificial treatment, must be wasted time and energy very much, the resource of losing time, can not meet user's user demand well.

Summary of the invention

The present invention is intended to solve at least to a certain extent one of technical matters in correlation technique.For this reason, one object of the present invention is to propose a kind ofly not only can save time, and improves classification speed and classifying quality, mood classification and mood component analyzing method based on microblogging text that can also express-analysis mood composition.

Another object of the present invention is to propose a kind of mood classification and mood elemental analysis system based on microblogging text.

For achieving the above object, one aspect of the present invention embodiment has proposed a kind of mood classification and mood component analyzing method based on microblogging text, comprises the following steps: the microblogging text that obtains many user's issues from internet; Described many microblogging texts are carried out to participle, to obtain a plurality of words according to the part of speech of each word; From described a plurality of words, extract a plurality of Feature Words; According to the sorter of each node in described a plurality of Feature Words training mood taxonomic hierarchieses, to build described mood taxonomic hierarchies, and by described mood taxonomic hierarchies, realize mood and classify; And according to classification results, microblogging text mood composition is analyzed.

According to mood classification and the mood component analyzing method based on microblogging text of the embodiment of the present invention, by being carried out to participle, microblogging text obtains a plurality of words, and from a plurality of words, extract a plurality of Feature Words, to train the sorter of each node in mood taxonomic hierarchieses according to a plurality of Feature Words, complete and build mood taxonomic hierarchies, and by mood taxonomic hierarchies, realizing mood classifies, and fast microblogging text mood composition is analyzed according to classification results, detect topmost mood in text, not only save time, improved classification speed, also promoted classifying quality, meet better user's user demand.

In addition, mood classification and the mood component analyzing method based on microblogging text according to the above embodiment of the present invention can also have following additional technical characterictic:

In one embodiment of the invention, from described a plurality of words, extract described a plurality of Feature Words, specifically comprise: judge whether each word is high frequency words; If grammatical term for the character is described high frequency words, calculate the degree of correlation of institute's predicate; And if judgement institute predicate is low-frequency word, calculate the PMI value of institute's predicate.

Further, in one embodiment of the invention, according to following formula, calculate the degree of correlation of institute's predicate:

χ^{2} (t, c) = \frac{N {(AD - BC)}^{2}}{(A + C) (A + B) (B + D) (C + D)}

Wherein, the word of t for calculating, c is classification, N is number of files, A represents that document belongs to classification c and do not comprise word t, B represents that described document does not belong to described classification c and comprises the predicate t of institute, and C represents that described document belongs to described classification c and do not comprise the predicate t of institute, and D represents that described document does not belong to described classification c and do not comprise the predicate t of institute.

Further, in one embodiment of the invention, according to following formula, calculate PMI (PointwiseMutual Information the puts mutual formula information) value of institute's predicate:

PMI (t, c) = \log \frac{p (t, c)}{p (t) p (c)}

Wherein, p (t, c) represents described document package containing the predicate t of institute and belongs to the probability of described classification c, and p (t) represents that described document package is containing the probability of the predicate t of institute, and p (c) represents that described document belongs to the probability of described classification c.

Further, in one embodiment of the invention, should mood classification and mood component analyzing method based on microblogging text also comprise: if the degree of correlation of institute's predicate is greater than the first predetermined threshold value, as Feature Words, extract; If the PMI value of institute's predicate is greater than the second predetermined threshold value, as described Feature Words, extract.

Further, in one embodiment of the invention, according to described classification results, described microblogging text mood composition is analyzed, further comprised: the regressand value that obtains described microblogging text every kind of mood in described mood degree taxonomic hierarchies; According to the regressand value of described every kind of mood, calculate the score of described every kind of mood, to choose the mood of default value, and calculate the ratio of the mood of described default value.

Further, in one embodiment of the invention, according to following formula, calculate the score of described every kind of mood:

S_{i} = e^{V_{i, 3} + V_{i, 4}}

Wherein, S _ithe score that represents i kind mood, V _{i, 3}the 3rd layer of regressand value that represents described i kind mood, V _{i, 4}the 4th layer of regressand value that represents described i kind mood; According to following formula, calculate the ratio of the mood of described default value:

P_{i} = \frac{e^{V_{i, 3} + V_{i, 4}}}{Σ_{k = 1}^{4} e^{V_{k, 3} + V_{k, 4}}}

Wherein, P _ithe ratio that represents described i kind mood, K represents total K kind mood.

The present invention on the other hand embodiment has proposed a kind of mood classification and mood elemental analysis system based on microblogging text, comprising: acquisition module, for obtain the microblogging text of many user's issues from internet; Word-dividing mode, for described many microblogging texts are carried out to participle, to obtain a plurality of words according to the part of speech of each word; Extraction module, for extracting a plurality of Feature Words from described a plurality of words; Creation module, for according to the sorter of described a plurality of each node of Feature Words training mood taxonomic hierarchies, carries out mood classification to build described mood taxonomic hierarchies, and by described mood taxonomic hierarchies, realizes mood and classify; And analysis module, for microblogging text mood composition being analyzed according to classification results.

According to mood classification and the mood elemental analysis system based on microblogging text of the embodiment of the present invention, by being carried out to participle, microblogging text obtains a plurality of words, and from a plurality of words, extract a plurality of Feature Words, to train the sorter of each node in mood taxonomic hierarchieses according to a plurality of Feature Words, complete and build mood taxonomic hierarchies, and by mood taxonomic hierarchies, realizing mood classifies, and fast microblogging text mood composition is analyzed according to classification results, detect topmost mood in text, not only save time, improved classification speed, also promoted classifying quality, meet better user's user demand.

In addition, mood classification and the mood elemental analysis system based on microblogging text according to the above embodiment of the present invention can also have following additional technical characterictic:

In one embodiment of the invention, described extraction module also for: judge whether each word is high frequency words; If grammatical term for the character is described high frequency words, calculate the degree of correlation of institute's predicate; And if judgement institute predicate is low-frequency word, calculate the PMI value of institute's predicate.

χ^{2} (t, c) = \frac{N {(AD - BC)}^{2}}{(A + C) (A + B) (B + D) (C + D)}

Further, in one embodiment of the invention, according to following formula, calculate the PMI value of institute's predicate:

PMI (t, c) = \log \frac{p (t, c)}{p (t) p (c)}

Further, in one embodiment of the invention, described extraction module also for: if the degree of correlation of institute's predicate is greater than the first predetermined threshold value, as Feature Words, extract; If the PMI value of institute's predicate is greater than the second predetermined threshold value, as described Feature Words, extract.

Further, in one embodiment of the invention, described analysis module also for: obtain described microblogging text at the regressand value of every kind of mood of described mood degree taxonomic hierarchies; According to the regressand value of described every kind of mood, calculate the score of described every kind of mood, to choose the mood of default value, and calculate the ratio of the mood of described default value.

S_{i} = e^{V_{i, 3} + V_{i, 4}}

P_{i} = \frac{e^{V_{i, 3} + V_{i, 4}}}{Σ_{k = 1}^{4} e^{V_{k, 3} + V_{k, 4}}}

The aspect that the present invention is additional and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.

Accompanying drawing explanation

Above-mentioned and/or the additional aspect of the present invention and advantage will become from the following description of the accompanying drawings of embodiments and obviously and easily understand, wherein:

Fig. 1 is the process flow diagram of the classification of the mood based on microblogging text according to an embodiment of the invention and mood component analyzing method;

Fig. 2 is mood based on the microblogging text classification of a specific embodiment according to the present invention and the process flow diagram of mood component analyzing method;

Fig. 3 is four layers of fine granularity mood taxonomic hierarchies according to an embodiment of the invention;

Fig. 4 is the structural representation of the classification of the mood based on microblogging text according to an embodiment of the invention and mood elemental analysis system; And

Fig. 5 is mood based on the microblogging text classification of a specific embodiment according to the present invention and the structural representation of mood elemental analysis system.

Embodiment

Describe embodiments of the invention below in detail, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has the element of identical or similar functions from start to finish.Below by the embodiment being described with reference to the drawings, be exemplary, only for explaining the present invention, and can not be interpreted as limitation of the present invention.

Disclosing below provides many different embodiment or example to be used for realizing different structure of the present invention.Of the present invention open in order to simplify, hereinafter the parts of specific examples and setting are described.Certainly, they are only example, and object does not lie in restriction the present invention.In addition, the present invention can be in different examples repeat reference numerals and/or letter.This repetition is in order to simplify and object clearly, itself do not indicate the relation between discussed various embodiment and/or setting.In addition, the various specific technique the invention provides and the example of material, but those of ordinary skills can recognize the property of can be applicable to of other techniques and/or the use of other materials.In addition, First Characteristic described below Second Characteristic it " on " structure can comprise that the first and second Characteristics creations are for the direct embodiment of contact, also can comprise the embodiment of other Characteristics creation between the first and second features, such the first and second features may not be direct contacts.

In description of the invention, it should be noted that, unless otherwise prescribed and limit, term " installation ", " being connected ", " connection " should be interpreted broadly, for example, can be mechanical connection or electrical connection, also can be the connection of two element internals, can be to be directly connected, and also can indirectly be connected by intermediary, for the ordinary skill in the art, can understand as the case may be the concrete meaning of above-mentioned term.

Describe with reference to the accompanying drawings mood classification and mood component analyzing method and the system based on the microblogging text that according to the embodiment of the present invention, propose, describe first with reference to the accompanying drawings mood classification and the mood component analyzing method based on the microblogging text that according to the embodiment of the present invention, propose.Shown in Fig. 1, should mood classification and mood component analyzing method (hereinafter to be referred as analytical approach) based on microblogging text comprise the following steps:

S101, obtains the microblogging text of many user's issues from internet.

In one embodiment of the invention, shown in Fig. 2, the embodiment of the present invention is mainly obtained original microblogging text from internet, to carry out mood classification and mood constituent analysis.The data of the embodiment of the present invention mainly API based on microblogging (Application Programming Interface, application programming interface) crawl from microblogging by web crawlers, and are saved in associated databases.Further, the data of crawl are generally microblogging text, if the relevant microblogging of a certain event is analyzed, can use corresponding API to capture data.

S102, carries out participle to many microblogging texts, to obtain a plurality of words according to the part of speech of each word.

In one embodiment of the invention, the embodiment of the present invention is preferably by using the ICTCLAS of Chinese Academy of Sciences Words partition system to carry out participle to microblogging text, retains the word of following part of speech: noun (n), character string (x), number (m), measure word (q), verb (v), adjective (a), descriptive word (z), distinction word (b), adverbial word (d), unknown part of speech (un), interrogative pronoun (ry), question mark (ww), exclamation (wt), left parenthesis (wkz) and right parenthesis (wky) after participle.

Further, in one embodiment of the invention, in order to extract better the word of the part of speech that needs reservation, to obtain a plurality of words, thus constitutive characteristic space, realization extracts Feature Words, and the embodiment of the present invention also adds two processing rules.Article one: repeat continuously punctuate rule.For example, in order to distinguish a plurality of question marks (exclamation) and single question mark (exclamation), again for unified feature, the present invention represents continuous a plurality of question marks (exclamation) unification with two question marks (exclamation); Second: negative word rule.For example ought occur with negative phrase, during as " not very happy ", Words partition system can be split up into negative word " no/d too/d happiness/a ", and does not meet like this demand and actual semantic.So there is adjective in three words after negative word, just these words are processed as a word together, so word segmentation result is " not very happy/a ".

S103 extracts a plurality of Feature Words from a plurality of words.

In one embodiment of the invention, from a plurality of words, extract a plurality of Feature Words, specifically comprise: judge whether each word is high frequency words; If grammatical term for the character is high frequency words, calculate the degree of correlation of word; If grammatical term for the character is low-frequency word, calculate the PMI value of word.

Particularly, in one embodiment of the invention, the target signature set of words that the feature selecting algorithm that the embodiment of the present invention proposes will be selected is that a plurality of Feature Words are divided into high frequency word set and low frequency word set two parts.Wherein, high frequency words can refer to that the frequency of occurrences is higher in sampling text, and low-frequency word can refer to that the frequency of occurrences is lower in sampling text, can determine predeterminated frequency according to actual conditions particularly, and when the frequency of word is during higher than predeterminated frequency, grammatical term for the character is high frequency words; When the frequency of word is during lower than predeterminated frequency, grammatical term for the character is low-frequency word.In addition, it should be noted that, the threshold value relating in following process is all determined by iteration.

Particularly, for high frequency words set, adopt the method for chi square test and odds ratio (Odds Ratios) combination.Chi square test algorithm is as follows: establishing the word that will calculate is t, and classification is c, and total N document is N bar microblogging text, as shown in table 1, according to whether comprising t and whether belong to c, document is divided into following 4 classes:

Table 1

	Belong to classification c	Do not belong to classification c
			Comprise word t	A	B
Do not comprise word t	C	D

Further, in one embodiment of the invention, according to following formula, calculate the degree of correlation of word:

χ^{2} (t, c) = \frac{N {(AD - BC)}^{2}}{(A + C) (A + B) (B + D) (C + D)}

Wherein, the word of t for calculating, c is classification, and N is number of files, and A represents that document belongs to classification c and do not comprise word t, and B represents that document does not belong to classification c and comprises word t, and C represents that document belongs to classification c and do not comprise word t, and D represents that document does not belong to classification c and do not comprise word t.

Further, in one embodiment of the invention, according to following formula, calculate the PMI value of word:

PMI (t, c) = \log \frac{p (t, c)}{p (t) p (c)}

Wherein, p (t, c) expression document package contains word t and belongs to the probability of classification c, and p (t) represents that document package is containing the probability of word t, and p (c) represents that document belongs to the probability of classification c.

Further, in one embodiment of the invention, above-mentioned analytical approach also comprises: if the degree of correlation of word is greater than the first predetermined threshold value, as Feature Words, extract; If the PMI value of word is greater than the second predetermined threshold value, as Feature Words, extract.

Particularly, in one embodiment of the invention, for high frequency words set, when selecting high frequency words, travel through from high to low each word, if its odds ratio is the degree of correlation, be greater than the threshold value of setting, just by this selected ci poem, until do not have the optional or word number of word to reach threshold value.For low-frequency word set, when selecting low-frequency word, the embodiment of the present invention adopts PMI to select.Wherein, for each word, if its positive PMI or negative PMI are higher than the threshold value of setting, just by this selected ci poem.Finally merge high frequency words set and low-frequency word set, as the Feature Words set of final reservation.

S104, according to the sorter of each node in a plurality of Feature Words training mood taxonomic hierarchieses, to build mood degree taxonomic hierarchies, and realizes mood by mood taxonomic hierarchies and classifies.

Further, in one embodiment of the invention, SVM (Support Vector Machine, support vector machine) is a kind of machine learning algorithm, is used for processing the data of linear separability.When data linearly inseparable, SVM can make its linear separability by data-mapping in higher dimensional space.Meanwhile, for fear of the computational complexity in higher dimensional space, SVM can be used kernel function (Kernel Function) to carry out result of calculation.The sorter that the embodiment of the present invention is used is SVR (SupporVectorRegression, support vector regression), and SVR is the Yi Ge branch of SVM.Particularly, directly to provide classification results different from SVM, and what SVR provided is the regressand value of each sample, can regulate more neatly classification thresholds like this.For relating to polytypic situation, first SVR calculates the regressand value of each class, then calculates the difference between regressand value and threshold value, selects the final classification of conduct of difference maximum.In other words, the embodiment of the present invention is according to the SVR of each node in a plurality of Feature Words training mood taxonomic hierarchieses, to build four layers of fine granularity mood taxonomic hierarchies.It should be noted that, in actual applications, also can adjust flexibly the feature selecting algorithm of every one deck according to data characteristics, can select the algorithm different from the present invention to build mood taxonomic hierarchies.The analytical approach of the embodiment of the present invention not only can promote mood classifying quality, can also improve mood classification speed.

Preferably, in one embodiment of the invention, shown in Fig. 3, mood taxonomic hierarchies is preferably the fine-grained mood taxonomic hierarchies of four layers.Particularly, mood sorting algorithm is in the past general uses 3 layers, the taxonomic hierarchies of totally 7 kinds of moods, and the mood taxonomic hierarchies that the embodiment of the present invention adopts is the upper one deck that increases in original basis again, has 19 kinds of fine-grained basic emotions, can portray more meticulously mood.

In an embodiment of the present invention, the embodiment of the present invention is according to a plurality of Feature Words training classifiers.Wherein, a plurality of Feature Words are divided into training set and test set.Sorter is trained on training set, test effect on test set.Wherein, Indexes of Evaluation Effect adopts accuracy rate (Precision), recall rate (Recall) and F1 value (F1-Score) to evaluate.In specific embodiment of the present invention, classification results is as shown in table 1, the original microblogging text that data Dou Shicong used Sina microblogging captures, totally 9960.According to shown in table 2, the embodiment of the present invention has improved precision and the coverage rate of mood classification, better microblogging text is carried out to mood classification.

Table 2

Mood	Accuracy rate	Recall rate	F1 value
				Sad	0.398	0.412	0.415
Compunction	0.333	0.130	0.188

Disappointed	0.327	0.358	0.341
				Miss	0.446	0.465	0.455
In surprise	0.417	0.312	0.357
				Unbearably	0.529	0.429	0.474
Frightened	0.500	0.583	0.538
				Shy	0.267	0.267	0.267
Indignation	0.750	0.493	0.595
				Censure	0.284	0.338	0.309
Unhappy	0.300	0.401	0.344
				Suspect	0.188	0.115	0.143
Abhor	0.514	0.463	0.487
				Like	0.273	0.185	0.220
Believe	0.467	0.389	0.424
				Praise	0.111	0.070	0.086
Wish	0.606	0.680	0.641
				Feel at ease	0.294	0.294	0.294
Happy	0.578	0.585	0.581

S105, analyzes microblogging text mood composition according to classification results.

Further, in one embodiment of the invention, according to classification results, microblogging text mood composition is analyzed, further comprised: the regressand value that obtains microblogging text every kind of mood in mood degree taxonomic hierarchies; According to the regressand value of every kind of mood, calculate the score of every kind of mood, to choose the mood of default value, and calculate the ratio of the mood of described default value.

Further, in one embodiment of the invention, according to following formula, calculate the score of every kind of mood: wherein, S _ithe score that represents i kind mood, V _{i, 3}the 3rd layer of regressand value that represents i kind mood, V _{i, 4}the 4th layer of regressand value that represents i kind mood.According to following formula, calculate the ratio of the mood of default value: wherein, P _ithe ratio that represents i kind mood, K represents total K kind mood.

Particularly, mood constituent analysis depends on mood classification results, detects topmost mood in current text.Wherein, be mainly that the regressand value on the 3rd layer and the 4th layer of mood taxonomic hierarchies counts the score based on current text, select for example 4 kinds of moods of default value that score is the highest.For example, for i kind basic emotion, it must be divided into: v wherein _{i, 3}and V _{i, 4}it is respectively the regressand value of the 3rd layer and the 4th layer mood i.Further, pass through S _iselect 4 kinds of moods that score is the highest, and calculate the ratio of every kind of mood, complete the analysis to microblogging text mood composition.Wherein, ratio computing method are: the embodiment of the present invention is automatically identified the mood in microblogging by computing machine, and detects topmost 4 kinds of moods, and calculates ratio, and by result Dynamic Display.

The analytical approach of the embodiment of the present invention has following several principal feature: 1) save time.Current microblogging text does not need manual analysis, just can obtain rapidly mood classification and the main mood of microblogging text.2) applied widely.The method can be used by manufacturer or competent authorities, and the mood trend of analysis user integral body also can be used by unique user oneself, analysis oneself and other people emotional status.3) mood fine size.Mood sorting algorithm is in the past general uses 3 layers, the taxonomic hierarchies of totally 7 kinds of moods, and the mood taxonomic hierarchies that the embodiment of the present invention adopts is the upper one deck that increases in original basis again, has 19 kinds of fine-grained basic emotions, can portray more meticulously mood.

Fig. 4 is according to the mood classification based on microblogging text of the embodiment of the present invention and the structural representation of mood elemental analysis system.Shown in Fig. 4, according to mood classification and the mood elemental analysis system (hereinafter to be referred as analytic system 100) based on microblogging text of the embodiment of the present invention, comprise: acquisition module 10, word-dividing mode 20, extraction module 30, creation module 40 and analysis module 50.

Wherein, acquisition module 10 is for obtaining the microblogging text of many user's issues from internet.Word-dividing mode 20 is for carrying out participle to many microblogging texts, to obtain a plurality of words according to the part of speech of each word.Extraction module 30 is for extracting a plurality of Feature Words from a plurality of words.Creation module 40, for according to the sorter of a plurality of each node of Feature Words training mood taxonomic hierarchies, is carried out mood classification to build mood taxonomic hierarchies, and by mood taxonomic hierarchies, is realized mood and classify.Analysis module 50 is for analyzing microblogging text mood composition according to classification results.

In one embodiment of the invention, shown in Fig. 2, the embodiment of the present invention is mainly obtained original microblogging text from internet, to carry out mood classification and mood constituent analysis.The data of the embodiment of the present invention mainly API based on microblogging crawl from microblogging by web crawlers, and are saved in database 80.Further, the data of crawl are generally microblogging text, if the relevant microblogging of a certain event is analyzed, can use corresponding API to capture data.

Preferably, in one embodiment of the invention, the embodiment of the present invention is preferably by using the ICTCLAS of Chinese Academy of Sciences Words partition system to carry out participle to microblogging text, retains the word of following part of speech: noun (n), character string (x), number (m), measure word (q), verb (v), adjective (a), descriptive word (z), distinction word (b), adverbial word (d), unknown part of speech (un), interrogative pronoun (ry), question mark (ww), exclamation (wt), left parenthesis (wkz) and right parenthesis (wky) after participle.

Further, in one embodiment of the invention, extraction module 30 is also for judging whether each word is high frequency words; If grammatical term for the character is high frequency words, calculate the degree of correlation of word; If grammatical term for the character is low-frequency word, calculate the PMI value of word.

Particularly, for high frequency words set, adopt the method for chi square test and odds ratio (Odds Ratios) combination.Chi square test algorithm is as follows: establishing the word that will calculate is t, and classification is c, and total N document is N bar microblogging text, as shown in table 1, according to whether comprising t and whether belong to c, document is divided into 4 classes.

χ^{2} (t, c) = \frac{N {(AD - BC)}^{2}}{(A + C) (A + B) (B + D) (C + D)}

PMI (t, c) = \log \frac{p (t, c)}{p (t) p (c)}

Further, in one embodiment of the invention, if extraction module 30 is also greater than the first predetermined threshold value for the degree of correlation of word, as Feature Words, extract; If the PMI value of word is greater than the second predetermined threshold value, as Feature Words, extract.

Further, in one embodiment of the invention, SVM is a kind of machine learning algorithm, is used for processing the data of linear separability.When data linearly inseparable, SVM can make its linear separability by data-mapping in higher dimensional space.Meanwhile, for fear of the computational complexity in higher dimensional space, SVM can be used kernel function (Kernel Function) to carry out result of calculation.The sorter that the embodiment of the present invention is used is SVR, and SVR is the Yi Ge branch of SVM.Particularly, directly to provide classification results different from SVM, and what SVR provided is the regressand value of each sample, can regulate more neatly classification thresholds like this.For relating to polytypic situation, first SVR calculates the regressand value of each class, then calculates the difference between regressand value and threshold value, selects the final classification of conduct of difference maximum.In other words, the embodiment of the present invention is according to the SVR of each node in a plurality of Feature Words training mood taxonomic hierarchieses, to build four layers of fine granularity mood taxonomic hierarchies.It should be noted that, in actual applications, also can adjust flexibly the feature selecting algorithm of every one deck according to data characteristics, can select the algorithm different from the present invention to build mood taxonomic hierarchies.The analytical approach of the embodiment of the present invention not only can promote mood classifying quality, can also improve mood classification speed.

Further, in one embodiment of the invention, analysis module 50 is also for obtaining microblogging text at the regressand value of every kind of mood of mood degree taxonomic hierarchies; According to the regressand value of every kind of mood, calculate the score of every kind of mood, to choose the mood of default value, and calculate the ratio of the mood of described default value.

Further, in one embodiment of the invention, shown in Fig. 5, above-mentioned analytic system 100 can also comprise: subscriber interface module 60 and database interface module 70.

Wherein, subscriber interface module 60 is the user interface that user provides a patterned close friend for giving the user of analytic system 100, to facilitate user to browse own and other people emotional status.Database interface module 70, for database 80 read-write interfaces of whole system are provided, facilitates other each different functional module to carry out the I/O operation of data.

Further, in one embodiment of the invention, the acquisition module 10 of this analytic system 100, word-dividing mode 20, extraction module 30, creation module 40, analysis module 50, subscriber interface module 60 and database interface module 70 all realize with java, Python and JSP language development under Windows.Further, based on above-mentioned development platform, the deployment of this analytic system 100 operation needs the support of following several level running environment.First at operating system layer, analytic system 100 need to be moved on Windows XP or its compatible operating system platform, also needs program run time infrastructure, namely Java and Python run time infrastructure simultaneously.When having possessed above-mentioned back-up environment, this analytic system 100 could be able to normally be moved.And user only need to just can browse own and other people emotional status by web browser access system.

The analytic system 100 of the embodiment of the present invention has following several principal feature: 1) save time.Current microblogging text does not need manual analysis, just can obtain rapidly mood classification and the main mood of microblogging text.2) applied widely.This system can be used by manufacturer or competent authorities, and the mood trend of analysis user integral body also can be used by unique user oneself, analysis oneself and other people emotional status.3) mood fine size.Mood sorting algorithm is in the past general uses 3 layers, the taxonomic hierarchies of totally 7 kinds of moods, and the mood taxonomic hierarchies that the embodiment of the present invention adopts is the upper one deck that increases in original basis again, has 19 kinds of fine-grained basic emotions, can portray more meticulously mood.

In process flow diagram or any process of otherwise describing at this or method describe and can be understood to, represent to comprise that one or more is for realizing module, fragment or the part of code of executable instruction of the step of specific logical function or process, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can be not according to order shown or that discuss, comprise according to related function by the mode of basic while or by contrary order, carry out function, this should be understood by embodiments of the invention person of ordinary skill in the field.

The logic and/or the step that in process flow diagram, represent or otherwise describe at this, for example, can be considered to for realizing the sequencing list of the executable instruction of logic function, may be embodied in any computer-readable medium, for instruction execution system, device or equipment (as computer based system, comprise that the system of processor or other can and carry out the system of instruction from instruction execution system, device or equipment instruction fetch), use, or use in conjunction with these instruction execution systems, device or equipment.With regard to this instructions, " computer-readable medium " can be anyly can comprise, storage, communication, propagation or transmission procedure be for instruction execution system, device or equipment or the device that uses in conjunction with these instruction execution systems, device or equipment.The example more specifically of computer-readable medium (non-exhaustive list) comprises following: the electrical connection section (electronic installation) with one or more wirings, portable computer diskette box (magnetic device), random access memory (RAM), ROM (read-only memory) (ROM), the erasable ROM (read-only memory) (EPROM or flash memory) of editing, fiber device, and portable optic disk ROM (read-only memory) (CDROM).In addition, computer-readable medium can be even paper or other the suitable medium that can print described program thereon, because can be for example by paper or other media be carried out to optical scanning, then edit, decipher or process in electronics mode and obtain described program with other suitable methods if desired, be then stored in computer memory.

Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, a plurality of steps or method can realize with being stored in storer and by software or the firmware of suitable instruction execution system execution.For example, if realized with hardware, the same in another embodiment, can realize by any one in following technology well known in the art or their combination: have for data-signal being realized to the discrete logic of the logic gates of logic function, the special IC with suitable combinational logic gate circuit, programmable gate array (PGA), field programmable gate array (FPGA) etc.

Those skilled in the art are appreciated that realizing all or part of step that above-described embodiment method carries is to come the hardware that instruction is relevant to complete by program, described program can be stored in a kind of computer-readable recording medium, this program, when carrying out, comprises step of embodiment of the method one or a combination set of.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, can be also that the independent physics of unit exists, and also can be integrated in a module two or more unit.Above-mentioned integrated module both can adopt the form of hardware to realize, and also can adopt the form of software function module to realize.If described integrated module usings that the form of software function module realizes and during as production marketing independently or use, also can be stored in a computer read/write memory medium.In addition, term " first ", " second " be only for describing object, and can not be interpreted as indication or hint relative importance or the implicit quantity that indicates indicated technical characterictic.Thus, at least one this feature can be expressed or impliedly be comprised to the feature that is limited with " first ", " second ".In description of the invention, the implication of " a plurality of " is at least two, for example two, and three etc., unless otherwise expressly limited specifically.

The above-mentioned storage medium of mentioning can be ROM (read-only memory), disk or CD etc.

In the description of this instructions, the description of reference term " embodiment ", " some embodiment ", " example ", " concrete example " or " some examples " etc. means to be contained at least one embodiment of the present invention or example in conjunction with specific features, structure, material or the feature of this embodiment or example description.In this manual, the schematic statement of above-mentioned term is not necessarily referred to identical embodiment or example.And the specific features of description, structure, material or feature can be with suitable mode combinations in any one or more embodiment or example.

Although illustrated and described embodiments of the invention, for the ordinary skill in the art, be appreciated that without departing from the principles and spirit of the present invention and can carry out multiple variation, modification, replacement and modification to these embodiment, scope of the present invention is by claims and be equal to and limit.

Claims

1. the classification of the mood based on microblogging text and a mood component analyzing method, is characterized in that, comprises the following steps:

From internet, obtain the microblogging text of many user's issues;

Described many microblogging texts are carried out to participle, to obtain a plurality of words according to the part of speech of each word;

From described a plurality of words, extract a plurality of Feature Words;

According to the sorter of each node in described a plurality of Feature Words training mood taxonomic hierarchieses, to build described mood taxonomic hierarchies, and by described mood taxonomic hierarchies, realize mood and classify; And

According to classification results, microblogging text mood composition is analyzed.

2. method according to claim 1, is characterized in that, extracts described a plurality of Feature Words from described a plurality of words, specifically comprises:

Judge whether each word is high frequency words;

If grammatical term for the character is described high frequency words, calculate the degree of correlation of institute's predicate; And

If judgement institute predicate is low-frequency word, calculate the PMI value of institute's predicate.

3. method according to claim 2, is characterized in that, calculates the degree of correlation of institute's predicate according to following formula:

χ^{2} (t, c) = \frac{N {(AD - BC)}^{2}}{(A + C) (A + B) (B + D) (C + D)}

4. method according to claim 2, is characterized in that, calculates the PMI value of institute's predicate according to following formula:

PMI (t, c) = \log \frac{p (t, c)}{p (t) p (c)}

5. according to the method described in claim 2-4 any one, it is characterized in that, also comprise:

If the degree of correlation of institute's predicate is greater than the first predetermined threshold value, as Feature Words, extract;

If the PMI value of institute's predicate is greater than the second predetermined threshold value, as described Feature Words, extract.

6. method according to claim 1, is characterized in that, according to described classification results, described microblogging text mood composition is analyzed, and further comprises:

Obtain the regressand value of described microblogging text every kind of mood in described mood degree taxonomic hierarchies;

According to the regressand value of described every kind of mood, calculate the score of described every kind of mood, to choose the mood of default value, and calculate the ratio of the mood of described default value.

7. method according to claim 6, is characterized in that, calculates the score of described every kind of mood according to following formula:

S_{i} = e^{V_{i, 3} + V_{i, 4}}

Wherein, S _ithe score that represents i kind mood, V _{i, 3}the 3rd layer of regressand value that represents described i kind mood, V _{i, 4}the 4th layer of regressand value that represents described i kind mood;

According to following formula, calculate the ratio of the mood of described default value:

P_{i} = \frac{e^{V_{i, 3} + V_{i, 4}}}{Σ_{k = 1}^{4} e^{V_{k, 3} + V_{k, 4}}}

8. the classification of the mood based on microblogging text and a mood component analyzing method, is characterized in that, comprising:

Acquisition module, for obtaining the microblogging text of many user's issues from internet;

Word-dividing mode, for described many microblogging texts are carried out to participle, to obtain a plurality of words according to the part of speech of each word;

Extraction module, for extracting a plurality of Feature Words from described a plurality of words;

Creation module, for according to the sorter of described a plurality of each node of Feature Words training mood taxonomic hierarchies, carries out mood classification to build described mood taxonomic hierarchies, and by described mood taxonomic hierarchies, realizes mood and classify; And

Analysis module, for analyzing microblogging text mood composition according to classification results.

9. system according to claim 8, is characterized in that, described extraction module also for:

Judge whether each word is high frequency words;

10. system according to claim 9, is characterized in that, calculates the degree of correlation of institute's predicate according to following formula:

χ^{2} (t, c) = \frac{N {(AD - BC)}^{2}}{(A + C) (A + B) (B + D) (C + D)}

11. systems according to claim 9, is characterized in that, calculate the PMI value of institute's predicate according to following formula:

PMI (t, c) = \log \frac{p (t, c)}{p (t) p (c)}

12. according to the system described in claim 9-11 any one, it is characterized in that, described extraction module also for:

13. systems according to claim 8, is characterized in that, described analysis module also for:

14. systems according to claim 13, is characterized in that, calculate the score of described every kind of mood according to following formula:

S_{i} = e^{V_{i, 3} + V_{i, 4}}

P_{i} = \frac{e^{V_{i, 3} + V_{i, 4}}}{Σ_{k = 1}^{4} e^{V_{k, 3} + V_{k, 4}}}