CN103020712A

CN103020712A - Distributed classification device and distributed classification method for massive micro-blog data

Info

Publication number: CN103020712A
Application number: CN2012105838868A
Authority: CN
Inventors: 王国仁; 信俊昌; 聂铁铮; 赵相国; 丁琳琳
Original assignee: Northeastern University China
Current assignee: Beijing Institute of Technology BIT
Priority date: 2012-12-28
Filing date: 2012-12-28
Publication date: 2013-04-03
Anticipated expiration: 2032-12-28
Also published as: CN103020712B

Abstract

The invention discloses a distributed classification device and a distributed classification method for massive micro-blog data, and belongs to the field of data mining technology. The distributed classification device is of a distributed structure. The method includes that each secondary controller transmits an intermediate result to a master controller according to an ELM (extreme learning machine) processing technology, and the intermediate result is generated by the secondary controller and is used for generating a final micro-blog data classifier; the master controller acquires the final micro-blog data classifier according to an ELM principle after receiving all the intermediate results transmitted by the secondary controllers; and the generated micro-blog classifier classifies the micro-blog data. The distributed classification device and the distributed classification method have the advantages that the shortcoming that an existing method implemented by an extreme learning machine technology only can be applied to a centralized environment and cannot be adapted to ELM classification for large-scale training sample sets is overcome, the massive micro-blog data can be processed and analyzed, the effectiveness of the massive micro-blog data accumulated during application can be sufficiently played, and an effective application service effect is realized.

Description

A kind of distributed sorter and method of magnanimity microblogging data

Technical field

The invention belongs to the data mining technology field, relate to a kind of extreme learning machine sorter and method based on distributed proccessing, particularly a kind of distributed sorter and method of magnanimity microblogging data.

Background technology

At present, all can produce a large amount of information all the time on the internet, the form of expression of these information is varied, and wherein the quantity of information of microblogging platform generation is also increasing sharply.Microblogging is miniature blog (Micro-Blogs), is the blog form that a kind of permission user upgraded and can publish brief text (usually about 140 words) in time.The fast development of microblogging be so that anyone can become microblog users, and delivers and reading information in the client of any support microblogging at any time, carries out interaction, expresses the emotion information of oneself.Microblogging has become the powerful information carrier in internet, and the micro-blog information amount reached the magnanimity scale, becomes at present most popular information sharing, propagation and interaction platform.For this reason, how to take adequate measures and technology, from the microblogging data of magnanimity, excavate Useful Information, the judgement of following making prediction property of things has been become focus and the difficult point of current data excavation applications research.

In existing correlative study for the microblogging data, the data volume of handled microblogging data is less often, just can process under centralized environment; Yet be accompanied by the growth at full speed of microblogging data in the internet, the data volume of microblogging data adopts existing method to be difficult to realize large-scale data analysis considerably beyond the processing power of single computing machine.

Summary of the invention

For the deficiencies in the prior art, the objective of the invention is to propose a kind of distributed sorter and method of magnanimity microblogging data, limit of utilization learning machine (Extreme Learning Machine, ELM) technology is classified to the microblogging data, and then can effectively process and analyze the microblogging data of magnanimity, to reach the effectiveness of the magnanimity microblogging data that accumulate in the application is not fully exerted, better is the purpose of application service.

Technical scheme of the present invention is achieved in that a kind of distributed sorter of magnanimity microblogging data, this device adopts distributed frame, comprise that a master controller and at least one are from controller, and each is all interconnected with master controller from controller, master controller intercoms from controller mutually with each, all are from separate between controller, complete independently task separately; Disposal route according to ELM, each intermediate result for generating final microblogging data sorter of self being processed from controller sends to master controller, master controller according to the principle of ELM, obtains final microblogging data sorter after receiving all intermediate results of sending from controller.

Describedly comprise from controller:

To measuring device: be used for and convert the form of vector representation from controller to every microblogging training data of classification results, comprising the proper vector x of the data division of every microblogging data _iWith classification results part t _i

Stripper: the eigenvectors matrix X that is used for peeling off all microblogging data of the microblogging data training set after processing to measuring device _iWith the classification results matrix T _i

Converter: the principle of limit of utilization learning machine (ELM) is used for the eigenvectors matrix X that stripper is extracted _iConvert the hidden layer output matrix H among the ELM to _i

The preceding paragraph counter: the principle of limit of utilization learning machine (ELM) is used for according to hidden layer output matrix H _i, calculate intermediate result H _i ^TH _i, and submit to master controller.

Consequent counter: the principle of limit of utilization learning machine (ELM) is used for according to hidden layer output matrix H _iWith microblogging data centralization classification results matrix T _i, calculate intermediate result H _i ^TT _i, and submit to master controller.

Described master controller comprises:

Preceding paragraph totalizer: be used for merging the intermediate result H that each is submitted to from controller _i ^TH _i, obtain summarized results H ^TH.

Consequent totalizer: be used for merging the intermediate result H that each is submitted to from controller _i ^TT _i, obtain summarized results H ^TT.

Parameter generators: the principle of limit of utilization learning machine (ELM), be used for the result according to the preceding paragraph totalizer that gathers and the output of consequent totalizer, calculate the weight vectors parameter beta of output node.

Taxonomy generator: the parameter beta that obtains according to parameter generators makes up the sorter of microblogging data, is used for microblogging data to be tested are classified.

A kind of Distributed Classification of magnanimity microblogging data may further comprise the steps:

Step 1: the preparation of microblogging training dataset;

The preparation of microblogging training dataset comprises grasping original microblogging data and manually the microblogging data being marked two parts.Can adopt following dual mode: first kind of way is to be grasped the original microblogging data of required processing by master controller, and manually mark for each bar training data, then the classification results that represents these microblogging data arrives these microblogging data allocations accordingly from controller; The second way is to be communicated by letter from controller with each by master controller, notify the information of each microblogging data that need to grasp from controller, each grasps original microblogging data from controller self, and for the original microblogging data that self grasp manually mark, represent the classification results of these microblogging data;

Step 2: master controller is the desired parameters initialization, and sends to all from controller;

The principle of limit of utilization learning machine (ELM) generates parameter in advance at random by master controller, comprising: the weight vectors w of hidden node number L, input node ₁, w ₂..., w _L, hidden node side-play amount b ₁, b ₂..., b _L, and these parameters are sent to all from controller;

Step 3: each is processed local microblogging data set separately from controller, and result is sent to master controller, generates the microblogging data sorter by master controller;

Step 3-1: microblogging data vector;

To carry out vectorization with every microblogging training data of classification results part, comprising the proper vector x of the data division of every microblogging data _iWith classification results part t _i

Step 3-2: the peeling off of microblogging data;

For each microblogging data set through feature extraction from controller microblogging data training set, peel off proper vector part and the classification results part of these data, form each from the eigenvectors matrix X of the microblogging data training set of controller _iWith the classification results matrix T _i, namely so that each all generates separately local microblogging data set (X from controller _i, T _i), wherein, X _iBe the eigenmatrix of microblogging data set, T _iClassification results matrix for the microblogging data set.

Step 3-3: each generates intermediate result from controller basis local microblogging data set separately, and sends to master controller;

Each is from controller n _iWeight vectors w according to the input node that receives ₁, w ₂..., w _LThreshold value b with i hidden node ₁, b ₂..., b _L, and local microblogging training dataset (X _i, T _i), calculate and make up the required intermediate result of sorter, and intermediate result is submitted to master controller;

Step 3-3-1: with the eigenmatrix X of local microblogging data set _iBe converted into the hidden layer output matrix H of ELM _i

Step 3-3-2: according to hidden layer output matrix H _i, calculate intermediate result U _i=H _i ^TH _i

Step 3-3-3: according to hidden layer output matrix H _iClassification results matrix T with local training dataset _i, calculate intermediate result V _i=H _i ^TT _i

Step 3-4: master controller receives and gathers each from the intermediate result of controller; According to the Computing Principle of the intermediate result that gathers according to ELM, calculate the weight vectors parameter beta of output node, and then try to achieve the microblogging data sorter;

Step 3-4-1: merge the intermediate result U that each is submitted to from controller _i, obtain summarized results U=∑ U _i=∑ H _i ^TH _i=H ^TH;

Step 3-4-2: merge the intermediate result V that each is submitted to from controller _i, obtain summarized results V=∑ V _i=∑ H _i ^TT _i=H ^TT;

Step 3-4-3: the weight vectors parameter beta of calculating output node according to the U that gathers and V:

β = {(\frac{I}{λ} + H^{T} H)}^{- 1} H^{T} T = {(\frac{I}{λ} + U)}^{- 1} V

Wherein, I is unit matrix, and λ is the parameter of user's appointment, () ^-1It is matrix inversion operation;

And then the formula of definite microblogging data sorter,

f(x)=h(x)β

Wherein, the classification results of f (x) expression microblogging data to be sorted, the hidden layer output vector of h (x) expression microblogging data to be sorted;

Step 4: the automatic classification of microblogging data

The automatic classification of microblogging data can be taked dual mode: first kind of way is that master controller continues crawl microblogging data, the microblogging data sorter that uses step 3 to generate is directly exported the classification results of microblogging data to be sorted, the second is that master controller sends to each from controller with the microblogging data sorter that step 3 generates, then each uses sorter that the microblogging data to be sorted of self are classified from controller, tries to achieve classification results.

Beneficial effect: the present invention is a kind of distributed sorter and method of magnanimity microblogging data, overcome limit of utilization learning machine technology in the past and only can be applied to centralized environment, the defective that can't adapt to the ELM classification of large-scale training sample set, become possibility so that process and analyze magnanimity microblogging data, the effectiveness of the magnanimity microblogging data of accumulation was not fully exerted during order was used, and had played better to be the effect of application service.

Description of drawings

Fig. 1 is the distributed architecture synoptic diagram of one embodiment of the present invention;

Fig. 2 be the master controller of one embodiment of the present invention with from the connection diagram of controller;

Fig. 3 is that the master controller of one embodiment of the present invention reaches from the controller structured flowchart;

Fig. 4 is the distributed microblogging data training set synoptic diagram of one embodiment of the present invention;

Fig. 5 is the distributed microblogging data training method process flow diagram of one embodiment of the present invention;

Fig. 6 is the method flow diagram that produces the microblogging data sorter in one embodiment of the present invention;

Fig. 7 is the local intermediate result synoptic diagram after one embodiment of the present invention transforms from controller;

Fig. 8 is that one embodiment of the present invention gathers synoptic diagram from controller calculating intermediate result and master controller.

Embodiment

Below in conjunction with accompanying drawing embodiments of the present invention are described in further detail.

In microblogging data now, comprised a large amount of microblog users emotion informations, these information tables understand that microblog users is to certain event, commodity, personage's etc. viewpoint and view, these emotion informations have very high research and using value, also just so that obtained to pay close attention to widely for the sentiment analysis of microblogging data, had wide application prospect, aspects such as viewpoint analysis, commodity evaluation, will of the people detection.Therefore, in specific embodiments of the invention, come the microblogging data are classified according to the emotion tendency of microblogging data.

The present invention be under distributed environment to magnanimity microblogging data analysis, wherein distributed architecture is as shown in Figure 1.Comprise a host node n ₀With a plurality of from node n ₁, n ₂..., n _s, wherein, host node n ₀Respectively with a plurality of from node n ₁, n ₂..., n _sInterconnected, can with all from node n ₁, n ₂..., n _sIntercom mutually.

One embodiment of the present invention adopt overall connection diagram as shown in Figure 2, comprising a master controller and a plurality of from controller (from controller 1, from controller 2 ..., from controller m), each is all interconnected with master controller from controller.Principle according to extreme learning machine (ELM), each is from the microblogging training dataset of controller processing self part, produce the intermediate result that is used for generating final sorter separately, and these intermediate results are sent to master controller, master controller is after receiving these intermediate results, according to the principle of extreme learning machine (ELM), produce final microblogging data sorter equally.

Wherein, comprise to measuring device, stripper, converter, preceding paragraph counter and consequent counter from controller.Master controller comprises preceding paragraph totalizer, consequent totalizer, parameter generators and taxonomy generator.

Converter: the principle of limit of utilization learning machine (ELM) is used for the eigenvectors matrix X that will extract to stripper _iConvert the hidden layer output matrix H among the ELM to _i

In the present embodiment, respectively all adopt the realization of ELM technology to the analysis of microblogging data from controller and master controller, ELM technology wherein is specific as follows:

Extreme learning machine is a kind of training method based on single hidden layer feedforward neural network (Single Hidden-Layer Feedforward Neural Networks, SLFNs).ELM arranges hidden layer at random to connection weights and the bias of input layer before training, in the implementation of algorithm, do not need to adjust the input weights at networking and the bias of hidden layer unit, can either produce only optimal solution to the output layer weight and analyse solution, good generalization ability and the pace of learning that is exceedingly fast can be provided.

The ultimate principle of ELM is: in training process, ELM produces input weight and hidden node threshold value at first at random, and then calculates the output weight of SLFNs according to training data.Suppose given N training sample (x _j, t _j), x wherein _jThe proper vector part of training sample, t _jIt is the classification results part of sample.The number of hidden nodes is that L, excitation function are that the SLFNs of g (x) can be expressed as formally:

Σ_{i = 1}^{L} β_{i} g (x_{j}) = Σ_{i = 1}^{L} β_{i} g (w_{i} \cdot x_{j} + b_{i}) = o_{j}, j = 1,2, . . ., N . - - - (1)

Wherein, w _iIt is the weight vectors that connects i hidden node and input node; β _iIt is the weight vectors that connects i hidden node and output node; b _iIt is the threshold value of i hidden node; o _jJ the output vector of SLFNs.

If SLFNs can free from error approximate training sample, will satisfy so

Namely there is w _i, β _iAnd b _i, so that

Σ_{i = 1}^{L} β_{i} g (w_{i} \cdot x_{j} + b_{i}) = t_{j},

Brief note is H β=T.Wherein,

H (w_{1}, w_{2}, . . ., w_{L}, b_{1}, b_{2}, . . ., b_{L}, x_{1}, x_{2}, . . ., x_{N}) = [\begin{matrix} g (w_{1} \cdot x_{1} + b_{1}) & g (w_{2} \cdot x_{1} + b_{2}) & \cdot \cdot \cdot & g (w_{L} \cdot x_{1} + b_{L}) \\ g (w_{1} \cdot x_{2} + b_{1}) & g (w_{2} \cdot x_{2} + b_{2}) & \cdot \cdot \cdot & g (w_{L} \cdot x_{2} + b_{L}) \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ g (w_{1} \cdot x_{N} + b_{1}) & g (w_{2} \cdot x_{N} + b_{2}) & \cdot \cdot \cdot & g (w_{L} \cdot x_{N} + b_{L}) \end{matrix}] - - - (2)

β = [β_{1}^{T}, β_{2}^{T}, \cdot \cdot \cdot β_{L}^{T}]^{T},

T = {[t_{1}^{T}, t_{2}^{T}, \cdot \cdot \cdot t_{N}^{T}]}^{T} \cdot

Wherein, matrix x ^TTransposed matrix for matrix x.

Matrix H is called the hidden layer output matrix.Among formula H β=T, only having β is unknown number, can get

It is the Moore-Penrose generalized inverse of H.

On the basis of basic extreme learning machine, several scholars have further proposed based on the ELM of hidden layer Feature Mapping at random, at this moment

Wherein I is unit matrix, and λ is the parameter of user's appointment;

In addition, also has the ELM (Kernel based ELM) based on kernel function, the mutation of a plurality of ELM such as complete complicated ELM (Fully Complex ELM), on-line continuous ELM (Online Sequential ELM), increment ELM (Incremental ELM) and integrated ELM (Ensemble ofELM), all be widely used in different applications, reached good practical application effect.

Present embodiment is according to the microblogging data relevant with the apple panel computer, present microblog users is analyzed the emotion tendency of apple panel computer, by such emotional orientation analysis, help relevant product producer, supplier, dealer etc. to make correct judgement for the development trend in apple panel computer future, what also can help the apple panel computer simultaneously purchases and purchases in advance user's intensification to the understanding of apple panel computer, and then makes suitable selection.

Figure 4 shows that by a master controller (be host node n ₀), three from controller (namely from node n ₁, n ₂And n ₃) the common distributed system that consists of.According to the ultimate principle of said process and ELM, in distributed system shown in Figure 4, need to carry out following processing.

Present embodiment adopts a kind of Distributed Classification pair microblogging data relevant with panel computer of magnanimity microblogging data perceptual analysis of admiring, and flow process as shown in Figure 5.This flow process begins and step 501.

In step 502, prepare the microblogging training data.According to aforementioned content, the preparation of microblogging training data comprises dual mode, adopts first kind of way in the present embodiment.The original microblogging data that the master controller crawl is relevant with the apple panel computer, original microblogging data contain a plurality of fields, for example, deliver time, utterer, type, access rights, body text content, picture URL, video URL etc.The content of text field of only obtaining in the present embodiment in these raw data gets final product, and is used for emotional orientation analysis.Simultaneously, need artificial mark increase an emotion tendency dimension, i.e. the classification results part of microblogging data is used for the emotion tendency of expression microblogging content, and the emotion tendency with text in the present embodiment is divided into three ranks, agree with, neutral, oppose.What the below listed is 7 microblogging data of advancing artificial Emotion tagging, and master controller is distributed to three from controller with these 7 training datas, and wherein statement 1-2 issues from controller n ₁, statement 3-5 sends to from controller n ₂, statement 6-7 sends to from controller n ₃

From controller n ₁The microblogging training dataset:

Statement 1: apple panel computer quality is pretty good, and reaction velocity is enough fast, and feel is also fine.(the emotion tendency of statement 1 is: agree with)

Statement 2: the apple panel computer has been used the section of the having time, function very little, not have legendary so good, too general.(the emotion tendency of statement 2 is: oppose)

From controller n ₂The microblogging training dataset:

Statement 3: apple panel computer speed is very fast, and networking is stable, and the game online is all relatively more perfect, praises one! (the emotion tendency of statement 3 is: agree with)

Statement 4: the line of products that the apple panel computer is single and high price, do not know under other adversary's the competition of three magnitudes, how long can also continue.(the emotion tendency of statement 4 is: neutrality)

Statement 5: apple panel computer operating system is uncomfortable, and screen proportion sees that the widescreen film is very not well, the export trouble, and it is very expensive to download software.(the emotion tendency of statement 5 is: oppose)

From controller n ₃The microblogging training dataset:

Statement 6: apple panel computer speed is very fast, and resolution is also very high, and application program is quite abundant.(the emotion tendency of statement 6 is: agree with)

Statement 7: apple panel computer fuselage is too heavy, picks up inconvenience, and downloading needs by itunes, and is pretty troublesome! (the emotion tendency of statement 7 is: oppose)

In step 503: master controller is the desired parameters initialization, and sends to all from controller;

Predefined parameter generates in advance at random by master controller, and parameter comprises: the weight vectors w of input node ₁, w ₂, w ₃Threshold value b with hidden node ₁, b ₂, b ₃And these parameters are issued to from node n ₁, n ₂And n ₃, and set the number of hidden nodes L=3.

w ₁=（-0.9286，0.3575，-0.2155，0.4121，-0.9077，0.3897）

w ₂=（0.6983，0.5155，0.3110，-0.9363，-0.8057，-0.3658）

w ₃=（0.8680，0.4863，-0.6576，-0.4462，0.6469，0.9004）

b ₁=0.0344

b ₂=0.4387

b ₃=0.3816

In step 504: each is processed local microblogging data set separately from controller, and result is sent to master controller, is produced the sorter of microblogging data by master controller; Idiographic flow as shown in Figure 6, this flow process starts from step 601.

In step 602, will carry out vectorization with every microblogging training data of classification results part, comprising the proper vector x of the data division of every microblogging data _iWith classification results part t _i

The vectorization of data portion is that data portion is carried out feature extraction.Feature extraction is the basis of emotional orientation analysis, and the quality of feature extraction directly affects the result of emotion tendency prediction.Feature extraction is with the method for shining upon (or conversion) primitive character to be transformed to most representative new feature.Present embodiment is mainly studied commendation emotion word in the text data, derogatory sense emotion word, degree adverb, negative word as the impact of feature on the emotion tendentiousness of text analysis.Lower mask body is introduced:

The emotion word: the emotion word refers to have noun, verb, adjective and some Chinese idioms and the idiom etc. of emotion tendency.The emotion tendency of text is mainly transmitted by the emotion word, and therefore, the emotion word is one of key character of emotion tendentiousness of text analysis and prediction.According to the needs of sentiment analysis, present embodiment is divided into two kinds with the emotion word in the text data, i.e. commendatory term and derogatory term.Commendatory term be part of speech with praise, the sure word of emotion is such as " liking ", " approval ", " appreciation ", " praising ", " praising ", " worshipping ", " fine " etc.Derogatory term: be the meaning of a word with demote, negate, the word of hatred, contempt emotion, such as " detest ", " opposition ", " ignorant ", " gloomy ", " meanness ", " deception " etc.Present embodiment is divided into Three Estate [+3 ,+2 ,+1] with commendation emotion word, and the commendation degree reduces successively, and derogatory sense emotion word also is divided into Three Estate [1 ,-2 ,-3], and the derogatory sense degree raises successively.

The proper vector that the emotion word relates to mainly contains four, is respectively commendatory term word frequency, commendatory term average rank, derogatory term word frequency, derogatory term average rank.Word frequency

Average rank

Degree adverb: degree adverb is a kind of of adverbial word, the expression degree.As " very, very, the utmost point, very,, top, too, more, very, extremely, especially, exceptionally, more, more, all the more, a bit, slightly, a little, slightly, almost, too, especially " etc., wherein the word frequency of present embodiment extraction degree adverbial word is as a proper vector.

Negative adverb: negative adverb is a kind of of adverbial word, represents sure, negative.As " not, do not have, do not have, need not (don't), must, must, must, accurate, really, not, not, not, not, whether, needn't, never " etc., wherein present embodiment is extracted the word frequency of negative adverb as a proper vector.

In sum, the Text eigenvector that present embodiment is extracted mainly contains six, is respectively commendatory term word frequency, commendatory term average rank, derogatory term word frequency, derogatory term average rank, degree adverb word frequency and negative adverb word frequency.Simultaneously in the classification results of the microblogging data part, the emotion tendency of text is divided into three ranks, agree with, neutral, oppose, with [+1 ,+2 ,+3] expression.So both can obtain proper vector and part and the classification results part of every microblogging data, concrete form is as follows:

According to above-mentioned feature extracting method, 7 microblogging data are extracted corresponding vectorization, the result is as follows:

Statement 1: apple panel computer quality is pretty good, and reaction velocity is enough fast, and feel is also fine.The emotion tendency of statement 1 is: agree with)

Statement 1 is analyzed: can be divided into 8 words in the statement 1, wherein commendatory term has " well ", " soon ", " fine " 3, then the commendatory term word frequency of statement 1 is 3/8, the rank of corresponding commendatory term is respectively+and 1, + 2, + 2, then the commendatory term average rank of statement 1 is (1+2+2)/3, do not contain derogatory term in the statement 1, therefore its derogatory term word frequency and average rank are 0, and degree adverb is " very ", and word frequency is 1/8, the word frequency of negative adverb is 0, emotion tendency is for agreeing with, and classification results be+1, so statement 1 can convert (0.375 to after passing through extraction, 1.667,0,0,0.125,0,1).

Use identical method, can obtain the proper vector part of other statement.

Statement 2 is analyzed: (0.083,2,0.167 ,-1.5,0.25,0.083,3).

Statement 3 is analyzed: (0.333,2.5,0,0,0.25,0,1).

Statement 4 is analyzed: (0.077,2,0.077 ,-1,0,0,2).

Statement 5 is analyzed: (0,0,0.188 ,-2.333,0.125,0.063,3).

Statement 6 is analyzed: (0.273,2.333,0,0,0.273,0,1).

Statement 7 is analyzed: (0,0,0.154 ,-2.5,0.154,0.077,3).

In step 603, each is peeled off from the microblogging training data of controller after to the vectorization of self part, peels off proper vector part and the classification results part of these data, namely so that each all generates separately local microblogging data set (X from controller _i, T _i), wherein, X _iBe the eigenmatrix of microblogging data set, T _iClassification results matrix for the microblogging data set.In distributed environment shown in Figure 4, from controller n ₁Training data be:

Statement 1 (0.375,1.667,0,0,0.125,0,1)

Statement 2 (0.083,2,0.167 ,-1.5,0.25,0.083,3)

From controller n ₁The microblogging data through the eigenmatrix X of the microblogging training data after peeling off ₁With the classification results matrix T ₁As follows:

Eigenmatrix

The classification results matrix

T_{1} = [\begin{matrix} 1 \\ 3 \end{matrix}]

From controller n ₂Training data be:

Statement 3 (0.333,2.5,0,0,0.25,0,1)

Statement 4 (0.077,2,0.077 ,-1,0,0,2)

Statement 5 (0,0,0.188 ,-2.333,0.125,0.063,3)

From controller n ₂The microblogging data through the microblogging training data eigenmatrix X after peeling off ₂With the classification results matrix T ₂As follows:

Eigenmatrix

The classification results matrix

T_{2} = [\begin{matrix} 1 \\ 2 \\ 3 \end{matrix}]

From controller n ₃Training data be:

Statement 6 (0.273,2.333,0,0,0.273,0,1)

Statement 7 (0,0,0.154 ,-2.5,0.154,0.07,3)

From controller n ₃The microblogging data through the microblogging training data eigenmatrix X after peeling off ₃With the classification results matrix T ₃As follows:

Eigenmatrix

The classification results matrix

T

_{3} = [\begin{matrix} 1 \\ 3 \end{matrix}]

In step 604: each is from controller n _iAccording to the parameter w that receives ₁, w ₂..., w _LAnd b ₁, b ₂..., b _L, and local microblogging data set (X _i, T _i), calculate the required intermediate result of ELM, and intermediate result is submitted to master controller; Wherein, at (X _i, T _i) in, X _iBe the eigenmatrix of microblogging data set, T _iBe the classification results matrix of microblogging data set, as shown in Figure 7.

Need to prove, in ELM, for the eigenmatrix X of input data herein _iIn each element need to carry out normalization so that X _iIn all element all between [1 ,+1], the difference that method for normalizing is chosen can cause the difference of input data.In addition, for excitation function g (w _iX _i+ b _i), ELM provides multiple excitation function for user selection, and the difference of choosing of excitation function equally can be so that intermediate result be different, and then causes the difference of final classification results.In the specific embodiment of the present invention, also be first the vector of these statements to be carried out normalization, then select an activation function, and then try to achieve the required intermediate result of ELM.The below describes from controller three respectively:

For from node n ₁:

At step 604-1 from controller n ₁The data of processing are statement 1 (0.375,1.667,0,0,0.125,0,1) and statement 2 (0.083,2,0.167 ,-1.5,0.25,0.083,3), and the parameter of reception is w ₁, w ₂, w ₃, b ₁, b ₂, b ₃, normalization and choose excitation function after can get

The hidden layer output matrix

H_{1} = [\begin{matrix} g (w_{1} \cdot x_{1} + b_{1}) & g (w_{2} \cdot x_{1} + b_{2}) & g (w_{3} \cdot x_{1} + b_{3}) \\ g (w_{1} \cdot x_{2} + b_{1}) & g (w_{2} \cdot x_{2} + b_{2}) & g (w_{3} \cdot x_{2} + b_{3}) \end{matrix}] = [\begin{matrix} 0.5287 & 0.7409 & 0.7524 \\ 0.5442 & 0.7244 & 0.7404 \end{matrix}],

The classification results matrix

T_{1} = [\begin{matrix} 1 \\ 3 \end{matrix}]

At step 604-2, according to H ₁, calculate intermediate result U ₁, can get

U_{1} = H_{1}^{T} H_{1} = [\begin{matrix} 0.5867 & 0.7932 & 0.8081 \\ 0.7932 & 1.0737 & 1.0938 \\ 0.8081 & 1.0938 & 1.1143 \end{matrix}];

At step 604-3, according to H ₁And T ₁, calculate intermediate result V ₁, can get

V_{1} = H_{1}^{T} T_{1} = [\begin{matrix} 2.1913 \\ 2.9141 \\ 2.9736 \end{matrix}],

And with intermediate result U ₁And V ₁Submit to master controller.

For from controller 2:

At step 604-4 from controller n ₂The data of processing are statement 3 (0.333,2.5,0,0,0.25,0,1), statement 4 (0.077,2,0.077 ,-1,0,0,2) and statement 5 (0,0,0.188 ,-2.333,0.125,0.063,3), and the parameter of reception is w ₁, w ₂, w ₃, b ₁, b ₂, b ₃, normalization and choose excitation function after can get the hidden layer output matrix

H_{2} = [\begin{matrix} g (w_{1} \cdot x_{3} + b_{1}) & g (w_{2} \cdot x_{3} + b_{2}) & g (w_{3} \cdot x_{3} + b_{3}) \\ g (w_{1} \cdot x_{4} + b_{1}) & g (w_{2} \cdot x_{4} + b_{2}) & g (w_{3} \cdot x_{4} + b_{3}) \\ g (w_{1} \cdot x_{5} + b_{1}) & g (w_{2} \cdot x_{5} + b_{2}) & g (w_{3} \cdot x_{5} + b_{3}) \end{matrix}] = \begin{matrix}  \end{matrix} [\begin{matrix} 0.5441 & 0.7194 & 0.7388 \\ 0.5467 & 0.7244 & 0.7163 \\ 0.7398 & 0.7388 & 0.8114 \end{matrix}]

The classification results matrix

T_{2} = [\begin{matrix} 1 \\ 2 \\ 3 \end{matrix}]

Step 604-5 is according to H ₂, calculate intermediate result U ₂, can get

U_{2} = H_{2}^{T} H_{2} = [\begin{matrix} 1.1422 & 1.3340 & 1.3961 \\ 1.3340 & 1.5881 & 1.6521 \\ 1.3961 & 1.6521 & 1.7222 \end{matrix}];

Step 604-6 is according to H ₂And T ₂, calculate intermediate result V ₂, can get

V_{2} = H_{2}^{T} T_{2} = [\begin{matrix} 3.8569 \\ 4.3846 \\ 4.6146 \end{matrix}],

And with intermediate result U ₂And V ₂Submit to master controller.

For from controller 3:

Step 604-7 is from controller n ₃The data of processing are statement 6 (0.273,2.333,0,0,0.273,0,1) and statement 7 (0,0,0.154 ,-2.5,0.154,0.07,3), and the parameter of reception is w ₁, w ₂, w ₃, b ₁, b ₂, b ₃, normalization and choose excitation function after can get

The hidden layer output matrix

H

_{3} = [\begin{matrix} g (w_{1} \cdot x_{6} + b_{1}) & g (w_{2} \cdot x_{6} + b_{2}) & g (w_{3} \cdot x_{6} + b_{3}) \\ g (w_{1} {\cdot x}_{7} + b_{1}) & g (w_{2} {\cdot x}_{7} + b_{2}) & g (w_{3} \cdot x_{7} + b_{3}) \end{matrix}] = [\begin{matrix} 0.3993 & 0.7005 & 0.8426 \\ 0.2272 & 0.6769 & 0.8216 \end{matrix}]

The classification results matrix

T

_{3} = [\begin{matrix} 1 \\ 3 \end{matrix}]

Step 604-8 is according to H ₃, calculate intermediate result U ₃, can get

U_{3} = H_{3}^{T} H_{3} = [\begin{matrix} 0.2111 & 0.4335 & 0.5458 \\ 0.4335 & 0.9489 & 1.2141 \\ 0.5458 & 1.2141 & 1.5593 \end{matrix}];

Step 604-9 is according to H ₃And T ₃, calculate intermediate result V ₃, can get

V_{3} = H_{3}^{T} T_{3} = [\begin{matrix} 1.0809 \\ 2.7312 \\ 3.6074 \end{matrix}],

And with intermediate result U ₃And V ₃Submit to master controller.

In step 605, master controller n ₀Receive from controller n ₁The U that submits to ₁And V ₁, receive from controller n ₂The U that submits to ₂And V ₂, receive from controller n ₃The U that submits to ₃And V ₃, and calculate final result, as shown in Figure 8.

Step 605-1 merges the intermediate result U that each is submitted to from controller ₁, U ₂, U ₃, obtain summarized results

U = U_{1} + U_{2} + U_{3} = [\begin{matrix} 1.9400 & 2.5607 & 2.7500 \\ 2.5607 & 3.6107 & 3.9600 \\ 2.7500 & 3.9600 & 4.3958 \end{matrix}];

Step 605-2 merges the intermediate result V that each is submitted to from controller ₁, V ₂, V ₃, obtain summarized results

V = V_{1} + V_{2} + V_{3} = [\begin{matrix} 7.1390 \\ 11.0317 \\ 11.1956 \end{matrix}];

Step 605-3 according to the U that gathers and V, calculates the weight vectors parameter beta of output node,

β = {(\frac{1}{λ} + U)}^{- 1} V = (\begin{matrix} - 16.8925 & 9.9534 & 6.6591 \\ 42.3653 & - 19.4846 & - 23.3897 \\ - 28.1804 & 10.8984 & 16.6435 \end{matrix})

So far, can obtain the weight vectors parameter beta.

At step 605-4, according to the parameter beta that parameter generators obtains, structure can be predicted the sorter of microblogging data emotional orientation analysis, is used for microblogging data to be tested are carried out emotional orientation analysis, and formula is as follows:

f(x)=h(x)β

In step 505: the automatic classification of microblogging data.

The automatic classification of microblogging data mainly contains dual mode, adopt first kind of way in the present embodiment, continue crawl microblogging data by master controller, use the directly classification results of output microblogging data to be sorted of the microblogging data sorter generated, following two are continued the microblogging data to be sorted that grasp and use the result who obtains behind the identical feature extracting method for master controller.

Statement 8: the apple panel computer is given friend, and friend is delithted with, and is very good! Speed, moulding are all fine! Like!

Statement 8 is analyzed: (0.286,2.25,0,0,0.214, unknown classification results).

Statement 9: apple panel computer screen quality is very low, uses also very bothersomely, and cruising time is very poor.

Statement 9 is analyzed: (0,0,0.25 ,-2.333,0.25,0, unknown classification results).

After using same method for normalizing, choosing same excitation function, the classification results of trying to achieve statement 8 is as follows:

Hidden layer output matrix h (x ₈)=[g (w ₁X ₈+ b ₁) g (w ₂X ₈+ b ₂) g (w ₃X ₈+ b ₃)]=[0.5467 0.7244 0.7388]

Be brought in the formula of sorter, try to achieve

f(x)=h(x)β=[0.6332-0.6207-1.0061]

For the above results, ELM takes a kind of maximized method to judge the classification results of microblogging data to be predicted, ultimate principle is to judge the dimension at the element place of the maximum in the vector of trying to achieve the result, then tag along sort corresponding to this dimension is the classification results of these data to be predicted, element such as maximum in the sorter Output rusults of statement 8 is 0.6332, corresponding dimension is 1, and the classification results of statement 8 is exactly the classification of label 1 expression so, namely " agrees with ".

The forecasting process of statement 9 is identical with statement 8, is summarized as follows: the classification results of trying to achieve statement 9 is as follows:

Hidden layer output matrix h (x ₉)=[g (w ₁X ₉+ b ₁) g (w ₂X ₉+ b ₂) g (w ₃X ₉+ b ₃)]=[0.2222 0.6704 0.9174]

Be brought in the formula of sorter, try to achieve

f(x)＝h(x)β＝[-1.2055?-0.8521?1.0684]

Element maximum in the sorter Output rusults of statement 9 is 1.0684, and corresponding dimension is 3, and the classification results of statement 9 is exactly the classification of label 3 expressions so, i.e. " opposition ".

When test data is statement 8 and statement 9, use the microblogging data sorter that has generated, the emotion tendency that obtains statement 8 and statement 9 that can be correct can be classified to microblogging data to be sorted accurately.

Except the emotion tendency of analyzing the microblogging data, the present invention also can be used for analyzing in numerous application such as box office receipts, song clicking rate, finance product recommendation, stock analysis, equipment performance, hot news event analysis, public opinion analysis.

Although more than described the specific embodiment of the present invention, the those skilled in the art in this area should be appreciated that these only illustrate, and can make numerous variations or modification to these embodiments, and not deviate from principle of the present invention and essence.Scope of the present invention is only limited by appended claims.

Claims

1. the distributed sorter of magnanimity microblogging data, adopt distributed frame, it is characterized in that: comprise that a master controller and at least one are from controller, and each is all interconnected with master controller from controller, master controller intercoms from controller mutually with each, and all are from separate between controller;

Describedly comprise from controller:

Converter: the principle of limit of utilization learning machine ELM is used for the eigenvectors matrix X that stripper is extracted _iConvert the hidden layer output matrix H among the ELM to _i

The preceding paragraph counter: the principle of limit of utilization learning machine ELM is used for according to hidden layer output matrix H _i, calculate intermediate result H _i ^TH _i, and submit to master controller;

Consequent counter: the principle of limit of utilization learning machine ELM is used for according to hidden layer output matrix H _iWith microblogging data centralization classification results matrix T _i, calculate intermediate result H _i ^TT _i, and submit to master controller;

Described master controller comprises:

Preceding paragraph totalizer: be used for merging the intermediate result H that each is submitted to from controller _i ^TH _i, obtain summarized results H ^TH;

Consequent totalizer: be used for merging the intermediate result H that each is submitted to from controller _i ^TT _i, obtain summarized results H ^TT;

Parameter generators: the principle of limit of utilization learning machine ELM, be used for the result according to the preceding paragraph totalizer that gathers and the output of consequent totalizer, calculate the weight vectors parameter beta of output node;

2. the distributed sorter of magnanimity microblogging data as claimed in claim 1, it is characterized in that: described each intermediate result of self being processed, be used for generating final microblogging data sorter from controller sends to master controller, after master controller receives all intermediate results of sending from controller, according to the principle of ELM, obtain final microblogging data sorter.

3. the Distributed Classification of magnanimity microblogging data adopts the distributed sorter of magnanimity microblogging data claimed in claim 1 to realize, it is characterized in that: may further comprise the steps:

Step 1: the preparation of microblogging training dataset;

The preparation of microblogging training dataset comprises grasping original microblogging data and manually the microblogging data being marked two parts; Adopt following dual mode: first kind of way is to be grasped the original microblogging data of required processing by master controller, and manually mark for each bar training data, then the classification results that represents these microblogging data arrives these microblogging data allocations accordingly from controller; The second way is to be communicated by letter from controller with each by master controller, notify the information of each microblogging data that need to grasp from controller, each grasps original microblogging data from controller self, and for the original microblogging data that self grasp manually mark, represent the classification results of these microblogging data;

The principle of limit of utilization learning machine ELM generates parameter in advance at random by master controller, comprising: the weight vectors w of hidden node number L, input node ₁, w ₂..., w _L, hidden node side-play amount b ₁, b ₂..., b _L, and these parameters are sent to all from controller;

Step 3-1: microblogging data vector;

Step 3-2: the peeling off of microblogging data;

For each microblogging data set through feature extraction from controller microblogging data training set, peel off proper vector part and the classification results part of these data, form each from the eigenvectors matrix X of the microblogging data training set of controller _iWith the classification results matrix T _i, namely so that each all generates separately local microblogging data set (X from controller _i, T _i), wherein, X _iBe the eigenmatrix of microblogging data set, T _iClassification results matrix for the microblogging data set;

Step 4: the automatic classification of microblogging data

4. the Distributed Classification of magnanimity microblogging data according to claim 3 is characterized in that: step 3-3 described each generate intermediate result from controller according to separately local microblogging data set, and send to master controller, specific as follows:

5. the Distributed Classification of magnanimity microblogging data according to claim 3 is characterized in that: the weight vectors parameter beta of the described calculating output node of step 3-4, specific as follows:

β = {(\frac{I}{λ} + H^{T} H)}^{- 1} H^{T} T = {(\frac{I}{λ} + U)}^{- 1} V

And then the formula of definite microblogging data sorter is as follows:

f(x)=h(x)β

In the formula, the classification results of f (x) expression microblogging data to be sorted, the hidden layer output vector of h (x) expression microblogging data to be sorted.