CN103020712A - Distributed classification device and distributed classification method for massive micro-blog data - Google Patents
Distributed classification device and distributed classification method for massive micro-blog data Download PDFInfo
- Publication number
- CN103020712A CN103020712A CN2012105838868A CN201210583886A CN103020712A CN 103020712 A CN103020712 A CN 103020712A CN 2012105838868 A CN2012105838868 A CN 2012105838868A CN 201210583886 A CN201210583886 A CN 201210583886A CN 103020712 A CN103020712 A CN 103020712A
- Authority
- CN
- China
- Prior art keywords
- controller
- microblogging
- data
- microblogging data
- master controller
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a distributed classification device and a distributed classification method for massive micro-blog data, and belongs to the field of data mining technology. The distributed classification device is of a distributed structure. The method includes that each secondary controller transmits an intermediate result to a master controller according to an ELM (extreme learning machine) processing technology, and the intermediate result is generated by the secondary controller and is used for generating a final micro-blog data classifier; the master controller acquires the final micro-blog data classifier according to an ELM principle after receiving all the intermediate results transmitted by the secondary controllers; and the generated micro-blog classifier classifies the micro-blog data. The distributed classification device and the distributed classification method have the advantages that the shortcoming that an existing method implemented by an extreme learning machine technology only can be applied to a centralized environment and cannot be adapted to ELM classification for large-scale training sample sets is overcome, the massive micro-blog data can be processed and analyzed, the effectiveness of the massive micro-blog data accumulated during application can be sufficiently played, and an effective application service effect is realized.
Description
Technical field
The invention belongs to the data mining technology field, relate to a kind of extreme learning machine sorter and method based on distributed proccessing, particularly a kind of distributed sorter and method of magnanimity microblogging data.
Background technology
At present, all can produce a large amount of information all the time on the internet, the form of expression of these information is varied, and wherein the quantity of information of microblogging platform generation is also increasing sharply.Microblogging is miniature blog (Micro-Blogs), is the blog form that a kind of permission user upgraded and can publish brief text (usually about 140 words) in time.The fast development of microblogging be so that anyone can become microblog users, and delivers and reading information in the client of any support microblogging at any time, carries out interaction, expresses the emotion information of oneself.Microblogging has become the powerful information carrier in internet, and the micro-blog information amount reached the magnanimity scale, becomes at present most popular information sharing, propagation and interaction platform.For this reason, how to take adequate measures and technology, from the microblogging data of magnanimity, excavate Useful Information, the judgement of following making prediction property of things has been become focus and the difficult point of current data excavation applications research.
In existing correlative study for the microblogging data, the data volume of handled microblogging data is less often, just can process under centralized environment; Yet be accompanied by the growth at full speed of microblogging data in the internet, the data volume of microblogging data adopts existing method to be difficult to realize large-scale data analysis considerably beyond the processing power of single computing machine.
Summary of the invention
For the deficiencies in the prior art, the objective of the invention is to propose a kind of distributed sorter and method of magnanimity microblogging data, limit of utilization learning machine (Extreme Learning Machine, ELM) technology is classified to the microblogging data, and then can effectively process and analyze the microblogging data of magnanimity, to reach the effectiveness of the magnanimity microblogging data that accumulate in the application is not fully exerted, better is the purpose of application service.
Technical scheme of the present invention is achieved in that a kind of distributed sorter of magnanimity microblogging data, this device adopts distributed frame, comprise that a master controller and at least one are from controller, and each is all interconnected with master controller from controller, master controller intercoms from controller mutually with each, all are from separate between controller, complete independently task separately; Disposal route according to ELM, each intermediate result for generating final microblogging data sorter of self being processed from controller sends to master controller, master controller according to the principle of ELM, obtains final microblogging data sorter after receiving all intermediate results of sending from controller.
Describedly comprise from controller:
To measuring device: be used for and convert the form of vector representation from controller to every microblogging training data of classification results, comprising the proper vector x of the data division of every microblogging data
iWith classification results part t
i
Stripper: the eigenvectors matrix X that is used for peeling off all microblogging data of the microblogging data training set after processing to measuring device
iWith the classification results matrix T
i
Converter: the principle of limit of utilization learning machine (ELM) is used for the eigenvectors matrix X that stripper is extracted
iConvert the hidden layer output matrix H among the ELM to
i
The preceding paragraph counter: the principle of limit of utilization learning machine (ELM) is used for according to hidden layer output matrix H
i, calculate intermediate result H
i TH
i, and submit to master controller.
Consequent counter: the principle of limit of utilization learning machine (ELM) is used for according to hidden layer output matrix H
iWith microblogging data centralization classification results matrix T
i, calculate intermediate result H
i TT
i, and submit to master controller.
Described master controller comprises:
Preceding paragraph totalizer: be used for merging the intermediate result H that each is submitted to from controller
i TH
i, obtain summarized results H
TH.
Consequent totalizer: be used for merging the intermediate result H that each is submitted to from controller
i TT
i, obtain summarized results H
TT.
Parameter generators: the principle of limit of utilization learning machine (ELM), be used for the result according to the preceding paragraph totalizer that gathers and the output of consequent totalizer, calculate the weight vectors parameter beta of output node.
Taxonomy generator: the parameter beta that obtains according to parameter generators makes up the sorter of microblogging data, is used for microblogging data to be tested are classified.
A kind of Distributed Classification of magnanimity microblogging data may further comprise the steps:
Step 1: the preparation of microblogging training dataset;
The preparation of microblogging training dataset comprises grasping original microblogging data and manually the microblogging data being marked two parts.Can adopt following dual mode: first kind of way is to be grasped the original microblogging data of required processing by master controller, and manually mark for each bar training data, then the classification results that represents these microblogging data arrives these microblogging data allocations accordingly from controller; The second way is to be communicated by letter from controller with each by master controller, notify the information of each microblogging data that need to grasp from controller, each grasps original microblogging data from controller self, and for the original microblogging data that self grasp manually mark, represent the classification results of these microblogging data;
Step 2: master controller is the desired parameters initialization, and sends to all from controller;
The principle of limit of utilization learning machine (ELM) generates parameter in advance at random by master controller, comprising: the weight vectors w of hidden node number L, input node
1, w
2..., w
L, hidden node side-play amount b
1, b
2..., b
L, and these parameters are sent to all from controller;
Step 3: each is processed local microblogging data set separately from controller, and result is sent to master controller, generates the microblogging data sorter by master controller;
Step 3-1: microblogging data vector;
To carry out vectorization with every microblogging training data of classification results part, comprising the proper vector x of the data division of every microblogging data
iWith classification results part t
i
Step 3-2: the peeling off of microblogging data;
For each microblogging data set through feature extraction from controller microblogging data training set, peel off proper vector part and the classification results part of these data, form each from the eigenvectors matrix X of the microblogging data training set of controller
iWith the classification results matrix T
i, namely so that each all generates separately local microblogging data set (X from controller
i, T
i), wherein, X
iBe the eigenmatrix of microblogging data set, T
iClassification results matrix for the microblogging data set.
Step 3-3: each generates intermediate result from controller basis local microblogging data set separately, and sends to master controller;
Each is from controller n
iWeight vectors w according to the input node that receives
1, w
2..., w
LThreshold value b with i hidden node
1, b
2..., b
L, and local microblogging training dataset (X
i, T
i), calculate and make up the required intermediate result of sorter, and intermediate result is submitted to master controller;
Step 3-3-1: with the eigenmatrix X of local microblogging data set
iBe converted into the hidden layer output matrix H of ELM
i
Step 3-3-2: according to hidden layer output matrix H
i, calculate intermediate result U
i=H
i TH
i
Step 3-3-3: according to hidden layer output matrix H
iClassification results matrix T with local training dataset
i, calculate intermediate result V
i=H
i TT
i
Step 3-4: master controller receives and gathers each from the intermediate result of controller; According to the Computing Principle of the intermediate result that gathers according to ELM, calculate the weight vectors parameter beta of output node, and then try to achieve the microblogging data sorter;
Step 3-4-1: merge the intermediate result U that each is submitted to from controller
i, obtain summarized results U=∑ U
i=∑ H
i TH
i=H
TH;
Step 3-4-2: merge the intermediate result V that each is submitted to from controller
i, obtain summarized results V=∑ V
i=∑ H
i TT
i=H
TT;
Step 3-4-3: the weight vectors parameter beta of calculating output node according to the U that gathers and V:
Wherein, I is unit matrix, and λ is the parameter of user's appointment, ()
-1It is matrix inversion operation;
And then the formula of definite microblogging data sorter,
f(x)=h(x)β
Wherein, the classification results of f (x) expression microblogging data to be sorted, the hidden layer output vector of h (x) expression microblogging data to be sorted;
Step 4: the automatic classification of microblogging data
The automatic classification of microblogging data can be taked dual mode: first kind of way is that master controller continues crawl microblogging data, the microblogging data sorter that uses step 3 to generate is directly exported the classification results of microblogging data to be sorted, the second is that master controller sends to each from controller with the microblogging data sorter that step 3 generates, then each uses sorter that the microblogging data to be sorted of self are classified from controller, tries to achieve classification results.
Beneficial effect: the present invention is a kind of distributed sorter and method of magnanimity microblogging data, overcome limit of utilization learning machine technology in the past and only can be applied to centralized environment, the defective that can't adapt to the ELM classification of large-scale training sample set, become possibility so that process and analyze magnanimity microblogging data, the effectiveness of the magnanimity microblogging data of accumulation was not fully exerted during order was used, and had played better to be the effect of application service.
Description of drawings
Fig. 1 is the distributed architecture synoptic diagram of one embodiment of the present invention;
Fig. 2 be the master controller of one embodiment of the present invention with from the connection diagram of controller;
Fig. 3 is that the master controller of one embodiment of the present invention reaches from the controller structured flowchart;
Fig. 4 is the distributed microblogging data training set synoptic diagram of one embodiment of the present invention;
Fig. 5 is the distributed microblogging data training method process flow diagram of one embodiment of the present invention;
Fig. 6 is the method flow diagram that produces the microblogging data sorter in one embodiment of the present invention;
Fig. 7 is the local intermediate result synoptic diagram after one embodiment of the present invention transforms from controller;
Fig. 8 is that one embodiment of the present invention gathers synoptic diagram from controller calculating intermediate result and master controller.
Embodiment
Below in conjunction with accompanying drawing embodiments of the present invention are described in further detail.
In microblogging data now, comprised a large amount of microblog users emotion informations, these information tables understand that microblog users is to certain event, commodity, personage's etc. viewpoint and view, these emotion informations have very high research and using value, also just so that obtained to pay close attention to widely for the sentiment analysis of microblogging data, had wide application prospect, aspects such as viewpoint analysis, commodity evaluation, will of the people detection.Therefore, in specific embodiments of the invention, come the microblogging data are classified according to the emotion tendency of microblogging data.
The present invention be under distributed environment to magnanimity microblogging data analysis, wherein distributed architecture is as shown in Figure 1.Comprise a host node n
0With a plurality of from node n
1, n
2..., n
s, wherein, host node n
0Respectively with a plurality of from node n
1, n
2..., n
sInterconnected, can with all from node n
1, n
2..., n
sIntercom mutually.
One embodiment of the present invention adopt overall connection diagram as shown in Figure 2, comprising a master controller and a plurality of from controller (from controller 1, from controller 2 ..., from controller m), each is all interconnected with master controller from controller.Principle according to extreme learning machine (ELM), each is from the microblogging training dataset of controller processing self part, produce the intermediate result that is used for generating final sorter separately, and these intermediate results are sent to master controller, master controller is after receiving these intermediate results, according to the principle of extreme learning machine (ELM), produce final microblogging data sorter equally.
Wherein, comprise to measuring device, stripper, converter, preceding paragraph counter and consequent counter from controller.Master controller comprises preceding paragraph totalizer, consequent totalizer, parameter generators and taxonomy generator.
To measuring device: be used for and convert the form of vector representation from controller to every microblogging training data of classification results, comprising the proper vector x of the data division of every microblogging data
iWith classification results part t
i
Stripper: the eigenvectors matrix X that is used for peeling off all microblogging data of the microblogging data training set after processing to measuring device
iWith the classification results matrix T
i
Converter: the principle of limit of utilization learning machine (ELM) is used for the eigenvectors matrix X that will extract to stripper
iConvert the hidden layer output matrix H among the ELM to
i
The preceding paragraph counter: the principle of limit of utilization learning machine (ELM) is used for according to hidden layer output matrix H
i, calculate intermediate result H
i TH
i, and submit to master controller.
Consequent counter: the principle of limit of utilization learning machine (ELM) is used for according to hidden layer output matrix H
iWith microblogging data centralization classification results matrix T
i, calculate intermediate result H
i TT
i, and submit to master controller.
Preceding paragraph totalizer: be used for merging the intermediate result H that each is submitted to from controller
i TH
i, obtain summarized results H
TH.
Consequent totalizer: be used for merging the intermediate result H that each is submitted to from controller
i TT
i, obtain summarized results H
TT.
Parameter generators: the principle of limit of utilization learning machine (ELM), be used for the result according to the preceding paragraph totalizer that gathers and the output of consequent totalizer, calculate the weight vectors parameter beta of output node.
Taxonomy generator: the parameter beta that obtains according to parameter generators makes up the sorter of microblogging data, is used for microblogging data to be tested are classified.
In the present embodiment, respectively all adopt the realization of ELM technology to the analysis of microblogging data from controller and master controller, ELM technology wherein is specific as follows:
Extreme learning machine is a kind of training method based on single hidden layer feedforward neural network (Single Hidden-Layer Feedforward Neural Networks, SLFNs).ELM arranges hidden layer at random to connection weights and the bias of input layer before training, in the implementation of algorithm, do not need to adjust the input weights at networking and the bias of hidden layer unit, can either produce only optimal solution to the output layer weight and analyse solution, good generalization ability and the pace of learning that is exceedingly fast can be provided.
The ultimate principle of ELM is: in training process, ELM produces input weight and hidden node threshold value at first at random, and then calculates the output weight of SLFNs according to training data.Suppose given N training sample (x
j, t
j), x wherein
jThe proper vector part of training sample, t
jIt is the classification results part of sample.The number of hidden nodes is that L, excitation function are that the SLFNs of g (x) can be expressed as formally:
Wherein, w
iIt is the weight vectors that connects i hidden node and input node; β
iIt is the weight vectors that connects i hidden node and output node; b
iIt is the threshold value of i hidden node; o
jJ the output vector of SLFNs.
If SLFNs can free from error approximate training sample, will satisfy so
Namely there is w
i, β
iAnd b
i, so that
Brief note is H β=T.Wherein,
Matrix H is called the hidden layer output matrix.Among formula H β=T, only having β is unknown number, can get
It is the Moore-Penrose generalized inverse of H.
On the basis of basic extreme learning machine, several scholars have further proposed based on the ELM of hidden layer Feature Mapping at random, at this moment
Wherein I is unit matrix, and λ is the parameter of user's appointment;
In addition, also has the ELM (Kernel based ELM) based on kernel function, the mutation of a plurality of ELM such as complete complicated ELM (Fully Complex ELM), on-line continuous ELM (Online Sequential ELM), increment ELM (Incremental ELM) and integrated ELM (Ensemble ofELM), all be widely used in different applications, reached good practical application effect.
Present embodiment is according to the microblogging data relevant with the apple panel computer, present microblog users is analyzed the emotion tendency of apple panel computer, by such emotional orientation analysis, help relevant product producer, supplier, dealer etc. to make correct judgement for the development trend in apple panel computer future, what also can help the apple panel computer simultaneously purchases and purchases in advance user's intensification to the understanding of apple panel computer, and then makes suitable selection.
Figure 4 shows that by a master controller (be host node n
0), three from controller (namely from node n
1, n
2And n
3) the common distributed system that consists of.According to the ultimate principle of said process and ELM, in distributed system shown in Figure 4, need to carry out following processing.
Present embodiment adopts a kind of Distributed Classification pair microblogging data relevant with panel computer of magnanimity microblogging data perceptual analysis of admiring, and flow process as shown in Figure 5.This flow process begins and step 501.
In step 502, prepare the microblogging training data.According to aforementioned content, the preparation of microblogging training data comprises dual mode, adopts first kind of way in the present embodiment.The original microblogging data that the master controller crawl is relevant with the apple panel computer, original microblogging data contain a plurality of fields, for example, deliver time, utterer, type, access rights, body text content, picture URL, video URL etc.The content of text field of only obtaining in the present embodiment in these raw data gets final product, and is used for emotional orientation analysis.Simultaneously, need artificial mark increase an emotion tendency dimension, i.e. the classification results part of microblogging data is used for the emotion tendency of expression microblogging content, and the emotion tendency with text in the present embodiment is divided into three ranks, agree with, neutral, oppose.What the below listed is 7 microblogging data of advancing artificial Emotion tagging, and master controller is distributed to three from controller with these 7 training datas, and wherein statement 1-2 issues from controller n
1, statement 3-5 sends to from controller n
2, statement 6-7 sends to from controller n
3
From controller n
1The microblogging training dataset:
Statement 1: apple panel computer quality is pretty good, and reaction velocity is enough fast, and feel is also fine.(the emotion tendency of statement 1 is: agree with)
Statement 2: the apple panel computer has been used the section of the having time, function very little, not have legendary so good, too general.(the emotion tendency of statement 2 is: oppose)
From controller n
2The microblogging training dataset:
Statement 3: apple panel computer speed is very fast, and networking is stable, and the game online is all relatively more perfect, praises one! (the emotion tendency of statement 3 is: agree with)
Statement 4: the line of products that the apple panel computer is single and high price, do not know under other adversary's the competition of three magnitudes, how long can also continue.(the emotion tendency of statement 4 is: neutrality)
Statement 5: apple panel computer operating system is uncomfortable, and screen proportion sees that the widescreen film is very not well, the export trouble, and it is very expensive to download software.(the emotion tendency of statement 5 is: oppose)
From controller n
3The microblogging training dataset:
Statement 6: apple panel computer speed is very fast, and resolution is also very high, and application program is quite abundant.(the emotion tendency of statement 6 is: agree with)
Statement 7: apple panel computer fuselage is too heavy, picks up inconvenience, and downloading needs by itunes, and is pretty troublesome! (the emotion tendency of statement 7 is: oppose)
In step 503: master controller is the desired parameters initialization, and sends to all from controller;
Predefined parameter generates in advance at random by master controller, and parameter comprises: the weight vectors w of input node
1, w
2, w
3Threshold value b with hidden node
1, b
2, b
3And these parameters are issued to from node n
1, n
2And n
3, and set the number of hidden nodes L=3.
w
1=(-0.9286,0.3575,-0.2155,0.4121,-0.9077,0.3897)
w
2=(0.6983,0.5155,0.3110,-0.9363,-0.8057,-0.3658)
w
3=(0.8680,0.4863,-0.6576,-0.4462,0.6469,0.9004)
b
1=0.0344
b
2=0.4387
b
3=0.3816
In step 504: each is processed local microblogging data set separately from controller, and result is sent to master controller, is produced the sorter of microblogging data by master controller; Idiographic flow as shown in Figure 6, this flow process starts from step 601.
In step 602, will carry out vectorization with every microblogging training data of classification results part, comprising the proper vector x of the data division of every microblogging data
iWith classification results part t
i
The vectorization of data portion is that data portion is carried out feature extraction.Feature extraction is the basis of emotional orientation analysis, and the quality of feature extraction directly affects the result of emotion tendency prediction.Feature extraction is with the method for shining upon (or conversion) primitive character to be transformed to most representative new feature.Present embodiment is mainly studied commendation emotion word in the text data, derogatory sense emotion word, degree adverb, negative word as the impact of feature on the emotion tendentiousness of text analysis.Lower mask body is introduced:
The emotion word: the emotion word refers to have noun, verb, adjective and some Chinese idioms and the idiom etc. of emotion tendency.The emotion tendency of text is mainly transmitted by the emotion word, and therefore, the emotion word is one of key character of emotion tendentiousness of text analysis and prediction.According to the needs of sentiment analysis, present embodiment is divided into two kinds with the emotion word in the text data, i.e. commendatory term and derogatory term.Commendatory term be part of speech with praise, the sure word of emotion is such as " liking ", " approval ", " appreciation ", " praising ", " praising ", " worshipping ", " fine " etc.Derogatory term: be the meaning of a word with demote, negate, the word of hatred, contempt emotion, such as " detest ", " opposition ", " ignorant ", " gloomy ", " meanness ", " deception " etc.Present embodiment is divided into Three Estate [+3 ,+2 ,+1] with commendation emotion word, and the commendation degree reduces successively, and derogatory sense emotion word also is divided into Three Estate [1 ,-2 ,-3], and the derogatory sense degree raises successively.
The proper vector that the emotion word relates to mainly contains four, is respectively commendatory term word frequency, commendatory term average rank, derogatory term word frequency, derogatory term average rank.Word frequency
Average rank
Degree adverb: degree adverb is a kind of of adverbial word, the expression degree.As " very, very, the utmost point, very,, top, too, more, very, extremely, especially, exceptionally, more, more, all the more, a bit, slightly, a little, slightly, almost, too, especially " etc., wherein the word frequency of present embodiment extraction degree adverbial word is as a proper vector.
Negative adverb: negative adverb is a kind of of adverbial word, represents sure, negative.As " not, do not have, do not have, need not (don't), must, must, must, accurate, really, not, not, not, not, whether, needn't, never " etc., wherein present embodiment is extracted the word frequency of negative adverb as a proper vector.
In sum, the Text eigenvector that present embodiment is extracted mainly contains six, is respectively commendatory term word frequency, commendatory term average rank, derogatory term word frequency, derogatory term average rank, degree adverb word frequency and negative adverb word frequency.Simultaneously in the classification results of the microblogging data part, the emotion tendency of text is divided into three ranks, agree with, neutral, oppose, with [+1 ,+2 ,+3] expression.So both can obtain proper vector and part and the classification results part of every microblogging data, concrete form is as follows:
According to above-mentioned feature extracting method, 7 microblogging data are extracted corresponding vectorization, the result is as follows:
Statement 1: apple panel computer quality is pretty good, and reaction velocity is enough fast, and feel is also fine.The emotion tendency of statement 1 is: agree with)
Statement 1 is analyzed: can be divided into 8 words in the statement 1, wherein commendatory term has " well ", " soon ", " fine " 3, then the commendatory term word frequency of statement 1 is 3/8, the rank of corresponding commendatory term is respectively+and 1, + 2, + 2, then the commendatory term average rank of statement 1 is (1+2+2)/3, do not contain derogatory term in the statement 1, therefore its derogatory term word frequency and average rank are 0, and degree adverb is " very ", and word frequency is 1/8, the word frequency of negative adverb is 0, emotion tendency is for agreeing with, and classification results be+1, so statement 1 can convert (0.375 to after passing through extraction, 1.667,0,0,0.125,0,1).
Use identical method, can obtain the proper vector part of other statement.
Statement 2: the apple panel computer has been used the section of the having time, function very little, not have legendary so good, too general.(the emotion tendency of statement 2 is: oppose)
Statement 2 is analyzed: (0.083,2,0.167 ,-1.5,0.25,0.083,3).
Statement 3: apple panel computer speed is very fast, and networking is stable, and the game online is all relatively more perfect, praises one! (the emotion tendency of statement 3 is: agree with)
Statement 4: the line of products that the apple panel computer is single and high price, do not know under other adversary's the competition of three magnitudes, how long can also continue.(the emotion tendency of statement 4 is: neutrality)
Statement 4 is analyzed: (0.077,2,0.077 ,-1,0,0,2).
Statement 5: apple panel computer operating system is uncomfortable, and screen proportion sees that the widescreen film is very not well, the export trouble, and it is very expensive to download software.(the emotion tendency of statement 5 is: oppose)
Statement 5 is analyzed: (0,0,0.188 ,-2.333,0.125,0.063,3).
Statement 6: apple panel computer speed is very fast, and resolution is also very high, and application program is quite abundant.(the emotion tendency of statement 6 is: agree with)
Statement 6 is analyzed: (0.273,2.333,0,0,0.273,0,1).
Statement 7: apple panel computer fuselage is too heavy, picks up inconvenience, and downloading needs by itunes, and is pretty troublesome! (the emotion tendency of statement 7 is: oppose)
Statement 7 is analyzed: (0,0,0.154 ,-2.5,0.154,0.077,3).
In step 603, each is peeled off from the microblogging training data of controller after to the vectorization of self part, peels off proper vector part and the classification results part of these data, namely so that each all generates separately local microblogging data set (X from controller
i, T
i), wherein, X
iBe the eigenmatrix of microblogging data set, T
iClassification results matrix for the microblogging data set.In distributed environment shown in Figure 4, from controller n
1Training data be:
Statement 1 (0.375,1.667,0,0,0.125,0,1)
Statement 2 (0.083,2,0.167 ,-1.5,0.25,0.083,3)
From controller n
1The microblogging data through the eigenmatrix X of the microblogging training data after peeling off
1With the classification results matrix T
1As follows:
The classification results matrix
From controller n
2Training data be:
Statement 3 (0.333,2.5,0,0,0.25,0,1)
Statement 4 (0.077,2,0.077 ,-1,0,0,2)
Statement 5 (0,0,0.188 ,-2.333,0.125,0.063,3)
From controller n
2The microblogging data through the microblogging training data eigenmatrix X after peeling off
2With the classification results matrix T
2As follows:
The classification results matrix
From controller n
3Training data be:
Statement 6 (0.273,2.333,0,0,0.273,0,1)
Statement 7 (0,0,0.154 ,-2.5,0.154,0.07,3)
From controller n
3The microblogging data through the microblogging training data eigenmatrix X after peeling off
3With the classification results matrix T
3As follows:
The classification results matrix
In step 604: each is from controller n
iAccording to the parameter w that receives
1, w
2..., w
LAnd b
1, b
2..., b
L, and local microblogging data set (X
i, T
i), calculate the required intermediate result of ELM, and intermediate result is submitted to master controller; Wherein, at (X
i, T
i) in, X
iBe the eigenmatrix of microblogging data set, T
iBe the classification results matrix of microblogging data set, as shown in Figure 7.
Need to prove, in ELM, for the eigenmatrix X of input data herein
iIn each element need to carry out normalization so that X
iIn all element all between [1 ,+1], the difference that method for normalizing is chosen can cause the difference of input data.In addition, for excitation function g (w
iX
i+ b
i), ELM provides multiple excitation function for user selection, and the difference of choosing of excitation function equally can be so that intermediate result be different, and then causes the difference of final classification results.In the specific embodiment of the present invention, also be first the vector of these statements to be carried out normalization, then select an activation function, and then try to achieve the required intermediate result of ELM.The below describes from controller three respectively:
For from node n
1:
At step 604-1 from controller n
1The data of processing are statement 1 (0.375,1.667,0,0,0.125,0,1) and statement 2 (0.083,2,0.167 ,-1.5,0.25,0.083,3), and the parameter of reception is w
1, w
2, w
3, b
1, b
2, b
3, normalization and choose excitation function after can get
The hidden layer output matrix
The classification results matrix
At step 604-2, according to H
1, calculate intermediate result U
1, can get
At step 604-3, according to H
1And T
1, calculate intermediate result V
1, can get
And with intermediate result U
1And V
1Submit to master controller.
For from controller 2:
At step 604-4 from controller n
2The data of processing are statement 3 (0.333,2.5,0,0,0.25,0,1), statement 4 (0.077,2,0.077 ,-1,0,0,2) and statement 5 (0,0,0.188 ,-2.333,0.125,0.063,3), and the parameter of reception is w
1, w
2, w
3, b
1, b
2, b
3, normalization and choose excitation function after can get the hidden layer output matrix
The classification results matrix
Step 604-5 is according to H
2, calculate intermediate result U
2, can get
Step 604-6 is according to H
2And T
2, calculate intermediate result V
2, can get
And with intermediate result U
2And V
2Submit to master controller.
For from controller 3:
Step 604-7 is from controller n
3The data of processing are statement 6 (0.273,2.333,0,0,0.273,0,1) and statement 7 (0,0,0.154 ,-2.5,0.154,0.07,3), and the parameter of reception is w
1, w
2, w
3, b
1, b
2, b
3, normalization and choose excitation function after can get
The hidden layer output matrix
The classification results matrix
Step 604-8 is according to H
3, calculate intermediate result U
3, can get
Step 604-9 is according to H
3And T
3, calculate intermediate result V
3, can get
And with intermediate result U
3And V
3Submit to master controller.
In step 605, master controller n
0Receive from controller n
1The U that submits to
1And V
1, receive from controller n
2The U that submits to
2And V
2, receive from controller n
3The U that submits to
3And V
3, and calculate final result, as shown in Figure 8.
Step 605-1 merges the intermediate result U that each is submitted to from controller
1, U
2, U
3, obtain summarized results
Step 605-2 merges the intermediate result V that each is submitted to from controller
1, V
2, V
3, obtain summarized results
Step 605-3 according to the U that gathers and V, calculates the weight vectors parameter beta of output node,
So far, can obtain the weight vectors parameter beta.
At step 605-4, according to the parameter beta that parameter generators obtains, structure can be predicted the sorter of microblogging data emotional orientation analysis, is used for microblogging data to be tested are carried out emotional orientation analysis, and formula is as follows:
f(x)=h(x)β
In step 505: the automatic classification of microblogging data.
The automatic classification of microblogging data mainly contains dual mode, adopt first kind of way in the present embodiment, continue crawl microblogging data by master controller, use the directly classification results of output microblogging data to be sorted of the microblogging data sorter generated, following two are continued the microblogging data to be sorted that grasp and use the result who obtains behind the identical feature extracting method for master controller.
Statement 8: the apple panel computer is given friend, and friend is delithted with, and is very good! Speed, moulding are all fine! Like!
Statement 8 is analyzed: (0.286,2.25,0,0,0.214, unknown classification results).
Statement 9: apple panel computer screen quality is very low, uses also very bothersomely, and cruising time is very poor.
Statement 9 is analyzed: (0,0,0.25 ,-2.333,0.25,0, unknown classification results).
After using same method for normalizing, choosing same excitation function, the classification results of trying to achieve statement 8 is as follows:
Hidden layer output matrix h (x
8)=[g (w
1X
8+ b
1) g (w
2X
8+ b
2) g (w
3X
8+ b
3)]=[0.5467 0.7244 0.7388]
Be brought in the formula of sorter, try to achieve
f(x)=h(x)β=[0.6332-0.6207-1.0061]
For the above results, ELM takes a kind of maximized method to judge the classification results of microblogging data to be predicted, ultimate principle is to judge the dimension at the element place of the maximum in the vector of trying to achieve the result, then tag along sort corresponding to this dimension is the classification results of these data to be predicted, element such as maximum in the sorter Output rusults of statement 8 is 0.6332, corresponding dimension is 1, and the classification results of statement 8 is exactly the classification of label 1 expression so, namely " agrees with ".
The forecasting process of statement 9 is identical with statement 8, is summarized as follows: the classification results of trying to achieve statement 9 is as follows:
Hidden layer output matrix h (x
9)=[g (w
1X
9+ b
1) g (w
2X
9+ b
2) g (w
3X
9+ b
3)]=[0.2222 0.6704 0.9174]
Be brought in the formula of sorter, try to achieve
f(x)=h(x)β=[-1.2055?-0.8521?1.0684]
Element maximum in the sorter Output rusults of statement 9 is 1.0684, and corresponding dimension is 3, and the classification results of statement 9 is exactly the classification of label 3 expressions so, i.e. " opposition ".
When test data is statement 8 and statement 9, use the microblogging data sorter that has generated, the emotion tendency that obtains statement 8 and statement 9 that can be correct can be classified to microblogging data to be sorted accurately.
Except the emotion tendency of analyzing the microblogging data, the present invention also can be used for analyzing in numerous application such as box office receipts, song clicking rate, finance product recommendation, stock analysis, equipment performance, hot news event analysis, public opinion analysis.
Although more than described the specific embodiment of the present invention, the those skilled in the art in this area should be appreciated that these only illustrate, and can make numerous variations or modification to these embodiments, and not deviate from principle of the present invention and essence.Scope of the present invention is only limited by appended claims.
Claims (5)
1. the distributed sorter of magnanimity microblogging data, adopt distributed frame, it is characterized in that: comprise that a master controller and at least one are from controller, and each is all interconnected with master controller from controller, master controller intercoms from controller mutually with each, and all are from separate between controller;
Describedly comprise from controller:
To measuring device: be used for and convert the form of vector representation from controller to every microblogging training data of classification results, comprising the proper vector x of the data division of every microblogging data
iWith classification results part t
i
Stripper: the eigenvectors matrix X that is used for peeling off all microblogging data of the microblogging data training set after processing to measuring device
iWith the classification results matrix T
i
Converter: the principle of limit of utilization learning machine ELM is used for the eigenvectors matrix X that stripper is extracted
iConvert the hidden layer output matrix H among the ELM to
i
The preceding paragraph counter: the principle of limit of utilization learning machine ELM is used for according to hidden layer output matrix H
i, calculate intermediate result H
i TH
i, and submit to master controller;
Consequent counter: the principle of limit of utilization learning machine ELM is used for according to hidden layer output matrix H
iWith microblogging data centralization classification results matrix T
i, calculate intermediate result H
i TT
i, and submit to master controller;
Described master controller comprises:
Preceding paragraph totalizer: be used for merging the intermediate result H that each is submitted to from controller
i TH
i, obtain summarized results H
TH;
Consequent totalizer: be used for merging the intermediate result H that each is submitted to from controller
i TT
i, obtain summarized results H
TT;
Parameter generators: the principle of limit of utilization learning machine ELM, be used for the result according to the preceding paragraph totalizer that gathers and the output of consequent totalizer, calculate the weight vectors parameter beta of output node;
Taxonomy generator: the parameter beta that obtains according to parameter generators makes up the sorter of microblogging data, is used for microblogging data to be tested are classified.
2. the distributed sorter of magnanimity microblogging data as claimed in claim 1, it is characterized in that: described each intermediate result of self being processed, be used for generating final microblogging data sorter from controller sends to master controller, after master controller receives all intermediate results of sending from controller, according to the principle of ELM, obtain final microblogging data sorter.
3. the Distributed Classification of magnanimity microblogging data adopts the distributed sorter of magnanimity microblogging data claimed in claim 1 to realize, it is characterized in that: may further comprise the steps:
Step 1: the preparation of microblogging training dataset;
The preparation of microblogging training dataset comprises grasping original microblogging data and manually the microblogging data being marked two parts; Adopt following dual mode: first kind of way is to be grasped the original microblogging data of required processing by master controller, and manually mark for each bar training data, then the classification results that represents these microblogging data arrives these microblogging data allocations accordingly from controller; The second way is to be communicated by letter from controller with each by master controller, notify the information of each microblogging data that need to grasp from controller, each grasps original microblogging data from controller self, and for the original microblogging data that self grasp manually mark, represent the classification results of these microblogging data;
Step 2: master controller is the desired parameters initialization, and sends to all from controller;
The principle of limit of utilization learning machine ELM generates parameter in advance at random by master controller, comprising: the weight vectors w of hidden node number L, input node
1, w
2..., w
L, hidden node side-play amount b
1, b
2..., b
L, and these parameters are sent to all from controller;
Step 3: each is processed local microblogging data set separately from controller, and result is sent to master controller, generates the microblogging data sorter by master controller;
Step 3-1: microblogging data vector;
To carry out vectorization with every microblogging training data of classification results part, comprising the proper vector x of the data division of every microblogging data
iWith classification results part t
i
Step 3-2: the peeling off of microblogging data;
For each microblogging data set through feature extraction from controller microblogging data training set, peel off proper vector part and the classification results part of these data, form each from the eigenvectors matrix X of the microblogging data training set of controller
iWith the classification results matrix T
i, namely so that each all generates separately local microblogging data set (X from controller
i, T
i), wherein, X
iBe the eigenmatrix of microblogging data set, T
iClassification results matrix for the microblogging data set;
Step 3-3: each generates intermediate result from controller basis local microblogging data set separately, and sends to master controller;
Step 3-4: master controller receives and gathers each from the intermediate result of controller; According to the Computing Principle of the intermediate result that gathers according to ELM, calculate the weight vectors parameter beta of output node, and then try to achieve the microblogging data sorter;
Step 4: the automatic classification of microblogging data
The automatic classification of microblogging data can be taked dual mode: first kind of way is that master controller continues crawl microblogging data, the microblogging data sorter that uses step 3 to generate is directly exported the classification results of microblogging data to be sorted, the second is that master controller sends to each from controller with the microblogging data sorter that step 3 generates, then each uses sorter that the microblogging data to be sorted of self are classified from controller, tries to achieve classification results.
4. the Distributed Classification of magnanimity microblogging data according to claim 3 is characterized in that: step 3-3 described each generate intermediate result from controller according to separately local microblogging data set, and send to master controller, specific as follows:
Each is from controller n
iWeight vectors w according to the input node that receives
1, w
2..., w
LThreshold value b with i hidden node
1, b
2..., b
L, and local microblogging training dataset (X
i, T
i), calculate and make up the required intermediate result of sorter, and intermediate result is submitted to master controller;
Step 3-3-1: with the eigenmatrix X of local microblogging data set
iBe converted into the hidden layer output matrix H of ELM
i
Step 3-3-2: according to hidden layer output matrix H
i, calculate intermediate result U
i=H
i TH
i
Step 3-3-3: according to hidden layer output matrix H
iClassification results matrix T with local training dataset
i, calculate intermediate result V
i=H
i TT
i
5. the Distributed Classification of magnanimity microblogging data according to claim 3 is characterized in that: the weight vectors parameter beta of the described calculating output node of step 3-4, specific as follows:
Step 3-4-1: merge the intermediate result U that each is submitted to from controller
i, obtain summarized results U=∑ U
i=∑ H
i TH
i=H
TH;
Step 3-4-2: merge the intermediate result V that each is submitted to from controller
i, obtain summarized results V=∑ V
i=∑ H
i TT
i=H
TT;
Step 3-4-3: the weight vectors parameter beta of calculating output node according to the U that gathers and V:
Wherein, I is unit matrix, and λ is the parameter of user's appointment, ()
-1It is matrix inversion operation;
And then the formula of definite microblogging data sorter is as follows:
f(x)=h(x)β
In the formula, the classification results of f (x) expression microblogging data to be sorted, the hidden layer output vector of h (x) expression microblogging data to be sorted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210583886.8A CN103020712B (en) | 2012-12-28 | 2012-12-28 | A kind of distributed sorter of massive micro-blog data and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210583886.8A CN103020712B (en) | 2012-12-28 | 2012-12-28 | A kind of distributed sorter of massive micro-blog data and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103020712A true CN103020712A (en) | 2013-04-03 |
CN103020712B CN103020712B (en) | 2015-10-28 |
Family
ID=47969298
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210583886.8A Active CN103020712B (en) | 2012-12-28 | 2012-12-28 | A kind of distributed sorter of massive micro-blog data and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103020712B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103593462A (en) * | 2013-11-25 | 2014-02-19 | 中国科学院深圳先进技术研究院 | Microblog-data-oriented flu epidemic surveillance analysis method and system |
CN105760899A (en) * | 2016-03-31 | 2016-07-13 | 大连楼兰科技股份有限公司 | Adboost training learning method and device based on distributed computation and detection cost ordering |
WO2017133568A1 (en) * | 2016-02-05 | 2017-08-10 | 阿里巴巴集团控股有限公司 | Mining method and device for target characteristic data |
CN107590134A (en) * | 2017-10-26 | 2018-01-16 | 福建亿榕信息技术有限公司 | Text sentiment classification method, storage medium and computer |
CN109034366A (en) * | 2018-07-18 | 2018-12-18 | 北京化工大学 | Application based on the ELM integrated model of more activation primitives in chemical engineering modeling |
CN109657061A (en) * | 2018-12-21 | 2019-04-19 | 合肥工业大学 | A kind of Ensemble classifier method for the more word short texts of magnanimity |
CN110381456A (en) * | 2019-07-19 | 2019-10-25 | 珠海格力电器股份有限公司 | Flow management system, flow threshold calculation method and air conditioning system |
CN113177163A (en) * | 2021-04-28 | 2021-07-27 | 烟台中科网络技术研究所 | Method, system and storage medium for social dynamic information sentiment analysis |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH1185796A (en) * | 1997-09-01 | 1999-03-30 | Canon Inc | Automatic document classification device, learning device, classification device, automatic document classification method, learning method, classification method and storage medium |
US20120189194A1 (en) * | 2011-01-26 | 2012-07-26 | Microsoft Corporation | Mitigating use of machine solvable hips |
CN102789498A (en) * | 2012-07-16 | 2012-11-21 | 钱钢 | Method and system for carrying out sentiment classification on Chinese comment text on basis of ensemble learning |
-
2012
- 2012-12-28 CN CN201210583886.8A patent/CN103020712B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH1185796A (en) * | 1997-09-01 | 1999-03-30 | Canon Inc | Automatic document classification device, learning device, classification device, automatic document classification method, learning method, classification method and storage medium |
US20120189194A1 (en) * | 2011-01-26 | 2012-07-26 | Microsoft Corporation | Mitigating use of machine solvable hips |
CN102789498A (en) * | 2012-07-16 | 2012-11-21 | 钱钢 | Method and system for carrying out sentiment classification on Chinese comment text on basis of ensemble learning |
Non-Patent Citations (3)
Title |
---|
HUANG GUANGBIN ET AL: "Extreme Learning Machine for Regression and Multiclass Classification", 《IEEE TRANSACTIONS ON SYSTEMS,MAN AND CYBERNETICS-PARTB》 * |
王磊等: "基于二叉级联结构的并行极速学习机算法", 《吉林大学学报(信息科学版)》 * |
赵相国等: "基于ELM的蛋白质二级结构预测及其后处理", 《东北大学学报(自然科学版)》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103593462A (en) * | 2013-11-25 | 2014-02-19 | 中国科学院深圳先进技术研究院 | Microblog-data-oriented flu epidemic surveillance analysis method and system |
CN103593462B (en) * | 2013-11-25 | 2017-02-15 | 中国科学院深圳先进技术研究院 | Microblog-data-oriented flu epidemic surveillance analysis method and system |
WO2017133568A1 (en) * | 2016-02-05 | 2017-08-10 | 阿里巴巴集团控股有限公司 | Mining method and device for target characteristic data |
CN105760899A (en) * | 2016-03-31 | 2016-07-13 | 大连楼兰科技股份有限公司 | Adboost training learning method and device based on distributed computation and detection cost ordering |
CN105760899B (en) * | 2016-03-31 | 2019-04-05 | 大连楼兰科技股份有限公司 | Training learning method and device based on distributed computing and detection cost sequence |
CN107590134A (en) * | 2017-10-26 | 2018-01-16 | 福建亿榕信息技术有限公司 | Text sentiment classification method, storage medium and computer |
CN109034366A (en) * | 2018-07-18 | 2018-12-18 | 北京化工大学 | Application based on the ELM integrated model of more activation primitives in chemical engineering modeling |
CN109657061A (en) * | 2018-12-21 | 2019-04-19 | 合肥工业大学 | A kind of Ensemble classifier method for the more word short texts of magnanimity |
CN109657061B (en) * | 2018-12-21 | 2020-11-27 | 合肥工业大学 | Integrated classification method for massive multi-word short texts |
CN110381456A (en) * | 2019-07-19 | 2019-10-25 | 珠海格力电器股份有限公司 | Flow management system, flow threshold calculation method and air conditioning system |
CN113177163A (en) * | 2021-04-28 | 2021-07-27 | 烟台中科网络技术研究所 | Method, system and storage medium for social dynamic information sentiment analysis |
Also Published As
Publication number | Publication date |
---|---|
CN103020712B (en) | 2015-10-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103020712A (en) | Distributed classification device and distributed classification method for massive micro-blog data | |
Koncel-Kedziorski et al. | Text generation from knowledge graphs with graph transformers | |
Shi et al. | WE-LDA: a word embeddings augmented LDA model for web services clustering | |
CN102495860B (en) | Expert recommendation method based on language model | |
CN103853824B (en) | In-text advertisement releasing method and system based on deep semantic mining | |
Wu et al. | Neural news recommendation with heterogeneous user behavior | |
Li et al. | Knowledge-grounded dialogue generation with a unified knowledge representation | |
CN101354714B (en) | Method for recommending problem based on probability latent semantic analysis | |
CN102831119B (en) | Short text clustering Apparatus and method for | |
CN111222332A (en) | Commodity recommendation method combining attention network and user emotion | |
CN106897914A (en) | A kind of Method of Commodity Recommendation and system based on topic model | |
CN109993583A (en) | Information-pushing method and device, storage medium and electronic device | |
CN110210933A (en) | A kind of enigmatic language justice recommended method based on generation confrontation network | |
Yin et al. | Ranking products through online reviews considering the mass assignment of features based on BERT and q-rung orthopair fuzzy set theory | |
CN103729431A (en) | Massive microblog data distributed classification device and method with increment and decrement function | |
Pathan et al. | Unsupervised aspect extraction algorithm for opinion mining using topic modeling | |
Huang et al. | Sentiment analysis in e-commerce platforms: A review of current techniques and future directions | |
CN109902273A (en) | The modeling method and device of keyword generation model | |
Perez-Castro et al. | Efficiency of automatic text generators for online review content generation | |
Zong et al. | Double sparse learning model for speech emotion recognition | |
Zhang et al. | Local-global graph pooling via mutual information maximization for video-paragraph retrieval | |
Chakraborty et al. | LSTM-ANN based price hike sentiment analysis from Bangla social media comments | |
Jangra et al. | Semantic extractor-paraphraser based abstractive summarization | |
Wang et al. | The application of factorization machines in user behavior prediction | |
Tran et al. | Sentiment classification for beauty-fashion reviews |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220324 Address after: 100081 No. 5 South Main Street, Haidian District, Beijing, Zhongguancun Patentee after: BEIJING INSTITUTE OF TECHNOLOGY Address before: 110819 No. 3 lane, Heping Road, Heping District, Shenyang, Liaoning 11 Patentee before: Northeastern University |