CN103020712B

CN103020712B - A kind of distributed sorter of massive micro-blog data and method

Info

Publication number: CN103020712B
Application number: CN201210583886.8A
Authority: CN
Inventors: 王国仁; 信俊昌; 聂铁铮; 赵相国; 丁琳琳
Original assignee: Northeastern University China
Current assignee: Beijing Institute of Technology BIT
Priority date: 2012-12-28
Filing date: 2012-12-28
Publication date: 2015-10-28
Anticipated expiration: 2032-12-28
Also published as: CN103020712A

Abstract

The distributed sorter of massive micro-blog data and a method, belong to data mining technology field.This device adopts distributed frame, according to the disposal route of ELM, each intermediate result being used for generating final microblog data sorter self processed from controller sends to master controller, after master controller receives all intermediate result of sending from controller, according to the principle of ELM, obtain final microblog data sorter, utilize the classification of microblog data sorter realization to microblog data produced.The limit of utilization learning machine technology overcome in the past only can be applied to centralized environment, the defect of the ELM classification of large-scale training sample set cannot be adapted to, make process and analyze massive micro-blog data to become possibility, in order application, the effectiveness of the massive micro-blog data of accumulation is not fully exerted, and serving is better the effect of application service.

Description

Distributed classification device and method for mass microblog data

Technical Field

The invention belongs to the technical field of data mining, relates to an extreme learning machine classification device and method based on a distributed processing technology, and particularly relates to a distributed classification device and method for massive microblog data.

Background

At present, a large amount of information is generated on the internet every moment, the expression forms of the information are various, and the amount of information generated by a microblog platform is also rapidly increased. Micro Blogs, Micro-Blogs, are a form of Blogs that allow users to update in a timely manner and publish short text (usually around 140 words) in public. The rapid development of the microblog enables anyone to become a microblog user, and sends and reads information on any client side supporting the microblog at any time, so as to carry out interactive communication and express the emotional information of the user. Microblogs become powerful information carriers of the internet, and the information quantity of the microblogs reaches a massive scale, so that the microblogs become the most popular information sharing, spreading and interaction platform at present. Therefore, how to adopt appropriate measures and technologies to mine useful information from massive microblog data and make predictive judgment on future things become a hotspot and difficulty of research in the field of current data mining.

In the existing related research aiming at microblog data, the data volume of the processed microblog data is usually relatively small, and the microblog data can be processed in a centralized environment; however, with the rapid increase of microblog data in the internet, the data volume of the microblog data far exceeds the processing capacity of a single computer, and large-scale data analysis is difficult to realize by adopting the existing method.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a distributed classification device and a distributed classification method for mass microblog data, which classify the microblog data by utilizing an Extreme Learning Machine (ELM) technology, and further can effectively process and analyze the mass microblog data so as to fully exert the utility of the mass microblog data accumulated in application and better serve the application.

The technical scheme of the invention is realized as follows: a distributed classification device for massive microblog data adopts a distributed structure and comprises a master controller and at least one slave controller, wherein each slave controller is interconnected with the master controller, the master controller and each slave controller are communicated with each other, and all the slave controllers are independent of each other and independently complete respective tasks; according to the ELM processing method, each slave control machine sends the intermediate result processed by the slave control machine and used for generating the final microblog data classifier to the master control machine, and the master control machine receives the intermediate results sent by all the slave control machines and then obtains the final microblog data classifier according to the ELM principle.

The slave control machine comprises:

a vector machine: converting each microblog training data with classification result from the control computer into a vector representation form, wherein the vector representation form comprises a feature vector x of a data part of each microblog data_iAnd a classification result part t_i。

A stripper: feature vector matrix X for all microblog data in microblog data training set processed by stripping vector machine_iAnd a classification result matrix T_i。

A converter: using the principle of Extreme Learning Machine (ELM) for extracting the feature vector matrix X of the stripper_iConversion to hidden layer output matrix H in ELM_i。

The former calculator: using extreme learning machines (ELM) for outputting a matrix H from a hidden layer_iCalculating an intermediate result H_i ^TH_iAnd submitted to the main control machine.

A consequent calculator: using the principle of Extreme Learning Machines (ELM) for outputting the matrix H from hidden layers_iAnd microblog data centralized classification result matrix T_iCalculating an intermediate result H_i ^TT_iAnd submitted to the main control machine.

The main control machine comprises:

the former term accumulator: for merging intermediate results H submitted from the controllers_i ^TH_iTo obtain a summary result H^TH。

A last term accumulator: for merging intermediate results H submitted from the controllers_i ^TT_iTo obtain a summary result H^TT。

A parameter generator: the principle of an Extreme Learning Machine (ELM) is used for calculating a weight vector parameter beta of an output node according to the summarized results output by the antecedent accumulator and the consequent accumulator.

And (3) a classification generator: and constructing a classification device of the microblog data according to the parameter beta obtained by the parameter generator, and classifying the microblog data to be tested.

A distributed classification method for massive microblog data comprises the following steps:

step 1: preparing a microblog training data set;

the preparation of the microblog training data set comprises two parts of capturing original microblog data and labeling the microblog data manually. The following two ways can be adopted: the first mode is that a master controller captures original microblog data required to be processed, each piece of training data is labeled manually to represent classification results of the microblog data, and then the microblog data are distributed to corresponding slave controllers; the second mode is that the master controller communicates with each slave controller to inform each slave controller of information of microblog data required to be captured, each slave controller captures original microblog data per se and manually marks the original microblog data captured per se to represent classification results of the microblog data;

step 2: the master controller initializes the required parameters and sends the parameters to all the slave controllers;

the method utilizes the principle of an Extreme Learning Machine (ELM) and randomly generates parameters in advance by a main control machine, and comprises the following steps: number L of hidden nodes and weight vector w of input node₁,w₂,...,w_LOffset b of hidden node₁,b₂,...,b_LAnd sends these parameters to all slave controllers;

and step 3: each slave controller processes the respective local microblog data set, sends the processing result to the master controller, and the master controller generates a microblog data classifier;

step 3-1: vectorizing microblog data;

vectorizing each microblog training data with a classification result part, wherein the vectorization comprises a feature vector x of the data part of each microblog data_iAnd a classification result part t_i。

Step 3-2: stripping microblog data;

for each microblog data set extracted from features in the microblog data training set of the slave controller, the feature vector part and the classification result part of the data are stripped to form a feature vector matrix X of the microblog data training set of the slave controller_iAnd a classification result matrix T_iThat is, each slave controller generates a respective local microblog data set (X)_i,T_i) Wherein X is_iFeature matrix, T, for microblog data sets_iAnd obtaining a classification result matrix of the microblog data sets.

Step 3-3: each slave control machine generates an intermediate result according to the respective local microblog data set and sends the intermediate result to the master control machine;

each slave controller n_iAccording to the weight vector w of the received input node₁,w₂,...,w_LAnd threshold b of the ith hidden node₁,b₂,...,b_LAnd a local microblog training data set (X)_i,T_i) Calculating an intermediate result required by constructing the classifier, and submitting the intermediate result to the main control machine;

step 3-3-1: feature matrix X of local microblog data set_iHidden layer output matrix H converted into ELM_i；

Step 3-3-2: output matrix H according to hidden layer_iCalculating an intermediate result U_i=H_i ^TH_i；

Step 3-3-3: output matrix H according to hidden layer_iAnd a classification result matrix T of the local training data set_iCalculating an intermediate result V_i=H_i ^TT_i；

Step 3-4: the master controller receives and summarizes the intermediate results of the slave controllers; calculating a weight vector parameter beta of the output node according to the collected intermediate result and the ELM calculation principle, and further solving a microblog data classifier;

step 3-4-1: merging intermediate results U submitted by each slave controller_iObtaining a summary result U = ∑ U_i=∑H_i ^TH_i=H^TH；

Step 3-4-2: merging intermediate results V submitted by each slave control machine_iObtaining a summary result V = ∑ V_i=∑H_i ^TT_i=H^TT；

Step 3-4-3: calculating a weight vector parameter beta of the output node according to the summarized U and V:

where I is the unit matrix and λ is a user-specified parameter, (-)^-1Is a matrix inversion operation;

further determining a formula of the microblog data classifier,

f(x)=h(x)β

wherein, f (x) represents the classification result of the microblog data to be classified, h (x) represents the hidden layer output vector of the microblog data to be classified;

and 4, step 4: automatic classification of microblog data

Automatic classification of microblog data can take two ways: the first mode is that the master controller continuously captures microblog data, a classification result of the microblog data to be classified is directly output by applying the microblog data classifier generated in the step 3, the second mode is that the master controller sends the microblog data classifier generated in the step 3 to each slave controller, and then each slave controller classifies the microblog data to be classified by applying the classifier to obtain the classification result.

Has the advantages that: the invention relates to a distributed classification device and method for massive microblog data, which overcome the defect that the traditional extreme learning machine technology can only be applied to a centralized environment and cannot adapt to ELM classification of a large-scale training sample set, so that the massive microblog data can be processed and analyzed, the utility of the massive microblog data accumulated in the application can be fully exerted, and a better effect of serving the application can be achieved.

Drawings

FIG. 1 is a schematic diagram of a distributed architecture according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the connection between the master controller and the slave controller according to an embodiment of the present invention;

FIG. 3 is a block diagram of the master control machine and the slave control machine according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a distributed microblog data training set according to an embodiment of the invention;

FIG. 5 is a flow chart of a distributed microblog data training method according to an embodiment of the invention;

FIG. 6 is a flowchart of a method for generating a microblog data classifier according to an embodiment of the invention;

FIG. 7 is a graph illustrating partial intermediate results after conversion from a control engine in accordance with one embodiment of the present invention;

FIG. 8 is a diagram illustrating the calculation of intermediate results from the slave controller and the collection of the master controller according to one embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

The current microblog data contains a large amount of microblog user emotion information, the information shows the viewpoint and the opinion of a microblog user on certain event, commodity, figure and the like, and the emotion information has high research and application values, so that the emotion analysis on the microblog data obtains wide attention, and the microblog data has wide application prospects, such as viewpoint analysis, commodity evaluation, opinion detection and the like. Therefore, in the embodiment of the invention, the microblog data are classified according to the emotional tendency of the microblog data.

The invention analyzes massive microblog data in a distributed environment, wherein a distributed system structure is shown as a figure 1. Comprising a main node n₀And a plurality of slave nodes n₁,n₂,...,n_sWherein, the master node n₀Respectively connected with a plurality of slave nodes n₁,n₂,...,n_sInterconnection, able to communicate with all slave nodes n₁,n₂，...,n_sTo communicate with each other.

One embodiment of the present invention uses the general connection scheme shown in fig. 2, which includes a master controller and a plurality of slave controllers (slave controller 1, slave controller 2.., slave controller m), each of which is interconnected with the master controller. According to the principle of an Extreme Learning Machine (ELM), each slave control machine processes a local microblog training data set thereof to generate respective intermediate results for generating a final classifier, and sends the intermediate results to the master control machine, and the master control machine generates the final microblog data classifier according to the principle of the ELM after receiving the intermediate results.

The slave control machine comprises a vector machine, a stripper, a converter, a front term calculator and a back term calculator. The main control machine comprises a front term accumulator, a back term accumulator, a parameter generator and a classification generator.

A stripper: for stripping the processed micro-scale of the vector machineEigenvector matrix X of all microblog data in Bo data training set_iAnd a classification result matrix T_i。

A converter: using the principle of Extreme Learning Machine (ELM) for the matrix X of eigenvectors extracted towards the stripper_iConversion to hidden layer output matrix H in ELM_i。

The former calculator: using the principle of Extreme Learning Machines (ELM) for outputting the matrix H from hidden layers_iCalculating an intermediate result H_i ^TH_iAnd submitted to the main control machine.

In this embodiment, each slave controller and the master controller adopt an ELM technique to analyze microblog data, where the ELM technique is specifically as follows:

the extreme learning machine is a training method based on a Single Hidden-Layer feed forward neural network (SLFNs). The ELM randomly sets the connection weight and the offset value from the hidden layer to the input layer before training, does not need to adjust the input weight of a network and the offset value of a hidden layer unit in the execution process of the algorithm, can generate a unique optimal analytic solution for the weight of the output layer, and can provide good generalization capability and extremely fast learning speed.

The basic principle of ELM is: in the training process, the ELM firstly generates input weight and hidden node threshold value randomly, and then calculates the output weight of SLFNs according to training data. Suppose that N training samples (x) are given_j,t_j) Wherein x is_jIs the feature vector part of the training sample, t_jIs the classification result portion of the sample. SLFNs with hidden node number L and excitation function g (x) can be formally expressed as:

wherein, w_iIs a weight vector connecting the ith hidden node and the input node; beta is a_iIs a weight vector connecting the ith hidden node and the output node; b_iIs the threshold of the ith hidden node; o_jIs the jth output vector of SLFNs.

If SLFNs can approximate the training samples without error, then the conditions are satisfiedI.e. the presence of w_i、β_iAnd b_iSo that

Abbreviated as H β ═ T. Wherein,

wherein, the matrix x^TIs the transpose of matrix x.

The matrix H is called the hidden layer output matrix. In the formula H β ═ T, only β is an unknown number, and this is available Is the Moore-Penrose generalized inverse of H.

Based on the basic extreme learning machine, several scholars further propose ELM based on random hidden layer feature mapping, and at the momentWherein I is a unit matrix and λ is a user-specified parameter;

in addition, a plurality of variations of ELMs such as kernel-based ELM (kernel based ELM), fully Complex ELM (full Complex ELM), online continuous ELM (online Sequential ELM), incremental ELM (incremental ELM) and integrated ELM (Ensemble of ELM) are widely used in different application fields, and good practical application effects are achieved.

According to the embodiment, the emotion tendentiousness of the current microblog user to the apple tablet computer is analyzed according to microblog data related to the apple tablet computer, and through the emotion tendentiousness analysis, the method helps relevant product manufacturers, suppliers, distributors and the like to make correct judgment on the future development trend of the apple tablet computer, and can also help purchased and pre-purchased users of the apple tablet computer to deepen the understanding of the apple tablet computer and further make appropriate choices.

FIG. 4 shows a master control machine (i.e., master node n)₀) Three slave controllers (i.e. slave node n)₁、n₂And n₃) Together forming a distributed system. According to the above-described procedure and the basic principle of ELM, the following process is required in the distributed system shown in fig. 4。

In the embodiment, a distributed classification method of mass microblog data is adopted to perform emotion analysis on microblog data related to a tablet computer, and the flow is shown in fig. 5. The flow begins with step 501.

At step 502, microblog training data is prepared. According to the foregoing, the preparation of the microblog training data includes two ways, and the first way is adopted in this embodiment. The main control machine captures original microblog data related to the apple tablet computer, wherein the original microblog data comprise a plurality of fields, such as publication time, publisher, type, access authority, text content, picture URL, video URL and the like. In this embodiment, only the text content field in the original data is obtained, and is used for emotion tendency analysis. Meanwhile, an emotional tendency dimension needs to be added in manual labeling, namely a classification result part of microblog data is used for representing the emotional tendency of microblog content. Listed below are 7 pieces of microblog data subjected to artificial emotion marking, and the master controller distributes the 7 pieces of training data to three slave controllers, wherein sentences 1-2 are sent to the slave controller n₁Sentences 3-5 to the slave controller n₂Sentences 6-7 to the slave controller n₃。

Slave control machine n₁The microblog training data set is as follows:

statement 1: the apple tablet computer has good quality, fast reaction speed and good hand feeling. (the emotional tendency of statement 1 is: praise)

Statement 2: the apple tablet personal computer is used for a while, has too few functions, is not as good as the legend, and is too common. (the emotional tendency of statement 2 is: objection)

Slave control machine n₂The microblog training data set is as follows:

statement 3: the apple tablet personal computer has the advantages of high speed, stable networking and perfect game surfing, and praises one! (the emotional tendency of statement 3 is: praise)

Statement 4: the single product route and high price of apple tablet computers is not known how long it can last in the competition of other opponents such as samsung. (the emotional tendency of statement 4 is: neutral)

Statement 5: the apple tablet computer operating system is not used to, the screen is not comfortable to watch wide-screen movies in proportion, files are difficult to export, and software is expensive to download. (the emotional tendency of statement 5 is: against)

Slave control machine n₃The microblog training data set is as follows:

statement 6: apple tablet computers are very fast, high in resolution and quite rich in application programs. (the emotional tendency of statement 6 is: praise)

Statement 7: the apple tablet computer body is too heavy to pick up, downloading requires access to itunes, is cumbersome! (the emotional tendency of statement 7 is: against)

In step 503: the master controller initializes the required parameters and sends the parameters to all the slave controllers;

the preset parameters are all generated randomly by the main control machine in advance, and the parameters comprise: weight vector w of input node₁,w₂,w₃And the threshold b of the hidden node₁,b₂,b₃(ii) a And issues these parameters to the slave node n₁、n₂And n₃And the number of hidden nodes L =3 is set.

w₁=（-0.9286，0.3575，-0.2155，0.4121，-0.9077，0.3897）

w₂=（0.6983，0.5155，0.3110，-0.9363，-0.8057，-0.3658）

w₃=（0.8680，0.4863，-0.6576，-0.4462，0.6469，0.9004）

b₁=0.0344

b₂=0.4387

b₃=0.3816

In step 504: each slave controller processes the respective local microblog data set, sends the processing result to the master controller, and generates a microblog data classifier by the master controller; the specific process is shown in fig. 6, and the process starts at step 601.

In step 602, vectorizing each piece of microblog training data with a classification result part, wherein the vectorization includes a feature vector x of the data part of each piece of microblog data_iAnd a classification result part t_i。

Vectorization of the data portion is to perform feature extraction on the data portion. The feature extraction is the basis of emotion tendency analysis, and the quality of the feature extraction directly influences the result of emotion tendency prediction. Feature extraction is to transform the original features into the most representative new features by a mapping (or transformation) method. The method mainly researches the influence of positive emotion words, negative emotion words, degree adverbs and negative words in the text data as characteristics on the analysis of the emotion tendentiousness of the text. The following is specifically presented:

emotional words: the emotional words refer to nouns, verbs, adjectives, idioms and idioms with emotional tendencies. The emotional tendency of the text is mainly transmitted through emotional words, and therefore, the emotional words are one of the important characteristics of the emotional tendency analysis and prediction of the text. According to the requirement of emotion analysis, the embodiment divides the emotion words in the text data into two types, namely recognition words and derviation words. The positive word is a word with a part of speech having a positive emotion, such as 'liking', 'accepting', 'enjoying', 'accepting', 'commending', 'honoring', 'nice', etc. Depreciation words: the words have meanings with dislike, negation, hate and light bamboo emotion colors, such as "aversion", "objection", "ignorance", "depression", "having advantages over" and "cheating". In the embodiment, the positive emotion words are divided into three levels [ +3, +2, +1], the positive degree is reduced in sequence, the derogatory emotion words are also divided into three levels [ -1, -2, -3], and the derogatory degree is increased in sequence.

The emotion words are related to four feature vectors which are respectively the recognition word frequency, the recognition word average level, the derogatory word frequency and the derogatory word average level. Word frequencyAverage grade

Degree adverb: the degree adverb is one of adverbs, representing a degree. Such as "very, extreme, tenth, top, too, much, straight, extreme, extra, out, more, over, somewhat, slightly, almost, too much, especially," etc., where the present embodiment extracts the word frequency of the degree adverb as a feature vector.

Negative adverb: the negative adverb is one of adverbs, and means positive and negative. Such as "none, not (at all), necessary, must, quasi, exact, none, other, mourning, don, not necessarily, none", etc., wherein the present embodiment extracts the frequency of the negative adverb as a feature vector.

In summary, the text feature vectors extracted in the embodiment mainly include six, which are recognition word frequency, recognition word average level, derogation word frequency, derogation word average level, degree adverb word frequency, and negative adverb word frequency. Meanwhile, in the classification result part of the microblog data, the emotional tendency of the text is divided into three levels, i.e., +1, +2, +3, which are approved, neutral and objectionable. Therefore, the feature vector and part of each microblog data and the classification result part can be obtained in the following specific form:

according to the feature extraction method, corresponding vectorization is extracted from the 7 pieces of microblog data, and the result is as follows:

statement 1: the apple tablet computer has good quality, fast reaction speed and good hand feeling. The emotional tendency of statement 1 is: praise)

Statement 1 analysis: the sentence 1 may be divided into 8 words, wherein the recognition words have "good", "fast", "good" 3 words, the recognition word frequency of the sentence 1 is 3/8, the level of the corresponding recognition words is +1, +2, respectively, the average recognition word rank of the sentence 1 is (1+2+2)/3, the sentence 1 does not contain the disambiguated words, therefore, the frequency and average rank of the disambiguated words are 0, the degree adverb is "good", the frequency is 1/8, the word frequency of the negative adverb is 0, the emotion tendency is good, and the classification result is +1, so that the sentence 1 may be converted into (0.375,1.667,0,0,0.125,0,1) after being extracted.

The feature vector portions of other statements can be obtained using the same method.

Statement 2 analysis: (0.083,2,0.167, -1.5,0.25,0.083,3).

Statement 3 analysis: (0.333,2.5,0,0,0.25,0,1).

Statement 4 analysis: (0.077,2,0.077, -1,0,0,2).

Statement 5 analysis: (0,0,0.188, -2.333,0.125,0.063,3).

Statement 6 analysis: (0.273,2.333,0,0,0.273,0,1).

Statement 7 analysis: (0,0,0.154, -2.5,0.154,0.077,3).

In step 603, each slave controller strips local vectorized microblog training data of the slave controller, and strips a feature vector part and a classification result part of the local vectorized microblog training data, namely, each slave controller generates a local microblog data set (X)_i,T_i) Wherein X is_iFeature matrix, T, for microblog data sets_iAnd obtaining a classification result matrix of the microblog data sets. In the distributed environment shown in fig. 4, the slave controller n₁The training data of (a) are:

statement 1(0.375,1.667,0,0,0.125,0,1)

Statement 2(0.083,2,0.167, -1.5,0.25,0.083,3)

Slave control machine n₁The feature matrix X of the stripped microblog training data of the microblog data₁And a classification result matrix T₁As follows:

feature matrix

Classification result matrix

T_{1} = [\begin{matrix} 1 \\ 3 \end{matrix}]

Slave control machine n₂The training data of (a) are:

statement 3(0.333,2.5,0,0,0.25,0,1)

Statement 4(0.077,2,0.077, -1,0,0,2)

Statement 5(0,0,0.188, -2.333,0.125,0.063,3)

Slave control machine n₂The microblog training data feature matrix X of the stripped microblog data₂And a classification result matrix T₂As follows:

feature matrix

Classification result matrix

T_{2} = [\begin{matrix} 1 \\ 2 \\ 3 \end{matrix}]

Slave control machine n₃The training data of (a) are:

statement 6(0.273,2.333,0,0,0.273,0,1)

Statement 7(0,0,0.154, -2.5,0.154,0.07,3)

Slave control machine n₃The microblog data ofStripped microblog training data feature matrix X₃And a classification result matrix T₃As follows:

feature matrix

Classification result matrix

T_{3} = [\begin{matrix} 1 \\ 3 \end{matrix}]

At step 604: each slave controller n_iAccording to the received parameter w₁,w₂,...,w_LAnd b₁,b₂,...,b_LAnd local microblog data sets (X)_i,T_i) Calculating an intermediate result required by the ELM, and submitting the intermediate result to a main control machine; wherein, in (X)_i,T_i) In, X_iFeature matrix, T, for microblog data sets_iA classification result matrix for the microblog data sets is shown in fig. 7.

Here, in the ELM, the feature matrix X for the input data_iNeed to be normalized so that X_iAll elements in the formula are [ -1, +1 [)]The difference in the normalization method selection results in a difference in the input data. In addition, for the excitation function g (w)_i·x_i+b_i) The ELM provides a plurality of excitation functions for the user to select, and different selection of the excitation functions can also lead to different intermediate results and further lead to different final classification results. In the embodiment of the present invention, the vectors of these statements are normalized first, and then an activation function is selected, so as to obtain the intermediate result required by the ELM. The following is respectively carried out on three slave controllersDescription of the drawings:

for the slave node n₁In a word:

in step 604-1, the slave controller n₁The processed data are statement 1(0.375,1.667,0,0,0.125,0,1) and statement 2(0.083,2,0.167, -1.5,0.25,0.083,3), and the received parameter is w₁,w₂,w₃,b₁,b₂,b₃Normalization and selection of the excitation function

Hidden layer output matrix

Classification result matrix

T_{1} = [\begin{matrix} 1 \\ 3 \end{matrix}]

In step 604-2, according to H₁Calculating an intermediate result U₁Is obtained by

U_{1} = H_{1}^{T} H_{1} = [\begin{matrix} 0.5867 & 0.7932 & 0.8081 \\ 0.7932 & 1.0737 & 1.0938 \\ 0.8081 & 1.0938 & 1.1143 \end{matrix}];

In step 604-3, according to H₁And T₁Calculating an intermediate result V₁Is obtained by

V_{1} = H_{1}^{T} T_{1} = [\begin{matrix} 2.1913 \\ 2.9141 \\ 2.9736 \end{matrix}],

And intermediate result U₁And V₁And submitting the data to a main control machine.

For the slave controller 2:

from the control machine n in step 604-4₂The data processed are statement 3(0.333,2.5,0,0,0.25,0,1), statement 4(0.077,2,0.077, -1,0,0,2) and statement 5(0,0,0.188, -2.333,0.125,0.063,3), the parameter received is w₁,w₂,w₃,b₁,b₂,b₃Normalization and selection of excitation function to obtain hidden layer output matrix

Classification result matrix

T_{2} = [\begin{matrix} 1 \\ 2 \\ 3 \end{matrix}]

Step 604-5, according to H₂Calculating an intermediate result U₂Is obtained by

U_{2} = H_{2}^{T} H_{2} = [\begin{matrix} 1.1422 & 1.3340 & 1.3961 \\ 1.3340 & 1.5881 & 1.6521 \\ 1.3961 & 1.6521 & 1.7222 \end{matrix}];

Step 604-6, according to H₂And T₂Calculating an intermediate result V₂Is obtained by

V_{2} = H_{2}^{T} T_{2} = [\begin{matrix} 3.8569 \\ 4.3846 \\ 4.6146 \end{matrix}],

And intermediate result U₂And V₂And submitting the data to a main control machine.

For the slave controller 3:

step 604-7 slave controller n₃The processed data are statements 6(0.273,2.333,0,0,0.273,0,1) and statements 7(0,0,0.154, -2.5,0.154,0.07,3), and the received parameters are w₁,w₂,w₃,b₁,b₂,b₃Normalization and selection of the excitation function

Hidden layer output matrix

Classification result matrix

T_{3} = [\begin{matrix} 1 \\ 3 \end{matrix}]

Step 604-8, according to H₃Calculating an intermediate result U₃Is obtained by

U_{3} = H_{3}^{T} H_{3} = [\begin{matrix} 0.2111 & 0.4335 & 0.5458 \\ 0.4335 & 0.9489 & 1.2141 \\ 0.5458 & 1.2141 & 1.5593 \end{matrix}];

Step 604-9, according to H₃And T₃Calculating an intermediate result V₃Is obtained by

V_{3} = H_{3}^{T} T_{3} = [\begin{matrix} 1.0809 \\ 2.7312 \\ 3.6074 \end{matrix}],

And intermediate result U₃And V₃And submitting the data to a main control machine.

In step 605, the master controller n₀Received slave controller n₁Submitted U₁And V₁Receiving the slave controller n₂Submitted U₂And V₂Receiving the slave controller n₃Submitted U₃And V₃And calculates the final result, as shown in fig. 8.

Step 605-1, merging the intermediate results U submitted by the slave controllers₁，U₂，U₃To obtain a summary result

U = U_{1} + U_{2} + U_{3} = [\begin{matrix} 1.9400 & 2.5607 & 2.7500 \\ 2.5607 & 3.6107 & 3.9600 \\ 2.7500 & 3.9600 & 4.3958 \end{matrix}];

Step 605-2, combining the intermediate results V submitted by the slave controllers₁，V₂，V₃To obtain a summary result

V = V_{1} + V_{2} + V_{3} = [\begin{matrix} 7.1390 \\ 11.0317 \\ 11.1956 \end{matrix}];

Step 605-3, calculating a weight vector parameter beta of the output node according to the summarized U and V,

thus, the weight vector parameter β can be obtained.

In step 605-4, according to the parameter β obtained by the parameter generator, a classifier capable of predicting the emotional tendency analysis of the microblog data is constructed, and is used for performing the emotional tendency analysis on the microblog data to be tested, wherein the formula is as follows:

f(x)=h(x)β

in step 505: and automatically classifying microblog data.

The main controller continuously captures microblog data, and a generated microblog data classifier is used to directly output a classification result of the microblog data to be classified, wherein the following two sentences are microblog data to be classified continuously captured by the main controller and a result obtained by applying the same feature extraction method.

Statement 8: apple tablet is sent to friends, friends like well! Speed and shape are good! Like!

Statement 8 analysis: (0.286,2.25,0,0,0.214, unknown classification result).

Statement 9: the apple tablet personal computer has low screen quality, is very troublesome to use and has poor endurance time.

Statement 9 analysis: (0,0,0.25, -2.333,0.25,0, unknown classification result).

After the same normalization method is applied and the same excitation function is selected, the classification result of the sentence 8 is obtained as follows:

hidden layer output matrix h (x)₈)＝[g(w₁·x₈+b₁)g(w₂·x₈+b₂)g(w₃·x₈+b₃)]＝[0.5467 0.7244 0.7388]

Is brought into the formula of the classifier to obtain

f(x)=h(x)β=[0.6332-0.6207-1.0061]

For the above result, the ELM adopts a maximization method to judge the classification result of the microblog data to be predicted, the basic principle is to judge the dimension where the largest element in the vector of the obtained result is located, the classification label corresponding to the dimension is the classification result of the microblog data to be predicted, if the largest element in the classifier output result of statement 8 is 0.6332 and the corresponding dimension is 1, then the classification result of statement 8 is the classification represented by label 1, i.e., "approve".

The prediction process of statement 9 is the same as statement 8, and is briefly as follows: the classification result of sentence 9 is obtained as follows:

hidden layer output matrix h (x)₉)＝[g(w₁·x₉+b₁)g(w₂·x₉+b₂)g(w₃·x₉+b₃)]＝[0.2222 0.6704 0.9174]

Is brought into the formula of the classifier to obtain

f(x)＝h(x)β＝[-1.2055 -0.8521 1.0684]

The largest element in the output result of the classifier of sentence 9 is 1.0684, and the corresponding dimension is 3, so the classification result of sentence 9 is the classification represented by label 3, i.e. "objection".

When the test data are sentences 8 and 9, the generated microblog data classifier is used, the emotional tendency of the sentences 8 and 9 can be correctly obtained, and the microblog data to be classified can be accurately classified.

Besides analyzing emotional tendency of microblog data, the invention can also be used for analyzing movie boxes, song click rate, financial product recommendation, stock analysis, instrument efficiency, news hot event analysis, social public opinion analysis and other applications.

Although specific embodiments of the present invention have been described above, it will be appreciated by those skilled in the art that these are merely illustrative and that various changes or modifications may be made to these embodiments without departing from the principles and spirit of the invention. The scope of the invention is only limited by the appended claims.

Claims

1. A distributed classification method of massive microblog data is realized by adopting a distributed classification device of massive microblog data, the device adopts a distributed structure and comprises a master controller and at least one slave controller, each slave controller is interconnected with the master controller, the master controller and each slave controller are communicated with each other, and all the slave controllers are independent from each other;

the slave control machine comprises:

a vector machine: for converting each microblog training data with classification result from the control computer into a vector representation formIncluding a feature vector x for the data portion of each microblog datum_iAnd a classification result part t_i；

A stripper: feature vector matrix X for all microblog data in microblog data training set processed by stripping vector machine_iAnd a classification result matrix T_i；

A converter: using the principle of extreme learning machine ELM for extracting the feature vector matrix X of the stripper_iConversion to hidden layer output matrix H in ELM_i；

The former calculator: using the principle of an extreme learning machine ELM for outputting a matrix H from a hidden layer_iCalculating an intermediate result H_i ^TH_iAnd submitting to the main control machine;

a consequent calculator: using the principle of an extreme learning machine ELM for outputting a matrix H from a hidden layer_iAnd microblog data centralized classification result matrix T_iCalculating an intermediate result H_i ^TT_iAnd submitting to the main control machine;

the main control machine comprises:

the former term accumulator: for merging intermediate results H submitted from the controllers_i ^TH_iTo obtain a summary result H^TH；

A last term accumulator: for merging intermediate results H submitted from the controllers_i ^TT_iTo obtain a summary result H^TT；

A parameter generator: calculating a weight vector parameter beta of an output node according to the summarized results output by the antecedent accumulator and the consequent accumulator by utilizing the principle of an Extreme Learning Machine (ELM);

and (3) a classification generator: constructing a classification device of microblog data according to the parameter beta obtained by the parameter generator, and classifying the microblog data to be tested;

each slave controller sends an intermediate result which is processed by the slave controller and used for generating a final microblog data classifier to the master controller, and the master controller receives the intermediate results sent by all the slave controllers and then obtains the final microblog data classifier according to the ELM principle;

the method is characterized in that: the method comprises the following steps:

step 1: preparing a microblog training data set;

the preparation of the microblog training data set comprises two parts of capturing original microblog data and labeling the microblog data manually; the following two ways are adopted: the first mode is that a master controller captures original microblog data required to be processed, each piece of training data is labeled manually to represent classification results of the microblog data, and then the microblog data are distributed to corresponding slave controllers; the second mode is that the master controller communicates with each slave controller to inform each slave controller of information of microblog data required to be captured, each slave controller captures original microblog data per se and manually marks the original microblog data captured per se to represent classification results of the microblog data;

the method utilizes the principle of an extreme learning machine ELM, and a main controller generates parameters randomly in advance, and comprises the following steps: number L of hidden nodes and weight vector w of input node₁,w₂,...,w_LOffset b of hidden node₁,b₂,...,b_LAnd sends these parameters to all slave controllers;

step 3-1: vectorizing microblog data;

vectorizing each microblog training data with a classification result part, wherein the vectorization comprises a feature vector x of the data part of each microblog data_iAnd a classification result part t_i；

Step 3-2: stripping microblog data;

for each microblog data set extracted from features in the microblog data training set of the slave controller, stripping the feature vector part and the classification result part of the data to form the microblog data of each slave controllerEigenvector matrix X of Boke data training set_iAnd a classification result matrix T_iThat is, each slave controller generates a respective local microblog data set (X)_i,T_i) Wherein X is_iFeature matrix, T, for microblog data sets_iA classification result matrix of the microblog data sets;

the method comprises the following specific steps:

Step 3-3-2: output matrix H according to hidden layer_iCalculating an intermediate result U_i＝H_i ^TH_i；

Step 3-3-3: output matrix H according to hidden layer_iAnd a classification result matrix T of the local training data set_iCalculating an intermediate result V_i＝H_i ^T T_i；

Step 3-4: the master controller receives and summarizes the intermediate results of the slave controllers; calculating a weight vector parameter beta of the output node according to the collected intermediate result and the ELM calculation principle, and further solving a microblog data classifier; the weight vector parameter β of the calculation output node is specifically as follows:

step 3-4-1: merging intermediate results U submitted by each slave controller_iObtaining a summary result U ═ sigma U_i＝∑H_i ^TH_i＝H^TH；

Step 3-4-2: merging intermediate results V submitted by each slave control machine_iObtaining a summary result V ═∑V_i＝∑H_i ^TT_i＝H^TT；

and further determining a formula of the microblog data classifier as follows:

f(x)＝h(x)β

in the formula, f (x) represents the classification result of the microblog data to be classified, and h (x) represents the hidden layer output vector of the microblog data to be classified;

and 4, step 4: automatically classifying microblog data;