CN103020712B - A kind of distributed sorter of massive micro-blog data and method - Google Patents

A kind of distributed sorter of massive micro-blog data and method Download PDF

Info

Publication number
CN103020712B
CN103020712B CN201210583886.8A CN201210583886A CN103020712B CN 103020712 B CN103020712 B CN 103020712B CN 201210583886 A CN201210583886 A CN 201210583886A CN 103020712 B CN103020712 B CN 103020712B
Authority
CN
China
Prior art keywords
microblog
microblog data
data
msub
mrow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210583886.8A
Other languages
Chinese (zh)
Other versions
CN103020712A (en
Inventor
王国仁
信俊昌
聂铁铮
赵相国
丁琳琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201210583886.8A priority Critical patent/CN103020712B/en
Publication of CN103020712A publication Critical patent/CN103020712A/en
Application granted granted Critical
Publication of CN103020712B publication Critical patent/CN103020712B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The distributed sorter of massive micro-blog data and a method, belong to data mining technology field.This device adopts distributed frame, according to the disposal route of ELM, each intermediate result being used for generating final microblog data sorter self processed from controller sends to master controller, after master controller receives all intermediate result of sending from controller, according to the principle of ELM, obtain final microblog data sorter, utilize the classification of microblog data sorter realization to microblog data produced.The limit of utilization learning machine technology overcome in the past only can be applied to centralized environment, the defect of the ELM classification of large-scale training sample set cannot be adapted to, make process and analyze massive micro-blog data to become possibility, in order application, the effectiveness of the massive micro-blog data of accumulation is not fully exerted, and serving is better the effect of application service.

Description

Distributed classification device and method for mass microblog data
Technical Field
The invention belongs to the technical field of data mining, relates to an extreme learning machine classification device and method based on a distributed processing technology, and particularly relates to a distributed classification device and method for massive microblog data.
Background
At present, a large amount of information is generated on the internet every moment, the expression forms of the information are various, and the amount of information generated by a microblog platform is also rapidly increased. Micro Blogs, Micro-Blogs, are a form of Blogs that allow users to update in a timely manner and publish short text (usually around 140 words) in public. The rapid development of the microblog enables anyone to become a microblog user, and sends and reads information on any client side supporting the microblog at any time, so as to carry out interactive communication and express the emotional information of the user. Microblogs become powerful information carriers of the internet, and the information quantity of the microblogs reaches a massive scale, so that the microblogs become the most popular information sharing, spreading and interaction platform at present. Therefore, how to adopt appropriate measures and technologies to mine useful information from massive microblog data and make predictive judgment on future things become a hotspot and difficulty of research in the field of current data mining.
In the existing related research aiming at microblog data, the data volume of the processed microblog data is usually relatively small, and the microblog data can be processed in a centralized environment; however, with the rapid increase of microblog data in the internet, the data volume of the microblog data far exceeds the processing capacity of a single computer, and large-scale data analysis is difficult to realize by adopting the existing method.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a distributed classification device and a distributed classification method for mass microblog data, which classify the microblog data by utilizing an Extreme Learning Machine (ELM) technology, and further can effectively process and analyze the mass microblog data so as to fully exert the utility of the mass microblog data accumulated in application and better serve the application.
The technical scheme of the invention is realized as follows: a distributed classification device for massive microblog data adopts a distributed structure and comprises a master controller and at least one slave controller, wherein each slave controller is interconnected with the master controller, the master controller and each slave controller are communicated with each other, and all the slave controllers are independent of each other and independently complete respective tasks; according to the ELM processing method, each slave control machine sends the intermediate result processed by the slave control machine and used for generating the final microblog data classifier to the master control machine, and the master control machine receives the intermediate results sent by all the slave control machines and then obtains the final microblog data classifier according to the ELM principle.
The slave control machine comprises:
a vector machine: converting each microblog training data with classification result from the control computer into a vector representation form, wherein the vector representation form comprises a feature vector x of a data part of each microblog dataiAnd a classification result part ti
A stripper: feature vector matrix X for all microblog data in microblog data training set processed by stripping vector machineiAnd a classification result matrix Ti
A converter: using the principle of Extreme Learning Machine (ELM) for extracting the feature vector matrix X of the stripperiConversion to hidden layer output matrix H in ELMi
The former calculator: using extreme learning machines (ELM) for outputting a matrix H from a hidden layeriCalculating an intermediate result Hi THiAnd submitted to the main control machine.
A consequent calculator: using the principle of Extreme Learning Machines (ELM) for outputting the matrix H from hidden layersiAnd microblog data centralized classification result matrix TiCalculating an intermediate result Hi TTiAnd submitted to the main control machine.
The main control machine comprises:
the former term accumulator: for merging intermediate results H submitted from the controllersi THiTo obtain a summary result HTH。
A last term accumulator: for merging intermediate results H submitted from the controllersi TTiTo obtain a summary result HTT。
A parameter generator: the principle of an Extreme Learning Machine (ELM) is used for calculating a weight vector parameter beta of an output node according to the summarized results output by the antecedent accumulator and the consequent accumulator.
And (3) a classification generator: and constructing a classification device of the microblog data according to the parameter beta obtained by the parameter generator, and classifying the microblog data to be tested.
A distributed classification method for massive microblog data comprises the following steps:
step 1: preparing a microblog training data set;
the preparation of the microblog training data set comprises two parts of capturing original microblog data and labeling the microblog data manually. The following two ways can be adopted: the first mode is that a master controller captures original microblog data required to be processed, each piece of training data is labeled manually to represent classification results of the microblog data, and then the microblog data are distributed to corresponding slave controllers; the second mode is that the master controller communicates with each slave controller to inform each slave controller of information of microblog data required to be captured, each slave controller captures original microblog data per se and manually marks the original microblog data captured per se to represent classification results of the microblog data;
step 2: the master controller initializes the required parameters and sends the parameters to all the slave controllers;
the method utilizes the principle of an Extreme Learning Machine (ELM) and randomly generates parameters in advance by a main control machine, and comprises the following steps: number L of hidden nodes and weight vector w of input node1,w2,...,wLOffset b of hidden node1,b2,...,bLAnd sends these parameters to all slave controllers;
and step 3: each slave controller processes the respective local microblog data set, sends the processing result to the master controller, and the master controller generates a microblog data classifier;
step 3-1: vectorizing microblog data;
vectorizing each microblog training data with a classification result part, wherein the vectorization comprises a feature vector x of the data part of each microblog dataiAnd a classification result part ti
Step 3-2: stripping microblog data;
for each microblog data set extracted from features in the microblog data training set of the slave controller, the feature vector part and the classification result part of the data are stripped to form a feature vector matrix X of the microblog data training set of the slave controlleriAnd a classification result matrix TiThat is, each slave controller generates a respective local microblog data set (X)i,Ti) Wherein X isiFeature matrix, T, for microblog data setsiAnd obtaining a classification result matrix of the microblog data sets.
Step 3-3: each slave control machine generates an intermediate result according to the respective local microblog data set and sends the intermediate result to the master control machine;
each slave controller niAccording to the weight vector w of the received input node1,w2,...,wLAnd threshold b of the ith hidden node1,b2,...,bLAnd a local microblog training data set (X)i,Ti) Calculating an intermediate result required by constructing the classifier, and submitting the intermediate result to the main control machine;
step 3-3-1: feature matrix X of local microblog data setiHidden layer output matrix H converted into ELMi
Step 3-3-2: output matrix H according to hidden layeriCalculating an intermediate result Ui=Hi THi
Step 3-3-3: output matrix H according to hidden layeriAnd a classification result matrix T of the local training data setiCalculating an intermediate result Vi=Hi TTi
Step 3-4: the master controller receives and summarizes the intermediate results of the slave controllers; calculating a weight vector parameter beta of the output node according to the collected intermediate result and the ELM calculation principle, and further solving a microblog data classifier;
step 3-4-1: merging intermediate results U submitted by each slave controlleriObtaining a summary result U = ∑ Ui=∑Hi THi=HTH;
Step 3-4-2: merging intermediate results V submitted by each slave control machineiObtaining a summary result V = ∑ Vi=∑Hi TTi=HTT;
Step 3-4-3: calculating a weight vector parameter beta of the output node according to the summarized U and V:
<math> <mrow> <mi>&beta;</mi> <mo>=</mo> <msup> <mrow> <mo>(</mo> <mfrac> <mi>I</mi> <mi>&lambda;</mi> </mfrac> <mo>+</mo> <msup> <mi>H</mi> <mi>T</mi> </msup> <mi>H</mi> <mo>)</mo> </mrow> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msup> <msup> <mi>H</mi> <mi>T</mi> </msup> <mi>T</mi> <mo>=</mo> <msup> <mrow> <mo>(</mo> <mfrac> <mi>I</mi> <mi>&lambda;</mi> </mfrac> <mo>+</mo> <mi>U</mi> <mo>)</mo> </mrow> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msup> <mi>V</mi> </mrow> </math>
where I is the unit matrix and λ is a user-specified parameter, (-)-1Is a matrix inversion operation;
further determining a formula of the microblog data classifier,
f(x)=h(x)β
wherein, f (x) represents the classification result of the microblog data to be classified, h (x) represents the hidden layer output vector of the microblog data to be classified;
and 4, step 4: automatic classification of microblog data
Automatic classification of microblog data can take two ways: the first mode is that the master controller continuously captures microblog data, a classification result of the microblog data to be classified is directly output by applying the microblog data classifier generated in the step 3, the second mode is that the master controller sends the microblog data classifier generated in the step 3 to each slave controller, and then each slave controller classifies the microblog data to be classified by applying the classifier to obtain the classification result.
Has the advantages that: the invention relates to a distributed classification device and method for massive microblog data, which overcome the defect that the traditional extreme learning machine technology can only be applied to a centralized environment and cannot adapt to ELM classification of a large-scale training sample set, so that the massive microblog data can be processed and analyzed, the utility of the massive microblog data accumulated in the application can be fully exerted, and a better effect of serving the application can be achieved.
Drawings
FIG. 1 is a schematic diagram of a distributed architecture according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the connection between the master controller and the slave controller according to an embodiment of the present invention;
FIG. 3 is a block diagram of the master control machine and the slave control machine according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a distributed microblog data training set according to an embodiment of the invention;
FIG. 5 is a flow chart of a distributed microblog data training method according to an embodiment of the invention;
FIG. 6 is a flowchart of a method for generating a microblog data classifier according to an embodiment of the invention;
FIG. 7 is a graph illustrating partial intermediate results after conversion from a control engine in accordance with one embodiment of the present invention;
FIG. 8 is a diagram illustrating the calculation of intermediate results from the slave controller and the collection of the master controller according to one embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
The current microblog data contains a large amount of microblog user emotion information, the information shows the viewpoint and the opinion of a microblog user on certain event, commodity, figure and the like, and the emotion information has high research and application values, so that the emotion analysis on the microblog data obtains wide attention, and the microblog data has wide application prospects, such as viewpoint analysis, commodity evaluation, opinion detection and the like. Therefore, in the embodiment of the invention, the microblog data are classified according to the emotional tendency of the microblog data.
The invention analyzes massive microblog data in a distributed environment, wherein a distributed system structure is shown as a figure 1. Comprising a main node n0And a plurality of slave nodes n1,n2,...,nsWherein, the master node n0Respectively connected with a plurality of slave nodes n1,n2,...,nsInterconnection, able to communicate with all slave nodes n1,n2,...,nsTo communicate with each other.
One embodiment of the present invention uses the general connection scheme shown in fig. 2, which includes a master controller and a plurality of slave controllers (slave controller 1, slave controller 2.., slave controller m), each of which is interconnected with the master controller. According to the principle of an Extreme Learning Machine (ELM), each slave control machine processes a local microblog training data set thereof to generate respective intermediate results for generating a final classifier, and sends the intermediate results to the master control machine, and the master control machine generates the final microblog data classifier according to the principle of the ELM after receiving the intermediate results.
The slave control machine comprises a vector machine, a stripper, a converter, a front term calculator and a back term calculator. The main control machine comprises a front term accumulator, a back term accumulator, a parameter generator and a classification generator.
A vector machine: converting each microblog training data with classification result from the control computer into a vector representation form, wherein the vector representation form comprises a feature vector x of a data part of each microblog dataiAnd a classification result part ti
A stripper: for stripping the processed micro-scale of the vector machineEigenvector matrix X of all microblog data in Bo data training setiAnd a classification result matrix Ti
A converter: using the principle of Extreme Learning Machine (ELM) for the matrix X of eigenvectors extracted towards the stripperiConversion to hidden layer output matrix H in ELMi
The former calculator: using the principle of Extreme Learning Machines (ELM) for outputting the matrix H from hidden layersiCalculating an intermediate result Hi THiAnd submitted to the main control machine.
A consequent calculator: using the principle of Extreme Learning Machines (ELM) for outputting the matrix H from hidden layersiAnd microblog data centralized classification result matrix TiCalculating an intermediate result Hi TTiAnd submitted to the main control machine.
The former term accumulator: for merging intermediate results H submitted from the controllersi THiTo obtain a summary result HTH。
A last term accumulator: for merging intermediate results H submitted from the controllersi TTiTo obtain a summary result HTT。
A parameter generator: the principle of an Extreme Learning Machine (ELM) is used for calculating a weight vector parameter beta of an output node according to the summarized results output by the antecedent accumulator and the consequent accumulator.
And (3) a classification generator: and constructing a classification device of the microblog data according to the parameter beta obtained by the parameter generator, and classifying the microblog data to be tested.
In this embodiment, each slave controller and the master controller adopt an ELM technique to analyze microblog data, where the ELM technique is specifically as follows:
the extreme learning machine is a training method based on a Single Hidden-Layer feed forward neural network (SLFNs). The ELM randomly sets the connection weight and the offset value from the hidden layer to the input layer before training, does not need to adjust the input weight of a network and the offset value of a hidden layer unit in the execution process of the algorithm, can generate a unique optimal analytic solution for the weight of the output layer, and can provide good generalization capability and extremely fast learning speed.
The basic principle of ELM is: in the training process, the ELM firstly generates input weight and hidden node threshold value randomly, and then calculates the output weight of SLFNs according to training data. Suppose that N training samples (x) are givenj,tj) Wherein x isjIs the feature vector part of the training sample, tjIs the classification result portion of the sample. SLFNs with hidden node number L and excitation function g (x) can be formally expressed as:
<math> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>L</mi> </munderover> <msub> <mi>&beta;</mi> <mi>i</mi> </msub> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>L</mi> </munderover> <msub> <mi>&beta;</mi> <mi>i</mi> </msub> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>+</mo> <msub> <mi>b</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>o</mi> <mi>j</mi> </msub> <mo>,</mo> <mi>j</mi> <mo>=</mo> <mn>1,2</mn> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mo>,</mo> <mi>N</mi> <mo>.</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein, wiIs a weight vector connecting the ith hidden node and the input node; beta is aiIs a weight vector connecting the ith hidden node and the output node; biIs the threshold of the ith hidden node; ojIs the jth output vector of SLFNs.
If SLFNs can approximate the training samples without error, then the conditions are satisfiedI.e. the presence of wi、βiAnd biSo that <math> <mrow> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>L</mi> </msubsup> <msub> <mi>&beta;</mi> <mi>i</mi> </msub> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>+</mo> <msub> <mi>b</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>t</mi> <mi>j</mi> </msub> <mo>,</mo> </mrow> </math> Abbreviated as H β ═ T. Wherein,
<math> <mrow> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mo>,</mo> <msub> <mi>w</mi> <mi>L</mi> </msub> <mo>,</mo> <msub> <mi>b</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>b</mi> <mn>2</mn> </msub> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mo>,</mo> <msub> <mi>b</mi> <mi>L</mi> </msub> <mo>,</mo> <msub> <mi>x</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>x</mi> <mn>2</mn> </msub> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mo>,</mo> <msub> <mi>x</mi> <mi>N</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open='[' close=']'> <mtable> <mtr> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>1</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>1</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mo>&CenterDot;</mo> <mo>&CenterDot;</mo> <mo>&CenterDot;</mo> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>L</mi> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>1</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mi>L</mi> </msub> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>2</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>2</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mo>&CenterDot;</mo> <mo>&CenterDot;</mo> <mo>&CenterDot;</mo> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>L</mi> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>2</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mi>L</mi> </msub> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mo>&CenterDot;</mo> </mtd> <mtd> <mo>&CenterDot;</mo> </mtd> <mtd> <mo>&CenterDot;</mo> </mtd> <mtd> <mo>&CenterDot;</mo> </mtd> </mtr> <mtr> <mtd> <mo>&CenterDot;</mo> </mtd> <mtd> <mo>&CenterDot;</mo> </mtd> <mtd> <mo>&CenterDot;</mo> </mtd> <mtd> <mo>&CenterDot;</mo> </mtd> </mtr> <mtr> <mtd> <mo>&CenterDot;</mo> </mtd> <mtd> <mo>&CenterDot;</mo> </mtd> <mtd> <mo>&CenterDot;</mo> </mtd> <mtd> <mo>&CenterDot;</mo> </mtd> </mtr> <mtr> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mi>N</mi> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mi>N</mi> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mo>&CenterDot;</mo> <mo>&CenterDot;</mo> <mo>&CenterDot;</mo> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>L</mi> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mi>N</mi> </msub> <mo>+</mo> <msub> <mi>b</mi> <mi>L</mi> </msub> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <mi>&beta;</mi> <mo>=</mo> <mrow> <mo>[</mo> <msubsup> <mi>&beta;</mi> <mn>1</mn> <mi>T</mi> </msubsup> <mo>,</mo> <msubsup> <mi>&beta;</mi> <mn>2</mn> <mi>T</mi> </msubsup> <mo>,</mo> <mo>&CenterDot;</mo> <mo>&CenterDot;</mo> <mo>&CenterDot;</mo> <msubsup> <mi>&beta;</mi> <mi>L</mi> <mi>T</mi> </msubsup> </mrow> <msup> <mo>]</mo> <mi>T</mi> </msup> <mo>,</mo> </mrow> </math> <math> <mrow> <mi>T</mi> <mo>=</mo> <msup> <mrow> <mo>[</mo> <msubsup> <mi>t</mi> <mn>1</mn> <mi>T</mi> </msubsup> <mo>,</mo> <msubsup> <mi>t</mi> <mn>2</mn> <mi>T</mi> </msubsup> <mo>,</mo> <mo>&CenterDot;</mo> <mo>&CenterDot;</mo> <mo>&CenterDot;</mo> <msubsup> <mi>t</mi> <mi>N</mi> <mi>T</mi> </msubsup> <mo>]</mo> </mrow> <mi>T</mi> </msup> <mo>&CenterDot;</mo> </mrow> </math> wherein, the matrix xTIs the transpose of matrix x.
The matrix H is called the hidden layer output matrix. In the formula H β ═ T, only β is an unknown number, and this is available Is the Moore-Penrose generalized inverse of H.
Based on the basic extreme learning machine, several scholars further propose ELM based on random hidden layer feature mapping, and at the momentWherein I is a unit matrix and λ is a user-specified parameter;
in addition, a plurality of variations of ELMs such as kernel-based ELM (kernel based ELM), fully Complex ELM (full Complex ELM), online continuous ELM (online Sequential ELM), incremental ELM (incremental ELM) and integrated ELM (Ensemble of ELM) are widely used in different application fields, and good practical application effects are achieved.
According to the embodiment, the emotion tendentiousness of the current microblog user to the apple tablet computer is analyzed according to microblog data related to the apple tablet computer, and through the emotion tendentiousness analysis, the method helps relevant product manufacturers, suppliers, distributors and the like to make correct judgment on the future development trend of the apple tablet computer, and can also help purchased and pre-purchased users of the apple tablet computer to deepen the understanding of the apple tablet computer and further make appropriate choices.
FIG. 4 shows a master control machine (i.e., master node n)0) Three slave controllers (i.e. slave node n)1、n2And n3) Together forming a distributed system. According to the above-described procedure and the basic principle of ELM, the following process is required in the distributed system shown in fig. 4。
In the embodiment, a distributed classification method of mass microblog data is adopted to perform emotion analysis on microblog data related to a tablet computer, and the flow is shown in fig. 5. The flow begins with step 501.
At step 502, microblog training data is prepared. According to the foregoing, the preparation of the microblog training data includes two ways, and the first way is adopted in this embodiment. The main control machine captures original microblog data related to the apple tablet computer, wherein the original microblog data comprise a plurality of fields, such as publication time, publisher, type, access authority, text content, picture URL, video URL and the like. In this embodiment, only the text content field in the original data is obtained, and is used for emotion tendency analysis. Meanwhile, an emotional tendency dimension needs to be added in manual labeling, namely a classification result part of microblog data is used for representing the emotional tendency of microblog content. Listed below are 7 pieces of microblog data subjected to artificial emotion marking, and the master controller distributes the 7 pieces of training data to three slave controllers, wherein sentences 1-2 are sent to the slave controller n1Sentences 3-5 to the slave controller n2Sentences 6-7 to the slave controller n3
Slave control machine n1The microblog training data set is as follows:
statement 1: the apple tablet computer has good quality, fast reaction speed and good hand feeling. (the emotional tendency of statement 1 is: praise)
Statement 2: the apple tablet personal computer is used for a while, has too few functions, is not as good as the legend, and is too common. (the emotional tendency of statement 2 is: objection)
Slave control machine n2The microblog training data set is as follows:
statement 3: the apple tablet personal computer has the advantages of high speed, stable networking and perfect game surfing, and praises one! (the emotional tendency of statement 3 is: praise)
Statement 4: the single product route and high price of apple tablet computers is not known how long it can last in the competition of other opponents such as samsung. (the emotional tendency of statement 4 is: neutral)
Statement 5: the apple tablet computer operating system is not used to, the screen is not comfortable to watch wide-screen movies in proportion, files are difficult to export, and software is expensive to download. (the emotional tendency of statement 5 is: against)
Slave control machine n3The microblog training data set is as follows:
statement 6: apple tablet computers are very fast, high in resolution and quite rich in application programs. (the emotional tendency of statement 6 is: praise)
Statement 7: the apple tablet computer body is too heavy to pick up, downloading requires access to itunes, is cumbersome! (the emotional tendency of statement 7 is: against)
In step 503: the master controller initializes the required parameters and sends the parameters to all the slave controllers;
the preset parameters are all generated randomly by the main control machine in advance, and the parameters comprise: weight vector w of input node1,w2,w3And the threshold b of the hidden node1,b2,b3(ii) a And issues these parameters to the slave node n1、n2And n3And the number of hidden nodes L =3 is set.
w1=(-0.9286,0.3575,-0.2155,0.4121,-0.9077,0.3897)
w2=(0.6983,0.5155,0.3110,-0.9363,-0.8057,-0.3658)
w3=(0.8680,0.4863,-0.6576,-0.4462,0.6469,0.9004)
b1=0.0344
b2=0.4387
b3=0.3816
In step 504: each slave controller processes the respective local microblog data set, sends the processing result to the master controller, and generates a microblog data classifier by the master controller; the specific process is shown in fig. 6, and the process starts at step 601.
In step 602, vectorizing each piece of microblog training data with a classification result part, wherein the vectorization includes a feature vector x of the data part of each piece of microblog dataiAnd a classification result part ti
Vectorization of the data portion is to perform feature extraction on the data portion. The feature extraction is the basis of emotion tendency analysis, and the quality of the feature extraction directly influences the result of emotion tendency prediction. Feature extraction is to transform the original features into the most representative new features by a mapping (or transformation) method. The method mainly researches the influence of positive emotion words, negative emotion words, degree adverbs and negative words in the text data as characteristics on the analysis of the emotion tendentiousness of the text. The following is specifically presented:
emotional words: the emotional words refer to nouns, verbs, adjectives, idioms and idioms with emotional tendencies. The emotional tendency of the text is mainly transmitted through emotional words, and therefore, the emotional words are one of the important characteristics of the emotional tendency analysis and prediction of the text. According to the requirement of emotion analysis, the embodiment divides the emotion words in the text data into two types, namely recognition words and derviation words. The positive word is a word with a part of speech having a positive emotion, such as 'liking', 'accepting', 'enjoying', 'accepting', 'commending', 'honoring', 'nice', etc. Depreciation words: the words have meanings with dislike, negation, hate and light bamboo emotion colors, such as "aversion", "objection", "ignorance", "depression", "having advantages over" and "cheating". In the embodiment, the positive emotion words are divided into three levels [ +3, +2, +1], the positive degree is reduced in sequence, the derogatory emotion words are also divided into three levels [ -1, -2, -3], and the derogatory degree is increased in sequence.
The emotion words are related to four feature vectors which are respectively the recognition word frequency, the recognition word average level, the derogatory word frequency and the derogatory word average level. Word frequencyAverage grade
Degree adverb: the degree adverb is one of adverbs, representing a degree. Such as "very, extreme, tenth, top, too, much, straight, extreme, extra, out, more, over, somewhat, slightly, almost, too much, especially," etc., where the present embodiment extracts the word frequency of the degree adverb as a feature vector.
Negative adverb: the negative adverb is one of adverbs, and means positive and negative. Such as "none, not (at all), necessary, must, quasi, exact, none, other, mourning, don, not necessarily, none", etc., wherein the present embodiment extracts the frequency of the negative adverb as a feature vector.
In summary, the text feature vectors extracted in the embodiment mainly include six, which are recognition word frequency, recognition word average level, derogation word frequency, derogation word average level, degree adverb word frequency, and negative adverb word frequency. Meanwhile, in the classification result part of the microblog data, the emotional tendency of the text is divided into three levels, i.e., +1, +2, +3, which are approved, neutral and objectionable. Therefore, the feature vector and part of each microblog data and the classification result part can be obtained in the following specific form:
according to the feature extraction method, corresponding vectorization is extracted from the 7 pieces of microblog data, and the result is as follows:
statement 1: the apple tablet computer has good quality, fast reaction speed and good hand feeling. The emotional tendency of statement 1 is: praise)
Statement 1 analysis: the sentence 1 may be divided into 8 words, wherein the recognition words have "good", "fast", "good" 3 words, the recognition word frequency of the sentence 1 is 3/8, the level of the corresponding recognition words is +1, +2, respectively, the average recognition word rank of the sentence 1 is (1+2+2)/3, the sentence 1 does not contain the disambiguated words, therefore, the frequency and average rank of the disambiguated words are 0, the degree adverb is "good", the frequency is 1/8, the word frequency of the negative adverb is 0, the emotion tendency is good, and the classification result is +1, so that the sentence 1 may be converted into (0.375,1.667,0,0,0.125,0,1) after being extracted.
The feature vector portions of other statements can be obtained using the same method.
Statement 2: the apple tablet personal computer is used for a while, has too few functions, is not as good as the legend, and is too common. (the emotional tendency of statement 2 is: objection)
Statement 2 analysis: (0.083,2,0.167, -1.5,0.25,0.083,3).
Statement 3: the apple tablet personal computer has the advantages of high speed, stable networking and perfect game surfing, and praises one! (the emotional tendency of statement 3 is: praise)
Statement 3 analysis: (0.333,2.5,0,0,0.25,0,1).
Statement 4: the single product route and high price of apple tablet computers is not known how long it can last in the competition of other opponents such as samsung. (the emotional tendency of statement 4 is: neutral)
Statement 4 analysis: (0.077,2,0.077, -1,0,0,2).
Statement 5: the apple tablet computer operating system is not used to, the screen is not comfortable to watch wide-screen movies in proportion, files are difficult to export, and software is expensive to download. (the emotional tendency of statement 5 is: against)
Statement 5 analysis: (0,0,0.188, -2.333,0.125,0.063,3).
Statement 6: apple tablet computers are very fast, high in resolution and quite rich in application programs. (the emotional tendency of statement 6 is: praise)
Statement 6 analysis: (0.273,2.333,0,0,0.273,0,1).
Statement 7: the apple tablet computer body is too heavy to pick up, downloading requires access to itunes, is cumbersome! (the emotional tendency of statement 7 is: against)
Statement 7 analysis: (0,0,0.154, -2.5,0.154,0.077,3).
In step 603, each slave controller strips local vectorized microblog training data of the slave controller, and strips a feature vector part and a classification result part of the local vectorized microblog training data, namely, each slave controller generates a local microblog data set (X)i,Ti) Wherein X isiFeature matrix, T, for microblog data setsiAnd obtaining a classification result matrix of the microblog data sets. In the distributed environment shown in fig. 4, the slave controller n1The training data of (a) are:
statement 1(0.375,1.667,0,0,0.125,0,1)
Statement 2(0.083,2,0.167, -1.5,0.25,0.083,3)
Slave control machine n1The feature matrix X of the stripped microblog training data of the microblog data1And a classification result matrix T1As follows:
feature matrix
Classification result matrix T 1 = 1 3
Slave control machine n2The training data of (a) are:
statement 3(0.333,2.5,0,0,0.25,0,1)
Statement 4(0.077,2,0.077, -1,0,0,2)
Statement 5(0,0,0.188, -2.333,0.125,0.063,3)
Slave control machine n2The microblog training data feature matrix X of the stripped microblog data2And a classification result matrix T2As follows:
feature matrix
Classification result matrix T 2 = 1 2 3
Slave control machine n3The training data of (a) are:
statement 6(0.273,2.333,0,0,0.273,0,1)
Statement 7(0,0,0.154, -2.5,0.154,0.07,3)
Slave control machine n3The microblog data ofStripped microblog training data feature matrix X3And a classification result matrix T3As follows:
feature matrix
Classification result matrix T 3 = 1 3
At step 604: each slave controller niAccording to the received parameter w1,w2,...,wLAnd b1,b2,...,bLAnd local microblog data sets (X)i,Ti) Calculating an intermediate result required by the ELM, and submitting the intermediate result to a main control machine; wherein, in (X)i,Ti) In, XiFeature matrix, T, for microblog data setsiA classification result matrix for the microblog data sets is shown in fig. 7.
Here, in the ELM, the feature matrix X for the input dataiNeed to be normalized so that XiAll elements in the formula are [ -1, +1 [)]The difference in the normalization method selection results in a difference in the input data. In addition, for the excitation function g (w)i·xi+bi) The ELM provides a plurality of excitation functions for the user to select, and different selection of the excitation functions can also lead to different intermediate results and further lead to different final classification results. In the embodiment of the present invention, the vectors of these statements are normalized first, and then an activation function is selected, so as to obtain the intermediate result required by the ELM. The following is respectively carried out on three slave controllersDescription of the drawings:
for the slave node n1In a word:
in step 604-1, the slave controller n1The processed data are statement 1(0.375,1.667,0,0,0.125,0,1) and statement 2(0.083,2,0.167, -1.5,0.25,0.083,3), and the received parameter is w1,w2,w3,b1,b2,b3Normalization and selection of the excitation function
Hidden layer output matrix <math> <mrow> <msub> <mi>H</mi> <mn>1</mn> </msub> <mo>=</mo> <mfenced open='[' close=']'> <mtable> <mtr> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>1</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>1</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>3</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>1</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>3</mn> </msub> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>2</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>2</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>3</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>2</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>3</mn> </msub> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>=</mo> <mfenced open='[' close=']'> <mtable> <mtr> <mtd> <mn>0.5287</mn> </mtd> <mtd> <mn>0.7409</mn> </mtd> <mtd> <mn>0.7524</mn> </mtd> </mtr> <mtr> <mtd> <mn>0.5442</mn> </mtd> <mtd> <mn>0.7244</mn> </mtd> <mtd> <mn>0.7404</mn> </mtd> </mtr> </mtable> </mfenced> <mo>,</mo> </mrow> </math>
Classification result matrix T 1 = 1 3
In step 604-2, according to H1Calculating an intermediate result U1Is obtained by U 1 = H 1 T H 1 = 0.5867 0.7932 0.8081 0.7932 1.0737 1.0938 0.8081 1.0938 1.1143 ;
In step 604-3, according to H1And T1Calculating an intermediate result V1Is obtained by V 1 = H 1 T T 1 = 2.1913 2.9141 2.9736 , And intermediate result U1And V1And submitting the data to a main control machine.
For the slave controller 2:
from the control machine n in step 604-42The data processed are statement 3(0.333,2.5,0,0,0.25,0,1), statement 4(0.077,2,0.077, -1,0,0,2) and statement 5(0,0,0.188, -2.333,0.125,0.063,3), the parameter received is w1,w2,w3,b1,b2,b3Normalization and selection of excitation function to obtain hidden layer output matrix
<math> <mrow> <msub> <mi>H</mi> <mn>2</mn> </msub> <mo>=</mo> <mfenced open='[' close=']'> <mtable> <mtr> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>3</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>3</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>3</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>3</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>3</mn> </msub> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>4</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>4</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>3</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>4</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>3</mn> </msub> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>5</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>5</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>3</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>5</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>3</mn> </msub> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>=</mo> <mtable> <mtr> <mtd> </mtd> </mtr> </mtable> <mfenced open='[' close=']'> <mtable> <mtr> <mtd> <mn>0.5441</mn> </mtd> <mtd> <mn>0.7194</mn> </mtd> <mtd> <mn>0.7388</mn> </mtd> </mtr> <mtr> <mtd> <mn>0.5467</mn> </mtd> <mtd> <mn>0.7244</mn> </mtd> <mtd> <mn>0.7163</mn> </mtd> </mtr> <mtr> <mtd> <mn>0.7398</mn> </mtd> <mtd> <mn>0.7388</mn> </mtd> <mtd> <mn>0.8114</mn> </mtd> </mtr> </mtable> </mfenced> </mrow> </math>
Classification result matrix T 2 = 1 2 3
Step 604-5, according to H2Calculating an intermediate result U2Is obtained by U 2 = H 2 T H 2 = 1.1422 1.3340 1.3961 1.3340 1.5881 1.6521 1.3961 1.6521 1.7222 ;
Step 604-6, according to H2And T2Calculating an intermediate result V2Is obtained by V 2 = H 2 T T 2 = 3.8569 4.3846 4.6146 , And intermediate result U2And V2And submitting the data to a main control machine.
For the slave controller 3:
step 604-7 slave controller n3The processed data are statements 6(0.273,2.333,0,0,0.273,0,1) and statements 7(0,0,0.154, -2.5,0.154,0.07,3), and the received parameters are w1,w2,w3,b1,b2,b3Normalization and selection of the excitation function
Hidden layer output matrix <math> <mrow> <msub> <mi>H</mi> <mn>3</mn> </msub> <mo>=</mo> <mfenced open='[' close=']'> <mtable> <mtr> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>6</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>6</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>3</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>6</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>3</mn> </msub> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <msub> <mrow> <mo>&CenterDot;</mo> <mi>x</mi> </mrow> <mn>7</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <msub> <mrow> <mo>&CenterDot;</mo> <mi>x</mi> </mrow> <mn>7</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mtd> <mtd> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>3</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mn>7</mn> </msub> <mo>+</mo> <msub> <mi>b</mi> <mn>3</mn> </msub> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>=</mo> <mfenced open='[' close=']'> <mtable> <mtr> <mtd> <mn>0.3993</mn> </mtd> <mtd> <mn>0.7005</mn> </mtd> <mtd> <mn>0.8426</mn> </mtd> </mtr> <mtr> <mtd> <mn>0.2272</mn> </mtd> <mtd> <mn>0.6769</mn> </mtd> <mtd> <mn>0.8216</mn> </mtd> </mtr> </mtable> </mfenced> </mrow> </math>
Classification result matrix T 3 = 1 3
Step 604-8, according to H3Calculating an intermediate result U3Is obtained by U 3 = H 3 T H 3 = 0.2111 0.4335 0.5458 0.4335 0.9489 1.2141 0.5458 1.2141 1.5593 ;
Step 604-9, according to H3And T3Calculating an intermediate result V3Is obtained by V 3 = H 3 T T 3 = 1.0809 2.7312 3.6074 , And intermediate result U3And V3And submitting the data to a main control machine.
In step 605, the master controller n0Received slave controller n1Submitted U1And V1Receiving the slave controller n2Submitted U2And V2Receiving the slave controller n3Submitted U3And V3And calculates the final result, as shown in fig. 8.
Step 605-1, merging the intermediate results U submitted by the slave controllers1,U2,U3To obtain a summary result
U = U 1 + U 2 + U 3 = 1.9400 2.5607 2.7500 2.5607 3.6107 3.9600 2.7500 3.9600 4.3958 ;
Step 605-2, combining the intermediate results V submitted by the slave controllers1,V2,V3To obtain a summary result
V = V 1 + V 2 + V 3 = 7.1390 11.0317 11.1956 ;
Step 605-3, calculating a weight vector parameter beta of the output node according to the summarized U and V,
<math> <mrow> <mi>&beta;</mi> <mo>=</mo> <msup> <mrow> <mo>(</mo> <mfrac> <mn>1</mn> <mi>&lambda;</mi> </mfrac> <mo>+</mo> <mi>U</mi> <mo>)</mo> </mrow> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msup> <mi>V</mi> <mo>=</mo> <mfenced open='(' close=')'> <mtable> <mtr> <mtd> <mo>-</mo> <mn>16.8925</mn> </mtd> <mtd> <mn>9.9534</mn> </mtd> <mtd> <mn>6.6591</mn> </mtd> </mtr> <mtr> <mtd> <mn>42.3653</mn> </mtd> <mtd> <mo>-</mo> <mn>19.4846</mn> </mtd> <mtd> <mo>-</mo> <mn>23.3897</mn> </mtd> </mtr> <mtr> <mtd> <mo>-</mo> <mn>28.1804</mn> </mtd> <mtd> <mn>10.8984</mn> </mtd> <mtd> <mn>16.6435</mn> </mtd> </mtr> </mtable> </mfenced> </mrow> </math>
thus, the weight vector parameter β can be obtained.
In step 605-4, according to the parameter β obtained by the parameter generator, a classifier capable of predicting the emotional tendency analysis of the microblog data is constructed, and is used for performing the emotional tendency analysis on the microblog data to be tested, wherein the formula is as follows:
f(x)=h(x)β
in step 505: and automatically classifying microblog data.
The main controller continuously captures microblog data, and a generated microblog data classifier is used to directly output a classification result of the microblog data to be classified, wherein the following two sentences are microblog data to be classified continuously captured by the main controller and a result obtained by applying the same feature extraction method.
Statement 8: apple tablet is sent to friends, friends like well! Speed and shape are good! Like!
Statement 8 analysis: (0.286,2.25,0,0,0.214, unknown classification result).
Statement 9: the apple tablet personal computer has low screen quality, is very troublesome to use and has poor endurance time.
Statement 9 analysis: (0,0,0.25, -2.333,0.25,0, unknown classification result).
After the same normalization method is applied and the same excitation function is selected, the classification result of the sentence 8 is obtained as follows:
hidden layer output matrix h (x)8)=[g(w1·x8+b1)g(w2·x8+b2)g(w3·x8+b3)]=[0.5467 0.7244 0.7388]
Is brought into the formula of the classifier to obtain
f(x)=h(x)β=[0.6332-0.6207-1.0061]
For the above result, the ELM adopts a maximization method to judge the classification result of the microblog data to be predicted, the basic principle is to judge the dimension where the largest element in the vector of the obtained result is located, the classification label corresponding to the dimension is the classification result of the microblog data to be predicted, if the largest element in the classifier output result of statement 8 is 0.6332 and the corresponding dimension is 1, then the classification result of statement 8 is the classification represented by label 1, i.e., "approve".
The prediction process of statement 9 is the same as statement 8, and is briefly as follows: the classification result of sentence 9 is obtained as follows:
hidden layer output matrix h (x)9)=[g(w1·x9+b1)g(w2·x9+b2)g(w3·x9+b3)]=[0.2222 0.6704 0.9174]
Is brought into the formula of the classifier to obtain
f(x)=h(x)β=[-1.2055 -0.8521 1.0684]
The largest element in the output result of the classifier of sentence 9 is 1.0684, and the corresponding dimension is 3, so the classification result of sentence 9 is the classification represented by label 3, i.e. "objection".
When the test data are sentences 8 and 9, the generated microblog data classifier is used, the emotional tendency of the sentences 8 and 9 can be correctly obtained, and the microblog data to be classified can be accurately classified.
Besides analyzing emotional tendency of microblog data, the invention can also be used for analyzing movie boxes, song click rate, financial product recommendation, stock analysis, instrument efficiency, news hot event analysis, social public opinion analysis and other applications.
Although specific embodiments of the present invention have been described above, it will be appreciated by those skilled in the art that these are merely illustrative and that various changes or modifications may be made to these embodiments without departing from the principles and spirit of the invention. The scope of the invention is only limited by the appended claims.

Claims (1)

1. A distributed classification method of massive microblog data is realized by adopting a distributed classification device of massive microblog data, the device adopts a distributed structure and comprises a master controller and at least one slave controller, each slave controller is interconnected with the master controller, the master controller and each slave controller are communicated with each other, and all the slave controllers are independent from each other;
the slave control machine comprises:
a vector machine: for converting each microblog training data with classification result from the control computer into a vector representation formIncluding a feature vector x for the data portion of each microblog datumiAnd a classification result part ti
A stripper: feature vector matrix X for all microblog data in microblog data training set processed by stripping vector machineiAnd a classification result matrix Ti
A converter: using the principle of extreme learning machine ELM for extracting the feature vector matrix X of the stripperiConversion to hidden layer output matrix H in ELMi
The former calculator: using the principle of an extreme learning machine ELM for outputting a matrix H from a hidden layeriCalculating an intermediate result Hi THiAnd submitting to the main control machine;
a consequent calculator: using the principle of an extreme learning machine ELM for outputting a matrix H from a hidden layeriAnd microblog data centralized classification result matrix TiCalculating an intermediate result Hi TTiAnd submitting to the main control machine;
the main control machine comprises:
the former term accumulator: for merging intermediate results H submitted from the controllersi THiTo obtain a summary result HTH;
A last term accumulator: for merging intermediate results H submitted from the controllersi TTiTo obtain a summary result HTT;
A parameter generator: calculating a weight vector parameter beta of an output node according to the summarized results output by the antecedent accumulator and the consequent accumulator by utilizing the principle of an Extreme Learning Machine (ELM);
and (3) a classification generator: constructing a classification device of microblog data according to the parameter beta obtained by the parameter generator, and classifying the microblog data to be tested;
each slave controller sends an intermediate result which is processed by the slave controller and used for generating a final microblog data classifier to the master controller, and the master controller receives the intermediate results sent by all the slave controllers and then obtains the final microblog data classifier according to the ELM principle;
the method is characterized in that: the method comprises the following steps:
step 1: preparing a microblog training data set;
the preparation of the microblog training data set comprises two parts of capturing original microblog data and labeling the microblog data manually; the following two ways are adopted: the first mode is that a master controller captures original microblog data required to be processed, each piece of training data is labeled manually to represent classification results of the microblog data, and then the microblog data are distributed to corresponding slave controllers; the second mode is that the master controller communicates with each slave controller to inform each slave controller of information of microblog data required to be captured, each slave controller captures original microblog data per se and manually marks the original microblog data captured per se to represent classification results of the microblog data;
step 2: the master controller initializes the required parameters and sends the parameters to all the slave controllers;
the method utilizes the principle of an extreme learning machine ELM, and a main controller generates parameters randomly in advance, and comprises the following steps: number L of hidden nodes and weight vector w of input node1,w2,...,wLOffset b of hidden node1,b2,...,bLAnd sends these parameters to all slave controllers;
and step 3: each slave controller processes the respective local microblog data set, sends the processing result to the master controller, and the master controller generates a microblog data classifier;
step 3-1: vectorizing microblog data;
vectorizing each microblog training data with a classification result part, wherein the vectorization comprises a feature vector x of the data part of each microblog dataiAnd a classification result part ti
Step 3-2: stripping microblog data;
for each microblog data set extracted from features in the microblog data training set of the slave controller, stripping the feature vector part and the classification result part of the data to form the microblog data of each slave controllerEigenvector matrix X of Boke data training setiAnd a classification result matrix TiThat is, each slave controller generates a respective local microblog data set (X)i,Ti) Wherein X isiFeature matrix, T, for microblog data setsiA classification result matrix of the microblog data sets;
step 3-3: each slave control machine generates an intermediate result according to the respective local microblog data set and sends the intermediate result to the master control machine;
the method comprises the following specific steps:
each slave controller niAccording to the weight vector w of the received input node1,w2,...,wLAnd threshold b of the ith hidden node1,b2,...,bLAnd a local microblog training data set (X)i,Ti) Calculating an intermediate result required by constructing the classifier, and submitting the intermediate result to the main control machine;
step 3-3-1: feature matrix X of local microblog data setiHidden layer output matrix H converted into ELMi
Step 3-3-2: output matrix H according to hidden layeriCalculating an intermediate result Ui=Hi THi
Step 3-3-3: output matrix H according to hidden layeriAnd a classification result matrix T of the local training data setiCalculating an intermediate result Vi=Hi T Ti
Step 3-4: the master controller receives and summarizes the intermediate results of the slave controllers; calculating a weight vector parameter beta of the output node according to the collected intermediate result and the ELM calculation principle, and further solving a microblog data classifier; the weight vector parameter β of the calculation output node is specifically as follows:
step 3-4-1: merging intermediate results U submitted by each slave controlleriObtaining a summary result U ═ sigma Ui=∑Hi THi=HTH;
Step 3-4-2: merging intermediate results V submitted by each slave control machineiObtaining a summary result V ═∑Vi=∑Hi TTi=HTT;
Step 3-4-3: calculating a weight vector parameter beta of the output node according to the summarized U and V:
<math> <mrow> <mi>&beta;</mi> <mo>=</mo> <msup> <mrow> <mo>(</mo> <mfrac> <mi>I</mi> <mi>&lambda;</mi> </mfrac> <mo>+</mo> <msup> <mi>H</mi> <mi>T</mi> </msup> <mi>H</mi> <mo>)</mo> </mrow> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msup> <msup> <mi>H</mi> <mi>T</mi> </msup> <mi>T</mi> <mo>=</mo> <msup> <mrow> <mo>(</mo> <mfrac> <mi>I</mi> <mi>&lambda;</mi> </mfrac> <mo>+</mo> <mi>U</mi> <mo>)</mo> </mrow> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msup> <mi>V</mi> </mrow> </math>
where I is the unit matrix and λ is a user-specified parameter, (-)-1Is a matrix inversion operation;
and further determining a formula of the microblog data classifier as follows:
f(x)=h(x)β
in the formula, f (x) represents the classification result of the microblog data to be classified, and h (x) represents the hidden layer output vector of the microblog data to be classified;
and 4, step 4: automatically classifying microblog data;
automatic classification of microblog data can take two ways: the first mode is that the master controller continuously captures microblog data, a classification result of the microblog data to be classified is directly output by applying the microblog data classifier generated in the step 3, the second mode is that the master controller sends the microblog data classifier generated in the step 3 to each slave controller, and then each slave controller classifies the microblog data to be classified by applying the classifier to obtain the classification result.
CN201210583886.8A 2012-12-28 2012-12-28 A kind of distributed sorter of massive micro-blog data and method Active CN103020712B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210583886.8A CN103020712B (en) 2012-12-28 2012-12-28 A kind of distributed sorter of massive micro-blog data and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210583886.8A CN103020712B (en) 2012-12-28 2012-12-28 A kind of distributed sorter of massive micro-blog data and method

Publications (2)

Publication Number Publication Date
CN103020712A CN103020712A (en) 2013-04-03
CN103020712B true CN103020712B (en) 2015-10-28

Family

ID=47969298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210583886.8A Active CN103020712B (en) 2012-12-28 2012-12-28 A kind of distributed sorter of massive micro-blog data and method

Country Status (1)

Country Link
CN (1) CN103020712B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593462B (en) * 2013-11-25 2017-02-15 中国科学院深圳先进技术研究院 Microblog-data-oriented flu epidemic surveillance analysis method and system
CN107045511B (en) * 2016-02-05 2021-03-02 阿里巴巴集团控股有限公司 Target feature data mining method and device
CN105760899B (en) * 2016-03-31 2019-04-05 大连楼兰科技股份有限公司 Training learning method and device based on distributed computing and detection cost sequence
CN107590134A (en) * 2017-10-26 2018-01-16 福建亿榕信息技术有限公司 Text sentiment classification method, storage medium and computer
CN109034366B (en) * 2018-07-18 2021-10-01 北京化工大学 Application of ELM integrated model based on multiple activation functions in chemical engineering modeling
CN109657061B (en) * 2018-12-21 2020-11-27 合肥工业大学 Integrated classification method for massive multi-word short texts
CN110381456B (en) * 2019-07-19 2020-10-02 珠海格力电器股份有限公司 Flow management system, flow threshold calculation method and air conditioning system
CN113177163B (en) * 2021-04-28 2022-08-02 烟台中科网络技术研究所 Method, system and storage medium for social dynamic information sentiment analysis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1185796A (en) * 1997-09-01 1999-03-30 Canon Inc Automatic document classification device, learning device, classification device, automatic document classification method, learning method, classification method and storage medium
US20120189194A1 (en) * 2011-01-26 2012-07-26 Microsoft Corporation Mitigating use of machine solvable hips
CN102789498A (en) * 2012-07-16 2012-11-21 钱钢 Method and system for carrying out sentiment classification on Chinese comment text on basis of ensemble learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1185796A (en) * 1997-09-01 1999-03-30 Canon Inc Automatic document classification device, learning device, classification device, automatic document classification method, learning method, classification method and storage medium
US20120189194A1 (en) * 2011-01-26 2012-07-26 Microsoft Corporation Mitigating use of machine solvable hips
CN102789498A (en) * 2012-07-16 2012-11-21 钱钢 Method and system for carrying out sentiment classification on Chinese comment text on basis of ensemble learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Extreme Learning Machine for Regression and Multiclass Classification;Huang Guangbin et al;《IEEE Transactions on systems,man and cybernetics-partB》;20120430;第42卷(第2期);第513-529页 *
基于ELM的蛋白质二级结构预测及其后处理;赵相国等;《东北大学学报(自然科学版)》;20091031;第30卷(第10期);第1402-1405页 *
基于二叉级联结构的并行极速学习机算法;王磊等;《吉林大学学报(信息科学版)》;20120731;第30卷(第4期);第418-425页 *

Also Published As

Publication number Publication date
CN103020712A (en) 2013-04-03

Similar Documents

Publication Publication Date Title
CN103020712B (en) A kind of distributed sorter of massive micro-blog data and method
US11436414B2 (en) Device and text representation method applied to sentence embedding
Kumar et al. Sentiment analysis of multimodal twitter data
Kontopoulos et al. Ontology-based sentiment analysis of twitter posts
CN110377740B (en) Emotion polarity analysis method and device, electronic equipment and storage medium
CN106105096A (en) System and method for continuous social communication
Mac Kim et al. Demographic inference on twitter using recursive neural networks
CN109241424A (en) A kind of recommended method
JP2020523699A (en) Generate point of interest copy
CN107688576B (en) Construction and tendency classification method of CNN-SVM model
CN109033433B (en) Comment data emotion classification method and system based on convolutional neural network
CN103365867A (en) Method and device for emotion analysis of user evaluation
CN110046353B (en) Aspect level emotion analysis method based on multi-language level mechanism
CN111177559B (en) Text travel service recommendation method and device, electronic equipment and storage medium
Claster et al. Naïve Bayes and unsupervised artificial neural nets for Cancun tourism social media data analysis
CN109993583A (en) Information-pushing method and device, storage medium and electronic device
Kim et al. Text mining and sentiment analysis for predicting box office success
CN110706028A (en) Commodity evaluation emotion analysis system based on attribute characteristics
CN105447193A (en) Music recommending system based on machine learning and collaborative filtering
CN105760499A (en) Method for analyzing and predicting network public sentiment based on LDA topic model
CN112256866A (en) Text fine-grained emotion analysis method based on deep learning
CN102789449A (en) Method and device for evaluating comment text
CN103729431B (en) Massive microblog data distributed classification device and method with increment and decrement function
CN114443899A (en) Video classification method, device, equipment and medium
CN115294427A (en) Stylized image description generation method based on transfer learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220324

Address after: 100081 No. 5 South Main Street, Haidian District, Beijing, Zhongguancun

Patentee after: BEIJING INSTITUTE OF TECHNOLOGY

Address before: 110819 No. 3 lane, Heping Road, Heping District, Shenyang, Liaoning 11

Patentee before: Northeastern University