A kind of hot spot of public opinions Forecasting Methodology based on big data
Technical field
When information analysis and issue is carried out, people, which do not know generally, will issue that what content can just cause reader's
Interest, reader has higher propagation enthusiasm for which type of specifying information content.For example:Tomorrow will open press conference
, which related theme reporter can ask to spectators;On news portal, social media platform, the agenda which is actively set
The focus propagated can be turned into;Special time, which theme can turn into current hot spot of public opinions;For particular persons, the common people are frequent
What item of the personage paid close attention to and discuss.Prediction for this category information often relies on the knowledge and experience of people, it is proposed that
The calculating of big data is carried out by computer technology, aid forecasting is assorted for some time, object, department, personage, event prediction
The hot spot of public opinions that theme can become people's concern, discuss and propagate.
Background technology
For the demand, the existing usual Forecasting Methodology of Forecasting Methodology is artificial prediction scheme, its process and result
The knowledge and experience of heavy dependence user, there is unstable state in accuracy and repeatability.Therefore the present invention proposes a kind of base
In the hot spot of public opinions Forecasting Methodology of big data, aid forecasting is predicted in some timing node for hot spot of public opinions, to do
Go out targetedly Agenda Setting and prepare counter-measure.Excavate and calculate the present invention relates to Information Communication, computer radix, big data
Method, user content tendency model modeling.
The content of the invention
The present invention says that the technical problem to be solved is:How people have found customer information requirement and letter by big data analysis
Propagation law is ceased, and predicts that some theme can turn into hot spot of public opinions.
The object of prediction includes two kinds:One kind is that theme is excavated from mass historical data by big data analysis, for
The possibility that the theme extracted in historical data turns into hot spot of public opinions is predicted, judge the theme in some timing node or
Whether person's period can turn into hot spot of public opinions;Another is in the current theme actively determined, according to customer information requirement
The incidence relation of model, information disclosure model, by the big data mining analysis to historical data, to calculate the theme and user
Content tendency and regularity of information dissemination matching degree, finally judge the theme whether can some timing node or when
Between section turn into hot spot of public opinions.
The present invention solves the technical scheme that is used of above-mentioned technical problem:
1. building big data data store organisation, adopted using crawler technology, File Format Analysis, database and other data
Collection technology, is acquired, duplicate removal, format analysis and structured storage to information and Information Communication data.
2. it is pretreated to carry out participle, word frequency statisticses, affection computation, subject extraction etc. to data using semantic analysis technology
Journey.
3. by the big data method for digging such as statistical analysis, correlation rule, time series analysis, cluster, classification analysis, point
User is for the Demand perference of content, the propagation characteristic of hot spot of public opinions, content characteristic and temporal characteristics, user in analysis historical data
The dimensions such as affection index, and big data analysis model is set up, model name is user content trend analysis model.
4. in the 3rd step, being excavated to pretreated historical data, passage time sequence analysis is identical by history
Timing node and the theme of period are counted and clustered, and drawing in certain time node and period has higher propagation heat
The theme of degree, and being matched with the time series that user content is inclined to, calculate the theme and user Current Content be inclined to and
Propagation characteristic matching degree, up to or over certain threshold value, then the theme is extracted from historical data can turn into public opinion heat
Point.
5. setting up Data Input Interface, the theme that user sets active is inputted, and the input of Feature Words is carried out to theme.
6. carrying out full-text search and the Similarity Measure of theme feature word from mass historical data, extract in historical data
The similar content of the theme, propagation data and information issuing time, the analysis of passage time sequence calculate theme special in history
Timing intermediate node or the hot value that the cycle spreads through sex intercourse in the period, exceed certain threshold value if there is periodic temperature of propagating
Phenomenon, then matched with user content tendency and propagation characteristic, when reaching certain threshold value, then is matched with predicted time,
If time registration exceedes certain value, judge that the theme can turn into hot spot of public opinions.
7. the theme that active is set carries out Similarity Measure with historical data, if similarity reaches one in same class
Determine threshold value, then matched with user content tendency, judge to overlap degree with the content tendency and propagation characteristic of active user, such as
Fruit exceedes certain threshold value, then judges that the theme can turn into hot spot of public opinions.
8. judging the relatively low theme of similarity in previous step, extract related subject in historical data and carry out clustering, meter
Calculate whether the theme belongs to similar theme together with history theme, and analyze new topic whether is added on original theme and new thin
Section, if so, then matching new topic and details and user profile content tendency and propagation characteristic, more than certain threshold value, then
Judge that the theme can turn into hot spot of public opinions.
9. the hot spot of public opinions of user carries out the content mining of large span time in pair historical data, user's Current Content is calculated
The development law and Time Change of tendency, are inclined to the theme and its Feature Words and user content of input and propagation characteristic enter
Row contrast, if registration exceedes certain threshold value, judges that the theme can turn into hot spot of public opinions.
Compared with prior art, the present invention has advantages below:
1. the drawbacks of this method overcomes existing manual method inefficiency, degree of accuracy heavy dependence knowledge experience, passes through
Big data and semantic analysis technology, are realized using computerized algorithm, greatly promote speed, efficiency and its applicable scene.
2. this method is by big data technology, collection and analysis mass data, greatly expand analysis sample data and
Case, makes full use of a large amount of cases of historical accumulation, is inclined to for user content and each side's region feature of hot spot of public opinions propagation enters
Row is excavated, and model is more scientific and reasonable, and analysis result is continuously available improvement, and reaches certain degree of accuracy.
3. this method carries out fine-grained cutting and subject extraction, for carriage by semantic analysis technology to historical data
Covered by the more details of focus, the content tendency of user in hot spot of public opinions is more comprehensively analyzed, for the essence of prediction
Fineness has more preferable grasp.
4th, data source can use crawler technology and other data sources, overlay network and other categorical datas, pass through meter
Calculation machine technology carries out automatic data collection, intelligently parsing, all-round structuring and mass memory to data, and the magnanimity for solving information source is covered
The abundant accumulation of lid and analysis case.For continuous improvement reservoir data and Algorithm Learning the iteration basis of prediction
5th, prediction process is based on user content tendency model, time, content in being propagated with reference to hot spot of public opinions, biography
Broadcast, each dimension such as user feedback, the wide-scale distribution feature for hot spot of public opinions analyzed comprehensively, and it is many that lifting prediction judges
Factor is acted on and collective effect comprehensive analysis, and predict the outcome more accurate and closing to reality.
Brief description of the drawings
Accompanying drawing 1 is the calculation flow chart of this method.
Embodiment
Hot spot of public opinions Forecasting Methodology of the invention based on big data, its method main points include:
A. the interactive window for inputting theme and the correlated characteristic of the theme for user is set up, receives the text of user's submission
Sheet or file.
B., can be by reptile, File Format Analysis module, database to the sea of entrance for different historical data sources
Amount data are pre-processed, and form the storage of structuring, and being capable of more fine-grained manual labeling, introducing big data body
System structure, forms the storage of mass data, and the data pick-up of automation, streaming computing is predicted there is provided high performance hot spot of public opinions.
C. fine-grained cutting and labeling are carried out for historical data.The basis of prediction is the content tendency mould of user
Type, temporal characteristics, content characteristic, the propagation characteristic of Public Opinion Transmission, therefore packet containing information in itself, the time, distribution platform, use
Other data produced in family comment, reply, thumb up, reading number and communication process are combined, such as:User's covering of distribution platform,
The transmissibility for turning originator, the content tendency for turning originator, the communication mode of platform, the overall mood tendency of active user etc..
D. using statistics and semantic analysis algorithm, participle, part of speech identification, subject extraction are carried out to historical data, to data
Pre-processed, form follow-up big data analysis foundation.
E. by Time-Series analysis, the technology such as content analysis, Topics Crawling, cluster to history hot spot of public opinions and its is propagated through
Cheng Jinhang fine granularities are analyzed, and form hot spot of public opinions propagation effect Factor system, including user content tendency, temporal characteristics, propagation
Feature, content characteristic build the general frame of prediction, and form certain rule and rule, are used as the standard of the calculating of prediction.
F. during big data Time-Series analysis, it is possible to find that cycle certain time is recurrent from historical data
Meet the theme of user's certain content tendency, the theme once meets the temporal characteristics of current propagation, wide-scale distribution feature, then can
Focus as public opinion, breaks out within a certain period of time.
G. periodicity hot spot of public opinions judges.The theme that user is set, can be coincide in terms of periodicity temporal regularity
The hot spot of public opinions periodically occurred in theme to be predicted and historical data, is carried out goodness of fit calculating, more than a little by the calculating of degree
Threshold value (C), and the propagation characteristic and temporal characteristics, content characteristic of the theme are extracted, can be with the focus goodness of fit in certain time
(K), then the theme is to meet periodicity hot spot of public opinions feature, it will occurred within a certain period of time and as focus
H. content type hot spot of public opinions judges.User inputs theme and correlated characteristic, is inclined to certain time user content special
Levy progress goodness of fit calculating, a kind of situation be with the focus goodness of fit higher (C), then easily become hot spot of public opinions, another feelings
It is all identical theme with focus theme, but have new feature after the goodness of fit reaches certain threshold value (C), clustering that condition, which is,
(P), the novelty with propagation, can attract user's concern with discussing, and be coincide with propagation characteristic, then can turn into public opinion heat
Point.
I. propagated hot spot of public opinions judges.Society is continued to develop, and user constantly changes, and demand is passed also with differentiation, information
The pattern of broadcasting can in real time be analyzed according to the data being continuously replenished into, excavate the new propagation rule for meeting user content tendency
Rule, new phenomenon, the theme of new things and its development law, are compared for the theme that user inputs with propagation characteristic, calculate
Its goodness of fit (D), and its novel degree, innovation degree, attraction, transmissibility are analyzed, judge whether it can rely on its fresh spy
Matter, obtains the concern and discussion of user, as hot spot of public opinions.