The content of the invention
Present invention aims at a kind of the analysis of public opinion method, system, computer equipment and the storage medium of proposition, to solve
Shortcoming in above-mentioned background technology for the comment in artificial special writing, such as changes phonetically similar word, additional character interference,
Data crawl that difficulty is big, key message leakage occur and climb situation about disorderly being climbed with junk information;Since the collection of data source is done
It disturbs, can not accurately analyze a kind of dynamic tendentiousness from the existing limitation information crawled;Specialty is short of to the analysis of information
Property, it is impossible to it is effectively directed to using internet marketing activity, exchanges the behavior people occurred frequently of great number reward for low cost even zero cost
Group carries out risk perceptions exactly.
To achieve these goals, the present invention provides following technical solution:
A kind of the analysis of public opinion method, the analysis of public opinion method, is as follows:
S101:According to pre-defined search strategy, searched for by web crawlers and read web page files, from web page files
Middle extraction public sentiment data;
S102:The public sentiment data of extraction is filtered, removes junk information;
S103:Collating sort is carried out to the public sentiment data after filtering, classification type includes source, strong correlation and enlivens personnel
It is posted;
S104:Public sentiment data in each classification results is analyzed and processed, including the origin, the public opinion emotion
Color, the network disperse state, the development trend, the Regional Distribution information, the age bracket range information and described
The focus of attention;
S105:The analysis of public opinion result obtained by step S104 is shown and exported with chart and report form.
Preferably, public sentiment data include network address, title, the time, author, source, text, comment, clicking rate, reply number and
Reprinting amount.
Preferably, in the step S102, public sentiment data is filtered including:When triggering preset condition, carriage is judged
Feelings data are junk information, and are filtered, wherein, junk information=A | | B | | C | | D, A=Chinese length connect less than 4, B=
Continuous English length is more than 15, C=blacklist words, and D=includes symbol * &^% $ #@.
Preferably, step S104, analyzing and processing is carried out to the public sentiment data in each classification results to be included:S401:Analysis
Source is crawled, obtains the corresponding origin of the public sentiment data;S402:Emotion point is carried out to the public sentiment data in the statistical unit time
Analysis, obtains public opinion emotional color;S403:Each reptile source is analyzed whether comprising the public sentiment event, obtains the institute of the public sentiment event
State network disperse state;S404:Keyword frequency of occurrences in the unit of analysis time, the development for obtaining the public sentiment event become
Gesture;S405:Analysis participates in the login IP and age information of the user of the public sentiment event, obtains the location of public sentiment event generation
Domain distributed intelligence and age bracket range information;S406:The word frequency of occurrences in the unit of analysis time, obtains the focus of attention.
Preferably, in step S402, carrying out sentiment analysis to the public sentiment data in the statistical unit time includes:With reference to dictionary
Mode, use the sentiment analysis method based on sentence weighting algorithm.
Preferably, public opinion emotional color includes glad, common or angry, and network disperse state includes diffusion initial stage, diffusion
Mid-term or diffusion late period.
Preferably, in the step S105, the chart include pie chart, line chart, column diagram, bar chart, area-graph,
In one or several kinds or pie chart, line chart, column diagram, bar chart, area-graph, scatter diagram, form in scatter diagram, form
Two or more composite diagram being formed by stacking.
Based on identical technical concept, the present invention also provides a kind of the analysis of public opinion system, the analysis of public opinion system includes
Reptile module, filtering module, sort module, analysis module and display module.
The reptile module, for according to pre-defined search strategy, being searched for by web crawlers and reading webpage text
Part extracts public sentiment data from web page files;
The filtering module is filtered for the public sentiment data to extraction, removes junk information;
The sort module, for carrying out collating sort to the public sentiment data after filtering, classification type includes source, Qiang Xiang
Personnel are closed and enliven to be posted;
The analysis module for being analyzed and processed to the public sentiment data in each classification results, obtains the analysis of public opinion
As a result, including origin, public opinion emotional color, network disperse state, development trend, Regional Distribution information, age bracket range information
And the focus of attention;
The display module, for showing and exporting the public sentiment passed through step S104 and obtained with chart and report form
Analysis result.
Based on identical technical concept, the present invention also provides a kind of computer equipment, including memory and processor, storage
Computer-readable instruction is stored in device, when computer-readable instruction is executed by processor so that processor performs above-mentioned public sentiment
The step of analysis method.
Based on identical technical concept, the present invention also provides a kind of storage medium for being stored with computer-readable instruction, meters
When calculation machine readable instruction is executed by one or more processors so that one or more processors perform above-mentioned the analysis of public opinion method
The step of.
Above-mentioned the analysis of public opinion method, system, computer equipment and storage medium according to pre-defined search strategy, lead to
It crosses web crawlers to search for and read web page files, public sentiment data is extracted from web page files, the public sentiment data of extraction was carried out
Filter includes:When triggering preset condition, the public sentiment data is judged for junk information, and is filtered, wherein, junk information=
A | | B | | C | | less than 4, B=, continuously English length is more than 15, C=blacklist words to D, A=Chinese length, and D=includes symbol * &
^% $ #@, remove junk information, to after filtering public sentiment data carry out collating sort, classification type include source, strong correlation and
The personnel of enlivening are posted;Public sentiment data in each classification results is analyzed and processed, obtains the analysis of public opinion as a result, including rising
Source, public opinion emotional color, network disperse state, development trend, Regional Distribution information, age bracket range information and the focus of attention;
The analysis of public opinion result obtained by step S104 is shown and exported with chart and report form.Compared with prior art,
The beneficial effects of the invention are as follows:Increase data collection precision, data formatting is handled, actively increases hot spot vocabulary word frequency
Afterwards, the analysis of public opinion accuracy is increased;The tendency of concern is obtained, speech collection is carried out in disorder community forum, perceives people's
Concern tendency, mood tendency;Active Perceived Risk, by tracking pageview of posting in forum, public platform, microblogging, newpapers and periodicals
Before ranking before 50 member or reply volume ranking of posting 50 member, i.e., using internet marketing activity, with low cost even zero
Cost exchanges the action of posting of the behavior core person of great number reward for, and look-ahead utilizes internet marketing activity next time, with
Low cost even zero cost exchanges the major event content of the behavior group of people at high risk of great number reward for, carrys out active perception risk, can use up
It is early that risk point is effectively treated.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other without creative efforts
Embodiment belongs to the scope of protection of the invention.
As shown in Figure 1, the present invention provides a kind of technical solution:
A kind of the analysis of public opinion method, the analysis of public opinion method, is as follows:
S101:According to pre-defined search strategy, searched for by web crawlers and read web page files, from web page files
Middle extraction public sentiment data.
Selected several using internet marketing activity, the behavior for exchanging great number reward for low cost even zero cost (ulls up sheep
Hair behavior) the active forum of group of people at high risk, then writes a specific aim reptile, the present embodiment is used using reptile frame of increasing income
Be Scrapy frames, searched for by web crawlers and read daily all models and comment, carried from forum in all information
Public sentiment data is taken, including network address, title, time, author, source, text, comment, amount of reading and replys number;
Web crawlers automatically captures the program or script of the network information according to certain rule.Extreme saturation website
Resource, these resources are grabbed into local, specific method is exactly each effective URL of analyzing web site, and submits Http
Request so as to obtain accordingly result, generates local file and corresponding log information.Crawl policy is based primarily upon Hyperlink
And it corresponds to existing mapping relations between webpage, crawl policy can be depth-first search strategy, breadth first search plan
Summary or illumination scan.
In the present embodiment, increasing income property reptile frame is except Scrapy, can also use PySpider, Nutch,
Crawler4j, WebMagic, WebCollector or other increasing income property reptile frames.
Title and body matter, these contents contain the information of entire webpage substantially and urtext data are adopted
The emphasis of collection.It is the Homepage Publishing time, convenient when public feelings information occurs, content is retrieved sequentially in time.Netizen joins
With information, comment amount, transfer amount, click volume etc. are shown as, can be used for analyzing the attention rate of public sentiment.
In the present embodiment, the public sentiment data of extraction is except network address, title, time, author, source, text, comment, reading
Amount and number is replied, can also be other information that can be extracted, such as sweep spacing.
S102:The public sentiment data of extraction is filtered, removes junk information.
The method that public sentiment data is filtered can be included:
When triggering preset condition, judge that public sentiment data is junk information, and be filtered, wherein, junk information=A | |
B | | C | | less than 4, B=, continuously English length is more than 15, C=blacklist words to D, A=Chinese length, and D=includes symbol * &^%
$#@。
The present embodiment filters out web page contents in the useless letters such as audio, video, the markup language of webpage in itself in text
Breath only retains content of text, saves memory space.For being unsatisfactory for the information of call format, it is necessary to its format transformation.This reality
Example is applied also for the comment in artificial special writing, such as change phonetically similar word, additional character interference and, the high frequency in blacklist
Word is filtered, cleaned that online friend largely uses with phonetically similar word, the nearly word of sound, additional character come watch sound word and,
The noise data of high frequency words in blacklist efficiently and correctly obtains the public sentiment data on network.This filter method, effectively
It has been directed to using internet marketing activity, exchange the behavior group of people at high risk of great number reward for low cost even zero cost.
For example, writing expression of the online friend on network is very arbitrarily various, number, letter, symbol are mingled in Chinese character;
Sentence paragraph expression interruption it is imperfect, there is also largely repeat phrase short sentence, such as somebody can comment on " Zan Zanzan ",
“ddddddddddddddd”、“A&B”.Text cleaning is to wash these noise datas.
S103:Collating sort is carried out to the public sentiment data after filtering, classification type includes source, strong correlation and enlivens personnel
It is posted;
Source, for distinguishing the issue source of each data;Strong correlation, match sensitive words, stamp whether strong correlation
Label;Personnel are enlivened, are accumulated by data, matching enlivens the ID of personnel, stamps the label from the personnel that enliven.
For example, Xiao Zhang's model that one having containing flow keyword ulls up wool behavior in A website orientations then should
Information can identify 3 information in advance in the database, from A websites;It is strong correlation data containing default sensitive keys word;It is small
The ID opened, which belongs to, enlivens personnel ID, is the content for enlivening personnel's issue.
The need to rely on data of a full dose of the method for the classification crawl work, and statistics draws sensitive keys word and active
Personnel ID, can subsequent public sentiment data collect during tagged classification, further increase public sentiment data value.
S104:Public sentiment data in each classification results is analyzed and processed, obtains the analysis of public opinion as a result, including rising
Source, public opinion emotional color, network disperse state, development trend, Regional Distribution information, age bracket range information and the focus of attention.
Public sentiment data in each classification results is analyzed and processed, can be directed to using internet marketing activity, with low
Cost even zero cost exchanges the active forum of behavior group of people at high risk of great number reward for, and specialty analysis rule is set to know forum
Public sentiment trend, big including what model reply volume, what elite model is, before festivals or holidays posting number whether explodes, model
Concern utilizes internet marketing activity, exchanges the type of the behavior of great number reward for low cost even zero cost and perceives
New utilizes internet marketing activity, the active platform of the behavior for exchanging great number reward for inexpensive even zero cost.
In one embodiment, the step S104, analyzes and processes the public sentiment data in each classification results, can
For internet marketing activity is utilized, the behavior group of people at high risk for exchanging great number reward for inexpensive even zero cost is active to be won
Visitor sets specialty analysis rule to know the public sentiment trend of blog, and big including what blog article reply volume, what elite blog article is,
Whether the blog article number before festivals or holidays explodes, and blog article concern utilizes internet marketing activity, is exchanged for low cost even zero cost
The type of great number reward behavior and perceive it is new utilize internet marketing activity, great number is exchanged for low cost even zero cost
The active blog of reward behavior.
In one embodiment, the step S104, analyzes and processes the public sentiment data in each classification results, can
For internet marketing activity is utilized, active micro- of behavior group of people at high risk of great number reward is exchanged for low cost even zero cost
It is rich, specialty analysis rule is set to know the public sentiment trend of microblogging, big including microblogging reply volume, what top set microblogging is, section is false
Whether hair microblogging number a few days ago explodes, and microblogging concern utilizes internet marketing activity, and height is exchanged for low cost even zero cost
The type of volume reward behavior and perceive it is new utilize internet marketing activity, great number prize is exchanged for low cost even zero cost
Encourage the active microblogging homepage of behavior.
In one embodiment, the step S104, analyzes and processes the public sentiment data in each classification results, can
For internet marketing activity is utilized, the active wechat of behavior group of people at high risk of great number reward is exchanged for low cost even zero cost
Public platform sets specialty analysis rule to know the public sentiment trend of wechat public platform, big including what public platform article reply volume,
Whether the hair public platform article number before festivals or holidays explodes, and public platform concern utilizes internet marketing activity, with low cost even
Zero cost exchange for the type of great number reward behavior and perceive it is new utilize internet marketing activity, with even zero one-tenth of low cost
Originally the active wechat public platform of great number reward behavior is exchanged for.
S401, analysis crawl source, obtain the corresponding origin of the public sentiment data;
URL is the foundation for judging web page source, and the characteristics of being webpage unique mark using URL analyzes website
And statistics.
S402 carries out sentiment analysis to the public sentiment data in the statistical unit time, obtains public opinion emotional color.
It can be specifically the mode with reference to dictionary, use the sentiment analysis method based on sentence weighting algorithm.Obtain public opinion
Emotional color includes glad, common or angry;
For example, sentiment dictionary, degree adverb table are preset, summarizes vocabulary, association vocabulary, negative vocabulary etc., is each
Word assigns corresponding emotion weights, and the final emotion of sentence is drawn according to word emotion weights.
Whether S403 analyzes each reptile source comprising the public sentiment event, obtains the network disperse state of the public sentiment event.
Network disperse state includes diffusion initial stage, diffusion mid-term or diffusion late period;
At diffusion initial stage, basis for estimation is just to start event body occur in public sentiment data, not in the data of more than half
It is found in source;
Mid-term is spread, basis for estimation is that event body, and data occurs in the data source of more than half in public sentiment data
Amount is in the phenomenon that rises appreciably;
It spreads late period, basis for estimation is that event body, and data occurs in the data source of more than half in public sentiment data
Amount increases phenomenon in almost 0;
Non- especially big utilization internet marketing activity, exchanges the public sentiment event of great number reward for low cost even zero cost,
It is not necessarily suitable the diffusion analysis.
For example, the mobile phone traffic activity of giving of certain operator goes wrong, and can be utilized internet marketing activity, with
The behavior group of people at high risk that inexpensive even zero cost exchanges great number reward for, which largely ulls up, takes mobile phone flow, and on the ground of A, B, C, D, E
Fang Fabu is propagated, it is assumed that the data source of current collection is 8, and the information adds 5 today, more than half in public sentiment data
Data source there is event body, and data volume is in the phenomenon that rises appreciably, then is judged to spreading mid-term at this time.
S404, the frequency that keyword occurs in the unit of analysis time, obtains the development trend of the public sentiment event;
Difference between the frequency that starting and end time of one word in special time period occur, obtains the public sentiment thing
The development trend of part.If positive number, then public sentiment event shows a rising trend;If negative, then public sentiment event is in reducing tendency;
If 0, then public sentiment event is in smooth trend.
S405, analysis participate in the login IP and age information of the user of the public sentiment event, obtain public sentiment event generation
The distributed intelligence of place region and age bracket range information.
For example, IP 116.238.88.116, age are 19 years old, and analysis obtains regional information as Shanghai and age model
Segment information is enclosed for 19 years old.
S406, the word frequency of occurrences, obtains the focus of attention in the unit of analysis time.
Basis for estimation (excludes unexpected care vocabulary, such as the top10 that word in the unit time occurs:Wechat, mobile phone)
Currently to utilize internet marketing activity, the concern heat of the behavior group of people at high risk of great number reward is exchanged for low cost even zero cost
Point;
S105:The analysis of public opinion result obtained by step S104 is shown and exported with chart and report form.
Displaying exchanges the behavior group of people at high risk's of great number reward for low cost even zero cost using internet marketing activity
During the analysis of public opinion result, some screening rules in analysis module are called, portion is integrally formed and utilizes internet marketing activity, with
Low cost even zero cost exchanges the analysis of public opinion report of the behavior group of people at high risk of great number reward for, is put on display in display module.
It shows and exports with chart and report form by the obtained the analysis of public opinion of step S104 as a result, being risen including described
Source, the public opinion emotional color, the network disperse state, the development trend, the Regional Distribution information, the age bracket
Range information and the focus of attention.
Specifically, chart described in step 105 for pie chart, line chart, column diagram, bar chart, area-graph, scatter diagram,
Two kinds or two in one or several kinds or pie chart, line chart, column diagram, bar chart, area-graph, scatter diagram, form in form
Kind or more the composite diagram that is formed by stacking.
Wherein, an illustrative public sentiment state table of comparisons, as shown in table 1.
Table 1:
In the present embodiment, the prediction carried out for the public sentiment temperature trend in the 3-5 days network public-opinion future, prediction miss
Difference is small, and prediction effect is good.
Based on identical technical concept, the embodiment of the present invention additionally provides a kind of the analysis of public opinion system, as shown in Fig. 2, should
System includes:Reptile module, filtering module, sort module, analysis module and display module.
Specifically, the reptile module, for according to pre-defined search strategy, being searched for and being read by web crawlers
Web page files extract public sentiment data from web page files;
Specifically, the filtering module is filtered for the public sentiment data to extraction, removes junk information;
Specifically, the sort module, for carrying out collating sort to the public sentiment data after filtering, classification type includes coming
It source, strong correlation and enlivens personnel and is posted;
Specifically, the analysis module for being analyzed and processed to the public sentiment data in each classification results, obtains carriage
Mutual affection is analysed as a result, including origin, public opinion emotional color, network disperse state, development trend, Regional Distribution information, age bracket model
Enclose information and the focus of attention;
Specifically, the display module passes through what step S104 was obtained for being shown and being exported with chart and report form
The analysis of public opinion result.
Based on identical technical concept, the embodiment of the present invention additionally provides a kind of computer equipment, and computer equipment includes
Memory, processor and storage on a memory and the computer program that can run on a processor, processor execution computer
The step in above-mentioned the analysis of public opinion method is realized during program:According to pre-defined search strategy, searched for simultaneously by web crawlers
Web page files are read, public sentiment data is extracted from web page files;The public sentiment data of extraction is filtered, removes junk information;
Collating sort is carried out to the public sentiment data after filtering, classification type includes source, strong correlation and enlivens personnel and posted;To each
Public sentiment data in classification results is analyzed and processed, and obtains the analysis of public opinion as a result, including origin, public opinion emotional color, network
Disperse state, development trend, Regional Distribution information, age bracket range information and the focus of attention;It is shown with chart and report form
The analysis of public opinion result obtained with output by step S104.
Specifically, the public sentiment data includes network address, title, time, author, source, text, comment, clicking rate, reply
Number and reprinting amount.
Specifically, the step S102, the public sentiment data of extraction is filtered including:When triggering preset condition, sentence
The fixed public sentiment data is junk information, and is filtered, wherein, junk information=A | | B | | C | | D, A=Chinese length are less than
4, B=continuously English length be more than 15, C=blacklist words, D=includes symbol * &^% $ #@.
Specifically, the step S104, analyzing and processing is carried out to the public sentiment data in each classification results to be included:
S401:Analysis crawls source, obtains the corresponding origin of the public sentiment data;
S402:Sentiment analysis is carried out to the public sentiment data in the statistical unit time, obtains the public opinion emotional color;
S403:Each reptile source is analyzed whether comprising the public sentiment event, obtains the network diffusion type of the public sentiment event
State;
S404:Keyword frequency of occurrences in the unit of analysis time obtains the development trend of the public sentiment event;
S405 analyses participate in the login IP and age information of the user of the public sentiment event, obtain occurring the institute of the public sentiment event
State Regional Distribution information and the age bracket range information;
S406:The word frequency of occurrences in the unit of analysis time, obtains the focus of attention.
Further, in the step S402, carrying out the sentiment analysis to the public sentiment data in the statistical unit time includes:
With reference to the mode of dictionary, the sentiment analysis method based on sentence weighting algorithm is used.
Further, the public opinion emotional color includes glad, common or angry, and the network disperse state includes diffusion just
Phase, diffusion mid-term or diffusion late period.
Specifically, in the step S105, the chart include pie chart, line chart, column diagram, bar chart, area-graph,
In one or several kinds or pie chart, line chart, column diagram, bar chart, area-graph, scatter diagram, form in scatter diagram, form
Two or more composite diagram being formed by stacking.
Based on identical technical concept, the embodiment of the present invention additionally provides a kind of storage for being stored with computer-readable instruction
Medium, when which is executed by one or more processors so that one or more processors perform above-mentioned carriage
Step in feelings analysis method:According to pre-defined search strategy, searched for by web crawlers and read web page files, from net
Public sentiment data is extracted in page file;The public sentiment data of extraction is filtered, removes junk information;To the public sentiment data after filtering
Collating sort is carried out, classification type includes source, strong correlation and enlivens personnel and posted;To the public sentiment number in each classification results
According to being analyzed and processed, obtain the analysis of public opinion as a result, including origin, public opinion emotional color, network disperse state, development trend,
Regional Distribution information, age bracket range information and the focus of attention;It shows and exports with chart and report form by step S104
The obtained the analysis of public opinion result.
Specifically, the public sentiment data includes network address, title, time, author, source, text, comment, clicking rate, reply
Number and reprinting amount.
Specifically, the step S102, the public sentiment data of extraction is filtered including:When triggering preset condition, sentence
The fixed public sentiment data is junk information, and is filtered, wherein, junk information=A | | B | | C | | D, A=Chinese length are less than
4, B=continuously English length be more than 15, C=blacklist words, D=includes symbol * &^% $ #@.
Specifically, the step S104, analyzing and processing is carried out to the public sentiment data in each classification results to be included:
S401:Analysis crawls source, obtains the corresponding origin of the public sentiment data;
S402:Sentiment analysis is carried out to the public sentiment data in the statistical unit time, obtains the public opinion emotional color;
S403:Each reptile source is analyzed whether comprising the public sentiment event, obtains the network diffusion type of the public sentiment event
State;
S404:Keyword frequency of occurrences in the unit of analysis time obtains the development trend of the public sentiment event;
S405 analyses participate in the login IP and age information of the user of the public sentiment event, obtain occurring the institute of the public sentiment event
State Regional Distribution information and the age bracket range information;
S406:The word frequency of occurrences in the unit of analysis time, obtains the focus of attention.
Further, in the step S402, carrying out the sentiment analysis to the public sentiment data in the statistical unit time includes:
With reference to the mode of dictionary, the sentiment analysis method based on sentence weighting algorithm is used.
Further, the public opinion emotional color includes glad, common or angry, and the network disperse state includes diffusion just
Phase, diffusion mid-term or diffusion late period.
Specifically, in the step S105, the chart include pie chart, line chart, column diagram, bar chart, area-graph,
In one or several kinds or pie chart, line chart, column diagram, bar chart, area-graph, scatter diagram, form in scatter diagram, form
Two or more composite diagram being formed by stacking.
Embodiment described above only expresses the several embodiments of the present invention, and description is more specific and detailed, but simultaneously
Cannot the limitation to the scope of the claims of the present invention therefore be interpreted as.It should be pointed out that for those of ordinary skill in the art
For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the guarantor of the present invention
Protect scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.