CN102708176B - Microblog data mining method based on active users - Google Patents
Microblog data mining method based on active users Download PDFInfo
- Publication number
- CN102708176B CN102708176B CN201210140531.1A CN201210140531A CN102708176B CN 102708176 B CN102708176 B CN 102708176B CN 201210140531 A CN201210140531 A CN 201210140531A CN 102708176 B CN102708176 B CN 102708176B
- Authority
- CN
- China
- Prior art keywords
- user
- microblogging
- topic
- real
- users
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 20
- 238000007418 data mining Methods 0.000 title claims abstract description 13
- 238000012216 screening Methods 0.000 claims abstract description 9
- 238000004458 analytical method Methods 0.000 claims description 11
- 238000009412 basement excavation Methods 0.000 claims description 6
- 238000007621 cluster analysis Methods 0.000 claims description 6
- 230000003203 everyday effect Effects 0.000 claims description 6
- 238000002203 pretreatment Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 235000009776 Rathbunia alamosensis Nutrition 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 3
- 238000012552 review Methods 0.000 claims description 3
- 244000097202 Rathbunia alamosensis Species 0.000 claims 1
- 238000005065 mining Methods 0.000 abstract description 2
- 244000089409 Erythrina poeppigiana Species 0.000 description 4
- 230000032683 aging Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 201000004569 Blindness Diseases 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a microblog data mining method based on active users and belongs to the technical field of data mining in network microblogs. The method includes: firstly, regularly and randomly selecting a batch of new potential users, selectively screening to allow the new potential users to enter an effective mining user group, and continuously updating to generate new active users, so that data comprehensiveness is guaranteed; and secondly, re-screening the effective user group by real-time topic models generated during traditional data mining, and removing inactive users in the field, so that timeliness of users in the effective user group can be guaranteed constantly. By setting a user selecting and updating mechanism in advance, comprehensiveness and effectiveness of mined data can be guaranteed, and mined user groups of each topic are maintained and updated in real time. Further, effective user groups of real-time topic models are re-updated to guarantee timeliness of mined users.
Description
Technical field
The invention belongs to the data mining technology field in the network microblogging, particularly a kind of microblogging data digging method based on any active ues.
Background technology
Microblogging, as Web brand-new network application form of 2.0 epoch, not only more adapts to modern fast pace life, and realizes information sharing anywhere or anytime.Information Sharing by one based on customer relationship, propagate and obtain platform, the user can pass through WEB, WAP and various client component individual community, with the word lastest imformation of 140 words left and right, and realizes immediately sharing.
Due to real-time, easy characteristics such as access property, microblogging becomes a kind of breaking news message propagation new media rapidly.Different from the traditional media form, in this microblogging platform, everyone is an information publisher (concepts of " from media "), realizes information sharing whenever and wherever possible.The very first time that microblog users occurs in media event participates in comment and forwards, and often prior to traditional media, reacts and expresses viewpoint.Therefore the analysis based on the microblogging real time data has become a research direction merited attention.
From the data angle, microblogging is a great platform of quantity of information, has that data layout confusion, noise are numerous and diverse, effective value is difficult to the characteristics such as extraction.Traditional topic detecting method can't adapt to this new model, and is difficult to effectively from mass data information, refine and detect the burst hot ticket.
The excavation of micro-blog information at present, in the comparatively elementary stage, rests in customer relationship and community structure analysis mostly, seldom directly the microblogging real time content is analyzed.Mainly to obtain the microblogging raw data by two kinds of modes: adopt the application of the external API of microblogging and the microblog users page based on the net worm to resolve.Analysis with reference to Sina's microblogging data mining schemes such as Lian Jie, all there is comparatively significantly defect in both: the mode of the API that microblogging is external is called the restriction of frequency and query context to api interface due to API service provider, API exploitation own is very imperfect in addition, therefore can not realize the Overall Acquisition of microblogging data; And there is the blindness of certain user's selection in the alone family microblogging page mode based on the net worm owing to lacking the mechanism of necessarily selecting in advance, equally also can cause unavoidably the undetected phenomenon of false retrieval." Sina's microblogging data mining scheme " literary composition (author: Lian Jie, Zhou Xin, Cao Wei, Liu Yun) that Tsing-Hua University's journal (natural science edition) 10 phases in 2011 deliver belongs to these row.
Summary of the invention
For defect and the deficiency that overcomes prior art, the invention provides a kind of microblogging data digging method based on any active ues.Regularly choose at random a collection of new potential user to reach, through selecting screening, enter effective digging user group's purpose, the real-time topic model that the traditional data mining process is produced filters the validated user group again, user no longer active in this field is got rid of, and the effective group of assurance user's is ageing so all the time.
For achieving the above object, the present invention adopts following technical scheme:
A kind of microblogging data digging method based on any active ues, step is as follows:
1) within the every 10-30 of tandom number generator minute, produce a collection of random user ID, the microblog users group who excavates as the candidate; Take Sina's microblogging as example, and user ID is from 6 to 9, and tandom number generator is divided into 6 random numbers, 7 random numbers, 8 random numbers, 9 random number Four types accordingly;
The rule that tandom number generator produces 6-9 position random number is as follows:
Produce one-bit digital by RAND () %10, produce successively to low level from a high position, repeat 6-9 time;
2) user in candidate collection is carried out to the personal information collecting work, personal information comes from the url of the individual microblogging page, as http://weibo.com/ID;
3) to the personal information gathered, adopt desired indicator to be screened, these desired indicator comprise whether intelligent, user's interest, user place area, use frequent degree, every day microblogging quantity, every day to forward comment number, average microblogging coverage, upgrade the candidate user group;
4) use the microblog users page analytic method based on the net worm dynamically to capture the individual microblogging page through the candidate user group of screening, the preselected raw data as the user;
5) adopt real-time popular microblog topic model in the recent period, microblogging is carried out to cluster analysis, for the cluster that is greater than predetermined threshold value, include microblogging;
6) according to the unique user in the candidate user group as analytic target, add up the microblogging quantity of this user in each topic field cluster, if the cluster microblogging quantity of this user in certain field surpasses the validated user that threshold value thinks that this user is this topic field, enter the user group that such topic excavates;
7) so far, this batch of random customer group processing screening produced is complete, joins in effective digging user set in each topic field;
8) adopt the page analytic method based on the net worm, take the 10-30 minute up-to-date microblogging page for interval excavation validated user, enter the traditional data mining flow process;
9) through pre-treatment step, these pre-treatment step comprise that the processing of microblogging particular meaning symbol@and #, word number limit filter, forward the number of reviews threshold filtering, cut word participle, key phrases clustering, then real-time microblogging are carried out to cluster analysis and excavate the real-time topic of generation;
10) dynamically update real-time topic model, and select new user group with this;
11) evaluate existing validated user colony, if certain user does not have to produce comment and the forwarding event to hot ticket in continuous three days, from effective digging user group, remove this user.
Above-mentioned rand() %10 means the random number of generation has been obtained 10 remainders a random number of scope 0~9.
Above-mentioned URL is the abbreviation of English Uniform/Universal Resource Locator, looks like for URL(uniform resource locator), is also referred to as web page address, is the address (Address) of the resource of standard on the Internet.
The inventive method is passed through user's selection update mechanism in advance, guarantees the comprehensive and validity of mining data, and real-time servicing upgrades the digging user colony of each topic.Simultaneously, existing real-time topic model upgrades validated user colony again, guarantees the ageing of digging user.
The accompanying drawing explanation
The schematic process flow diagram that Fig. 1 is the inventive method, wherein 1)-11) be each step of its method.
Embodiment
Below in conjunction with drawings and Examples, the invention will be further described, but be not limited to this.
Embodiment 1:
A kind of microblogging data digging method based on any active ues, as shown in Figure 1, step is as follows:
1) tandom number generator produces a collection of random user ID, the microblog users group who excavates as the candidate in every 20 minutes; Take Sina's microblogging as example, and user ID is from 6 to 9, and tandom number generator is divided into 6 random numbers, 7 random numbers, 8 random numbers, 9 random number Four types accordingly;
The rule that tandom number generator produces 6-9 position random number is as follows:
Produce one-bit digital by RAND () %10, produce successively to low level from a high position, repeat 6-9 time;
2) user in candidate collection is carried out to the personal information collecting work, personal information comes from the url of the individual microblogging page, as http://weibo.com/ID;
3) to the personal information gathered, adopt desired indicator to be screened, these desired indicator comprise whether intelligent, user's interest, user place area, use frequent degree, every day microblogging quantity, every day to forward comment number, average microblogging coverage, upgrade the candidate user group;
4) use the microblog users page analytic method based on the net worm dynamically to capture the individual microblogging page through the candidate user group of screening, the preselected raw data as the user;
5) adopt real-time popular microblog topic model in the recent period, microblogging is carried out to cluster analysis, for the cluster that is greater than predetermined threshold value, include microblogging;
6) according to the unique user in the candidate user group as analytic target, add up the microblogging quantity of this user in each topic field cluster, if the cluster microblogging quantity of this user in certain field surpasses the validated user that threshold value thinks that this user is this topic field, enter the user group that such topic excavates;
7) so far, this batch of random customer group processing screening produced is complete, joins in effective digging user set in each topic field;
8) adopt the page analytic method based on the net worm, take 20 minutes up-to-date microblogging pages for interval excavation validated user, enter the traditional data mining flow process;
9) through pre-treatment step, these pre-treatment step comprise that the processing of microblogging particular meaning symbol@and #, word number limit filter, forward the number of reviews threshold filtering, cut word participle, key phrases clustering, then real-time microblogging are carried out to cluster analysis and excavate the real-time topic of generation;
10) dynamically update real-time topic model, and select new user group with this;
11) evaluate existing validated user colony, if certain user does not have to produce comment and the forwarding event to hot ticket in continuous three days, from effective digging user group, remove this user.
Embodiment 2:
The same manner as in Example 1, be " tandom number generator produces a collection of random user ID in every 20 minutes " in step 1); " adopting the page analytic method based on the net worm, take 20 minutes up-to-date microblogging pages for interval excavation validated user, entering the traditional data mining flow process in step 8); ".
Embodiment 3:
The same manner as in Example 1, be " tandom number generator produces a collection of random user ID in every 30 minutes " in step 1); " adopting the page analytic method based on the net worm, take 30 minutes up-to-date microblogging pages for interval excavation validated user, entering the traditional data mining flow process in step 8); ".
Claims (1)
1. the microblogging data digging method based on any active ues, step is as follows:
1) within the every 10-30 of tandom number generator minute, produce a collection of random user ID, the microblog users group who excavates as the candidate; Take Sina's microblogging as example, and user ID is from 6 to 9, and tandom number generator is divided into 6 random numbers, 7 random numbers, 8 random numbers, 9 random number Four types accordingly;
The rule that tandom number generator produces 6-9 position random number is as follows:
Produce one-bit digital by RAND () %10, produce successively to low level from a high position, repeat 6-9 time;
2) user in candidate collection is carried out to the personal information collecting work, personal information comes from the url of the individual microblogging page;
3) to the personal information gathered, adopt desired indicator to be screened, these desired indicator comprise whether intelligent, user's interest, user place area, use frequent degree, every day microblogging quantity, every day to forward comment number, average microblogging coverage, upgrade the candidate user group;
4) use the microblog users page analytic method based on the net worm dynamically to capture the individual microblogging page through the candidate user group of screening, the preselected raw data as the user;
5) adopt real-time popular microblog topic model in the recent period, microblogging is carried out to cluster analysis, for the cluster that is greater than predetermined threshold value, include microblogging;
6) according to the unique user in the candidate user group as analytic target, add up the microblogging quantity of this user in each topic field cluster, if the cluster microblogging quantity of this user in certain field surpasses the validated user that threshold value thinks that this user is this topic field, enter the user group that such topic excavates;
7) so far, this batch of random customer group processing screening produced is complete, joins in effective digging user set in each topic field;
8) adopt the page analytic method based on the net worm, take the 10-30 minute up-to-date microblogging page for interval excavation validated user, enter the traditional data mining flow process;
9) through pre-treatment step, these pre-treatment step comprise that the processing of microblogging particular meaning symbol@and #, word number limit filter, forward the number of reviews threshold filtering, cut word participle, key phrases clustering, then real-time microblogging are carried out to cluster analysis and excavate the real-time topic of generation;
10) dynamically update real-time topic model, and select new user group with this;
11) evaluate existing validated user colony, if certain user does not have to produce comment and the forwarding event to hot ticket in continuous three days, from effective digging user group, remove this user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210140531.1A CN102708176B (en) | 2012-05-08 | 2012-05-08 | Microblog data mining method based on active users |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210140531.1A CN102708176B (en) | 2012-05-08 | 2012-05-08 | Microblog data mining method based on active users |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102708176A CN102708176A (en) | 2012-10-03 |
CN102708176B true CN102708176B (en) | 2013-12-04 |
Family
ID=46900942
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210140531.1A Expired - Fee Related CN102708176B (en) | 2012-05-08 | 2012-05-08 | Microblog data mining method based on active users |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102708176B (en) |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103810169B (en) * | 2012-11-06 | 2018-01-09 | 腾讯科技(深圳)有限公司 | A kind of method and apparatus for excavating community domain expert |
CN102930029A (en) * | 2012-11-07 | 2013-02-13 | 北京网智天元科技有限公司 | Socialized search engine method and system |
CN103902566B (en) * | 2012-12-26 | 2018-04-24 | 中国科学院心理研究所 | A kind of personality Forecasting Methodology based on microblog users behavior |
CN103914491B (en) * | 2013-01-09 | 2017-11-17 | 腾讯科技(北京)有限公司 | To the data digging method and system of high-quality user-generated content |
CN103116605B (en) * | 2013-01-17 | 2016-02-10 | 上海交通大学 | A kind of microblog hot event real-time detection method based on monitoring subnet and system |
CN104102675A (en) * | 2013-04-15 | 2014-10-15 | 中国人民大学 | Method for detecting blogger interest community based on user relationship |
CN104252461B (en) | 2013-06-26 | 2017-12-05 | 国际商业机器公司 | Monitor the method and system of subject of interest |
CN103399968B (en) * | 2013-07-16 | 2016-08-10 | 中国科学院计算技术研究所 | A kind of micro-blog information acquisition method and system |
CN103345535B (en) * | 2013-07-26 | 2017-03-29 | 人民搜索网络股份公司 | A kind of microblog users method for digging and device |
CN103366018B (en) * | 2013-08-02 | 2017-11-03 | 人民搜索网络股份公司 | A kind of micro-blog information grasping means and device |
CN103488683B (en) * | 2013-08-21 | 2017-05-10 | 北京航空航天大学 | Microblog data management system and implementation method thereof |
CN103593398A (en) * | 2013-10-12 | 2014-02-19 | 北京奇虎科技有限公司 | Method and equipment for updating microblog user library |
CN103593397B (en) * | 2013-10-12 | 2018-10-09 | 北京奇虎科技有限公司 | A kind of method and apparatus of acquisition content of microblog |
CN103593399A (en) * | 2013-10-12 | 2014-02-19 | 北京奇虎科技有限公司 | Method and equipment for collecting microblog content according to microblog user library |
CN104618216B (en) * | 2013-11-05 | 2019-05-17 | 腾讯科技(北京)有限公司 | Information management method, equipment and system |
CN104699679B (en) * | 2013-12-04 | 2019-03-26 | 腾讯科技(北京)有限公司 | The method and system of user property in a kind of determining social network-i i-platform |
CN106095839B (en) * | 2016-06-03 | 2020-02-14 | 网智天元科技集团股份有限公司 | Method for extracting and processing specific film watching group data |
CN107870913B (en) * | 2016-09-23 | 2021-12-14 | 腾讯科技(深圳)有限公司 | Efficient time high expectation weight item set mining method and device and processing equipment |
CN108898428A (en) * | 2018-06-19 | 2018-11-27 | 努比亚技术有限公司 | A kind of terminal user enlivens determination method, server and the storage medium of index |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060020056A1 (en) * | 2004-07-23 | 2006-01-26 | Specialty Minerals (Michigan) Inc. | Method for improved melt flow rate fo filled polymeric resin |
CN102289447B (en) * | 2011-06-16 | 2013-04-10 | 北京亿赞普网络技术有限公司 | Website webpage evaluation system based on communication network message |
-
2012
- 2012-05-08 CN CN201210140531.1A patent/CN102708176B/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN102708176A (en) | 2012-10-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102708176B (en) | Microblog data mining method based on active users | |
Cao et al. | Detecting spam urls in social media via behavioral analysis | |
Zhao | Web scraping | |
CN102779174B (en) | A kind of public opinion information display system and method | |
Chhabra et al. | Phi. sh/$ ocial: the phishing landscape through short urls | |
CN103116605B (en) | A kind of microblog hot event real-time detection method based on monitoring subnet and system | |
Narayanan et al. | Russian involvement and junk news during Brexit | |
CN103617169A (en) | Microblog hot topic extracting method based on Hadoop | |
Chowdhury et al. | On Twitter purge: a retrospective analysis of suspended users | |
CN102724059A (en) | Website operation state monitoring and abnormal detection based on MapReduce | |
CN109246064A (en) | Safe access control, the generation method of networkaccess rules, device and equipment | |
CN103152442A (en) | Detection and processing method and system for botnet domain names | |
CN103177076A (en) | Public sentiment monitoring system and method based on fixed point websites | |
CN110691080A (en) | Automatic tracing method, device, equipment and medium | |
Li et al. | PhishBox: An approach for phishing validation and detection | |
Cao et al. | Behavioral detection of spam URL sharing: posting patterns versus click patterns | |
CN103544165A (en) | Neologism mining method and system | |
Zhou et al. | Feature analysis of spammers in social networks with active honeypots: A case study of chinese microblogging networks | |
Flores et al. | Searching for spam: detecting fraudulent accounts via web search | |
CN104133908A (en) | Method, server, client and system for displaying or generating discussion box on page | |
Chen et al. | Cost-effective node monitoring for online hot eventdetection in sina weibo microblogging | |
CN104199947A (en) | Important person speech supervision and incidence relation excavating method | |
CN103853848A (en) | Method and device for establishing social monitoring subnetwork | |
CN110110188A (en) | A kind of network public-opinion monitoring system based on cloud computing technology | |
Wang et al. | Detection of compromised accounts for online social networks based on a supervised analytical hierarchy process |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20131204 Termination date: 20160508 |
|
CF01 | Termination of patent right due to non-payment of annual fee |