CN102708176B - Microblog data mining method based on active users - Google Patents

Microblog data mining method based on active users Download PDF

Info

Publication number
CN102708176B
CN102708176B CN201210140531.1A CN201210140531A CN102708176B CN 102708176 B CN102708176 B CN 102708176B CN 201210140531 A CN201210140531 A CN 201210140531A CN 102708176 B CN102708176 B CN 102708176B
Authority
CN
China
Prior art keywords
user
microblogging
topic
real
users
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210140531.1A
Other languages
Chinese (zh)
Other versions
CN102708176A (en
Inventor
江铭炎
王伟
袁东风
宋玉川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN201210140531.1A priority Critical patent/CN102708176B/en
Publication of CN102708176A publication Critical patent/CN102708176A/en
Application granted granted Critical
Publication of CN102708176B publication Critical patent/CN102708176B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a microblog data mining method based on active users and belongs to the technical field of data mining in network microblogs. The method includes: firstly, regularly and randomly selecting a batch of new potential users, selectively screening to allow the new potential users to enter an effective mining user group, and continuously updating to generate new active users, so that data comprehensiveness is guaranteed; and secondly, re-screening the effective user group by real-time topic models generated during traditional data mining, and removing inactive users in the field, so that timeliness of users in the effective user group can be guaranteed constantly. By setting a user selecting and updating mechanism in advance, comprehensiveness and effectiveness of mined data can be guaranteed, and mined user groups of each topic are maintained and updated in real time. Further, effective user groups of real-time topic models are re-updated to guarantee timeliness of mined users.

Description

Microblogging data digging method based on any active ues
Technical field
The invention belongs to the data mining technology field in the network microblogging, particularly a kind of microblogging data digging method based on any active ues.
Background technology
Microblogging, as Web brand-new network application form of 2.0 epoch, not only more adapts to modern fast pace life, and realizes information sharing anywhere or anytime.Information Sharing by one based on customer relationship, propagate and obtain platform, the user can pass through WEB, WAP and various client component individual community, with the word lastest imformation of 140 words left and right, and realizes immediately sharing.
Due to real-time, easy characteristics such as access property, microblogging becomes a kind of breaking news message propagation new media rapidly.Different from the traditional media form, in this microblogging platform, everyone is an information publisher (concepts of " from media "), realizes information sharing whenever and wherever possible.The very first time that microblog users occurs in media event participates in comment and forwards, and often prior to traditional media, reacts and expresses viewpoint.Therefore the analysis based on the microblogging real time data has become a research direction merited attention.
From the data angle, microblogging is a great platform of quantity of information, has that data layout confusion, noise are numerous and diverse, effective value is difficult to the characteristics such as extraction.Traditional topic detecting method can't adapt to this new model, and is difficult to effectively from mass data information, refine and detect the burst hot ticket.
The excavation of micro-blog information at present, in the comparatively elementary stage, rests in customer relationship and community structure analysis mostly, seldom directly the microblogging real time content is analyzed.Mainly to obtain the microblogging raw data by two kinds of modes: adopt the application of the external API of microblogging and the microblog users page based on the net worm to resolve.Analysis with reference to Sina's microblogging data mining schemes such as Lian Jie, all there is comparatively significantly defect in both: the mode of the API that microblogging is external is called the restriction of frequency and query context to api interface due to API service provider, API exploitation own is very imperfect in addition, therefore can not realize the Overall Acquisition of microblogging data; And there is the blindness of certain user's selection in the alone family microblogging page mode based on the net worm owing to lacking the mechanism of necessarily selecting in advance, equally also can cause unavoidably the undetected phenomenon of false retrieval." Sina's microblogging data mining scheme " literary composition (author: Lian Jie, Zhou Xin, Cao Wei, Liu Yun) that Tsing-Hua University's journal (natural science edition) 10 phases in 2011 deliver belongs to these row.
Summary of the invention
For defect and the deficiency that overcomes prior art, the invention provides a kind of microblogging data digging method based on any active ues.Regularly choose at random a collection of new potential user to reach, through selecting screening, enter effective digging user group's purpose, the real-time topic model that the traditional data mining process is produced filters the validated user group again, user no longer active in this field is got rid of, and the effective group of assurance user's is ageing so all the time.
For achieving the above object, the present invention adopts following technical scheme:
A kind of microblogging data digging method based on any active ues, step is as follows:
1) within the every 10-30 of tandom number generator minute, produce a collection of random user ID, the microblog users group who excavates as the candidate; Take Sina's microblogging as example, and user ID is from 6 to 9, and tandom number generator is divided into 6 random numbers, 7 random numbers, 8 random numbers, 9 random number Four types accordingly;
The rule that tandom number generator produces 6-9 position random number is as follows:
Produce one-bit digital by RAND () %10, produce successively to low level from a high position, repeat 6-9 time;
2) user in candidate collection is carried out to the personal information collecting work, personal information comes from the url of the individual microblogging page, as http://weibo.com/ID;
3) to the personal information gathered, adopt desired indicator to be screened, these desired indicator comprise whether intelligent, user's interest, user place area, use frequent degree, every day microblogging quantity, every day to forward comment number, average microblogging coverage, upgrade the candidate user group;
4) use the microblog users page analytic method based on the net worm dynamically to capture the individual microblogging page through the candidate user group of screening, the preselected raw data as the user;
5) adopt real-time popular microblog topic model in the recent period, microblogging is carried out to cluster analysis, for the cluster that is greater than predetermined threshold value, include microblogging;
6) according to the unique user in the candidate user group as analytic target, add up the microblogging quantity of this user in each topic field cluster, if the cluster microblogging quantity of this user in certain field surpasses the validated user that threshold value thinks that this user is this topic field, enter the user group that such topic excavates;
7) so far, this batch of random customer group processing screening produced is complete, joins in effective digging user set in each topic field;
8) adopt the page analytic method based on the net worm, take the 10-30 minute up-to-date microblogging page for interval excavation validated user, enter the traditional data mining flow process;
9) through pre-treatment step, these pre-treatment step comprise that the processing of microblogging particular meaning symbol@and #, word number limit filter, forward the number of reviews threshold filtering, cut word participle, key phrases clustering, then real-time microblogging are carried out to cluster analysis and excavate the real-time topic of generation;
10) dynamically update real-time topic model, and select new user group with this;
11) evaluate existing validated user colony, if certain user does not have to produce comment and the forwarding event to hot ticket in continuous three days, from effective digging user group, remove this user.
Above-mentioned rand() %10 means the random number of generation has been obtained 10 remainders a random number of scope 0~9.
Above-mentioned URL is the abbreviation of English Uniform/Universal Resource Locator, looks like for URL(uniform resource locator), is also referred to as web page address, is the address (Address) of the resource of standard on the Internet.
The inventive method is passed through user's selection update mechanism in advance, guarantees the comprehensive and validity of mining data, and real-time servicing upgrades the digging user colony of each topic.Simultaneously, existing real-time topic model upgrades validated user colony again, guarantees the ageing of digging user.
The accompanying drawing explanation
The schematic process flow diagram that Fig. 1 is the inventive method, wherein 1)-11) be each step of its method.
Embodiment
Below in conjunction with drawings and Examples, the invention will be further described, but be not limited to this.
Embodiment 1:
A kind of microblogging data digging method based on any active ues, as shown in Figure 1, step is as follows:
1) tandom number generator produces a collection of random user ID, the microblog users group who excavates as the candidate in every 20 minutes; Take Sina's microblogging as example, and user ID is from 6 to 9, and tandom number generator is divided into 6 random numbers, 7 random numbers, 8 random numbers, 9 random number Four types accordingly;
The rule that tandom number generator produces 6-9 position random number is as follows:
Produce one-bit digital by RAND () %10, produce successively to low level from a high position, repeat 6-9 time;
2) user in candidate collection is carried out to the personal information collecting work, personal information comes from the url of the individual microblogging page, as http://weibo.com/ID;
3) to the personal information gathered, adopt desired indicator to be screened, these desired indicator comprise whether intelligent, user's interest, user place area, use frequent degree, every day microblogging quantity, every day to forward comment number, average microblogging coverage, upgrade the candidate user group;
4) use the microblog users page analytic method based on the net worm dynamically to capture the individual microblogging page through the candidate user group of screening, the preselected raw data as the user;
5) adopt real-time popular microblog topic model in the recent period, microblogging is carried out to cluster analysis, for the cluster that is greater than predetermined threshold value, include microblogging;
6) according to the unique user in the candidate user group as analytic target, add up the microblogging quantity of this user in each topic field cluster, if the cluster microblogging quantity of this user in certain field surpasses the validated user that threshold value thinks that this user is this topic field, enter the user group that such topic excavates;
7) so far, this batch of random customer group processing screening produced is complete, joins in effective digging user set in each topic field;
8) adopt the page analytic method based on the net worm, take 20 minutes up-to-date microblogging pages for interval excavation validated user, enter the traditional data mining flow process;
9) through pre-treatment step, these pre-treatment step comprise that the processing of microblogging particular meaning symbol@and #, word number limit filter, forward the number of reviews threshold filtering, cut word participle, key phrases clustering, then real-time microblogging are carried out to cluster analysis and excavate the real-time topic of generation;
10) dynamically update real-time topic model, and select new user group with this;
11) evaluate existing validated user colony, if certain user does not have to produce comment and the forwarding event to hot ticket in continuous three days, from effective digging user group, remove this user.
Embodiment 2:
The same manner as in Example 1, be " tandom number generator produces a collection of random user ID in every 20 minutes " in step 1); " adopting the page analytic method based on the net worm, take 20 minutes up-to-date microblogging pages for interval excavation validated user, entering the traditional data mining flow process in step 8); ".
Embodiment 3:
The same manner as in Example 1, be " tandom number generator produces a collection of random user ID in every 30 minutes " in step 1); " adopting the page analytic method based on the net worm, take 30 minutes up-to-date microblogging pages for interval excavation validated user, entering the traditional data mining flow process in step 8); ".

Claims (1)

1. the microblogging data digging method based on any active ues, step is as follows:
1) within the every 10-30 of tandom number generator minute, produce a collection of random user ID, the microblog users group who excavates as the candidate; Take Sina's microblogging as example, and user ID is from 6 to 9, and tandom number generator is divided into 6 random numbers, 7 random numbers, 8 random numbers, 9 random number Four types accordingly;
The rule that tandom number generator produces 6-9 position random number is as follows:
Produce one-bit digital by RAND () %10, produce successively to low level from a high position, repeat 6-9 time;
2) user in candidate collection is carried out to the personal information collecting work, personal information comes from the url of the individual microblogging page;
3) to the personal information gathered, adopt desired indicator to be screened, these desired indicator comprise whether intelligent, user's interest, user place area, use frequent degree, every day microblogging quantity, every day to forward comment number, average microblogging coverage, upgrade the candidate user group;
4) use the microblog users page analytic method based on the net worm dynamically to capture the individual microblogging page through the candidate user group of screening, the preselected raw data as the user;
5) adopt real-time popular microblog topic model in the recent period, microblogging is carried out to cluster analysis, for the cluster that is greater than predetermined threshold value, include microblogging;
6) according to the unique user in the candidate user group as analytic target, add up the microblogging quantity of this user in each topic field cluster, if the cluster microblogging quantity of this user in certain field surpasses the validated user that threshold value thinks that this user is this topic field, enter the user group that such topic excavates;
7) so far, this batch of random customer group processing screening produced is complete, joins in effective digging user set in each topic field;
8) adopt the page analytic method based on the net worm, take the 10-30 minute up-to-date microblogging page for interval excavation validated user, enter the traditional data mining flow process;
9) through pre-treatment step, these pre-treatment step comprise that the processing of microblogging particular meaning symbol@and #, word number limit filter, forward the number of reviews threshold filtering, cut word participle, key phrases clustering, then real-time microblogging are carried out to cluster analysis and excavate the real-time topic of generation;
10) dynamically update real-time topic model, and select new user group with this;
11) evaluate existing validated user colony, if certain user does not have to produce comment and the forwarding event to hot ticket in continuous three days, from effective digging user group, remove this user.
CN201210140531.1A 2012-05-08 2012-05-08 Microblog data mining method based on active users Expired - Fee Related CN102708176B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210140531.1A CN102708176B (en) 2012-05-08 2012-05-08 Microblog data mining method based on active users

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210140531.1A CN102708176B (en) 2012-05-08 2012-05-08 Microblog data mining method based on active users

Publications (2)

Publication Number Publication Date
CN102708176A CN102708176A (en) 2012-10-03
CN102708176B true CN102708176B (en) 2013-12-04

Family

ID=46900942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210140531.1A Expired - Fee Related CN102708176B (en) 2012-05-08 2012-05-08 Microblog data mining method based on active users

Country Status (1)

Country Link
CN (1) CN102708176B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810169B (en) * 2012-11-06 2018-01-09 腾讯科技(深圳)有限公司 A kind of method and apparatus for excavating community domain expert
CN102930029A (en) * 2012-11-07 2013-02-13 北京网智天元科技有限公司 Socialized search engine method and system
CN103902566B (en) * 2012-12-26 2018-04-24 中国科学院心理研究所 A kind of personality Forecasting Methodology based on microblog users behavior
CN103914491B (en) * 2013-01-09 2017-11-17 腾讯科技(北京)有限公司 To the data digging method and system of high-quality user-generated content
CN103116605B (en) * 2013-01-17 2016-02-10 上海交通大学 A kind of microblog hot event real-time detection method based on monitoring subnet and system
CN104102675A (en) * 2013-04-15 2014-10-15 中国人民大学 Method for detecting blogger interest community based on user relationship
CN104252461B (en) 2013-06-26 2017-12-05 国际商业机器公司 Monitor the method and system of subject of interest
CN103399968B (en) * 2013-07-16 2016-08-10 中国科学院计算技术研究所 A kind of micro-blog information acquisition method and system
CN103345535B (en) * 2013-07-26 2017-03-29 人民搜索网络股份公司 A kind of microblog users method for digging and device
CN103366018B (en) * 2013-08-02 2017-11-03 人民搜索网络股份公司 A kind of micro-blog information grasping means and device
CN103488683B (en) * 2013-08-21 2017-05-10 北京航空航天大学 Microblog data management system and implementation method thereof
CN103593398A (en) * 2013-10-12 2014-02-19 北京奇虎科技有限公司 Method and equipment for updating microblog user library
CN103593397B (en) * 2013-10-12 2018-10-09 北京奇虎科技有限公司 A kind of method and apparatus of acquisition content of microblog
CN103593399A (en) * 2013-10-12 2014-02-19 北京奇虎科技有限公司 Method and equipment for collecting microblog content according to microblog user library
CN104618216B (en) * 2013-11-05 2019-05-17 腾讯科技(北京)有限公司 Information management method, equipment and system
CN104699679B (en) * 2013-12-04 2019-03-26 腾讯科技(北京)有限公司 The method and system of user property in a kind of determining social network-i i-platform
CN106095839B (en) * 2016-06-03 2020-02-14 网智天元科技集团股份有限公司 Method for extracting and processing specific film watching group data
CN107870913B (en) * 2016-09-23 2021-12-14 腾讯科技(深圳)有限公司 Efficient time high expectation weight item set mining method and device and processing equipment
CN108898428A (en) * 2018-06-19 2018-11-27 努比亚技术有限公司 A kind of terminal user enlivens determination method, server and the storage medium of index

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060020056A1 (en) * 2004-07-23 2006-01-26 Specialty Minerals (Michigan) Inc. Method for improved melt flow rate fo filled polymeric resin
CN102289447B (en) * 2011-06-16 2013-04-10 北京亿赞普网络技术有限公司 Website webpage evaluation system based on communication network message

Also Published As

Publication number Publication date
CN102708176A (en) 2012-10-03

Similar Documents

Publication Publication Date Title
CN102708176B (en) Microblog data mining method based on active users
Cao et al. Detecting spam urls in social media via behavioral analysis
Zhao Web scraping
CN102779174B (en) A kind of public opinion information display system and method
Chhabra et al. Phi. sh/$ ocial: the phishing landscape through short urls
CN103116605B (en) A kind of microblog hot event real-time detection method based on monitoring subnet and system
Narayanan et al. Russian involvement and junk news during Brexit
CN103617169A (en) Microblog hot topic extracting method based on Hadoop
Chowdhury et al. On Twitter purge: a retrospective analysis of suspended users
CN102724059A (en) Website operation state monitoring and abnormal detection based on MapReduce
CN109246064A (en) Safe access control, the generation method of networkaccess rules, device and equipment
CN103152442A (en) Detection and processing method and system for botnet domain names
CN103177076A (en) Public sentiment monitoring system and method based on fixed point websites
CN110691080A (en) Automatic tracing method, device, equipment and medium
Li et al. PhishBox: An approach for phishing validation and detection
Cao et al. Behavioral detection of spam URL sharing: posting patterns versus click patterns
CN103544165A (en) Neologism mining method and system
Zhou et al. Feature analysis of spammers in social networks with active honeypots: A case study of chinese microblogging networks
Flores et al. Searching for spam: detecting fraudulent accounts via web search
CN104133908A (en) Method, server, client and system for displaying or generating discussion box on page
Chen et al. Cost-effective node monitoring for online hot eventdetection in sina weibo microblogging
CN104199947A (en) Important person speech supervision and incidence relation excavating method
CN103853848A (en) Method and device for establishing social monitoring subnetwork
CN110110188A (en) A kind of network public-opinion monitoring system based on cloud computing technology
Wang et al. Detection of compromised accounts for online social networks based on a supervised analytical hierarchy process

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20131204

Termination date: 20160508

CF01 Termination of patent right due to non-payment of annual fee