CN109325159A

CN109325159A - A kind of microblog hot event method for digging

Info

Publication number: CN109325159A
Application number: CN201810860009.8A
Authority: CN
Inventors: 龙华; 吴睿; 熊新; 邵玉斌; 杜庆治
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2019-02-12

Abstract

The present invention relates to a kind of microblog hot event method for digging, belong to technical field of data processing.Microblog data is crawled first, establishes microblog data library；Then the microblog data crawled is pre-processed；Then Entity recognition is named to pretreated microblog data；Further according to the result after pretreatment and name Entity recognition, the entity and event trigger word of microblog data are extracted, so that it is determined that event expressed by each microblogging, finally calculates the similarity between microblogging, similarity result, publisher's information and issuing time are analyzed, microblog hot event is obtained.The present invention is compared with prior art, mainly solves the lack of standard lacked in a large amount of complete training corpus and name Entity recognition link due to microblog data in the preprocessing process of microblogging, lead to there can be very big error during identifying entity, so that the phenomenon that the event extraction accuracy rate of microblogging is low, to improve the high efficiency of microblog hot event excavation.

Description

A kind of microblog hot event method for digging

Technical field

The present invention relates to a kind of microblog hot event method for digging, belong to technical field of data processing.

Background technique

In recent years, the social medias platform such as microblogging emerges in multitude, and microblogging is as a kind of representative novel propagation matchmaker Body, oneself is become people's expression idea, sharing information, a kind of most popular network tool for exchanging opinion now, compared to just The newsletter archive of formula, microblogging are conducive to more acurrate, richer event information are extracted more in time, by hot spot thing in microblogging The excavation of part, we can understand the big mishap occurred both at home and abroad in time, understand reaction and view of the people to various events, Useful information is filtered out, has good booster action for real time monitoring, risk-assessment and decision support etc..

Generally, due to which microblog data has the characteristics that information update is fireballing, so pre-processing for traditional microblogging Technology often lacks a large amount of complete training corpus；Simultaneously as every microblogging length is short and small, the information content for including is limited, institute For traditional microblogging name entity recognition techniques, it is difficult to which sufficiently fusion great deal of related information, the above can all give The excavation of microblog hot event causes difficulty.

Summary of the invention

The technical problem to be solved by the present invention is to be directed to the limitation and deficiency of the prior art, a kind of microblog hot event is provided Method for digging mainly solves and lacks a large amount of complete training corpus and name Entity recognition ring in the preprocessing process of microblogging Due to the lack of standard of microblog data in section, lead to there can be very big error during identifying entity, so that micro- The low phenomenon of rich event extraction accuracy rate, to improve the high efficiency of microblog hot event excavation.

The technical scheme is that a kind of microblog hot event method for digging, specifically includes following 6 steps:

1. crawling microblog data, microblog data library is established.

2. being pre-processed to the microblog data crawled.

3. being named Entity recognition to pretreated microblog data.

4. according to after pretreatment and name Entity recognition as a result, the entity and event trigger word of extraction microblog data.

5. binding entity and trigger word determine event expressed by each microblogging, the similarity between microblogging is calculated.

6. analyzing similarity result, publisher's information and issuing time, microblog hot event is obtained.

Further, the step 1. process for establishing microblog data library are as follows: according to the issuing time of microblogging according to suitable Sequence crawls 100,000 microblog datas, including microblogging text, publisher and issuing time；Again by microblogging text, publisher and publication Time is written in local data base.

Further, 2. the pretreatment includes being filtered, segmenting and part-of-speech tagging to microblog data to step.

Further, the concrete operations that microblog data is filtered are as follows: screen out microblogging text number of words less than 5 The microblogging of word, the punctuation mark and emoticon for removing microblogging text；The concrete operations of the participle are as follows: more smart to obtain The true word segmentation result towards microblogging text, initially sets up dictionary for word segmentation, take 1/5th by filtered microblogging texts into Row participle, is added dictionary for word segmentation for word segmentation result, then takes 1/5th to be segmented by filtered microblogging text, will segment As a result dictionary for word segmentation is added, and so on, it is segmented in this manner by all by filtered microblogging text；Described Part-of-speech tagging is will to mark part of speech by the microblog data after participle, so as to subsequent processing.

Further, step 3. the name Entity recognition be by the entity in microblogging text, such as name, place name, Mechanism name, proper noun etc. identify.

Further, the part-of-speech tagging is realized using hanlp natural language processing packet, the name Entity recognition By the way of semi-supervised learning, i.e., identified solid data is input in model, continues to identify remaining microblogging text In entity, recycled, constantly identified solid data be input in model to identify in remaining microblogging text with this Entity.

Further, step 4. the event trigger word be identification events occur there is indicative word, usually Verb, noun, gerund and preposition, can be obtained the part of speech of word by the part-of-speech tagging, and above-mentioned four kinds of parts of speech are chosen Priority is verb, noun, gerund, preposition from high to low, if only existing a kind of part of speech, taking the word is event trigger word, Two or more part of speech if it exists, two words for taking priority forward are event trigger word, in conjunction with the entity and described Event expressed by microblogging can be obtained in event trigger word.

Further, the calculation formula of the step 5. similarity between the microblogging are as follows:

Wherein A and B indicates that two microblogging texts, Sim (A, B) indicate the similarity of A and B, a_iAnd b_iPass through respectively for A and B I-th of value in the word frequency vector of entity and the event trigger word composition obtained after processing.

Further, the concrete mode of step 6. the acquisitions microblog hot event are as follows: traversal each microblogging is analyzed Similarity result, publisher's information and issuing time, if the similarity met between microblogging simultaneously is higher than 85%, publisher is different And issuing time was spaced within 12 hours, then regarded as same event, and counted the microblogging item number of the event, was finally counted The microblogging item number for calculating each event accounts for the ratio of total microblogging item number, according to sorting from high to low, as microblog hot event.

Further, the microblogging item number of described each event of calculating accounts for the formula of the ratio of total microblogging item number are as follows:

Wherein, K is that the microblogging item number of some event accounts for the ratio of total microblogging item number, and W is the microblogging item of some event Number, N are total microblogging item number.

Lack a large amount of complete training corpus in the preprocessing process of microblogging the beneficial effects of the present invention are: mainly solving And due to the lack of standard of microblog data in name Entity recognition link, cause to exist during identifying entity very big Error so that the phenomenon that the event extraction accuracy rate of microblogging is low, to improve the high efficiency of microblog hot event excavation.

Detailed description of the invention

Fig. 1 is flow chart of steps of the present invention；

Fig. 2 is step of the present invention 2. flow chart；

Fig. 3 be step of the present invention 3.~4. flow chart；

Fig. 4 be step of the present invention 5.~6. flow chart.

Specific embodiment

With reference to the accompanying drawings and detailed description, the invention will be further described.

Embodiment 1: as shown in Figs 1-4, a kind of microblog hot event method for digging crawls microblog data first, establishes micro- Rich database；Then the microblog data crawled is pre-processed；Then entity is named to pretreated microblog data Identification；Further according to pre-process and name Entity recognition after as a result, extract microblog data entity and event trigger word, thus really Determine event expressed by each microblogging, finally calculate microblogging between similarity, analysis similarity result, publisher's information and Issuing time obtains microblog hot event.

Specific steps are as follows:

1. crawling microblog data, microblog data library is established.

2. being pre-processed to the microblog data crawled.

3. being named Entity recognition to pretreated microblog data.

In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. a kind of microblog hot event method for digging, it is characterised in that:

1. crawling microblog data, microblog data library is established.

2. being pre-processed to the microblog data crawled.

3. being named Entity recognition to pretreated microblog data.

2. microblog hot event method for digging according to claim 1, it is characterised in that: the step 1. in establish microblogging The process of database are as follows: 100,000 microblog datas, including microblogging text, publication are crawled according to the issuing time of microblogging in sequence People and issuing time；Microblogging text, publisher and issuing time are written in local data base again.

3. microblog hot event method for digging according to claim 1, it is characterised in that: the step 2. wrap by middle pretreatment It includes and microblog data is filtered, is segmented and part-of-speech tagging.

4. microblog hot event method for digging according to claim 3, it is characterised in that: described to be carried out to microblog data Filter specifically: screen out microblogging text number of words less than the microblogging of 5 words, the punctuation mark and emoticon of removal microblogging text Number；The participle specifically: initially set up dictionary for word segmentation, take 1/5th to be divided by filtered microblogging text Dictionary for word segmentation is added in word segmentation result by word, then takes 1/5th to be segmented by filtered microblogging text, by word segmentation result Dictionary for word segmentation is added, until all segmented by filtered microblogging text；The part-of-speech tagging is will be by participle Microblog data afterwards marks part of speech.

5. microblog hot event method for digging according to claim 4, it is characterised in that: the part-of-speech tagging uses Hanlp natural language processing packet realizes that the name Entity recognition, i.e., will be identified by the way of semi-supervised learning Solid data is input in model, is continued to identify the entity in remaining microblogging text, be recycled with this, constantly by identified reality Volume data is input in model the entity identified in remaining microblogging text.

6. microblog hot event method for digging according to claim 1, it is characterised in that: the step is 5. between middle microblogging Similarity calculation formula are as follows:

Wherein, A and B indicates that two microblogging texts, Sim (A, B) indicate the similarity of A and B, a_iAnd b_iAt passing through respectively for A and B I-th of value in the word frequency vector of entity and the event trigger word composition obtained after reason.

7. microblog hot event method for digging according to claim 1, it is characterised in that: the step is 6. middle to obtain microblogging The concrete mode of focus incident are as follows: traversal each microblogging, analysis similarity result, publisher's information and issuing time, if together When meet similarity between microblogging and be higher than 85%, publisher is different and issuing time was spaced within 12 hours, then recognizes It is set to same event, and counts the microblogging item number of the event, the microblogging item number for finally calculating each event accounts for total microblogging item number Ratio, according to sorting from high to low, as microblog hot event.

8. microblog hot event method for digging according to claim 7, it is characterised in that: each event of the calculating Microblogging item number account for total microblogging item number ratio formula are as follows:

Wherein, K is that the microblogging item number of some event accounts for the ratio of total microblogging item number, and W is the microblogging item number of some event, N For total microblogging item number.