CN109325159A - A kind of microblog hot event method for digging - Google Patents

A kind of microblog hot event method for digging Download PDF

Info

Publication number
CN109325159A
CN109325159A CN201810860009.8A CN201810860009A CN109325159A CN 109325159 A CN109325159 A CN 109325159A CN 201810860009 A CN201810860009 A CN 201810860009A CN 109325159 A CN109325159 A CN 109325159A
Authority
CN
China
Prior art keywords
microblogging
microblog
event
data
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810860009.8A
Other languages
Chinese (zh)
Inventor
龙华
吴睿
熊新
邵玉斌
杜庆治
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201810860009.8A priority Critical patent/CN109325159A/en
Publication of CN109325159A publication Critical patent/CN109325159A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of microblog hot event method for digging, belong to technical field of data processing.Microblog data is crawled first, establishes microblog data library;Then the microblog data crawled is pre-processed;Then Entity recognition is named to pretreated microblog data;Further according to the result after pretreatment and name Entity recognition, the entity and event trigger word of microblog data are extracted, so that it is determined that event expressed by each microblogging, finally calculates the similarity between microblogging, similarity result, publisher's information and issuing time are analyzed, microblog hot event is obtained.The present invention is compared with prior art, mainly solves the lack of standard lacked in a large amount of complete training corpus and name Entity recognition link due to microblog data in the preprocessing process of microblogging, lead to there can be very big error during identifying entity, so that the phenomenon that the event extraction accuracy rate of microblogging is low, to improve the high efficiency of microblog hot event excavation.

Description

A kind of microblog hot event method for digging
Technical field
The present invention relates to a kind of microblog hot event method for digging, belong to technical field of data processing.
Background technique
In recent years, the social medias platform such as microblogging emerges in multitude, and microblogging is as a kind of representative novel propagation matchmaker Body, oneself is become people's expression idea, sharing information, a kind of most popular network tool for exchanging opinion now, compared to just The newsletter archive of formula, microblogging are conducive to more acurrate, richer event information are extracted more in time, by hot spot thing in microblogging The excavation of part, we can understand the big mishap occurred both at home and abroad in time, understand reaction and view of the people to various events, Useful information is filtered out, has good booster action for real time monitoring, risk-assessment and decision support etc..
Generally, due to which microblog data has the characteristics that information update is fireballing, so pre-processing for traditional microblogging Technology often lacks a large amount of complete training corpus;Simultaneously as every microblogging length is short and small, the information content for including is limited, institute For traditional microblogging name entity recognition techniques, it is difficult to which sufficiently fusion great deal of related information, the above can all give The excavation of microblog hot event causes difficulty.
Summary of the invention
The technical problem to be solved by the present invention is to be directed to the limitation and deficiency of the prior art, a kind of microblog hot event is provided Method for digging mainly solves and lacks a large amount of complete training corpus and name Entity recognition ring in the preprocessing process of microblogging Due to the lack of standard of microblog data in section, lead to there can be very big error during identifying entity, so that micro- The low phenomenon of rich event extraction accuracy rate, to improve the high efficiency of microblog hot event excavation.
The technical scheme is that a kind of microblog hot event method for digging, specifically includes following 6 steps:
1. crawling microblog data, microblog data library is established.
2. being pre-processed to the microblog data crawled.
3. being named Entity recognition to pretreated microblog data.
4. according to after pretreatment and name Entity recognition as a result, the entity and event trigger word of extraction microblog data.
5. binding entity and trigger word determine event expressed by each microblogging, the similarity between microblogging is calculated.
6. analyzing similarity result, publisher's information and issuing time, microblog hot event is obtained.
Further, the step 1. process for establishing microblog data library are as follows: according to the issuing time of microblogging according to suitable Sequence crawls 100,000 microblog datas, including microblogging text, publisher and issuing time;Again by microblogging text, publisher and publication Time is written in local data base.
Further, 2. the pretreatment includes being filtered, segmenting and part-of-speech tagging to microblog data to step.
Further, the concrete operations that microblog data is filtered are as follows: screen out microblogging text number of words less than 5 The microblogging of word, the punctuation mark and emoticon for removing microblogging text;The concrete operations of the participle are as follows: more smart to obtain The true word segmentation result towards microblogging text, initially sets up dictionary for word segmentation, take 1/5th by filtered microblogging texts into Row participle, is added dictionary for word segmentation for word segmentation result, then takes 1/5th to be segmented by filtered microblogging text, will segment As a result dictionary for word segmentation is added, and so on, it is segmented in this manner by all by filtered microblogging text;Described Part-of-speech tagging is will to mark part of speech by the microblog data after participle, so as to subsequent processing.
Further, step 3. the name Entity recognition be by the entity in microblogging text, such as name, place name, Mechanism name, proper noun etc. identify.
Further, the part-of-speech tagging is realized using hanlp natural language processing packet, the name Entity recognition By the way of semi-supervised learning, i.e., identified solid data is input in model, continues to identify remaining microblogging text In entity, recycled, constantly identified solid data be input in model to identify in remaining microblogging text with this Entity.
Further, step 4. the event trigger word be identification events occur there is indicative word, usually Verb, noun, gerund and preposition, can be obtained the part of speech of word by the part-of-speech tagging, and above-mentioned four kinds of parts of speech are chosen Priority is verb, noun, gerund, preposition from high to low, if only existing a kind of part of speech, taking the word is event trigger word, Two or more part of speech if it exists, two words for taking priority forward are event trigger word, in conjunction with the entity and described Event expressed by microblogging can be obtained in event trigger word.
Further, the calculation formula of the step 5. similarity between the microblogging are as follows:
Wherein A and B indicates that two microblogging texts, Sim (A, B) indicate the similarity of A and B, aiAnd biPass through respectively for A and B I-th of value in the word frequency vector of entity and the event trigger word composition obtained after processing.
Further, the concrete mode of step 6. the acquisitions microblog hot event are as follows: traversal each microblogging is analyzed Similarity result, publisher's information and issuing time, if the similarity met between microblogging simultaneously is higher than 85%, publisher is different And issuing time was spaced within 12 hours, then regarded as same event, and counted the microblogging item number of the event, was finally counted The microblogging item number for calculating each event accounts for the ratio of total microblogging item number, according to sorting from high to low, as microblog hot event.
Further, the microblogging item number of described each event of calculating accounts for the formula of the ratio of total microblogging item number are as follows:
Wherein, K is that the microblogging item number of some event accounts for the ratio of total microblogging item number, and W is the microblogging item of some event Number, N are total microblogging item number.
Lack a large amount of complete training corpus in the preprocessing process of microblogging the beneficial effects of the present invention are: mainly solving And due to the lack of standard of microblog data in name Entity recognition link, cause to exist during identifying entity very big Error so that the phenomenon that the event extraction accuracy rate of microblogging is low, to improve the high efficiency of microblog hot event excavation.
Detailed description of the invention
Fig. 1 is flow chart of steps of the present invention;
Fig. 2 is step of the present invention 2. flow chart;
Fig. 3 be step of the present invention 3.~4. flow chart;
Fig. 4 be step of the present invention 5.~6. flow chart.
Specific embodiment
With reference to the accompanying drawings and detailed description, the invention will be further described.
Embodiment 1: as shown in Figs 1-4, a kind of microblog hot event method for digging crawls microblog data first, establishes micro- Rich database;Then the microblog data crawled is pre-processed;Then entity is named to pretreated microblog data Identification;Further according to pre-process and name Entity recognition after as a result, extract microblog data entity and event trigger word, thus really Determine event expressed by each microblogging, finally calculate microblogging between similarity, analysis similarity result, publisher's information and Issuing time obtains microblog hot event.
Specific steps are as follows:
1. crawling microblog data, microblog data library is established.
2. being pre-processed to the microblog data crawled.
3. being named Entity recognition to pretreated microblog data.
4. according to after pretreatment and name Entity recognition as a result, the entity and event trigger word of extraction microblog data.
5. binding entity and trigger word determine event expressed by each microblogging, the similarity between microblogging is calculated.
6. analyzing similarity result, publisher's information and issuing time, microblog hot event is obtained.
Further, the step 1. process for establishing microblog data library are as follows: according to the issuing time of microblogging according to suitable Sequence crawls 100,000 microblog datas, including microblogging text, publisher and issuing time;Again by microblogging text, publisher and publication Time is written in local data base.
Further, 2. the pretreatment includes being filtered, segmenting and part-of-speech tagging to microblog data to step.
Further, the concrete operations that microblog data is filtered are as follows: screen out microblogging text number of words less than 5 The microblogging of word, the punctuation mark and emoticon for removing microblogging text;The concrete operations of the participle are as follows: more smart to obtain The true word segmentation result towards microblogging text, initially sets up dictionary for word segmentation, take 1/5th by filtered microblogging texts into Row participle, is added dictionary for word segmentation for word segmentation result, then takes 1/5th to be segmented by filtered microblogging text, will segment As a result dictionary for word segmentation is added, and so on, it is segmented in this manner by all by filtered microblogging text;Described Part-of-speech tagging is will to mark part of speech by the microblog data after participle, so as to subsequent processing.
Further, step 3. the name Entity recognition be by the entity in microblogging text, such as name, place name, Mechanism name, proper noun etc. identify.
Further, the part-of-speech tagging is realized using hanlp natural language processing packet, the name Entity recognition By the way of semi-supervised learning, i.e., identified solid data is input in model, continues to identify remaining microblogging text In entity, recycled, constantly identified solid data be input in model to identify in remaining microblogging text with this Entity.
Further, step 4. the event trigger word be identification events occur there is indicative word, usually Verb, noun, gerund and preposition, can be obtained the part of speech of word by the part-of-speech tagging, and above-mentioned four kinds of parts of speech are chosen Priority is verb, noun, gerund, preposition from high to low, if only existing a kind of part of speech, taking the word is event trigger word, Two or more part of speech if it exists, two words for taking priority forward are event trigger word, in conjunction with the entity and described Event expressed by microblogging can be obtained in event trigger word.
Further, the calculation formula of the step 5. similarity between the microblogging are as follows:
Wherein A and B indicates that two microblogging texts, Sim (A, B) indicate the similarity of A and B, aiAnd biPass through respectively for A and B I-th of value in the word frequency vector of entity and the event trigger word composition obtained after processing.
Further, the concrete mode of step 6. the acquisitions microblog hot event are as follows: traversal each microblogging is analyzed Similarity result, publisher's information and issuing time, if the similarity met between microblogging simultaneously is higher than 85%, publisher is different And issuing time was spaced within 12 hours, then regarded as same event, and counted the microblogging item number of the event, was finally counted The microblogging item number for calculating each event accounts for the ratio of total microblogging item number, according to sorting from high to low, as microblog hot event.
Further, the microblogging item number of described each event of calculating accounts for the formula of the ratio of total microblogging item number are as follows:
Wherein, K is that the microblogging item number of some event accounts for the ratio of total microblogging item number, and W is the microblogging item of some event Number, N are total microblogging item number.
In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims (8)

1. a kind of microblog hot event method for digging, it is characterised in that:
1. crawling microblog data, microblog data library is established.
2. being pre-processed to the microblog data crawled.
3. being named Entity recognition to pretreated microblog data.
4. according to after pretreatment and name Entity recognition as a result, the entity and event trigger word of extraction microblog data.
5. binding entity and trigger word determine event expressed by each microblogging, the similarity between microblogging is calculated.
6. analyzing similarity result, publisher's information and issuing time, microblog hot event is obtained.
2. microblog hot event method for digging according to claim 1, it is characterised in that: the step 1. in establish microblogging The process of database are as follows: 100,000 microblog datas, including microblogging text, publication are crawled according to the issuing time of microblogging in sequence People and issuing time;Microblogging text, publisher and issuing time are written in local data base again.
3. microblog hot event method for digging according to claim 1, it is characterised in that: the step 2. wrap by middle pretreatment It includes and microblog data is filtered, is segmented and part-of-speech tagging.
4. microblog hot event method for digging according to claim 3, it is characterised in that: described to be carried out to microblog data Filter specifically: screen out microblogging text number of words less than the microblogging of 5 words, the punctuation mark and emoticon of removal microblogging text Number;The participle specifically: initially set up dictionary for word segmentation, take 1/5th to be divided by filtered microblogging text Dictionary for word segmentation is added in word segmentation result by word, then takes 1/5th to be segmented by filtered microblogging text, by word segmentation result Dictionary for word segmentation is added, until all segmented by filtered microblogging text;The part-of-speech tagging is will be by participle Microblog data afterwards marks part of speech.
5. microblog hot event method for digging according to claim 4, it is characterised in that: the part-of-speech tagging uses Hanlp natural language processing packet realizes that the name Entity recognition, i.e., will be identified by the way of semi-supervised learning Solid data is input in model, is continued to identify the entity in remaining microblogging text, be recycled with this, constantly by identified reality Volume data is input in model the entity identified in remaining microblogging text.
6. microblog hot event method for digging according to claim 1, it is characterised in that: the step is 5. between middle microblogging Similarity calculation formula are as follows:
Wherein, A and B indicates that two microblogging texts, Sim (A, B) indicate the similarity of A and B, aiAnd biAt passing through respectively for A and B I-th of value in the word frequency vector of entity and the event trigger word composition obtained after reason.
7. microblog hot event method for digging according to claim 1, it is characterised in that: the step is 6. middle to obtain microblogging The concrete mode of focus incident are as follows: traversal each microblogging, analysis similarity result, publisher's information and issuing time, if together When meet similarity between microblogging and be higher than 85%, publisher is different and issuing time was spaced within 12 hours, then recognizes It is set to same event, and counts the microblogging item number of the event, the microblogging item number for finally calculating each event accounts for total microblogging item number Ratio, according to sorting from high to low, as microblog hot event.
8. microblog hot event method for digging according to claim 7, it is characterised in that: each event of the calculating Microblogging item number account for total microblogging item number ratio formula are as follows:
Wherein, K is that the microblogging item number of some event accounts for the ratio of total microblogging item number, and W is the microblogging item number of some event, N For total microblogging item number.
CN201810860009.8A 2018-08-01 2018-08-01 A kind of microblog hot event method for digging Pending CN109325159A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810860009.8A CN109325159A (en) 2018-08-01 2018-08-01 A kind of microblog hot event method for digging

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810860009.8A CN109325159A (en) 2018-08-01 2018-08-01 A kind of microblog hot event method for digging

Publications (1)

Publication Number Publication Date
CN109325159A true CN109325159A (en) 2019-02-12

Family

ID=65264063

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810860009.8A Pending CN109325159A (en) 2018-08-01 2018-08-01 A kind of microblog hot event method for digging

Country Status (1)

Country Link
CN (1) CN109325159A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334702A (en) * 2019-05-30 2019-10-15 深圳壹账通智能科技有限公司 Data transmission method, device and computer equipment based on configuration platform
CN112800767A (en) * 2021-01-31 2021-05-14 云知声智能科技股份有限公司 Method and system for checking basic information of patient in medical record text

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484343A (en) * 2014-11-26 2015-04-01 无锡清华信息科学与技术国家实验室物联网技术中心 Topic detection and tracking method for microblog
CN104615593A (en) * 2013-11-01 2015-05-13 北大方正集团有限公司 Method and device for automatic detection of microblog hot topics
CN107633044A (en) * 2017-09-14 2018-01-26 国家计算机网络与信息安全管理中心 A kind of public sentiment knowledge mapping construction method based on focus incident
CN107885793A (en) * 2017-10-20 2018-04-06 江苏大学 A kind of hot microblog topic analyzing and predicting method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104615593A (en) * 2013-11-01 2015-05-13 北大方正集团有限公司 Method and device for automatic detection of microblog hot topics
CN104484343A (en) * 2014-11-26 2015-04-01 无锡清华信息科学与技术国家实验室物联网技术中心 Topic detection and tracking method for microblog
CN107633044A (en) * 2017-09-14 2018-01-26 国家计算机网络与信息安全管理中心 A kind of public sentiment knowledge mapping construction method based on focus incident
CN107885793A (en) * 2017-10-20 2018-04-06 江苏大学 A kind of hot microblog topic analyzing and predicting method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334702A (en) * 2019-05-30 2019-10-15 深圳壹账通智能科技有限公司 Data transmission method, device and computer equipment based on configuration platform
WO2020238556A1 (en) * 2019-05-30 2020-12-03 深圳壹账通智能科技有限公司 Configuration platform-based data transmission method, apparatus and computer device
CN112800767A (en) * 2021-01-31 2021-05-14 云知声智能科技股份有限公司 Method and system for checking basic information of patient in medical record text
CN112800767B (en) * 2021-01-31 2023-11-21 云知声智能科技股份有限公司 Method and system for checking basic information of patient in medical record text

Similar Documents

Publication Publication Date Title
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
WO2021114745A1 (en) Named entity recognition method employing affix perception for use in social media
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN103336766B (en) Short text garbage identification and modeling method and device
CN106407236B (en) A kind of emotion tendency detection method towards comment data
CN108628828A (en) A kind of joint abstracting method of viewpoint and its holder based on from attention
CN107092596A (en) Text emotion analysis method based on attention CNNs and CCR
CN101127042A (en) Sensibility classification method based on language model
CN105975625A (en) Chinglish inquiring correcting method and system oriented to English search engine
CN106776538A (en) The information extracting method of enterprise's noncanonical format document
CN106095749A (en) A kind of text key word extracting method based on degree of depth study
CN105512687A (en) Emotion classification model training and textual emotion polarity analysis method and system
CN111274814B (en) Novel semi-supervised text entity information extraction method
CN105095190B (en) A kind of sentiment analysis method combined based on Chinese semantic structure and subdivision dictionary
CN103853834B (en) Text structure analysis-based Web document abstract generation method
CN112307153B (en) Automatic construction method and device of industrial knowledge base and storage medium
CN103207855A (en) Fine-grained sentiment analysis system and method specific to product comment information
CN112069826A (en) Vertical domain entity disambiguation method fusing topic model and convolutional neural network
CN106503256B (en) A kind of hot information method for digging based on social networks document
CN110941720A (en) Knowledge base-based specific personnel information error correction method
CN107818173B (en) Vector space model-based Chinese false comment filtering method
CN107451116B (en) Statistical analysis method for mobile application endogenous big data
CN109325159A (en) A kind of microblog hot event method for digging
CN111460147A (en) Title short text classification method based on semantic enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190212