CN109325159A - A kind of microblog hot event method for digging - Google Patents
A kind of microblog hot event method for digging Download PDFInfo
- Publication number
- CN109325159A CN109325159A CN201810860009.8A CN201810860009A CN109325159A CN 109325159 A CN109325159 A CN 109325159A CN 201810860009 A CN201810860009 A CN 201810860009A CN 109325159 A CN109325159 A CN 109325159A
- Authority
- CN
- China
- Prior art keywords
- microblogging
- microblog
- event
- data
- entity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000000605 extraction Methods 0.000 claims abstract description 6
- 230000011218 segmentation Effects 0.000 claims description 15
- 239000007787 solid Substances 0.000 claims description 5
- 241001269238 Data Species 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000009193 crawling Effects 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims description 2
- 238000009412 basement excavation Methods 0.000 abstract description 5
- 238000007781 pre-processing Methods 0.000 abstract description 4
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012502 risk assessment Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of microblog hot event method for digging, belong to technical field of data processing.Microblog data is crawled first, establishes microblog data library;Then the microblog data crawled is pre-processed;Then Entity recognition is named to pretreated microblog data;Further according to the result after pretreatment and name Entity recognition, the entity and event trigger word of microblog data are extracted, so that it is determined that event expressed by each microblogging, finally calculates the similarity between microblogging, similarity result, publisher's information and issuing time are analyzed, microblog hot event is obtained.The present invention is compared with prior art, mainly solves the lack of standard lacked in a large amount of complete training corpus and name Entity recognition link due to microblog data in the preprocessing process of microblogging, lead to there can be very big error during identifying entity, so that the phenomenon that the event extraction accuracy rate of microblogging is low, to improve the high efficiency of microblog hot event excavation.
Description
Technical field
The present invention relates to a kind of microblog hot event method for digging, belong to technical field of data processing.
Background technique
In recent years, the social medias platform such as microblogging emerges in multitude, and microblogging is as a kind of representative novel propagation matchmaker
Body, oneself is become people's expression idea, sharing information, a kind of most popular network tool for exchanging opinion now, compared to just
The newsletter archive of formula, microblogging are conducive to more acurrate, richer event information are extracted more in time, by hot spot thing in microblogging
The excavation of part, we can understand the big mishap occurred both at home and abroad in time, understand reaction and view of the people to various events,
Useful information is filtered out, has good booster action for real time monitoring, risk-assessment and decision support etc..
Generally, due to which microblog data has the characteristics that information update is fireballing, so pre-processing for traditional microblogging
Technology often lacks a large amount of complete training corpus;Simultaneously as every microblogging length is short and small, the information content for including is limited, institute
For traditional microblogging name entity recognition techniques, it is difficult to which sufficiently fusion great deal of related information, the above can all give
The excavation of microblog hot event causes difficulty.
Summary of the invention
The technical problem to be solved by the present invention is to be directed to the limitation and deficiency of the prior art, a kind of microblog hot event is provided
Method for digging mainly solves and lacks a large amount of complete training corpus and name Entity recognition ring in the preprocessing process of microblogging
Due to the lack of standard of microblog data in section, lead to there can be very big error during identifying entity, so that micro-
The low phenomenon of rich event extraction accuracy rate, to improve the high efficiency of microblog hot event excavation.
The technical scheme is that a kind of microblog hot event method for digging, specifically includes following 6 steps:
1. crawling microblog data, microblog data library is established.
2. being pre-processed to the microblog data crawled.
3. being named Entity recognition to pretreated microblog data.
4. according to after pretreatment and name Entity recognition as a result, the entity and event trigger word of extraction microblog data.
5. binding entity and trigger word determine event expressed by each microblogging, the similarity between microblogging is calculated.
6. analyzing similarity result, publisher's information and issuing time, microblog hot event is obtained.
Further, the step 1. process for establishing microblog data library are as follows: according to the issuing time of microblogging according to suitable
Sequence crawls 100,000 microblog datas, including microblogging text, publisher and issuing time;Again by microblogging text, publisher and publication
Time is written in local data base.
Further, 2. the pretreatment includes being filtered, segmenting and part-of-speech tagging to microblog data to step.
Further, the concrete operations that microblog data is filtered are as follows: screen out microblogging text number of words less than 5
The microblogging of word, the punctuation mark and emoticon for removing microblogging text;The concrete operations of the participle are as follows: more smart to obtain
The true word segmentation result towards microblogging text, initially sets up dictionary for word segmentation, take 1/5th by filtered microblogging texts into
Row participle, is added dictionary for word segmentation for word segmentation result, then takes 1/5th to be segmented by filtered microblogging text, will segment
As a result dictionary for word segmentation is added, and so on, it is segmented in this manner by all by filtered microblogging text;Described
Part-of-speech tagging is will to mark part of speech by the microblog data after participle, so as to subsequent processing.
Further, step 3. the name Entity recognition be by the entity in microblogging text, such as name, place name,
Mechanism name, proper noun etc. identify.
Further, the part-of-speech tagging is realized using hanlp natural language processing packet, the name Entity recognition
By the way of semi-supervised learning, i.e., identified solid data is input in model, continues to identify remaining microblogging text
In entity, recycled, constantly identified solid data be input in model to identify in remaining microblogging text with this
Entity.
Further, step 4. the event trigger word be identification events occur there is indicative word, usually
Verb, noun, gerund and preposition, can be obtained the part of speech of word by the part-of-speech tagging, and above-mentioned four kinds of parts of speech are chosen
Priority is verb, noun, gerund, preposition from high to low, if only existing a kind of part of speech, taking the word is event trigger word,
Two or more part of speech if it exists, two words for taking priority forward are event trigger word, in conjunction with the entity and described
Event expressed by microblogging can be obtained in event trigger word.
Further, the calculation formula of the step 5. similarity between the microblogging are as follows:
Wherein A and B indicates that two microblogging texts, Sim (A, B) indicate the similarity of A and B, aiAnd biPass through respectively for A and B
I-th of value in the word frequency vector of entity and the event trigger word composition obtained after processing.
Further, the concrete mode of step 6. the acquisitions microblog hot event are as follows: traversal each microblogging is analyzed
Similarity result, publisher's information and issuing time, if the similarity met between microblogging simultaneously is higher than 85%, publisher is different
And issuing time was spaced within 12 hours, then regarded as same event, and counted the microblogging item number of the event, was finally counted
The microblogging item number for calculating each event accounts for the ratio of total microblogging item number, according to sorting from high to low, as microblog hot event.
Further, the microblogging item number of described each event of calculating accounts for the formula of the ratio of total microblogging item number are as follows:
Wherein, K is that the microblogging item number of some event accounts for the ratio of total microblogging item number, and W is the microblogging item of some event
Number, N are total microblogging item number.
Lack a large amount of complete training corpus in the preprocessing process of microblogging the beneficial effects of the present invention are: mainly solving
And due to the lack of standard of microblog data in name Entity recognition link, cause to exist during identifying entity very big
Error so that the phenomenon that the event extraction accuracy rate of microblogging is low, to improve the high efficiency of microblog hot event excavation.
Detailed description of the invention
Fig. 1 is flow chart of steps of the present invention;
Fig. 2 is step of the present invention 2. flow chart;
Fig. 3 be step of the present invention 3.~4. flow chart;
Fig. 4 be step of the present invention 5.~6. flow chart.
Specific embodiment
With reference to the accompanying drawings and detailed description, the invention will be further described.
Embodiment 1: as shown in Figs 1-4, a kind of microblog hot event method for digging crawls microblog data first, establishes micro-
Rich database;Then the microblog data crawled is pre-processed;Then entity is named to pretreated microblog data
Identification;Further according to pre-process and name Entity recognition after as a result, extract microblog data entity and event trigger word, thus really
Determine event expressed by each microblogging, finally calculate microblogging between similarity, analysis similarity result, publisher's information and
Issuing time obtains microblog hot event.
Specific steps are as follows:
1. crawling microblog data, microblog data library is established.
2. being pre-processed to the microblog data crawled.
3. being named Entity recognition to pretreated microblog data.
4. according to after pretreatment and name Entity recognition as a result, the entity and event trigger word of extraction microblog data.
5. binding entity and trigger word determine event expressed by each microblogging, the similarity between microblogging is calculated.
6. analyzing similarity result, publisher's information and issuing time, microblog hot event is obtained.
Further, the step 1. process for establishing microblog data library are as follows: according to the issuing time of microblogging according to suitable
Sequence crawls 100,000 microblog datas, including microblogging text, publisher and issuing time;Again by microblogging text, publisher and publication
Time is written in local data base.
Further, 2. the pretreatment includes being filtered, segmenting and part-of-speech tagging to microblog data to step.
Further, the concrete operations that microblog data is filtered are as follows: screen out microblogging text number of words less than 5
The microblogging of word, the punctuation mark and emoticon for removing microblogging text;The concrete operations of the participle are as follows: more smart to obtain
The true word segmentation result towards microblogging text, initially sets up dictionary for word segmentation, take 1/5th by filtered microblogging texts into
Row participle, is added dictionary for word segmentation for word segmentation result, then takes 1/5th to be segmented by filtered microblogging text, will segment
As a result dictionary for word segmentation is added, and so on, it is segmented in this manner by all by filtered microblogging text;Described
Part-of-speech tagging is will to mark part of speech by the microblog data after participle, so as to subsequent processing.
Further, step 3. the name Entity recognition be by the entity in microblogging text, such as name, place name,
Mechanism name, proper noun etc. identify.
Further, the part-of-speech tagging is realized using hanlp natural language processing packet, the name Entity recognition
By the way of semi-supervised learning, i.e., identified solid data is input in model, continues to identify remaining microblogging text
In entity, recycled, constantly identified solid data be input in model to identify in remaining microblogging text with this
Entity.
Further, step 4. the event trigger word be identification events occur there is indicative word, usually
Verb, noun, gerund and preposition, can be obtained the part of speech of word by the part-of-speech tagging, and above-mentioned four kinds of parts of speech are chosen
Priority is verb, noun, gerund, preposition from high to low, if only existing a kind of part of speech, taking the word is event trigger word,
Two or more part of speech if it exists, two words for taking priority forward are event trigger word, in conjunction with the entity and described
Event expressed by microblogging can be obtained in event trigger word.
Further, the calculation formula of the step 5. similarity between the microblogging are as follows:
Wherein A and B indicates that two microblogging texts, Sim (A, B) indicate the similarity of A and B, aiAnd biPass through respectively for A and B
I-th of value in the word frequency vector of entity and the event trigger word composition obtained after processing.
Further, the concrete mode of step 6. the acquisitions microblog hot event are as follows: traversal each microblogging is analyzed
Similarity result, publisher's information and issuing time, if the similarity met between microblogging simultaneously is higher than 85%, publisher is different
And issuing time was spaced within 12 hours, then regarded as same event, and counted the microblogging item number of the event, was finally counted
The microblogging item number for calculating each event accounts for the ratio of total microblogging item number, according to sorting from high to low, as microblog hot event.
Further, the microblogging item number of described each event of calculating accounts for the formula of the ratio of total microblogging item number are as follows:
Wherein, K is that the microblogging item number of some event accounts for the ratio of total microblogging item number, and W is the microblogging item of some event
Number, N are total microblogging item number.
In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned
Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept
Put that various changes can be made.
Claims (8)
1. a kind of microblog hot event method for digging, it is characterised in that:
1. crawling microblog data, microblog data library is established.
2. being pre-processed to the microblog data crawled.
3. being named Entity recognition to pretreated microblog data.
4. according to after pretreatment and name Entity recognition as a result, the entity and event trigger word of extraction microblog data.
5. binding entity and trigger word determine event expressed by each microblogging, the similarity between microblogging is calculated.
6. analyzing similarity result, publisher's information and issuing time, microblog hot event is obtained.
2. microblog hot event method for digging according to claim 1, it is characterised in that: the step 1. in establish microblogging
The process of database are as follows: 100,000 microblog datas, including microblogging text, publication are crawled according to the issuing time of microblogging in sequence
People and issuing time;Microblogging text, publisher and issuing time are written in local data base again.
3. microblog hot event method for digging according to claim 1, it is characterised in that: the step 2. wrap by middle pretreatment
It includes and microblog data is filtered, is segmented and part-of-speech tagging.
4. microblog hot event method for digging according to claim 3, it is characterised in that: described to be carried out to microblog data
Filter specifically: screen out microblogging text number of words less than the microblogging of 5 words, the punctuation mark and emoticon of removal microblogging text
Number;The participle specifically: initially set up dictionary for word segmentation, take 1/5th to be divided by filtered microblogging text
Dictionary for word segmentation is added in word segmentation result by word, then takes 1/5th to be segmented by filtered microblogging text, by word segmentation result
Dictionary for word segmentation is added, until all segmented by filtered microblogging text;The part-of-speech tagging is will be by participle
Microblog data afterwards marks part of speech.
5. microblog hot event method for digging according to claim 4, it is characterised in that: the part-of-speech tagging uses
Hanlp natural language processing packet realizes that the name Entity recognition, i.e., will be identified by the way of semi-supervised learning
Solid data is input in model, is continued to identify the entity in remaining microblogging text, be recycled with this, constantly by identified reality
Volume data is input in model the entity identified in remaining microblogging text.
6. microblog hot event method for digging according to claim 1, it is characterised in that: the step is 5. between middle microblogging
Similarity calculation formula are as follows:
Wherein, A and B indicates that two microblogging texts, Sim (A, B) indicate the similarity of A and B, aiAnd biAt passing through respectively for A and B
I-th of value in the word frequency vector of entity and the event trigger word composition obtained after reason.
7. microblog hot event method for digging according to claim 1, it is characterised in that: the step is 6. middle to obtain microblogging
The concrete mode of focus incident are as follows: traversal each microblogging, analysis similarity result, publisher's information and issuing time, if together
When meet similarity between microblogging and be higher than 85%, publisher is different and issuing time was spaced within 12 hours, then recognizes
It is set to same event, and counts the microblogging item number of the event, the microblogging item number for finally calculating each event accounts for total microblogging item number
Ratio, according to sorting from high to low, as microblog hot event.
8. microblog hot event method for digging according to claim 7, it is characterised in that: each event of the calculating
Microblogging item number account for total microblogging item number ratio formula are as follows:
Wherein, K is that the microblogging item number of some event accounts for the ratio of total microblogging item number, and W is the microblogging item number of some event, N
For total microblogging item number.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810860009.8A CN109325159A (en) | 2018-08-01 | 2018-08-01 | A kind of microblog hot event method for digging |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810860009.8A CN109325159A (en) | 2018-08-01 | 2018-08-01 | A kind of microblog hot event method for digging |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109325159A true CN109325159A (en) | 2019-02-12 |
Family
ID=65264063
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810860009.8A Pending CN109325159A (en) | 2018-08-01 | 2018-08-01 | A kind of microblog hot event method for digging |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109325159A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334702A (en) * | 2019-05-30 | 2019-10-15 | 深圳壹账通智能科技有限公司 | Data transmission method, device and computer equipment based on configuration platform |
CN112800767A (en) * | 2021-01-31 | 2021-05-14 | 云知声智能科技股份有限公司 | Method and system for checking basic information of patient in medical record text |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104484343A (en) * | 2014-11-26 | 2015-04-01 | 无锡清华信息科学与技术国家实验室物联网技术中心 | Topic detection and tracking method for microblog |
CN104615593A (en) * | 2013-11-01 | 2015-05-13 | 北大方正集团有限公司 | Method and device for automatic detection of microblog hot topics |
CN107633044A (en) * | 2017-09-14 | 2018-01-26 | 国家计算机网络与信息安全管理中心 | A kind of public sentiment knowledge mapping construction method based on focus incident |
CN107885793A (en) * | 2017-10-20 | 2018-04-06 | 江苏大学 | A kind of hot microblog topic analyzing and predicting method and system |
-
2018
- 2018-08-01 CN CN201810860009.8A patent/CN109325159A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104615593A (en) * | 2013-11-01 | 2015-05-13 | 北大方正集团有限公司 | Method and device for automatic detection of microblog hot topics |
CN104484343A (en) * | 2014-11-26 | 2015-04-01 | 无锡清华信息科学与技术国家实验室物联网技术中心 | Topic detection and tracking method for microblog |
CN107633044A (en) * | 2017-09-14 | 2018-01-26 | 国家计算机网络与信息安全管理中心 | A kind of public sentiment knowledge mapping construction method based on focus incident |
CN107885793A (en) * | 2017-10-20 | 2018-04-06 | 江苏大学 | A kind of hot microblog topic analyzing and predicting method and system |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334702A (en) * | 2019-05-30 | 2019-10-15 | 深圳壹账通智能科技有限公司 | Data transmission method, device and computer equipment based on configuration platform |
WO2020238556A1 (en) * | 2019-05-30 | 2020-12-03 | 深圳壹账通智能科技有限公司 | Configuration platform-based data transmission method, apparatus and computer device |
CN112800767A (en) * | 2021-01-31 | 2021-05-14 | 云知声智能科技股份有限公司 | Method and system for checking basic information of patient in medical record text |
CN112800767B (en) * | 2021-01-31 | 2023-11-21 | 云知声智能科技股份有限公司 | Method and system for checking basic information of patient in medical record text |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
WO2021114745A1 (en) | Named entity recognition method employing affix perception for use in social media | |
CN104391942B (en) | Short essay eigen extended method based on semantic collection of illustrative plates | |
CN103336766B (en) | Short text garbage identification and modeling method and device | |
CN106407236B (en) | A kind of emotion tendency detection method towards comment data | |
CN108628828A (en) | A kind of joint abstracting method of viewpoint and its holder based on from attention | |
CN107092596A (en) | Text emotion analysis method based on attention CNNs and CCR | |
CN101127042A (en) | Sensibility classification method based on language model | |
CN105975625A (en) | Chinglish inquiring correcting method and system oriented to English search engine | |
CN106776538A (en) | The information extracting method of enterprise's noncanonical format document | |
CN106095749A (en) | A kind of text key word extracting method based on degree of depth study | |
CN105512687A (en) | Emotion classification model training and textual emotion polarity analysis method and system | |
CN111274814B (en) | Novel semi-supervised text entity information extraction method | |
CN105095190B (en) | A kind of sentiment analysis method combined based on Chinese semantic structure and subdivision dictionary | |
CN103853834B (en) | Text structure analysis-based Web document abstract generation method | |
CN112307153B (en) | Automatic construction method and device of industrial knowledge base and storage medium | |
CN103207855A (en) | Fine-grained sentiment analysis system and method specific to product comment information | |
CN112069826A (en) | Vertical domain entity disambiguation method fusing topic model and convolutional neural network | |
CN106503256B (en) | A kind of hot information method for digging based on social networks document | |
CN110941720A (en) | Knowledge base-based specific personnel information error correction method | |
CN107818173B (en) | Vector space model-based Chinese false comment filtering method | |
CN107451116B (en) | Statistical analysis method for mobile application endogenous big data | |
CN109325159A (en) | A kind of microblog hot event method for digging | |
CN111460147A (en) | Title short text classification method based on semantic enhancement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190212 |