CN103631862B - Event characteristic evolution excavation method and system based on microblogs - Google Patents

Event characteristic evolution excavation method and system based on microblogs Download PDF

Info

Publication number
CN103631862B
CN103631862B CN201310532377.7A CN201310532377A CN103631862B CN 103631862 B CN103631862 B CN 103631862B CN 201310532377 A CN201310532377 A CN 201310532377A CN 103631862 B CN103631862 B CN 103631862B
Authority
CN
China
Prior art keywords
event
evolution
microblog
micro
edge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310532377.7A
Other languages
Chinese (zh)
Other versions
CN103631862A (en
Inventor
邓镭
贾焰
邹鹏
杨树强
周斌
韩伟红
李爱平
韩毅
李莎莎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201310532377.7A priority Critical patent/CN103631862B/en
Publication of CN103631862A publication Critical patent/CN103631862A/en
Application granted granted Critical
Publication of CN103631862B publication Critical patent/CN103631862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an event characteristic evolution excavation method and system based on microblogs. The method includes the steps that in a microblog time sequence, an evolution starting document set is selected, and graph models of documents are constructed on the microblog document set based on co-occurrence characteristics of vocabularies so as to obtain a knowledge network structure of events; according to literal characteristics of the vocabularies and the tendentious compatibility characteristic of the vocabularies, the microblog graph models are combined, and a micro evolution graph of characteristics of the events is constructed; clipping, segmentation and conversion are performed on the micro evolution graph of the events, and a macro evolution graph of the characteristics of the events is formed. According to the method, in the evolution law process for excavating the characteristics of the events, a graph excavating method based on a knowledge network of the events is adopted, the event characteristic evolution excavation method is improved in the succession aspect of knowledge as a whole, and interpretability of the excavating results is higher.

Description

Event feature evolution mining method and system based on microblog
Technical Field
The invention relates to the field of text mining and topic discovery and tracking, in particular to a method for event feature evolution and mining based on microblog text data.
Background
With the rapid development of the Web2.0 technology and application in recent years, the online microblog service gradually becomes a new information dissemination platform which has a large number of users and generates a large amount of information. According to the statistics of the 29 th Chinese internet report: by 12 months end in 2011, the number of actual users of the microblog reaches 2.5 hundred million, which is increased by 296.0% compared with the last year end, and the utilization rate of the netizens is 48.7%.
Unlike strong-relationship social networking services such as Facebook, the social networking relationships of the microblog services are generally unidirectional — that is, users can follow them without authorization from other users and receive the information they generate. The people the user is interested in are called friends (friends) of the user; people who focus on a user are called fans (fans) of the user, all blogs (tweets) published by the user will appear on a public timeline (public time), and all messages of the user will be displayed on the timeline (fans) of the user.
The real topic or event is projected in the text space of the microblog, namely the set of the bloggers for all users to discuss the related topic or event. (in the field of text analysis, the two concepts of topics and events are sometimes not distinguished, and this point is adopted hereinafter.) in reality, topics and events are evolving, and correspondingly, topics and events in a microblog text space are evolving. The time of topic/event evolution is the time when the fans in the microblog forward or comment the information sent by the followers. In the forwarding and comment, besides the repeated display or implicit expression of the viewpoints and the narration in the original blog text, a new viewpoint and a new narration are introduced, and at the moment, the topics change to a certain extent. The evolution process of topics starts from the first time the original blogged text is forwarded or commented on. With the continuous progress of forwarding and comments, the extension of topics is continuously extended, and topics continuously evolve. The evolution of the topic/event in the propagation process is researched, namely, the slight change of the topic/event information in each propagation process is tracked, and the change of the topic/event on the macro scale is comprehensively considered.
At present, researches on topic/event information propagation and evolution on microblogs are divided into the following two categories. The first type of research establishes a mathematical model of topic propagation and evolution by analyzing behavioral elements of topic/event propagation, and simulates a propagation evolution process so as to answer a question why the topic/event propagates. The research is biased to the simulation modeling theory of the propaganda level and has no practical significance for researching the propagation evolution process of a specific topic/event. The second type of research combines social network information in the microblog with a traditional topic/event model and carries out reasoning on the propagation process of the topic/event in the microblog, and the research finally obtains two results, wherein one result is the explicit and implicit propagation path of the topic/event in the microblog, and the other result is the change of the topic/event model in the propagation process. The basic steps of such studies are:
1. arranging texts discussing the same topic/event in the microblog according to a time sequence, keeping an explicit forwarding relation, processing the texts according to a sequence from front to back and a forwarding sequence, introducing a concept of a time slice if necessary, and simultaneously processing the texts in the same time slice. For documents without the introduction of the time slice concept, each document can be regarded as occupying a time slice separately;
2. and (3) establishing a topic/event model of each time slice, taking the vector space model and the probability model into consideration, splitting the topic model of the time slice if necessary, and decomposing the topic model into a plurality of sub-topics to express different aspects of the topic.
3. And (3) taking the topic/event model at the time 0 as a reference, sequentially inspecting the topic/event model of each text in the subsequent time slice, comparing the similarity of the latter and the former, and reasoning the propagation relation. In view of the locality of the information flow trend in the microblog, the relationship between the users generating the two texts needs to be considered in the step, and if no obvious relationship exists between the two users, the probability that the propagation relationship exists between the texts is considered to be small.
4. In step 3, each document can be regarded as a vertex, and the propagation relationship between the documents can be regarded as an edge between the vertices, so that a propagation tree or a propagation graph for creating text information can be constructed at this time. The explicit/implicit propagation path of the topic/event information in the microblog is depicted in the figure. And (4) inspecting the topic/event model of each vertex along each path, wherein the change rule of the model is the evolution rule of the topics/events along the path.
As can be seen from the above description, since the evolution process of investigating the topic/event is completed while establishing the propagation model, the evolution process of the topic/event does not have an independent model, but depends on topic models such as vector space or probability model. The topic models are effective expression modes of document sets, but lack expression in topic evolution, which results in that the topic/event evolution analysis result obtained by the method is not beyond the change rule of word frequency or word vector along with time, has no associated information among words, has no inheritance in the domain knowledge of topics/events, and lacks interpretability in evolution. In this regard, a new topic/event feature evolution mining method is needed.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a novel microblog-based event characteristic evolution mining method and system.
The purpose of the invention is realized by the following technical scheme:
in one aspect, the invention provides a microblog-based event feature evolution mining method, which comprises the following steps:
step 1, selecting a plurality of microblogs representing event starting points from a set of microblog texts related to an event to be analyzed to form an event evolution starting point microblog set;
step 2, constructing a graph model of the event evolution starting point microblog set as an initial event micro evolution graph; the top points in the graph model are nouns/verbs appearing in microblog texts of the event evolution starting point microblog set, and the edges between the two top points indicate that words corresponding to the two top points appear in the same microblog together or the co-occurrence distance is smaller than a preset threshold value;
step 3, constructing a graph model of the microblog for the rest microblogs in the set of microblog texts related to the event to be analyzed and adding the graph model into the current event evolution micro-graph;
and 4, acquiring an event macroscopic evolution diagram based on the event microscopic evolution diagram obtained in the step 3, and observing the evolution of the event characteristics based on the event macroscopic evolution diagram.
In the above method, the microblog representing the event starting point in the step 1 may have the following characteristics: a) the publication time is early; b) is the original microblog, not the forwarded or commented microblog.
In the method, the vertex of the graph model in step 2 may be represented by a triplet including a noun/verb corresponding to the vertex, a set of microblog documents including the noun/verb, and a tendency score of the noun/verb, where the tendency score of the noun/verb is an average of tendency scores corresponding to adjectives and adverbs that modify the noun/verb.
In the above method, the step 2 may include:
step 2-1), performing word segmentation and part-of-speech tagging on each microblog text in the event evolution starting point microblog set;
step 2-2), setting tendency scores of the adjectives and the adverbs after word segmentation;
step 2-3), for the nouns and verbs after word segmentation, averaging the tendency scores corresponding to the adjectives and adverbs which modify the same nouns/verbs, and taking the average as the tendency score of the nouns or verbs;
and 2-4) taking the nouns and the verbs as vertexes, and if the words corresponding to any two vertexes appear in the same microblog together or the co-occurrence distance is smaller than a preset threshold value, creating an edge between the two vertexes.
In the above method, adding the constructed microblog graph model to the current event evolution micro-graph in the step 3 may include: each edge in the graph model of the microblog to be processed is as follows:
a) if both vertexes of the edge exist in the current event evolution micro-image and the edge exists in the event evolution micro-image, accumulating the occurrence count of the edge; if the edge does not exist in the event evolution micro-map, copying the edge into the event evolution micro-map;
b) if one and only one vertex of the edges appear in the current event evolution micro-map, copying the vertex and the edge which are not in the event evolution micro-map into the event evolution micro-map;
c) and if the two vertexes of the edge are not in the current event evolution micro-map, completely copying the edge and the two vertexes into the event evolution micro-map.
In the above method, the step 3 may further include a step of determining whether a vertex in the microblog map model is in the event evolution micrographs, and the step includes: for a given vertex in a microblog graph model, if an event evolution micro graph comprises a vertex which is the same as a word corresponding to the vertex, the microblog and a microblog text related to the corresponding vertex in the event evolution micro graph have a forwarding or commenting relationship, and the tendency scores of the two vertices are compatible, it is determined that the given vertex is included in the event evolution micro graph, wherein the compatibility of the tendency scores means that the difference between the tendency score of the corresponding vertex in the event evolution micro graph and the tendency score of the given vertex is less than a certain threshold value.
In the above method, the step 4) may include segmenting and transforming the event micro-evolution diagram to obtain the event macro-evolution diagram.
In the above method, the segmenting and transforming the event micro-evolution diagram may include:
step 4-1), sequencing the microblog texts related to the event to be analyzed according to time, and slicing the microblog text sequence according to time to form a time slice with required granularity;
step 4-2) creating a vertex in the event micro-evolution diagram, wherein the vertex corresponds to the initial event micro-evolution diagram;
step 4-3) the following steps are performed for each time slice:
4-3-a) sequentially selecting a vertex and an edge corresponding to each time slice in the event micro evolution diagram, and constructing a minimum connected subgraph based on the subgraph;
4-3-b) creating a vertex in the event macro-evolution diagram, corresponding to the minimum connected subgraph, and creating an edge connecting two subgraphs if the minimum connected subgraph is intersected with the subgraphs corresponding to other vertices in the event macro-evolution diagram;
in the above method, the step 4-3) may further include assigning a weight to the created edge connecting the two subgraphs, where the weight of the edge is a Jaccard coefficient of the subgraph corresponding to the two vertices; for any two vertexes v and v' in the event macroscopic evolution diagram, the Jaccard coefficient calculation mode of the corresponding subgraph is as follows:wherein G isv∩Gv′And Gv∪Gv′Respectively representing the intersection and union of the vertex sets of the subgraphs corresponding to the two vertices, and the function # () representing the number of elements in the set.
In the above method, the step 4 may further include pruning the event micro-evolution graph, where the pruning includes deleting an edge whose occurrence frequency in the event micro-evolution graph is lower than a given threshold, and then deleting a branch that is not connected to the initial event micro-evolution graph, where the occurrence frequency of the edge refers to a frequency of occurrence of words corresponding to two vertices of the edge in the same microblog in the set of microblog texts related to the event to be analyzed.
In another aspect, the present invention provides a microblog-based event feature evolution mining system, including:
the device is used for selecting a plurality of microblogs representing event starting points from a set of microblog texts related to the events to be analyzed so as to form an event evolution starting point microblog set;
a device for constructing a graph model of the event evolution starting point microblog set as an initial event micro evolution graph; the top points in the graph model are nouns/verbs appearing in microblog texts of the event evolution starting point microblog set, and the edges between the two top points indicate that words corresponding to the two top points appear in the same microblog together or the co-occurrence distance is smaller than a preset threshold value;
the device is used for constructing a graph model of each microblog in the set of microblog texts related to the event to be analyzed and adding the graph model into the current event evolution micro-graph;
and a device for acquiring the event macro-evolution diagram based on the final event micro-evolution diagram and observing the evolution of the event characteristics based on the event macro-evolution diagram.
Compared with the prior art, the invention has the advantages that:
based on the graph model of the event, the knowledge structure among the vocabularies is constructed, so that the event evolution model which is more interpretable in the knowledge level is obtained. An event evolution diagram is constructed on the event diagram model by taking a knowledge network as a unit, so that the inheritance of event knowledge is improved. The method balances the characteristics of the microblog texts, and overcomes the defects of few microblog texts and scarce characteristics by using a statistical method and using the advantages of more texts and more users.
Drawings
Embodiments of the invention are described in detail below with reference to the attached drawing figures, wherein:
fig. 1 is a schematic flow chart of an event feature evolution mining method based on a microblog according to an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In one embodiment of the invention, the invention provides a microblog-based event feature evolution mining method with higher recognition degree and interpretability, which is used for mining and tracking an event evolution process in a fine-grained manner from the aspect of event knowledge beyond the boundary of a document. The specific steps of the method are illustrated below with reference to fig. 1.
Step 1, acquiring a set of microblog texts discussing the same event, and selecting a plurality of microblogs with evolution starting points from the set. The microblog starting from the evolution point is the microblog representing the starting point of the event, and the microblog starting from the event has the following characteristics: a) the publication time is early; b) is the original microblog, not the forwarding or commenting. According to an embodiment of the present invention, the step 1) may include the steps of:
step 1-1, acquiring a set of microblog texts discussing the same event. For example, the search may be performed by a keyword search.
Step 1-2, ordering the microblogs discussing the same event according to a time sequence, namely, arranging the microblog texts in the set from first to last according to the microblog publishing time, and keeping explicit forwarding and comment relations among the microblogs (forwarding, comment and the like are viewed in the application), wherein the sequence can be recorded as: d = { D =1,d2,...,dn}., the subscripts 1-n can be used as the time mark of the document, and one time can be considered to generate at most one document due to the infinite divisibility of the time, a forwarding indication function Rt: D × D → {0,1} is established on the sequence to represent the forwarding relation between the documents, and for the document Di,dj,0<i<j<n, if the document djHaving forwarded the document diThen Rt (d)i,dj) =1, otherwise this expression value is 0. On the basis of the relation, a function isRt: D → {0,1} can be established again, and each document is an original document (0) or a forwarded document (1). In addition, another version Rt of the forwarding indication function Rt defined on the document set is 2D×2D→ 0,1 for document set D1And D2
Step 1-3, selecting a plurality of groups from the setEvolving a starting microblog as a starting document set D0. Considering the situation that microblogs as the starting points of events may not be unique, the number of forwarded documents can be used as a limiting condition of the starting document set. Starting document set D0Candidate range D ofcandidateAre a number of consecutive document subsequences that are aligned front to the microblog sequence D mentioned above, and:
D candidate = { d 1 , d 2 , . . . , d k } | &Sigma; i = 1 k isRt ( d i ) &le; &epsiv; start , D0is DcandidateA subset of original microblogs is assembled. WhereinstartA threshold value is defined for forwarding and this value may be defined as 5. And the maximum k that satisfies this inequality can be taken. D0Also called event evolution starting point microblog set.
And 2, constructing a graph model of the event evolution starting point microblog set, namely a knowledge network of the event starting point.
For a microblog text, according to an embodiment of the present invention, a graph model of the microblog can be built according to the following steps: (1) and performing word segmentation and part-of-speech tagging on the text. (2) And obtaining corresponding tendency scores of the adjectives and the adverbs in the vocabulary obtained after the segmentation by inquiring a tendency database, such as a tendency dictionary. The adjective/adverb propensity score may be attached to the noun or verb vertex it modifies as a feature value. Wherein the tendency score may be, for example, a real number between [ -1,1 ]. Closer to-1 indicates a higher degree of negative tendency, whereas closer to 1 indicates a higher degree of positive tendency. More simply, the values of the tendency scores in the three values of { -1,0,1} can also be limited, and the values respectively represent three types of negative tendency, neutral tendency and positive tendency. (3) And finding out adjectives and adverbs for modifying the nouns and the verbs in the vocabulary obtained after the word segmentation, and averaging tendency scores corresponding to the modified words for modifying the same object to be used as the tendency score of the noun or the verb. Then, the noun and the verb are used as vertexes to participate in the construction of the graph model, and meanwhile, a time mark at the moment can be attached; and establishing edge connection between the vertexes of the nouns and verbs meeting specified conditions, wherein the specified conditions refer to that the words and the nouns appear in the same sentence or the co-occurrence distance of the words and the nouns is less than a specified threshold value. In addition, the number of times this association occurs and the time of day may be added to the edge. The co-occurrence distance of the two words refers to the number of characters or words between the two words when the two words appear in the same microblog.
For a set of event evolution starting point microblogs, according to an embodiment of the present invention, the following steps may be adopted to construct a graph model of the set:
step 2-1, for the initial document set D0Each document of (1) is subjected to word segmentation and part-of-speech tagging.
Step 2-2, for the segmented adjectives and adverbs, querying the tendency dictionary to obtain a tendency score, which may be a real number between [ -1,1] as mentioned above. In simplification, the value of the score can be limited to three values of { -1,0,1}, which respectively represent three tendencies of negative, neutral and positive. The tropism scores for adjectives and adverbs will eventually fall on nouns and verbs. The propensity score s (w) for the vocabulary w may be an average of the propensity scores for adjectives or adverbs that modify the noun or verb w.
Step 2-3, constructing a starting document set D0Graph model G of(0)=<V,E∪R,Lv,Le>And the micro evolution diagram is taken as a starting point of the construction of the event micro evolution diagram. Where V is the set of vertices, E and R represent sets of edges of different types, E is a direct connection, R is an associative connection, LvIs a labeling function of the set of vertices V, LeIs the label function of the edge set E.
The vertex V represents a noun or adjective, and may be represented by a triplet of lexical face values, the set of documents in which the vocabulary resides, and lexical tendencies, so the labeling function of the vertex is expressed as:
Lv(v)=<wv,Dv,s(wv)>wherein w isvRepresenting words w, D corresponding to the vertex vvRepresenting a collection of microblog documents containing the vocabulary w, s (w)v) The tendency score representing the vocabulary w may also be referred to as the tendency feature value of the vertex v.
The edge E in the graph represents that there is a specific relationship between the vertices, for example, corresponding words of two vertices appear in the same microblog, or the co-occurrence distance is smaller than a pre-specified threshold. The marking function of the edge can be expressed as the common occurrence count of the corresponding words of the two vertexes and the set of corresponding document time stamps, namely the set of the release time of the microblog containing the corresponding words of the two vertexes. I.e. for e = { v =1,v2∈ E, having:
wherein, c (v)1,v2) Representing a vertex v1,v2The number of times that the corresponding vocabularies appear in the same microblog together; t is tv1v2A set of microblogs containing words corresponding to the two vertexes, including c (v)1,v2) Individual document timestamps.
Thus, obtainedGraph model G of event evolution starting point microblog set(0)Also known as a microscopic evolution model of the initial time event.
And 3, processing the rest microblogs one by one according to the time sequence, establishing a graph model of the microblog, and adding the graph model into an event model at the previous moment until all the microblogs are processed. At this time, a microscopic graph model of event evolution is obtained. According to an embodiment of the present invention, a process of adding a graph model of a microblog to be processed to an existing graph model may follow the following steps:
for each edge in the graph model of the microblog to be processed:
if both vertexes of the edge exist in the existing graph model and the edge exists in the existing graph model, the counter of the occurrence times of the edge is accumulated; if the edge does not exist in the existing graph model, the edge is copied into the existing graph model.
If there is one and only one vertex present in the existing graph model, vertices and edges that are not in the existing graph model are copied into the existing graph model.
If neither vertex of the edge is in the existing graph model, the edge and both vertices are copied into the existing graph model in their entirety.
Again taking the above microblog set D and the starting document set D0 as examples, the remaining document sequences D-D are0Sequentially fetching the documents d thereiniThe graph model is constructed as discussed above and designated Gi. The microscopic evolution model of the event at this time is marked as G(i)By the following steps, GiIs combined to G(i)To obtain G(i+1)
In the process of adding the graph model of the microblog to be processed into the existing graph model, whether a certain vertex is included in the existing graph model needs to be judged. For a given vertex, if the graph contains a vertex that is the same as the vocabulary corresponding to the vertex, the forwarding indication function for the document set involving both verticesIf the decision value is 1 (true) and the tendency feature values of the two vertices are compatible, the given vertex is included in the graph. Wherein, the tendency characteristic value compatibility means that the difference between the tendency characteristic value of the vertex in the graph and the given vertex tendency characteristic value is less than a certain threshold value. Suppose a definition function EqvV × V → {0,1}, to determine whether the two vertices are equal:
redefining the function MtvV × V → {0,1}, and when the value of the function is 1, the function is a pair of vertexes with the same vocabulary but without an evolutionary relationship (e.g., forwarding or commenting), the two vertexes are called to have an association relationship.
Wherein,sthe experience value is 0.3 for a pre-specified tendency gap threshold.
For document diGraph model G ofiTaking each side e = { v = { (v) }1,v2∈ E, with notation v designating any one of the vertices:
(a) if it isThen v and v' are merged and considered to be the same point:
Dv′←Dv′∪Dv
s(v′)←(s(v′)+s(v))/2
(b) if it isThen v is introduced into the graph as a new vertex and addedEdge R = { v', v } into edge set R.
(c) If the conditions a and b are not met, directly adding the vertex v into the graph G(i)In (1).
At this time, if G(i)In which no edge e' = { v =1,v2∈ E, adding the edge, if the edge exists, combining E and E':
c(v1,v2)e′←c(v1,v2)e′+c(v1,v2)e
t e &prime; v 1 v 2 &LeftArrow; t e &prime; v 1 v 2 &cup; t e v 1 v 2
and continuously repeating the process until all the documents to be processed in the document set are processed, and recording the obtained event micro-evolution diagram as G.
And 4, pruning, segmenting and converting the event micro evolution diagram to finally obtain a macro event evolution diagram.
Wherein, the event micro evolution diagram is pruned and can be deletedRemoving the edges with the number of the co-occurrences being lower than a specified threshold value in the event micro evolution graph, and deleting the edges and the initial graph G(0)A branch that is not connected. The step of segmenting the event micro evolution diagram can comprise the step of dividing the initial microblog sequence mentioned in the step 1) according to time, and can be divided into time slices with different granularities according to different requirements. According to one embodiment of the invention, the conversion of the event micro evolution diagram refers to the conversion of the event micro evolution diagram into the macro evolution diagram, which comprises the following steps: establishing a starting vertex of the event macro evolutionary graph, wherein the starting vertex corresponds to a subgraph expressing a starting part in the micro evolutionary graph; and then, sequentially inspecting each time slice, selecting a vertex and an edge corresponding to the time slice in the micro evolution diagram, constructing a minimum connected subgraph based on the subgraph in the micro evolution diagram, adding the subgraph into the macro evolution diagram as a vertex, constructing an edge in the macro evolution diagram to connect the two vertexes if the subgraph is intersected with subgraphs corresponding to other vertexes, and endowing the edge with Jaccard coefficients of the two subgraphs as characteristic values.
The microblog set D and the initial document set D are still above0And the remaining document sequence D-D0For example, the execution of step 4 will be described according to an embodiment of the present invention.
Step 4-1, setting a threshold valuecoThe minimum co-occurrence number between words (i.e. the number of co-occurrences in the same microblog) is defined, or given the minimum co-occurrence number required divided by the total number of documents given the minimum co-occurrence frequency. Scanning each edge in the event micro-evolution diagram G, and determining e = { v } in the event micro-evolution diagram G1,v2∈ E, if c (v)1,v2)e′coThen the edge is removed from E. From the initial diagram G(0)Starting from, searching connected branches in the graph, and deleting the vertex which is not connected with the initial graph from the vertex set.
Step 4-2, to document sequence D-D0Time slices are divided. At this time, different time slices can be divided according to needs, including the following methods:
(a) specifying fixed time intervals, e.g. divided in hours, days
(b) Computing a set of starting documents D0And dividing the time slice by using the time span as a fixed value
(c) And clustering and dividing according to the density degree of time in the document sequence to form time slices with different intervals.
The time slice sequence obtained by dividing the sequence in this step is denoted as T = { T = { (T) }1,T2,...,TmEach time slice contains one or several documents.
Step 4-3, creating an event macro-evolution diagramWherein VΨIs a set of vertices in the macroscopic evolution diagram, EΨIs a set of edges that are to be considered,is a set of vertices VΨThe marking function of (a) is selected,is a set of edges EΨThe marking function of (2). Creating a vertex v0∈VΨMemory for recording
Step 4-4, sequentially inspecting each time slice in the time slice set T, and comparing Ti∈T,
In the selected graph G at time slice TiThe point set and edge set in (1), denoted as V and E, respectively. Here, to speed up the query, the two temporal tokens of the vertex and edge addition mentioned above may be used to select a set of points and an edge set in a time slice.
Marking the maximum connected branches in the vertex set V, and constructing a minimum connected subgraph G in the graph G containing V based on the maximum connected branchesV. According to the inventionIn one embodiment, the method comprises:
(a) solving the shortest path between any two of the two maximum connected branches by using a Dijkstra algorithm;
(b) selecting the smallest one from the shortest paths, and adding all vertexes and edges in the path into the subgraph;
(c) repeat ab step until the subgraphs are fully connected.
Creation of vertex V → VΨMemory for recordingExhaustion VΨEach vertex v' in v, ifThen an edge E = { v, v' } → E is createdΨAnd is markedWherein the right side of the equation represents Jaccard coefficients of the vertexes v and v', and the calculation formula is as follows:
wherein G isv∩Gv′And Gv∪Gv′Respectively representing the intersection and union of the two sets of micro evolutionary graph vertices, and the function # () represents the number of elements in the sets.
And (4) repeating the step 4-4 until all time slices are processed. And at the moment, the construction of the event macro evolution diagram is finished. The evolution of the event features can then be observed based on the event macro-evolution map.
The micro evolution diagram of the event takes vocabulary as granularity, mainly embodies the continuous expansion of event knowledge along with the development and change of the event, embodies the inheritance and evolution of the knowledge by the construction rule of edges, and therefore, in the aspect of interpretability, the traditional evolution analysis method is purely based on the similarity of the vocabulary. However, the micro-evolution graph has a large number of nodes and a complex connection relationship, and is suitable for calculation in a computer but not beneficial to human observation. The macroscopic evolution diagram refined based on the microscopic evolution diagram takes the time slice as granularity, the number of nodes and edges is correspondingly and greatly reduced, and the method is suitable for observation of people. Meanwhile, the observation granularity can be changed by adjusting the size of the time slice, so that the macroscopic evolution diagram can be zoomed. The evolution of the event features can be easily observed based on the event macro-evolution diagram.
In another embodiment of the present invention, a microblog-based event feature evolution mining system is further provided, which includes: the device is used for selecting a plurality of microblogs representing event starting points from a set of microblog texts related to the events to be analyzed so as to form an event evolution starting point microblog set; a device for constructing a graph model of the event evolution starting point microblog set by adopting the method as an initial event micro evolution graph; a device for constructing a graph model of the microblog and adding the graph model into the current event evolution microscopic graph for the rest microblogs in the set of the microblog texts related to the event to be analyzed by adopting the method discussed above; means for obtaining an event macroevolution map based on the final event micro-evolution map and observing the evolution of the event features based on the event macroevolution map using the method described above.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (7)

1. A microblog-based event feature evolution mining method comprises the following steps:
step 1, selecting a plurality of microblogs representing event starting points from a set of microblog texts related to an event to be analyzed to form an event evolution starting point microblog set, wherein the microblogs representing the event starting points have the following characteristics: a) the publication time is early; b) a microblog that is an original microblog, not a forwarded or commented microblog;
step 2, constructing a graph model of the event evolution starting point microblog set as an initial event micro evolution graph; the top points in the graph model are nouns/verbs appearing in microblog texts of the event evolution starting point microblog set, and the edges between the two top points indicate that words corresponding to the two top points appear in the same microblog together or the co-occurrence distance is smaller than a preset threshold value;
step 3, constructing a graph model of the microblog for the rest microblogs in the set of microblog texts related to the event to be analyzed and adding the graph model into the current event evolution micro-graph;
step 4, segmenting and converting the event micro-evolution diagram obtained in the step 3 to obtain an event macro-evolution diagram, and observing the evolution of the event characteristics based on the event macro-evolution diagram;
adding the constructed microblog graph model to the current event evolution micro graph in the step 3 comprises:
each edge in the graph model of the microblog to be processed is as follows:
a) if both vertexes of the edge exist in the current event evolution micro-image and the edge exists in the event evolution micro-image, accumulating the occurrence count of the edge; if the edge does not exist in the event evolution micro-map, copying the edge into the event evolution micro-map;
b) if one and only one vertex of the edges appear in the current event evolution micro-map, copying the vertex and the edge which are not in the event evolution micro-map into the event evolution micro-map;
c) if the two vertexes of the edge are not in the current event evolution micro-map, the edge and the two vertexes are completely copied into the event evolution micro-map;
wherein the segmenting and transforming the event micro-evolution diagram in the step 4 comprises:
step 4-1), sequencing the microblog texts related to the event to be analyzed according to time, and slicing the microblog text sequence according to time to form a time slice with required granularity;
step 4-2) creating a vertex in the event micro-evolution diagram, wherein the vertex corresponds to the initial event micro-evolution diagram;
step 4-3) the following steps are performed for each time slice:
4-3-a) sequentially selecting a vertex and an edge corresponding to each time slice in the event micro evolution diagram, and constructing a minimum connected subgraph based on the subgraph;
4-3-b) creating a vertex in the event macro evolution diagram, corresponding to the minimum connected subgraph, and creating an edge connecting two subgraphs if the minimum connected subgraph is intersected with the subgraphs corresponding to other vertices in the event macro evolution diagram.
2. The method according to claim 1, wherein the vertex of the graph model in step 2 is represented by a set of nouns/verbs corresponding to the vertex, a microblog document containing the nouns/verbs, and a triple consisting of tendency scores of the nouns/verbs, wherein the tendency score of the nouns/verbs is an average of tendency scores corresponding to adjectives and adverbs modifying the nouns/verbs.
3. The method of claim 2, the step 2 comprising:
step 2-1), performing word segmentation and part-of-speech tagging on each microblog text in the event evolution starting point microblog set;
step 2-2), setting tendency scores of the adjectives and the adverbs after word segmentation;
step 2-3), for the nouns and verbs after word segmentation, averaging the tendency scores corresponding to the adjectives and adverbs which modify the same nouns/verbs, and taking the average as the tendency score of the nouns or verbs;
and 2-4) taking the nouns and the verbs as vertexes, and if the words corresponding to any two vertexes appear in the same microblog together or the co-occurrence distance is smaller than a preset threshold value, creating an edge between the two vertexes.
4. The method of claim 3, wherein the step 3 further comprises a step of determining whether a vertex in the microblog graph model is in the event evolution micro-graph, and the step comprises: for a given vertex in a microblog graph model, if an event evolution micro graph comprises a vertex which is the same as a word corresponding to the vertex, the microblog and a microblog text related to the corresponding vertex in the event evolution micro graph have a forwarding or commenting relationship, and the tendency scores of the two vertices are compatible, it is determined that the given vertex is included in the event evolution micro graph, wherein the compatibility of the tendency scores means that the difference between the tendency score of the corresponding vertex in the event evolution micro graph and the tendency score of the given vertex is less than a certain threshold value.
5. The method according to claim 1, wherein the step 4-3) further comprises the step of assigning a weight to the created edge connecting the two subgraphs, wherein the weight of the edge is Jaccard coefficient of the subgraphs corresponding to the two vertexes; for any two vertexes v and v' in the event macroscopic evolution diagram, the Jaccard coefficient calculation mode of the corresponding subgraph is as follows:wherein G isv∩Gv'And Gv∪Gv'Respectively representing the intersection and union of the vertex sets of the subgraphs corresponding to the two vertices, and the function # () representing the number of elements in the set.
6. The method according to claim 1, wherein the step 4 further comprises a step of pruning the event micro-evolution diagram, which comprises deleting edges in the event micro-evolution diagram, the occurrence times of which are lower than a given threshold, and then deleting branches which are not communicated with the initial event micro-evolution diagram, wherein the occurrence times of the edges refer to the times of common occurrence of words corresponding to two vertexes of the edges in the same microblog in the set of microblog texts related to the event to be analyzed.
7. A microblog-based event feature evolution mining system comprises:
the device is used for selecting a plurality of microblogs representing event starting points from a set of microblog texts related to an event to be analyzed so as to form an event evolution starting point microblog set, wherein the microblogs representing the event starting points have the following characteristics: a) the publication time is early; b) a microblog that is an original microblog, not a forwarded or commented microblog;
a device for constructing a graph model of the event evolution starting point microblog set as an initial event micro evolution graph; the top points in the graph model are nouns/verbs appearing in microblog texts of the event evolution starting point microblog set, and the edges between the two top points indicate that words corresponding to the two top points appear in the same microblog together or the co-occurrence distance is smaller than a preset threshold value;
the device is used for constructing a graph model of each microblog in the set of microblog texts related to the event to be analyzed and adding the graph model into the current event evolution micro-graph;
a device for segmenting and transforming the final event micro-evolution diagram to obtain an event macro-evolution diagram and observing the evolution of the event characteristics based on the event macro-evolution diagram;
adding the constructed microblog graph model into the current event evolution micro graph comprises the following steps:
each edge in the graph model of the microblog to be processed is as follows:
a) if both vertexes of the edge exist in the current event evolution micro-image and the edge exists in the event evolution micro-image, accumulating the occurrence count of the edge; if the edge does not exist in the event evolution micro-map, copying the edge into the event evolution micro-map;
b) if one and only one vertex of the edges appear in the current event evolution micro-map, copying the vertex and the edge which are not in the event evolution micro-map into the event evolution micro-map;
c) if the two vertexes of the edge are not in the current event evolution micro-map, the edge and the two vertexes are completely copied into the event evolution micro-map;
wherein the segmenting and transforming the event micro-evolution diagram comprises:
sequencing microblog texts related to an event to be analyzed according to time, and slicing the microblog text sequence according to time to form a time slice with required granularity;
creating a vertex in the event micro-evolution diagram, and corresponding to the initial event micro-evolution diagram; for each time slice, the following steps are performed:
i) sequentially selecting a vertex and an edge corresponding to each time slice in the event micro evolution diagram, and constructing a minimum connected subgraph based on the subgraph;
ii) creating a vertex in the event macro evolution diagram, corresponding to the minimum connected subgraph, and creating an edge connecting the two subgraphs if the minimum connected subgraph is intersected with the subgraphs corresponding to other vertices in the event macro evolution diagram.
CN201310532377.7A 2012-11-02 2013-10-31 Event characteristic evolution excavation method and system based on microblogs Active CN103631862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310532377.7A CN103631862B (en) 2012-11-02 2013-10-31 Event characteristic evolution excavation method and system based on microblogs

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201210433713 2012-11-02
CN2012104337138 2012-11-02
CN201210433713.8 2012-11-02
CN201310532377.7A CN103631862B (en) 2012-11-02 2013-10-31 Event characteristic evolution excavation method and system based on microblogs

Publications (2)

Publication Number Publication Date
CN103631862A CN103631862A (en) 2014-03-12
CN103631862B true CN103631862B (en) 2017-01-11

Family

ID=50212904

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310532377.7A Active CN103631862B (en) 2012-11-02 2013-10-31 Event characteristic evolution excavation method and system based on microblogs

Country Status (1)

Country Link
CN (1) CN103631862B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104536956A (en) * 2014-07-23 2015-04-22 中国科学院计算技术研究所 A Microblog platform based event visualization method and system
CN104899908B (en) * 2015-06-12 2018-09-11 百度在线网络技术(北京)有限公司 The method and apparatus for generating event group evolution diagram
CN104933129B (en) 2015-06-12 2019-04-30 百度在线网络技术(北京)有限公司 Event train of thought acquisition methods and system based on microblogging
CN106708947B (en) * 2016-11-25 2020-06-09 成都寻道科技有限公司 Web article forwarding and identifying method based on big data
CN109145224B (en) * 2018-08-20 2021-11-23 电子科技大学 Social network event time sequence relation analysis method
CN110472105A (en) * 2019-08-06 2019-11-19 电子科技大学 A kind of social networks event evolution method for tracing divided based on the time
CN110781317B (en) * 2019-10-29 2022-03-01 北京明略软件***有限公司 Method and device for constructing event map and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
CN102214241A (en) * 2011-07-05 2011-10-12 清华大学 Method for detecting burst topic in user generation text stream based on graph clustering
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8566360B2 (en) * 2010-05-28 2013-10-22 Drexel University System and method for automatically generating systematic reviews of a scientific field

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
CN102214241A (en) * 2011-07-05 2011-10-12 清华大学 Method for detecting burst topic in user generation text stream based on graph clustering
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model

Also Published As

Publication number Publication date
CN103631862A (en) 2014-03-12

Similar Documents

Publication Publication Date Title
CN103631862B (en) Event characteristic evolution excavation method and system based on microblogs
US11500905B2 (en) Probability mapping model for location of natural resources
Adedoyin-Olowe et al. A survey of data mining techniques for social media analysis
CN108549647B (en) Method for realizing active prediction of emergency in mobile customer service field without marking corpus based on SinglePass algorithm
Mumtaz et al. Sentiment analysis of movie review data using Senti-lexicon algorithm
Basha et al. Weighted fuzzy rule based sentiment prediction analysis on tweets
CN115017303A (en) Method, computing device and medium for enterprise risk assessment based on news text
Kafeza et al. Predicting information diffusion patterns in twitter
Stahl et al. A survey of data mining techniques for social network analysis
Davahli et al. Identification and prediction of human behavior through mining of unstructured textual data
Dritsas et al. An apache spark implementation for graph-based hashtag sentiment classification on twitter
Wijesekara et al. Source credibility analysis on Twitter users
Kalabikhina et al. The measurement of demographic temperature using the sentiment analysis of data from the social network VKontakte
Hananto et al. A text segmentation approach for automated annotation of online customer reviews, based on topic modeling
Nguyen et al. Emotion analysis using multilayered networks for graphical representation of tweets
Nigam et al. Towards a robust metric of polarity
CN104484437A (en) Network brief comment sentiment mining method
CN113934835A (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
Khan et al. Fake news detection of South African COVID-19 related tweets using machine learning
Mishra et al. Evaluating Performance of Machine Leaming Techniques used in Opinion Mining
Illendula et al. Which emoji talks best for my picture?
Jayasekara et al. Opinion mining of customer reviews: feature and smiley based approach
Liu et al. Extraction method and integration framework for perception features of public opinion in transportation
Genc et al. Text-based event detection: deciphering date information using graph embeddings
CN108710650B (en) Topic mining method for forum text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant