CN103179198B - Based on the topic influence individual method for digging of many relational networks - Google Patents

Based on the topic influence individual method for digging of many relational networks Download PDF

Info

Publication number
CN103179198B
CN103179198B CN201310071162.XA CN201310071162A CN103179198B CN 103179198 B CN103179198 B CN 103179198B CN 201310071162 A CN201310071162 A CN 201310071162A CN 103179198 B CN103179198 B CN 103179198B
Authority
CN
China
Prior art keywords
user
network
users
relationship
relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310071162.XA
Other languages
Chinese (zh)
Other versions
CN103179198A (en
Inventor
丁兆云
贾焰
杨树强
周斌
韩伟红
李爱平
韩毅
李莎莎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201310071162.XA priority Critical patent/CN103179198B/en
Publication of CN103179198A publication Critical patent/CN103179198A/en
Application granted granted Critical
Publication of CN103179198B publication Critical patent/CN103179198B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

For determine the network user dispatch between whether there is the method for replication relation, described method comprises: obtain the time probability distribution that the time interval between two sections of dispatches existing and clearly forward relation obeys; The amenable time probability distribution of time interval institute between the two sections of dispatches that there is replication relation is inferred based on above-mentioned time probability distribution; Based on this inferred time probability distribution, arrange and there is the scope that the time interval between two sections of replication relation dispatches should meet; Be in any two sections of dispatches in above-mentioned scope the time interval, calculate its similarity; And determine whether there is replication relation between two sections of dispatches based on described similarity.

Description

Topic influence individual mining method based on multi-relation network
Technical Field
The invention relates to the technical field of network data mining, in particular to a topic influence individual mining technology in microblog.
Background
The Twitter-like microblog service is rapidly developed recently as a new communication medium, and statistics are reported according to the 29 th internet of China: by 12 months end in 2011, the number of actual users of the microblog reaches 2.5 hundred million, the microblog users are increased by 296.0% compared with the microblog users at the last year end, and the use rate of the netizens is 48.7%. Different from other Facebook-like social network services, the social network relationship of the microblog service is unidirectional, and users can pay attention to the social network service without other user rights. For example, the social network in Twitter is formed by a focus (following) relationship, and a person focused by a user is called a friend (friend) of the user; the person who is interested in a user is called the follower (follower) of the user, and all tweets (tweets) posted by the user will appear on a common timeline (publictime), and all messages of the user will be displayed on the follower timeline of the user.
With the popularization of microblog services, a large number of users participate in topic discussions in microblogs, so that the microblog services generate a large amount of information related to a plurality of topics every day. The influence individuals on each topic are submerged by a large amount of information, so that how to mine the influence individuals on each topic from the large amount of information on the microblog is challenging.
Recently, related scholars have proposed topic level influence individual mining methods for Twitter data.
For measuring the influence of a user on each topic in Twitter, the Twitter rank method was proposed by Weng et al, for example, reference 1 "[ 1] WengJ, LimEP, JiangJ, et al, Twitter rank: Findingtopic-sensory in fluent witters [ C ]// Procoftef 3t ACMINTERATIONAL CONFERENCE WebSearchandData Mining.NewYork, NY: ACM,2010: 261-. The TwitterRank method is superior to the PageRank method and the Topic-sentivePageRank method to a certain extent, but the transition probability of random walk in the TwitterRank method only considers the number of context and Topic similarity, and ignores relevant factors such as forwarding and replying.
In reference 2 "[ 2] PalA, Counts. identification into topic Conferencen WebSearchand Data Mining.NewYork, NY: ACM,2011: 45-54", Pal et al, in order to identify authoritative users for each topic in Twitter, consider a number of attributes of the user in Twitter, but ignore the link structure of the multi-relationship network, making it difficult to characterize the relative impact of the user in the entire network.
Accordingly, there is a need in the art for an improved topic influencing individual mining technique.
Disclosure of Invention
It is an object of the present invention to provide a method for determining whether a duplication relationship exists between messages of network users. The invention further aims to provide an improved topic influence individual mining method to make up for the incompleteness of the traditional individual influence evaluation method.
To achieve the above object, in one aspect of the present invention, there is provided a method for determining whether a copy relationship exists between messages of network users, the method comprising: obtaining time probability distribution obeyed by time intervals between two texts with definite forwarding relation; deducing a time probability distribution to which a time interval between two texts with a copy relation is subject based on the time probability distribution; setting a range which is required to be met by a time interval between two texts with a copy relation based on the deduced time probability distribution; calculating the similarity of any two texts with the time interval within the range; and determining whether a duplication relationship exists between the two texts based on the similarity.
Preferably, the time probability distribution obeyed by the time interval between two messages in which there is a duplication relationship is inferred to be the same as the time probability distribution obeyed by the time interval between two messages in which there is an explicit forwarding relationship.
In another aspect of the present invention, a topic influence individual mining method based on a multiple relation network is provided, which includes: extracting the forwarding relation among users to construct a forwarding relation network, and calculating the transfer probability of one user randomly forwarding the text of another user in the forwarding relation network; extracting the reply relationship among the users to construct a reply relationship network, and calculating the transition probability of one user randomly replying the text of another user in the reply relationship network; extracting the copy relationship among users to construct a copy relationship network according to the messages with the copy relationship determined in the method, and calculating the transfer probability of randomly copying the messages of another user by one user in the copy relationship network; extracting the reading relation among the users to construct a reading relation network, and calculating the transition probability of one user randomly reading the text of the other user in the reading relation network; and comprehensively considering the transition probabilities to calculate the probability of random access of any user by other users.
Preferably, the reading relationship between the users is extracted based on any one or more of the posting time series pattern similarity between the users, the number of the texts, and the interest similarity between the users.
Preferably, the transition probability that the one user randomly forwards the text of the other user in the forwarding relation network is calculated based on the number of times that the one user forwards the text of the other user.
Preferably, the transition probability that the one user replies the text of the other user at random in the reply relationship network is calculated based on the number of times the one user replies the text of the other user.
Preferably, the transition probability of one user randomly copying the text of the other user in the copy relationship network is calculated based on the time interval and the similarity between the texts of the two users.
Preferably, the transition probability that one user randomly reads the letters of the other user in the reading relationship network is calculated based on any one or more of the similarity of posting time series patterns between the two users, the number of letters, and the similarity of interests between the two users.
Preferably, the access includes forwarding, replying, copying, and reading.
Preferably, the spoken text is a text, video, audio or any combination thereof, which is published by the user.
By adopting the technical scheme of the invention, whether the copy relationship exists between the texts of the network users can be determined efficiently, and in addition, various different interaction characteristics among the network users can be considered comprehensively by using various different relationship networks, so that influencing individuals in the network can be found more accurately.
Drawings
The present invention is described in detail with reference to the attached drawings, it being understood that the drawings and the corresponding description are to be considered illustrative and not restrictive, wherein:
FIG. 1 illustrates an example of a multi-relationship network that may exist among multiple microblog users;
FIG. 2 illustrates an example time interval distribution of user forwarding operations;
FIG. 3 shows a negative exponential distribution of time intervals for two Bovins in a replication relationship;
FIG. 4 exemplarily shows a posting time series pattern for each day for 3 users;
FIG. 5 shows the accuracy of various algorithms in various topics; and
fig. 6 shows the average accuracy of each algorithm across all topics.
Detailed Description
The following describes preferred embodiments of the present invention in detail with reference to the accompanying drawings by taking microblog applications as examples.
In order to comprehensively measure the influence of a user on a certain topic level, the invention considers various network relationship types in the microblog. For example, as shown in fig. 1 (a), the influence of user B on user a is represented by 4 relationship types: 1) the user A uses similar RTB or viaB in own blog and forwards the blog of the user B; 2) the user A uses the similar 'B' in own blog and replies the blog of the user B; 3) the user A does not explicitly use forwarding type labels like 'RTB' or 'viaB' and the like to copy the blog article of the user B; 4) user a reads user B's blog. In fig. 1 (a), there are shown 4 different types of directed edges (a, B) between users a and B to represent the above four relationship types, respectively. FIG. 1 (b) shows another example multi-relationship network, which consists of 3 users and the4 different types of directed edges described above. Therefore, the influence network among the microblog users is a multi-relation network. The multi-relationship network can be decomposed into 4 different types of relationship networks according to the above-mentioned four relationship types, respectively: forwarding networks, reply networks, replication networks, and reading networks.
To realize the individual mining of the influence in the microblog, the forwarding network, the reply network, the copy network and the reading network should be extracted first, and then the individual influence can be analyzed or calculated by fusing the networks. The influence individual mining process is based on microblog-related data (e.g., user information, user published blog articles, user forwarding records, reply records, etc.), and those skilled in the art understand that the microblog-related data can be collected in various existing manners, which is not the focus of the present disclosure. Therefore, in order to avoid obscuring the present invention, the process of obtaining the microblog-related data is not repeated.
The above is the basic implementation process of the method of the present invention, and in the following detailed description, the relevant definitions are first explained separately.
Correlation definition
First, C is defined as the entire blog collection and V is the entire user collection. Define k topics simultaneously (k is a positive integer), then CiAnd ViRespectively representing all the blog sets and user sets in the ith (i is more than 0 and less than or equal to k) topic space.
Defining concerned network in microblog as directed non-right graph Gf=(V,Ef) Where V represents the set of nodes (i.e., the set of users) in the directed graph G, EfA directed set of edges between nodes in G is represented, which represents the attention relationship between users. In addition, the above f has no practical meaning, and is only used for identifying that the network or the relationship relates to a focus (following) network and a focus relationship.
Defining a multi-relation network under the ith (i is more than 0 and less than or equal to k) topic space as a multi-relation graph G i = ( V i , E Retweet i ∪ E Reply i ∪ E Copy i ∪ E Read i ) , Wherein ViRepresenting a set of users under the ith topic space,all are directed edge sets, which respectively represent forwarding relation, replying relation, copying relation and reading relation under the ith topic space. The multiple relationship graph can be decomposed into 4 different relationship type graphs, namely a forwarding (Retweet) network graph, a Reply (Reply) network graph, a Copy (Copy) network graph and a Read (Read) network graph. In particular, the amount of the solvent to be used,representing a weighted directed forwarding network graph, whereinRepresenting a set of users under the ith topic space and related to a forwarding relationship;the method is a set of directed edges and represents a forwarding relation under the ith topic space;is the weight of the forwarding edge, which may be, for example, the number of times of forwarding between two users in the ith topic space. Also, in the same manner as above,representing a weighted directed reply network graph, whereinRepresenting a set of users under the ith topic space and involved in a reply relationship;is a set of directed edgesThe answer represents the answer relationship in the ith topic space;is the weight of the reply edge, which may be, for example, the number of replies between two users in the ith topic space.Representing a weighted directed replication network graph, whereinRepresenting a set of users under the ith topic space and involved in a replication relationship;the method comprises the steps that a directed edge set is used for representing the copy relationship among users in the ith topic space;is the weight of the replicated edge (described below).Representing a weighted directed reading network graph, whereinRepresenting a set of users under the ith topic space and involved in reading relationships;the method comprises the steps that a set of directed edges is used for representing the reading relation among users in the ith topic space;is the weight of the reading edge (described below). A, b, c, d as used herein are merely used to identify or distinguish individual networks or sets of users of individual networks and are not meant in an actual sense per se.
The following explains the transition probability calculation method for each network by way of example in the preferred embodiment.
Transition probability calculation method of forwarding network
Forwarding network graphsThe medium random walk process is constructed as follows: the user is influenced by friends in the ith topic space, and the friends and the blog messages of the user are forwarded with a certain transfer probability. The random walk process in the forwarding network graph simulates the forwarding behavior of users in the microblog. Let the transition probability matrix in the forwarding network under the ith topic space beTransition probabilities between users are defined as follows.
Definition 1. in the forwarding network under the ith topic space, usersRandom forwarding userBo Wen transition probability is defined as:
P a i ( u t i | u s i ) = w a ( u s i , u t i ) Σ u i ∈ out ( u s i ) w a ( u s i , u i ) ,
whereinRepresenting the user in the ith topic spaceForwarding subscribersThe number of times of the blog;representing the user in the ith topic spaceNumber of times of forwarding all of his friends' blog text, among themThe set of end points representing all directed edges in the network of interest starting at usi, i.e. the set of end pointsThe set of friends of.
Transition probability calculation method of reply network
Replying to a network graphThe medium random walk process is constructed as follows: the user is influenced by friends in the ith topic space, and replies the friend blog messages with a certain transfer probability. The random walk process in the reply network graph simulates the reply behavior of the user in the microblog. Let the transition probability matrix in the reply network under the ith topic space beThe transition probability between users is defined as follows.
Definition 2. in the reply network under the ith topic space, the userRandom reply userBo Wen transition probability is defined as:
P b i ( u t i | u s i ) = w b ( u s i , u t i ) Σ u i ∈ out ( u s i ) w b ( u s i , u i ) ,
whereinRepresenting the user in the ith topic spaceReplying to a userThe number of times of the blog;representing the user in the ith topic spaceThe number of times of replying the blog articles of all friends of the user, whereinStarting point represented in the network of interest isOf all directed edges, i.e. the set of end pointsThe set of friends of.
Transfer probability calculation method for copy network
Because the 'copy' relationship label does not exist in the microblog, the network graph is copied in order to constructThe random walk process in (1) requires first inferring "duplicate" relationships to dig out the implied relationship edges.
One embodiment of the invention considers the similarity of the time interval between two Bovins and the Bovins comprehensively. Generally, the content similarity of two bloggers with a 'copy' relationship is higher, and the naive method for deducing the 'copy' relationship can consider calculating the similarity between the bloggers, and if the similarity is higher than a certain threshold value, the source of the blogger can be deduced. The naive method needs to calculate the similarity between all the blog articles of all the friends of the user, and has high calculation cost. In one embodiment of the invention, in order to reduce the calculation cost, not only the similarity between the bobbles but also the time interval Δ t between the bobbles is considered.
The 'copy' behavior in the microblog belongs to a forwarding behavior to a certain extent, but forwarding type labels such as 'RTB' or 'viaB' are not explicitly used. So the time interval deltat between the bosom where there is an explicit forwarding relation is used hereinRetweetSet T ofRetweetIs fitted to the time interval deltat between bobbles in which "copy" relationships existCopySet T ofCopyProbability distribution of (2). Based on the above considerations, the present invention proposes a method for determining the number of network usersA method of determining whether a copy relationship exists between documents, which may include the steps of: obtaining time probability distribution obeyed by a time interval between two blogs with definite forwarding relation; deducing a time probability distribution to which a time interval between two blogs with a copy relation is obeyed based on the time probability distribution; setting a range which is required to be met by a time interval between two blossoms with a copy relation based on the deduced time probability distribution; calculating the similarity of any two blossoms with the time intervals within the range; and determining whether a copy relationship exists between the two bloggers based on the similarity.
Specifically, in one embodiment, the time interval Δ t between blogs with explicit forwarding relationships of size 71000 (other numbers are possible) is first randomly sampled from the data setRetweetSet T ofRetweetI.e. | TRetweetL 71000. Fig. 2 shows the data distribution, and as can be seen from fig. 2, most of the time intervals are only within several hours, and only a small part of the time intervals span a relatively large time span, even a small part of the time intervals span more than 10 days. To more finely characterize the time interval distribution, long tail points with time intervals spanning more than 10 days can be removed. It is reasonable to eliminate these points because forwarding users whose time intervals span more than 10 days are usually spam users as found by data analysis.
After removing the long tail point, which spans more than 10 days, the sample size becomes 69770, i.e. | T'RetweetI =69770, the time interval Δ t of which is found statistically to approximately follow a negative exponential distribution. Given sample set T'RetweetEstimate the parameter λ =1.9768 × 10 for the negative exponential distribution4Then the negative exponential distribution probability density function is as follows:
f ( x ) = 1 19768 e - x / 19768 , x &GreaterEqual; 0 , 0 , x < 0 .
where e is the base of the natural logarithm function.
The set of time intervals between blogs for which there is an explicit forwarding relation obeys the parameter λ 1.9768 × 104Because the "duplication" behavior is to some extent a forwarding behavior, it can be concluded that the time interval between two articles in which the "duplication" relationship exists is also approximately subject to the parameter λ 1.9768 × 104The distribution function is as follows (see fig. 3):
F ( x ) = 1 - e - x / 19768 , x &GreaterEqual; 0 , 0 , x < 0 .
from the negative exponential distribution, the range Δ t of the time interval between two Bovins in which the "copy" relationship exists can be approximatedrange. Taking into account both computational cost and accuracy, in one embodiment, the range Δ t of the time interval of two Bovins in a "duplicate" relationship is set illustrativelyrange∈(0ks,1.08×102ks]Where ks denotes 1000 seconds, small brackets "(" denotes no end-point included, middle brackets "]"means including the endpoints. From the negative exponential distribution, the recall ratio R (Recall) of the "replication" relationship can be inferred as:
R=F(1.08×105)-F(0)=99.58%。
if the time interval delta t of the two Bovins is not delta trangeAnd the probability of existence of the 'copy' relationship is low, so that the similarity of the two bloggers does not need to be calculated, and the calculation cost is reduced.
Therefore, whether the blog messages published by the users with the friend relationship have the 'copy' relationship or not is deduced, and the following two conditions need to be met simultaneously:
1)Sim(pt,ps)≥ξ;
2)Δtrange∈(0ks,1.08×102ks]。
the first condition being two Bovins ptAnd psSimilarity of (c) Sim (p)t,ps) It is required to be equal to or greater than a certain threshold ξ. in one embodiment, the cosine of the included angle may be used to calculate the similarity between two Bovins, and other methods of calculating document similarity, such as KL distance (Kulback-leibler), are equally effective.
sim(pt,ps)=cos(vt,vs),
Wherein v istAnd vsRepresenting the vectors of two bobbles, respectively.
The second condition, i.e. the time interval between two Bowens is within a certain threshold, 1.08 × 10 as used above2ks is used merely as an example, and one skilled in the art will appreciate that any other suitable value may be used based on a combination of computational cost and accuracy.
Defining two Bozhuang texts with 'copy' relationship as a binary group<pt,ps>All the blog pairs with 'copy' relationship between two friends are a binary set U. From which a duplicate network graph can be inferredAs a weighted directed graph, usersTo the userIs "copied" to (i.e., due to the user)Duplicating usersThe "copy" relationship of the bobble) of the user's web pageThe definition is as follows:
w c ( u s i , u t i ) = &Sigma; < p t i , p s i > &Element; U s , t i sim ( p s i , p t i ) &times; f ( &Delta; t p s i , p t i ) .
is under the ith topic spaceAndall having a "copy" relationship therebetween and being made by the userA set of Bowen-pair tuples that initiate replication. F table in the above equationShowing a function, in general, f is set to a higher value as Δ t is smaller. Therefore, the similarity and the time interval of the two bobbles are comprehensively considered by the weight calculation, and the higher the similarity between the two bobbles is, the smaller the time interval is, and the higher the probability of existence of the 'copy' relationship is.
Replicating a network graphThe medium random walk process is constructed as follows: the user is influenced by friends in the ith topic space, and the friends and the blog of the user are copied according to a certain transfer probability. The random walk process in the copy network graph simulates the copy behavior of the user in the microblog. Let the transition probability matrix in the replication network under the ith topic space beThe transition probability between users is defined as follows.
Definition 3. users in copy network under ith topic spaceRandom copy userBo Wen transition probability is defined as:
P c i ( u t i | u s i ) = w c ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w c ( u s i , u i ) ,
whereinRepresenting the user in the ith topic spaceTo the userIs "copied" to (i.e., due to the user)Duplicating usersThe "copy" relationship resulting from the bobble of (b);representing the user in the ith topic spaceThe sum of the weights of the "copy" relationships to all of its buddies.
Transition probability calculation method of reading network
The more users who read the published blog article, the wider the influence range of the blog article is. To construct a reading network diagramThe random walk process first requires the construction of a reading network.
Simple idea of constructing reading network diagram namely using concerned network Gf=(V,Ef) And constructing the relationship among the users, and taking the number of the Bowen released by the users as the weight of the edge. Intuitively explaining that under a specific topic space, the more the user publishes the blog and has more followers, the wider the blog influence range.
The popularization of the naive method is the TwitterRank algorithm, the topic similarity among friends is increased, and the naive method is intuitively understood that a user is more likely to read friend blog texts with similar topics. The TwitterRank algorithm transition probability is shown as follows:
P t ( i , j ) = | &tau; j | &Sigma; a : s i follows s a | &tau; a | &times; sim t ( i , j ) .
j| is user ujThe number of issued messages;for user uiThe number of the messages issued by all friends; simt(i, j) is the similarity between users in the t-th topic space.
In the microblog, all the blog articles issued by the user are actively pushed to the time line of the follower, and usually the follower logs in the personal homepage of the follower and can read the information of the home page of the follower. Therefore, the smaller the time interval between the login time of the user and the posting time of the friend of the user is deduced, the more possible reading of the blog of the friend of the user is. But user login time is difficult to obtain. Therefore, in one embodiment of the present invention, the daily posting time series pattern of the users is calculated by counting the rules of a certain number of postings for each user, and it is assumed that the higher the similarity of the daily posting time series patterns between the users, the greater the probability of the existence of the "reading" relationship. Because the time statistical rule of a certain number of postings reflects the user login time rule to a certain extent, the higher the similarity of the posting time sequence patterns every day is, the smaller the time interval between the user login time and the posting time of friends of the user is, the more likely the user can read the blog of friends of the user, and therefore the assumption is reasonable.
For example, fig. 4 shows the posting time series patterns of 3 users each day, and if the user a focuses on the users B and C at the same time, it can be known from the graph that the posting time series pattern similarity of the users a and B is significantly higher than that of the users a and C, the user a has a greater probability of reading the blog of B. Because the user B has a higher probability of issuing the blog while the user A is online (online), the microblog service actively pushes the blog to the personal homepage of the user A, so that the user A can read the blog of the user B more frequently.
In one embodiment, the posting rules of the user are measured by the posting probability of the user in each hour within 24 hours of a day, and the posting time sequence mode is defined as follows.
Definition 4. for arbitrary user u, two-tuple<t,p>Represents that the probability of the user posting within t (unit is hour) is p; set of time series t0,t1,...,t23}(t0<t1<...<t23) Representing 24 discrete points, each of 1 hour duration, the user posting time series pattern is defined as follows:
ts=<ts0=<t0,p0>,ts1=<t1,p1>,...,ts23=<t23,p23>>
counting the time law of N posts issued by each user, and calculating the posting probability of the user in i (unit is hour)Wherein N isiThe number of all posts in i (hour) of all N posts of the user is represented, namely the greater the number of posts in i (hour) of the user, the greater the probability that the microblog is online in the hour of the user in one day.
If the posting probabilities for multiple users are the same for each hour, their time series patterns are more similar. In one embodiment, the time series similarity may be measured using Euclidean (Euclidean) distance.
Specifically, let Q andc respectively represents two posting time sequences; q. q.siRepresenting the value of the ith point of the Q sequence; c. CiRepresenting the value of the C sequence at the ith point; i and n respectively represent the sequence number of the current point in the whole sequence and the length of the whole sequence, and the two time sequence similarity calculation formulas are as follows:
simSeries ( Q , C ) = 1 &Sigma; i = 1 n ( q i - c i ) 2 , | Q | &NotEqual; 0 and | C | &NotEqual; 0 | ,
where | Q | and | C | each represent the length of the two sequences (the lengths of both are equal), the smaller the euclidean distance, the higher the similarity.
In one embodiment, it can be assumed that the probability that there is a reading relationship between users is related to the following 3 factors:
1) the user reads friends with a large number of blog articles with higher probability;
2) a user reads friends with high topic similarity at a higher probability;
3) the user reads the friends with high posting time sequence pattern similarity with higher probability.
Thus, user u can be assignedsReading friend utThe probability of Bowen is defined as follows:
P read ( u s , u t ) = &tau; t &times; sim ( u s , u t ) &times; simSeries ( u s , u t ) &Sigma; u &Element; out ( u s ) &tau; u &times; sim ( u s , u ) &times; simSeries ( u s , u ) ,
wherein tau represents the number of the messages issued by the corresponding users in the data set (excluding the messages with forwarding, replying and copying relations); sim (u)s,ut) Representing user usFriend u concerned with himtTopic similarity between; simseries (u)s,ut) Representing user usFriend u concerned with himtA posting time series similarity between; out (u)s) Representing user usIs focused on the buddy set. The above embodiments have fully considered the above 3 factors, and it is understood that in some embodiments, only any one or two of the factors may be considered in order to simplify computational complexity, etc.
In one embodiment of the present invention, the topic similarity sim (u) is calculated using all the bloggers of one user as one documents,ut) The LDA model is also used to determine the topic distribution for each user. Defining t as the "user-topic" distribution vector of the user, i.e. t ═ t1,t2,...,tk)TWherein t is1,t2,...,tkFor each element in the distribution vector of 'user-topic', the probability distribution of the user in each topic space is shown, k represents the set number of topics, and the set of user topic categories is shown as { t1,t2,...,tk}。
User topic similarity is defined as a set of user topic categories { t }1,t2,...,tkKL distance of }:
sim ( u s , u t ) = 1 KL ( u s , u t )
= 1 &Sigma; 0 < i &le; k t s i 1 g t s i t t i , | u t | &NotEqual; 0 and | u s | &NotEqual; 0 ,
wherein | usI and I utL respectively represents the number of the blog articles issued by the user;representing user usThe probability distribution in the ith topic space,representing user utIn the probability distribution of the ith topic space, the smaller the KL distance is, the higher the topic similarity is.
If | utI.e. user u in the data set | ═ 0sFriend u concernedtIf the blog article is not released, the user usReading friend utThe probability of (2) is 0.
If | usI.e. user u in the data set | ═ 0sWithout issuing a blog, define user usReading friend utThe probability of (c) is:
P read ( u s , u t ) = &tau; t &Sigma; u &Element; out ( u s ) &tau; u
that is, the probability of reading friends is only related to the number of the friends publishing the messages, and the more the friends publish the messages, the higher the probability of reading the friends.
In one embodiment, for all user sets U in the ith topic spaceiDeducing a set of users UiWhether all followers of u have a "reading" relationship with u needs to satisfy the following conditions:
t u i &times; P read &GreaterEqual; &eta;
whereinRepresenting the probability distribution of the user u in the ith topic space; preadRepresenting the probability of a follower of user u reading the user's blog. I.e. "reading" probability of two users in i topic spacesIt needs to be equal to or greater than a certain threshold η.
Reading network pictureAs a weighted directed graph, in the ith topic space, usersTo the userIn a reading relationship (i.e., due to the user)Reading userReading relationship by bobble) of the other wordsThe following can be defined:
w d ( u s i , u t i ) = t t i &times; P read ( u s i , u t i )
wherein,representing a userProbability distribution in the ith topic space.
Reading network graph under ith topic spaceThe transition probability matrix ofThe transition probability between users is defined as follows.
Definition 5. in the reading network under the ith topic space, usersRandom reading userBo Wen transition probability is defined as:
P d i ( u t i | u s i ) = w d ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w d ( u s i , u i ) ,
whereinRepresenting the user in the ith topic spaceTo the userIn a "reading" relationship (i.e., due to the user)Reading userReading relations by bobble of) are calculated;and the sum of the weights of the 'reading' relations from the user usi to all friends of the user under the ith topic space is represented.
The embodiments of calculating the transition probabilities in the forwarding network, the replying network, the copying network and the reading network are sequentially described above, but it is obvious that in the actual calculation process, the transition probabilities in the networks may be calculated in any order, and one or more of the calculations may be performed in parallel, and are not limited to the order described above.
Method for fusing multiple relation networks
In order to comprehensively measure the influence of the individual microblog users, the transition probabilities in the four networks can be comprehensively considered for each user to calculate the influence ranking score of each user, wherein the influence ranking score reflects the probability that other users randomly access the user (i.e. forward, reply, copy and read the blog of the user), and further reflects the topic influence of the user.
The influence of the user by his friends appears as a random walk process inside the4 kinds of influence networks, and at the same time, the user will jump to another kind of influence network with a certain probability. In one embodiment of the invention, the user is defined to stay in the forwarding network, the reply network, the copy network, the reading networkProbability is respectively lambda1,λ2,λ3,λ4And satisfy lambda1234And =1. The user will start with 1-lambda1Jumps from the forwarding network to other networks; at 1-lambda2The probability of (2) jumping from the reply network to other networks; at 1-lambda3Jumps from the replicated network to the other network; at 1-lambda4The probability of jumping from the reading network to the other network. Let the transition probability matrix be B by considering the stay probability λ of each network, between users in 4 networks in the ith topic space (e.g., from users)To the user) The transition probabilities of (c) are respectively defined as follows:
1) forwarding network:
B a i ( u t i | u s i ) = &lambda; 1 w a ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w a ( u s i , u i )
2) and (3) replying to the network:
B b i ( u t i | u s i ) = &lambda; 2 w b ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w b ( u s i , u i )
3) copying the network:
B c i ( u t i | u s i ) = &lambda; 3 w c ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w c ( u s i , u i )
4) reading network
B d i ( u t i | u s i ) = &lambda; 4 w d ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w d ( u s i , u i )
In another embodiment of the present invention, according to the PageRank algorithm, the user not only walks along the network randomly (e.g., the user directly accesses the microblog of the friend via his friend link), but also jumps to other nodes randomly with a certain probability β (e.g., the user accesses the microblog of the user who is not his friend by other means, such as manual input), so the jump probability β between nodes inside the network and the stay probability λ of each network are considered together in this embodiment, the transition probability matrix is B, and the users in 4 networks in the ith topic space (e.g., from the users) are considered togetherTo the user) The transition probabilities of (c) are respectively defined as follows:
1) forwarding network:
B a i ( u t i | u s i ) = &lambda; 1 &times; ( 1 - &beta; ) &times; w a ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w a ( u s i , u i ) + &beta; n
2) and (3) replying to the network:
B b i ( u t i | u s i ) = &lambda; 2 &times; ( 1 - &beta; ) &times; w b ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w b ( u s i , u i ) + &beta; n
3) copying the network:
B c i ( u t i | u s i ) = &lambda; 3 &times; ( 1 - &beta; ) &times; w c ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w c ( u s i , u i ) + &beta; n
4) reading network
B d i ( u t i | u s i ) = &lambda; 4 &times; ( 1 - &beta; ) &times; w d ( u s i , u t i ) &Sigma; u i &Element; out ( u s i ) w d ( u s i , u i ) + &beta; n
N in the above formula represents the number of nodes in the corresponding network.
Let r bei(u) ranking score of user u in ith topic space, and integratingConsidering the random walk of users in 4 networks, the ranking score of user u in the ith topic space is defined as follows in one embodiment:
r i ( u ) = &Sigma; ( u t i , u ) &Element; E Retweet i B a i ( u | u t i ) r i ( u t i ) + &Sigma; ( u t i , u ) &Element; E Reply i B b i ( u | u t i ) r i ( u t i )
+ &Sigma; ( u t i , u ) &Element; E Copy i B c i ( u | u t i ) r i ( u t i ) + &Sigma; ( u t i , u ) &Element; E Read i B d i ( u | u t i ) r i ( u t i ) .
i.e., the rank score of a user is largely determined by the probability of random access to the user by the follower.
The ranking score of user u in the ith topic space can be calculated by the above formula transformation:
r i ( u ) = &Sigma; ( u t i , u ) &Element; E Retweet i B a i ( u | u t i ) + &Sigma; ( u t i , u ) &Element; E Reply i B b i ( u | u t i )
+ &Sigma; ( u t i , u ) &Element; E Copy i B c i ( u | u t i ) + &Sigma; ( u t i , u ) &Element; E Read i B d i ( u | u t i ) ) &times; r i ( u t i ) .
let M be a transition probability matrix of the converged multi-relationship network random walk model, and then the multi-relationship network random walk iteration model is:
&pi; i t + 1 = M &pi; i t
whereinAnd (4) in the ith topic space, a user ranking score vector in the process of the t iteration.
As can be seen from the above formula, the stochastic walk iterative model of the multi-relationship network is a Markov process traversed by each state. Thus, an initial vector can be given(for example, each node in the network may initially have the same ranking score), and then through n iterations, the computation results will converge gradually. The continuation of the iteration may be stopped when a certain stop condition is met.
Results of the experiment
The effect of the method of the present invention will be described below with reference to the experimental results.
The experiment employed 261954 chinese users in Twitter, with 103836 users posting tweets between 3 months 2011-04-15 to 2011-07-15, accounting for 39.6% of chinese users, while obtaining 2660281 tweets.
In order to verify the effectiveness of the method, the accuracy of influence individual mining is verified respectively. Since the actual influencing individuals in Twitter are difficult to determine artificially, cross-validation of multiple algorithms is relied upon to determine the accuracy of each algorithm. The experiment comprehensively considers 6 influence individual mining algorithms:
1) a social network formed by forwarding relations finds individual influence in the forwarding network by means of a PageRank algorithm, and the algorithm is defined as ReposTransnk;
2) finding individual influence in the reply network by means of a PageRank algorithm, and defining the algorithm as ReplyRank;
3) weng et al propose the twitter rank algorithm;
4) measuring the individual influence by depending on the number of followers, and defining the algorithm as FollowerNum;
5) measuring the individual influence by depending on the number of posts, and defining the algorithm as TweetNum;
6) the random walk model in the multi-relation network provided by the invention defines such algorithm as MultiRank (three different parameter settings of the MultiRank algorithm are shown in fig. 5 and 6 and named as MultiRank1, MultiRank2 and MultiRank3 respectively to further measure the influence of different parameter settings on the MultiRank algorithm).
The experiment uses a cross-validation method, i.e. the correct result is considered by various (N) algorithms as the correct result of the reference. For example, given 4 algorithms A, B, C, D, the obtained Top-K high-influence individual sets are respectively IA,IB,ICAnd IDAssuming that 2 algorithms all consider correct results as correct results of reference, the reference standard set of the influencing individuals is defined as:
I2=(IA∩IB)∪(IA∩IC)∪(IA∩ID)∪
(IB∩IC)∪(IB∩ID)∪(IC∩ID)
the accuracy P (precision) reflects the authenticity of the influential individuals in Twitter, i.e., the algorithm A influential individual discovery accuracy is defined as follows:
P A = | I A &cap; I 2 | | I A |
experiment according to the above algorithms, the influence individuals of Top-10, 20, 50, 100, 200, 500 in each topic are respectively obtained.
For N =2, 3, 4, 5, 6, 7, the average accuracy distribution of Top-10, 20, 50, 100, 200, 500 influence individual mining algorithms in each topic category is shown in fig. 5, where the ordinate represents accuracy and the abscissa represents each topic.
As can be seen from the experimental results shown in fig. 5, when N =2, 3, 4, 5, 6, and 7, the parameters of the MultiRank algorithm proposed by the present invention are selected as the group 2 (MultiRank 2 shown in fig. 5), the accuracy is higher among the 10 topic categories. The principle of MultiRank2 parameter selection is to balance the network size versus the impact ranking score ri(u) and the Multirank1, Multirank3 parameters are selected by the rule that the reading net rank score r dominates the influence rank score ri(u) and forward, reply, duplicate network ranking score r dominates the impact ranking score ri(u) in the multi-relation network, the MultiRank1 and the MultiRank3 are respectively influenced by the network scale, which results in relative reduction of accuracy.
The experimental result simultaneously shows that in each topic, the accuracy rate shows a descending trend along with the increase of the number N of the reference standards in the cross validation, and the increase of the number N of the reference standards leads to the intersection set elements I of a plurality of reference standardsNLess number, thereby making arbitrary algorithm IAAnd INLess intersection set elements, resulting in a reduction in accuracy. The experimental result also shows that when the number of the reference standards is N =3, 4, the accuracy rate discrimination of each algorithm is relatively high, the experimental effect is optimal, and if N is set too low (N = 2), the intersection set element I of the multiple reference standards is causedNThe number is large, so that 3 algorithms including MultiRank1, MultiRank2 and MultiRank3 are combined with the standard reference set INThe intersection elements of the three algorithms are basically consistent, so that the accuracy rate discrimination of the3 algorithms is not high. If N is set too high (N =5, 6, 7), this will result in the intersection set element I of the multiple reference standardsNThe number is small, and the algorithms are also matched with the standard reference set INThe intersection elements of (a) are basically consistent, so that the accuracy discrimination is not large.
Then, for N =2, 3, 4, 5, 6, 7, the distribution of the average accuracy of 10 topic categories at Top-10, 20, 50, 100, 200, 500 is shown in fig. 6.
As can be seen from the experimental results shown in fig. 6, the MultiRank algorithm provided by the present invention has a higher accuracy in the 6 groups of Top-K influence individual mining algorithms, and when the reference standard number N =3 and 4 is used, the accuracy discrimination of each algorithm is larger, and the experimental effect is optimal. The experimental result also shows that in Top-K influence individual mining, the accuracy rate shows an ascending trend along with the increase of the number of K, and the increase of the number of influence individuals causes the increase of the number of the same elements of cross validation of a plurality of algorithms, so that the accuracy rate is increased.
While the preferred embodiments of the present invention have been described above, it should be noted that the preferred embodiments are described in detail with respect to microblog applications, but those skilled in the art will appreciate that the methods described herein may be applied to other applications besides microblog applications, and that the corresponding users may issue other messages, such as text, video, audio, and the like, and any combination thereof, and are not limited to blog messages. In addition, the specific algorithms, formulas, parameter settings, etc. mentioned in the above detailed description of the present application are for illustration only and are not intended to limit the present invention. Those skilled in the art can make appropriate changes and substitutions to the above algorithms, formulas, parameters, etc. without departing from the spirit and scope of the present invention.

Claims (10)

1. A method for determining whether a duplication relationship exists between messages of network users, the method comprising:
obtaining time probability distribution obeyed by time intervals between two texts with definite forwarding relation;
deducing a time probability distribution to which a time interval between two texts with a copy relation is subject based on the time probability distribution;
setting a range which is required to be met by a time interval between two texts with a copy relation based on the deduced time probability distribution;
calculating the similarity of any two texts with the time interval within the range; and
and determining whether a copy relationship exists between the two texts based on the similarity.
2. The method of claim 1, wherein the inferring, based on the temporal probability distribution, a temporal probability distribution to which a time interval between two utterances having a replication relationship should comply comprises: the time probability distribution obeyed by the time interval between two texts with copy relation is inferred to be the same as the time probability distribution obeyed by the time interval between two texts with explicit forwarding relation.
3. A topic influence individual mining method based on a multi-relation network comprises the following steps:
extracting the forwarding relation among users to construct a forwarding relation network, and calculating the transfer probability of one user randomly forwarding the text of another user in the forwarding relation network;
extracting the reply relationship among the users to construct a reply relationship network, and calculating the transition probability of one user randomly replying the text of another user in the reply relationship network;
extracting the copy relationship between users to construct a copy relationship network according to the messages having the copy relationship determined by the method of claim 1 or 2, and calculating a transition probability that one user randomly copies the message of another user in the copy relationship network;
extracting the reading relation among the users to construct a reading relation network, and calculating the transition probability of one user randomly reading the text of the other user in the reading relation network;
and comprehensively considering the transition probabilities in the four networks to calculate the probability of random access of any user by other users.
4. The method according to claim 3, wherein the reading relationship between the users is extracted based on any one or more of a posting time series pattern similarity between the users, the number of texts, and an interest similarity between the users.
5. The method of claim 3, wherein the transition probability of the one user randomly forwarding the text of the other user in the forwarding relationship network is calculated based on the number of times the one user forwards the text of the other user.
6. The method according to claim 3, wherein a transition probability that the one user replies the letters of another user at random in the reply relationship network is calculated based on the number of times the one user replies the letters of other users.
7. The method of claim 3, wherein the transition probability of one user randomly replicating the text of another user in the replication relationship network is calculated based on the time interval and similarity between the texts of two users.
8. The method according to claim 3, wherein a transition probability that one user randomly reads the letters of another user in the reading relationship network is calculated based on any one or more of a posting time series pattern similarity between two users, a number of letters, and an interest similarity between two users.
9. The method of claim 3, wherein the access comprises forwarding, replying, copying, and reading.
10. The method of claim 3, wherein the text is a user-posted text, video, audio, or any combination thereof.
CN201310071162.XA 2012-11-02 2013-03-06 Based on the topic influence individual method for digging of many relational networks Active CN103179198B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310071162.XA CN103179198B (en) 2012-11-02 2013-03-06 Based on the topic influence individual method for digging of many relational networks

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201210432184 2012-11-02
CN201210432184.X 2012-11-02
CN201310071162.XA CN103179198B (en) 2012-11-02 2013-03-06 Based on the topic influence individual method for digging of many relational networks

Publications (2)

Publication Number Publication Date
CN103179198A CN103179198A (en) 2013-06-26
CN103179198B true CN103179198B (en) 2016-01-20

Family

ID=48638816

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310071162.XA Active CN103179198B (en) 2012-11-02 2013-03-06 Based on the topic influence individual method for digging of many relational networks

Country Status (1)

Country Link
CN (1) CN103179198B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761292B (en) * 2014-01-16 2017-01-18 北京理工大学 User forward behavior based microblog reading probability calculation method
CN104134159B (en) * 2014-08-04 2017-10-24 中国科学院软件研究所 A kind of method that spread scope is maximized based on stochastic model information of forecasting
CN104346443B (en) * 2014-10-20 2018-08-03 北京国双科技有限公司 Network text processing method and processing device
CN104376083B (en) * 2014-11-18 2017-06-27 电子科技大学 It is a kind of that method is recommended based on concern relation and the figure of multi-user's behavior
CN105808664A (en) * 2016-02-29 2016-07-27 四川长虹电器股份有限公司 Forum user ranking method
CN108009933B (en) * 2016-10-27 2021-06-11 中国科学技术大学先进技术研究院 Graph centrality calculation method and device
CN109446171B (en) * 2017-08-30 2022-03-15 腾讯科技(深圳)有限公司 Data processing method and device
CN109271584B (en) * 2018-08-29 2022-02-15 杭州电子科技大学 Recommendation method based on improved PageRank and comprehensive influence
CN109800351A (en) * 2018-12-29 2019-05-24 常熟理工学院 High-impact usage mining method in microblogging specific topics
CN110851659B (en) * 2019-10-23 2021-06-29 清华大学 Student academic influence calculation method and system based on student thesis relationship network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2081126A2 (en) * 2008-01-21 2009-07-22 NEC Corporation Information processing system, information processing apparatus, information processing program and recording medium
CN102254025A (en) * 2011-07-28 2011-11-23 清华大学 Information memory retrieving method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2081126A2 (en) * 2008-01-21 2009-07-22 NEC Corporation Information processing system, information processing apparatus, information processing program and recording medium
CN102254025A (en) * 2011-07-28 2011-11-23 清华大学 Information memory retrieving method

Also Published As

Publication number Publication date
CN103179198A (en) 2013-06-26

Similar Documents

Publication Publication Date Title
CN103179198B (en) Based on the topic influence individual method for digging of many relational networks
Yang et al. Unsupervised fake news detection on social media: A generative approach
Tang et al. Community detection and mining in social media
US11100411B2 (en) Predicting influence in social networks
Fayazi et al. Uncovering crowdsourced manipulation of online reviews
Gu et al. Modeling of user portrait through social media
Zhaoyun et al. Mining topical influencers based on the multi-relational network in micro-blogging sites
CN103678613B (en) Method and device for calculating influence data
Zhang et al. Product information diffusion in a social network
Mehta et al. Sentiment analysis and influence tracking using twitter
Li et al. Exploiting concept drift to predict popularity of social multimedia in microblogs
Li et al. Novel user influence measurement based on user interaction in microblog
CN107566249A (en) A kind of method for training the model for predicting social network user forwarding message
Wang et al. Opinion leader mining algorithm in microblog platform based on topic similarity
CN112948775A (en) Transfer tracking system based on block chain
CN115712772A (en) Topic propagation prediction method based on topic association
Saxena et al. A statistical approach for reducing misinformation propagation on twitter social media
Elsharkawy et al. Towards feature selection for cascade growth prediction on twitter
Ladd et al. Measurement considerations for quantitative social science research using social media data
Marlow Investment and Attention in the Weblog Community.
Ying et al. Followee recommendation in asymmetrical location-based social networks
JP2010286868A (en) Community forming system, community forming device thereof, data processing method thereof, and computer program
CN107590742B (en) Behavior-based social network user attribute value inversion method
Lin et al. Friend recommendation algorithm based on location-based social networks
Shi et al. Individual retweeting behavior on social networking sites

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant