CN103179198B

CN103179198B - Based on the topic influence individual method for digging of many relational networks

Info

Publication number: CN103179198B
Application number: CN201310071162.XA
Authority: CN
Inventors: 丁兆云; 贾焰; 杨树强; 周斌; 韩伟红; 李爱平; 韩毅; 李莎莎
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2012-11-02
Filing date: 2013-03-06
Publication date: 2016-01-20
Anticipated expiration: 2033-03-06
Also published as: CN103179198A

Abstract

For determine the network user dispatch between whether there is the method for replication relation, described method comprises: obtain the time probability distribution that the time interval between two sections of dispatches existing and clearly forward relation obeys; The amenable time probability distribution of time interval institute between the two sections of dispatches that there is replication relation is inferred based on above-mentioned time probability distribution; Based on this inferred time probability distribution, arrange and there is the scope that the time interval between two sections of replication relation dispatches should meet; Be in any two sections of dispatches in above-mentioned scope the time interval, calculate its similarity; And determine whether there is replication relation between two sections of dispatches based on described similarity.

Description

Topic influence individual mining method based on multi-relation network

Technical Field

The invention relates to the technical field of network data mining, in particular to a topic influence individual mining technology in microblog.

Background

The Twitter-like microblog service is rapidly developed recently as a new communication medium, and statistics are reported according to the 29 th internet of China: by 12 months end in 2011, the number of actual users of the microblog reaches 2.5 hundred million, the microblog users are increased by 296.0% compared with the microblog users at the last year end, and the use rate of the netizens is 48.7%. Different from other Facebook-like social network services, the social network relationship of the microblog service is unidirectional, and users can pay attention to the social network service without other user rights. For example, the social network in Twitter is formed by a focus (following) relationship, and a person focused by a user is called a friend (friend) of the user; the person who is interested in a user is called the follower (follower) of the user, and all tweets (tweets) posted by the user will appear on a common timeline (publictime), and all messages of the user will be displayed on the follower timeline of the user.

With the popularization of microblog services, a large number of users participate in topic discussions in microblogs, so that the microblog services generate a large amount of information related to a plurality of topics every day. The influence individuals on each topic are submerged by a large amount of information, so that how to mine the influence individuals on each topic from the large amount of information on the microblog is challenging.

Recently, related scholars have proposed topic level influence individual mining methods for Twitter data.

For measuring the influence of a user on each topic in Twitter, the Twitter rank method was proposed by Weng et al, for example, reference 1 "[ 1] WengJ, LimEP, JiangJ, et al, Twitter rank: Findingtopic-sensory in fluent witters [ C ]// Procoftef 3t ACMINTERATIONAL CONFERENCE WebSearchandData Mining.NewYork, NY: ACM,2010: 261-. The TwitterRank method is superior to the PageRank method and the Topic-sentivePageRank method to a certain extent, but the transition probability of random walk in the TwitterRank method only considers the number of context and Topic similarity, and ignores relevant factors such as forwarding and replying.

In reference 2 "[ 2] PalA, Counts. identification into topic Conferencen WebSearchand Data Mining.NewYork, NY: ACM,2011: 45-54", Pal et al, in order to identify authoritative users for each topic in Twitter, consider a number of attributes of the user in Twitter, but ignore the link structure of the multi-relationship network, making it difficult to characterize the relative impact of the user in the entire network.

Accordingly, there is a need in the art for an improved topic influencing individual mining technique.

Disclosure of Invention

It is an object of the present invention to provide a method for determining whether a duplication relationship exists between messages of network users. The invention further aims to provide an improved topic influence individual mining method to make up for the incompleteness of the traditional individual influence evaluation method.

To achieve the above object, in one aspect of the present invention, there is provided a method for determining whether a copy relationship exists between messages of network users, the method comprising: obtaining time probability distribution obeyed by time intervals between two texts with definite forwarding relation; deducing a time probability distribution to which a time interval between two texts with a copy relation is subject based on the time probability distribution; setting a range which is required to be met by a time interval between two texts with a copy relation based on the deduced time probability distribution; calculating the similarity of any two texts with the time interval within the range; and determining whether a duplication relationship exists between the two texts based on the similarity.

Preferably, the time probability distribution obeyed by the time interval between two messages in which there is a duplication relationship is inferred to be the same as the time probability distribution obeyed by the time interval between two messages in which there is an explicit forwarding relationship.

In another aspect of the present invention, a topic influence individual mining method based on a multiple relation network is provided, which includes: extracting the forwarding relation among users to construct a forwarding relation network, and calculating the transfer probability of one user randomly forwarding the text of another user in the forwarding relation network; extracting the reply relationship among the users to construct a reply relationship network, and calculating the transition probability of one user randomly replying the text of another user in the reply relationship network; extracting the copy relationship among users to construct a copy relationship network according to the messages with the copy relationship determined in the method, and calculating the transfer probability of randomly copying the messages of another user by one user in the copy relationship network; extracting the reading relation among the users to construct a reading relation network, and calculating the transition probability of one user randomly reading the text of the other user in the reading relation network; and comprehensively considering the transition probabilities to calculate the probability of random access of any user by other users.

Preferably, the reading relationship between the users is extracted based on any one or more of the posting time series pattern similarity between the users, the number of the texts, and the interest similarity between the users.

Preferably, the transition probability that the one user randomly forwards the text of the other user in the forwarding relation network is calculated based on the number of times that the one user forwards the text of the other user.

Preferably, the transition probability that the one user replies the text of the other user at random in the reply relationship network is calculated based on the number of times the one user replies the text of the other user.

Preferably, the transition probability of one user randomly copying the text of the other user in the copy relationship network is calculated based on the time interval and the similarity between the texts of the two users.

Preferably, the transition probability that one user randomly reads the letters of the other user in the reading relationship network is calculated based on any one or more of the similarity of posting time series patterns between the two users, the number of letters, and the similarity of interests between the two users.

Preferably, the access includes forwarding, replying, copying, and reading.

Preferably, the spoken text is a text, video, audio or any combination thereof, which is published by the user.

By adopting the technical scheme of the invention, whether the copy relationship exists between the texts of the network users can be determined efficiently, and in addition, various different interaction characteristics among the network users can be considered comprehensively by using various different relationship networks, so that influencing individuals in the network can be found more accurately.

Drawings

The present invention is described in detail with reference to the attached drawings, it being understood that the drawings and the corresponding description are to be considered illustrative and not restrictive, wherein:

FIG. 1 illustrates an example of a multi-relationship network that may exist among multiple microblog users;

FIG. 2 illustrates an example time interval distribution of user forwarding operations;

FIG. 3 shows a negative exponential distribution of time intervals for two Bovins in a replication relationship;

FIG. 4 exemplarily shows a posting time series pattern for each day for 3 users;

FIG. 5 shows the accuracy of various algorithms in various topics; and

fig. 6 shows the average accuracy of each algorithm across all topics.

Detailed Description

The following describes preferred embodiments of the present invention in detail with reference to the accompanying drawings by taking microblog applications as examples.

In order to comprehensively measure the influence of a user on a certain topic level, the invention considers various network relationship types in the microblog. For example, as shown in fig. 1 (a), the influence of user B on user a is represented by 4 relationship types: 1) the user A uses similar RTB or viaB in own blog and forwards the blog of the user B; 2) the user A uses the similar 'B' in own blog and replies the blog of the user B; 3) the user A does not explicitly use forwarding type labels like 'RTB' or 'viaB' and the like to copy the blog article of the user B; 4) user a reads user B's blog. In fig. 1 (a), there are shown 4 different types of directed edges (a, B) between users a and B to represent the above four relationship types, respectively. FIG. 1 (b) shows another example multi-relationship network, which consists of 3 users and the4 different types of directed edges described above. Therefore, the influence network among the microblog users is a multi-relation network. The multi-relationship network can be decomposed into 4 different types of relationship networks according to the above-mentioned four relationship types, respectively: forwarding networks, reply networks, replication networks, and reading networks.

To realize the individual mining of the influence in the microblog, the forwarding network, the reply network, the copy network and the reading network should be extracted first, and then the individual influence can be analyzed or calculated by fusing the networks. The influence individual mining process is based on microblog-related data (e.g., user information, user published blog articles, user forwarding records, reply records, etc.), and those skilled in the art understand that the microblog-related data can be collected in various existing manners, which is not the focus of the present disclosure. Therefore, in order to avoid obscuring the present invention, the process of obtaining the microblog-related data is not repeated.

The above is the basic implementation process of the method of the present invention, and in the following detailed description, the relevant definitions are first explained separately.

Correlation definition

First, C is defined as the entire blog collection and V is the entire user collection. Define k topics simultaneously (k is a positive integer), then CⁱAnd VⁱRespectively representing all the blog sets and user sets in the ith (i is more than 0 and less than or equal to k) topic space.

Defining concerned network in microblog as directed non-right graph G_f=(V,E_f) Where V represents the set of nodes (i.e., the set of users) in the directed graph G, E_fA directed set of edges between nodes in G is represented, which represents the attention relationship between users. In addition, the above f has no practical meaning, and is only used for identifying that the network or the relationship relates to a focus (following) network and a focus relationship.

Defining a multi-relation network under the ith (i is more than 0 and less than or equal to k) topic space as a multi-relation graph

G^{i} = (V^{i}, E_{Retweet}^{i} \cup E_{Reply}^{i} \cup E_{Copy}^{i} \cup E_{Read}^{i}),

Wherein VⁱRepresenting a set of users under the ith topic space,all are directed edge sets, which respectively represent forwarding relation, replying relation, copying relation and reading relation under the ith topic space. The multiple relationship graph can be decomposed into 4 different relationship type graphs, namely a forwarding (Retweet) network graph, a Reply (Reply) network graph, a Copy (Copy) network graph and a Read (Read) network graph. In particular, the amount of the solvent to be used,representing a weighted directed forwarding network graph, whereinRepresenting a set of users under the ith topic space and related to a forwarding relationship;the method is a set of directed edges and represents a forwarding relation under the ith topic space;is the weight of the forwarding edge, which may be, for example, the number of times of forwarding between two users in the ith topic space. Also, in the same manner as above,representing a weighted directed reply network graph, whereinRepresenting a set of users under the ith topic space and involved in a reply relationship;is a set of directed edgesThe answer represents the answer relationship in the ith topic space;is the weight of the reply edge, which may be, for example, the number of replies between two users in the ith topic space.Representing a weighted directed replication network graph, whereinRepresenting a set of users under the ith topic space and involved in a replication relationship;the method comprises the steps that a directed edge set is used for representing the copy relationship among users in the ith topic space;is the weight of the replicated edge (described below).Representing a weighted directed reading network graph, whereinRepresenting a set of users under the ith topic space and involved in reading relationships;the method comprises the steps that a set of directed edges is used for representing the reading relation among users in the ith topic space;is the weight of the reading edge (described below). A, b, c, d as used herein are merely used to identify or distinguish individual networks or sets of users of individual networks and are not meant in an actual sense per se.

The following explains the transition probability calculation method for each network by way of example in the preferred embodiment.

Transition probability calculation method of forwarding network

Forwarding network graphsThe medium random walk process is constructed as follows: the user is influenced by friends in the ith topic space, and the friends and the blog messages of the user are forwarded with a certain transfer probability. The random walk process in the forwarding network graph simulates the forwarding behavior of users in the microblog. Let the transition probability matrix in the forwarding network under the ith topic space beTransition probabilities between users are defined as follows.

Definition 1. in the forwarding network under the ith topic space, usersRandom forwarding userBo Wen transition probability is defined as:

P_{a}^{i} (u_{t}^{i} | u_{s}^{i}) = \frac{w_{a} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{a} (u_{s}^{i}, u^{i})},

whereinRepresenting the user in the ith topic spaceForwarding subscribersThe number of times of the blog;representing the user in the ith topic spaceNumber of times of forwarding all of his friends' blog text, among themThe set of end points representing all directed edges in the network of interest starting at usi, i.e. the set of end pointsThe set of friends of.

Transition probability calculation method of reply network

Replying to a network graphThe medium random walk process is constructed as follows: the user is influenced by friends in the ith topic space, and replies the friend blog messages with a certain transfer probability. The random walk process in the reply network graph simulates the reply behavior of the user in the microblog. Let the transition probability matrix in the reply network under the ith topic space beThe transition probability between users is defined as follows.

Definition 2. in the reply network under the ith topic space, the userRandom reply userBo Wen transition probability is defined as:

P_{b}^{i} (u_{t}^{i} | u_{s}^{i}) = \frac{w_{b} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{b} (u_{s}^{i}, u^{i})},

whereinRepresenting the user in the ith topic spaceReplying to a userThe number of times of the blog;representing the user in the ith topic spaceThe number of times of replying the blog articles of all friends of the user, whereinStarting point represented in the network of interest isOf all directed edges, i.e. the set of end pointsThe set of friends of.

Transfer probability calculation method for copy network

Because the 'copy' relationship label does not exist in the microblog, the network graph is copied in order to constructThe random walk process in (1) requires first inferring "duplicate" relationships to dig out the implied relationship edges.

One embodiment of the invention considers the similarity of the time interval between two Bovins and the Bovins comprehensively. Generally, the content similarity of two bloggers with a 'copy' relationship is higher, and the naive method for deducing the 'copy' relationship can consider calculating the similarity between the bloggers, and if the similarity is higher than a certain threshold value, the source of the blogger can be deduced. The naive method needs to calculate the similarity between all the blog articles of all the friends of the user, and has high calculation cost. In one embodiment of the invention, in order to reduce the calculation cost, not only the similarity between the bobbles but also the time interval Δ t between the bobbles is considered.

The 'copy' behavior in the microblog belongs to a forwarding behavior to a certain extent, but forwarding type labels such as 'RTB' or 'viaB' are not explicitly used. So the time interval deltat between the bosom where there is an explicit forwarding relation is used herein_RetweetSet T of_RetweetIs fitted to the time interval deltat between bobbles in which "copy" relationships exist_CopySet T of_CopyProbability distribution of (2). Based on the above considerations, the present invention proposes a method for determining the number of network usersA method of determining whether a copy relationship exists between documents, which may include the steps of: obtaining time probability distribution obeyed by a time interval between two blogs with definite forwarding relation; deducing a time probability distribution to which a time interval between two blogs with a copy relation is obeyed based on the time probability distribution; setting a range which is required to be met by a time interval between two blossoms with a copy relation based on the deduced time probability distribution; calculating the similarity of any two blossoms with the time intervals within the range; and determining whether a copy relationship exists between the two bloggers based on the similarity.

Specifically, in one embodiment, the time interval Δ t between blogs with explicit forwarding relationships of size 71000 (other numbers are possible) is first randomly sampled from the data set_RetweetSet T of_RetweetI.e. | T_RetweetL 71000. Fig. 2 shows the data distribution, and as can be seen from fig. 2, most of the time intervals are only within several hours, and only a small part of the time intervals span a relatively large time span, even a small part of the time intervals span more than 10 days. To more finely characterize the time interval distribution, long tail points with time intervals spanning more than 10 days can be removed. It is reasonable to eliminate these points because forwarding users whose time intervals span more than 10 days are usually spam users as found by data analysis.

After removing the long tail point, which spans more than 10 days, the sample size becomes 69770, i.e. | T'_RetweetI =69770, the time interval Δ t of which is found statistically to approximately follow a negative exponential distribution. Given sample set T'_RetweetEstimate the parameter λ =1.9768 × 10 for the negative exponential distribution⁴Then the negative exponential distribution probability density function is as follows:

f (x) = \{\begin{matrix} \frac{1}{19768} e^{- x / 19768}, x &GreaterEqual; 0, \\ 0, x < 0 . \end{matrix}

where e is the base of the natural logarithm function.

The set of time intervals between blogs for which there is an explicit forwarding relation obeys the parameter λ 1.9768 × 10⁴Because the "duplication" behavior is to some extent a forwarding behavior, it can be concluded that the time interval between two articles in which the "duplication" relationship exists is also approximately subject to the parameter λ 1.9768 × 10⁴The distribution function is as follows (see fig. 3):

F (x) = \{\begin{matrix} 1 - e^{- x / 19768}, x &GreaterEqual; 0, \\ 0, x < 0 . \end{matrix}

from the negative exponential distribution, the range Δ t of the time interval between two Bovins in which the "copy" relationship exists can be approximated_range. Taking into account both computational cost and accuracy, in one embodiment, the range Δ t of the time interval of two Bovins in a "duplicate" relationship is set illustratively_range∈(0ks,1.08×10²ks]Where ks denotes 1000 seconds, small brackets "(" denotes no end-point included, middle brackets "]"means including the endpoints. From the negative exponential distribution, the recall ratio R (Recall) of the "replication" relationship can be inferred as:

R=F(1.08×10⁵)-F(0)＝99.58%。

if the time interval delta t of the two Bovins is not delta t_rangeAnd the probability of existence of the 'copy' relationship is low, so that the similarity of the two bloggers does not need to be calculated, and the calculation cost is reduced.

Therefore, whether the blog messages published by the users with the friend relationship have the 'copy' relationship or not is deduced, and the following two conditions need to be met simultaneously:

1）Sim(p_t,p_s)≥ξ；

2）Δt_range∈(0ks,1.08×10²ks]。

the first condition being two Bovins p_tAnd p_sSimilarity of (c) Sim (p)_t,p_s) It is required to be equal to or greater than a certain threshold ξ. in one embodiment, the cosine of the included angle may be used to calculate the similarity between two Bovins, and other methods of calculating document similarity, such as KL distance (Kulback-leibler), are equally effective.

sim(p_t,p_s)＝cos(v_t,v_s)，

Wherein v is_tAnd v_sRepresenting the vectors of two bobbles, respectively.

The second condition, i.e. the time interval between two Bowens is within a certain threshold, 1.08 × 10 as used above²ks is used merely as an example, and one skilled in the art will appreciate that any other suitable value may be used based on a combination of computational cost and accuracy.

Defining two Bozhuang texts with 'copy' relationship as a binary group<p_t,p_s>All the blog pairs with 'copy' relationship between two friends are a binary set U. From which a duplicate network graph can be inferredAs a weighted directed graph, usersTo the userIs "copied" to (i.e., due to the user)Duplicating usersThe "copy" relationship of the bobble) of the user's web pageThe definition is as follows:

w_{c} (u_{s}^{i}, u_{t}^{i}) = \underset{< p_{t}^{i}, p_{s}^{i} > &Element; U_{s, t}^{i}}{Σ} sim (p_{s}^{i}, p_{t}^{i}) \times f (Δ t_{p_{s}^{i}, p_{t}^{i}}) .

is under the ith topic spaceAndall having a "copy" relationship therebetween and being made by the userA set of Bowen-pair tuples that initiate replication. F table in the above equationShowing a function, in general, f is set to a higher value as Δ t is smaller. Therefore, the similarity and the time interval of the two bobbles are comprehensively considered by the weight calculation, and the higher the similarity between the two bobbles is, the smaller the time interval is, and the higher the probability of existence of the 'copy' relationship is.

Replicating a network graphThe medium random walk process is constructed as follows: the user is influenced by friends in the ith topic space, and the friends and the blog of the user are copied according to a certain transfer probability. The random walk process in the copy network graph simulates the copy behavior of the user in the microblog. Let the transition probability matrix in the replication network under the ith topic space beThe transition probability between users is defined as follows.

Definition 3. users in copy network under ith topic spaceRandom copy userBo Wen transition probability is defined as:

P_{c}^{i} (u_{t}^{i} | u_{s}^{i}) = \frac{w_{c} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{c} (u_{s}^{i}, u^{i})},

whereinRepresenting the user in the ith topic spaceTo the userIs "copied" to (i.e., due to the user)Duplicating usersThe "copy" relationship resulting from the bobble of (b);representing the user in the ith topic spaceThe sum of the weights of the "copy" relationships to all of its buddies.

Transition probability calculation method of reading network

The more users who read the published blog article, the wider the influence range of the blog article is. To construct a reading network diagramThe random walk process first requires the construction of a reading network.

Simple idea of constructing reading network diagram namely using concerned network G_f=(V,E_f) And constructing the relationship among the users, and taking the number of the Bowen released by the users as the weight of the edge. Intuitively explaining that under a specific topic space, the more the user publishes the blog and has more followers, the wider the blog influence range.

The popularization of the naive method is the TwitterRank algorithm, the topic similarity among friends is increased, and the naive method is intuitively understood that a user is more likely to read friend blog texts with similar topics. The TwitterRank algorithm transition probability is shown as follows:

P_{t} (i, j) = \frac{| τ_{j} |}{\underset{a : s_{i} follows s_{a}}{Σ} | τ_{a} |} \times {sim}_{t} (i, j) .

|τ_j| is user u_jThe number of issued messages;for user u_iThe number of the messages issued by all friends; sim_t(i, j) is the similarity between users in the t-th topic space.

In the microblog, all the blog articles issued by the user are actively pushed to the time line of the follower, and usually the follower logs in the personal homepage of the follower and can read the information of the home page of the follower. Therefore, the smaller the time interval between the login time of the user and the posting time of the friend of the user is deduced, the more possible reading of the blog of the friend of the user is. But user login time is difficult to obtain. Therefore, in one embodiment of the present invention, the daily posting time series pattern of the users is calculated by counting the rules of a certain number of postings for each user, and it is assumed that the higher the similarity of the daily posting time series patterns between the users, the greater the probability of the existence of the "reading" relationship. Because the time statistical rule of a certain number of postings reflects the user login time rule to a certain extent, the higher the similarity of the posting time sequence patterns every day is, the smaller the time interval between the user login time and the posting time of friends of the user is, the more likely the user can read the blog of friends of the user, and therefore the assumption is reasonable.

For example, fig. 4 shows the posting time series patterns of 3 users each day, and if the user a focuses on the users B and C at the same time, it can be known from the graph that the posting time series pattern similarity of the users a and B is significantly higher than that of the users a and C, the user a has a greater probability of reading the blog of B. Because the user B has a higher probability of issuing the blog while the user A is online (online), the microblog service actively pushes the blog to the personal homepage of the user A, so that the user A can read the blog of the user B more frequently.

In one embodiment, the posting rules of the user are measured by the posting probability of the user in each hour within 24 hours of a day, and the posting time sequence mode is defined as follows.

Definition 4. for arbitrary user u, two-tuple<t,p>Represents that the probability of the user posting within t (unit is hour) is p; set of time series t₀,t₁,...,t₂₃}(t₀＜t₁＜...＜t₂₃) Representing 24 discrete points, each of 1 hour duration, the user posting time series pattern is defined as follows:

ts＝＜ts₀＝＜t₀,p₀＞,ts₁＝＜t₁,p₁＞,...,ts₂₃＝＜t₂₃,p₂₃＞＞

counting the time law of N posts issued by each user, and calculating the posting probability of the user in i (unit is hour)Wherein N is_iThe number of all posts in i (hour) of all N posts of the user is represented, namely the greater the number of posts in i (hour) of the user, the greater the probability that the microblog is online in the hour of the user in one day.

If the posting probabilities for multiple users are the same for each hour, their time series patterns are more similar. In one embodiment, the time series similarity may be measured using Euclidean (Euclidean) distance.

Specifically, let Q andc respectively represents two posting time sequences; q. q.s_iRepresenting the value of the ith point of the Q sequence; c. C_iRepresenting the value of the C sequence at the ith point; i and n respectively represent the sequence number of the current point in the whole sequence and the length of the whole sequence, and the two time sequence similarity calculation formulas are as follows:

simSeries (Q, C) = \frac{1}{\sqrt{Σ_{i = 1}^{n} {(q_{i} - c_{i})}^{2}}}, | Q | &NotEqual; 0 and | C | &NotEqual; 0 |,

where | Q | and | C | each represent the length of the two sequences (the lengths of both are equal), the smaller the euclidean distance, the higher the similarity.

In one embodiment, it can be assumed that the probability that there is a reading relationship between users is related to the following 3 factors:

1) the user reads friends with a large number of blog articles with higher probability;

2) a user reads friends with high topic similarity at a higher probability;

3) the user reads the friends with high posting time sequence pattern similarity with higher probability.

Thus, user u can be assigned_sReading friend u_tThe probability of Bowen is defined as follows:

P_{read} (u_{s}, u_{t}) = \frac{τ_{t} \times sim (u_{s}, u_{t}) \times simSeries (u_{s}, u_{t})}{\underset{u &Element; out (u_{s})}{Σ} τ_{u} \times sim (u_{s}, u) \times simSeries (u_{s}, u)},

wherein tau represents the number of the messages issued by the corresponding users in the data set (excluding the messages with forwarding, replying and copying relations); sim (u)_s,u_t) Representing user u_sFriend u concerned with him_tTopic similarity between; simseries (u)_s,u_t) Representing user u_sFriend u concerned with him_tA posting time series similarity between; out (u)_s) Representing user u_sIs focused on the buddy set. The above embodiments have fully considered the above 3 factors, and it is understood that in some embodiments, only any one or two of the factors may be considered in order to simplify computational complexity, etc.

In one embodiment of the present invention, the topic similarity sim (u) is calculated using all the bloggers of one user as one document_s,u_t) The LDA model is also used to determine the topic distribution for each user. Defining t as the "user-topic" distribution vector of the user, i.e. t ═ t¹,t²,...,t^k)^TWherein t is¹,t²,...,t^kFor each element in the distribution vector of 'user-topic', the probability distribution of the user in each topic space is shown, k represents the set number of topics, and the set of user topic categories is shown as { t¹,t²,...,t^k}。

User topic similarity is defined as a set of user topic categories { t }¹,t²,...,t^kKL distance of }:

sim (u_{s}, u_{t}) = \frac{1}{KL (u_{s}, u_{t})}

= \frac{1}{\underset{0 < i \leq k}{Σ} t_{s}^{i} 1 g \frac{t_{s}^{i}}{t_{t}^{i}}}, | u_{t} | &NotEqual; 0 and | u_{s} | &NotEqual; 0,

wherein | u_sI and I u_tL respectively represents the number of the blog articles issued by the user;representing user u_sThe probability distribution in the ith topic space,representing user u_tIn the probability distribution of the ith topic space, the smaller the KL distance is, the higher the topic similarity is.

If | u_tI.e. user u in the data set | ═ 0_sFriend u concerned_tIf the blog article is not released, the user u_sReading friend u_tThe probability of (2) is 0.

If | u_sI.e. user u in the data set | ═ 0_sWithout issuing a blog, define user u_sReading friend u_tThe probability of (c) is:

P_{read} (u_{s}, u_{t}) = \frac{τ_{t}}{\underset{u &Element; out (u_{s})}{Σ} τ_{u}}

that is, the probability of reading friends is only related to the number of the friends publishing the messages, and the more the friends publish the messages, the higher the probability of reading the friends.

In one embodiment, for all user sets U in the ith topic spaceⁱDeducing a set of users UⁱWhether all followers of u have a "reading" relationship with u needs to satisfy the following conditions:

t_{u}^{i} \times P_{read} &GreaterEqual; η

whereinRepresenting the probability distribution of the user u in the ith topic space; p_readRepresenting the probability of a follower of user u reading the user's blog. I.e. "reading" probability of two users in i topic spacesIt needs to be equal to or greater than a certain threshold η.

Reading network pictureAs a weighted directed graph, in the ith topic space, usersTo the userIn a reading relationship (i.e., due to the user)Reading userReading relationship by bobble) of the other wordsThe following can be defined:

w_{d} (u_{s}^{i}, u_{t}^{i}) = t_{t}^{i} \times P_{read} (u_{s}^{i}, u_{t}^{i})

wherein,representing a userProbability distribution in the ith topic space.

Reading network graph under ith topic spaceThe transition probability matrix ofThe transition probability between users is defined as follows.

Definition 5. in the reading network under the ith topic space, usersRandom reading userBo Wen transition probability is defined as:

P_{d}^{i} (u_{t}^{i} | u_{s}^{i}) = \frac{w_{d} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{d} (u_{s}^{i}, u^{i})},

whereinRepresenting the user in the ith topic spaceTo the userIn a "reading" relationship (i.e., due to the user)Reading userReading relations by bobble of) are calculated;and the sum of the weights of the 'reading' relations from the user usi to all friends of the user under the ith topic space is represented.

The embodiments of calculating the transition probabilities in the forwarding network, the replying network, the copying network and the reading network are sequentially described above, but it is obvious that in the actual calculation process, the transition probabilities in the networks may be calculated in any order, and one or more of the calculations may be performed in parallel, and are not limited to the order described above.

Method for fusing multiple relation networks

In order to comprehensively measure the influence of the individual microblog users, the transition probabilities in the four networks can be comprehensively considered for each user to calculate the influence ranking score of each user, wherein the influence ranking score reflects the probability that other users randomly access the user (i.e. forward, reply, copy and read the blog of the user), and further reflects the topic influence of the user.

The influence of the user by his friends appears as a random walk process inside the4 kinds of influence networks, and at the same time, the user will jump to another kind of influence network with a certain probability. In one embodiment of the invention, the user is defined to stay in the forwarding network, the reply network, the copy network, the reading networkProbability is respectively lambda₁，λ₂，λ₃，λ₄And satisfy lambda₁+λ₂+λ₃+λ₄And =1. The user will start with 1-lambda₁Jumps from the forwarding network to other networks; at 1-lambda₂The probability of (2) jumping from the reply network to other networks; at 1-lambda₃Jumps from the replicated network to the other network; at 1-lambda₄The probability of jumping from the reading network to the other network. Let the transition probability matrix be B by considering the stay probability λ of each network, between users in 4 networks in the ith topic space (e.g., from users)To the user) The transition probabilities of (c) are respectively defined as follows:

1) forwarding network:

B_{a}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{1} \frac{w_{a} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{a} (u_{s}^{i}, u^{i})}

2) and (3) replying to the network:

B_{b}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{2} \frac{w_{b} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{b} (u_{s}^{i}, u^{i})}

3) copying the network:

B_{c}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{3} \frac{w_{c} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{c} (u_{s}^{i}, u^{i})}

4) reading network

B_{d}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{4} \frac{w_{d} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{d} (u_{s}^{i}, u^{i})}

In another embodiment of the present invention, according to the PageRank algorithm, the user not only walks along the network randomly (e.g., the user directly accesses the microblog of the friend via his friend link), but also jumps to other nodes randomly with a certain probability β (e.g., the user accesses the microblog of the user who is not his friend by other means, such as manual input), so the jump probability β between nodes inside the network and the stay probability λ of each network are considered together in this embodiment, the transition probability matrix is B, and the users in 4 networks in the ith topic space (e.g., from the users) are considered togetherTo the user) The transition probabilities of (c) are respectively defined as follows:

1) forwarding network:

B_{a}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{1} \times (1 - β) \times \frac{w_{a} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{a} (u_{s}^{i}, u^{i})} + \frac{β}{n}

2) and (3) replying to the network:

B_{b}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{2} \times (1 - β) \times \frac{w_{b} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{b} (u_{s}^{i}, u^{i})} + \frac{β}{n}

3) copying the network:

B_{c}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{3} \times (1 - β) \times \frac{w_{c} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{c} (u_{s}^{i}, u^{i})} + \frac{β}{n}

4) reading network

B_{d}^{i} (u_{t}^{i} | u_{s}^{i}) = λ_{4} \times (1 - β) \times \frac{w_{d} (u_{s}^{i}, u_{t}^{i})}{\underset{u^{i} &Element; out (u_{s}^{i})}{Σ} w_{d} (u_{s}^{i}, u^{i})} + \frac{β}{n}

N in the above formula represents the number of nodes in the corresponding network.

Let r beⁱ(u) ranking score of user u in ith topic space, and integratingConsidering the random walk of users in 4 networks, the ranking score of user u in the ith topic space is defined as follows in one embodiment:

r^{i} (u) = \underset{(u_{t}^{i}, u) &Element; E_{Retweet}^{i}}{Σ} B_{a}^{i} (u {| u}_{t}^{i}) r^{i} (u_{t}^{i}) + \underset{(u_{t}^{i}, u) &Element; E_{Reply}^{i}}{Σ} B_{b}^{i} (u | u_{t}^{i}) r^{i} (u_{t}^{i})

+ \underset{(u_{t}^{i}, u) &Element; E_{Copy}^{i}}{Σ} B_{c}^{i} (u | u_{t}^{i}) r^{i} (u_{t}^{i}) + \underset{(u_{t}^{i}, u) &Element; E_{Read}^{i}}{Σ} B_{d}^{i} (u | u_{t}^{i}) r^{i} (u_{t}^{i}) .

i.e., the rank score of a user is largely determined by the probability of random access to the user by the follower.

The ranking score of user u in the ith topic space can be calculated by the above formula transformation:

r^{i} (u) = \underset{(u_{t}^{i}, u) &Element; E_{Retweet}^{i}}{Σ} B_{a}^{i} (u {| u}_{t}^{i}) + \underset{(u_{t}^{i}, u) &Element; E_{Reply}^{i}}{Σ} B_{b}^{i} (u | u_{t}^{i})

+ \underset{(u_{t}^{i}, u) &Element; E_{Copy}^{i}}{Σ} B_{c}^{i} (u | u_{t}^{i}) + \underset{(u_{t}^{i}, u) &Element; E_{Read}^{i}}{Σ} B_{d}^{i} (u | u_{t}^{i})) \times r^{i} (u_{t}^{i}) .

let M be a transition probability matrix of the converged multi-relationship network random walk model, and then the multi-relationship network random walk iteration model is:

π_{i}^{t + 1} = M π_{i}^{t}

whereinAnd (4) in the ith topic space, a user ranking score vector in the process of the t iteration.

As can be seen from the above formula, the stochastic walk iterative model of the multi-relationship network is a Markov process traversed by each state. Thus, an initial vector can be given(for example, each node in the network may initially have the same ranking score), and then through n iterations, the computation results will converge gradually. The continuation of the iteration may be stopped when a certain stop condition is met.

Results of the experiment

The effect of the method of the present invention will be described below with reference to the experimental results.

The experiment employed 261954 chinese users in Twitter, with 103836 users posting tweets between 3 months 2011-04-15 to 2011-07-15, accounting for 39.6% of chinese users, while obtaining 2660281 tweets.

In order to verify the effectiveness of the method, the accuracy of influence individual mining is verified respectively. Since the actual influencing individuals in Twitter are difficult to determine artificially, cross-validation of multiple algorithms is relied upon to determine the accuracy of each algorithm. The experiment comprehensively considers 6 influence individual mining algorithms:

1) a social network formed by forwarding relations finds individual influence in the forwarding network by means of a PageRank algorithm, and the algorithm is defined as ReposTransnk;

2) finding individual influence in the reply network by means of a PageRank algorithm, and defining the algorithm as ReplyRank;

3) weng et al propose the twitter rank algorithm;

4) measuring the individual influence by depending on the number of followers, and defining the algorithm as FollowerNum;

5) measuring the individual influence by depending on the number of posts, and defining the algorithm as TweetNum;

6) the random walk model in the multi-relation network provided by the invention defines such algorithm as MultiRank (three different parameter settings of the MultiRank algorithm are shown in fig. 5 and 6 and named as MultiRank1, MultiRank2 and MultiRank3 respectively to further measure the influence of different parameter settings on the MultiRank algorithm).

The experiment uses a cross-validation method, i.e. the correct result is considered by various (N) algorithms as the correct result of the reference. For example, given 4 algorithms A, B, C, D, the obtained Top-K high-influence individual sets are respectively I_A，I_B，I_CAnd I_DAssuming that 2 algorithms all consider correct results as correct results of reference, the reference standard set of the influencing individuals is defined as:

I₂=(I_A∩I_B)∪(I_A∩I_C)∪(I_A∩I_D)∪

(I_B∩I_C)∪(I_B∩I_D)∪(I_C∩I_D)

the accuracy P (precision) reflects the authenticity of the influential individuals in Twitter, i.e., the algorithm A influential individual discovery accuracy is defined as follows:

P_{A} = \frac{| I_{A} \cap I_{2} |}{| I_{A} |}

experiment according to the above algorithms, the influence individuals of Top-10, 20, 50, 100, 200, 500 in each topic are respectively obtained.

For N =2, 3, 4, 5, 6, 7, the average accuracy distribution of Top-10, 20, 50, 100, 200, 500 influence individual mining algorithms in each topic category is shown in fig. 5, where the ordinate represents accuracy and the abscissa represents each topic.

As can be seen from the experimental results shown in fig. 5, when N =2, 3, 4, 5, 6, and 7, the parameters of the MultiRank algorithm proposed by the present invention are selected as the group 2 (MultiRank 2 shown in fig. 5), the accuracy is higher among the 10 topic categories. The principle of MultiRank2 parameter selection is to balance the network size versus the impact ranking score rⁱ(u) and the Multirank1, Multirank3 parameters are selected by the rule that the reading net rank score r dominates the influence rank score rⁱ(u) and forward, reply, duplicate network ranking score r dominates the impact ranking score rⁱ(u) in the multi-relation network, the MultiRank1 and the MultiRank3 are respectively influenced by the network scale, which results in relative reduction of accuracy.

The experimental result simultaneously shows that in each topic, the accuracy rate shows a descending trend along with the increase of the number N of the reference standards in the cross validation, and the increase of the number N of the reference standards leads to the intersection set elements I of a plurality of reference standards_NLess number, thereby making arbitrary algorithm I_AAnd I_NLess intersection set elements, resulting in a reduction in accuracy. The experimental result also shows that when the number of the reference standards is N =3, 4, the accuracy rate discrimination of each algorithm is relatively high, the experimental effect is optimal, and if N is set too low (N = 2), the intersection set element I of the multiple reference standards is caused_NThe number is large, so that 3 algorithms including MultiRank1, MultiRank2 and MultiRank3 are combined with the standard reference set I_NThe intersection elements of the three algorithms are basically consistent, so that the accuracy rate discrimination of the3 algorithms is not high. If N is set too high (N =5, 6, 7), this will result in the intersection set element I of the multiple reference standards_NThe number is small, and the algorithms are also matched with the standard reference set I_NThe intersection elements of (a) are basically consistent, so that the accuracy discrimination is not large.

Then, for N =2, 3, 4, 5, 6, 7, the distribution of the average accuracy of 10 topic categories at Top-10, 20, 50, 100, 200, 500 is shown in fig. 6.

As can be seen from the experimental results shown in fig. 6, the MultiRank algorithm provided by the present invention has a higher accuracy in the 6 groups of Top-K influence individual mining algorithms, and when the reference standard number N =3 and 4 is used, the accuracy discrimination of each algorithm is larger, and the experimental effect is optimal. The experimental result also shows that in Top-K influence individual mining, the accuracy rate shows an ascending trend along with the increase of the number of K, and the increase of the number of influence individuals causes the increase of the number of the same elements of cross validation of a plurality of algorithms, so that the accuracy rate is increased.

While the preferred embodiments of the present invention have been described above, it should be noted that the preferred embodiments are described in detail with respect to microblog applications, but those skilled in the art will appreciate that the methods described herein may be applied to other applications besides microblog applications, and that the corresponding users may issue other messages, such as text, video, audio, and the like, and any combination thereof, and are not limited to blog messages. In addition, the specific algorithms, formulas, parameter settings, etc. mentioned in the above detailed description of the present application are for illustration only and are not intended to limit the present invention. Those skilled in the art can make appropriate changes and substitutions to the above algorithms, formulas, parameters, etc. without departing from the spirit and scope of the present invention.

Claims

1. A method for determining whether a duplication relationship exists between messages of network users, the method comprising:

obtaining time probability distribution obeyed by time intervals between two texts with definite forwarding relation;

deducing a time probability distribution to which a time interval between two texts with a copy relation is subject based on the time probability distribution;

setting a range which is required to be met by a time interval between two texts with a copy relation based on the deduced time probability distribution;

calculating the similarity of any two texts with the time interval within the range; and

and determining whether a copy relationship exists between the two texts based on the similarity.

2. The method of claim 1, wherein the inferring, based on the temporal probability distribution, a temporal probability distribution to which a time interval between two utterances having a replication relationship should comply comprises: the time probability distribution obeyed by the time interval between two texts with copy relation is inferred to be the same as the time probability distribution obeyed by the time interval between two texts with explicit forwarding relation.

3. A topic influence individual mining method based on a multi-relation network comprises the following steps:

extracting the forwarding relation among users to construct a forwarding relation network, and calculating the transfer probability of one user randomly forwarding the text of another user in the forwarding relation network;

extracting the reply relationship among the users to construct a reply relationship network, and calculating the transition probability of one user randomly replying the text of another user in the reply relationship network;

extracting the copy relationship between users to construct a copy relationship network according to the messages having the copy relationship determined by the method of claim 1 or 2, and calculating a transition probability that one user randomly copies the message of another user in the copy relationship network;

extracting the reading relation among the users to construct a reading relation network, and calculating the transition probability of one user randomly reading the text of the other user in the reading relation network;

and comprehensively considering the transition probabilities in the four networks to calculate the probability of random access of any user by other users.

4. The method according to claim 3, wherein the reading relationship between the users is extracted based on any one or more of a posting time series pattern similarity between the users, the number of texts, and an interest similarity between the users.

5. The method of claim 3, wherein the transition probability of the one user randomly forwarding the text of the other user in the forwarding relationship network is calculated based on the number of times the one user forwards the text of the other user.

6. The method according to claim 3, wherein a transition probability that the one user replies the letters of another user at random in the reply relationship network is calculated based on the number of times the one user replies the letters of other users.

7. The method of claim 3, wherein the transition probability of one user randomly replicating the text of another user in the replication relationship network is calculated based on the time interval and similarity between the texts of two users.

8. The method according to claim 3, wherein a transition probability that one user randomly reads the letters of another user in the reading relationship network is calculated based on any one or more of a posting time series pattern similarity between two users, a number of letters, and an interest similarity between two users.

9. The method of claim 3, wherein the access comprises forwarding, replying, copying, and reading.

10. The method of claim 3, wherein the text is a user-posted text, video, audio, or any combination thereof.