CN110287424B - Collaborative filtering recommendation method based on single-source SimRank - Google Patents

Collaborative filtering recommendation method based on single-source SimRank Download PDF

Info

Publication number
CN110287424B
CN110287424B CN201910577524.XA CN201910577524A CN110287424B CN 110287424 B CN110287424 B CN 110287424B CN 201910577524 A CN201910577524 A CN 201910577524A CN 110287424 B CN110287424 B CN 110287424B
Authority
CN
China
Prior art keywords
node
simrank
nodes
random walk
current sampling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910577524.XA
Other languages
Chinese (zh)
Other versions
CN110287424A (en
Inventor
魏哲巍
何晓东
王涵之
萧小奎
王思博
刘钰
杜小勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin University of China
Original Assignee
Renmin University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renmin University of China filed Critical Renmin University of China
Priority to CN201910577524.XA priority Critical patent/CN110287424B/en
Publication of CN110287424A publication Critical patent/CN110287424A/en
Application granted granted Critical
Publication of CN110287424B publication Critical patent/CN110287424B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a collaborative filtering recommendation method based on single-source SimRank, which comprises the following steps: converting the user to be recommended, the user and the relationship among the users into a graph structure, and according to the probability pi that the random walk starting from the source node u reaches the node w through l steps and stopsl(u, w), the probability eta (w) that two random walks starting from the node w do not meet any more in the walking process, and the probability pi that a reverse walk starting from the node w reaches the node v through l steps and stopsl(v, w) estimating the SimRank similarity between the nodes u and v, and repeatedly estimating the similarity until the estimation between all the nodes and the source node u in the graph structure is completed; according to the estimation result, the first k nodes with the highest similarity to the node to be recommended are found; and acquiring the behavior information of the first k nodes, and integrating and pushing the behavior information to the source node u. The collaborative filtering recommendation method based on the single-source SimRank provided by the embodiment reduces time complexity and meets the requirements of real-time recommendation and interactive query.

Description

Collaborative filtering recommendation method based on single-source SimRank
Technical Field
The invention relates to collaborative filtering recommendation, in particular to a collaborative filtering recommendation method based on single-source SimRank.
Background
The recommendation system is used as a core technology in a plurality of fields such as electronic commerce, social network analysis, personalized advertisement delivery, user interest recommendation and the like, and has increasingly prominent importance in the wave of internet development.
A personalized information system for analyzing the interests and requirements of users according to the existing historical data and recommending the interested information, products, services and the like to the users is called a recommendation system. According to different recommendation algorithms, the recommendation algorithm can be divided into three types of recommendation based on collaborative filtering, recommendation based on content and mixed recommendation. The recommendation based on collaborative filtering does not need to obtain the characteristic data of the user or the article in advance, only relies on the historical behavior of the user to model the user, has strong portability, and is widely used by people. Specifically, in the recommendation system oriented to the social network, users with higher similarity to the specified user can be recommended to the specified user according to the existing friend relationship, and commodities liked by the similar user can be recommended to the target user, so that the application scene of the recommendation system is expanded, and the robustness and the portability of the recommendation system are improved.
In the execution process of the collaborative filtering recommendation method, calculation and comparison of the similarity of the graph nodes are required. How to accurately define the similarity of graph nodes and efficiently calculate the similarity is a problem that researchers continuously explore. To facilitate the abstraction and definition of concrete questions, we convert the social network into graph structure G ═ V, E in graph theory, where V, E represents the set of all nodes, edges on the graph structure, respectively, corresponding to the user groups and the friend relationships between users in the actual social network. Thus, the actual problem of finding similar users on a social network can be translated into an abstract problem of computing the similarity between nodes on a graph structure.
In the node similarity definition method, the SimRank similarity is used as an important algorithm in the field of graph node similarity calculation, and the definition of the SimRank similarity is completely based on a graph structure, does not depend on other additional characteristics, integrates excellent characteristics such as environment information of multi-order neighbors of nodes and the like by means of an iteration form, and attracts important attention of people. The following is the basic definition of SimRank:
Figure BDA0002112398160000021
based on this, the inventors of the present application found that directly calculating the SimRank of a single source node requires O (td) consumption due to the iteratively defined nature of the SimRank2n2) Time complexity of (d) and O (n)2) Where t represents the number of iterations, d represents the average degree of nodes on the graph, and n represents the number of graph nodes.
With the advent of the big data era, the scale of graph structures generated in real life is larger and larger, for example, social networks formed by WeChat and Twitter (Twitter) of hundreds of millions of users and the like, the time complexity for calculating the SimRank according to the original definition is too high, the result of the single-source SimRank on the big graph is difficult to calculate in an effective time, and the requirements of real-time recommendation and interactive query cannot be met.
The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
Disclosure of Invention
The invention aims to provide a collaborative filtering recommendation method based on single-source SimRank, which can calculate the result of the single-source SimRank of a large graph in effective time and meet the requirements of real-time recommendation and interactive query.
In order to achieve the purpose, the invention provides a collaborative filtering recommendation method based on single-source SimRank, which comprises the following steps: converting a user to be recommended, a user and a relationship among the users into a graph structure, wherein the graph structure comprises a plurality of nodes corresponding to the user and a plurality of edges corresponding to the relationship among the users, the edges comprise an outgoing edge and an incoming edge, and the user to be recommended is a source node u in the graph structure; according to the probability pi that the random walk from the source node u reaches the node w through l steps and stopsl(u, w), the probability eta (w) that two random walks starting from the node w do not meet any more in the walking process, and the probability pi that a reverse walk starting from the node w reaches the node v through l steps and stopsl(v, w) to make an estimate s (u, v) of the SimRank similarity between node v and the source node u, where the random walk is such that
Figure BDA0002112398160000031
Is randomly directed to an in-neighbor node of the current node to
Figure BDA0002112398160000032
C is an attenuation coefficient, c is more than or equal to 0 and less than or equal to 1, and v is any node in the graph structure; repeatedly executing the estimation of the SimRank similarity until the estimation of the SimRank similarity between all the nodes and the source node u in the graph structure is completed; according to the estimation result of the similarity of the SimRank, the first k nodes with the highest similarity with the node to be recommended are found; and acquiring the behavior information of the first k nodes, and integrating and pushing the behavior information to the source node u.
In a preferred embodiment, the probability pi that the random walk from the source node u reaches the node w and stops after l stepsl(u, w), the probability eta (w) that two random walks starting from the node w do not meet any more in the walking process, and the probability pi that a reverse walk starting from the node w reaches the node v through l steps and stopsl(v, w) to make an estimate s (u, v) of the SimRank similarity between node v and the source node u, comprising: calculating the (pi)l(u, w) · η (w)) an estimate of; judging whether the node w is a core node; if yes, acquiring pi of the core node from a prestored index tablel(v, w) an estimate of; if not, dynamically calculating pi by a backward walking methodl(v, w), wherein the backward walking refers to walking from the current node to the exit neighbor of the current node with a certain probability distribution; the estimation s (u, v) of the SimRank similarity between the node v and the source node u is performed according to a formula one:
Figure BDA0002112398160000033
in a preferred embodiment, the determining whether the node w is a core node includes: determining a core node in the graph structure according to the Pepper ranking of the nodes in the graph structure, wherein the core node is positioned at the top j after the descending ranking of the inverse Pepper ranking0Node of bit, wherein j0Is an integer of 0 to j0N is less than or equal to n, and n is the number of nodes in the graph structure; judging whether a preset condition is met, and if the preset condition is not met, starting from the core node, and performing random walk along the edge-out direction; if the preset condition is met, stopping random walk and determining pil(v, w) estimate.
In a preferred embodiment, the random walk in the edge-out direction includes: when random walk is carried out, the residual value r of the reverse random walk starting from the node w and reaching the node v' through the step l +1 is updated according to the formula IIl+1(v', w), the second formula is:
Figure BDA0002112398160000041
wherein d isin(v ') represents the in-degree of the node v ', v ' being an out-neighbor node of the randomly selected node v, rl(v, w) is a residual value of the reverse random walk starting from the node w and reaching the node v after l steps; updating pi corresponding to the arrival node v at the first step of random walk according to a formula IIIl(v, w), the formula three is:
Figure BDA0002112398160000042
in a preferred embodiment, the preset conditions include: all nodes on the graph structure perform l steps of random walk to reach the residual value r of the node vl(v, w) are both less than the predetermined value rmaxThen (c) is performed.
In a preferred embodiment, if the predetermined condition is satisfied, the random walk is stopped, and pi is determinedlThe values of (v, w) include: if the preset condition is met, stopping random walk;updated pi when random walk will stoplThe values of (v, w) are stored in an index table.
In a preferred embodiment, said calculating said (π)lAn estimated value of (u, w) · η (w)) comprising: acquiring a first current sampling frequency; starting from a source node u under the first current sampling frequency to generate a random walk, and the random walk reaches a node w through l steps and stops at the node w; two random walks W are generated starting from a node W1(w)、W2(W) if W1(w)、W2(w) if the two samples do not meet each other in the walking process, the result of the sampling is 1, otherwise, the result of the sampling is 0; judging whether the first current sampling frequency reaches a first preset sampling value or not; if the first preset sampling value is not reached, the first current sampling frequency is repeatedly obtained, and one strip is generated from the source node u under the first current sampling frequency
Figure BDA0002112398160000043
A step of random walk, namely, the random walk reaches a node w through the step l and stops at the node w, starting from the node w to generate two random walks and judge whether the two random walks meet, and updating the first current sampling frequency until the first current sampling frequency reaches the first preset sampling value; if the first preset sampling value is reached, taking the average value of all sampling results as (pi)l(u, w). eta (w)) is estimated.
In a preferred embodiment, the pi is dynamically calculated by a backward migration methodl(v, w) the estimated values include: acquiring a second current sampling frequency; under the second current sampling frequency, calculating pi of the arriving node v' according to the formula IVi+1(v', w), the formula four is:
Figure BDA0002112398160000051
where v represents a node reached when the number of walking steps is i, v' represents a node reached when the number of walking steps is i +1, and din(v) Representing the in-degree of the node v, i is an integer,i is more than or equal to 0 and less than or equal to l; continuously performing random walk and updating the fourth formula until the number of walk steps reaches the required step l, and obtaining pi under the second current sampling frequencyl(v, w) the estimation result; judging whether the second current sampling frequency reaches a second preset sampling value or not; if the second preset sampling value is not reached, repeatedly acquiring the current sampling times, and carrying out pi pairing under the second current sampling timesi+1(v', w) calculating and obtaining pi under the second current sampling frequencyl(v, w) and updating the second current sampling number until the second current sampling number reaches the second preset sampling value; if the second preset sampling value is reached, taking the average value of all sampling results as pil(v, w) estimate.
In a preferred embodiment, the first predetermined sampling value and/or the second predetermined sampling value is/are drAnd then, wherein,
Figure BDA0002112398160000052
epsilon is a preset error.
In a preferred embodiment, after the estimating s (u, v) of the SimRank similarity between the node v and the source node u according to formula one, the method further includes: f is repeatedly performedrCalculation of an estimate of the SimRank similarity between u, v of a round, where,
Figure BDA0002112398160000053
calculated frAnd taking the median of the estimated values of the rounds as an estimated value s (u, v) of the SimRank similarity between the node v and the source node u.
Compared with the prior art, the collaborative filtering recommendation method based on the single-source SimRank can obtain the first k nodes with the highest SimRank similarity between the source node and all other nodes on the graph structure in effective time on the premise of ensuring the accuracy of the calculation result, so that k users most similar to the user to be recommended are correspondingly obtained, the time complexity is reduced, and the requirements of real-time recommendation and interactive query are met.
Drawings
Fig. 1 is a flowchart of a collaborative filtering recommendation method based on a single-source SimRank according to an embodiment of the present invention.
Fig. 2 is a flowchart of estimation of SimRank similarity between nodes u, v according to an embodiment of the present invention.
Detailed Description
The following detailed description of the present invention is provided in conjunction with the accompanying drawings, but it should be understood that the scope of the present invention is not limited to the specific embodiments.
Throughout the specification and claims, unless explicitly stated otherwise, the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element or component but not the exclusion of any other element or component.
The SimRank is an important algorithm in the field of graph node similarity calculation, is originally proposed in 2002 by Glen Jeh and Jennifer Widom, and the core ideas can be summarized as the following two points:
a. a node has the greatest similarity to itself.
b. Two nodes also have a degree of similarity between their respective neighbor nodes if they are similar.
It is specifically defined as follows:
Figure BDA0002112398160000061
wherein u and v represent two nodes, s (u and v) represents the SimRank similarity between the nodes u and v, and c is a real number between 0 and 1 and is also called an attenuation coefficient. By definition, the similarity of different nodes u and v depends on the similarity between their neighbors. In particular, if | i (u) | 0 or | i (v) | 0, s (u, v) ═ 0 is considered, where i (u), i (v) represent the in-neighbor set of nodes u, v.
In the computation problem of SimRank, a given source node is returned to all other nodes and nodes on the graphThe problem of the SimRank values between the source nodes is called the calculation problem of the single-source SimRank. Here, we do not need to obtain a completely accurate SimRank result, but only expect that the absolute error between the SimRank value calculated by the algorithm and the corresponding node pair real SimRank value does not exceed a given error parameter epsilon. In particular, given nodes u, v on the graph, we want to estimate the SimRank between any node v and the source node u on the graph with a probability of (1- δ)
Figure BDA0002112398160000071
Satisfies the true value s (u, v)
Figure BDA0002112398160000072
The single-source SimRank has important application in the fields of web network mining, social network analysis, spam detection and the like. However, due to the iterative nature of the original definition of SimRank, the definition-based computation method requires a cost of O (td)2n2) Time complexity of (d) and O (n)2) The space complexity of (a) is represented by t, d represents the average degree of nodes on the graph, and n represents the number of nodes on the graph, so that the high time and space complexity is difficult to meet the requirement of practical application. After the SimRank is proposed, a plurality of researchers optimize the calculation algorithm of the single-source SimRank, but at present, no algorithm capable of completing the calculation of the single-source SimRank in the sub-linear time exists, for a large-scale atlas, the node number is often in a higher order, and the algorithm with the query time linear to the node number cannot meet the requirements of real-time analysis, interactive query and other actual scenes, so that the application range of the SimRank is retarded.
As shown in fig. 1, the flowchart of the collaborative filtering recommendation method based on the single source SimRank according to the preferred embodiment of the present invention includes steps S1-S5.
Step S1, converting the relation among the users to be recommended, the users and the users into a graph structure, wherein the graph structure comprises a plurality of nodes corresponding to the users and a plurality of edges corresponding to the relation among the users, the edges comprise an outgoing edge and an incoming edge, and the users to be recommended are source nodes u in the graph structure.
The users in this embodiment refer to all registered users on the platform, and the relationship between the users may specifically be a concern relationship between the users. For example, all registered buddies on Facebook and the buddy relationship network.
Specifically, for social networks with concern relationships, such as microblogs, facebooks, instagrams, and the like, users of the social networks correspond to nodes on a graph structure, and concern relationships among the users correspond to edges on the graph structure. Specifically, if the a user pays attention to the B user, a directed edge from the B user node to the a user node needs to be established on the graph structure. (case B- > A), where B is the in-neighbor node of A, A is the out-neighbor node of B, this edge is the out-edge of node B, and is the in-edge of node A. The number of outgoing edges owned by a node is referred to as "outgoing degree", and the number of incoming edges owned by the node is referred to as "incoming degree".
For a social network with friend relationships such as WeChat and QQ, users on the social network correspond to graph nodes, and friend relationships correspond to edges on a graph structure. Specifically, if there is a buddy relationship between the a-user and the B-user (i.e., A, B are buddies of each other), a directed edge from the a-user node to the B-user node and a directed edge from the B-user node to the a-user node are established on the graph structure.
Step S2, according to the probability pi that the random walk from the source node u reaches the node w through l steps and stopsl(u, w), the probability eta (w) that two random walks starting from the node w do not meet any more in the walking process, and the probability pi that a reverse walk starting from the node w reaches the node v through l steps and stopsl(v, w) to make an estimate of the SimRank similarity between node v and source node u.
Wherein the random walk in the embodiment is as follows
Figure BDA0002112398160000081
The probability of (2) randomly goes to any one of the neighbor nodes of the current node, and the neighbor node is the current nodeNodes that are walked along an incoming edge to
Figure BDA0002112398160000082
Is stopped at the current node, also called
Figure BDA0002112398160000083
And (4) wandering. c is an attenuation coefficient, c is more than or equal to 0 and less than or equal to 1, and v is any node in the graph structure. c is a real number, and can be 0.6, 0.8 and the like.
Step S3, repeatedly executing the estimation of the SimRank similarity until the estimation of the SimRank similarity between all the nodes and the source node u in the graph structure is completed;
specifically, the method can be implemented by traversing all nodes on the graph through the node v.
Step S4, finding the first k nodes with the highest similarity to the node to be recommended according to the estimation result of the SimRank similarity;
and step S5, acquiring the behavior information of the first k nodes, and integrating and pushing the behavior information to the source node u.
Therefore, the collaborative filtering recommendation method based on the single-source SimRank provided by the embodiment can obtain the first k nodes with the highest SimRank similarity between the source node and all other nodes on the graph structure in the effective time on the premise of ensuring the accuracy of the calculation result, so that the k users most similar to the user to be recommended are correspondingly obtained, the time complexity is reduced, and the requirements of real-time recommendation and interactive query are met.
In an implementation manner, the present embodiment further provides a method for specifically estimating the SimRank similarity between the nodes u and v. As shown in fig. 2, a flowchart of estimation of SimRank similarity between nodes u and v according to a preferred embodiment of the present invention includes: S201-S208.
Step S201, according to the Pepper ranking of the nodes in the graph structure, determining the core nodes in the graph structure, wherein the core nodes are positioned at the front j after the reverse Pepper ranking is in descending order0Node of bit, wherein j0Is an integer of 0 to j0N is not more than n, n is a graphThe number of nodes in the fabric.
Wherein, the inverse Peltier is calculated by reversing the directions of all sides on the graph to obtain the Peltier ranking of each point on the graph.
Step S202, judging whether a preset condition is met, and if the preset condition is not met, starting from the core node and carrying out random walk along the edge outlet direction;
step S2021, when random walk is performed, updating the reverse direction from the node w according to the formula II
Figure BDA0002112398160000094
Walk through step l +1 to reach residual value r of node vl+1(v', w), the second formula is:
Figure BDA0002112398160000091
wherein d isin(v ') represents the degree of entry of the node v ', i.e. the number of incoming edges, v ' is a randomly selected outgoing neighbor node of the node v, rl(v, w) is the inverse of the direction from node w
Figure BDA0002112398160000092
The walk reaches the residual value of the node v through step l, and in steps S2021 and S2022, w is the core node;
step S2022, updating pi corresponding to the node v reached in the first step of random walk according to the formula IIIl(v, w), the formula three is:
Figure BDA0002112398160000093
wherein the preset conditions include: all nodes on the graph structure perform l steps of random walk to reach the residual value r of the node vl(v, w) are both less than the predetermined value rmaxThen (c) is performed.
Step S203, if the preset condition is met, stopping random walk and determining pil(v, w) estimate.
In particular, determining pil(v, w) the estimated values include: updated pi when random walk will stoplThe values of (v, w) are stored in an index table.
It should be noted that steps S201 to S203 may be calculated without determining the user to be recommended. For example, in step S1, only the friends and the friend relationships on the platform are converted into a graph structure, where the graph structure includes nodes corresponding to all users in the social network and edges corresponding to relationships between all users, so that the computations in steps S201 to S203 can be implemented.
The description of steps S201-S203 is as follows: to increase the query speed of the algorithm, we can calculate pi in advancel(v, w), looking up pi from index file in actual querylAnd (v, w) is selected. But if all pi are storedl(v, w), nodes v and w are nodes on the graph structure, and an index space of O (n/epsilon) size is required. For smaller error parameters ε, the space for O (n/ε) is significantly larger than the map data size O (m), which is detrimental to the use and extension of the algorithm. To reduce the index space required by this process, we propose to index only the core node, pi for the other nodesl(v, w) are computed instantaneously during the query. The core node here refers to the node with the larger inverse-pecky rank. Setting a parameter j0Calculating inverse Pepper ranking top j by using a backward query method0Pi corresponding to node w of bitl(v, w) and stored in the index file. Therefore, partial intermediate results can be calculated in advance when the information of the user to be recommended is not known, and the calculation time of the SimRank in the query process is shortened.
Step S204, calculating the (pi)lThe estimated value of (u, w) · η (w)) may specifically include: S2041-S2046.
Step S2041, obtaining a first current sampling frequency;
step S2042, starting from the source node u, generating a line under the first current sampling frequency
Figure BDA0002112398160000101
Random walk, and arrive at the node through step IPoint w and stop at node w;
step S2043, two random walks W are generated from the node W1(w)、W2(W) if W1(w)、W2(w) if the two samples do not meet each other in the walking process, the result of the sampling is 1, otherwise, the result of the sampling is 0;
step S2044, judging whether the first current sampling frequency reaches a first preset sampling value;
step S2045, if the first preset sampling value is not reached, repeatedly acquiring a first current sampling frequency, and generating a strip from a source node u under the first current sampling frequency
Figure BDA0002112398160000113
A step of random walk, namely, the random walk reaches a node w through the step l and stops at the node w, starting from the node w to generate two random walks and judge whether the two random walks meet, and updating the first current sampling frequency until the first current sampling frequency reaches the first preset sampling value;
step S2046, if the first preset sampling value is reached, taking the mean value of all sampling results as (pi)l(u, w). eta (w)) is estimated.
Wherein, the first preset sampling value can be drAnd then, wherein,
Figure BDA0002112398160000111
epsilon is a preset error.
Step S205, judge whether node w is the core node;
step S206, if yes, acquiring pi of the core node from a prestored index tablel(v, w) an estimate of;
step S207, if not, dynamically calculating pi by a backward migration methodl(v, w), wherein the backward walking refers to walking from the current node to the exit neighbor of the current node with a certain probability distribution;
step S2071, acquiring a second current sampling frequency;
step S2072, in the secondUnder the current sampling times, calculating pi of the arriving node v' according to the formula IVi+1(v', w), the formula four is:
Figure BDA0002112398160000112
where v represents a node reached when the number of walking steps is i, v' represents a node reached when the number of walking steps is i +1, and din(v) Representing the degree of entry of a node v, i is an integer, i is more than or equal to 0 and less than or equal to l;
step S2073, continuously performing random walk and updating the fourth formula until the number of walk steps reaches the required step L, obtaining pi under the second current sampling frequencyl(v, w) the estimation result;
step S2074, judging whether the second current sampling frequency reaches a second preset sampling value;
step S2075, if the second preset sampling value is not reached, the current sampling frequency is repeatedly acquired, and the pi is matched under the second current sampling frequencyi+1(v', w) calculating and obtaining pi under the second current sampling frequencyl(v, w) and updating the second current sampling frequency until the second current sampling frequency reaches the second preset sampling value;
step S2076, if the second preset sampling value is reached, the mean value of all sampling results is used as pil(v, w) estimate. Wherein the second preset sampling value may be drAnd then, wherein,
Figure BDA0002112398160000121
Figure BDA0002112398160000122
epsilon is a preset error.
The explanation for steps S205-S207 is as follows:
1) pair pilThe estimate of (v, w) is an unbiased estimate with respect to its true value, i.e. a pair
Figure BDA0002112398160000123
l is greater than or equal to 0, has
Figure BDA0002112398160000124
Wherein the content of the first and second substances,
Figure BDA0002112398160000125
representation of true value ofl(v, w) estimate.
2) The time complexity of the process run is O (n π (w)). It is worth noting that only pi is outputlThe value of (V, w) (i ═ 0,1, …, ∞ V) also requires
Figure BDA0002112398160000126
The running time of the method is the optimal choice for the possible cases. Here, π (w) represents the inverse Peltier corresponding to node w.
Step S208, calculating an estimated value S (u, v) of the SimRank similarity between u and v according to a first formula:
Figure BDA0002112398160000127
step S209, repeat fr1 round of calculation of an estimate of SimRank similarity between u and v, wherein,
Figure BDA0002112398160000128
specifically, the step S209 may be implemented by performing d in the current round respectivelyrPi of sub-sampling timesl(v, w) and (πlCalculation of the estimated value of (u, w). eta (w)), and then based on drMean of the secondary calculation results, determining pil(v, w) and (πl(u, w). eta (w)) is estimated. And calculating s (u, v) according to the formula I as an estimated value of s (u, v) of one round. Current roundUpdate to the current round + 1. Repeating the steps until the current turn is frUntil now.
Step S210, f obtained by calculationrAnd taking the median of the estimated values of the rounds as an estimated value s (u, v) of the SimRank similarity between the nodes u and v.
For the calculation of the single-source SimRank, in the worst case, the time complexity of the whole method is
Figure BDA0002112398160000131
For the graph with the inverse Peltier ranking conforming to the power law distribution, the time complexity of the whole method satisfies:
Figure BDA0002112398160000132
where γ denotes a coefficient corresponding to the inverse-pettier obeying the power law distribution.
Therefore, the SimRank needs to consume O (td) compared with directly calculating the single-source node2n2) The time complexity of the calculation is obviously reduced by comparison.
Further, this example converts the original definition of SimRank to πl(u,w)、η(w)、πl(v, w) the form of the sum of the products of the three, in the part of calculating eta (w), the method selects the integral estimation (pi) (w) instead of estimating n (w) separatelylThe value of (u, w) · eta (w)) is selected so that the time complexity of the partial program operation is represented by O (n (log (n/delta))/epsilon2) Down to O ((log (n/delta))/epsilon)2) The method improves the operation efficiency of the method, and can support the query situation of a small error coefficient epsilon and a large number n of graph nodes. In calculating pil(v, w) portion, the present embodiment chooses to store all possible π corresponding to the partial graph nodes with the greater inverse Peltier rankingl(v, w), thereby reducing the size of the index file while ensuring that the method query time is not too much affected.
In addition, most graph structures obey power law distribution, and for the graph with the inverse Peltier rank according to the power law distribution, the process of calculating the single-source SimRank according to the method provided by the embodiment is sub-linear to the node number n.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims (9)

1. A collaborative filtering recommendation method based on single-source SimRank is characterized by comprising the following steps:
converting a user to be recommended, a user and a relationship among the users into a graph structure, wherein the graph structure comprises a plurality of nodes corresponding to the user and a plurality of edges corresponding to the relationship among the users, the edges comprise an outgoing edge and an incoming edge, and the user to be recommended is a source node u in the graph structure;
according to the probability pi that the random walk from the source node u reaches the node w through l steps and stopsl(u, w), the probability eta (w) that two random walks starting from the node w do not meet any more in the walking process, and the probability pi that a reverse walk starting from the node w reaches the node v through l steps and stopsl(v, w) to make an estimate s (u, v) of the SimRank similarity between node v and the source node u, where the random walk is such that
Figure FDA0003068596240000011
Is randomly directed to an in-neighbor node of the current node to
Figure FDA0003068596240000012
Is stopped atC is an attenuation coefficient, c is more than or equal to 0 and less than or equal to 1, and v is any node in the graph structure;
repeatedly executing the estimation of the SimRank similarity until the estimation of the SimRank similarity between all the nodes and the source node u in the graph structure is completed;
according to the estimation result of the similarity of the SimRank, the first k nodes with the highest similarity with the node to be recommended are found;
acquiring behavior information of the first k nodes, and integrating and pushing the behavior information to the source node u;
the probability pi that the random walk starting from the source node u reaches the node w through the step l and stopsl(u, w), the probability eta (w) that two random walks starting from the node w do not meet any more in the walking process, and the probability pi that a reverse walk starting from the node w reaches the node v through l steps and stopsl(v, w) to make an estimate s (u, v) of the SimRank similarity between node v and the source node u, comprising:
calculating (pi)l(u, w) · η (w)) an estimate of;
judging whether the node w is a core node;
if yes, acquiring pi of the core node from a prestored index tablel(v, w) an estimate of;
if not, dynamically calculating pi by a backward walking methodl(v, w), wherein the backward walking refers to walking from the current node to the exit neighbor of the current node with a certain probability distribution;
the estimation s (u, v) of the SimRank similarity between the node v and the source node u is performed according to a formula one:
Figure FDA0003068596240000021
wherein, V is a set formed by all nodes on the graph structure.
2. The collaborative filtering recommendation method according to claim 1, wherein the determining whether the node w is a core node comprises:
determining a core node in the graph structure according to the Pepper ranking of the nodes in the graph structure, wherein the core node is positioned at the top j after the descending ranking of the inverse Pepper ranking0Node of bit, wherein j0Is an integer of 0 to j0N is less than or equal to n, and n is the number of nodes in the graph structure;
judging whether a preset condition is met, and if the preset condition is not met, starting from the core node, and performing random walk along the edge-out direction;
if the preset condition is met, stopping random walk and determining pil(v, w) estimate.
3. The collaborative filtering recommendation method of claim 2, wherein the random walk in the out-of-edge direction is followed by:
when random walk is carried out, the residual value r of the reverse random walk starting from the node w and reaching the node v' through the step l +1 is updated according to the formula IIl+1(v', w), the second formula is:
Figure FDA0003068596240000022
wherein d isin(v ') represents the in-degree of the node v ', v ' being an out-neighbor node of the randomly selected node v, rl(v, w) is a residual value of the reverse random walk starting from the node w and reaching the node v after l steps;
updating pi corresponding to the arrival node v at the first step of random walk according to a formula IIIl(v, w), the formula three is:
Figure FDA0003068596240000031
4. the collaborative filtering recommendation method of claim 3, wherein the preset conditions include:
all nodes on the graph structure perform l steps of random walk to reach the residual value r of the node vl(v, w) are both less than the predetermined value rmaxThen (c) is performed.
5. The collaborative filtering recommendation method according to claim 3, wherein if a preset condition is met, stopping random walk and determining pilThe values of (v, w) include:
if the preset condition is met, stopping random walk;
updated pi when random walk will stoplThe values of (v, w) are stored in an index table.
6. The collaborative filtering recommendation method of claim 1, wherein the calculating (pi)lAn estimated value of (u, w) · η (w)) comprising:
acquiring a first current sampling frequency;
starting from a source node u under the first current sampling frequency to generate a random walk, and the random walk reaches a node w through l steps and stops at the node w;
two random walks W are generated starting from a node W1(w)、W2(W) if W1(w)、W2(w) if the two samples do not meet each other in the walking process, the result of the sampling is 1, otherwise, the result of the sampling is 0;
judging whether the first current sampling frequency reaches a first preset sampling value or not;
if the first current sampling frequency is not reached, repeatedly acquiring the first current sampling frequency, generating a random walk from a source node u under the first current sampling frequency, reaching a node w through the step l and stopping at the node w, starting from the node w to generate two random walks, judging whether the two random walks meet or not, and updating the first current sampling frequency until the first current sampling frequency reaches the first preset sampling value;
if the first preset sampling value is reached, taking the average value of all sampling results as (pi)l(u, w). eta (w)) is estimated.
7. The collaborative filtering recommendation method of claim 6, wherein pi is dynamically computed by a back-walk methodl(v, w) the estimated values include:
acquiring a second current sampling frequency;
under the second current sampling frequency, the pi of the arriving node v' is calculated according to the formula IVi+1(v', w), the formula four is:
Figure FDA0003068596240000041
where v represents a node reached when the number of walking steps is i, v' represents a node reached when the number of walking steps is i +1, and din(v) Representing the degree of entry of a node v, i is an integer, i is more than or equal to 0 and less than or equal to l;
continuously performing random walk and updating the fourth formula until the number of walk steps reaches the required step l, and obtaining pi under the second current sampling frequencyl(v, w) the estimation result;
judging whether the second current sampling frequency reaches a second preset sampling value or not;
if the second preset sampling value is not reached, repeatedly acquiring the current sampling times, and carrying out pi pairing under the second current sampling timesi+1(v', w) calculating and obtaining pi under the second current sampling frequencyl(v, w) and updating the second current sampling number until the second current sampling number reaches the second preset sampling value;
if the second preset sampling value is reached, taking the average value of all sampling results as pil(v, w) estimate.
8. The collaborative filtering recommendation method of claim 7, wherein the first preset sample value and/or the second preset sample value is drAnd then, wherein,
Figure FDA0003068596240000042
epsilon is a preset error.
9. The collaborative filtering recommendation method of claim 8, wherein after the estimating s (u, v) of the SimRank similarity between the node v and the source node u according to formula one, further comprising:
f is repeatedly performedrCalculation of an estimate of the SimRank similarity between u, v of a round, where,
Figure FDA0003068596240000051
Figure FDA0003068596240000052
wherein δ is a preset parameter;
calculated frAnd taking the median of the estimated values of the rounds as an estimated value s (u, v) of the SimRank similarity between the node v and the source node u.
CN201910577524.XA 2019-06-28 2019-06-28 Collaborative filtering recommendation method based on single-source SimRank Active CN110287424B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910577524.XA CN110287424B (en) 2019-06-28 2019-06-28 Collaborative filtering recommendation method based on single-source SimRank

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910577524.XA CN110287424B (en) 2019-06-28 2019-06-28 Collaborative filtering recommendation method based on single-source SimRank

Publications (2)

Publication Number Publication Date
CN110287424A CN110287424A (en) 2019-09-27
CN110287424B true CN110287424B (en) 2021-07-20

Family

ID=68019724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910577524.XA Active CN110287424B (en) 2019-06-28 2019-06-28 Collaborative filtering recommendation method based on single-source SimRank

Country Status (1)

Country Link
CN (1) CN110287424B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111506833B (en) * 2020-06-12 2023-05-02 中国人民大学 Friend recommendation method based on single-source SimRank accurate solution
CN111984832B (en) * 2020-08-21 2023-07-07 中国人民大学 Friend recommendation method based on personalized petty ranking
CN112667402B (en) * 2020-12-30 2021-09-21 深圳计算科学研究院 Method and device for distributively calculating similarity of SimRank single-source nodes

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761225A (en) * 2014-01-23 2014-04-30 天津大学 Chinese term semantic similarity calculating method driven by data
CN105512242A (en) * 2015-11-30 2016-04-20 浙江工业大学 Parallel recommend method based on social network structure
CN108664658A (en) * 2018-05-21 2018-10-16 南京大学 A kind of collaborative filtering video recommendation method considering user preference dynamic change
CN108776816A (en) * 2018-06-04 2018-11-09 中国人民大学 A kind of TopPPR methods calculating big node of graph proximity
CN108959370A (en) * 2018-05-23 2018-12-07 哈尔滨工业大学 The community discovery method and device of entity similarity in a kind of knowledge based map

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060190225A1 (en) * 2005-02-18 2006-08-24 Brand Matthew E Collaborative filtering using random walks of Markov chains
CN102779182A (en) * 2012-07-02 2012-11-14 吉林大学 Collaborative filtering recommendation method for integrating preference relationship and trust relationship
CN106445989A (en) * 2016-06-03 2017-02-22 新乡学院 Query click graph-based search recommendation model optimization
US20190114373A1 (en) * 2017-10-13 2019-04-18 Facebook, Inc. Scalable candidate selection for recommendations
CN109684520A (en) * 2019-01-07 2019-04-26 东北大学 Large-scale time sequence diagram vertex similarity calculation method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761225A (en) * 2014-01-23 2014-04-30 天津大学 Chinese term semantic similarity calculating method driven by data
CN105512242A (en) * 2015-11-30 2016-04-20 浙江工业大学 Parallel recommend method based on social network structure
CN108664658A (en) * 2018-05-21 2018-10-16 南京大学 A kind of collaborative filtering video recommendation method considering user preference dynamic change
CN108959370A (en) * 2018-05-23 2018-12-07 哈尔滨工业大学 The community discovery method and device of entity similarity in a kind of knowledge based map
CN108776816A (en) * 2018-06-04 2018-11-09 中国人民大学 A kind of TopPPR methods calculating big node of graph proximity

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
UniWalk: Unidirectional Random Walk Based Scalable SimRank Computation over Large Graph;Junshuai Song等;《IEEE Transactions on Knowledge and Data Engineering》;20171104;第30卷(第5期);992-1006 *
基于SimRank的百度百科词条语义相似度计算;尹坤等;《山东大学学报(工学版)》;20140317;第44卷(第3期);29-35 *

Also Published As

Publication number Publication date
CN110287424A (en) 2019-09-27

Similar Documents

Publication Publication Date Title
CN110287424B (en) Collaborative filtering recommendation method based on single-source SimRank
CN108287864B (en) Interest group dividing method, device, medium and computing equipment
CN107391542B (en) Open source software community expert recommendation method based on file knowledge graph
RU2607621C2 (en) Method, system and computer-readable data medium for grouping in social networks
CN105512242B (en) A kind of parallel recommendation method based on social network structure
CN105809554B (en) Prediction method for user participating in hot topics in social network
CN104601438A (en) Friend recommendation method and device
US20090306996A1 (en) Rating computation on social networks
CN111160847B (en) Method and device for processing flow information
CN111339437B (en) Method and device for determining roles of group members and electronic equipment
CN112507245B (en) Social network friend recommendation method based on graph neural network
CN110659394A (en) Recommendation method based on two-way proximity
CN114928548A (en) Social network information propagation scale prediction method and device
CN111079058A (en) Network node representation method and device based on node importance
Praditya et al. Designing a tourism recommendation system using a hybrid method (Collaborative Filtering and Content-Based Filtering)
KR20180035633A (en) Artificial Intelligence for Decision Making Based on Machine Learning of Human Decision Making Process
CN116629362A (en) Interpreteable time graph reasoning method based on path search
CN112799928B (en) Knowledge graph-based industrial APP association analysis method, device and medium
CN112307343B (en) Cross-E-book city user alignment method based on double-layer iterative compensation and full-face representation
Chen et al. Finding the just-in-time service location and path in a ubiquitous service network
CN111984832A (en) Friend recommendation method based on personalized Page ranking
Agagu et al. Context-aware recommendation methods
CN113392289A (en) Search recommendation method and device and electronic equipment
CN115858911A (en) Information recommendation method and device, electronic equipment and computer-readable storage medium
CN113076450A (en) Method and device for determining target recommendation list

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant