CN110287424B

CN110287424B - Collaborative filtering recommendation method based on single-source SimRank

Info

Publication number: CN110287424B
Application number: CN201910577524.XA
Authority: CN
Inventors: 魏哲巍; 何晓东; 王涵之; 萧小奎; 王思博; 刘钰; 杜小勇
Original assignee: Renmin University of China
Current assignee: Renmin University of China
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2021-07-20
Anticipated expiration: 2039-06-28
Also published as: CN110287424A

Abstract

The invention discloses a collaborative filtering recommendation method based on single-source SimRank, which comprises the following steps: converting the user to be recommended, the user and the relationship among the users into a graph structure, and according to the probability pi that the random walk starting from the source node u reaches the node w through l steps and stops_l(u, w), the probability eta (w) that two random walks starting from the node w do not meet any more in the walking process, and the probability pi that a reverse walk starting from the node w reaches the node v through l steps and stops_l(v, w) estimating the SimRank similarity between the nodes u and v, and repeatedly estimating the similarity until the estimation between all the nodes and the source node u in the graph structure is completed; according to the estimation result, the first k nodes with the highest similarity to the node to be recommended are found; and acquiring the behavior information of the first k nodes, and integrating and pushing the behavior information to the source node u. The collaborative filtering recommendation method based on the single-source SimRank provided by the embodiment reduces time complexity and meets the requirements of real-time recommendation and interactive query.

Description

Collaborative filtering recommendation method based on single-source SimRank

Technical Field

The invention relates to collaborative filtering recommendation, in particular to a collaborative filtering recommendation method based on single-source SimRank.

Background

The recommendation system is used as a core technology in a plurality of fields such as electronic commerce, social network analysis, personalized advertisement delivery, user interest recommendation and the like, and has increasingly prominent importance in the wave of internet development.

A personalized information system for analyzing the interests and requirements of users according to the existing historical data and recommending the interested information, products, services and the like to the users is called a recommendation system. According to different recommendation algorithms, the recommendation algorithm can be divided into three types of recommendation based on collaborative filtering, recommendation based on content and mixed recommendation. The recommendation based on collaborative filtering does not need to obtain the characteristic data of the user or the article in advance, only relies on the historical behavior of the user to model the user, has strong portability, and is widely used by people. Specifically, in the recommendation system oriented to the social network, users with higher similarity to the specified user can be recommended to the specified user according to the existing friend relationship, and commodities liked by the similar user can be recommended to the target user, so that the application scene of the recommendation system is expanded, and the robustness and the portability of the recommendation system are improved.

In the execution process of the collaborative filtering recommendation method, calculation and comparison of the similarity of the graph nodes are required. How to accurately define the similarity of graph nodes and efficiently calculate the similarity is a problem that researchers continuously explore. To facilitate the abstraction and definition of concrete questions, we convert the social network into graph structure G ═ V, E in graph theory, where V, E represents the set of all nodes, edges on the graph structure, respectively, corresponding to the user groups and the friend relationships between users in the actual social network. Thus, the actual problem of finding similar users on a social network can be translated into an abstract problem of computing the similarity between nodes on a graph structure.

In the node similarity definition method, the SimRank similarity is used as an important algorithm in the field of graph node similarity calculation, and the definition of the SimRank similarity is completely based on a graph structure, does not depend on other additional characteristics, integrates excellent characteristics such as environment information of multi-order neighbors of nodes and the like by means of an iteration form, and attracts important attention of people. The following is the basic definition of SimRank:

based on this, the inventors of the present application found that directly calculating the SimRank of a single source node requires O (td) consumption due to the iteratively defined nature of the SimRank²n²) Time complexity of (d) and O (n)²) Where t represents the number of iterations, d represents the average degree of nodes on the graph, and n represents the number of graph nodes.

With the advent of the big data era, the scale of graph structures generated in real life is larger and larger, for example, social networks formed by WeChat and Twitter (Twitter) of hundreds of millions of users and the like, the time complexity for calculating the SimRank according to the original definition is too high, the result of the single-source SimRank on the big graph is difficult to calculate in an effective time, and the requirements of real-time recommendation and interactive query cannot be met.

The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

The invention aims to provide a collaborative filtering recommendation method based on single-source SimRank, which can calculate the result of the single-source SimRank of a large graph in effective time and meet the requirements of real-time recommendation and interactive query.

In order to achieve the purpose, the invention provides a collaborative filtering recommendation method based on single-source SimRank, which comprises the following steps: converting a user to be recommended, a user and a relationship among the users into a graph structure, wherein the graph structure comprises a plurality of nodes corresponding to the user and a plurality of edges corresponding to the relationship among the users, the edges comprise an outgoing edge and an incoming edge, and the user to be recommended is a source node u in the graph structure; according to the probability pi that the random walk from the source node u reaches the node w through l steps and stops_l(u, w), the probability eta (w) that two random walks starting from the node w do not meet any more in the walking process, and the probability pi that a reverse walk starting from the node w reaches the node v through l steps and stops_l(v, w) to make an estimate s (u, v) of the SimRank similarity between node v and the source node u, where the random walk is such that

Is randomly directed to an in-neighbor node of the current node to

C is an attenuation coefficient, c is more than or equal to 0 and less than or equal to 1, and v is any node in the graph structure; repeatedly executing the estimation of the SimRank similarity until the estimation of the SimRank similarity between all the nodes and the source node u in the graph structure is completed; according to the estimation result of the similarity of the SimRank, the first k nodes with the highest similarity with the node to be recommended are found; and acquiring the behavior information of the first k nodes, and integrating and pushing the behavior information to the source node u.

In a preferred embodiment, the probability pi that the random walk from the source node u reaches the node w and stops after l steps_l(u, w), the probability eta (w) that two random walks starting from the node w do not meet any more in the walking process, and the probability pi that a reverse walk starting from the node w reaches the node v through l steps and stops_l(v, w) to make an estimate s (u, v) of the SimRank similarity between node v and the source node u, comprising: calculating the (pi)_l(u, w) · η (w)) an estimate of; judging whether the node w is a core node; if yes, acquiring pi of the core node from a prestored index table_l(v, w) an estimate of; if not, dynamically calculating pi by a backward walking method_l(v, w), wherein the backward walking refers to walking from the current node to the exit neighbor of the current node with a certain probability distribution; the estimation s (u, v) of the SimRank similarity between the node v and the source node u is performed according to a formula one:

in a preferred embodiment, the determining whether the node w is a core node includes: determining a core node in the graph structure according to the Pepper ranking of the nodes in the graph structure, wherein the core node is positioned at the top j after the descending ranking of the inverse Pepper ranking₀Node of bit, wherein j₀Is an integer of 0 to j₀N is less than or equal to n, and n is the number of nodes in the graph structure; judging whether a preset condition is met, and if the preset condition is not met, starting from the core node, and performing random walk along the edge-out direction; if the preset condition is met, stopping random walk and determining pi_l(v, w) estimate.

In a preferred embodiment, the random walk in the edge-out direction includes: when random walk is carried out, the residual value r of the reverse random walk starting from the node w and reaching the node v' through the step l +1 is updated according to the formula II_l+1(v', w), the second formula is:

wherein d is_in(v ') represents the in-degree of the node v ', v ' being an out-neighbor node of the randomly selected node v, r_l(v, w) is a residual value of the reverse random walk starting from the node w and reaching the node v after l steps; updating pi corresponding to the arrival node v at the first step of random walk according to a formula III_l(v, w), the formula three is:

in a preferred embodiment, the preset conditions include: all nodes on the graph structure perform l steps of random walk to reach the residual value r of the node v_l(v, w) are both less than the predetermined value r_maxThen (c) is performed.

In a preferred embodiment, if the predetermined condition is satisfied, the random walk is stopped, and pi is determined_lThe values of (v, w) include: if the preset condition is met, stopping random walk;updated pi when random walk will stop_lThe values of (v, w) are stored in an index table.

In a preferred embodiment, said calculating said (π)_lAn estimated value of (u, w) · η (w)) comprising: acquiring a first current sampling frequency; starting from a source node u under the first current sampling frequency to generate a random walk, and the random walk reaches a node w through l steps and stops at the node w; two random walks W are generated starting from a node W₁(w)、W₂(W) if W₁(w)、W₂(w) if the two samples do not meet each other in the walking process, the result of the sampling is 1, otherwise, the result of the sampling is 0; judging whether the first current sampling frequency reaches a first preset sampling value or not; if the first preset sampling value is not reached, the first current sampling frequency is repeatedly obtained, and one strip is generated from the source node u under the first current sampling frequency

A step of random walk, namely, the random walk reaches a node w through the step l and stops at the node w, starting from the node w to generate two random walks and judge whether the two random walks meet, and updating the first current sampling frequency until the first current sampling frequency reaches the first preset sampling value; if the first preset sampling value is reached, taking the average value of all sampling results as (pi)_l(u, w). eta (w)) is estimated.

In a preferred embodiment, the pi is dynamically calculated by a backward migration method_l(v, w) the estimated values include: acquiring a second current sampling frequency; under the second current sampling frequency, calculating pi of the arriving node v' according to the formula IV_i+1(v', w), the formula four is:

where v represents a node reached when the number of walking steps is i, v' represents a node reached when the number of walking steps is i +1, and d_in(v) Representing the in-degree of the node v, i is an integer,i is more than or equal to 0 and less than or equal to l; continuously performing random walk and updating the fourth formula until the number of walk steps reaches the required step l, and obtaining pi under the second current sampling frequency_l(v, w) the estimation result; judging whether the second current sampling frequency reaches a second preset sampling value or not; if the second preset sampling value is not reached, repeatedly acquiring the current sampling times, and carrying out pi pairing under the second current sampling times_i+1(v', w) calculating and obtaining pi under the second current sampling frequency_l(v, w) and updating the second current sampling number until the second current sampling number reaches the second preset sampling value; if the second preset sampling value is reached, taking the average value of all sampling results as pi_l(v, w) estimate.

In a preferred embodiment, the first predetermined sampling value and/or the second predetermined sampling value is/are d_rAnd then, wherein,

epsilon is a preset error.

In a preferred embodiment, after the estimating s (u, v) of the SimRank similarity between the node v and the source node u according to formula one, the method further includes: f is repeatedly performed_rCalculation of an estimate of the SimRank similarity between u, v of a round, where,

calculated f_rAnd taking the median of the estimated values of the rounds as an estimated value s (u, v) of the SimRank similarity between the node v and the source node u.

Compared with the prior art, the collaborative filtering recommendation method based on the single-source SimRank can obtain the first k nodes with the highest SimRank similarity between the source node and all other nodes on the graph structure in effective time on the premise of ensuring the accuracy of the calculation result, so that k users most similar to the user to be recommended are correspondingly obtained, the time complexity is reduced, and the requirements of real-time recommendation and interactive query are met.

Drawings

Fig. 1 is a flowchart of a collaborative filtering recommendation method based on a single-source SimRank according to an embodiment of the present invention.

Fig. 2 is a flowchart of estimation of SimRank similarity between nodes u, v according to an embodiment of the present invention.

Detailed Description

The following detailed description of the present invention is provided in conjunction with the accompanying drawings, but it should be understood that the scope of the present invention is not limited to the specific embodiments.

Throughout the specification and claims, unless explicitly stated otherwise, the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element or component but not the exclusion of any other element or component.

The SimRank is an important algorithm in the field of graph node similarity calculation, is originally proposed in 2002 by Glen Jeh and Jennifer Widom, and the core ideas can be summarized as the following two points:

a. a node has the greatest similarity to itself.

b. Two nodes also have a degree of similarity between their respective neighbor nodes if they are similar.

It is specifically defined as follows:

wherein u and v represent two nodes, s (u and v) represents the SimRank similarity between the nodes u and v, and c is a real number between 0 and 1 and is also called an attenuation coefficient. By definition, the similarity of different nodes u and v depends on the similarity between their neighbors. In particular, if | i (u) | 0 or | i (v) | 0, s (u, v) ═ 0 is considered, where i (u), i (v) represent the in-neighbor set of nodes u, v.

In the computation problem of SimRank, a given source node is returned to all other nodes and nodes on the graphThe problem of the SimRank values between the source nodes is called the calculation problem of the single-source SimRank. Here, we do not need to obtain a completely accurate SimRank result, but only expect that the absolute error between the SimRank value calculated by the algorithm and the corresponding node pair real SimRank value does not exceed a given error parameter epsilon. In particular, given nodes u, v on the graph, we want to estimate the SimRank between any node v and the source node u on the graph with a probability of (1- δ)

Satisfies the true value s (u, v)

The single-source SimRank has important application in the fields of web network mining, social network analysis, spam detection and the like. However, due to the iterative nature of the original definition of SimRank, the definition-based computation method requires a cost of O (td)²n²) Time complexity of (d) and O (n)²) The space complexity of (a) is represented by t, d represents the average degree of nodes on the graph, and n represents the number of nodes on the graph, so that the high time and space complexity is difficult to meet the requirement of practical application. After the SimRank is proposed, a plurality of researchers optimize the calculation algorithm of the single-source SimRank, but at present, no algorithm capable of completing the calculation of the single-source SimRank in the sub-linear time exists, for a large-scale atlas, the node number is often in a higher order, and the algorithm with the query time linear to the node number cannot meet the requirements of real-time analysis, interactive query and other actual scenes, so that the application range of the SimRank is retarded.

As shown in fig. 1, the flowchart of the collaborative filtering recommendation method based on the single source SimRank according to the preferred embodiment of the present invention includes steps S1-S5.

Step S1, converting the relation among the users to be recommended, the users and the users into a graph structure, wherein the graph structure comprises a plurality of nodes corresponding to the users and a plurality of edges corresponding to the relation among the users, the edges comprise an outgoing edge and an incoming edge, and the users to be recommended are source nodes u in the graph structure.

The users in this embodiment refer to all registered users on the platform, and the relationship between the users may specifically be a concern relationship between the users. For example, all registered buddies on Facebook and the buddy relationship network.

Specifically, for social networks with concern relationships, such as microblogs, facebooks, instagrams, and the like, users of the social networks correspond to nodes on a graph structure, and concern relationships among the users correspond to edges on the graph structure. Specifically, if the a user pays attention to the B user, a directed edge from the B user node to the a user node needs to be established on the graph structure. (case B- > A), where B is the in-neighbor node of A, A is the out-neighbor node of B, this edge is the out-edge of node B, and is the in-edge of node A. The number of outgoing edges owned by a node is referred to as "outgoing degree", and the number of incoming edges owned by the node is referred to as "incoming degree".

For a social network with friend relationships such as WeChat and QQ, users on the social network correspond to graph nodes, and friend relationships correspond to edges on a graph structure. Specifically, if there is a buddy relationship between the a-user and the B-user (i.e., A, B are buddies of each other), a directed edge from the a-user node to the B-user node and a directed edge from the B-user node to the a-user node are established on the graph structure.

Step S2, according to the probability pi that the random walk from the source node u reaches the node w through l steps and stops_l(u, w), the probability eta (w) that two random walks starting from the node w do not meet any more in the walking process, and the probability pi that a reverse walk starting from the node w reaches the node v through l steps and stops_l(v, w) to make an estimate of the SimRank similarity between node v and source node u.

Wherein the random walk in the embodiment is as follows

The probability of (2) randomly goes to any one of the neighbor nodes of the current node, and the neighbor node is the current nodeNodes that are walked along an incoming edge to

Is stopped at the current node, also called

And (4) wandering. c is an attenuation coefficient, c is more than or equal to 0 and less than or equal to 1, and v is any node in the graph structure. c is a real number, and can be 0.6, 0.8 and the like.

Step S3, repeatedly executing the estimation of the SimRank similarity until the estimation of the SimRank similarity between all the nodes and the source node u in the graph structure is completed;

specifically, the method can be implemented by traversing all nodes on the graph through the node v.

Step S4, finding the first k nodes with the highest similarity to the node to be recommended according to the estimation result of the SimRank similarity;

and step S5, acquiring the behavior information of the first k nodes, and integrating and pushing the behavior information to the source node u.

Therefore, the collaborative filtering recommendation method based on the single-source SimRank provided by the embodiment can obtain the first k nodes with the highest SimRank similarity between the source node and all other nodes on the graph structure in the effective time on the premise of ensuring the accuracy of the calculation result, so that the k users most similar to the user to be recommended are correspondingly obtained, the time complexity is reduced, and the requirements of real-time recommendation and interactive query are met.

In an implementation manner, the present embodiment further provides a method for specifically estimating the SimRank similarity between the nodes u and v. As shown in fig. 2, a flowchart of estimation of SimRank similarity between nodes u and v according to a preferred embodiment of the present invention includes: S201-S208.

Step S201, according to the Pepper ranking of the nodes in the graph structure, determining the core nodes in the graph structure, wherein the core nodes are positioned at the front j after the reverse Pepper ranking is in descending order₀Node of bit, wherein j₀Is an integer of 0 to j₀N is not more than n, n is a graphThe number of nodes in the fabric.

Wherein, the inverse Peltier is calculated by reversing the directions of all sides on the graph to obtain the Peltier ranking of each point on the graph.

Step S202, judging whether a preset condition is met, and if the preset condition is not met, starting from the core node and carrying out random walk along the edge outlet direction;

step S2021, when random walk is performed, updating the reverse direction from the node w according to the formula II

Walk through step l +1 to reach residual value r of node v_l+1(v', w), the second formula is:

wherein d is_in(v ') represents the degree of entry of the node v ', i.e. the number of incoming edges, v ' is a randomly selected outgoing neighbor node of the node v, r_l(v, w) is the inverse of the direction from node w

The walk reaches the residual value of the node v through step l, and in steps S2021 and S2022, w is the core node;

step S2022, updating pi corresponding to the node v reached in the first step of random walk according to the formula III_l(v, w), the formula three is:

wherein the preset conditions include: all nodes on the graph structure perform l steps of random walk to reach the residual value r of the node v_l(v, w) are both less than the predetermined value r_maxThen (c) is performed.

Step S203, if the preset condition is met, stopping random walk and determining pi_l(v, w) estimate.

In particular, determining pi_l(v, w) the estimated values include: updated pi when random walk will stop_lThe values of (v, w) are stored in an index table.

It should be noted that steps S201 to S203 may be calculated without determining the user to be recommended. For example, in step S1, only the friends and the friend relationships on the platform are converted into a graph structure, where the graph structure includes nodes corresponding to all users in the social network and edges corresponding to relationships between all users, so that the computations in steps S201 to S203 can be implemented.

The description of steps S201-S203 is as follows: to increase the query speed of the algorithm, we can calculate pi in advance_l(v, w), looking up pi from index file in actual query_lAnd (v, w) is selected. But if all pi are stored_l(v, w), nodes v and w are nodes on the graph structure, and an index space of O (n/epsilon) size is required. For smaller error parameters ε, the space for O (n/ε) is significantly larger than the map data size O (m), which is detrimental to the use and extension of the algorithm. To reduce the index space required by this process, we propose to index only the core node, pi for the other nodes_l(v, w) are computed instantaneously during the query. The core node here refers to the node with the larger inverse-pecky rank. Setting a parameter j₀Calculating inverse Pepper ranking top j by using a backward query method₀Pi corresponding to node w of bit_l(v, w) and stored in the index file. Therefore, partial intermediate results can be calculated in advance when the information of the user to be recommended is not known, and the calculation time of the SimRank in the query process is shortened.

Step S204, calculating the (pi)_lThe estimated value of (u, w) · η (w)) may specifically include: S2041-S2046.

Step S2041, obtaining a first current sampling frequency;

step S2042, starting from the source node u, generating a line under the first current sampling frequency

Random walk, and arrive at the node through step IPoint w and stop at node w;

step S2043, two random walks W are generated from the node W₁(w)、W₂(W) if W₁(w)、W₂(w) if the two samples do not meet each other in the walking process, the result of the sampling is 1, otherwise, the result of the sampling is 0;

step S2044, judging whether the first current sampling frequency reaches a first preset sampling value;

step S2045, if the first preset sampling value is not reached, repeatedly acquiring a first current sampling frequency, and generating a strip from a source node u under the first current sampling frequency

A step of random walk, namely, the random walk reaches a node w through the step l and stops at the node w, starting from the node w to generate two random walks and judge whether the two random walks meet, and updating the first current sampling frequency until the first current sampling frequency reaches the first preset sampling value;

step S2046, if the first preset sampling value is reached, taking the mean value of all sampling results as (pi)_l(u, w). eta (w)) is estimated.

Wherein, the first preset sampling value can be d_rAnd then, wherein,

epsilon is a preset error.

Step S205, judge whether node w is the core node;

step S206, if yes, acquiring pi of the core node from a prestored index table_l(v, w) an estimate of;

step S207, if not, dynamically calculating pi by a backward migration method_l(v, w), wherein the backward walking refers to walking from the current node to the exit neighbor of the current node with a certain probability distribution;

step S2071, acquiring a second current sampling frequency;

step S2072, in the secondUnder the current sampling times, calculating pi of the arriving node v' according to the formula IV_i+1(v', w), the formula four is:

where v represents a node reached when the number of walking steps is i, v' represents a node reached when the number of walking steps is i +1, and d_in(v) Representing the degree of entry of a node v, i is an integer, i is more than or equal to 0 and less than or equal to l;

step S2073, continuously performing random walk and updating the fourth formula until the number of walk steps reaches the required step L, obtaining pi under the second current sampling frequency_l(v, w) the estimation result;

step S2074, judging whether the second current sampling frequency reaches a second preset sampling value;

step S2075, if the second preset sampling value is not reached, the current sampling frequency is repeatedly acquired, and the pi is matched under the second current sampling frequency_i+1(v', w) calculating and obtaining pi under the second current sampling frequency_l(v, w) and updating the second current sampling frequency until the second current sampling frequency reaches the second preset sampling value;

step S2076, if the second preset sampling value is reached, the mean value of all sampling results is used as pi_l(v, w) estimate. Wherein the second preset sampling value may be d_rAnd then, wherein,

epsilon is a preset error.

The explanation for steps S205-S207 is as follows:

1) pair pi_lThe estimate of (v, w) is an unbiased estimate with respect to its true value, i.e. a pair

l is greater than or equal to 0, has

Wherein the content of the first and second substances,

representation of true value of_l(v, w) estimate.

2) The time complexity of the process run is O (n π (w)). It is worth noting that only pi is output_lThe value of (V, w) (i ═ 0,1, …, ∞ V) also requires

The running time of the method is the optimal choice for the possible cases. Here, π (w) represents the inverse Peltier corresponding to node w.

Step S208, calculating an estimated value S (u, v) of the SimRank similarity between u and v according to a first formula:

step S209, repeat f_r1 round of calculation of an estimate of SimRank similarity between u and v, wherein,

specifically, the step S209 may be implemented by performing d in the current round respectively_rPi of sub-sampling times_l(v, w) and (π_lCalculation of the estimated value of (u, w). eta (w)), and then based on d_rMean of the secondary calculation results, determining pi_l(v, w) and (π_l(u, w). eta (w)) is estimated. And calculating s (u, v) according to the formula I as an estimated value of s (u, v) of one round. Current roundUpdate to the current round + 1. Repeating the steps until the current turn is f_rUntil now.

Step S210, f obtained by calculation_rAnd taking the median of the estimated values of the rounds as an estimated value s (u, v) of the SimRank similarity between the nodes u and v.

For the calculation of the single-source SimRank, in the worst case, the time complexity of the whole method is

For the graph with the inverse Peltier ranking conforming to the power law distribution, the time complexity of the whole method satisfies:

where γ denotes a coefficient corresponding to the inverse-pettier obeying the power law distribution.

Therefore, the SimRank needs to consume O (td) compared with directly calculating the single-source node²n²) The time complexity of the calculation is obviously reduced by comparison.

Further, this example converts the original definition of SimRank to π_l(u,w)、η(w)、π_l(v, w) the form of the sum of the products of the three, in the part of calculating eta (w), the method selects the integral estimation (pi) (w) instead of estimating n (w) separately_lThe value of (u, w) · eta (w)) is selected so that the time complexity of the partial program operation is represented by O (n (log (n/delta))/epsilon²) Down to O ((log (n/delta))/epsilon)²) The method improves the operation efficiency of the method, and can support the query situation of a small error coefficient epsilon and a large number n of graph nodes. In calculating pi_l(v, w) portion, the present embodiment chooses to store all possible π corresponding to the partial graph nodes with the greater inverse Peltier ranking_l(v, w), thereby reducing the size of the index file while ensuring that the method query time is not too much affected.

In addition, most graph structures obey power law distribution, and for the graph with the inverse Peltier rank according to the power law distribution, the process of calculating the single-source SimRank according to the method provided by the embodiment is sub-linear to the node number n.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. A collaborative filtering recommendation method based on single-source SimRank is characterized by comprising the following steps:

converting a user to be recommended, a user and a relationship among the users into a graph structure, wherein the graph structure comprises a plurality of nodes corresponding to the user and a plurality of edges corresponding to the relationship among the users, the edges comprise an outgoing edge and an incoming edge, and the user to be recommended is a source node u in the graph structure;

according to the probability pi that the random walk from the source node u reaches the node w through l steps and stops_l(u, w), the probability eta (w) that two random walks starting from the node w do not meet any more in the walking process, and the probability pi that a reverse walk starting from the node w reaches the node v through l steps and stops_l(v, w) to make an estimate s (u, v) of the SimRank similarity between node v and the source node u, where the random walk is such that

Is randomly directed to an in-neighbor node of the current node to

Is stopped atC is an attenuation coefficient, c is more than or equal to 0 and less than or equal to 1, and v is any node in the graph structure;

repeatedly executing the estimation of the SimRank similarity until the estimation of the SimRank similarity between all the nodes and the source node u in the graph structure is completed;

according to the estimation result of the similarity of the SimRank, the first k nodes with the highest similarity with the node to be recommended are found;

acquiring behavior information of the first k nodes, and integrating and pushing the behavior information to the source node u;

the probability pi that the random walk starting from the source node u reaches the node w through the step l and stops_l(u, w), the probability eta (w) that two random walks starting from the node w do not meet any more in the walking process, and the probability pi that a reverse walk starting from the node w reaches the node v through l steps and stops_l(v, w) to make an estimate s (u, v) of the SimRank similarity between node v and the source node u, comprising:

calculating (pi)_l(u, w) · η (w)) an estimate of;

judging whether the node w is a core node;

if yes, acquiring pi of the core node from a prestored index table_l(v, w) an estimate of;

if not, dynamically calculating pi by a backward walking method_l(v, w), wherein the backward walking refers to walking from the current node to the exit neighbor of the current node with a certain probability distribution;

the estimation s (u, v) of the SimRank similarity between the node v and the source node u is performed according to a formula one:

wherein, V is a set formed by all nodes on the graph structure.

2. The collaborative filtering recommendation method according to claim 1, wherein the determining whether the node w is a core node comprises:

determining a core node in the graph structure according to the Pepper ranking of the nodes in the graph structure, wherein the core node is positioned at the top j after the descending ranking of the inverse Pepper ranking₀Node of bit, wherein j₀Is an integer of 0 to j₀N is less than or equal to n, and n is the number of nodes in the graph structure;

judging whether a preset condition is met, and if the preset condition is not met, starting from the core node, and performing random walk along the edge-out direction;

if the preset condition is met, stopping random walk and determining pi_l(v, w) estimate.

3. The collaborative filtering recommendation method of claim 2, wherein the random walk in the out-of-edge direction is followed by:

when random walk is carried out, the residual value r of the reverse random walk starting from the node w and reaching the node v' through the step l +1 is updated according to the formula II_l+1(v', w), the second formula is:

wherein d is_in(v ') represents the in-degree of the node v ', v ' being an out-neighbor node of the randomly selected node v, r_l(v, w) is a residual value of the reverse random walk starting from the node w and reaching the node v after l steps;

updating pi corresponding to the arrival node v at the first step of random walk according to a formula III_l(v, w), the formula three is:

4. the collaborative filtering recommendation method of claim 3, wherein the preset conditions include:

all nodes on the graph structure perform l steps of random walk to reach the residual value r of the node v_l(v, w) are both less than the predetermined value r_maxThen (c) is performed.

5. The collaborative filtering recommendation method according to claim 3, wherein if a preset condition is met, stopping random walk and determining pi_lThe values of (v, w) include:

if the preset condition is met, stopping random walk;

updated pi when random walk will stop_lThe values of (v, w) are stored in an index table.

6. The collaborative filtering recommendation method of claim 1, wherein the calculating (pi)_lAn estimated value of (u, w) · η (w)) comprising:

acquiring a first current sampling frequency;

starting from a source node u under the first current sampling frequency to generate a random walk, and the random walk reaches a node w through l steps and stops at the node w;

two random walks W are generated starting from a node W₁(w)、W₂(W) if W₁(w)、W₂(w) if the two samples do not meet each other in the walking process, the result of the sampling is 1, otherwise, the result of the sampling is 0;

judging whether the first current sampling frequency reaches a first preset sampling value or not;

if the first current sampling frequency is not reached, repeatedly acquiring the first current sampling frequency, generating a random walk from a source node u under the first current sampling frequency, reaching a node w through the step l and stopping at the node w, starting from the node w to generate two random walks, judging whether the two random walks meet or not, and updating the first current sampling frequency until the first current sampling frequency reaches the first preset sampling value;

if the first preset sampling value is reached, taking the average value of all sampling results as (pi)_l(u, w). eta (w)) is estimated.

7. The collaborative filtering recommendation method of claim 6, wherein pi is dynamically computed by a back-walk method_l(v, w) the estimated values include:

acquiring a second current sampling frequency;

under the second current sampling frequency, the pi of the arriving node v' is calculated according to the formula IV_i+1(v', w), the formula four is:

continuously performing random walk and updating the fourth formula until the number of walk steps reaches the required step l, and obtaining pi under the second current sampling frequency_l(v, w) the estimation result;

judging whether the second current sampling frequency reaches a second preset sampling value or not;

if the second preset sampling value is not reached, repeatedly acquiring the current sampling times, and carrying out pi pairing under the second current sampling times_i+1(v', w) calculating and obtaining pi under the second current sampling frequency_l(v, w) and updating the second current sampling number until the second current sampling number reaches the second preset sampling value;

if the second preset sampling value is reached, taking the average value of all sampling results as pi_l(v, w) estimate.

8. The collaborative filtering recommendation method of claim 7, wherein the first preset sample value and/or the second preset sample value is d_rAnd then, wherein,

epsilon is a preset error.

9. The collaborative filtering recommendation method of claim 8, wherein after the estimating s (u, v) of the SimRank similarity between the node v and the source node u according to formula one, further comprising:

f is repeatedly performed_rCalculation of an estimate of the SimRank similarity between u, v of a round, where,

wherein δ is a preset parameter;